In early 2014, we were faced with a challenge: As our data grew, our current real-time data storage was no longer able to support our growth. We looked for other real-time ingestion systems, and that’s when one of our engineers, Maxime, came across Druid. Let’s backtrack for a minute. The “we” in this story is GumGum, inventors of In-Image Advertising and a leading digital marketing platform based in Santa Monica, the heart of Silicon Beach. The company was founded in 2007, and its data has consistently quadrupled in recent years. We’ve welcomed and embraced this rapid growth, but it’s also kept us in the market for a system that can handle a lot of data.
While we had had a good run with MySQL, which we were using exclusively until recently, we needed something much more scalable. We looked at Druid, which is designed to ingest and access large-scale real-time data and is horizontally scalable because it is distributed. Druid is especially suitable for our real time use cases where we aggregate large numbers of counters for a known set of keys. Because Druid is open source and created by another ad tech company that had needs similar to ours, we decided to give it a try. And the results so far have been impressive.
We now use Druid as part of a wishbone-shaped data processing architecture called Lambda, which is a generic, fault-tolerant, and scalable data processing system that is able to serve a wide range of workloads in which low-latency reads and updates are required. Lambda architecture scales out rather than up.
GumGum ad servers generate billions of events every day that are sent to two sources simultaneously (Kafka and S3). A Storm topology keeps consuming the stream of these events from Kafka in real time. The topology parses JSON events and identifies various metrics to send to Druid. In parallel, a MapReduce job reads the files from S3 every four hours and sends the data to Druid, which then overwrites previously sent data using Storm.
That real-time data gets overwritten by batch data is one of the most important facets of Lambda architecture, because currently, real-time data processing technologies do not guarantee ‘exactly once’ processing one hundred percent of the time. This means that some events may get processed again or some events may get dropped. Various open source technologies are working to make this better, but this is one of the reasons why Lambda architecture recommends that you replace real time data with batch data.
Druid allows us to overwrite our real-time data with batch data by providing an indexer MapReduce job. But because this MapReduce job expects the files to conform to Druid’s data source, we run one more MapReduce job to convert our complex, nested JSON to a format acceptable by Druid’s data source. These files are then sent to an indexing service (HadoopDruidIndexer) that creates Druid segments from these files, uploads them to S3, and updates the segment metadata in MySQL. Within a minute, a Druid coordinator node realizes the metadata change and loads the newly created segments on the historical nodes while discarding the old segments created by the Storm topology. Operating batch pipelines requires us to run two MapReduce jobs in succession. At GumGum, we use AWS Data Pipeline because it offers a workflow engine that orchestrates multiple tasks with complex dependencies.
We do not expose our Druid to client applications directly; instead, we have built a reporting server that sits in front of Druid. The reporting server allows us to join data from Druid with our MySQL database. It also insulates clients from any backend changes and converts data into a Google Visualization format that is easier for dashboard applications to consume. Druid does offer a REST API in case you don’t want to go through the trouble of building your own API server.
The architecture described above processes more than 3 billion events per day in real time, which amounts to 5 TB of new data per day. We are constantly trying to find the right instance types for our servers machines, but here is list of what we are presently using:
Brokers – 2 m4.xlarge (Round-robin DNS)
Coordinators – 2 c4.large
Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD)
Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD)
Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD)
Overlords – 2 c4.large
Zookeeper – 3 c4.large
MySQL – RDS – db.m3.medium
Druid has proven to be extremely fast, flexible, and reliable. We love using Druid as our data-accessing technology because it allows us to quickly ingest large amounts of data and query the data using simple REST APIs. Druid is horizontally scalable, so as our needs increase, we can simply add more middle managers and historical nodes. Druid is also easy to upgrade and maintain because most of the data stays in Amazon S3. Our goal next year is to allocate some dedicated time to contribute to Druid codebase. We are looking forward to a long partnership with the Druid community and are enthusiastic about being part of the software’s evolution.
The allure of cloud services is something most startups can understand. When you have big ideas and little capital, you need the technology that will enable you to move fast.
One of the most effective tools for this is Amazon Web Services. When you don’t have to reinvent the infrastructure wheel, you can pour your investments into areas that produce real innovation and value to your industry. At the early stages, operating in the cloud is an easy decision to get a startup’s technology off the ground.
Flash-Forward a Couple of Years: Your startup is rapidly growing, along with your data. You’ve now reached the crossroad that every startup faces as it matures. This is the point where the financial cost of staying with a cloud provider is possibly higher than owning and maintaining your own infrastructure. When you hit that price point, it’s time for some difficult conversations about whether or not it will continue to be advantageous to stay with the cloud. Many startups choose to leave the cloud behind and begin utilizing their own infrastructures, which is a valid move. However, far too often companies view this as their only option, which is simply not the case.
Why Staying in the Cloud May Be the Smart Move
Continuing to scale within a cloud infrastructure can provide a bright future for mature startups. Here are a few reasons why:
Ability to Support Exponential Growth: Mature startups usually grow at a substantial pace, expanding the number of servers used. Despite contrary chatter in the startup community, maturity doesn’t equate being able to support growth without being partnered with some kind of cloud service. Whether preparing for an increase in website traffic or international expansion, a cloud infrastructure allows companies to adjust the number of servers they need in real time.
Sophistication of an Established Cloud Platform: It’s often very difficult for startups to achieve on their own the same level of sophistication in infrastructure that an established cloud platform offers. Tools such as Amazon Web Services make it easy to recover from hardware failures. Say, for example, your RDS database goes down. AWS can recover it within minutes without any action needed on your part. Without the cloud in that instance, downtime would have been much longer, and every tech company knows that downtime can be deadly. And of course there’s the upfront investment required to design an infrastructure to begin with.
Hidden Costs of Going Off the Cloud: Sometimes people overlook the hidden costs of going off the cloud. It’s not just the manpower required to manage the infrastructure and the financial cost of acquiring servers. Think of the missed opportunity in innovation and business partnerships if you were to suddenly need to launch, say, 200 servers to try something new on infrastructure that doesn’t support it. Building out infrastructure takes time and in the technology sphere, time is a commodity you can’t afford to lose when operating in a fast-moving market.
It Encourages Innovation: In the same vein as the reason above, cloud computing encourages people to try a lot of different things simply because it’s easy to do so. You can’t put a price on an infrastructure solution that inspires innovation for our engineers.
You Have Access to a Powerhouse Arsenal of Services: Many cloud providers offer additional services on top of just renting virtual server space (e.g., (Microsoft Azure’s Cassandra and AWS’s Elastic MapReduce) so companies don’t need to build and maintain these services on their own. While it’s possible to use a select portion of Amazon’s services with our own infrastructure, the services in their cloud environment have a tight integration between their services and the reduced latency of accessing their services through their infrastructure.
The Cloud Forces Us to Design Redundant Architecture: The cloud forces startups to design systems that anticipate the possibility of servers going down at anytime. Organizations in the cloud are prepared for instance failure and have eliminated almost all “emergencies” within their organization. If a server or database node goes down, applications will continue to work without any problems because a new server or database node will automatically come up. Cloud allows startups to design for redundancy everywhere, something you may not achieve if you own your own costly hardware.
Every case is different and it’s important to continue finding the balance between utilizing cloud services and owning actual hardware. But it’s just as important to realize that you don’t have to give up the cloud just because your company has grown to a certain size. The Cloud can be a proven infrastructure solution to keep you nimble, agile, and right where you want to be–the fast lane–during a very exciting time in the tech industry.
The above article was originally published at www.adotas.com