Tagged: Big Data

Lambda Architecture with Druid at GumGum

In early 2014, we were faced with a challenge: As our data grew, our current real-time data storage was no longer able to support our growth. We looked for other real-time ingestion systems, and that’s when one of our engineers, Maxime, came across Druid. Let’s backtrack for a minute. The “we” in this story is GumGum, inventors of In-Image Advertising and a leading digital marketing platform based in Santa Monica, the heart of Silicon Beach. The company was founded in 2007, and its data has consistently quadrupled in recent years. We’ve welcomed and embraced this rapid growth, but it’s also kept us in the market for a system that can handle a lot of data.

While we had had a good run with MySQL, which we were using exclusively until recently, we needed something much more scalable. We looked at Druid, which is designed to ingest and access large-scale real-time data and is horizontally scalable because it is distributed. Druid is especially suitable for our real time use cases where we aggregate large numbers of counters for a known set of keys. Because Druid is open source and created by another ad tech company that had needs similar to ours, we decided to give it a try. And the results so far have been impressive.

We now use Druid as part of a wishbone-shaped data processing architecture called Lambda, which is a generic, fault-tolerant, and scalable data processing system that is able to serve a wide range of workloads in which low-latency reads and updates are required. Lambda architecture scales out rather than up.

GumGum ad servers generate billions of events every day that are sent to two sources simultaneously (Kafka and S3). A Storm topology keeps consuming the stream of these events from Kafka in real time. The topology parses JSON events and identifies various metrics to send to Druid. In parallel, a MapReduce job reads the files from S3 every four hours and sends the data to Druid, which then overwrites previously sent data using Storm.

That real-time data gets overwritten by batch data is one of the most important facets of Lambda architecture, because currently, real-time data processing technologies do not guarantee ‘exactly once’ processing one hundred percent of the time. This means that some events may get processed again or some events may get dropped. Various open source technologies are working to make this better, but this is one of the reasons why Lambda architecture recommends that you replace real time data with batch data.

Druid allows us to overwrite our real-time data with batch data by providing an indexer MapReduce job.  But because this MapReduce job expects the files to conform to Druid’s data source, we run one more MapReduce job to convert our complex, nested JSON to a format acceptable by Druid’s data source. These files are then sent to an indexing service (HadoopDruidIndexer) that creates Druid segments from these files, uploads them to S3, and updates the segment metadata in MySQL. Within a minute, a Druid coordinator node realizes the metadata change and loads the newly created segments on the historical nodes while discarding the old segments created by the Storm topology. Operating batch pipelines requires us to run two MapReduce jobs in succession. At GumGum, we use AWS Data Pipeline because it offers a workflow engine that orchestrates multiple tasks with complex dependencies.

GumGum-Data-Pipeline

We do not expose our Druid to client applications directly; instead, we have built a reporting server that sits in front of Druid. The reporting server allows us to join data from Druid with our MySQL database. It also insulates clients from any backend changes and converts data into a Google Visualization format that is easier for dashboard applications to consume. Druid does offer a REST API in case you don’t want to go through the trouble of building your own API server.

The architecture described above processes more than 3 billion events per day in real time, which amounts to 5 TB of new data per day. We are constantly trying to find the right instance types for our servers machines, but here is list of what we are presently using:

Brokers – 2 m4.xlarge (Round-robin DNS)

Coordinators – 2 c4.large

Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD)

Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD)

Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD)

Overlords – 2 c4.large

Zookeeper – 3 c4.large

MySQL – RDS – db.m3.medium

Druid has proven to be extremely fast, flexible, and reliable. We love using Druid as our data-accessing technology because it allows us to quickly ingest large amounts of data and query the data using simple REST APIs. Druid is horizontally scalable, so as our needs increase, we can simply add more middle managers and historical nodes. Druid is also easy to upgrade and maintain because most of the data stays in Amazon S3. Our goal next year is to allocate some dedicated time to contribute to Druid codebase. We are looking forward to a long partnership with the Druid community and are enthusiastic about being part of the software’s evolution.