AWS’s Database Migration Service (DMS) is often misunderstood as a service that can only migrate your data to the cloud. But the service could be very useful for replicating data between the two datastores within the cloud as well. In our case, both of our datastores – RDS (MySQL) and Redshift were already in the cloud. The MySQL database is our primary OLTP database and many of our employees login to an internal application to modify data within this database. This database contains all of our dimension tables. Our reporting dashboard connects with our data warehouse built on Redshift. This database contains all of our fact tables. In order to run the reports accurately it’s important that all the modifications done to the dimension tables in MySQL are transferred to Redshift in timely manner.
Enabling general log in your MySQL RDS instance can be very useful. Especially for auditing and accountability purposes. It’s usually useful to debug problems too. General log can log every single query on your database in mysql.general_log table. Alternatively you can also log all the queries to a log file, but in RDS instance, logging them to a table was the best option in our situation. More information about general log can be obtained from MySQL documentation.
Here is a sample records from one of our general log:
This table has the following columns: event_time, user_host, thread_id, server_id, command_type, argument. For us event time, user_host and argument (query) are the most important columns. They can tell us which application/user executed a given query at a given time.
You can enable general log by modifying the Db Parameter Group associated with your RDS instance. To enable table based log, all you really need to do is set general_log parameter to 1. The log_output is by default set to TABLE. Thus unless you want to switch your log to a file based log, you don’t have to modify anything else. If you want to set a file based log, you will need to modify log_output parameter. More detailed information about the parameters can be obtained from RDS documentation. Good news is that the general_log Db Parameter is dynamic and it means you do not have to reboot your RDS instance.
General log is a very good idea if you roll it up regularly. If you don’t roll it up, it can start occupying the disk space pretty fast. Amazon’s default rules for rolling it:
When TABLE logging is enabled, log tables are rotated every 24 hours if the space used by the table logs is more than 20 percent of the allocated storage space or the size of all logs combined is greater than 10 GB. If the amount of space used for a DB instance is greater than 90 percent of the DB instance’s allocated storage space, then the thresholds for log rotation are reduced. Log tables are then rotated if the space used by the table logs is more than 10 percent of the allocated storage space or the size of all logs combined is greater than 5 GB. You can subscribe to the low_free_storage event to be notified when log tables are rotated to free up space.
The problem with the default rules is that if your database is busy, you may accumulate hundreds of millions of rows if you wait for 20% of your disk space to fill up or the total log size to go over 10 GB. If your disk space is too big, 20% of the disk space might be too big for you to wait for rolling. Your backups will have all this data and if you regularly restore your backup to use it as a test database, the restoration process will take much longer time, will use more bandwidth and use more IO operations. Every weekend, we create a new test database from our production database. On one fine Monday we woke up to realize that our test database is still not restored. We found out that it took much longer because of the general_log table. So setting up a rollup policy became a priority.
Amazon actually makes it very easy to roll up general_log. You simply need to execute the following stored procedure – mysql.rds_rotate_general_log. Initially we decided to write a script calling this procedure every 3 days. But soon we realized that it requires master user to execute the procedure. In limited time we spent, we couldn’t figure out exact set of permissions required to roll the log using a different user. So instead of setting up a cron script and adding master user credentials to it, we decided to simply use MySQL’s own event scheduler.
MySQL events can be compared to triggers. Events get executed at a given schedule – like crontab. The event scheduler runs events at a given frequency. I created the following simple event for rolling general_log:
You can execute the above statement to add the event, but by default the event scheduler is not enabled in a RDS instance. You will need to set event_scheduler Db Parameter Group to ON in order to make sure that your event actually gets executed. Do that before you add the event. Fortunately this was also a dynamic parameter and hence we didn’t have to reboot our database. Once we set the scheduler to ON, we simply added the above event to mysql (system) schema. Walla! The log got rolled immediately!
In early 2014, we were faced with a challenge: As our data grew, our current real-time data storage was no longer able to support our growth. We looked for other real-time ingestion systems, and that’s when one of our engineers, Maxime, came across Druid. Let’s backtrack for a minute. The “we” in this story is GumGum, inventors of In-Image Advertising and a leading digital marketing platform based in Santa Monica, the heart of Silicon Beach. The company was founded in 2007, and its data has consistently quadrupled in recent years. We’ve welcomed and embraced this rapid growth, but it’s also kept us in the market for a system that can handle a lot of data.
While we had had a good run with MySQL, which we were using exclusively until recently, we needed something much more scalable. We looked at Druid, which is designed to ingest and access large-scale real-time data and is horizontally scalable because it is distributed. Druid is especially suitable for our real time use cases where we aggregate large numbers of counters for a known set of keys. Because Druid is open source and created by another ad tech company that had needs similar to ours, we decided to give it a try. And the results so far have been impressive.
We now use Druid as part of a wishbone-shaped data processing architecture called Lambda, which is a generic, fault-tolerant, and scalable data processing system that is able to serve a wide range of workloads in which low-latency reads and updates are required. Lambda architecture scales out rather than up.
GumGum ad servers generate billions of events every day that are sent to two sources simultaneously (Kafka and S3). A Storm topology keeps consuming the stream of these events from Kafka in real time. The topology parses JSON events and identifies various metrics to send to Druid. In parallel, a MapReduce job reads the files from S3 every four hours and sends the data to Druid, which then overwrites previously sent data using Storm.
That real-time data gets overwritten by batch data is one of the most important facets of Lambda architecture, because currently, real-time data processing technologies do not guarantee ‘exactly once’ processing one hundred percent of the time. This means that some events may get processed again or some events may get dropped. Various open source technologies are working to make this better, but this is one of the reasons why Lambda architecture recommends that you replace real time data with batch data.
Druid allows us to overwrite our real-time data with batch data by providing an indexer MapReduce job. But because this MapReduce job expects the files to conform to Druid’s data source, we run one more MapReduce job to convert our complex, nested JSON to a format acceptable by Druid’s data source. These files are then sent to an indexing service (HadoopDruidIndexer) that creates Druid segments from these files, uploads them to S3, and updates the segment metadata in MySQL. Within a minute, a Druid coordinator node realizes the metadata change and loads the newly created segments on the historical nodes while discarding the old segments created by the Storm topology. Operating batch pipelines requires us to run two MapReduce jobs in succession. At GumGum, we use AWS Data Pipeline because it offers a workflow engine that orchestrates multiple tasks with complex dependencies.
We do not expose our Druid to client applications directly; instead, we have built a reporting server that sits in front of Druid. The reporting server allows us to join data from Druid with our MySQL database. It also insulates clients from any backend changes and converts data into a Google Visualization format that is easier for dashboard applications to consume. Druid does offer a REST API in case you don’t want to go through the trouble of building your own API server.
The architecture described above processes more than 3 billion events per day in real time, which amounts to 5 TB of new data per day. We are constantly trying to find the right instance types for our servers machines, but here is list of what we are presently using:
Brokers – 2 m4.xlarge (Round-robin DNS)
Coordinators – 2 c4.large
Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD)
Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD)
Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD)
Overlords – 2 c4.large
Zookeeper – 3 c4.large
MySQL – RDS – db.m3.medium
Druid has proven to be extremely fast, flexible, and reliable. We love using Druid as our data-accessing technology because it allows us to quickly ingest large amounts of data and query the data using simple REST APIs. Druid is horizontally scalable, so as our needs increase, we can simply add more middle managers and historical nodes. Druid is also easy to upgrade and maintain because most of the data stays in Amazon S3. Our goal next year is to allocate some dedicated time to contribute to Druid codebase. We are looking forward to a long partnership with the Druid community and are enthusiastic about being part of the software’s evolution.
The allure of cloud services is something most startups can understand. When you have big ideas and little capital, you need the technology that will enable you to move fast.
One of the most effective tools for this is Amazon Web Services. When you don’t have to reinvent the infrastructure wheel, you can pour your investments into areas that produce real innovation and value to your industry. At the early stages, operating in the cloud is an easy decision to get a startup’s technology off the ground.
Flash-Forward a Couple of Years: Your startup is rapidly growing, along with your data. You’ve now reached the crossroad that every startup faces as it matures. This is the point where the financial cost of staying with a cloud provider is possibly higher than owning and maintaining your own infrastructure. When you hit that price point, it’s time for some difficult conversations about whether or not it will continue to be advantageous to stay with the cloud. Many startups choose to leave the cloud behind and begin utilizing their own infrastructures, which is a valid move. However, far too often companies view this as their only option, which is simply not the case.
Why Staying in the Cloud May Be the Smart Move
Continuing to scale within a cloud infrastructure can provide a bright future for mature startups. Here are a few reasons why:
Ability to Support Exponential Growth: Mature startups usually grow at a substantial pace, expanding the number of servers used. Despite contrary chatter in the startup community, maturity doesn’t equate being able to support growth without being partnered with some kind of cloud service. Whether preparing for an increase in website traffic or international expansion, a cloud infrastructure allows companies to adjust the number of servers they need in real time.
Sophistication of an Established Cloud Platform: It’s often very difficult for startups to achieve on their own the same level of sophistication in infrastructure that an established cloud platform offers. Tools such as Amazon Web Services make it easy to recover from hardware failures. Say, for example, your RDS database goes down. AWS can recover it within minutes without any action needed on your part. Without the cloud in that instance, downtime would have been much longer, and every tech company knows that downtime can be deadly. And of course there’s the upfront investment required to design an infrastructure to begin with.
Hidden Costs of Going Off the Cloud: Sometimes people overlook the hidden costs of going off the cloud. It’s not just the manpower required to manage the infrastructure and the financial cost of acquiring servers. Think of the missed opportunity in innovation and business partnerships if you were to suddenly need to launch, say, 200 servers to try something new on infrastructure that doesn’t support it. Building out infrastructure takes time and in the technology sphere, time is a commodity you can’t afford to lose when operating in a fast-moving market.
It Encourages Innovation: In the same vein as the reason above, cloud computing encourages people to try a lot of different things simply because it’s easy to do so. You can’t put a price on an infrastructure solution that inspires innovation for our engineers.
You Have Access to a Powerhouse Arsenal of Services: Many cloud providers offer additional services on top of just renting virtual server space (e.g., (Microsoft Azure’s Cassandra and AWS’s Elastic MapReduce) so companies don’t need to build and maintain these services on their own. While it’s possible to use a select portion of Amazon’s services with our own infrastructure, the services in their cloud environment have a tight integration between their services and the reduced latency of accessing their services through their infrastructure.
The Cloud Forces Us to Design Redundant Architecture: The cloud forces startups to design systems that anticipate the possibility of servers going down at anytime. Organizations in the cloud are prepared for instance failure and have eliminated almost all “emergencies” within their organization. If a server or database node goes down, applications will continue to work without any problems because a new server or database node will automatically come up. Cloud allows startups to design for redundancy everywhere, something you may not achieve if you own your own costly hardware.
Every case is different and it’s important to continue finding the balance between utilizing cloud services and owning actual hardware. But it’s just as important to realize that you don’t have to give up the cloud just because your company has grown to a certain size. The Cloud can be a proven infrastructure solution to keep you nimble, agile, and right where you want to be–the fast lane–during a very exciting time in the tech industry.
The above article was originally published at www.adotas.com