Cassandra, DynamoDB, HBase, Hadoop and Big Data in general
In: HBase14 Jul 2011
Key HBase community members advise people not to host their HBase cluster on EC2. And they have good reasons for advising so. But in this post I am going to explain why we decided to host our HBase cluster on EC2 and why we continue to host it on EC2.
When we began experimenting with HBase in July of 2009, HBase was fairly new and we were experimenting with Hadoop and HBase to learn how these technologies could help us solve our problems. By then we didn’t have big data but being an ad network, we wanted the ability to scale horizontally as our network grew. One new publisher could take our traffic to a new level. We were a startup with just a few engineers.
Amazon EC2 was getting popular around that time and elasticity being the prime concern, we started using AutoScaling for our web cluster. We thought that buying equipment wouldn’t be wise use of the money. Furthermore we were still trying to validate our unique business model. That means a product pivot could have invalidated our expenditure. As we were using EC2 for hosting our web servers it was natural for us to use EC2 for hosting our backend. It was also natural for us to start using EC2 for our newly experimental HBase cluster.
As our business grew, even though the HBase community was talking about problems with EBS’s slow disk IO, we started realizing many benefits of hosting HBase on EC2. First and foremost was the ability to upgrade and add nodes within minutes. Many times we only had 2 or 3 days notice of adding a new big publisher and if we hosted our own cluster, we couldn’t have ordered new machines, prepared it and made it operational in two days. Furthermore we didn’t have any sysops expertise. We had prepared an AMI with an HBase installation. Adding a new node was a matter of minutes for us.
The second important benefit was the ability to backup easily. Backing up our HBase data was important to us as we were running map reduce jobs that were modifying data. We wanted to guard against accidental modifications. Snapshotting ebs volumes containing data gave us an easy way to backup (read HBase on EC2 using EBS volumes – Lessons Learned). This feature has many advantages besides serving as a backup. It also means that we can create an identical cluster within a couple of hours with the exact same data as production. You can use such a cluster for QA, for running map reduce jobs for analyzing old data etc. And when you are done, you could simply shut down the cluster saving money.
Upgrade was one more advantage. Although usually HBase upgrades are pretty simple, there are times when we have done upgrades in a different way. For example when upgrading from 0.20.6 to 0.90.3, we brought up a new cluster. We transferred data from the old cluster to the new cluster using a Map Reduce Job. Once the new cluster was operational and we were confident about it, we discontinued the old cluster. I don’t think this is very easy to do unless you are using a cloud provider like EC2. You would otherwise need to keep spare capacity to take up such projects.
It’s not that we never faced performance problems. We faced performance problems throughout as we were growing. But clearly they were a function of our traffic and growth. And every time we faced a performance problem, with the help of extremely helpful HBase committers and other key community members, we were able to solve them by tweaking the configuration of our HBase and Hadoop. Occasionally we upgraded instances, and even resorted to adding an in-memory cache in between HBase and our web cluster. But the cost of doing so was much smaller than moving to our own data center and hiring people to manage it.
One more feature of AWS made us continue using HBase in EC2 – Amazon EMR. HBase experts advise people against running heavy map reduce jobs in the same cluster that is serving real time data. We use Amazon EMR to run our map reduce jobs – even the jobs that require data from HBase. We save money on data transfer as all the map reduce running machines and HBase both are in the same availability zone and we pay only for the time we run the map reduce jobs. We were aware that we were giving up data locality by doing so but the data size allowed us to run the jobs on EMR in reasonable time.
Today we are running 50 servers on EC2 at any given point. About 30 of them are the web servers (c1.mediums), 9 of them makeup the HBase cluster (c1.xlarge) and others are memcache servers, queue processing servers etc. Also, every night we launch at least 30 more servers for running our map reduce jobs on Amazon EMR. If we decide to move all this to our own data center, we would have to buy/rent at least 50 servers, hire people to manage it and may not be able to scale at the pace our business grows.
We are a lean, fast moving startup. We don’t hire people unless we really need them, we don’t buy any equipment unless we have a long term plan. EC2 enables us to grow at our speed and takes a lot of complexity out of our life. It’s this kind of decision making that has helped us become the world’s largest in-image ad network.