Cassandra, DynamoDB, HBase, Hadoop and Big Data in general
In: HBase
12 Jun 2010How do we accomplish real time reports in a big data system? What if you want to count ad impressions and give real time reports to your customers? HBase makes it really easy to accomplish Aggregation. I am going to tell you how we accomplished aggregation with HBase.
The HTable class in HBase client API contains an incredibly useful method – incrementColumnValue. This method increments a column value by a given amount. If the row and column does not exist, the row and the column will be added and the amount specified will be added to zero. This method is very useful because it essentially does one read and one write without you having to make two calls to HBase. It reads the existing value, updates it to a new value and returns back the updated value.
Thus is is very easy to aggregate data in HBase. Let’s say you are storing ad impression data in one table. Create another table with the same key (or different if you want to query it differently) for aggregation. With every update to the impressions table, update this aggregate table using incrementColumnValue method. This will allow you to give real time impression data to your customers.
If you want to aggregate data by multiple dimensions, you will have to setup one table for each dimension. In order to understand why you need to do this, you need to first understand how HBase row keys work. HBase row key is essentially a byte array. HBase stores all the rows in the lexicographic order of the keys. The key can be made up of more than one information pieces such as period and impression id. If you design your key by combining period and impression id, you can query all the impressions for a given period (hour, minute, day etc). A detailed discussion about this topic is given in Jonathan Gray’s following post – http://devblog.streamy.com/2009/04/23/hbase-row-key-design-for-paging-limit-offset-queries/. You can learn how to design HBase row keys for easy querying.
This method is very fast. In our past experimental product, we were doing 90 incrementColumnValue calls per single web request. We were doing it asynchronously (By spawning a thread from a servlet). With our not so big cluster (1 master, 3 region servers – all Amazon ec2 m1.large instances), HBase took only few milliseconds for 90 calls!
There are few trivial disadvantages to this approach too. But fortunately we never encountered them. You may encounter them if you have a bigger system than ours. First, hot spotting may occur based on your key design. If you have time embedded in your key, all the requests will go to a a region for that time. That means the regions server with that region should be able to take all of the traffic. Again, we never encountered this as our servers were big enough to handle all of our traffic. The scalability is limited to what one region server can handle. If you are updating multiple tables, tables could be on different region servers but still the scalability is limited to what a region server can take. Second disadvantage is that if you want to count impressions by multiple dimensions, you have to create a table for each dimension. As a table in HBase can be queried only by one dimension (This may or may not be seen as a disadvantage – number of tables doesn’t matter that much). That is why we were doing 90 updates per request – impression count by publisher, by time, by advertiser, by country … etc. A third disadvantage – which I think is inherent to any distributed systems – is that your counts may be off by a little. This happens because of errors, or web server shut-downs etc. An alternative approach to this is to run a map reduce job in the night and correct the counts if you are anal about the accuracy. I have seen Google ad sense data getting corrected the next day by a little amount!
At the scale we were using it, this kind of aggregation worked very well for us. If anybody else has their experiences doing aggregation with HBase or any other nosql system, please share it in the comments.