Aggregation with HBase

In: HBase

12 Jun 2010

How do we accomplish real time reports in a big data system? What if you want to count ad impressions and give real time reports to your customers? HBase makes it really easy to accomplish Aggregation. I am going to tell you how we accomplished aggregation with HBase.

The HTable class in HBase client API contains an incredibly useful method – incrementColumnValue. This method increments a column value by a given amount. If the row and column does not exist, the row and the column will be added and the amount specified will be added to zero. This method is very useful because it essentially does one read and one write without you having to make two calls to HBase. It reads the existing value, updates it to a new value and returns back the updated value.

Thus is is very easy to aggregate data in HBase. Let’s say you are storing ad impression data in one table. Create another table with the same key (or different if you want to query it differently) for aggregation. With every update to the impressions table, update this aggregate table using incrementColumnValue method. This will allow you to give real time impression data to your customers.

If you want to aggregate data by multiple dimensions, you will have to setup one table for each dimension. In order to understand why you need to do this, you need to first understand how HBase row keys work. HBase row key is essentially a byte array. HBase stores all the rows in the lexicographic order of the keys. The key can be made up of more than one information pieces such as period and impression id. If you design your key by combining period and impression id, you can query all the impressions for a given period (hour, minute, day etc). A detailed discussion about this topic is given in Jonathan Gray’s following post – http://devblog.streamy.com/2009/04/23/hbase-row-key-design-for-paging-limit-offset-queries/. You can learn how to design HBase row keys for easy querying.

This method is very fast. In our past experimental product, we were doing 90 incrementColumnValue calls per single web request. We were doing it asynchronously (By spawning a thread from a servlet). With our not so big cluster (1 master, 3 region servers – all Amazon ec2 m1.large instances), HBase took only few milliseconds for 90 calls!

There are few trivial disadvantages to this approach too. But fortunately we never encountered them. You may encounter them if you have a bigger system than ours. First, hot spotting may occur based on your key design. If you have time embedded in your key, all the requests will go to a a region for that time. That means the regions server with that region should be able to take all of the traffic. Again, we never encountered this as our servers were big enough to handle all of our traffic. The scalability is limited to what one region server can handle. If you are updating multiple tables, tables could be on different region servers but still the scalability is limited to what a region server can take. Second disadvantage is that if you want to count impressions by multiple dimensions, you have to create a table for each dimension. As a table in HBase can be queried only by one dimension (This may or may not be seen as a disadvantage – number of tables doesn’t matter that much). That is why we were doing 90 updates per request – impression count by publisher, by time, by advertiser, by country … etc. A third disadvantage – which I think is inherent to any distributed systems – is that your counts may be off by a little. This happens because of errors, or web server shut-downs etc. An alternative approach to this is to run a map reduce job in the night and correct the counts if you are anal about the accuracy. I have seen Google ad sense data getting corrected the next day by a little amount!

At the scale we were using it, this kind of aggregation worked very well for us. If anybody else has their experiences doing aggregation with HBase or any other nosql system, please share it in the comments.

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Blogplay
  • http://twitter.com/abaranau Alex Baranov

    Just wanted to share another link for the readers: https://github.com/sematext/HBaseHUT. This tool is also meant for serving online stats and other data using HBase and is meant to increase write throughput during updates. Apart from other things it allows much more sophisticated records update logic (than counters).

  • Sonal

    Hi,

    How are you creating your reports from the data in HBase? We are working on a reporting application for HBase called Crux, open sourced at https://github.com/sonalgoyal/crux/ . Do you have something similar in house?

    Sonal

  • Anonymous

     Sonal, we created our own small framework that allows us to design a key with period in it. We wrote small piece of code that will allow us to formulate the key based on the period we are looking for. In our framework one row represents one month of data. Inside the row, we have columns that store data for each day, each  hour. I will take a look at Crux and let you know if it’s similar to ours. Thanks.

  • Sonal

    Sure, look forward to hearing from you. Thanks.

  • John Cohen

    How can I create the top 10 ad impression report? Assuming this is a real-time report so you can not use a map/reduce job because it would add too much overhead.

blog comments powered by Disqus

WhyNosql subscription by Email

Name:
E-Mail Address:

Top Commentators

Individuals who contribute to WhyNoSQL on a regular basis, through commenting, will be rewarded here. When will you be on this list?
  • No commentators.