Cassandra, DynamoDB, HBase, Hadoop and Big Data in general
In: HBase10 Dec 2010
Distributed counters is an important functionality many distributed databases offer. For an ad network distributed counters are important for many reasons. Real time ad impressions and click data can be used for ad optimization. HBase and Cassandra both support distributed counters.
Ultimately, whatever system you may choose, scaling distributed counters remains a challenge. It boils down to a fundamental problem – atomic updates. At a any given time unit, only one process/thread can update a counter. This counter cannot live on multiple servers at the same time. It has to be updated on one node. Thus, there is no such thing as a true distributed counter! You will always be limited by how many operations a node can execute in a given amount of time.
In spite of being a fundamental problem, there are some simple strategies you can employ to alleviate this problem. In this article I plan to discuss some strategies we used to solve this problem.
Break the counter into multiple counters
First strategy is to break the counter, make it more granular. Let’s say we want to update the ad impressions counter for ad that is being served millions of times across thousands of websites. Here, instead of updating one counter per ad, we can choose to update a counter per ad- website combination. Thus, when you want to find out how many ad impressions were served in any given time interval, you will have to sum up all the website impressions for an ad. This will help in distributing the load. In HBase speak, one ad-website combination can be on one region server and another ad-website combination can be on the other region server. HBase allows you to design your key in such a way that it is easy to do partial key queries.
Let’s say that we employ this strategy to find out that your ad is being served on a huge website (such as cnn.com) where an-ad website combination itself is not enough for a node to handle. What do you do then? You employ In Memory Aggregation.
In Memory Aggregation
In most systems, “Real Time” does not have to be an absolute real time. Let’s say you have a cluster of 40 web servers, and a cluster of 10 HBase nodes. All the 40 nodes are updating counters on 3 region servers. What if you keep these counters in the web server memory, accumulate them for 30 seconds, and then dump the counters to HBase? Thousands of requests to Hbase will immediately reduce to 40 requests per 30 second! That’s a huge reduction in load! Your ability to get real time data will go away, and you will always get 30 seconds delayed data. However, most systems can live with such a delay.
Java offers concept of shared memory; hence it’s easy to do In Memory Aggregation with Java. In Java, each process is handled by a thread in the same JVM process. On other platforms, you must have memcache, or a similar system installed on each of the servers as every request is processed by a separate process (no shared memory exist).
This optimization will take you so far, that you will not have to worry about scaling distributed counters unless your web server farms grows beyond thousands of servers! If that happens, you are in the same league as large companies such as Facebook and Google. In that case you wouldn’t need my advice, you would have some brilliant engineers dedicated to solving all your problems.