Cassandra, DynamoDB, HBase, Hadoop and Big Data in general
This year’s Hadoop Summit was the biggest ever. There were 2200 people. Barring the first day lunch hiccup for not having food for vegetarians, everything went on smoothly. Storm is getting bigger. Nathan Marz’s talk was as good as his other talks. There was nothing special, but I think the response was noteworthy. Nathan Marz [...]
Amazon recently announced DynamoDB. I have to admit, this time Amazon might have gotten it right! SimpleDB was simply a disaster. But from whatever I have read so far DynamoDB looks really promising.
Key HBase community members advise people not to host their HBase cluster on EC2. And they have good reasons for advising so. But in this post I am going to explain why we decided to host our HBase cluster on EC2 and why we continue to host it on EC2. When we began experimenting with [...]
Recently we learned few interesting lessons about architecting HBase on EC2. Since the lessons we learned are more related to EC2 than HBase, I decided to post it on my Amazon Web Services related blog. For those who are planning to host their HBase/Hadoop systems on EC2, it’s a must read – http://aws-musings.com/hbase-on-ec2-using-ebs-volumes-lessons-learned/
Distributed counters is an important functionality many distributed databases offer. For an ad network distributed counters are important for many reasons. Real time ad impressions and click data can be used for ad optimization. HBase and Cassandra both support distributed counters. Ultimately, whatever system you may choose, scaling distributed counters remains a challenge. It boils [...]
Each HBase region server hosts many regions – possibly hundreds or even thousands. How do you find out which one of them is a hotspot? We saw that CPU on one of the region server was shooting up at peak traffic. But the region server had 4 tables (and hundreds of regions) and their access [...]
There are two useful tutorials (HBase wiki and Yaan’s blog) on the web devoted to this topic. But I think both of them missed few steps. In spite of following the tutorials, I found myself struggling with compiling thrift and python’s No module found errors. Hence this attempt.
How do we accomplish real time reports in a big data system? What if you want to count ad impressions and give real time reports to your customers? HBase makes it really easy to accomplish Aggregation. I am going to tell you how we accomplished aggregation with HBase. The HTable class in HBase client API [...]
We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra [...]