I was really excited with Amazon’s DynamoDB annoucement. After developing for few days with it and using it in production, I was a bit disappointed.
Many of those disappointments come from comparing our existing HBase with DynamoDB. After this evaluation, we think that we simply cannot replace our HBase installation. We can use DynamoDB for some other lightweight applications.

1) 64KB limit on row size: We have many records in our HBase with data much more than 64KB in one row. We even had some records as big as 100MB. Limiting row size to such a tiny amounts rules out lots of use cases for us instantly. In one of the user forum threads here is what one of the Amazon guys say about the limit :
“64KB cumulative attribute size limit, which isn’t too hard to reach if you have a set of Number values for a single attribute When you have so many values for a set, you should consider switching your table to a hash-range schema.”
I think what he is missing is that there could be genuine cases where switching your table to hash range schema may not be possible. In some cases  single piece of data is bigger than 64KB. You cannot simply change your schema to use more rows instead of more columns in these cases. For example a page crawler may store the entire page content in one field. Somebody might want to store an Image as a field in your NoSQL database. You simply cannot use DynamoDB for such use cases. This is my biggest complaint.

2) Limited Data Types: It doesn’t accept binary data. You have to have strings or numbers or sets of strings or numbers. You cannot store images, byte arrays etc. You can get around it by encoding everything in string using base64 encoding, but base64 encoding produces bigger data. And you are counting your bytes as you cannot hit the 64KB limit!

3) 1MB limit on Querying and Scanning: You cannot get a result bigger than 1MB from Query or Scan operations. You are required to make LastEvaluatedKey call to start from wherever you stopped in the earlier scan request. This is not that bad, but it doesn’t allow you to optimize it for your use cases. In most of our use cases, making one trip for 1MB of data could be too much. Amazon should think about increasing this limit or allowing clients to specify this limit.

DynamoDB is supposed to be scalable. I think these limitation seriously challange the scalability claim. It makes me feel like Amazon cannot make it scalable without imposing these limitations.

4) Limited Capability of Query’s Comparision Operators: You cannot use CONTAINS, NOT_NULL and some other operators when you use Query features of DynamoDB. And the documentation may be wrong! Please read this thread for more information:
https://forums.aws.amazon.com/thread.jspa?threadID=85511&tstart=0

You can always use ‘Scan’ instead of ‘Query’ but then you will be forced to go through each and every record. It’s not necessarily any worse than any of the existing NoSQL solution. But since they offered Query mechanism (in additon to Scan) operation, I was little disapointed.

5) Time Required for Creation of a Table or API Call to know when the table is ready: When you create a table programatically (or even using AWS Console), the table doesn’t become available instantly. The call returns before the table is ready. This means you cannot create a table and use it instantly. Sometimes, we use dynamically created tables. I can undestand why it may take time, but it would be nice if they have an api call that can tell us when the table is ready.

Overall, I am really impressed by the simplicity of DynamoDB. The APIs (even though I don’t like the way they are designed) are pretty simple and schema modeling is also very simple. The forums have started buzzing and I think more and more people are trying DynamoDB out. What I will be watching is whether the points discussed above are preventing people from switching their existing NoSQL solution to Amazon’s managed DynamoDB. At GumGum, the first three issues are blockers and unless they are resolved, we are less likely to switch from HBase to DynamoDB.

 

Amazon recently announced DynamoDB. I have to admit, this time Amazon might have gotten it right! SimpleDB was simply a disaster. But from whatever I have read so far DynamoDB looks really promising.
Read the rest of this entry »

Key HBase community members advise people not to host their HBase cluster on EC2. And they have good reasons for advising so. But in this post I am going to explain why we decided to host our HBase cluster on EC2 and why we continue to host it on EC2.

When we began experimenting with HBase in July of 2009, HBase was fairly new and we were experimenting with Hadoop and HBase to learn how these technologies could help us solve our problems. By then we didn’t have big data but being an ad network, we wanted the ability to scale horizontally as our network grew. One new publisher could take our traffic to a new level. We were a startup with just a few engineers.
Read the rest of this entry »

Hadoop Summit is always interesting for Hadoopers. You get to learn the latest and greatest in Hadoop world and meet the people behind projects in the Hadoop ecosystem. In this post, I have tried to share my takeaways.

Currently there are many distributions of Hadoop floating around. Besides the main Apache Hadoop distribution, there is Cloudera, Yahoo, IBM and even Amazon uses there own distribution for their Elastic Map Reduce Service. All these distributions were born becuase the main Apache distribution is not good enough. Yahoo is now launching a separate company – Hortonworks to fix this problem. It is essentially going to be a Cloudera competitior but before they start providing support to clients, they are going to fix the main Apache Hadoop distribution. This is certainly a good news for the Hadoop community. Not only Cloudera will have more competition (and hence more support options for Hadoop), but also there will be a company focusing on making the main Apache distribution robust.
Read the rest of this entry »

Recently we learned few interesting lessons about architecting HBase on EC2. Since the lessons we learned are more related to EC2 than HBase, I decided to post it on my Amazon Web Services related blog. For those who are planning to host their HBase/Hadoop systems on EC2, it’s a must read – http://aws-musings.com/hbase-on-ec2-using-ebs-volumes-lessons-learned/

Distributed counters is an important functionality many distributed databases offer. For an ad network distributed counters are important for many reasons. Real time ad impressions and click data can be used for ad optimization. HBase and Cassandra both support distributed counters.

Ultimately, whatever system you may choose, scaling distributed counters remains a challenge. It boils down to a fundamental problem – atomic updates. At a any given time unit, only one process/thread can update a counter. This counter cannot live on multiple servers at the same time. It has to be updated on one node. Thus, there is no such thing as a true distributed counter! You will always be limited by how many operations a node can execute in a given amount of time.
Read the rest of this entry »

Each HBase region server hosts many regions – possibly hundreds or even thousands. How do you find out which one of them is a hotspot? We saw that CPU on one of the region server was shooting up at peak traffic. But the region server had 4 tables (and hundreds of regions) and their access patterns were similar. Each web request was accessing all four of the tables. So how do you find out which table is the bottleneck? Which region(s) is causing hotspotting?
Read the rest of this entry »

There are two useful tutorials (HBase wiki and Yaan’s blog) on the web devoted to this topic. But I think both of them missed few steps. In spite of following the tutorials, I found myself struggling with compiling thrift and python’s No module found errors. Hence this attempt.
Read the rest of this entry »

Aggregation with HBase

In: HBase

12 Jun 2010

How do we accomplish real time reports in a big data system? What if you want to count ad impressions and give real time reports to your customers? HBase makes it really easy to accomplish Aggregation. I am going to tell you how we accomplished aggregation with HBase.

The HTable class in HBase client API contains an incredibly useful method – incrementColumnValue. This method increments a column value by a given amount. If the row and column does not exist, the row and the column will be added and the amount specified will be added to zero. This method is very useful because it essentially does one read and one write without you having to make two calls to HBase. It reads the existing value, updates it to a new value and returns back the updated value.
Read the rest of this entry »

We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra was so much that we decided to give it a shot. I think for some reason Cassandra team has succeeded in marketing itself very well. You will find even non technical people (such as VCs, CEOs, Product Managers) in Santa Monica recommending Cassandra to each other. Read the rest of this entry »

WhyNosql subscription by Email

Name:
E-Mail Address:

Top Commentators

Individuals who contribute to WhyNoSQL on a regular basis, through commenting, will be rewarded here. When will you be on this list?
  • Kitchen Cupboard ...