Cassandra, DynamoDB, HBase, Hadoop and Big Data in general
We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra was so much that we decided to give it a shot. I think for some reason Cassandra team has succeeded in marketing itself very well. You will find even non technical people (such as VCs, CEOs, Product Managers) in Santa Monica recommending Cassandra to each other.
First impression of Cassandra was a good one. Their web page looks much more professional and nicer than HBase. It’s very easy to get it up and running. The website is well documented. Literally, it took me 5 minutes to set it up and get it running.
The real challenge was to understand Cassandra data model and try to implement it for our use cases. It was very clear to us how we will do it with HBase as we have a pretty good experience with it. Even though Cassandra inherits the same data model from BigTable, there are some fundamental differences between Cassandra and HBase. I have tried to tabulate the differences between the two systems below:
|Lacks concept of a Table. All the documentation will tell you that it's not common to have multiple keyspaces. That means you have to share a key space in a cluster. Furthermore adding a keyspace requires a cluster restart!||Concept of Table exists. Each table has it's own key space. This was a big win for us. You can add and remove table as as easily as a RDBMS.|
|Uses string keys. Very common to use uuids as the keys. You can use TimeUUID if you want your data to be sorted by time.||Uses binary keys. It's common to combine three different items together to form a key. This means you can search by more than one key in a give table.|
|Even if you use TimeUUID, as Cassandra load balances client requests, hot spotting problem won't occur. (All the client requests going to one server in a cluster is known as a hot spot problem).||If your key's first component is time or a sequential number, then hotspotting occurs. All of the new keys will be inserted to one region until it fills up (hence by causing a hotspotting problem).|
|Offers sorting of columns.||Does not have sorting of columns.|
|Concept of Supercolumn allows you to design very flexible, very complex schemas.||Does not have supercolumns. But you can design a super column like structure as column names and values are binary.|
|Does not have any convinience method to to increment a column value. In fact the vary nature of eventual consistency makes it difficult to update/write a record and read it instantly after the update. You have to make sure that R + W > N to achive strong consitency.||By design consitent. Offers a nice convinience method to increment counters. Very much suitable for data aggregation.|
|Map Reduce support is new. You will need a hadoop cluster to run it. Data will be tranferred from Cassandra cluster to the hadoop cluster. No suitable for running large data map reduce jobs.||Map Reduce support is native. HBase is built on Hadoop. Data does not get transferred.|
|Comparatively simpler to maintain if you don't have to have hadoop.||Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself.|
|Does not have a native java api as of now. No java doc. Even though written in Java, you have to use Thrift to communicate with the cluster.||Has a nice native java api. Feels much more java system than Cassandra. Being a java shop, it was important for us. HBase has a thrift interface for other laguages too.|
|No master server, hence no single point of failure.||Although there exists a concept of a master server, HBase itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. Hadoop namenode is a single point of failure.|
After comparing the data model and features in this way, HBase was the clear winner for us. In my opinion, if consistency is what you need, HBase is a clear choice. Furthermore native map reduce, concept of a table and simpler schema that can be modified without a cluster restart are a big plus you cannot ignore. HBase is a much more mature platform. When people site Twitter, Facebook using Cassandra, they forget that the same organizations are using HBase too. In fact one of the committers of HBase was recenlty hired by Facebook which clearly shows their interest in HBase.
So in short, we will root for team HBase!