Cassandra vs HBase

In: Cassandra|HBase

3 Jun 2010

We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra was so much that we decided to give it a shot. I think for some reason Cassandra team has succeeded in marketing itself very well. You will find even non technical people (such as VCs, CEOs, Product Managers) in Santa Monica recommending Cassandra to each other.

First impression of Cassandra was a good one. Their web page looks much more professional and nicer than HBase. It’s very easy to get it up and running. The website is well documented. Literally, it took me 5 minutes to set it up and get it running.

The real challenge was to understand Cassandra data model and try to implement it for our use cases. It was very clear to us how we will do it with HBase as we have a pretty good experience with it. Even though Cassandra inherits the same data model from BigTable, there are some fundamental differences between Cassandra and HBase. I have tried to tabulate the differences between the two systems below:

CassandraHBase
Lacks concept of a Table. All the documentation will tell you that it's not common to have multiple keyspaces. That means you have to share a key space in a cluster. Furthermore adding a keyspace requires a cluster restart!Concept of Table exists. Each table has it's own key space. This was a big win for us. You can add and remove table as as easily as a RDBMS.
Uses string keys. Very common to use uuids as the keys. You can use TimeUUID if you want your data to be sorted by time.Uses binary keys. It's common to combine three different items together to form a key. This means you can search by more than one key in a give table.
Even if you use TimeUUID, as Cassandra load balances client requests, hot spotting problem won't occur. (All the client requests going to one server in a cluster is known as a hot spot problem).If your key's first component is time or a sequential number, then hotspotting occurs. All of the new keys will be inserted to one region until it fills up (hence by causing a hotspotting problem).
Offers sorting of columns.Does not have sorting of columns.
Concept of Supercolumn allows you to design very flexible, very complex schemas.Does not have supercolumns. But you can design a super column like structure as column names and values are binary.
Does not have any convinience method to to increment a column value. In fact the vary nature of eventual consistency makes it difficult to update/write a record and read it instantly after the update. You have to make sure that R + W > N to achive strong consitency.By design consitent. Offers a nice convinience method to increment counters. Very much suitable for data aggregation.
Map Reduce support is new. You will need a hadoop cluster to run it. Data will be tranferred from Cassandra cluster to the hadoop cluster. No suitable for running large data map reduce jobs.Map Reduce support is native. HBase is built on Hadoop. Data does not get transferred.
Comparatively simpler to maintain if you don't have to have hadoop.Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself.
Does not have a native java api as of now. No java doc. Even though written in Java, you have to use Thrift to communicate with the cluster.Has a nice native java api. Feels much more java system than Cassandra. Being a java shop, it was important for us. HBase has a thrift interface for other laguages too.
No master server, hence no single point of failure.Although there exists a concept of a master server, HBase itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. Hadoop namenode is a single point of failure.

After comparing the data model and features in this way, HBase was the clear winner for us. In my opinion, if consistency is what you need, HBase is a clear choice. Furthermore native map reduce, concept of a table and simpler schema that can be modified without a cluster restart are a big plus you cannot ignore. HBase is a much more mature platform. When people site Twitter, Facebook using Cassandra, they forget that the same organizations are using HBase too. In fact one of the committers of HBase was recenlty hired by Facebook which clearly shows their interest in HBase.

So in short, we will root for team HBase!

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Blogplay
  • helwr

    you have to look at CRUD (read/write) throughput and latency in addition to this qualitative check list, run this benchmark on your cluster: research.yahoo.com/files/ycsb.pdf

  • http://php-app-engine.com/2010/nosql/cassandra-and-hbase-compared/ Cassandra and HBase Compared. › PHP App Engine

    [...] Cassandra and HBase Compared: Another comparison of Cassandra and HBase based on an ad network company requirements:. . We are [...]

  • Jeremy Hanna

    You don't have to run your hadoop cluster separately – if you run hadoop on your cassandra nodes, hadoop will be able to use the same data locality it uses for hdfs/hbase with the mapreduce tasks.

  • vpuranik

    Jeremy, agreed. The only problem with that is then the whole system becomes at least as complex as HBase + Hadoop combination. When we choose HBase, it's implicitly understood that Hadoop is an absolute requirement. But with Cassandra that is not clear. When you learn HBase, you are forced to learn Hadoop. But with Cassandra, if you missed out on learning hadoop then it becomes difficult to adjust to Hadoop.

  • Denis Orlov

    Membase (membase.org) is probably the most promising technology we've seen for our purpose @ red aril. We've looked @ hbase, cassandra, mongo and redis. For ad serving, none of them even came close. We run analytics on a hadoop cluster and feed result sets back to membase. Consistently seeing <.1 msec latency numbers … standard deviation approaching 0.

  • Ashwin Jayaprakash

    Well…twitter claims to have stopped using Cassandra (http://nosql.mypopescu.com/post/781834027/cassa…)

    The SuperColumn concept in Cassandra is really cool though.

    Membase looks like Voldemort+BDB minus the nifty atomic ops like – add/incr/decr etc.

  • Jeremy Hanna

    vpuranik – I was just referring to data locality. You don't have to move your data in order to do it. That's one of the benefits of how it's integrated with Hadoop.

    wrt complicated setup, I partially agree – if you're coming from Cassandra only and want to run MapReduce over its data, there is some learning that needs to take place and some complexity added to the cluster by adding task tracker and datanode (for distributed cache) daemons to your Cassandra nodes. But if you are familiar with Hadoop MapReduce and already have a Hadoop cluster, then it's not a big deal.

    However, I would argue that Cassandra+MapReduce is less complicated than running HBase on top of HDFS – there's no special purpose Cassandra node to worry about or need for ZooKeeper.

    That said, if you're happy with HBase for your needs and like the integration with Hadoop, awesome. If you're using Cassandra for particular qualities it has and you want to use MapReduce for analytics, it's not that big of a deal if you're already familiar with using Hadoop's MapReduce.

  • http://twitter.com/JamesMPhillips James Phillips

    membase actually does do atomic incr/decr

  • Ashwin Jayaprakash

    I meant Membase = V+BDB with nifty atomic ops like – add/incr/decr…

  • http://twitter.com/faltering Brandon Williams

    Tables:

    A table is analogous to a keyspace. There's nothing to stop you from using multiple keyspaces, but in practice one keyspace per application works well and most people don't need multiple keyspaces. In 0.7, you will be able to create, modify and remove keyspaces without a restart via an API. It's already in trunk.

    String keys:

    TimeUUID is a column name, not a row key. Row keys are strings in 0.6, and will be binary in 0.7 (they already are in trunk.)

    Incrementing:

    Incrementing is very easy when your your system isn't fully decentralized, but Cassandra is, so this is isn't easy to do. Work is being done by Digg in 0.7 to add vector clocks, which will give users a convenient way to increment.

    MapReduce:

    As Jeremy pointed out, you can have locality. Of course you don't get locality if you run Hadoop on a different set of machines; it's logically impossible.

    Native Java API:

    There is a native java api, but it's not recommend to use it as it can change quite often. See contrib/client_only.

  • Devarajaswami

    But will Hadoop be able to assign map/reduce tasks to the same node as the file system blocks/regions/whatever that the tasks need to operate on?

    For this, Hadoop needs to understand how Cassandra assigns individual blocks/regions, and this is not built into Cassandra.

    HBase has built in support for Hadoop data locality, because HBase provides each of the table regions as a Hadoop input split. So each Hadoop task runs on the same physical node as the part of the HBase table it needs to operate on.
    This makes things really fast.

    Does this node level task <–> table region coincidence hold for Cassandra with Hadoop? If not, then even if you install Hadoop on the same nodes as Cassandra, things will be slow because rows have to be passed from the Cassandra node where they reside to the Hadoop node where the map or reduce task runs. This will cause slowness.

  • Jeremy Hanna

    Devarajaswami: yes, Cassandra gives Hadoop the location of its data by extending the Hadoop MapReduce RecordReader class – http://hadoop.apache.org/common/docs/current/ap… – which has a method called getLocations. That returns the locations for a particular block of data or for the particular record.

    As a result, Hadoop is able to prefer running the task on the same machine as the data is, even when the data is stored in Cassandra.

  • http://damnhandy.com/ Ryan J. McDonough

    “There is a native java api, but it’s not recommend to use it as it can change quite often. ”

    If the interfaces are in a constant state of flux and the use is said interfaces is not recommended, then Cassandra does NOT have a Native API ;) Just sayin.

  • http://www.onlineautomotive.co.uk/BMW%20Car%20Parts.aspx BMW Parts

    The part of the hyperactive tom who keeps trying to catch the
    brown mouse, but never succeeds.

  • http://www.best4frames.co.uk/ Picture Frames

    Thanks for this nice information.I like your blog.thanks for sharing so good post.

  • http://www.best4frames.co.uk/ Picture Frames

    The story is that she was a princess or something like that, and Apollo fell in
    love with her. He then gave her the gift of prophecy, but Cassandra did not
    accept him as her lover. Apollo turned her gift into a curse by making it so no
    one would believe her readings about the future.

  • http://www.best4frames.co.uk/ Picture Frames

    Part of the mythology is that Cassandra gave repeated warnings to the Trojans
    about events of the Trojan War, including warning them against brining in the
    Trojan Horse into the City. So a Cassandra is someone who gives dire warnings,
    who is Right when they give those warnings, but are still ignored or not
    believed

  • http://sobercollege.com drug treatment

    Good post

  • http://twitter.com/tovbinm Matthew Tovbin

    This article DOES NOT worth reading. Unfortunately the person, who wrote it does not have enough experience to discuss neither Cassandra nor HBase, but he mostly bases his experience on who get hired by Facebook. Quote: “Cassandra does not have a native java api as of now. No java doc. Even though written in Java, you have to use Thrift to communicate with the cluster.” Quote: “Cassandra does not have any convinience method to to increment a column value.”

  • Ashu Net

    Sorting: Sorting of columns in HBase is very much possible. All data model operations HBase return data in sorted order. First by row, then
    by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in
    reverse, so newest records are returned first.

blog comments powered by Disqus

WhyNosql subscription by Email

Name:
E-Mail Address:

Top Commentators

Individuals who contribute to WhyNoSQL on a regular basis, through commenting, will be rewarded here. When will you be on this list?
  • No commentators.