Hadoop Summit 2012

In: Hadoop|HBase

19 Jun 2012

This year’s Hadoop Summit was the biggest ever. There were 2200 people. Barring the first day lunch hiccup for not having food for vegetarians, everything went on smoothly.

Storm is getting bigger. Nathan Marz’s talk was as good as his other talks. There was nothing special, but I think the response was noteworthy. Nathan Marz was a celebrity at the conference for the small time I saw him there. Every single person I talked to have either heard about Storm or is evaluating Storm. Some of them are even using Storm already! Needless to say, GumGum is going to put Storm in production pretty soon!

Low Latency OLAP with HBase was another popular session. Andrei Dragomir from Adobe explained their OLAP strategies with what they call it ‘SaasBase’. SaasBase is their internal software for aggregation functions such as group by, count, sum, avg etc. Many people couldn’t get in the session. They let 7 people in after 15 minutes and I was one of them. The slides will eventually be up at Hadoop Summit site, but here is the talk at HBaseCon – http://www.hbasecon.com/sessions/low-latency-olap-with-hbase/. It’s almost the same talk. We have been doing that kind of aggregation since 2010. DrawnToScale  is building a similar database – Spire. Spire will allow you to query HBase with SQL like language. Sadly, it’s not open source. I have also heard about an open source project called Crux that helps you query and visualise data from HBase. I haven’t had a chance to play with it yet.

Mohammad Sabah’s talk on using machine learning techniques on social data at Netflix was very good. I am not a data scientist and I didn’t understand the talk fully. At Netflix they are using Markov Chains to do pretty cool stuff.

Todd Lipcon’s How to Optimize MR jobs was great too. He said that the most Map reduce jobs are are not cpu bound. He talked about how io.sort.mb setting can make a big difference and also mentioned that you won’t have to worry about this setting in Hadoop 2.0!

I was little bit disappointed with Richard Cole’s talk on best practices running Hadoop in the cloud. Most of the tips he gave were not that great and I already knew about them. We have a cluster running on AWS cloud and we have been using EMR for past two years.

Twitter’s P. Oscal Boykin talked about their DSL for Hadoop – Scalding. It’s a Scala based language over Cascading. I am not a big fan of Scala (yet). Pig and Hive being already mature, I couldn’t understand the exact problem they are trying to solve. Now a days there are many choices available if you don’t want to think in Map Reduce.

I also had a chance to attend the first ever Kafka Users group at Linked In. Kafka is a distributed pub sub messaging system commonly used for log aggregation. LinkedIn built it and open sourced it. Kafka together with Storm is a lethal combination. Storm can be used to consume Kafka topics for real time aggregations (and so many other use cases). Good news is that the video of the meeting is already up (Thanks to LinkedIn folks) here – https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations

Overall the Hadoop Summit was a great experience with lots of learning. If you are working in this space, you should definitely try to attend it next year.

I was really excited with Amazon’s DynamoDB annoucement. After developing for few days with it and using it in production, I was a bit disappointed.
Many of those disappointments come from comparing our existing HBase with DynamoDB. After this evaluation, we think that we simply cannot replace our HBase installation. We can use DynamoDB for some other lightweight applications.

1) 64KB limit on row size: We have many records in our HBase with data much more than 64KB in one row. We even had some records as big as 100MB. Limiting row size to such a tiny amounts rules out lots of use cases for us instantly. In one of the user forum threads here is what one of the Amazon guys say about the limit :
“64KB cumulative attribute size limit, which isn’t too hard to reach if you have a set of Number values for a single attribute When you have so many values for a set, you should consider switching your table to a hash-range schema.”
I think what he is missing is that there could be genuine cases where switching your table to hash range schema may not be possible. In some cases  single piece of data is bigger than 64KB. You cannot simply change your schema to use more rows instead of more columns in these cases. For example a page crawler may store the entire page content in one field. Somebody might want to store an Image as a field in your NoSQL database. You simply cannot use DynamoDB for such use cases. This is my biggest complaint.

2) Limited Data Types: It doesn’t accept binary data. You have to have strings or numbers or sets of strings or numbers. You cannot store images, byte arrays etc. You can get around it by encoding everything in string using base64 encoding, but base64 encoding produces bigger data. And you are counting your bytes as you cannot hit the 64KB limit!

3) 1MB limit on Querying and Scanning: You cannot get a result bigger than 1MB from Query or Scan operations. You are required to make LastEvaluatedKey call to start from wherever you stopped in the earlier scan request. This is not that bad, but it doesn’t allow you to optimize it for your use cases. In most of our use cases, making one trip for 1MB of data could be too much. Amazon should think about increasing this limit or allowing clients to specify this limit.

DynamoDB is supposed to be scalable. I think these limitation seriously challange the scalability claim. It makes me feel like Amazon cannot make it scalable without imposing these limitations.

4) Limited Capability of Query’s Comparision Operators: You cannot use CONTAINS, NOT_NULL and some other operators when you use Query features of DynamoDB. And the documentation may be wrong! Please read this thread for more information:
https://forums.aws.amazon.com/thread.jspa?threadID=85511&tstart=0

You can always use ‘Scan’ instead of ‘Query’ but then you will be forced to go through each and every record. It’s not necessarily any worse than any of the existing NoSQL solution. But since they offered Query mechanism (in additon to Scan) operation, I was little disapointed.

5) Time Required for Creation of a Table or API Call to know when the table is ready: When you create a table programatically (or even using AWS Console), the table doesn’t become available instantly. The call returns before the table is ready. This means you cannot create a table and use it instantly. Sometimes, we use dynamically created tables. I can undestand why it may take time, but it would be nice if they have an api call that can tell us when the table is ready.

Overall, I am really impressed by the simplicity of DynamoDB. The APIs (even though I don’t like the way they are designed) are pretty simple and schema modeling is also very simple. The forums have started buzzing and I think more and more people are trying DynamoDB out. What I will be watching is whether the points discussed above are preventing people from switching their existing NoSQL solution to Amazon’s managed DynamoDB. At GumGum, the first three issues are blockers and unless they are resolved, we are less likely to switch from HBase to DynamoDB.

 

Amazon recently announced DynamoDB. I have to admit, this time Amazon might have gotten it right! SimpleDB was simply a disaster. But from whatever I have read so far DynamoDB looks really promising.
Read the rest of this entry »

Key HBase community members advise people not to host their HBase cluster on EC2. And they have good reasons for advising so. But in this post I am going to explain why we decided to host our HBase cluster on EC2 and why we continue to host it on EC2.

When we began experimenting with HBase in July of 2009, HBase was fairly new and we were experimenting with Hadoop and HBase to learn how these technologies could help us solve our problems. By then we didn’t have big data but being an ad network, we wanted the ability to scale horizontally as our network grew. One new publisher could take our traffic to a new level. We were a startup with just a few engineers.
Read the rest of this entry »

Hadoop Summit is always interesting for Hadoopers. You get to learn the latest and greatest in Hadoop world and meet the people behind projects in the Hadoop ecosystem. In this post, I have tried to share my takeaways.

Currently there are many distributions of Hadoop floating around. Besides the main Apache Hadoop distribution, there is Cloudera, Yahoo, IBM and even Amazon uses there own distribution for their Elastic Map Reduce Service. All these distributions were born becuase the main Apache distribution is not good enough. Yahoo is now launching a separate company – Hortonworks to fix this problem. It is essentially going to be a Cloudera competitior but before they start providing support to clients, they are going to fix the main Apache Hadoop distribution. This is certainly a good news for the Hadoop community. Not only Cloudera will have more competition (and hence more support options for Hadoop), but also there will be a company focusing on making the main Apache distribution robust.
Read the rest of this entry »

Recently we learned few interesting lessons about architecting HBase on EC2. Since the lessons we learned are more related to EC2 than HBase, I decided to post it on my Amazon Web Services related blog. For those who are planning to host their HBase/Hadoop systems on EC2, it’s a must read – http://aws-musings.com/hbase-on-ec2-using-ebs-volumes-lessons-learned/

Distributed counters is an important functionality many distributed databases offer. For an ad network distributed counters are important for many reasons. Real time ad impressions and click data can be used for ad optimization. HBase and Cassandra both support distributed counters.

Ultimately, whatever system you may choose, scaling distributed counters remains a challenge. It boils down to a fundamental problem – atomic updates. At a any given time unit, only one process/thread can update a counter. This counter cannot live on multiple servers at the same time. It has to be updated on one node. Thus, there is no such thing as a true distributed counter! You will always be limited by how many operations a node can execute in a given amount of time.
Read the rest of this entry »

Each HBase region server hosts many regions – possibly hundreds or even thousands. How do you find out which one of them is a hotspot? We saw that CPU on one of the region server was shooting up at peak traffic. But the region server had 4 tables (and hundreds of regions) and their access patterns were similar. Each web request was accessing all four of the tables. So how do you find out which table is the bottleneck? Which region(s) is causing hotspotting?
Read the rest of this entry »

There are two useful tutorials (HBase wiki and Yaan’s blog) on the web devoted to this topic. But I think both of them missed few steps. In spite of following the tutorials, I found myself struggling with compiling thrift and python’s No module found errors. Hence this attempt.
Read the rest of this entry »

Aggregation with HBase

In: HBase

12 Jun 2010

How do we accomplish real time reports in a big data system? What if you want to count ad impressions and give real time reports to your customers? HBase makes it really easy to accomplish Aggregation. I am going to tell you how we accomplished aggregation with HBase.

The HTable class in HBase client API contains an incredibly useful method – incrementColumnValue. This method increments a column value by a given amount. If the row and column does not exist, the row and the column will be added and the amount specified will be added to zero. This method is very useful because it essentially does one read and one write without you having to make two calls to HBase. It reads the existing value, updates it to a new value and returns back the updated value.
Read the rest of this entry »

WhyNosql subscription by Email

Name:
E-Mail Address:

Top Commentators

Individuals who contribute to WhyNoSQL on a regular basis, through commenting, will be rewarded here. When will you be on this list?
  • No commentators.