A little over a month ago we had a chance to present as session in conjunction with Eric Sammer of Cloudera on Designing Hadoop for the Enterprise Data Center and findings at Strata + Hadoop World 2012 .

Taking a look back, we started this initiative back in early 2011 as the demand for Hadoop was on the rise and we began to notice a lot of confusion from our customers on what Hadoop would mean to their Data Center Infrastructure. This lead us to our first presentation at Hadoop World 2011 where we shared an extensive testing effort with the goal of characterizing what happens when you run a Hadoop Map/Reduce job. Further, we illustrated how different network and compute considerations would change these characteristics. As Hadoop deployment gained tracking in enterprise, we found a need of developing network reference architecture for Hadoop. This lead us to another round of testing concluded earlier this year and presented at Hadoop Summit, which examined what happened when looking at design considerations such as architectures, availability, capacity, scale and management.

Finally this brings us to last month and our presentation at Strata + Hadoop World 2012. We met with Cloudera in the months leading up to the event and discussed what we could share to the Hadoop community. We discussed all the previous rounds of testing and came to the conclusion that along with a combination of customer experiences and another round of testing that examined Multi-tenant environments we could put together a talk that really addressed the fundamental design considerations of Hadoop in the Enterprise Data Center.

We went into depth to examine the network traffic considerations with Hadoop in the Data Center to
show why 10GE to the server is strongly recommended, multi-homing servers is very important and having proper switch buffer is important to consider.

We explored various multi-tenant environments with the focus of application multi-tenants (such as Hadoop + BHASE) that require a closer look at traffic patterns versus Job-based or Department-based, which require scheduling or permissions considerations. We actually started with just a close look at HBASE itself as it combines a need for low latency Reads along with large congestion HDFS replication events (Major Compactions). Traditional answer to address congestion is to offer more buffers..,.However there are alternatives to managing congestion, as adding more buffers to help one large congestion scenario may have adverse affects on traffic that has low latency considerations. Thus adding buffer doesn’t always help. In these scenarios a simple QOS setting can prioritize the North/South traffic of Reads/Updates over East/West traffic of HDFS replication, with simple configuration we demonstrated dramatic improvement of the Read performance of up to 45% during a major compaction event. Secondly, we looked at actually combining HBASE and Hadoop on the same cluster, which is largely kept separate today. This scenario showed the same result of 60% read improvements when applying a simple QOS policy prioritizing Reads and Updates.

The benefit of Big Data is brought with close integration to current infrastructure and data. With understanding of traffic considerations and simple design considerations, Hadoop can and should be integrated into data center infrastructures today with ease and efficiently as any other Data Center applications.

For complete information on our complete Big Data solutions please visit www.cisco.com/go/bigdata.