Last week we participated in the annual Hadoop Summit held in San Jose, CA. When we first met with Hortonworks about the Summit many months back they mentioned this year’s Hadoop Summit would be promoting Reference Architectures from many companies in the Hadoop Ecosystem. This was great to hear as we had previously presented results from a large round of testing on Network and Compute Considerations for Hadoop at Hadoop World 2011 last November and we were looking to do a second round of testing to take our original findings and test/develop a set of best practices around them including failure and connectivity options. Further the set of validation demystifies the one key Enterprise ask “Can we use the same architecture/component for Hadoop deployments?”. Since a lot of the value of Hadoop is seen once it is integrated into current enterprise data models the goal of the testing was to not only define a reference architecture, but to define a set of best practices so Hadoop can be integrated into current enterprise architectures.
Below are the results of this new testing effort presented at Hadoop Summit, 2012. Thanks to Hortonworks for their collaboration throughout the testing.
We started this phase of testing in the lab with Hadoop on 96 Nodes (~768TB of raw disk) in a 3 rack configuration. The next step was to develop a set of tests around the following design considerations:
- Capacity, Scale and Oversubscription
- Management and Visibility
Some of the key findings include:
- Hadoop can fit in a variety of different architectures, but an architecture designed specifically for the Data Center is recommended. While Hadoop can work on a wide range of topologies, job completion times may be negatively impacted if our conclusions regarding design considerations are not taken into account.
- While Hadoop Distributed File System (HDFS) was designed to handle availability concerns through replication, losing connectivity to servers or racks can have a significant impact to performance, especially in small-to-medium sized clusters common in the enterprise. For example: Losing a rack in a 3000+ node cluster is far different than loosing a rack in ~100 node cluster, which is around the average size for an enterprise. This makes availability of the network an important consideration. The loss of a single-homed node due to a port/NIC/cable failure adversely impacted job completion times in our test cluster, thus dual-homing is recommended.
- A non-oversubscribed network, while ideal in theory, typically does not provide performance gain to offset the additional cost, however degree of oversubscription does matter. A more important consideration is buffer. Even the best non-oversubscribed design is still going to cause incast blocking driving buffer demands on the switch during large many-to-one traffic patters in Hadoop.
- A move to 10GE is going to be needed in Hadoop. While it is largely a cost/performance trade-off today keeping many Hadoop deployments at 1GE (or dual 1GE) Hadoop is reaching the limit of 1GE bandwidth. Testing also shows 10GE widening of the available incoming bandwidth to each node dramatically reduces the switch buffer utilization.
- Management and Visibility remain critical in the enterprise. Hadoop job completion needs to be monitored and over the time network characteristics needed to evaluated with tool available for buffer/link/system monitoring.
For more information on Cisco’s Hadoop and Big Data solutions please visit www.cisco.com/go/bigdata