At this year’s Hadoop Summit 2013
, I presented on the “The Data Center and Hadoop” which built upon the past two years of testing the effects of Hadoop on the data center infrastructure
. What makes Hadoop an important framework to study in the data center is that it contains a distributed system that combines both a distributed file system (HDFS) along with an execution framework (Map/Reduce). Further it builds upon itself and can provide other real-time or key/value stores(HBASE) along with many other possibilities. Each comes with its own set of infrastructure requirements that include throughput sensitive components along with latency sensitive components. Further in the Data Center, understanding how all these components work together is key to optimized deployments.
After studying many of these components and their effects, the very data we were alanyzing became a topic of a lot of our discussions. We combined application performance data, application logs, compute data AND network data to build a complete picture of what is happening in the data center.
With the advent of programmable networks (aka “Software Defined Networking”) it is not only important to make the network more application aware, but to also know where and how to analyze and make the right connections between the application and the network.
First we looked from a pure operational and troubleshooting perspective. One key finding showed that throughput sensitive traffic had a direct impact on latency sensitive traffic. For example, when running a map/reduce job next to a HBase read request, can cause large spikes in the read latency due to buffer(Queue) build up on the network. To understand these scenarios, it is important to correlate the actual source of queueuing(the map/reduce job) and any other traffic that may be present at the same time. To do this, we took advantage of the programmability of the Nexus switches, to provide a high resolution view of the buffer occupancy of the actual map/reduce jobs running on the switch, along with details of the jobs themselves. We accomplished this all through the Nexus python framework we made available on our github.com/datacenter
Secondly, in order to correlate even more sources of data and to provide a longer term view, we choose to export the switching information to a separate framework to do more in depth analysis. We found that while making the network application aware for operations and troubleshooting, more in-depth analysis from multiple sources are better accomplished off the network itself (for example, using Hadoop itself). For this, we also took advantage of the programmability of the switch. Instead of traditional SNMP, which is a pull model, we used a python script on the switch to gather very high resolution counter information, aggregate and push the data out. Further, optionally, we could also synchronize everything with IEEE1588 Precision Time Protocol, to really get nanosecond accuracy to correlate the events. Latency isn’t a key considerations for Hadoop, but when tracking log data clock accuracy can be important.
The first step is to simply add buffer utilization graphs to the commonly monitored statistics. This starts to add some more information to understand what is happening in the distributed architecture, but it doesn’t yet give the full picture.
To go one step further, we start combining in actual application data. The graph below shows buffer usage correlated to the actual job completion and map/reduce information.
Understanding the correlation allowed us to actually solve an issue some customers were seeing with with application multi-tenancy (two applications co-existing on common infrastructure) by proactively prioritizing HBase Read requests over it’s major compaction phase. This also held true when using common infrastructure for Map/Reduce and HBase as seen below.
We made the example code framework is available on: github.com/datacenter