It’s amazing how some concepts take off like gangbusters in a short duration of time. Big Data is one such concept, that creeps into our conversations because of all the market noise. There is definitely merit to the fundamental premise behind Big Data for most businesses; create better end-user experience, make intelligent business decisions, reduce intellectual waste and monetize on new opportunities or opportunities that did not present itself before. Thus the demand for Data Scientists, application developers, statisticians, mathematicians, etc. -- note these are mostly on the development and analytic side of the house. What’s amazing is large databases have been there for the longest time, in many cases, even the data that are targets now for Big Data applications were also available for the longest time. What has evolved rapidly are the applications tools that facilitate optimized manipulation of massive data sets and flexible interfaces to diverse databases -- example Hadoop.
Use cases for Hadoop are many -- banks analyzing millions of transactions a day for fraud detection, spending patterns, credit reviews, etc.; consumer companies analyzing social media feeds for product sentiments, buying patterns, reviews, etc. security analysis through millions of data points ranging from data forensics to surveillance data, insurance verticals sifting through million of data points for claims processing, identifying services opportunities related to certain claims patterns, etc. -- this list goes on.
The question I recently asked myself is, how would an application development team approach deploying Hadoop clusters and how is that going to effect the infrastructure. Not to the scale of that of Yahoo, Facebook or Twitter, but something within, say, couple of hundred nodes deployment with scale considerations. This is a very realistic approach that I have discussed with several of my larger enterprise customers. As an infrastructure person, I don’t claim to be a developer or analyst, but what is clear is that deploying Hadoop requires good understanding of the use cases and components, and coordination between application and infrastructure teams.
Having said that, I am convinced, the network (along with servers/storage) plays a crucial role in Hadoop type deployments. Before you discount this opinion, please hear me out.
Hadoop architecture involves many different components:
There are many processes that run within a Hadoop cluster, however there are a few key relationships that must be mentioned. NameNode and DataNode are HDFS components that work in a master/slave mode. NameNode is a major component that controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS.
JobTracker and TaskTracker are also HDFS components that work in master/slave mode where JobTracker tasks control the mapping and reducing tasks at individual nodes among other tasks. The TaskTrackers run at the node levels and maintains communications with JobTracker for all nodes within the cluster.
Here is an example where I am running Cloudera OVM on 64-bit REHL, and as you can see the Hadoop processes are running as JVM on a single node:
You can probably get away with running Hadoop on Virtual Machines for a very small 10 -- 15 node test environment, with multiple processes (like above) on a single machine. But absolutely not recommended for full scale production deployment -- you want to deploy on dedicated standalone hosts, we will come back to this later.
The other critical components are your MapReduce computational layers. This is a complex set of rules that Hadoop workloads depend on where massive volumes of data are mapped and then reduced for efficient lookups, reads and writes -- across all the nodes. There are other processes like combiner, sorting and shuffling that effects the reduced output. Reducer communicates with all nodes and facilitates the output. Since, the reducer sort of plays an aggregate role, the number of reducers to the number of nodes (mappers/etc.) effects the performance of the applications. It’s TaskTrackers responsibility to track at a local node, while the JobTracker oversees all the nodes in the cluster.
We will skip the various databases, abstraction layer tools and top level interfaces for now, but do keep in mind that they also play a role in determining network traffic and any other application based characteristic that could effect the network performance and availability.
As you can see there is a LOT of communications going on within a Hadoop cluster in terms of heath-checks, data traffic, replication, client access, read/writes, etc. etc. All the above mentioned roles and processes play a key role within the cluster. The NameNode is especially important for data locality, as it maintains metadata for data block location across all racks within the Data Center in which all the Data Nodes are installed. This is done using the “Rack Awareness” concepts where any given data block is mapped to a certain rack and a certain node within that rack -- now imagine 100s of racks and 1000s of nodes !
Let’s take a very common situation (this actually happened with one of my clients), where a server team (driven by Hadoop app/dev team requirements) hands over several servers to the network/DC infrastructure team to install. The infrastructure team has no knowledge of the purpose for these servers, so finds the first available open space within current production racks and installs the servers. Is this optimal when deploying Hadoop ? The short answer is no. Regardless of what tasks these servers will be used within the Hadoop clusters (as defined above), it should never be mixed in the same racks with other production workloads -- this may cause performance issues and operational challenges so not a common practice.
Let’s summarize, what are some the different infrastructure considerations for deploying Hadoop:
1. It is not a general practice to deploy Hadoop nodes as virtual machines for many different reasons, primarily for I/O performance and other shared properties. There are many discussions around this on Hadoop forums at Cloudera, Hortonworks, etc. that you can review -- the consensus is deploying Hadoop clusters using “pizza box” type servers for performance and scale benefits.
You can check out this blog for Cisco UCS platforms supporting Big Data workloads
2. I have briefly explained some of the critical Hadoop processes and roles above. For performance and scale benefits, you should run Name Node, Secondary Name Node and/or Checkpoint Node, Job Tracker and the HBASE (or any DB) Master as dedicated standalone nodes where these processes are not shared with other Hadoop processes on the same server. Ask the roles
3. When there is an application performance issue, all eyes are usually on the network ! However, when the application based deployment best practices are not followed the network/infrastructure engineers usually does not know any better. With Hadoop, it’s critical to try to distribute the network workload evenly across the cluster. In order to accomplish scalable performance factors. It’s important to understand the use case and the communication patterns across the Hadoop components, for example the replication factors, Name Node & Job Tracker placements, cluster size dependencies on the mappers to reducers ratios, etc.. For example, unless you are in a small test environment, it’s asking for trouble to deploy Hadoop in production with a single reducer process. This is when we run into buffering and latency issues on switches based on the traffic load hitting this reducer process (has happened before). This is one of those situations where application architecture drives your underlying infrastructure architecture.
4. In general storage for Hadoop deployments are always local disks and/or JBOD attached. SAN attached Hadoop clusters are not recommended, again for I/O and other performance factors. HDFS is capable of handling huge data set sizes distributed over large clusters of compute, so tuning HDFS is another factor that drives Hadoop performance, for example, increasing the HDFS block size to 128MB (from default 64MB) is a recommended best practice. Also, backing up critical Hadoop configuration files like the journal, checkpoint file, etc. should be built into the operations rules.
5. Hadoop deployment requirements, use cases, tools and application interfaces are going to vary from one environment to other, so no one design is going to applicable in all situations. In my opinion, it’s a safer design to create a standalone dedicated network POD just to support Hadoop workloads -- especially when there are rapid scale requirements. This is when you dedicated racks of equipment for Big Data/Hadoop traffic profiles. Keep in mind that there could be many application layer communications based on interfaces/connectors, abstraction layer tools, various traditional DB access, etc. -- this will also drive network traffic volumes -- so careful planning is required to reduce network based bottlenecks. Transport bandwidth matters, hence 10Gb from server is going to be increasingly common. With volume there are going to be other considerations like latency, scale, buffers, possibly network QoS based on production integration points, etc. There is going to be a need for appropriate monitoring requirements both from applications and infrastructure perspectives.
Here is my very high level view of the network POD separation:
6. My final point will sound very basic, but seldom done effectively ! The Hadoop app/dev teams and infrastructure team (facilities/server/network) must have joint planning meetings to strategize on the goals. You don’t want 300 2RU servers sitting on your loading dock and nowhere to rack them ! In fact, some companies create a Big Data ops team to make joint decisions that effect each team. This also facilitates a cross technology education & training process where the silos get blurred -- it’s a good thing, might even look good on Linkedin !
There are many educational resources out there that talks about different aspects of Hadoop (one book I particularly found very informative is: Hadoop Operations by Eric Sammer: http://shop.oreilly.com/product/0636920025085.do) but here for this blog content, I wanted to bring you a non-developers operations perspective of deploying Hadoop and what are some of the characteristics to be aware of and hopefully help as talking points in your next Hadoop deployment planning discussions.
If you can contribute some of your real-life experiences of deploying Hadoop from an infrastructure perspective, would love to hear your thoughts.