As discussed in my previous post, application developers and data analysts are demanding fast access to ever larger data sets so they can not only reduce or even eliminate sampling errors in their queries (query the entire raw data set!), but they can also begin to ask new questions that were either not conceivable or not practical using traditional software and infrastructure. Hadoop emerged in this data arms race as a favored alternative to the RDBMS and SAN/NAS storage model. In this second half of the post, I’ll discuss how Hadoop was specifically designed to address these limitations.
Hadoop’s origins derive from two seminal Google white papers from 2003-4, the first describing the Google Filesystem (GFS) for persistent, massively scalable, reliable storage and the second the MapReduce framework for distributed data processing, both of which Google used to ingest and crunch the vast amounts of web data needed to provide timely and relevant search results. These papers laid the groundwork for Apache Hadoop’s implementation of MapReduce running on top of the Hadoop Filesystem (HDFS). Hadoop gained an early, dedicated following from companies like Yahoo!, Facebook, and Twitter, and has since found its way into enterprises of all types due to its unconventional approach to data and distributed computing. Hadoop tackles the problems discussed in Part 1 in the following ways:
- Hadoop moves the compute to the data, rather than pulling the data from a remote array to the compute over a finite fabric. This model takes advantage of the speed and low cost of local data access and the commoditization of x86 compute, aiming to keep data processing local to the node that owns the data whenever possible. Traffic between individual data nodes is kept to a minimum; a master “Name Node” tracks filesystem metadata and a master “Job Tracker” node orchestrates job task distribution and execution throughout the cluster, but neither are in the direct data access path creating a bottleneck for the data nodes. This is not just a distributed filesystem, of which there are many other examples; this is a distributed filesystem that leverages local processing for application execution and minimizes data movement.
- Hadoop is designed to scale massively and predictably. Hadoop was built with the assumption that many relatively small and inexpensive computers with local storage could be clustered together to provide a single system with massive aggregated throughput to handle the big data growth problem. Each additional node should provide a proportional increase in capability, and an increase in system load should result in a proportional decrease in job runtime without resulting in full failure. Shared-disk RDBMS clusters might be lucky to scale to a few dozen nodes with any sort of linearity, whereas 1,000-node (and greater) Hadoop clusters aren’t uncommon. Every additional node means more storage capacity and more performance (measured in job completion times and/or the ability to support additional concurrent jobs). This makes for a more predictable system that in turn evokes high utilization rates – the more predicable, the more work you can throw at it without fear of it falling over.
- Hadoop is designed to support partial failure and recoverability. With the assumption of a distributed, scalable, shared-nothing architecture built atop inexpensive servers as the foundation, Hadoop set out to ensure that the system would continue to function predictably and reliably despite failures and possibly recovery of individual components. Servers, storage, and networks were assumed to be unreliable, and the framework was designed to gracefully deal with a high degree of infrastructure unpredictability. Data is (by default) replicated three times in the cluster; data nodes can come and go without forcing a cluster or even a job restart; and the results of a job are consistent across runs despite potential failures (and/or recoveries!) during a given run. Failure and recovery is part of Hadoop’s DNA, not a retrofit, alleviating the difficult and time consuming requirement for developers to reinvent the failure/recovery wheel for every distributed application they write. This is a significant asset that allows developers to focus on the data and the questions they want to ask, rather than worrying about how to code for a wonky hard drive or dead NIC.
- Hadoop provides a powerful and flexible data analytics framework, and alleviates many cumbersome distributed computing coding challenges. There’s always overhead in a parallel coding routine – it takes some work to break up a job intelligently into smaller chunks, distribute those chunks evenly across available resources, monitor the job queues as they execute to try to keep available CPUs busy (but not too busy), and then retrieve and consolidate the results. The effort is frequently worth the trouble for large data-set analytics problems (Amdahl’s Law notwithstanding), but the coding work can be painstaking. Hadoop, in addition to alleviating concerns about handling complex failure and recovery scenarios in distributed application code, also simplifies coding tasks for job distribution, parallelization, and result aggregation. Again, this capability is Hadoop’s raison d’être, not some bolt-on feature enhancement.
None of this is to say that Hadoop is magic and masks poorly written applications. Sloppy and inefficient code is still sloppy and inefficient when run on Hadoop; it’s just now widely distributed and fault tolerant. But as a result of a dedicated and rigorous approach to these design principles, Hadoop solves many of the infrastructure problems faced by big data application developers, and frees them to focus on the data and their questions, and less so on the mechanics of distribution. Hadoop’s flexible data model, its reliable and cost effective storage system, and its efficient analytics engine allows IT departments to capture, retain, and analyze data that otherwise might go to waste or lie dormant. It is not a replacement for the RDBMS and SAN or NAS array, but it does provide an effective new alternative tool for enterprises to extract business value out of the flood of data being generated in their data centers.