Cisco Blogs

Why Hadoop? (Part 2)

November 4, 2011 - 4 Comments

Part 2

As discussed in my previous post, application developers and data analysts are demanding fast access to ever larger data sets so they can not only reduce or even eliminate sampling errors in their queries (query the entire raw data set!), but they can also begin to ask new questions that were either not conceivable or not practical using traditional software and infrastructure.  Hadoop emerged in this data arms race as a favored alternative to the RDBMS and SAN/NAS storage model.   In this second half of the post, I’ll discuss how Hadoop was specifically designed to address these limitations.

Hadoop’s origins derive from two seminal Google white papers from 2003-4, the first describing the Google Filesystem (GFS) for persistent, massively scalable, reliable storage and the second the MapReduce framework for distributed data processing, both of which Google used to ingest and crunch the vast amounts of web data needed to provide timely and relevant search results.  These papers laid the groundwork for Apache Hadoop’s implementation of MapReduce running on top of the Hadoop Filesystem (HDFS).  Hadoop gained an early, dedicated following from companies like Yahoo!, Facebook, and Twitter, and has since found its way into enterprises of all types due to its unconventional approach to data and distributed computing.  Hadoop tackles the problems discussed in Part 1 in the following ways:

  • Hadoop moves the compute to the data, rather than pulling the data from a remote array to the compute over a finite fabric.  This model takes advantage of the speed and low cost of local data access and the commoditization of x86 compute, aiming to keep data processing local to the node that owns the data whenever possible.  Traffic between individual data nodes is kept to a minimum; a master “Name Node” tracks filesystem metadata and a master “Job Tracker” node orchestrates job task distribution and execution throughout the cluster, but neither are in the direct data access path creating a bottleneck for the data nodes.  This is not just a distributed filesystem, of which there are many other examples; this is a distributed filesystem that leverages local processing for application execution and minimizes data movement.
  • Hadoop is designed to scale massively and predictably. Hadoop was built with the assumption that many relatively small and inexpensive computers with local storage could be clustered together to provide a single system with massive aggregated throughput to handle the big data growth problem.  Each additional node should provide a proportional increase in capability, and an increase in system load should result in a proportional decrease in job runtime without resulting in full failure.  Shared-disk RDBMS clusters might be lucky to scale to a few dozen nodes with any sort of linearity, whereas 1,000-node (and greater) Hadoop clusters aren’t uncommon.  Every additional node means more storage capacity and more performance (measured in job completion times and/or the ability to support additional concurrent jobs).  This makes for a more predictable system that in turn evokes high utilization rates – the more predicable, the more work you can throw at it without fear of it falling over.
  • Hadoop is designed to support partial failure and recoverability. With the assumption of a distributed, scalable, shared-nothing architecture built atop inexpensive servers as the foundation, Hadoop set out to ensure that the system would continue to function predictably and reliably despite failures and possibly recovery of individual components.  Servers, storage, and networks were assumed to be unreliable, and the framework was designed to gracefully deal with a high degree of infrastructure unpredictability.  Data is (by default) replicated three times in the cluster; data nodes can come and go without forcing a cluster or even a job restart; and the results of a job are consistent across runs despite potential failures (and/or recoveries!) during a given run.  Failure and recovery is part of Hadoop’s DNA, not a retrofit, alleviating the difficult and time consuming requirement for developers to reinvent the failure/recovery wheel for every distributed application they write.  This is a significant asset that allows developers to focus on the data and the questions they want to ask, rather than worrying about how to code for a wonky hard drive or dead NIC.
  • Hadoop provides a powerful and flexible data analytics framework, and alleviates many cumbersome distributed computing coding challenges. There’s always overhead in a parallel coding routine – it takes some work to break up a job intelligently into smaller chunks, distribute those chunks evenly across available resources, monitor the job queues as they execute to try to keep available CPUs busy (but not too busy), and then retrieve and consolidate the results.  The effort is frequently worth the trouble for large data-set analytics problems (Amdahl’s Law notwithstanding), but the coding work can be painstaking.  Hadoop, in addition to alleviating concerns about handling complex failure and recovery scenarios in distributed application code, also simplifies coding tasks for job distribution, parallelization, and result aggregation.  Again, this capability is Hadoop’s raison d’être, not some bolt-on feature enhancement.

None of this is to say that Hadoop is magic and masks poorly written applications.  Sloppy and inefficient code is still sloppy and inefficient when run on Hadoop; it’s just now widely distributed and fault tolerant.  🙂  But as a result of a dedicated and rigorous approach to these design principles, Hadoop solves many of the infrastructure problems faced by big data application developers, and frees them to focus on the data and their questions, and less so on the mechanics of distribution.  Hadoop’s flexible data model, its reliable and cost effective storage system, and its efficient analytics engine allows IT departments to capture, retain, and analyze data that otherwise might go to waste or lie dormant.   It is not a replacement for the RDBMS and SAN or NAS array, but it does provide an effective new alternative tool for enterprises to extract business value out of the flood of data being generated in their data centers.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. Sean,

    Thanks for the article.

    As far as I understand genesis of Google GFS and MapReduce framework is not only to efficiently handle big data but also to reduce cost of managing that data.

    Map and Reduce paradigm of computing enables companies like Google and FB to deploy commodity hardware in very large numbers and reduce cost of compute hardware required to build their monstrous data centers.

    By the same token is it possible that, these same companies will invent new techniques in data center networking to reduce dependence on networking vendors. Is it good enough to construe that advent software defined networking, openflow, openvswitch will make commodity hardware become distributed switch or router threatening existing networking vendors?

    • Hi Sameer-

      You’re right, in the end it all comes down to cost. You can certainly try to use traditional RDBMS’s for big data workloads, it’s just that it’ll likely cost you much more to build a system with equivalent performance for a given workload that’s better suited to Hadoop. The Google’s and Facebook’s of the world have unique engineering requirements that push the limits of their infrastructure and lead them to innovate in new areas, like they did with GFS and MapReduce. Will innovations emerge in networking resulting from a similar dynamic? I wouldn’t bet against that – these companies are certainly on the bleeding edge of engineering. The question is whether those innovations will come in the form of SDN, or some new hardware paradigm, or a combination of the two. Stay tuned! In any case, I see these sorts of advancements as opportunities for Cisco, rather than threats. They give us a chance to deliver more value back to the customer, regardless of the technology.


  2. Sean – am interested in your take on the Cisco role in Big Data (other than the data gets into the big data sets via the network) and Big Data analytics. Thanks

    • Hi Christina-

      Cisco has a major role to play in the big data universe, and not just in the network. We’re one of the few vendors that can provide a complete infrastructure stack, end-to-end, to build big data environments and integrate them seamlessly into the rest of the enterprise. Our network infrastructure has been powering some of the largest big data clusters in the world to date, and we’re extending that experience into the compute side of the equation with our UCS servers. We’re forging partnerships with leaders in the big data arena – for example, our UCS servers and Nexus switches were recently certified for Cloudera’s Hadoop and Oracle’s NOSQL database. Our knowledge of the network gives us a unique perspective into big data clusters which, although they attempt to minimize network traffic by computing data locally, still place significant demands on the network during ingest, data shuffle, replication, etc.