Cisco Logo


Data Center and Cloud

Part 2

As discussed in my previous post, application developers and data analysts are demanding fast access to ever larger data sets so they can not only reduce or even eliminate sampling errors in their queries (query the entire raw data set!), but they can also begin to ask new questions that were either not conceivable or not practical using traditional software and infrastructure.  Hadoop emerged in this data arms race as a favored alternative to the RDBMS and SAN/NAS storage model.   In this second half of the post, I’ll discuss how Hadoop was specifically designed to address these limitations.

Hadoop’s origins derive from two seminal Google white papers from 2003-4, the first describing the Google Filesystem (GFS) for persistent, massively scalable, reliable storage and the second the MapReduce framework for distributed data processing, both of which Google used to ingest and crunch the vast amounts of web data needed to provide timely and relevant search results.  These papers laid the groundwork for Apache Hadoop’s implementation of MapReduce running on top of the Hadoop Filesystem (HDFS).  Hadoop gained an early, dedicated following from companies like Yahoo!, Facebook, and Twitter, and has since found its way into enterprises of all types due to its unconventional approach to data and distributed computing.  Hadoop tackles the problems discussed in Part 1 in the following ways:

None of this is to say that Hadoop is magic and masks poorly written applications.  Sloppy and inefficient code is still sloppy and inefficient when run on Hadoop; it’s just now widely distributed and fault tolerant.  :)  But as a result of a dedicated and rigorous approach to these design principles, Hadoop solves many of the infrastructure problems faced by big data application developers, and frees them to focus on the data and their questions, and less so on the mechanics of distribution.  Hadoop’s flexible data model, its reliable and cost effective storage system, and its efficient analytics engine allows IT departments to capture, retain, and analyze data that otherwise might go to waste or lie dormant.   It is not a replacement for the RDBMS and SAN or NAS array, but it does provide an effective new alternative tool for enterprises to extract business value out of the flood of data being generated in their data centers.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

4 Comments.


  1. Sean – am interested in your take on the Cisco role in Big Data (other than the data gets into the big data sets via the network) and Big Data analytics. Thanks

       0 likes

    • Hi Christina-

      Cisco has a major role to play in the big data universe, and not just in the network. We’re one of the few vendors that can provide a complete infrastructure stack, end-to-end, to build big data environments and integrate them seamlessly into the rest of the enterprise. Our network infrastructure has been powering some of the largest big data clusters in the world to date, and we’re extending that experience into the compute side of the equation with our UCS servers. We’re forging partnerships with leaders in the big data arena – for example, our UCS servers and Nexus switches were recently certified for Cloudera’s Hadoop and Oracle’s NOSQL database. Our knowledge of the network gives us a unique perspective into big data clusters which, although they attempt to minimize network traffic by computing data locally, still place significant demands on the network during ingest, data shuffle, replication, etc.

      -Sean

         0 likes

  2. Sean,

    Thanks for the article.

    As far as I understand genesis of Google GFS and MapReduce framework is not only to efficiently handle big data but also to reduce cost of managing that data.

    Map and Reduce paradigm of computing enables companies like Google and FB to deploy commodity hardware in very large numbers and reduce cost of compute hardware required to build their monstrous data centers.

    By the same token is it possible that, these same companies will invent new techniques in data center networking to reduce dependence on networking vendors. Is it good enough to construe that advent software defined networking, openflow, openvswitch will make commodity hardware become distributed switch or router threatening existing networking vendors?

       0 likes

    • Hi Sameer-

      You’re right, in the end it all comes down to cost. You can certainly try to use traditional RDBMS’s for big data workloads, it’s just that it’ll likely cost you much more to build a system with equivalent performance for a given workload that’s better suited to Hadoop. The Google’s and Facebook’s of the world have unique engineering requirements that push the limits of their infrastructure and lead them to innovate in new areas, like they did with GFS and MapReduce. Will innovations emerge in networking resulting from a similar dynamic? I wouldn’t bet against that – these companies are certainly on the bleeding edge of engineering. The question is whether those innovations will come in the form of SDN, or some new hardware paradigm, or a combination of the two. Stay tuned! In any case, I see these sorts of advancements as opportunities for Cisco, rather than threats. They give us a chance to deliver more value back to the customer, regardless of the technology.

      -Sean

         0 likes

  1. Return to Countries/Regions
  2. Return to Home
  1. All Data Center and Cloud
  2. Return to Home