As discussed in my previous post, application developers and data analysts are demanding fast access to ever larger data sets so they can not only reduce or even eliminate sampling errors in their queries (query the entire raw data set!), but they can also begin to ask new questions that were either not conceivable or not practical using traditional software and infrastructure. Hadoop emerged in this data arms race as a favored alternative to the RDBMS and SAN/NAS storage model. In this second half of the post, I’ll discuss how Hadoop was specifically designed to address these limitations.
Hadoop’s origins derive from two seminal Google white papers from 2003-4, the first describing the Google Filesystem (GFS) for persistent, massively scalable, reliable storage and the second the MapReduce framework for distributed data processing, both of which Google used to ingest and crunch the vast amounts of web data needed to provide timely and relevant search results. These papers laid the groundwork for Apache Hadoop’s implementation of MapReduce running on top of the Hadoop Filesystem (HDFS). Hadoop gained an early, dedicated following from companies like Yahoo!, Facebook, and Twitter, and has since found its way into enterprises of all types due to its unconventional approach to data and distributed computing. Hadoop tackles the problems discussed in Part 1 in the following ways:
If you have been a regular reader of just about any technology blog or publication over the last year you’d be hard-pressed to have not heard about big data and especially the excitement (some might argue hype) surrounding Hadoop. Big data is becoming big business, and the buzz around it is building commensurately. What began as a specialized solution to a unique problem faced by the largest of Web 2.0 search engines and social media outlets – namely the need to ingest, store and analyze vast amounts of semi- or unstructured data in a fast, efficient, cost-effective and reliable manner that challenges traditional relational database management and storage approaches – has expanded in scope across nearly every industry vertical and trickled out into a wide variety of IT shops, from small technology startups to large enterprises. Big business has taken note, and major industry players such as IBM, Oracle, EMC, and Cisco have all begun investing directly in this space. But why has Hadoop itself proved so popular, and how has it solved some of the limitations of traditional structured relational database management systems (RDBMS) and associated SAN/NAS storage designs?
In the Part 1 of this blog I’ll start by taking a closer look at some of those problems, and tomorrow in Part 2 I’ll show how Hadoop addresses them.
Businesses of all shapes and sizes are asking complex questions of their data to gain a competitive advantage: retail companies want to be able to track changes in brand sentiment from online sources like Facebook and Twitter and react to them rapidly; financial services firms want to scour large swaths of transaction data to detect fraud patterns; power companies ingest terabytes of data from millions of smart meters generating data every hour in hopes of uncovering new efficiencies in billing and delivery. As a result, developers and data analysts are demanding fast access to as large and “pure” a data set as possible, taxing the limits of traditional software and infrastructure and exposing the following technology challenges:
On October 25 at 9:00 am PST/ 12:00 pm EST , join a very special webcast “Evolutionary Fabric. Revolutionary Scale “ with customers, analysts and Cisco executives and experts for conversations about the benefits of Cisco Unified Fabric .
“There is a lot going on in the data center these days – There is a continue expansion of virtualization , we see broader adoption of cloud and we see emerging trends, big data being the newest and trendiest of the hot data center topics – So there are folks out there who will tell you, you know what each of these needs special equipment, they have unique requirements , your regular infrastructure will not be able to handle these requirements So what we do believe is that each of these requirements, big data, cloud have their own specific needs , we truly don’t believe that you need purpose built hardware , at least if your infrastructure is built the right way “ Omar Sultan
So this webcast is really about learning how Cisco’s fabric-based approach delivers architectural flexibility across physical, virtual and cloud environments for any application.
For Brian Gracely the equation is simple to remember : Cisco Unified FABRIC is Fast, Agile, Best of breed, Resiliant, Innovative, Cisco-based