Why Hadoop? (Part 1)
If you have been a regular reader of just about any technology blog or publication over the last year you’d be hard-pressed to have not heard about big data and especially the excitement (some might argue hype) surrounding Hadoop. Big data is becoming big business, and the buzz around it is building commensurately. What began as a specialized solution to a unique problem faced by the largest of Web 2.0 search engines and social media outlets – namely the need to ingest, store and analyze vast amounts of semi- or unstructured data in a fast, efficient, cost-effective and reliable manner that challenges traditional relational database management and storage approaches – has expanded in scope across nearly every industry vertical and trickled out into a wide variety of IT shops, from small technology startups to large enterprises. Big business has taken note, and major industry players such as IBM, Oracle, EMC, and Cisco have all begun investing directly in this space. But why has Hadoop itself proved so popular, and how has it solved some of the limitations of traditional structured relational database management systems (RDBMS) and associated SAN/NAS storage designs?
In the Part 1 of this blog I’ll start by taking a closer look at some of those problems, and tomorrow in Part 2 I’ll show how Hadoop addresses them.
Businesses of all shapes and sizes are asking complex questions of their data to gain a competitive advantage: retail companies want to be able to track changes in brand sentiment from online sources like Facebook and Twitter and react to them rapidly; financial services firms want to scour large swaths of transaction data to detect fraud patterns; power companies ingest terabytes of data from millions of smart meters generating data every hour in hopes of uncovering new efficiencies in billing and delivery. As a result, developers and data analysts are demanding fast access to as large and “pure” a data set as possible, taxing the limits of traditional software and infrastructure and exposing the following technology challenges:
- CPU horsepower/density continues to outpace spinning disk performance. As compute power tracks Moore’s Law and spinning disk capacities continue to advance at a rapid pace, we’re still stuck with relatively low bandwidth from a given spindle – maybe 150 MB/s sequential read throughput out of an enterprise-class SAS 15k RPM drive, or 80 MB/s out of a slower/cheaper/more dense 7.2k RPM SATA disk. A common solution to the bandwidth limit of a single hard disk spindle is to add more spindles and parallelize the read and write operations. SAN and NAS systems became hugely popular by providing far more spindles than a single server could hold internally and relatively fast access to them over a fibrechannel or Ethernet fabric. This separation of compute from storage works well as long as your application doesn’t need to read or write very large amounts of data very quickly, where bottlenecks in the fabric, server, or storage array can arise. However, with big data-scale applications addressing terabyte- and even petabyte-sized working sets, this compute/storage performance imbalance often leaves the compute starved of data in traditional SAN/NAS-based RDBMS architectures.
- The explosion of data, especially the unstructured or semi-structured variety, taxes traditional systems’ scalability. Customers are struggling to manage and analyze the massive influx of data from a variety of sources: system and network event logs, application clickstream data, sensor data from robots on the manufacturing floor, and other human-generated and especially machine-generated data sources. In many cases these data are being simply thrown away for lack of an effective means of capturing, storing, and analyzing them. RDBMS’s with their comparatively rigid data models and transactional focus can be cumbersome for developers to adapt to the varying data types and flexible analysis models required of these data sets.
- The need to scale out horizontally to overcome the processing and availability limitations of a single or small number of monolithic servers presents complex distributed computing challenges. A given database node can only be so big before it can’t handle the compute needs of the applications trying to access its data, and before it becomes too big to fail. Splitting the compute demands across multiple servers can address both performance and availability concerns, in much the same way that parallelizing I/O across multiple spindles in a RAID set can help aggregate throughput and improve reliability. But all of that distribution and parallelization comes at a cost – writing performant, distributed applications is difficult business and places its own demands on the network, compute, and storage for concurrency, synchronization, locking, and especially failure and recovery. Historically it’s been painful for developers to have to reinvent the distributed application wheel with each new app.
Acknowledging these difficulties with the traditional RDMBS+SAN/NAS model, a new breed of applications and underlying data management frameworks have emerged over the last decade intending to handle the needs of big data sets in a cost-effective and timely manner. Hadoop has become one of the most popular choices for big data problems, as it was purpose-built to address these shortcomings. In Part 2 of this post, I’ll take a closer look at how Hadoop works in this context.
Update: Cisco’s own Jacob Rapp will be speaking at HadoopWorld in NYC next week. See you there! http://www.hadoopworld.com/session/hadoop-networkcompute-architecture-considerations/