Is your data giving you the costly cold shoulder?
2nd Guest Blog by Ron Graham
Ron Graham had served as a Data Center Architect and Systems Engineer for some of the largest IT companies in the U.S. including Cisco Systems, NetApp, Sun Microsystems, and Oracle. He is currently working for Cisco Systems as a Big Data Analytics Engineer.
What I mean is, is your data not being used that much or is the temperature of the data going from hot to cold? Hot data is being used a lot and cold data is being used sparingly. I think every one runs into this problem at some point where they are store cold or frozen data on high performance compute resources. Does it make sense to move unused data to an archive directory as long as it is still in the same cluster and can still be accessed? In the majority of cases this makes sense.
We have hot data and cold data, so what about warm data? Warm data is giving off a moderate degree of heat and data is used less frequently than hot and more than cold. Take a look at the graph below. I interpolated the graph based on tech posting from Ebay and interviews with a former Disney admin.
On the business side, my analysis proved a 15.9% saving in CAPEX for a 1 petabyte (PB) Hadoop cluster. With a hot and cold storage ratio of 4:1, which means that 80% of my data will be on high performance storage platforms and 20% of my data on storage optimized platforms.
As data become the New Oil, Cisco and Hortonworks are working together to efficiently and cost effectively handle data of all temperature. Having the right storage infrastructure and management is imperative as unprecedented amounts of unstructured data flows in from all different sources such as email, file service (video and audio), wearable medical technology, log files, appliances, and thousand of different sensors.
Hortonworks Data Platform supports storage tiers with the ability to move data between tiers using placement policies. Cisco and Hortonworks are working on an integrated solution to identify data temperature and automate its movement across storage tiers. For now, I am thinking there is a python script in my near future.
Ebay tech blog : HDFS Storage Efficiency Using Tiered Storage By Benoy Antony
– Meet us in person at Strata in NYC:
- Cisco: booth #425| Hortonworks: booth #409
– Learn more about our joint reference architecture.
– Check out our tutorial.