Undoubtedly Big Data is becoming an integral part of enterprise IT ecosystem across major industry verticals, and Apache Hadoop is emerging almost synonymous with it as the as the foundation of the next generation data management platform. Sometimes referred to as Data Lake this platform serves as the primary landing zone for data from across a wide variety of data sources. Traditional and several new application software vendors have been building the plumbing -- in software terms data connectors and data movers -- to extract data from it for further processing. New to Apache Hadoop is YARN which is pretty much an operating system for Big Data enabling multiple workloads -- batch, interactive, streaming, and real-time -- all coexisting on a cluster.
The Hortonworks Data Platform combines the most useful and stable versions of Apache Hadoop and its related projects into a single tested and certified package. Cisco has been partnering with HortonWorks to provide an industry leading platform for enterprise Hadoop deployments. The Cisco UCS solution for Hortonworks Data Platform is based on the Cisco UCS Common Platform Architecture Version 2 for Big Data – a popular platform for Data Lakes widely adopted across major industry verticals, featuring single connect, unified management, advanced monitoring capabilities, seamless management integration and data integration (plumbing) capabilities with other enterprise application systems based on Oracle, Microsoft, SAS, SAP and others.
We are excited to see several joint wins with Hortonworks in the service provider, insurance, retail, healthcare and other sectors. The joint solution is available in three reference architectures, Performance-Capacity Balanced, Capacity Optimized and Capacity Optimized with Flash – all support up to 10 racks at 16 servers each without additional switches. Scaling beyond 10 racks (160 servers) can be implemented by interconnecting domains using Cisco Nexus 6000/7000/9000 series switches, scalable to thousands of servers and to hundreds of petabytes storage, and managed from a single pane using the Cisco UCS Central.
New to this partnership is Hortonworks Data Platform 2.1 which includes Apache Hive 13 which significantly faster than previous generation Hive 12. We have jointly conducted extensive performance benchmarking using 20 queries derived from TPC-DS Benchmark – an industry standard benchmark for Decision Support Systems from the Transaction Processing Performance Council (TPC). The tests were conducted on a 16 node Cisco UCS CPA v2 Performance-Capacity Balanced cluster using a 30TB dataset. We have observed about 300% performance acceleration for some queries with Hive 13 compared to Hive 12. See Figure 1.
Additional performance are improvements expected with the GA release. What does this mean? (i) First of all, Hive brings SQL like abilities – SQL being the most common and expressive language for analytics -- to petabyte scale datasets – in an economical manner (ii) Hadoop becomes friendlier for SQL developers and SQL based business analytics platforms (iii) Such performance improvements (from Hive 12 to 13) makes migrations from proprietary systems to Hadoop even more compelling. More coming. Stay tuned !
Figure 1:Hive 13 vs. Hive 12
Disclaimer: The queries listed here is derived from the TPC-DS Benchmark. These results cannot be compared with TPC-DS Benchmark results. For more information visit www.tpc.org.
The Internet of Everything (IoE) is a juggernaut of change, transforming organizations in profound ways. It sows disruption, and it grants enormous opportunities. But this sweeping wave of change is not reserved for what we normally think of as “technology companies.” In the IoE economy, even seemingly “analog” endeavors must be bestowed with network connectivity, no matter how venerable a company’s roots or old its traditions.
In a world where Everyone Is a Tech Company, there are some great examples of older companies that are heeding this new reality. Retail, manufacturing, transportation, and education are just a few of the places where people, process, data, and things are being connected in startling new ways. Companies that are ahead of the IoE transformation curve will ensure their competiveness in marketplaces that are ever more vulnerable to disruption.
Dundee Precious Metalsprovides a great example of a company that is embracing change. A far-flung global organization, the company, for example, runs Europe’s largest mine in Chelopech, Bulgaria, from which it ships gold-rich copper ore to a smelter in Namibia. Yet through IoE-related technologies, executives at the company’s headquarters in Toronto, Canada, have gained unprecedented visibility into all aspects of their operations.
The end result? A boon in safety, efficiency, and productivity.
By now it is clear that big data analytics opens the door to unprecedented analytic opportunities for business innovation, customer retention and profit growth. However, a shortage of data scientists is creating a bottleneck as organizations move from early big data experiments into larger scale adoption. This constraint limits big data analytics and the positive business outcomes that could be achieved.
Click on the photo to hear from Comcast’s Jason Hull, Data Integration Specialist about how his team uses data virtualization to get what they need done, faster
It’s All About the Data
As every data scientist will tell you, the key to analytics is data. The more data the better, including big data as well as the myriad other data sources both in the enterprise and across the cloud. But accessing and massaging this data, in advance of data modeling and statistical analysis, typically consumes 50% or more of any new analytic development effort.
• What would happen if we could simplify the data aspect of the work?
• Would that free up data scientists to spend more time on analysis?
• Would it open the door for non-data scientists to contribute to analytic projects?
SQL is the key. Because of its ease and power, it has been the predominant method for accessing and massaging data for the past 30 years. Nearly all non-data scientists in IT can use SQL to access and massage data, but very few know MapReduce, the traditional language used to access data from Hadoop sources.
How Data Virtualization Helps
“We have a multitude of users…from BI to operational reporting, they are constantly coming to us requesting access to one server or another…we now have that one central place to say ‘you already have access to it’ and they immediately have access rather than having to grant access outside of the tool” -Jason Hull, Comcast
Data virtualization offerings, like Cisco’s, can help organizations bridge this gap and accelerate their big data analytics efforts. Cisco was the first data virtualization vendor to support Hadoop integration with its June 2011 release. This standardized SQL approach augments specialized MapReduce coding of Hadoop queries. By simplifying access to Hadoop data, organizations could for the first time use SQL to include big data sources, as well as enterprise, cloud and other data sources, in their analytics.
In February 2012, Cisco became the first data virtualization vendor to enable MapReduce programs to easily query virtualized data sources, on-demand with high performance. This allowed enterprises to extend MapReduce analyses beyond Hadoop stores to include diverse enterprise data previously integrated by the Cisco Information Server.
In 2013, Cisco maintained its big data integration leadership with updates of its support for Hive access to the leading Hadoop distributions including Apache Hadoop, Cloudera Distribution (CDH) and Hortonworks (HDP). In addition, Cisco now also supports access to Hadoop through HiveServer2 and Cloudera CDH through Impala.
Once our Cisco Consulting Services colleagues finished winding through the streets of central Amsterdam each morning, we got down to the serious business of “hacking” some key global issues, together with our friends at THNK.
One of those issues has evolved into a Cisco/THNK partnership challenge, inwhich we will share Cisco’s expertise on the Internet of Everything (IoE) to solve some global problems around food safety and food distribution. I will speak more about the Internet of Food initiative in a subsequent blog.
Another key challenge was to foster digital disruption in the Internet of Everything (IoE) age — a time when our enterprise customers, and especially their end users, are demanding rapid transformation.
That level of change stems from the kind of open innovation and inclusive creative processes promoted by THNK in Amsterdam. Those processes are also being embraced by Cisco at our innovation hubs in such places as Rio de Janeiro, Toronto, and Songdo, South Korea. At these centers, IoE cornerstones such as cloud, mobility, Big Data analytics, and social media are already enabling digital disruption — and will continue to accelerate it.
If you’re an Operations Technology (OT) pro, then the buzz about the Internet of Everything (IoE) should have you pretty excited--because it will likely impact your work. You won’t want to miss a chance to find out more about it at Cisco Live San Francisco May 18 -- 22.
Cisco has been hard at work building solutions to address your OT challenges. Cisco Live San Francisco is the place to find out the details…
Here are five (5) reasons not to miss this pivotal event:
#1. A Targeted OT Learning Track: We’ve put together a special program to bring OT and IT issues together and make it crystal clear how the Internet of Everything (IoE)–the convergence of machines, sensors, processes, people and data–is going to make your job a lot more interesting. Read More »