Undoubtedly Big Data is becoming an integral part of enterprise IT ecosystem across major industry verticals, and Apache Hadoop is emerging almost synonymous with it as the as the foundation of the next generation data management platform. Sometimes referred to as Data Lake this platform serves as the primary landing zone for data from across a wide variety of data sources. Traditional and several new application software vendors have been building the plumbing -- in software terms data connectors and data movers -- to extract data from it for further processing. New to Apache Hadoop is YARN which is pretty much an operating system for Big Data enabling multiple workloads -- batch, interactive, streaming, and real-time -- all coexisting on a cluster.
The Hortonworks Data Platform combines the most useful and stable versions of Apache Hadoop and its related projects into a single tested and certified package. Cisco has been partnering with HortonWorks to provide an industry leading platform for enterprise Hadoop deployments. The Cisco UCS solution for Hortonworks Data Platform is based on the Cisco UCS Common Platform Architecture Version 2 for Big Data – a popular platform for Data Lakes widely adopted across major industry verticals, featuring single connect, unified management, advanced monitoring capabilities, seamless management integration and data integration (plumbing) capabilities with other enterprise application systems based on Oracle, Microsoft, SAS, SAP and others.
We are excited to see several joint wins with Hortonworks in the service provider, insurance, retail, healthcare and other sectors. The joint solution is available in three reference architectures, Performance-Capacity Balanced, Capacity Optimized and Capacity Optimized with Flash – all support up to 10 racks at 16 servers each without additional switches. Scaling beyond 10 racks (160 servers) can be implemented by interconnecting domains using Cisco Nexus 6000/7000/9000 series switches, scalable to thousands of servers and to hundreds of petabytes storage, and managed from a single pane using the Cisco UCS Central.
New to this partnership is Hortonworks Data Platform 2.1 which includes Apache Hive 13 which significantly faster than previous generation Hive 12. We have jointly conducted extensive performance benchmarking using 20 queries derived from TPC-DS Benchmark – an industry standard benchmark for Decision Support Systems from the Transaction Processing Performance Council (TPC). The tests were conducted on a 16 node Cisco UCS CPA v2 Performance-Capacity Balanced cluster using a 30TB dataset. We have observed about 300% performance acceleration for some queries with Hive 13 compared to Hive 12. See Figure 1.
Additional performance are improvements expected with the GA release. What does this mean? (i) First of all, Hive brings SQL like abilities – SQL being the most common and expressive language for analytics -- to petabyte scale datasets – in an economical manner (ii) Hadoop becomes friendlier for SQL developers and SQL based business analytics platforms (iii) Such performance improvements (from Hive 12 to 13) makes migrations from proprietary systems to Hadoop even more compelling. More coming. Stay tuned !
Figure 1:Hive 13 vs. Hive 12
Disclaimer: The queries listed here is derived from the TPC-DS Benchmark. These results cannot be compared with TPC-DS Benchmark results. For more information visit www.tpc.org.
Tags: Big Data, Cisco UCS CPA, Hadoop
By now it is clear that big data analytics opens the door to unprecedented analytic opportunities for business innovation, customer retention and profit growth. However, a shortage of data scientists is creating a bottleneck as organizations move from early big data experiments into larger scale adoption. This constraint limits big data analytics and the positive business outcomes that could be achieved.
Click on the photo to hear from Comcast’s Jason Hull, Data Integration Specialist about how his team uses data virtualization to get what they need done, faster
It’s All About the Data
As every data scientist will tell you, the key to analytics is data. The more data the better, including big data as well as the myriad other data sources both in the enterprise and across the cloud. But accessing and massaging this data, in advance of data modeling and statistical analysis, typically consumes 50% or more of any new analytic development effort.
• What would happen if we could simplify the data aspect of the work?
• Would that free up data scientists to spend more time on analysis?
• Would it open the door for non-data scientists to contribute to analytic projects?
SQL is the key. Because of its ease and power, it has been the predominant method for accessing and massaging data for the past 30 years. Nearly all non-data scientists in IT can use SQL to access and massage data, but very few know MapReduce, the traditional language used to access data from Hadoop sources.
How Data Virtualization Helps
“We have a multitude of users…from BI to operational reporting, they are constantly coming to us requesting access to one server or another…we now have that one central place to say ‘you already have access to it’ and they immediately have access rather than having to grant access outside of the tool” -Jason Hull, Comcast
Data virtualization offerings, like Cisco’s, can help organizations bridge this gap and accelerate their big data analytics efforts. Cisco was the first data virtualization vendor to support Hadoop integration with its June 2011 release. This standardized SQL approach augments specialized MapReduce coding of Hadoop queries. By simplifying access to Hadoop data, organizations could for the first time use SQL to include big data sources, as well as enterprise, cloud and other data sources, in their analytics.
In February 2012, Cisco became the first data virtualization vendor to enable MapReduce programs to easily query virtualized data sources, on-demand with high performance. This allowed enterprises to extend MapReduce analyses beyond Hadoop stores to include diverse enterprise data previously integrated by the Cisco Information Server.
In 2013, Cisco maintained its big data integration leadership with updates of its support for Hive access to the leading Hadoop distributions including Apache Hadoop, Cloudera Distribution (CDH) and Hortonworks (HDP). In addition, Cisco now also supports access to Hadoop through HiveServer2 and Cloudera CDH through Impala.
Others, beyond Cisco, recognize this beneficial trend. In fact, Rick van der Lans, noted Data Virtualization expert and author, recently blogged on future developments in this area in Convergence of Data Virtualization and SQL-on-Hadoop Engines.
So if your organization’s big data efforts are slowed by a shortage of data scientists, consider data virtualization as a way to break the bottleneck.
Tags: apache, Big Data, Cisco Data Center, Cisco Data virtualization, Cloudera, Composite Software, data integration, data virtualization, Hadoop, HiveServer2, Hortonworks, mapreduce, query, SQL, video
The Cloudera Sessions Roadshow helps companies to navigate the Big Data journey. As Hadoop takes the data management market by storm, organizations are evolving the role it plays in the modern data center. This disruptive technology is quickly transforming an industry, the value it adds to the modern data center, and how you can leverage it today. When combined with Cisco Unified Computing System™ (Cisco UCS®), the joint solution helps you exploit the valuable insights contained in your data to drive meaningful change in your business.
The Cloudera Sessions roadshow is designed to help organizations to identify where they are on their Big Data journey and to navigate how to stay the course in a low-risk, productive way. The Cloudera Sessions’ attendees will benefit from hearing about Cloudera and its partners’ experiences with real-world deployments, as well as those of Hadoop users who plan and manage them.
Cisco is partnering with Cloudera to offer a comprehensive infrastructure and management solution, based on the Cisco Unified Computing System (UCS), to support our customers big data initiatives. As a proud sponsor for this event, I would encourage you to join us at one of the following scheduled stops to learn more about our joint solutions for big data:
San Francisco on June 4, 2014 (Registration Link Available Soon)
New York on June 18, 2014 (Registration Link Available Soon)
More Cities to be added
Tags: Big Data, Blade Servers, Cisco UCS, Cisco Unified Computing System, Cloudera, Cloudera Sessions, Hadoop, Rack Servers, UCS
Security concerns around cloud adoption can keep many IT and business leaders up at night. This blog series examines how organizations can take control of their cloud strategies. The first blog of this series discussing the role of data security in the cloud can be found here. The second blog of this series highlighting drivers for managed security and what to look for in a cloud provider can be found here.
In today’s workplace, employees are encouraged to find the most agile ways to accomplish business: this extends beyond using their own devices to work on from anywhere, anytime and at any place to now choosing which cloud services to use.
Why Bring Your Own Service Needs to be on Infosec’s Radar
In many instances, most of this happens with little IT engagement. In fact, according to a 2013 Fortinet Survey, Generation Y users are increasingly willing to skirt such policies to use their own devices and cloud services. Couple this user behavior with estimates from Cisco’s Global Cloud Index that by the year 2017, over two thirds of all data center traffic will be based in the cloud proves that cloud computing is undeniable and unstoppable.
With this information in mind, how should IT and InfoSec teams manage their company’s data when hundreds of instances of new cloud deployments happen each month without their knowledge?
Additionally, what provisions need to be in place to limit risks from data being stored, processed and managed by third parties?
Here are a few considerations for IT and InfoSec teams as they try to secure our world of many clouds:
Read More »
Tags: 2014 annual security report, CIO, Cisco Security, CiscoCloud, cloud, cloud security, data security, Fortinet, Hadoop, infosec, ITaaS, OLAP, security, Service Provider, wired
Huge amounts of information are flooding companies every second, which has led to an increased focus on big data and the ability to capture and analyze this sea of information. Enterprises are turning to big data and Apache Hadoop in order to improve business performance and provide a competitive advantage. But to unlock business value from data quickly, easily and cost-effectively, organizations need to find and deploy a truly reliable Hadoop infrastructure that can perform, scale, and be used safely for mission-critical applications.
As more and more Hadoop projects are being deployed to provide actionable results in real-time or near real-time, low latency has become a key factor that influences a company’s Hadoop distribution choice. Thus, performance and scalability should be evaluated closely before choosing a particular Hadoop solution.
The raw performance of a Hadoop platform is critical; it refers to how quickly the platform can ingest, process and analyze information. The MapR Distribution for Hadoop in particular provides world-record performance for MapReduce operations on Hadoop. Its advanced architecture harnesses distributed metadata with an optimized shuffle process, delivering consistent high performance.
The graph below compares the MapR M7 Edition with another Hadoop distribution, and it vividly illustrates the vast difference in latency and performance between these Hadoop distributions.
One particular solution that is optimized for performance is Cisco UCS with MapR. MapR on the Cisco Unified Computing System™ (Cisco UCS®) is a powerful, production-ready Hadoop solution that increases business and IT agility, supports mission-critical workloads, reduces total cost of ownership (TCO), and delivers exceptional return on investment (ROI) at scale.
Read More »
Tags: Big Data, blade server, Blade Servers, Cisco UCS, Cisco UCS C240 M3 Rack Server, Cisco Unified Computing System, Cisco Unified Data Center, Cisco Unified Fabric, Hadoop, MapR, rack server, UCS Central, UCS service profiles