Cisco Blogs


Cisco Blog > Data Center and Cloud

How Data Virtualization Helps Data Scientists

By now it is clear that big data analytics opens the door to unprecedented analytic opportunities for business innovation, customer retention and profit growth. However, a shortage of data scientists is creating a bottleneck as organizations move from early big data experiments into larger scale adoption. This constraint limits big data analytics and the positive business outcomes that could be achieved.

Jason Hull

Click on the photo to hear from Comcast’s Jason Hull, Data Integration Specialist about how his team uses data virtualization to get what they need done, faster

It’s All About the Data

As every data scientist will tell you, the key to analytics is data. The more data the better, including big data as well as the myriad other data sources both in the enterprise and across the cloud. But accessing and massaging this data, in advance of data modeling and statistical analysis, typically consumes 50% or more of any new analytic development effort.

• What would happen if we could simplify the data aspect of the work?
• Would that free up data scientists to spend more time on analysis?
• Would it open the door for non-data scientists to contribute to analytic projects?

SQL is the key. Because of its ease and power, it has been the predominant method for accessing and massaging data for the past 30 years. Nearly all non-data scientists in IT can use SQL to access and massage data, but very few know MapReduce, the traditional language used to access data from Hadoop sources.

How Data Virtualization Helps

“We have a multitude of users…from BI to operational reporting, they are constantly coming to us requesting access to one server or another…we now have that one central place to say ‘you already have access to it’ and they immediately have access rather than having to grant access outside of the tool” -Jason Hull, Comcast

Data virtualization offerings, like Cisco’s, can help organizations bridge this gap and accelerate their big data analytics efforts. Cisco was the first data virtualization vendor to support Hadoop integration with its June 2011 release. This standardized SQL approach augments specialized MapReduce coding of Hadoop queries. By simplifying access to Hadoop data, organizations could for the first time use SQL to include big data sources, as well as enterprise, cloud and other data sources, in their analytics.

In February 2012, Cisco became the first data virtualization vendor to enable MapReduce programs to easily query virtualized data sources, on-demand with high performance. This allowed enterprises to extend MapReduce analyses beyond Hadoop stores to include diverse enterprise data previously integrated by the Cisco Information Server.

In 2013, Cisco maintained its big data integration leadership with updates of its support for Hive access to the leading Hadoop distributions including Apache Hadoop, Cloudera Distribution (CDH) and Hortonworks (HDP). In addition, Cisco now also supports access to Hadoop through HiveServer2 and Cloudera CDH through Impala.

Others, beyond Cisco, recognize this beneficial trend. In fact, Rick van der Lans, noted Data Virtualization expert and author, recently blogged on future developments in this area in Convergence of Data Virtualization and SQL-on-Hadoop Engines.

So if your organization’s big data efforts are slowed by a shortage of data scientists, consider data virtualization as a way to break the bottleneck.

Tags: , , , , , , , , , , , , , ,

Cisco Unified Computing XML API with Curl and xmlstarlet

February 9, 2012 at 2:04 pm PST

One of my favorite books is The Pillars of the Earth by Ken Follet, I’ve read it and reread it many times and each time I read it I get something new out of it.  With so many good books out there it seems silly to reread a book, especially a very long book. I think what it is, is that the story is so good, the characters so compelling that I don’t want to leave them and when I’m finished with the book I miss them.  Fortunately the book was made into a mini-series that I enjoyed and brought a nice visualization of the story.  I also think the mini-series may have attracted a new set of readers in the viewing audience.

New audiences come with new methods of distribution for the same, similar or different presentation of an already published work.  With the intent to reach a new audience I am republishing a UCS XML API focused blog from another blog site on Cisco Developer Network UCS Section.  I wrote this blog in April 2010, but the methods utilized seemed to flow from my prior entries on this site.The previously published blog has references to other blogs on the on the Cisco Developer Network site in the Cisco UCS section.

The previous blog…

Last time I wrote about using telnet to connect to the UCS Manager XML API as a way to introduce the API and show it’s lack of complexity. Now I don’t expect anyone to write an application that uses telnet to manage a UCS system, I just wanted to get across that if text, XML structured text, can be pushed across an open port to the listening API process on the UCS then it doesn’t matter how the push is done.

However telnet is not very practical, so I thought I would write about curl and xmlstarlet (xmlstarlet referred to as xml in this entry). curl is used to handle the request and response cycle with the UCS and xml is used to process the XML response. In some of my early scripts I used sed and awk to “parse” the output. I say parse but it was more pattern matching; by the way sed and awk are great tools, but maybe I’m partial to them because I’ve been around for a while. The reason I started with curl, sed and awk was not because I lacked XML experience but because I wanted to appeal to the administrators out there and show that XML experience, while beneficial, is not specifically needed.

Read More »

Tags: , , , ,