I recently attended the Strata + Hadoop World Conference in San Jose, and came away impressed with the accelerating pace of innovation in the world of Big Data. Companies and startups are innovating in every area of the Big Data value chain – from automating how data is collected, cleaned, and organized; to data governance and management; to data storage using a plethora of NoSQL database technologies; and to the numerous emerging tools for data science.
Of particular interest are the innovations in the area of streaming data analytics at the edge of the network. This will be critical in the emerging world of the Internet of Everything (IoE), where “things” are connected to the Internet in the context of “people”, “process” and “data”. Data analytics will provide the intelligence in IoE, transforming data generated by millions of edge devices and applications into useful business insights. Examples of IoE in action abound – from applications in connected healthcare, supply chain management, and the smart grid to Google’s self-driving car and Uber’s industry-transforming business model that connects riders to drivers.
Big Data and analytics are clearly seen as a game-changing technology. Data science is the foundational capability behind the enormous value potential that everyone is expecting from Big Data and IoE. Both a mature and a new discipline, data science is based on well-established inferential statistical and computer science techniques.
So, what exactly is data science? D.J. Patil has called it a “team sport.” It is a multi-disciplinary approach that combines business domain knowledge, IT, good communication skills, change management skills, along with the core expertise in statistical analysis and computer science to identify and capture business value from data. A colleague recently told me, “Data science is mostly about cleaning and preparing large datasets so that programmers can work on them.” This is partially true. Data scientists spend a lot of time understanding raw data and preparing “clean” datasets for subsequent analysis.
The objective of data science is to identify and build an analytical model that can be scaled and operationalized (i.e., implemented in a “production” environment) to provide useful business insights, predictions and recommendations.
The typical data science process is shown in the figure below:
Four data science developments from the last decade stand out:
- Data science is gaining prominence across many organizations, even at the CEO level. For example, President Obama recently remarked, “Understanding and innovating with data has the potential to change the way we do almost anything for the better.” The U.S. government, through its open data initiative, has released more than 135,000 datasets on data.gov to encourage innovation, stimulate the economy, and drive job creation. A chief data scientist (D.J. Patil) has been named to the United States Office of Science and Technology Policy. Data science is not just restricted to the CIO and the IT function. Many organizations have created a new role, chief data officer, who may lead a centralized data science function for the company.
- A rapidly expanding toolset is now available for data scientists. This includes tools like iPython, open-source R (e.g., R’s Caret Package), numerous analytics software packages, and various tools and technologies available as part of the Apache open-source ecosystem (e.g., Spark, Mahout, Flume, Kafka, Storm and others). A data scientist today is expected to be familiar with multiple tools and technologies.
- Tremendous advances in algorithms, machine learning, and recommender systems as well as data engineering technologies have led companies such as Amazon.com, Netflix, Google, leading financial institutions, retailers such as Target, and others to deploy data science-based models in business operations at scale for competitive advantage.
- The Internet of Everything (IoE), coupled with advances in IT infrastructure and networking technologies such as fog networking and cloud computing, are making it possible to do both real-time analytics (on streaming data from machines and sensors) as well as batch analytics (on data “at rest” in corporate databases), opening up huge opportunities for transforming business processes across marketing, finance, risk management, supply chain, and customer care. (See recent Cisco white paper titled “Attaining IoT Value: How To Move from Connecting Things to Capturing Insights.”)
A word of caution: Above all, data scientists need to be data skeptics. Not all data is useful, and not every business problem can be or should be solved using data science and analytics. George E.P. Box, a noted British statistician, once said, “Essentially, all models are wrong, but some are useful.” This is why any data science project should include a cross-functional team and use a healthy dose of business acumen and pragmatism to develop approaches that ultimately drive useful business outcomes in a cost-effective manner.