Recently I had an opportunity to sit down with the talented data scientists from Cisco’s Threat Research, Analysis, and Communications (TRAC) team to discuss Big Data security challenges, tools and methodologies. The following is part one of five in this series where Jisheng Wang, John Conley, and Preetham Raghunanda share how TRAC is tackling Big Data.
Given the hype surrounding “Big Data,” what does that term actually mean?
John: First of all, because of overuse, the “Big Data” term has become almost meaningless. For us and for SIO (Security Intelligence and Operations) it means a combination of infrastructure, tools, and data sources all coming together to make it possible to have unified repositories of data that can address problems that we never thought we could solve before. It really means taking advantage of new technologies, tools, and new ways of thinking about problems.
How large is the data that you’re analyzing?
John: We have a lot of disparate data sources. Web telemetry is one of the largest single sources, and we have tens of terabytes of data currently residing in our cluster and we expect that to quickly grow over the next couple of years to petabytes that we will be processing. The differentiation in the usefulness of our cluster is the merging of numerous data sources and sizes that are all in the same infrastructure and providing the ability to query across all of these different data sets using the same language and tools.
[One petabyte is one quadrillion bytes. For more information on "telemetry" please visit http://www.cisco.com/web/siteassets/legal/isa_supp.html and http://www.cisco.com/en/US/prod/collateral/vpndevc/ps10142/ps11720/at_a_glance_c45-729616.pdf]
Why do you like Hadoop as a primary tool for this big data effort?
John: Hadoop encompasses HDFS (Hadoop Distributed File System) and the MapReduce programming framework. Hadoop is very useful and something we use, but it’s not the only way of interacting with the data. The reason Hadoop is the industry standard for handling big data is because it’s very scalable. As you throw more disks and computing resources at it you receive better performance and higher data processing capabilities. It’s trivial to add more resources; scaling from a five node cluster to a thousand node cluster doesn’t excessively increase the burden to the administrator. Hadoop is also extremely fault tolerant. If a disk, or node, or even a whole rack of nodes goes down your data is replicated across the cluster in such a way that you won’t lose any data. The running jobs that are processing data are also fault tolerant, restarting tasks when necessary to ensure all the data is processed correctly. So you don’t have to worry about the nasty issues associated with big data spanning across multiple disks and multiple machines. Hadoop takes care of the resilience, fault tolerance, and scalability issues for you.
We also use MapR, not just open source Hadoop. MapR has created a more robust and feature-rich version of Hadoop and we found MapR has a lot of advantages for us. There are also many other tools we leverage or are investigating such as the distributed key-value store HBase (MapR also has their own version of Hbase), in-memory databases and analysis frameworks like Spark and Shark, as well as graph databases like InfiniteGraph and Titan. This is by no means an exhaustive list, the point is that for different use cases there are many different tools available and many different ways of interacting with big data. Hadoop is the hammer for which everything is a nail. You can use it for everything, but it’s not necessarily the best tool for every use case.
Historically businesses have been storing data in relational databases with some normalization and IT groups are comfortable maintaining and querying in this model. Now with the tidal wave of data that organizations are looking to store and profit from, NoSQL solutions are becoming much more necessary due to the resilience and scalability factors that you mentioned. Working with Hadoop initially feels a little daunting and the barriers to use appear higher specifically because of the Java API. How steep is the learning curve and what sort of human resources are required to implement Hadoop correctly?
Preetham: All of the NoSQL solutions fundamentally offer an ecosystem such that other technologies can sit on top of the stack in a hierarchy. There are two important factors when analyzing large data sets and Hadoop does both well: the ability to run processes across multiple nodes in the data center and a file system that can store results in a single view. A lot of companies tried their own technologies and techniques to achieve this and when Hadoop came along it instantly satisfied these two requirements in a dependable, open-source way. Over time the barriers to using Hadoop -- understanding the API and architecture, setting it up, and using the eco-system -- have become significantly lower. For example using MapR abstracts away a lot of the issues that initially made Hadoop difficult to use. It’s becoming easier to quickly start using Hadoop primarily due to additional companies creating tools and APIs with their own optimizations and stacks that let you focus more on the business problems and less on the infrastructure.
You mentioned Hive earlier which is a query language based on SQL and obviously IT workers tend to be comfortable crafting SQL queries. Can Hadoop be queried using a SQL model?
John: Absolutely. Hive is one solution in a long list that provides this functionality; there are many solutions like Shark and Apache Drill. It’s a very common use case. The data sitting on HDFS can be accessed with any standard SQL query. Solutions like Hive are useful for people who are comfortable with relational databases and know how to write a SQL query, but have never seen MapReduce before. You can write a typical SQL query and it will run. It’s a very common use case because so many people are used to thinking about data in a relational way. Having the data physically sitting in a distributed, fault-tolerant data store like HDFS and still being able to use SQL tools is extremely useful.
How long does it take to setup Hadoop and implement a SQL query type solution with no prior experience?
Preetham: Fifteen minutes. There are scripts that will assist with a default installation.
What was the motivation for applying Big Data to security at Cisco?
Jisheng: We started this effort of applying big data technology to security about 2 years ago. There were two main challenges that Cisco Security was facing at that time, and still faces today. First, we have lots of security and other related data sets inside Cisco, but they are owned by different groups and distributed in different data centers. Different groups of people occasionally become aware of the existence of others’ data through collaboration, but there is no centralized place where every group can bring in their data and share it with all other groups in Cisco.
The second Cisco Security challenge - that provided motivation for starting the big data effort -- has to do with people. While Cisco has many world-class security experts and data scientists in different groups, they have mostly been working on solving isolated customer security problems at a small scale. Many of these researchers have enjoyed collaborating, exchanging ideas, and brainstorming on an ad hoc basis, but there has never been a common platform and infrastructure where these experts can work together daily, on the same data, using the same tools.
These two challenges helped us realize the necessity for this effort. Thus far, we have been building a common data sharing architecture in partnership with the Threat Response, Intelligence and Development team (TRIAD), that can support hundreds of users to store and process petabytes of data in real time for different use cases.
From a security perspective, what are some of the use cases that can be solved with the data and tools?
Preetham: It’s correlation. By bringing the data together in one place we can perform analysis that simply wasn’t possible previously. That gives us a competitive advantage; as we centralize additional data sets and correlate them, we find new relationships in previously unimaginable ways.
John: To Preetham’s point about correlation, threats or malicious cyber events typically cross multiple channels and use multiple attack vectors. So you may have binaries involved in an attack or file hashes, log data, Command and Control (C2) infrastructure, host ownership, location, actor meta-data, etc. Before you know about a threat or compromise, you need to have all of the different data sets at your fingertips to query and quickly move between. This process has traditionally been done manually in a very ad-hoc manner; people have to go fetch different data sets from different places. The whole idea behind our approach is putting all of the data in one place where the relationships are manifest. Ultimately, we are working toward automated processes that discover relationships – the types of relationships that could have only been found through arduous manual analysis before – much more efficiently.
Jisheng: We are trying to provide tools to support four major use cases in our big data solution: stream, interactive, batch and graph processing, which are all standard in any comprehensive big data solution these days.
Stream processing is typically needed when we already have some known patterns that we use to identify threats, and we want to be able to detect those known threats from a high-volume stream of data in real time. Interactive processing is important to help our security researchers speed up the manual investigation process to find new patterns in different large datasets. Batch processing refers to typical MapReduce-based machine learning and modeling work that data scientists use to find new threat patterns in the data. Finally, graph processing has been proven to be one of the most effective and powerful ways to analyze correlations over different dimensions and data sets.
Are these different use cases independent from each other?
Jisheng: No, these use cases are not isolated; they interact closely with each other. For example, when security experts find vague patterns or label a small set of new threats from interactive processing, data scientists can take advantage of these manual investigation results and train a more comprehensive and robust model to detect these new threats through batch processing. Also, when we find any new threat patterns from either interactive or batch processing, we want to then implement them in stream processing so that we can detect and block these threats in real-time.
What tools have you considered and evaluated for different use cases? How do you decide which tool you want to use?
Jisheng: We have evaluated multiple popular tools for each use case in the past couple years, including both open source and commercial tools. For example, we’ve tried Flume, Sqoop, Kafka, Qpid and RabbitMQ as different ways to ingest data into Hadoop, using different data formats and persistence patterns. We’ve compared Storm with Akka for stream processing. We’ve looked into Apache Drill, Impala, Hadapt, and AMPLab’s Shark for the interactive SQL queries. For a NoSQL key-value database, we’ve used Apache Hbase, the MapR version of Hbase (called MapR Tables), and Redis. We also deeply investigated using Spark as our in-memory computing framework. Finally, for graph processing, we spent a lot of time and effort evaluating the InfiniteGraph solution from Objectivity, as well as other open source solutions like Titan and Neo4J.
There are several steps we normally take before deciding on a tool for each use case. First, we search for different currently popular solutions and do a survey regarding general pros and cons for each tool. Second, we narrow the list down to two or three candidate tools for each use case based on the fit between the tools’ strengths and our own requirements. Third, we conduct a complete benchmarking and comparison test among the candidate tools using our own data sets and use cases to decide which one fits our needs best. Normally, after these three steps, we find the best available tools to compose our big data solution.
That concludes our first Q&A session on TRAC tools. For an introduction to this week’s Big Data in Security series please visit http://cs.co/9005jAfZ. You can also listen to John and Preetham’s additional interview on TRAC Fahrenheit: http://cs.co/9006jANC.
Stop by tomorrow for my conversation with Mahdi Namazifar regarding U.C. Berkeley’s AMPLab Stack.
John Conley studied physics and math at Columbia University and attained a PhD in theoretical particle physics at Stanford. He subsequently completed a post-doctoral project at the University of Bonn in Germany before returning to the U.S. and joining Cisco. John’s physics research involved a lot of data analysis, statistics, and machine learning, so he was excited to join TRAC and apply his experience to security.
Jisheng Wang received his bachelor’s and master’s degrees in electrical engineering (EE) from Shanghai Jiao Tong University, and then started his 10-year journey into network security from Penn State University. He received his PhD in EE from Penn State in 2008, after working on automatic signature extraction for polymorphic worms.
During his first years at Cisco, Jisheng worked on different security features and products including ACL, FW, IPS, and QoS. Four years, ago he began to apply Big Data and data analysis technologies to various security challenges. As a researcher, he moved from traditional heuristic-driven security to data-driven security. Jisheng has already led a few data analysis projects to improve the efficacy of Cisco web and email reputation systems. Currently he is leading TRAC efforts to pioneer development of Cisco Security Intelligence Operations (SIO) and next-generation Cisco Threat Defense (CTD).
Preetham Raghunanda has been an integral part of Cisco’s security business for the past five years doing both research and development. He is enthusiastic about how TRAC is pioneering the application of data science and big data technologies within Cisco Security. In 2008, he earned a master’s degree in computer science from the University of Illinois at Chicago. Prior to that, he held various software engineering roles.