OpenSOC, an open source security analytics framework, helps organizations make big data part of their technical security strategy by providing a platform for the application of anomaly detection and incident forensics to the data loss problem. By integrating numerous elements of the Hadoop ecosystem such as Storm, Kafka, and Elasticsearch, OpenSOC provides a scalable platform incorporating capabilities such as full-packet capture indexing, storage, data enrichment, stream processing, batch processing, real-time search, and telemetry aggregation. It also provides a centralized platform to effectively enable security analysts to rapidly detect and respond to advanced security threats.
A few months ago we were really excited to bring OpenSOC to the open source community. Developing OpenSOC has been a challenging, yet rewarding experience. Our small team pushed the limits of what is possible to do with big data technologies and put a strong foundational framework together that the community can add to and enhance. With OpenSOC we strive to provide an open alternative to proprietary and often expensive analytics tools and do so at the scale of big data.
So what is so different about OpenSOC? OpenSOC is an application that runs natively on the Hadoop 2.x technology stack. OpenSOC is different from most legacy Hadoop applications because it focuses on real-time streaming analytics. In security, speed matters. The capability to offer stream processing at scale integrated into a Hadoop distribution has only become available recently, when Hortonworks released Storm on Yarn. We identified this technology early as a key enabler for our real-time analytics use case and partnered with Hortonworks to accelerate its maturity. Using Storm, we went through multiple prototypes and variations of our architecture, but it became apparent that in addition to scalable stream processing we also needed a scalable message broker. Network traffic is bursty by nature and we needed a way to smooth out these bursts. After much searching we picked Kafka as our message broker of choice because of its unprecedented ability to buffer data at scale. With Kafka and Storm in place, we finally had the scalable stream processing capability we needed to start implementing OpenSOC.
Our most difficult requirement was around storing full PCAP data. No database at the time was powerful enough to store binary data in order of millions messages per second and at the same time offered sophisticated query and scanning capabilities to build on-demand PCAP files out of the captured packets. However, one came close: HBase. We worked with our Hortonworks partners to tune and build additional capabilities around HBase and were able to get it to meet our query and storage requirements. Now we had a scalable stream processor and somewhere to archive the data.
Hbase was great, but it had one major limitation. In order for us to achieve the speeds we needed, we had to be careful about formulating Hbase keys and were not able to put as much metadata into those keys as we needed. This made analysis of meta data extremely limited. To enhance this capability in OpenSOC, we brought in Elastic Search to serve as a secondary index around our Hbase keys and started using it as a general metadata repository and even as an alerts repository. With some tuning from Elastic Search, their tool met our functional requirements and we integrated it into OpenSOC.
The final hurdle to overcome was around getting non-packet telemetry into our system. Examples of these telemetries would be syslog, machine exhaust data, network appliance alerts, etc. For this we brought in Flume. Flume has a large set of sources (adapters) that can pull the telemetries we need and push them into Kafka, after which our stream processor takes over. Flume is also extremely extensible and it will nicely grow with us as our use case evolves. It was the last piece to our architecture and with that in place we were ready to go to into beta.
When we fielded our first cluster and started to get users and user input about the system, it became apparent that streaming use case, although primary, is not the only use case for OpenSOC. Security analysts want to have the ability to run summary reports over large data lakes and that requires batch processing. The logical choice to meet this requirement was to bring in Hive. Hive allowed us to run SQL-like queries in batch over data we had in Hadoop. We additionally realized that security analysts are not the only audience. Data scientists are after the same data. Data scientists have an established set of tools they like to use with data in HDFS and a lot of these tools can interface with ODBC/JDBC connectors. Hive had such connectors available, which enabled our data science tools to pull data directly from our data lake. With that addition, OpenSOC 0.1 was born.
We presented our platform architecture at Hadoop Summit in San Jose last year and have received extremely positive responses and many imitations since. We have continued evolving and enhancing OpenSOC and have introduced numerous new features in recent months. We are currently on our fifth point release and are working hard on the sixth.
The OpenSOC platform was rather challenging to design and mature. The team had to navigate through an ever-changing landscape of big data technologies, drive capabilities within these technologies through partnering, and work with the open source community to test them and open them for public consumption. Some of the more notable of these components are Flume plugin for Kafka, the Kafka Spout for Storm, HDFS, HBase, and Elastic Search Bolt for Storm, and a lot of tooling around operationalizing and delivering Hadoop as an on-premises appliance.
As we are continuing to innovate we are looking for developers from outside of Cisco to contribute to the platform. OpenSOC is intended to be a true community-managed effort. Please check out http://opensoc.github.io/ for more information on OpenSOC and how you can contribute to its future development.