Big Data in Security – Part IV: Email Auto Rule Scoring on Hadoop
Following part three of our Big Data in Security series on graph analytics, I’m joined by expert data scientists Dazhuo Li and Jisheng Wang to talk about their work in developing an intelligent anti-spam solution using modern machine learning approaches on Hadoop.
What is ARS and what problem is it trying to solve?
Dazhuo: From a high-level view, Auto Rule Scoring (ARS) is the machine learning system for our anti-spam system. The system receives a lot of email and classifies whether it’s spam or not spam. From a more detailed view, the system has hundreds of millions of sample email messages and each one is tagged with a label. ARS extracts features or rules from these messages, builds a classification model, and predicts whether new messages are spam or not spam. The more variety of spam and ham (non-spam) that we receive the better our system works.
Jisheng: ARS is also a more general large-scale supervised learning use case. Assume you have tens (or hundreds) of thousands of features and hundreds of millions (or even billions) of labeled samples, and you need them to train a classification model which can be used to classify new data in real time.
How does ARS specifically work?
Dazhuo: Our email corpus stores raw email messages and additional metadata about them. These messages are labeled based either on the originating source, or on the classifier. Those that cannot be straightforwardly classified (via either technique) require manual labeling. The spam messages are derived from spam traps or from custom reporting. Ham is sampled from different internal sources. A variety of language and network features (for example, regular expressions, tokens, URI links, GeoIP, WHOIS) are derived from the corpus for the machine learning system. We currently use about 50,000 rules. Once we carefully collect the messages, labels, and rules, the machine learning algorithm then properly weighs each rule so that when new messages arrive we can do some summarization to detect whether it’s spam or legitimate email.
As you receive more spam and ham, do the machine learning algorithms continue to improve and therefore increase effectiveness in protecting Cisco customers?
Dazhuo: There are several pieces to a successful program as I mentioned and the algorithms are the last step, but that presupposes that we are receiving quality email inputs. As we collect a larger variety of emails, and dedicate efforts to feature engineering, we certainly expect efficacy improvements to continue.
How important is parallelizing in ARS?
Dazhuo: For the past five years, we had a proprietary system that was parallelizing the ARS work. When you have this amount of data on any kind of cluster, a common problem is failure at any stage in the process. So we knew we needed a framework that could handle failures and we also needed the system to be scalable so that as our data grew we could add additional hardware. The stack of tools built on Hadoop, though not perfect, does address the infrastructure difficulties on a distributed file system, map-reduce, and scheduling. So we implemented ARS on our Hadoop stack.
Although it often appears that the learning algorithm is the last component of the system thus not affecting the efficacy, its output does serve as a feedback to a variety of efficacy engineering efforts (for example, data sampling and rule writing). The sooner we receive feedback on the efforts, the more potential it has on efficacy improvement.
Jisheng: We introduced two levels of parallelizing work into ARS: the process level and the data level. First, ARS training has a few hyper parameters, so we need to train over 100 different classification models according to different hyper parameter combinations, and finally aggregate them with weights. This process itself is a perfect fit for map-reduce. We use one mapper job to run each classification work, and finally use one reducer to combine model results.
The second parallelizing effort is at the data level, which is similar to the concept of Mahout. The idea is that when you have a large training set with billions of samples, rather than training a classification model on the entire data, you can actually parallelize the training work on each small block of the training samples, and finally aggregate the intermediate results into one final model.
[Mahout is a machine learning code library developed by Apache]
With the process level parallelizing, we can dramatically reduce the ARS training time, though it’s not less than the unit time of each individual training. Furthermore, the data level parallelizing substantially reduces the training time of each individual training process. Overall, with both parallelizing efforts, we reduced the ARS training time by a factor of over 100 on a 9-node (small) cluster.
What is the end product that benefits from these efforts?
Dazhuo: The ESA (Email Security Appliance) blocks malicious messages for our customers.
What were the challenges with the old ARS system?
Jisheng: There were two main challenges with the old ARS system. First was the scalability problem. The prior ARS system was not able to take advantage of the fast growth in email training samples and the new rules (used to generate ARS features) which we have been generating to catch the latest spam patterns. As we discussed before, ideally we want to use as much training data variety as possible to achieve the best coverage and efficacy.
The second problem was timeliness. We do have a group of very experienced email experts manually writing new pattern rules to quickly catch the latest spam patterns. These new rules cannot be used in our real-time classification system until we finish training the weights for them in ARS. In the old ARS system, the long training period actually delayed the use of the new pattern rules in the real-time detection system on the ESA, thus reducing the possible efficacy gains.
How do you continually improve ESA efficacy?
Dazhuo: It really comes down to engineering effort: being able to evaluate the effectiveness of each individual component from a system’s perspective.
Is there any further work planned for ARS?
Jisheng: Yes, we are working on using online machine learning algorithms to further reduce the ARS training to nearly real-time. The key idea of online machine learning is to keep refining the prediction hypothesis by taking the prediction and true label feedback for each new sample. So rather than retrain the classification model every time a new training sample arrives, we can actually predict the sample, label the sample, and refine the model only using this new sample. Vowpal Wabbit, which was developed originally at Yahoo Research and currently used by Microsoft Research, is a very fast machine learning program focused on online gradient descent learning, and it can be used with Hadoop now.
[“Online gradient descent” enables very efficient updates on the estimated parameters (which are already learned from the original data set) to reflect newly collected samples. This is different from offline learning which needs to collect all interesting samples before learning. Despite not necessarily being the best optimization method, online gradient descent may still yield very good generalization performance.]
This online machine learning model fits our ARS system well, since we are constantly pushing new labeled email samples into the training set. If we can refine the model based on the prediction and true label of these new samples, then we can really update our ARS classification model in real time.
What other machine learning work is TRAC currently doing?
Jisheng: TRAC puts considerable effort into applying different machine learning techniques on petabytes of data to quickly retrieve important security values. In addition to the supervised learning that I previously mentioned, we are also exploring different graph-based unsupervised learning algorithms like Collaborative Filtering, Bayesian Network, and Markov Random Fields to address the correlation analysis crossing different dimensions and data sets.
That concludes part four of our Big Data in Security series. Don’t forget to catch up on this week’s previous Big Data in Security blogs featuring conversations around Graph Analytics, The AMPLab Stack, and TRAC Tools.
Tomorrow is our fifth and final chapter when I talk to Brennan Evans and Mahdi Namazifar about leveraging the cloud for improved anti-phishing systems.
Dazhuo Li is a Threat Research, Analysis, and Communications (TRAC) data scientist who builds data infrastructure and machine learning systems for intelligent Internet security. His PhD dissertation introduced an approximation algorithm to Probabilistic Graphical Model. In addition to Machine Learning, Dazhuo is also interested in distributed databases and functional programming languages.
Jisheng Wang received his bachelor’s and master’s degrees in electrical engineering (EE) from Shanghai Jiao Tong University, and then started his 10-year journey into network security from Penn State University. He received his PhD in EE from Penn State in 2008, after working on automatic signature extraction for polymorphic worms.
During his first years at Cisco, Jisheng worked on different security features and products including ACL, FW, IPS, and QoS. Four years ago, he began to apply Big Data and data analysis technologies to various security challenges. As a researcher, he moved from traditional heuristic-driven security to data-driven security. Jisheng has already led a few data analysis projects to improve the efficacy of Cisco web and email reputation systems. Currently he is leading TRAC efforts to pioneer development of Cisco Security Intelligence Operations (SIO) and next-generation Cisco Threat Defense (CTD).