April kicked off with a 1:292 rate of malware encounters and closed with a rate of 1:315. Highest peak day was April 20 when the rate reached 1:177. Lowest was April 4 at 1:338. The median rate of web malware encounters in April 2014 was 1:292, representing a slight improvement over the median of 1:260 requests in March but still worse than the median of 1:341 requests in February.
Takedowns of prolific spam botnets, such as Rustock in 2011 and Grum in 2012, had a substantial effect on reducing overall global spam volumes. This, combined with diminishing returns for spammers sending via bots, had left many email recipients basking in the comfort of (mostly) clean inboxes. No doubt this downward trend in global spam volumes also saved countless dollars that would have otherwise been frittered away on phony university degrees, suspect weight loss products, and erectile dysfunction medication.
Unfortunately, however, the good times seem to be coming to an end. Spam volumes have increased to the point that spam is now at its highest level since late 2010. Below is the graph of global spam volume as reported by Cisco SenderBase. From June 2013 to January 2014, spam was averaging between 50-100 billion messages per month, but as of March 2014 volumes were peaking above 200 billion messages per month--more than a 2X increase above normal.
On January 16, 2014, Proofpoint discussed a spam attack conducted via “smart devices which have been compromised.” Among the devices cited by Proofpoint as participating in the “Thingbot” were routers, set-top boxes, game consoles, and purportedly, even one refrigerator. Of course, news about a refrigerator sending spam generates considerable media attention, as it should, since an attack by the Internet of Things (IoT) would represent a high-water mark in the evolution of (in)security on the Internet. However, soon after Proofpoint’s post, Symantec published a response indicating that IoT devices were not responsible for the spam attack in question, and the machines behind the spam attack were all really just infected Windows boxes. So why is determining the identify of the devices used in this spam attack so difficult?
Following part three of our Big Data in Security series on graph analytics, I’m joined by expert data scientists Dazhuo Li and Jisheng Wang to talk about their work in developing an intelligent anti-spam solution using modern machine learning approaches on Hadoop.
What is ARS and what problem is it trying to solve?
Dazhuo: From a high-level view, Auto Rule Scoring (ARS) is the machine learning system for our anti-spam system. The system receives a lot of email and classifies whether it’s spam or not spam. From a more detailed view, the system has hundreds of millions of sample email messages and each one is tagged with a label. ARS extracts features or rules from these messages, builds a classification model, and predicts whether new messages are spam or not spam. The more variety of spam and ham (non-spam) that we receive the better our system works.
Jisheng: ARS is also a more general large-scale supervised learning use case. Assume you have tens (or hundreds) of thousands of features and hundreds of millions (or even billions) of labeled samples, and you need them to train a classification model which can be used to classify new data in real time.