In the last chapter of our five part Big Data in Security series, expert Data Scientists Brennan Evans and Mahdi Namazifar join me to discuss their work on a cloud anti-phishing solution.
Phishing is a well-known historical threat. Essentially, it’s social engineering via email and it continues to be effective and potent. What is TRAC currently doing in this space to protect Cisco customers?
Brennan: One of the ways that we have traditionally confronted this threat is through third-party intelligence in the form of data feeds. The problem is that these social engineering attacks have a high time dependency. If we solely rely on feeds, we risk delivering data to our customers that may be stale so that solution isn’t terribly attractive. This complicates another issue with common approaches with a lot of the data sources out there: many attempt to enumerate the solution by listing compromised hosts and in practice each vendor seems to see just a small slice of the problem space, and as I just said, oftentimes it’s too late.
We have invested a lot of time in looking at how to avoid the problem of essentially being an intelligence redistributor and instead look at the problem firsthand using our own rich data sources – both external and internal – and really develop a system that is more flexible, timely, and robust in the types of attacks it can address.
Mahdi: In principle, we have designed and built prototypes around Cisco’s next generation phishing detection solution. To address the requirements for both an effective and efficient phishing detection solution, our design is based on Big Data and machine learning. The Big Data technology allows us to dig into a tremendous amount of data that we have for this problem and extract predictive signals for the phishing problem. Machine learning algorithms, on the other hand, provide the means for using the predictive signals, captured from historical data, to build mathematical models for predicting the probability of a URL or other content being phishing.
When you say, “deliver to the customer,” what does that specifically look like?
Brennan: It’s dependent on the customer’s context. When we have a customer that is reading an email with a fraud component, we want to identify the malicious attempt and alert the customer. A different customer may be performing web requests after falling prey to a phishing email so that’s a different choke point where we can identify and alert. There are multiple places where we want to address the problem – it’s not specific to a platform solution – rather it’s addressing the core problem and then we can tailor the solution that’s best applied to the customer’s endpoint. That may be a firewall or AnyConnect or a client making a proxy request through our service. Our goal is to have the best intelligence available that addresses these sorts of threats.
So you’re describing making this real-time data available in the cloud for different devices that are on customer’s premises?
Brennan: Certainly, we want to make the data available to devices where appropriate, but our cloud solution is where we have the largest battery of resources and we can do the most in-depth analysis where we have a giant data set driving constant improvements to our models. Part of the challenge is then tailoring that cloud solution to make the best decision within the customer’s context. So, for example, if we’re looking at a suspicious web request we may be able to make the decision at the end point or we may need them to utilize the immediate cloud where we have full resources. We want to have a response that’s most appropriate for our customers. Social engineering attacks are not easy to prevent, so we really need to be smart about how we address these attacks regardless of the customer environment.
Mahdi: To emphasize Brennan’s point, we want to address the problem in two layers. The first layer sits on the box that the client is using and makes local decisions. These decisions are based on the intelligence and parameters that are passed to the local decision maker on the box from a central orchestrator. The second layer is the cloud piece, which receives requests and acts on them based on updated logic and then quickly passes the decision back to the client device. Both of these layers work based on centrally and continuously trained classifiers.
In the future, are white lists/black lists realistic for preventing HTTP connections or does the solution need to be behavioral?
Brennan: The answer is really both. If we have enough granular insight into networks, having a listing agency built into the system is quite helpful. But we want to avoid providing extreme answers related to “block” or “don’t block”. Instead, we want to develop confidence ratings and determine malicious activity and then we can better contextualize our responses. With phishing and other social engineering attempts, we really look at the structure of the attack message and include a descriptive space about what’s happening – whether it’s static like a URL or a time dependent dynamic network or network structure.
Mahdi: It really depends how you want to look at the problem. If the decision must be near instantaneous – like in the case of a potential Phishing URL – then the options are limited to white lists and black lists and very simple models, because we can’t open the URL, parse the content, analyze it, and determine whether it’s malicious. On the other hand if the time restrictions are more flexible then we can load the content, pass it to a more elaborate trained machine-learning model, and then make a decision based on the outcome. The other point about black lists and white lists is that they allow us to efficiently leverage our knowledge (for example, about the malicious nature of a specific URL) to make the right decision when we see it again in the future. From that perspective, these lists are definitely useful, but relying only on them is certainly not the best we could do.
From your perspective, it sounds like the transaction begins when a user clicks on the link in the email or does the transaction begin when the user receives the email? Where do you focus resources when looking at the full infection chain?
Brennan: That’s a good question and I’ll go back to what I said earlier, we need to look at the context of the individual opening the email. Our answer can’t assume that our email appliances are everywhere, because that’s not true in every customer environment. We want to respond to all of the attack channels and it’s somewhat dependent on where our products are placed and the data available.
Mahdi: The kind of interaction that we have with our customers is broad and that makes us think about general solutions that can be modified based on specific customer needs. This is different from something like Google’s phishing solution, for example, which is only useful if you are using Chrome. Our interactions with customers are very different and might have one or several of many different forms depending on the customer’s needs. We may use different trained models for specific customers or we may need to build targeted white lists and black lists for customers in the financial services or the healthcare industry, for example. So the solution we are building addresses all of our customer’s varying needs. We designed this system to be as flexible as possible, without making limiting assumptions. We build different layers of decision making that can be turned on or off.
You mention machine learning. How much effort is going into machine learning with phishing specifically?
Brennan: There is a lot of promise in machine learning. We have an amazing amount of information in our email corpus and within it, we see a lot of phishing attacks and we have been extracting those embedded URLs. With simple machine learning algorithms we receive very competitive answers compared to some of the Phishing data sources that we were paying for and that’s just scratching the surface. So, there is tremendous promise in expanding this and looking at network dynamics and looking at more interesting models like the content on a web page. All of that comes together to create a very promising solution.
Mahdi: To add to Brennan’s point, we see that all modern anti-phishing tools are in one way or another based on machine learning. The reason is clear, there are simply many variables in deciding something being phishing and if the process isn’t automated it’s really impossible to develop a reliable system. A vast majority of the research in this area revolves around machine learning. The important point to remember is that the input to machine learning algorithms are data sets, which contain predicative signals. For the phishing problem, these signals fall into two categories: one, based on the URL string only, and two, based on the full content of an email or a web page. As a couple of examples for URL-based signals, (the presence of the at symbol (@) in the URL) and the number of dashes in the URL could be mentioned. For the content-based signals, presence of a login form and signals based on Term Frequency Inverse Document Frequency (TFIDF) technique, are some examples of content-based signals. Generally, in predictive analytics projects using machine learning, the modeling is the last 5 percent of the work, whereas understanding the data, finding the right way to use the data, cleaning up the data, and discovering the predicative signals in the data is 95 percent of the work. We have spent a lot of time getting the predicative signals correct.
You both mentioned working toward a model that is less reliant on third-party data feeds. Brennan mentioned the email corpus, what are some of the other homegrown data sources that drive this solution?
Mahdi: The feeds will continue to provide a valuable source of information, but as another source of data, we also crawl the web, based on our telemetry data and for each crawled page, we extract tons of URL and content based predictive signals. This is a powerful source of data that is unlabeled (for example, we don’t know whether a specific page is phishing or not), but it gives us a rich source of information that could be used in unsupervised and semi-supervised machine learning techniques. For example, using these techniques we could try to find the right labels (phishing or not phishing) for some of the crawled pages and use these labeled pages as homegrown feeds.
Brennan: As Mahdi alluded to, better data trumps smarter modeling. In machine learning, we want the simplest model that describes the data. I like to look at data, explore it, and think about the data until it begins to tell a story. Then I see where that leads and I begin to understand the problem and determine what I can learn from this data or what is missing. This approach leads to a solution that is not just effective theoretically, but effective practically in a product. Mahdi described our methodology for fetching new data and we mentioned the email corpus, but we also collect quite a bit of data from other sources such as honeypots and value highly telemetry we have from deployed services. This visibility helps us with label confidence – do we know something is bad, do we think it’s bad, or does it describe a legitimate financial institution? All of that information comes together and we pick from our internal sources to build a better data model.
That concludes our fifth and final segment for the Big Data in Security series. This week’s earlier discussions include TRAC Tools, The AMPLab Stack, Graph Analytics, and Email Auto Rule Scoring on Hadoop. Thank you for joining us this week.
Mahdi Namazifar joined TRAC in January 2013. Prior to Cisco, he was a scientist at Opera Solutions where he worked on a variety of data analytics and machine learning problems from industries such as finance and healthcare. He received a master’s of science in 2008 and a PhD in 2011, both in operations research from the University of Wisconsin-Madison. During his time as a PhD student, Mahdi did internships at IBM T.J. Watson Research Lab and the San Diego Supercomputer Center. Mahdi’s research interests include machine learning, mathematical optimization, parallel computing, and Big Data.
Brennan Evans is a Threat Research, Analysis & Communications (TRAC) applied researcher. His background at Cisco dates back to IronPort Systems – a 2002 startup that Cisco acquired in 2007. Brennan has worked on product design, prototyping, and production software engineering for both backend systems and security appliances. In 1998, he joined Inktomi and worked on HTTP web proxies. He holds a bachelor’s degree in computer science from CalPoly San Luis Obispo. His research interests include machine learning, data mining and analysis, and programming languages.
“If we solely rely on feeds we risk delivering data to our customers that may be stale so that solution isn’t terribly attractive. ”
Excellent point. Stale (and even incorrect) data doesn’t do much for your customers, or for you own reputation with them. This is a problem many companies run into. You have to balance security with customer needs with business goals and something (usually security) gets pushed to the backburner to make room for everything else.
Hadoop encompasses HDFS (Hadoop Distributed File System) and the MapReduce programming framework. Hadoop is very useful and something we use, but it’s not the only way of interacting with the data. The reason Hadoop is the industry standard for handling big data is because it’s very scalable. As We throw more disks and computing resources at it We receive better performance and higher data processing capabilities.