Defeating Polymorphic Malware with Cognitive Intelligence. Part 3: Static AnalysisContributors: Jeff Burke
Nowadays, everyone likes to talk about the use of machine learning in cybersecurity. Almost every security vendor leverages machine learning in one form or another. Organizations employ security teams with data analysis skills to automate threat hunts. But what does it really take to build a scalable and effective machine learning capability with broad threat coverage?
Diverse data sets. Machine learning systems learn from what they observe. Balanced representation of malware encountered by customers of various industries, sizes, and geolocations is essential for the system to provide comprehensive coverage.
Network and endpoint data. These are the two common places to detect cyber threats, both of which provide telemetry needed for machine learning systems to learn over time. Limiting training data to just one of them reduces the efficacy and accuracy of detections.
Combination of machine learning algorithms. Multiple algorithms and algorithm categories working alongside each other is far more effective and resilient to manipulation than any single algorithm can be. Together they reinforce each other to improve precision and attribution.
Frequently updated classifiers. As new threat data becomes available, the classifiers should continuously learn and adapt. A completely new piece of malware written from scratch won’t be detected by a year-old classifier that was never exposed to the attributes of that new malware sample.
Smart feature engineering. Fundamental to the application of machine learning, feature engineering is the most complicated part of the puzzle. It takes a lot of brainstorming, domain knowledge, and extensive testing to accurately select the attributes of a file (or a web flow) to drive accurate predictions.
Cisco’s approach to malware and threat detection is based on the combination of a large variety of data types, machine learning algorithms and algorithm categories (supervised, unsupervised, online or semi-supervised, graph-based). We have been working closely with many of our customers to help protect their organizations from threats observed on the network (using our threat-centric Firewall, Email and Web Security, and other offerings) or the endpoint level (using AMP for Endpoints). This close connection and interaction with our customers also resulted in global threat visibility and exposure to vast quantities of malware across a variety of organizations across the globe.
We always do our absolute best to block most of these threats automatically, though some of the very new malware samples sometimes can get through. Such threats will be eliminated retrospectively. Retrospective detections can be a result of many different analysis techniques performed on the Cisco backend. One of the approaches that we call Cross-Layer analytics has been briefly described in one of our previous blog posts. By correlating network and endpoint data, Cross-Layer analytics enables us to boost the detection efficacy. As a result, security teams become more productive (what previously had to be hunted for is now automated) and can shift more focus to creating new hunts and detection mechanisms.
Over time, the machine learning capabilities introduced on individual product-levels evolved and formed a real architecture that has several building blocks, including the Global Risk Map, Network Analysis Pipeline, Endpoint Analysis Pipeline, and Cross-Layer analytics.
Cisco’s Machine Learning Architecture
Each block contains a set of machine learning algorithms that are specialized in the given domain. All blocks can be used independently but reinforce each other when deployed together. In addition to simply detecting the intrusion, the goal is to correctly estimate the risk related to that intrusion and to prioritize it for efficient handling based on the score and recommended actions. That’s where Cisco Threat Response also plays a significant role, as it allows the correlation of customer’s telemetry from a variety of Cisco and 3rd party sources in real-time to then come up with an adequate response strategy. Combining these methods helps provide optimal coverage against a large variety of attackers.
What’s New in 2018: Static File Analysis
Let’s take a closer look at the Static File Analysis algorithms which are the closest match to traditional Anti-Virus solutions. The algorithms inspect various file artifacts and classify them as either malicious or yet unknown. Static analysis engine leverages the ground truth to generate forward-looking protection using machine learning techniques. What we strive to build for our customers can also be described as a time machine that allows you to proactively classify files that are likely to be more broadly known as confirmed malware in the future.
The training process of the static file analysis classifier is done in the Cisco cloud. It is carefully vetted for training bias and optimized for fair representation of various malware families and categories. More specifically, we are using our static and dynamic malware analysis results to attribute the malware samples to a specific malware family. This analysis may be used to ensure that we are considering both low-risk and high-risk malware families to push the trained classifier towards higher detection rates across a broad spectrum of threats. We do not actively seek the largest possible set of malware and legitimate files for training, but instead, we aim to have balanced representation. The 100,001st sample of the same polymorphic malware has little incremental value.
Today, the Static File Analysis algorithm is available to Threat Grid Cloud users as one of the indicators contributing to the overall sample analysis efficacy and threat score. The Static File analysis is not restricted to any particular source of sample submissions; it rather analyzes every Portable Executable (PE) binary submitted to Threat Grid. The classifier also inspects the artifacts associated with binaries.
Machine Learning-based Static File Analysis performed within Threat Grid Cloud
If you are not extensively familiar with Threat Grid, this powerful platform serves as Cisco’s sample analysis backend tightly integrated with most products in the security portfolio. Behavioral indicators serve as the foundational building blocks of its analysis pipeline, which consists of both static and dynamic analysis using “Outside Looking In” approach (that allows monitoring the kernel from outside the VM to counteract sandbox evasion techniques). Behavioral indicators show up as a part of the analysis report when a particular behavior is observed during sample execution. Each indicator has an associated score assigned. The higher the score, the more confidence we put that this specific behavior highlighted by the indicator, or a specific static attribute of a sample is malicious. Together, all indicators that trigger during the analysis contribute to a final threat score assigned to the sample. Samples that score 95 and above are automatically marked malicious in the AMP Cloud architecture.
Machine Learning Model Identified Executable Artifact as Likely Malicious
So, what does the new Machine Learning-based (ML-based) Static Analysis have to offer for Cisco customers? The answer is simple – further efficacy improvements for any of your AMP-enabled devices and endpoints. Whether the file analysis submissions are coming from our AMP-enabled devices (Endpoint, Web, Email, or Firewall), or whether the samples are submitted manually through the User Interface, or through Threat Grid’s APIs, all of them get to be inspected by the classifier.
Based on our initial evaluation with around 8.4M of samples submitted to Threat Grid within 5 days, we have observed around 2% efficacy improvements. Let’s break it down a little more. We have seen around 15800 samples that were convicted by Threat Grid. The ML-based Static Analysis classifier increased that by another 320 unique samples that were confirmed to be true positives. Not only does it help improve protection for our customers, but it also helps address coverage gaps across the entire security portfolio.
Where To Go Next
To learn more about AMP for Endpoints: http://cisco.com/go/ampendpoint
To learn more about Threat Grid: http://cisco.com/go/threatgrid
To learn more about Cognitive Intelligence: http://cisco.com/go/cognitive
If you are attending Cisco Live Barcelona, join us for a comprehensive 8-hour technical deep dive to learn more about Cisco’s Endpoint Security and Advanced Threat offerings. Look for TECSEC-2599.