Netflow for Incident Response
This is the Forth part in the series “Missives from the Trenches.” (Here are the (first), (second), and(third) parts of the series.) In today’s blog post we will be discussing Cisco IOS Netflow. Netflow has an interesting position as being both the most useful and least used tool. When meeting with other companies I often ask them “do you use Netflow?” By asking this question I am actually asking several different questions–Do you care about the security of your site? Or do you have any hopes in managing/responding to events at your site? Answers to these questions unfortunately tend to be as follows: What is Netflow? The network guys use it but we don’t. I think we capture it somewhere but not really sure where – and so on. I then mention that Netflow is free, they don’t have to buy anything to start using it, and it’s used for every large case we do. At that point they start looking angrily at the sales engineer asking why this is the first they are hearing about it. So what is Netflow and why does Cisco CSIRT say its critical to daily event management? Read on to find out!
Cisco IT’s networking team uses the flow tools NetQos and Arbor.
NetQos is mainly used for capacity planning. This is a highly summarized view that is great for establishing which links need more bandwidth (or less), historical trending over time application use, etc. This type of trending isn’t used for IR, although we occasionally pull stats. We use Arbor for anomaly detection on our points of presence (pops). Basically to detect and identify network distributed denial of service (DDOS) attacks. This is again highly summarized information, the products strength is in its ability to create very customizable relational databases of traffic amounts (alerting on anomalies). We love that our IT networking team does this, we (CSIRT) used to be highly involved with IT responding to DDoS. Over the years it became a normal IT service, not much investigation needed really, we just want it to stop and IT can do the appropriate controls. We are occasionally brought in when the DDOS is sourced inside of Cisco, or there is something unusual or targeted with it.
We (Cisco CSIRT), have recently replaced our large homegrown OSU flow tools with Lancope. There are a couple of reasons behind that but basically there is some great stuff out there; we don’t need the hassle of development and support that a homegrown tool brings with it. Flows are a large part of how we deal with security at Cisco–for that we need 1 to 1 Netflow records for forensic investigations. I have testified in court using reports based on Netflow and having as close to the real record as possible is important. We also like some of the cool summarized views that Lancope provides. For example, we have all of our high-risk networks showing up in a 24hr graph, no-one in one of these networks is allowed to send data directly out to the internet; Lancope gives a great visual of that. However we really need the ability to query the actual records, not just look at pretty graphs. As well as having direct access to the actual flow records we need/want as much backlog on the flows as few can afford. We typically like to have a year of historical Netflow data.
At Cisco we collect flows from all of our Internet pops (currently 50 and growing), and across all of our data centers (40). We collect and store on average 4.6 billion flows per day; now we don’t look at each one of these flows but we have them in a searchable database. We use the Netflow records to provide the needed context to security events. We started using OSU flowtools back when that was the only flow-based software available and have recently moved over to using Lancope for storing our Netflow records. If we identify a security issue with IPS, we can query Netflow to find out exactly what IP’s accessed the host, at what time they accessed the host, and also what that host did on the network after the issue. This provides us the needed context to effectively manage and respond to the event. Without Netflow we would not be able to see the chain of events leading to the compromise, nor the after effects (what other machines may have been touched). So what are some real-world examples of our use of Netflow to support IR&M? One of the typical roles for an enterprise CSIRT is the identification of 0-day malware that may be able to bypass typical security controls (often through user actions). We may identify a machine on our network making a connection out to an IRC server through IPS. Further examination (often as easy as looking at the bot name WXP-US-De3r44Der OS-COUNTRY-RANDOMDATA) finds that the machine is participating in a botnet and under remote control by a miscreant. The first thing we would do is inject a BGP null route or real-time blackhole routing for the external botnets IP address to stop any other machine that may be similarly compromised. Then we query our Netflow database for all connections to the IP address and the port of the malicious IRC server; this way we can a get a complete list of everyone that may still be infected and work through obtaining all of the Netflow-identified machines remediated. Further we can create alerts that will tell us if we receive any new flows to that botnet. Another daily use would be the identification (again often through policy based IDS alert) of a machine that was compromised. By querying Netflow we can instantly find where the attack originated and what other machines were impacted. Another good use is policy-based alerts or reporting. For example, you could check any connections to areas in your enterprise where you need network connectivity but want to ensure employees are adhering to policy–not web surfing from a Data Center system. Similarly you can use Netflow to evaluate your firewall access control lists. If, for example, you have a webserver and a DNS server in a DMZ and you have applied access control lists to block all other traffic, you can setup alerts for any traffic not on port 80 or 53. Last example that I will give (and you should be guessing by now they are limitless) is using Netflow to detect covert channels and/or web-based uploads. This can be useful even in areas where the data is encrypted. You can query for web traffic where the ratio of upload to download doesn’t match expected behavior. For example, if a user connects to a webserver and uploads 20 megabytes of data while downloading 200k–this is probably uploading files to the webserver, or tunneling traffic. IPS (or deep packet inspection) is our #1 security defense; Netflow is a very close #2 (of course now with version 9 you have some limited DPI too). If you don’t currently use Netflow to its full potential, you have an amazing opportunity ahead to make a real change by adding it to your arsenal.
The last thought I will leave you with is the enabling of Netflow on your routers; Netflow uses almost no resources on the device. On our busiest gateway Netflow uses less than 3% cpu at peak, it is nothing like implementing ACL logging which has significant resource impact. Most of the newer devices can do Netflow in hardware. For more information on the load on particular devices see the NetFlow Services Solutions Guide and
NetFlow Performance Analysis.
So enable flows, collect them, then start getting a handle on the security of your site.