In this post we will be building on the ideas covered in my previous post, Whales and IDS, and discussing how striving for the possible, not the perfect, is a valuable direction to take; not just with IPS, but with management and monitoring alerts from IDS too.
At Cisco, my team (Cisco CSIRT) is responsible for investigations into any cyber attacks against Cisco.com. Back when we first deployed IDS, we found that hundreds of IPs from all over the world were attacking us all the time. Right now there are probably 100 different sources port-scanning, probing our web infrastructure, looking for a way in. An IPS would detect these attacks, but we have a relatively small team and we can’t act on everything, so we really have to make sure we DO act on the important stuff. In the Cisco.com environment, we get over a million inbound attacks every day. Very rarely do the attacks have any level of success, and no one can physically examine that many legitimate (but unsuccessful) attacks. Yes, examining each and every one of those attempts would be perfect, just not feasible. But we haven’t let the ideal get in the way of the possible. I’ll give you an example of how we used this line of thinking to improve the security of the site.
The web servers at Cisco.com see a lot of inbound activity (most involving customers buying routers or looking up information on Cisco products!). They don’t, however, see any outbound activity. If a system is compromised, the miscreants start looking around for other systems, trying to download code, exfiltrate data, etc. -- stuff your web farms (and mine) should never be doing. So we turned the equation around and concentrated on outbound activity; anything at all gets a follow up — doesn’t really matter what it is, we check it out thoroughly. So already we’ve gone from inspecting millions of alerts and failing to the following:
Inbound vs outbound CCO (≈1 million per day vs 05/23/09 14 05/24/09 48 05/25/09 87 05/26/09 94 05/27/09 125 05/28/09 66 05/29/09 44 05/30/09 17 05/31/09 25 06/01/09)
My team can go through these events in a few hours. Most end up being added into our tuning, so the case numbers continue dropping, our fidelity increases, and we have a stranglehold on anything happening in that environment. Most CSIRT teams end up with responsibilities that seem like counting the sand on a beach or finding the proverbial needle in a haystack. Many look for (or are sold) magic security products that claim to reduce events to only the important. Magic never worked well for me so I am not comfortable with relying on it. I would suggest that you look at your monitoring to make sure you’re asking the right questions, start with what is possible — events that you know you can take action on — and work out from there. So exactly how do we deal with things when we do have high event counts? The graph below shows daily event counts from our highly tuned 165 sensor IPS farm.
I will go over a couple of hints here that will demonstrate that you don’t need a magic algorithm but just some dedication and common sense. The first is assigning IDS location (locale, in Cisco IPS) variables. At Cisco we have IDS variables defined for anything meaningful. We use them in tuning, we use them in custom IDS sigs, and most importantly we use them to make the alert human-readable. Take the below alert as an example:
Assign “locality” to the source ip and destination ip.
sigDetails=STOR command on dst ports 20 and 21″ src=64.104.X.X srcDir=DC_OTHER_DC_NETS srcport=41507 dst=210.210.X.X dstDir=OUT dstport=21
When our monitoring team views this alert they can immediately see (without doing any host lookups) that a host in one of our data centers (DC_OTHER_DC_NETS) made an outbound FTP connection to a site outside of Cisco (OUT). We investigate any outbound transfer from our data centers to the Internet. This would be an immediate escalation from our monitoring team without them having to do any research. We notate our networks and our services to make the alert understandable. This is also very useful in IPS custom signature making. For example, a network management locale would allow you to instantly tune management systems that may legitimately perform discoveries from IDS sigs that looks for one-to-many connections (worm-like scan activity). Below you can see us adding some systems to the locale variable:
xxx-dc-nms-1# conf t
service event-action-rules rules0
variables MGT_SYSTEMS address 10.6.30.5,10.6.30.6,10.30.6.7,10.50.1.5,10.50.1.6,10.50.1.7
And here we use a filter to tune those management systems from multiple IDS sigs:
filters insert drop_mgt_system_alerts
Then when we find new management systems (usually through detection, but sometimes IT lets us know). It is really easy to update the variable and, in turn, all the IPS sigs that use that variable. If someone on our monitoring team sees a one-to-many scan coming from MGT_SYSTEMS, they know it’s expected.
I will mention one last real-world 0-day outbreak detection method. This is a poor man’s, no magic 0-day mass outbreak detection method that can be applied to any type of IDS.
We chart infected host count per detection vector, we establish thresholds, we trend and, when breached, they are a great indication of a mass outbreak. A few of our IPS sigs are coded to fire only when the number of destinations scanned is 50+ in 60sec — this number was arrived at after testing at lower threshold and evaluating the o/p until the baseline was set to its current value. This signature today is our “rock” at detecting infected hosts propagating on vulnerable “worm-able” ports (NBT, VNC, MSSQL, MySQL, etc. ). Another example of baselining to detect an outbreak is by recording the number of ipaddrs found per run of each malware report and then looking for deviations from the expected.
None of the stuff covered above is very sexy, none of it is very difficult, none of it is magic (or even perfect), but all of it helps to effectively reduce risk with IDS. For more information, view our Cisco on Cisco writeup (Cisco.com account needed).