This is the second and final part of my series about security logging in an enterprise.
We first logged IDS, some syslog from some UNIX hosts, and firewall logs (circa 1999). We went from there to dropping firewall logging as it introduced some overhead and we didn’t have any really good uses for it. (We still don’t.) Where did we go next? Read on.
When we established a formal CSIRT, we quickly found out we didn’t have the needed telemetry to be successful at computer security investigations. We set out to fix that by building out a syslog reflector system (and associated ACLs) so we could take logs from the network devices and many of the production *nix systems. We then tackled netflow. We had a POC that captured a few hours of flows from Cisco.com to the Sun workstation of one of the first CSIRT members. We went from there to 4 dedicated Linux boxes, then to 24 Linux boxes collecting flows.
So we had some of the basics tied together but more gaps than data. We had coverage for deep packet inspection and IDS for all of our ingress/egress points but not all of our data centers. In about 12 months we completed the deep packet inspection and flow collection for our data center deployments. We then starting filling in our logging gaps in earnest.
The team member who at the time was leading the effort had the mantra of “log everything” – which is a great idea but in practice has some issues. The first is the obvious: The more you log the more resources it takes. In particular you need to be able to process and make forensic sense of the logs. You also need quick and fast ways to search through the logs. You also need to understand and have good enough integrity to know when a needed source is no longer producing logging data.
So we slowly built on the logging we had. The basic idea was if the log file was very useful and it wasn’t too huge (or we needed it for an investigation and we didn’t have it), we would bring it in. We had syslog reflectors we could ask IT to use to send logs to us (and they got feeds too if they wanted them). We started with low-volume, heavily useful logs like authentication from our routers. We created a list of what we needed to log, prioritized as mentioned earlier, then started knocking out sources/types as we had time and resources to do it. We were still using flat text files and grep for searching and as our data stores grew larger, and the team grew more diversified in skills (addition of a tier 1 for example), we knew we needed a different approach. We, like most, CIRTs did a bake-off (well, many bake-offs) and deployed a Security Information and Event Management (SIEM) system. This alleviated the problem with the tier 1s but didn’t give us the performance and flexibility we desired, so we continued to develop a separate logging solution.
With netflow, we eventually moved to a commercial solution based on a columns-based dbase (Lancope). For the systems-based logging, we started using Splunk. Splunk had a great mix of performance and flexibility. Splunk provides a flexible search mechanism without binding you to the structure of a relational DB. Along with that, Splunk has very functional CLI/API for making the data accessible through automated means. Unlike with the SIEM, we didn’t need special parsing agents or licensing to add a new data source. It was also complex enough for the most advanced queries while allowing us to abstract the log sources for the tier 1s. At that point, we added in more and more sources, the main ones being VPN logs, web proxy, authentication, HIPS, HIDS, syslog, DHCP, DNS, AAA, ACNS, network scans, advanced malware tools, e-mail security, AV, sinkhole, and Active Directory.
We continued to run the SIEM in parallel, and at some point the same team member mentioned earlier said, “Why don’t we just keep the data in one place – and run all of our monitoring out of the SIEM?” The CSIRT has a list of mini-investigations we have our tier 1 complete, and these were all done in our SIEM (using a logging server to augment where needed). So I asked that we check if we could indeed do the 100+ mini-investigations we did in the SIEM in the logger (to test if it was indeed possible and report back). Step one for testing was to add any needed event source in the SIEM to the logger (mainly IDS). Then we needed to convert our queries for the SIEM into something Splunk could understand. We had over 100 of these mini-investigations (or “playbook” as we called them). We had the tier 1 team convert them all over, then we ran them in parallel with the SIEM and compared the results. We found that our logger could easily do everything we did in the SIEM. Furthermore, in testing we found the performance to be hugely better than the SIEM. At the time of testing we had 200X the amount of data being queried in our logger than the SIEM, while queries were coming back over 20X faster. We ran them both in parallel for quite a while then eventually EOLd the SIEM, finding no further use for it. The team is now looking beyond parity with the SIEM and exploring how they can use the additional data sources available along with the additional flexibility to create a more advanced playbook. It is important to note that we moved to Splunk from traditional SIEM as Splunk is designed and engineered for “big data” use cases. Our previous SIEM was not and simply could not scale to the data volumes we have.
Our new setup has proved to be useful and resilient as we grow into the big-data size you would expect of a company like Cisco. The most recent additions have been logging of all DNS (super useful for tracking some of the more advanced criminals). While netflow is great for understanding “wheel and spoke” type traffic, DNS helps us understand lateral movement within our organization. For that we use ISC’s DNSDB for “answers” and a home-grown collection, storage, and query tool for “questions” (arguably the most useful piece). We are also in the process of deploying our own local instance of DNSDB, but currently it’s not done so we’re still using the public instance. You can find more detailed information on this in Tracking Malicious Activity with Passive DNS Query Monitoring. We can call directly out to these databases from Splunk. The most recent addition to Splunk is Windows server logging. Finishing off, we do further processing on attachments with Vortex and other metadata tools and send the data to Hadoop, along with saving rolling full pcaps.
While Splunk can be all things to all people, currently it’s not practical to have all our data in Splunk. We use a combination of technologies to help us manage all the data. The main ones are summarized below:
- Netflow – This data is ideally managed by a special-purpose system designed for flows (we use Lancope, but there are others). We don’t plan to use Splunk for flows.
- DNS – Cisco uses a home-grown solution due mainly to cost considerations. In the future we may keep this data in Splunk.
- Web logs (clickstream) – These are not in Splunk due to size and costing.
- Splunk – Splunk stores everything else (all logs other than the three categories mentioned above).
This blog gives you a point of time in a journey that continues. It’s important to note that the logging and history that this blog covers have been the most important tool and capability enabling us to keep Cisco secure from cyber incident–related threat. We capture and store trillions of records each day. That provides the base we can build incident detection and response upon.
You say “Furthermore, in testing we found the performance to be hugely better in the SIEM.”. Surely that’s a reason to keep the SIEM?
Thanks, that was an error – it should read
“Furthermore, in testing we found the performance to be hugely better than the SIEM.”.
Ill get it fixed
Thanks for the useful specific insight, as Internet is full of executive-type opinions about doing SIEM, without mentioning any technical detail.
We could debate the differences of course, and there are some, but arguably you just switched from one SIEM to another…with different qualities. We are considering taking the same step. Thanks for sharing.