Getting a Handle on Your Data
When your incident response team gets access to a new log data source, chances are that the events may not only contain an entirely different type of data, but may also be formatted differently than any log data source you already have. Having a data collection and organization standard will ease management and analysis of the data later on. Event attributes must be normalized to a standard format so events from disparate sources have meaning when viewed homogeneously. In addition to normalization, log events must be parsed into fields and labeled in a consistent way across data sources. Ensuring that log data is organized properly is a minimum requirement for efficient log analysis. Without digestible and flexible components, it’s extremely difficult to comprehend a log message. If you have ever paged through screen after screen of log data with no filter, you know what I’m talking about.
Data normalization is the process of transforming a log event into its canonical form, that is, the accepted standard representation of the data required by the organization consuming the data. If the same data can be represented in multiple formats, each possible iteration of the data can be considered a member of an equivalence class. To allow proper sorting, searching, and correlation, all data in the equivalence class must be formatted identically.
As an example, let’s consider timestamps. The C function strftime and its approximately 40 format specifiers give an indication of the potential number of ways a date and time can be represented. The lack of an internationally recognized standard timestamp format, combined with the fact that most programming libraries have adopted strftime’s conversion specifications, means that application developers are free to define timestamps as they see fit. Consuming data that includes timestamps requires recognizing the different formats and normalizing them to an organization’s adopted standard format. Other data contained in logs that may require normalization includes MAC addresses, phone numbers, alarm types, IP addresses, and DNS names. These are examples of equivalence classes, where the same data may be represented by different applications in different formats. In the case of an IP address or a DNS name, the CSIRT may find it beneficial not to normalize the data in-place, but rather to create an additional field, the labels of which are standardized across all data sources where possible.
Understanding the information in a given data source is much easier when using unique identifiers for identical attributes pervasively. A CSIRT might standardize on a field name of source_host for events in the monitoring infrastructure that include an originating IP address. Authentication logs containing a username might label that particular attribute field user. Whatever the name, use the same label each time a newly acquired data source possesses an extant field.
Hopefully the format and fully enumerated list of possible log event messages and all fields from any given data source is available, well documented by the system or application owner. Reference this documentation where it exists to help identify different messages, fields, and formats. In the absence of such documentation, the CSIRT team must dissect the log data using context about the data source itself or type of messages generated. Attribution data sources are likely to associate an action with a host or user, while classic security event data sources such as IDS contain alerts about an asset or action.
Knowing and understanding your data allows you to make decisions affecting system efficiency. By eliminating messages providing no value for investigations or event detection before they’re stored in the collector, the system saves disk space, improves index build time, and speeds the analysis of sorting through the data. Filtering data from entering the collector can be performed in multiple ways. The most straightforward way involves the data steward simply configuring the data feed so the CSIRT never receives data they do not want or need. Barring this possibility, CSIRT, as a data custodian, must manipulate the data in such a way as to filter prior to inclusion in the collector.
Knowing which messages to filter requires somehow identifying and matching those messages, and then dropping them from the data feed. This can be accomplished with rsyslog’s property-based filters, or syslog-ng’s various filter functions, which include the ability to specific string matching based on regular expressions. The match must be on a string unique to that log message so as not to unintentionally drop desirable data from storage. The log feed is likely to contain benign error messages, or verbose data that cannot otherwise be filtered from the exporting system itself. If you’ve parsed your data correctly into fields, you should be able to identify the unique event attribute field or combination of fields that uniquely identify the undesirable messages.
This blog post will discuss considerations for preparation of multiple data sources for consumption into a centralized log collector. Preparation includes the basic steps of identifying delimiters, normalizing potentially varied data formats, and field parsing.
Let’s begin by looking at some scrubbed examples of actual log data from actual data sources, realizing comparable applications in your environment may produce similar log events with different formats. Our first event is raw data from a single Host Intrusion Detection System (HIDS) log:
2013-09-21 14:40:02 -0700|mypc-WIN7|10.10.10.50|4437|The process 'C:\Users\mypc\Downloads\FontPack11000_XtdAlf_Lang.msi' (as user DOMAIN\mypc) attempted to initiate a connection as a client on TCP port 80 to 10.2.3.4 using interface Virtual\Cisco Systems VPN Adapter for 64-bit Windows. The operation was allowed by default (rule defaults).
An initial qualitative assessment of the message identifies | used as a field delimiter. Other log sources commonly use spaces, quotes, or tabs. Or, they may unfortunately lack a delimiter entirely. Assuming the delimiter is unique to the message—that is the delimiter does not exist elsewhere within the event message—it can easily parsed with a unix-like split() function. There are a handful of additional artifacts in the event worth remarking on. Starting with the timestamp, we see a date format of year-month-day, the use of a 24-hour clock specified to seconds, and PST time zone defined as a numeric offset from UTC. Already we have questions to answer. How does this timestamp format align with those of other logs already in our collector? Will any portion of or the entire timestamp need to be normalized to CSIRT’s standard?
Additionally delimited fields include an IP address 10.10.10.50, a hostname mypc-WIN7, a numeric rule identifier 4437, and an event action occupying the remainder of the event. Source IP address and hostname are fairly standard log attributes, and the collector should apply to the DHCP log feed, CSIRT’s standard field name for these attributes. Though possibly referred to by a different name, a rule identifier is also common across data sources—such as an IDS signature—and should also use the standard label for that field.
We’re left with the event action, in this case identifying an allowed Windows process making a network connection over a VPN interface. At a minimum, this entire field can be labeled as Event Action. Bear in mind that the same log feed will likely contain various actions. Are those actions finite and enumerable? Is there consistency within the same action for each log message? Different Operating systems or agent versions may generate similar, but varying messages for the same action. Can the parser properly handle inconsistently formatted event actions? Consistency within event actions is notable since there may be additional attributes within the action itself worth parsing. We’ll discuss these additional attributes later.
Let’s look at an event from another log source and see how it compares to the HIDS data. The following is an event from a web proxy:
1381888686.410 - 10.10.10.50 63549 80 - -6.9 http://[adware-URL-example-TLD]/toolbar/?ownerId=CT3311834 - 0 309 0 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" - 403 TCP_DENIED - "adware" "Domain reported and verified as serving malware. Identified malicious behavior on domain or URI." - - - - GET
Here we miss the convenience of an obviously unique delimiter. Though fields are delimited with spaces, spaces may also exist and be escaped within a quoted field. As a result, a simple split() will no longer work to break the event into fields. This log event requires a parser with logic that knows not to split when the delimiter is escaped or quoted.
Beyond the difference between delimiters, the timestamp is in a different format from the HIDS data. In this instance, the web proxy uses Unix time format. If at a minimum we have only the HIDS data and this web proxy data, at least one of the time values must be normalized to a standard so the data can be easily understood and correlated during analysis. As explained previously, log timestamp inconsistencies are rampant, and CSIRTs should prepare to convert and format data for each new instance of data sources received.
Similar to the HIDS data, the web proxy identifies a client IP Address 10.10.10.50. The same standardized field name used for source IP address should be used in the proxy logs. Because this is a web proxy log, there should be no surprise the event contains HTTP information such as requested URL, user agent, HTTP server response, HTTP method, and referrer, all of which should be parsed into uniquely representative fields.
As a final data source, we’ll introduce DHCP logs. This example shows multiple events:
10/08/2013 20:10:29 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" gw="10.10.10.1" action="DHCPREQUEST" 10/08/2013 20:10:29 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" ip="10.10.10.50" exp="10/09/2013 22:11:34 UTC" 10/08/2013 22:11:34 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" gw="10.10.10.1" action="DHCPREQUEST" 10/08/2013 22:11:34 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" ip="10.10.10.50" exp="10/09/2013 02:11:33 UTC" 10/09/2013 02:11:37 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" ip="10.10.10.50" exp="10/09/2013 02:11:33 UTC" 10/09/2013 02:11:38 hn="mypc-WIN7" mac="e4:cd:8f:05:2b:ac" ip="10.10.10.50" action="EXPIRED"
Initial analysis again reveals fields delimited via spaces. This data source clearly identifies fields in a key=”value” pair, with keys hn (hostname), mac (MAC address), ip (Source IP Address), gw (gateway), action (DHCP server action), and exp (DHCP lease expiration timestamp). It’s also worth noting that each value in the tuple is also quoted. Identifying these intricacies in message structure is required to write accurate parsers to separate field values. The data source may identify additional fields. Even if documentation identifying each field is lacking, an analysis can and should manually enumerate each field with a large enough sample set of log data.
This new data source is using yet another new timestamp format. We can see the use of month/day/year and a 24-hour clock. But what about time zone? Did the DHCP admins neglect to configure an export of time zone information in the log message or does the DHCP server not even support exporting the time zone? Either way, determination and normalization of the event timezone is a responsibility of the log custodian—as opposed to the log analyst—and must be performed prior to writing the event data to the collector.
Whereas earlier we looked at individual HIDS and web proxy log messages, six DHCP log messages are presented in this example. While each message has value individually, the combined six events show a single DHCP session over the course of approximately six hours for a single host. Log data that must be dealt with as a single unit of work over a period of time is referred to as a transaction. There are additional instances where it’s ideal to identify transactions in log data that have the potential to span multiple events such as VPN session data, authentication attempts, or web sessions. Once combined, the transaction can be used for holistic timelines or to identify anomalous behavior. Consider a transaction identifying normal VPN usage. If the same user initiates a new VPN connection before an older connection has been terminated, it may be indicative of a policy violation or shared VPN credentials—both the responsibility of a CSIRT to identify.
Fields Within Fields
As mentioned previously, additional attributes may exist within individual fields that are worth parsing themselves. Consider the action from the HIDS alert, and the Requested URL from the web proxy log:
The process 'C:\Users\mypc\Downloads\FontPack11000_XtdAlf_Lang.msi' (as user DOMAIN\mypc) attempted to initiate a connection as a client on TCP port 80 to 10.2.3.4 using interface Virtual\Cisco Systems VPN Adapter for 64-bit Windows. The operation was allowed by default (rule defaults). http://analytics.query.foo.com/v1/public/yql?format=json&yhlVer=2&ylClient=rapid&yhlS=2149249&ylCT=2&yhlBTMS=13812&yhlClientVer=3.9&yhlRnd=5vevo70DEDjJ7PTw&yhlCompressed=3
Contained within both messages is data that very well could be a field in and of itself in another data source. The HIDS log identifies a path, filename, Active Directory domain, destination IP address and port and protocol. The URL from the web proxy contains a domain and subdomain, URI, URI path, and filename. Each of these fields is a potential attribute that can be used to correlate, or link, the HIDS or web proxy event with another data source. From the HIDS event, an analyst may want to determine what other hosts also contacted the IP address on port 80 by searching Netflow data. Or the analyst may want to correlate the filename identified in the web proxy log with HIDS data to see if the same filename attempted execution on the same or other hosts.
In the next blog post, we’ll explore how to use the normalized and parsed data to search across different data sources.