Cisco Logo


Security

In the spirit of National Cyber Security Awareness Month (NCSAM) I offer up a recent tale of intrigue and mystery from an ongoing Cisco Security Research project…

Prologue

One of Cisco Security Research and Operation’s ongoing projects is to oversee a massive infrastructure of several high-volume Internet POPs that send large amounts of network traffic into one of our research labs. We are collecting NetFlow and packet dumps from a geographically distributed sensor network. These pcap files each contain several million packets, but due to a configuration error in the packet capture process, there was some amount of packet duplication. This short blog article will talk about why the duplication happened, how we prevented it from reoccurring, and a unique solution that was employed to remove the duplicate packets from all of the affected pcap files.

A Love Story: The Hub, The Switch, Packet Sniffing, and Cisco SPAN

Before there were network switches, there were network hubs. The hub was a wonderful little creature that dutifully forwarded every packet it saw to every other port on its body. This was a wonderful situation for the packet sniffer who wanted nothing more than to siphon up every packet on the network. He simply had to be plugged in to the hub on any port and immediately all the beautiful packets became available for consumption. Packet sniffer loves hub. Enter the brutally efficient switch. After the switch boots, it begins to build a very selfish layer 2 forwarding table (the CAM table) that maps switch ports with MAC addresses. After it learns what comes whence and what goes where, it then forwards or “switches” traffic only to the corresponding port. What a disaster for the packet sniffer! Now, the packet sniffer could only see traffic destined for the MAC address tha is registered to its port. Next comes Cisco SPAN (Switched Port Analyzer). SPAN is a feature on Cisco switches that provides a capability for packet sniffers to see some or all packets on a switch or VLAN (A Virtual Local Area Network where a physical network is partitioned into multiple smaller networks). To be clear, the switch is not simply forwarding each packet so that the sniffer can see them, in fact the switch is actually making a copy of each packet and forwarding it to the packet sniffer. Essentially, by configuring SPAN on a Cisco switch, the packet sniffer can now happily gobble up all of the packets on a network.

Duplicate Packets

Given the above Switch/Packet Sniffer/SPAN love triangle, a common side effect of packet capturing on SPAN ports is duplicate packets. Duplicate packets most often occur when the packet capture source that is specified is a VLAN or port channel (a bundle of interfaces used to provide increased bandwidth and redundancy).

Cisco4948#configure terminal Enter configuration commands, one per line. End with CNTL/Z.
Cisco4948(config)#monitor session 1 source interface vlan 5
!--- This configures vlan 5 as the source port.
Cisco4948(config)#monitor session 1 destination interface fastethernet 0/3
!--- The configures interface Fast Ethernet 0/3 as the destination port.

Cisco4948#show monitor session 1
Session 1
---------
Type : Local Session
Source Ports :
Both : vlan 5
Destination Ports : Fa0/3
Cisco4948#

Moreover it becomes increasingly convoluted when capturing bi-directional traffic. The simplest and most proper solution is to alter the source of the SPAN configuration and make the source a specified interface, for example “interface FastEthernet 0/0”:

Cisco4948#configure terminal Enter configuration commands, one per line. End with CNTL/Z.
Cisco4948(config)#monitor session 1 source interface fastethernet 0/0
!--- This configures interface Fast Ethernet 0/0 as the source port.
Cisco4948(config)#monitor session 1 destination interface fastethernet 0/3
!--- The configures interface Fast Ethernet 0/3 as the destination port.

Cisco4948#show monitor session 1
Session 1
---------
Type : Local Session
Source Ports :
Both : Fa0/0
Destination Ports : Fa0/3
Cisco4948#

We were fortunate in our scenario to have done exactly as seen above; we changed the source of our SPAN from a VLAN to an interface. So let’s say you encounter a situation where it is not possible to change the source of your SPAN. Perhaps based on the environment and architecture your only option that allows you to see the necessary data flow is to have the source SPAN on a port channel or VLAN. If this is the case, duplicate packets are going to show up due to the SPAN source.

While we were indeed fortunate enough to resolve our SPAN issue in the most effective manner possible, we were still left with many pcap files consisting of billions of packets across thousands of files with some level of duplication. See for yourself!

[snarkbox:~/Projects] mike% capinfos -csd sample-02.pcap.gz
File name:           sample-02.pcap.gz
Number of packets:   2928239
File size:           109090270 bytes
Data size:           1953148233 bytes

Needless to say, we needed a tool to remedy the situation.

The Solution

We didn’t need something production quality, we just needed a stopgap tool that worked to remove the duplicate packets. Python, as it happens, is perfect for such rapid prototyping. A quick game-plan was drawn with the following requirements:

pdd.py: Pcap De-duplicator

The fruit of that labor was a 101 line Python program named pdd.py (that’s right pdd = pcap de-duplicator). Note the program is only 80 lines without docstrings. It has the following calling conventions:

[snarkbox:~/Projects] mike% ./pdd/pdd.py -h
usage: pdd.py [-h] -f INFILE_NAME [-w WINDOW_SIZE] [-v] [-o OUTFILE_NAME] [-z]
parse a pcap file and remove duplicate packets, accepts gzip'd pcap files

optional arguments:
  -h, --help            show this help message and exit
  -f INFILE_NAME, --file INFILE_NAME
                        pcap file to sift through
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        size of the sliding packet window, a larger window may
                        find more duplicate packets but will increase run-
                        time, default is 12
  -v, --verbose         be more verbose when reporting, -vv be even more
                        verbose
  -o OUTFILE_NAME, --outfile OUTFILE_NAME
                        output filename
  -z, --gzip            gzip the output file

A sample invocation shown against a smaller pcap file:

[snarkbox:~/Projects] mike% ./pdd.py -f sample-01.pcap.gz -o sample-01.pdd.pcap -vv
Using a window of 12, writing non-duplicates to sample-01.pdd.pcap
dup: 60 byte packet at 2010-04-30 16:43:41.859558 and 2010-04-30 16:43:41.859554: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=3825337896, win=0, sum=45839, flags=4, dport=443, sport=33949)))
dup: 60 byte packet at 2010-04-30 16:45:42.830688 and 2010-04-30 16:45:42.830685: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=1440424076, win=0, sum=3534, flags=4, dport=443, sport=40354)))
dup: 60 byte packet at 2010-04-30 16:47:43.831652 and 2010-04-30 16:47:43.831559: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=3325172995, win=0, sum=41214, flags=4, dport=443, sport=40355)))
dup: 60 byte packet at 2010-04-30 16:48:55.308183 and 2010-04-30 16:48:55.308180: Ethernet(src='\x00\x1d\xa1\xea\xec\x1b', dst='\x00PV\x8e?@', data=IP(src='redacted', off=16384, dst='redacted', sum=18987, len=40, p=6, ttl=43, data=TCP(seq=1751296284, win=0, sum=36867, flags=4, dport=2153, sport=80)))
dup: 60 byte packet at 2010-04-30 16:48:55.332592 and 2010-04-30 16:48:55.332588: Ethernet(src='\x00\x1d\xa1\xea\xec\x1b', dst='\x00PV\x8e?@', data=IP(src='redacted', off=16384, dst='redacted', sum=18991, len=40, p=6, ttl=43, data=TCP(seq=275966928, win=0, sum=42310, flags=4, dport=2150, sport=80)))
dup: 60 byte packet at 2010-04-30 16:49:44.832697 and 2010-04-30 16:49:44.832693: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=924029966, win=0, sum=47376, flags=4, dport=443, sport=40357)))
dup: 60 byte packet at 2010-04-30 16:51:45.833485 and 2010-04-30 16:51:45.833481: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=2836595558, win=0, sum=58154, flags=4, dport=443, sport=37427)))
dup: 60 byte packet at 2010-04-30 16:53:46.834433 and 2010-04-30 16:53:46.834430: Ethernet(src='\x00\x1aK\x00\x02\x1a', dst='\x00\x1d\xa1\xea\xec\x1b', data=IP(src='redacted', off=16384, dst='redacted', sum=44674, len=40, p=6, data=TCP(seq=434232427, win=0, sum=39254, flags=4, dport=443, sport=37428)))
Of 163831 total packets, I wrote 163823 and found 8 duplicates

Another invocation, this time against a large file:

[snarkbox:~/Projects] mike% ./pdd.py -f sample-02.pcap.gz
Using a window of 12, writing non-duplicates to sample-02.pcap.gz.pdd.22213
Of 2928239 total packets, I wrote 1467450 and found 1460789 duplicates

 

The Special Sauce

The real work of detecting duplicate packets inside pdd.py is accomplished by implementing a simple sliding window across the input stream of packets. Many readers will be familiar with the sliding window protocol used with TCP. As pdd.py starts, it will check each packet against entries already in its window. If the packet does not already exist in the window (not a duplicate) it is written to the output file and appended to the left side of the window. If the window is already full, the oldest packet is popped off and thrown away. For pdd.py, the sliding window was implemented using python’s collections.deque() object. This object was chosen since it supports efficient push and pop operations against both ends of the queue. The built-in python list() object does support pushes and pops from both ends; however, it suffers a serious performance penalty when making changes to the head of the list. This whole process is depicted below in Figure 1.

Figure 1

The Python function employing the deque is shown below:

def deduplicate_pcap(infile, outfile, pcap, window_size, verbosity):
 """Uses a sliding window of recently seen packets to remove duplicates. infile: original pcap file outfile: newly created output file obeying the dpkt.pcap interface window_size: size of the sliding window of packets that get compared verbosity: level of verbosity as specified by the user """
    sliding_window = deque()
    tot_count = 0
    pkt_count = 0
    dup_count = 0

    for ts, pkt in pcap:
        tot_count += 1
        for stored_pkt, stored_ts in sliding_window:
            if pkt == stored_pkt:
                dup_count += 1
                found_dup(pkt, ts, stored_ts, verbosity)
                break
        else:
            outfile.writepkt(pkt, ts)
            pkt_count += 1
            if len(sliding_window) >= window_size:
                # once deque is full pop off the rightmost (oldest) item
                sliding_window.pop()
            # add a new entry to the left side of the packet deque
            sliding_window.appendleft((pkt, ts))
    print >> sys.stderr, "Of %d total packets, I wrote %d and found %d duplicates" % (tot_count, pkt_count, dup_count)

 

What Size Window?

Certainly there is some magic in choosing a window size. Clearly the larger the window size, the longer the execution time of pdd.py. Big O running time of searching the deque is always linear, meaning that it scales with the size of the window and pdd.py will iterate over the entire window for packets that aren’t duplicated. Therefore, the user should choose the smallest effective window size. However, if too small of a size is chosen, duplicate packets could be missed. An optimal value should probably hinge upon the number of packets captured and the network topology as it relates to the expected number of duplicates. We’re not really concerned with the packet rate or speed of the network since we’re using an absolute packet window and not a time-based window. If the cause for duplication is a layer 2 configuration issue, the duplicates are likely to be laid out very tightly (in some cases sequentially). In this case, a smaller window will suffice. If something is causing the duplicate packets to be delayed before the sniffer sees them, then they might be more sparsely populated throughout the pcap file. As a default, we chose 12. It seemed to be a good compromise between speed and efficacy.

Another Option: Wireshark’s editcap

At the time of the project, unbeknownst to the SR&O team, there existed prior art. As it so often happens in the byzantine world of computer security, someone else had encountered and solved our problem before we did. Inside the Wireshark suite exists a cache of command line tools including one called editcap. This handy tool allows a user to edit or translate the contents of a pcap file—including the ability to remove duplicate packets. As validation that indeed we were on the right track, editcap also employs a user-defined sliding window (and also offers the option to use a time-based window).

Conclusion

There’s no doubt that other tools and options may also exist to resolve packet duplication issues, but we hope we have provided you with not only a few options/solutions from our own experiences, but also a clear understanding of the root cause and the effective solutions.

There are several key takeaways here:

Special Thanks

Andrae Middleton actually co-wrote this blog with me. He wrote all of the switch configurations. Additionally, I could never have gotten anywhere in life without being under the tutelage of esteemed programming leviathans William McVey and Nathan Ramella. From them, I learned the wonderful Python programming language.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

4 Comments.


  1. As always, Mike is out on the bleeding edge doing what he does best – digging into the details and coming up with valuable nuggets.

       1 like

  2. packet de/duplication is a serious issue when deploying IDS
    sometimes one might be abe to get around by using combinations of VACL & RSPAN

       1 like

  3. Dr. Jose A. Wong - Perez

    Just a very detailed explanation…really enjoy it and indeed enhances my class’s security presentations…

       3 likes

  4. Another tool serving the same purpose from an old colleauge. Couldn’t resist to promoto it :)
    http://myoss.belgoline.com/despan

       3 likes

  1. Return to Countries/Regions
  2. Return to Home
  1. All Security
  2. All Security
  3. Return to Home