Cisco Blogs


Cisco Blog > Security

Network-Based File Carving

In this blog post you will first learn what file carving is and, with a simplified example, why it’s useful. Next you will learn how this powerful technique has been applied to the network and how its utility has been expanded beyond just forensics. We will talk about several tools in this article, but specific attention will be paid to the NFEX network file carving tool.


What is File Carving?

File Carving, sometimes contextually shortened to “carving,” is the name given to the technique of extracting files from a data source.  It is a specialized practice where files are located and extracted from a stream of bytes without having to rely on filesystem metadata. Most often, files are located by searching for a specific “magic number” byte-code called a header and carving out the logically contiguous bytes in between it and a closing code called a footer. A large list of these headers and footers is actively maintained on the File Signatures website.

A Simple Example

To illustrate further, we can coax together a simple example to simulate a standard carving situation: you accidentally formatted a USB memory stick and lost your favorite jpeg. Oh no, what a disaster! Well not so fast Chicken Little… Only the filesystem information was erased; the data is still there, there just isn’t a convenient to way access it. It’s a needle in a haystack problem. Using file carving, we can actually recover this file pretty handily.

First, let’s create the haystack using the ubiquitous Unix command dd. We’ll create two files of arbitrary size and fill them with completely random bytes:

[snarkbox:~/Desktop] mike% dd count=923 if=/dev/random of=haystack-01
923+0 records in
923+0 records out
472576 bytes transferred in 0.054913 secs (8605934 bytes/sec)
[snarkbox:~/Desktop] mike% dd count=841 if=/dev/random of=haystack-02
841+0 records in
841+0 records out
430592 bytes transferred in 0.051001 secs (8442842 bytes/sec)

Next we drop the needle into the middle of the haystack. We actually just place the jpeg at the end of the first haystack file and then append the second haystack file to the end of the first:

[snarkbox:~/Desktop] mike% ls -l needle.jpg
-rw-r--r--  1 mike  staff  19959 Aug 11 10:59 needle.jpg
[snarkbox:~/Desktop] mike% cat needle.jpg >> haystack-01
[snarkbox:~/Desktop] mike% cat haystack-02 >> haystack-01
[snarkbox:~/Desktop] mike% ls -l haystack-01
-rw-r--r--  1 mike  staff  923127 Aug 19 09:56 haystack-01

If we can forget for a few minutes that we do know the exact location of the dropped needle, we can treat haystack-01 as a random sea of bytes in which our jpeg has been lost. Using a handy little OS X-based hex editing tool called Hex Fiend, and some knowledge of the magic numbers used to identify jpeg files, we can confirm the file is still in there. Jpeg files are heralded by the presence of the magic number header “0xFF 0xD8 0xFF” and we can search for this value in Hex Fiend:

It follows then that we can search for the closing footer that terminates the file: “0xFF 0xD9″:

Now that we’ve confirmed the file is there, it’s a matter of extraction. To accomplish this, we fire up a stalwart yet precise file carving tool named Scalpel. Scalpel is a command-line tool that will carve files based on a robust configuration file. We’ll cover this in more detail later, but for now, suffice it to say that entries in the configuration file specify file types and magic numbers. We enable the jpeg search entry and fire away:

[snarkbox:~/Projects/Filecarving/scalpel-1.60] mike% ./scalpel -o lost_and_found haystack-01Scalpel version 1.60Written by Golden G. Richard III, based on Foremost 0.69.
Opening target "/Users/mike/Desktop/haystack-01" Image file pass 1/2.
haystack-01: 100.0% |******************************************|  901.5 KB    00:00 ETA
Allocating work queues...
Work queues allocation complete. Building carve lists...
Carve lists built.  Workload:
jpg with header "\xff\xd8\xff\xe0\x00\x10" and footer "\xff\xd9" --> 1 files
Carving files from image.
Image file pass 2/2.
haystack-01: 100.0% |******************************************|  901.5 KB    00:00 ETA
Processing of image file complete. Cleaning up...
Done.
Scalpel is done, files carved = 1, elapsed = 0 seconds.[snarkbox:~/Desktop] mike% ls -l lost_and_found
total 8
-rw-r--r--  1 mike  staff  453 Aug 19 10:02 audit.txt drwxr-xr-x  3 mike  staff  102 Aug 19 10:02 jpg-0-0
[snarkbox:~/Desktop] mike% ls -l lost_and_found/jpg-0-0/
total 40 -rw-r--r--  1 mike  staff  19959 Aug 19 10:02 00000000.jpg
[snarkbox:~/Desktop] mike% diff lost_and_found/jpg-0-0/00000000.jpg ~/Desktop/handsomedevil.jpg

We’ve carved our favorite jpeg back from oblivion and just to ensure it’s legit and uncorrupted, we diff the carved file with the original. Success. Astute readers will note that the header Scalpel used for jpeg files was the longer string “0xff 0xd8 0xff 0xe0 0x00 0x10.” This is probably sufficient to catch most jpeg files but will miss a small percentage of them. On the flip side, it also will reduce the number of false positive matches.

Traditionally, file carving was used not only to recover orphaned files, but also as a powerful technique during the post mortem forensic investigation of defunct hard disks. As we’ll see, however, it is also well suited for other important computer security-related tasks.

File Carving Tools

There are many file carving tools, a modest list of which is maintained on the Forensics Wiki. A few outliers are mentioned below.

Foremost and Scalpel

Foremost, originally written by the United States Air Force Office of Special Investigations, is one of the first full-featured carving tools to make its way into the public domain. It offers the ability to carve files from a variety of image files such as those created from dd, Safeback or Encase. The real power that Foremost brought to the table is the ability to flexibly match different headers and footers as specified by a user-tunable configuration file. In fact, this has proven so useful that many subsequent tools have adopted this configuration file format (we will cover this in more detail shortly). The original forensic investigator’s mainstay, Foremost is the metric by which other carving tools are measured, and in several cases, derived. As demonstrated above, Scalpel is based off of Foremost but rewritten from the ground up. The main improvements here are a faster and more memory efficient tool.

EtherPEG and Driftnet

One of the earliest network-based file carving tools, EtherPEG was written to expose how naked most early WIFI networks really were when deployed without WEP (unbeknownst to most people at the time, WEP would be exposed as being tragically flawed, paving the way for the current standard in WIFI confidentiality, WPA2). EtherPEG is a Macintosh-only program that would promiscuously sniff the local WIFI network and carve image files and display them on a local machine. According to Internet lore, it was a quick hack written to shame or embarrass people into either encrypting their networks or stop downloading “questionable” content. Driftnet added multi-platform support and the ability to run as a screen saver and optionally decode and play mpeg files.

NFEX

NFEX is the Network File EXtraction tool maintained by myself. It is based off of the defunct tcpxtract tool written by Nick Harbour. It is an asynchronous unix-based, command-line driven standalone tool designed to perform real-time file carving from network streams. NFEX was built on top of the tcpxtract engine with some bugfixes, performance enhancements and new features. NFEX can be launched on either an IPv4-based network device or across a pcap savefile. As it employs an asynchronous multiplexer, the user is able to query NFEX in real-time to learn ongoing carving information and statistics. Finally, NFEX will carve files based on a malleable user-tunable, text-based configuration file.

Configuration File

The configuration file employed by NFEX is similar in format to those used by the Foremost and Scalpel tools mentioned above. It is a simple format that takes on the following format:

file_extension(size, header, footer)

  • file extension: a canonical label that will be used as an extension to name the file when it is carved from the input
  • size: a limiter that places a maximum size in bytes that NFEX will carve
  • header: a specifier that, when matched, signals the start of a carving session
  • footer: a specifier that, when matched, signals the carving session should close

The footer is optional, but if there is no footer, NFEX will continue carving until it either hits the size limit or until it detects the network session in question has closed. PE32 file carving has a few problems including this one, as discussed below. If NFEX reaches the maximum size limit before it locates a footer, it will close out the carving. A few examples from the NFEX configuration file:

# PE32 executables
exe(10000000, \x4d\x5a\x90\x00);# HTML files
html(50000, \x3chtml, \x3c\x2fhtml\x3e);# Adobe PDF files
pdf(5000000, \x25PDF, \x25EOF\x0d);

Asynchronous Interface

NFEX was designed to run either persistently on a network interface or ad hoc against large pcap savefiles. To compliment this design goal, NFEX features an asynchronous interface that enables the user to interact with the program as it runs.

Session State Table

The main data structure inside of NFEX is the session state table, which is a large hash table. NFEX is TCP session driven, and as such, this is the most important and heavily trafficked data structure in the program. When NFEX finds a TCP session with data (referenced by an endpoint four tuple) it stores it here, along with a timestamp. In order to create a unique hash table index, the four tuple of the TCP endpoint is hashed in linear order. Several hashing algorithms were empirically tested, and the Fowler-Noll-Vo hash was found to be a good balance of speed and collision avoidance. Periodically, a session expiry function will be called in order to remove stale sessions from the table.

Search and Extract

The NFEX search engine is mostly untouched from the original elegant and efficient tcpxtract code. NFEX builds a finite state machine (FSM) by tokenizing each header and footer specified in the configuration file. The resulting FSM is similar in structure to the underlying mechanism used to codify regular expressions, albeit with a greatly reduced search grammar (just 256 possible values with wild cards).  This is because NFEX is building searches based on file headers and footers, which consist of single bytes (0-255).  Each state transition table is 256 elements wide (while a wildcard simply sets all possible transitions). The result is a very fast pattern matching search interface. As NFEX adds new sessions to its session state table, it summarily calls the search engine. The search engine grooms every byte of data in all packet payloads looking for any of its header markers. If one is found, it tracks this session and calls the extraction interface, which will carve out all of the data until the corresponding footer is detected.

Sample Invocation

NFEX is instantiated with the follow command-line arguments:

  • -v enable verbose mode (extra diagnostic and statistical output)
  • -c specifies the path the NFEX configuration file
  • -f specifies the path to the pcap savefile (implies offline mode)
  • -o specifies the output directory in which to save the extracted files
  • -g specifies geo IP targeting mode

[snarkbox:~/Projects/nfex-current] mike% ./src/nfex -v -c ./conf/nfex.conf -f ../pcaps/4291.pcap -o FOO -g
nfex - realtime network file extraction engine loading configuration file...
1 exe search code compiled (10000000 byte max)
what we're working with:
output dir: FOO/
config file: ./conf/nfex.conf
pcap file: ../pcaps/4291.pcap
pcap filesize: 1200018390 bytes
pcap filter: tcp
index file: FOO/5530-index.txt
geoIP database: /usr/local/etc/nfex/GeoIPCity.dat
verbosity on
geoIP mode on

Once initialized, NFEX enters its main driver loop and starts to pull packets from the pcap file. As it finds TCP sessions it populates the session state table. NFEX notifies the user when its search engine finds a match and begins to carve an executable file from its packetized corpus.

program initialized, now the game can start...
extracting "exe" (Barlassina, IT:1871 -> Durham, US:49793) to FOO/5530-000001.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:35199) to FOO/5530-000002.exe

The user hits ‘?‘ to see what in-program options are available:

-[command summary]-
[c] - clear screen
[f] - show file search types
[g] - toggle geoIP mode
[r] - reset statistics
[s] - display statistics
[q] - quit
[V] - display program version
[v] - toggle verbose mode
[?] - help

The user hits ‘s‘ to get an update of what’s going on:

up-time: 5 seconds
sessions watched: 59007
packets churned: 984500
bytes churned: 209302827
pcap file processed: 17.4%
files extracted: 2
packet errors: 0
extraction errors: 0

More files are found:

extracting "exe" (Barlassina, IT:1871 -> Durham, US:60750) to FOO/5530-000003.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:48861) to FOO/5530-000004.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:34220) to FOO/5530-000005.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:46837) to FOO/5530-000006.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:52083) to FOO/5530-000007.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:52083) to FOO/5530-000008.exe

The user hits ‘s‘ again:

up-time: 14
seconds sessions watched: 59994
packets churned: 1003600
bytes churned: 213553089
pcap file processed: 17.8%
files extracted: 8
packet errors: 0
extraction errors: 0
extracting "exe" (Barlassina, IT:1871 -> Durham, US:59349) to FOO/5530-000009.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:40004) to FOO/5530-000010.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:53639) to FOO/5530-000011.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:53639) to FOO/5530-000012.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:50573) to FOO/5530-000013.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:50573) to FOO/5530-000014.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:33356) to FOO/5530-000015.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:43958) to FOO/5530-000016.exe
extracting "exe" (Barlassina, IT:1871 -> Durham, US:40702) to FOO/5530-000017.exe

NFEX exits cleanly and gives the user a final tally of what went on:

running-time: 39 seconds
packets churned: 3999525
bytes churned: 749660498
pcap file processed: 62.5%
files extracted: 17
packet errors: 0
extraction errors: 0

The Problem with PE32 File Carving

Carving executable files, specifically PE32 files, poses a problem because there is no footer value with which to terminate the carving. The magic number header value for executable files is the MZ header, a two byte code “0x4D 0x5A,” but there is no corresponding closing footer. This makes file carving difficult for tools that rely on such boundaries for extraction.

There are options, none of which are very elegant.  Once the PE32 header byte code is detected, it is possible to overlay a PE32 header over the file and manually parse the SizeOfImage value from the IMAGE_OPTIONAL_HEADER. The problem here is that this method isn’t very scalable and it only describes the size of the image file in memory, which may not accurately reflect the file on the wire.

NFEX manages to sidestep this issue by the nature of its design. The executable files it extracts across the network are all inside of FTP and HTTP sessions. NFEX is contextually network-aware and it detects when a TCP session is closed, and consequently will end any corresponding extractions. From these sessions, the executable files are carved cleanly.

NFEX PE32 Header Validation Post Processing

NFEX offers a standalone utility, called nfex_exe_pp, to post process MZ-based executables to determine if they’re PE32 files. It works by overlaying an MS-DOS header on top of the file and checking e_lfanew, which is a four-byte offset of where the PE32 header lives, looking for the PE32 signature of “0x00 ox00 ox45 ox50.” After determining if the file is a PE32 executable, nfex_exe_pp will then use the ClamAV library to check the file’s signature against their known malware database. Abridged output from a sample invocation of the nfex_exe_pp, run across the output from the above NFEX invocation, is shown below:

[snarkbox:~/Projects/nfex-current/FOO] mike% ./nfex_exe_pp 26120-index.txt
clamav: intializing...
clamav: loaded 640722 signatures...
checking 5530-000001.exe...
found MZ header
found PE header at 0xd8
malware detected: 5530-000001.exe is Trojan.Small-4287
checking 5530-000002.exe...
found MZ header
found PE header at 0xd8
malware detected: 5530-000002.exe is Worm.Padobot.M
[...]
checking 5530-000017.exe...
found MZ header
found PE header at 0xd8
malware detected: 5530-000017.exe is Worm.Padobot.M
program completed, normal exit

Denouement

While it is a reasonably technical concept, at its core, network-based file carving is a very useful and practical technique. To wit, internally at Cisco we needed an expedient and repeatable way to extract suspected malware from pcap files. To that end, we’ve deployed NFEX as a key component of  an internal research project devoted to the automated detected of network-based malfeasance. We have a series of components that monitor and archive network traffic; NFEX sits in-line with these as a part of the automated system. NFEX is invoked via Cron to carve out all PE32 executable files from pcap files and then nfex_exe_pp is called to weed out all but the PE32-based malware files. From there, the malware is shipped off  for analysis and cataloging.

I hope this blog post found you well and you found it informative. Comments welcomed.

Tags: ,

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.

2 Comments.


  1. Good article.

    File carving aficionados should check out Gary Kessler’s magic number site.

    http://www.garykessler.net/library/file_sigs.html

       0 likes

  2. Thanks for the article.

    Do you have any statistics on the amount of bandwidth this tool can handle on a live interface? Obviously this will be heavily dependent on the machine nfex is running on, but I’m just trying to get a general idea.

       0 likes