by Keith J. Jones, Ph.D
Introduction
In the first post we added a stub to Zeek for JPEG file analysis. As you recall, instead of just checking in code into the open source repository, our goal in this blog series is to “teach a person to fish” along with a few small fish to get started as bait. In this part we will give you that figurative bait and add logic to our JPEG analyzer so that we can output more data than just the Zeek file ID and timestamp. With the fundamentals addressed in the first post, this post will dive right into the technical explanations.
Additional pcaps containing JPEGs inside HTTP traffic can be downloaded from the Wireshark project at:
The source code for this post is available at:
The source code difference between the last post and this post is available at:
The source code difference between this post and the master branch at the time this blog series was written is available at:
What Do We Want To Parse?
In the last post, our parser was bare bones. It only parsed the first few bytes of the JPEG header, which was something the magic signature already determined. In this post, we will parse more of our JPEG files than the first few bytes. Specifically, let’s try to find the JFIF version, the image size (height/width in pixels) and any comments that might be in the image header. For this task, a little online research will provide the binary structures we hope to parse. While this post touches only on a few fields, the following resources provide all of the information we needed to accomplish this parsing and more:
- https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format#JFIF_APP0_marker_segment
- http://imrannazar.com/Let’s-Build-a-JPEG-Decoder:-Frames-and-Bitstreams
- https://link.springer.com/referenceworkentry/10.1007%2F0-387-30038-4_115
- https://www.disktuna.com/list-of-jpeg-markers/
- https://www.media.mit.edu/pia/Research/deepview/exif.html
We now have enough information to find the offsets and field sizes we will be parsing from our JPEG files. To summarize the resources above, a JPEG file can be thought of visually as the following diagram:

The purple box is the beginning of a JPEG file. It is the Start of Image (SOI) marker and only two bytes (0xFFD8). After the SOI, each section (or marker) is identified by a marker number and length. The first marker, in green above, shows that a marker is identified by the hexadecimal number 0xFF (this is a marker signature), and the second byte is the marker value. The marker value is used to parse the data within the marker correctly because it represents the type of marker such as SOI (0xD8). This is an example of the common type/length/value data encoding scheme (
https://en.wikipedia.org/wiki/Type-length-value). Between the marker header and the data is the length of the data (including the two length bytes). This length is used to find the second marker, and so on. There is an End of Image (EOI) marker 0xFFD9, but it is unimportant for us because as discussed in the previous blog post, we will not be parsing the whole file at once like host based forensic software would. The SOI and EOI are the only markers that do not have a length or data field after it.
After reviewing the JPEG structure resources, you will see that there are some marker types we will be interested in parsing. The first is the “app0” marker identified by 0xFFE0. The second is the “Start of Frame 0 (SOF0)” marker identified by 0xFFC0. The “app0” marker will have the JFIF version in it as two bytes. The “SOF0” marker will have the image width and height. “SOF0” is just one type of frame and as presented in the JPEG resources above there are SOF0-3, SOF5-7, and SOF9-15 markers that we must parse similarly to get the width and height for each of those marker types. We will only parse the first SOFn marker we find for the width and height for the purposes of this blog post.
Lastly, there is also a comment marker, and its value is 0xFFFE. This marker simply holds a string comment after the length. Therefore, reading the markers and parsing them accordingly for the three types will address our needs of outputting the most basic JPEG information. If you have read through the JPEG resources linked above, you may have noticed that the overall length of the JPEG is not contained within the file like the length in a Portable Executable file header. This presents some unique technical “chicken and egg” parsing challenges for us that we will address later in this article.
Parsing Our JPEG Markers
We have a few methods at our disposal to try to parse JPEG files within Zeek:
- Zeek scripts
- C++ within JPEG.cc/.h
- Binpac
We could try to parse our JPEG using Zeek scripts (#1) based upon our code in the last post, but we do not have the whole file at once when passed in through events, so this method is not going to work for this task. The next option of parsing JPEGs inside the C++ logic (#2) might work if the file were small, but the “DeliverStream” function only passes blocks of data at a time. It would be considerable work to implement everything in the JPEG.cc/.h files using such small windows of the data at a time. Our third option, binpac (#3), is a language that will generate parsers in C++ that we can tie into Zeek through the “DeliverStream” function, using #2, while driving the whole process with the ease of Zeek scripts in #1. In short, we must use a combination of all three methods to parse a JPEG file.
Defining Our Binpac Structures
Our first task is to define our binpac structures. The binpac language, in general, is outside the scope of this post but we will explain the binpac source code we developed here. Our first step is to remove the padding from our JPEG_Image record as we do not need it:

Next, we will want to add more fields to our “JPEG_Header” record in “jpeg-file-headers.pac”. This is where things begin to get tricky. Recall previously we discussed not having the length of the file. We also do not know how many markers there are, or how long each marker is without reading some number of the markers. This is problematic because binpac needs to know how many bytes to read in first so that second it can start to match the structures we define for JPEGs. It is a “chicken and egg” problem, and the best we can do with binpac is a tradeoff if we want to use the current technology. Our tradeoff must be “good enough” for generic JPEG parsing. After reviewing the JPEGs in the pcaps in these posts, it seemed that reading in just 500 bytes (to keep it smaller than most images) will allow for you to parse at least the first five markers. We were able to find the JPEG metadata in the markers within the first 500 bytes of the JPEG images. This is a value that could be tuned if you do not care about smaller JPEGs and want more markers, but this process is outside the scope of this blog post. Hopefully it will be addressed in a future post.
Within the first five markers, the information we seek is usually present. Therefore, to make all of this happen, we need the following binpac source in our “jpeg-file-headers.pac” file:
type Headers = record {
jpeg_header : JPEG_Header;
} &let {
# Do not care about parsing rest of the file so mark done now …
proc: bool = $context.connection.mark_done();
};
type JPEG_Header = record {
soi_start : uint8;
soi_val : uint8;
markers : JPEG_Marker[5];
} &length=500;
type JPEG_Marker = record {
marker_start : uint8;
marker_val : uint8;
length : uint16;
data : bytestring &length=length-2;
} &length=length+2;›
The source above cuts the SOI into two parts, a “start” and a “val”. The “start” and “val” are a byte in length, and the “start” is always 0xFF (since SOI is a marker, and markers begin with 0xFF). There is an array of five JPEG_Marker records called “markers”. Notice that we must tell binpac to read in 500 bytes at the end of the “JPEG_Header” definition above.
The second record type, “JPEG_Marker” is defined with a marker start, which should always be 0xFF, and a marker value (type). After the marker type, there is the length of the marker, and the associated data. The length of the record is calculated from the length in the JPEG file’s marker plus two bytes for the marker start and value.
Add The New JPEG Marker Event
Next, we have to add the new event called “file_jpeg_marker” to the “events.bif” file. Add the following to the bottom of the file:
## This event is generated each time file analysis identifies
## a jpeg file.
##
## f: The file.
## h: The JPEG marker data
##
event file_jpeg_marker%(f: fa_file, h: JPEG::JPEGMarker%);
type JPEG::JPEGMarker: record;
Add The New JPEGMarker Record
module JPEG;
export {
type JPEG::JPEGMarker: record {
# The type of marker
marker_val : count;
# The length, including the length
len : count;
## Data
data : string;
## The marker index in the JPEG
marker_num : count;
};
}
The record above creates the following fields: a marker value (the type), a length, the data, and the marker index (the offset with respect to markers) within the JPEG file. Notice that this record is what we will use to pass the attributes we parse from the JPEG file to the events in Zeek scripts so we can perform further processing on them. It contains all of the information in a marker, plus the marker’s index within the JPEG file.
The Binpac Parsing Functions
This is where understanding how binpac interfaces with Zeek’s C++ code can seem extremely vague. Binpac will allow you to run functions on each of the parsed record types, if they are detected. What this means is that binpac will parse data, and in our case it will check 500 bytes to see if the structures we defined could fit in there. In most cases, as long as the image is bigger than 500 bytes, this will be true. So when binpac determines a “JPEG_Header” has been detected, it will automatically pass it to a processing function if that processing function has been defined for that record type. You do not need to tell Zeek to do this. We are going to create a processing function in “jpeg-analyzer.pac” for the record type “JPEG_Header” called “proc_jpeg_header”. Make your “jpeg-analyzer.pac” file be the following content:
%extern{
#include “Event.h”
#include “file_analysis/File.h”
#include “events.bif.h”
%}
%header{
%}
%code{
%}
refine flow File += {
function proc_jpeg_header(h: JPEG_Header): bool
%{
DBG_LOG(DBG_FILE_ANALYSIS, “TRYING TO PROCESS A JPEG!!!”);
if ( file_jpeg_marker )
{
DBG_LOG(DBG_FILE_ANALYSIS, “PROCESSING A JPEG!!!”);
int markers[] = { 0, 1, 2, 3, 4 };
for (int m: markers)
{
RecordVal* dh = new RecordVal(BifType::Record::JPEG::JPEGMarker);
dh->Assign(0, val_mgr->GetCount(${h.markers[m].marker_val}));
dh->Assign(1, val_mgr->GetCount(${h.markers[m].length}));
dh->Assign(2, new StringVal(${h.markers[m].data}.length(), (const char*) ${h.markers[m].data}.data()));
dh->Assign(3, val_mgr->GetCount(m));
mgr.QueueEventFast(file_jpeg_marker, {
connection()->bro_analyzer()->GetFile()->GetVal()->Ref(),
dh
});
}
}
DBG_LOG(DBG_FILE_ANALYSIS, “DONE PROCESSING A JPEG!!!”);
return true;
%}
};
refine typeattr JPEG_Header += &let {
proc : bool = $context.flow.proc_jpeg_header(this);
};
The file above defines a function called “proc_jpeg_header”. The mere existence of this function with a “JPEG_Header” argument causes binpac to send every parsed “JPEG_Header” into this function without you having to tell it. This is important because you can then further process the JPEG_Header from this point before Zeek is involved. In our example, we are creating a Zeek record value that we are going to pass to the event “file_jpeg_marker”, then fire the event. This makes the data parsed with binpac available for further processing in scriptland, which is the next section, below.
Create The “file_jpeg_marker” Event Handler
We can now add logic through Zeek scripts to parse the inbound “file_jpeg_marker” events. It will be much easier to parse the binary data through Zeek scripts rather than tedious C++ byte manipulation. The following code in scripts/base/files/jpeg/main.zeek will handle the new JPEG events:
module JPEG;
export {
redef enum Log::ID += { LOG };
type Info: record {
## Current timestamp.
ts: time &log;
## File id of this portable executable file.
id: string &log;
total_bytes: count &log &optional;
width: count &log &optional;
height: count &log &optional;
jfif_major: count &log &optional;
jfif_minor: count &log &optional;
comment: string &log &optional;
};
## Event for accessing logged records.
global log_jpeg: event(rec: Info);
## A hook that gets called when we first see a JPEG file.
global set_file: hook(f: fa_file);
}
redef record fa_file += {
jpeg: Info &optional;
};
const jpeg_mime_types = { “image/jpeg” };
event zeek_init() &priority=5
{
Files::register_for_mime_types(Files::ANALYZER_JPEG, jpeg_mime_types);
Log::create_stream(LOG, [$columns=Info, $ev=log_jpeg, $path=”jpeg”]);
}
hook set_file(f: fa_file) &priority=5
{
if ( ! f?$jpeg )
{
f$jpeg = [$ts=network_time(), $id=f$id];
}
}
event file_jpeg(f: fa_file) &priority=5
{
hook set_file(f);
if (f?$total_bytes)
f$jpeg$total_bytes = f$total_bytes;
}
event file_jpeg_marker(f: fa_file, m: JPEG::JPEGMarker)
{
hook set_file(f);
switch( m$marker_val ) {
# This will be SOF0 0xC0
case 192:
fallthrough;
# This will be SOF1 0xC1
case 193:
fallthrough;
# This will be SOF2 0xC2
case 194:
fallthrough;
# This will be SOF3 0xC3
case 195:
fallthrough;
# This will be SOF5 0xC5
case 197:
fallthrough;
# This will be SOF6 0xC6
case 198:
fallthrough;
# This will be SOF7 0xC7
case 199:
fallthrough;
# This will be SOF9 0xC9
case 201:
fallthrough;
# This will be SOF10 0xCA
case 202:
fallthrough;
# This will be SOF11 0xCB
case 203:
fallthrough;
# This will be SOF13 0xCD
case 205:
fallthrough;
# This will be SOF14 0xCE
case 206:
fallthrough;
# This will be SOF15 0xCF
case 207:
if (! f$jpeg?$height )
{
f$jpeg$height = bytestring_to_count(m$data[1:3]);
f$jpeg$width = bytestring_to_count(m$data[3:5]);
}
break;
# This will be app0 0xE0
case 224:
f$jpeg$jfif_major = bytestring_to_count(m$data[5:6]);
f$jpeg$jfif_minor = bytestring_to_count(m$data[5:6]);
break;
# This will be comment 0xFE
case 254:
f$jpeg$comment = m$data;
break;
}
}
event file_state_remove(f: fa_file) &priority=-5
{
if ( f?$jpeg )
{
Log::write(LOG, f$jpeg);
}
}
The first section of the Zeek script above creates the output record. We will output six fields: the file length, the width (in pixels), the height (in pixels), the JFIF major version number, the JFIF minor version number, and any comments in the JPEG.
Next, during the “file_jpeg” event, the total bytes are recorded in the output record from the file object. Then, the “file_jpeg_marker” event handles the three types of markers of which we are interested (SOFn, APP0, and comment). The “bytestring_to_count” is a provided Zeek convenience function to take arbitrary bytes and translate them into Zeek counts. There are other similar functions for other types, if you so need them.
This code is all that is needed to create binpac parsers to handle the JPEG events so that we can output the basic information we are looking for.
Use Binpac Parsers
We must implement the binpac logic we just wrote, and that is done in JPEG.h/.cc. JPEG.h will create the necessary binpac::JPEG File and MockConnection structures. JPEG.cc will then instantiate these File and MockConnection structures. In the “DeliverStream” function, the data will be sent to the binpac parsers through a binpac::JPEG::File object, which matches the name of our Flow on purpose because of the CMake files that will compile and link everything together. This ties all of the functionality from binpac into the C++ plugin, which we can then use from Zeek scriptland.
You should have the following content for your files:
JPEG.h
#pragma once
#include <string>
#include “Val.h”
#include “../File.h”
#include “jpeg_pac.h”
namespace file_analysis {
/**
* Analyze Portable Executable files
*/
class JPEG : public file_analysis::Analyzer {
public:
~JPEG();
static file_analysis::Analyzer* Instantiate(RecordVal* args, File* file)
{ return new JPEG(args, file); }
virtual bool DeliverStream(const u_char* data, uint64_t len);
virtual bool EndOfFile();
protected:
JPEG(RecordVal* args, File* file);
binpac::JPEG::File* interp;
binpac::JPEG::MockConnection* conn;
bool done;
};
} // namespace file_analysis
JPEG.cc
#include “JPEG.h”
#include “file_analysis/Manager.h”
using namespace file_analysis;
JPEG::JPEG(RecordVal* args, File* file)
: file_analysis::Analyzer(file_mgr->GetComponentTag(“JPEG”), args, file)
{
conn = new binpac::JPEG::MockConnection(this);
interp = new binpac::JPEG::File(conn);
done = false;
if ( file_jpeg )
mgr.QueueEventFast(file_jpeg, {
GetFile()->GetVal()->Ref()
});
}
JPEG::~JPEG()
{
delete interp;
delete conn;
}
bool JPEG::DeliverStream(const u_char* data, uint64_t len)
{
if ( conn->is_done() )
return false;
try
{
interp->NewData(data, data + len);
}
catch ( const binpac::Exception& e )
{
return false;
}
return ! conn->is_done();
}
bool JPEG::EndOfFile()
{
return false;
}
Seeing The Output
You need to recompile and install Zeek for your changes to take effect:
make
sudo make install
Next, the following commands will run Zeek on the Wireshark pcap and show you the JPEG information we were able to parse:
$ zeek -B file_analysis -r pcaps/http_with_jpegs.cap
$ cat jpeg.log
#separator x09
#set_separator ,
#empty_field (empty)
#unset_field –
#path jpeg
#open 2019-11-25-10-12-15
#fields ts id total_bytes width height jfif_major jfif_minor comment
#types time string count count count count count string
1100903355.573238 FxUtSi4RIZf03k0sFd 8281 512 512 1 1 Created with The GIMP
1100903355.580655 FKuUQ14DO6SOTBbaB5 9045 500 89 1 1 Created with The GIMP
1100903360.932707 FC536m3fh036oWUd3i 8963 180 240 1 1 Created with The GIMP
1100903360.939152 Fqu89k3QfZCvvvZN0g 10730 180 240 1 1 Created with The GIMP
1100903365.003584 FqMaUw18GBKJ7r5oca 191515 960 1280 1 1 Created with The GIMP
#close 2019-11-25-10-12-15
When we run the same code on the other pcap file we downloaded in part one, we see the following:
$ zeek -B file_analysis -r pcaps/http.pcap
$ cat jpeg.log
#separator x09
#set_separator ,
#empty_field (empty)
#unset_field –
#path jpeg
#open 2019-11-25-10-13-01
#fields ts id total_bytes width height jfif_major jfif_minor comment
#types time string count count count count count string
1320279566.886920 FFTf9Zdgk3YkfCKo3 2300 70 70 1 1 –
1320279566.889283 FfXtOj3o7aub4vbs2j 2272 70 70 1 1 –
1320279566.898309 F21Ybs3PTqS6O4Q2Zh 2562 70 70 1 1 –
1320279566.898520 Fdk0MZ1wQmKWAJ4WH4 1595 70 52 1 1 –
1320279566.903330 FwCCcC3lGkQAwhCDX3 2242 70 70 1 1 –
1320279566.906766 FHK4nO28ZC5rrBZPqa 1874 70 52 1 1 –
1320279586.179367 FGB1Ugi2qN3KCeA3j 6921 366 81 1 1 –
1320279589.375105 F2GiAw3j1m22R2yIg2 893 120 90 – – –
1320279616.992134 FipMsu3eD5AnIRq2N 59209 500 433 1 1 –
Conclusion
This post showed you how you could add logic to the JPEG stub we created in the last blog post. Through a specially crafted combination of binpac, C++, and Zeek scripting; we were able to analyze JPEG data. Through some basic analysis, we were able to output some basic information about JPEGs such as image width, height, version number, and comments. The next post in this blog series will address converting our working code to a dynamically loadable package so that we do not
have to distribute it with the full Zeek source code
.
About Keith J. Jones, Ph.D
Dr. Jones is an internationally industry-recognized expert with over two decades of experience in cyber security, incident response, and computer forensics. His expertise includes software development, innovative prototyping, information security consulting, application security, malware analysis & reverse engineering, software analysis/design and image/video/audio analysis.
Dr. Jones holds an Electrical Engineering and Computer Engineering undergraduate degrees from Michigan State University. He also earned a Master of Science degree in Electrical Engineering from MSU. Dr. Jones recently completed his Ph.D. in Cyber Operations from Dakota State University in 2019.