By Zach Medley
Getting started working on Zeek can be daunting because of the sheer size of the repository. While designed reasonably, Zeek is big and a lot of reasonable design can still be a lot to handle. This blog post walks through how I added Zeek’s key-value for loop in the hope that it might make it easier for future Zeek developers to get started.
Zeek, formerly Bro, is an open-source network security monitoring tool that transforms raw traffic into rich logs, extracted files, and custom insights via a Turing-complete Zeek programming language. It’s all open source, and developed on GitHub with its community.
Defining the Problem
Before the addition of a key-value for loop in Zeek you can iterate over the items in a container with a standard range based for loop:
However, looping over tables where there are both keys and values requires a separate lookup:
Christian suggested that we extend this tuple unpacking for use with key-value for loops.
The testing framework Zeek uses is called btest and tests written using it are commonly called “btests.” Zeek’s btests live in the testing/btest/ directory. Once you get the hang of them, they are pretty straightforward, but at first glance they can be a little confusing.
A btest usually consists of a test and a baseline. Btest works by running your test and comparing its output to a known baseline. A difference between the output and the baseline results in a failed test. In addition to cloning Zeek, you’ll need to install btest separately, as follows:
To get btest we suggest installing the development version. This will give you access to a more up-to-date btest version that the master version of Zeek may depend on. After cloning Zeek, move to the directory that it’s installed in and run:
pip install -e aux/btest/
With btest installed, we can begin to write our tests. Zeek already has tests that cover for-loops in testing/btest/language/for.bro, so modifying that file is fine, but I chose to add a separate test file called key-value-for.bro. I wrote a couple tests for key-value for-loops and added one for iterating over tables with more than one index value because there wasn’t a test for that yet. My tests for the key-value look like this:
With the test written, you’ll now have to add a baseline so that btest knows what the desired output should be. The best way to create a btest is fairly nebulous as there are many ways that will work well. Ultimately though, once you find a way you like, and as long as in the end you’re left with a working test, it’s likely fine.
The easiest way to create a simple btest is to replace the test script with some ad-hoc script that produces the same output. For the above we might replace it with some print statements that produce the desired output. Then you can go ahead and run the test with the -U parameter, which will prompt you to make a baseline. Once that’s done, don’t forget to go back and change the script back to the one you want to test.
For more complicated tests, though, this ad-hoc method can get troublesome. Here, Christian suggests running the real test, letting it fail, then copying the “out” file it creates over to the baseline directory.
More or less in line with Christian’s suggestion, I created my btests by moving to the /btest/Baseline/ directory. Here I created a new folder with the name <the btest folder your test is in>.<the name of your test file>. For example, my tests were named key-value-for.bro and in the btest/language folder, so I added a folder to the btest/Baseline folder called language.key-value-for. Inside of your new folder add a file called out, and write whatever the expected output of your test is. My out file looks like this:
Now we can run our test and see if it fails. To run the test, first build and install Zeek by running
Then, change back to the ./btest directory and run:
btest -d language/key-value-for.bro
Adding new language functionality in Zeek can be done in a couple of simple steps:
Modify parse.y so that the new syntax is recognized and handled properly;
Write the underlying C++ code to make it all work. We’ll start by writing the code to parse the new for-loop.
Zeek uses lex and yacc to generate its parser. The part that we’re concerned with can be found in src/parse.y. Specifically, we’re interested in the part that parses the for statement, underneath for_head:
I’ll walk through this code to give an overview of how it works, and then show the new parsing rules for a key-value for-loop.
TOK_FOR ‘(‘ TOK_ID TOK_IN expr ‘)’
Indicates the type of syntax that the following code deals with. Each of the tokens is represented below as a positional number, with TOK_FOR corresponding to the number 1 and ‘)’ corresponding to the number 6.
When Zeek is parsed, objects can be associated with a location. For more information on the utility of this, see Bison’s page here. For a little more on how a location is represented, see src/Obj.h.
ID* loop_var = lookup_ID($3, current_module.c_str());
In this case, $3 refers to TOK_ID. Here we get loop_var’s previous definition if it already exists in the current module.
This is the meat of the parse phase. Here, if loop_var already has a definition, we make sure that it is not a global variable. Otherwise, we initialize it.
$$ = new ForStmt(loop_vars, $5);
Finally, we build a new for-statement, and $5, which refers to the thing we’re iterating through.
My implementation follows the basic for-loop’s parsing procedure very closely and calls an alternate version of the constructor that I’ll discuss next.
In order to preserve as much of the original for-loop’s functionality as possible, I opted to write an alternate constructor for the for-loop that included a variable for values to be stored in as the loop moves through the table. The constructor first calls the regular for-loop constructor on the loop variables and expression, and then runs some additional code to verify the type of the value variable.
The most interesting part of the for-loop is the actual looping. This is done in the DoExec part of the for-loop in src/Stmt.cc.
We’re only interested in the part of the for-loop that deals with looping over tables because they are the only data type supported by key value for-loops. This code is mostly self explanatory with the exception of the usage of Ref() and Unref().
Zeek uses reference counting under the hood to clean up objects when they’re done being used. If you’re familiar with modern C++, this is the same way that shared_ptr works. Each object keeps track of how many references it has, if that number drops to zero, Zeek will clean it up. Whenever we’re setting an element in a frame we need to call Ref() on it. This increases the reference count in the frame, indicating that something needs to use that value until some time in the future when Unref() is called on it.
Keeping track of reference counting in Zeek can be quite difficult to get the hang of and lead to hard to track down bugs. Take care when using a value after passing it elsewhere and if you get a segfault, this is often the cause. Debuggers like gdb and tools like valgrind can be useful to help track down what it was that got deleted.
The addition of key-value for loops to Zeek make the process of iterating over a table simpler and more performant:
When possible, key-value for loops should be preferred to regular loops over tables.
If you’re interested in contributing to Zeek there is no bar to entry. For C and C++ people, the Zeek core is a great place to get your feet wet developing a scripting language. You can also get involved just writing Zeek. Much of Zeek is written in Zeek. Even if you don’t program much, I wrote the README so I’m sure it’s got a couple spelling and grammar errors.
Helpful Links and information:
About Zeek (formerly Bro): Zeek is a powerful network analysis framework that is much different from the typical IDS you may know. https://www.zeek.org/