Zeek: Why Zeek keeps breaking your test baselines

With zeek-8.1.0 dns.log now includes information on dynamic update messages (RFC 2136) adding new opcode and opcode_name columns. During the RC phase we received feedback that this change to the log schema might break downstream consumers of dns.log, so we also included a new policy script policy/protocols/dns/disable-opcode-log-fields.zeek to disable these columns again. Two points where this would break users came up:

The new log columns add volume to a Zeek log which for many sites already is one of busiest. The increased data volume would break assumptions sites had made during storage capacity planning.
Adding columns to a Zeek log would break existing log snapshots in tests produced with e.g., btest-diff, and a changing schema would make testing a package across different Zeek version hard. Since Zeek’s package manager zkg by default runs tests during package installation this would break many packages.

Both issues are related to schema evolution which is something the Zeek project tries to handle with care and deliberately. This posts details our approach, and we hope gives you enough background so you can upgrade Zeek without grief. We also give some guidance for package developers for how to set their package up for upstream Zeek changes.

Log schema evolution in Zeek

When talking about the schema of Zeek logs we mean the present columns (e.g., does dns.log contain a opcode column?) and their data types, e.g., time vs. count. A schema provides a contract to users so they can make assumptions, e.g., whether a column is present, or on how to interpret its data. Since Zeek is a living project log schemas change over time, and we take some care to make this as predictable as possible for users.

Under the hood Zeek logs are an list of serialized Zeek record values, e.g., entries in dns.log follow the schema defined in DNS::Info

## The record type which contains the column fields of the DNS log.
type Info: record {
    ## The earliest time at which a DNS protocol message over the
    ## associated connection is observed.
    ts:            time               &log;
    ## A unique identifier of the connection over which DNS messages
    ## are being transferred.
    uid:           string             &log;
    ## The connection's 4-tuple of endpoint addresses/ports.
    id:            conn_id            &log;

    ... SNIP

    ## The domain name that is the subject of the DNS query.
    query:         string             &log &optional;

    ... SNIP

    ## The Authoritative Answer bit for response messages specifies
    ## that the responding name server is an authority for the
    ## domain name in the question section.
    AA:            bool               &log &default=F;

    ... SNIP

    ## The total number of resource records in a reply message's
    ## answer section.
    total_answers: count           &optional;

    ... SNIP

This schema helps us understand what form of dns.log to expect, e.g.,

DNS::Info contains a number of named fields, e.g., it has a field ts which holds a time. We should not make assumptions about their order though, e.g., TSV and JSON logs might render columns in different order.
Fields with the &log attribute like e.g., ts get written to logs, but there are also fields which hold internal state which are not logged, e.g., total_answers.
Some fields have an &optional attribute which means they might not be present, e.g., ts will always be present, but query might not be set.
Some fields specify a &default value which is used if the fields was not explicitly set, e.g., AA defaults to F (false).

When evolving the schema of a log or any other exported record in Zeek we follow these rules:

A required (not &optional) field cannot be removed without deprecation cycle.
The data type of the column cannot be changed.
&optional or &default cannot be removed without deprecation cycle.
New columns are always introduced as &optional or with a &default.

Deprecation cycle most of the time means that we mark a field with a &deprecated attribute and some target version for removal, and call out its deprecation in the Zeek release notes. Use of a record field marked &deprecated in Zeek script raises a warning, but consumers of Zeek log should still audit the release notes to catch deprecations. Once we release the target version the field might get dropped. For deciding the target version we take our LTS policy into account, e.g., if we marked a field &deprecated with zeek-8.2 we would only remove it with or after zeek-9.1.

These rules allow users to prepare for breaking changes, but also require some work on our side since we need to allow users to migrate, e.g., actually changing the type of a column typically means:

We introduce a new &optional &log record field with the desired type.
We mark the existing field as &deprecated and start a deprecation cycle.
The deprecated column can be removed at the end of the deprecation cycle.
Most of the time the new column likely stays &optional since we want to avoid another deprecation cycle.

What this means for log consumers

When ingesting logs into a SIEM

As a Zeek admin forwarding logs into a SIEM you interface two worlds,

the logs provided by Zeek, and
the log information exposed to analysts in the SIEM.

In doing that you are always translating between Zeek’s log schemas and schemas exposed in the SIEM (owned by you or somebody else one your end), either by explicit choice or implicitly when just forwarding Zeek’s logs. While we hope you are excited about making new Zeek log information to your users as quickly as possible, being explicit about what you expose will help to decouple you from breaking changes in Zeek.

Zeek’s logging framework provides tools to adapt what is logged, and you could use the LogFilter::include to define a LogFilter which defines an explicit allow list of columns to expose. As a consumer this is a strictly better approach than e.g., dropping the &log attributes from select columns to define an exclusion list since with that approach any new columns break the schema you expose to your users. Your log ingestion pipeline might provide similar tools.

Even when being explicit you still might need to deal with column removals. These are called out in the Zeek release notes with ample time for you to migrate your users off the column’s data.

As a developer of Zeek packages

As a developer of Zeek packages you probably interact with Zeek logs since your package adds to information Zeek emits; most often this means that your package either provides a new log, or adds columns to an existing log provided by e.g., Zeek. When testing your package works you might use BTest to create baselines of what you expect to see, and your tests might contain lines like

## Create a snapshot of `some.log`, track it with the package, 
and validate its shape in testing.
# @TEST-EXEC: btest-diff some.log

Depending on the log in question you might need to be explicit what you include in your testing baselines:

If your package provides a new log, i.e., your package defines the record which gets logged, you could baseline it as is.
If your package adds data to an existing log you want to only baseline information you control, i.e., the column you add or add to.

Columns in Zeek TSV logs can be narrowed down with zeek-cut, e.g., to select only the service column in conn.log one would use

cat conn.log | zeek-cut service

For JSON logs you can use jq to select columns, e.g.,

## Assuming JSON logs.
cat conn.log | jq '.service'

You can use zeek-cut with BTest’s canonifiers to preprocess the logs you baseline, e.g.,

# @TEST-EXEC: TEST_DIFF_CANONIFIER='zeek-cut service' 
btest-diff conn.log

TEST_DIFF_CANONIFIER is an environment variable interpreted by btest-diff where it expects a shell command which it can pipe data into (via stdin) and receives processed data from (via stdout). Since TEST_DIFF_CANONIFIER contains a shell command preprocessors can be composed, e.g.,

## First select `ts` and `service` columns, then process 
## output with `custom_script`. 
# @TEST-EXEC: TEST_DIFF_CANONIFIER='zeek-cut ts service | 
custom_script' btest-diff conn.log

As general guideline, be wary of tests which baseline full logs, or which define columns to baseline via an exclusion instead of an allow list. In both cases baselines can suddenly break on Zeek upgrades.

# EXAMPLES OF BAD TESTS, AVOID IF POSSIBLE. 

# BAD: no control over schema of `conn.log` at all. 
# @ TEST-EXEC: btest-diff conn.log 

# BAD: Column set defined via exclusion `-n` instead of allow list. 
# @ TEST-EXEC: TEST_DIFF_CANONIFIER='zeek-cut -n local_orig' btest-
diff conn.log

Summary

We hope this post has given you some context on how we approach evolving improving Zeek logs so you can follow Zeek releases with less issues. Please feel free to reach out to us on Slack or our Discourse forum.

Author

Benjamin Bannier

Benjamin Bannier works as a Senior Open Source Developer at Corelight where he spends most of his time maintaining and evolving Spicy and its integration into the Zeek ecosystem.

View all posts

Why Zeek keeps breaking your test baselines

Log schema evolution in Zeek

What this means for log consumers

When ingesting logs into a SIEM

As a developer of Zeek packages

Summary

Author

Share this:

Like this:

Discover more from Zeek