Zeek has a long-standing issue with standby CPU usage on low-power systems and low-traffic networks where even if nothing is happening on the network, Zeek will continue to use 10-15% of the CPU doing nothing. This stems from the fact that the existing main loop of Zeek is effectively a busy-wait loop. On each pass of the net_run() method, the loop checks all of the registered IOSources for whether there is data to be processed, or an event to handle, or any number of other tasks. This happens for each IOSource whether or not there is anything to handle. In other words, Zeek spends a large amount of time just spinning the loop checking for non-existent tasks.

This issue is mostly corrected in Zeek 3.1. Benchmarking shows that Zeek now takes roughly 0% CPU time when listening on a fully-idle network interface. This blog post describes the new architecture for the IO loop and changes made to IO sources to support the new architecture.

The major change for the IO loop is a switch from constantly polling the individual sources for updates. Zeek 3.1 instead relies on the operating system’s built-in event queue libraries for watching for changes on file descriptors. It does this through a 3rdparty library called libkqueue[1]. Libkqueue provides a system-angostic implementation of the kqueue() interface from macOS and FreeBSD, which allows us to only have to implement usage of that interface in Zeek. On Linux systems, libkqueue actually uses epoll underneath (or poll, if epoll is unavailable). As mentioned earlier, using this interface allows us to wait for actions on file descriptors and have the OS notify us when something is ready to be processed or when a timeout occurs.

That “waiting for something to be processed” part is what causes most of the changes to the IO source code in general. When a new IO source is created at startup, it optionally registers a file descriptor with the IOSource manager. This descriptor can be used to notify kqueue when the source has data to be processed. This effectively pushes the handling of new data onto the IO sources. Instead of us asking whether the source has something to do, the source can tell the loop that it needs to do work and the loop will act on that information. This also removes the need to ask every source what their file descriptors are with every loop pass, since the sources can just register and unregister them when necessary.

On top of sources registering their file descriptors, there is also the matter of timeouts for the calls to kqueue. At the start of each pass of the IO loop, it asks each IO source what the next time is for when it must process something. For example, the Timer manager returns the time for when the next active timer fires. The return value is always time relative to the current network time. When kqueue starts waiting for data, it uses this next time as the timeout value. The existing IO loop code managed timers by always checking whether there were timers ready to be processed every pass through the loop. Since it was spinning anyways, this meant a lot of extra checks for no reason. Since kqueue can now just timeout when a timer is ready to be processed, it ensures that a) we’re not doing those extra checks and b) timers will always fire even in the absence of other data to push the network time forward.

The same timeout mechanism can be used for other functions as well. For example, there are a number of network interface types that don’t return a proper file descriptor. The myricom library is one of these. The file descriptor returned by that library cannot be reliably used for checking whether the interface has data to process. For this packet source (and others with the same problem), we can just skip setting the file descriptor entirely. This causes the PktSrc code to simply return a very short (20 microsecond) timeout. Kqueue is capable of handling timeouts this short, and will simply wake up and check whether the interface has data available. This closely mimics the old IO loop. This causes the loop to use more CPU time than if the interface was supported correctly, but will still use less than the old code because we’re only processing the single source for each of those extra loop passes.

Changes Required for IO Source Plugins

Unfortunately, all of the above requires some changes to non-packet-source IO source plugins in order to properly work with the new architecture. Changes are NOT required for packet source plugins since the PktSrc interface abstracts them away. These changes were mentioned in the release notes for 3.1, but I’ll go into them in more detail here. The GetFDs() and NextTimestamp() methods are no longer part of the IOSource interface. The first is removed for reasons described earlier, where sources can register the descriptors directly. The second is removed because the next timestamp for data is no longer meaningful. Zeek will be notified exactly when data is available and doesn’t need to ask what the next data time is. One new method is added that must be implemented by non-packet sources: GetNextTimeout(). As described earlier, this method returns the next timeout value for a source relative to the current network time. This method can return a -1 value to indicate that timeouts from a source shouldn’t be considered when setting the kqueue timeout.

Benchmarking Data

Included below is some benchmarking data that goes into a bit more detail about the changes.

The first set of data is a comparison of the latest 3.0 release against the current HEAD on the Zeek github repo. It is a benchmark run with 5 minutes of HTTPS/DNS (50/50 split) data at 500Mbps, replayed using tcpreplay onto a single network interface, read from that interface by a single Zeek process pinned to a single CPU. This benchmark uses the af_packet packet source plugin.

3.0 Release:

1585776369.065076 received termination signal
1585776369.065076 9199281 packets received on interface ens1f0, 0 dropped

 Performance counter stats for process id '3332':

     	98,433.32 msec task-clock         #   0.631 CPUs utilized
     	1,077,751    context-switches     #   0.011 M/sec
             	0    cpu-migrations       #   0.000 K/sec
     	2,119,545    page-faults          #   0.022 M/sec
   283,888,577,871  	    cycles           #    2.884 GHz
   274,225,123,031  	    instructions     #   0.97  insn per cycle
       57,914,999,753  	    branches         # 588.368 M/sec
 	1,197,059,679  	    branch-misses    #	 2.07% of all branches

 	156.020277819 seconds time elapsed

Maximum memory usage (max_rss): 2304488 bytes
Average CPU usage: 56.9%

Master:

1585775751.832691 received termination signal
1585775751.832691 9199281 packets received on interface ens1f0, 0 (0.00%) dropped

 Performance counter stats for process id '2731':

    	101,784.81 msec task-clock       #   0.652 CPUs utilized
       	      306,675  	    context-switches #	    0.003 M/sec
              0      cpu-migrations      #   0.000 K/sec
     	3,340,496    page-faults         #   0.033 M/sec
   293,898,075,195  	    cycles           #	   2.887 GHz
   293,748,675,220  	    instructions     #	   1.00 insn per cycle
	63,975,692,786      branches         #  628.539 M/sec
 	1,215,795,628  	    branch-misses    #	    1.90% of all branches 

 	156.020450755 seconds time elapsed

Maximum memory usage (max_rss): 2271148 bytes
Average CPU usage: 56.9%

The CPU and memory usage remain nearly the same because we are processing data at a fast enough pace that the spinning of the main loop in the 3.0 version doesn’t become apparent in those metrics. It does show that there is not a significant drop in performance between the two versions.

The second set of data is a comparison of the top output with the latest 3.0 release against the current HEAD on the Zeek github repo, running on a completely idle network interface. This is the output of `top | grep zeek`, left to run until 5 outputs are received.

3.0 Release:

  PID USER  	PR  NI	VIRT	RES	SHR S  %CPU  %MEM 	TIME+ COMMAND
 4071 root  	20   0  797296 222204 149416 S  18.6   0.1   0:00.56 zeek 
 4071 root  	20   0  797296 222204 149416 S  14.0   0.1   0:00.98 zeek 
 4071 root  	20   0  797296 222208 149416 S  14.0   0.1   0:01.40 zeek 
 4071 root  	20   0  797296 222208 149416 R  14.0   0.1   0:01.82 zeek 
 4071 root  	20   0  797296 222208 149416 S  14.0   0.1   0:02.24 zeek 

Master:

  PID USER  	PR  NI	VIRT	RES	SHR S  %CPU  %MEM 	TIME+ COMMAND
 3976 root  	20   0  861008 213464 149516 S   9.3   0.1   0:00.28 zeek 
 3976 root  	20   0  861008 213464 149516 S   0.3   0.1   0:00.29 zeek 
 3976 root  	20   0  861008 213464 149516 S   0.3   0.1   0:00.30 zeek 
 3976 root  	20   0  861008 213464 149516 S   0.3   0.1   0:00.31 zeek 
 3976 root  	20   0  861008 213464 149516 S   0.3   0.1   0:00.32 zeek 

If you want to run the first benchmark yourself on other data sets, the script is available in the on github at https://github.com/zeek/zeek-aux/blob/master/devel-tools/perf-benchmark.

%d bloggers like this: