wingolog

lightweight concurrency in lua

2018-05-16T15:17:32Z

Hello, all! Today I'd like to share some work I have done recently as part of the Snabb user-space networking toolkit. Snabb is mainly about high-performance packet processing, but it also needs to communicate with management-oriented parts of network infrastructure. These communication needs are performed by a dedicated manager process, but that process has many things to do, and can't afford to make blocking operations.

Snabb is written in Lua, which doesn't have built-in facilities for concurrency. What we'd like is to have fibers. Fortunately, Lua's coroutines are powerful enough to implement fibers. Let's do that!

fibers in lua

First we need a scheduling facility. Here's the smallest possible scheduler: simply a queue of tasks and a function to run those tasks.

local task_queue = {}

function schedule_task(thunk)
   table.insert(task_queue, thunk)
end

function run_tasks()
   local queue = task_queue
   task_queue = {}
   for _,thunk in ipairs(queue) do thunk() end
end

For our purposes, a task is just a function that will be called with no arguments.

Now let's build fibers. This is easier than you might think!

local current_fiber = false

function spawn_fiber(fn)
   local fiber = coroutine.create(fn)
   schedule_task(function () resume_fiber(fiber) end)
end

function resume_fiber(fiber, ...)
   current_fiber = fiber
   local ok, err = coroutine.resume(fiber, ...)
   current_fiber = nil
   if not ok then
      print('Error while running fiber: '..tostring(err))
   end
end

function suspend_current_fiber(block, ...)
   -- The block function should arrange to reschedule
   -- the fiber when it becomes runnable.
   block(current_fiber, ...)
   return coroutine.yield()
end

Here, a fiber is simply a coroutine underneath. Suspending a fiber suspends the coroutine. Resuming a fiber runs the coroutine. If you're unfamiliar with coroutines, or coroutines in Lua, maybe have a look at the lua-users wiki page on the topic.

The difference between a fibers facility and just coroutines is that with fibers, you have a scheduler as well. Very much like Scheme's call-with-prompt, coroutines are one of those powerful language building blocks that should rarely be used directly; concurrent programming needs more structure than what Lua offers.

If you're following along, it's probably worth it here to think how you would implement yield based on these functions. A yield implementation should yield control to the scheduler, and resume the fiber on the next scheduler turn. The answer is here.

communication

Once you have fibers and a scheduler, you have concurrency, which means that if you're not careful, you have a mess. Here I think the Go language got the essence of the idea exactly right: Do not communicate by sharing memory; instead, share memory by communicating.

Even though Lua doesn't support multiple machine threads running concurrently, concurrency between fibers can still be fraught with bugs. Tony Hoare's Communicating Sequential Processes showed that we can avoid a class of these bugs by treating communication as a first-class concept.

Happily, the Concurrent ML project showed that it's possible to build these first-class communication facilities as a library, provided the language you are working in has threads of some kind, and fibers are enough. Last year I built a Concurrent ML library for Guile Scheme, and when in Snabb we had a similar need, I ported that code over to Lua. As it's a new take on the problem in a different language, I think I've been able to simplify things even more.

So let's take a crack at implementing Concurrent ML in Lua. In CML, the fundamental primitive for communication is the operation. An operation represents the potential for communication. For example, if you have a channel, it would have methods to return "get operations" and "put operations" on that channel. Actually receiving or sending a message on a channel occurs by performing those operations. One operation can be performed many times, or not at all.

Compared to a system like Go, for example, there are two main advantages of CML. The first is that CML allows non-deterministic choice between a number of potential operations in a generic way. For example, you can construct a operation that, when performed, will either get on one channel or wait for a condition variable to be signalled, whichever comes first. In Go, you can only select between operations on channels.

The other interesting part of CML is that operations are built from a uniform protocol, and so users can implement new kinds of operations. Compare again to Go where all you have are channels, and nothing else.

The CML operation protocol consists three related functions: try which attempts to directly complete an operation in a non-blocking way; block, which is called after a fiber has suspended, and which arranges to resume the fiber when the operation completes; and wrap, which is called on the result of a successfully performed operation.

In Lua, we can call this an implementation of an operation, and create it like this:

function new_op_impl(try, block, wrap)
   return { try=try, block=block, wrap=wrap }
end

Now let's go ahead and write the guts of CML: the operation implementation. We'll represent an operation as a Lua object with two methods. The perform method will attempt to perform the operation, and return the resulting value. If the operation can complete immediately, the call to perform will return directly. Otherwise, perform will suspend the current fiber and arrange to continue only when the operation completes.

The wrap method "decorates" an operation, returning a new operation that, if and when it completes, will "wrap" the result of the completed operation with a function, by applying the function to the result. It's useful to distinguish the sub-operations of a non-deterministic choice from each other.

Here our new_op function will take an array of operation implementations and return an operation that, when performed, will synchronize on the first available operation. As you can see, it already has the equivalent of Go's select built in.

function new_op(impls)
   local op = { impls=impls }
   
   function op.perform()
      for _,impl in ipairs(impls) do
         local success, val = impl.try()
         if success then return impl.wrap(val) end
      end
      local function block(fiber)
         local suspension = new_suspension(fiber)
         for _,impl in ipairs(impls) do
            impl.block(suspension, impl.wrap)
         end
      end
      local wrap, val = suspend_current_fiber(block)
      return wrap(val)
   end

   function op.wrap(f)
      local wrapped = {}
      for _, impl in ipairs(impls) do
         local function wrap(val)
            return f(impl.wrap(val))
         end
         local impl = new_op_impl(impl.try, impl.block, wrap)
         table.insert(wrapped, impl)
      end
      return new_op(wrapped)
   end

   return op
end

There's only one thing missing there, which is new_suspension. When you go to suspend a fiber because none of the operations that it's trying to do can complete directly (i.e. all of the try functions of its impls returned false), at that point the corresponding block functions will publish the fact that the fiber is waiting. However the fiber only waits until the first operation is ready; subsequent operations becoming ready should be ignored. The suspension is the object that manages this state.

function new_suspension(fiber)
   local waiting = true
   local suspension = {}
   function suspension.waiting() return waiting end
   function suspension.complete(wrap, val)
      assert(waiting)
      waiting = false
      local function resume()
         resume_fiber(fiber, wrap, val)
      end
      schedule_task(resume)
   end
   return suspension
end

As you can see, the suspension's complete method is also the bit that actually arranges to resume a suspended fiber.

Finally, just to round out the implementation, here's a function implementing non-deterministic choice from among a number of sub-operations:

function choice(...)
   local impls = {}
   for _, op in ipairs({...}) do
      for _, impl in ipairs(op.impls) do
         table.insert(impls, impl)
      end
   end
   return new_op(impls)
end

on cml

OK, I'm sure this seems a bit abstract at this point. Let's implement something concrete in terms of these primitives: channels.

Channels expose two similar but different kinds of operations: put operations, which try to send a value, and get operations, which try to receive a value. If there's a sender already waiting to send when we go to perform a get_op, the operation continues directly, and we resume the sender; otherwise the receiver publishes its suspension to a queue. The put_op case is similar.

Finally we add some synchronous put and get convenience methods, in terms of their corresponding CML operations.

function new_channel()
   local ch = {}
   -- Queues of suspended fibers waiting to get or put values
   -- via this channel.
   local getq, putq = {}, {}

   local function default_wrap(val) return val end
   local function is_empty(q) return #q == 0 end
   local function peek_front(q) return q[1] end
   local function pop_front(q) return table.remove(q, 1) end
   local function push_back(q, x) q[#q+1] = x end

   -- Since a suspension could complete in multiple ways
   -- because of non-deterministic choice, it could be that
   -- suspensions on a channel's putq or getq are already
   -- completed.  This helper removes already-completed
   -- suspensions.
   local function remove_stale_entries(q)
      local i = 1
      while i <= #q do
         if q[i].suspension.waiting() then
            i = i + 1
         else
            table.remove(q, i)
         end
      end
   end

   -- Make an operation that if and when it completes will
   -- rendezvous with a receiver fiber to send VAL over the
   -- channel.  Result of performing operation is nil.
   function ch.put_op(val)
      local function try()
         remove_stale_entries(getq)
         if is_empty(getq) then
            return false, nil
         else
            local remote = pop_front(getq)
            remote.suspension.complete(remote.wrap, val)
            return true, nil
         end
      end
      local function block(suspension, wrap)
         remove_stale_entries(putq)
         push_back(putq, {suspension=suspension, wrap=wrap, val=val})
      end
      return new_op({new_op_impl(try, block, default_wrap)})
   end

   -- Make an operation that if and when it completes will
   -- rendezvous with a sender fiber to receive one value from
   -- the channel.  Result is the value received.
   function ch.get_op()
      local function try()
         remove_stale_entries(putq)
         if is_empty(putq) then
            return false, nil
         else
            local remote = pop_front(putq)
            remote.suspension.complete(remote.wrap)
            return true, remote.val
         end
      end
      local function block(suspension, wrap)
         remove_stale_entries(getq)
         push_back(getq, {suspension=suspension, wrap=wrap})
      end
      return new_op({new_op_impl(try, block, default_wrap)})
   end

   function ch.put(val) return ch.put_op(val).perform() end
   function ch.get()    return ch.get_op().perform()    end

   return ch
end

a wee example

You might be wondering what it's like to program with channels in Lua, so here's a little example that shows a prime sieve based on channels. It's not a great example of concurrency in that it's not an inherently concurrent problem, but it's cute to show computations in terms of infinite streams.

function prime_sieve(count)
   local function sieve(p, rx)
      local tx = new_channel()
      spawn_fiber(function ()
         while true do
            local n = rx.get()
            if n % p ~= 0 then tx.put(n) end
         end
      end)
      return tx
   end

   local function integers_from(n)
      local tx = new_channel()
      spawn_fiber(function ()
         while true do
            tx.put(n)
            n = n + 1
         end
      end)
      return tx
   end

   local function primes()
      local tx = new_channel()
      spawn_fiber(function ()
         local rx = integers_from(2)
         while true do
            local p = rx.get()
            tx.put(p)
            rx = sieve(p, rx)
         end
      end)
      return tx
   end

   local done = false
   spawn_fiber(function()
      local rx = primes()
      for i=1,count do print(rx.get()) end
      done = true
   end)

   while not done do run_tasks() end
end

Here you also see an example of running the scheduler in the last line.

where next?

Let's put this into perspective: in a couple hundred lines of code, we've gone from minimal Lua to a language with lightweight multitasking, extensible CML-based operations, and CSP-style channels; truly a delight.

There are a number of possible ways to extend this code. One of them is to implement true multithreading, if the language you are working in supports that. In that case there are some small protocol modifications to take into account; see the notes on the Guile CML implementation and especially the Manticore Parallel CML project.

The implementation above is pleasantly small, but it could be faster with the choice of more specialized data structures. I think interested readers probably see a number of opportunities there.

In a library, you might want to avoid the global task_queue and implement nested or multiple independent schedulers, and of course in a parallel situation you'll want core-local schedulers as well.

The implementation above has no notion of time. What we did in the Snabb implementation of fibers was to implement a timer wheel, inspired by Juho Snellman's Ratas, and then add that timer wheel as a task source to Snabb's scheduler. In Snabb, every time the equivalent of run_tasks() is called, a scheduler asks its sources to schedule additional tasks. The timer wheel implementation schedules expired timers. It's straightforward to build CML timeout operations in terms of timers.

Additionally, your system probably has other external sources of communication, such as sockets. The trick to integrating sockets into fibers is to suspend the current fiber whenever an operation on a file descriptor would block, and arrange to resume it when the operation can proceed. Here's the implementation in Snabb.

The only difficult bit with getting nice nonblocking socket support is that you need to be able to suspend the calling thread when you see the EWOULDBLOCK condition, and for coroutines that is often only possible if you implemented the buffered I/O yourself. In Snabb that's what we did: we implemented a compatible replacement for Lua's built-in streams, in Lua. That lets us handle EWOULDBLOCK conditions in a flexible manner. Integrating epoll as a task source also lets us sleep when there are no runnable tasks.

Likewise in the Snabb context, we are also working on a TCP implementation. In that case you want to structure TCP endpoints as fibers, and arrange to suspend and resume them as appropriate, while also allowing timeouts. I think the scheduler and CML patterns are going to allow us to do that without much trouble. (Of course, the TCP implementation will give us lots of trouble!)

Additionally your system might want to communicate with fibers from other threads. It's entirely possible to implement CML on top of pthreads, and it's entirely possible as well to support communication between pthreads and fibers. If this is interesting to you, see Guile's implementation.

When I talked about fibers in an earlier article, I built them in terms of delimited continuations. Delimited continuations are fun and more expressive than coroutines, but it turns out that for fibers, all you need is the expressive power of coroutines -- multi-shot continuations aren't useful. Also I think the presentation might be more straightforward. So if all your language has is coroutines, that's still good enough.

There are many more kinds of standard CML operations; implementing those is also another next step. In particular, I have found semaphores and condition variables to be quite useful. Also, standard CML supports "guards", invoked when an operation is performed, and "nacks", invoked when an operation is definitively not performed because a choice selected some other operation. These can be layered on top; see the Parallel CML paper for notes on "primitive CML".

Also, the choice operator above is left-biased: it will prefer earlier impls over later ones. You might want to not always start with the first impl in the list.

The scheduler shown above is the simplest thing I could come up with. You may want to experiment with other scheduling algorithms, e.g. capability-based scheduling, or kill-safe abstractions. Do it!

Or, it could be you already have a scheduler, like some kind of main loop that's already there. Cool, you can use it directly -- all that fibers needs is some way to schedule functions to run.

godspeed

In summary, I think Concurrent ML should be better-known. Its simplicity and expressivity make it a valuable part of any concurrent system. Already in Snabb it helped us solve some longstanding gnarly issues by making the right solutions expressible.

As Adam Solove says, Concurrent ML is great, but it has a branding problem. Its ideas haven't penetrated the industrial concurrent programming world to the extent that they should. This article is another attempt to try to get the word out. Thanks to Adam for the observation that CML is really a protocol; I'm sure the concepts could be made even more clear, but at least this is a step forward.

All the code in this article is up on a gitlab snippet along with instructions for running the example program from the command line. Give it a go, and happy hacking with CML!

Pfmatch, a packet filtering language embedded in Lua

2015-07-03T11:05:11Z

Greets, hackers! I just finished implementing a little embedded language in Lua and wanted to share it with you. First, a bit about the language, then some notes on how it works with Lua to reach the high performance targets of Snabb Switch.

the pfmatch language

Pfmatch is a language designed for filtering, classifying, and dispatching network packets in Lua. Pfmatch is built on the well-known pflang packet filtering language, using the fast pflua compiler for LuaJIT.

Here's an example of a simple pfmatch program that just divides up packets depending on whether they are TCP, UDP, or something else:

match {
   tcp => handle_tcp
   udp => handle_udp
   otherwise => handle_other
}

Unlike pflang filters written for such tools as tcpdump, a pfmatch program can dispatch packets to multiple handlers, potentially destructuring them along the way. In contrast, a pflang filter can only say "yes" or "no" on a packet.

Here's a more complicated example that passes all non-IP traffic, drops all IP traffic that is not going to or coming from certain IP addresses, and calls a handler on the rest of the traffic.

match {
   not ip => forward
   ip src 1.2.3.4 => incoming_ip
   ip dst 5.6.7.8 => outgoing_ip
   otherwise => drop
}

In the example above, the handlers after the arrows (forward, incoming_ip, outgoing_ip, and drop) are Lua functions. The part before the arrow (not ip and so on) is a pflang expression. If the pflang expression matches, its handler will be called with two arguments: the packet data and the length. For example, if the not ip pflang expression is true on the packet, the forward handler will be called.

It's also possible for the handler of an expression to be a sub-match:

match {
   not ip => forward
   ip src 1.2.3.4 => {
      tcp => incoming_tcp(&ip[0], &tcp[0])
      udp => incoming_udp(&ip[0], &ucp[0])
      otherwise => incoming_ip(&ip[0])
   }
   ip dst 5.6.7.8 => {
      tcp => outgoing_tcp(&ip[0], &tcp[0])
      udp => outgoing_udp(&ip[0], &ucp[0])
      otherwise => outgoing_ip(&ip[0])
   }
   otherwise => drop
}

As you can see, the handlers can also have additional arguments, beyond the implicit packet data and length. In the above example, if not ip doesn't match, then ip src 1.2.3.4 matches, then tcp matches, then the incoming_tcp function will be called with four arguments: the packet data as a uint8_t* pointer, its length in bytes, the offset of byte 0 of the IP header, and the offset of byte 0 of the TCP header. An argument to a handler can be any arithmetic expression of pflang; in this case &ip[0] is actually an extension. More on that later. For language lawyers, check the syntax and semantics over in our source repo.

Thanks especially to my colleague Katerina Barone-Adesi for long backs and forths about the language design; they really made it better. Fistbump!

pfmatch and lua

The challenge of designing pfmatch is to gain expressiveness, compared to writing filters by hand, while not endangering the performance targets of Pflua and Snabb Switch. These days Snabb is on target to give ASIC-driven network appliances a run for their money, so anything we come up with cannot sacrifice speed.

In practice what this means is compile, don't interpret. Using the pflua compiler allows us to generalize the good performance that we have gotten on pflang expressions to a multiple-dispatch scenario. It's a pretty straightword strategy. Naturally though, the interface with Lua is more complex now, so to understand the performance we should understand the interaction with Lua.

How does one make two languages interoperate, anyway? With pflang it's pretty clear: you compile pflang to a Lua function, and call the Lua function to match on packets. It returns true or false. It's a thin interface. Indeed with pflang and pflua you could just match the clauses in order:

not_ip = pf.compile('not ip')
incoming = pf.compile('ip src 1.2.3.4')
outgoing = pf.compile('ip dst 5.6.7.8')

function handle(packet, len)
   if not_ip(packet, len) then return forward(packet, len)
   elseif incoming(packet, len) then return incoming_ip(packet, len)
   elseif outgoing(packet, len) then return outgoing_ip(packet, len)
   else return drop(packet, len) end
end

But not only is this tedious, you don't get easy access to the packet itself, and you're missing out on opportunities for optimization. For example, if the packet fails the not_ip check, we don't need to check if it's an IP packet in the incoming check. Compiling a pfmatch program takes advantage of pflua's optimizer to produce good code for the match expression as a whole.

If this were Scheme I would make the right-hand side of an arrow be an expression and implement pfmatch as a macro; see Racket's match documentation for an example. In Lua or other languages that's harder to do; you would have to parse Lua, and it's not clear which parts of the production as a whole are the host language (Lua) and which are the embedded language (pfmatch).

Instead, I think embedding host language snippets by function name is a fine solution. It seems fairly clear that incoming_ip, for example, is some kind of function. It's easy to parse identifiers in an embedded language, both for humans and for programs, so that takes away a lot of implementation headache and cognitive overhead.

We are left with a few problems: how to map names to functions, what to do about the return value of match expressions, and how to tie it all together in the host language. Again, if this were Scheme then I'd use macros to embed expressions into the pfmatch term, and their names would be scoped into whatever environment the match term was defined. In Lua, the best way to implement a name/value mapping is with a table. So we have:

local handlers = {
   forward = function(data, len)
      ...
   end,
   drop = function(data, len)
      ...
   end,
   incoming_ip = function(data, len)
      ...
   end,
   outgoing_ip = function(data, len)
      ...
   end
}

Then we will pass the handlers table to the matcher function, and the matcher function will call the handlers by name. LuaJIT will mostly take care of the overhead of the table dispatch. We compile the filter like this:

local match = require('pf.match')

local dispatcher = match.compile([[match {
   not ip => forward
   ip src 1.2.3.4 => incoming_ip
   ip dst 5.6.7.8 => outgoing_ip
   otherwise => drop
}]])

To use it, you just invoke the dispatcher with the handlers, data, and length, and the return value is whatever the handler returns. Here let's assume it's a boolean.

function loop(self)
   local i, o = self.input.input, self.output.output
   while not link.empty() do
      local pkt = link.receive(i)
      if dispatcher(handlers, pkt.data, pkt.length) then
         link.transmit(o, pkt)
      end
   end
end

Finally, we're ready for an example of a compiled matcher function. Here's what pflua does with the match expression above:

local cast = require("ffi").cast
return function(self,P,length)
   if length < 14 then return self.forward(P, len) end
   if cast("uint16_t*", P+12)[0] ~= 8 then return self.forward(P, len) end
   if length < 34 then return self.drop(P, len) end
   if P[23] ~= 6 then return self.drop(P, len) end
   if cast("uint32_t*", P+26)[0] == 67305985 then return self.incoming_ip(P, len) end
   if cast("uint32_t*", P+30)[0] == 134678021 then return self.outgoing_ip(P, len) end
   return self.drop(P, len)
end

The result is a pretty good dispatcher. There are always things to improve, but it's likely that the function above is better than what you would write by hand, and it will continue to get better as pflua improves.

Getting back to what I mentioned earlier, when we write filtering code by hand, we inevitably end up writing interpreters for some kind of filtering language. Network functions are essentially linguistic in nature: static appliances are no good because network topologies change, and people want solutions that reflect their problems. Usually this means embedding an interpreter for some embedded language, for example BPF bytecode or iptables rules. Using pflua and pfmatch expressions, we can instead compile a filter suited directly for the problem at hand -- and while we're at it, we can forget about worrying about pesky offsets, constants, and bit-shifts.

challenges

I'm optimistic about pfmatch or something like it being a success, but there are some challenges too.

One challenge is that pflang is pretty weird. For example, attempting to access ip[100] will abort a filter immediately on a packet that is less than 100 bytes long, not including L2 encapsulation. It's wonky semantics, and in the context of pfmatch, aborting the entire pfmatch program would obviously be the wrong thing. That would abort too much. Instead it should probably just fail the pflang test in which that packet access appears. To this end, in pfmatch we turn those aborts into local expression match failures. However, this leads to an inconsistency with pflang. For example in (ip[100000] == 0 or (1==1)), instead of ip[100000] causing the whole pflang match to fail, it just causes the local test to fail. This leaves us with 1==1, which passes. We abort too little.

This inconsistency is probably a bug. We want people to be able to test clauses with vanilla pflang expressions, and have the result match the pfmatch behavior. Due to limitations in some of pflua's intermediate languages, it's likely to persist for a while. It is the only inconsistency that I know of, though.

Pflang is also underpowered in many ways. It has terrible IPv6 support; for example, tcp[0] only matches IPv4 packets, and at least as implemented in libpcap, most payload access on IPv6 packets does the wrong thing regarding chained extension headers. There is no facility in the language for binding names to intermediate results, there is no linguistic facility for talking about fragmentation, no ability to address IP source and destination addresses in arithmetic expressions by name, and so on. We can solve these in pflua with extensions to the language, but that introduces incompatibilities with pflang.

You might wonder why to stick with pflang, after all of this. If this is you, Juho Snellman wrote a great article on this topic, just for you: What's wrong with pcap filters.

Pflua's optimizer has mostly helped us, but there have been places where it could be more helpful. When compiling just one expression, you can often end up figuring out which branches are dead-ends, which helps the rest of the optimization to proceed. With more than one successful branch, we had to make a few improvements to the optimizer to actually get decent results. We also had to relax one restriction on the optimizer: usually we only permit transformations that make the code smaller. This way we know we're going in the right direction and will eventually terminate. However because of reasons™ we did decide to allow tail calls to be duplicated, so instead of having just one place in the match function that tail-calls a handler, you can end up with multiple calls. I suspect using a tracing compiler will largely make this moot, as control-flow splits effectively lead to trace duplication anyway, and making sure control-flow joins later doesn't effectively counter that. Still, I suspect that the resulting trace shape will rejoin only at the loop head, instead of in some intermediate point, which is probably OK.

future

With all of these concerns, is pfmatch still a win? Yes, probably! We're going to start using it when building Snabb apps, and will see how it goes. We'll probably end up adding a few more pflang extensions before we're done. If it's something you're in to, snabb-devel is the place to try it out, and see you on the bug tracker. Happy packet hacking!

high-performance packet filtering with pflua

2014-09-02T10:15:49Z

Greets! I'm delighted to be able to announce the release of Pflua, a high-performance packet filtering toolkit written in Lua.

Pflua implements the well-known libpcap packet filtering language, which we call pflang for short.

Unlike other packet filtering toolkits, which tend to use the libpcap library to compile pflang expressions bytecode to be run by the kernel, Pflua is a completely new implementation of pflang.

why lua?

At this point, regular readers are asking themselves why this Schemer is hacking on a Lua project. The truth is that I've always been looking for an excuse to play with the LuaJIT high-performance Lua implementation.

LuaJIT is a tracing compiler, which is different from other JIT systems I have worked on in the past. Among other characteristics, tracing compilers only emit machine code for branches that are taken at run-time. Tracing seems a particularly appropriate strategy for the packet filtering use case, as you end up with linear machine code that reflects the shape of actual network traffic. This has the potential to be much faster than anything static compilation techniques can produce.

The other reason for using Lua was because it was an excuse to hack with Luke Gorrie, who for the past couple years has been building the Snabb Switch network appliance toolkit, also written in Lua. A common deployment environment for Snabb is within the host virtual machine of a virtualized server, with Snabb having CPU affinity and complete control over a high-performance 10Gbit NIC, which it then routes to guest VMs. The administrator of such an environment might want to apply filters on the kinds of traffic passing into and out of the guests. To this end, we plan on integrating Pflua into Snabb so as to provide a pleasant, expressive, high-performance filtering facility.

Given its high performance, it is also reasonable to deploy Pflua on gateway routers and load-balancers, within virtualized networking appliances.

implementation

Pflua compiles pflang expressions to Lua source code, which are then optimized at run-time to native machine code.

There are actually two compilation pipelines in Pflua. The main one is fairly traditional. First, a custom parser produces a high-level AST of a pflang filter expression. This AST is lowered to a primitive AST, with a limited set of operators and ways in which they can be combined. This representation is then exhaustively optimized, folding constants and tests, inferring ranges of expressions and packet offset values, hoisting assertions that post-dominate success continuations, etc. Finally, we residualize Lua source code, performing common subexpression elimination as we go.

For example, if we compile the simple Pflang expression ip or ip6 with the default compilation pipeline, we get the Lua source code:

return function(P,length)
   if not (length >= 14) then return false end
   do
      local v1 = ffi.cast("uint16_t*", P+12)[0]
      if v1 == 8 then return true end
      do
         do return v1 == 56710 end
      end
   end
end

The other compilation pipeline starts with bytecode for the Berkeley packet filter VM. Pflua can load up the libpcap library and use it to compile a pflang expression to BPF. In any case, whether you start from raw BPF or from a pflang expression, the BPF is compiled directly to Lua source code, which LuaJIT can gnaw on as it pleases. Compiling ip or ip6 with this pipeline results in the following Lua code:

return function (P, length)
   local A = 0
   if 14 > length then return 0 end
   A = bit.bor(bit.lshift(P[12], 8), P[12+1])
   if (A==2048) then goto L2 end
   if not (A==34525) then goto L3 end
   ::L2::
   do return 65535 end
   ::L3::
   do return 0 end
   error("end of bpf")
end

We like the independence and optimization capabilities afforded by the native pflang pipeline. Pflua can hoist and eliminate bounds checks, whereas BPF is obligated to check that every packet access is valid. Also, Pflua can work on data in network byte order, whereas BPF must convert to host byte order. Both of these restrictions apply not only to Pflua's BPF pipeline, but also to all other implementations that use BPF (for example the interpreter in libpcap, as well as the JIT compilers in the BSD and Linux kernels).

However, though Pflua does a good job in implementing pflang, it is inevitable that there may be bugs or differences of implementation relative to what libpcap does. For that reason, the libpcap-to-bytecode pipeline can be a useful alternative in some cases.

performance

When Pflua hits the sweet spots of the LuaJIT compiler, performance screams.

(full image, analysis)

This synthetic benchmark runs over a packet capture of a ping flood between two machines and compares the following pflang implementations:

libpcap: The user-space BPF interpreter from libpcap
linux-bpf: The old Linux kernel-space BPF compiler from 2011. We have adapted this library to work as a loadable user-space module (source)
linux-ebpf: The new Linux kernel-space BPF compiler from 2014, also adapted to user-space (source)
bpf-lua: BPF bytecodes, cross-compiled to Lua by Pflua.
pflua: Pflang compiled directly to Lua by Pflua.

To benchmark a pflang implementation, we use the implementation to run a set of pflang expressions over saved packet captures. The result is a corresponding set of benchmark scores measured in millions of packets per second (MPPS). The first set of results is thrown away as a warmup. After warmup, the run is repeated 50 times within the same process to get multiple result sets. Each run checks to see that the filter matches the the expected number of packets, to verify that each implementation does the same thing, and also to ensure that the loop is not dead.

In all cases the same Lua program is used to drive the benchmark. We have tested a native C loop when driving libpcap and gotten similar results, so we consider that the LuaJIT interface to C is not a performance bottleneck. See the pflua-bench project for more on the benchmarking procedure and a more detailed analysis.

The graph above shows that Pflua can stream in packets from memory and run some simple pflang filters them at close to the memory bandwidth on this machine (100 Gbit/s). Because all of the filters are actually faster than the accept-all case, probably due to work causing prefetching, we actually don't know how fast the filters themselves can run. At any case, in this ideal situation, we're running at a handful of nanoseconds per packet. Good times!

(full image, analysis)

It's impossible to make real-world tests right now, especially since we're running over packet captures and not within a network switch. However, we can get more realistic. In the above test, we run a few filters over a packet capture from wingolog.org, which mostly operates as a web server. Here we see again that Pflua beats all of the competition. Oddly, the new Linux JIT appears to fare marginally worse than the old one. I don't know why that would be.

Sadly, though, the last tests aren't running at that amazing flat-out speed we were seeing before. I spent days figuring out why that is, and that's part of the subject of my last section here.

on lua, on luajit

I implement programming languages for a living. That doesn't mean I know everything there is to know about everything, or that everything I think I know is actually true -- in particular, I was quite ignorant about trace compilers, as I had never worked with one, and I hardly knew anything about Lua at all. With all of those caveats, here are some ignorant first impressions of Lua and LuaJIT.

LuaJIT has a ridiculously fast startup time. It also compiles really quickly: under a minute. Neither of these should be important but they feel important. Of course, LuaJIT is not written in Lua, so it doesn't have the bootstrap challenges that Guile has; but still, a fast compilation is refreshing.

LuaJIT's FFI is great. Five stars, would program again.

As a compilation target, Lua is OK. On the plus side, it has goto and efficient bit operations over 32-bit numbers. However, and this is a huge downer, the result range of bit operations is the signed int32 range, not the unsigned range. This means that bit.band(0xffffffff, x) might be negative. No one in the history of programming has ever wanted this. There are sensible meanings for negative results to bit operations, but only if an argument was negative. Grr. Otherwise, Lua shares the same concerns as other languages whose numbers are defined as 64-bit doubles.

Sometimes people get upset that Lua starts its indexes (in "arrays" or strings) with 1 instead of 0. It's foreign to me, so it's sometimes a challenge, but it can work as well as anything else. The problem comes in when working with the LuaJIT FFI, which starts indexes with 0, leading me to make errors as I forget which kind of object I am working on.

As a language to implement compilers, Lua desperately misses a pattern matching facility. Otherwise, a number of small gripes but no big ones; tables and closures abound, which leads to relatively terse code.

Finally, how well does trace compilation work for this task? I offer the following graph.

(full image, analysis)

Here the tests are paired. The first test of a pair, for example the leftmost portrange 0-6000, will match most packets. The second test of a pair, for example the second-from-the-left portrange 0-5, will reject all packets. The generated Lua code will be very similar, except for some constants being different. See portrange-0-6000.md for an example.

The Pflua performance of these filters is very different: the one that matches is slower than the one that doesn't, even though in most cases the non-matching filter will have to do more work. For example, a non-matching filter probably checks both src and dst ports, whereas a successful one might not need to check the dst.

It hurts to see Pflua's performance be less than the Linux JIT compilers, and even less than libpcap at times. I scratched my head for a long time about this. The Lua code is fine, and actually looks much like the BPF code. I had taken a look at the generated assembly code for previous traces and it looked fine -- some things that were not as good as they should be (e.g. a fair bit of conversions between integers and doubles, where these traces have no doubles), but things were OK. What changed?

Well. I captured the traces for portrange 0-6000 to a file, and dove in. Trace 66 contains the inner loop. It's interesting to see that there's a lot of dynamic checks in the beginning of the trace, although the loop itself is not bad (scroll down to see the word LOOP:), though with the double conversions I mentioned before.

It seems that trace 66 was captured for a packet whose src port was within range. Later, we end up compiling a second trace if the src port check fails: trace 67. The trace starts off with an absurd amount of loads and dynamic checks -- to a similar degree as trace 66, even though trace 66 dominates trace 67. It seems that there is a big penalty for transferring from one trace to another, even though they are both compiled.

Finally, once trace 67 is done -- and recall that all it has to do is check the destination port, and then update the counters from the inner loop) -- it jumps back to the top of trace 66 instead of the top of the loop, repeating all of the dynamic checks in trace 66! I can only think this is a current deficiency of LuaJIT, and not with trace compilation in general, although the amount of state transfer points to a lack of global analysis that you would get in a method JIT. I'm sure that values are being transferred that are actually dead.

This explains the good performance for the match-nothing cases: the first trace that gets compiled residualizes the loop expecting that all tests fail, and so only matching cases or variations incur the trace transfer-and-re-loop cost.

It could be that the Lua code that Pflua residualizes is in some way not idiomatic or not performant; tips in that regard are appreciated.

conclusion

I was going to pass some possible slogans by our marketing department, but we don't really have one, so I pass them on to you and you can tell me what you think:

"Pflua: A Totally Adequate Pflang Implementation"
"Pflua: Sometimes Amazing Performance!!!!1!!"
"Pflua: Organic Artisanal Network Packet Filtering"

Pflua was written by Igalians Diego Pino, Javier Muñoz, and myself for Snabb Gmbh, fine purveyors of high-performance networking solutions. If you are interested in getting Pflua in a Snabb context, we'd be happy to talk; drop a note to the snabb-devel forum. For Pflua in other contexts, file an issue or drop me a mail at wingo@igalia.com. Happy hackings with Pflua, the totally adequate pflang implementation!