wingolog

a new concurrent ml

29 June 2017 2:37 PM (guile | scheme | gnu | igalia | compilers | delimited continuations | prompts | fibers | concurrency | concurrent ml | curry on)

Good morning all!

In my last article I talked about how we composed a lightweight "fibers" facility in Guile out of lower-level primitives. What we implemented there is enough to be useful, but it is missing an important aspect of concurrency: communication. Sure, being able to spawn off fibers is nice, but you have to be able to actually talk to them.

Fibers had just gotten to the state described above about a year ago as I caught a train from Geneva to Rome for Curry On 2016. Train rides are magnificent for organizing thoughts, and I was in dire need of clarity. I had tentatively settled on Go-style channels by the time I got there, but when I saw that Matthias Felleisen and Matthew Flatt there, I had to take advantage of the opportunity to ask them what they thought. Would they recommend Racket-like threads and channels? Had that been a good experience?

The answer that I got in return was a "yes, that's what you should do", but also a "you should look at Concurrent ML". Concurrent ML? What's that? I looked and was a bit skeptical. It seemed old and hoary and maybe channels were just as expressive. I looked more deeply into this issue and it seemed CML is a bit more expressive than just channels but damn, it looked complicated to implement.

I was wrong. This article shows that what you need to do to implement multi-core CML is actually the same as what you need to do to implement channels in a multi-core environment. By building CML first and channels and whatever later, you get more power for the same amount of work.

Note that this article has an associated talk! If video is your thing, see my Curry On 2017 talk here:

Or, watch on the youtube if the inline video above doesn't work; slides here as well.

on channels

Let's first have a crack at implementing channels. Before we begin, we should be a bit more explicit about what a channel is. My first hack in this area did the wrong thing: I was used to asynchronous queues, and I thought that's what a channel was. Besides ignorance, apparently that's what Erlang does; a process's inbox is an unbounded queue of messages with only very slight back-pressure.

But an asynchronous queue is not a channel, at least in its classic sense. As they were originally formulated in "Communicating Sequential Processes" by Tony Hoare, adopted into David May's occam, and from there into many other languages, channels are meeting-places. Processes meet at a channel to exchange values; whichever party arrives first has to wait for the other party to show up. The message that is handed off in a channel send/receive operation is never "owned" by the channel; it is either owned by a sender who is waiting at the meeting point for a receiver, or it's accepted by a receiver. After the transaction is complete, both parties continue on.

You'd think this is a fine detail, but meeting-place channels are strictly more expressive than buffered channels. I was actually called out for this because my first implementation of channels for Fibers had effectively a minimum buffer size of 1. In Go, whose channels are unbuffered by default, you can use a channel for RPC:

package main

func double(ch chan int) {
  for { ch <- (<-ch * 2) }
}

func main() {
  ch := make(chan int)
  go double(ch)
  ch <- 2
  x := <-ch
  print(x)
}

Here you see that the main function sent a value on ch, then immediately read a response from the same channel. If the channel were buffered, then we'd probably read the value we sent instead of the doubled value supplied by the double goroutine. I say "probably" because it's not deterministic! Likewise the double routine could read its responses as its inputs.

Anyway, the channels we are looking to build are meeting-place channels. If you are interested in the broader design questions, you might enjoy the incomplete history of language facilities for concurrency article I wrote late last year.

With that prelude out of the way, here's a first draft at the implementation of the "receive" operation on a channel.

(define (recv ch)
  (match ch
    (($ $channel recvq sendq)
     (match (try-dequeue! sendq)
       (#(value resume-sender)
        (resume-sender)
        value)
       (#f
        (suspend
         (lambda (k)
           (define (resume val)
             (schedule (lambda () (k val)))
           (enqueue! recvq resume))))))))

;; Note: this code has a race!  Fixed later.

A channel is a record with two fields, its recvq and sendq. The receive queue (recvq) holds a FIFO queue of continuations that are waiting to receive values, and the send queue holds continuations that are waiting to send values, along with the value that they are sending. Both the recvq and the sendq are lockless queues.

To receive a value from a meeting-place channel, there are two possibilities: either there's a sender already there and waiting, or we have to wait for a sender. Those two cases are handled above, in that order. We use the suspend primitive from the last article to arrange for the fiber to wait; presumably the sender will resume us when they arrive later at the meeting-point.

an aside on lockless data structures

We'll go more deeply into the channel receive mechanics later, but first, a general question: what's the right way to implement a data structure that can be accessed and modified concurrently without locks? Though I am full of hubris, I don't have enough to answer this question definitively. I know many ways, but none that's optimal in all ways.

For what I needed in Fibers, I chose to err on the side of simplicity.

Some data in Fibers is never modified; this immutable data is safe to access concurrently from any code. This is the best, obviously :)

Some mutable data is only ever mutated from an "owner" core; it's safe to read without a lock from that owner core, and in Fibers we do not access this data from other cores. An example of this kind of data structure is the i/o map from file descriptors to continuations; it's core-local. I say "core-local" because in fibers we typically run one scheduler per core, with each core having a pinned POSIX thread; it's really thread-local but I don't want to use the word "thread" too much here as it's confusing.

Some mutable data needs to be read and written from many cores. An example of this is the recvq of a channel; many receivers and senders can be wanting to read and write there at once. The approach we take in Fibers is just to use immutable data stored inside an "atomic box". An atomic box holds a single value, and exposes operations to read, write, swap, and compare-and-swap (CAS) the value. To read a value, just fetch it from the box; you then have immutable data that you can analyze without locks. Having read a value, you can to compute a new state and use CAS on the atomic box to publish that change. If the CAS succeeds, then great; otherwise the state changed in the meantime, so you typically want to loop and try again.

Single-word CAS suffices for Guile when every value stored into an atomic box will be unique, a property that freshly-allocated objects have and of which GC ensures us an endless supply. Note that for this to work, the values can share structure internally but the outer reference has to be freshly allocated.

The combination of freshly-allocated data structures and atomic variables is a joy to use: no hassles about multi-word compare-and-swap or the ABA problem. Assuming your GC can keep up (Guile's does about 700 MB/s), it can be an effective strategy, and is certainly less error-prone than others.

back at the channel recv ranch

Now, the theme here is "growing a language": taking primitives and using them to compose more expressive abstractions. In that regard, sure, channel send and receive are nice, but what about select, which allows us to wait on any channel in a set of channels? How do we take what we have and built non-determinism on top?

I think we should begin by noting that select in Go for example isn't just about receiving messages. You can select on the first channel that can send, or between send and receive operations.

select {
case c <- x:
  x, y = y, x+y
case <-quit:
  return
}

As you can see, Go provides special syntax for select. Although in Guile we can of course provide macros, usually those macros expand out to a procedure call; the macro is sugar for a function. So we want select as a function. But because we need to be able to select over receiving and sending at the same time, the function needs to take some kind of annotation on what we are going to do with the channels:

(select (recv A) (send B v))

So what we do is to introduce the concept of an operation, which is simply data describing some event which may occur in the future. The arguments to select are now operations.

(select (recv-op A) (send-op B v))

Here recv-op is obviously a constructor for the channel-receive operation, and likewise for send-op. And actually, given that we've just made an abstraction over sending or receiving on a channel, we might as well make an abstraction over choosing the first available op among a set of operations. The implementation of select now creates such a choice-op, then performs it.

(define (select . ops)
  (perform (apply choice-op ops)))

But what we're missing here is the ability to know which operation actually happened. In Go, select's special syntax associates a clause of code with each sub-operation. In Scheme a clause of code is just a function, and so what we want to do is to be able to annotate an operation with a function that will get run if the operation succeeds.

So we define a (wrap-op op k), which makes an operation that itself annotates op, associating it with k. If op occurs, its result values will be passed to k. For example, if we make a fiber that tries to perform this operation:

(perform
 (wrap-op
  (recv-op A)
  (lambda (v)
    (string-append "hello, " v))))

If we send the string "world" on the channel A, then the result of this perform invocation will be "hello, world". Providing "wrapped" operations to select allows us to handle the various cases in separate, appropriate ways.

we just made concurrent ml

Hey, we just reinvented Concurrent ML! In his PLDI 1988 paper "Synchronous operations as first-class values", John Reppy proposes just this abstraction. I like to compare it to the relationship between an expression (exp) and wrapping that expression in a lambda ((lambda () exp)); evaluating an expression gives its value, and the expression just goes away, whereas evaluating a lambda gives a procedure that you can call in the future to evaluate the expression. You can call the lambda many times, or no times. In the same way, a channel-receive operation is an abstraction over receiving a value from a channel. You can perform that operation many times, once, or not at all.

Reppy consolidated this work in his PLDI 1991 paper, "CML: A higher-order concurrent language". Note that he uses the term "event" instead of "operation". To me the name "event" to apply to this abstraction never felt quite right; I guess I wrote too much code in the past against event loops. I see "events" as single instances in time and not an abstraction over the possibility of a, well, of an event. Indeed I wish I could refer to an instantiation of an operation as an event, but better not to muddy the waters. Likewise Reppy uses "synchronize" where I use "perform". As you like, really, it's still Concurrent ML; I just prefer to explain to my users using terms that make sense to me.

what's an op?

Let's return to that channel recv implementation. It had basically two parts: an optimistic part, where the operation could complete immediately, and a pessimistic part, where we had to wait for the other party to arrive. However, there was a race condition, as I noted in the comment. If a sender and a receiver concurrently arrive at a channel, it could be that they concurrently do the optimistic check, don't notice that the other is there, then they both suspend, waiting for each other to arrive: deadlock. To fix this for recv, we have to recheck the sendq after publishing our presence to the recvq.

I'll get to the details in a bit for channels, but it turns out that this is a general pattern. All kinds of ops have optimistic and pessimistic behavior.

(define (perform op)
  (match op
    (($ $op try block wrap)
     (define (do-op)
       ;; Return a thunk that has result values.
       (or optimistic
           pessimistic)))
     ;; Return values, passed through wrap function.
     ((compose wrap do-op)))))

In the optimistic phase, the calling fiber will try to commit the operation directly. If that succeeds, then the calling fiber resumes any other fibers that are part of the transaction, and the calling fiber continues. In the pessimistic phase, we park the calling fiber, publish the fact that we're ready and waiting for the operation, then to resolve the race condition we have to try again to complete the operation. In either case we pass the result(s) through the wrap function.

Given that the pessimistic phase has to include a re-check for operation completability, the optimistic phase is purely an optimization. It's a good optimization that everyone will want to implement, but it's not strictly necessary. It's OK for a try function to always return #f.

As shown in the above function, an operation is a plain old data structure with three fields: a try, a block, and a wrap function. The optimistic behavior is implemented by the try function; the pessimistic side is partly implemented by perform, which handles the fiber suspension part, and by the operation's block function. The wrap function implements the wrap-op behavior described above, and is applied to the result(s) of a successful operation.

Now, it was about at this point that I was thinking "jeebs, this CML thing is complicated". I was both wrong and right -- there's some complication inherent in multicore lockless communication, yes, but I believe CML captures something close to the minimum, and certainly it's just as much work as with a direct implementation of channels. In that spirit, I continue on with the implementation of channel operations in Fibers.

channel receive operation

Here's an implementation of a try function for a channel.

(define (try-recv ch)
  (match ch
    (($ $channel recvq sendq)
     (let ((q (atomic-ref sendq)))
       (match q
         (() #f)
         ((head . tail)
          (match head
            (#(val resume-sender state)
             (match (CAS! state 'W 'S)
               ('W
                (resume-sender (lambda () (values)))
                (CAS! sendq q tail) ; *
                (lambda () val))
               (_ #f))))))))))

In Fibers, a try function either succeeds and returns a thunk, or fails and returns #f. For channel receive, we only succeed if there is a sender already in the queue: the sender has arrived, suspended itself, and published its availability. The state variable is an atomic box that holds the operation state, which initially starts as W and when complete is S. More on that in a minute. If the CAS! compare-and-swap operation managed to change the state from W to S, then the optimistic phase suceeded -- yay! We resume the sender with no values, take the value that the sender gave us, and keep on trucking, returning that value wrapped in a thunk.

Additionally the sender's entry on the sendq is now stale, as the operation is already complete; we try to pop it off the queue at the line indicated with *, but that could fail due to concurrent queue modification. In that case, no biggie, someone else will do the collect our garbage for us.

The pessimistic case is a bit more involved. It's the last bit of code though; almost done here! I express the pessimistic phase as a function of the operation's block function.

(define (pessimistic block)
  ;; For consistency with optimistic phase, result of
  ;; pessimistic phase is a thunk that "perform" will
  ;; apply.
  (lambda ()
    ;; 1. Suspend the thread.  Expect to be resumed
    ;; with a thunk, which we arrange to invoke directly.
    ((suspend
       (lambda (k)
        (define (resume values-thunk)
          (schedule (lambda () (k values-thunk))))
        ;; 2. Make a fresh opstate.
        (define state (make-atomic-box 'W))
        ;; 3. Call op's block function.
        (block resume state))))))

So what about that state variable? Well basically, once we publish the fact that we're ready to perform an operation, fibers from other cores might concurrently try to complete our operation. We need for this perform invocation to complete at most once! So we introduce a state variable, the "opstate", held in an atomic box. It has three states:

  • W: "Waiting"; initial state

  • C: "Claimed"; temporary state

  • S: "Synched"; final state

There are four possible state transitions, of two kinds. Firstly there are the "local" transitions W->C, C->W, and C->S. These transitions may only ever occur as part of the "retry" phase a block function; notably, no remote fiber will cause these transitions on "our" state variable. Remote fibers can only make the W->S transition, committing an operation. The W->S transition can also be made locally of course.

Every time an operation is instantiated via the perform function, we make a new opstate. Operations themselves don't hold any state; only their instantiations do.

The need for the C state wasn't initially obvious to me, but after seeing the recv-op block function below, it will be clear to you I hope.

block functions

The block function itself has two jobs to do. Recall that it's called after the calling fiber was suspended, and is passed two arguments: a procedure that can be called to resume the fiber with some number of values, and the fresh opstate for this instantiation. The block function has two jobs: it needs to publish the resume function and the opstate to the channel's recvq, and then it needs to try again to receive. That's the "retry" phase I was mentioning before.

Retrying the recv can have three possible results:

  1. If the retry succeeds, we resume the sender. We also have to resume the calling fiber, as it has been suspended already. In general, whatever code manages to commit an operation has to resume any fibers that were waiting on it to complete.

  2. If the operation was already in the S state, that means some other party concurrently completed our operation on our behalf. In that case there's nothing to do; the other party resumed us already.

  3. Otherwise if the operation couldn't proceed, then when the other party or parties arrive, they will be responsible for completing the operation and ultimately resuming our fiber in the future.

With that long prelude out of the way, here's the gnarlies!

(define (block-recv ch resume-recv recv-state)
  (match ch
    (($ $channel recvq sendq)
     ;; Publish -- now others can resume us!
     (enqueue! recvq (vector resume-recv recv-state))
     ;; Try again to receive.
     (let retry ()
       (let ((q (atomic-ref sendq)))
         (match q
           ((head . tail)
            (match head
              (#(val resume-send send-state)
               (match (CAS! recv-state 'W 'C)   ; Claim txn.
                 ('W
                  (match (CAS! send-state 'W 'S)
                    ('W                         ; Case (1): yay!
                     (atomic-set! recv-state 'S)
                     (CAS! sendq q tail)        ; Maybe GC.
                     (resume-send (lambda () (values)))
                     (resume-recv (lambda () val)))
                    ('C                         ; Conflict; retry.
                     (atomic-set! recv-state 'W)
                     (retry))
                    ('S                         ; GC and retry.
                     (atomic-set! recv-state 'W)
                     (CAS! sendq q tail)
                     (retry))))
                 ('S #f)))))                    ; Case (2): cool!
           (() #f)))))))                        ; Case (3): we wait.

As we said, first we publish, then we retry. If there is a sender on the queue, we will try to complete their operation, but before we do that we have to prevent other fibers from completing ours; that's the purpose of going into the C state. If we manage to commit the sender's operation, then we commit ours too, going from C to S; otherwise we roll back to W. If the sender itself was in C then we had a conflict, and we spin to retry. We also try to GC off any completed operations from the sendq via unchecked CAS. If there's no sender on the queue, we just wait.

And that's it for the code! Thank you for suffering through this all. I only left off a few details: the try function can loop if sender is in the C state, and the block function needs to avoid a (choice-op (send-op A v) (recv-op A)) from sending v to itself. But because opstates are fresh allocations, we can know if a sender is actually ourself by comparing its opstate to ours (with eq?).

what about select?

I started about all this "op" business because I needed to annotate the arguments to select. Did I actually get anywhere? Good news, everyone: it turns out that select doesn't have to be a primitive!

Firstly, note that the choose-op try function just needs to run all try functions of sub-operations (possibly in random order), returning early if one succeeds. Pretty straightforward. And actually the story with the block function is the same: we just run the sub-operation block functions, knowing that the operation will commit at most one time. The only complication is plumbing through the respective wrap functions to all of the sub-operations, but of course that's the point of the wrap facility, so we pay the cost willingly.

(define (choice-op . ops)
  (define (try)
    (or-map
     (match-lambda
      (($ $op sub-try sub-block sub-wrap)
       (define thunk (sub-try))
       (and thunk (compose sub-wrap thunk))))
     ops))
  (define (block resume opstate)
    (for-each
     (match-lambda
      (($ $op sub-try sub-block sub-wrap)
       (define (wrapped-resume results-thunk)
         (resume (compose sub-wrap results-thunk)))
       (sub-block wrapped-resume opstate)))
     ops))
  (define wrap values)
  (make-op try block wrap))

There are optimizations possible, for example to randomize the order of visiting the sub-operations for more less deterministic behavior, but this is really all there is.

concurrent ml is inevitable

As far as I understand things, the protocol to implement CML-style operations on channels in a lock-free environment are exactly the same as what's needed if you wrote out the recv function by hand, without abstracting it to a recv-op.

You still need the ability to park a fiber in the block function, and you still need to retry the operation after parking. Although try is just an optimization, it's an optimization that you'll want.

So given that the cost of parallel CML is necessary, you might as well get what you pay for and have your language expose the more expressive CML interface in addition to the more "standard" channel operations.

concurrent ml between pthreads and fibers

One really cool aspect about implementing CML is that the bit that suspends the current thread is isolated in the perform function. Of course if you're in a fiber, you suspend the current fiber as we have described above. But what if you're not? What if you want to use CML to communicate between POSIX threads? You can do that, just create a mutex/cond pair and pass a procedure that will signal the cond as the resume argument to the block function. It just works! The channels implementation doesn't need to know anything about pthreads, or even fibers for that matter.

In fact, you can actually use CML operations to communicate between fibers and full pthreads. This can be really useful if you need to run some truly blocking operation in a side pthread, but you want most of your program to be in fibers.

a meta-note for a meta-language

This implementation was based on the Parallel CML paper from Reppy et al, describing the protocol implemented in Manticore. Since then there's been a lot of development there; you should check out Manticore! I also hear that Reppy has a new version of his "Concurrent Programming in ML" book coming out soon (not sure though).

This work is in Fibers, a concurrency facility for Guile Scheme, built as a library. Check out the manual for full details. Relative to the Parallel CML paper, this work has a couple differences beyond the superficial operation/perform event/sync name change.

Most significantly, Reppy's CML operations have three phases: poll, do, and block. Fibers uses just two, as in a concurrent context it doesn't make sense to check-then-do. There is no do, only try :)

Additionally the Fibers channel implementation is lockless, with an atomic sendq and recvq. In contrast, Manticore uses a spinlock and hence needs to mask/unmask interrupts at times.

On the other hand, the Parallel CML paper included some model checking work, which Fibers doesn't have. It would be nice to have some more confidence on correctness!

but what about perf

Performance! Does it scale? Let's poke it. Here I'm going to try to isolate my tests to measure the overhead of communication of channels as implemented in terms of Parallel CML ops. I have more real benchmarks for Fibers on a web server workload where it does well, but here I am really trying to focus on CML.

My test system is a 2 x E5-2620v3, which is two sockets each having 6 2.6GHz cores, hyperthreads off, performance governor on all cores. This is a system we use for Snabb testing, so the first core on each socket handles interrupts and all others are reserved; Linux won't schedule anything on them. When you run a fibers system, it will spawn a thread per available core, then set the thread's affinity to that core. In these tests, I'll give benchmarks progressively more cores and see how they do with the workload.

So this is a benchmark measuring total message sends per second on a chain of fibers communicating over channels. For 0 links, that means that there's just a sender and a receiver and no intermediate links. For 10 links, each message is relayed 10 times, for 11 total sends in the chain and 12 total fibers. For 0 links we expect pretty much no parallel speedup, and no slowdown, and that's what we see; but when we get to more links, we should expect more throughput. The fibers are allocated to cores at random (a randomized round-robin initial scheduling, then after that fibers have core affinity; though there is a limited work-stealing phase).

You would think that the 1-core case would be the same for all of them. Unfortunately it seems that currently there is a fixed cost for bouncing through epoll to pick up new I/O tasks, even though there are no I/O runnables in this test and the timeout is 0, so it will return immediately. It's definitely something to look into as it's a cost that all cores are paying.

Initially I expected a linear speedup but that's not what we're seeing. But then I thought about it and revised my expectations :) As we add more cores, we add more communication; we should see sublinear speedups as we have to do more cross-core wakeups and synchronizations. After all, we aren't measuring a nice parallelizable computational workload: we're measuring overhead.

On the other hand, the diminishing returns effect is pretty bad, and then we hit the NUMA cliff: as we cross from 6 to 7 cores, we start talking to the other CPU socket and everything goes to shit.

But here it's hard to isolate the test from three external factors, whose impact I don't understand: firstly, that Fibers itself has a significant wakeup cost for remote schedulers. I haven't measured contention on scheduler inboxes, but I suspect one issue is that when a remote scheduler has decided it has no runnables, it will sleep in epoll; and to wake it up we need to write on a socketpair. Guile can avoid that when there are lots of runnables and we see the remote scheduler isn't sleeping, but it's not perfect.

Secondly, Guile is a bytecode VM. I measured that Guile retires about 0.4 billion instructions per second per core on the test machine, whereas a 4 IPC native program will retire about 10 billion. There's overhead at various points, some of which will go away with native compilation in Guile but some might not for a while, given that Go (for example) has baked-in support for channels. So to what extent is it the protocol and to what extent the implementation overhead? I don't know.

Finally, and perhaps most importantly, we can't isolate this test from the garbage collector. Guile still uses the Boehm GC, which is just OK I think. It does have a nice parallel mark phase, but it uses POSIX signals to pause program threads instead of having those threads reach safepoints; and it's completely NUMA-unaware.

So, with all of those caveats mentioned, let's see a couple more graphs :) Firstly, similar to the previous one, here's total message send rate for N pairs of fibers that ping-pong their message back and forth. Core allocation was randomized round-robin.

My conclusion here is that when more fibers are runnable per scheduler turn, the overhead of the epoll phase is less.

Here's a test where there's one fiber producer, and N fibers competing to consume the messages sent. Ultimately we expect that the rate will be limited on the producer side, but there's still a nice speedup.

Next is a pretty weak-sauce benchmark where we're computing diagonal lengths on an N-dimensional cube; the squares of the dimensions happen in parallel fibers, then one fiber collects those lengths, sums and makes a square root.

The workload on that one is just very low, and the serial components become a bottleneck quickly. I think I need to rework that test case.

Finally, there's a false sieve of Erastothenes, in which every time we find a prime, we add another fiber onto the sieve chain that filters out multiples of that prime.

Even though the workload is really small, we still see speedups, which is somewhat satisfying. Still, on all of these, the NUMA cliff is something fierce.

For me what these benchmarks show is that there are still some bottlenecks to work on. We do OK in the handful-of-cores scenario, but the system as a whole doesn't really scale past that. On more real benchmarks with bigger workloads and proportionally much less communication, I get much more satisfactory results; but those tend to be I/O heavy anyway, so the bottleneck is elsewhere.

closing notes

There are other parts to CML events, namely guard functions and withNack functions. My understanding is that these are implementable in terms of this "primitive" CML as described here; that was a result of earlier work by Matthew Fluet. I haven't actually implemented these yet! A to-do item, truly.

There are other event types in CML systems of course! Besides being able to implement operations yourself, there are built-in condition variables (cvars), timeouts, thread join events, and so on. The Fibers manual mentions some of these, but it's an open set.

Finally and perhaps most significantly, Aaron Turon did some work a few years ago on "Reagents", a pattern library for composing parallel and concurrent operations, initially in Scala. It's claimed that Reagents generalizes CML. Is this the case? I am looking forward to finding out.

OK, that's it for this verrrrry long post :) I hope that you found that this made parallel CML seem a bit more approachable and interesting, whether as a language implementor, a library implementor, or a user. Comments and corrections welcome. Check out Fibers and give it a go!

growing fibers

27 June 2017 10:17 AM (guile | scheme | gnu | igalia | compilers | delimited continuations | prompts | fibers | concurrency)

Good day, Schemers!

Over the last 12 to 18 months, as we were preparing for the Guile 2.2 release, I was growing increasingly dissatisfied at not having a good concurrency story in Guile.

I wanted to be able to spawn a million threads on a core, to support highly-concurrent I/O servers, and Guile's POSIX threads are just not the answer. I needed something different, and this article is about the search for and the implementation of that thing.

on pthreads

It's worth being specific why POSIX threads are not a great abstraction. One is that they don't compose: two pieces of code that use mutexes won't necessarily compose together. A correct component A that takes locks might call a correct component B that takes locks, and the other way around, and if both happen concurrently you get the classic deadly-embrace deadlock.

POSIX threads are also terribly low-level. Asking someone to build a system with mutexes and cond vars is like building a house with exploding toothpicks.

I want to program network services in a straightforward way, and POSIX threads don't help me here either. I'd like to spawn a million "threads" (scare-quotes!), one for each client, each one just just looping reading a request, computing and writing the response, and so on. POSIX threads aren't the concrete implementation of this abstraction though, as in most systems you can't have more than a few thousand of them.

Finally as a Guile maintainer I have a duty to tell people the good ways to make their programs, but I can't in good conscience recommend POSIX threads to anyone. If someone is a responsible programmer, then yes we can discuss details of POSIX threads. But for a new Schemer? Never. Recommending POSIX threads is malpractice.

on scheme

In Scheme we claim to be minimalists. Whether we actually are that or not is another story, but it's true that we have a culture of trying to grow expressive systems from minimal primitives.

It's sometimes claimed that in Scheme, we don't need threads because we have call-with-current-continuation, an ultrapowerful primitive that lets us implement any kind of control structure we want. (The name screams for an abbreviation, so the alias call/cc is blessed; minimalism is whatever we say it is, right?) Unfortunately it turned out that while call/cc can implement any control abstraction, it can't implement any two. Abstractions built on call/cc don't compose!

Fortunately, there is a way to build powerful control abstractions that do compose. This article covers the first half of composing a concurrency facility out of a set of more basic primitives.

Just to be concrete, I have to start with a simple implementation of an event loop. We're going to build on it later, but for now, here we go:

(define (run sched)
  (match sched
    (($ $sched inbox i/o)
     (define (dequeue-tasks)
       (append (dequeue-all! inbox)
               (poll-for-tasks i/o)))
     (let lp ()
       (for-each (lambda (task) (task))
                 (dequeue-tasks))
       (lp)))))

This is a scheduler that is a record with two fields, inbox and i/o.

The inbox holds a queue of pending tasks, as thunks (procedure of no arguments). When something wants to enqueue a task, it posts a thunk to the inbox.

On the other hand, when a task needs to wait in some external input or output being available, it will register an event with i/o. Typically i/o will be a simple combination of an epollfd and a mapping of tasks to enqueue when a file descriptor becomes readable or writable. poll-for-tasks does the underlying epoll_wait call that pulls new I/O events from the kernel.

There are some details I'm leaving out, like when to have epoll_wait return directly, and when to have it wait for some time, and how to wake it up if it's sleeping while a task is posted to the scheduler's inbox, but ultimately this is the core of an event loop.

a long digression

Now you might think that I'm getting a little far afield from what my goal was, which was threads or fibers or something. But that's OK, let's go a little farther and talk about "prompts". The term "prompt" comes from the experience you get when you work on the command-line:

/home/wingo% ./prog

I don't know about you all, but I have the feeling that the /home/wingo% has a kind of solid reality, that my screen is not just an array of characters but there is a left-hand-side that belongs to the system, and a right-hand-side that's mine. The two parts are delimited by a prompt. Well prompts in Scheme allow you to provide this abstraction within your program: you can establish a program part that's a "system" facility, for whatever definition of "system" suits your purposes, and a part that's for the "user".

In a way, prompts generalize a pattern of system/user division that has special facilities in other programming languages, such as a try/catch block.

try {
  foo();
} catch (e) {
  bar();
}

Here again I put the "user" code in italics. Some other examples of control flow patterns that prompts generalize would be early exit of a subcomputation, coroutines, and nondeterminitic choice like SICP's amb operator. Coroutines is obviously where I'm headed here in the context of this article, but still there are some details to go over.

To make a prompt in Guile, you can use the % operator, which is pronounced "prompt":

(use-modules (ice-9 control))

(% expr
   (lambda (k . args) #f))

The name for this operator comes from Dorai Sitaram's 1993 paper, Handling Control; it's actually a pun on the tcsh prompt, if you must know. Anyway the basic idea in this example is that we run expr, but if it aborts we run the lambda handler instead, which just returns #f.

Really % is just syntactic sugar for call-with-prompt though. The previous example desugars to something like this:

(let ((tag (make-prompt-tag)))
  (call-with-prompt tag
    ;; Body:
    (lambda () expr)
    ;; Escape handler:
    (lambda (k . args) #f)))

(It's not quite the same; % uses a "default prompt tag". This is just a detail though.)

You see here that call-with-prompt is really the primitive. It will call the body thunk, but if an abort occurs within the body to the given prompt tag, then the body aborts and the handler is run instead.

So if you want to define a primitive that runs a function but allows early exit, we can do that:

(define-module (my-module)
  #:export (with-return))

(define-syntax-rule (with-return return body ...)
  (let ((t (make-prompt-tag)))
    (define (return . args)
      (apply abort-to-prompt t args))
    (call-with-prompt t
      (lambda () body ...)
      (lambda (k . rvals)
        (apply values rvals)))))

Here we define a module with a little with-return macro. We can use it like this:

(use-modules (my-module))

(with-return return
  (+ 3 (return 42)))
;; => 42

As you can see, calling return within the body will abort the computation and cause the with-return expression to evaluate to the arguments passed to return.

But what's up with the handler? Let's look again at the form of the call-with-prompt invocations we've been making.

(let ((tag (make-prompt-tag)))
  (call-with-prompt tag
    (lambda () ...)
    (lambda (k . args) ...)))

With the with-return macro, the handler took a first k argument, threw it away, and returned the remaining values. But the first argument to the handler is pretty cool: it is the continuation of the computation that was aborted, delimited by the prompt: meaning, it's the part of the computation between the abort-to-prompt and the call-with-prompt, packaged as a function that you can call.

If you call the k, the delimited continuation, you reinstate it:

(define (f)
  (define tag (make-prompt-tag))
  (call-with-prompt tag
   (lambda ()
     (+ 3
        (abort-to-prompt tag)))
   (lambda (k) k)))

(let ((k (f)))
  (k 1))
;; =& 4

Here, the abort-to-prompt invocation behaved simply like a "suspend" operation, returning the suspended computation k. Calling that continuation resumes it, supplying the value 1 to the saved continuation (+ 3 []), resulting in 4.

Basically, when a delimited continuation suspends, the first argument to the handler is a function that can resume the continuation.

tasks to fibers

And with that, we just built coroutines in terms of delimited continuations. We can turn our scheduler inside-out, giving the illusion that each task runs in its own isolated fiber.

(define tag (make-prompt-tag))

(define (call/susp thunk)
  (define (handler k on-suspend) (on-suspend k))
  (call-with-prompt tag thunk handler))

(define (suspend on-suspend)
  (abort-to-prompt tag on-suspend))

(define (schedule thunk)
  (match (current-scheduler)
    (($ $sched inbox i/o)
     (enqueue! inbox (lambda () (call/susp thunk))))))

So! Here we have a system that can run a thunk in a scheduler. Fine. No big deal. But if the thunk calls suspend, then it causes an abort back to a prompt. suspend takes a procedure as an argument, the on-suspend procedure, which will be called with one argument: the suspended continuation of the thunk. We've layered coroutines on top of the event loop.

Guile's virtual machine is a normal register virtual machine with a stack composed of function frames. It's not necessary to do full CPS conversion to implement delimited control, but if you don't, then your virtual machine needs primitive support for call-with-prompt, as Guile's VM does. In Guile then, a suspended continuation is an object composed of the slice of the stack captured between the prompt and the abort, and also the slice of the dynamic stack. (Guile keeps a parallel stack for dynamic bindings. Perhaps we should unify these; dunno.) This object is wrapped in a little procedure that uses VM primitives to push those stack frames back on, and continue.

I say all this just to give you a mental idea of what it costs to suspend a fiber. It will allocate storage proportional to the stack depth between the prompt and the abort. Usually this is a few dozen words, if there are 5 or 10 frames on the stack in the fiber.

We've gone from prompts to coroutines, and from here to fibers there's just a little farther to go. First, note that spawning a new fiber is simply scheduling a thunk:

(define (spawn-fiber thunk)
  (schedule thunk))

Many threading libraries provide a "yield" primitive, which simply suspends the current thread, allowing others to run. We can do this for fibers directly:

(define (yield)
  (suspend schedule))

Note that the on-suspend procedure here is just schedule, which re-schedules the continuation (but presumably at the back of the queue).

Similarly if we are reading on a non-blocking file descriptor and detect that we need more input before we can continue, but none is available, we can suspend and arrange for the epollfd to resume us later:

(define (wait-for-readable fd)
  (suspend
   (lambda (k)
     (match (current-scheduler)
       (($ $sched inbox i/o)
        (add-read-fd! i/o fd
                      (lambda () (schedule k))))))))

In Guile you can arrange to install this function as the "current read waiter", causing it to run whenever a port would block. The details are a little gnarly currently; see the Non-blocking I/O manual page for more.

Anyway the cool thing is that I can run any thunk within a spawn-fiber, without modification, and it will run as if in a new thread of some sort.

solid abstractions?

I admit that although I am very happy with Emacs, I never really took to using the shell from within Emacs. I always have a terminal open with a bunch of tabs. I think the reason for that is that I never quite understood why I could move the cursor over the bash prompt, or into previous expressions or results; it seemed like I was waking up groggily from some kind of dream where nothing was real. I like the terminal, where the only bit that's "mine" is the current command. All the rest is immutable text in the scrollback.

Similarly when you make a UI, you want to design things so that people perceive the screen as being composed of buttons and so on, not just lines. In essence you trick the user, a willing user who is ready to be tricked, into seeing buttons and text and not just weird pixels.

In the same way, with fibers we want to provide the illusion that fibers actually exist. To solidify this illusion, we're still missing a few elements.

One point relates to error handling. As it is, if an error happens in a fiber and the fiber doesn't handle it, the exception propagates out of the fiber, through the scheduler, and might cause the whole program to error out. So we need to wrap fibers in a catch-all.

(define (spawn-fiber thunk)
  (schedule
   (lambda ()
     (catch #t thunk
       (lambda (key . args)
         (print-exception (current-error-port) #f key args))))))

Well, OK. Exceptions won't propagate out of fibers, yay. In fact in Guile we add another catch inside the print-exception, in case the print-exception throws an exception... Anyway. Cool.

Another point relates to fiber-local variables. In an operating system, each process has a number of variables that are local to it, notably in UNIX we have the umask, the current effective user, the current directory, the open files and what file descriptors they are associated with, and so on. In Scheme we have similar facilities in the form of parameters.

Now the usual way that parameters are used is to bind a new value within the extent of some call:

(define (with-output-to-string thunk)
  (let ((p (open-output-string)))
    (parameterize ((current-output-port p))
      (thunk))
    (get-output-string p)))

Here the parameterize invocation established p as the current output port during the call to thunk. Parameters already compose quite well with prompts; Guile, like Racket, implements the protocol described by Kiselyov, Shan, and Sabry in their Delimited Dynamic Binding paper (well worth a read!).

The one missing piece is that parameters in Scheme are mutable (by default). Normally if you call (current-input-port), you just get the current value of the current input port parameter. But if you pass an argument, like (current-input-port p), then you actually set the current input port to that new value. This value will be in place until we leave some parameterize invocation that parameterizes the current input port.

The problem here is that it could be that there's an interesting parameter which some piece of Scheme code will want to just mutate, so that all further Scheme code will use the new value. This is fine if you have no concurrency: there's just one thing running. But when you have many fibers, you want to avoid mutations in one fiber from affecting others. You want some isolation with regards to parameters. In Guile, we do this with the with-dynamic-state facility, which isolates changes to the dynamic state (parameters and so on) within the extent of the with-dynamic-state call.

(define (spawn-fiber thunk)
  (let ((state (current-dynamic-state)))
    (schedule
     (lambda ()
       (catch #t
         (lambda ()
           (with-dynamic-state state thunk))
         (lambda (key . args)
           (print-exception (current-error-port) #f key args))))))

Interestingly, with-dynamic-state solves another problem as well. You would like for newly spawned fibers to inherit the parameters from the point at which they were spawned.

(parameterize ((current-output-port p))
  (spawn-fiber
   ;; New fiber should inherit current-output-port
   ;; binding as "p"
   (lambda () ...)))

Capturing the (current-dynamic-state) outside the thunk does this for us.

When I made this change in Guile, making sure that with-dynamic-state did not impose a continuation barrier, I ran into a problem. In Guile we implemented exceptions in terms of delimited continuations and dynamic binding. The current stack of exception handlers was a list, and each element included the exceptions handled by that handler, and what prompt to which to abort before running the exception handler. See where the problem is? If we ship this exception handler stack over to a new fiber, then an exception propagating out of the new fiber would be looking up handlers from another fiber, for prompts that probably aren't even on the stack any more.

The problem here is that if you store a heap-allocated stack of current exception handlers in a dynamic variable, and that dynamic variable is captured somehow (say, by a delimited continuation), then you capture the whole stack of handlers, not (in the case of delimited continuations) the delimited set of handlers that were active within the prompt. To fix this, we had to change Guile's exceptions to instead make catch just rebind the exception handler parameter to hold the handler installed by the catch. If Guile needs to walk the chain of exception handlers, we introduced a new primitive fluid-ref* to do so, building the chain from the current stack of parameterizations instead of some representation of that stack on the heap. It's O(n), but life is that way sometimes. This way also, delimited continuations capture the right set of exception handlers.

Finally, Guile also supports asynchronous interrupts. We can arrange to interrupt a Guile process (or POSIX thread) every so often, as measured in wall-clock or process time. It used to be that interrupt handlers caused a continuation barrier, but this is no longer the case, so now we can add pre-emption to a fibers using interrupts.

summary and reflections

In Guile we were able to create a solid-seeming abstraction for fibers by composing other basic building blocks from the Scheme toolkit. Guile users can take an abstraction that's implemented in terms of an event loop (any event loop) and layer fibers on top in a way that feels "real". We were able to do this because we have prompts (delimited continuation) and parameters (dynamic binding), and we were able to compose the two. Actually getting it all to work required fixing a few bugs.

In Fibers, we just use delimited continuations to implement coroutines, and then our fibers are coroutines. If we had coroutines as a primitive, that would work just as well. As it is, each suspension of a fiber will allocate a new continuation. Perhaps this is unimportant, given the average continuation size, but it would be comforting in a way to be able to re-use the allocation from the previous suspension (if any). Other languages with coroutine primitives might have an advantage here, though delimited dynamic binding is still relatively uncommon.

Another point is that because we use prompts to suspend fiberss, we effectively are always unwinding and rewinding the dynamic state. In practice this should be transparent to the user and the implementor should make this transparent from a performance perspective, with the exception of dynamic-wind. Basically any fiber suspension will be run the "out" guard of any enclosing dynamic-wind, and resumption will run the "in" guard. In practice we find that we defer "finalization" issues to with-throw-handler / catch, which unlike dynamic-wind don't run on every entry or exit of a dynamic extent and rather just run on exceptional exits. We will see over time if this situation is acceptable. It's certainly another nail in the coffin of dynamic-wind though.

This article started with pthreads malaise, and although we've solved the problem of having a million fibers, we haven't solved the communications problem. How should fibers communicate with each other? This is the topic for my next article. Until then, happy hacking :)

an early look at p4 for software networking

26 June 2017 2:00 PM (igalia | networking | p4 | compilers | snabb | vpp)

Happy midsummer, hackfriends!

As you know at work we have been trying to find ways to apply compilers technology to the networking space. We will compile high-level configurations into low-level network processing graphs, search algorithms into lookup routines optimized for the target data structures, packet filters into code ready to be further trace-compiled, or hash functions into parallel AVX2 code.

On one side, we try to provide fast implementations of existing "languages"; on the other side we can't help but try out new co-designed domain-specific languages that can be expressive and run fast. As an example, with pfmatch we extended pflang, the tcpdump language, with a more match-action kind of semantics. It worked fine but the embedding between pfmatch and the host language could have been more smooth; in the end the abstractions it offers don't really apply to what we have needed to build. For a long time we have been wondering if indeed there is a better domain-specific programming language to apply to the networking domain.

P4 claims to be this language, and I think it's worth a look. P4's goal is be able to define switches and other networking equipment in software, with the specific goal that they would like to be able for P4 programs to be synthesized to ASICs, or installed in the FPGA of a "Smart NIC", or compiled to CPUs. It's a wide target domain and the silicon-bakery side of things definitely constrains what is possible. Indeed P4 explicitly disclaims any ambition to be a general-purpose programming language. Still, I think they manage to achieve an admirable balance between declarative programming and transparent low-level compilability.

The best, most current intro to P4 out there is probably Vladimir Gurevich's slides from last month's P4 "developer day" in California. I think it does a good job linking the language's syntax and semantics with how they are intended to be applied to the target domain. For a more PL-friendly and abstract introduction, the P416 specification is a true delight.

Like I said, at work we build software switches and other network functions, and our target is commodity hardware. We write most of our work in Snabb, a powerful network toolkit built on LuaJIT, though we are branching out now to VPP/fd.io as well, just to broaden the offering a bit. Generally we try to build solutions that don't have any dependencies other than a commodity Xeon server and a commodity NIC like Intel's 82599. So how could P4 help us in what we're doing?

My first thought in this regard was that if there is a library of P4 building blocks out there, that it would be really convenient to be able to incorporate a functional block written in P4 within the graph of a Snabb program. For example, if we have an IPFIX collector written in Snabb (and we do!), it would be cool to stick that in the middle of a P4 traffic conditioner.

(Immediately I run into the problem that I am straining my mind to think of a network function that we wouldn't rather just write in Snabb -- something valuable enough that we wouldn't want to "own" it and instead we would import someone else's black box into our data-plane. Maybe this interesting in-network key-value cache counts? But I digress, let's assume that something exists here.)

One question is, why bother doing P4 in software? I can understand that if you have 1Tbps ports that you definitely need custom silicon pushing your packets around. You would like to be able to program that silicon, so P4 looks to be a compelling step forward. But if your needs are satisfied with 40Gbps ports and you have chosen a software networking solution for its low cost, low lock-in, high flexibility, and sufficient performance -- well does P4 buy you something?

Right now it would seem that the answer is "no". A Cisco group wrote a custom P4 compiler to VPP, which is architecturally pretty much the same as Snabb, and they had to do some work to get the performance within a couple percent of the hand-coded application. The only win I can see is if people start building up libraries of reusable P4 components that can be linked together -- but the language itself currently doesn't support any more global composition primitive than #include (yes, it uses CPP :).

Additionally, at least as things are now, it doesn't seem that there's a library of reusable, open source P4 components out there to take advantage of. If this changes, I'll have to have another look. And of course it's worth keeping an eye on what kinds of cool things people are building :)

Thanks to Luke Gorrie for conversations leading to this blog post. All opinions and errors mine, of course!

guile 2.2 omg!!!

15 March 2017 10:56 PM (guile | scheme | gnu | igalia | compilers | academia | relief)

Oh, good evening my hackfriends! I am just chuffed to share a thing with yall: tomorrow we release Guile 2.2.0. Yaaaay!

I know in these days of version number inflation that this seems like a very incremental, point-release kind of a thing, but it's a big deal to me. This is a project I have been working on since soon after the release of Guile 2.0 some 6 years ago. It wasn't always clear that this project would work, but now it's here, going into production.

In that time I have worked on JavaScriptCore and V8 and SpiderMonkey and so I got a feel for what a state-of-the-art programming language implementation looks like. Also in that time I ate and breathed optimizing compilers, and really hit the wall until finally paging in what Fluet and Weeks were saying so many years ago about continuation-passing style and scope, and eventually came through with a solution that was still CPS: CPS soup. At this point Guile's "middle-end" is, I think, totally respectable. The backend targets a quite good virtual machine.

The virtual machine is still a bytecode interpreter for now; native code is a next step. Oddly my journey here has been precisely opposite, in a way, to An incremental approach to compiler construction; incremental, yes, but starting from the other end. But I am very happy with where things are. Guile remains very portable, bootstrappable from C, and the compiler is in a good shape to take us the rest of the way to register allocation and native code generation, and performance is pretty ok, even better than some natively-compiled Schemes.

For a "scripting" language (what does that mean?), I also think that Guile is breaking nice ground by using ELF as its object file format. Very cute. As this seems to be a "Andy mentions things he's proud of" segment, I was also pleased with how we were able to completely remove the stack size restriction.

high fives all around

As is often the case with these things, I got the idea for removing the stack limit after talking with Sam Tobin-Hochstadt from Racket and the PLT group. I admire Racket and its makers very much and look forward to stealing fromworking with them in the future.

Of course the ideas for the contification and closure optimization passes are in debt to Matthew Fluet and Stephen Weeks for the former, and Andy Keep and Kent Dybvig for the the latter. The intmap/intset representation of CPS soup itself is highly endebted to the late Phil Bagwell, to Rich Hickey, and to Clojure folk; persistent data structures were an amazing revelation to me.

Guile's virtual machine itself was initially heavily inspired by JavaScriptCore's VM. Thanks to WebKit folks for writing so much about the early days of Squirrelfish! As far as the actual optimizations in the compiler itself, I was inspired a lot by V8's Crankshaft in a weird way -- it was my first touch with fixed-point flow analysis. As most of yall know, I didn't study CS, for better and for worse; for worse, because I didn't know a lot of this stuff, and for better, as I had the joy of learning it as I needed it. Since starting with flow analysis, Carl Offner's Notes on graph algorithms used in optimizing compilers was invaluable. I still open it up from time to time.

While I'm high-fiving, large ups to two amazing support teams: firstly to my colleagues at Igalia for supporting me on this. Almost the whole time I've been at Igalia, I've been working on this, for about a day or two a week. Sometimes at work we get to take advantage of a Guile thing, but Igalia's Guile investment mainly pays out in the sense of keeping me happy, keeping me up to date with language implementation techniques, and attracting talent. At work we have a lot of language implementation people, in JS engines obviously but also in other niches like the networking group, and it helps to be able to transfer hackers from Scheme to these domains.

I put in my own time too, of course; but my time isn't really my own either. My wife Kate has been really supportive and understanding of my not-infrequent impulses to just nerd out and hack a thing. She probably won't read this (though maybe?), but it's important to acknowledge that many of us hackers are only able to do our work because of the support that we get from our families.

a digression on the nature of seeking and knowledge

I am jealous of my colleagues in academia sometimes; of course it must be this way, that we are jealous of each other. Greener grass and all that. But when you go through a doctoral program, you know that you push the boundaries of human knowledge. You know because you are acutely aware of the state of recorded knowledge in your field, and you know that your work expands that record. If you stay in academia, you use your honed skills to continue chipping away at the unknown. The papers that this process reifies have a huge impact on the flow of knowledge in the world. As just one example, I've read all of Dybvig's papers, with delight and pleasure and avarice and jealousy, and learned loads from them. (Incidentally, I am given to understand that all of these are proper academic reactions :)

But in my work on Guile I don't actually know that I've expanded knowledge in any way. I don't actually know that anything I did is new and suspect that nothing is. Maybe CPS soup? There have been some similar publications in the last couple years but you never know. Maybe some of the multicore Concurrent ML stuff I haven't written about yet. Really not sure. I am starting to see papers these days that are similar to what I do and I have the feeling that they have a bit more impact than my work because of their medium, and I wonder if I could be putting my work in a more useful form, or orienting it in a more newness-oriented way.

I also don't know how important new knowledge is. Simply being able to practice language implementation at a state-of-the-art level is a valuable skill in itself, and releasing a quality, stable free-software language implementation is valuable to the world. So it's not like I'm negative on where I'm at, but I do feel wonderful talking with folks at academic conferences and wonder how to pull some more of that into my life.

In the meantime, I feel like (my part of) Guile 2.2 is my master work in a way -- a savepoint in my hack career. It's fine work; see A Virtual Machine for Guile and Continuation-Passing Style for some high level documentation, or many of these bloggies for the nitties and the gritties. OKitties!

getting the goods

It's been a joy over the last two or three years to see the growth of Guix, a packaging system written in Guile and inspired by GNU stow and Nix. The laptop I'm writing this on runs GuixSD, and Guix is up to some 5000 packages at this point.

I've always wondered what the right solution for packaging Guile and Guile modules was. At one point I thought that we would have a Guile-specific packaging system, but one with stow-like characteristics. We had problems with C extensions though: how do you build one? Where do you get the compilers? Where do you get the libraries?

Guix solves this in a comprehensive way. From the four or five bootstrap binaries, Guix can download and build the world from source, for any of its supported architectures. The result is a farm of weirdly-named files in /gnu/store, but the transitive closure of a store item works on any distribution of that architecture.

This state of affairs was clear from the Guix binary installation instructions that just have you extract a tarball over your current distro, regardless of what's there. The process of building this weird tarball was always a bit ad-hoc though, geared to Guix's installation needs.

It turns out that we can use the same strategy to distribute reproducible binaries for any package that Guix includes. So if you download this tarball, and extract it as root in /, then it will extract some paths in /gnu/store and also add a /opt/guile-2.2.0. Run Guile as /opt/guile-2.2.0/bin/guile and you have Guile 2.2, before any of your friends! That pack was made using guix pack -C lzip -S /opt/guile-2.2.0=/ guile-next glibc-utf8-locales, at Guix git revision 80a725726d3b3a62c69c9f80d35a898dcea8ad90.

(If you run that Guile, it will complain about not being able to install the locale. Guix, like Scheme, is generally a statically scoped system; but locales are dynamically scoped. That is to say, you have to set GUIX_LOCPATH=/opt/guile-2.2.0/lib/locale in the environment, for locales to work. See the GUIX_LOCPATH docs for the gnarlies.)

Alternately of course you can install Guix and just guix package -i guile-next. Guix itself will migrate to 2.2 over the next week or so.

Welp, that's all for this evening. I'll be relieved to push the release tag and announcements tomorrow. In the meantime, happy hacking, and yes: this blog is served by Guile 2.2! :)

it's probably spam

6 March 2017 2:16 PM (spam | tekuti | meta | bayesian | classification | scheme | guile | hubris)

Greetings, peoples. As you probably know, these words are served to you by Tekuti, a blog engine written in Scheme that uses Git as its database.

Part of the reason I wrote this blog software was that from the time when I was using Wordpress, I actually appreciated the comments that I would get. Sometimes nice folks visit this blog and comment with information that I find really interesting, and I thought it would be a shame if I had to disable those entirely.

But allowing users to add things to your site is tricky. There are all kinds of potential security vulnerabilities. I thought about the ones that were important to me, back in 2008 when I wrote Tekuti, and I thought I did a pretty OK job on preventing XSS and designing-out code execution possibilities. When it came to bogus comments though, things worked well enough for the time. Tekuti uses Git as a log-structured database, and so to delete a comment, you just revert the change that added the comment. I added a little security question ("what's your favorite number?"; any number worked) to prevent wordpress spammers from hitting me, and I was good to go.

Sadly, what was good enough in 2008 isn't good enough in 2017. In 2017 alone, some 2000 bogus comments made it through. So I took comments offline and painstakingly went through and separated the wheat from the chaff while pondering what to do next.

an aside

I really wondered why spammers bothered though. I mean, I added the rel="external nofollow" attribute on links, which should prevent search engines from granting relevancy to the spammer's links, so what gives? Could be that all the advice from the mid-2000s regarding nofollow is bogus. But it was definitely the case that while I was adding the attribute to commenter's home page links, I wasn't adding it to links in the comment. Doh! With this fixed, perhaps I will just have to deal with the spammers I have and not even more spammers in the future.

i digress

I started by simply changing my security question to require a number in a certain range. No dice; bogus comments still got through. I changed the range; could it be the numbers they were using were already in range? Again the bogosity continued undaunted.

So I decided to break down and write a bogus comment filter. Luckily, Git gives me a handy corpus of legit and bogus comments: all the comments that remain live are legit, and all that were ever added but are no longer live are bogus. I wrote a simple tokenizer across the comments, extracted feature counts, and fed that into a naive Bayesian classifier. I finally turned it on this morning; fingers crossed!

My trials at home show that if you train the classifier on half the data set (around 5300 bogus comments and 1900 legit comments) and then run it against the other half, I get about 6% false negatives and 1% false positives. The feature extractor interns sequences of 1, 2, and 3 tokens, and doesn't have a lower limit for number of features extracted -- a feature seen only once in bogus comments and never in legit comments is a fairly strong bogosity signal; as you have to make up the denominator in that case, I set it to indicate that such a feature is 99.9% bogus. A corresponding single feature in the legit set without appearance in the bogus set is 99% legit.

Of course with this strong of a bias towards precise features of the training set, if you run the classifier against its own training set, it produces no false positives and only 0.3% false negatives, some of which were simply reverted duplicate comments.

It wasn't straightforward to get these results out of a Bayesian classifier. The "smoothing" factor that you add to both numerator and denominator was tricky, as I mentioned above. Getting a useful tokenization was tricky. And the final trick was even trickier: limiting the significant-feature count when determining bogosity. I hate to cite Paul Graham but I have to do so here -- choosing the N most significant features in the document made the classification much less sensitive to the varying lengths of legit and bogus comments, and less sensitive to inclusions of verbatim texts from other comments.

We'll see I guess. If your comment gets caught by my filters, let me know -- over email or Twitter I guess, since you might not be able to comment! I hope to be able to keep comments open; I've learned a lot from yall over the years.

encyclopedia snabb and the case of the foreign drivers

24 February 2017 5:37 PM (snabb | igalia | networking | dpdk | drivers | lca | lwn | conway's law)

Peoples of the blogosphere, welcome back to the solipsism! Happy 2017 and all that. Today's missive is about Snabb (formerly Snabb Switch), a high-speed networking project we've been working on at work for some years now.

What's Snabb all about you say? Good question and I have a nice answer for you in video and third-party textual form! This year I managed to make it to linux.conf.au in lovely Tasmania. Tasmania is amazing, with wild wombats and pademelons and devils and wallabies and all kinds of things, and they let me talk about Snabb.

You can check that video on the youtube if the link above doesn't work; slides here.

Jonathan Corbet from LWN wrote up the talk in an article here, which besides being flattering is a real windfall as I don't have to write it up myself :)

In that talk I mentioned that Snabb uses its own drivers. We were recently approached by a customer with a simple and honest question: does this really make sense? Is it really a win? Why wouldn't we just use the work that the NIC vendors have already put into their drivers for the Data Plane Development Kit (DPDK)? After all, part of the attraction of a switch to open source is that you will be able to take advantage of the work that others have produced.

Our answer is that while it is indeed possible to use drivers from DPDK, there are costs and benefits on both sides and we think that when we weigh it all up, it makes both technical and economic sense for Snabb to have its own driver implementations. It might sound counterintuitive on the face of things, so I wrote this long article to discuss some perhaps under-appreciated points about the tradeoff.

Technically speaking there are generally two ways you can imagine incorporating DPDK drivers into Snabb:

  1. Bundle a snapshot of the DPDK into Snabb itself.

  2. Somehow make it so that Snabb could (perhaps optionally) compile against a built DPDK SDK.

As part of a software-producing organization that ships solutions based on Snabb, I need to be able to ship a "known thing" to customers. When we ship the lwAFTR, we ship it in source and in binary form. For both of those deliverables, we need to know exactly what code we are shipping. We achieve that by having a minimal set of dependencies in Snabb -- only LuaJIT and three Lua libraries (DynASM, ljsyscall, and pflua) -- and we include those dependencies directly in the source tree. This requirement of ours rules out (2), so the option under consideration is only (1): importing the DPDK (or some part of it) directly into Snabb.

So let's start by looking at Snabb and the DPDK from the top down, comparing some metrics, seeing how we could make this combination.

snabbdpdk
Code lines 61K 583K
Contributors (all-time) 60 370
Contributors (since Jan 2016) 32 240
Non-merge commits (since Jan 2016) 1.4K 3.2K

These numbers aren't directly comparable, of course; in Snabb our unit of code change is the merge rather than the commit, and in Snabb we include a number of production-ready applications like the lwAFTR and the NFV, but they are fine enough numbers to start with. What seems clear is that the DPDK project is significantly larger than Snabb, so adding it to Snabb would fundamentally change the nature of the Snabb project.

So depending on the DPDK makes it so that suddenly Snabb jumps from being a project that compiles in a minute to being a much more heavy-weight thing. That could be OK if the benefits were high enough and if there weren't other costs, but there are indeed other costs to including the DPDK:

  • Data-plane control. Right now when I ship a product, I can be responsible for the whole data plane: everything that happens on the CPU when packets are being processed. This includes the driver, naturally; it's part of Snabb and if I need to change it or if I need to understand it in some deep way, I can do that. But if I switch to third-party drivers, this is now out of my domain; there's a wall between me and something that running on my CPU. And if there is a performance problem, I now have someone to blame that's not myself! From the customer perspective this is terrible, as you want the responsibility for software to rest in one entity.

  • Impedance-matching development costs. Snabb is written in Lua; the DPDK is written in C. I will have to build a bridge, and keep it up to date as both Snabb and the DPDK evolve. This impedance-matching layer is also another source of bugs; either we make a local impedance matcher in C or we bind everything using LuaJIT's FFI. In the former case, it's a lot of duplicate code, and in the latter we lose compile-time type checking, which is a no-go given that the DPDK can and does change API and ABI.

  • Communication costs. The DPDK development list had 3K messages in January. Keeping up with DPDK development would become necessary, as the DPDK is now in your dataplane, but it costs significant amounts of time.

  • Costs relating to mismatched goals. Snabb tries to win development and run-time speed by searching for simple solutions. The DPDK tries to be a showcase for NIC features from vendors, placing less of a priority on simplicity. This is a very real cost in the form of the way network packets are represented in the DPDK, with support for such features as scatter/gather and indirect buffers. In Snabb we were able to do away with this complexity by having simple linear buffers, and our speed did not suffer; adding the DPDK again would either force us to marshal and unmarshal these buffers into and out of the DPDK's format, or otherwise to reintroduce this particular complexity into Snabb.

  • Abstraction costs. A network function written against the DPDK typically uses at least three abstraction layers: the "EAL" environment abstraction layer, the "PMD" poll-mode driver layer, and often an internal hardware abstraction layer from the network card vendor. (And some of those abstraction layers are actually external dependencies of the DPDK, as with Mellanox's ConnectX-4 drivers!) Any discrepancy between the goals and/or implementation of these layers and the goals of a Snabb network function is a cost in developer time and in run-time. Note that those low-level HAL facilities aren't considered acceptable in upstream Linux kernels, for all of these reasons!

  • Stay-on-the-train costs. The DPDK is big and sometimes its abstractions change. As a minor player just riding the DPDK train, we would have to invest a continuous amount of effort into just staying aboard.

  • Fork costs. The Snabb project has a number of contributors but is really run by Luke Gorrie. Because Snabb is so small and understandable, if Luke decided to stop working on Snabb or take it in a radically different direction, I would feel comfortable continuing to maintain (a fork of) Snabb for as long as is necessary. If the DPDK changed goals for whatever reason, I don't think I would want to continue to maintain a stale fork.

  • Overkill costs. Drivers written against the DPDK have many considerations that simply aren't relevant in a Snabb world: kernel drivers (KNI), special NIC features that we don't use in Snabb (RDMA, offload), non-x86 architectures with different barrier semantics, threads, complicated buffer layouts (chained and indirect), interaction with specific kernel modules (uio-pci-generic / igb-uio / ...), and so on. We don't need all of that, but we would have to bring it along for the ride, and any changes we might want to make would have to take these use cases into account so that other users won't get mad.

So there are lots of costs if we were to try to hop on the DPDK train. But what about the benefits? The goal of relying on the DPDK would be that we "automatically" get drivers, and ultimately that a network function would be driver-agnostic. But this is not necessarily the case. Each driver has its own set of quirks and tuning parameters; in order for a software development team to be able to support a new platform, the team would need to validate the platform, discover the right tuning parameters, and modify the software to configure the platform for good performance. Sadly this is not a trivial amount of work.

Furthermore, using a different vendor's driver isn't always easy. Consider Mellanox's DPDK ConnectX-4 / ConnectX-5 support: the "Quick Start" guide has you first install MLNX_OFED in order to build the DPDK drivers. What is this thing exactly? You go to download the tarball and it's 55 megabytes. What's in it? 30 other tarballs! If you build it somehow from source instead of using the vendor binaries, then what do you get? All that code, running as root, with kernel modules, and implementing systemd/sysvinit services!!! And this is just step one!!!! Worse yet, this enormous amount of code powering a DPDK driver is mostly driver-specific; what we hear from colleagues whose organizations decided to bet on the DPDK is that you don't get to amortize much knowledge or validation when you switch between an Intel and a Mellanox card.

In the end when we ship a solution, it's going to be tested against a specific NIC or set of NICs. Each NIC will add to the validation effort. So if we were to rely on the DPDK's drivers, we would have payed all the costs but we wouldn't save very much in the end.

There is another way. Instead of relying on so much third-party code that it is impossible for any one person to grasp the entirety of a network function, much less be responsible for it, we can build systems small enough to understand. In Snabb we just read the data sheet and write a driver. (Of course we also benefit by looking at DPDK and other open source drivers as well to see how they structure things.) By only including what is needed, Snabb drivers are typically only a thousand or two thousand lines of Lua. With a driver of that size, it's possible for even a small ISV or in-house developer to "own" the entire data plane of whatever network function you need.

Of course Snabb drivers have costs too. What are they? Are customers going to be stuck forever paying for drivers for every new card that comes out? It's a very good question and one that I know is in the minds of many.

Obviously I don't have the whole answer, as my role in this market is a software developer, not an end user. But having talked with other people in the Snabb community, I see it like this: Snabb is still in relatively early days. What we need are about three good drivers. One of them should be for a standard workhorse commodity 10Gbps NIC, which we have in the Intel 82599 driver. That chipset has been out for a while so we probably need to update it to the current commodities being sold. Additionally we need a couple cards that are going to compete in the 100Gbps space. We have the Mellanox ConnectX-4 and presumably ConnectX-5 drivers on the way, but there's room for another one. We've found that it's hard to actually get good performance out of 100Gbps cards, so this is a space in which NIC vendors can differentiate their offerings.

We budget somewhere between 3 and 9 months of developer time to create a completely new Snabb driver. Of course it usually takes less time to develop Snabb support for a NIC that is only incrementally different from others in the same family that already have drivers.

We see this driver development work to be similar to the work needed to validate a new NIC for a network function, with the additional advantage that it gives us up-front knowledge instead of the best-effort testing later in the game that we would get with the DPDK. When you add all the additional costs of riding the DPDK train, we expect that the cost of Snabb-native drivers competes favorably against the cost of relying on third-party DPDK drivers.

In the beginning it's natural that early adopters of Snabb make investments in this base set of Snabb network drivers, as they would to validate a network function on a new platform. However over time as Snabb applications start to be deployed over more ports in the field, network vendors will also see that it's in their interests to have solid Snabb drivers, just as they now see with the Linux kernel and with the DPDK, and given that the investment is relatively low compared to their already existing efforts in Linux and the DPDK, it is quite feasible that we will see the NIC vendors of the world start to value Snabb for the performance that it can squeeze out of their cards.

So in summary, in Snabb we are convinced that writing minimal drivers that are adapted to our needs is an overall win compared to relying on third-party code. It lets us ship solutions that we can feel responsible for: both for their operational characteristics as well as their maintainability over time. Still, we are happy to learn and share with our colleagues all across the open source high-performance networking space, from the DPDK to VPP and beyond.

An incomplete history of language facilities for concurrency

12 October 2016 1:45 PM (pl | concurrency | erlang | go | csp | guile | fibers | callback hell)

I have lately been in the market for better concurrency facilities in Guile. I want to be able to write network servers and peers that can gracefully, elegantly, and efficiently handle many tens of thousands of clients and other connections, but without blowing the complexity budget. It's a hard nut to crack.

Part of the problem is implementation, but a large part is just figuring out what to do. I have often thought that modern musicians must be crushed under the weight of recorded music history, but it turns out in our humble field that's also the case; there are as many concurrency designs as languages, just about. In this regard, what follows is an incomplete, nuanced, somewhat opinionated history of concurrency facilities in programming languages, with an eye towards what I should "buy" for the Fibers library I have been tinkering on for Guile.

* * *

Modern machines have the raw capability to serve hundreds of thousands of simultaneous long-lived connections, but it’s often hard to manage this at the software level. Fibers tries to solve this problem in a nice way. Before discussing the approach taken in Fibers, it’s worth spending some time on history to see how we got here.

One of the most dominant patterns for concurrency these days is “callbacks”, notably in the Twisted library for Python and the Node.js run-time for JavaScript. The basic observation in the callback approach to concurrency is that the efficient way to handle tens of thousands of connections at once is with low-level operating system facilities like poll or epoll. You add all of the file descriptors that you are interested in to a “poll set” and then ask the operating system which ones are readable or writable, as appropriate. Once the operating system says “yes, file descriptor 7145 is readable”, you can do something with that socket; but what? With callbacks, the answer is “call a user-supplied closure”: a callback, representing the continuation of the computation on that socket.

Building a network service with a callback-oriented concurrency system means breaking the program into little chunks that can run without blocking. Whereever a program could block, instead of just continuing the program, you register a callback. Unfortunately this requirement permeates the program, from top to bottom: you always pay the mental cost of inverting your program’s control flow by turning it into callbacks, and you always incur run-time cost of closure creation, even when the particular I/O could proceed without blocking. It’s a somewhat galling requirement, given that this contortion is required of the programmer, but could be done by the compiler. We Schemers demand better abstractions than manual, obligatory continuation-passing-style conversion.

Callback-based systems also encourage unstructured concurrency, as in practice callbacks are not the only path for data and control flow in a system: usually there is mutable global state as well. Without strong patterns and conventions, callback-based systems often exhibit bugs caused by concurrent reads and writes to global state.

Some of the problems of callbacks can be mitigated by using “promises” or other library-level abstractions; if you’re a Haskell person, you can think of this as lifting all possibly-blocking operations into a monad. If you’re not a Haskeller, that’s cool, neither am I! But if your typey spidey senses are tingling, it’s for good reason: with promises, your whole program has to be transformed to return promises-for-values instead of values anywhere it would block.

An obvious solution to the control-flow problem of callbacks is to use threads. In the most generic sense, a thread is a language feature which denotes an independent computation. Threads are created by other threads, but fork off and run independently instead of returning to their caller. In a system with threads, there is implicitly a scheduler somewhere that multiplexes the threads so that when one suspends, another can run.

In practice, the concept of threads is often conflated with a particular implementation, kernel threads. Kernel threads are very low-level abstractions that are provided by the operating system. The nice thing about kernel threads is that they can use any CPU that is the kernel knows about. That’s an important factor in today’s computing landscape, where Moore’s law seems to be giving us more cores instead of more gigahertz.

However, as a building block for a highly concurrent system, kernel threads have a few important problems.

One is that kernel threads simply aren’t designed to be allocated in huge numbers, and instead are more optimized to run in a one-per-CPU-core fashion. Their memory usage is relatively high for what should be a lightweight abstraction: some 10 kilobytes at least and often some megabytes, in the form of the thread’s stack. There are ongoing efforts to reduce this for some systems but we cannot expect wide deployment in the next 5 years, if ever. Even in the best case, a hundred thousand kernel threads will take at least a gigabyte of memory, which seems a bit excessive for book-keeping overhead.

Kernel threads can be a bit irritating to schedule, too: when one thread suspends, it’s for a reason, and it can be that user-space knows a good next thread that should run. However because kernel threads are scheduled in the kernel, it’s rarely possible for the kernel to make informed decisions. There are some “user-mode scheduling” facilities that are in development for some systems, but again only for some systems.

The other significant problem is that building non-crashy systems on top of kernel threads is hard to do, not to mention “correct” systems. It’s an embarrassing situation. For one thing, the low-level synchronization primitives that are typically provided with kernel threads, mutexes and condition variables, are not composable. Also, as with callback-oriented concurrency, one thread can silently corrupt another via unstructured mutation of shared state. It’s worse with kernel threads, though: a kernel thread can be interrupted at any point, not just at I/O. And though callback-oriented systems can theoretically operate on multiple CPUs at once, in practice they don’t. This restriction is sometimes touted as a benefit by proponents of callback-oriented systems, because in such a system, the callback invocations have a single, sequential order. With multiple CPUs, this is not the case, as multiple threads can run at the same time, in parallel.

Kernel threads can work. The Java virtual machine does at least manage to prevent low-level memory corruption and to do so with high performance, but still, even Java-based systems that aim for maximum concurrency avoid using a thread per connection because threads use too much memory.

In this context it’s no wonder that there’s a third strain of concurrency: shared-nothing message-passing systems like Erlang. Erlang isolates each thread (called processes in the Erlang world), giving each it its own heap and “mailbox”. Processes can spawn other processes, and the concurrency primitive is message-passing. A process that tries receive a message from an empty mailbox will “block”, from its perspective. In the meantime the system will run other processes. Message sends never block, oddly; instead, sending to a process with many messages pending makes it more likely that Erlang will pre-empt the sending process. It’s a strange tradeoff, but it makes sense when you realize that Erlang was designed for network transparency: the same message send/receive interface can be used to send messages to processes on remote machines as well.

No network is truly transparent, however. At the most basic level, the performance of network sends should be much slower than local sends. Whereas a message sent to a remote process has to be written out byte-by-byte over the network, there is no need to copy immutable data within the same address space. The complexity of a remote message send is O(n) in the size of the message, whereas a local immutable send is O(1). This suggests that hiding the different complexities behind one operator is the wrong thing to do. And indeed, given byte read and write operators over sockets, it’s possible to implement remote message send and receive as a process that serializes and parses messages between a channel and a byte sink or source. In this way we get cheap local channels, and network shims are under the programmer’s control. This is the approach that the Go language takes, and is the one we use in Fibers.

Structuring a concurrent program as separate threads that communicate over channels is an old idea that goes back to Tony Hoare’s work on “Communicating Sequential Processes” (CSP). CSP is an elegant tower of mathematical abstraction whose layers form a pattern language for building concurrent systems that you can still reason about. Interestingly, it does so without any concept of time at all, instead representing a thread’s behavior as a trace of instantaneous events. Threads themselves are like functions that unfold over the possible events to produce the actual event trace seen at run-time.

This view of events as instantaneous happenings extends to communication as well. In CSP, one communication between two threads is modelled as an instantaneous event, partitioning the traces of the two threads into “before” and “after” segments.

Practically speaking, this has ramifications in the Go language, which was heavily inspired by CSP. You might think that a channel is just a an asynchronous queue that blocks when writing to a full queue, or when reading from an empty queue. That’s a bit closer to the Erlang conception of how things should work, though as we mentioned, Erlang simply slows down writes to full mailboxes rather than blocking them entirely. However, that’s not what Go and other systems in the CSP family do; sending a message on a channel will block until there is a receiver available, and vice versa. The threads are said to “rendezvous” at the event.

Unbuffered channels have the interesting property that you can select between sending a message on channel a or channel b, and in the end only one message will be sent; nothing happens until there is a receiver ready to take the message. In this way messages are really owned by threads and never by the channels themselves. You can of course add buffering if you like, simply by making a thread that waits on either sends or receives on a channel, and which buffers sends and makes them available to receives. It’s also possible to add explicit support for buffered channels, as Go, core.async, and many other systems do, which can reduce the number of context switches as there is no explicit buffer thread.

Whether to buffer or not to buffer is a tricky choice. It’s possible to implement singly-buffered channels in a system like Erlang via an explicit send/acknowlege protocol, though it seems difficult to implement completely unbuffered channels. As we mentioned, it’s possible to add buffering to an unbuffered system by the introduction of explicit buffer threads. In the end though in Fibers we follow CSP’s lead so that we can implement the nice select behavior that we mentioned above.

As a final point, select is OK but is not a great language abstraction. Say you call a function and it returns some kind of asynchronous result which you then have to select on. It could return this result as a channel, and that would be fine: you can add that channel to the other channels in your select set and you are good. However, what if what the function does is receive a message on a channel, then do something with the message? In that case the function should return a channel, plus a continuation (as a closure or something). If select results in a message being received over that channel, then we call the continuation on the message. Fine. But, what if the function itself wanted to select over some channels? It could return multiple channels and continuations, but that becomes unwieldy.

What we need is an abstraction over asynchronous operations, and that is the main idea of a CSP-derived system called “Concurrent ML” (CML). Originally implemented as a library on top of Standard ML of New Jersey by John Reppy, CML provides this abstraction, which in Fibers is called an operation1. Calling send-operation on a channel returns an operation, which is just a value. Operations are like closures in a way; a closure wraps up code in its environment, which can be later called many times or not at all. Operations likewise can be performed2 many times or not at all; performing an operation is like calling a function. The interesting part is that you can compose operations via the wrap-operation and choice-operation combinators. The former lets you bundle up an operation and a continuation. The latter lets you construct an operation that chooses over a number of operations. Calling perform-operation on a choice operation will perform one and only one of the choices. Performing an operation will call its wrap-operation continuation on the resulting values.

While it’s possible to implement Concurrent ML in terms of Go’s channels and baked-in select statement, it’s more expressive to do it the other way around, as that also lets us implement other operations types besides channel send and receive, for example timeouts and condition variables.


1 CML uses the term event, but I find this to be a confusing name. In this isolated article my terminology probably looks confusing, but in the context of the library I think it can be OK. The jury is out though.

2 In CML, synchronized.

* * *

Well, that's my limited understanding of the crushing weight of history. Note that part of this article is now in the Fibers manual.

Thanks very much to Matthew Flatt, Matthias Felleisen, and Michael Sperber for pushing me towards CML. In the beginning I thought its benefits were small and complication large, but now I see it as being the reverse. Happy hacking :)

is go an acceptable cml?

21 September 2016 9:29 PM (go | concurrency | ml | ignorance | guile | clojure | core.async)

Yesterday I tried to summarize the things I know about Concurrent ML, and I came to the tentative conclusion that Go (and any Go-like system) was an acceptable CML. Turns out I was both wrong and right.

you were wrong when you said everything's gonna be all right

I was wrong, in the sense that programming against the CML abstractions lets you do more things than programming against channels-and-goroutines. Thanks to Sam Tobin-Hochstadt to pointing this out. As an example, consider a little process that tries to receive a message off a channel, and times out otherwise:

func withTimeout(ch chan int, timeout int) (result int) {
  var timeoutChannel chan int;
  var msg int;
  go func() {
    sleep(timeout)
    timeoutChannel <- 0
  }()
  select {
    case msg = <-ch: return msg;
    case msg = <-timeoutChannel: return 0;
  }
}

I think that's the first Go I've ever written. I don't even know if it's syntactically valid. Anyway, I think we see how it should work. We return the message from the channel, unless the timeout happens before.

But, what if the message is itself a composite message somehow? For example, say we have a transformer that reads a value from a channel and adds 1 to it:

func onePlus(in chan int) (result chan int) {
  var out chan int
  go func () { out <- 1 + <-in }()
  return out
}

What if we do a withTimeout(onePlus(numbers), 0)? Assume the timeout fires first and that's the result that select chooses. There's still that onePlus goroutine out there trying to read from in and at some point probably it will succeed, but nobody will read its value. At that point the number just vanishes into the ether. Maybe that's OK in certain domains, but certainly not in general!

What CML gives you is the ability to express an event (which is kinda like a possibility of sending or receiving a message on a channel) in such a way that we don't run into this situation. Specifically with the wrap combinator, we would make an event such that receiving on numbers would run a function on the received message and return that as the message value -- which is of course the same as what we have, except that in CML the select wouldn't actually read the message off unless it select'd that channel for input.

Of course in Go you could just rewrite your program, so that the select statement looks like this:

select {
  case msg = <-ch: return msg + 1;
  case msg = <-timeoutChannel: return 0;
}

But here we're operating at a lower level of abstraction; we were forced to intertwingle our concerns of adding 1 and our concerns of timeout. CML is more expressive than Go.

you were right when you said we're all just bricks in the wall

However! I was right in the sense that you can build a CML system on top of Go-like systems (though possibly not Go in particular). Thanks to Vesa Karvonen for this comment and the link to their proof-of-concept CML implementation in Clojure's core.async. I understand Vesa also has an implementation in F# as well.

Folks should read Vesa's code, after reading the Reppy papers of course; it's delightfully short and expressive. The basic idea is that event composition operators like choose and wrap build up data structures instead of doing things. The sync operation then grovels through those data structures to collect a list of channels to pass on to core.async's equivalent of select. When select returns, sync determines which event that chosen channel and message corresponds to, and proceeds to "activate" the event (and, as a side effect, possibly issue NACK messages to other channels).

Provided you can map from the chosen select channel/message back to the event, (something that core.async can mostly do, with a caveat; see the code), then you can build CML on top of channels and goroutines.

o/~ yeah you were wrong o/~

On the other hand! One advantage of CML is that its events are not limited to channel sends and receives. I understand that timeouts, thread joins, and maybe some other event types are first-class event kinds in many CML systems. Michael Sperber, current Scheme48 maintainer and functional programmer, tells me that simply wrapping events in channels+goroutines works but can incur a big performance overhead relative to supporting those event types natively, due to the need to make the new goroutine and channel and the scheduling costs. He quotes 10X as the overhead!

So although CML and Go appear to be inter-expressible, maybe a proper solution will base the simple channel send/receive interface on CML rather than the other way around.

Also, since these events are now second-class, it must be OK to lose these events, for the same reason that the naïve withTimeout could lose a message from numbers. This is the case for timeouts usually but maybe you have to think about this more, and possibly provide an infinite stream of the message. (Of course the wrapper goroutine would be collected if the channel becomes unreachable.)

you were right when you said this is the end

I've long wondered how contemporary musicians deal with the enormous, crushing weight of recorded music. I don't really pick any more but hoo am I feeling this now. I think for Guile, I will continue hacking on fibers in a separate library, and I think that things will remain that way for the next couple years and possibly more. We need more experience and more mistakes before blessing and supporting any particular formulation of highly concurrent programming. I will say though that I am delighted that we are able to actually do this experimentation on a library level and I look forward to seeing what works out :)

Thanks again to Vesa, Michael, and Sam for sharing their time and knowledge; all errors are of course mine. Happy hacking!

concurrent ml versus go

20 September 2016 9:33 PM (concurrency | go | ml | ignorance | guile)

Peoples! Lately I've been navigating the guile-ship through waters unknown. This post is something of an echolocation to figure out where the hell this ship is and where it should go.

Concretely, I have been working on getting a nice lightweight concurrency system rolling for Guile. I'll write more about that later, but you can think of it as being modelled on Go, though built as a library. (I had previously described it as "Erlang-like", but that's just not accurate.)

Earlier this year at Curry On this topic was burning in my mind and of course when I saw the language-hacker fam there I had to bend their ears. My targets: Matthew Flatt, the amazing boundary-crossing engineer, hacker, teacher, researcher, and implementor of Racket, and Matthias Felleisen, the godfather of the PLT research family. I saw them sitting together and I thought, you know what, what can they have to say to each other? These people have been talking together for 30 years right? Surely they are actually waiting for some ignorant dude to saunter up to the PL genius bar, right?

So saunter I do, saying, "if someone says to you that they want to build a server that will handle 100K or so simultaneous connections on Racket, what abstraction do you tell them to use? Racket threads?" Apparently: yes. A definitive yes, in the case of Matthias, with a pointer to Robby Findler's paper on kill-safe abstractions; and still a yes from Matthew with the caveat that for the concrete level of concurrency that I described, you'd have to run tests. More fundamentally, I was advised to look at Concurrent ML (on which Racket's concurrency facilities were based), that CML was much better put together than many modern variants like Go.

This was very interesting and new to me. As y'all probably know, I don't have a formal background in programming languages, and although I've read a lot of literature, reading things only makes you aware of the growing dimension of the not-yet-read. Concurrent ML was even beyond my not-yet-read horizon.

So I went back and read a bunch of papers. Turns out Concurrent ML is like Lisp in that it has a tribe and a tightly-clutched history and a diaspora that reimplements it in whatever language they happen to be working in at the moment. Kinda cool, and, um... a bit hard to appreciate in the current-day context when the only good references are papers from 10 or 20 years ago.

However, after reading a bunch of John Reppy papers, here is my understanding of what Concurrent ML is. I welcome corrections; surely I am getting this wrong.

1. CML is like Go, composed of channels and goroutines. (Forgive the modern referent; I assume most folks know Go at this point.)

2. Unlike Go, in CML a channel is never buffered. To make a buffered channel in CML, you spawn a thread that manages a buffer between two channels.

3. Message send and receive operations in CML are built on a lower-level primitive called "events". (send ch x) is instead euivalent to (sync (send-event ch x)). It's like an event is the derivative of a message send with respect to time, or something.

4. Events can be combined and transformed using the choose and wrap combinators.

5. Doing a sync on an event created by choose allows a user to build select in "user-space", as a library. Cool stuff. So this is what events are for.

6. There are separate event type implementations for timeouts, channel send/recv blocking operations, file descriptor blocking operations, syscalls, thread joins, and the like. These are supported by the CML implementation.

7. The early implementations of Concurrent ML were concurrent but not parallel; they did not run multiple "goroutines" on separate CPU cores at the same time. It was only in like 2009 that people started to do CML in parallel. I do not know if this late parallelism has a practical impact on the viability of CML.

ok go

What is the relationship of CML to Go? Specifically, is CML more expressive than Go? (I assume the reverse is not the case, but that would also be an interesting result!)

There are a few languages that only allow you to select over message receives (not sends), but Go's select doesn't have this limitation, so that's not a differentiator.

Some people say that it's nice to have events as the common denominator, but I don't get this argument. If the only event under consideration is message send or receive over a channel, events + choose + sync is the same in expressive power as a built-in select, as far as I can see. If there are other events, then your runtime already has to support them either way, and something like (let ((ch (make-channel))) (spawn-fiber (lambda () (put-message ch exp))) (get-message ch)) should be sufficient for any runtime-supported event in exp, like sleeps or timeouts or thread joins or whatever.

To me it seems like Go has made the right choices here. I do not see the difference, and that's why I wrote all this, is to be shown the error of my ways. Choosing channels, send, receive, and select as the primitives seems to have the same power as SML events.

Let this post be a pentagram on the floor, then, to summon the CML cognoscenti. Well-actuallies are very welcome; hit me up in the comments!

[edit: Sam Tobin-Hochstadt tells me I got it wrong and I believe him :) In the meantime while I work out how I was wrong, examples are welcome!]

a simple (local) solution to the pay gap

24 March 2016 9:49 PM (justice | tech | pay gap | igalia)

International Working Women's Day was earlier this month, a day that reminds the world how far it has yet to go to achieve just treatment of women in the workplace. Obviously there are many fronts on which to fight to dismantle patriarchy, and also cissexism, and also transphobia, and also racism, and sometimes it gets a bit overwhelming just to think of a world where people treat each other right.

Against this backdrop, it's surprising that some policies are rarely mentioned by people working on social change. This article is about one of them -- a simple local change that can eliminate the pay gap across all axes of unfair privilege.

Ready?

OK here it is: just pay everyone in a company the same hourly wage.

That's it!

on simple, on easy

But, you say, that's impossible!

Rich Hickey has this famous talk where he describes one thing as simple and the other as easy. In his narrative, simple is good but hard, and easy is bad but, you know, easy. I enjoy this talk because it's easy (hah!) to just call one thing simple and the other easy and it's codewords for good and bad, and you come across as having the facile prestidigitatory wisdom of a Malcolm Gladwell.

As far as simple, the substance of equal pay is as simple as it gets. And as far as practical implementation goes, it only needs buy-in from one person: your boss could do it tomorrow.

But, you say, a real business would never do this! This is getting closer to the real issues, but not there yet. There are plenty of instances of real businesses that do this. Incidentally, mine is one of them! I do not intend this to be an advertisement for my company, but I have to mention this early because society does its best to implant inside our brains the ideas that certain ideas are possible and certain others are not.

But, you say, this would be terrible for business! Here I think we are almost there. There's a question underneath, if we can manage to phrase it in a more scientific way -- I mean, the ideal sense in which science is a practice of humankind in which we use our limited powers to seek truth, with hypotheses but without prejudice. It might sound a bit pompous to invoke capital-S Science here, but I think few conversations of this kind try to honestly even consider existence proofs in the form of already-existing histories (like the company I work for), much less an unbiased study of the implications of modelling the future on those histories.

Let's assume that you and I want to work for justice, and in this more perfect world, men and women and nonbinary people will have equal pay for equal work, as will all people that lie on all axes of privilege that currently operate in society. If you are with me up to here: great. If not, we don't share a premise so it's not much use to go farther. You can probably skip to the next article in your reading list.

So, then, the questions: first of all, would a flat equal wage within a company actually help people in marginalized groups? What changes would happen to a company if it enacted a flat wage tomorrow? What are its limitations? How could this change come about?

would it help?

Let's take the most basic question first. How would this measure affect people in marginalized groups?

Let us assume that salaries are distributed inversely: the higher salaries are made by fewer people. A lower salary corresponds to more people. So firstly, we are in a situation where the median salary is less than the mean: that if we switched to pay everyone the mean, then most people would see an increase in their salary.

Assuming that marginalized people were evenly placed in a company, that would mean that most would benefit. But we know that is not the case: "marginalized" is the operative term. People are categorized at a lower point than their abilities; people's climb of the organizational hierarchy (and to higher salaries) is hindered by harassment, by undervalued diversity work, and by external structural factors, like institutionalized racism or the burden of having to go through a gender transition. So probably, even if a company touts equal pay within job classifications, the job classifications themselves unfairly put marginalized people lower than white dudes like me. So, proportionally marginalized people would benefit from an equal wage more than most.

Already this plan is looking pretty good: more money going to marginalized people is a necessary step to bootstrap a more just world.

All that said, many (but not most) people from marginalized groups will earn more than the mean. What for them? Some will decide that paying for a more just company as a whole is worth a salary reduction. (Incidentally, this applies to everyone: everyone has their price for justice. It might be 0.1%, it might be 5%, it might be 50%.)

Some, though, will decide it is not worth paying. They will go work elsewhere, probably for even more money (changing jobs being the best general way to advance your salary). I don't blame marginalized folks for getting all they can: more power to them.

From what I can tell, things are looking especially good for marginalized people under a local equal-wage initiative. Not perfect, not in all cases, but generally better.

won't someone think of the dudes

I don't believe in value as a zero-sum proposition: there are many ways in which a more fair world could be more productive, too. But in the short term, a balance sheet must balance. Salary increases in the bottom will come from salary decreases from the top, and the dudebro is top in tech.

We should first note that many and possibly most white men will see their wages increase under a flat-wage scheme, as most people earn below the mean.

Secondly, some men will be willing to pay for justice in the form of equal pay for equal work. An eloquent sales pitch as to what they are buying will help.

Some men would like to pay but have other obligations that a "mean" salary just can't even. Welp, there are lots of jobs out there. We'll give you a glowing recommendation :)

Finally there will be dudes that are fine with the pay gap. Maybe they have some sort of techno-libertarian justification? Oh well! They will find other jobs. As someone who cares about justice, you don't really want to work with these people anyway. Call it "bad culture fit", and treat it as a great policy to improve the composition of your organization.

an aside: what are we here for anyway?

A frequent objection to workplace change comes in the form of a pandering explanation of what companies are for, that corporations are legally obligated to always proceed along the the most profitable path.

I always find it extraordinarily ignorant to hear this parroted by people in tech: it's literally part of the CS canon to learn about the limitations of hill-climbing as an optimization strategy. But on the other hand, I do understand; the power of just-so neoliberal narrative is immense, filling your mind with pat explanations, cooling off your brain into a poorly annealed solid mass.

The funny thing about corporate determinism that it's not even true. Folks who say this have rarely run companies, otherwise they should know better. Loads of corporate decisions are made with a most tenuous link on profitability, and some that probably even go against the profit interest. It's always easy to go in a known-profitable direction, but that doesn't mean it's the only way to go, nor that all the profitable directions are known.

Sometimes this question is framed in the language of "what MyDesignCo really cares about is good design; we're worried about how this measure might affect our output". I respect this question more, because it's more materialist (you can actually answer the question!), but I disagree with the premise. I don't think any company really cares about the product in a significant way. Take the design company as an example. What do you want on your tombstone: "She made good advertisements"??? Don't get me wrong, I like my craft, and I enjoy practicing it with my colleagues. But if on my tombstone they wrote "He worked for justice", and also if there were a heaven, I would be p OK with that. What I'm saying is, you start a company, you have an initial idea, you pivot, whatever, it doesn't matter in the end. What matters is you relationship with life on the planet, and that is the criteria you should use to evaluate what you do.

Beyond all that -- it's amazing how much wrong you can wrap up in a snarky hacker news one-liner -- beyond all that, the concern begs the question by assuming that a flat-wage arrangement is less profitable. People will mention any down-side they can but never an up-side.

possible flat-wage up-sides from a corporate perspective

With that in mind, let's consider some ways that a flat wage can actually improve the commercial fate of a company.

A company with a flat wage already has a marketing point that they can use to attract people that care about this sort of thing. It can make your company stand out from the crowd and attract good people.

The people you attract will know you're doing the flat-wage thing, and so will be predisposed to want to work together. This can increase productivity. It also eliminates some material sources of conflict between different roles in an organization. You would still need "human resources" people but they would need to spend less time on mitigating the natural money-based conflicts that exist in other organizations.

Another positive side relates to the ability of the company to make collective sacrifices. For example a company that is going through harder times can collectively decide not to raise wages or even to lower them, rather than fire people. Obviously this outcome depends on the degree to which people feel responsible for the organization, which is incomplete without a feeling of collective self-management as in a cooperative, but even in a hierarchical organization these effects can be felt.

Incidentally a feeling of "investment" in the organization is another plus. When you work in a company in which compensation depends on random factors that you can't see, you always wonder if you're being cheated out of your true value. If everyone is being paid the same you know that everyone's interest in improving company revenue is aligned with their own salary interest -- you can't gain by screwing someone else over.

limitations of a flat wage at improving justice

All that said, paying all workers/partners/employees the same hourly wage is not a panacea for justice. It won't dismantle patriarchy overnight. It won't stop domestic violence, and it won't stop the cops from killing people of color. It won't stop microagressions or harassment in the workplace, and in some ways if there are feelings of resentment, it could even exacerbate them. It won't arrest attrition of marginalized people from the tech industry, and it won't fix hiring. Enacting the policy in a company won't fix the industry as a whole, even if all companies enacted it, as you would still have different wages at different companies. It won't fix the situation outside of the tech industry; a particularly egregious example being that in almost all places, cleaning staff are hired via subcontracts and not as employees. And finally, it won't resolve class conflict at work: the owner still owns. There are still pressures on the owner to keep the whole balance sheet secret, even if the human resources side of things is transparent.

All that said, these are mainly ways in which an equal wage policy is incomplete. A step in the right direction, on a justice level, but incomplete. In practice though the objections you get will be less related to justice and more commercial in nature. Let's take a look at some of them.

commercial challenges to a flat wage

Having everyone paid the same makes it extraordinarily difficult to hire people that are used to being paid on commission, like sales people. Sales people drive Rolexes and wear Mercedes. It is very, very tough to hire good sales people on salary. At my work we have had some limited success hiring, and some success growing technical folks into sales roles, but this compensation package will hinder your efforts to build and/or keep your sales team.

On the other hand, having the same compensation between sales and engineering does eliminate some of the usual sales-vs-product conflicts of interest.

Another point it that if you institute a flat-wage policy, you will expect to lose some fraction of your highly-skilled workers, as many of these are more highly paid. There are again some mitigations but it's still a reality. Perhaps more perniciously, you will have greater difficulties hiring senior people: you literally can't get into a bidding war with a competitor over a potential hire.

On the flip side, a flat salary can make it difficult to hire more junior positions. There are many theories here but I think that a company is healthy when it has a mix of experiences, that senior folks and junior folks bring different things to the table. But if your flat wage is higher than the standard junior wage, then your potential junior hires are now competing against more senior people -- internally it will be hard to keep a balance between different experiences.

Indeed junior workers that you already have are now competing at their wage level with potential hires that might be more qualified in some way. An unscrupulous management could fire those junior staff members and replace them with more senior candidates. An equal wage policy does not solve internal class conflicts; you need to have equal ownership and some form of workplace democracy for that.

You could sort people into pay grades, but in many ways this would formalize injustice. Marginalized people are by definition not equally distributed across pay grades.

Having a flat wage also removes a standard form of motivation, that your wage is always rising as you get older. It could be that after 5 years in a job, maybe your wages went up because the company's revenues went up, but they're still the same as a new hire's -- how do you feel about that? It's a tough question. I think an ever-rising wage has a lot of negative aspects, including decreasing the employability of older workers, but it's deeply rooted in tech culture at least.

Another point is motivation of people within the same cadre. Some people are motivated by bonuses, by performing relatively well compared to their peers. This wouldn't be an option in an organization with a purely flat wage. Does it matter? I do not know.

work with me tho

As the prophet Pratchett said, "against one perfect moment, the centuries beat in vain". There are some definite advantages to a flat wage within a company: it's concrete, it can be immediately enacted, it solves some immediate problems in a local way. Its commercial impact is unclear, but the force of narrative can bowl over many concerns in that department: what's important is to do the right thing. Everybody knows that!

As far as implementation, I see three-and-a-half ways this could happen in a company.

The first is that equal pay could be a founding principle of the company. This was mostly the case in the company I work for (and operate, and co-own equally with the other 40 or so partners). I wasn't a founder of the company, and the precise set of principles and policies has changed over the 15 years of the company's life, but it's more obvious for this arrangement to continue from a beginning than to change from the normal pay situation.

The second is, the change could come from the top down. Some CEOs get random brain waves and this happens. In this case, the change is super-easy to make: you proclaim the thing and it's done. As a person who has had to deal with cash-flow and payroll and balance sheets, I can tell you that this considerably simplifies HR from a management perspective.

The third is via collective action. This only works if workers are able to organize and can be convinced to be interested in justice in this specific way. In some companies, a worker's body might simply be able to negotiate this with management -- e.g., we try it out for 6 months and see. In most others you'd probably need to unionize and strike.

Finally, if this practice were more wider-spread in a sector, it could be that it just becomes "best practice" in some way -- that company management could be shamed into doing it, or it could just be the way things are done.

fin

Many of these points are probably best enacted in the context of a worker-owned cooperative, where you can do away with the worker-owner conflict at the same time. But still, they are worth thinking of in a broader context, and worth evaluating in the degree to which they work for (or against) justice in the workplace. But enough blathering from me today :) Happy hacking!