wingologA mostly dorky weblog by Andy Wingo2017-06-29T14:37:20Ztekutihttps://wingolog.org/feed/atomAndy Wingohttps://wingolog.org/a new concurrent mlhttps://wingolog.org/2017/06/29/a-new-concurrent-ml2017-06-29T14:37:20Z2017-06-29T14:37:20Z

Good morning all!

In my last article I talked about how we composed a lightweight "fibers" facility in Guile out of lower-level primitives. What we implemented there is enough to be useful, but it is missing an important aspect of concurrency: communication. Sure, being able to spawn off fibers is nice, but you have to be able to actually talk to them.

Fibers had just gotten to the state described above about a year ago as I caught a train from Geneva to Rome for Curry On 2016. Train rides are magnificent for organizing thoughts, and I was in dire need of clarity. I had tentatively settled on Go-style channels by the time I got there, but when I saw that Matthias Felleisen and Matthew Flatt there, I had to take advantage of the opportunity to ask them what they thought. Would they recommend Racket-like threads and channels? Had that been a good experience?

The answer that I got in return was a "yes, that's what you should do", but also a "you should look at Concurrent ML". Concurrent ML? What's that? I looked and was a bit skeptical. It seemed old and hoary and maybe channels were just as expressive. I looked more deeply into this issue and it seemed CML is a bit more expressive than just channels but damn, it looked complicated to implement.

I was wrong. This article shows that what you need to do to implement multi-core CML is actually the same as what you need to do to implement channels in a multi-core environment. By building CML first and channels and whatever later, you get more power for the same amount of work.

Note that this article has an associated talk! If video is your thing, see my Curry On 2017 talk here:

Or, watch on the youtube if the inline video above doesn't work; slides here as well.

on channels

Let's first have a crack at implementing channels. Before we begin, we should be a bit more explicit about what a channel is. My first hack in this area did the wrong thing: I was used to asynchronous queues, and I thought that's what a channel was. Besides ignorance, apparently that's what Erlang does; a process's inbox is an unbounded queue of messages with only very slight back-pressure.

But an asynchronous queue is not a channel, at least in its classic sense. As they were originally formulated in "Communicating Sequential Processes" by Tony Hoare, adopted into David May's occam, and from there into many other languages, channels are meeting-places. Processes meet at a channel to exchange values; whichever party arrives first has to wait for the other party to show up. The message that is handed off in a channel send/receive operation is never "owned" by the channel; it is either owned by a sender who is waiting at the meeting point for a receiver, or it's accepted by a receiver. After the transaction is complete, both parties continue on.

You'd think this is a fine detail, but meeting-place channels are strictly more expressive than buffered channels. I was actually called out for this because my first implementation of channels for Fibers had effectively a minimum buffer size of 1. In Go, whose channels are unbuffered by default, you can use a channel for RPC:

package main

func double(ch chan int) {
  for { ch <- (<-ch * 2) }
}

func main() {
  ch := make(chan int)
  go double(ch)
  ch <- 2
  x := <-ch
  print(x)
}

Here you see that the main function sent a value on ch, then immediately read a response from the same channel. If the channel were buffered, then we'd probably read the value we sent instead of the doubled value supplied by the double goroutine. I say "probably" because it's not deterministic! Likewise the double routine could read its responses as its inputs.

Anyway, the channels we are looking to build are meeting-place channels. If you are interested in the broader design questions, you might enjoy the incomplete history of language facilities for concurrency article I wrote late last year.

With that prelude out of the way, here's a first draft at the implementation of the "receive" operation on a channel.

(define (recv ch)
  (match ch
    (($ $channel recvq sendq)
     (match (try-dequeue! sendq)
       (#(value resume-sender)
        (resume-sender)
        value)
       (#f
        (suspend
         (lambda (k)
           (define (resume val)
             (schedule (lambda () (k val)))
           (enqueue! recvq resume))))))))

;; Note: this code has a race!  Fixed later.

A channel is a record with two fields, its recvq and sendq. The receive queue (recvq) holds a FIFO queue of continuations that are waiting to receive values, and the send queue holds continuations that are waiting to send values, along with the value that they are sending. Both the recvq and the sendq are lockless queues.

To receive a value from a meeting-place channel, there are two possibilities: either there's a sender already there and waiting, or we have to wait for a sender. Those two cases are handled above, in that order. We use the suspend primitive from the last article to arrange for the fiber to wait; presumably the sender will resume us when they arrive later at the meeting-point.

an aside on lockless data structures

We'll go more deeply into the channel receive mechanics later, but first, a general question: what's the right way to implement a data structure that can be accessed and modified concurrently without locks? Though I am full of hubris, I don't have enough to answer this question definitively. I know many ways, but none that's optimal in all ways.

For what I needed in Fibers, I chose to err on the side of simplicity.

Some data in Fibers is never modified; this immutable data is safe to access concurrently from any code. This is the best, obviously :)

Some mutable data is only ever mutated from an "owner" core; it's safe to read without a lock from that owner core, and in Fibers we do not access this data from other cores. An example of this kind of data structure is the i/o map from file descriptors to continuations; it's core-local. I say "core-local" because in fibers we typically run one scheduler per core, with each core having a pinned POSIX thread; it's really thread-local but I don't want to use the word "thread" too much here as it's confusing.

Some mutable data needs to be read and written from many cores. An example of this is the recvq of a channel; many receivers and senders can be wanting to read and write there at once. The approach we take in Fibers is just to use immutable data stored inside an "atomic box". An atomic box holds a single value, and exposes operations to read, write, swap, and compare-and-swap (CAS) the value. To read a value, just fetch it from the box; you then have immutable data that you can analyze without locks. Having read a value, you can to compute a new state and use CAS on the atomic box to publish that change. If the CAS succeeds, then great; otherwise the state changed in the meantime, so you typically want to loop and try again.

Single-word CAS suffices for Guile when every value stored into an atomic box will be unique, a property that freshly-allocated objects have and of which GC ensures us an endless supply. Note that for this to work, the values can share structure internally but the outer reference has to be freshly allocated.

The combination of freshly-allocated data structures and atomic variables is a joy to use: no hassles about multi-word compare-and-swap or the ABA problem. Assuming your GC can keep up (Guile's does about 700 MB/s), it can be an effective strategy, and is certainly less error-prone than others.

back at the channel recv ranch

Now, the theme here is "growing a language": taking primitives and using them to compose more expressive abstractions. In that regard, sure, channel send and receive are nice, but what about select, which allows us to wait on any channel in a set of channels? How do we take what we have and built non-determinism on top?

I think we should begin by noting that select in Go for example isn't just about receiving messages. You can select on the first channel that can send, or between send and receive operations.

select {
case c <- x:
  x, y = y, x+y
case <-quit:
  return
}

As you can see, Go provides special syntax for select. Although in Guile we can of course provide macros, usually those macros expand out to a procedure call; the macro is sugar for a function. So we want select as a function. But because we need to be able to select over receiving and sending at the same time, the function needs to take some kind of annotation on what we are going to do with the channels:

(select (recv A) (send B v))

So what we do is to introduce the concept of an operation, which is simply data describing some event which may occur in the future. The arguments to select are now operations.

(select (recv-op A) (send-op B v))

Here recv-op is obviously a constructor for the channel-receive operation, and likewise for send-op. And actually, given that we've just made an abstraction over sending or receiving on a channel, we might as well make an abstraction over choosing the first available op among a set of operations. The implementation of select now creates such a choice-op, then performs it.

(define (select . ops)
  (perform (apply choice-op ops)))

But what we're missing here is the ability to know which operation actually happened. In Go, select's special syntax associates a clause of code with each sub-operation. In Scheme a clause of code is just a function, and so what we want to do is to be able to annotate an operation with a function that will get run if the operation succeeds.

So we define a (wrap-op op k), which makes an operation that itself annotates op, associating it with k. If op occurs, its result values will be passed to k. For example, if we make a fiber that tries to perform this operation:

(perform
 (wrap-op
  (recv-op A)
  (lambda (v)
    (string-append "hello, " v))))

If we send the string "world" on the channel A, then the result of this perform invocation will be "hello, world". Providing "wrapped" operations to select allows us to handle the various cases in separate, appropriate ways.

we just made concurrent ml

Hey, we just reinvented Concurrent ML! In his PLDI 1988 paper "Synchronous operations as first-class values", John Reppy proposes just this abstraction. I like to compare it to the relationship between an expression (exp) and wrapping that expression in a lambda ((lambda () exp)); evaluating an expression gives its value, and the expression just goes away, whereas evaluating a lambda gives a procedure that you can call in the future to evaluate the expression. You can call the lambda many times, or no times. In the same way, a channel-receive operation is an abstraction over receiving a value from a channel. You can perform that operation many times, once, or not at all.

Reppy consolidated this work in his PLDI 1991 paper, "CML: A higher-order concurrent language". Note that he uses the term "event" instead of "operation". To me the name "event" to apply to this abstraction never felt quite right; I guess I wrote too much code in the past against event loops. I see "events" as single instances in time and not an abstraction over the possibility of a, well, of an event. Indeed I wish I could refer to an instantiation of an operation as an event, but better not to muddy the waters. Likewise Reppy uses "synchronize" where I use "perform". As you like, really, it's still Concurrent ML; I just prefer to explain to my users using terms that make sense to me.

what's an op?

Let's return to that channel recv implementation. It had basically two parts: an optimistic part, where the operation could complete immediately, and a pessimistic part, where we had to wait for the other party to arrive. However, there was a race condition, as I noted in the comment. If a sender and a receiver concurrently arrive at a channel, it could be that they concurrently do the optimistic check, don't notice that the other is there, then they both suspend, waiting for each other to arrive: deadlock. To fix this for recv, we have to recheck the sendq after publishing our presence to the recvq.

I'll get to the details in a bit for channels, but it turns out that this is a general pattern. All kinds of ops have optimistic and pessimistic behavior.

(define (perform op)
  (match op
    (($ $op try block wrap)
     (define (do-op)
       ;; Return a thunk that has result values.
       (or optimistic
           pessimistic)))
     ;; Return values, passed through wrap function.
     ((compose wrap do-op)))))

In the optimistic phase, the calling fiber will try to commit the operation directly. If that succeeds, then the calling fiber resumes any other fibers that are part of the transaction, and the calling fiber continues. In the pessimistic phase, we park the calling fiber, publish the fact that we're ready and waiting for the operation, then to resolve the race condition we have to try again to complete the operation. In either case we pass the result(s) through the wrap function.

Given that the pessimistic phase has to include a re-check for operation completability, the optimistic phase is purely an optimization. It's a good optimization that everyone will want to implement, but it's not strictly necessary. It's OK for a try function to always return #f.

As shown in the above function, an operation is a plain old data structure with three fields: a try, a block, and a wrap function. The optimistic behavior is implemented by the try function; the pessimistic side is partly implemented by perform, which handles the fiber suspension part, and by the operation's block function. The wrap function implements the wrap-op behavior described above, and is applied to the result(s) of a successful operation.

Now, it was about at this point that I was thinking "jeebs, this CML thing is complicated". I was both wrong and right -- there's some complication inherent in multicore lockless communication, yes, but I believe CML captures something close to the minimum, and certainly it's just as much work as with a direct implementation of channels. In that spirit, I continue on with the implementation of channel operations in Fibers.

channel receive operation

Here's an implementation of a try function for a channel.

(define (try-recv ch)
  (match ch
    (($ $channel recvq sendq)
     (let ((q (atomic-ref sendq)))
       (match q
         (() #f)
         ((head . tail)
          (match head
            (#(val resume-sender state)
             (match (CAS! state 'W 'S)
               ('W
                (resume-sender (lambda () (values)))
                (CAS! sendq q tail) ; *
                (lambda () val))
               (_ #f))))))))))

In Fibers, a try function either succeeds and returns a thunk, or fails and returns #f. For channel receive, we only succeed if there is a sender already in the queue: the sender has arrived, suspended itself, and published its availability. The state variable is an atomic box that holds the operation state, which initially starts as W and when complete is S. More on that in a minute. If the CAS! compare-and-swap operation managed to change the state from W to S, then the optimistic phase suceeded -- yay! We resume the sender with no values, take the value that the sender gave us, and keep on trucking, returning that value wrapped in a thunk.

Additionally the sender's entry on the sendq is now stale, as the operation is already complete; we try to pop it off the queue at the line indicated with *, but that could fail due to concurrent queue modification. In that case, no biggie, someone else will do the collect our garbage for us.

The pessimistic case is a bit more involved. It's the last bit of code though; almost done here! I express the pessimistic phase as a function of the operation's block function.

(define (pessimistic block)
  ;; For consistency with optimistic phase, result of
  ;; pessimistic phase is a thunk that "perform" will
  ;; apply.
  (lambda ()
    ;; 1. Suspend the thread.  Expect to be resumed
    ;; with a thunk, which we arrange to invoke directly.
    ((suspend
       (lambda (k)
        (define (resume values-thunk)
          (schedule (lambda () (k values-thunk))))
        ;; 2. Make a fresh opstate.
        (define state (make-atomic-box 'W))
        ;; 3. Call op's block function.
        (block resume state))))))

So what about that state variable? Well basically, once we publish the fact that we're ready to perform an operation, fibers from other cores might concurrently try to complete our operation. We need for this perform invocation to complete at most once! So we introduce a state variable, the "opstate", held in an atomic box. It has three states:

  • W: "Waiting"; initial state

  • C: "Claimed"; temporary state

  • S: "Synched"; final state

There are four possible state transitions, of two kinds. Firstly there are the "local" transitions W->C, C->W, and C->S. These transitions may only ever occur as part of the "retry" phase a block function; notably, no remote fiber will cause these transitions on "our" state variable. Remote fibers can only make the W->S transition, committing an operation. The W->S transition can also be made locally of course.

Every time an operation is instantiated via the perform function, we make a new opstate. Operations themselves don't hold any state; only their instantiations do.

The need for the C state wasn't initially obvious to me, but after seeing the recv-op block function below, it will be clear to you I hope.

block functions

The block function itself has two jobs to do. Recall that it's called after the calling fiber was suspended, and is passed two arguments: a procedure that can be called to resume the fiber with some number of values, and the fresh opstate for this instantiation. The block function has two jobs: it needs to publish the resume function and the opstate to the channel's recvq, and then it needs to try again to receive. That's the "retry" phase I was mentioning before.

Retrying the recv can have three possible results:

  1. If the retry succeeds, we resume the sender. We also have to resume the calling fiber, as it has been suspended already. In general, whatever code manages to commit an operation has to resume any fibers that were waiting on it to complete.

  2. If the operation was already in the S state, that means some other party concurrently completed our operation on our behalf. In that case there's nothing to do; the other party resumed us already.

  3. Otherwise if the operation couldn't proceed, then when the other party or parties arrive, they will be responsible for completing the operation and ultimately resuming our fiber in the future.

With that long prelude out of the way, here's the gnarlies!

(define (block-recv ch resume-recv recv-state)
  (match ch
    (($ $channel recvq sendq)
     ;; Publish -- now others can resume us!
     (enqueue! recvq (vector resume-recv recv-state))
     ;; Try again to receive.
     (let retry ()
       (let ((q (atomic-ref sendq)))
         (match q
           ((head . tail)
            (match head
              (#(val resume-send send-state)
               (match (CAS! recv-state 'W 'C)   ; Claim txn.
                 ('W
                  (match (CAS! send-state 'W 'S)
                    ('W                         ; Case (1): yay!
                     (atomic-set! recv-state 'S)
                     (CAS! sendq q tail)        ; Maybe GC.
                     (resume-send (lambda () (values)))
                     (resume-recv (lambda () val)))
                    ('C                         ; Conflict; retry.
                     (atomic-set! recv-state 'W)
                     (retry))
                    ('S                         ; GC and retry.
                     (atomic-set! recv-state 'W)
                     (CAS! sendq q tail)
                     (retry))))
                 ('S #f)))))                    ; Case (2): cool!
           (() #f)))))))                        ; Case (3): we wait.

As we said, first we publish, then we retry. If there is a sender on the queue, we will try to complete their operation, but before we do that we have to prevent other fibers from completing ours; that's the purpose of going into the C state. If we manage to commit the sender's operation, then we commit ours too, going from C to S; otherwise we roll back to W. If the sender itself was in C then we had a conflict, and we spin to retry. We also try to GC off any completed operations from the sendq via unchecked CAS. If there's no sender on the queue, we just wait.

And that's it for the code! Thank you for suffering through this all. I only left off a few details: the try function can loop if sender is in the C state, and the block function needs to avoid a (choice-op (send-op A v) (recv-op A)) from sending v to itself. But because opstates are fresh allocations, we can know if a sender is actually ourself by comparing its opstate to ours (with eq?).

what about select?

I started about all this "op" business because I needed to annotate the arguments to select. Did I actually get anywhere? Good news, everyone: it turns out that select doesn't have to be a primitive!

Firstly, note that the choose-op try function just needs to run all try functions of sub-operations (possibly in random order), returning early if one succeeds. Pretty straightforward. And actually the story with the block function is the same: we just run the sub-operation block functions, knowing that the operation will commit at most one time. The only complication is plumbing through the respective wrap functions to all of the sub-operations, but of course that's the point of the wrap facility, so we pay the cost willingly.

(define (choice-op . ops)
  (define (try)
    (or-map
     (match-lambda
      (($ $op sub-try sub-block sub-wrap)
       (define thunk (sub-try))
       (and thunk (compose sub-wrap thunk))))
     ops))
  (define (block resume opstate)
    (for-each
     (match-lambda
      (($ $op sub-try sub-block sub-wrap)
       (define (wrapped-resume results-thunk)
         (resume (compose sub-wrap results-thunk)))
       (sub-block wrapped-resume opstate)))
     ops))
  (define wrap values)
  (make-op try block wrap))

There are optimizations possible, for example to randomize the order of visiting the sub-operations for more less deterministic behavior, but this is really all there is.

concurrent ml is inevitable

As far as I understand things, the protocol to implement CML-style operations on channels in a lock-free environment are exactly the same as what's needed if you wrote out the recv function by hand, without abstracting it to a recv-op.

You still need the ability to park a fiber in the block function, and you still need to retry the operation after parking. Although try is just an optimization, it's an optimization that you'll want.

So given that the cost of parallel CML is necessary, you might as well get what you pay for and have your language expose the more expressive CML interface in addition to the more "standard" channel operations.

concurrent ml between pthreads and fibers

One really cool aspect about implementing CML is that the bit that suspends the current thread is isolated in the perform function. Of course if you're in a fiber, you suspend the current fiber as we have described above. But what if you're not? What if you want to use CML to communicate between POSIX threads? You can do that, just create a mutex/cond pair and pass a procedure that will signal the cond as the resume argument to the block function. It just works! The channels implementation doesn't need to know anything about pthreads, or even fibers for that matter.

In fact, you can actually use CML operations to communicate between fibers and full pthreads. This can be really useful if you need to run some truly blocking operation in a side pthread, but you want most of your program to be in fibers.

a meta-note for a meta-language

This implementation was based on the Parallel CML paper from Reppy et al, describing the protocol implemented in Manticore. Since then there's been a lot of development there; you should check out Manticore! I also hear that Reppy has a new version of his "Concurrent Programming in ML" book coming out soon (not sure though).

This work is in Fibers, a concurrency facility for Guile Scheme, built as a library. Check out the manual for full details. Relative to the Parallel CML paper, this work has a couple differences beyond the superficial operation/perform event/sync name change.

Most significantly, Reppy's CML operations have three phases: poll, do, and block. Fibers uses just two, as in a concurrent context it doesn't make sense to check-then-do. There is no do, only try :)

Additionally the Fibers channel implementation is lockless, with an atomic sendq and recvq. In contrast, Manticore uses a spinlock and hence needs to mask/unmask interrupts at times.

On the other hand, the Parallel CML paper included some model checking work, which Fibers doesn't have. It would be nice to have some more confidence on correctness!

but what about perf

Performance! Does it scale? Let's poke it. Here I'm going to try to isolate my tests to measure the overhead of communication of channels as implemented in terms of Parallel CML ops. I have more real benchmarks for Fibers on a web server workload where it does well, but here I am really trying to focus on CML.

My test system is a 2 x E5-2620v3, which is two sockets each having 6 2.6GHz cores, hyperthreads off, performance governor on all cores. This is a system we use for Snabb testing, so the first core on each socket handles interrupts and all others are reserved; Linux won't schedule anything on them. When you run a fibers system, it will spawn a thread per available core, then set the thread's affinity to that core. In these tests, I'll give benchmarks progressively more cores and see how they do with the workload.

So this is a benchmark measuring total message sends per second on a chain of fibers communicating over channels. For 0 links, that means that there's just a sender and a receiver and no intermediate links. For 10 links, each message is relayed 10 times, for 11 total sends in the chain and 12 total fibers. For 0 links we expect pretty much no parallel speedup, and no slowdown, and that's what we see; but when we get to more links, we should expect more throughput. The fibers are allocated to cores at random (a randomized round-robin initial scheduling, then after that fibers have core affinity; though there is a limited work-stealing phase).

You would think that the 1-core case would be the same for all of them. Unfortunately it seems that currently there is a fixed cost for bouncing through epoll to pick up new I/O tasks, even though there are no I/O runnables in this test and the timeout is 0, so it will return immediately. It's definitely something to look into as it's a cost that all cores are paying.

Initially I expected a linear speedup but that's not what we're seeing. But then I thought about it and revised my expectations :) As we add more cores, we add more communication; we should see sublinear speedups as we have to do more cross-core wakeups and synchronizations. After all, we aren't measuring a nice parallelizable computational workload: we're measuring overhead.

On the other hand, the diminishing returns effect is pretty bad, and then we hit the NUMA cliff: as we cross from 6 to 7 cores, we start talking to the other CPU socket and everything goes to shit.

But here it's hard to isolate the test from three external factors, whose impact I don't understand: firstly, that Fibers itself has a significant wakeup cost for remote schedulers. I haven't measured contention on scheduler inboxes, but I suspect one issue is that when a remote scheduler has decided it has no runnables, it will sleep in epoll; and to wake it up we need to write on a socketpair. Guile can avoid that when there are lots of runnables and we see the remote scheduler isn't sleeping, but it's not perfect.

Secondly, Guile is a bytecode VM. I measured that Guile retires about 0.4 billion instructions per second per core on the test machine, whereas a 4 IPC native program will retire about 10 billion. There's overhead at various points, some of which will go away with native compilation in Guile but some might not for a while, given that Go (for example) has baked-in support for channels. So to what extent is it the protocol and to what extent the implementation overhead? I don't know.

Finally, and perhaps most importantly, we can't isolate this test from the garbage collector. Guile still uses the Boehm GC, which is just OK I think. It does have a nice parallel mark phase, but it uses POSIX signals to pause program threads instead of having those threads reach safepoints; and it's completely NUMA-unaware.

So, with all of those caveats mentioned, let's see a couple more graphs :) Firstly, similar to the previous one, here's total message send rate for N pairs of fibers that ping-pong their message back and forth. Core allocation was randomized round-robin.

My conclusion here is that when more fibers are runnable per scheduler turn, the overhead of the epoll phase is less.

Here's a test where there's one fiber producer, and N fibers competing to consume the messages sent. Ultimately we expect that the rate will be limited on the producer side, but there's still a nice speedup.

Next is a pretty weak-sauce benchmark where we're computing diagonal lengths on an N-dimensional cube; the squares of the dimensions happen in parallel fibers, then one fiber collects those lengths, sums and makes a square root.

The workload on that one is just very low, and the serial components become a bottleneck quickly. I think I need to rework that test case.

Finally, there's a false sieve of Erastothenes, in which every time we find a prime, we add another fiber onto the sieve chain that filters out multiples of that prime.

Even though the workload is really small, we still see speedups, which is somewhat satisfying. Still, on all of these, the NUMA cliff is something fierce.

For me what these benchmarks show is that there are still some bottlenecks to work on. We do OK in the handful-of-cores scenario, but the system as a whole doesn't really scale past that. On more real benchmarks with bigger workloads and proportionally much less communication, I get much more satisfactory results; but those tend to be I/O heavy anyway, so the bottleneck is elsewhere.

closing notes

There are other parts to CML events, namely guard functions and withNack functions. My understanding is that these are implementable in terms of this "primitive" CML as described here; that was a result of earlier work by Matthew Fluet. I haven't actually implemented these yet! A to-do item, truly.

There are other event types in CML systems of course! Besides being able to implement operations yourself, there are built-in condition variables (cvars), timeouts, thread join events, and so on. The Fibers manual mentions some of these, but it's an open set.

Finally and perhaps most significantly, Aaron Turon did some work a few years ago on "Reagents", a pattern library for composing parallel and concurrent operations, initially in Scala. It's claimed that Reagents generalizes CML. Is this the case? I am looking forward to finding out.

OK, that's it for this verrrrry long post :) I hope that you found that this made parallel CML seem a bit more approachable and interesting, whether as a language implementor, a library implementor, or a user. Comments and corrections welcome. Check out Fibers and give it a go!

Andy Wingohttps://wingolog.org/growing fibershttps://wingolog.org/2017/06/27/growing-fibers2017-06-27T10:17:58Z2017-06-27T10:17:58Z

Good day, Schemers!

Over the last 12 to 18 months, as we were preparing for the Guile 2.2 release, I was growing increasingly dissatisfied at not having a good concurrency story in Guile.

I wanted to be able to spawn a million threads on a core, to support highly-concurrent I/O servers, and Guile's POSIX threads are just not the answer. I needed something different, and this article is about the search for and the implementation of that thing.

on pthreads

It's worth being specific why POSIX threads are not a great abstraction. One is that they don't compose: two pieces of code that use mutexes won't necessarily compose together. A correct component A that takes locks might call a correct component B that takes locks, and the other way around, and if both happen concurrently you get the classic deadly-embrace deadlock.

POSIX threads are also terribly low-level. Asking someone to build a system with mutexes and cond vars is like building a house with exploding toothpicks.

I want to program network services in a straightforward way, and POSIX threads don't help me here either. I'd like to spawn a million "threads" (scare-quotes!), one for each client, each one just just looping reading a request, computing and writing the response, and so on. POSIX threads aren't the concrete implementation of this abstraction though, as in most systems you can't have more than a few thousand of them.

Finally as a Guile maintainer I have a duty to tell people the good ways to make their programs, but I can't in good conscience recommend POSIX threads to anyone. If someone is a responsible programmer, then yes we can discuss details of POSIX threads. But for a new Schemer? Never. Recommending POSIX threads is malpractice.

on scheme

In Scheme we claim to be minimalists. Whether we actually are that or not is another story, but it's true that we have a culture of trying to grow expressive systems from minimal primitives.

It's sometimes claimed that in Scheme, we don't need threads because we have call-with-current-continuation, an ultrapowerful primitive that lets us implement any kind of control structure we want. (The name screams for an abbreviation, so the alias call/cc is blessed; minimalism is whatever we say it is, right?) Unfortunately it turned out that while call/cc can implement any control abstraction, it can't implement any two. Abstractions built on call/cc don't compose!

Fortunately, there is a way to build powerful control abstractions that do compose. This article covers the first half of composing a concurrency facility out of a set of more basic primitives.

Just to be concrete, I have to start with a simple implementation of an event loop. We're going to build on it later, but for now, here we go:

(define (run sched)
  (match sched
    (($ $sched inbox i/o)
     (define (dequeue-tasks)
       (append (dequeue-all! inbox)
               (poll-for-tasks i/o)))
     (let lp ()
       (for-each (lambda (task) (task))
                 (dequeue-tasks))
       (lp)))))

This is a scheduler that is a record with two fields, inbox and i/o.

The inbox holds a queue of pending tasks, as thunks (procedure of no arguments). When something wants to enqueue a task, it posts a thunk to the inbox.

On the other hand, when a task needs to wait in some external input or output being available, it will register an event with i/o. Typically i/o will be a simple combination of an epollfd and a mapping of tasks to enqueue when a file descriptor becomes readable or writable. poll-for-tasks does the underlying epoll_wait call that pulls new I/O events from the kernel.

There are some details I'm leaving out, like when to have epoll_wait return directly, and when to have it wait for some time, and how to wake it up if it's sleeping while a task is posted to the scheduler's inbox, but ultimately this is the core of an event loop.

a long digression

Now you might think that I'm getting a little far afield from what my goal was, which was threads or fibers or something. But that's OK, let's go a little farther and talk about "prompts". The term "prompt" comes from the experience you get when you work on the command-line:

/home/wingo% ./prog

I don't know about you all, but I have the feeling that the /home/wingo% has a kind of solid reality, that my screen is not just an array of characters but there is a left-hand-side that belongs to the system, and a right-hand-side that's mine. The two parts are delimited by a prompt. Well prompts in Scheme allow you to provide this abstraction within your program: you can establish a program part that's a "system" facility, for whatever definition of "system" suits your purposes, and a part that's for the "user".

In a way, prompts generalize a pattern of system/user division that has special facilities in other programming languages, such as a try/catch block.

try {
  foo();
} catch (e) {
  bar();
}

Here again I put the "user" code in italics. Some other examples of control flow patterns that prompts generalize would be early exit of a subcomputation, coroutines, and nondeterminitic choice like SICP's amb operator. Coroutines is obviously where I'm headed here in the context of this article, but still there are some details to go over.

To make a prompt in Guile, you can use the % operator, which is pronounced "prompt":

(use-modules (ice-9 control))

(% expr
   (lambda (k . args) #f))

The name for this operator comes from Dorai Sitaram's 1993 paper, Handling Control; it's actually a pun on the tcsh prompt, if you must know. Anyway the basic idea in this example is that we run expr, but if it aborts we run the lambda handler instead, which just returns #f.

Really % is just syntactic sugar for call-with-prompt though. The previous example desugars to something like this:

(let ((tag (make-prompt-tag)))
  (call-with-prompt tag
    ;; Body:
    (lambda () expr)
    ;; Escape handler:
    (lambda (k . args) #f)))

(It's not quite the same; % uses a "default prompt tag". This is just a detail though.)

You see here that call-with-prompt is really the primitive. It will call the body thunk, but if an abort occurs within the body to the given prompt tag, then the body aborts and the handler is run instead.

So if you want to define a primitive that runs a function but allows early exit, we can do that:

(define-module (my-module)
  #:export (with-return))

(define-syntax-rule (with-return return body ...)
  (let ((t (make-prompt-tag)))
    (define (return . args)
      (apply abort-to-prompt t args))
    (call-with-prompt t
      (lambda () body ...)
      (lambda (k . rvals)
        (apply values rvals)))))

Here we define a module with a little with-return macro. We can use it like this:

(use-modules (my-module))

(with-return return
  (+ 3 (return 42)))
;; => 42

As you can see, calling return within the body will abort the computation and cause the with-return expression to evaluate to the arguments passed to return.

But what's up with the handler? Let's look again at the form of the call-with-prompt invocations we've been making.

(let ((tag (make-prompt-tag)))
  (call-with-prompt tag
    (lambda () ...)
    (lambda (k . args) ...)))

With the with-return macro, the handler took a first k argument, threw it away, and returned the remaining values. But the first argument to the handler is pretty cool: it is the continuation of the computation that was aborted, delimited by the prompt: meaning, it's the part of the computation between the abort-to-prompt and the call-with-prompt, packaged as a function that you can call.

If you call the k, the delimited continuation, you reinstate it:

(define (f)
  (define tag (make-prompt-tag))
  (call-with-prompt tag
   (lambda ()
     (+ 3
        (abort-to-prompt tag)))
   (lambda (k) k)))

(let ((k (f)))
  (k 1))
;; =& 4

Here, the abort-to-prompt invocation behaved simply like a "suspend" operation, returning the suspended computation k. Calling that continuation resumes it, supplying the value 1 to the saved continuation (+ 3 []), resulting in 4.

Basically, when a delimited continuation suspends, the first argument to the handler is a function that can resume the continuation.

tasks to fibers

And with that, we just built coroutines in terms of delimited continuations. We can turn our scheduler inside-out, giving the illusion that each task runs in its own isolated fiber.

(define tag (make-prompt-tag))

(define (call/susp thunk)
  (define (handler k on-suspend) (on-suspend k))
  (call-with-prompt tag thunk handler))

(define (suspend on-suspend)
  (abort-to-prompt tag on-suspend))

(define (schedule thunk)
  (match (current-scheduler)
    (($ $sched inbox i/o)
     (enqueue! inbox (lambda () (call/susp thunk))))))

So! Here we have a system that can run a thunk in a scheduler. Fine. No big deal. But if the thunk calls suspend, then it causes an abort back to a prompt. suspend takes a procedure as an argument, the on-suspend procedure, which will be called with one argument: the suspended continuation of the thunk. We've layered coroutines on top of the event loop.

Guile's virtual machine is a normal register virtual machine with a stack composed of function frames. It's not necessary to do full CPS conversion to implement delimited control, but if you don't, then your virtual machine needs primitive support for call-with-prompt, as Guile's VM does. In Guile then, a suspended continuation is an object composed of the slice of the stack captured between the prompt and the abort, and also the slice of the dynamic stack. (Guile keeps a parallel stack for dynamic bindings. Perhaps we should unify these; dunno.) This object is wrapped in a little procedure that uses VM primitives to push those stack frames back on, and continue.

I say all this just to give you a mental idea of what it costs to suspend a fiber. It will allocate storage proportional to the stack depth between the prompt and the abort. Usually this is a few dozen words, if there are 5 or 10 frames on the stack in the fiber.

We've gone from prompts to coroutines, and from here to fibers there's just a little farther to go. First, note that spawning a new fiber is simply scheduling a thunk:

(define (spawn-fiber thunk)
  (schedule thunk))

Many threading libraries provide a "yield" primitive, which simply suspends the current thread, allowing others to run. We can do this for fibers directly:

(define (yield)
  (suspend schedule))

Note that the on-suspend procedure here is just schedule, which re-schedules the continuation (but presumably at the back of the queue).

Similarly if we are reading on a non-blocking file descriptor and detect that we need more input before we can continue, but none is available, we can suspend and arrange for the epollfd to resume us later:

(define (wait-for-readable fd)
  (suspend
   (lambda (k)
     (match (current-scheduler)
       (($ $sched inbox i/o)
        (add-read-fd! i/o fd
                      (lambda () (schedule k))))))))

In Guile you can arrange to install this function as the "current read waiter", causing it to run whenever a port would block. The details are a little gnarly currently; see the Non-blocking I/O manual page for more.

Anyway the cool thing is that I can run any thunk within a spawn-fiber, without modification, and it will run as if in a new thread of some sort.

solid abstractions?

I admit that although I am very happy with Emacs, I never really took to using the shell from within Emacs. I always have a terminal open with a bunch of tabs. I think the reason for that is that I never quite understood why I could move the cursor over the bash prompt, or into previous expressions or results; it seemed like I was waking up groggily from some kind of dream where nothing was real. I like the terminal, where the only bit that's "mine" is the current command. All the rest is immutable text in the scrollback.

Similarly when you make a UI, you want to design things so that people perceive the screen as being composed of buttons and so on, not just lines. In essence you trick the user, a willing user who is ready to be tricked, into seeing buttons and text and not just weird pixels.

In the same way, with fibers we want to provide the illusion that fibers actually exist. To solidify this illusion, we're still missing a few elements.

One point relates to error handling. As it is, if an error happens in a fiber and the fiber doesn't handle it, the exception propagates out of the fiber, through the scheduler, and might cause the whole program to error out. So we need to wrap fibers in a catch-all.

(define (spawn-fiber thunk)
  (schedule
   (lambda ()
     (catch #t thunk
       (lambda (key . args)
         (print-exception (current-error-port) #f key args))))))

Well, OK. Exceptions won't propagate out of fibers, yay. In fact in Guile we add another catch inside the print-exception, in case the print-exception throws an exception... Anyway. Cool.

Another point relates to fiber-local variables. In an operating system, each process has a number of variables that are local to it, notably in UNIX we have the umask, the current effective user, the current directory, the open files and what file descriptors they are associated with, and so on. In Scheme we have similar facilities in the form of parameters.

Now the usual way that parameters are used is to bind a new value within the extent of some call:

(define (with-output-to-string thunk)
  (let ((p (open-output-string)))
    (parameterize ((current-output-port p))
      (thunk))
    (get-output-string p)))

Here the parameterize invocation established p as the current output port during the call to thunk. Parameters already compose quite well with prompts; Guile, like Racket, implements the protocol described by Kiselyov, Shan, and Sabry in their Delimited Dynamic Binding paper (well worth a read!).

The one missing piece is that parameters in Scheme are mutable (by default). Normally if you call (current-input-port), you just get the current value of the current input port parameter. But if you pass an argument, like (current-input-port p), then you actually set the current input port to that new value. This value will be in place until we leave some parameterize invocation that parameterizes the current input port.

The problem here is that it could be that there's an interesting parameter which some piece of Scheme code will want to just mutate, so that all further Scheme code will use the new value. This is fine if you have no concurrency: there's just one thing running. But when you have many fibers, you want to avoid mutations in one fiber from affecting others. You want some isolation with regards to parameters. In Guile, we do this with the with-dynamic-state facility, which isolates changes to the dynamic state (parameters and so on) within the extent of the with-dynamic-state call.

(define (spawn-fiber thunk)
  (let ((state (current-dynamic-state)))
    (schedule
     (lambda ()
       (catch #t
         (lambda ()
           (with-dynamic-state state thunk))
         (lambda (key . args)
           (print-exception (current-error-port) #f key args))))))

Interestingly, with-dynamic-state solves another problem as well. You would like for newly spawned fibers to inherit the parameters from the point at which they were spawned.

(parameterize ((current-output-port p))
  (spawn-fiber
   ;; New fiber should inherit current-output-port
   ;; binding as "p"
   (lambda () ...)))

Capturing the (current-dynamic-state) outside the thunk does this for us.

When I made this change in Guile, making sure that with-dynamic-state did not impose a continuation barrier, I ran into a problem. In Guile we implemented exceptions in terms of delimited continuations and dynamic binding. The current stack of exception handlers was a list, and each element included the exceptions handled by that handler, and what prompt to which to abort before running the exception handler. See where the problem is? If we ship this exception handler stack over to a new fiber, then an exception propagating out of the new fiber would be looking up handlers from another fiber, for prompts that probably aren't even on the stack any more.

The problem here is that if you store a heap-allocated stack of current exception handlers in a dynamic variable, and that dynamic variable is captured somehow (say, by a delimited continuation), then you capture the whole stack of handlers, not (in the case of delimited continuations) the delimited set of handlers that were active within the prompt. To fix this, we had to change Guile's exceptions to instead make catch just rebind the exception handler parameter to hold the handler installed by the catch. If Guile needs to walk the chain of exception handlers, we introduced a new primitive fluid-ref* to do so, building the chain from the current stack of parameterizations instead of some representation of that stack on the heap. It's O(n), but life is that way sometimes. This way also, delimited continuations capture the right set of exception handlers.

Finally, Guile also supports asynchronous interrupts. We can arrange to interrupt a Guile process (or POSIX thread) every so often, as measured in wall-clock or process time. It used to be that interrupt handlers caused a continuation barrier, but this is no longer the case, so now we can add pre-emption to a fibers using interrupts.

summary and reflections

In Guile we were able to create a solid-seeming abstraction for fibers by composing other basic building blocks from the Scheme toolkit. Guile users can take an abstraction that's implemented in terms of an event loop (any event loop) and layer fibers on top in a way that feels "real". We were able to do this because we have prompts (delimited continuation) and parameters (dynamic binding), and we were able to compose the two. Actually getting it all to work required fixing a few bugs.

In Fibers, we just use delimited continuations to implement coroutines, and then our fibers are coroutines. If we had coroutines as a primitive, that would work just as well. As it is, each suspension of a fiber will allocate a new continuation. Perhaps this is unimportant, given the average continuation size, but it would be comforting in a way to be able to re-use the allocation from the previous suspension (if any). Other languages with coroutine primitives might have an advantage here, though delimited dynamic binding is still relatively uncommon.

Another point is that because we use prompts to suspend fiberss, we effectively are always unwinding and rewinding the dynamic state. In practice this should be transparent to the user and the implementor should make this transparent from a performance perspective, with the exception of dynamic-wind. Basically any fiber suspension will be run the "out" guard of any enclosing dynamic-wind, and resumption will run the "in" guard. In practice we find that we defer "finalization" issues to with-throw-handler / catch, which unlike dynamic-wind don't run on every entry or exit of a dynamic extent and rather just run on exceptional exits. We will see over time if this situation is acceptable. It's certainly another nail in the coffin of dynamic-wind though.

This article started with pthreads malaise, and although we've solved the problem of having a million fibers, we haven't solved the communications problem. How should fibers communicate with each other? This is the topic for my next article. Until then, happy hacking :)