encyclopedia snabb and the case of the foreign drivers

24 February 2017 5:37 PM (snabb | igalia | networking | dpdk | drivers | lca | lwn | conway's law)

Peoples of the blogosphere, welcome back to the solipsism! Happy 2017 and all that. Today's missive is about Snabb (formerly Snabb Switch), a high-speed networking project we've been working on at work for some years now.

What's Snabb all about you say? Good question and I have a nice answer for you in video and third-party textual form! This year I managed to make it to in lovely Tasmania. Tasmania is amazing, with wild wombats and pademelons and devils and wallabies and all kinds of things, and they let me talk about Snabb.

You can check that video on the youtube if the link above doesn't work; slides here.

Jonathan Corbet from LWN wrote up the talk in an article here, which besides being flattering is a real windfall as I don't have to write it up myself :)

In that talk I mentioned that Snabb uses its own drivers. We were recently approached by a customer with a simple and honest question: does this really make sense? Is it really a win? Why wouldn't we just use the work that the NIC vendors have already put into their drivers for the Data Plane Development Kit (DPDK)? After all, part of the attraction of a switch to open source is that you will be able to take advantage of the work that others have produced.

Our answer is that while it is indeed possible to use drivers from DPDK, there are costs and benefits on both sides and we think that when we weigh it all up, it makes both technical and economic sense for Snabb to have its own driver implementations. It might sound counterintuitive on the face of things, so I wrote this long article to discuss some perhaps under-appreciated points about the tradeoff.

Technically speaking there are generally two ways you can imagine incorporating DPDK drivers into Snabb:

  1. Bundle a snapshot of the DPDK into Snabb itself.

  2. Somehow make it so that Snabb could (perhaps optionally) compile against a built DPDK SDK.

As part of a software-producing organization that ships solutions based on Snabb, I need to be able to ship a "known thing" to customers. When we ship the lwAFTR, we ship it in source and in binary form. For both of those deliverables, we need to know exactly what code we are shipping. We achieve that by having a minimal set of dependencies in Snabb -- only LuaJIT and three Lua libraries (DynASM, ljsyscall, and pflua) -- and we include those dependencies directly in the source tree. This requirement of ours rules out (2), so the option under consideration is only (1): importing the DPDK (or some part of it) directly into Snabb.

So let's start by looking at Snabb and the DPDK from the top down, comparing some metrics, seeing how we could make this combination.

Code lines 61K 583K
Contributors (all-time) 60 370
Contributors (since Jan 2016) 32 240
Non-merge commits (since Jan 2016) 1.4K 3.2K

These numbers aren't directly comparable, of course; in Snabb our unit of code change is the merge rather than the commit, and in Snabb we include a number of production-ready applications like the lwAFTR and the NFV, but they are fine enough numbers to start with. What seems clear is that the DPDK project is significantly larger than Snabb, so adding it to Snabb would fundamentally change the nature of the Snabb project.

So depending on the DPDK makes it so that suddenly Snabb jumps from being a project that compiles in a minute to being a much more heavy-weight thing. That could be OK if the benefits were high enough and if there weren't other costs, but there are indeed other costs to including the DPDK:

  • Data-plane control. Right now when I ship a product, I can be responsible for the whole data plane: everything that happens on the CPU when packets are being processed. This includes the driver, naturally; it's part of Snabb and if I need to change it or if I need to understand it in some deep way, I can do that. But if I switch to third-party drivers, this is now out of my domain; there's a wall between me and something that running on my CPU. And if there is a performance problem, I now have someone to blame that's not myself! From the customer perspective this is terrible, as you want the responsibility for software to rest in one entity.

  • Impedance-matching development costs. Snabb is written in Lua; the DPDK is written in C. I will have to build a bridge, and keep it up to date as both Snabb and the DPDK evolve. This impedance-matching layer is also another source of bugs; either we make a local impedance matcher in C or we bind everything using LuaJIT's FFI. In the former case, it's a lot of duplicate code, and in the latter we lose compile-time type checking, which is a no-go given that the DPDK can and does change API and ABI.

  • Communication costs. The DPDK development list had 3K messages in January. Keeping up with DPDK development would become necessary, as the DPDK is now in your dataplane, but it costs significant amounts of time.

  • Costs relating to mismatched goals. Snabb tries to win development and run-time speed by searching for simple solutions. The DPDK tries to be a showcase for NIC features from vendors, placing less of a priority on simplicity. This is a very real cost in the form of the way network packets are represented in the DPDK, with support for such features as scatter/gather and indirect buffers. In Snabb we were able to do away with this complexity by having simple linear buffers, and our speed did not suffer; adding the DPDK again would either force us to marshal and unmarshal these buffers into and out of the DPDK's format, or otherwise to reintroduce this particular complexity into Snabb.

  • Abstraction costs. A network function written against the DPDK typically uses at least three abstraction layers: the "EAL" environment abstraction layer, the "PMD" poll-mode driver layer, and often an internal hardware abstraction layer from the network card vendor. (And some of those abstraction layers are actually external dependencies of the DPDK, as with Mellanox's ConnectX-4 drivers!) Any discrepancy between the goals and/or implementation of these layers and the goals of a Snabb network function is a cost in developer time and in run-time. Note that those low-level HAL facilities aren't considered acceptable in upstream Linux kernels, for all of these reasons!

  • Stay-on-the-train costs. The DPDK is big and sometimes its abstractions change. As a minor player just riding the DPDK train, we would have to invest a continuous amount of effort into just staying aboard.

  • Fork costs. The Snabb project has a number of contributors but is really run by Luke Gorrie. Because Snabb is so small and understandable, if Luke decided to stop working on Snabb or take it in a radically different direction, I would feel comfortable continuing to maintain (a fork of) Snabb for as long as is necessary. If the DPDK changed goals for whatever reason, I don't think I would want to continue to maintain a stale fork.

  • Overkill costs. Drivers written against the DPDK have many considerations that simply aren't relevant in a Snabb world: kernel drivers (KNI), special NIC features that we don't use in Snabb (RDMA, offload), non-x86 architectures with different barrier semantics, threads, complicated buffer layouts (chained and indirect), interaction with specific kernel modules (uio-pci-generic / igb-uio / ...), and so on. We don't need all of that, but we would have to bring it along for the ride, and any changes we might want to make would have to take these use cases into account so that other users won't get mad.

So there are lots of costs if we were to try to hop on the DPDK train. But what about the benefits? The goal of relying on the DPDK would be that we "automatically" get drivers, and ultimately that a network function would be driver-agnostic. But this is not necessarily the case. Each driver has its own set of quirks and tuning parameters; in order for a software development team to be able to support a new platform, the team would need to validate the platform, discover the right tuning parameters, and modify the software to configure the platform for good performance. Sadly this is not a trivial amount of work.

Furthermore, using a different vendor's driver isn't always easy. Consider Mellanox's DPDK ConnectX-4 / ConnectX-5 support: the "Quick Start" guide has you first install MLNX_OFED in order to build the DPDK drivers. What is this thing exactly? You go to download the tarball and it's 55 megabytes. What's in it? 30 other tarballs! If you build it somehow from source instead of using the vendor binaries, then what do you get? All that code, running as root, with kernel modules, and implementing systemd/sysvinit services!!! And this is just step one!!!! Worse yet, this enormous amount of code powering a DPDK driver is mostly driver-specific; what we hear from colleagues whose organizations decided to bet on the DPDK is that you don't get to amortize much knowledge or validation when you switch between an Intel and a Mellanox card.

In the end when we ship a solution, it's going to be tested against a specific NIC or set of NICs. Each NIC will add to the validation effort. So if we were to rely on the DPDK's drivers, we would have payed all the costs but we wouldn't save very much in the end.

There is another way. Instead of relying on so much third-party code that it is impossible for any one person to grasp the entirety of a network function, much less be responsible for it, we can build systems small enough to understand. In Snabb we just read the data sheet and write a driver. (Of course we also benefit by looking at DPDK and other open source drivers as well to see how they structure things.) By only including what is needed, Snabb drivers are typically only a thousand or two thousand lines of Lua. With a driver of that size, it's possible for even a small ISV or in-house developer to "own" the entire data plane of whatever network function you need.

Of course Snabb drivers have costs too. What are they? Are customers going to be stuck forever paying for drivers for every new card that comes out? It's a very good question and one that I know is in the minds of many.

Obviously I don't have the whole answer, as my role in this market is a software developer, not an end user. But having talked with other people in the Snabb community, I see it like this: Snabb is still in relatively early days. What we need are about three good drivers. One of them should be for a standard workhorse commodity 10Gbps NIC, which we have in the Intel 82599 driver. That chipset has been out for a while so we probably need to update it to the current commodities being sold. Additionally we need a couple cards that are going to compete in the 100Gbps space. We have the Mellanox ConnectX-4 and presumably ConnectX-5 drivers on the way, but there's room for another one. We've found that it's hard to actually get good performance out of 100Gbps cards, so this is a space in which NIC vendors can differentiate their offerings.

We budget somewhere between 3 and 9 months of developer time to create a completely new Snabb driver. Of course it usually takes less time to develop Snabb support for a NIC that is only incrementally different from others in the same family that already have drivers.

We see this driver development work to be similar to the work needed to validate a new NIC for a network function, with the additional advantage that it gives us up-front knowledge instead of the best-effort testing later in the game that we would get with the DPDK. When you add all the additional costs of riding the DPDK train, we expect that the cost of Snabb-native drivers competes favorably against the cost of relying on third-party DPDK drivers.

In the beginning it's natural that early adopters of Snabb make investments in this base set of Snabb network drivers, as they would to validate a network function on a new platform. However over time as Snabb applications start to be deployed over more ports in the field, network vendors will also see that it's in their interests to have solid Snabb drivers, just as they now see with the Linux kernel and with the DPDK, and given that the investment is relatively low compared to their already existing efforts in Linux and the DPDK, it is quite feasible that we will see the NIC vendors of the world start to value Snabb for the performance that it can squeeze out of their cards.

So in summary, in Snabb we are convinced that writing minimal drivers that are adapted to our needs is an overall win compared to relying on third-party code. It lets us ship solutions that we can feel responsible for: both for their operational characteristics as well as their maintainability over time. Still, we are happy to learn and share with our colleagues all across the open source high-performance networking space, from the DPDK to VPP and beyond.

An incomplete history of language facilities for concurrency

12 October 2016 1:45 PM (pl | concurrency | erlang | go | csp | guile | fibers | callback hell)

I have lately been in the market for better concurrency facilities in Guile. I want to be able to write network servers and peers that can gracefully, elegantly, and efficiently handle many tens of thousands of clients and other connections, but without blowing the complexity budget. It's a hard nut to crack.

Part of the problem is implementation, but a large part is just figuring out what to do. I have often thought that modern musicians must be crushed under the weight of recorded music history, but it turns out in our humble field that's also the case; there are as many concurrency designs as languages, just about. In this regard, what follows is an incomplete, nuanced, somewhat opinionated history of concurrency facilities in programming languages, with an eye towards what I should "buy" for the Fibers library I have been tinkering on for Guile.

* * *

Modern machines have the raw capability to serve hundreds of thousands of simultaneous long-lived connections, but it’s often hard to manage this at the software level. Fibers tries to solve this problem in a nice way. Before discussing the approach taken in Fibers, it’s worth spending some time on history to see how we got here.

One of the most dominant patterns for concurrency these days is “callbacks”, notably in the Twisted library for Python and the Node.js run-time for JavaScript. The basic observation in the callback approach to concurrency is that the efficient way to handle tens of thousands of connections at once is with low-level operating system facilities like poll or epoll. You add all of the file descriptors that you are interested in to a “poll set” and then ask the operating system which ones are readable or writable, as appropriate. Once the operating system says “yes, file descriptor 7145 is readable”, you can do something with that socket; but what? With callbacks, the answer is “call a user-supplied closure”: a callback, representing the continuation of the computation on that socket.

Building a network service with a callback-oriented concurrency system means breaking the program into little chunks that can run without blocking. Whereever a program could block, instead of just continuing the program, you register a callback. Unfortunately this requirement permeates the program, from top to bottom: you always pay the mental cost of inverting your program’s control flow by turning it into callbacks, and you always incur run-time cost of closure creation, even when the particular I/O could proceed without blocking. It’s a somewhat galling requirement, given that this contortion is required of the programmer, but could be done by the compiler. We Schemers demand better abstractions than manual, obligatory continuation-passing-style conversion.

Callback-based systems also encourage unstructured concurrency, as in practice callbacks are not the only path for data and control flow in a system: usually there is mutable global state as well. Without strong patterns and conventions, callback-based systems often exhibit bugs caused by concurrent reads and writes to global state.

Some of the problems of callbacks can be mitigated by using “promises” or other library-level abstractions; if you’re a Haskell person, you can think of this as lifting all possibly-blocking operations into a monad. If you’re not a Haskeller, that’s cool, neither am I! But if your typey spidey senses are tingling, it’s for good reason: with promises, your whole program has to be transformed to return promises-for-values instead of values anywhere it would block.

An obvious solution to the control-flow problem of callbacks is to use threads. In the most generic sense, a thread is a language feature which denotes an independent computation. Threads are created by other threads, but fork off and run independently instead of returning to their caller. In a system with threads, there is implicitly a scheduler somewhere that multiplexes the threads so that when one suspends, another can run.

In practice, the concept of threads is often conflated with a particular implementation, kernel threads. Kernel threads are very low-level abstractions that are provided by the operating system. The nice thing about kernel threads is that they can use any CPU that is the kernel knows about. That’s an important factor in today’s computing landscape, where Moore’s law seems to be giving us more cores instead of more gigahertz.

However, as a building block for a highly concurrent system, kernel threads have a few important problems.

One is that kernel threads simply aren’t designed to be allocated in huge numbers, and instead are more optimized to run in a one-per-CPU-core fashion. Their memory usage is relatively high for what should be a lightweight abstraction: some 10 kilobytes at least and often some megabytes, in the form of the thread’s stack. There are ongoing efforts to reduce this for some systems but we cannot expect wide deployment in the next 5 years, if ever. Even in the best case, a hundred thousand kernel threads will take at least a gigabyte of memory, which seems a bit excessive for book-keeping overhead.

Kernel threads can be a bit irritating to schedule, too: when one thread suspends, it’s for a reason, and it can be that user-space knows a good next thread that should run. However because kernel threads are scheduled in the kernel, it’s rarely possible for the kernel to make informed decisions. There are some “user-mode scheduling” facilities that are in development for some systems, but again only for some systems.

The other significant problem is that building non-crashy systems on top of kernel threads is hard to do, not to mention “correct” systems. It’s an embarrassing situation. For one thing, the low-level synchronization primitives that are typically provided with kernel threads, mutexes and condition variables, are not composable. Also, as with callback-oriented concurrency, one thread can silently corrupt another via unstructured mutation of shared state. It’s worse with kernel threads, though: a kernel thread can be interrupted at any point, not just at I/O. And though callback-oriented systems can theoretically operate on multiple CPUs at once, in practice they don’t. This restriction is sometimes touted as a benefit by proponents of callback-oriented systems, because in such a system, the callback invocations have a single, sequential order. With multiple CPUs, this is not the case, as multiple threads can run at the same time, in parallel.

Kernel threads can work. The Java virtual machine does at least manage to prevent low-level memory corruption and to do so with high performance, but still, even Java-based systems that aim for maximum concurrency avoid using a thread per connection because threads use too much memory.

In this context it’s no wonder that there’s a third strain of concurrency: shared-nothing message-passing systems like Erlang. Erlang isolates each thread (called processes in the Erlang world), giving each it its own heap and “mailbox”. Processes can spawn other processes, and the concurrency primitive is message-passing. A process that tries receive a message from an empty mailbox will “block”, from its perspective. In the meantime the system will run other processes. Message sends never block, oddly; instead, sending to a process with many messages pending makes it more likely that Erlang will pre-empt the sending process. It’s a strange tradeoff, but it makes sense when you realize that Erlang was designed for network transparency: the same message send/receive interface can be used to send messages to processes on remote machines as well.

No network is truly transparent, however. At the most basic level, the performance of network sends should be much slower than local sends. Whereas a message sent to a remote process has to be written out byte-by-byte over the network, there is no need to copy immutable data within the same address space. The complexity of a remote message send is O(n) in the size of the message, whereas a local immutable send is O(1). This suggests that hiding the different complexities behind one operator is the wrong thing to do. And indeed, given byte read and write operators over sockets, it’s possible to implement remote message send and receive as a process that serializes and parses messages between a channel and a byte sink or source. In this way we get cheap local channels, and network shims are under the programmer’s control. This is the approach that the Go language takes, and is the one we use in Fibers.

Structuring a concurrent program as separate threads that communicate over channels is an old idea that goes back to Tony Hoare’s work on “Communicating Sequential Processes” (CSP). CSP is an elegant tower of mathematical abstraction whose layers form a pattern language for building concurrent systems that you can still reason about. Interestingly, it does so without any concept of time at all, instead representing a thread’s behavior as a trace of instantaneous events. Threads themselves are like functions that unfold over the possible events to produce the actual event trace seen at run-time.

This view of events as instantaneous happenings extends to communication as well. In CSP, one communication between two threads is modelled as an instantaneous event, partitioning the traces of the two threads into “before” and “after” segments.

Practically speaking, this has ramifications in the Go language, which was heavily inspired by CSP. You might think that a channel is just a an asynchronous queue that blocks when writing to a full queue, or when reading from an empty queue. That’s a bit closer to the Erlang conception of how things should work, though as we mentioned, Erlang simply slows down writes to full mailboxes rather than blocking them entirely. However, that’s not what Go and other systems in the CSP family do; sending a message on a channel will block until there is a receiver available, and vice versa. The threads are said to “rendezvous” at the event.

Unbuffered channels have the interesting property that you can select between sending a message on channel a or channel b, and in the end only one message will be sent; nothing happens until there is a receiver ready to take the message. In this way messages are really owned by threads and never by the channels themselves. You can of course add buffering if you like, simply by making a thread that waits on either sends or receives on a channel, and which buffers sends and makes them available to receives. It’s also possible to add explicit support for buffered channels, as Go, core.async, and many other systems do, which can reduce the number of context switches as there is no explicit buffer thread.

Whether to buffer or not to buffer is a tricky choice. It’s possible to implement singly-buffered channels in a system like Erlang via an explicit send/acknowlege protocol, though it seems difficult to implement completely unbuffered channels. As we mentioned, it’s possible to add buffering to an unbuffered system by the introduction of explicit buffer threads. In the end though in Fibers we follow CSP’s lead so that we can implement the nice select behavior that we mentioned above.

As a final point, select is OK but is not a great language abstraction. Say you call a function and it returns some kind of asynchronous result which you then have to select on. It could return this result as a channel, and that would be fine: you can add that channel to the other channels in your select set and you are good. However, what if what the function does is receive a message on a channel, then do something with the message? In that case the function should return a channel, plus a continuation (as a closure or something). If select results in a message being received over that channel, then we call the continuation on the message. Fine. But, what if the function itself wanted to select over some channels? It could return multiple channels and continuations, but that becomes unwieldy.

What we need is an abstraction over asynchronous operations, and that is the main idea of a CSP-derived system called “Concurrent ML” (CML). Originally implemented as a library on top of Standard ML of New Jersey by John Reppy, CML provides this abstraction, which in Fibers is called an operation1. Calling send-operation on a channel returns an operation, which is just a value. Operations are like closures in a way; a closure wraps up code in its environment, which can be later called many times or not at all. Operations likewise can be performed2 many times or not at all; performing an operation is like calling a function. The interesting part is that you can compose operations via the wrap-operation and choice-operation combinators. The former lets you bundle up an operation and a continuation. The latter lets you construct an operation that chooses over a number of operations. Calling perform-operation on a choice operation will perform one and only one of the choices. Performing an operation will call its wrap-operation continuation on the resulting values.

While it’s possible to implement Concurrent ML in terms of Go’s channels and baked-in select statement, it’s more expressive to do it the other way around, as that also lets us implement other operations types besides channel send and receive, for example timeouts and condition variables.

1 CML uses the term event, but I find this to be a confusing name. In this isolated article my terminology probably looks confusing, but in the context of the library I think it can be OK. The jury is out though.

2 In CML, synchronized.

* * *

Well, that's my limited understanding of the crushing weight of history. Note that part of this article is now in the Fibers manual.

Thanks very much to Matthew Flatt, Matthias Felleisen, and Michael Sperber for pushing me towards CML. In the beginning I thought its benefits were small and complication large, but now I see it as being the reverse. Happy hacking :)

is go an acceptable cml?

21 September 2016 9:29 PM (go | concurrency | ml | ignorance | guile | clojure | core.async)

Yesterday I tried to summarize the things I know about Concurrent ML, and I came to the tentative conclusion that Go (and any Go-like system) was an acceptable CML. Turns out I was both wrong and right.

you were wrong when you said everything's gonna be all right

I was wrong, in the sense that programming against the CML abstractions lets you do more things than programming against channels-and-goroutines. Thanks to Sam Tobin-Hochstadt to pointing this out. As an example, consider a little process that tries to receive a message off a channel, and times out otherwise:

func withTimeout(ch chan int, timeout int) (result int) {
  var timeoutChannel chan int;
  var msg int;
  go func() {
    timeoutChannel <- 0
  select {
    case msg = <-ch: return msg;
    case msg = <-timeoutChannel: return 0;

I think that's the first Go I've ever written. I don't even know if it's syntactically valid. Anyway, I think we see how it should work. We return the message from the channel, unless the timeout happens before.

But, what if the message is itself a composite message somehow? For example, say we have a transformer that reads a value from a channel and adds 1 to it:

func onePlus(in chan int) (result chan int) {
  var out chan int
  go func () { out <- 1 + <-in }()
  return out

What if we do a withTimeout(onePlus(numbers), 0)? Assume the timeout fires first and that's the result that select chooses. There's still that onePlus goroutine out there trying to read from in and at some point probably it will succeed, but nobody will read its value. At that point the number just vanishes into the ether. Maybe that's OK in certain domains, but certainly not in general!

What CML gives you is the ability to express an event (which is kinda like a possibility of sending or receiving a message on a channel) in such a way that we don't run into this situation. Specifically with the wrap combinator, we would make an event such that receiving on numbers would run a function on the received message and return that as the message value -- which is of course the same as what we have, except that in CML the select wouldn't actually read the message off unless it select'd that channel for input.

Of course in Go you could just rewrite your program, so that the select statement looks like this:

select {
  case msg = <-ch: return msg + 1;
  case msg = <-timeoutChannel: return 0;

But here we're operating at a lower level of abstraction; we were forced to intertwingle our concerns of adding 1 and our concerns of timeout. CML is more expressive than Go.

you were right when you said we're all just bricks in the wall

However! I was right in the sense that you can build a CML system on top of Go-like systems (though possibly not Go in particular). Thanks to Vesa Karvonen for this comment and the link to their proof-of-concept CML implementation in Clojure's core.async. I understand Vesa also has an implementation in F# as well.

Folks should read Vesa's code, after reading the Reppy papers of course; it's delightfully short and expressive. The basic idea is that event composition operators like choose and wrap build up data structures instead of doing things. The sync operation then grovels through those data structures to collect a list of channels to pass on to core.async's equivalent of select. When select returns, sync determines which event that chosen channel and message corresponds to, and proceeds to "activate" the event (and, as a side effect, possibly issue NACK messages to other channels).

Provided you can map from the chosen select channel/message back to the event, (something that core.async can mostly do, with a caveat; see the code), then you can build CML on top of channels and goroutines.

o/~ yeah you were wrong o/~

On the other hand! One advantage of CML is that its events are not limited to channel sends and receives. I understand that timeouts, thread joins, and maybe some other event types are first-class event kinds in many CML systems. Michael Sperber, current Scheme48 maintainer and functional programmer, tells me that simply wrapping events in channels+goroutines works but can incur a big performance overhead relative to supporting those event types natively, due to the need to make the new goroutine and channel and the scheduling costs. He quotes 10X as the overhead!

So although CML and Go appear to be inter-expressible, maybe a proper solution will base the simple channel send/receive interface on CML rather than the other way around.

Also, since these events are now second-class, it must be OK to lose these events, for the same reason that the naïve withTimeout could lose a message from numbers. This is the case for timeouts usually but maybe you have to think about this more, and possibly provide an infinite stream of the message. (Of course the wrapper goroutine would be collected if the channel becomes unreachable.)

you were right when you said this is the end

I've long wondered how contemporary musicians deal with the enormous, crushing weight of recorded music. I don't really pick any more but hoo am I feeling this now. I think for Guile, I will continue hacking on fibers in a separate library, and I think that things will remain that way for the next couple years and possibly more. We need more experience and more mistakes before blessing and supporting any particular formulation of highly concurrent programming. I will say though that I am delighted that we are able to actually do this experimentation on a library level and I look forward to seeing what works out :)

Thanks again to Vesa, Michael, and Sam for sharing their time and knowledge; all errors are of course mine. Happy hacking!

concurrent ml versus go

20 September 2016 9:33 PM (concurrency | go | ml | ignorance | guile)

Peoples! Lately I've been navigating the guile-ship through waters unknown. This post is something of an echolocation to figure out where the hell this ship is and where it should go.

Concretely, I have been working on getting a nice lightweight concurrency system rolling for Guile. I'll write more about that later, but you can think of it as being modelled on Go, though built as a library. (I had previously described it as "Erlang-like", but that's just not accurate.)

Earlier this year at Curry On this topic was burning in my mind and of course when I saw the language-hacker fam there I had to bend their ears. My targets: Matthew Flatt, the amazing boundary-crossing engineer, hacker, teacher, researcher, and implementor of Racket, and Matthias Felleisen, the godfather of the PLT research family. I saw them sitting together and I thought, you know what, what can they have to say to each other? These people have been talking together for 30 years right? Surely they are actually waiting for some ignorant dude to saunter up to the PL genius bar, right?

So saunter I do, saying, "if someone says to you that they want to build a server that will handle 100K or so simultaneous connections on Racket, what abstraction do you tell them to use? Racket threads?" Apparently: yes. A definitive yes, in the case of Matthias, with a pointer to Robby Findler's paper on kill-safe abstractions; and still a yes from Matthew with the caveat that for the concrete level of concurrency that I described, you'd have to run tests. More fundamentally, I was advised to look at Concurrent ML (on which Racket's concurrency facilities were based), that CML was much better put together than many modern variants like Go.

This was very interesting and new to me. As y'all probably know, I don't have a formal background in programming languages, and although I've read a lot of literature, reading things only makes you aware of the growing dimension of the not-yet-read. Concurrent ML was even beyond my not-yet-read horizon.

So I went back and read a bunch of papers. Turns out Concurrent ML is like Lisp in that it has a tribe and a tightly-clutched history and a diaspora that reimplements it in whatever language they happen to be working in at the moment. Kinda cool, and, um... a bit hard to appreciate in the current-day context when the only good references are papers from 10 or 20 years ago.

However, after reading a bunch of John Reppy papers, here is my understanding of what Concurrent ML is. I welcome corrections; surely I am getting this wrong.

1. CML is like Go, composed of channels and goroutines. (Forgive the modern referent; I assume most folks know Go at this point.)

2. Unlike Go, in CML a channel is never buffered. To make a buffered channel in CML, you spawn a thread that manages a buffer between two channels.

3. Message send and receive operations in CML are built on a lower-level primitive called "events". (send ch x) is instead euivalent to (sync (send-event ch x)). It's like an event is the derivative of a message send with respect to time, or something.

4. Events can be combined and transformed using the choose and wrap combinators.

5. Doing a sync on an event created by choose allows a user to build select in "user-space", as a library. Cool stuff. So this is what events are for.

6. There are separate event type implementations for timeouts, channel send/recv blocking operations, file descriptor blocking operations, syscalls, thread joins, and the like. These are supported by the CML implementation.

7. The early implementations of Concurrent ML were concurrent but not parallel; they did not run multiple "goroutines" on separate CPU cores at the same time. It was only in like 2009 that people started to do CML in parallel. I do not know if this late parallelism has a practical impact on the viability of CML.

ok go

What is the relationship of CML to Go? Specifically, is CML more expressive than Go? (I assume the reverse is not the case, but that would also be an interesting result!)

There are a few languages that only allow you to select over message receives (not sends), but Go's select doesn't have this limitation, so that's not a differentiator.

Some people say that it's nice to have events as the common denominator, but I don't get this argument. If the only event under consideration is message send or receive over a channel, events + choose + sync is the same in expressive power as a built-in select, as far as I can see. If there are other events, then your runtime already has to support them either way, and something like (let ((ch (make-channel))) (spawn-fiber (lambda () (put-message ch exp))) (get-message ch)) should be sufficient for any runtime-supported event in exp, like sleeps or timeouts or thread joins or whatever.

To me it seems like Go has made the right choices here. I do not see the difference, and that's why I wrote all this, is to be shown the error of my ways. Choosing channels, send, receive, and select as the primitives seems to have the same power as SML events.

Let this post be a pentagram on the floor, then, to summon the CML cognoscenti. Well-actuallies are very welcome; hit me up in the comments!

[edit: Sam Tobin-Hochstadt tells me I got it wrong and I believe him :) In the meantime while I work out how I was wrong, examples are welcome!]

a simple (local) solution to the pay gap

24 March 2016 9:49 PM (justice | tech | pay gap | igalia)

International Working Women's Day was earlier this month, a day that reminds the world how far it has yet to go to achieve just treatment of women in the workplace. Obviously there are many fronts on which to fight to dismantle patriarchy, and also cissexism, and also transphobia, and also racism, and sometimes it gets a bit overwhelming just to think of a world where people treat each other right.

Against this backdrop, it's surprising that some policies are rarely mentioned by people working on social change. This article is about one of them -- a simple local change that can eliminate the pay gap across all axes of unfair privilege.


OK here it is: just pay everyone in a company the same hourly wage.

That's it!

on simple, on easy

But, you say, that's impossible!

Rich Hickey has this famous talk where he describes one thing as simple and the other as easy. In his narrative, simple is good but hard, and easy is bad but, you know, easy. I enjoy this talk because it's easy (hah!) to just call one thing simple and the other easy and it's codewords for good and bad, and you come across as having the facile prestidigitatory wisdom of a Malcolm Gladwell.

As far as simple, the substance of equal pay is as simple as it gets. And as far as practical implementation goes, it only needs buy-in from one person: your boss could do it tomorrow.

But, you say, a real business would never do this! This is getting closer to the real issues, but not there yet. There are plenty of instances of real businesses that do this. Incidentally, mine is one of them! I do not intend this to be an advertisement for my company, but I have to mention this early because society does its best to implant inside our brains the ideas that certain ideas are possible and certain others are not.

But, you say, this would be terrible for business! Here I think we are almost there. There's a question underneath, if we can manage to phrase it in a more scientific way -- I mean, the ideal sense in which science is a practice of humankind in which we use our limited powers to seek truth, with hypotheses but without prejudice. It might sound a bit pompous to invoke capital-S Science here, but I think few conversations of this kind try to honestly even consider existence proofs in the form of already-existing histories (like the company I work for), much less an unbiased study of the implications of modelling the future on those histories.

Let's assume that you and I want to work for justice, and in this more perfect world, men and women and nonbinary people will have equal pay for equal work, as will all people that lie on all axes of privilege that currently operate in society. If you are with me up to here: great. If not, we don't share a premise so it's not much use to go farther. You can probably skip to the next article in your reading list.

So, then, the questions: first of all, would a flat equal wage within a company actually help people in marginalized groups? What changes would happen to a company if it enacted a flat wage tomorrow? What are its limitations? How could this change come about?

would it help?

Let's take the most basic question first. How would this measure affect people in marginalized groups?

Let us assume that salaries are distributed inversely: the higher salaries are made by fewer people. A lower salary corresponds to more people. So firstly, we are in a situation where the median salary is less than the mean: that if we switched to pay everyone the mean, then most people would see an increase in their salary.

Assuming that marginalized people were evenly placed in a company, that would mean that most would benefit. But we know that is not the case: "marginalized" is the operative term. People are categorized at a lower point than their abilities; people's climb of the organizational hierarchy (and to higher salaries) is hindered by harassment, by undervalued diversity work, and by external structural factors, like institutionalized racism or the burden of having to go through a gender transition. So probably, even if a company touts equal pay within job classifications, the job classifications themselves unfairly put marginalized people lower than white dudes like me. So, proportionally marginalized people would benefit from an equal wage more than most.

Already this plan is looking pretty good: more money going to marginalized people is a necessary step to bootstrap a more just world.

All that said, many (but not most) people from marginalized groups will earn more than the mean. What for them? Some will decide that paying for a more just company as a whole is worth a salary reduction. (Incidentally, this applies to everyone: everyone has their price for justice. It might be 0.1%, it might be 5%, it might be 50%.)

Some, though, will decide it is not worth paying. They will go work elsewhere, probably for even more money (changing jobs being the best general way to advance your salary). I don't blame marginalized folks for getting all they can: more power to them.

From what I can tell, things are looking especially good for marginalized people under a local equal-wage initiative. Not perfect, not in all cases, but generally better.

won't someone think of the dudes

I don't believe in value as a zero-sum proposition: there are many ways in which a more fair world could be more productive, too. But in the short term, a balance sheet must balance. Salary increases in the bottom will come from salary decreases from the top, and the dudebro is top in tech.

We should first note that many and possibly most white men will see their wages increase under a flat-wage scheme, as most people earn below the mean.

Secondly, some men will be willing to pay for justice in the form of equal pay for equal work. An eloquent sales pitch as to what they are buying will help.

Some men would like to pay but have other obligations that a "mean" salary just can't even. Welp, there are lots of jobs out there. We'll give you a glowing recommendation :)

Finally there will be dudes that are fine with the pay gap. Maybe they have some sort of techno-libertarian justification? Oh well! They will find other jobs. As someone who cares about justice, you don't really want to work with these people anyway. Call it "bad culture fit", and treat it as a great policy to improve the composition of your organization.

an aside: what are we here for anyway?

A frequent objection to workplace change comes in the form of a pandering explanation of what companies are for, that corporations are legally obligated to always proceed along the the most profitable path.

I always find it extraordinarily ignorant to hear this parroted by people in tech: it's literally part of the CS canon to learn about the limitations of hill-climbing as an optimization strategy. But on the other hand, I do understand; the power of just-so neoliberal narrative is immense, filling your mind with pat explanations, cooling off your brain into a poorly annealed solid mass.

The funny thing about corporate determinism that it's not even true. Folks who say this have rarely run companies, otherwise they should know better. Loads of corporate decisions are made with a most tenuous link on profitability, and some that probably even go against the profit interest. It's always easy to go in a known-profitable direction, but that doesn't mean it's the only way to go, nor that all the profitable directions are known.

Sometimes this question is framed in the language of "what MyDesignCo really cares about is good design; we're worried about how this measure might affect our output". I respect this question more, because it's more materialist (you can actually answer the question!), but I disagree with the premise. I don't think any company really cares about the product in a significant way. Take the design company as an example. What do you want on your tombstone: "She made good advertisements"??? Don't get me wrong, I like my craft, and I enjoy practicing it with my colleagues. But if on my tombstone they wrote "He worked for justice", and also if there were a heaven, I would be p OK with that. What I'm saying is, you start a company, you have an initial idea, you pivot, whatever, it doesn't matter in the end. What matters is you relationship with life on the planet, and that is the criteria you should use to evaluate what you do.

Beyond all that -- it's amazing how much wrong you can wrap up in a snarky hacker news one-liner -- beyond all that, the concern begs the question by assuming that a flat-wage arrangement is less profitable. People will mention any down-side they can but never an up-side.

possible flat-wage up-sides from a corporate perspective

With that in mind, let's consider some ways that a flat wage can actually improve the commercial fate of a company.

A company with a flat wage already has a marketing point that they can use to attract people that care about this sort of thing. It can make your company stand out from the crowd and attract good people.

The people you attract will know you're doing the flat-wage thing, and so will be predisposed to want to work together. This can increase productivity. It also eliminates some material sources of conflict between different roles in an organization. You would still need "human resources" people but they would need to spend less time on mitigating the natural money-based conflicts that exist in other organizations.

Another positive side relates to the ability of the company to make collective sacrifices. For example a company that is going through harder times can collectively decide not to raise wages or even to lower them, rather than fire people. Obviously this outcome depends on the degree to which people feel responsible for the organization, which is incomplete without a feeling of collective self-management as in a cooperative, but even in a hierarchical organization these effects can be felt.

Incidentally a feeling of "investment" in the organization is another plus. When you work in a company in which compensation depends on random factors that you can't see, you always wonder if you're being cheated out of your true value. If everyone is being paid the same you know that everyone's interest in improving company revenue is aligned with their own salary interest -- you can't gain by screwing someone else over.

limitations of a flat wage at improving justice

All that said, paying all workers/partners/employees the same hourly wage is not a panacea for justice. It won't dismantle patriarchy overnight. It won't stop domestic violence, and it won't stop the cops from killing people of color. It won't stop microagressions or harassment in the workplace, and in some ways if there are feelings of resentment, it could even exacerbate them. It won't arrest attrition of marginalized people from the tech industry, and it won't fix hiring. Enacting the policy in a company won't fix the industry as a whole, even if all companies enacted it, as you would still have different wages at different companies. It won't fix the situation outside of the tech industry; a particularly egregious example being that in almost all places, cleaning staff are hired via subcontracts and not as employees. And finally, it won't resolve class conflict at work: the owner still owns. There are still pressures on the owner to keep the whole balance sheet secret, even if the human resources side of things is transparent.

All that said, these are mainly ways in which an equal wage policy is incomplete. A step in the right direction, on a justice level, but incomplete. In practice though the objections you get will be less related to justice and more commercial in nature. Let's take a look at some of them.

commercial challenges to a flat wage

Having everyone paid the same makes it extraordinarily difficult to hire people that are used to being paid on commission, like sales people. Sales people drive Rolexes and wear Mercedes. It is very, very tough to hire good sales people on salary. At my work we have had some limited success hiring, and some success growing technical folks into sales roles, but this compensation package will hinder your efforts to build and/or keep your sales team.

On the other hand, having the same compensation between sales and engineering does eliminate some of the usual sales-vs-product conflicts of interest.

Another point it that if you institute a flat-wage policy, you will expect to lose some fraction of your highly-skilled workers, as many of these are more highly paid. There are again some mitigations but it's still a reality. Perhaps more perniciously, you will have greater difficulties hiring senior people: you literally can't get into a bidding war with a competitor over a potential hire.

On the flip side, a flat salary can make it difficult to hire more junior positions. There are many theories here but I think that a company is healthy when it has a mix of experiences, that senior folks and junior folks bring different things to the table. But if your flat wage is higher than the standard junior wage, then your potential junior hires are now competing against more senior people -- internally it will be hard to keep a balance between different experiences.

Indeed junior workers that you already have are now competing at their wage level with potential hires that might be more qualified in some way. An unscrupulous management could fire those junior staff members and replace them with more senior candidates. An equal wage policy does not solve internal class conflicts; you need to have equal ownership and some form of workplace democracy for that.

You could sort people into pay grades, but in many ways this would formalize injustice. Marginalized people are by definition not equally distributed across pay grades.

Having a flat wage also removes a standard form of motivation, that your wage is always rising as you get older. It could be that after 5 years in a job, maybe your wages went up because the company's revenues went up, but they're still the same as a new hire's -- how do you feel about that? It's a tough question. I think an ever-rising wage has a lot of negative aspects, including decreasing the employability of older workers, but it's deeply rooted in tech culture at least.

Another point is motivation of people within the same cadre. Some people are motivated by bonuses, by performing relatively well compared to their peers. This wouldn't be an option in an organization with a purely flat wage. Does it matter? I do not know.

work with me tho

As the prophet Pratchett said, "against one perfect moment, the centuries beat in vain". There are some definite advantages to a flat wage within a company: it's concrete, it can be immediately enacted, it solves some immediate problems in a local way. Its commercial impact is unclear, but the force of narrative can bowl over many concerns in that department: what's important is to do the right thing. Everybody knows that!

As far as implementation, I see three-and-a-half ways this could happen in a company.

The first is that equal pay could be a founding principle of the company. This was mostly the case in the company I work for (and operate, and co-own equally with the other 40 or so partners). I wasn't a founder of the company, and the precise set of principles and policies has changed over the 15 years of the company's life, but it's more obvious for this arrangement to continue from a beginning than to change from the normal pay situation.

The second is, the change could come from the top down. Some CEOs get random brain waves and this happens. In this case, the change is super-easy to make: you proclaim the thing and it's done. As a person who has had to deal with cash-flow and payroll and balance sheets, I can tell you that this considerably simplifies HR from a management perspective.

The third is via collective action. This only works if workers are able to organize and can be convinced to be interested in justice in this specific way. In some companies, a worker's body might simply be able to negotiate this with management -- e.g., we try it out for 6 months and see. In most others you'd probably need to unionize and strike.

Finally, if this practice were more wider-spread in a sector, it could be that it just becomes "best practice" in some way -- that company management could be shamed into doing it, or it could just be the way things are done.


Many of these points are probably best enacted in the context of a worker-owned cooperative, where you can do away with the worker-owner conflict at the same time. But still, they are worth thinking of in a broader context, and worth evaluating in the degree to which they work for (or against) justice in the workplace. But enough blathering from me today :) Happy hacking!

a lambda is not (necessarily) a closure

8 February 2016 10:12 AM (guile | gnu | compilers | igalia | scheme | closure optimization | contification)


Greets, folks! Check it out: Guile had a whole track devoted to it at FOSDEM this year. OK, so it was only half a day, but there were like a dozen talks! And the room was full all the morning! And -- get this -- I had nothing to do with its organization! I think we can credit the Guix project with the recent surge of interest in Guile; fully half the talks were from people excited about using Guix to solve their problems. Thanks very, very much to Pjotr Prins for organizing the lovely event.

I gave a talk on how the Guile 2.2 compiler and virtual machine could change the way people program. Happily, the video recording came out OK! Video below (or here if that doesn't work), and slides here.

The time was super-limited though and I wasn't able to go into the detail that I'd like. So, dear readers, here we are, with a deeper look on lambda representation in Guile.

a lambda is not (necessarily) a closure

What is this?

(lambda (a b) (+ a b))

If you answer, "it's a lambda expression", you're right! You're also right if you say it's a function -- I mean, lambda makes a function, right? There are lots of things that you could say that would be right, including silly things like "twenty-two characters set in an awkward typeface".

But if you said "it's a closure" -- well you're right in general I guess, like on a semantic what-does-it-mean level, but as far as how Guile represents this thing at run-time, hoo boy are there a number of possibilities, and a closure is just one of them. This article dives into the possibilities, with the goal being to help you update your mental model of "how much do things cost".

In Guile, a lambda expression can be one of the following things at run-time:

  1. Gone

  2. Inlined

  3. Contified

  4. Code pointer

  5. Closure

Let's look into these one-by-one.

lambda: gone

If Guile can prove that a lambda expression is never reached, it won't be present at run-time. The main way this happens is via partial evaluation, but later passes can do this too. In the most basic example, consider the lambda bound to f by this let expression.

(let ((f (lambda ()

Guile has an ,optimize command that can be run at the REPL to show the effect of partial evaluation on your code. These days it's a bit out of date in a way -- it can't show what CPS-based optimization will do to your code -- but for our purposes here it will transform the expression to the following code:

(let ((f (lambda ()
=> 42

So the lambda is gone, big whoop. The interesting thing though is that this happens concurrently with other things that partial evaluation does, so the lambda goes away in this expression too:

(let ((launch? #f)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> 'just-kidding

lambda: inlined

The other trick that partial evaluation can do with lambda expressions is inlining. Re-taking the example above, if we change launch? to #t, the branch folds the other way and the application (f) inlines:

(let ((launch? #t)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
     (if #t (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
=> (let ((launch? #t)
         (f (lambda ()
     ((lambda () (launch-the-missiles!))))
=> (let ((launch? #t)
         (f (lambda ()
=> (launch-the-missiles!)

Here again the lambda is gone, but not because it was unreachable, but because it was inlined into its use. I showed some intermediate steps as well, just so you get a feel about how partial evaluation works. The inlining step is illustrated by the fourth transformation, where the lambda application went away, replaced by its body.

Partial evaluation can also unroll many kinds of recursion:

(letrec ((lp (lambda (n)
               (if (zero? n)
                   (+ n (lp (1- n)))))))
  (lp 5))
=> 15

The partial evaluator in Guile 2.2 is more or less unchanged from the one in Guile 2.0, so you get these benefits on old Guile as well. Building a good intuition as to what the partial evaluator will do is important if you want to get the best performance out of Guile. Use the ,optimize command at the REPL to see the effects of partial evaluation on any given expression.

lambda: contified

So, here we step into the unknown, in the sense that from here on out, these optimizations are new in Guile 2.2. Unfortunately, they can be hard to see as they aren't really representable in terms of source-to-source transformations over Scheme programs. Consider this program:

(define (count-down n)
  (define loop
    (lambda (n out)
      (let ((out (cons n out)))
        (if (zero? n)
            (loop (1- n) out)))))
  (loop n '()))

It's a little loop that builds a list of integers. The lambda in this loop, bound to loop, will be contified into the body of count-down.

To see that this is the case, we have to use a new tool, ,disassemble (abbreviated ,x). This takes a procedure and prints its bytecode. It can be hard to understand, so I'm going to just point out some "shapes" of disassembly that you can recognize.

> ,x count-down
Disassembly of #<procedure count-down (n)> at #x9775a8:

  10    (cons 2 1 2)
  11    (br-if-u64-=-scm 0 1 #f 5) ;; -> L2
  14    (sub/immediate 1 1 1)
  15    (br -5)                    ;; -> L1

I've snipped the disassembly to the interesting part. The first thing to notice is that there's just one procedure here: only one time that ,x prints "Disassembly of ...". That means that the lambda was eliminated somehow, either because it was dead or inlined, as described above, or because it was contified. It wasn't dead; we can see that from looking at the ,optimize output, which doesn't significantly change the term. It wasn't inlined either; again, ,optimize can show you this, but consider that because partial evaluation can't determine when the loop would terminate, it won't find a point at which it can stop unrolling the loop. (In practice what happens though is that it tries, hits an effort or code growth limit, then aborts the inlining attempt.)

However, what we see in the disassembly is the body of the loop: we cons something onto a list (the cons), check if a two numbers are equal (br-if-u64-=-scm), and if they are we jump out of the loop (L2). Otherwise we subtract 1 from a number (sub/immediate) and loop (br to L1). That is the loop. So what happened?

Well, if inlining is copying, then contification is rewiring. Guile's compiler was able to see that although it couldn't inline the loop function, it could see all of loop's callers, and that loop always returned to the same "place". (Another way to say this is that loop is always called with the same continuation.) The compiler was then able to incorporate the body of loop into count-down, rewiring calls to loop to continue to loop's beginning, and rewriting returns from loop to proceed to the continuation of the loop call.

a digression on language

These words like "contification" and "continuation" might be unfamiliar to you, and I sympathize. If you know of a better explanation of contification, I welcome any links you might have. The name itself comes from a particular formulation of the intermediate language used in Guile, the so-called "CPS" language. In this language, you convert a program to make it so it never returns: instead, each sub-expression passes its values to its continuation via a tail call. Each continuation is expressed as a lambda expression. See this article for an intro to CPS and how it relates to things you might know like SSA.

Transforming a program into CPS explodes it into a bunch of little lambdas: every subexpression gets its own. You would think this would be a step backwards, if your goal is to eliminate closures in some way. However it's possible to syntactically distinguish between lambda expressions which are only ever used as continuations and those that are used as values. Let's call the former kind of lambda a cont and the latter a function. A cont-lambda can be represented at run-time as a label -- indeed, the disassembly above shows this. It turns out that all lambda expressions introduced by the CPS transformation are conts. Conts form a first-order flow graph, and are basically the same as SSA basic blocks. If you're interested in this kind of thing, see Andrew Kennedy's great paper, Compiling with Continuations, Continued, and see also CPS soup for more on how this evolved in Guile 2.2.

I say all this to give you a vocabulary. Functions that are present in the source program start life as being treated as function-lambdas. Contification takes function-lambda values and turns then into cont-lambda labels, if it can. That's where the name "contification" comes from. For more on contification, see MLton's page on its contification pass, linking to the original paper that introduces the concept.

and we're back

Contification incorporates the body of a function into the flow graph of its caller. Unlike inlining, contification is always an optimization: it never causes code growth, and it enables other optimizations by exposing first-order control flow. (It's easier for the compiler to reason about first-order loops than it is to reason about control flow between higher-order functions.)

Contification is a reliable optimization. If a function's callers are always visible to the compiler, and the function is always called with the same continuation, it will be contified. These are two fairly simple conditions that you can cultivate your instincts to detect and construct.

Contification can also apply to mutually recursive functions, if as a group they are all always called with the same continuation. It's also an iterative process, in the sense that contifying one set of functions can expose enough first-order control flow that more contification opportunities become apparent.

It can take a while to get a feel for when this optimization applies. You have to have a feel for what a continuation is, and what it means for a function's callers to all be visible to the compiler. However, once you do internalize these conditions, contification is something you can expect Guile's compiler to do to your code.

lambda: code pointer

The next representation a lambda might have at run-time is as a code pointer. In this case, the function fails the conditions for contification, but we still avoid allocating a closure.

Here's a little example to illustrate the case.

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

In this example, log is called with three different continuations, so it's not eligible for contification. Unfortunately, this example won't illustrate anything for us because the log function is so small that partial evaluation will succeed in inlining it. (You could determine this for yourself by using ,optimize.) So let's make it bigger, to fool the inliner:

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what)
    ;; If `log' is too short, it will be inlined.  Make it bigger.
    (format #t "Did I ever tell you about my chickens\n")
    (format #t "I was going to name one Donkey\n")
    (format #t "I always wanted a donkey\n")
    (format #t "In the end we called her Raveonette\n")
    (format #t "Donkey is not a great name for a chicken\n")
    (newline) (newline) (newline) (newline) (newline))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

Now if we disassembly it, we do get disassembly for two different functions:

,x thing
Disassembly of #<procedure thing ()> at #x97d704:

Disassembly of log at #x97d754:

So, good. We defeated the inliner. Let's look closer at the disassembly of the outer function.

,x thing
Disassembly of #<procedure thing ()> at #x97d704:
  12    (call-label 3 2 8)              ;; log at #x97d754

Here we see that instead of the generic call instruction, we have the specific call-label instruction which calls a procedure whose code is at a known offset from the calling function.

call-label is indeed a cheaper call than the full call instruction that has to check that the callee is actually a function and so on. But that's not the real optimization here. If all callers of a function are known -- and by this time, you're starting to catch the pattern, I think -- if all callers are known, then the procedure does not need to exist as a value at run-time.

This affords a number of optimization opportunities. Theoretically there are many -- all call sites can be specialized to the specific callee. The callee can have an optimized calling convention that doesn't have anything to do with the generic convention. Effect analysis can understand the side effects and dependencies of the callee in a more precise way. The compiler can consider unboxing some arguments and return values, if it finds that useful.

In Guile though, there's only one real optimization that we do, and that is related to free variables. Currently in Guile, all procedures have a uniform calling convention, in which the procedure being called (the callee) is itself passed as the zeroeth argument, and then the arguments follow on the stack. The function being called accesses its free variables through that zeroeth argument. If however there is no need for the procedure to be represented as a value, we are free to specialize that zeroeth argument.

So, consider a well-known procedure like log above. (By "well-known", we mean that all of log's callers are known.) Since log doesn't actually have any lexically bound free variables, we can just pass in anything as argument zero when invoking it. In practice we pass #f, because it happens to be an easy value to make.

(Why isn't format treated as a free variable in log? Because there is special support from the linker for lazily initializing the locations of variables imported from other modules or defined at the top level instead of within a lexical contour. In short: only variables that are (a) used within the lambda and (b) defined within a let or similar count towards a lambda's free variables.)

For a well-known procedure with only one free variable, we can pass in that free variable as the zeroeth argument. Internally to the function, we rewrite references to that free variable to reference argument 0 instead. This is a neat hack because we can have a lambda with a free variable but which results in no allocation at run-time.

Likewise if there are two free variables -- and this is starting to sound like Alice's restaurant, isn't it -- well we do have to pass in their values to the procedure, but we don't have to build an actual closure object with a tag and a code pointer and all. Pairs happen to be small and have some fast paths in Guile, so we use that. References to the free variables get internally rewritten to be car or cdr of argument 0.

For three or more free variables, we do the same, but with a vector.

For a final trick, a set of mutually recursive procedures whose callers are all known can share the object that collects their free variables. We collect the union of the free variables of all of the procedures, and pack them into a specialized representation as above.

Note that for well-known procedures, all variables that are free in the lambda are also free in the caller; that's why the 1-free-variable substitution works. The lambda is bound in a scope that dominates its callers, but its free variables dominate the lambda so they dominate the callers too. For that reason in this case we could choose to do lambda lifting instead, with no penalty: instead of bundling up the free variables in a heap object, we could pass them as arguments. Dybvig claims this is not a great idea because it increases register pressure. That could be true, but I haven't seen the numbers. Anyway, we do the flat closure thing, so we pack the free vars into data.

All these ideas came pretty much straight from the great Optimizing Closures in O(0) Time by Andrew Keep et al.

lambda: closure

OK! So you have a lambda whose callees are not all visible to the compiler. You need to reify the procedure as a value. That reified procedure-as-value is a closure: an object with a tag, a code pointer, and an array of free variables.

Of course, if the procedure has no free variables, you just have the tag and the code pointer... and because Scheme is semantically squirrely when it comes to the result of (eqv? (lambda () 10) (lambda () 10)) (it's unspecified: lambda expressions don't have identity), we can statically allocate the closure in the binary, as a constant.

Otherwise we do allocate the heap object.

Note however that if a group of mutually recursive procedures has just one entry that is not "well-known", then that procedure clique can share one closure object.

lambda: it's complicated

In summary, a lambda is an abstraction that has many concrete representations. Guile will choose the cheapest representation that it can. If you need to eke out even more performance from your program, having a good mental model of how the abstract maps to the concrete will help you know where to focus your efforts, and what changes might be helpful. Good luck, and happy hacking!

guile compiler tasks

4 February 2016 9:38 PM (guile | compilers | gnu | igalia | linkers)

Hey! We released Guile 2.1.2, including the unboxing work, and we fixed the slow bootstrap problem by shipping pre-built bootstraps in tarballs. A pretty OK solution in my opinion; check it out!

future work

At this point I think I'm happy with Guile's compiler and VM, enough for now. There is a lot more work to do but it's a good point at which to release a stable series. There will probably be a number of additional pre-releases, but not any more significant compiler/VM work that must be done before a release.

However, I was talking with Guilers at FOSDEM last weekend and we realized that although we do a pretty good job at communicating the haps in compiler-land, we don't do a good job at sharing a roadmap or making it possible for other folks to join the hack. And indeed, it's been difficult to do so while things were changing so much: I had to get things right in my head before joining in the confusion of other people's heads.

In that spirit I'd like to share a list of improvements that it would be nice to make at some point. If you take one of these tasks, be my guest: find me on IRC (wingo on freenode) and let me know, and I'll help as I am able. You need to be somewhat independent; I'm not offering a proper mentoring or anything, more like office hours or something, where you come with the problem you are having and I commiserate and give context/background/advice as I am able.

So with that out of the way, here's a huge list of stuff! Following this, more details on each one.

  1. stripping binaries

  2. full source in binaries

  3. cps in in binaries

  4. linking multiple modules together

  5. linking a single executable

  6. instruction explosion

  7. elisp optimizations

  8. prompt removal

  9. basic register allocation

  10. optimal register allocation

  11. unboxed record fields

  12. textual CPS

  13. avoiding arity checks

  14. unboxed calls and returns

  15. module-level inlining

  16. cross-module inlining

As a bonus, in the end I'll give some notes on native compilation. But first, the hacks!

stripping binaries

Guile uses ELF as its object file format, and currently includes source location information as DWARF data. On space-constrained devices this might be too much. Your task: add a hack to the linker that can strip existing binaries. Read Ian Lance Taylor's linker articles for more background, if you don't know things about linkers yet.

full source in binaries

Wouldn't it be nice if the ELF files that Guile generates actually included the source as well as the line numbers? We could do that, in a separate strippable ELF section. This point is like the reverse of the previous point :)

cps in in binaries

We could also include the CPS IR in ELF files too. This would enable some kinds of link-time optimization and cross-module inlining. You'd need to define a binary format for CPS, like LLVM bitcode or so. Neat stuff :)

linking multiple modules together

Currently in Guile, just about every module is a separate .go file. Loading a module will cause a few stat calls and some seeks and reads and all that. Wouldn't it be nice if you could link together all the .go files that were commonly used into one object? Again this is a linker hack, but it needs support from the run-time as well: when the run-time goes to load a file, it should first check in a registry if that file has been logically provided by some other file. We'd be able to de-duplicate constant data from various modules. However there is an initialization phase when loading a .go file which effectively performs all the relocations needed by constants that need a fix-up at load-time; see the ELF article I linked to above for more. For some uses, it would be OK to produce one relocation/initialization procedure. For others, if you expected to only load a fraction of the modules in a .go file, it would be a lose on startup time,
so you would probably need to support lazy relocation when a module is first loaded.

Anyway, your task would be to write a linker hack that loads a bunch of .go files, finds the relocations in them, de-duplicates the constants, and writes out a combined .go file that includes a table of files contained in it. Good luck :) This hack would work great for Emacs, where it's effectively a form of unexec that doesn't actually rely on unexec.

linking a single executable

In the previous task, you could end up with the small guile binary that links to libguile (or your binary linking to libguile), and then a .go file containing all the modules you are interestd in. It sure would be nice to be able to link those together into just one binary, or at least to link the .go into the Guile binary. If the Guile is statically linked itself, you would have a statically linked application. If it's dynamically linked, it would remain dynamically linked. Again, a linker hack, but one that could provide a nicer way to distribute Guile binaries.

instruction explosion

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do
this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

elisp optimizations

Guile implements Emacs Lisp, and does so well. However it hasn't been the focus of a lot of optimization. Emacs has a lot of stuff going on on its side, and so have we, so we haven't managed to replace the Elisp interpreter in Emacs with one written in Guile, though Robin Templeton has brought us a long way forward. We need someone to do both the integration work but also to poke the compiler and make sure it's a clear win.

prompt removal

It's pretty natural to use delimited continuations when compiling some kind of construct that includes a break statement to Guile, whether that compiler is part of Elisp or just implemented as a Scheme macro. But, many instances of prompts can be contified, resulting in no overhead at run-time. Read up on contification and contify the hell out of some prompts!

basic register allocation

Guile usually tries its best to be safe-for-space: only the data which might be used in the future of a program is kept alive, and the rest is available for garbage collection. Notably, this applies to function arguments, temporaries, and lexical variables: if a value is dead, the GC can collect it and re-use its space. However this isn't always what you want. Sometimes you might want to have all variables that are in scope to be available, for better debugging. Your task would be to implement a "slot allocator" (which is really register allocation) that keeps values alive in the parts of the programs that they dominate.

optimal register allocation

On the other hand, our slot allocator -- which is basically register allocation, but for stack slots -- isn't so great. It does OK but you can often end up shuffling values in a loop, which is the worst. Your task would be to implement a proper register allocator: puzzle-solving, graph-coloring, iterative coalescing, something that really tries to do a good job. Good luck!

unboxed record fields

Guile's "structs", on which records are implemented, support unboxed values, but these values are untyped, not really integrated with the record layer, and always boxed in the VM. Your task would be to design a language facility that allows us to declare records with typed fields, and to store unboxed values in those fields, and to cause access to their values to emit boxing/unboxing instructions around them. The optimizer will get rid of those boxing/unboxing instructions if it can. Good luck!

textual CPS

The CPS language is key to all compiler work in Guile, but it doesn't have a nice textual form like LLVM IR does. Design one, and implement a parser and an unparser!

avoiding arity checks

If you know the procedure you are calling, like if it's lexically visible, then if you are calling it with the right number of arguments you can skip past the argument check and instead do a call-label directly into the body. Would be pretty neat!

unboxed calls and returns

Likewise if a function's callers are all known, it might be able to unbox its arguments or return value, if that's a good idea. Tricky! You could start with a type inference pass or so, and maybe that could produce some good debugging feedback too.

module-level inlining

Guile currently doesn't inline anything that's not lexically visible. Unfortunately this restriction extends to top-level definitions in a module: they are treated as mutable and so never inlined/optimized/etc. Probably we need to change the semantics here such that a module can be compiled as a unit, and all values which are never mutated can be assumed to be constant. Probably you also want a knob to turn off this behavior, but really you can always re-compile and re-load a module as a whole if re-loading a function at run-time doesn't work because it was inlined. Anyway. Some semantic work here, but some peval work as well. Be careful!

cross-module inlining

Likewise Guile currently doesn't inline definitions from other modules. However for small functions this really hurts. Guile should probably serialize tree-il for small definitions in .go files, and allow peval to speculatively inline imported definitions. This is related to the previous point and has some semantic implications.

bobobobobobonus! native compilation

Thinking realistically, native compilation is the next step. We have the object file format, cool. We will need the ability to call out from machine code in .go files to run-time functions, so we need to enhance the linker, possibly even with things like PLT/GOT sections to avoid dirtying too many pages. We need to lower the CPS even further, to get closer to some kind of machine model, then go specific, with an assembler for each architecture. The priority in the beginning will be simplicity and minimal complexity; good codegen will come later. This is obviously the most attractive thing but it's also the most tricky, design-wise. I want to do at least part of this, so though you can't have it all, you are welcome to help :)

That's it for now. I'll amend the post with more things as and when I think of them. Comments welcome too, as always. Happy hacking!

talks i would like to give in 2016

21 January 2016 11:59 AM (guile | luajit | compilers | talks | networking)

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to women and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

unboxing in guile

19 January 2016 11:57 AM (guile | gnu | compilers | igalia | scalar replacement | unboxing | common subexpression elimination | type inference | flow analysis)

Happy snowy Tuesday, hackfolk! I know I said in my last dispatch that I'd write about Lua soon, but that article is still cooking. In the meantime, a note on Guile and unboxing.

on boxen, on blitzen

Boxing is a way for a programming language implementation to represent a value.

A boxed value is the combination of a value along with a tag providing some information about the value. Both the value and the tag take up some space. The value can be thought to be inside a "box" labelled with the tag and containing the value.

A value's tag can indicate whether the value's bits should be interpreted as an unsigned integer, as a double-precision floating-point number, as an array of words of a particular data type, and so on. A tag can also be used for other purposes, for example to indicate whether a value is a pointer or an "immediate" bit string.

Whether values in a programming language are boxed or not is an implementation consideration. It can be the case that in languages with powerful type systems that a compiler can know what the representation of all values are in all parts of all programs, and so boxing is never needed. However, it's much easier to write a garbage collector if values have a somewhat uniform representation, with tag bits to tell the GC how to trace any pointers that might be contained in the object. Tags can also carry run-time type information needed by a dynamically typed language like Scheme or JavaScript, to allow for polymorphic predicates like number? or pair?.

Boxing all of the values in a program can incur significant overhead in space and in time. For example, one way to implement boxes is to allocate space for the tag and the value on the garbage-collected heap. A boxed value would then be referred to via a pointer to the corresponding heap allocation. However, most memory allocation systems align their heap allocations on word-sized boundaries, for example on 8-byte boundaries. That means that the low 3 bits of a heap allocation will always be zero. If you make a bit string whose low 3 bits are not zero, it cannot possibly be a valid pointer. In that case you can represent some types within the set of bit strings that cannot be valid pointers. These values are called "immediates", as opposed to "heap objects". In Guile, we have immediate representations for characters, booleans, some special values, and a subset of the integers. Alternately, a programming language implementation can represent values as double-precision floating point numbers, and shove pointers into the space of the NaN values. And for heap allocations, some systems can associate one tag with a whole page of values, minimizing per-value boxing overhead.

The goal of these optimizations is to avoid heap allocation for some kinds of boxes. While most language implementations have good garbage collectors that make allocation fairly cheap, the best way to minimize allocation cost is to refrain from it entirely.

In Guile's case, we currently use a combination of low-bit tagging for immediates, including fixnums (a subset of the integers), and tagged boxes on the heap for everything else, including floating-point numbers.

Boxing floating-point numbers obviously incurs huge overhead on floating-point math. You have to consider that each intermediate value produced by a computation will result in the allocation of another 8 bytes for the value and 4 or 8 bytes for the tag. Given that Guile aligns allocations on 8-byte boundaries, the result is a 16-byte allocation in either case. Consider this loop to sum the doubles in a bytevector:

(use-modules (rnrs bytevectors))
(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

Each trip through the loop is going to allocate not one but two heap floats: one to box the result of bytevector-ieee-double-native-ref (whew, what a mouthful), and one for the sum. If we have a bytevector of 10 million elements, that will be 320 megabytes of allocation. Guile can allocate short-lived 16-byte allocations at about 900 MB/s on my machine, so summing this vector is going to take at least 350ms, just for the allocation. Indeed, without unboxing I measure this loop at 580ms for a 10 million element vector:

> (define v (make-f64vector #e10e6 1.0))
> ,time (f64-sum v)
$1 = 1.0e7
;; 0.580114s real time, 0.764572s run time.  0.268305s spent in GC.

The run time is higher than the real time due to parallel marking. I think in this case, allocation has even higher overhead because it happens outside the bytecode interpreter. The add opcode has a fast path for small integers (fixnums), and if it needs to work on flonums it calls out to a C helper. That C helper doesn't have a pointer to the thread-local freelist so it has to go through a more expensive allocation path.

Anyway, in the time that Guile takes to fetch one f64 value from the vector and add it to the sum, the CPU ticked through some 150 cycles, so surely we can do better than this.

unboxen, unblitzen

Let's take a look again at the loop to see where the floating-point allocations are produced.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

It turns out there's no reason for the loquatiously-named bytevector-ieee-double-native-ref to return a boxed number. It's a monomorphic function that is well-known to the Guile compiler and virtual machine, and it even has its own opcode. In Guile 2.0 and until just a couple months ago in Guile 2.2, this function did box its return value, but that was because the virtual machine had no facility for unboxed values of any kind.

To allow bytevector-ieee-double-native-ref to return an unboxed double value, the first item of business was then to support unboxed values in Guile's VM. Looking forward to unboxed doubles, we made a change such that all on-stack values are 64 bits wide, even on 32-bit systems. (For simplicity, all locals in Guile take up the same amount of space. For the same reason, fetching 32-bit floats also unbox to 64-bit doubles.)

We also made a change to Guile's "stack maps", which are data structures that tell the garbage collector which locals are live in a stack frame. There is a stack map recorded at every call in a procedure, to be used when an activation is pending on the stack. Stack maps are stored in a side table in a separate section of the compiled ELF library. Live values are traced by the garbage collector, and dead values are replaced by a special "undefined" singleton. The change we made was to be able to indicate that live values were boxed or not, and if they were unboxed, what type they were (e.g. unboxed double). Knowing the type of locals helps the debugger to print values correctly. Currently, all unboxed values are immediates, so the GC doesn't need to trace them, but it's conceivable that we could have unboxed pointers at some point. Anyway, instead of just storing one bit (live or dead) per local in the stack map, we store two, and reserve one of the bit patterns to indicate that
the local is actually an f64 value.

But the changes weren't done then: since we had never had unboxed locals, there were quite a few debugging-related parts of the VM that assumed that we could access the first slot in an activation to see if it was a procedure. This dated from a time in Guile where slot 0 would always be the procedure being called, but the check is bogus ever since Guile 2.2 allowed local value slots corresponding to the closure or procedure arguments to be re-used for other values, if the closure or argument was dead. Another nail in the coffin of procedure-in-slot-0 was driven by closure optimizations, in which closures whose callees are all visible could specialize the representation of their closure in non-standard ways. It took a while, but unboxing f64 values flushed out these bogus uses of slot 0.

The next step was to add boxing and unboxing operations to the VM (f64->scm and scm->f64, respectively). Then we changed bytevector-ieee-double-native-ref to return an unboxed value and then immediately box it via f64->scm. Similarly for bytevector-ieee-double-native-set!, we unbox the value via scm->f64, potentially throwing a type error. Unfortunately our run-time type mismatch errors got worse; although the source location remains the same, scm->f64 doesn't include the reason for the unboxing. Oh well.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
              (+ sum boxed))

When we lower Tree-IL to CPS, we insert the needed f64->scm and scm->f64 boxing and unboxing operations around bytevector accesses. Cool. At this point we have a system with unboxed f64 values, but which is slower than the original version because every f64 bytevector access involves two instructions instead of one, although the instructions themselves together did the same amount of work. However, telling the optimizer about these instructions could potentially eliminate some of them. Let's keep going and see where we get.

Let's attack the other source of boxes, the accumulation of the sum. We added some specialized instuctions to the virtual machine to support arithmetic over unboxed values. Doing this is potentially a huge win, because not only do you avoid allocating a box for the result, you also avoid the type checks on the incoming values. So we add f64+, f64-, and so on.

Unboxing the + to f64+ is a tricky transformation, and relies on type analysis. Our assumption is that if type analysis indicates that we are in fact able to replace a generic arithmetic instruction with a combination of operand unboxing, unboxed arithmetic, and a boxing operation, then we should do it. Separating out the boxes and the monomorphic arithmetic opens the possibility to remove the resulting box, and possibly remove the unboxing of operands too. In this case, we run an optimization pass and end up with something like:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)
                     (scm->f64 boxed)))))

Scalar replacement via fabricated expressions will take the definition of boxed as (f64->scm f64) and fabricate a definition of f64 as (scm->f64 boxed), which propagates down to the f64+ so we get:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)

Dead code elimination can now kill boxed, so we end up with:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i)))
               (f64+ (scm->f64 sum)

Voilà, we removed one allocation. Yay!

As we can see from the residual code, we're still left with one f64->scm boxing operation. That expression is one of the definitions of sum, one of the loop variables. The other definition is 0.0, the starting value. So, after specializing arithmetic operations, we go through the set of multiply-defined variables ("phi" variables) and see what we can do to unbox them.

A phi variable can be unboxed if all of its definitions are unboxable. It's not always clear that you should unbox, though. For example, maybe you know via looking at the definitions for the value that it can be unboxed as an f64, but all of its uses are boxed. In that case it could be that you throw away the box when unboxing each definition, only to have to re-create them anew when using the variable. You end up allocating twice as much instead of not at all. It's a tricky situation. Currently we assume a variable with multiple definitions should only be unboxed if it has an unboxed use. The initial set of unboxed uses is the set of operands to scm->f64. We iterate this set to a fixed point: unboxing one phi variable could cause others to be unbox as well. As a heuristic, we only require one unboxed use; it could be there are other uses that are boxed, and we could indeed hit that pessimal double-allocation case. Oh well!

In this case, the intermediate result looks something like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (let ((f64 (bytevector-ieee-double-native-ref v i)))
                  (f64+ (scm->f64 sum-box)

After the scalar replacement and dead code elimination passes, we end up with something more like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (f64+ sum
                    (bytevector-ieee-double-native-ref v i)))

Well this is looking pretty good. There's still a box though. Really we should sink this to the exit, but as it happens there's something else that accidentally works in our favor: loop peeling. By peeling the first loop iteration, we create a control-flow join at the loop exit that defines a phi variable. That phi variable is subject to the same optimization, sinking the box down to the join itself. So in reality the result looks like:

(define (f64-sum v)
  (let ((i 0)
        (sum (scm->f64 0.0))
        (len (bytevector-length v)))
     (if (< i len)
         (let ((i (+ i 8))
               (sum (f64+ sum
                          (bytevector-ieee-double-native-ref v i))))
           (let lp ((i i) (sum sum))
             (if (< i len)
                 (lp (+ i 8)
                     (f64+ sum (bytevector-ieee-double-native-ref v i)))

As you can see, the peeling lifted the length computation up to the top too, which is a bonus. We should probably still implement allocation sinking, especially for loops for which peeling isn't an option, but the current status often works well. Running f64-sum on a 10-million-element packed double array goes down from 580ms to 99ms, or to some 25 or 30 CPU cycles per element, and of course no time in GC. Considering that this loop still has the overhead of bytecode interpretation and cache misses, I think we're doing A O K.


It used to be that using packed bytevectors of doubles was an easy way to make your program slower using types (thanks to Sam Tobin-Hochstadt for that quip). The reason is that although a packed vector of doubles uses less memory, every access to it has to allocate a new boxed number. Compare to "normal" vectors where sure, it uses more memory, but fetching an element fetches an already-boxed value. Now with the unboxing optimization, this situation is properly corrected... in most cases.

The major caveat is that for unboxing to work completely, each use of a potentially-unboxable value has to have an alternate implementation that can work on unboxed values. In our example above, the only use was f64+ (which internally is really called fadd), so we win. Writing an f64 to a bytevector can also be unboxed. Unfortunately, bytevectors and simple arithmetic are currently all of the unboxable operations. We'll implement more over time, but it's a current limitation.

Another point is that we are leaning heavily on the optimizer to remove the boxes when it can. If there's a bug or a limitation in the optimizer, it could be the box stays around needlessly. It happens, hopefully less and less but it does happen. To be sure you get the advantages, you need to time the code and see if it's spending significant time in GC. If it is, then you need to disassemble your code to see where that's happening. It's not a very nice thing, currently. The Scheme-like representations I gave above were written by hand; the CPS intermediate language is much more verbose than that.

Another limitation is that function arguments and return values are always boxed. Of course, the compiler can inline and contify a lot of functions, but that means that to use abstraction, you need to build up a mental model of what the inliner is going to do.

Finally, it's not always obvious to the compiler what the type of a value is, and that necessarily limits unboxing. For example, if we had started off the loop by defining sum to be 0 instead of 0.0, the result of the loop as a whole could be either an exact integer or an inexact real. Of course, loop peeling mitigates this to an extent, unboxing sum within the loop after the first iteration, but it so happens that peeling also prevents the phi join at the loop exit from being unboxed, because the result from the peeled iteration is 0 and not 0.0. In the end, we are unable to remove the equivalent of sum-box, and so we still allocate once per iteration. Here is a clear case where we would indeed need allocation sinking.

Also, consider that in other contexts the type of (+ x 1.0) might actually be complex instead of real, which means that depending on the type of x it might not be valid to unbox this addition. Proving that a number is not complex can be non-obvious. That's the second way that fetching a value from a packed vector of doubles or floats is useful: it's one of the rare times that you know that a number is real-valued.

on integer, on fixnum

That's all there is to say about floats. However, when doing some benchmarks of the floating-point unboxing, one user couldn't reproduce some of the results: they were seeing huge run-times for on a microbenchmark that repeatedly summed the elements of a vector. It turned out that the reason was that they were on a 32-bit machine, and one of the loop variables used in the test was exceeding the fixnum range. Recall that fixnums are the subset of integers that fit in an immediate value, along with their tag. Guile's fixnum tag is 2 bits, and fixnums have a sign bit, so the most positive fixnum on a 32-bit machine is 229—1, or around 500 million. It sure is a shame not to be able to count up to #xFFFFFFFF without throwing an allocation party!

So, we set about seeing if we could unbox integers as well in Guile. Guile's compiler has a lot more visibility as to when something is an integer, compared to real numbers. Anything used as an index into a vector or similar data structure must be an exact integer, and any query as to the length of a vector or a string or whatever is also an integer.

Note that knowing that a value is an exact integer is insufficient to unbox it: you have to also know that it is within the range of your unboxed integer data type. Here we take advantage of the fact that in Guile, type analysis also infers ranges. So, cool. Because the kinds of integers that can be used as indexes and lengths are all non-negative, our first unboxed integer type is u64, the unsigned 64-bit integers.

If Guile did native compilation, it would always be a win to unbox any integer operation, if only because you would avoid polymorphism or any other potential side exit. For bignums that are within the unboxable range, the considerations are similar to the floating-point case: allocation costs dominate, so unboxing is almost always a win, provided that you avoid double-boxing. Eliminating one allocation can pay off a lot of instruction dispatch.

For fixnums, though, things are not so clear. Immediate tagging is such a cheap way of boxing that in an interpreter, the extra instructions you introduce could outweigh any speedup from having faster operations.

In the end, I didn't do science and I decided to just go ahead and unbox if I could. We are headed towards native compilation, this is a necessary step along that path, and what the hell, it seemed like a good idea at the time.

Because there are so many more integers in a typical program than floating-point numbers, we had to provide unboxed integer variants of quite a number of operations. Of course we could unconditionally require unboxed arguments to vector-ref, string-length and so on, but in addition to making u64 variants of arithmetic, we also support bit operations like logand and such. Unlike the current status with floating point numbers, we can do test-and-branch over unboxed u64 comparisons, and we can compare u64 values to boxed SCM values.

In JavaScript, making sure an integer is unboxed is easy: you just do val | 0. The bit operation | truncates the value to a uint32 32-bit two's-complement signed integer (thanks to Slava for the correction). In Guile though, we have arbitrary-precision bit operations, so although (logior val 0) would assert that val is an integer, it wouldn't necessarily mean that it's unboxable.

Instead, the Guile idiom for making sure you have an unboxed integer in a particular range should go like this:

(define-inlinable (check-uint-range x mask)
  (let ((x* (logand x mask)))
    (unless (= x x*)
      (error "out of range" x))

A helper like this is useful to assert that an argument to a function is of a particular type, especially given that arguments to functions are always boxed and treated as being of unknown type. The logand asserts that the value is an integer, and the comparison asserts that it is within range.

For example, if we want to implement a function that does modular 8-bit addition, it can go like:

(define-inlinable (check-uint8 x)
  (check-uint-range x #xff))
(define-inlinable (truncate-uint8 x)
  (logand x #xff))
(define (uint8+ x y)
  (truncate-uint8 (+ (check-uint8 x) (check-uint8 y))))

If we disassemble this function, we get something like:

Disassembly of #<procedure uint8+ (x y)> at #xa8d0f8:

   0    (assert-nargs-ee/locals 3 2)    ;; 5 slots (2 args)
   1    (scm->u64/truncate 4 3)
   2    (load-u64 1 0 255)
   5    (ulogand 4 4 1)
   6    (br-if-u64-=-scm 4 3 #f 17)     ;; -> L1
;; [elided code to throw an error if x is not in range]
  23    (scm->u64/truncate 3 2)
  24    (ulogand 3 3 1)
  25    (br-if-u64-=-scm 3 2 #f 18)     ;; -> L2
;; [elided code to throw an error if y is not in range]
  43    (uadd 4 4 3)
  44    (ulogand 4 4 1)
  45    (u64->scm 3 4)
  46    (return-values 2)               ;; 1 value

The scm->u64/truncate instructions unbox an integer, but truncating it to the u64 range. They are used when we know that any additional bits won't be used, as in this case where we immediately do a logand of the unboxed value. All in all it's not a bad code sequence; there are two possible side exits for each argument (not an integer signalled by the unboxing, and out of range signalled by the explicit check), and no other run-time dispatch. For now I think we can be pretty happy with the code.

That's about it for integer unboxing. We also support unboxed signed 64-bit integers, mostly for use as operands or return values from bytevector-s8-ref and similar unboxed accessors on bytevectors. There are fewer operations that have s64 variants, though, compared to u64 variants.


Up until now in Guile, it could be that you might have to avoid Scheme if you needed to do some kinds of numeric computation. Unboxing floating-point and integer numbers makes it feasible to do more computation in Scheme instead of having to rely in inflexible C interfaces. At the same time, as a Scheme hacker I feel much more free knowing that I can work on 64-bit integers without necessarily allocating bignums. I expect this optimization to have a significant impact on the way I program, and what I program. We'll see where this goes, though. Until next time, happy hacking :)

the half strap: self-hosting and guile

11 January 2016 9:51 PM (compilers | gnu | igalia | guile | scheme | compilers | self-hosting | profiling | interpreters)

or, "why does building guile take so friggin long"

Happy new year's, hackfolk! I don't know about y'all, but I'm feeling pretty good about 2016. Let's make some cool stuff!

Today's article is about Guile and how it builds itself. It's a Scheme implementation mostly written in Scheme, so how it would go about doing that isn't straightforward. And although the performance of Guile is pretty great these days, a user's first experience with it will probably be building it, which is a process that takes approximately forever. Seriously. On this newish laptop with an i7-5600U CPU and four cores it takes like 45 minutes. On older machines it can take even longer. What gives?

Well, fictional reader, it's a good question. I'm glad you asked! Before getting to the heart of the matter, I summarize a bit of background information.

and then nothing turned itself inside out

Guile is mostly written in Scheme. Some parts of it are written in C -- some runtime routines, some supporting libraries (the garbage collector, unicode support, arbitrary precision arithmetic), and the bytecode interpreter. The first phase when building Guile is to take the system's C compiler -- a program that takes C source code and produces native machine code -- and use it to build libguile, the part of Guile that is written in C.

The next phase is to compile the parts of Guile written in Scheme. Currently we compile to bytecode which is then interpreted by libguile, but this discussion would be the same if we compiled Scheme to native code instead of bytecode.

There's a wrinkle, though: the Scheme compiler -- the program that takes a Scheme program and produces bytecode -- is written in Scheme. When we built libguile, we could use the system's C compiler. But the system has no Scheme compiler, so how do we do?

The answer is that in addition to a Scheme compiler, Guile also includes a Scheme interpreter. We use the interpreter to load the Scheme compiler, and then use the compiler to produce bytecode from Scheme.

There's another wrinkle, though, and I bet you can guess what it is :) The Scheme interpreter is also written in Scheme. It used to be that Guile's Scheme interpreter was written in C, but that made it impossible to tail-call between compiled and interpreted code. So some six years ago, I rewrote the interpreter in Scheme.

As I mention in that article, Guile actually has two Scheme interpreters: the one in Scheme and one in C that is only used to compile the one in Scheme, and never used again. The bootstrap interpreter written in C avoids the problem with tail calls to compiled code because when it runs, there is no compiled code.

So in summary, Guile's build has the following general phases:

  1. The system C compiler builds libguile.

  2. The bootstrap C interpreter in libguile loads the Scheme compiler and builds eval.go from eval.scm. (Currently .go is the extension for compiled Guile code. The extension predates the Go language. Probably we switch to .so at some point, though.)

  3. The Scheme interpreter from eval.go loads the Scheme compiler and compiles the rest of the Scheme code in Guile, including the Scheme compiler itself.

In the last step, Guile compiles each file in its own process, allowing for good parallelization. This also means that as the compiler builds, the compiler itself starts running faster because it can use the freshly built .go files instead having to use the interpreter to load the source .scm files.

so what's slow?

Building libguile is not so slow; it takes about a minute on my laptop. Could be faster, but it's fine.

Building eval.go is slow, but at two and half minutes it's bearable.

Building the rest of the Scheme code is horribly slow though, and for me takes around 40 or 50 minutes. What is going on?

The crucial difference between building libguile and building the .go files is that when we build libguile, we use the C compiler, which is itself a highly optimized program. When we build .go files, we use the Scheme compiler, which hasn't yet been compiled! Indeed if you rebuild all the Scheme code using a compiled Scheme compiler instead of an interpreted Scheme compiler, you can rebuild all of Guile in about 5 minutes. (Due to the way the Makefile dependencies work, the easiest way to do this if you have a built Guile is rm bootstrap/ice-9/eval.go && make -jN.)

The story is a bit complicated by parallelism, though. Usually if you do a make -j4, you will be able to build 4 things at the same time, taking advantage of 4 cores (if you have them). However Guile's Makefile rules are arranged in such a way that the initial eval.go compile is done serially, when nothing else is running. This is because the bootstrap interpreter written in C uses C stack space as temporary storage. It could be that when compiling bigger files, the C interpreter might run out of stack, and with C it's hard to detect exactly how much stack you have. Indeed, sometimes we get reports of strange bootstrap failures that end up being because Guile was built with -O0 and the compiler decided to use much more stack space than we usually see. We try to fix these, usually by raising the static stack limits that Guile's C interpreter imposes, but we certainly don't want a limitation in the bootstrap interpreter to affect the internal structure of the rest of Guile. The
bootstrap interpreter's only job is to load the compiler and build eval.go, and isn't tested in any other way.

So eval.go is build serially. After that, compilation can proceed in parallel, but goes more slowly before speeding up. To explain that, I digress!

a digression on interpreters

When Scheme code is loaded into Guile from source, the process goes like this:

  1. Scheme code is loaded from disk or wherever as a stream of bytes.

  2. The reader parses that byte stream into S-expressions.

  3. The expander runs on the S-expressions, expanding macros and lowering Scheme code to an internal language called "Tree-IL".

Up to here, the pipeline is shared between the interpreter and the compiler. If you're compiling, Guile will take the Tree-IL, run the partial evaluator on it, lower to CPS, optimize that CPS, and then emit bytecode. The next time you load this file, Guile will just mmap in the .go file and skip all of the other steps. Compilation is great!

But if you are interpreting, a few more things happen:

  1. The memoizer does some analysis on the Tree-IL and turns variable references into two-dimensional (depth, offset) references on a chained environment. See the story time article for more; scroll down about halfway for the details. The goal is to do some light compilation on variable access so that the interpreter will have to do less work, and also prevent closures from hanging on to too much data; this is the "flat closure" optimization, for the interpreter.

  2. The interpreter "compiles" the code to a chain of closures. This is like the classic direct-threading optimization, but for a tree-based interpreter.

The closure-chaining strategy of the interpreter is almost exactly as in described in SICP's analyze pass. I came up with it independently, but so did Jonathan Rees in 1982 and Marc Feeley in 1986, so I wasn't surprised when I found the prior work!

Back in 2009 when we switched to the eval-in-Scheme, we knew that it would result in a slower interpreter. This is because instead of the interpreter being compiled to native code, it was compiled to bytecode. Also, Guile's Scheme compiler wasn't as good then, so we knew that we were leaving optimizations on the floor. Still, the switch to an evaluator in Scheme enabled integration of the compiler, and we thought that the interpreter speed would improve with time. I just took a look and with this silly loop:

(let lp ((n 0)) (if (< n #e1e7) (lp (1+ n))))

Guile 1.8's interpreter written in C manages to run this in 1.1 seconds. Guile 2.0's interpreter written in Scheme and compiled to the old virtual machine does it in 16.4 seconds. Guile 2.1.1's interpreter, with the closure-chaining optimization, a couple of peephole optimizations in the interpreter, and compiled using the better compiler and VM from Guile 2.2, manages to finish in 2.4 seconds. So we are definitely getting better, and by the time we compile eval.scm to native code I have no doubt that we will be as good as the old C implementation. (Of course, when compiled to Guile 2.2's VM, the loop finishes in 55 milliseconds, but comparing a compiler and an interpreter is no fair.)

The up-shot for bootstrap times is that once the interpreter is compiled, the build currently runs a little slower, because the compiled eval.go interpreter is a bit slower than the bootstrap interpreter in libguile.

bottom up, top down

Well. Clearly I wanted to share a thing with you about interpreters; thank you for following along :) The salient point is that Guile's interpreter is now pretty OK, though of course not as good as the compiler. Still, Guile 2.0 builds in 12 minutes, while Guile 2.2 builds in 40 or 50, and Guile 2.2 has a faster interpreter. What's the deal?

There are a few factors at play but I think the biggest is that Guile 2.2's compiler is simply much more sophisticated than Guile 2.0's compiler. Just loading it up at bootstrap-time takes longer than loading Guile 2.0's compiler, because there's more code using more macro abstractions than in Guile 2.0. The expander has to do more work, and the evaluator has to do more work. A compiler is a program that runs on programs, and interpreting a bigger program is going to be slower than interpreting a smaller program.

It's a somewhat paradoxical result: to make programs run faster, we needed a better compiler, but that better compiler is bigger, and so it bootstraps from source more slowly. Some of the improvements to generated code quality were driven by a desire to have the compiler run faster, but this only had the reverse effect on bootstrap time.

Unfortunately, Guile 2.2's compiler also runs slow when it's fully compiled: compiling one largeish module in Guile 2.2 compared to 2.0 takes 10.7 seconds instead of 1.9. (To reproduce, ,time (compile-file "module/ice-9/psyntax-pp.scm") from a Guile 2.0 or 2.2 REPL.) How can we explain this?

Understanding this question has taken me some time. If you do a normal profile of the code using statprof, you get something like this:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm")
%     cumulative   self             
time   seconds     seconds  procedure
 12.41      1.61      1.61  language/cps/intmap.scm:393:0:intmap-ref
  6.35      1.05      0.82  vector-copy
  5.92     13.09      0.77  language/cps/intset.scm:467:5:visit-branch
  5.05      0.71      0.65  language/cps/intmap.scm:183:0:intmap-add!
  4.62      1.40      0.60  language/cps/intset.scm:381:2:visit-node
  3.61      0.93      0.47  language/cps/intset.scm:268:0:intset-add
  3.46      0.49      0.45  language/cps/intset.scm:203:0:intset-add!
  3.17      1.01      0.41  language/cps/intset.scm:269:2:adjoin
  3.03      1.46      0.39  language/cps/intmap.scm:246:2:adjoin

("Cumulative seconds" can be greater than the total number of seconds for functions that have multiple activations live on the stack.)

These results would seem to unequivocally indicate that the switch to persistent data structures in the new compiler is to blame. This is a somewhat disheartening realization; I love working with the new data structures. They let me write better code and think about bigger things.

Seeing that most of the time is spent in intmap and intset manipulations, I've tried off and on over the last few months to speed them up. I tried at one point replacing hot paths with C -- no speedup, so I threw it away. I tried adding an alternate intmap implementation that, for transient packed maps, would store the map as a single vector; no significant speedup, binned it. I implemented integer unboxing in the hopes that it would speed up the results; more about that in another missive. I stared long and hard at the generated code, looking for opportunities to improve it (and did make some small improvements). Even when writing this article, the results are such a shame that I put the article on hold for a couple weeks while I looked into potential improvements, and managed to squeak out another 10%.

In retrospect, getting no speedup out of C hot paths should have been a hint.

For many years, a flat statistical profile with cumulative/self timings like the one I show above has been my go-to performance diagnostic. Sometimes it does take a bit of machine sympathy to understand, though; when you want to know what's calling a hot function, usually you look farther down the list for functions that don't have much self time but whose cumulative time matches the function you're interested in. But this approach doesn't work for hot functions that are called from many, many places, as is the case with these fundamental data structure operations.

Indeed at one point I built a tool to visualize statistical stack samples, the idea being you often want to see how a program gets to its hot code. This tool was useful but its output could be a bit overwhelming. Sometimes you'd have to tell it to generate PDF instead of PNG files because the height of the image exceeded Cairo's internal limits. The tool also had too many moving pieces to maintain. Still, the core of the idea was a good one, and I incorporated the non-graphical parts of it into Guile proper, where they sat unused for a few years.

Fast-forward to now, where faced with this compiler performance problem, I needed some other tool to help me out. It turns out that in the 2.0 to 2.2 transition, I had to rewrite the profiler's internals anyway to deal with the new VM. The old VM could identify a frame's function by the value in local slot 0; the new one has to look up from instruction pointer values. Because this lookup can be expensive, the new profiler just writes sampled instruction pointer addresses into an array for later offline analysis, eventual distilling to a flat profile. It turns out that this information is exactly what's needed to do a tree profile like I did in chartprof. I had to add cycle detection to prevent the graphs from being enormous, but cycle detection makes much more sense in a tree output than in a flat profile. The result, distilled a bit:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm") #:display-style tree
100.0% read-and-compile at system/base/compile.scm:208:0
  99.4% compile at system/base/compile.scm:237:0
    99.4% compile-fold at system/base/compile.scm:177:0
      75.3% compile-bytecode at language/cps/compile-bytecode.scm:568:0
        73.8% lower-cps at language/cps/compile-bytecode.scm:556:0
          41.1% optimize-higher-order-cps at language/cps/optimize.scm:86:0
          29.9% optimize-first-order-cps at language/cps/optimize.scm:106:0
          1.5% convert-closures at language/cps/closure-conversion.scm:814:0
      20.5% emit-bytecode at language/cps/compile-bytecode.scm:547:0
        18.5% visit-branch at language/cps/intmap.scm:514:5
          18.5% #x7ff420853318 at language/cps/compile-bytecode.scm:49:15
            18.5% compile-function at language/cps/compile-bytecode.scm:83:0
              18.5% allocate-slots at language/cps/slot-allocation.scm:838:0
      3.6% compile-cps at language/tree-il/compile-cps.scm:1071:0
        2.5% optimize at language/tree-il/optimize.scm:31:0
        0.6% cps-convert/thunk at language/tree-il/compile-cps.scm:924:0
        0.4% fix-letrec at language/tree-il/fix-letrec.scm:213:0
  0.6% compile-fold at system/base/compile.scm:177:0
    0.6% save-module-excursion at ice-9/boot-9.scm:2607:0
      0.6% #x7ff420b95254 at language/scheme/compile-tree-il.scm:29:3

I've uploaded the full file here, for the curious Guile hacker.

So what does it mean? The high-order bit is that we spend some 70% of the time in the optimizer. Indeed, running the same benchmark but omitting optimizations gets a much more respectable time:

$ time meta/uninstalled-env \
  guild compile -O0 module/ice-9/psyntax-pp.scm -o /tmp/foo.go
wrote `/tmp/foo.go'

real	0m3.050s
user	0m3.404s
sys	0m0.060s

One of the results of this investigation was that we should first compile the compiler with -O0 (no optimizations), then compile the compiler with -O2 (with optimizations). This change made it into the 2.1.1 release a couple months ago.

We also spend around 18.5% of time in slot allocation -- deciding what local variable slots to allocate to CPS variables. This takes time because we do a precise live variable analysis over the CPS, which itself has one variable for every result value and a label for every program point. Then we do register allocation, but in a way that could probably be optimized better. Perhaps with -O0 we should use a different strategy to allocate slots: one which preserves the values of variables that are available but dead. This would actually be an easier allocation task. An additional 1.5% is spent actually assembling the bytecode.

Interestingly, partial evaluation, CPS conversion, and a couple of other small optimizations together account for only 3.6% of time; and reading and syntax expansion account for only 0.6% of time. This is good news at least :)

up in the trees, down in the weeds

Looking at the top-down tree profile lets me see that the compiler is spending most of its time doing things that the Guile 2.0 compiler doesn't do: loop optimizations, good slot allocations, and so on. To an extent, then, it's to be expected that the Guile 2.2 compiler is slower. This also explains why the C fast-paths weren't so effective at improving performance: the per-operation costs for the already pretty low and adding C implementations wasn't enough of a speedup to matter. The problem was not that intmap-ref et al were slow, it was that code was calling them a lot.

Improving the optimizer has been a bit challenging, not least due to the many axes of "better". Guile's compiler ran faster before the switch to "CPS soup" and persistent data structures, but it produced code that ran slower because I wasn't able to write the optimizations that I would have liked. Likewise, Guile 2.0's compiler ran faster, because it did a worse job. But before switching to CPS soup, Guile's compiler also used more memory, because per-program-point and per-variable computations were unable to share space with each other.

I think the top-down profiler has given me a better point of view in this instance, as I can reason about what I'm doing on a structural level, which I wasn't able to understand from the flat profile. Still, it's possible to misunderstand the performance impact of leaf functions when they are spread all over a tree, and for that reason I think we probably need both kinds of profilers.

In the case of Guile's compiler I'm not sure that I'll change much at this point. We'll be able to switch to native compilation without a fundamental compiler rewrite. But spending most of the time in functions related to data structures still seems pretty wrong to me on some deep level -- what if the data structures were faster? What if I wrote the code in some other way that didn't need the data structures so much? It gnaws at me. It gnaws and gnaws.

the half strap

Unfortunately, while compiling Scheme to native code will probably speed up the compiler, it won't necessarily speed up the bootstrap. I think the compiler has some 800 KB of source code right now, and let's say that we're able to do native compilation with 1200 KB. So 50% more code, but probably the result is two to ten times faster on average: a win, in terms of compiler speed, when compiled. But for bootstrap time, because in the beginning of the bootstrap most of the compiler isn't compiled, it could well be a slowdown.

This is the disadvantage of bootstrapping from an interpreter -- the more compiler you write, the slower your strap.

Note that this is different from the case where you bootstrap from a compiled Scheme compiler. In our case we do a half-bootstrap, first building an interpreter in C, compiling the interpreter in Scheme, then bootstrapping off that.

It's a common trope in compiler development where the heroic, farsighted compiler hacker refuses to add optimizations unless they make the compiler bootstrap faster. Dybvig says as much in his "History of Chez Scheme" paper. Well, sure -- if you're willing to accept complete responsibility for bootstrapping. From my side, I'm terrified that I could introduce some error in a binary that could reproduce itself worm-like into all my work and it make it impossible to change anything. You think I jest, but the Sanely Bootstrappable Common Lisp papers instilled me with fear. Want to change your tagging scheme? You can't! Want to experiment with language, start programming using features from your own dialect? You can't! No, thank you. I value my sanity more than that.

Incidentally, this also answers a common question people have: can I use some existing Guile to compile a new Guile? The answer is tricky. You can if the two Guiles implement the same language and virtual machine. Guile-the-language is fairly stable. However, due to the way that the VM and the compiler are co-developed, some of the compiler is generated from data exported by libguile. If that information happens to be the same on your Guile, then yes, it's possible. Otherwise no. For this reason it's not something we describe, besides cross-compilers from the same version. Just half strap: it takes a while but it's fairly fool-proof.

and that's it!

Thanks for reading I guess. Good jobbies! Next time, some words on Lua. Until then, happy strapping!