copying collectors with block-structured heaps are unreliable

Good day, garbage pals! This morning, a quick note on “reliability” and garbage collectors, how a common GC construction is unreliable, and why we choose it anyway.

on reliability

For context, I’m easing back in to Whippet development. One of Whippet’s collectors is a semi-space collector. Semi-space collectors are useful as correctness oracles: they always move objects, so they require their embedder to be able to precisely enumerate all edges of the object graph, to update those edges to point to the relocated objects. A semi-space collector is so simple that if there is a bug, it is probably in the mutator rather than the collector. They also have well-understood performance, as as such are useful when comparing performance of other collectors.

But one other virtue of the simple semi-space collector is that it is reliable, in the sense that given a particular set of live objects, allocated in any order, there is a single heap size at which the allocation (and collection) will succeed, and below which the program fails (not enough memory). This is because all allocations go in the same linear region, collection itself doesn’t allocate memory, the amount of free space after an object (the fragmentation) does not depend on where it is allocated, and those object extents just add up in a commutative way.

Reliability is a virtue. Sometimes it is a requirement: for example, the Toit language and run-time targets embeded microcontrollers, and you there you have finite resources and either they workload fits or it doesn’t. You can’t really tolerate a result of “it works sometimes”. So, Toit uses a generational semi-space + mark-sweep collector that never makes things worse.

on block-structured heaps

But, threads make reliability tricky. With Whippet I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.

The usual solution for this problem is to arrange the heap in such a way that different threads can allocate in different areas, so they don’t need to share an allocation pointer and so they don’t write to the same cache lines. And, the most common way to do this is to use a block-structured heap; for example you might have a 256 MB heap, but divided into 4096 blocks, each of which is 64 kB. That’s enough granularity to dynamically partition out space between many threads: you keep a list of available blocks and allocator threads compete to grab fresh blocks as needed. There’s central contention on the block list, so you want blocks big enough that you aren’t fetching blocks too often.

To break a heap into blocks requires a large-object space, to allow for allocations that are larger than a block. And actually, as I mentioned in the article about block-structured heaps, usually you choose a threshold for large object allocations that is smaller than the block size, to limit the maximum and expected amount of fragmentation at the end of each block, when an allocation doesn’t fit.

on unreliability

Which brings me to my point: a copying collector with a block-structured heap is unreliable, in the sense that there is no single heap size below which the program fails and above which it succeeds.

Consider a mutator with a single active thread, allocating a range of object sizes, all smaller than the large object threshold. There is a global list of empty blocks available for allocation, and the thread grabs blocks as needed and bump-pointer allocates into that block. The last allocation in each block will fail: that’s what forces the thread to grab a new fresh block. The space left at the end of the block is fragmentation.

Assuming that the sequence of allocations performed by the mutator is deterministic, by the time the mutator has forced the first collection, the total amount of fragmentation will also be deterministic, as will the set of live roots at the time of collection. Assume also that there is a single collector thread which evacuates the live objects; this will also produce deterministic fragmentation.

However, there is no guarantee that the post-collection fragmentation is less than the pre-collection fragmentation. Unless objects are copied in such a way that preserves allocation order—generally not the case for a semi-space collector, and it would negate any advantage of a block-structured heap—then different object order could produce different end-of-block fragmentation.

causes of unreliability

The unreliability discussed above is due to non-commutative evacuation. If your collector marks objects in place, you are not affected. If you don’t commute live objects—if you preserve their allocation order, as Toit’s collector does—then you are not affected. If your evacuation commutes, as in the case of the simple semi-space collector, you are not affected. But if you have a block-structured heap and you evacuate, your collector is probably unreliable.

There are other sources of unreliability in a collector, but to my mind they are not as fundamental as this one.

  • Multiple mutator threads generally lead to a kind of unreliability, because the size of the live object graph is not deterministic at the time of collection: even if all threads have the same allocation trace, they don’t necessarily proceed in lock-step nor stop in the same place.

  • Adding collector threads to evacuate in parallel adds its own form of unreliability: if you have 8 evacuator threads, then there are 8 blocks local to the evacuator threads which also contribute to post-collection wasted space, possibly an entire block per thread.

  • Some collectors will allocate memory during collection, for example to represent a worklist of objects that need tracing. This allocation can fail. Also, annoyingly, collection-time allocation complicates comparison: you can no longer compare two collectors at the same heap size, because one of them cheats.

  • Virtual memory and paging can make you have a bad time. For example, you go to allocate a large object, so you remove some empty blocks from the main space and return them to the OS, providing you enough budget to allocate the new large object. Then the new large object is collected, so you reclaim the pages you returned to the OS, adding them to the available list. But probably you don’t page them in already, because who wants a syscall? They get paged in lazily when the mutator needs them, but that could fail because of other processes on the system.

embracing unreliability

I think it only makes sense to insist on a reliable collector if your mutator does not have threads; otherwise, the fragmentation-related unreliability pales in comparison.

What’s more, fragmentation-related unreliability can be entirely mitigated by giving the heap more memory: the maximum amount of fragmentation is an object just shy of the large object threshold, per block, so in our case 8 kB per 64 kB. So, just increase your heap by 12.5%. You will certainly not regret increasing your heap by 12.5%.

And happily, increasing heap size also works to mitigate unreliability related to multiple mutator threads. Consider 10 threads each of which has a local object graph that is usually 10 MB but briefly 100MB when calculating: usually when GC happens, total live object size is 10×10MB=100MB, but sometimes as much as 1 GB; there is a minimum heap size for which the program sometimes works, but also a minimum heap size at which it always works. The trouble is, of course, that you generally only know the minimum always-works size by experimentation, and you are unlikely to be able to actually measure the maximum heap size.

Which brings me to my final point, which is that virtual memory and growable heaps are good. Unless you have a dedicated devops team or you are evaluating a garbage collector, you should not be using a fixed heap size. The ability to just allocate some pages to keep the heap from being too tight buys us a form of soft reliability.

And with that, end of observations. Happy fragmenting, and until next time!

enable persistent history in gdb

Friends. I have been using GDB for more than two decades and have been annoyed by the fact that, unlike the shell, it doesn’t keep a persistent history.

Of course, it has always been able to do that, but history saving is just not on by default. So do yourself a favor and turn it on by pasting this into your terminal:

mkdir -p ~/.config/gdb
mkdir -p ~/.cache/gdb
echo 'set history filename ~/.cache/gdb/history' >> ~/.config/gdb/gdbinit
echo 'set history save on' >> ~/.config/gdb/gdbinit
echo 'set history size unlimited' >> ~/.config/gdb/gdbinit

You are welcome!

cps in hoot

Good morning good morning! Today I have another article on the Hoot Scheme-to-Wasm compiler, this time on Hoot’s use of the continuation-passing-style (CPS) transformation.

calls calls calls

So, just a bit of context to start out: Hoot is a Guile, Guile is a Scheme, Scheme is a Lisp, one with “proper tail calls”: function calls are either in tail position, syntactically, in which case they are tail calls, or they are not in tail position, in which they are non-tail calls. A non-tail call suspends the calling function, putting the rest of it (the continuation) on some sort of stack, and will resume when the callee returns. Because non-tail calls push their continuation on a stack, we can call them push calls.

(define (f)
  ;; A push call to g, binding its first return value.
  (define x (g))
  ;; A tail call to h.
  (h x))

Usually the problem in implementing Scheme on other language run-times comes in tail calls, but WebAssembly supports them natively (except on JSC / Safari; should be coming at some point though). Hoot’s problem is the reverse: how to implement push calls?

The issue might seem trivial but it is not. Let me illustrate briefly by describing what Guile does natively (not compiled to WebAssembly). Firstly, note that I am discussing residual push calls, by which I mean to say that the optimizer might remove a push call in the source program via inlining: we are looking at those push calls that survive the optimizer. Secondly, note that native Guile manages its own stack instead of using the stack given to it by the OS; this allows for push-call recursion without arbitrary limits. It also lets Guile capture stack slices and rewind them, which is the fundamental building block we use to implement exception handling, Fibers and other forms of lightweight concurrency.

The straightforward function call will have an artificially limited total recursion depth in most WebAssembly implementations, meaning that many idiomatic uses of Guile will throw exceptions. Unpleasant, but perhaps we could stomach this tradeoff. The greater challenge is how to slice the stack. That I am aware of, there are three possible implementation strategies.

generic slicing

One possibility is that the platform provides a generic, powerful stack-capture primitive, which is what Guile does. The good news is that one day, the WebAssembly stack-switching proposal should provide this too. And in the meantime, the so-called JS Promise Integration (JSPI) proposal gets close: if you enter Wasm from JS via a function marked as async, and you call out to JavaScript to a function marked as async (i.e. returning a promise), then on that nested Wasm-to-JS call, the engine will suspend the continuation and resume it only when the returned promise settles (i.e. completes with a value or an exception). Each entry from JS to Wasm via an async function allocates a fresh stack, so I understand you can have multiple pending promises, and thus multiple wasm coroutines in progress. It gets a little gnarly if you want to control when you wait, for example if you might want to wait on multiple promises; in that case you might not actually mark promise-returning functions as async, and instead import an async-marked async function waitFor(p) { return await p} or so, allowing you to use Promise.race and friends. The main problem though is that JSPI is only for JavaScript. Also, its stack sizes are even smaller than the the default stack size.

instrumented slicing

So much for generic solutions. There is another option, to still use push calls from the target machine (WebAssembly), but to transform each function to allow it to suspend and resume. This is what I think of as Joe Marshall’s stack trick (also see §4.2 of the associated paper). The idea is that although there is no primitive to read the whole stack, each frame can access its own state. If you insert a try/catch around each push call, the catch handler can access local state for activations of that function. You can slice a stack by throwing a SaveContinuation exception, in which each frame’s catch handler saves its state and re-throws. And if we want to avoid exceptions, we can use checked returns as Asyncify does.

I never understood, though, how you resume a frame. The Generalized Stack Inspection paper would seem to indicate that you need the transformation to introduce a function to run “the rest of the frame” at each push call, which becomes the Invoke virtual method on the reified frame object. To avoid code duplication you would have to make normal execution flow run these Invoke snippets as well, and that might undo much of the advantages. I understand the implementation that Joe Marshall was working on was an interpreter, though, which bounds the number of sites needing such a transformation.

cps transformation

The third option is a continuation-passing-style transformation. A CPS transform results in a program whose procedures “return” by tail-calling their “continuations”, which themselves are procedures. Taking our previous example, a naïve CPS transformation would reify the following program:

(define (f' k)
  (g' (lambda (x) (h' k x))))

Here f' (“f-prime”) receives its continuation as an argument. We call g', for whose continuation argument we pass a closure. That closure is the return continuation of g, binding a name to its result, and then tail-calls h with respect to f. We know their continuations are the same because it is the same binding, k.

Unfortunately we can’t really slice abitrary ranges of a stack with the naïve CPS transformation: we can only capture the entire continuation, and can’t really inspect its structure. There is also no way to compose a captured continuation with the current continuation. And, in a naïve transformation, we would be constantly creating lots of heap allocation for these continuation closures; a push call effectively pushes a frame onto the heap as a closure, as we did above for g'.

There is also the question of when to perform the CPS transform; most optimizing compilers would like a large first-order graph to work on, which is out of step with the way CPS transformation breaks functions into many parts. Still, there is a nugget of wisdom here. What if we preserve the conventional compiler IR for most of the pipeline, and only perform the CPS transformation at the end? In that way we can have nice SSA-style optimizations. And, for return continuations of push calls, what if instead of allocating a closure, we save the continuation data on an explicit stack. As Andrew Kennedy notes, closures introduced by the CPS transform follow a stack discipline, so this seems promising; we would have:

(define (f'' k)
  (push! k)
  (push! h'')
  (g'' (lambda (x)
         (define h'' (pop!))
         (define k (pop!))
         (h'' k x))))

The explicit stack allows for generic slicing, which makes it a win for implementing delimited continuations.

hoot and cps

Hoot takes the CPS transformation approach with stack-allocated return closures. In fact, Hoot goes a little farther, too far probably:

(define (f''')
  (push! k)
  (push! h''')
  (push! (lambda (x)
           (define h'' (pop!))
           (define k (pop!))
           (h'' k x)))
  (g'''))

Here instead of passing the continuation as an argument, we pass it on the stack of saved values. Returning pops off from that stack; for example, (lambda () 42) would transform as (lambda () ((pop!) 42)). But some day I should go back and fix it to pass the continuation as an argument, to avoid excess stack traffic for leaf function calls.

There are some gnarly details though, which I know you are here for!

splits

For our function f, we had to break it into two pieces: the part before the push-call to g and the part after. If we had two successive push-calls, we would instead split into three parts. In general, each push-call introduces a split; let us use the term tails for the components produced by a split. (You could also call them continuations.) How many tails will a function have? Well, one for the entry, one for each push call, and one any time control-flow merges between two tails. This is a fixpoint problem, given that the input IR is a graph. (There is also some special logic for call-with-prompt but that is too much detail for even this post.)

where to save the variables

Guile is a dynamically-typed language, having a uniform SCM representation for every value. However in the compiler and run-time we can often unbox some values, generally as u64/s64/f64 values, but also raw pointers of some specific types, some GC-managed and some not. In native Guile, we can just splat all of these data members into 64-bit stack slots and rely on the compiler to emit stack maps to determine whether a given slot is a double or a tagged heap object reference or what. In WebAssembly though there is no sum type, and no place we can put either a u64 or a (ref eq) value. So we have not one stack but three (!) stacks: one for numeric values, implemented using a Wasm memory; one for (ref eq) values, using a table; and one for return continuations, because the func type hierarchy is disjoin from eq. It’s.... it’s gross? It’s gross.

what variables to save

Before a push-call, you save any local variables that will be live after the call. This is also a flow analysis problem. You can leave off constants, and instead reify them anew in the tail continuation.

I realized, though, that we have some pessimality related to stacked continuations. Consider:

(define (q x)
  (define y (f))
  (define z (f))
  (+ x y z))

Hoot’s CPS transform produces something like:

(define (q0 x)
  (save! x)
  (save! q1)
  (f))

(define (q1 y)
  (restore! x)
  (save! x)
  (save! y)
  (save! q2)
  (f))

(define (q2 z)
  (restore! x)
  (restore! y)
  ((pop!) (+ x y z)))

So q0 saved x, fine, indeed we need it later. But q1 didn’t need to restore x uselessly, only to save it again on q2‘s behalf. Really we should be applying a stack discipline for saved data within a function. Given that the source IR is a graph, this means another flow analysis problem, one that I haven’t thought about how to solve yet. I am not even sure if there is a solution in the literature, given that the SSA-like flow graphs plus tail calls / CPS is a somewhat niche combination.

calling conventions

The continuations introduced by CPS transformation have associated calling conventions: return continuations may have the generic varargs type, or the compiler may have concluded they have a fixed arity that doesn’t need checking. In any case, for a return, you call the return continuation with the returned values, and the return point then restores any live-in variables that were previously saved. But for a merge between tails, you can arrange to take the live-in variables directly as parameters; it is a direct call to a known continuation, rather than an indirect call to an unknown call site.

cps soup?

Guile’s intermediate representation is called CPS soup, and you might wonder what relationship that CPS has to this CPS. The answer is not much. The continuations in CPS soup are first-order; a term in one function cannot continue to a continuation in another function. (Inlining and contification can merge graphs from different functions, but the principle is the same.)

It might help to explain that it is the same relationship as it would be if Guile represented programs using SSA: the Hoot CPS transform runs at the back-end of Guile’s compilation pipeline, where closures representations have already been made explicit. The IR is still direct-style, just that syntactically speaking, every call in a transformed program is a tail call. We had to introduce save and restore primitives to implement the saved variable stack, and some other tweaks, but generally speaking, the Hoot CPS transform ensures the run-time all-tail-calls property rather than altering the compile-time language; a transformed program is still CPS soup.

fin

Did we actually make the right call in going for a CPS transformation?

I don’t have good performance numbers at the moment, but from what I can see, the overhead introduced by CPS transformation can impose some penalties, even 10x penalties in some cases. But some results are quite good, improving over native Guile, so I can’t be categorical.

But really the question is, is the performance acceptable for the functionality, and there I think the answer is more clear: we have a port of Fibers that I am sure Spritely colleagues will be writing more about soon, we have good integration with JavaScript promises while not relying on JSPI or Asyncify or anything else, and we haven’t had to compromise in significant ways regarding the source language. So, for now, I am satisfied, and looking forward to experimenting with the stack slicing proposal as it becomes available.

Until next time, happy hooting!

hoot's wasm toolkit

Good morning! Today we continue our dive into the Hoot Scheme-to-WebAssembly compiler. Instead of talking about Scheme, let’s focus on WebAssembly, specifically the set of tools that we have built in Hoot to wrangle Wasm. I am peddling a thesis: if you compile to Wasm, probably you should write a low-level Wasm toolchain as well.

(Incidentally, some of this material was taken from a presentation I gave to the Wasm standardization organization back in October, which I think I haven’t shared yet in this space, so if you want some more context, have at it.)

naming things

Compilers are all about names: definitions of globals, types, local variables, and so on. An intermediate representation in a compiler is a graph of definitions and uses in which the edges are names, and the set of possible names is generally unbounded; compilers make more names when they see fit, for example when copying a subgraph via inlining, and remove names if they determine that a control or data-flow edge is not necessary. Having an unlimited set of names facilitates the graph transformation work that is the essence of a compiler.

Machines, though, generally deal with addresses, not names; one of the jobs of the compiler back-end is to tabulate the various names in a compilation unit, assigning them to addresses, for example when laying out an ELF binary. Some uses may refer to names from outside the current compilation unit, as when you use a function from the C library. The linker intervenes at the back-end to splice in definitions for dangling uses and applies the final assignment of names to addresses.

When targetting Wasm, consider what kinds of graph transformations you would like to make. You would probably like for the compiler to emit calls to functions from a low-level run-time library written in wasm. Those functions are probably going to pull in some additional definitions, such as globals, types, exception tags, and so on. Then once you have your full graph, you might want to lower it, somehow: for example, you choose to use the stringref string representation, but browsers don’t currently support it; you run a post-pass to lower to UTF-8 arrays, but then all your strings are not constant, meaning they can’t be used as global initializers; so you run another post-pass to initialize globals in order from the start function. You might want to make other global optimizations as well, for example to turn references to named locals into unnamed stack operands (not yet working :).

Anyway what I am getting at is that you need a representation for Wasm in your compiler, and that representation needs to be fairly complete. At the very minimum, you need a facility to transform that in-memory representation to the standard WebAssembly text format, which allows you to use a third-party assembler and linker such as Binaryen’s wasm-opt. But since you have to have the in-memory representation for your own back-end purposes, probably you also implement the names-to-addresses mapping that will allow you to output binary WebAssembly also. Also it could be that Binaryen doesn’t support something you want to do; for example Hoot uses block parameters, which are supported fine in browsers but not in Binaryen.

(I exaggerate a little; Binaryen is a more reasonable choice now than it was before the GC proposal was stabilised. But it has been useful to be able to control Hoot’s output, for example as the exception-handling proposal has evolved.)

one thing leads to another

Once you have a textual and binary writer, and an in-memory representation, perhaps you want to be able to read binaries as well; and perhaps you want to be able to read text. Reading the text format is a little annoying, but I had implemented it already in JavaScript a few years ago; and porting it to Scheme was a no-brainer, allowing me to easily author the run-time Wasm library as text.

And so now you have the beginnings of a full toolchain, built just out of necessity: reading, writing, in-memory construction and transformation. But how are you going to test the output? Are you going to require a browser? That’s gross. Node? Sure, we have to check against production Wasm engines, and that’s probably the easiest path to take; still, would be nice if this were optional. Wasmtime? But that doesn’t do GC.

No, of course not, you are a dirty little compilers developer, you are just going to implement a little wasm interpreter, aren’t you. Of course you are. That way you can build nice debugging tools to help you understand when things go wrong. Hoot’s interpreter doesn’t pretend to be high-performance—it is not—but it is simple and it just works. Massive kudos to Spritely hacker David Thompson for implementing this. I think implementing a Wasm VM also had the pleasant side effect that David is now a Wasm expert; implementation is the best way to learn.

Finally, one more benefit of having a Wasm toolchain as part of the compiler: %inline-wasm. In my example from last time, I had this snippet that makes a new bytevector:

(%inline-wasm
 '(func (param $len i32) (param $init i32)
    (result (ref eq))
    (struct.new
     $mutable-bytevector
     (i32.const 0)
     (array.new $raw-bytevector
                (local.get $init)
                (local.get $len))))
 len init)

%inline-wasm takes a literal as its first argument, which should parse as a Wasm function. Parsing guarantees that the wasm is syntactically valid, and allows the arity of the wasm to become apparent: we just read off the function’s type. Knowing the number of parameters and results is one thing, but we can do better, in that we also know their type, which we use for intentional types, requiring in this case that the parameters be exact integers which get wrapped to the signed i32 range. The resulting term is spliced into the CPS graph, can be analyzed for its side effects, and ultimately when written to the binary we replace each local reference in the Wasm with a reference of the appropriate local variable. All this is possible because we have the tools to work on Wasm itself.

fin

Hoot’s Wasm toolchain is about 10K lines of code, and is fairly complete. I think it pays off for Hoot. If you are building a compiler targetting Wasm, consider budgetting for a 10K SLOC Wasm toolchain; you won’t regret it.

Next time, an article on Hoot’s use of CPS. Until then, happy hacking!

growing a bootie

Following on last week’s egregious discussion of the Hoot Scheme-to-WebAssembly compiler bootie, today I would like to examine another axis of boot, which is a kind of rebased branch of history: not the hack as it happened, but the logic inside the hack, the structure of the built thing, the history as it might have been. Instead of describing the layers of shims and props that we used while discovering what were building, let’s look at how we would build Hoot again, if we had to.

I think many readers of this blog will have seen Growing a Language, a talk / performance art piece in which Guy L. Steele—I once mentioned to him that Guy L. was one of the back-justifications for the name Guile; he did not take it well—in which Steele takes the set of monosyllabic words as primitives and builds up a tower of terms on top, bootstrapping a language as he goes. I just watched it again and I think it holds up, probably well enough to forgive the superfluous presence of the gender binary in the intro; ideas were different in the 1900s.

It is in the sense of that talk that I would like to look at growing a Hoot: how Hoot defines nouns and verbs in terms of smaller, more primitive terms: terms in terms of terms.

(hoot features) features (hoot primitives) primitives (ice-9 match) match (ice-9 match):s->(hoot primitives):n (hoot eq) eq (ice-9 match):s->(hoot eq):n (hoot pairs) pairs (ice-9 match):s->(hoot pairs):n (hoot vectors) vectors (ice-9 match):s->(hoot vectors):n (hoot equal) equal (ice-9 match):s->(hoot equal):n (hoot lists) lists (ice-9 match):s->(hoot lists):n (hoot errors) errors (ice-9 match):s->(hoot errors):n (hoot numbers) numbers (ice-9 match):s->(hoot numbers):n (fibers scheduler) scheduler (hoot ffi) ffi (fibers scheduler):s->(hoot ffi):n (guile) (guile) (fibers scheduler):s->(guile):n (fibers channels) channels (fibers channels):s->(ice-9 match):n (fibers waiter-queue) waiter-queue (fibers channels):s->(fibers waiter-queue):n (fibers operations) operations (fibers channels):s->(fibers operations):n (fibers channels):s->(guile):n (srfi srfi-9) srfi-9 (fibers channels):s->(srfi srfi-9):n (fibers waiter-queue):s->(ice-9 match):n (fibers waiter-queue):s->(fibers operations):n (fibers waiter-queue):s->(guile):n (fibers waiter-queue):s->(srfi srfi-9):n (fibers promises) promises (fibers promises):s->(fibers operations):n (hoot exceptions) exceptions (fibers promises):s->(hoot exceptions):n (fibers promises):s->(hoot ffi):n (fibers promises):s->(guile):n (fibers conditions) conditions (fibers conditions):s->(ice-9 match):n (fibers conditions):s->(fibers waiter-queue):n (fibers conditions):s->(fibers operations):n (fibers conditions):s->(guile):n (fibers conditions):s->(srfi srfi-9):n (fibers timers) timers (fibers timers):s->(fibers scheduler):n (fibers timers):s->(fibers operations):n (scheme time) time (fibers timers):s->(scheme time):n (fibers timers):s->(guile):n (fibers operations):s->(ice-9 match):n (fibers operations):s->(fibers scheduler):n (hoot boxes) boxes (fibers operations):s->(hoot boxes):n (fibers operations):s->(guile):n (fibers operations):s->(srfi srfi-9):n (hoot eq):s->(hoot primitives):n (hoot syntax) syntax (hoot eq):s->(hoot syntax):n (hoot strings) strings (hoot strings):s->(hoot primitives):n (hoot strings):s->(hoot eq):n (hoot strings):s->(hoot pairs):n (hoot bytevectors) bytevectors (hoot strings):s->(hoot bytevectors):n (hoot strings):s->(hoot lists):n (hoot bitwise) bitwise (hoot strings):s->(hoot bitwise):n (hoot char) char (hoot strings):s->(hoot char):n (hoot strings):s->(hoot errors):n (hoot strings):s->(hoot numbers):n (hoot match) match (hoot strings):s->(hoot match):n (hoot pairs):s->(hoot primitives):n (hoot bitvectors) bitvectors (hoot bitvectors):s->(hoot primitives):n (hoot bitvectors):s->(hoot bitwise):n (hoot bitvectors):s->(hoot errors):n (hoot bitvectors):s->(hoot match):n (hoot vectors):s->(hoot primitives):n (hoot vectors):s->(hoot pairs):n (hoot vectors):s->(hoot lists):n (hoot vectors):s->(hoot errors):n (hoot vectors):s->(hoot numbers):n (hoot vectors):s->(hoot match):n (hoot equal):s->(hoot primitives):n (hoot equal):s->(hoot eq):n (hoot equal):s->(hoot strings):n (hoot equal):s->(hoot pairs):n (hoot equal):s->(hoot bitvectors):n (hoot equal):s->(hoot vectors):n (hoot records) records (hoot equal):s->(hoot records):n (hoot equal):s->(hoot bytevectors):n (hoot not) not (hoot equal):s->(hoot not):n (hoot values) values (hoot equal):s->(hoot values):n (hoot hashtables) hashtables (hoot equal):s->(hoot hashtables):n (hoot equal):s->(hoot numbers):n (hoot equal):s->(hoot boxes):n (hoot equal):s->(hoot match):n (hoot exceptions):s->(hoot features):n (hoot exceptions):s->(hoot primitives):n (hoot exceptions):s->(hoot pairs):n (hoot exceptions):s->(hoot records):n (hoot exceptions):s->(hoot lists):n (hoot exceptions):s->(hoot syntax):n (hoot exceptions):s->(hoot errors):n (hoot exceptions):s->(hoot match):n (hoot cond-expand) cond-expand (hoot exceptions):s->(hoot cond-expand):n (hoot parameters) parameters (hoot parameters):s->(hoot primitives):n (hoot fluids) fluids (hoot parameters):s->(hoot fluids):n (hoot parameters):s->(hoot errors):n (hoot parameters):s->(hoot cond-expand):n (hoot records):s->(hoot primitives):n (hoot records):s->(hoot eq):n (hoot records):s->(hoot pairs):n (hoot records):s->(hoot vectors):n (hoot symbols) symbols (hoot records):s->(hoot symbols):n (hoot records):s->(hoot lists):n (hoot records):s->(hoot values):n (hoot records):s->(hoot bitwise):n (hoot records):s->(hoot errors):n (hoot ports) ports (hoot records):s->(hoot ports):n (hoot records):s->(hoot numbers):n (hoot records):s->(hoot match):n (hoot keywords) keywords (hoot records):s->(hoot keywords):n (hoot records):s->(hoot cond-expand):n (hoot dynamic-wind) dynamic-wind (hoot dynamic-wind):s->(hoot primitives):n (hoot dynamic-wind):s->(hoot syntax):n (hoot bytevectors):s->(hoot primitives):n (hoot bytevectors):s->(hoot bitwise):n (hoot bytevectors):s->(hoot errors):n (hoot bytevectors):s->(hoot match):n (hoot error-handling) error-handling (hoot error-handling):s->(hoot primitives):n (hoot error-handling):s->(hoot pairs):n (hoot error-handling):s->(hoot exceptions):n (hoot write) write (hoot error-handling):s->(hoot write):n (hoot control) control (hoot error-handling):s->(hoot control):n (hoot error-handling):s->(hoot fluids):n (hoot error-handling):s->(hoot errors):n (hoot error-handling):s->(hoot ports):n (hoot error-handling):s->(hoot numbers):n (hoot error-handling):s->(hoot match):n (hoot error-handling):s->(hoot cond-expand):n (hoot ffi):s->(hoot primitives):n (hoot ffi):s->(hoot strings):n (hoot ffi):s->(hoot pairs):n (hoot procedures) procedures (hoot ffi):s->(hoot procedures):n (hoot ffi):s->(hoot lists):n (hoot ffi):s->(hoot not):n (hoot ffi):s->(hoot errors):n (hoot ffi):s->(hoot numbers):n (hoot ffi):s->(hoot cond-expand):n (hoot debug) debug (hoot debug):s->(hoot primitives):n (hoot debug):s->(hoot match):n (hoot symbols):s->(hoot primitives):n (hoot symbols):s->(hoot errors):n (hoot assoc) assoc (hoot assoc):s->(hoot primitives):n (hoot assoc):s->(hoot eq):n (hoot assoc):s->(hoot pairs):n (hoot assoc):s->(hoot equal):n (hoot assoc):s->(hoot lists):n (hoot assoc):s->(hoot not):n (hoot procedures):s->(hoot primitives):n (hoot procedures):s->(hoot syntax):n (hoot write):s->(hoot primitives):n (hoot write):s->(hoot eq):n (hoot write):s->(hoot strings):n (hoot write):s->(hoot pairs):n (hoot write):s->(hoot bitvectors):n (hoot write):s->(hoot vectors):n (hoot write):s->(hoot records):n (hoot write):s->(hoot bytevectors):n (hoot write):s->(hoot symbols):n (hoot write):s->(hoot procedures):n (hoot write):s->(hoot bitwise):n (hoot write):s->(hoot char):n (hoot write):s->(hoot errors):n (hoot write):s->(hoot ports):n (hoot write):s->(hoot numbers):n (hoot write):s->(hoot keywords):n (hoot lists):s->(hoot primitives):n (hoot lists):s->(hoot pairs):n (hoot lists):s->(hoot values):n (hoot lists):s->(hoot numbers):n (hoot lists):s->(hoot match):n (hoot lists):s->(hoot cond-expand):n (hoot not):s->(hoot syntax):n (hoot syntax):s->(hoot primitives):n (hoot values):s->(hoot primitives):n (hoot values):s->(hoot syntax):n (hoot control):s->(hoot primitives):n (hoot control):s->(hoot parameters):n (hoot control):s->(hoot values):n (hoot control):s->(hoot cond-expand):n (hoot bitwise):s->(hoot primitives):n (hoot char):s->(hoot primitives):n (hoot char):s->(hoot bitvectors):n (hoot char):s->(hoot bitwise):n (hoot char):s->(hoot errors):n (hoot char):s->(hoot match):n (hoot dynamic-states) dynamic-states (hoot dynamic-states):s->(hoot primitives):n (hoot dynamic-states):s->(hoot vectors):n (hoot dynamic-states):s->(hoot debug):n (hoot dynamic-states):s->(hoot lists):n (hoot dynamic-states):s->(hoot values):n (hoot dynamic-states):s->(hoot errors):n (hoot dynamic-states):s->(hoot numbers):n (hoot dynamic-states):s->(hoot match):n (hoot read) read (hoot read):s->(hoot primitives):n (hoot read):s->(hoot eq):n (hoot read):s->(hoot strings):n (hoot read):s->(hoot pairs):n (hoot read):s->(hoot bitvectors):n (hoot read):s->(hoot vectors):n (hoot read):s->(hoot exceptions):n (hoot read):s->(hoot symbols):n (hoot read):s->(hoot lists):n (hoot read):s->(hoot not):n (hoot read):s->(hoot values):n (hoot read):s->(hoot char):n (hoot read):s->(hoot errors):n (hoot read):s->(hoot ports):n (hoot read):s->(hoot numbers):n (hoot read):s->(hoot match):n (hoot read):s->(hoot keywords):n (hoot hashtables):s->(hoot primitives):n (hoot hashtables):s->(hoot eq):n (hoot hashtables):s->(hoot pairs):n (hoot hashtables):s->(hoot vectors):n (hoot hashtables):s->(hoot procedures):n (hoot hashtables):s->(hoot lists):n (hoot hashtables):s->(hoot values):n (hoot hashtables):s->(hoot bitwise):n (hoot hashtables):s->(hoot errors):n (hoot hashtables):s->(hoot numbers):n (hoot fluids):s->(hoot primitives):n (hoot fluids):s->(hoot cond-expand):n (hoot errors):s->(hoot primitives):n (hoot atomics) atomics (hoot atomics):s->(hoot primitives):n (hoot ports):s->(hoot primitives):n (hoot ports):s->(hoot eq):n (hoot ports):s->(hoot strings):n (hoot ports):s->(hoot pairs):n (hoot ports):s->(hoot vectors):n (hoot ports):s->(hoot parameters):n (hoot ports):s->(hoot bytevectors):n (hoot ports):s->(hoot procedures):n (hoot ports):s->(hoot lists):n (hoot ports):s->(hoot not):n (hoot ports):s->(hoot values):n (hoot ports):s->(hoot bitwise):n (hoot ports):s->(hoot char):n (hoot ports):s->(hoot errors):n (hoot ports):s->(hoot numbers):n (hoot ports):s->(hoot boxes):n (hoot ports):s->(hoot match):n (hoot ports):s->(hoot cond-expand):n (hoot numbers):s->(hoot primitives):n (hoot numbers):s->(hoot eq):n (hoot numbers):s->(hoot not):n (hoot numbers):s->(hoot values):n (hoot numbers):s->(hoot bitwise):n (hoot numbers):s->(hoot errors):n (hoot numbers):s->(hoot match):n (hoot boxes):s->(hoot primitives):n (hoot match):s->(hoot primitives):n (hoot match):s->(hoot errors):n (hoot keywords):s->(hoot primitives):n (hoot cond-expand):s->(hoot features):n (hoot cond-expand):s->(hoot primitives):n (scheme lazy) lazy (scheme lazy):s->(hoot primitives):n (scheme lazy):s->(hoot records):n (scheme lazy):s->(hoot match):n (scheme base) base (scheme lazy):s->(scheme base):n (scheme load) load (scheme load):s->(hoot primitives):n (scheme load):s->(hoot errors):n (scheme load):s->(scheme base):n (scheme complex) complex (scheme complex):s->(hoot numbers):n (scheme time):s->(hoot primitives):n (scheme time):s->(scheme base):n (scheme file) file (scheme file):s->(hoot primitives):n (scheme file):s->(hoot errors):n (scheme file):s->(hoot ports):n (scheme file):s->(hoot match):n (scheme file):s->(scheme base):n (scheme write) write (scheme write):s->(hoot write):n (scheme eval) eval (scheme eval):s->(hoot errors):n (scheme eval):s->(scheme base):n (scheme inexact) inexact (scheme inexact):s->(hoot primitives):n (scheme inexact):s->(hoot numbers):n (scheme char) char (scheme char):s->(hoot primitives):n (scheme char):s->(hoot bitwise):n (scheme char):s->(hoot char):n (scheme char):s->(hoot numbers):n (scheme char):s->(scheme base):n (scheme process-context) process-context (scheme process-context):s->(hoot primitives):n (scheme process-context):s->(hoot errors):n (scheme process-context):s->(scheme base):n (scheme cxr) cxr (scheme cxr):s->(hoot pairs):n (scheme read) read (scheme read):s->(hoot read):n (scheme base):s->(hoot features):n (scheme base):s->(hoot primitives):n (scheme base):s->(hoot eq):n (scheme base):s->(hoot strings):n (scheme base):s->(hoot pairs):n (scheme base):s->(hoot vectors):n (scheme base):s->(hoot equal):n (scheme base):s->(hoot exceptions):n (scheme base):s->(hoot parameters):n (scheme base):s->(hoot dynamic-wind):n (scheme base):s->(hoot bytevectors):n (scheme base):s->(hoot error-handling):n (scheme base):s->(hoot symbols):n (scheme base):s->(hoot assoc):n (scheme base):s->(hoot procedures):n (scheme base):s->(hoot write):n (scheme base):s->(hoot lists):n (scheme base):s->(hoot not):n (scheme base):s->(hoot syntax):n (scheme base):s->(hoot values):n (scheme base):s->(hoot control):n (scheme base):s->(hoot char):n (scheme base):s->(hoot read):n (scheme base):s->(hoot errors):n (scheme base):s->(hoot ports):n (scheme base):s->(hoot numbers):n (scheme base):s->(hoot match):n (scheme base):s->(hoot cond-expand):n (scheme base):s->(srfi srfi-9):n (scheme repl) repl (scheme repl):s->(hoot errors):n (scheme repl):s->(scheme base):n (scheme r5rs) r5rs (scheme r5rs):s->(scheme lazy):n (scheme r5rs):s->(scheme load):n (scheme r5rs):s->(scheme complex):n (scheme r5rs):s->(scheme file):n (scheme r5rs):s->(scheme write):n (scheme r5rs):s->(scheme eval):n (scheme r5rs):s->(scheme inexact):n (scheme r5rs):s->(scheme char):n (scheme r5rs):s->(scheme process-context):n (scheme r5rs):s->(scheme cxr):n (scheme r5rs):s->(scheme read):n (scheme r5rs):s->(scheme base):n (scheme r5rs):s->(scheme repl):n (scheme case-lambda) case-lambda (scheme case-lambda):s->(hoot primitives):n (fibers) (fibers) (fibers):s->(fibers scheduler):n (fibers):s->(guile):n (guile):s->(hoot features):n (guile):s->(hoot primitives):n (guile):s->(ice-9 match):n (guile):s->(hoot eq):n (guile):s->(hoot strings):n (guile):s->(hoot pairs):n (guile):s->(hoot bitvectors):n (guile):s->(hoot vectors):n (guile):s->(hoot equal):n (guile):s->(hoot exceptions):n (guile):s->(hoot parameters):n (guile):s->(hoot dynamic-wind):n (guile):s->(hoot bytevectors):n (guile):s->(hoot error-handling):n (guile):s->(hoot symbols):n (guile):s->(hoot assoc):n (guile):s->(hoot procedures):n (guile):s->(hoot write):n (guile):s->(hoot lists):n (guile):s->(hoot not):n (guile):s->(hoot syntax):n (guile):s->(hoot values):n (guile):s->(hoot control):n (guile):s->(hoot bitwise):n (guile):s->(hoot char):n (guile):s->(hoot dynamic-states):n (guile):s->(hoot read):n (guile):s->(hoot fluids):n (guile):s->(hoot errors):n (guile):s->(hoot ports):n (guile):s->(hoot numbers):n (guile):s->(hoot boxes):n (guile):s->(hoot keywords):n (guile):s->(hoot cond-expand):n (guile):s->(scheme lazy):n (guile):s->(scheme time):n (guile):s->(scheme file):n (guile):s->(scheme char):n (guile):s->(scheme process-context):n (guile):s->(scheme base):n (guile):s->(srfi srfi-9):n (srfi srfi-9):s->(hoot primitives):n (srfi srfi-9):s->(hoot records):n

If you are reading this on the web, you should see above a graph of dependencies among the 50 or so libraries that are shipped as part of Hoot. (Somehow I doubt that a feed reader will plumb through the inline SVG, but who knows.) It’s a bit of a mess, but still I think it’s a useful illustration of a number of properties of how the Hoot language is grown from small to large. Click on any box to visit the source code for that module.

the root of the boot

Firstly, let us note that the graph is not a forest: it is a single tree. There is no module that does not depend (possibly indirectly) on (hoot primitives). This is because there are no capabilities that Hoot libraries can access without importing them, and the only way into the Hootosphere from outside is via the definitions in the primitives module.

So what are these definitions, you might ask? Well, these are the “well-known” bindings, for example + for which the compiler might have some special understanding, the sort of binding that gets translated to a primitive operation at the compiler IR level. They are used in careful ways by the modules that use (hoot primitives) to ensure that their uses are all open-coded by the compiler. (“Open coding” is inlining. But inlining to me implies that the whole implementation is inlined, with no slow-path callouts, whereas open coding implies to me that it’s the compiler that knows what the op does and may or may not inline the actual asm.)

But, (hoot primitives) also exposes some other definitions, for example define and let and lambda and all that. Scheme doesn’t have keywords in the sense that Python has def and with and such: there is no privileged way to associate a name with its meaning. It is in this sense that it is impossible to avoid (hoot primitives): the most simple (define x 42) depends on the lexical meaning of define, which is provided by the primitives module.

Syntax definitions are an expander construct; they are not present at run-time. Using a syntax definition causes the expander to invoke code, and the expander runs on the host system, which is Guile and not WebAssembly. So, syntax definitions belong to the host. This goes also for some first-order definitions such as syntax->datum and so on, which are only used in syntax expanders; these definitions are plumbed through (hoot primitives), but can only ever be used by macro definitions, which run on the meta-level.

(Is this too heavy? Allow me to lighten the mood: when I was 22 or so and working in Namibia, I somehow got an advance copy of Notes from the Metalevel. I was working on algorithmic music synthesis, and my chief strategy was knocking hubris together with itself, as one does. I sent the author a bunch of uninvited corrections to his book. I think it was completely unwelcome! Anyway, moral of the story, at 22 you get a free pass to do whatever you want, and come to think of it, now that I am 44 I think I should get some kind of hubris loyalty award or something.)

powerful primitives

So, there are expand-time primitives and run-time primitives. The expander knows about expand-time primitives and the compiler knows about run-time primitives. One particularly powerful primitive is %inline-wasm, which takes an inline snippet of WebAssembly as an s-expression and applies it to a number of arguments passed at run-time. Consider make-bytevector:

(define* (make-bytevector len #:optional (init 0))
  (%inline-wasm
   '(func (param $len i32) (param $init i32)
      (result (ref eq))
      (struct.new
       $mutable-bytevector
       (i32.const 0)
       (array.new $raw-bytevector
                  (local.get $init)
                  (local.get $len))))
   len init))

We have an inline snippet of wasm that makes a $mutable-bytevector. It passes 0 as the hash field, meaning that the hashq of this value will be lazily initialized, and the contents are a new array of a given size and initial value. Inputs will be unboxed to the appropriate type (two i32s in this case), and likewise with outputs; here we produce the universal (ref eq) representation.

The nice thing about %inline-wasm is that the compiler didn’t have to be taught about make-bytevector: this definition suffices, because %inline-wasm can access a number of lower-level capabilities.

dual denotations

But as we learned in my notes on whole-program compilation, any run-time definition is available at compile-time, if it is reachable from a syntax transformer. So this definition above isn’t quite sufficient; we can’t call make-bytevector as part of a procedural macro, which we might want to do. What we need instead is to provide one definition when residualizing wasm at run-time, and another when loading a module at expand-time.

In Hoot we do this with cond-expand, where we expand to %inline-wasm when targetting Hoot, and... what, precisely, at expand-time? Really we need to make a Guile bytevector, so in this sort of case, we end up having to include a run-time make-bytevector definition in the (hoot primitives) module. This happens whereever we end up using %inline-wasm.

building to guile

Returning to our graph, we see that there is a red-colored block for Hoot modules, a teal-colored layer on top for those modules that are defined by R7RS, a few oddballs, and then (guile) and Fibers built on top. The (guile) module provides a shim that implements Guile’s own default set of bindings, allowing Guile modules to be loaded on a Hoot system. (guile) is layered on top of the low-level Hoot libraries, and out of convenience, on top of the various R7RS libraries as well, because it was easiest to remember what was where in R7RS than our ad-hoc nest of Hoot internal libraries.

Having (guile) lets Guile hackers build on Hoot. It’s still incomplete but I think eventually it will be capital-G Good. Even for a library that needed more porting like Fibers (Hoot has no threads so much of the parallel concurrent ML implementation can be simplified, and we use an event loop from the Wasm run-time instead of an epoll-based scheduler), it was still pleasant to be able to use define-module and keyword arguments and all of that.

next layers

I mentioned that this tower of terms is incomplete, and so that is one of the next work items for Hoot: complete support for Guile’s run-time library. At that point we’d probably want to merge it into Guile, but that is another topic.

But let’s leave that for another day; until then, happy hacking!

on hoot, on boot

I realized recently that I haven’t been writing much about the Hoot Scheme-to-WebAssembly compiler. Upon reflection, I have been too conscious of its limitations to give it verbal tribute, preferring to spend each marginal hour fixing bugs and filling in features rather than publicising progress.

In the last month or so, though, Hoot has gotten to a point that pleases me. Not to the point where I would say “accept no substitutes” by any means, but good already for some things, and worth writing about.

So let’s start today by talking about bootie. Boot, I mean! The boot, the boot, the boot of Hoot.

hoot boot: temporal tunnel

The first axis of boot is time. In the beginning, there was nary a toot, and now, through boot, there is Hoot.

The first boot of Hoot was on paper. Christine Lemmer-Webber had asked me, ages ago, what I thought Guile should do about the web. After thinking a bit, I concluded that it would be best to avoid compromises when building an in-browser Guile: if you have to pollute Guile to match what JavaScript offers, you might as well program in JavaScript. JS is cute of course, but Guile is a bit different in some interesting ways, the most important of which is control: delimited continuations, multiple values, tail calls, dynamic binding, threads, and all that. If Guile’s web bootie doesn’t pack all the funk in its trunk, probably it’s just junk.

So I wrote up a plan something to which I attributed the name tailification. In retrospect, this is simply a specific flavor of a continuation-passing-style (CPS) transmutation, late in the compiler pipeline. I’ll elocute more in a future dispatch. I did end up writing the tailification pass back then; I could have continued to target JS, but it was sufficiently annoying and I didn’t prosecute. It sat around unused for a few years, until Christine’s irresistable charisma managed to conjure some resources for Hoot.

In the meantime, the GC extension for WebAssembly shipped (woot woot!), and to boot Hoot, I filled in the missing piece: a backend for Guile’s compiler that tailified and then translated primitive operations to snippets of WebAssembly.

It was, well, hirsute, but cute and it did compute, so we continued to boot. From this root we grew a small run-time library, written in raw WebAssembly, used for slow-paths for the various primitive operations that are part of Guile’s compiler back-end. We filled out Guile primcalls, in minute commits, growing the WebAssembly runtime library and toolchain as we went.

Eventually we started constituting facilities defined in terms of those primitives, via a Scheme prelude that was prepended to all programs, within a nested lexical environment. It was never our intention though to drown the user’s programs in a sea of predefined bindings, as if the ultimate program were but a vestigial inhabitant of the lexical lake—don’t dilute the newt!, we would often say [ed: we did not]— so eventually when the prelude became unmanageable, we finally figured out how to do whole-program compilation of a set of modules.

Then followed a long month in which I would uproot the loot from the boot: take each binding from the prelude and reattribute it into an appropriate module. User code could import all the modules that suit, as long as they were known to Hoot, but no others; it was only until we added the ability for users to programmatically consitute an environment from their modules that Hoot became a language implementation of any repute.

Which brings us to the work of the last month, about which I cannot be mute. When you have existing Guile code that you want to distribute via the web, Hoot required you transmute its module definitions into the more precise R6RS syntax. Precise, meaning that R6RS modules are static, in a way that Guile modules, at least in absolute terms, are not: Guile programs can use first-class accessors on the module systems to pull out bindings. This is yet another example of what I impute as the original sin of 1990s language development, that modules are just mutable hash maps. You see it in Python, for example: because you don’t know for sure to what values global names are bound, it is easy for any discussion of what a particular piece of code means to end in dispute.

The question is, though, are the semantics of name binding in a language fixed and absolute? Once your language is booted, are its aspects definitively attributed? I think some perfection, in the sense of becoming more perfect or more like the thing you should be, is something to salute. Anyway, in Guile it would be coherent with Scheme’s lexical binding heritage to restitute some certainty as to the meanings of names, at least in a default compilation node. Lexical binding is, after all, the foundation of the Macro Writer’s Statute of Rights. Of course if you are making a build for development purposes, not to distribute, then you might prefer a build that marks all bindings as dynamic. Otherwise I think it’s reasonable to require the user to explicitly indicate which definitions are denotations, and which constitute locations.

Hoot therefore now includes an implementation of the static semantics of Guile’s define-module: it can load Guile modules directly, and as a tribute, it also has an implementation of the ambient (guile) module that constitutes the lexical soup of modules that aren’t #:pure. (I agree, it would be better if all modules were explicit about the language they are written in—their imported bindings and so on—but there is an existing corpus to accomodate; the point is moot.)

The astute reader (whom I salute!) will note that we have a full boot: Hoot is a Guile. Not an implementation to substitute the original, but more of an alternate route to the same destination. So, probably we should scoot the two implementations together, to knock their boots, so to speak, merging the offshoot Hoot into Guile itself.

But do I circumlocute: I can only plead a case of acute Hoot. Tomorrow, we elocute on a second axis of boot. Until then, happy compute!

partitioning pitfalls for generational collectors

You might have heard of long-pole problems: when you are packing a tent, its bag needs to be at least as big as the longest pole. (This is my preferred etymology; there are others.) In garbage collection, the long pole is the longest chain of object references; there is no way you can algorithmically speed up pointer-chasing of (say) a long linked-list.

As a GC author, some of your standard tools can mitigate the long-pole problem, and some don’t apply.

Parallelism doesn’t help: a baby takes 9 months, no matter how many people you assign to the problem. You need to visit each node in the chain, one after the other, and having multiple workers available to process those nodes does not get us there any faster. Parallelism does help in general, of course, but doesn’t help for long poles.

You can apply concurrency: instead of stopping the user’s program (the mutator) to enumerate live objects, you trace while the mutator is running. This can happen on mutator threads, interleaved with allocation, via what is known as incremental tracing. Or it can happen in dedicated tracing threads, which is what people usually mean when they refer to concurrent tracing. Though it does impose some coordination overhead on the mutator, the pause-time benefits are such that most production garbage collectors do trace concurrently with the mutator, if they have the development resources to manage the complexity.

Then there is partitioning: instead of tracing the whole graph all the time, try to just trace part of it and just reclaim memory for that part. This bounds the size of a long pole—it can’t be bigger than the partition you trace—and so tracing a graph subset should reduce total pause time.

The usual way to partition is generational, in which the main program allocates into a nursery, and objects that survive a few collections then get promoted into old space. But there may be other partitions, for example to put “large objects” (for example, bigger than a few virtual memory pages) in their own section, to be managed with their own algorithm.

partitions, take one

And that brings me to today’s article: generational partitioning has a failure mode which manifests itself as spurious out-of-memory. For example, in V8, running this small JavaScript program that repeatedly allocates a 1-megabyte buffer grows to consume all memory in the system and eventually panics:

while (1) new Uint8Array(1e6);

This is a funny result, to say the least. Let’s dig in a bit to see why.

First off, note that allocating a 1-megabyte Uint8Array makes a large object, because it is bigger than half a V8 page, which is 256 kB on most systems There is a backing store for the array that will get allocated into the large object space, and then the JS object wrapper that gets allocated in the young generation of the regular object space.

Let’s imagine the heap consists of the nursery, the old space, and a large object space (lospace, for short). (See the next section for a refinement of this model.) In the loop, we cause allocation in the nursery and the lospace. When the nursery starts to get full, we run a minor collection, to trace the newly allocated part of the object graph, possibly promoting some objects to the old generation.

Promoting an object adds to the byte count in the old generation. You could say that promoting causes allocation in the old-gen, though it might happen just on an accounting level if an object is promoted in place. In V8, old-gen allocation is subject to a limit, and that limit is set by a gnarly series of heuristics. Exceeding the limit will cause a major GC, in which the whole heap is traced.

So, minor collections cause major collections via promotion. But what if a minor collection never promotes? This can happen in request-oriented workflows, in which a request comes in, you handle it in JS, write out the result, and then handle the next. If the nursery is at least as large as the memory allocated in one request, no object will ever survive more than one collection. But in our new Uint8Array(1e6) example above, we are allocating to newspace and the lospace. If we never cause promotion, we will always collect newspace but never trigger the full GC that would also collect the large object space, triggering this out-of-memory phenomenon.

partitions, take two

In general, the phenomenon we are looking for is nursery allocations that cause non-nursery allocations, where those non-nursery allocations will not themselves bring about a major GC.

In our example above, it was a typed array object with an associated lospace backing store, assuming the lospace allocation wouldn’t eventually trigger GC. But V8’s heap is not exactly like our model, for one because it actually has separate nursery and old-generation lospaces, and for two because allocation to the old-generation lospace does count towards the old-gen allocation limit. And yet, that simple does still blow through all memory. So what is the deal?

The answer is that V8’s heap now has around two dozen spaces, and allocations to some of those spaces escape the limits and allocation counters. In this case, V8’s sandbox makes the link between the JS typed array object and its backing store pass through a table of indirections. At each allocation, we make an object in the nursery, allocate a backing store in the nursery lospace, and then allocate a new entry in the external pointer table, storing the index of that entry in the JS object. When the object needs to get its backing store, it dereferences the corresponding entry in the external pointer table.

Our tight loop above would therefore cause an allocation (in the external pointer table) that itself would not hasten a major collection.

Seen this way, one solution immediately presents itself: maybe we should find a way to make external pointer table entry allocations trigger a GC based on some heuristic, perhaps the old-gen allocated bytes counter. But then you have to actually trigger GC, and this is annoying and not always possible, as EPT entries can be allocated in response to a pointer write. V8 hacker Michael Lippautz summed up the issue in a comment that can be summarized as “it’s gnarly”.

In the end, it would be better if new-space allocations did not cause old-space allocations. We should separate the external pointer table (EPT) into old and new spaces. Because there is at most one “host” object that is associated with any EPT entry—if it’s zero objects, the entry is dead—then the heuristics that apply to host objects will do the right thing with regards to EPT entries.

A couple weeks ago, I landed a patch to do this. It was much easier said than done; the patch went through upwards of 40 revisions, as it took me a while to understand the delicate interactions between the concurrent and parallel parts of the collector, which were themselves evolving as I was working on the patch.

The challenge mostly came from the fact that V8 has two nursery implementations that operate in different ways.

The semi-space nursery (the scavenger) is a parallel stop-the-world collector, which is relatively straightforward... except that there can be a concurrent/incremental major trace in progress while the scavenger runs (a so-called interleaved minor GC). External pointer table entries have mark bits, which the scavenger collector will want to use to determine which entries are in use and which can be reclaimed, but the concurrent major marker will also want to use those mark bits. In the end we decided that if a major mark is in progress, an interleaved collection will neither mark nor sweep the external pointer table nursery; a major collection will finish soon, and reclaiming dead EPT entries will be its responsibility.

The minor mark-sweep nursery does not run concurrently with a major concurrent/incremental trace, which is something of a relief, but it does run concurrently with the mutator. When we promote a page, we also need to promote all EPT entries for objects on that page. To keep track of which objects have external pointers, we had to add a new remembered set, built up during a minor trace, and cleared when the page is swept (also lazily / concurrently). Promoting a page iterates through that set, evacuating EPT entries to the old space. This is additional work during the stop-the-world pause, but hopefully it is not too bad.

To be honest I don’t have the performance numbers for this one. It rides the train this week to Chromium 126 though, and hopefully it fixes this problem in a robust way.

partitions, take three

The V8 sandboxing effort has sprouted a number of indirect tables: external pointers, external buffers, trusted objects, and code objects. Also recently to better integrate with compressed pointers, there is also a table for V8-to-managed-C++ (Oilpan) references. I expect the Oilpan reference table will soon grow a nursery along the same lines as the regular external pointer table.

In general, if you have a generational partition of your main object space, it would seem that you almost always need a generational partition of every other space. Otherwise either you cause new allocations to occur in what is effectively an old space, perhaps needlessly hastening a major GC, or you forget to track allocations in that space, leading to a memory leak. If the generational hypothesis applies for a wrapper object, it probably also applies for any auxiliary allocations as well.

fin

I have a few cleanups remaining in this area but I am relieved to have landed this patch, and pleased to have spent time under the hood of V8’s collector. Many many thanks to Samuel Groß, Michael Lippautz, and Dominik Inführ for their review and patience. Until the next cycle, happy allocating!

hacking v8 with guix, bis

Good day, hackers. Today, a pragmatic note, on hacking on V8 from a Guix system.

I’m going to skip a lot of the background because, as it turns out, I wrote about this already almost a decade ago. But following that piece, I mostly gave up on doing V8 hacking from a Guix machine—it was more important to just go with the flow of the ever-evolving upstream toolchain. In fact, I ended up installing Ubuntu LTS on my main workstations for precisely this reason, which has worked fine; I still get Guix in user-space, which is better than nothing.

Since then, though, Guix has grown to the point that it’s easier to create an environment that can run a complicated upstream source management project like V8’s. This is mainly guix shell in the --container --emulate-fhs mode. This article is a step-by-step for how to get started with V8 hacking using Guix.

get the code

You would think this would be the easy part: just git clone the V8 source. But no, the build wants a number of other Google-hosted dependencies to be vendored into the source tree. To perform the initial fetch for those dependencies and to keep them up to date, you use helpers from the depot_tools project. You also use depot_tools to submit patches to code review.

When you live in the Guix world, you might be tempted to look into what depot_tools actually does, and to replicate its functionality in a more minimal, Guix-like way. Which, sure, perhaps this is a good approach for packaging V8 or Chromium or something, but when you want to work on V8, you need to learn some humility and just go with the flow. (It’s hard for the kind of person that uses Guix. But it’s what you do.)

You can make some small adaptations, though. depot_tools is mostly written in Python, and it actually bundles its own virtualenv support for using a specific python version. This isn’t strictly needed, so we can set the funny environment variable VPYTHON_BYPASS="manually managed python not supported by chrome operations" to just use python from the environment.

Sometimes depot_tools will want to run some prebuilt binaries. Usually on Guix this is anathema—we always build from source—but there’s only so much time in the day and the build system is not our circus, not our monkeys. So we get Guix to set up the environment using a container in --emulate-fhs mode; this lets us run third-party pre-build binaries. Note, these binaries are indeed free software! We can run them just fine if we trust Google, which you have to when working on V8.

no, really, get the code

Enough with the introduction. The first thing to do is to check out depot_tools.

mkdir src
cd src
git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git

I’m assuming you have git in your Guix environment already.

Then you need to initialize depot_tools. For that you run a python script, which needs to run other binaries – so we need to make a specific environment in which it can run. This starts with a manifest of packages, is conventionally placed in a file named manifest.scm in the project’s working directory, though you don’t have one yet, so you can just write it into v8.scm or something anywhere:

(use-modules (guix packages)
             (gnu packages gcc))

(concatenate-manifests
 (list
  (specifications->manifest
   '(
     "bash"
     "binutils"
     "clang-toolchain"
     "coreutils"
     "diffutils"
     "findutils"
     "git"
     "glib"
     "glibc"
     "glibc-locales"
     "grep"
     "less"
     "ld-gold-wrapper"
     "make"
     "nss-certs"
     "nss-mdns"
     "openssh"
     "patch"
     "pkg-config"
     "procps"
     "python"
     "python-google-api-client"
     "python-httplib2"
     "python-pyparsing"
     "python-requests"
     "python-tzdata"
     "sed"
     "tar"
     "wget"
     "which"
     "xz"
     ))
  (packages->manifest
   `((,gcc "lib")))))

Then, you guix shell -m v8.scm. But you actually do more than that, because we need to set up a container so that we can expose a standard /lib, /bin, and so on:

guix shell --container --network \
  --share=$XDG_RUNTIME_DIR --share=$HOME \
  --preserve=TERM --preserve=SSH_AUTH_SOCK \
  --emulate-fhs \
  --manifest=v8.scm

Let’s go through these options one by one.

  • --container: This is what lets us run pre-built binaries, because it uses Linux namespaces to remap the composed packages to /bin, /lib, and so on.

  • --network: Depot tools are going to want to download things, so we give them net access.

  • --share: By default, the container shares the current working directory with the “host”. But we need not only the checkout for V8 but also the sibling checkout for depot tools (more on this in a minute); let’s just share the whole home directory. Also, we share the /run/user/1000 directory, which is $XDG_RUNTIME_DIR, which lets us access the SSH agent, so we can check out over SSH.

  • --preserve: By default, the container gets a pruned environment. This lets us pass some environment variables through.

  • --emulate-fhs: The crucial piece that lets us bridge the gap between Guix and the world.

  • --manifest: Here we specify the list of packages to use when composing the environment.

We can use short arguments to make this a bit less verbose:

guix shell -CNF --share=$XDG_RUNTIME_DIR --share=$HOME \
  -ETERM -ESSH_AUTH_SOCK -m manifest.scm

I would like it if all of these arguments could somehow be optional, that I could get a bare guix shell invocation to just apply them, when run in this directory. Perhaps some day.

Running guix shell like this drops you into a terminal. So let’s initialize depot tools:

cd $HOME/src
export VPYTHON_BYPASS="manually managed python not supported by chrome operations"
export PATH=$HOME/src/depot_tools:$PATH
export SSL_CERT_DIR=/etc/ssl/certs/
export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
gclient

This should download a bunch of things, I don’t know what. But at this point we’re ready to go:

fetch v8

This checks out V8, which is about 1.3 GB, and then probably about as much again in dependencies.

build v8

You can build V8 directly:

# note caveat below!
cd v8
tools/dev/gm.py x64.release

This will build fine... and then fail to link. The precise reason is obscure to me: it would seem that by default, V8 uses a whole Debian sysroot for Some Noble Purpose, and ends up linking against it. But it compiles against system glibc, which seems to have replaced fcntl64 with a versioned symbol, or some such nonsense. It smells like V8 built against a too-new glibc and then failed trying to link to an old glibc.

To fix this, you need to go into the args.gn that was generated in out/x64.release and then add use_sysroot = false, so that it links to system glibc instead of the downloaded one.

echo 'use_sysroot = false' >> out/x64.release/args.gn
tools/dev/gm.py x64.release

You probably want to put the commands needed to set up your environment into some shell scripts. For Guix you could make guix-env:

#!/bin/sh
guix shell -CNF --share=$XDG_RUNTIME_DIR --share=$HOME \
  -ETERM -ESSH_AUTH_SOCK -m manifest.scm -- "$@"

Then inside the container you need to set the PATH and such, so we could put this into the V8 checkout as env:

#!/bin/sh
# Look for depot_tools in sibling directory.
depot_tools=`cd $(dirname $0)/../depot_tools && pwd`
export PATH=$depot_tools:$PATH
export VPYTHON_BYPASS="manually managed python not supported by chrome operations"
export SSL_CERT_DIR=/etc/ssl/certs/
export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
exec "$@"

This way you can run ./guix-env ./env tools/dev/gm.py x64.release and not have to “enter” the container so much.

notes

This all works fine enough, but I do have some meta-reflections.

I would prefer it if I didn’t have to use containers, for two main reasons. One is that the resulting build artifacts have to be run in the container, because they are dynamically linked to e.g. /lib, at least for the ELF loader. It would be better if I could run them on the host (with the host debugger, for example). Using Guix to make the container is better than e.g. docker, though, because I can ensure that the same tools are available in the guest as I use on the host. But also, I don’t like adding “modes” to my terminals: are you in or out of this or that environment. Being in a container is not like being in a vanilla guix shell, and that’s annoying.

The build process uses many downloaded tools and artifacts, including clang itself. This is a feature, in that I am using the same compiler that colleagues at Google use, which is important. But it’s also annoying and it would be nice if I could choose. (Having the same clang-format though is an absolute requirement.)

There are two tests failing, in this configuration. It is somehow related to time zones. I have no idea why, but I just ignore them.

If the build system were any weirder, I would think harder about maybe using Docker or something like that. Colleagues point to distrobox as being a useful wrapper. It is annoying though, because such a docker image becomes like a little stateful thing to do sysadmin work on, and I would like to avoid that if I can.

Welp, that’s all for today. Hopefully if you are contemplating installing Guix as your operating system (rather than just in user-space), this can give you a bit more information as to what it might mean when working on third-party projects. Happy hacking and until next time!

on the impossibility of composing finalizers and ffi

While poking the other day at making a Guile binding for Harfbuzz, I remembered why I don’t much do this any more: it is impossible to compose GC with explicit ownership.

Allow me to illustrate with an example. Harfbuzz has a concept of blobs, which are refcounted sequences of bytes. It uses these in a number of places, for example when loading OpenType fonts. You can get a peek at the blob’s contents back with hb_blob_get_data, which gives you a pointer and a length.

Say you are in LuaJIT. (To think that for a couple years, I wrote LuaJIT all day long; now I can hardly remember.) You get a blob from somewhere and want to get its data. You define a wrapper for hb_blob_get_data:

local hb = ffi.load("harfbuzz")
ffi.cdef [[
typedef struct hb_blob_t hb_blob_t;

const char *
hb_blob_get_data (hb_blob_t *blob, unsigned int *length);
]]

Presumably you then arrange to release LuaJIT’s reference on the blob when GC collects a Lua wrapper for a blob:

ffi.cdef [[
void hb_blob_destroy (hb_blob_t *blob);
]]

function adopt_blob(ptr)
  return ffi.gc(ptr, hb.hb_blob_destroy)
end

OK, so let’s say we get a blob from somewhere, and want to copy out its contents as a byte string.

function blob_contents(blob)
   local len_out = ffi.new('unsigned int')
   local contents = hb.hb_blob_get_data(blob, len_out)
   local len = len_out[0];
   return ffi.string(contents, len)
end

The thing is, this code is as correct as you can get it, but it’s not correct enough. In between the call to hb_blob_get_data and, well, anything else, GC could run, and if blob is not used in the future of the program execution (the continuation), then it could be collected, causing the hb_blob_destroy finalizer to release the last reference on the blob, freeing contents: we would then be accessing invalid memory.

Among GC implementors, it is a truth universally acknowledged that a program containing finalizers must be in want of a segfault. The semantics of LuaJIT do not prescribe when GC can happen and what values will be live, so the GC and the compiler are not constrained to extend the liveness of blob to, say, the entirety of its lexical scope. It is perfectly valid to collect blob after its last use, and so at some point a GC will evolve to do just that.

I chose LuaJIT not to pick on it, but rather because its FFI is very straightforward. All other languages with GC that I am aware of have this same issue. There are but two work-arounds, and neither are satisfactory: either develop a deep and correct knowledge of what the compiler and run-time will do for a given piece of code, and then pray that knowledge does not go out of date, or attempt to manually extend the lifetime of a finalizable object, and then pray the compiler and GC don’t learn new tricks to invalidate your trick.

This latter strategy takes the form of “remember-this” procedures that are designed to outsmart the compiler. They have mostly worked for the last few decades, but I wouldn’t bet on them in the future.

Another way to look at the problem is that once you have a system working—though, how would you know it’s correct?—then you either never update the compiler and run-time, or you become fast friends with whoever maintains your GC, and probably your compiler too.

For more on this topic, as always Hans Boehm has the first and last word; see for example the 2002 Destructors, finalizers, and synchronization. These considerations don’t really apply to destructors, which are used in languages with ownership and generally run synchronously.

Happy hacking, and be safe out there!

guix on the framework 13 amd

I got a new laptop! It’s a Framework 13 AMD: 8 cores, 2 threads per core, 64 GB RAM, 3:2 2256×1504 matte screen. It kicks my 5-year-old Dell XPS 13 in the pants, and I am so relieved to be back to a matte screen. I just got it up and running with Guix, which though easier than past installation experiences was not without some wrinkles, so here I wanted to share a recipe for what worked for me.

(I swear this isn’t going to become a product review blog, but when I went to post something like this on the Framework forum I got an error saying that new users could only post 2 links. I understand how we got here but hoo, that is a garbage experience!)

The basic deal

Upstream Guix works on the Framework 13 AMD, but only with software rendering and no wifi, and I wasn’t able to install from upstream media. This is mainly because Guix uses a modified kernel and doesn’t include necessary firmware. There is a third-party nonguix repository that defines packages for the vanilla Linux kernel and the linux-firmware collection; we have to use that repo if we want all functionality.

Of course having the firmware be user-hackable would be better, and it would be better if the framework laptop used parts with free firmware. Something for a next revision, hopefully.

On firmware

As an aside, I think the official Free Software Foundation position on firmware is bad praxis. To recall, the idea is that if a device has embedded software (firmware) that can be updated, but that software is in a form that users can’t modify, then the system as a whole is not free software. This is technically correct but doesn’t logically imply that the right strategy for advancing free software is to forbid firmware blobs; you have a number of potential policy choices and you have to look at their expected results to evaluate which one is most in line with your goals.

Bright lines are useful, of course; I just think that with respect to free software, drawing that line around firmware is not interesting. To illustrate this point, I believe the current FSF position is that if you can run e.g. a USB ethernet adapter without installing non-free firmware, then it is kosher, otherwise it is haram. However many of these devices have firmware; it’s just that you aren’t updating it. So for example the the USB Ethernet adapter I got with my Dell system many years ago has firmware, therefore it has bugs, but I have never updated that firmware because that’s not how we roll. Or, on my old laptop, I never updated the CPU microcode, despite spectre and meltdown and all the rest.

“Firmware, but never updated” reminds me of the wires around some New York neighborhoods that allow orthodox people to leave the house on Sabbath; useful if you are of a given community and enjoy the feeling of belonging, but I think even the faithful would see it as a hack. It is like how Richard Stallman wouldn’t use travel booking web sites because they had non-free JavaScript, but would happily call someone on the telephone to perform the booking for him, using those same sites. In that case, the net effect on the world of this particular bright line is negative: it does not advance free software in the least and only adds overhead. Privileging principle over praxis is generally a losing strategy.

Installation

Firstly I had to turn off secure boot in the bios settings; it’s in “security”.

I wasn’t expecting wifi to work out of the box, but for some reason the upstream Guix install media was not able to configure the network via the Ethernet expansion card nor an external USB-C ethernet adapter that I had; stuck at the DHCP phase. So my initial installation attempt failed.

Then I realized that the nonguix repository has installation media, which is the same as upstream but with the vanilla kernel and linux-firmware. So on another machine where I had Guix installed, I added the nonguix channel and built the installation media, via guix system image -t iso9660 nongnu/system/install.scm. That gave me a file that I could write to a USB stick.

Using that installation media, installing was a breeze.

However upon reboot, I found that I had no wifi and I was using software rendering; clearly, installation produced an OS config with the Guix kernel instead of upstream Linux. Happily, at this point the ethernet expansion card was able to work, so connect to wired ethernet, open /etc/config.scm, add the needed lines as described in the operating-system part of the nonguix README, reconfigure, and reboot. Building Linux takes a little less than an hour on this machine.

Fractional scaling

At that point you have wifi and graphics drivers. I use GNOME, and things seem to work. However the screen defaults to 200% resolution, which makes everything really big. Crisp, pretty, but big. Really you would like something in between? Or that the Framework ships a higher-resolution screen so that 200% would be a good scaling factor; this was the case with my old Dell XPS 13, and it worked well. Anyway with the Framework laptop, I wanted 150% scaling, and it seems these days that the way you have to do this is to use Wayland, which Guix does not yet enable by default.

So you go into config.scm again, and change where it says %desktop-services to be:

(modify-services %desktop-services
  (gdm-service-type config =>
    (gdm-configuration (inherit config) (wayland? #t))))

Then when you reboot you are in Wayland. Works fine, it seems. But then you have to go and enable an experimental mutter setting; install dconf-editor, run it, search for keys with “mutter” in the name, find the “experimental settings” key, tell it to not use the default setting, then click the box for “scale-monitor-framebuffer”.

Then! You can go into GNOME settings and get 125%, 150%, and so on. Great.

HOWEVER, and I hope this is a transient situation, there is a problem: in GNOME, applications that aren’t native Wayland apps don’t scale nicely. It’s like the app gets rendered to a texture at the original resolution, which then gets scaled up in a blurry way. There aren’t so many of these apps these days as most things have been ported to be Wayland-capable, Firefox included, but Emacs is one of them :( However however! If you install the emacs-pgtk package instead of emacs, it looks better. Not perfect, but good enough. So that’s where I am.

Bugs

The laptop hangs on reboot due to this bug, but that seems a minor issue at this point. There is an ongoing tracker discussion on the community forum; like other problems in that thread, I hope that this one resolves itself upstream in Linux over time.

Other things?

I didn’t mention the funniest thing about this laptop: it comes in pieces that you have to put together :) I am not so great with hardware, but I had no problem. The build quality seems pretty good; not a MacBook Air, but then it’s also user-repairable, which is a big strong point. It has these funny extension cards that slot into the chassis, which I have found to be quite amusing.

I haven’t had the machine for long enough but it seems to work fine up to now: suspend, good battery use, not noisy (unless it’s compiling on all 16 threads), graphics, wifi, ethernet, good compilation speed. (I should give compiling LLVM a go; that’s a useful workload.) I don’t have bluetooth or the fingerprint reader working yet; I give it 25% odds that I get around to this during the lifetime of this laptop :)

Until next time, happy hacking!