wingolog

wastrelly wabbits

2026-03-31T20:34:23Z

Good day! Today (tonight), some notes on the last couple months of Wastrel, my ahead-of-time WebAssembly compiler.

Back in the beginning of February, I showed Wastrel running programs that use garbage collection, using an embedded copy of the Whippet collector, specialized to the types present in the Wasm program. But, the two synthetic GC-using programs I tested on were just ported microbenchmarks, and didn’t reflect the output of any real toolchain.

In this cycle I worked on compiling the output from the Hoot Scheme-to-Wasm compiler. There were some interesting challenges!

bignums

When I originally wrote the Hoot compiler, it targetted the browser, which already has a bignum implementation in the form of BigInt, which I worked on back in the day. Hoot-generated Wasm files use host bigints via externref (though wrapped in structs to allow for hashing and identity).

In Wastrel, then, I implemented the imports that implement bignum operations: addition, multiplication, and so on. I did so using mini-gmp, a stripped-down implementation of the workhorse GNU multi-precision library. At some point if bignums become important, this gives me the option to link to the full GMP instead.

Bignums were the first managed data type in Wastrel that wasn’t defined as part of the Wasm module itself, instead hiding behind externref, so I had to add a facility to allocate type codes to these “host” data types. More types will come in time: weak maps, ephemerons, and so on.

I think bignums would be a great proposal for the Wasm standard, similar to stringref ideally (sniff!), possibly in an attenuated form.

exception handling

Hoot used to emit a pre-standardization form of exception handling, and hadn’t gotten around to updating to the newer version that was standardized last July. I updated Hoot to emit the newer kind of exceptions, as it was easier to implement them in Wastrel that way.

Some of the problems Chris Fallin contended with in Wasmtime don’t apply in the Wastrel case: since the set of instances is known at compile-time, we can statically allocate type codes for exception tags. Also, I didn’t really have to do the back-end: I can just use setjmp and longjmp.

This whole paragraph was meant to be a bit of an aside in which I briefly mentioned why just using setjmp was fine. Indeed, because Wastrel never re-uses a temporary, relying entirely on GCC to “re-use” the register / stack slot on our behalf, I had thought that I didn’t need to worry about the “volatile problem”. From the C99 specification:

[...] values of objects of automatic storage duration that are local to the function containing the invocation of the corresponding setjmp macro that do not have volatile-qualified type and have been changed between the setjmp invocation and longjmp call are indeterminate.

My thought was, though I might set a value between setjmp and longjmp, that would only be the case for values whose lifetime did not reach the longjmp (i.e., whose last possible use was before the jump). Wastrel didn’t introduce any such cases, so I was good.

However, I forgot about local.set: mutations of locals (ahem, objects of automatic storage duration) in the source Wasm file could run afoul of this rule. So, because of writing this blog post, I went back and did an analysis pass on each function to determine the set of locals which are mutated inside the body of a try_table. Thank you, rubber duck readers!

bugs

Oh my goodness there were many bugs. Lacunae, if we are being generous; things not implemented quite right, which resulted in errors either when generating C or when compiling the C. The type-preserving translation strategy does seem to have borne fruit, in that I have spent very little time in GDB: once things compile, they work.

coevolution

Sometimes Hoot would use a browser facility where it was convenient, but for which in a better world we would just do our own thing. Such was the case for the number->string operation on floating-point numbers: we did something awful but expedient.

I didn’t have this facility in Wastrel, so instead we moved to do float-to-string conversions in Scheme. This turns out to have been a good test for bignums too; the algorithm we use is a bit dated and relies on bignums to do its thing. The move to Scheme also allows for printing floating-point numbers in other radices.

There are a few more Hoot patches that were inspired by Wastrel, about which more later; it has been good for both to work on the two at the same time.

tail calls

My plan for Wasm’s return_call and friends was to use the new musttail annotation for calls, which has been in clang for a while and was recently added to GCC. I was careful to limit the number of function parameters such that no call should require stack allocation, and therefore a compiler should have no reason to reject any particular tail call.

However, there were bugs. Funny ones, at first: attributes applying to a preceding label instead of the following call, or the need to insert if (1) before the tail call. More dire ones, in which tail callers inlined into their callees would cause the tail calls to fail, worked around with judicious application of noinline. Thanks to GCC’s Andrew Pinski for help debugging these and other issues; with GCC things are fine now.

I did have to change the code I emitted to return “top types only”: if you have a function returning type T, you can tail-call a function returning U if U is a subtype of T, but there is no nice way to encode this into the C type system. Instead, we return the top type of T (or U, it’s the same), e.g. anyref, and insert downcasts at call sites to recover the precise types. Not so nice, but it’s what we got.

Trying tail calls on clang, I ran into a funny restriction: clang not only requires that return types match, but requires that tail caller and tail callee have the same parameters as well. I can see why they did this (it requires no stack shuffling and thus such a tail call is always possible, even with 500 arguments), but it’s not the design point that I need. Fortunately there are discussions about moving to a different constraint.

scale

I spent way more time that I had planned to on improving the speed of Wastrel itself. My initial idea was to just emit one big C file, and that would provide the maximum possibility for GCC to just go and do its thing: it can see everything, everything is static, there are loads of always_inline helpers that should compile away to single instructions, that sort of thing. But, this doesn’t scale, in a few ways.

In the first obvious way, consider whitequark’s llvm.wasm. This is all of LLVM in one 70 megabyte Wasm file. Wastrel made a huuuuuuge C file, then GCC chugged on it forever; 80 minutes at -O1, and I wasn’t aiming for -O1.

I realized that in many ways, GCC wasn’t designed to be a compiler target. The shape of code that one might emit from a Wasm-to-C compiler like Wastrel is different from that that one would write by hand. I even ran into a segfault compiling with -Wall, because GCC accidentally recursed instead of iterated in the -Winfinite-recursion pass.

So, I dealt with this in a few ways. After many hours spent pleading and bargaining with different -O options, I bit the bullet and made Wastrel emit multiple C files. It will compute a DAG forest of all the functions in a module, where edges are direct calls, and go through that forest, greedily consuming (and possibly splitting) subtrees until we have “enough” code to split out a partition, as measured by number of Wasm instructions. They say that -flto makes this a fine approach, but one never knows when a translation unit boundary will turn out to be important. I compute needed symbol visibilities as much as I can so as to declare functions that don’t escape their compilation unit as static; who knows if this is of value. Anyway, this partitioning introduced no performance regression in my limited tests so far, and compiles are much much much faster.

scale, bis

A brief observation: Wastrel used to emit indented code, because it could, and what does it matter, anyway. However, consider Wasm’s br_table: it takes an array of n labels and an integer operand, and will branch to the nth label, or the last if the operand is out of range. To set up a label in Wasm, you make a block, of which there are a handful of kinds; the label is visible in the block, and for n labels, the br_table will be the most nested expression in the n nested blocks.

Now consider that block indentation is proportional to n. This means, the file size of an indented C file is quadratic in the number of branch targets of the br_table.

Yes, this actually bit me; there are br_table instances with tens of thousands of targets. No, wastrel does not indent any more.

scale, ter

Right now, the long pole in Wastrel is the compile-to-C phase; the C-to-native phase parallelises very well and is less of an issue. So, one might think: OK, you have partitioned the functions in this Wasm module into a number of files, why not emit the files in parallel?

I gave this a go. It did not speed up C generation. From my cursory investigations, I think this is because the bottleneck is garbage collection in Wastrel itself; Wastrel is written in Guile, and Guile still uses the Boehm-Demers-Weiser collector, which does not parallelize well for multiple mutators. It’s terrible but I ripped out parallelization and things are fine. Someone on Mastodon suggested fork; they’re not wrong, but also not Right either. I’ll just keep this as a nice test case for the Guile-on-Whippet branch I want to poke later this year.

scale, quator

Finally, I had another realization: GCC was having trouble compiling the C that Wastrel emitted, because Hoot had emitted bad WebAssembly. Not bad as in “invalid”; rather, “not good”.

There were two cases in which Hoot emitted ginormous (technical term) functions. One, for an odd debugging feature: Hoot does a CPS transform on its code, and allocates return continuations on a stack. This is a gnarly technique but it gets us delimited continuations and all that goodness even before stack switching has landed, so it’s here for now. It also gives us a reified return stack of funcref values, which lets us print Scheme-level backtraces.

Or it would, if we could associate data with a funcref. Unfortunately func is not a subtype of eq, so we can’t. Unless... we pass the funcref out to the embedder (e.g. JavaScript), and the embedder checks the funcref for equality (e.g. using ===); then we can map a funcref to an index, and use that index to map to other properties.

How to pass that funcref/index map to the host? When I initially wrote Hoot, I didn’t want to just, you know, put the funcrefs of interet into a table and let the index of a function’s slot be the value in the key-value mapping; that would be useless memory usage. Instead, we emitted functions that took an integer, and which would return a funcref. Yes, these used br_table, and yes, there could be tens of thousands of cases, depending on what you were compiling.

Then to map the integer index to, say, a function name, likewise I didn’t want a table; that would force eager allocation of all strings. Instead I emitted a function with a br_table whose branches would return string.const values.

Except, of course, stringref didn’t become a thing, and so instead we would end up lowering to allocate string constants as globals.

Except, of course, Wasm’s idea of what a “constant” is is quite restricted, so we have a pass that moves non-constant global initializers to the “start” function. This results in an enormous start function. The straightforward solution was to partition global initializations into separate functions, called by the start function.

For the funcref debugging, the solution was more intricate: firstly, we represent the funcref-to-index mapping just as a table. It’s fine. Then for the side table mapping indices to function names and sources, we emit DWARF, and attach a special attribute to each “introspectable” function. In this way, reading the DWARF sequentially, we reconstruct a mapping from index to DWARF entry, and thus to a byte range in the Wasm code section, and thus to source information in the .debug_line section. It sounds gnarly but Guile already used DWARF as its own debugging representation; switching to emit it in Hoot was not a huge deal, and as we only need to consume the DWARF that we emit, we only needed some 400 lines of JS for the web/node run-time support code.

This switch to data instead of code removed the last really long pole from the GCC part of Wastrel’s pipeline. What’s more, Wastrel can now implement the code_name and code_source imports for Hoot programs ahead of time: it can parse the DWARF at compile-time, and generate functions that look up functions by address in a sorted array to return their names and source locations. As of today, this works!

fin

There are still a few things that Hoot wants from a host that Wastrel has stubbed out: weak refs and so on. I’ll get to this soon; my goal is a proper Scheme REPL. Today’s note is a waypoint on the journey. Until next time, happy hacking!

two mechanisms for dynamic type checks

2026-02-18T16:21:10Z

Today, a very quick note on dynamic instance type checks in virtual machines with single inheritance.

The problem is that given an object o whose type is t, you want to check if o actually is of some more specific type u. To my knowledge, there are two sensible ways to implement these type checks.

if the set of types is fixed: dfs numbering

Consider a set of types T := {t, u, ...} and a set of edges S := {<t|ε, u>, ...} indicating that t is the direct supertype of u, or ε if u is a top type. S should not contain cycles and is thus a direct acyclic graph rooted at ε.

First, compute a pre-order and post-order numbering for each t in the graph by doing a depth-first search over S from ε. Something like this:

def visit(t, counter):
    t.pre_order = counter
    counter = counter + 1
    for u in S[t]:
        counter = visit(u, counter)
    t.post_order = counter
    return counter

Then at run-time, when making an object of type t, you arrange to store the type’s pre-order number (its tag) in the object itself. To test if the object is of type u, you extract the tag from the object and check if tag–u.pre_order mod 2ⁿ < u.post_order–u.pre_order.

Two notes, probably obvious but anyway: one, you know the numbering for u at compile-time and so can embed those variables as immediates. Also, if the type has no subtypes, it can be a simple equality check.

Note that this approach applies only if the set of types T is fixed. This is the case when statically compiling a WebAssembly module in a system that doesn’t allow modules to be instantiated at run-time, like Wastrel. Interestingly, it can also be the case in JIT compilers, when modeling types inside the optimizer.

if the set of types is unbounded: the display hack

If types may be added to a system at run-time, maintaining a sorted set of type tags may be too much to ask. In that case, the standard solution is something I learned of as the display hack, but whose name is apparently ungooglable. It is described in a 4-page technical note by Norman H. Cohen, from 1991: Type-Extension Type Tests Can Be Performed In Constant Time.

The basic idea is that each type t should have an associated sorted array of supertypes, starting with its top type and ending with t itself. Each t also has a depth, indicating the number of edges between it and its top type. A type u is a subtype of t if u[t.depth]=t, if u.depth <= t.depth.

There are some tricks one can do to optimize out the depth check, but it’s probably a wash given the check performs a memory access or two on the way. But the essence of the whole thing is in Cohen’s paper; go take a look!

Jan Vitek notes in a followup paper (Efficient Type Inclusion Tests) that Christian Queinnec discovered the technique around the same time. Vitek also mentions the DFS technique, but as prior art, apparently already deployed in DEC Modula-3 systems. The term “display” was bouncing around in the 80s to describe some uses of arrays; I learned it from Dybvig’s implementation of flat closures, who learned it from Cardelli. I don’t know though where “display hack” comes from.

That’s it! If you know of any other standard techniques for type checks with single-inheritance subtyping, do let me know in the comments. Until next time, happy hacking!

Addendum: Thanks to kind readers, I have some new references! Michael Schinz refers to Yoav Zibin’s PhD thesis as a good overview. Alex Bradbury points to a survey article by Roland Ducournau as describing the DFS technique as “Schubert numbering”. CF Bolz-Tereick unearthed the 1983 Schubert paper, and it is a weird one. Still, I can’t but think that the DFS technique was known earlier; I have a 1979 graph theory book by Shimon Even that describes a test for “separation vertices” that is precisely the same, though it does not mention the application to type tests. Many thanks also to fellow traveller Max Bernstein for related discussions.

cps in hoot

2024-05-27T12:36:44Z

Good morning good morning! Today I have another article on the Hoot Scheme-to-Wasm compiler, this time on Hoot’s use of the continuation-passing-style (CPS) transformation.

calls calls calls

So, just a bit of context to start out: Hoot is a Guile, Guile is a Scheme, Scheme is a Lisp, one with “proper tail calls”: function calls are either in tail position, syntactically, in which case they are tail calls, or they are not in tail position, in which they are non-tail calls. A non-tail call suspends the calling function, putting the rest of it (the continuation) on some sort of stack, and will resume when the callee returns. Because non-tail calls push their continuation on a stack, we can call them push calls.

(define (f)
  ;; A push call to g, binding its first return value.
  (define x (g))
  ;; A tail call to h.
  (h x))

Usually the problem in implementing Scheme on other language run-times comes in tail calls, but WebAssembly supports them natively (except on JSC / Safari; should be coming at some point though). Hoot’s problem is the reverse: how to implement push calls?

The issue might seem trivial but it is not. Let me illustrate briefly by describing what Guile does natively (not compiled to WebAssembly). Firstly, note that I am discussing residual push calls, by which I mean to say that the optimizer might remove a push call in the source program via inlining: we are looking at those push calls that survive the optimizer. Secondly, note that native Guile manages its own stack instead of using the stack given to it by the OS; this allows for push-call recursion without arbitrary limits. It also lets Guile capture stack slices and rewind them, which is the fundamental building block we use to implement exception handling, Fibers and other forms of lightweight concurrency.

The straightforward function call will have an artificially limited total recursion depth in most WebAssembly implementations, meaning that many idiomatic uses of Guile will throw exceptions. Unpleasant, but perhaps we could stomach this tradeoff. The greater challenge is how to slice the stack. That I am aware of, there are three possible implementation strategies.

generic slicing

One possibility is that the platform provides a generic, powerful stack-capture primitive, which is what Guile does. The good news is that one day, the WebAssembly stack-switching proposal should provide this too. And in the meantime, the so-called JS Promise Integration (JSPI) proposal gets close: if you enter Wasm from JS via a function marked as async, and you call out to JavaScript to a function marked as async (i.e. returning a promise), then on that nested Wasm-to-JS call, the engine will suspend the continuation and resume it only when the returned promise settles (i.e. completes with a value or an exception). Each entry from JS to Wasm via an async function allocates a fresh stack, so I understand you can have multiple pending promises, and thus multiple wasm coroutines in progress. It gets a little gnarly if you want to control when you wait, for example if you might want to wait on multiple promises; in that case you might not actually mark promise-returning functions as async, and instead import an async-marked async function waitFor(p) { return await p} or so, allowing you to use Promise.race and friends. The main problem though is that JSPI is only for JavaScript. Also, its stack sizes are even smaller than the the default stack size.

instrumented slicing

So much for generic solutions. There is another option, to still use push calls from the target machine (WebAssembly), but to transform each function to allow it to suspend and resume. This is what I think of as Joe Marshall’s stack trick (also see §4.2 of the associated paper). The idea is that although there is no primitive to read the whole stack, each frame can access its own state. If you insert a try/catch around each push call, the catch handler can access local state for activations of that function. You can slice a stack by throwing a SaveContinuation exception, in which each frame’s catch handler saves its state and re-throws. And if we want to avoid exceptions, we can use checked returns as Asyncify does.

I never understood, though, how you resume a frame. The Generalized Stack Inspection paper would seem to indicate that you need the transformation to introduce a function to run “the rest of the frame” at each push call, which becomes the Invoke virtual method on the reified frame object. To avoid code duplication you would have to make normal execution flow run these Invoke snippets as well, and that might undo much of the advantages. I understand the implementation that Joe Marshall was working on was an interpreter, though, which bounds the number of sites needing such a transformation.

cps transformation

The third option is a continuation-passing-style transformation. A CPS transform results in a program whose procedures “return” by tail-calling their “continuations”, which themselves are procedures. Taking our previous example, a naïve CPS transformation would reify the following program:

(define (f' k)
  (g' (lambda (x) (h' k x))))

Here f' (“f-prime”) receives its continuation as an argument. We call g', for whose continuation argument we pass a closure. That closure is the return continuation of g, binding a name to its result, and then tail-calls h with respect to f. We know their continuations are the same because it is the same binding, k.

Unfortunately we can’t really slice abitrary ranges of a stack with the naïve CPS transformation: we can only capture the entire continuation, and can’t really inspect its structure. There is also no way to compose a captured continuation with the current continuation. And, in a naïve transformation, we would be constantly creating lots of heap allocation for these continuation closures; a push call effectively pushes a frame onto the heap as a closure, as we did above for g'.

There is also the question of when to perform the CPS transform; most optimizing compilers would like a large first-order graph to work on, which is out of step with the way CPS transformation breaks functions into many parts. Still, there is a nugget of wisdom here. What if we preserve the conventional compiler IR for most of the pipeline, and only perform the CPS transformation at the end? In that way we can have nice SSA-style optimizations. And, for return continuations of push calls, what if instead of allocating a closure, we save the continuation data on an explicit stack. As Andrew Kennedy notes, closures introduced by the CPS transform follow a stack discipline, so this seems promising; we would have:

(define (f'' k)
  (push! k)
  (push! h'')
  (g'' (lambda (x)
         (define h'' (pop!))
         (define k (pop!))
         (h'' k x))))

The explicit stack allows for generic slicing, which makes it a win for implementing delimited continuations.

hoot and cps

Hoot takes the CPS transformation approach with stack-allocated return closures. In fact, Hoot goes a little farther, too far probably:

(define (f''')
  (push! k)
  (push! h''')
  (push! (lambda (x)
           (define h'' (pop!))
           (define k (pop!))
           (h'' k x)))
  (g'''))

Here instead of passing the continuation as an argument, we pass it on the stack of saved values. Returning pops off from that stack; for example, (lambda () 42) would transform as (lambda () ((pop!) 42)). But some day I should go back and fix it to pass the continuation as an argument, to avoid excess stack traffic for leaf function calls.

There are some gnarly details though, which I know you are here for!

splits

For our function f, we had to break it into two pieces: the part before the push-call to g and the part after. If we had two successive push-calls, we would instead split into three parts. In general, each push-call introduces a split; let us use the term tails for the components produced by a split. (You could also call them continuations.) How many tails will a function have? Well, one for the entry, one for each push call, and one any time control-flow merges between two tails. This is a fixpoint problem, given that the input IR is a graph. (There is also some special logic for call-with-prompt but that is too much detail for even this post.)

where to save the variables

Guile is a dynamically-typed language, having a uniform SCM representation for every value. However in the compiler and run-time we can often unbox some values, generally as u64/s64/f64 values, but also raw pointers of some specific types, some GC-managed and some not. In native Guile, we can just splat all of these data members into 64-bit stack slots and rely on the compiler to emit stack maps to determine whether a given slot is a double or a tagged heap object reference or what. In WebAssembly though there is no sum type, and no place we can put either a u64 or a (ref eq) value. So we have not one stack but three (!) stacks: one for numeric values, implemented using a Wasm memory; one for (ref eq) values, using a table; and one for return continuations, because the func type hierarchy is disjoin from eq. It’s.... it’s gross? It’s gross.

what variables to save

Before a push-call, you save any local variables that will be live after the call. This is also a flow analysis problem. You can leave off constants, and instead reify them anew in the tail continuation.

I realized, though, that we have some pessimality related to stacked continuations. Consider:

(define (q x)
  (define y (f))
  (define z (f))
  (+ x y z))

Hoot’s CPS transform produces something like:

(define (q0 x)
  (save! x)
  (save! q1)
  (f))

(define (q1 y)
  (restore! x)
  (save! x)
  (save! y)
  (save! q2)
  (f))

(define (q2 z)
  (restore! x)
  (restore! y)
  ((pop!) (+ x y z)))

So q0 saved x, fine, indeed we need it later. But q1 didn’t need to restore x uselessly, only to save it again on q2‘s behalf. Really we should be applying a stack discipline for saved data within a function. Given that the source IR is a graph, this means another flow analysis problem, one that I haven’t thought about how to solve yet. I am not even sure if there is a solution in the literature, given that the SSA-like flow graphs plus tail calls / CPS is a somewhat niche combination.

calling conventions

The continuations introduced by CPS transformation have associated calling conventions: return continuations may have the generic varargs type, or the compiler may have concluded they have a fixed arity that doesn’t need checking. In any case, for a return, you call the return continuation with the returned values, and the return point then restores any live-in variables that were previously saved. But for a merge between tails, you can arrange to take the live-in variables directly as parameters; it is a direct call to a known continuation, rather than an indirect call to an unknown call site.

cps soup?

Guile’s intermediate representation is called CPS soup, and you might wonder what relationship that CPS has to this CPS. The answer is not much. The continuations in CPS soup are first-order; a term in one function cannot continue to a continuation in another function. (Inlining and contification can merge graphs from different functions, but the principle is the same.)

It might help to explain that it is the same relationship as it would be if Guile represented programs using SSA: the Hoot CPS transform runs at the back-end of Guile’s compilation pipeline, where closures representations have already been made explicit. The IR is still direct-style, just that syntactically speaking, every call in a transformed program is a tail call. We had to introduce save and restore primitives to implement the saved variable stack, and some other tweaks, but generally speaking, the Hoot CPS transform ensures the run-time all-tail-calls property rather than altering the compile-time language; a transformed program is still CPS soup.

fin

Did we actually make the right call in going for a CPS transformation?

I don’t have good performance numbers at the moment, but from what I can see, the overhead introduced by CPS transformation can impose some penalties, even 10x penalties in some cases. But some results are quite good, improving over native Guile, so I can’t be categorical.

But really the question is, is the performance acceptable for the functionality, and there I think the answer is more clear: we have a port of Fibers that I am sure Spritely colleagues will be writing more about soon, we have good integration with JavaScript promises while not relying on JSPI or Asyncify or anything else, and we haven’t had to compromise in significant ways regarding the source language. So, for now, I am satisfied, and looking forward to experimenting with the stack slicing proposal as it becomes available.

Until next time, happy hooting!

hoot's wasm toolkit

2024-05-24T10:37:54Z

Good morning! Today we continue our dive into the Hoot Scheme-to-WebAssembly compiler. Instead of talking about Scheme, let’s focus on WebAssembly, specifically the set of tools that we have built in Hoot to wrangle Wasm. I am peddling a thesis: if you compile to Wasm, probably you should write a low-level Wasm toolchain as well.

(Incidentally, some of this material was taken from a presentation I gave to the Wasm standardization organization back in October, which I think I haven’t shared yet in this space, so if you want some more context, have at it.)

naming things

Compilers are all about names: definitions of globals, types, local variables, and so on. An intermediate representation in a compiler is a graph of definitions and uses in which the edges are names, and the set of possible names is generally unbounded; compilers make more names when they see fit, for example when copying a subgraph via inlining, and remove names if they determine that a control or data-flow edge is not necessary. Having an unlimited set of names facilitates the graph transformation work that is the essence of a compiler.

Machines, though, generally deal with addresses, not names; one of the jobs of the compiler back-end is to tabulate the various names in a compilation unit, assigning them to addresses, for example when laying out an ELF binary. Some uses may refer to names from outside the current compilation unit, as when you use a function from the C library. The linker intervenes at the back-end to splice in definitions for dangling uses and applies the final assignment of names to addresses.

When targetting Wasm, consider what kinds of graph transformations you would like to make. You would probably like for the compiler to emit calls to functions from a low-level run-time library written in wasm. Those functions are probably going to pull in some additional definitions, such as globals, types, exception tags, and so on. Then once you have your full graph, you might want to lower it, somehow: for example, you choose to use the stringref string representation, but browsers don’t currently support it; you run a post-pass to lower to UTF-8 arrays, but then all your strings are not constant, meaning they can’t be used as global initializers; so you run another post-pass to initialize globals in order from the start function. You might want to make other global optimizations as well, for example to turn references to named locals into unnamed stack operands (not yet working :).

Anyway what I am getting at is that you need a representation for Wasm in your compiler, and that representation needs to be fairly complete. At the very minimum, you need a facility to transform that in-memory representation to the standard WebAssembly text format, which allows you to use a third-party assembler and linker such as Binaryen’s wasm-opt. But since you have to have the in-memory representation for your own back-end purposes, probably you also implement the names-to-addresses mapping that will allow you to output binary WebAssembly also. Also it could be that Binaryen doesn’t support something you want to do; for example Hoot uses block parameters, which are supported fine in browsers but not in Binaryen.

(I exaggerate a little; Binaryen is a more reasonable choice now than it was before the GC proposal was stabilised. But it has been useful to be able to control Hoot’s output, for example as the exception-handling proposal has evolved.)

one thing leads to another

Once you have a textual and binary writer, and an in-memory representation, perhaps you want to be able to read binaries as well; and perhaps you want to be able to read text. Reading the text format is a little annoying, but I had implemented it already in JavaScript a few years ago; and porting it to Scheme was a no-brainer, allowing me to easily author the run-time Wasm library as text.

And so now you have the beginnings of a full toolchain, built just out of necessity: reading, writing, in-memory construction and transformation. But how are you going to test the output? Are you going to require a browser? That’s gross. Node? Sure, we have to check against production Wasm engines, and that’s probably the easiest path to take; still, would be nice if this were optional. Wasmtime? But that doesn’t do GC.

No, of course not, you are a dirty little compilers developer, you are just going to implement a little wasm interpreter, aren’t you. Of course you are. That way you can build nice debugging tools to help you understand when things go wrong. Hoot’s interpreter doesn’t pretend to be high-performance—it is not—but it is simple and it just works. Massive kudos to Spritely hacker David Thompson for implementing this. I think implementing a Wasm VM also had the pleasant side effect that David is now a Wasm expert; implementation is the best way to learn.

Finally, one more benefit of having a Wasm toolchain as part of the compiler: %inline-wasm. In my example from last time, I had this snippet that makes a new bytevector:

(%inline-wasm
 '(func (param $len i32) (param $init i32)
    (result (ref eq))
    (struct.new
     $mutable-bytevector
     (i32.const 0)
     (array.new $raw-bytevector
                (local.get $init)
                (local.get $len))))
 len init)

%inline-wasm takes a literal as its first argument, which should parse as a Wasm function. Parsing guarantees that the wasm is syntactically valid, and allows the arity of the wasm to become apparent: we just read off the function’s type. Knowing the number of parameters and results is one thing, but we can do better, in that we also know their type, which we use for intentional types, requiring in this case that the parameters be exact integers which get wrapped to the signed i32 range. The resulting term is spliced into the CPS graph, can be analyzed for its side effects, and ultimately when written to the binary we replace each local reference in the Wasm with a reference of the appropriate local variable. All this is possible because we have the tools to work on Wasm itself.

fin

Hoot’s Wasm toolchain is about 10K lines of code, and is fairly complete. I think it pays off for Hoot. If you are building a compiler targetting Wasm, consider budgetting for a 10K SLOC Wasm toolchain; you won’t regret it.

Next time, an article on Hoot’s use of CPS. Until then, happy hacking!

growing a bootie

2024-05-22T08:16:58Z

Following on last week’s egregious discussion of the Hoot Scheme-to-WebAssembly compiler bootie, today I would like to examine another axis of boot, which is a kind of rebased branch of history: not the hack as it happened, but the logic inside the hack, the structure of the built thing, the history as it might have been. Instead of describing the layers of shims and props that we used while discovering what were building, let’s look at how we would build Hoot again, if we had to.

I think many readers of this blog will have seen Growing a Language, a talk / performance art piece in which Guy L. Steele—I once mentioned to him that Guy L. was one of the back-justifications for the name Guile; he did not take it well—in which Steele takes the set of monosyllabic words as primitives and builds up a tower of terms on top, bootstrapping a language as he goes. I just watched it again and I think it holds up, probably well enough to forgive the superfluous presence of the gender binary in the intro; ideas were different in the 1900s.

It is in the sense of that talk that I would like to look at growing a Hoot: how Hoot defines nouns and verbs in terms of smaller, more primitive terms: terms in terms of terms.

If you are reading this on the web, you should see above a graph of dependencies among the 50 or so libraries that are shipped as part of Hoot. (Somehow I doubt that a feed reader will plumb through the inline SVG, but who knows.) It’s a bit of a mess, but still I think it’s a useful illustration of a number of properties of how the Hoot language is grown from small to large. Click on any box to visit the source code for that module.

the root of the boot

Firstly, let us note that the graph is not a forest: it is a single tree. There is no module that does not depend (possibly indirectly) on (hoot primitives). This is because there are no capabilities that Hoot libraries can access without importing them, and the only way into the Hootosphere from outside is via the definitions in the primitives module.

So what are these definitions, you might ask? Well, these are the “well-known” bindings, for example + for which the compiler might have some special understanding, the sort of binding that gets translated to a primitive operation at the compiler IR level. They are used in careful ways by the modules that use (hoot primitives) to ensure that their uses are all open-coded by the compiler. (“Open coding” is inlining. But inlining to me implies that the whole implementation is inlined, with no slow-path callouts, whereas open coding implies to me that it’s the compiler that knows what the op does and may or may not inline the actual asm.)

But, (hoot primitives) also exposes some other definitions, for example define and let and lambda and all that. Scheme doesn’t have keywords in the sense that Python has def and with and such: there is no privileged way to associate a name with its meaning. It is in this sense that it is impossible to avoid (hoot primitives): the most simple (define x 42) depends on the lexical meaning of define, which is provided by the primitives module.

Syntax definitions are an expander construct; they are not present at run-time. Using a syntax definition causes the expander to invoke code, and the expander runs on the host system, which is Guile and not WebAssembly. So, syntax definitions belong to the host. This goes also for some first-order definitions such as syntax->datum and so on, which are only used in syntax expanders; these definitions are plumbed through (hoot primitives), but can only ever be used by macro definitions, which run on the meta-level.

(Is this too heavy? Allow me to lighten the mood: when I was 22 or so and working in Namibia, I somehow got an advance copy of Notes from the Metalevel. I was working on algorithmic music synthesis, and my chief strategy was knocking hubris together with itself, as one does. I sent the author a bunch of uninvited corrections to his book. I think it was completely unwelcome! Anyway, moral of the story, at 22 you get a free pass to do whatever you want, and come to think of it, now that I am 44 I think I should get some kind of hubris loyalty award or something.)

powerful primitives

So, there are expand-time primitives and run-time primitives. The expander knows about expand-time primitives and the compiler knows about run-time primitives. One particularly powerful primitive is %inline-wasm, which takes an inline snippet of WebAssembly as an s-expression and applies it to a number of arguments passed at run-time. Consider make-bytevector:

(define* (make-bytevector len #:optional (init 0))
  (%inline-wasm
   '(func (param $len i32) (param $init i32)
      (result (ref eq))
      (struct.new
       $mutable-bytevector
       (i32.const 0)
       (array.new $raw-bytevector
                  (local.get $init)
                  (local.get $len))))
   len init))

We have an inline snippet of wasm that makes a $mutable-bytevector. It passes 0 as the hash field, meaning that the hashq of this value will be lazily initialized, and the contents are a new array of a given size and initial value. Inputs will be unboxed to the appropriate type (two i32s in this case), and likewise with outputs; here we produce the universal (ref eq) representation.

The nice thing about %inline-wasm is that the compiler didn’t have to be taught about make-bytevector: this definition suffices, because %inline-wasm can access a number of lower-level capabilities.

dual denotations

But as we learned in my notes on whole-program compilation, any run-time definition is available at compile-time, if it is reachable from a syntax transformer. So this definition above isn’t quite sufficient; we can’t call make-bytevector as part of a procedural macro, which we might want to do. What we need instead is to provide one definition when residualizing wasm at run-time, and another when loading a module at expand-time.

In Hoot we do this with cond-expand, where we expand to %inline-wasm when targetting Hoot, and... what, precisely, at expand-time? Really we need to make a Guile bytevector, so in this sort of case, we end up having to include a run-time make-bytevector definition in the (hoot primitives) module. This happens whereever we end up using %inline-wasm.

building to guile

Returning to our graph, we see that there is a red-colored block for Hoot modules, a teal-colored layer on top for those modules that are defined by R7RS, a few oddballs, and then (guile) and Fibers built on top. The (guile) module provides a shim that implements Guile’s own default set of bindings, allowing Guile modules to be loaded on a Hoot system. (guile) is layered on top of the low-level Hoot libraries, and out of convenience, on top of the various R7RS libraries as well, because it was easiest to remember what was where in R7RS than our ad-hoc nest of Hoot internal libraries.

Having (guile) lets Guile hackers build on Hoot. It’s still incomplete but I think eventually it will be capital-G Good. Even for a library that needed more porting like Fibers (Hoot has no threads so much of the parallel concurrent ML implementation can be simplified, and we use an event loop from the Wasm run-time instead of an epoll-based scheduler), it was still pleasant to be able to use define-module and keyword arguments and all of that.

next layers

I mentioned that this tower of terms is incomplete, and so that is one of the next work items for Hoot: complete support for Guile’s run-time library. At that point we’d probably want to merge it into Guile, but that is another topic.

But let’s leave that for another day; until then, happy hacking!

on hoot, on boot

2024-05-16T20:01:41Z

I realized recently that I haven’t been writing much about the Hoot Scheme-to-WebAssembly compiler. Upon reflection, I have been too conscious of its limitations to give it verbal tribute, preferring to spend each marginal hour fixing bugs and filling in features rather than publicising progress.

In the last month or so, though, Hoot has gotten to a point that pleases me. Not to the point where I would say “accept no substitutes” by any means, but good already for some things, and worth writing about.

So let’s start today by talking about bootie. Boot, I mean! The boot, the boot, the boot of Hoot.

hoot boot: temporal tunnel

The first axis of boot is time. In the beginning, there was nary a toot, and now, through boot, there is Hoot.

The first boot of Hoot was on paper. Christine Lemmer-Webber had asked me, ages ago, what I thought Guile should do about the web. After thinking a bit, I concluded that it would be best to avoid compromises when building an in-browser Guile: if you have to pollute Guile to match what JavaScript offers, you might as well program in JavaScript. JS is cute of course, but Guile is a bit different in some interesting ways, the most important of which is control: delimited continuations, multiple values, tail calls, dynamic binding, threads, and all that. If Guile’s web bootie doesn’t pack all the funk in its trunk, probably it’s just junk.

So I wrote up a plan something to which I attributed the name tailification. In retrospect, this is simply a specific flavor of a continuation-passing-style (CPS) transmutation, late in the compiler pipeline. I’ll elocute more in a future dispatch. I did end up writing the tailification pass back then; I could have continued to target JS, but it was sufficiently annoying and I didn’t prosecute. It sat around unused for a few years, until Christine’s irresistable charisma managed to conjure some resources for Hoot.

In the meantime, the GC extension for WebAssembly shipped (woot woot!), and to boot Hoot, I filled in the missing piece: a backend for Guile’s compiler that tailified and then translated primitive operations to snippets of WebAssembly.

It was, well, hirsute, but cute and it did compute, so we continued to boot. From this root we grew a small run-time library, written in raw WebAssembly, used for slow-paths for the various primitive operations that are part of Guile’s compiler back-end. We filled out Guile primcalls, in minute commits, growing the WebAssembly runtime library and toolchain as we went.

Eventually we started constituting facilities defined in terms of those primitives, via a Scheme prelude that was prepended to all programs, within a nested lexical environment. It was never our intention though to drown the user’s programs in a sea of predefined bindings, as if the ultimate program were but a vestigial inhabitant of the lexical lake—don’t dilute the newt!, we would often say [ed: we did not]— so eventually when the prelude became unmanageable, we finally figured out how to do whole-program compilation of a set of modules.

Then followed a long month in which I would uproot the loot from the boot: take each binding from the prelude and reattribute it into an appropriate module. User code could import all the modules that suit, as long as they were known to Hoot, but no others; it was only until we added the ability for users to programmatically consitute an environment from their modules that Hoot became a language implementation of any repute.

Which brings us to the work of the last month, about which I cannot be mute. When you have existing Guile code that you want to distribute via the web, Hoot required you transmute its module definitions into the more precise R6RS syntax. Precise, meaning that R6RS modules are static, in a way that Guile modules, at least in absolute terms, are not: Guile programs can use first-class accessors on the module systems to pull out bindings. This is yet another example of what I impute as the original sin of 1990s language development, that modules are just mutable hash maps. You see it in Python, for example: because you don’t know for sure to what values global names are bound, it is easy for any discussion of what a particular piece of code means to end in dispute.

The question is, though, are the semantics of name binding in a language fixed and absolute? Once your language is booted, are its aspects definitively attributed? I think some perfection, in the sense of becoming more perfect or more like the thing you should be, is something to salute. Anyway, in Guile it would be coherent with Scheme’s lexical binding heritage to restitute some certainty as to the meanings of names, at least in a default compilation node. Lexical binding is, after all, the foundation of the Macro Writer’s Statute of Rights. Of course if you are making a build for development purposes, not to distribute, then you might prefer a build that marks all bindings as dynamic. Otherwise I think it’s reasonable to require the user to explicitly indicate which definitions are denotations, and which constitute locations.

Hoot therefore now includes an implementation of the static semantics of Guile’s define-module: it can load Guile modules directly, and as a tribute, it also has an implementation of the ambient (guile) module that constitutes the lexical soup of modules that aren’t #:pure. (I agree, it would be better if all modules were explicit about the language they are written in—their imported bindings and so on—but there is an existing corpus to accomodate; the point is moot.)

The astute reader (whom I salute!) will note that we have a full boot: Hoot is a Guile. Not an implementation to substitute the original, but more of an alternate route to the same destination. So, probably we should scoot the two implementations together, to knock their boots, so to speak, merging the offshoot Hoot into Guile itself.

But do I circumlocute: I can only plead a case of acute Hoot. Tomorrow, we elocute on a second axis of boot. Until then, happy compute!

micro macro story time

2024-01-11T14:10:38Z

Today, a tiny tale: about 15 years ago I was working on Guile’s macro expander. Guile inherited this code from an early version of Kent Dybvig’s portable syntax expander. It was... not easy to work with.

Some difficulties were essential. Scope is tricky, after all.

Some difficulties were incidental, but deep. The expander is ultimately a function that translates Scheme-with-macros to Scheme-without-macros. However, it is itself written in Scheme-with-macros, so to load it on a substrate without macros requires a pre-expanded copy of itself, whose data representations need to be compatible with any incremental change, so that you will be able to use the new expander to produce a fresh pre-expansion. This difficulty could have been avoided by incrementally bootstrapping the library. It works once you are used to it, but it’s gnarly.

But then, some difficulties were just superflously egregious. Dybvig is a totemic developer and researcher, but a generation or two removed from me, and when I was younger, it never occurred to me to just email him to ask why things were this way. (A tip to the reader: if someone is doing work you are interested in, you can just email them. Probably they write you back! If they don’t respond, it’s not you, they’re probably just busy and their inbox leaks.) Anyway in my totally speculatory reconstruction of events, when Dybvig goes to submit his algorithm for publication, he gets annoyed that “expand” doesn’t sound fancy enough. In a way it’s similar to the original SSA developers thinking that “phony functions” wouldn’t get published.

So Dybvig calls the expansion function “χ”, because the Greek chi looks like the X in “expand”. Fine for the paper, whatever paper that might be, but then in psyntax, there are all these functions named chi and chi-lambda and all sorts of nonsense.

In early years I was often confused by these names; I wasn’t in on the pun, and I didn’t feel like I had enough responsibility for this code to think what the name should be. I finally broke down and changed all instances of “chi” to “expand” back in 2011, and never looked back.

Anyway, this is a story with a very specific moral: don’t name your functions chi.

scheme modules vs whole-program compilation: fight

2024-01-05T20:43:49Z

In a recent dispatch, I explained the whole-program compilation strategy used in Whiffle and Hoot. Today’s note explores what a correct solution might look like.

being explicit

Consider a module that exports an increment-this-integer procedure. We’ll use syntax from the R6RS standard:

(library (inc)
  (export inc)
  (import (rnrs))
  (define (inc n) (+ n 1)))

If we then have a program:

(import (rnrs) (inc))
(inc 42)

Then the meaning of this program is clear: it reduces to (+ 42 1), then to 43. Fine enough. But how do we get there? How does the compiler compose the program with the modules that it uses (transitively), to produce a single output?

In Whiffle (and Hoot), the answer is, sloppily. There is a standard prelude that initially has a number of bindings from the host compiler, Guile. One of these is +, exposed under the name %+, where the % in this case is just a warning to the reader that this is a weird primitive binding. Using this primitive, the prelude defines a wrapper:

...
(define (+ x y) (%+ x y))
...

At compilation-time, Guile’s compiler recognizes %+ as special, and therefore compiles the body of + as consisting of a primitive call (primcall), in this case to the addition primitive. The Whiffle (and Hoot, and native Guile) back-ends then avoid referencing an imported binding when compiling %+, and instead produce backend-specific code: %+ disappears. Most uses of the + wrapper get inlined so %+ ends up generating code all over the program.

The prelude is lexically splatted into the compilation unit via a pre-expansion phase, so you end up with something like:

(let () ; establish lexical binding contour
  ...
  (define (+ x y) (%+ x y))
  ...
  (let () ; new nested contour
    (define (inc n) (+ n 1))
    (inc 42)))

This program will probably optimize (via partial evaluation) to just 43. (What about let and define? Well. Perhaps we’ll get to that.)

But, again here I have taken a short-cut, which is about modules. Hoot and Whiffle don’t really do modules, yet anyway. I keep telling Spritely colleagues that it’s complicated, and rightfully they keep asking why, so this article gets into it.

is it really a big `letrec`?

Firstly you have to ask, what is the compilation unit anyway? I mean, given a set of modules A, B, C and so on, you could choose to compile them separately, relying on the dynamic linker to compose them at run-time, or all together, letting the compiler gnaw on them all at once. Or, just A and B, and so on. One good-enough answer to this problem is library-group form, which explicitly defines a set of topologically-sorted modules that should be compiled together. In our case, to treat the (inc) module together with our example program as one compilation unit, we would have:

(library-group
  ;; start with sequence of libraries
  ;; to include in compilation unit...
  (library (inc) ...)

  ;; then the tail is the program that
  ;; might use the libraries
  (import (rnrs) (inc))
  (inc 42))

In this example, the (rnrs) base library is not part of the compilation unit. Presumably it will be linked in, either as a build step or dynamically at run-time. For Hoot we would want the whole prelude to be included, because we don’t want any run-time dependencies. Anyway hopefully this would expand out to something like the set of nested define forms inside nested let lexical contours.

And that was my instinct: somehow we are going to smash all these modules together into a big nested letrec, and the compiler will go to town. And this would work, for a “normal” programming language.

But with Scheme, there is a problem: macros. Scheme is a “programmable programming language” that allows users to extend its syntax as well as its semantics. R6RS defines a procedural syntax transformer (“macro”) facility, in which the user can define functions that run on code at compile-time (specifically, during syntax expansion). Scheme macros manage to compose lexical scope from the macro definition with the scope at the macro instantiation site, by annotating these expressions with source location and scope information, and making syntax transformers mostly preserve those annotations.

“Macros are great!”, you say: well yes, of course. But they are a problem too. Consider this incomplete library:

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      ...)) // ***

The idea is to define a version of inc, but at compile-time: a (ctinc 42) form should expand directly to 43, not a call to inc (or even +, or %+). We define syntax transformers with define-syntax instead of define. The right-hand-side of the definition ((lambda (stx) ...)) should be a procedure of one argument, which returns one value: so far so good. Or is it? How do we actually evaluate what (lambda (stx) ...) means? What should we fill in for ...? When evaluating the transformer value, what definitions are in scope? What does lambda even mean in this context?

Well... here we butt up against the phasing wars of the mid-2000s. R6RS defines a whole system to explicitly declare what bindings are available when, then carves out a huge exception to allow for so-called implicit phasing, in which the compiler figures it out on its own. In this example we imported (rnrs) for the default phase, and this is the module that defines lambda (and indeed define and define-syntax). The standard defines that (rnrs) makes its bindings available both at run-time and expansion-time (compilation-time), so lambda means what we expect that it does. Whew! Let’s just assume implicit phasing, going forward.

The operand to the syntax transformer is a syntax object: an expression annotated with source and scope information. To pick it apart, R6RS defines a pattern-matching helper, syntax-case. In our case ctinc is unary, so we can begin to flesh out the syntax transformer:

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      (syntax-case stx ()
        ((ctinc n)
         (inc n)))))) // ***

But here there’s a detail, which is that when syntax-case destructures stx to its parts, those parts themselves are syntax objects which carry the scope and source location annotations. To strip those annotations, we call the syntax->datum procedure, exported by (rnrs).

(library (ctinc)
  (import (rnrs) (inc))
  (export ctinc)
  (define-syntax ctinc
    (lambda (stx)
      (syntax-case stx ()
        ((ctinc n)
         (inc (syntax->datum #'n)))))))

And with this, voilà our program:

(library-group
  (library (inc) ...)
  (library (ctinc) ...)
  (import (rnrs) (ctinc))
  (ctinc 42))

This program should pre-expand to something like:

(let ()
  (define (inc n) (+ n 1))
  (let ()
    (define-syntax ctinc
      (lambda (stx)
        (syntax-case stx ()
          ((ctinc n)
           (inc (syntax->datum #'n))))))
    (ctinc 42)))

And then expansion should transform (ctinc 42) to 43. However, our naïve pre-expansion is not good enough for this to be possible. If you ran this in Guile you would get an error:

Syntax error:
unknown file:8:12: reference to identifier outside its scope in form inc

Which is to say, inc is not available as a value within the definition of ctinc. ctinc could residualize an expression that refers to inc, but it can’t use it to produce the output.

modules are not expressible with local lexical binding

This brings us to the heart of the issue: with procedural macros, modules impose a phasing discipline on the expansion process. Definitions from any given module must be available both at expand-time and at run-time. In our example, ctinc needs inc at expand-time, which is an early part of the compiler that is unrelated to any later partial evaluation by the optimizer. We can’t make inc available at expand-time just using let / letrec bindings.

This is an annoying result! What do other languages do? Well, mostly they aren’t programmable, in the sense that they don’t have macros. There are some ways to get programmability using e.g. eval in JavaScript, but these systems are not very amenable to “offline” analysis of the kind needed by an ahead-of-time compiler.

For those declarative languages with macros, Scheme included, I understand the state of the art is to expand module-by-module and then stitch together the results of expansion later, using a kind of link-time optimization. You visit a module’s definitions twice: once to evaluate them while expanding, resulting in live definitions that can be used by further syntax expanders, and once to residualize an abstract syntax tree, which will eventually be spliced into the compilation unit.

Note that in general the expansion-time and the residual definitions don’t need to be the same, and indeed during cross-compilation they are often different. If you are compiling with Guile as host and Hoot as target, you might implement cons one way in Guile and another way in Hoot, choosing between them with cond-expand.

lexical scope regained?

What is to be done? Glad you asked, Vladimir. But, I don’t really know. The compiler wants a big blob of letrec, but the expander wants a pearl-string of modules. Perhaps we try to satisfy them both? The library-group paper suggests that modules should be expanded one by one, then stitched into a letrec by AST transformations. It’s not that lexical scope is incompatible with modules and whole-program compilation; the problems arise when you add in macros. So by expanding first, in units of modules, we reduce high-level Scheme to a lower-level language without syntax transformers, but still on the level of letrec.

I was unreasonably pleased by the effectiveness of the “just splat in a prelude” approach, and I will miss it. I even pled for a kind of stop-gap fat-fingered solution to sloppily parse module forms and keep on splatting things together, but colleagues helpfully talked me away from the edge. So good-bye, sloppy: I repent my ways and will make amends, with 40 hail-maries and an alpha renaming thrice daily and more often if in moral distress. Further bulletins as events warrant. Until then, happy scheming!

tree-shaking, the horticulturally misguided algorithm

2023-11-24T11:41:37Z

Let’s talk about tree-shaking!

looking up from the trough

But first, I need to talk about WebAssembly’s dirty secret: despite the hype, WebAssembly has had limited success on the web.

There is Photoshop, which does appear to be a real success. 5 years ago there was Figma, though they don’t talk much about Wasm these days. There are quite a number of little NPM libraries that use Wasm under the hood, usually compiled from C++ or Rust. I think Blazor probably gets used for a few in-house corporate apps, though I could be fooled by their marketing.

You might recall the hyped demos of 3D first-person-shooter games with Unreal engine again from 5 years ago, but that was the previous major release of Unreal and was always experimental; the current Unreal 5 does not support targetting WebAssembly.

Don’t get me wrong, I think WebAssembly is great. It is having fine success in off-the-web environments, and I think it is going to be a key and growing part of the Web platform. I suspect, though, that we are only just now getting past the trough of disillusionment.

It’s worth reflecting a bit on the nature of web Wasm’s successes and failures. Taking Photoshop as an example, I think we can say that Wasm does very well at bringing large C++ programs to the web. I know that it took quite some work, but I understand the end result to be essentially the same source code, just compiled for a different target.

Similarly for the JavaScript module case, Wasm finds success in getting legacy C++ code to the web, and as a way to write new web-targetting Rust code. These are often tasks that JavaScript doesn’t do very well at, or which need a shared implementation between client and server deployments.

On the other hand, WebAssembly has not been a Web success for DOM-heavy apps. Nobody is talking about rewriting the front-end of wordpress.com in Wasm, for example. Why is that? It may sound like a silly question to you: Wasm just isn’t good at that stuff. But why? If you dig down a bit, I think it’s that the programming models are just too different: the Web’s primary programming model is JavaScript, a language with dynamic typing and managed memory, whereas WebAssembly 1.0 was about static typing and linear memory. Getting to the DOM from Wasm was a hassle that was overcome only by the most ardent of the true Wasm faithful.

Relatedly, Wasm has also not really been a success for languages that aren’t, like, C or Rust. I am guessing that wordpress.com isn’t written mostly in C++. One of the sticking points for this class of language. is that C#, for example, will want to ship with a garbage collector, and that it is annoying to have to do this. Check my article from March this year for more details.

Happily, this restriction is going away, as all browsers are going to ship support for reference types and garbage collection within the next months; Chrome and Firefox already ship Wasm GC, and Safari shouldn’t be far behind thanks to the efforts from my colleague Asumu Takikawa. This is an extraordinarily exciting development that I think will kick off a whole ‘nother Gartner hype cycle, as more languages start to update their toolchains to support WebAssembly.

if you don’t like my peaches

Which brings us to the meat of today’s note: web Wasm will win where compilers create compact code. If your language’s compiler toolchain can manage to produce useful Wasm in a file that is less than a handful of over-the-wire kilobytes, you can win. If your compiler can’t do that yet, you will have to instead rely on hype and captured audiences for adoption, which at best results in an unstable equilibrium until you figure out what’s next.

In the JavaScript world, managing bloat and deliverable size is a huge industry. Bundlers like esbuild are a ubiquitous part of the toolchain, compiling down a set of JS modules to a single file that should include only those functions and data types that are used in a program, and additionally applying domain-specific size-squishing strategies such as minification (making monikers more minuscule).

Let’s focus on tree-shaking. The visual metaphor is that you write a bunch of code, and you only need some of it for any given page. So you imagine a tree whose, um, branches are the modules that you use, and whose leaves are the individual definitions in the modules, and you then violently shake the tree, probably killing it and also annoying any nesting birds. The only thing that’s left still attached is what is actually needed.

This isn’t how trees work: holding the trunk doesn’t give you information as to which branches are somehow necessary for the tree’s mission. It also primes your mind to look for the wrong fixed point, removing unneeded code instead of keeping only the necessary code.

But, tree-shaking is an evocative name, and so despite its horticultural and algorithmic inaccuracies, we will stick to it.

The thing is that maximal tree-shaking for languages with a thicker run-time has not been a huge priority. Consider Go: according to the golang wiki, the most trivial program compiled to WebAssembly from Go is 2 megabytes, and adding imports can make this go to 10 megabytes or more. Or look at Pyodide, the Python WebAssembly port: the REPL example downloads about 20 megabytes of data. These are fine sizes for technology demos or, in the limit, very rich applications, but they aren’t winners for web development.

shake a different tree

To be fair, both the built-in Wasm support for Go and the Pyodide port of Python both derive from the upstream toolchains, where producing small binaries is nice but not necessary: on a server, who cares how big the app is? And indeed when targetting smaller devices, we tend to see alternate implementations of the toolchain, for example MicroPython or TinyGo. TinyGo has a Wasm back-end that can apparently go down to less than a kilobyte, even!

These alternate toolchains often come with some restrictions or peculiarities, and although we can consider this to be an evil of sorts, it is to be expected that the target platform exhibits some co-design feedback on the language. In particular, running in the sea of the DOM is sufficiently weird that a Wasm-targetting Python program will necessarily be different than a “native” Python program. Still, I think as toolchain authors we aim to provide the same language, albeit possibly with a different implementation of the standard library. I am sure that the ClojureScript developers would prefer to remove their page documenting the differences with Clojure if they could, and perhaps if Wasm becomes a viable target for Clojurescript, they will.

on the algorithm

To recap: now that it supports GC, Wasm could be a winner for web development in Python and other languages. You would need a different toolchain and an effective tree-shaking algorithm, so that user experience does not degrade. So let’s talk about tree shaking!

I work on the Hoot Scheme compiler, which targets Wasm with GC. We manage to get down to 70 kB or so right now, in the minimal “main” compilation unit, and are aiming for lower; auxiliary compilation units that import run-time facilities (the current exception handler and so on) from the main module can be sub-kilobyte. Getting here has been tricky though, and I think it would be even trickier for Python.

Some background: like Whiffle, the Hoot compiler prepends a prelude onto user code. Tree-shakind happens in a number of places:

partial evaluation will evaluate unused bindings for effect, possibly eliding them
fixing letrec will do the same
CPS frequently traverses the program, following only referenced function, value, and control edges, e.g. via renumbering
There is an explicit dead-code elimination pass which tries to elide unused effect-free allocations, a situation that can arise due to other optimizations
Finally there is a standard library written in raw-ish WebAssembly, whose definitions (globals, tables, imports, functions, etc) are included in the residual binary only as neeeded.

Generally speaking, procedure definitions (functions / closures) are the easy part: you just include only those functions that are referenced by the code. In a language like Scheme, this gets you a long way.

However there are three immediate challenges. One is that the evaluation model for the definitions in the prelude is letrec*: the scope is recursive but ordered. Binding values can call or refer to previously defined values, or capture values defined later. If evaluating the value of a binding requires referring to a value only defined later, then that’s an error. Again, for procedures this is trivially OK, but as soon as you have non-procedure definitions, sometimes the compiler won’t be able to prove this nice “only refers to earlier bindings” property. In that case the fixing letrec (reloaded) algorithm will end up residualizing bindings that are set!, which of all the tree-shaking passes above require the delicate DCE pass to remove them.

Worse, some of those non-procedure definitions are record types, which have vtables that define how to print a record, how to check if a value is an instance of this record, and so on. These vtable callbacks can end up keeping a lot more code alive even if they are never used. We’ll get back to this later.

Similarly, say you print a string via display. Well now not only are you bringing in the whole buffered I/O facility, but you are also calling a highly polymorphic function: display can print anything. There’s a case for bitvectors, so you pull in code for bitvectors. There’s a case for pairs, so you pull in that code too. And so on.

One solution is to instead call write-string, which only writes strings and not general data. You’ll still get the generic buffered I/O facility (ports), though, even if your program only uses one kind of port.

This brings me to my next point, which is that optimal tree-shaking is a flow analysis problem. Consider display: if we know that a program will never have bitvectors, then any code in display that works on bitvectors is dead and we can fold the branches that guard it. But to know this, we have to know what kind of arguments display is called with, and for that we need higher-level flow analysis.

The problem is exacerbated for Python in a few ways. One, because object-oriented dispatch is higher-order programming. How do you know what foo.bar actually means? Depends on foo, which means you have to thread around representations of what foo might be everywhere and to everywhere’s caller and everywhere’s caller’s caller and so on.

Secondly, lookup in Python is generally more dynamic than in Scheme: you have __getattr__ methods (is that it?; been a while since I’ve done Python) everywhere and users might indeed use them. Maybe this is not so bad in practice and flow analysis can exclude this kind of dynamic lookup.

Finally, and perhaps relatedly, the object of tree-shaking in Python is a mess of modules, rather than a big term with lexical bindings. This is like JavaScript, but without the established ecosystem of tree-shaking bundlers; Python has its work cut out for some years to go.

in short

With GC, Wasm makes it thinkable to do DOM programming in languages other than JavaScript. It will only be feasible for mass use, though, if the resulting Wasm modules are small, and that means significant investment on each language’s toolchain. Often this will take the form of alternate toolchains that incorporate experimental tree-shaking algorithms, and whose alternate standard libraries facilitate the tree-shaker.

Welp, I’m off to lunch. Happy wassembling, comrades!

a whiff of whiffle

2023-11-16T21:11:43Z

A couple nights ago I wrote about a superfluous Scheme implementation and promised to move on from sheepishly justifying my egregious behavior in my next note, and finally mention some results from this experiment. Well, no: I am back on my bullshit. Tonight I write about a couple of implementation details that discerning readers may find of interest: value representation, the tail call issue, and the standard library.

what is a value?

As a Lisp, Scheme is one of the early “dynamically typed” languages. These days when you say “type”, people immediately think propositions as types, mechanized proof of program properties, and so on. But “type” has another denotation which is all about values and almost not at all about terms: one might say that vector-ref has a type, but it’s not part of a proof; it’s just that if you try to vector-ref a pair instead of a vector, you get a run-time error. You can imagine values as being associated with type tags: annotations that can be inspected at run-time for, for example, the sort of error that vector-ref will throw if you call it on a pair.

Scheme systems usually have a finite set of type tags: there are fixnums, booleans, strings, pairs, symbols, and such, and they all have their own tag. Even a Scheme system that provides facilities for defining new disjoint types (define-record-type et al) will implement these via a secondary type tag layer: for example that all record instances are have the same primary tag, and that you have to retrieve their record type descriptor to discriminate instances of different record types.

Anyway. In Whiffle there are immediate types and heap types. All values have a low-bit tag which is zero for heap objects and nonzero for immediates. For heap objects, the first word of the heap object has tagging in the low byte as well. The 3-bit heap tag for pairs is chosen so that pairs can just be two words, with no header word. There is another 3-bit heap tag for forwarded objects, which is used but the GC when evacuating a value. Other objects put their heap tags in the low 8 bits of the first word. Additionally there is a “busy” tag word value, used to prevent races when evacuating from multiple threads.

Finally, for generational collection of objects that can be “large” – the definition of large depends on the collector implementation, and is not nicely documented, but is more than, like, 256 bytes – anyway these objects might need to have space for a “remembered” bit in the object themselves. This is not the case for pairs but is the case for, say, vectors: even though they are prolly smol, they might not be, and they need space for a remembered bit in the header.

tail calls

When I started Whiffle, I thought, let’s just compile each Scheme function to a C function. Since all functions have the same type, clang and gcc will have no problem turning any tail call into a proper tail call.

This intuition was right and wrong: at optimization level -O2, this works great. We don’t even do any kind of loop recognition / contification: loop iterations are tail calls and all is fine. (Not the most optimal implementation technique, but the assumption is that for our test cases, GC costs will dominate.)

However, when something goes wrong, I will need to debug the program to see what’s up, and so you might think to compile at -O0 or -Og. In that case, somehow gcc does not compile to tail calls. One time while debugging a program I was flummoxed at a segfault during the call instruction; turns out it was just stack overflow, and the call was trying to write the return address into an unmapped page. For clang, I could use the musttail attribute; perhaps I should, to allow myself to debug properly.

Not being able to debug at -O0 with gcc is annoying. I feel like if GNU were an actual thing, we would have had the equivalent of a musttail attribute 20 years ago already. But it’s not, and we still don’t.

stdlib

So Whiffle makes C, and that C uses some primitives defined as inline functions. Whiffle actually lexically embeds user Scheme code with a prelude, having exposed a set of primitives to that prelude and to user code. The assumption is that the compiler will open-code all primitives, so that the conceit of providing a primitive from the Guile compilation host to the Whiffle guest magically works out, and that any reference to a free variable is an error. This works well enough, and it’s similar to what we currently do in Hoot as well.

This is a quick and dirty strategy but it does let us grow the language to something worth using. I think I’ll come back to this local maximum later if I manage to write about what Hoot does with modules.

coda

So, that’s Whiffle: the Guile compiler front-end for Scheme, applied to an expression that prepends a user’s program with a prelude, in a lexical context of a limited set of primitives, compiling to very simple C, in which tail calls are just return f(...), relying on the C compiler to inline and optimize and all that.

Perhaps next up: some results on using Whiffle to test Whippet. Until then, good night!

wingolog

wastrelly wabbits

bignums

exception handling

bugs

coevolution

tail calls

scale

scale, bis

scale, ter

scale, quator

fin

two mechanisms for dynamic type checks

if the set of types is fixed: dfs numbering

if the set of types is unbounded: the display hack

cps in hoot

calls calls calls

generic slicing

instrumented slicing

cps transformation

hoot and cps

splits

where to save the variables

what variables to save

calling conventions

cps soup?

fin

hoot's wasm toolkit

naming things

one thing leads to another

fin

growing a bootie

the root of the boot

powerful primitives

dual denotations

building to guile

next layers

on hoot, on boot

hoot boot: temporal tunnel

micro macro story time

scheme modules vs whole-program compilation: fight

being explicit

is it really a big letrec?

modules are not expressible with local lexical binding

lexical scope regained?

tree-shaking, the horticulturally misguided algorithm

looking up from the trough

if you don’t like my peaches

shake a different tree

on the algorithm

in short

a whiff of whiffle

what is a value?

tail calls

stdlib

coda

is it really a big `letrec`?