wastrelly wabbits
Good day! Today (tonight), some notes on the last couple months of Wastrel, my ahead-of-time WebAssembly compiler.
Back in the beginning of February, I showed Wastrel running programs that use garbage collection, using an embedded copy of the Whippet collector, specialized to the types present in the Wasm program. But, the two synthetic GC-using programs I tested on were just ported microbenchmarks, and didn’t reflect the output of any real toolchain.
In this cycle I worked on compiling the output from the Hoot Scheme-to-Wasm compiler. There were some interesting challenges!
bignums
When I originally wrote the Hoot compiler, it targetted the browser, which already has a bignum implementation in the form of BigInt, which I worked on back in the day. Hoot-generated Wasm files use host bigints via externref (though wrapped in structs to allow for hashing and identity).
In Wastrel, then, I implemented the imports that implement bignum operations: addition, multiplication, and so on. I did so using mini-gmp, a stripped-down implementation of the workhorse GNU multi-precision library. At some point if bignums become important, this gives me the option to link to the full GMP instead.
Bignums were the first managed data type in Wastrel that wasn’t defined as part of the Wasm module itself, instead hiding behind externref, so I had to add a facility to allocate type codes to these “host” data types. More types will come in time: weak maps, ephemerons, and so on.
I think bignums would be a great proposal for the Wasm standard, similar to stringref ideally (sniff!), possibly in an attenuated form.
exception handling
Hoot used to emit a pre-standardization form of exception handling, and hadn’t gotten around to updating to the newer version that was standardized last July. I updated Hoot to emit the newer kind of exceptions, as it was easier to implement them in Wastrel that way.
Some of the problems Chris Fallin contended with in Wasmtime don’t apply in the Wastrel case: since the set of instances is known at compile-time, we can statically allocate type codes for exception tags. Also, I didn’t really have to do the back-end: I can just use setjmp and longjmp.
This whole paragraph was meant to be a bit of an aside in which I briefly mentioned why just using setjmp was fine. Indeed, because Wastrel never re-uses a temporary, relying entirely on GCC to “re-use” the register / stack slot on our behalf, I had thought that I didn’t need to worry about the “volatile problem”. From the C99 specification:
[...] values of objects of automatic storage duration that are local to the function containing the invocation of the corresponding setjmp macro that do not have volatile-qualified type and have been changed between the setjmp invocation and longjmp call are indeterminate.
My thought was, though I might set a value between setjmp and longjmp, that would only be the case for values whose lifetime did not reach the longjmp (i.e., whose last possible use was before the jump). Wastrel didn’t introduce any such cases, so I was good.
However, I forgot about local.set: mutations of locals (ahem, objects of automatic storage duration) in the source Wasm file could run afoul of this rule. So, because of writing this blog post, I went back and did an analysis pass on each function to determine the set of locals which are mutated inside a try_block. Thank you, rubber duck readers!
bugs
Oh my goodness there were many bugs. Lacunae, if we are being generous; things not implemented quite right, which resulted in errors either when generating C or when compiling the C. The type-preserving translation strategy does seem to have borne fruit, in that I have spent very little time in GDB: once things compile, they work.
coevolution
Sometimes Hoot would use a browser facility where it was convenient, but for which in a better world we would just do our own thing. Such was the case for the number->string operation on floating-point numbers: we did something awful but expedient.
I didn’t have this facility in Wastrel, so instead we moved to do float-to-string conversions in Scheme. This turns out to have been a good test for bignums too; the algorithm we use is a bit dated and relies on bignums to do its thing. The move to Scheme also allows for printing floating-point numbers in other radices.
There are a few more Hoot patches that were inspired by Wastrel, about which more later; it has been good for both to work on the two at the same time.
tail calls
My plan for Wasm’s return_call and friends was to use the new musttail annotation for calls, which has been in clang for a while and was recently added to GCC. I was careful to limit the number of function parameters such that no call should require stack allocation, and therefore a compiler should have no reason to reject any particular tail call.
However, there were bugs. Funny ones, at first: attributes applying to a preceding label instead of the following call, or the need to insert if (1) before the tail call. More dire ones, in which tail callers inlined into their callees would cause the tail calls to fail, worked around with judicious application of noinline. Thanks to GCC’s Andrew Pinski for help debugging these and other issues; with GCC things are fine now.
I did have to change the code I emitted to return “top types only”: if you have a function returning type T, you can tail-call a function returning U if U is a subtype of T, but there is no nice way to encode this into the C type system. Instead, we return the top type of T (or U, it’s the same), e.g. anyref, and insert downcasts at call sites to recover the precise types. Not so nice, but it’s what we got.
Trying tail calls on clang, I ran into a funny restriction: clang not only requires that return types match, but requires that tail caller and tail callee have the same parameters as well. I can see why they did this (it requires no stack shuffling and thus such a tail call is always possible, even with 500 arguments), but it’s not the design point that I need. Fortunately there are discussions about moving to a different constraint.
scale
I spent way more time that I had planned to on improving the speed of Wastrel itself. My initial idea was to just emit one big C file, and that would provide the maximum possibility for GCC to just go and do its thing: it can see everything, everything is static, there are loads of always_inline helpers that should compile away to single instructions, that sort of thing. But, this doesn’t scale, in a few ways.
In the first obvious way, consider whitequark’s llvm.wasm. This is all of LLVM in one 70 megabyte Wasm file. Wastrel made a huuuuuuge C file, then GCC chugged on it forever; 80 minutes at -O1, and I wasn’t aiming for -O1.
I realized that in many ways, GCC wasn’t designed to be a compiler target. The shape of code that one might emit from a Wasm-to-C compiler like Wastrel is different from that that one would write by hand. I even ran into a segfault compiling with -Wall, because GCC accidentally recursed instead of iterated in the -Winfinite-recursion pass.
So, I dealt with this in a few ways. After many hours spent pleading and bargaining with different -O options, I bit the bullet and made Wastrel emit multiple C files. It will compute a DAG forest of all the functions in a module, where edges are direct calls, and go through that forest, greedily consuming (and possibly splitting) subtrees until we have “enough” code to split out a partition, as measured by number of Wasm instructions. They say that -flto makes this a fine approach, but one never knows when a translation unit boundary will turn out to be important. I compute needed symbol visibilities as much as I can so as to declare functions that don’t escape their compilation unit as static; who knows if this is of value. Anyway, this partitioning introduced no performance regression in my limited tests so far, and compiles are much much much faster.
scale, bis
A brief observation: Wastrel used to emit indented code, because it could, and what does it matter, anyway. However, consider Wasm’s br_table: it takes an array of n labels and an integer operand, and will branch to the nth label, or the last if the operand is out of range. To set up a label in Wasm, you make a block, of which there are a handful of kinds; the label is visible in the block, and for n labels, the br_table will be the most nested expression in the n nested blocks.
Now consider that block indentation is proportional to n. This means, the file size of an indented C file is quadratic in the number of branch targets of the br_table.
Yes, this actually bit me; there are br_table instances with tens of thousands of targets. No, wastrel does not indent any more.
scale, ter
Right now, the long pole in Wastrel is the compile-to-C phase; the C-to-native phase parallelises very well and is less of an issue. So, one might think: OK, you have partitioned the functions in this Wasm module into a number of files, why not emit the files in parallel?
I gave this a go. It did not speed up C generation. From my cursory investigations, I think this is because the bottleneck is garbage collection in Wastrel itself; Wastrel is written in Guile, and Guile still uses the Boehm-Demers-Weiser collector, which does not parallelize well for multiple mutators. It’s terrible but I ripped out parallelization and things are fine. Someone on Mastodon suggested fork; they’re not wrong, but also not Right either. I’ll just keep this as a nice test case for the Guile-on-Whippet branch I want to poke later this year.
scale, quator
Finally, I had another realization: GCC was having trouble compiling the C that Wastrel emitted, because Hoot had emitted bad WebAssembly. Not bad as in “invalid”; rather, “not good”.
There were two cases in which Hoot emitted ginormous (technical term) functions. One, for an odd debugging feature: Hoot does a CPS transform on its code, and allocates return continuations on a stack. This is a gnarly technique but it gets us delimited continuations and all that goodness even before stack switching has landed, so it’s here for now. It also gives us a reified return stack of funcref values, which lets us print Scheme-level backtraces.
Or it would, if we could associate data with a funcref. Unfortunately func is not a subtype of eq, so we can’t. Unless... we pass the funcref out to the embedder (e.g. JavaScript), and the embedder checks the funcref for equality (e.g. using ===); then we can map a funcref to an index, and use that index to map to other properties.
How to pass that funcref/index map to the host? When I initially wrote Hoot, I didn’t want to just, you know, put the funcrefs of interet into a table and let the index of a function’s slot be the value in the key-value mapping; that would be useless memory usage. Instead, we emitted functions that took an integer, and which would return a funcref. Yes, these used br_table, and yes, there could be tens of thousands of cases, depending on what you were compiling.
Then to map the integer index to, say, a function name, likewise I didn’t want a table; that would force eager allocation of all strings. Instead I emitted a function with a br_table whose branches would return string.const values.
Except, of course, stringref didn’t become a thing, and so instead we would end up lowering to allocate string constants as globals.
Except, of course, Wasm’s idea of what a “constant” is is quite restricted, so we have a pass that moves non-constant global initializers to the “start” function. This results in an enormous start function. The straightforward solution was to partition global initializations into separate functions, called by the start function.
For the funcref debugging, the solution was more intricate: firstly, we represent the funcref-to-index mapping just as a table. It’s fine. Then for the side table mapping indices to function names and sources, we emit DWARF, and attach a special attribute to each “introspectable” function. In this way, reading the DWARF sequentially, we reconstruct a mapping from index to DWARF entry, and thus to a byte range in the Wasm code section, and thus to source information in the .debug_line section. It sounds gnarly but Guile already used DWARF as its own debugging representation; switching to emit it in Hoot was not a huge deal, and as we only need to consume the DWARF that we emit, we only needed some 400 lines of JS for the web/node run-time support code.
This switch to data instead of code removed the last really long pole from the GCC part of Wastrel’s pipeline. What’s more, Wastrel can now implement the code_name and code_source imports for Hoot programs ahead of time: it can parse the DWARF at compile-time, and generate functions that look up functions by address in a sorted array to return their names and source locations. As of today, this works!
fin
There are still a few things that Hoot wants from a host that Wastrel has stubbed out: weak refs and so on. I’ll get to this soon; my goal is a proper Scheme REPL. Today’s note is a waypoint on the journey. Until next time, happy hacking!