wingolog

cps in hoot

2024-05-27T12:36:44Z

Good morning good morning! Today I have another article on the Hoot Scheme-to-Wasm compiler, this time on Hoot’s use of the continuation-passing-style (CPS) transformation.

calls calls calls

So, just a bit of context to start out: Hoot is a Guile, Guile is a Scheme, Scheme is a Lisp, one with “proper tail calls”: function calls are either in tail position, syntactically, in which case they are tail calls, or they are not in tail position, in which they are non-tail calls. A non-tail call suspends the calling function, putting the rest of it (the continuation) on some sort of stack, and will resume when the callee returns. Because non-tail calls push their continuation on a stack, we can call them push calls.

(define (f)
  ;; A push call to g, binding its first return value.
  (define x (g))
  ;; A tail call to h.
  (h x))

Usually the problem in implementing Scheme on other language run-times comes in tail calls, but WebAssembly supports them natively (except on JSC / Safari; should be coming at some point though). Hoot’s problem is the reverse: how to implement push calls?

The issue might seem trivial but it is not. Let me illustrate briefly by describing what Guile does natively (not compiled to WebAssembly). Firstly, note that I am discussing residual push calls, by which I mean to say that the optimizer might remove a push call in the source program via inlining: we are looking at those push calls that survive the optimizer. Secondly, note that native Guile manages its own stack instead of using the stack given to it by the OS; this allows for push-call recursion without arbitrary limits. It also lets Guile capture stack slices and rewind them, which is the fundamental building block we use to implement exception handling, Fibers and other forms of lightweight concurrency.

The straightforward function call will have an artificially limited total recursion depth in most WebAssembly implementations, meaning that many idiomatic uses of Guile will throw exceptions. Unpleasant, but perhaps we could stomach this tradeoff. The greater challenge is how to slice the stack. That I am aware of, there are three possible implementation strategies.

generic slicing

One possibility is that the platform provides a generic, powerful stack-capture primitive, which is what Guile does. The good news is that one day, the WebAssembly stack-switching proposal should provide this too. And in the meantime, the so-called JS Promise Integration (JSPI) proposal gets close: if you enter Wasm from JS via a function marked as async, and you call out to JavaScript to a function marked as async (i.e. returning a promise), then on that nested Wasm-to-JS call, the engine will suspend the continuation and resume it only when the returned promise settles (i.e. completes with a value or an exception). Each entry from JS to Wasm via an async function allocates a fresh stack, so I understand you can have multiple pending promises, and thus multiple wasm coroutines in progress. It gets a little gnarly if you want to control when you wait, for example if you might want to wait on multiple promises; in that case you might not actually mark promise-returning functions as async, and instead import an async-marked async function waitFor(p) { return await p} or so, allowing you to use Promise.race and friends. The main problem though is that JSPI is only for JavaScript. Also, its stack sizes are even smaller than the the default stack size.

instrumented slicing

So much for generic solutions. There is another option, to still use push calls from the target machine (WebAssembly), but to transform each function to allow it to suspend and resume. This is what I think of as Joe Marshall’s stack trick (also see §4.2 of the associated paper). The idea is that although there is no primitive to read the whole stack, each frame can access its own state. If you insert a try/catch around each push call, the catch handler can access local state for activations of that function. You can slice a stack by throwing a SaveContinuation exception, in which each frame’s catch handler saves its state and re-throws. And if we want to avoid exceptions, we can use checked returns as Asyncify does.

I never understood, though, how you resume a frame. The Generalized Stack Inspection paper would seem to indicate that you need the transformation to introduce a function to run “the rest of the frame” at each push call, which becomes the Invoke virtual method on the reified frame object. To avoid code duplication you would have to make normal execution flow run these Invoke snippets as well, and that might undo much of the advantages. I understand the implementation that Joe Marshall was working on was an interpreter, though, which bounds the number of sites needing such a transformation.

cps transformation

The third option is a continuation-passing-style transformation. A CPS transform results in a program whose procedures “return” by tail-calling their “continuations”, which themselves are procedures. Taking our previous example, a naïve CPS transformation would reify the following program:

(define (f' k)
  (g' (lambda (x) (h' k x))))

Here f' (“f-prime”) receives its continuation as an argument. We call g', for whose continuation argument we pass a closure. That closure is the return continuation of g, binding a name to its result, and then tail-calls h with respect to f. We know their continuations are the same because it is the same binding, k.

Unfortunately we can’t really slice abitrary ranges of a stack with the naïve CPS transformation: we can only capture the entire continuation, and can’t really inspect its structure. There is also no way to compose a captured continuation with the current continuation. And, in a naïve transformation, we would be constantly creating lots of heap allocation for these continuation closures; a push call effectively pushes a frame onto the heap as a closure, as we did above for g'.

There is also the question of when to perform the CPS transform; most optimizing compilers would like a large first-order graph to work on, which is out of step with the way CPS transformation breaks functions into many parts. Still, there is a nugget of wisdom here. What if we preserve the conventional compiler IR for most of the pipeline, and only perform the CPS transformation at the end? In that way we can have nice SSA-style optimizations. And, for return continuations of push calls, what if instead of allocating a closure, we save the continuation data on an explicit stack. As Andrew Kennedy notes, closures introduced by the CPS transform follow a stack discipline, so this seems promising; we would have:

(define (f'' k)
  (push! k)
  (push! h'')
  (g'' (lambda (x)
         (define h'' (pop!))
         (define k (pop!))
         (h'' k x))))

The explicit stack allows for generic slicing, which makes it a win for implementing delimited continuations.

hoot and cps

Hoot takes the CPS transformation approach with stack-allocated return closures. In fact, Hoot goes a little farther, too far probably:

(define (f''')
  (push! k)
  (push! h''')
  (push! (lambda (x)
           (define h'' (pop!))
           (define k (pop!))
           (h'' k x)))
  (g'''))

Here instead of passing the continuation as an argument, we pass it on the stack of saved values. Returning pops off from that stack; for example, (lambda () 42) would transform as (lambda () ((pop!) 42)). But some day I should go back and fix it to pass the continuation as an argument, to avoid excess stack traffic for leaf function calls.

There are some gnarly details though, which I know you are here for!

splits

For our function f, we had to break it into two pieces: the part before the push-call to g and the part after. If we had two successive push-calls, we would instead split into three parts. In general, each push-call introduces a split; let us use the term tails for the components produced by a split. (You could also call them continuations.) How many tails will a function have? Well, one for the entry, one for each push call, and one any time control-flow merges between two tails. This is a fixpoint problem, given that the input IR is a graph. (There is also some special logic for call-with-prompt but that is too much detail for even this post.)

where to save the variables

Guile is a dynamically-typed language, having a uniform SCM representation for every value. However in the compiler and run-time we can often unbox some values, generally as u64/s64/f64 values, but also raw pointers of some specific types, some GC-managed and some not. In native Guile, we can just splat all of these data members into 64-bit stack slots and rely on the compiler to emit stack maps to determine whether a given slot is a double or a tagged heap object reference or what. In WebAssembly though there is no sum type, and no place we can put either a u64 or a (ref eq) value. So we have not one stack but three (!) stacks: one for numeric values, implemented using a Wasm memory; one for (ref eq) values, using a table; and one for return continuations, because the func type hierarchy is disjoin from eq. It’s.... it’s gross? It’s gross.

what variables to save

Before a push-call, you save any local variables that will be live after the call. This is also a flow analysis problem. You can leave off constants, and instead reify them anew in the tail continuation.

I realized, though, that we have some pessimality related to stacked continuations. Consider:

(define (q x)
  (define y (f))
  (define z (f))
  (+ x y z))

Hoot’s CPS transform produces something like:

(define (q0 x)
  (save! x)
  (save! q1)
  (f))

(define (q1 y)
  (restore! x)
  (save! x)
  (save! y)
  (save! q2)
  (f))

(define (q2 z)
  (restore! x)
  (restore! y)
  ((pop!) (+ x y z)))

So q0 saved x, fine, indeed we need it later. But q1 didn’t need to restore x uselessly, only to save it again on q2‘s behalf. Really we should be applying a stack discipline for saved data within a function. Given that the source IR is a graph, this means another flow analysis problem, one that I haven’t thought about how to solve yet. I am not even sure if there is a solution in the literature, given that the SSA-like flow graphs plus tail calls / CPS is a somewhat niche combination.

calling conventions

The continuations introduced by CPS transformation have associated calling conventions: return continuations may have the generic varargs type, or the compiler may have concluded they have a fixed arity that doesn’t need checking. In any case, for a return, you call the return continuation with the returned values, and the return point then restores any live-in variables that were previously saved. But for a merge between tails, you can arrange to take the live-in variables directly as parameters; it is a direct call to a known continuation, rather than an indirect call to an unknown call site.

cps soup?

Guile’s intermediate representation is called CPS soup, and you might wonder what relationship that CPS has to this CPS. The answer is not much. The continuations in CPS soup are first-order; a term in one function cannot continue to a continuation in another function. (Inlining and contification can merge graphs from different functions, but the principle is the same.)

It might help to explain that it is the same relationship as it would be if Guile represented programs using SSA: the Hoot CPS transform runs at the back-end of Guile’s compilation pipeline, where closures representations have already been made explicit. The IR is still direct-style, just that syntactically speaking, every call in a transformed program is a tail call. We had to introduce save and restore primitives to implement the saved variable stack, and some other tweaks, but generally speaking, the Hoot CPS transform ensures the run-time all-tail-calls property rather than altering the compile-time language; a transformed program is still CPS soup.

fin

Did we actually make the right call in going for a CPS transformation?

I don’t have good performance numbers at the moment, but from what I can see, the overhead introduced by CPS transformation can impose some penalties, even 10x penalties in some cases. But some results are quite good, improving over native Guile, so I can’t be categorical.

But really the question is, is the performance acceptable for the functionality, and there I think the answer is more clear: we have a port of Fibers that I am sure Spritely colleagues will be writing more about soon, we have good integration with JavaScript promises while not relying on JSPI or Asyncify or anything else, and we haven’t had to compromise in significant ways regarding the source language. So, for now, I am satisfied, and looking forward to experimenting with the stack slicing proposal as it becomes available.

Until next time, happy hooting!

hoot's wasm toolkit

2024-05-24T10:37:54Z

Good morning! Today we continue our dive into the Hoot Scheme-to-WebAssembly compiler. Instead of talking about Scheme, let’s focus on WebAssembly, specifically the set of tools that we have built in Hoot to wrangle Wasm. I am peddling a thesis: if you compile to Wasm, probably you should write a low-level Wasm toolchain as well.

(Incidentally, some of this material was taken from a presentation I gave to the Wasm standardization organization back in October, which I think I haven’t shared yet in this space, so if you want some more context, have at it.)

naming things

Compilers are all about names: definitions of globals, types, local variables, and so on. An intermediate representation in a compiler is a graph of definitions and uses in which the edges are names, and the set of possible names is generally unbounded; compilers make more names when they see fit, for example when copying a subgraph via inlining, and remove names if they determine that a control or data-flow edge is not necessary. Having an unlimited set of names facilitates the graph transformation work that is the essence of a compiler.

Machines, though, generally deal with addresses, not names; one of the jobs of the compiler back-end is to tabulate the various names in a compilation unit, assigning them to addresses, for example when laying out an ELF binary. Some uses may refer to names from outside the current compilation unit, as when you use a function from the C library. The linker intervenes at the back-end to splice in definitions for dangling uses and applies the final assignment of names to addresses.

When targetting Wasm, consider what kinds of graph transformations you would like to make. You would probably like for the compiler to emit calls to functions from a low-level run-time library written in wasm. Those functions are probably going to pull in some additional definitions, such as globals, types, exception tags, and so on. Then once you have your full graph, you might want to lower it, somehow: for example, you choose to use the stringref string representation, but browsers don’t currently support it; you run a post-pass to lower to UTF-8 arrays, but then all your strings are not constant, meaning they can’t be used as global initializers; so you run another post-pass to initialize globals in order from the start function. You might want to make other global optimizations as well, for example to turn references to named locals into unnamed stack operands (not yet working :).

Anyway what I am getting at is that you need a representation for Wasm in your compiler, and that representation needs to be fairly complete. At the very minimum, you need a facility to transform that in-memory representation to the standard WebAssembly text format, which allows you to use a third-party assembler and linker such as Binaryen’s wasm-opt. But since you have to have the in-memory representation for your own back-end purposes, probably you also implement the names-to-addresses mapping that will allow you to output binary WebAssembly also. Also it could be that Binaryen doesn’t support something you want to do; for example Hoot uses block parameters, which are supported fine in browsers but not in Binaryen.

(I exaggerate a little; Binaryen is a more reasonable choice now than it was before the GC proposal was stabilised. But it has been useful to be able to control Hoot’s output, for example as the exception-handling proposal has evolved.)

one thing leads to another

Once you have a textual and binary writer, and an in-memory representation, perhaps you want to be able to read binaries as well; and perhaps you want to be able to read text. Reading the text format is a little annoying, but I had implemented it already in JavaScript a few years ago; and porting it to Scheme was a no-brainer, allowing me to easily author the run-time Wasm library as text.

And so now you have the beginnings of a full toolchain, built just out of necessity: reading, writing, in-memory construction and transformation. But how are you going to test the output? Are you going to require a browser? That’s gross. Node? Sure, we have to check against production Wasm engines, and that’s probably the easiest path to take; still, would be nice if this were optional. Wasmtime? But that doesn’t do GC.

No, of course not, you are a dirty little compilers developer, you are just going to implement a little wasm interpreter, aren’t you. Of course you are. That way you can build nice debugging tools to help you understand when things go wrong. Hoot’s interpreter doesn’t pretend to be high-performance—it is not—but it is simple and it just works. Massive kudos to Spritely hacker David Thompson for implementing this. I think implementing a Wasm VM also had the pleasant side effect that David is now a Wasm expert; implementation is the best way to learn.

Finally, one more benefit of having a Wasm toolchain as part of the compiler: %inline-wasm. In my example from last time, I had this snippet that makes a new bytevector:

(%inline-wasm
 '(func (param $len i32) (param $init i32)
    (result (ref eq))
    (struct.new
     $mutable-bytevector
     (i32.const 0)
     (array.new $raw-bytevector
                (local.get $init)
                (local.get $len))))
 len init)

%inline-wasm takes a literal as its first argument, which should parse as a Wasm function. Parsing guarantees that the wasm is syntactically valid, and allows the arity of the wasm to become apparent: we just read off the function’s type. Knowing the number of parameters and results is one thing, but we can do better, in that we also know their type, which we use for intentional types, requiring in this case that the parameters be exact integers which get wrapped to the signed i32 range. The resulting term is spliced into the CPS graph, can be analyzed for its side effects, and ultimately when written to the binary we replace each local reference in the Wasm with a reference of the appropriate local variable. All this is possible because we have the tools to work on Wasm itself.

fin

Hoot’s Wasm toolchain is about 10K lines of code, and is fairly complete. I think it pays off for Hoot. If you are building a compiler targetting Wasm, consider budgetting for a 10K SLOC Wasm toolchain; you won’t regret it.

Next time, an article on Hoot’s use of CPS. Until then, happy hacking!

on hoot, on boot

2024-05-16T20:01:41Z

I realized recently that I haven’t been writing much about the Hoot Scheme-to-WebAssembly compiler. Upon reflection, I have been too conscious of its limitations to give it verbal tribute, preferring to spend each marginal hour fixing bugs and filling in features rather than publicising progress.

In the last month or so, though, Hoot has gotten to a point that pleases me. Not to the point where I would say “accept no substitutes” by any means, but good already for some things, and worth writing about.

So let’s start today by talking about bootie. Boot, I mean! The boot, the boot, the boot of Hoot.

hoot boot: temporal tunnel

The first axis of boot is time. In the beginning, there was nary a toot, and now, through boot, there is Hoot.

The first boot of Hoot was on paper. Christine Lemmer-Webber had asked me, ages ago, what I thought Guile should do about the web. After thinking a bit, I concluded that it would be best to avoid compromises when building an in-browser Guile: if you have to pollute Guile to match what JavaScript offers, you might as well program in JavaScript. JS is cute of course, but Guile is a bit different in some interesting ways, the most important of which is control: delimited continuations, multiple values, tail calls, dynamic binding, threads, and all that. If Guile’s web bootie doesn’t pack all the funk in its trunk, probably it’s just junk.

So I wrote up a plan something to which I attributed the name tailification. In retrospect, this is simply a specific flavor of a continuation-passing-style (CPS) transmutation, late in the compiler pipeline. I’ll elocute more in a future dispatch. I did end up writing the tailification pass back then; I could have continued to target JS, but it was sufficiently annoying and I didn’t prosecute. It sat around unused for a few years, until Christine’s irresistable charisma managed to conjure some resources for Hoot.

In the meantime, the GC extension for WebAssembly shipped (woot woot!), and to boot Hoot, I filled in the missing piece: a backend for Guile’s compiler that tailified and then translated primitive operations to snippets of WebAssembly.

It was, well, hirsute, but cute and it did compute, so we continued to boot. From this root we grew a small run-time library, written in raw WebAssembly, used for slow-paths for the various primitive operations that are part of Guile’s compiler back-end. We filled out Guile primcalls, in minute commits, growing the WebAssembly runtime library and toolchain as we went.

Eventually we started constituting facilities defined in terms of those primitives, via a Scheme prelude that was prepended to all programs, within a nested lexical environment. It was never our intention though to drown the user’s programs in a sea of predefined bindings, as if the ultimate program were but a vestigial inhabitant of the lexical lake—don’t dilute the newt!, we would often say [ed: we did not]— so eventually when the prelude became unmanageable, we finally figured out how to do whole-program compilation of a set of modules.

Then followed a long month in which I would uproot the loot from the boot: take each binding from the prelude and reattribute it into an appropriate module. User code could import all the modules that suit, as long as they were known to Hoot, but no others; it was only until we added the ability for users to programmatically consitute an environment from their modules that Hoot became a language implementation of any repute.

Which brings us to the work of the last month, about which I cannot be mute. When you have existing Guile code that you want to distribute via the web, Hoot required you transmute its module definitions into the more precise R6RS syntax. Precise, meaning that R6RS modules are static, in a way that Guile modules, at least in absolute terms, are not: Guile programs can use first-class accessors on the module systems to pull out bindings. This is yet another example of what I impute as the original sin of 1990s language development, that modules are just mutable hash maps. You see it in Python, for example: because you don’t know for sure to what values global names are bound, it is easy for any discussion of what a particular piece of code means to end in dispute.

The question is, though, are the semantics of name binding in a language fixed and absolute? Once your language is booted, are its aspects definitively attributed? I think some perfection, in the sense of becoming more perfect or more like the thing you should be, is something to salute. Anyway, in Guile it would be coherent with Scheme’s lexical binding heritage to restitute some certainty as to the meanings of names, at least in a default compilation node. Lexical binding is, after all, the foundation of the Macro Writer’s Statute of Rights. Of course if you are making a build for development purposes, not to distribute, then you might prefer a build that marks all bindings as dynamic. Otherwise I think it’s reasonable to require the user to explicitly indicate which definitions are denotations, and which constitute locations.

Hoot therefore now includes an implementation of the static semantics of Guile’s define-module: it can load Guile modules directly, and as a tribute, it also has an implementation of the ambient (guile) module that constitutes the lexical soup of modules that aren’t #:pure. (I agree, it would be better if all modules were explicit about the language they are written in—their imported bindings and so on—but there is an existing corpus to accomodate; the point is moot.)

The astute reader (whom I salute!) will note that we have a full boot: Hoot is a Guile. Not an implementation to substitute the original, but more of an alternate route to the same destination. So, probably we should scoot the two implementations together, to knock their boots, so to speak, merging the offshoot Hoot into Guile itself.

But do I circumlocute: I can only plead a case of acute Hoot. Tomorrow, we elocute on a second axis of boot. Until then, happy compute!

missing the point of webassembly

2024-01-08T11:45:39Z

I find most descriptions of WebAssembly to be uninspiring: if you start with a phrase like “assembly-like language” or a “virtual machine”, we have already lost the plot. That’s not to say that these descriptions are incorrect, but it’s like explaining what a dog is by starting with its circulatory system. You’re not wrong, but you should probably lead with the bark.

I have a different preferred starting point which is less descriptive but more operational: WebAssembly is a new fundamental abstraction boundary. WebAssembly is a new way of dividing computing systems into pieces and of composing systems from parts.

This all may sound high-falutin´, but it’s for real: this is the actually interesting thing about Wasm.

fundamental & abstract

It’s probably easiest to explain what I mean by example. Consider the Linux ABI: Linux doesn’t care what code it’s running; Linux just handles system calls and schedules process time. Programs that run against the x86-64 Linux ABI don’t care whether they are in a container or a virtual machine or “bare metal” or whether the processor is AMD or Intel or even a Mac M3 with Docker and Rosetta 2. The Linux ABI interface is fundamental in the sense that either side can implement any logic, subject to the restrictions of the interface, and abstract in the sense that the universe of possible behaviors has been simplified to a limited language, in this case that of system calls.

Or take HTTP: when you visit wingolog.org, you don’t have to know (but surely would be delighted to learn) that it’s Scheme code that handles the request. I don’t have to care if the other side of the line is curl or Firefox or Wolvic. HTTP is such a successful fundamental abstraction boundary that at this point it is the default for network endpoints; whether you are a database or a golang microservice, if you don’t know that you need a custom protocol, you use HTTP.

Or, to rotate our metaphorical compound microscope to high-power magnification, consider the SYS-V amd64 C ABI: almost every programming language supports some form of extern C {} to access external libraries, and the best language implementations can produce artifacts that implement the C ABI as well. The standard C ABI splits programs into parts, and allows works from separate teams to be composed into a whole. Indeed, one litmus test of a fundamental abstraction boundary is, could I reasonably define an interface and have an implementation of it be in Scheme or OCaml or what-not: if the answer is yes, we are in business.

It is in this sense that WebAssembly is a new fundamental abstraction boundary.

WebAssembly shares many of the concrete characteristics of other abstractions. Like the Linux syscall interface, WebAssembly defines an interface language in which programs rely on host capabilities to access system features. Like the C ABI, calling into WebAssembly code has a predictable low cost. Like HTTP, you can arrange for WebAssembly code to have no shared state with its host, by construction.

But WebAssembly is a new point in this space. Unlike the Linux ABI, there is no fixed set of syscalls: WebAssembly imports are named, typed, and without pre-defined meaning, more like the C ABI. Unlike the C ABI, WebAssembly modules have only the shared state that they are given; neither side has a license to access all of the memory in the “process”. And unlike HTTP, WebAssembly modules are “in the room” with their hosts: close enough that hosts can allow themselves the luxury of synchronous function calls, and to allow WebAssembly modules to synchronously call back into their hosts.

applied teleology

At this point, you are probably nodding along, but also asking yourself, what is it for? If you arrive at this question from the “WebAssembly is a virtual machine” perspective, I don’t think you’re well-equipped to answer. But starting as we did by the interface, I think we are better positioned to appreciate how WebAssembly fits into the computing landscape: the narrative is generative, in that you can explore potential niches by identifying existing abstraction boundaries.

Again, let’s take a few examples. Say you ship some “smart cities” IoT device, consisting of a microcontroller that runs some non-Linux operating system. The system doesn’t have an MMU, so you don’t have hardware memory protections, but you would like to be able to enforce some invariants on the software that this device runs; and you would also like to be able to update that software over the air. WebAssembly is getting used in these environments; I wish I had a list of deployments at hand, but perhaps we can at least take this article last year from a WebAssembly IoT vendor as proof of commercial interest.

Or, say you run a function-as-a-service cloud, meaning that you run customer code in response to individual API requests. You need to limit the allowable set of behaviors from the guest code, so you choose some abstraction boundary. You could use virtual machines, but that would be quite expensive in terms of memory. You could use containers, but you would like more control over the guest code. You could have these functions written in JavaScript, but that means that your abstraction is no longer fundamental; you limit your applicability. WebAssembly fills an interesting niche here, and there are a number of products in this space, for example Fastly Compute or Fermyon Spin.

Or to go smaller, consider extensible software, like the GIMP image editor or VS Code: in the past you would use loadable plug-in modules via the C ABI, which can be quite gnarly, or you lean into a particular scripting language, which can be slow, inexpressive, and limit the set of developers that can write extensions. It’s not a silver bullet, but WebAssembly can have a role here. For example, the Harfbuzz text shaping library supports fonts with an embedded (em-behdad?) WebAssembly extension to control how strings of characters are mapped to positioned glyphs.

aside: what boundaries do

They say that good fences make good neighbors, and though I am not quite sure it is true—since my neighbor put up a fence a few months ago, our kids don’t play together any more—boundaries certainly facilitate separation of functionality. Conway’s law is sometimes applied as a descriptive observation—ha-ha, isn’t that funny, they just shipped their org chart—but this again misses the point, in that boundaries facilitate separation, but also composition: if I know that I can fearlessly allow a font to run code because I have an appropriate abstraction boundary between host application and extension, I have gained in power. I no longer need to be responsible for every part of the product, and my software can scale up to solve harder problems by composing work from multiple teams.

There is little point in using WebAssembly if you control both sides of a boundary, just as (unless you have chickens) there is little point in putting up a fence that runs through the middle of your garden. But where you want to compose work from separate teams, the boundaries imposed by WebAssembly can be a useful tool.

narrative generation

WebAssembly is enjoying a tail-wind of hype, so I think it’s fair to say that wherever you find a fundamental abstraction boundary, someone is going to try to implement it with WebAssembly.

Again, some examples: back in 2022 I speculated that someone would “compile” Docker containers to WebAssembly modules, and now that is a thing.

I think at some point someone will attempt to replace eBPF with Wasm in the Linux kernel; eBPF is just not as good a language as Wasm, and the toolchains that produce it are worse. eBPF has clunky calling-conventions about what registers are saved and spilled at call sites, a decision that can be made more efficiently for the program and architecture at hand when register-allocating WebAssembly locals. (Sometimes people lean on the provably-terminating aspect of eBPF as its virtue, but that could apply just as well to Wasm if you prohibit the loop opcode (and the tail-call instructions) at verification-time.) And why don’t people write whole device drivers in eBPF? Or rather, targetting eBPF from C or what-have-you. It’s because eBPF is just not good enough. WebAssembly is, though! Anyway I think Linux people are too chauvinistic to pick this idea up but I bet Microsoft could do it.

I was thinking today, you know, it actually makes sense to run a WebAssembly operating system, one which runs WebAssembly binaries. If the operating system includes the Wasm run-time, it can interpose itself at syscall boundaries, sometimes allowing it to avoid context switches. You could start with something like the Linux ABI, perhaps via WALI, but for a subset of guest processes that conform to particular conventions, you could build purpose-built composition that can allocate multiple WebAssembly modules to a single process, eliding inter-process context switches and data copies for streaming computations. Or, focussing on more restricted use-cases, you could make a microkernel; googling around I found this article from a couple days ago where someone is giving this a go.

wwwhat about the wwweb

But let’s go back to the web, where you are reading this. In one sense, WebAssembly is a massive web success, being deployed to literally billions of user agents. In another, it is marginal: people do not write web front-ends in WebAssembly. Partly this is because the kind of abstraction supported by linear-memory WebAssembly 1.0 isn’t a good match for the garbage-collected DOM API exposed by web browsers. As a corrolary, languages that are most happy targetting this linear-memory model (C, Rust, and so on) aren’t good for writing DOM applications either. WebAssembly is used in auxiliary modules where you want to run legacy C++ code on user devices, or to speed up a hot leaf function, but isn’t a huge success.

This will change with the recent addition of managed data types to WebAssembly, but not necessarily in the way that you might think. Like, now that it will be cheaper and more natural to pass data back and forth with JavaScript, are we likely to see Wasm/GC progressively occupying more space in web applications? For me, I doubt that progressive is the word. In the same way that you wouldn’t run a fence through the middle of your front lawn, you wouldn’t want to divide your front-end team into JavaScript and WebAssembly sub-teams. Instead I think that we will see more phase transitions, in which whole web applications switch from JavaScript to Wasm/GC, compiled from Dart or Elm or what have you. The natural fundamental abstraction boundary in a web browser is between the user agent and the site’s code, not within the site’s code itself.

conclusion

So, friends, if you are talking to a compiler engineer, by all means: keep describing WebAssembly as an virtual machine. It will keep them interested. But for everyone else, the value of WebAssembly is what it does, which is to be a different way of breaking a system into pieces. Armed with this observation, we can look at current WebAssembly uses to understand the nature of those boundaries, and to look at new boundaries to see if WebAssembly can have a niche there. Happy hacking, and may your components always compose!

sir talks-a-lot

2023-12-12T15:18:14Z

I know, dear reader: of course you have already seen all my talks this year. Your attentions are really too kind and I thank you. But those other people, maybe you share one talk with them, and then they ask you for more, and you have to go stalking back through the archives to slake their nerd-thirst. This happens all the time, right?

I was thinking of you this morning and I said to myself, why don’t I put together a post linking to all of my talks in 2023, so that you can just send them a link; here we are. You are very welcome, it is really my pleasure.

2023 talks

Scheme + Wasm + GC = MVP: Hoot Scheme-to-Wasm compiler update. Wasm standards group, Munich, 11 Oct 2023. slides

Scheme to Wasm: Use and misuse of the GC proposal. Wasm GC subgroup, 18 Apr 2023. slides

A world to win: WebAssembly for the rest of us. BOB, Berlin, 17 Mar 2023. blog slides youtube

Cross-platform mobile UI: “Compilers, compilers everywhere”. EOSS, Prague, 27 June 2023. slides youtube blog blog blog blog blog blog

CPS Soup: A functional intermediate language. Spritely, remote, 10 May 2023. blog slides

Whippet: A new GC for Guile. FOSDEM, Brussels, 4 Feb 2023. blog event slides

but wait, there’s more

Still here? The full talks archive will surely fill your cup.

tree-shaking, the horticulturally misguided algorithm

2023-11-24T11:41:37Z

Let’s talk about tree-shaking!

looking up from the trough

But first, I need to talk about WebAssembly’s dirty secret: despite the hype, WebAssembly has had limited success on the web.

There is Photoshop, which does appear to be a real success. 5 years ago there was Figma, though they don’t talk much about Wasm these days. There are quite a number of little NPM libraries that use Wasm under the hood, usually compiled from C++ or Rust. I think Blazor probably gets used for a few in-house corporate apps, though I could be fooled by their marketing.

You might recall the hyped demos of 3D first-person-shooter games with Unreal engine again from 5 years ago, but that was the previous major release of Unreal and was always experimental; the current Unreal 5 does not support targetting WebAssembly.

Don’t get me wrong, I think WebAssembly is great. It is having fine success in off-the-web environments, and I think it is going to be a key and growing part of the Web platform. I suspect, though, that we are only just now getting past the trough of disillusionment.

It’s worth reflecting a bit on the nature of web Wasm’s successes and failures. Taking Photoshop as an example, I think we can say that Wasm does very well at bringing large C++ programs to the web. I know that it took quite some work, but I understand the end result to be essentially the same source code, just compiled for a different target.

Similarly for the JavaScript module case, Wasm finds success in getting legacy C++ code to the web, and as a way to write new web-targetting Rust code. These are often tasks that JavaScript doesn’t do very well at, or which need a shared implementation between client and server deployments.

On the other hand, WebAssembly has not been a Web success for DOM-heavy apps. Nobody is talking about rewriting the front-end of wordpress.com in Wasm, for example. Why is that? It may sound like a silly question to you: Wasm just isn’t good at that stuff. But why? If you dig down a bit, I think it’s that the programming models are just too different: the Web’s primary programming model is JavaScript, a language with dynamic typing and managed memory, whereas WebAssembly 1.0 was about static typing and linear memory. Getting to the DOM from Wasm was a hassle that was overcome only by the most ardent of the true Wasm faithful.

Relatedly, Wasm has also not really been a success for languages that aren’t, like, C or Rust. I am guessing that wordpress.com isn’t written mostly in C++. One of the sticking points for this class of language. is that C#, for example, will want to ship with a garbage collector, and that it is annoying to have to do this. Check my article from March this year for more details.

Happily, this restriction is going away, as all browsers are going to ship support for reference types and garbage collection within the next months; Chrome and Firefox already ship Wasm GC, and Safari shouldn’t be far behind thanks to the efforts from my colleague Asumu Takikawa. This is an extraordinarily exciting development that I think will kick off a whole ‘nother Gartner hype cycle, as more languages start to update their toolchains to support WebAssembly.

if you don’t like my peaches

Which brings us to the meat of today’s note: web Wasm will win where compilers create compact code. If your language’s compiler toolchain can manage to produce useful Wasm in a file that is less than a handful of over-the-wire kilobytes, you can win. If your compiler can’t do that yet, you will have to instead rely on hype and captured audiences for adoption, which at best results in an unstable equilibrium until you figure out what’s next.

In the JavaScript world, managing bloat and deliverable size is a huge industry. Bundlers like esbuild are a ubiquitous part of the toolchain, compiling down a set of JS modules to a single file that should include only those functions and data types that are used in a program, and additionally applying domain-specific size-squishing strategies such as minification (making monikers more minuscule).

Let’s focus on tree-shaking. The visual metaphor is that you write a bunch of code, and you only need some of it for any given page. So you imagine a tree whose, um, branches are the modules that you use, and whose leaves are the individual definitions in the modules, and you then violently shake the tree, probably killing it and also annoying any nesting birds. The only thing that’s left still attached is what is actually needed.

This isn’t how trees work: holding the trunk doesn’t give you information as to which branches are somehow necessary for the tree’s mission. It also primes your mind to look for the wrong fixed point, removing unneeded code instead of keeping only the necessary code.

But, tree-shaking is an evocative name, and so despite its horticultural and algorithmic inaccuracies, we will stick to it.

The thing is that maximal tree-shaking for languages with a thicker run-time has not been a huge priority. Consider Go: according to the golang wiki, the most trivial program compiled to WebAssembly from Go is 2 megabytes, and adding imports can make this go to 10 megabytes or more. Or look at Pyodide, the Python WebAssembly port: the REPL example downloads about 20 megabytes of data. These are fine sizes for technology demos or, in the limit, very rich applications, but they aren’t winners for web development.

shake a different tree

To be fair, both the built-in Wasm support for Go and the Pyodide port of Python both derive from the upstream toolchains, where producing small binaries is nice but not necessary: on a server, who cares how big the app is? And indeed when targetting smaller devices, we tend to see alternate implementations of the toolchain, for example MicroPython or TinyGo. TinyGo has a Wasm back-end that can apparently go down to less than a kilobyte, even!

These alternate toolchains often come with some restrictions or peculiarities, and although we can consider this to be an evil of sorts, it is to be expected that the target platform exhibits some co-design feedback on the language. In particular, running in the sea of the DOM is sufficiently weird that a Wasm-targetting Python program will necessarily be different than a “native” Python program. Still, I think as toolchain authors we aim to provide the same language, albeit possibly with a different implementation of the standard library. I am sure that the ClojureScript developers would prefer to remove their page documenting the differences with Clojure if they could, and perhaps if Wasm becomes a viable target for Clojurescript, they will.

on the algorithm

To recap: now that it supports GC, Wasm could be a winner for web development in Python and other languages. You would need a different toolchain and an effective tree-shaking algorithm, so that user experience does not degrade. So let’s talk about tree shaking!

I work on the Hoot Scheme compiler, which targets Wasm with GC. We manage to get down to 70 kB or so right now, in the minimal “main” compilation unit, and are aiming for lower; auxiliary compilation units that import run-time facilities (the current exception handler and so on) from the main module can be sub-kilobyte. Getting here has been tricky though, and I think it would be even trickier for Python.

Some background: like Whiffle, the Hoot compiler prepends a prelude onto user code. Tree-shakind happens in a number of places:

partial evaluation will evaluate unused bindings for effect, possibly eliding them
fixing letrec will do the same
CPS frequently traverses the program, following only referenced function, value, and control edges, e.g. via renumbering
There is an explicit dead-code elimination pass which tries to elide unused effect-free allocations, a situation that can arise due to other optimizations
Finally there is a standard library written in raw-ish WebAssembly, whose definitions (globals, tables, imports, functions, etc) are included in the residual binary only as neeeded.

Generally speaking, procedure definitions (functions / closures) are the easy part: you just include only those functions that are referenced by the code. In a language like Scheme, this gets you a long way.

However there are three immediate challenges. One is that the evaluation model for the definitions in the prelude is letrec*: the scope is recursive but ordered. Binding values can call or refer to previously defined values, or capture values defined later. If evaluating the value of a binding requires referring to a value only defined later, then that’s an error. Again, for procedures this is trivially OK, but as soon as you have non-procedure definitions, sometimes the compiler won’t be able to prove this nice “only refers to earlier bindings” property. In that case the fixing letrec (reloaded) algorithm will end up residualizing bindings that are set!, which of all the tree-shaking passes above require the delicate DCE pass to remove them.

Worse, some of those non-procedure definitions are record types, which have vtables that define how to print a record, how to check if a value is an instance of this record, and so on. These vtable callbacks can end up keeping a lot more code alive even if they are never used. We’ll get back to this later.

Similarly, say you print a string via display. Well now not only are you bringing in the whole buffered I/O facility, but you are also calling a highly polymorphic function: display can print anything. There’s a case for bitvectors, so you pull in code for bitvectors. There’s a case for pairs, so you pull in that code too. And so on.

One solution is to instead call write-string, which only writes strings and not general data. You’ll still get the generic buffered I/O facility (ports), though, even if your program only uses one kind of port.

This brings me to my next point, which is that optimal tree-shaking is a flow analysis problem. Consider display: if we know that a program will never have bitvectors, then any code in display that works on bitvectors is dead and we can fold the branches that guard it. But to know this, we have to know what kind of arguments display is called with, and for that we need higher-level flow analysis.

The problem is exacerbated for Python in a few ways. One, because object-oriented dispatch is higher-order programming. How do you know what foo.bar actually means? Depends on foo, which means you have to thread around representations of what foo might be everywhere and to everywhere’s caller and everywhere’s caller’s caller and so on.

Secondly, lookup in Python is generally more dynamic than in Scheme: you have __getattr__ methods (is that it?; been a while since I’ve done Python) everywhere and users might indeed use them. Maybe this is not so bad in practice and flow analysis can exclude this kind of dynamic lookup.

Finally, and perhaps relatedly, the object of tree-shaking in Python is a mess of modules, rather than a big term with lexical bindings. This is like JavaScript, but without the established ecosystem of tree-shaking bundlers; Python has its work cut out for some years to go.

in short

With GC, Wasm makes it thinkable to do DOM programming in languages other than JavaScript. It will only be feasible for mass use, though, if the resulting Wasm modules are small, and that means significant investment on each language’s toolchain. Often this will take the form of alternate toolchains that incorporate experimental tree-shaking algorithms, and whose alternate standard libraries facilitate the tree-shaker.

Welp, I’m off to lunch. Happy wassembling, comrades!

requiem for a stringref

2023-10-19T10:33:58Z

Good day, comrades. Today’s missive is about strings!

a problem for java

Imagine you want to compile a program to WebAssembly, with the new GC support for WebAssembly. Your WebAssembly program will run on web browsers and render its contents using the DOM API: Document.createElement, Document.createTextNode, and so on. It will also use DOM interfaces to read parts of the page and read input from the user.

How do you go about representing your program in WebAssembly? The GC support gives you the ability to define a number of different kinds of aggregate data types: structs (records), arrays, and functions-as-values. Earlier versions of WebAssembly gave you 32- and 64-bit integers, floating-point numbers, and opaque references to host values (externref). This is what you have in your toolbox. But what about strings?

WebAssembly’s historical answer has been to throw its hands in the air and punt the problem to its user. This isn’t so bad: the direct user of WebAssembly is a compiler developer and can fend for themself. Using the primitives above, it’s clear we should represent strings as some kind of array.

The source language may impose specific requirements regarding string representations: for example, in Java, you will want to use an (array i16), because Java’s strings are specified as sequences of UTF-16¹ code units, and Java programs are written assuming that random access to a code unit is constant-time.

Let’s roll with the Java example for a while. It so happens that JavaScript, the main language of the web, also specifies strings in terms of 16-bit code units. The DOM interfaces are optimized for JavaScript strings, so at some point, our WebAssembly program is going to need to convert its (array i16) buffer to a JavaScript string. You can imagine that a high-throughput interface between WebAssembly and the DOM is going to involve a significant amount of copying; could there be a way to avoid this?

Similarly, Java is going to need to perform a number of gnarly operations on its strings, for example, locale-specific collation. This is a hard problem whose solution basically amounts to shipping a copy of libICU in their WebAssembly module; that’s a lot of binary size, and it’s not even clear how to compile libICU in such a way that works on GC-managed arrays rather than linear memory.

Thinking about it more, there’s also the problem of regular expressions. A high-performance regular expression engine is a lot of investment, and not really portable from the native world to WebAssembly, as the main techniques require just-in-time code generation, which is unavailable on Wasm.

This is starting to sound like a terrible system: big binaries, lots of copying, suboptimal algorithms, and a likely ongoing functionality gap. What to do?

a solution for java

One observation is that in the specific case of Java, we could just use JavaScript strings in a web browser, instead of implementing our own string library. We may need to make some shims here and there, but the basic functionality from JavaScript gets us what we need: constant-time UTF-16¹ code unit access from within WebAssembly, and efficient access to browser regular expression, internationalization, and DOM capabilities that doesn’t require copying.

A sort of minimum viable product for improving the performance of Java compiled to Wasm/GC would be to represent strings as externref, which is WebAssembly’s way of making an opaque reference to a host value. You would operate on those values by importing the equivalent of String.prototype.charCodeAt and friends; to get the receivers right you’d need to run them through Function.call.bind. It’s a somewhat convoluted system, but a WebAssembly engine could be taught to recognize such a function and compile it specially, using the same code that JavaScript compiles to.

(Does this sound too complicated or too distasteful to implement? Disabuse yourself of the notion: it’s happening already. V8 does this and other JS/Wasm engines will be forced to follow, as users file bug reports that such-and-such an app is slow on e.g. Firefox but fast on Chrome, and so on and so on. It’s the same dynamic that led asm.js adoption.)

Getting properly good performance will require a bit more, though. String literals, for example, would have to be loaded from e.g. UTF-8 in a WebAssembly data section, then transcoded to a JavaScript string. You need a function that can convert UTF-8 to JS string in the first place; let’s call it fromUtf8Array. An engine can now optimize the array.new_data + fromUtf8Array sequence to avoid the intermediate array creation. It would also be nice to tighten up the typing on the WebAssembly side: having everything be externref imposes a dynamic type-check on each operation, which is something that can’t always be elided.

beyond the web?

“JavaScript strings for Java” has two main limitations: JavaScript and Java. On the first side, this MVP doesn’t give you anything if your WebAssembly host doesn’t do JavaScript. Although it’s a bit of a failure for a universal virtual machine, to an extent, the WebAssembly ecosystem is OK with this distinction: there are different compiler and toolchain options when targetting the web versus, say, Fastly’s edge compute platform.

But does that mean you can’t run Java on Fastly’s cloud? Does the Java compiler have to actually implement all of those things that we were trying to avoid? Will Java actually implement those things? I think the answers to all of those questions is “no”, but also that I expect a pretty crappy outcome.

First of all, it’s not technically required that Java implement its own strings in terms of (array i16). A Java-to-Wasm/GC compiler can keep the strings-as-opaque-host-values paradigm, and instead have these string routines provided by an auxiliary WebAssembly module that itself probably uses (array i16), effectively polyfilling what the browser would give you. The effort of creating this module can be shared between e.g. Java and C#, and the run-time costs for instantiating the module can be amortized over a number of Java users within a process.

However, I don’t expect such a module to be of good quality. It doesn’t seem possible to implement a good regular expression engine that way, for example. And, absent a very good run-time system with an adaptive compiler, I don’t expect the low-level per-codepoint operations to be as efficient with a polyfill as they are on the browser.

Instead, I could see non-web WebAssembly hosts being pressured into implementing their own built-in UTF-16¹ module which has accelerated compilation, a native regular expression engine, and so on. It’s nice to have a portable fallback but in the long run, first-class UTF-16¹ will be everywhere.

beyond java?

The other drawback is Java, by which I mean, Java (and JavaScript) is outdated: if you were designing them today, their strings would not be UTF-16¹.

I keep this little “¹” sigil when I mention UTF-16 because Java (and JavaScript) don’t actually use UTF-16 to represent their strings. UTF-16 is standard Unicode encoding form. A Unicode encoding form encodes a sequence of Unicode scalar values (USVs), using one or two 16-bit code units to encode each USV. A USV is a codepoint: an integer in the range [0,0x10FFFF], but excluding surrogate codepoints: codepoints in the range [0xD800,0xDFFF].

Surrogate codepoints are an accident of history, and occur either when accidentally slicing a two-code-unit UTF-16-encoded-USV in the middle, or when treating an arbitrary i16 array as if it were valid UTF-16. They are annoying to detect, but in practice are here to stay: no amount of wishing will make them go away from Java, JavaScript, C#, or other similar languages from those heady days of the mid-90s. Believe me, I have engaged in some serious wishing, but if you, the virtual machine implementor, want to support Java as a source language, your strings have to be accessible as 16-bit code units, which opens the door (eventually) to surrogate codepoints.

So when I say UTF-16¹, I really mean WTF-16: sequences of any 16-bit code units, without the UTF-16 requirement that surrogate code units be properly paired. In this way, WTF-16 encodes a larger language than UTF-16: not just USV codepoints, but also surrogate codepoints.

The existence of WTF-16 is a consequence of a kind of original sin, originating in the choice to expose 16-bit code unit access to the Java programmer, and which everyone agrees should be somehow firewalled off from the rest of the world. The usual way to do this is to prohibit WTF-16 from being transferred over the network or stored to disk: a message sent via an HTTP POST, for example, will never include a surrogate codepoint, and will either replace it with the U+FFFD replacement codepoint or throw an error.

But within a Java program, and indeed within a JavaScript program, there is no attempt to maintain the UTF-16 requirements regarding surrogates, because any change from the current behavior would break programs. (How many? Probably very, very few. But productively deprecating web behavior is hard to do.)

If it were just Java and JavaScript, that would be one thing, but WTF-16 poses challenges for using JS strings from non-Java languages. Consider that any JavaScript string can be invalid UTF-16: if your language defines strings as sequences of USVs, which excludes surrogates, what do you do when you get a fresh string from JS? Passing your string to JS is fine, because WTF-16 encodes a superset of USVs, but when you receive a string, you need to have a plan.

You only have a few options. You can eagerly check that a string is valid UTF-16; this might be a potentially expensive O(n) check, but perhaps this is acceptable. (This check may be faster in the future.) Or, you can replace surrogate codepoints with U+FFFD, when accessing string contents; lossy, but preserves your language’s semantic domain. Or, you can extend your language’s semantics to somehow deal with surrogate codepoints.

My point is that if you want to use JS strings in a non-Java-like language, your language will need to define what to do with invalid UTF-16. Ideally the browser will give you a way to put your policy into practice: replace with U+FFFD, error, or pass through.

beyond java? (reprise) (feat: snakes)

With that detail out of the way, say you are compiling Python to Wasm/GC. Python’s language reference says: “A string is a sequence of values that represent Unicode code points. All the code points in the range U+0000 - U+10FFFF can be represented in a string.” This corresponds to the domain of JavaScript’s strings; great!

On second thought, how do you actually access the contents of the string? Surely not via the equivalent of JavaScript’s String.prototype.charCodeAt; Python strings are sequences of codepoints, not 16-bit code units.

Here we arrive to the second, thornier problem, which is less about domain and more about idiom: in Python, we expect to be able to access strings by codepoint index. This is the case not only to access string contents, but also to refer to positions in strings, for example when extracting a substring. These operations need to be fast (or fast enough anyway; CPython doesn’t have a very high performance baseline to meet).

However, the web platform doesn’t give us O(1) access to string codepoints. Usually a codepoint just takes up one 16-bit code unit, so the (zero-indexed) 5th codepoint of JS string s may indeed be at s.codePointAt(5), but it may also be at offset 6, 7, 8, 9, or 10. You get the point: finding the nth codepoint in a JS string requires a linear scan from the beginning.

More generally, all languages will want to expose O(1) access to some primitive subdivision of strings. For Rust, this is bytes; 8-bit bytes are the code units of UTF-8. For others like Java or C#, it’s 16-bit code units. For Python, it’s codepoints. When targetting JavaScript strings, there may be a performance impedance mismatch between what the platform offers and what the language requires.

Languages also generally offer some kind of string iteration facility, which doesn’t need to correspond to how a JavaScript host sees strings. In the case of Python, one can implement for char in s: print(char) just fine on top of JavaScript strings, by decoding WTF-16 on the fly. Iterators can also map between, say, UTF-8 offsets and WTF-16 offsets, allowing e.g. Rust to preserve its preferred “strings are composed of bytes that are UTF-8 code units” abstraction.

Our O(1) random access problem remains, though. Are we stuck?

what does the good world look like

How should a language represent its strings, anyway? Here we depart from a precise gathering of requirements for WebAssembly strings, but in a useful way, I think: we should build abstractions not only for what is, but also for what should be. We should favor a better future; imagining the ideal helps us design the real.

I keep returning to Henri Sivonen’s authoritative article, It’s Not Wrong that “🤦🏼‍♂️”.length == 7, But It’s Better that “🤦🏼‍♂️”.len() == 17 and Rather Useless that len(“🤦🏼‍♂️”) == 5. It is so good and if you have reached this point, pop it open in a tab and go through it when you can. In it, Sivonen argues (among other things) that random access to codepoints in a string is not actually important; he thinks that if you were designing Python today, you wouldn’t include this interface in its standard library. Users would prefer extended grapheme clusters, which is variable-length anyway and a bit gnarly to compute; storage wants bytes; array-of-codepoints is just a bad place in the middle. Given that UTF-8 is more space-efficient than either UTF-16 or array-of-codepoints, and that it embraces the variable-length nature of encoding, programming languages should just use that.

As a model for how strings are represented, array-of-codepoints is outdated, as indeed is UTF-16. Outdated doesn’t mean irrelevant, of course; there is lots of Python code out there and we have to support it somehow. But, if we are designing for the future, we should nudge our users towards other interfaces.

There is even a case that a JavaScript engine should represent its strings as UTF-8 internally, despite the fact that JS exposes a UTF-16 view on strings in its API. The pitch is that UTF-8 takes less memory, is probably what we get over the network anyway, and is probably what many of the low-level APIs that a browser uses will want; it would be faster and lighter-weight to pass UTF-8 to text shaping libraries, for example, compared to passing UTF-16 or having to copy when going to JS and when going back. JavaScript engines already have a dozen internal string representations or so (narrow or wide, cons or slice or flat, inline or external, interned or not, and the product of many of those); adding another is just a Small Matter Of Programming that could show benefits, even if some strings have to be later transcoded to UTF-16 because JS accesses them in that way. I have talked with JS engine people in all the browsers and everyone thinks that UTF-8 has a chance at being a win; the drawback is that actually implementing it would take a lot of effort for uncertain payoff.

I have two final data-points to indicate that UTF-8 is the way. One is that Swift used to use UTF-16 to represent its strings, but was able to switch to UTF-8. To adapt to the newer performance model of UTF-8, Swift maintainers designed new APIs to allow users to request a view on a string: treat this string as UTF-8, or UTF-16, or a sequence of codepoints, or even a sequence of extended grapheme clusters. Their users appear to be happy, and I expect that many languages will follow Swift’s lead.

Secondly, as a maintainer of the Guile Scheme implementation, I also want to switch to UTF-8. Guile has long used Python’s representation strategy: array of codepoints, with an optimization if all codepoints are “narrow” (less than 256). The Scheme language exposes codepoint-at-offset (string-ref) as one of its fundamental string access primitives, and array-of-codepoints maps well to this idiom. However, we do plan to move to UTF-8, with a Swift-like breadcrumbs strategy for accelerating per-codepoint access. We hope to lower memory consumption, simplify the implementation, and have general (but not uniform) speedups; some things will be slower but most should be faster. Over time, users will learn the performance model and adapt to prefer string builders / iterators (“string ports”) instead of string-ref.

a solution for webassembly in the browser?

Let’s try to summarize: it definitely makes sense for Java to use JavaScript strings when compiled to WebAssembly/GC, when running on the browser. There is an OK-ish compilation strategy for this use case involving externref, String.prototype.charCodeAt imports, and so on, along with some engine heroics to specially recognize these operations. There is an early proposal to sand off some of the rough edges, to make this use-case a bit more predictable. However, there are two limitations:

Focussing on providing JS strings to Wasm/GC is only really good for Java and friends; the cost of mapping charCodeAt semantics to, say, Python’s strings is likely too high.
JS strings are only present on browsers (and Node and such).

I see the outcome being that Java will have to keep its implementation that uses (array i16) when targetting the edge, and use JS strings on the browser. I think that polyfills will not have acceptable performance. On the edge there will be a binary size penalty and a performance and functionality gap, relative to the browser. Some edge Wasm implementations will be pushed to implement fast JS strings by their users, even though they don’t have JS on the host.

If the JS string builtins proposal were a local maximum, I could see putting some energy into it; it does make the Java case a bit better. However I think it’s likely to be an unstable saddle point; if you are going to infect the edge with WTF-16 anyway, you might as well step back and try to solve a problem that is a bit more general than Java on JS.

stringref: a solution for webassembly?

I think WebAssembly should just bite the bullet and try to define a string data type, for languages that use GC. It should support UTF-8 and UTF-16 views, like Swift’s strings, and support some kind of iterator API that decodes codepoints.

It should be abstract as regards the concrete representation of strings, to allow JavaScript strings to stand in for WebAssembly strings, in the context of the browser. JS hosts will use UTF-16 as their internal representation. Non-JS hosts will likely prefer UTF-8, and indeed an abstract API favors migration of JS engines away from UTF-16 over the longer term. And, such an abstraction should give the user control over what to do for surrogates: allow them, throw an error, or replace with U+FFFD.

What I describe is what the stringref proposal gives you. We don’t yet have consensus on this proposal in the Wasm standardization group, and we may never reach there, although I think it’s still possible. As I understand them, the objections are two-fold:

WebAssembly is an instruction set, like AArch64 or x86. Strings are too high-level, and should be built on top, for example with (array i8).
The requirement to support fast WTF-16 code unit access will mean that we are effectively standardizing JavaScript strings.

I think the first objection is a bit easier to overcome. Firstly, WebAssembly now defines quite a number of components that don’t map to machine ISAs: typed and extensible locals, memory.copy, and so on. You could have defined memory.copy in terms of primitive operations, or required that all local variables be represented on an explicit stack or in a fixed set of registers, but WebAssembly defines higher-level interfaces that instead allow for more efficient lowering to machine primitives, in this case SIMD-accelerated copies or machine-specific sets of registers.

Similarly with garbage collection, there was a very interesting “continuation marks” proposal by Ross Tate that would give a low-level primitive on top of which users could implement root-finding of stack values. However when choosing what to include in the standard, the group preferred a more high-level facility in which a Wasm module declares managed data types and allows the WebAssembly implementation to do as it sees fit. This will likely result in more efficient systems, as a Wasm implementation can more easily use concurrency and parallelism in the GC implementation than a guest WebAssembly module could do.

So, the criteria of what to include in the Wasm standard is not “what is the most minimal primitive that can express this abstraction”, or even “what looks like an ARMv8 instruction”, but rather “what makes Wasm a good compilation target”. Wasm is designed for its compiler-users, not for the machines that it runs on, and if we manage to find an abstract definition of strings that works for Wasm-targetting toolchains, we should think about adding it.

The second objection is trickier. When you compile to Wasm, you need a good model of what the performance of the Wasm code that you emit will be. Different Wasm implementations may use different stringref representations; requesting a UTF-16 view on a string that is already UTF-16 will be cheaper than doing so on a string that is UTF-8. In the worst case, requesting a UTF-16 view on a UTF-8 string is a linear operation on one system but constant-time on another, which in a loop over string contents makes the former system quadratic: a real performance failure that we need to design around.

The stringref proposal tries to reify as much of the cost model as possible with its “view” abstraction; the compiler can reason that any cost may incur then rather than when accessing a view. But, this abstraction can leak, from a performance perspective. What to do?

I think that if we look back on what the expected outcome of the JS-strings-for-Java proposal is, I believe that if Wasm succeeds as a target for Java, we will probably already end up with WTF-16 everywhere. We might as well admit this, I think, and if we do, then this objection goes away. Likewise on the Web I see UTF-8 as being potentially advantageous in the medium-long term for JavaScript, and certainly better for other languages, and so I expect JS implementations to also grow support for fast UTF-8.

i’m on a horse

I may be off in some of my predictions about where things will go, so who knows. In the meantime, in the time that it takes other people to reach the same conclusions, stringref is in a kind of hiatus.

The Scheme-to-Wasm compiler that I work on does still emit stringref, but it is purely a toolchain concept now: we have a post-pass that lowers stringref to WTF-8 via (array i8), and which emits calls to host-supplied conversion routines when passing these strings to and from the host. When compiling to Hoot’s built-in Wasm virtual machine, we can leave stringref in, instead of lowering it down, resulting in more efficient interoperation with the host Guile than if we had to bounce through byte arrays.

So, we wait for now. Not such a bad situation, at least we have GC coming soon to all the browsers. appy hacking to all my stringfolk, and until next time!

parallel futures in mobile application development

2023-06-15T14:02:16Z

Good morning, hackers. Today I’d like to pick up my series on mobile application development. To recap, we looked at:

Ionic/Capacitor, which makes mobile app development more like web app development;
React Native, a flavor of React that renders to platform-native UI components rather than the Web, with ahead-of-time compilation of JavaScript;
NativeScript, which exposes all platform capabilities directly to JavaScript and lets users layer their preferred framework on top;
Flutter, which bypasses the platform’s native UI components to render directly using the GPU, and uses Dart instead of JavaScript/TypeScript; and
Ark, which is Flutter-like in its rendering, but programmed via a dialect of TypeScript, with its own multi-tier compilation and distribution pipeline.

Taking a step back, with the exception of Ark which has a special relationship to HarmonyOS and Huawei, these frameworks are all layers on top of what is provided by Android or iOS. Why would you do that? Presumably there are benefits to these interstitial layers; what are they?

Probably the most basic answer is that an app framework layer offers the promise of abstracting over the different platforms. This way you can just have one mobile application development team instead of two or more. In practice you still need to test on iOS and Android at least, but this is cheaper than having fully separate Android and iOS teams.

Given that we are abstracting over platforms, it is natural also to abandon platform-specific languages like Swift or Kotlin. This is the moment in the strategic planning process that unleashes chaos: there is a fundamental element of randomness and risk when choosing a programming language and its community. Languages exist on a hype and adoption cycle; ideally you want to catch one on its way up, and you want it to remain popular over the life of your platform (10 years or so). This is not an easy thing to do and it’s quite possible to bet on the wrong horse. However the communities around popular languages also bring their own risks, in that they have fashions that change over time, and you might have to adapt your platform to the language as fashions come and go, whether or not these fashions actually make better apps.

Choosing JavaScript as your language places more emphasis on the benefits of popularity, and is in turn a promise to adapt to ongoing fads. Choosing a more niche language like Dart places more emphasis on predictability of where the language will go, and ability to shape the language’s future; Flutter is a big fish in a small pond.

There are other language choices, though; if you are building your own thing, you can choose any direction you like. What if you used Rust? What if you doubled down on WebAssembly, somehow? In some ways we’ll never know unless we go down one of these paths; one has to pick a direction and stick to it for long enough to ship something, and endless tergiversations on such basic questions as language are not helpful. But in the early phases of platform design, all is open, and it would be prudent to spend some time thinking about what it might look like in one of these alternate worlds. In that spirit, let us explore these futures to see how they might be.

alternate world: rust

The arc of history bends away from C and C++ and towards Rust. Given that a mobile development platform has to have some low-level code, there are arguments in favor of writing it in Rust already instead of choosing to migrate in the future.

One advantage of Rust is that programs written in it generally have fewer memory-safety bugs than their C and C++ counterparts, which is important in the context of smart phones that handle untrusted third-party data and programs, i.e., web sites.

Also, Rust makes it easy to write parallel programs. For the same implementation effort, we can expect Rust programs to make more efficient use of the hardware than C++ programs.

And relative to JavaScript et al, Rust also has the advantage of predictable performance: it requires quite a good ahead-of-time compiler, but no adaptive optimization at run-time.

These observations are just conversation-starters, though, and when it comes to imagining what a real mobile device would look like with a Rust application development framework, things get more complicated. Firstly, there is the approach to UI: how do you get pixels on the screen and events from the user? The three general solutions are to use a web browser engine, to use platform-native widgets, or to build everything in Rust using low-level graphics primitives.

The first approach is taken by the Tauri framework: an app is broken into two pieces, a Rust server and an HTML/JS/CSS front-end. Running a Tauri app creates a WebView in which to run the front-end, and establishes a bridge between the web client and the Rust server. In many ways the resulting system ends up looking a lot like Ionic/Capacitor, and many of the UI questions are left open to the user: what UI framework to use, all of the JavaScript programming, and so on.

Instead of using a platform’s WebView library, a Rust app could instead ship a WebView. This would of course make the application binary size larger, but tighter coupling between the app and the WebView may allow you to run the UI logic from Rust itself instead of having a large JS component. Notably this would be an interesting opportunity to adopt the Servo web engine, which is itself written in Rust. Servo is a project that in many ways exists in potentia; with more investment it could become a viable alternative to Gecko, Blink, or WebKit, and whoever does the investment would then be in a position of influence in the web platform.

If we look towards the platform-native side, though there are quite a number of Rust libraries that provide wrappers to native widgets, practically all of these primarily target the desktop. Only cacao supports iOS widgets, and there is no equivalent binding for Android, so any NativeScript-like solution in Rust would require a significant amount of work.

In contrast, the ecosystem of Rust UI libraries that are implemented on top of OpenGL and other low-level graphics facilities is much more active and interesting. Probably the best recent overview of this landscape is by Raph Levien, (see the “quick tour of existing architectures” subsection). In summary, everything is still in motion and there is no established consensus as to how to approach the problem of UI development, but there are many interesting experiments in progress. With my engineer hat on, exploring these directions looks like fun. As Raph notes, some degree of exploration seems necessary as well: we will only know if a given approach is a good idea if we spend some time with it.

However if instead we consider the situation from the perspective of someone building a mobile application development framework, Rust seems more of a mid/long-term strategy than a concrete short-term option. Sure, build low-level libraries in Rust, to the extent possible, but there is no compelling-in-and-of-itself story yet that you can sell to potential UI developers, because everything is still so undecided.

Finally, let us consider the question of scripting: sometimes you need to add logic to a program at run-time. It could be because actually most of your app is dynamic and comes from the network; in that case your app is like a little virtual machine. If your app development framework is written in JavaScript, like Ionic/Capacitor, then you have a natural solution: just serve JavaScript. But if your app is written in Rust, what do you do? Waiting until the app store pushes a new version of the app to the user is not an option.

There would appear to be three common solutions to this problem. One is to use JavaScript – that’s what Servo does, for example. As a web engine, Servo doesn’t have much of a choice, but the point stands. Currently Servo embeds a copy of SpiderMonkey, the JS engine from Firefox, and it does make sense for Servo to take advantage of an industrial, complete JS engine. Of course, SpiderMonkey is written in C++; if there were a JS engine written in Rust, probably Rust programmers would prefer it. Also it would be fun to write, or rather, fun to start writing; reaching the level of ECMA-262 conformance of SpiderMonkey is at least a hundred-million-dollar project. Anyway what I am saying is that I understand why Boa was started, and I wish them the many millions of dollars needed to see it through to completion.

You are not obliged to script your app via JavaScript, of course; there are many languages out there that have “extending a low-level core” as one of their core use cases. I think the mitigated success that this approach has had over the years—who embeds Python into an iPhone app?—should probably rule out this strategy as a core part of an application development framework. Still, I should mention one Rust-specific option, Rhai; the pitch is that by being Rust-specific, you get more expressive interoperation between Rhai and Rust than you would between Rust and any other dynamic language. Still, it is not a solution that I would bet on: Rhai internalizes so many Rust concepts (notably around borrowing and lifetimes) that I think you have to know Rust to write effective Rhai, and knowing both is quite rare. Anyone who writes Rhai would probably rather be writing Rust, and that’s not a good equilibrium.

The third option for scripting Rust is WebAssembly. We’ll get to that in a minute.

alternate world: the web of pixels

Let’s return to Flutter for a moment, if you will. Like the more active Rust GUI development projects, Flutter is an all-in-one rendering framework based on low-level primitives; all it needs is Vulkan or Metal or (soon) WebGPU, and it handles the rest, layering on opinionated patterns for how to build user interfaces. It didn’t arrive to this state in a day, though. To hear Eric Seidel tell the story, Flutter began as a kind of “reset” for the Web, a conscious attempt to determine from the pieces that compose the Web rendering stack, which ones enable smooth user interfaces and which ones get in the way. After taking away all of the parts they didn’t need, Flutter wasn’t left with much: just GPU texture layers, a low-level drawing toolkit, and the necessary bindings to input events. Of course what the application programmer sees is much more high-level, but underneath, these are the platform primitives that Flutter uses.

So, imagine you work at Google. You used to work on the web—maybe on WebKit and then Chrome like Eric, maybe on web standards—but you broke with this past to see what Flutter might become. Flutter works: great job everybody! The set of graphical and input primitives that you use is minimal enough that it is abstract by nature; it doesn’t much matter whether you target iOS or Android, because the primitives will be there. But the web is still the web, and it is annoying, aesthetically speaking. Could we Flutter-ize the web? What would that mean?

That’s exactly what former HTML specification editor and now Flutter team member Ian Hixie proposed this January in a brief manifesto, Towards a modern Web stack. The basic idea is that the web and thus the browser is, well, a bit much. Hixie proposed to start over, rebuilding the web on top of WebAssembly (for code), WebGPU (for graphics), WebHID (for input), and ARIA (for accessibility). Technically it’s a very interesting proposition! After all, people that build complex web apps end up having to fight with the platform to get the results they want; if we can reorient them to focus on these primitives, perhaps web apps can compete better with native apps.

However if you game out what is being proposed, I have doubts. The existing web is largely HTML, with JavaScript and CSS as add-ons: a web of structured text. Hixie’s flutterized web proposal, on the other hand, is a web of pixels. This has a number of implications. One is that each app has to ship its own text renderer and internationalization tables, which is a bit silly to say the least. And whereas we take it for granted that we can mouse over a web page and select its text, with a web of pixels it is much less obvious how that would happen. Hixie’s proposal is that apps expose structure via ARIA, but as far as I understand there is no association between pixels and ARIA properties: the pixels themselves really have no built-in structure to speak of.

And of course unlike in the web of structured text, in a web of pixels it would be up each app to actually describe its structure via ARIA: it’s not a built-in part of the system. But if you combine this with the rendering story (here’s WebGPU, now draw the rest of the owl), Hixie’s proposal leaves a void for frameworks to fill between what the app developer wants to write (e.g. Flutter/Dart) and the platform (WebGPU/ARIA/etc).

I said before that I had doubts and indeed I have doubts about my doubts. I am old enough to remember when X11 apps on Unix desktops changed from having fonts rendered on the server (i.e. by the operating system) to having them rendered on the client (i.e. the app), which was associated with a similar kind of anxiety. There were similar factors at play: slow-moving standards (X11) and not knowing at build-time what the platform would actually provide (which X server would be in use, etc). But instead of using the server, you could just ship pixels, and that’s how GNOME got good text rendering, with Pango and FreeType and fontconfig, and eventually HarfBuzz, the text shaper used in Chromium and Flutter and many other places. Client-side fonts not only enabled more complex text shaping but also eliminated some round-trips for text measurement during UI layout, which is a bit of a theme in this article series. So could it be that pixels instead of text does not represent an apocalypse for the web? I don’t know.

Incidentally I cannot move on from this point without pointing out another narrative thread, which is that of continued human effort over time. Raph Levien, who I mentioned above as a Rust UI toolkit developer, actually spent quite some time doing graphics for GNOME in the early 2000s; I remember working with his libart_lgpl. Behdad Esfahbod, author of HarfBuzz, built many parts of the free software text rendering stack before moving on to Chrome and many other things. I think that if you work on this low level where you are constantly translating text to textures, the accessibility and interaction benefits of using a platform-provided text library start to fade: you are the boss of text around here and you can implement the needed functionality yourself. From this perspective, pixels don’t represent risk at all. In the old days of GNOME 2, client-side font rendering didn’t lead to bad UI or poor accessibility. To be fair, there were other factors pushing to keep work in a commons, as the actual text rendering libraries still tended to be shipped with the operating system as shared libraries. Would similar factors prevail in a statically-linked web of pixels?

In a way it’s a moot question for us, because in this series we are focussing on native app development. So, if you ship a platform, should your app development framework look like the web-of-pixels proposal, or something else? To me it is clear that as a platform, you need more. You need a common development story for how to build user-facing apps: something that looks more like Flutter and less like the primitives that Flutter uses. Though you surely will include a web-of-pixels-like low-level layer, because you need it yourself, probably you should also ship shared text rendering libraries, to reduce the install size for each individual app.

And of course, having text as part of the system has the side benefit of making it easier to get users to install OS-level security patches: it is well-known in the industry that users will make time for the update if they get a new goose emoji in exchange.

alternate world: webassembly

Hark! Have you heard the good word? Have you accepted your Lord and savior, WebAssembly, into your heart? I jest; it does sometime feel like messianic narratives surrounding WebAssembly prevent us from considering its concrete aspects. But despite the hype, WebAssembly is clearly a technology that will be a part of the future of computing. So let’s dive in: what would it mean for a mobile app development platform to embrace WebAssembly?

Before answering that question, a brief summary of what WebAssembly is. WebAssembly 1.0 is portable bytecode format that is a good compilation target for C, C++, and Rust. These languages have good compiler toolchains that can produce WebAssembly. The nice thing is that when you instantiate a WebAssembly module, it is completely isolated from its host: it can’t harm the host (approximately speaking). All points of interoperation with the host are via copying data into memory owned by the WebAssembly guest; the compiler toolchains abstract over these copies, allowing a Rust-compiled-to-native host to call into a Rust-compiled-to-WebAssembly module using idiomatic Rust code.

So, WebAssembly 1.0 can be used as a way to script a Rust application. The guest script can be interpreted, compiled just in time, or compiled ahead of time for peak throughput.

Of course, people that would want to script an application probably want a higher-level language than Rust. In a way, WebAssembly is in a similar situation as WebGPU in the web-of-pixels proposal: it is a low-level tool that needs higher-level toolchains and patterns to bridge the gap between developers and primitives.

Indeed, the web-of-pixels proposal specifies WebAssembly as the compute primitive. The idea is that you ship your application as a WebAssembly module, and give that module WebGPU, WebHID, and ARIA capabilities via imports. Such a WebAssembly module doesn’t script an existing application: it is the app. So another way for an app development platform to use WebAssembly would be like how the web-of-pixels proposes to do it: as an interchange format and as a low-level abstraction. As in the scripting case, you can interpret or compile the module. Perhaps an infrequently-run app would just be interpreted, to save on disk space, whereas a more heavily-used app would be optimized ahead of time, or something.

We should mention another interesting benefit of WebAssembly as a distribution format, which is that it abstracts over the specific chipset on the user’s device; it’s the device itself that is responsible for efficiently executing the program, possibly via compilation to specialized machine code. I understand for example that RISC-V people are quite happy about this property because it lowers the barrier to entry for them relative to an ARM monoculture.

WebAssembly does have some limitations, though. One is that if the throughput of data transfer between guest and host is high, performance can be bad due to copying overhead. The nascent memory-control proposal aims to provide an mmap capability, but it is still early days. The need to copy would be a limitation for using WebGPU primitives.

More generally, as an abstraction, WebAssembly may not be able to express programs in the most efficient way for a given host platform. For example, its SIMD operations work on 128-bit vectors, whereas host platforms may have much wider vectors. Any current limitation will recede with time, as WebAssembly gains new features, but every year brings new hardware capabilities (tensor operation accelerator, anyone?), so there will be some impedance-matching to do for the foreseeable future.

The more fundamental limitation of the 1.0 version of WebAssembly is that it’s only a good compilation target for some languages. This is because some of the fundamental parts of WebAssembly that enable isolation between host and guest (structured control flow, opaque stack, no instruction pointer) make it difficult to efficiently implement languages that need garbage collection, such as Java or Go. The coming WebAssembly 2.0 starts to address this need by including low-level managed arrays and records, allowing for reasonable ahead-of-time compilation of languages like Java. Getting a dynamic language like JavaScript to compile to efficient WebAssembly can still be a challenge, though, because many of the just-in-time techniques needed to efficiently implement these languages will still be missing in WebAssembly 2.0.

Before moving on to WebAssembly as part of an app development framework, one other note: currently WebAssembly modules do not compose very well with each other and with the host, requiring extensive toolchain support to enable e.g. the use of any data type that’s not a scalar integer or floating-point value. The component model working group is trying to establish some abstractions and associated tooling, but (again!) it is still early days. Anyone wading into this space needs to be prepared to get their hands dirty.

To return to the question at hand, an app development framework can use WebAssembly for scripting, though the problem of how to compose a host application with a guest script requires good tooling. Or, an app development framework that exposes a web-of-pixels primitive layer can support running WebAssembly apps directly, though again, the set of imports remains to be defined. Either of these two patterns can stick with WebAssembly 1.0 or also allow for garbage collection in WebAssembly 2.0, aiming to capture mindshare among a broader community of potential developers, potentially in a wide range of languages.

As a final observation: WebAssembly is ecumenical, in the sense that it favors no specific church of how to write programs. As a platform, though, you might prefer a state religion, to avoid wasting internal and external efforts on redundant or ill-advised development. After all, if it’s your platform, presumably you know best.

summary

What is to be done?

Probably there are as many answers as people, but since this is my blog, here are mine:

On the shortest time-scale I think that it is entirely reasonable to base a mobile application development framework on JavaScript. I would particularly focus on TypeScript, as late error detection is more annoying in native applications.
I would to build something that looks like Flutter underneath: reactive, based on low-level primitives, with a multithreaded rendering pipeline. Perhaps it makes sense to take some inspiration from WebF.
In the medium-term I am sympathetic to Ark’s desire to extend the language in a more ResultBuilder-like direction, though this is not without risk.
Also in the medium-term I think that modifications to TypeScript to allow for sound typing could provide some of the advantages of Dart’s ahead-of-time compiler to JavaScript developers.
In the long term... well we can do all things with unlimited resources, right? So after solving climate change and homelessness, it makes sense to invest in frameworks that might be usable 3 or 5 years from now. WebAssembly in particular has a chance of sweeping across all platforms, and the primitives for the web-of-pixels will be present everywhere, so if you manage to produce a compelling application development story targetting those primitives, you could eat your competitors’ lunch.

Well, friends, that brings this article series to an end; it has been interesting for me to dive into this space, and if you have read down to here, I can only think that you are a masochist or that you have also found it interesting. In either case, you are very welcome. Until next time, happy hacking.

a world to win: webassembly for the rest of us

2023-03-20T09:06:42Z

Good day, comrades!

Today I’d like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

A world to win

WebAssembly for the rest of us

17 Mar 2023 – BOB 2023

Andy Wingo

Igalia, S.L.

This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using “technologies beyond the mainstream” to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I’ll link the video too when it becomes available.

WebAssembly, the story

WebAssembly is an exciting new universal compute platform

WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

WebAssembly, the pitch

Predictable portable performance

Low-level
Within 10% of native

Reliable composition via isolation

Modules share nothing by default
No nasal demons
Memory sandboxing

Compile your code to WebAssembly for easier distribution and composition

If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

Firstly it’s “close to the metal” – if you compile for example an image-processing library to WebAssembly and run it, you’ll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly’s instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it’s not enough to do something technically better: you have to also succeed in getting traction for your alternative.

The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

WebAssembly, the hype

It’s in all browsers! Serve your code to anyone in the world!

It’s on the edge! Run code from your web site close to your users!

Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

Again, the remarkable thing about WebAssembly is that it is succeeding! It’s on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

WebAssembly, the reality

WebAssembly is a weird backend for a C compiler

Only some source languages are having success on WebAssembly

What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

Are we just lazy? (Well...)

So why aren’t we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren’t really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

WebAssembly, the reality (2)

WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

Let’s look into why

As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let’s dig into the guts of the system to see what the limitations are.

GC and WebAssembly 1.0

Where do garbage-collected values live?

For WebAssembly 1.0, only possible answer: linear memory

(module
  (global $hp (mut i32) (i32.const 0))
  (memory $mem 10)) ;; 640 kB

The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It’s pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we’re going to request 10 pages, for 640 kB. Should be enough, right? We’ll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

(func $alloc (param $size i32) (result i32)
  (local $ret i32)
  (loop $retry
    (local.set $ret (global.get $hp))
    (global.set $hp
      (i32.add (local.get $size) (local.get $ret)))

    (br_if 1
      (i32.lt_u (i32.shr_u (global.get $hp) 16)
                (memory.size))
      (local.get $ret))

    (call $gc)
    (br $retry)))

Here’s what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

If this is your first time seeing the text representation of a WebAssembly function, you’re in for a treat, but that’s not the point of the presentation :) What I’d like to focus on is the (call $gc) – what happens when the allocation pointer reaches the end of the region?

GC and WebAssembly 1.0 (2)

What hides behind (call $gc) ?

Ship a GC over linear memory

Stop-the-world, not parallel, not concurrent

But... roots.

The first thing to note is that you have to provide the $gc yourself. Of course, this is doable – this is what we do when compiling to a native target.

Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

GC and WebAssembly 1.0 (3)

Live objects are

the roots
any object referenced by a live object

Roots are globals and locals in active stack frames

No way to visit active stack frames

What’s worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

But there we run into problems, because in WebAssembly (any version, not just 1.0) you can’t iterate over the stack, so you can’t find active stack frames, so you can’t find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

GC and WebAssembly 1.0 (3)

Workarounds

handle stack for precise roots
spill all possibly-pointer values to linear memory and collect conservatively

Handle book-keeping a drag for compiled code

Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a “stack map”) associating each potential point at which GC can be called with instructions on how to find the roots.

The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it’s sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

GC and WebAssembly 1.0 (4)

Cycles with external objects (e.g. JavaScript) uncollectable

A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

No way to give back memory to the OS

Gut check: gut says no

If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

Also, it’s pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

Finally, once you collect garbage, and maybe you manage to compact memory, you can’t give anything back to the OS. There are proposals in the works but they are not there yet.

If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

GC and WebAssembly 1.0 (5)

There is already a high-performance concurrent parallel compacting GC in the browser

Halftime: C++ N – Altlangs 0

The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser’s garbage collector!

I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

Change is coming!

Support for built-in GC set to ship in Q4 2023

With GC, the material conditions are now in place

Let’s compile our languages to WebAssembly

But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

Scheme to Wasm

Spritely + Igalia working on Scheme to WebAssembly

Avoid truncating language to platform; bring whole self

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that’s OK.

When thinking about Scheme, I’d like to highlight a few specific areas that have interesting translations. We’ll start with value representation, which stays in the GC theme from the introduction.

Scheme to Wasm: Values

;;       any  extern  func
;;        |
;;        eq
;;     /  |   \
;; i31 struct  array

The unitype: (ref eq)

Immediate values in (ref i31)

fixnums with 30-bit range
chars, bools, etc

Explicit nullability: (ref null eq) vs (ref eq)

The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it’s a 32-bit system or because of pointer compression.

Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

For Scheme, we need a common supertype for all values: the unitype, in Bob Harper’s memorable formulation. We can use (ref any), but actually we’ll use (ref eq) – this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

(func $eq? (param (ref eq) (ref eq))
           (result i32)
  (ref.eq (local.get a) (local.get b)))

Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

Happily, WebAssembly gives us an immediate value type, i31. We’ll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

Scheme to Wasm: Values (2)

Heap objects subtypes of struct; concretely:

(struct $heap-object
  (struct (field $tag-and-hash i32)))
(struct $pair
  (sub $heap-object
    (struct i32 (ref eq) (ref eq))))

GC proposal allows subtyping on structs, functions, arrays

Structural type equivalance: explicit tag useful

We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It’s a bit annoying to take the per-object memory hit but it’s a reality, and the JVM does it this way, so it must not be so terrible.

The other reason is more subtle: WebAssembly’s type system is built in such a way that types that are “structurally” equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can’t fully rely on WebAssembly’s dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

Scheme to Wasm: Values (3)

(func $cons (param (ref eq)
                   (ref eq))
            (result (ref $pair))
  (struct.new_canon $pair
    ;; Assume heap tag for pairs is 1.
    (i32.const 1)
    ;; Car and cdr.
    (local.get 0)
    (local.get 1)))

(func $%car (param (ref $pair))
            (result (ref eq))
  (struct.get $pair 1 (local.get 0)))

With this knowledge we can define cons, as a simple call to struct.new_canon pair.

I didn’t have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its “hidden class” / “structure” / “map” word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word’s value; if it needs to do deeper introspection, it can dereference that word to get more details.

Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it’s a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called “open-world” system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would’t be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

(func $car (param (ref eq)) (result (ref eq))
  (local (ref $pair))
  (block $not-pair
    (br_if $not-pair
      (i32.eqz (ref.test $pair (local.get 0))))
    (local.set 1 (ref.cast $pair) (local.get 0))
    (br_if $not-pair
      (i32.ne
        (i32.const 1)
        (i32.and
          (i32.const 0xff)
          (struct.get $heap-object 0 (local.get 1)))))
    (return_call $%car (local.get 1)))

  (call $type-error)
  (unreachable))

In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it’s actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

Scheme to Wasm

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

Scheme to Wasm: Varargs

(list 'hey)      ;; => (hey)
(list 'hey 'bob) ;; => (hey bob)

Problem: Wasm functions strongly typed

(func $list (param ???) (result (ref eq))
  ???)

Solution: Virtualize calling convention

In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn’t enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you’ll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

How do we square these two approaches to function types?

;; "Registers" for args 0 to 3
(global $arg0 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg1 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg2 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg3 (mut (ref eq)) (i31.new (i32.const 0)))

;; "Memory" for the rest
(type $argv (array (ref eq)))
(global $argN (ref $argv)
        (array.new_canon_default
          $argv (i31.const 42) (i31.new (i32.const 0))))

Uniform function type: argument count as sole parameter

Callee moves args to locals, possibly clearing roots

The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we’ll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don’t think that world is this one :)

Scheme to Wasm

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

Great, two points out of the way! Next up, tail calls.

Scheme to Wasm: Tail calls

;; Call known function
(return_call $f arg ...)

;; Call function by value
(return_call_ref $type callee arg ...)

Friends – I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it’s just such a relief that these functions are just going to be there and I don’t have to think much more about them. Technically speaking the proposal isn’t merged yet; checking the phases document it’s at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

Scheme to Wasm

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

Next up, my favorite favorite topic: delimited continuations.

Scheme to Wasm: Prompts (1)

Problem: Lightweight threads/fibers, exceptions

Possible solutions

Eventually, built-in coroutines
binaryen’s asyncify (not yet ready for GC); see Julia
Delimited continuations

“Bring your whole self”

Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let’s put that to one side.

There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don’t have to?

Scheme to Wasm: Prompts (2)

Prompts delimit continuations

(define k
  (call-with-prompt ’foo
    ; body
    (lambda ()
      (+ 34 (abort-to-prompt 'foo)))
    ; handler
    (lambda (continuation)
      continuation)))

(k 10)       ;; ⇒ 44
(- (k 10) 2) ;; ⇒ 42

k is the _ in (lambda () (+ 34 _))

There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

Scheme to Wasm: Prompts (3)

Delimited continuations are stack slices

Make stack explicit via minimal continuation-passing-style conversion

Turn all calls into tail calls
Allocate return continuations on explicit stack
Breaks functions into pieces at non-tail calls

This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can’t slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don’t need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

Scheme to Wasm: Prompts (4)

Before a non-tail-call:

Push live-out vars on stacks (one stack per top type)
Push continuation as funcref
Tail-call callee

Return from call via pop and tail call:

(return_call_ref (call $pop-return)
                 (i32.const 0))

After return, continuation pops state from stacks

Consider a simple function:

(define (f x y)
  (+ x (g y))

Before making a non-tail call, a “tailified” function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

In this concrete case, tailification would split f into two pieces:

(define (f x y)
  (push! x)
  (push-return! f-return-continuation-0)
  (g y))

(define (f-return-continuation-0 g-of-y)
  (define k (pop-return!))
  (define x (pop! x))
  (k (+ x g-of-y)))

Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

Scheme to Wasm: Prompts (5)

abort-to-prompt:

Pop stack slice to reified continuation object
Tail-call new top of stack: prompt handler

Calling a reified continuation:

Push stack slice
Tail-call new top of stack

No need to wait for effect handlers proposal; you can have it all now!

The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don’t know by how much. I guess we’ll find out soon.

Scheme to Wasm

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

Okeydokes, last point!

Scheme to Wasm: Numbers

Numbers can be immediate: fixnums

Or on the heap: bignums, fractions, flonums, complex

Supertype is still ref eq

Consider imports to implement bignums

On web: BigInt
On edge: Wasm support module (mini-gmp?)

Dynamic dispatch for polymorphic ops, as usual

First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We’ll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we’d compile an external bignum library. So that’s the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

Scheme to Wasm

Value representation
Varargs
Tail calls
Delimited continuations
Numeric tower

Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.

Miscellenea

Debugging: The wild west of DWARF; prompts

Strings: stringref host strings spark joy

JS interop: Export accessors; Wasm objects opaque to JS. externref.

JIT: A whole ’nother talk!

AOT: wasm2c

Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it’s possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it’s a fool’s errand and that copying is the only way to go. I don’t know which side will prevail; perhaps more on that as well later on.

Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can’t do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

Finally, some language implementations really need JIT support, like PyPy. There, that’s a whole ‘nother talk!

WebAssembly for the rest of us

With GC, WebAssembly is now ready for us

Getting our languages on WebAssembly now a S.M.O.P.

Let’s score some goals in the second half!

(visit-links
 "gitlab.com/spritely/guile-hoot-updates"
 "wingolog.org"
 "wingo@igalia.com"
 "igalia.com"
 "mastodon.social/@wingo")

WebAssembly has proven to have some great wins for C, C++, Rust, and so on – but now it’s our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let’s put on the coffee and bang some bytes together; it’s still early days and there’s a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

pre-initialization of garbage-collected webassembly heaps

2023-03-10T09:20:33Z

Hey comrades, I just had an idea that I won’t be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

Say you have a WebAssembly module with GC types. It might look like this:

(module
  (type $t0 (struct (ref eq)))
  (type $t1 (struct (ref $t0) i32))
  (type $t2 (array (mut (ref $t1))))
  ...
  (global $g0 (ref null eq)
    (ref.null eq))
  (global $g1 (ref $t1)
    (array.new_canon $t0 (i31.new (i32.const 42))))
  ...
  (function $f0 ...)
  ...)

You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let’s assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

(func $introspect (param $obj anyref)
  (block $L2 (ref $t2)
    (block $L1 (ref $t1)
      (block $L0 (ref $t0)
        (br_on_cast $L2 $t2 (local.get $obj))
        (br_on_cast $L1 $t1 (local.get $obj))
        (br_on_cast $L0 $t0 (local.get $obj))
        (unreachable))
      ;; Do $t0 things...
      (return))
    ;; Do $t1 things...
    (return))
  ;; Do $t2 things...
  (return))

In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what’s more, we can take a snapshot of those objects:

(func $snapshot (result (ref (array (mut anyref))))
  ;; Start from roots, use introspect to find concrete types
  ;; and trace edges, use a worklist, return an array of
  ;; all live objects in topological sort order
  )

Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of “initialization” phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3 foo.py, Python will first spend some time parsing and byte-compiling foo.py, importing the modules it uses and so on, and then will actually run foo.py‘s code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

(func $pickle (result (ref (array (mut i8))))
  ;; Return an array of bytecode which, when interpreted,
  ;; can reconstruct the object graph and set the roots
  )
(func $unpickle (param (ref (array (mut i8))))
  ;; Interpret the bytecode, building object graph in
  ;; topological order
  )

The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that’s not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

(func $make-init (result (ref (array (mut i8))))
  ;; Return an array of WebAssembly code which, when
  ;; added to the module as a function and invoked, 
  ;; can reconstruct the object graph and set the roots.
  )

Then you would use binary tools to add that newly generated function to the module.

In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M’, a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M’’, which is M pre-initialized for a given task.

Optimizations

Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don’t have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.

Limitations

There are five preconditions to be able to pickle and unpickle the GC heap:

The set of concrete types in a module must be closed.
The roots of the GC graph must be enumerable.
The object-graph edges from each live object must be enumerable.
To prevent cycles, we have to know when an object has been visited: objects must have identity.
We must be able to create each type in a module.

I think there are three limitations to this pre-initialization idea in practice.

One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let’s keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can’t participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can’t get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won’t be able to re-create the stack anyway.

OK, that’s my idea. Have at it, hackers!