wingolog

approaching cps soup

20 May 2023 7:10 AM (guile | igalia | spritely | cps soup | cps | compilers | flow analysis)

Good evening, hackers. Today's missive is more of a massive, in the sense that it's another presentation transcript-alike; these things always translate to many vertical pixels.

In my defense, I hardly ever give a presentation twice, so not only do I miss out on the usual per-presentation cost amortization and on the incremental improvements of repetition, the more dire error is that whatever message I might have can only ever reach a subset of those that it might interest; here at least I can be more or less sure that if the presentation would interest someone, that they will find it.

So for the time being I will try to share presentations here, in the spirit of, well, why the hell not.

CPS Soup

A functional intermediate language

10 May 2023 – Spritely

Andy Wingo

Igalia, S.L.

Last week I gave a training talk to Spritely Institute collaborators on the intermediate representation used by Guile's compiler.

CPS Soup

Compiler: Front-end to Middle-end to Back-end

Middle-end spans gap between high-level source code (AST) and low-level machine code

Programs in middle-end expressed in intermediate language

CPS Soup is the language of Guile’s middle-end

An intermediate representation (IR) (or intermediate language, IL) is just another way to express a computer program. Specifically it's the kind of language that is appropriate for the middle-end of a compiler, and by "appropriate" I meant that an IR serves a purpose: there has to be a straightforward transformation to the IR from high-level abstract syntax trees (ASTs) from the front-end, and there has to be a straightforward translation from IR to machine code.

There are also usually a set of necessary source-to-source transformations on IR to "lower" it, meaning to make it closer to the back-end than to the front-end. There are usually a set of optional transformations to the IR to make the program run faster or allocate less memory or be more simple: these are the optimizations.

"CPS soup" is Guile's IR. This talk presents the essentials of CPS soup in the context of more traditional IRs.

How to lower?

High-level:

(+ 1 (if x 42 69))

Low-level:

  cmpi $x, #f
  je L1
  movi $t, 42
  j L2  
L1:
  movi $t, 69
L2:
  addi $t, 1

How to get from here to there?

Before we dive in, consider what we might call the dynamic range of an intermediate representation: we start with what is usually an algebraic formulation of a program and we need to get down to a specific sequence of instructions operating on registers (unlimited in number, at this stage; allocating to a fixed set of registers is a back-end concern), with explicit control flow between them. What kind of a language might be good for this? Let's attempt to answer the question by looking into what the standard solutions are for this problem domain.

1970s

Control-flow graph (CFG)

graph := array<block>
block := tuple<preds, succs, insts>
inst  := goto B
       | if x then BT else BF
       | z = const C
       | z = add x, y
       ...

BB0: if x then BB1 else BB2
BB1: t = const 42; goto BB3
BB2: t = const 69; goto BB3
BB3: t2 = addi t, 1; ret t2

Assignment, not definition

Of course in the early days, there was no intermediate language; compilers translated ASTs directly to machine code. It's been a while since I dove into all this but the milestone I have in my head is that it's the 70s when compiler middle-ends come into their own right, with Fran Allen's work on flow analysis and optimization.

In those days the intermediate representation for a compiler was a graph of basic blocks, but unlike today the paradigm was assignment to locations rather than definition of values. By that I mean that in our example program, we get t assigned to in two places (BB1 and BB2); the actual definition of t is implicit, as a storage location, and our graph consists of assignments to the set of storage locations in the program.

1980s

Static single assignment (SSA) CFG

graph := array<block>
block := tuple<preds, succs, phis, insts>
phi   := z := φ(x, y, ...)
inst  := z := const C
       | z := add x, y
       ...
BB0: if x then BB1 else BB2
BB1: v0 := const 42; goto BB3
BB2: v1 := const 69; goto BB3
BB3: v2 := φ(v0,v1); v3:=addi t,1; ret v3

Phi is phony function: v2 is v0 if coming from first predecessor, or v1 from second predecessor

These days we still live in Fran Allen's world, but with a twist: we no longer model programs as graphs of assignments, but rather graphs of definitions. The introduction in the mid-80s of so-called "static single-assignment" (SSA) form graphs mean that instead of having two assignments to t, we would define two different values v0 and v1. Then later instead of reading the value of the storage location associated with t, we define v2 to be either v0 or v1: the former if we reach the use of t in BB3 from BB1, the latter if we are coming from BB2.

If you think on the machine level, in terms of what the resulting machine code will be, this either function isn't a real operation; probably register allocation will put v0, v1, and v2 in the same place, say $rax. The function linking the definition of v2 to the inputs v0 and v1 is purely notational; in a way, you could say that it is phony, or not real. But when the creators of SSA went to submit this notation for publication they knew that they would need something that sounded more rigorous than "phony function", so they instead called it a "phi" (φ) function. Really.

2003: MLton

Refinement: phi variables are basic block args

graph := array<block>
block := tuple<preds, succs, args, insts>

Inputs of phis implicitly computed from preds

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

SSA is still where it's at, as a conventional solution to the IR problem. There have been some refinements, though. I learned of one of them from MLton; I don't know if they were first but they had the idea of interpreting phi variables as arguments to basic blocks. In this formulation, you don't have explicit phi instructions; rather the "v2 is either v1 or v0" property is expressed by v2 being a parameter of a block which is "called" with either v0 or v1 as an argument. It's the same semantics, but an interesting notational change.

Refinement: Control tail

Often nice to know how a block ends (e.g. to compute phi input vars)

graph := array<block>
block := tuple<preds, succs, args, insts,
               control>
control := if v then L1 else L2
         | L(v, ...)
         | switch(v, L1, L2, ...)
         | ret v

One other refinement to SSA is to note that basic blocks consist of some number of instructions that can define values or have side effects but which otherwise exhibit fall-through control flow, followed by a single instruction that transfers control to another block. We might as well store that control instruction separately; this would let us easily know how a block ends, and in the case of phi block arguments, easily say what values are the inputs of a phi variable. So let's do that.

Refinement: DRY

Block successors directly computable from control

Predecessors graph is inverse of successors graph

graph := array<block>
block := tuple<args, insts, control>

Can we simplify further?

At this point we notice that we are repeating ourselves; the successors of a block can be computed directly from the block's terminal control instruction. Let's drop those as a distinct part of a block, because when you transform a program it's unpleasant to have to needlessly update something in two places.

While we're doing that, we note that the predecessors array is also redundant, as it can be computed from the graph of block successors. Here we start to wonder: am I simpliying or am I removing something that is fundamental to the algorithmic complexity of the various graph transformations that I need to do? We press on, though, hoping we will get somewhere interesting.

Basic blocks are annoying

Ceremony about managing insts; array or doubly-linked list?

Nonuniformity: “local” vs ‘`global’' transformations

Optimizations transform graph A to graph B; mutability complicates this task

  • Desire to keep A in mind while making B
  • Bugs because of spooky action at a distance

Recall that the context for this meander is Guile's compiler, which is written in Scheme. Scheme doesn't have expandable arrays built-in. You can build them, of course, but it is annoying. Also, in Scheme-land, functions with side-effects are conventionally suffixed with an exclamation mark; after too many of them, both the writer and the reader get fatigued. I know it's a silly argument but it's one of the things that made me grumpy about basic blocks.

If you permit me to continue with this introspection, I find there is an uneasy relationship between instructions and locations in an IR that is structured around basic blocks. Do instructions live in a function-level array and a basic block is an array of instruction indices? How do you get from instruction to basic block? How would you hoist an instruction to another basic block, might you need to reallocate the block itself?

And when you go to transform a graph of blocks... well how do you do that? Is it in-place? That would be efficient; but what if you need to refer to the original program during the transformation? Might you risk reading a stale graph?

It seems to me that there are too many concepts, that in the same way that SSA itself moved away from assignment to a more declarative language, that perhaps there is something else here that might be more appropriate to the task of a middle-end.

Basic blocks, phi vars redundant

Blocks: label with args sufficient; “containing” multiple instructions is superfluous

Unify the two ways of naming values: every var is a phi

graph := array<block>
block := tuple<args, inst>
inst  := L(expr)
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

I took a number of tacks here, but the one I ended up on was to declare that basic blocks themselves are redundant. Instead of containing an array of instructions with fallthrough control-flow, why not just make every instruction a control instruction? (Yes, there are arguments against this, but do come along for the ride, we get to a funny place.)

While you are doing that, you might as well unify the two ways in which values are named in a MLton-style compiler: instead of distinguishing between basic block arguments and values defined within a basic block, we might as well make all names into basic block arguments.

Arrays annoying

Array of blocks implicitly associates a label with each block

Optimizations add and remove blocks; annoying to have dead array entries

Keep labels as small integers, but use a map instead of an array

graph := map<label, block>

In the traditional SSA CFG IR, a graph transformation would often not touch the structure of the graph of blocks. But now having given each instruction its own basic block, we find that transformations of the program necessarily change the graph. Consider an instruction that we elide; before, we would just remove it from its basic block, or replace it with a no-op. Now, we have to find its predecessor(s), and forward them to the instruction's successor. It would be useful to have a more capable data structure to represent this graph. We might as well keep labels as being small integers, but allow for sparse maps and growth by using an integer-specialized map instead of an array.

This is CPS soup

graph := map<label, cont>
cont  := tuple<args, term>
term  := continue to L
           with values from expr
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

SSA is CPS

This is exactly what CPS soup is! We came at it "from below", so to speak; instead of the heady fumes of the lambda calculus, we get here from down-to-earth basic blocks. (If you prefer the other way around, you might enjoy this article from a long time ago.) The remainder of this presentation goes deeper into what it is like to work with CPS soup in practice.

Scope and dominators

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

What vars are “in scope” at BB3? a0 and v2.

Not v0; not all paths from BB0 to BB3 define v0.

a0 always defined: its definition dominates all uses.

BB0 dominates BB3: All paths to BB3 go through BB0.

Before moving on, though, we should discuss what it means in an SSA-style IR that variables are defined rather than assigned. If you consider variables as locations to which values can be assigned and which initially hold garbage, you can read them at any point in your program. You might get garbage, though, if the variable wasn't assigned something sensible on the path that led to reading the location's value. It sounds bonkers but it is still the C and C++ semantic model.

If we switch instead to a definition-oriented IR, then a variable never has garbage; the single definition always precedes any uses of the variable. That is to say that all paths from the function entry to the use of a variable must pass through the variable's definition, or, in the jargon, that definitions dominate uses. This is an invariant of an SSA-style IR, that all variable uses be dominated by their associated definition.

You can flip the question around to ask what variables are available for use at a given program point, which might be read equivalently as which variables are in scope; the answer is, all definitions from all program points that dominate the use site. The "CPS" in "CPS soup" stands for continuation-passing style, a dialect of the lambda calculus, which has also has a history of use as a compiler intermediate representation. But it turns out that if we use the lambda calculus in its conventional form, we end up needing to maintain a lexical scope nesting at the same time that we maintain the control-flow graph, and the lexical scope tree can fail to reflect the dominator tree. I go into this topic in more detail in an old article, and if it interests you, please do go deep.

CPS soup in Guile

Compilation unit is intmap of label to cont

cont := $kargs names vars term
      | ...
term := $continue k src expr
      | ...
expr := $const C
      | $primcall ’add #f (a b)
      | ...

Conventionally, entry point is lowest-numbered label

Anyway! In Guile, the concrete form that CPS soup takes is that a program is an intmap of label to cont. A cont is the smallest labellable unit of code. You can call them blocks if that makes you feel better. One kind of cont, $kargs, binds incoming values to variables. It has a list of variables, vars, and also has an associated list of human-readable names, names, for debugging purposes.

A $kargs contains a term, which is like a control instruction. One kind of term is $continue, which passes control to a continuation k. Using our earlier language, this is just goto *k*, with values, as in MLton. (The src is a source location for the term.) The values come from the term's expr, of which there are a dozen kinds or so, for example $const which passes a literal constant, or $primcall, which invokes some kind of primitive operation, which above is add. The primcall may have an immediate operand, in this case #f, and some variables that it uses, in this case a and b. The number and type of the produced values is a property of the primcall; some are just for effect, some produce one value, some more.

CPS soup

term := $continue k src expr
      | $branch kf kt src op param args
      | $switch kf kt* src arg
      | $prompt k kh src escape? tag
      | $throw src op param args

Expressions can have effects, produce values

expr := $const val
      | $primcall name param args
      | $values args
      | $call proc args
      | ...

There are other kinds of terms besides $continue: there is $branch, which proceeds either to the false continuation kf or the true continuation kt depending on the result of performing op on the variables args, with immediate operand param. In our running example, we might have made the initial term via:

(build-term
  ($branch BB1 BB2 'false? #f (a0)))

The definition of build-term (and build-cont and build-exp) is in the (language cps) module.

There is also $switch, which takes an unboxed unsigned integer arg and performs an array dispatch to the continuations in the list kt, or kf otherwise.

There is $prompt which continues to its k, having pushed on a new continuation delimiter associated with the var tag; if code aborts to tag before the prompt exits via an unwind primcall, the stack will be unwound and control passed to the handler continuation kh. If escape? is true, the continuation is escape-only and aborting to the prompt doesn't need to capture the suspended continuation.

Finally there is $throw, which doesn't continue at all, because it causes a non-resumable exception to be thrown. And that's it; it's just a handful of kinds of term, determined by the different shapes of control-flow (how many continuations the term has).

When it comes to values, we have about a dozen expression kinds. We saw $const and $primcall, but I want to explicitly mention $values, which simply passes on some number of values. Often a $values expression corresponds to passing an input to a phi variable, though $kargs vars can get their definitions from any expression that produces the right number of values.

Kinds of continuations

Guile functions untyped, can multiple return values

Error if too few values, possibly truncate too many values, possibly cons as rest arg...

Calling convention: contract between val producer & consumer

  • both on call and return side

Continuation of $call unlike that of $const

When a $continue term continues to a $kargs with a $const 42 expression, there are a number of invariants that the compiler can ensure: that the $kargs continuation is always passed the expected number of values, that the vars that it binds can be allocated to specific locations (e.g. registers), and that because all predecessors of the $kargs are known, that those predecessors can place their values directly into the variable's storage locations. Effectively, the compiler determines a custom calling convention between each $kargs and its predecessors.

Consider the $call expression, though; in general you don't know what the callee will do to produce its values. You don't even generally know that it will produce the right number of values. Therefore $call can't (in general) continue to $kargs; instead it continues to $kreceive, which expects the return values in well-known places. $kreceive will check that it is getting the right number of values and then continue to a $kargs, shuffling those values into place. A standard calling convention defines how functions return values to callers.

The conts

cont := $kfun src meta self ktail kentry
      | $kclause arity kbody kalternate
      | $kargs names syms term
      | $kreceive arity kbody
      | $ktail

$kclause, $kreceive very similar

Continue to $ktail: return

$call and return (and $throw, $prompt) exit first-order flow graph

Of course, a $call expression could be a tail-call, in which case it would continue instead to $ktail, indicating an exit from the first-order function-local control-flow graph.

The calling convention also specifies how to pass arguments to callees, and likewise those continuations have a fixed calling convention; in Guile we start functions with $kfun, which has some metadata attached, and then proceed to $kclause which bridges the boundary between the standard calling convention and the specialized graph of $kargs continuations. (Many details of this could be tweaked, for example that the case-lambda dispatch built-in to $kclause could instead dispatch to distinct functions instead of to different places in the same function; historical accidents abound.)

As a detail, if a function is well-known, in that all its callers are known, then we can lighten the calling convention, moving the argument-count check to callees. In that case $kfun continues directly to $kargs. Similarly for return values, optimizations can make $call continue to $kargs, though there is still some value-shuffling to do.

High and low

CPS bridges AST (Tree-IL) and target code

High-level: vars in outer functions in scope

Closure conversion between high and low

Low-level: Explicit closure representations; access free vars through closure

CPS soup is the bridge between parsed Scheme and machine code. It starts out quite high-level, notably allowing for nested scope, in which expressions can directly refer to free variables. Variables are small integers, and for high-level CPS, variable indices have to be unique across all functions in a program. CPS gets lowered via closure conversion, which chooses specific representations for each closure that remains after optimization. After closure conversion, all variable access is local to the function; free variables are accessed via explicit loads from a function's closure.

Optimizations at all levels

Optimizations before and after lowering

Some exprs only present in one level

Some high-level optimizations can merge functions (higher-order to first-order)

Because of the broad remit of CPS, the language itself has two dialects, high and low. The high level dialect has cross-function variable references, first-class abstract functions (whose representation hasn't been chosen), and recursive function binding. The low-level dialect has only specific ways to refer to functions: labels and specific closure representations. It also includes calls to function labels instead of just function values. But these are minor variations; some optimization and transformation passes can work on either dialect.

Practicalities

Intmap, intset: Clojure-style persistent functional data structures

Program: intmap<label,cont>

Optimization: program→program

Identify functions: (program,label)→intset<label>

Edges: intmap<label,intset<label>>

Compute succs: (program,label)→edges

Compute preds: edges→edges

I mentioned that programs were intmaps, and specifically in Guile they are Clojure/Bagwell-style persistent functional data structures. By functional I mean that intmaps (and intsets) are values that can't be mutated in place (though we do have the transient optimization).

I find that immutability has the effect of deploying a sense of calm to the compiler hacker -- I don't need to worry about data structures changing out from under me; instead I just structure all the transformations that you need to do as functions. An optimization is just a function that takes an intmap and produces another intmap. An analysis associating some data with each program label is just a function that computes an intmap, given a program; that analysis will never be invalidated by subsequent transformations, because the program to which it applies will never be mutated.

This pervasive feeling of calm allows me to tackle problems that I wouldn't have otherwise been able to fit into my head. One example is the novel online CSE pass; one day I'll either wrap that up as a paper or just capitulate and blog it instead.

Flow analysis

A[k] = meet(A[p] for p in preds[k])
         - kill[k] + gen[k]

Compute available values at labels:

  • A: intmap<label,intset<val>>
  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union
  • kill[k]: values invalidated by cont because of side effects
  • gen[k]: values defined at k

But to keep it concrete, let's take the example of flow analysis. For example, you might want to compute "available values" at a given label: these are the values that are candidates for common subexpression elimination. For example if a term is dominated by a car x primcall whose value is bound to v, and there is no path from the definition of V to a subsequent car x primcall, we can replace that second duplicate operation with $values (v) instead.

There is a standard solution for this problem, which is to solve the flow equation above. I wrote about this at length ages ago, but looking back on it, the thing that pleases me is how easy it is to decompose the task of flow analysis into manageable parts, and how the types tell you exactly what you need to do. It's easy to compute an initial analysis A, easy to define your meet function when your maps and sets have built-in intersect and union operators, easy to define what addition and subtraction mean over sets, and so on.

Persistent data structures FTW

  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union

Naïve: O(nconts * nvals)

Structure-sharing: O(nconts * log(nvals))

Computing an analysis isn't free, but it is manageable in cost: the structure-sharing means that meet is usually trivial (for fallthrough control flow) and the cost of + and - is proportional to the log of the problem size.

CPS soup: strengths

Relatively uniform, orthogonal

Facilitates functional transformations and analyses, lowering mental load: “I just have to write a function from foo to bar; I can do that”

Encourages global optimizations

Some kinds of bugs prevented by construction (unintended shared mutable state)

We get the SSA optimization literature

Well, we're getting to the end here, and I want to take a step back. Guile has used CPS soup as its middle-end IR for about 8 years now, enough time to appreciate its fine points while also understanding its weaknesses.

On the plus side, it has what to me is a kind of low cognitive overhead, and I say that not just because I came up with it: Guile's development team is small and not particularly well-resourced, and we can't afford complicated things. The simplicity of CPS soup works well for our development process (flawed though that process may be!).

I also like how by having every variable be potentially a phi, that any optimization that we implement will be global (i.e. not local to a basic block) by default.

Perhaps best of all, we get these benefits while also being able to use the existing SSA transformation literature. Because CPS is SSA, the lessons learned in SSA (e.g. loop peeling) apply directly.

CPS soup: weaknesses

Pointer-chasing, indirection through intmaps

Heavier than basic blocks: more control-flow edges

Names bound at continuation only; phi predecessors share a name

Over-linearizes control, relative to sea-of-nodes

Overhead of re-computation of analyses

CPS soup is not without its drawbacks, though. It's not suitable for JIT compilers, because it imposes some significant constant-factor (and sometimes algorithmic) overheads. You are always indirecting through intmaps and intsets, and these data structures involve significant pointer-chasing.

Also, there are some forms of lightweight flow analysis that can be performed naturally on a graph of basic blocks without looking too much at the contents of the blocks; for example in our available variables analysis you could run it over blocks instead of individual instructions. In these cases, basic blocks themselves are an optimization, as they can reduce the size of the problem space, with corresponding reductions in time and memory use for analyses and transformations. Of course you could overlay a basic block graph on top of CPS soup, but it's not a well-worn path.

There is a little detail that not all phi predecessor values have names, since names are bound at successors (continuations). But this is a detail; if these names are important, little $values trampolines can be inserted.

Probably the main drawback as an IR is that the graph of conts in CPS soup over-linearizes the program. There are other intermediate representations that don't encode ordering constraints where there are none; perhaps it would be useful to marry CPS soup with sea-of-nodes, at least during some transformations.

Finally, CPS soup does not encourage a style of programming where an analysis is incrementally kept up to date as a program is transformed in small ways. The result is that we end up performing much redundant computation within each individual optimization pass.

Recap

CPS soup is SSA, distilled

Labels and vars are small integers

Programs map labels to conts

Conts are the smallest labellable unit of code

Conts can have terms that continue to other conts

Compilation simplifies and lowers programs

Wasm vs VM backend: a question for another day :)

But all in all, CPS soup has been good for Guile. It's just SSA by another name, in a simpler form, with a functional flavor. Or, it's just CPS, but first-order only, without lambda.

In the near future, I am interested in seeing what a new GC will do for CPS soup; will bump-pointer allocation palliate some of the costs of pointer-chasing? We'll see. A tricky thing about CPS soup is that I don't think that anyone else has tried it in other languages, so it's hard to objectively understand its characteristics independent of Guile itself.

Finally, it would be nice to engage in the academic conversation by publishing a paper somewhere; I would like to see interesting criticism, and blog posts don't really participate in the citation graph. But in the limited time available to me, faced with the choice between hacking on something and writing a paper, it's always been hacking, so far :)

Speaking of limited time, I probably need to hit publish on this one and move on. Happy hacking to all, and until next time.

structure and interpretation of ark

2 May 2023 9:13 AM (igalia | compilers | mobile | ionic | capacitor | react | react native | nativescript | flutter | dart | javascript)

Hello, dear readers! Today's article describes Ark, a new JavaScript-based mobile development platform. If you haven't read them yet, you might want to start by having a look at my past articles on Capacitor, React Native, NativeScript, and Flutter; having a common understanding of the design space will help us understand where Ark is similar and where it differs.

Ark, what it is

If I had to bet, I would guess that you have not heard of Ark. (I certainly hadn't either, when commissioned to do this research series.) To a first approximation, Ark—or rather, what I am calling Ark; I don't actually know the name for the whole architecture—is a loosely Flutter-like UI library implemented on top of a dialect of JavaScript, with build-time compilation to bytecode (like Hermes) but also with support for just-in-time and ahead-of-time compilation of bytecode to native code. It is made by Huawei.

At this point if you are already interested in this research series, I am sure this description raises more questions than it answers. Flutter-like? A dialect? Native compilation? Targetting what platforms? From Huawei? We'll get to all of these, but I think we need to start with the last question.

How did we get here?

In my last article on Flutter, I told a kind of just-so history of how Dart and Flutter came to their point in the design space. Thanks to corrections from a kind reader, it happened to also be more or less correct. In this article, though, I haven't talked with Ark developers at all; I don't have the benefit of a true claim on history. And yet, the only way I can understand Ark is by inventing a narrative, so here we go. It might even be true!

Recall that in 2018, Huawei was a dominant presence in the smartphone market. They were shipping excellent hardware at good prices both to the Chinese and to the global markets. Like most non-Apple, non-Google manufacturers, they shipped Android, and like most Android OEMs, they shipped Google's proprietary apps (mail, maps, etc.).

But then, over the next couple years, the US decided that allowing Huawei to continue on as before was, like, against national security interests or something. Huawei was barred from American markets, a number of suppliers were forbidden from selling hardware components to Huawei, and even Google was prohibited from shipping its mobile apps on Huawei devices. The effect on Huawei's market share for mobile devices was enormous: its revenue was cut in half over a period of a couple years.

In this position, as Huawei, what do you do? I can't even imagine, but specifically looking at smartphones, I think I would probably do about what they did. I'd fork Android, for starters, because that's what you already know and ship, and Android is mostly open source. I'd probably plan on continuing to use its lower-level operating system pieces indefinitely (kernel and so on) because that's not a value differentiator. I'd probably ship the same apps on top at first, because otherwise you slip all the release schedules and lose revenue entirely.

But, gosh, there is the risk that your product will be perceived as just a worse version of Android: that's not a good position to be in. You need to be different, and ideally better. That will take time. In the meantime, you claim that you're different, without actually being different yet. It's a somewhat ridiculous position to be in, but I can understand how you get here; Ars Technica published a scathing review poking fun at the situation. But, you are big enough to ride it out, knowing that somehow eventually you will be different.

Up to now, this part of the story is relatively well-known; the part that follows is more speculative on my part. Firstly, I would note that Huawei had been working for a while on a compiler and language run-time called Ark Compiler, with the goal of getting better performance out of Android applications. If I understand correctly, this compiler took the Java / Dalvik / Android Run Time bytecodes as its input, and outputted native binaries along with a new run-time implementation.

As I can attest from personal experience, having a compiler leads to hubris: you start to consider source languages like a hungry person looks at a restaurant menu. "Wouldn't it be nice to ingest that?" That's what we say at restaurants, right, fellow humans? So in 2019 and 2020 when the Android rug was pulled out from underneath Huawei, I think having in-house compiler expertise allowed them to consider whether they wanted to stick with Java at all, or whether it might be better to choose a more fashionable language.

Like black, JavaScript is always in fashion. What would it mean, then, to retool Huawei's operating system -- by then known by the name "HarmonyOS" -- to expose a JavaScript-based API as its primary app development framework? You could use your Ark compiler somehow to implement JavaScript (hubris!) and then you need a UI framework. Having ditched Java, it is now thinkable to ditch all the other Android standard libraries, including the UI toolkit: you start anew, in a way. So are you going to build a Capacitor, a React Native, a NativeScript, a Flutter? Surely not precisely any of these, but what will it be like, and how will it differ?

Incidentally, I don't know the origin story for the name Ark, but to me it brings to mind tragedy and rebuilding: in the midst of being cut off from your rich Android ecosystem, you launch a boat into the sea, holding a promise of a new future built differently. Hope and hubris in one vessel.

Two programming interfaces

In the end, Huawei builds two things: something web-like and something like Flutter. (I don't mean to suggest copying or degeneracy here; it's rather that I can only understand things in relation to other things, and these are my closest points of comparison for what they built.)

The web-like programming interface specifies UIs using an XML dialect, HML, and styles the resulting node tree with CSS. You augment these nodes with JavaScript behavior; the main app is a set of DOM-like event handlers. There is an API to dynamically create DOM nodes, but unlike the other systems we have examined, the HarmonyOS documentation doesn't really sell you on using a high-level framework like Angular.

If this were it, I think Ark would not be so compelling: the programming model is more like what was available back in the DHTML days. I wouldn't expect people to be able to make rich applications that delight users, given these primitives, though CSS animation and the HML loop and conditional rendering from the template system might be just expressive enough for simple applications.

The more interesting side is the so-called "declarative" UI programming model which exposes a Flutter/React-like interface. The programmer describes the "what" of the UI by providing a tree of UI nodes in its build function, and the framework takes care of calling build when necessary and of rendering that tree to the screen.

Here I need to show some example code, because it is... weird. Well, I find it weird, but it's not too far from SwiftUI in flavor. A small example from the fine manual:

@Entry
@Component
struct MyComponent {
  build() {
    Stack() {
        Image($rawfile('Tomato.png'))
        Text('Tomato')
            .fontSize(26)
            .fontWeight(500)
    }
  }
}

The @Entry decorator (*) marks this struct (**) as being the main entry point for the app. @Component marks it as being a component, like a React functional component. Components conform to an interface (***) which defines them as having a build method which takes no arguments and returns no values: it creates the tree in a somewhat imperative way.

But as you see the flavor is somewhat declarative, so how does that work? Also, build() { ... } looks syntactically a lot like Stack() { ... }; what's the deal, are they the same?

Before going on to answer this, note my asterisks above: these are concepts that aren't in JavaScript. Indeed, programs written for HarmonyOS's declarative framework aren't JavaScript; they are in a dialect of TypeScript that Huawei calls ArkTS. In this case, an interface is a TypeScript concept. Decorators would appear to correspond to an experimental TypeScript feature, looking at the source code.

But struct is an ArkTS-specific extension, and Huawei has actually extended the TypeScript compiler to specifically recognize the @Component decorator, such that when you "call" a struct, for example as above in Stack() { ... }, TypeScript will parse that as a new expression type EtsComponentExpression, which may optionally be followed by a block. When Stack() is invoked, its children (instances of Image and Text, in this case) will be populated via running the block.

Now, though TypeScript isn't everyone's bag, it's quite normalized in the JavaScript community and not a hard sell. Language extensions like the handling of @Component pose a more challenging problem. Still, Facebook managed to sell people on JSX, so perhaps Huawei can do the same for their dialect. More on that later.

Under the hood, it would seem that we have a similar architecture to Flutter: invoking the components creates a corresponding tree of elements (as with React Native's shadow tree), which then are lowered to render nodes, which draw themselves onto layers using Skia, in a multi-threaded rendering pipeline. Underneath, the UI code actually re-uses some parts of Flutter, though from what I can tell HarmonyOS developers are replacing those over time.

Restrictions and extensions

So we see that the source language for the declarative UI framework is TypeScript, but with some extensions. It also has its restrictions, and to explain these, we have to talk about implementation.

Of the JavaScript mobile application development frameworks we discussed, Capacitor and NativeScript used "normal" JS engines from web browsers, while React Native built their own Hermes implementation. Hermes is also restricted, in a way, but mostly inasmuch as it lags the browser JS implementations; it relies on source-to-source transpilers to get access to new language features. ArkTS—that's the name of HarmonyOS's "extended TypeScript" implementation—has more fundamental restrictions.

Recall that the Ark compiler was originally built for Android apps. There you don't really have the ability to load new Java or Kotlin source code at run-time; in Java you have class loaders, but those load bytecode. On an Android device, you don't have to deal with the Java source language. If we use a similar architecture for JavaScript, though, what do we do about eval?

ArkTS's answer is: don't. As in, eval is not supported on HarmonyOS. In this way the implementation of ArkTS can be divided into two parts, a frontend that produces bytecode and a runtime that runs the bytecode, and you never have to deal with the source language on the device where the runtime is running. Like Hermes, the developer produces bytecode when building the application and ships it to the device for the runtime to handle.

Incidentally, before we move on to discuss the runtime, there are actually two front-ends that generate ArkTS bytecode: one written in C++ that seems to only handle standard TypeScript and JavaScript, and one written in TypeScript that also handles "extended TypeScript". The former has a test262 runner with about 10k skipped tests, and the latter doesn't appear to have a test262 runner. Note, I haven't actually built either one of these (or any of the other frameworks, for that matter).

The ArkTS runtime is itself built on a non-language-specific common Ark runtime, and the set of supported instructions is the union of the core ISA and the JavaScript-specific instructions. Bytecode can be interpreted, JIT-compiled, or AOT-compiled.

On the side of design documentation, it's somewhat sparse. There are some core design docs; readers may be interested in the rationale to use a bytecode interface for Ark as a whole, or the optimization overview.

Indeed ArkTS as a whole has a surfeit of optimizations, to an extent that makes me wonder which ones are actually needed. There are source-to-source optimizations on bytecode, which I expect are useful if you are generating ArkTS bytecode from JavaScript, where you probably don't have a full compiler implementation. There is a completely separate optimizer in the eTS part of the run-time, based on what would appear to be a novel "circuit-based" IR that bears some similarity to sea-of-nodes. Finally the whole thing appears to bottom out in LLVM, which of course has its own optimizer. I can only assume that this situation is somewhat transitory. Also, ArkTS does appear to generate its own native code sometimes, notably for inline cache stubs.

Of course, when it comes to JavaScript, there are some fundamental language semantics and there is also a large and growing standard library. In the case of ArkTS, this standard library is part of the run-time, like the interpreter, compilers, and the garbage collector (generational concurrent mark-sweep with optional compaction).

All in all, when I step back from it, it's a huge undertaking. Implementing JavaScript is no joke. It appears that ArkTS has done the first 90% though; the proverbial second 90% should only take a few more years :)

Evaluation

If you told a younger me that a major smartphone vendor switched from Java to JavaScript for their UI, you would probably hear me react in terms of the relative virtues of the programming languages in question. At this point in my career, though, the only thing that comes to mind is what an expensive proposition it is to change everything about an application development framework. 200 people over 5 years would be my estimate, though of course teams are variable. So what is it that we can imagine that Huawei bought with a thousand person-years of investment? Towards what other local maximum are we heading?

Startup latency

I didn't mention it before, but it would seem that one of the goals of HarmonyOS is in the name: Huawei wants to harmonize development across the different range of deployment targets. To the extent possible, it would be nice to be able to write the same kinds of programs for IoT devices as you do for feature-rich smartphones and tablets and the like. In that regard one can see through all the source code how there is a culture of doing work ahead-of-time and preventing work at run-time; for example see the design doc for the interpreter, or for the file format, or indeed the lack of JavaScript eval.

Of course, this wide range of targets also means that the HarmonyOS platform bears the burden of a high degree of abstraction; not only can you change the kernel, but also the JavaScript engine (using JerryScript on "lite" targets).

I mention this background because sometimes in news articles and indeed official communication from recent years there would seem to be some confusion that HarmonyOS is just for IoT, or aimed to be super-small, or something. In this evaluation I am mostly focussed on the feature-rich side of things, and there my understanding is that the developer will generate bytecode ahead-of-time. When an app is installed on-device, the AOT compiler will turn it into a single ELF image. This should generally lead to fast start-up.

However it would seem that the rendering library that paints UI nodes into layers and then composits those layers uses Skia in the way that Flutter did pre-Impeller, which to be fair is a quite recent change to Flutter. I expect therefore that Ark (ArkTS + ArkUI) applications also experience shader compilation jank at startup, and that they may be well-served by tesellating their shapes into primitives like Impeller does so that they can precompile a fixed, smaller set of shaders.

Jank

Maybe it's just that apparently I think Flutter is great, but ArkUI's fundamental architectural similarity to Flutter makes me think that jank will not be a big issue. There is a render thread that is separate from the ArkTS thread, so like with Flutter, async communication with main-thread interfaces is the main jank worry. And on the ArkTS side, ArkTS even has a number of extensions to be able to share objects between threads without copying, should that be needed. I am not sure how well-developed and well-supported these extensions are, though.

I am hedging my words, of course, because I am missing a bit of social proof; HarmonyOS is still in infant days, and it doesn't have much in the way of a user base outside China, from what I can tell, and my ability to read Chinese is limited to what Google Translate can do for me :) Unlike other frameworks, therefore, I haven't been as able to catch a feel of the pulse of the ArkUI user community: what people are happy about, what the pain points are.

It's also interesting that unlike iOS or Android, HarmonyOS is only exposing these "web-like" and "declarative" UI frameworks for app development. This makes it so that the same organization is responsible for the software from top to bottom, which can lead to interesting cross-cutting optimizations: functional reactive programming isn't just a developer-experience narrative, but it can directly affect the shape of the rendering pipeline. If there is jank, someone in the building is responsible for it and should be able to fix it, whether it is in the GPU driver, the kernel, the ArkTS compiler, or the application itself.

Peak performance

I don't know how to evaluate ArkTS for peak performance. Although there is a JIT compiler, I don't have the feeling that it is as tuned for adaptive optimization as V8 is.

At the same time, I find it interesting that HarmonyOS has chosen to modify JavaScript. While it is doing that, could they switch to a sound type system, to allow the kinds of AOT optimizations that Dart can do? It would be an interesting experiment.

As it is, though, if I had to guess, I would say that ArkTS is well-positioned for predictably good performance with AOT compilation, although I would be interested in seeing the results of actually running it.

Aside: On the importance of storytelling

In this series I have tried to be charitable towards the frameworks that I review, to give credit to what they are trying to do, even while noting where they aren't currently there yet. That's part of why I need a plausible narrative for how the frameworks got where they are, because that lets me have an idea of where they are going.

In that sense I think that Ark is at an interesting inflection point. When I started reading documentation about ArkUI and HarmonyOS and all that, I bounced out—there were too many architectural box diagrams, too many generic descriptions of components, too many promises with buzzwords. It felt to me like the project was trying to justify itself to a kind of clueless management chain. Was there actually anything here?

But now when I see the adoption of a modern rendering architecture and a bold new implementation of JavaScript, along with the willingness to experiment with the language, I think that there is an interesting story to be told, but this time not to management but to app developers.

Of course you wouldn't want to market to app developers when your system is still a mess because you haven't finished rebuilding an MVP yet. Retaking my charitable approach, then, I can only think that all the architectural box diagrams were a clever blind to avoid piquing outside interest while the app development kit wasn't ready yet :) As and when the system starts working well, presumably over the next year or so, I would expect HarmonyOS to invest much more heavily in marketing and developer advocacy; the story is interesting, but you have to actually tell it.

Aside: O platform, my platform

All of the previous app development frameworks that we looked at were cross-platform; Ark is not. It could be, of course: it does appear to be thoroughly open source. But HarmonyOS devices are the main target. What implications does this have?

A similar question arises in perhaps a more concrete way if we start with the mature Flutter framework: what would it mean to make a Flutter phone?

The first thought that comes to mind is that having a Flutter OS would allow for the potential for more cross-cutting optimizations that cross abstraction layers. But then I think, what does Flutter really need? It has the GPU drivers, and we aren't going to re-implement those. It has the bridge to the platform-native SDK, which is not such a large and important part of the app. You get input from the platform, but that's also not so specific. So maybe optimization is not the answer.

On the other hand, a Flutter OS would not have to solve the make-it-look-native problem; because there would be no other "native" toolkit, your apps won't look out of place. That's nice. It's not something that would make the platform compelling, though.

HarmonyOS does have this embryonic concept of app mobility, where like you could put an app from your phone on your fridge, or something. Clearly I am not doing it justice here, but let's assume it's a compelling use case. In that situation it would be nice for all devices to present similar abstractions, so you could somehow install the same app on two different kinds of devices, and they could communicate to transfer data. As you can see here though, I am straying far from my domain of expertise.

One reasonable way to "move" an app is to have it stay running on your phone and the phone just communicates pixels with your fridge (or whatever); this is the low-level solution. I think HarmonyOS appears to be going for the higher-level solution where the app actually runs logic on the device. In that case it would make sense to ship UI assets and JavaScript / extended TypeScript bytecode to the device, which would run the app with an interpreter (for low-powered devices) or use JIT/AOT compilation. The Ark runtime itself would live on all devices, specialized to their capabilities.

In a way this is the Apple WatchOS solution (as I understand it); developers publish their apps as LLVM bitcode, and Apple compiles it for the specific devices. A FlutterOS with a Flutter run-time on all devices could do something similar. As with WatchOS you wouldn't have to ship the framework itself in the app bundle; it would be on the device already.

Finally, publishing apps as some kind of intermediate representation also has security benefits: as the OS developer, you can ensure some invariants via the toolchain that you control. Of course, you would have to ensure that the Flutter API is sufficiently expressive for high-performance applications, while also not having arbitrary machine code execution vulnerabilities; there is a question of language and framework design as well as toolchain and runtime quality of implementation. HarmonyOS could be headed in this direction.

Conclusion

Ark is a fascinating effort that holds much promise. It's also still in motion; where will it be when it anneals to its own local optimum? It would appear that the system is approaching usability, but I expect a degree of churn in the near-term as Ark designers decide which language abstractions work for them and how to, well, harmonize them with the rest of JavaScript.

For me, the biggest open question is whether developers will love Ark in the way they love, say, React. In a market where Huawei is still a dominant vendor, I think the material conditions are there for a good developer experience: people tend to like Flutter and React, and Ark is similar. Huawei "just" needs to explain their framework well (and where it's hard to explain, to go back and change it so that it is explainable).

But in a more heterogeneous market, to succeed Ark would need to make a cross-platform runtime like the one Flutter has and engage in some serious marketing efforts, so that developers don't have to limit themselves to targetting the currently-marginal HarmonyOS. Selling extensions to JavaScript will be much more difficult in a context where the competition is already established, but perhaps Ark will be able to productively engage with TypeScript maintainers to move the language so it captures some of the benefits of Dart that facilitate ahead-of-time compilation.

Well, that's it for my review round-up; hope you have enjoyed the series. I have one more pending article, speculating about some future technologies. Until then, happy hacking, and see you next time.

structure and interpretation of flutter

26 April 2023 1:50 PM (igalia | compilers | mobile | ionic | capacitor | react | react native | nativescript | flutter | dart | javascript)

Good day, gentle hackfolk. Like an old-time fiddler I would appear to be deep in the groove, playing endless variations on a theme, in this case mobile application frameworks. But one can only recognize novelty in relation to the familiar, and today's note is a departure: we are going to look at Flutter, a UI toolkit based not on JavaScript but on the Dart language.

The present, from the past

Where to start, even? The problem is big enough that I'll approach it from three different angles: from the past, from the top, and from the bottom.

With the other frameworks we looked at, we didn't have to say much about their use of JavaScript. JavaScript is an obvious choice, in 2023 at least: it is ubiquitous, has high quality implementations, and as a language it is quite OK and progressively getting better. Up to now, "always bet on JS" has had an uninterrupted winning streak.

But winning is not the same as unanimity, and Flutter and Dart represent an interesting pole of contestation. To understand how we got here, we have to go back in time. Ten years ago, JavaScript just wasn't a great language: there were no modules, no async functions, no destructuring, no classes, no extensible iteration, no optional arguments to functions. In addition it was hobbled with a significant degree of what can only be called accidental sloppiness: with which can dynamically alter a lexical scope, direct eval that can define new local variables, Function.caller, and so on. Finally, larger teams were starting to feel the need for more rigorous language tooling that could use types to prohibit some classes of invalid programs.

All of these problems in JavaScript have been addressed over the last decade, mostly successfully. But in 2010 or so if you were a virtual machine engineer, you might look at JavaScript and think that in some other world, things could be a lot better. That's effectively what happened: the team that originally built V8 broke off and started to work on what became Dart.

Initially, Dart was targetted for inclusion in the Chrome web browser as an alternate "native" browser language. This didn't work, for various reasons, but since then Dart grew the Flutter UI toolkit, which has breathed new life into the language. And this is a review of Flutter, not a review of Dart, not really anyway; to my eyes, Dart is spiritually another JavaScript, different but in the same family. Dart's implementation has many interesting aspects as well that we'll get into later on, but all of these differences are incidental: they could just as well be implemented on top of JavaScript, TypeScript, or another source language in that family. Even if Flutter isn't strictly part of the JavaScript-based mobile application development frameworks that we are comparing, it is valuable to the extent that it shows what is possible, and in that regard there is much to say.

Flutter, from the top

At its most basic, Flutter is a UI toolkit for Dart. In many ways it is like React. Like React, its interface follows the functional-reactive paradigm: programmers describe the "what", and Flutter takes care of the "how". Also, like the phenomenon in which new developers can learn React without really knowing JavaScript, Flutter is the killer app for Dart: Flutter developers mostly learn Dart at the same time that they pick up Flutter.

In some other ways, Flutter is the logical progression of React, going in the same direction but farther along. Whereas React-on-the-web takes the user's declarative specifications of what the UI should look like and lowers them into DOM trees, and React Native lowers them to platform-native UI widgets, Flutter has its own built-in layout, rasterization, and compositing engine: Flutter draws all the pixels.

This has the predictable challenge that Flutter has to make significant investments so that its applications don't feel out-of-place on their platform, but on the other hand it opens up a huge space for experimentation and potential optimization: Flutter has the potential to beat native at its own game. Recall that with React Native, the result of the render-commit-mount process is a tree of native widgets. The native platform will surely then perform a kind of layout on those widgets, divide them into layers that correspond to GPU textures, paint those layers, then composite them to the screen -- basically, what a web engine will do.

What if we could instead skip the native tree and go directly to the lower GPU layer? That is the promise of Flutter. Flutter has the potential to create much more smooth and complex animations than the other application development frameworks we have mentioned, with lower overhead and energy consumption.

In practice... that's always the question, isn't it? Again, please accept my usual caveat that I am a compilers guy moonlighting in the user interface domain, but my understanding is that Flutter mostly lives up to its promise, but with one significant qualification which we'll get to in a minute. But before that, let's traverse Flutter from the other direction, coming up from Dart.

Dart, from the bottom

To explain some aspects of Dart I'm going to tell a just-so story that may or may not be true. I know and like many of the Dart developers, and we have similar instincts, so it's probably not too far from the truth.

Let's say you are the team that originally developed V8, and you decide to create a new language. You write a new virtual machine that looks like V8, taking Dart source code as input and applying advanced adaptive compilation techniques to get good performance. You can even be faster than JS because your language is just a bit more rigid than JavaScript is: you have traded off expressivity for performance. (Recall from our discussion of NativeScript that expressivity isn't a value judgment: there can be reasons to pay for more "mathematically appealing operational equivalences", in Felleisen's language, in exchange for applying more constraints on a language.)

But, you fail to ship the VM in a browser; what do you do? The project could die; that would be annoying, but you work for Google, so it happens all the time. However, a few interesting things happen around the same time that will cause you to pivot. One is a concurrent experiment by Chrome developers to pare the web platform down to its foundations and rebuild it. This effort will eventually become Flutter; while it was originally based on JS, eventually they will choose to switch to Dart.

The second thing that happens is that recently-invented smart phones become ubiquitous. Most people have one, and the two platforms are iOS and Android. Flutter wants to target them. You are looking for your niche, and you see that mobile application development might be it. As the Flutter people continue to experiment, you start to think about what it would mean to target mobile devices with Dart.

The initial Dart VM was made to JIT, but as we know, Apple doesn't let people do this on iOS. So instead you look to write a quick-and-dirty ahead-of-time compiler, based on your JIT compiler that takes your program as input, parses and baseline-compiles it, and generates an image that can be loaded at runtime. It ships on iOS. Funnily enough, it ships on Android too, because AOT compilation allows you to avoid some startup costs; forced to choose between peak performance via JIT and fast startup via AOT, you choose fast startup.

It's a success, you hit your product objectives, and you start to look further to a proper ahead-of-time compiler native code that can stand alone without the full Dart run-time. After all, if you have to compile at build-time, you might as well take the time to do some proper optimizations. You actually change the language to have a sound typing system so that the compiler can make program transformations that are valid as long as it can rely on the program's types.

Fun fact: I am told that the shift to a sound type system actually started before Flutter and thus before AOT, because of a Dart-to-JavaScript compiler that you inherited from the time in which you thought the web would be the main target. The Dart-to-JS compiler used to be a whole-program compiler; this enabled it to do flow-sensitive type inference, resulting in faster and smaller emitted JavaScript. But whole-program compilation doesn't scale well in terms of compilation time, so Dart-to-JS switched to separate per-module compilation. But then you lose lots of types! The way to recover the fast-and-small-emitted-JS property was through a stronger, sound type system for Dart.

At this point, you still have your virtual machine, plus your ahead-of-time compiler, plus your Dart-to-JS compiler. Such riches, such bounty! It is not a bad situation to be in, in 2023: you can offer a good development experience via the just-in-time compiled virtual machine. Apparently you can even use the JIT on iOS in developer mode, because attaching ptrace to a binary allows for native code generation. Then when you go to deploy, you make a native binary that includes everything.

For the web, you also have your nice story, even nicer than with JavaScript in some ways: because the type checker and ahead-of-time compiler are integrated in Dart, you don't have to worry about WebPack or Vite or minifiers or uglifiers or TypeScript or JSX or Babel or any of the other things that JavaScript people are used to. Granted, the tradeoff is that innovation is mostly centralized with the Dart maintainers, but currently Google seems to be investing enough so that's OK.

Stepping back, this story is not unique to Dart; many of its scenes also played out in the world of JavaScript over the last 5 or 10 years as well. Hermes (and QuickJS, for that matter) does ahead-of-time compilation, albeit only to bytecode, and V8's snapshot facility is a form of native AOT compilation. But the tooling in the JavaScript world is more diffuse than with Dart. With the perspective of developing a new JavaScript-based mobile operating system in mind, the advantages that Dart (and thus Flutter) has won over the years are also on the table for JavaScript to win. Perhaps even TypeScript could eventually migrate to have a sound type system, over time; it would take a significant investment but the JS ecosystem does evolve, if slowly.

(Full disclosure: while the other articles in this series were written without input from the authors of the frameworks under review, through what I can only think was good URL guesswork, a draft copy of this article leaked to Flutter developers. Dart hacker Slava Egorov kindly sent me a mail correcting a number of misconceptions I had about Dart's history. Fair play on whoever guessed the URL, and many thanks to Slava for the corrections; any remaining errors are wholly mine, of course!)

Evaluation

So how do we expect Flutter applications to perform? If we were writing a new mobile OS based on JavaScript, what would it mean in terms of performance to adopt a Flutter-like architecture?

Startup latency

Flutter applications are well-positioned to start fast, with ahead-of-time compilation. However they have had problems realizing this potential, with many users seeing a big stutter when they launch a Flutter app.

To explain this situation, consider the structure of a typical low-end Android mobile device: you have a small number of not-terribly-powerful CPU cores, but attached to the same memory you also have a decent GPU with many cores. For example, the SoC in the low-end Moto E7 Plus has 8 CPU cores and 128 GPU cores (texture shader units). You could paint widget pixels into memory from either the CPU or the GPU, but you'd rather do it in the GPU because it has so many more cores: in the time it takes to compute the color of a single pixel on the CPU, on the GPU you could do, like, 128 times as many, given that the comparison is often between multi-threaded rasterization on the GPU versus single-threaded rasterization on the CPU.

Flutter has always tried to paint on the GPU. Historically it has done so via a GPU back-end to the Skia graphics library, notably used by Chrome among other projects. But, Skia's API is a drawing API, not a GPU API; Skia is the one responsible for configuring the GPU to draw what we want. And here's the problem: configuring the GPU takes time. Skia generates shader code at run-time for rasterizing the specific widgets used by the Flutter programmer. That shader code then needs to be compiled to the language the GPU driver wants, which looks more like Vulkan or Metal. The process of compilation and linking takes time, potentially seconds, even.

The solution to "too much startup shader compilation" is much like the solution to "too much startup JavaScript compilation": move this phase to build time. The new Impeller rendering library does just that. However to do that, it had to change the way that Flutter renders: instead of having Skia generate specialized shaders at run-time, Impeller instead lowers the shapes that it draws to a fixed set of primitives, and then renders those primitives using a smaller, fixed set of shaders. These primitive shaders are pre-compiled at build time and included in the binary. By switching to this new renderer, Flutter should be able to avoid startup jank.

Jank

Of all the application development frameworks we have considered, to my mind Flutter is the best positioned to avoid jank. It has the React-like asynchronous functional layout model, but "closer to the metal"; by skipping the tree of native UI widgets, it can potentially spend less time for each frame render.

When you start up a Flutter app on iOS, the shell of the application is actually written in Objective C++. On Android it's the same, except that it's Java. That shell then creates a FlutterView widget and spawns a new thread to actually run Flutter (and the user's Dart code). Mostly, Flutter runs on its own, rendering frames to the GPU resources backing the FlutterView directly.

If a Flutter app needs to communicate with the platform, it passes messages across an asynchronous channel back to the main thread. Although these messages are asynchronous, this is probably the largest potential source of jank in a Flutter app, outside the initial frame paint: any graphical update which depends on the answer to an asynchronous call may lag.

Peak performance

Dart's type system and ahead-of-time compiler optimize for predictable good performance rather than the more variable but potentially higher peak performance that could be provided by just-in-time compilation.

This story should probably serve as a lesson to any future platform. The people that developed the original Dart virtual machine had a built-in bias towards just-in-time compilation, because it allows the VM to generate code that is specialized not just to the program but also to the problem at hand. A given system with ahead-of-time compilation can always be made to perform better via the addition of a just-in-time compiler, so the initial focus was on JIT compilation. On iOS of course this was not possible, but on Android and other platforms where this was available it was the default deployment model.

However, even Android switched to ahead-of-time compilation instead of the JIT model in order to reduce startup latency: doing any machine code generation at all at program startup was more work than was needed to get to the first frame. One could add JIT back again on top of AOT but it does not appear to be a high priority.

I would expect that Capacitor could beat Dart in some raw throughput benchmarks, given that Capacitor's JavaScript implementation can take advantage of the platform's native JIT capability. Does it matter, though, as long as you are hitting your frame budget? I do not know.

Aside: An escape hatch to the platform

What happens if you want to embed a web view into a Flutter app?

If you think on the problem for a moment I suspect you will arrive at the unsatisfactory answer, which is that for better or for worse, at this point it is too expensive even for Google to make a new web engine. Therefore Flutter will have to embed the native WebView. However Flutter runs on its own threads; the native WebView has its own process and threads but its interface to the app is tied to the main UI thread.

Therefore either you need to make the native WebView (or indeed any other native widget) render itself to (a region of) Flutter's GPU backing buffer, or you need to copy the native widget's pixels into their own texture and then composite them in Flutter-land. It's not so nice! The Android and iOS platform view documentation discuss some of the tradeoffs and mitigations.

Aside: For want of a canvas

There is a very funny situation in the React Native world in which, if the application programmer wants to draw to a canvas, they have to embed a whole WebView into the React Native app and then proxy the canvas calls into the WebView. Flutter is happily able to avoid this problem, because it includes its own drawing library with a canvas-like API. Of course, Flutter also has the luxury of defining its own set of standard libraries instead of necessarily inheriting them from the web, so when and if they want to provide equivalent but differently-shaped interfaces, they can do so.

Flutter manages to be more expressive than React Native in this case, without losing much in the way of understandability. Few people will have to reach to the canvas layer, but it is nice to know it is there.

Conclusion

Dart and Flutter are terribly attractive from an engineering perspective. They offer a delightful API and a high-performance, flexible runtime with a built-in toolchain. Could this experience be brought to a new mobile operating system as its primary programming interface, based on JavaScript? React Native is giving it a try, but I think there may be room to take things further to own the application from the program all the way down to the pixels.

Well, that's all from me on Flutter and Dart for the time being. Next up, a mystery guest; see you then!

structure and interpretation of nativescript

24 April 2023 9:10 AM (igalia | compilers | mobile | ionic | capacitor | react | react native | nativescript | flutter | dart | javascript)

Greetings, hackers tall and hackers small!

We're only a few articles in to this series on mobile application development frameworks, but I feel like we are already well into our journey. We started our trip through the design space with a look at Ionic / Capacitor, which defines its user interface in terms of the web platform, and only calls out to iOS or Android native features as needed. We proceeded on to React Native, which moves closer to native by rendering to platform-provided UI widgets, layering a cross-platform development interface on top.

Today's article takes an in-depth look at NativeScript, whose point in the design space is further on the road towards the platform, unabashedly embracing the specificities of the API available on iOS and Android, exposing these interfaces directly to the application programmer.

In practice what this looks like is that a NativeScript app is a native app which simply happens to call JavaScript on the main UI thread. That JavaScript has access to all native APIs, directly, without the mediation of serialization or message-passing over a bridge or message queue.

The first time I heard this I thought that it couldn't actually be all native APIs. After all, new versions of iOS and Android come out quite frequently, and surely it would take some effort on the part of NativeScript developers to expose the new APIs to JavaScript. But no, it really includes all of the various native APIs: the NativeScript developers wrote a build-time inspector that uses the platform's native reflection capabilities to grovel through all available APIs and to automatically generate JavaScript bindings, with associated TypeScript type definitions so that the developer knows what is available.

Some of these generated files are checked into source, so you can get an idea of the range of interfaces that are accessible to programmers; for example, see the iOS type definitions for x86-64. There are bindings for, like, everything.

Given access to all the native APIs, how do you go about making an app? You could write the same kind of programs that you would in Swift or Kotlin, but in JavaScript. But this would require more than just the ability to access native capabilities when needed: it needs a thorough knowledge of the platform interfaces, plus NativeScript itself on top. Most people don't have this knowledge, and those that do are probably programming directly in Swift or Kotlin already.

On one level, NativeScript's approach is to take refuge in that most ecumenical of adjectives, "unopinionated". Whereas Ionic / Capacitor encourages use of web platform interfaces, and React Native only really supports React as a programming paradigm, NativeScript provides a low-level platform onto which you can layer a number of different high-level frameworks.

Now, most high-level JavaScript application development frameworks are oriented to targetting the web: they take descriptions of user interfaces and translate them to the DOM. When targetting NativeScript, you could make it so that they target native UI widgets instead. However given the baked-in assumptions of how widgets should be laid out (notably via CSS), there is some impedance-matching to do between DOM-like APIs and native toolkits.

NativeScript's answer to this problem is a middle layer: a cross-platform UI library that provides DOM-like abstractions and CSS layout in a way that bridges the gap between web-like and native. You can even define parts of the UI using a NativeScript-specific XML vocabulary, which NativeScript compiles to native UI widget calls at run-time. Of course, there is no CSS engine in UIKit or Android's UI toolkit, so NativeScript includes its own, implemented in JavaScript of course.

You could program directly to this middle layer, but I suspect that its real purpose is in enabling Angular, Vue, Svelte, or the like. The pitch would be that NativeScript lets app developers use pleasant high-level abstractions, but while remaining close to the native APIs; you can always drop down for more power and expressiveness if needed.

Diving back down to the low level, as we mentioned all of the interactions between JavaScript and the native platform APIs happen on the main application UI thread. NativeScript does also allow programmers to create background threads, using an implementation of the Web Worker API. One could even in theory run a React-based UI in a worker thread and proxy native UI updates to the main thread; as an unopinionated platform, NativeScript can support many different frameworks and paradigms.

Finally, there is the question of how NativeScript runs the JavaScript in an application. Recall that Ionic / Capacitor uses the native JS engine, by virtue of using the native WebView, and that React Native used to use JavaScriptCore on both platforms but now uses its own Hermes implementation. NativeScript is another point in the design space, using V8 on both platforms. (They used to use JavaScriptCore on iOS but switched to V8 once V8 was able to run on iOS in "jitless" mode.) Besides the reduced maintenance burden of using a single implementation on all platforms, this also has the advantage of being able to use V8 snapshots to move JavaScript parse-and-compile work to build-time, even on iOS.

Evaluation

NativeScript is fundamentally simple: it's V8 running in an application's main UI thread, with access to all platform native APIs. So how do we expect it to perform?

Startup latency

In theory, applications with a NativeScript-like architecture should have no problem with startup time, because they can pre-compile all of their JavaScript into V8 snapshots. Snapshots are cheap to load up because they are already in a format that V8 is ready to consume.

In practice, it would seem that V8 snapshots do not perform as expected for NativeScript. There are a number of aspects about this situation that I don't understand, which I suspect relate to the state of the tooling around V8 rather than to the fundamental approach of ahead-of-time compilation. V8 is really made for Chrome, and it could be that not enough maintenance resources have been devoted to this snapshot facility.

In the meantime, NativeScript instead uses V8's code cache feature, which caches the result of parsing and compiling JavaScript files on the device. In this way the first time an app is installed or updated, it might start up slowly, but subsequent runs are faster. If you were designing a new operating system, you'd probably want to move this work to app install-time.

As we mentioned above, NativeScript apps have access to all native APIs. That is a lot of APIs, and only some of those interfaces will actually be used by any given app. In an ideal world, we would expect the build process to only include JavaScript code for those APIs that are needed by the application. However in the presence of eval and dynamic property lookup, pruning the native API surface to the precise minimum is a hard problem for a bundler to perform on its own. The solution for the time being is to manually allow and deny subsets of the platform native API. It's not an automatic process though, so it can be error-prone.

Besides the work that the JavaScript engine has to do to load an application's code, the other startup overhead involves whatever work that JavaScript might need to perform before the first frame is shown. In the case of NativeScript, more work is done before the initial layout than one would think: the main UI XML file is parsed by an XML parser written in JavaScript, any needed CSS files are parsed and loaded (again by JavaScript), and the tree of XML elements is translated to a tree of UI elements. The layout of the items in the view tree is then computed (in JavaScript, but calling into native code to measure text and so on), and then the app is ready.

At this point, I am again going to wave my "I am just a compiler engineer" flag: I am not a UI specialist, much less a NativeScript specialist. As in compilers, performance measurement and monitoring are key to UI development, but I suspect that also as in compilers there is a role for gut instinct. Incremental improvements are best driven by metrics, but qualitative leaps are often the result of somewhat ineffable hunches or even guesswork. In that spirit I can only surmise that React Native has an advantage over NativeScript in time-to-first-frame, because its layout is performed in C++ and because its element tree is computed directly from JavaScript instead of having JavaScript interpret XML and CSS files. In any case, I look forward to the forthcoming part 2 of the NativeScript and React Native performance investigations that were started in November 2022.

If I were NativeScript and using NativeScript's UI framework, and if startup latency proves to actually be a problem, I would lean into something in the shape of Angular's ahead-of-time compilation mode, but for the middle NativeScript UI layer.

Jank

On the face of it, NativeScript is the most jank-prone of the three frameworks we have examined, because it runs JavaScript on the main application UI thread, interleaved with UI event handling and painting and all of that. If an app's JavaScript takes too long to run, the app might miss frames or fail to promptly handle an event.

On the other hand, relative to React Native, the user's code is much closer to the application's behavior. There's no asynchrony between the application's logic and its main loop: in NativeScript it is easy to identify the code causing jank and eventually fix it.

The other classic JavaScript-on-the-main-thread worry relates to garbage collection pauses. V8's garbage collector does try to minimize the stop-the-world phase by tracing the heap concurrently and leveraging parallelism during pauses. Also, the user interface of a mobile app runs in an event loop, and typically spends most of its time idle; V8 exposes some API that can take advantage of this idle time to perform housekeeping tasks instead of needing to do them when handling high-priority events.

That said, having looked into the code of both the iOS and Android run-times, NativeScript does not currently take advantage of this facility. I dug deeper and it would seem that V8 itself is in flux, as the IdleNotificationDeadline API is on its way out; is the thought that concurrent tracing is largely sufficient? I would expect that if conservative stack scanning lands, we will see a re-introduction of this kind of API, as it does make sense to synchronize with the event loop when scanning the main thread stack.

Peak performance

As we have seen in our previous evaluations, this question boils down to "is the JavaScript engine state-of-the-art, and can it perform just-in-time compilation". In the case of NativeScript, the answers are yes and maybe, respectively: V8 is state-of-the-art, and it can JIT on Android, but not on iOS.

Perhaps the mitigation here is that the hardware that iOS runs on tends to be significantly more powerful than median Android devices; if you had to pick a subset of users to penalize with an interpreter-only run-time, people with iPhones are the obvious choice, because they can afford it.

Aside: Are markets wise?

Recall that our perspective in this series is that of the designer of a new JavaScript-based mobile development platform. We are trying to answer the question of what would it look like if a new platform offered a NativeScript-like experience. In this regard, only the structure of NativeScript is of interest, and notably its "market success" is not relevant, except perhaps in some Hayekian conception of the world in which markets are necessarily smarter than, well, me, or you, or any one of us.

It must be said, though, that React Native is the 800-pound gorilla of JavaScript mobile application development. The 2022 State of JS survey shows that among survey respondents, more people are aware of React Native than any other mobile framework, and people are generally more positive about React Native than other frameworks. Does NativeScript's mitigated market share indicate something about its architecture, or does it speak speak more to the size of Facebook's budget, both on the developer experience side and on marketing?

Aside: On the expressive power of application framworks

Oddly, I think the answer to the market wisdom question might be found in a 35-year-old computer science paper, "On the expressive power of programming languages" (PDF).

In this paper, Matthias Felleisen considers the notion of what it means for one programming language to be more expressive than another. For example, is a language with just for less expressive than a language with both for and while? Intuitively we would say no, these are similar things; you can make a simple local transformation of while (x) {...} to for (;x;) {...} and you have exactly the same program semantics. On the other hand a language with just for is less expressive than one which also has goto; there is no simple local rewrite that can turn goto into for.

In the same way, we can consider the question of what it would mean for one library to be more expressive than another. After all, the API of a library exposes a language in which its user can write programs; we should be able to reason about these languages. So between React Native and NativeScript, which one is more expressive?

By Felleisen's definitions, NativeScript is clearly the more expressive language: there is no simple local transformation that can turn imperative operations on native UI widgets into equivalent functional-reactive programs. Yes, with enough glue code React Native can reach directly to native APIs in a similar way as NativeScript, but everything that touches the native UI tree is expressly under React Native's control: there is no sanctioned escape hatch.

You might think that "more expressive" is always better, but Felleisen's take is more nuanced than that. Yes, he says, more expressive languages do allow programmers to make more concise programs, because they allow programmers to define abstractions that encapsulate patterns, and this is a good thing. However he also concludes that "an increase in expressive power is related to a decrease of the set of 'natural' (mathematically appealing) operational equivalences." Less expressive programming languages are easier to reason about, in general, and indeed that is one of the recognized strengths of React's programming model: it is easy to compose components and have confidence that the result will work.

Summary

A NativeScript-like architecture offers the possibility of performance: the developer has all the capabilities needed for writing pleasant-to-use applications that blend in with the platform-native experience. It is up to the developers to choose how to use the power at their disposal. In the wild, I expect that the low-level layer of NativeScript's API is used mainly by expert developers, who know how to assemble well-functioning machines from the parts on offer.

As a primary programming interface for a new JavaScript-based mobile platform, though, just providing a low-level API would seem to be not enough. NativeScript rightly promotes the use of more well-known high-level frameworks on top: Angular, Vue, Svelte, and such. Less experienced developers should use an opinionated high-level UI framework; these developers don't have good opinions yet and the API should lead them in the right direction.

That's it for today. Thanks for reading these articles, by the way; I have enjoyed diving into this space. Next up, we'll take a look beyond JavaScript, to Flutter and Dart. Until then, happy hacking!

structure and interpretation of react native

21 April 2023 8:20 AM (igalia | compilers | mobile | ionic | capacitor | react | react native | nativescript | flutter | dart | javascript)

Hey hey! Today's missive continues exploring the space of JavaScript and mobile application development.

Yesterday we looked into Ionic / Capacitor, giving a brief structural overview of what Capacitor apps look like under the hood and how this translates to three aspects of performance: startup latency, jank, and peak performance. Today we'll apply that same approach to another popular development framework, React Native.

Background: React

I don't know about you, but I find that there is so much marketing smoke and lights around the whole phenomenon that is React and React Native that sometimes it's hard to see what's actually there. This is compounded by the fact that the programming paradigm espoused by React (and its "native" cousin that we are looking at here) is so effective at enabling JavaScript UI programmers to focus on the "what" and not the "how" that the machinery supporting React recedes into the background.

At its most basic, React is what they call a functional reactive programming model. It is functional in the sense that the user interface elements render as a function of the global application state. The reactive comes into how user input is handled, but I'm not going to focus on that here.

React's rendering process starts with a root element tree, describing the root node of the user interface. An element is a JavaScript object with a type property. To render an element tree, if the value of the type property is a string, then the element is terminal and doesn't need further lowering, though React will visit any node in the children property of the element to render them as needed.

Otherwise if the type property of an element is a function, then the element node is functional. In that case React invokes the node's render function (the type property), passing the JavaScript element object as the argument. React will then recursively re-render the element tree produced as a result of rendering the component until all nodes are terminal. (Functional element nodes can instead have a class as their type property, but the concerns are pretty much the same.)

(In the language of React Native, a terminal node is a React Host Component, and a functional node is a React Composite Component, and both are React Elements. There are many imprecisely-used terms in React and I will continue this tradition by using the terms I mention above.)

The rendering phase of a React application is thus a function from an element tree to a terminal element tree. Nodes of element trees can be either functional or terminal. Terminal element trees are composed only of terminal elements. Rendering lowers all functional nodes to terminal nodes. This description applies both to React (targetting the web) and React Native (which we are reviewing here).

It's probably useful to go deeper into what React does with a terminal element tree, before building to the more complex pipeline used in React Native, so here we go. The basic idea is that React-on-the-web does impedance matching between the functional description of what the UI should have, as described by a terminal element tree, and the stateful tree of DOM nodes that a web browser uses to actually paint and display the UI. When rendering yields a new terminal element tree, React will compute the difference between the new and old trees. From that difference React then computes the set of imperative actions needed to mutate the DOM tree to correspond to what the new terminal element tree describes, and finally applies those changes.

In this way, small changes to the leaves of a React element tree should correspond to small changes in the DOM. Additionally, since rendering is a pure function of the global application state, we can avoid rendering at all when the application state hasn't changed. We'll dive into performance more deeply later on in the article.

React Native doesn't use a WebView

React Native is similar to React-on-the-web in intent but different in structure. Instead of using a WebView on native platforms, as Ionic / Capacitor does, React Native renders the terminal element tree to platform-native UI widgets.

When a React Native functional element renders to a terminal element, it will create not just a JS object for the terminal node as React-on-the-web does, but also a corresponding C++ shadow object. The fully lowered tree of terminal elements will thus have a corresponding tree of C++ shadow objects. React Native will then calculate the layout for each node in the shadow tree, and then commit the shadow tree: as on the web, React Native computes the set of imperative actions needed to change the current UI so that it corresponds to what the shadow tree describes. These changes are then applied on the main thread of the application.

The twisty path that leads one to implement JavaScript

The description above of React Native's rendering pipeline applies to the so-called "new architecture", which has been in the works for some years and is only now (April 2023) starting to be deployed. The key development that has allowed React Native to move over to this architecture is tighter integration and control over its JavaScript implementation. Instead of using the platform's JavaScript engine (JavaScriptCore on iOS or V8 on Android), Facebook went and made their own whole new JavaScript implementation, Hermes. Let's step back a bit to see if we can imagine why anyone in their right mind would make a new JS implementation.

In the last article, I mentioned that the only way to get peak JS performance on iOS is to use the platform's WkWebView, which enables JIT compilation of JavaScript code. React Native doesn't want a WebView, though. I guess you could create an invisible WebView and just run your JavaScript in it, but the real issue is that the interface to the JavaScript engine is so narrow as to be insufficiently expressive. You can't cheaply synchronously create a shadow tree of layout objects, for example, because every interaction with JavaScript has to cross a process boundary.

So, it may be that JIT is just not worth paying for, if it means having to keep JavaScript at arm's distance from other parts of the application. How do you do JavaScript without a browser on mobile, though? Either you use the platform's JavaScript engine, or you ship your own. It would be nice to use the same engine on iOS and Android, though. When React Native was first made, V8 wasn't able to operate in a mode that didn't JIT, so React Native went with JavaScriptCore on both platforms.

Bundling your own JavaScript engine has the nice effect that you can easily augment it with native extensions, for example to talk to the Swift or Java app that actually runs the main UI. That's what I describe above with the creation of the shadow tree, but that's not quite what the original React Native did; I can only speculate but I suspect that there was a fear that JavaScript rendering work (or garbage collection!) could be heavy enough to cause the main UI to drop frames. Phones were less powerful in 2016, and JavaScript engines were less good. So the original React Native instead ran JavaScript in a separate thread. When a render would complete, the resulting terminal element tree would be serialized as JSON and shipped over to the "native" side of the application, which would actually apply the changes.

This arrangement did work, but it ran into problems whenever the system needed synchronous communication between native and JavaScript subsystems. As I understand it, this was notably the case when React layout would need the dimensions of a native UI widget; to avoid a stall, React would assume something about the dimensions of the native UI, and then asynchronously re-layout once the actual dimensions were known. This was particularly gnarly with regards to text measurements, which depend on low-level platform-specific rendering details.

To recap: React Native had to interpret its JS on iOS and was using a "foreign" JS engine on Android, so they weren't gaining anything by using a platform JS interpreter. They would sometimes have some annoying layout jank when measuring native components. And what's more, React Native apps would still experience the same problem as Ionic / Capacitor apps, in that application startup time was dominated by parsing and compiling the JavaScript source files.

The solution to this problem was partly to switch to the so-called "new architecture", which doesn't serialize and parse so much data in the course of rendering. But the other side of it was to find a way to move parsing and compiling JavaScript to the build phase, instead of having to parse and compile JS every time the app was run. On V8, you would do this by generating a snapshot. On JavaScriptCore, which React Native used, there was no such facility. Faced with this problem and armed with Facebook's bank account, the React Native developers decided that the best solution would be to make a new JavaScript implementation optimized for ahead-of-time compilation.

The result is Hermes. If you are familiar with JavaScript engines, it is what you might expect: a JavaScript parser, originally built to match the behavior of Esprima; an SSA-based intermediate representation; a set of basic optimizations; a custom bytecode format; an interpreter to run that bytecode; a GC to manage JS objects; and so on. Of course, given the presence of eval, Hermes needs to include the parser and compiler as part of the virtual machine, but the hope is that most user code will be parsed and compiled ahead-of-time.

If this were it, I would say that Hermes seems to me to be a dead end. V8 is complete; Hermes is not. For example, Hermes doesn't have with, async function implementation has been lagging, and so on. Why Hermes when you can V8 (with snapshots), now that V8 doesn't require JIT code generation?

I thought about this for a while and in the end, given that V8's main target isn't as an embedded library in a mobile app, perhaps the binary size question is the one differentiating factor (in theory) for Hermes. By focussing on lowering distribution size, perhaps Hermes will be a compelling JS engine in its own right. In any case, Facebook can afford to keep Hermes running for a while, regardless of whether it has a competitive advantage or not.

It sounds like I'm criticising Hermes here but that's not really the point. If you can afford it, it's good to have code you control. For example one benefit that I see React Native getting from Hermes is that they control the threading model; they can mostly execute JS in its own thread, but interrupt that thread and switch to synchronous main-thread execution in response to high-priority events coming from the user. You might be able to do that with V8 at some point but the mobile-apps-with-JS domain is still in flux, so it's nice to have a sandbox that React Native developers can use to explore the system design space.

Evaluation

With that long overview out of the way, let's take a look to what kinds of performance we can expect out of a React Native system.

Startup latency

Because React Native apps have their JavaScript code pre-compiled to Hermes bytecode, we can expect that the latency imposed by JavaScript during application startup is lower than is the case with Ionic / Capacitor, which needs to parse and compile the JavaScript at run-time.

However, it must be said that as a framework, React tends to result in large application sizes and incurs significant work at startup time. One of React's strengths is that it allows development teams inside an organization to compose well: because rendering is a pure function, it's easy to break down the task of making an app into subtasks to be handled by separate groups of people. Could this strength lead to a kind of weakness, in that there is less of a need for overall coordination on the project management level, such that in the end nobody feels responsible for overall application performance? I don't know. I think the concrete differences between React Native and React (the C++ shadow object tree, the multithreading design, precompilation) could mean that React Native is closer to an optimum in the design space than React. It does seem to me though that whether a platform's primary development toolkit shold be React-like remains an open question.

Jank

In theory React Native is well-positioned to avoid jank. JavaScript execution is mostly off the main UI thread. The threading model changes to allow JavaScript rendering to be pre-empted onto the main thread do make me wonder, though: what if that work takes too much time, or what if there is a GC pause during that pre-emption? I would not be surprised to see an article in the next year or two from the Hermes team about efforts to avoid GC during high-priority event processing.

Another question I would have about jank relates to interactivity. Say the user is dragging around a UI element on the screen, and the UI needs to re-layout itself. If rendering is slow, then we might expect to see a lag between UI updates and the dragging motion; the app technically isn't dropping frames, but the render can't complete in the 16 milliseconds needed for a 60 frames-per-second update frequency.

Peak perf

But why might rendering be slow? On the one side, there is the fact that Hermes is not a high-performance JavaScript implementation. It uses a simple bytecode interpreter, and will never be able to meet the performance of V8 with JIT compilation.

However the other side of this is the design of the application framework. In the limit, React suffers from the O(n) problem: any change to the application state requires the whole element tree to be recomputed. Rendering and layout work is proportional to the size of the application, which may have thousands of nodes.

Of course, React tries to minimize this work, by detecting subtrees whose layout does not change, by avoiding re-renders when state doesn't change, by minimizing the set of mutations to the native widget tree. But the native widgets aren't the problem: the programming model is, or it can be anyway.

Aside: As good as native?

Again in theory, React Native can used to write apps that are as good as if they were written directly against platform-native APIs in Kotlin or Swift, because it uses the same platform UI toolkits as native applications. React Native can also do this at the same time as being cross-platform, targetting iOS and Android with the same code. In practice, besides the challenge of designing suitable cross-platform abstractions, React Native has to grapple with potential performance and memory use overheads of JavaScript, but the result has the potential to be quite satisfactory.

Aside: Haven't I seen that rendering model somewhere?

As I mentioned in the last article, I am a compiler engineer, not a UI specialist. In the course of my work I do interact with a number of colleagues working on graphics and user interfaces, notably in the context of browser engines. I was struck when reading about React Native's rendering pipeline about how much it resembled what a browser itself will do as part of the layout, paint, and render pipeline: translate a tree of objects to a tree of immutable layout objects, clip those to the viewport, paint the ones that are dirty, and composite the resulting textures to the screen.

It's funny to think about how many levels we have here: the element tree, the recursively expanded terminal element tree, the shadow object tree, the platform-native widget tree, surely a corresponding platform-native layout tree, and then the GPU backing buffers that are eventually composited together for the user to see. Could we do better? I could certainly imagine any of these mobile application development frameworks switching to their own Metal/Vulkan-based rendering architecture at some point, to flatten out these layers.

Summary

By all accounts, React Native is a real delight to program for; it makes developers happy. The challenge is to make it perform well for users. With its new rendering architecture based on Hermes, React Native may well be on the path to addressing many of these problems. Bytecode pre-compilation should go a long way towards solving startup latency, provided that React's expands-to-fit-all-available-space tendency is kept in check.

If you were designing a new mobile operating system from the ground up, though, I am not sure that you would necessarily end up with React Native as it is. At the very least, you would include Hermes and the base run-time as part of your standard library, so that every app doesn't have to incur the space costs of shipping the run-time. Also, in the same way that Android can ahead-of-time and just-in-time compile its bytecode, I would expect that a mobile operating system based on React Native would extend its compiler with on-device post-install compilation and possibly JIT compilation as well. And at that point, why not switch back to V8?

Well, that's food for thought. Next up, NativeScript. Until then, happy hacking!

structure and interpretation of capacitor programs

20 April 2023 10:20 AM (igalia | compilers | mobile | ionic | capacitor | react | react native | nativescript | flutter | dart | javascript)

Good day, hackers! Today's note is a bit of a departure from compilers internals. A client at work recently asked me to look into cross-platform mobile application development and is happy for the results to be shared publically. This, then, is the first in a series of articles.

Mobile apps and JavaScript: how does it work?

I'll be starting by taking a look at Ionic/Capacitor, React Native, NativeScript, Flutter/Dart, and then a mystery guest. This article will set the stage and then look into Ionic/Capacitor.

The angle I am taking is, if you were designing a new mobile operating system that uses JavaScript as its native application development language, what would it mean to adopt one of these as your primary app development toolkit? It's a broad question but I hope we can come up with some useful conclusions.

I'm going to approach the problem from the perspective of a compiler engineer. Compilers translate high-level languages that people like to write into low-level languages that machines like to run. Compilation is a bridge between developers and users: developers experience the source language interface to the compiler, while users experience the result of compiling those source languages. Note, though, that my expertise is mostly on the language and compiler side of things; though I have worked on a few user-interface libraries in my time, I am certainly not a graphics specialist, nor a user-experience specialist.

I'm explicitly not including the native application development toolkits from Android and iOS, because these are most useful to me as a kind of "control" to the experiment. A cross-platform toolkit that compiles down to use native widgets can only be as good as the official SDK, but it might be worse; and a toolkit that routes around the official widgets by using alternate drawing primitives has the possibility to be better, but at the risk of being inconsistent with other platform apps, along with the more general risk of not being as good as the native UI libraries.

And, of course, there is the irritation that SwiftUI / UIKit / AppKit aren't not open source, and that neither iOS nor Android's native UI library is cross platform. If we are designing a new JavaScript-based mobile operating system—that's the conceit of the original problem statement—then we might as well focus on the existing JS-based toolkits. (Flutter of course is the odd one out, but it's interesting enough to include in the investigation.)

Ionic / Capacitor

Let's begin with the Ionic Framework UI toolkit, based on the Capacitor web run-time.

Overview

Capacitor is like an alternate web browser development kit for phones. It uses the platform's native WebView: WKWebView on iOS or WebView on Android. The JavaScript APIs available within that WebView are extended with Capacitor plugins which can access native APIs; these plugins let JavaScript do the sort of things you wouldn't be able to do directly in a web browser.

(Over time, most of the features provided by Capacitor end up in the web platform in some way. For example, there are web standards to allow JavaScript to detect or lock screen rotation. The Fugu project is particularly avant-garde in pushing these capabilities to the web. But, the wheel of web standards grinds very finely, and for an app to have access to, say, the user's contact list now, it is pragmatic to use an extended-webview solution like Capacitor.)

I call Capacitor a "web browser development kit" because creating a Capacitor project effectively copies the shell of a specialized browser into your source tree, usually with an iOS and Android version. Building the app makes a web browser with your extensions that runs your JS, CSS, image, and other assets. Running the app creates a WebView and invokes your JavaScript; your JS should then create the user interface using the DOM APIs that are standard in a WebView.

As you can see, Capacitor is quite full-featured when it comes to native platform capabilities but is a bit bare-bones when it comes to UI. To provide a richer development environment, the Ionic Framework adds a set of widgets and some application life-cycle abstractions on top of the Capacitor WebView. Ionic is not like a "normal" app framework like Vue or Angular because it takes advantage of Capacitor APIs, and because it tries to mimic the presentation and behavior of the native plaform (iOS or Android, mainly), so that apps built with it don't look out-of-place.

The Ionic Framework itself is built using Web Components, which end up creating styled DOM nodes with JavaScript behavior. These components can compose with other app frameworks such as Vue or Angular, allowing developers to share some of their app code between web and mobile apps---you write the main app in Vue, and just implement the components one way on the web and using something like Ionic Framework on mobile.

To recap, an Ionic app uses Ionic Framework libraries on top of a Capacitor run-time. You could use other libraries on top of Capacitor instead of Ionic. In any case, I'm most interested in Capacitor, so let's continue our focus there.

Evaluation

Performance has a direct effect on the experience that users have when interacting with an application, and here I would like to break down the performance in three ways: one, startup latency; two, jank; and three, peak throughput. As Capacitor is structurally much like a web browser, many of these concerns, pitfalls, and mitigation techniques are similar to those on the web.

Startup latency

Startup latency is how long the app makes you wait after you run it before you can interact with it. The app may hide some of this work behind a splash screen, but if there is too much work to do at startup-time, the risk is the user might get bored or frustrated and switch away. The goal, therefore, it to minimize startup latency. Most people consider time-to-interactive of less than than a second to be sufficient; there's room to improve but for better or for worse, user expectations here are lower than they could be.

In the case of Capacitor, when a user launches an application, the app loads a minimal skeleton program that loads a WebView. On iOS and most Android systems, the rendering and JS parts of the WebView run in a separate process that has access to the app's bundled assets: JavaScript files, images, CSS, and so on. The WebView then executes the JavaScript main function to create the UI.

Capacitor application startup time will be dominated by parsing and compiling the JavaScript source files, as well as any initial work needed to boot the app framework (which could require running a significant amount of JS), and will depend on how fast the user's device is. The most effective performance mitigation here is to reduce the amount of JavaScript loaded, but this can be difficult for complex apps. The main techniques to reduce app size are toolchain-based. Tree-shaking, bundling, and minification reduce the number of JS bytes that the engine needs to parse without changing application logic. Code splitting can defer some work until it is needed, but this might just result in jank later on.

Common Ionic application sizes would seem to be between 500 kB and 5 MB of JavaScript. In 2021, Alex Russell suggested that the soon-to-be standard Android performance baseline should be the Moto E7 Plus which, if its performance relative to the mid-range Moto G4 phone that he measured in 2019 translates to JavaScript engine speed, should be able to parse and compile uncompressed JavaScript at a rate of about 1 MB/s. That's not very much, if you want to get startup latency under a second, and a JS payload size of 5 MB would lead to 5-second startup delays for many users. To my mind, this is the biggest challenge for a Capacitor-based app.

(Calculation detail: The Moto G4 JavaScript parse-and-compile throughput is measured at 170 kB/s, for compressed JS. Assuming a compression ratio of 4 and that the Moto E7 Plus is 50% faster than the Moto G4, that gets us to 1 MB/s for uncompressed JS, for what was projected to be a performance baseline in 2023.)

Jank

Jank is when an application's animation and interaction aren't smooth, because the application somehow missed rendering one or more frames. Technically startup time is a form of jank, but it is useful to treat startup separately.

Generally speaking, the question of whether a Capacitor application will show jank depends on who is doing the rendering: if application JavaScript is driving an animation, then this could cause jank, in the same way as it would on the web. The mitigation is to lean on the web platform so that the WebView is the one in charge of ensuring smooth interaction, for example by using CSS animations or the native scrolling capability.

There may always be an impetus to do some animation in JavaScript, though, for example if the design guidelines require a specific behavior that can't be created using raw CSS.

Peak perf

Peak throughput is how much work an app can do per unit time. For an application written in JavaScript, this is a question of how good can the JavaScript engine make a user's code run fast.

Here we need to make an important aside on the strategies that a JavaScript engine uses to run a user's code. A standard technique that JS engines use to make JavaScript run fast is just-in-time (JIT) compilation, in which the engine emits specialized machine code at run-time and then runs that code. However, on iOS the platform prohibits JIT code generation for most applications. The usual justification for this restriction is that the low-level capability granted to a program that allows it to JIT-compile increases the likelihood of security exploits. The only current exception to the no-JIT policy on iOS is for WebView (including the one in the system web browser), which iOS maintainers consider to be sufficiently sandboxed, so that any security exploit that uses JIT code generation won't be able to access other capabilities possessed by the application process.

So much for the motivation, as I understand it. The upshot is, there is no JIT on iOS outside web browsers or web views. If a cross-platform application development toolkit wants peak JavaScript performance on iOS, it has to run that JS in a WebView. Capacitor does this, so it has fast JS on iOS. Note that these restrictions are not in place on Android; you can have fast JS on Android without using a WebView, by using the system's JavaScript libraries.

Of course, actually getting peak throughput out of JavaScript is an art but the well-known techniques for JavaScript-on-the-web apply here; we won't cover them in this series.

The bridge

All of these performance observations are common to all web browsers, but with Capacitor there is the additional wrinkle that a Capacitor application can access native capabilities, for example access to a user's contacts. When application JavaScript goes to access the contacts API, Capacitor will send a message to the native side of the application over a bridge. The message will be serialized to JSON. Capacitor will include an implementation for the native side of the bridge into your app's source code, written in Swift for iOS or Java for Android. The native end of the bridge parses the JSON message, performs the requested action, and sends a message back with the result. Some messages may need to be proxied to the main application thread, because some native APIs can only be processed there.

As you can imagine, the bridge has an overhead. Apps that have high-bandwidth access to native capabilities will also have JSON encoding overhead, as well general asynchronous coordination overhead. It may even be possible that encoding or decoding a large JSON message causes the WebView to miss a frame, for example when accessing a large file in local storage.

The bridge is a necessary component to the design, though; an iOS WebView can't have direct in-process access to native capabilities. For Android, the WebView APIs do not appear to allow this either, though it is theoretically possible to ship your own WebView that could access native APIs directly. In any case, the Capacitor multi-process solution does allow for some parallelism, and the enforced asynchronous nature of the APIs should lead to less modal application programming.

Aside: Can WebView act like native widgets?

Besides performance, one way that users experience application behavior is by way of expectations: users expect (but don't require) a degree of uniformity between different apps on the same platform, for example the same scrolling behavior or the same kinds of widgets. This aspect isn't the most important one to my investigation, because I'm more concerned with a new operating system that might come with its own design language with its own standard JavaScript-based component library, but it's one to note; an Ionic app is starting at a disadvantage relative to a platform-native app. Sometimes ensuring a platform-native UI is just a question of CSS styling and using the right library (like Ionic Framework), but sometimes it might take significant engineering or possibly even new additions to the web platform.

Aside: Is a WebView as good as native widgets?

As a final observation on user experience, it's worth asking the question of whether you can actually make a user interface using the web platform that is "as good" as native widgets. After all, the API and widget set (DOM) available to a WebView is very different from that available to a native application; is a WebView able to use CPU and GPU resources efficiently to create a smooth interface? Are web standards, CSS, and the DOM API somehow a constraint that prevents efficient user interface construction? Here, somewhat embarrassingly, I can't actually say. I am instead going to just play the "compiler engineer" card, that I'm not a UI specialist; to me it is an open question.

Summary

Capacitor, with or without the Ionic Framework on top, is the always-bet-on-the-web solution to the cross-platform mobile application development problem. It uses stock system WebViews and JavaScript, augmented with native capabilities as needed. Consistency with the UX of the specific platform (iOS or Android) may require significant investment, though.

Performance-wise, my evaluation is that Capacitor apps are well-positioned for peak JS performance due to JIT code generation and low jank due to offloading animations to the web platform, but may have problems with startup latency, as Capacitor needs to parse and compile a possibly significant amount of JavaScript each time it is run.

Next time, a different beast: React Native. Until then, happy hacking!

sticking point

18 April 2023 8:23 PM (meta | hackjam | guile | gc | hegel)

Good evening, gentle readers. A brief note tonight, on a sticky place.

See, I have too many projects right now.

In and of itself this is not so much of a problem as a condition. I know my limits; I keep myself from burning out by shedding load, and there is a kind of priority list of which projects keep adequate service levels.

First come the tiny humans that are in my care who need their butts wiped and bodies translated to and from school or daycare and who -- well you know the old Hegelian trope, that the dialectic crank of history doesn't turn itself, that it takes actions from people to synthesize the thesis and the antithesis, and that even History itself isn't always monotonic; in the same way, bedtime is a reality, there are the material conditions of sleepiness and you're-gonna-be-so-tired-tomorrow but without the World-Historical Actor which whose name is Dada ain't nobody getting into pyjamas, not to mention getting in bed and going to sleep.

Lower in the priority queue there is work, and work is precious: a whole day, a day of just thinking about things and solving problems; yes, I wish I could choose all of the problems that I work on but the simple fact of being able to think and solve is a gift, or rather a reward, time stolen from the arbitrary chaotic preliterate interruptors. I love my kids -- and here I have to choose my conjunction wisely -- and also they love me. Which is nice! They express this through wanting to sit on my lap and demand that I read to them when I am thinking about things. We all have our love languages, I suppose.

And the house? There are 5 of us now, and that's, you know, more than 5 times the work it was before because the 3 wee ones are more chaotic than the adults. There is so much laundry. We are now a cook-the-whole-500g-bag-of-pasta family, and we're far from the teenager stage.

Finally there are the weird side projects, of which there are a few. I have some sporty things I do that I can't avoid because I am socially coerced into doing them; all in all a fine configuration. I have some volunteer work, which has parts I like that take time, which I am fine with, and parts I don't that I neglect. There's the garden, also neglected but nice to muck about in sometimes. Sometimes I see friends? Rarely, so rarely, but sometimes.

But, netfriends, if I find a free couple hours, I hack. Not less than a couple hours, because otherwise I can't get into things. More? Yes please! But usually no, because priorities.

I write this note because I realized that in the last month or so, hain't been no hacktime; there's barely 30 minutes between the end of the evening kitchen-clean and the onset of zzzz. That's fine, really; ultimately I need to carve out work time for this sort of thing, somehow or other.

But I was also realizing that I painted myself into a corner with my GC work. See, I need some new benchmarks, and I can't afford to use Guile itself as my benchmark yet, because I can't afford to fix deep bugs right now. I need something small. Specifically I need a little language implementation with efficient GC integration. I was looking at porting the little Scheme implementation from my Wasm JIT work to C -- because the rest of whippet is in C -- but is that a good enough reason? I got mired in the swamps of precise GC roots in C and let me tell you, it is disgusting. You don't have RAII like you do in C++, without GCC extensions. You don't have stack maps, like you would like. I might be able to push through but it's the sort of project you need a day for, and those days have not been forthcoming recently.

Anyway, just a little hack log tonight, not detailing a success, but rather a condition. Perhaps observing the position of this logjam will change its velocity. Here's hoping, and until next time, happy hacking!

a world to win: webassembly for the rest of us

20 March 2023 9:06 AM (gc | webassembly | wasm | scheme | garbage collection | browsers | js | igalia | spritely | bobkonf | delimited continuations)

Good day, comrades!

Today I'd like to share the good news that WebAssembly is finally coming for the rest of us weirdos.

A world to win

WebAssembly for the rest of us

17 Mar 2023 – BOB 2023

Andy Wingo

Igalia, S.L.

This is a transcript-alike of a talk that I gave last week at BOB 2023, a gathering in Berlin of people that are using "technologies beyond the mainstream" to get things done: Haskell, Clojure, Elixir, and so on. PDF slides here, and I'll link the video too when it becomes available.

WebAssembly, the story

WebAssembly is an exciting new universal compute platform

WebAssembly: what even is it? Not a programming language that you would write software in, but rather a compilation target: a sort of assembly language, if you will.

WebAssembly, the pitch

Predictable portable performance

  • Low-level
  • Within 10% of native

Reliable composition via isolation

  • Modules share nothing by default
  • No nasal demons
  • Memory sandboxing

Compile your code to WebAssembly for easier distribution and composition

If you look at what the characteristics of WebAssembly are as an abstract machine, to me there are two main areas in which it is an advance over the alternatives.

Firstly it's "close to the metal" -- if you compile for example an image-processing library to WebAssembly and run it, you'll get similar performance when compared to compiling it to x86-64 or ARMv8 or what have you. (For image processing in particular, native still generally wins because the SIMD primitives in WebAssembly are more narrow and because getting the image into and out of WebAssembly may imply a copy, but the general point remains.) WebAssembly's instruction set covers a broad range of low-level operations that allows compilers to produce efficient code.

The novelty here is that WebAssembly is both portable while also being successful. We language weirdos know that it's not enough to do something technically better: you have to also succeed in getting traction for your alternative.

The second interesting characteristic is that WebAssembly is (generally speaking) a principle-of-least-authority architecture: a WebAssembly module starts with access to nothing but itself. Any capabilities that an instance of a module has must be explicitly shared with it by the host at instantiation-time. This is unlike DLLs which have access to all of main memory, or JavaScript libraries which can mutate global objects. This characteristic allows WebAssembly modules to be reliably composed into larger systems.

WebAssembly, the hype

It’s in all browsers! Serve your code to anyone in the world!

It’s on the edge! Run code from your web site close to your users!

Compose a library (eg: Expat) into your program (eg: Firefox), without risk!

It’s the new lightweight virtualization: Wasm is what containers were to VMs! Give me that Kubernetes cash!!!

Again, the remarkable thing about WebAssembly is that it is succeeding! It's on all of your phones, all your desktop web browsers, all of the content distribution networks, and in some cases it seems set to replace containers in the cloud. Launch the rocket emojis!

WebAssembly, the reality

WebAssembly is a weird backend for a C compiler

Only some source languages are having success on WebAssembly

What about Haskell, Ocaml, Scheme, F#, and so on – what about us?

Are we just lazy? (Well...)

So why aren't we there? Where is Clojure-on-WebAssembly? Where are the F#, the Elixir, the Haskell compilers? Some early efforts exist, but they aren't really succeeding. Why is that? Are we just not putting in the effort? Why is it that Rust gets to ride on the rocket ship but Scheme does not?

WebAssembly, the reality (2)

WebAssembly (1.0, 2.0) is not well-suited to garbage-collected languages

Let’s look into why

As it turns out, there is a reason that there is no good Scheme implementation on WebAssembly: the initial version of WebAssembly is a terrible target if your language relies on the presence of a garbage collector. There have been some advances but this observation still applies to the current standardized and deployed versions of WebAssembly. To better understand this issue, let's dig into the guts of the system to see what the limitations are.

GC and WebAssembly 1.0

Where do garbage-collected values live?

For WebAssembly 1.0, only possible answer: linear memory

(module
  (global $hp (mut i32) (i32.const 0))
  (memory $mem 10)) ;; 640 kB

The primitive that WebAssembly 1.0 gives you to represent your data is what is called linear memory: just a buffer of bytes to which you can read and write. It's pretty much like what you get when compiling natively, except that the memory layout is more simple. You can obtain this memory in units of 64-kilobyte pages. In the example above we're going to request 10 pages, for 640 kB. Should be enough, right? We'll just use it all for the garbage collector, with a bump-pointer allocator. The heap pointer / allocation pointer is kept in the mutable global variable $hp.

(func $alloc (param $size i32) (result i32)
  (local $ret i32)
  (loop $retry
    (local.set $ret (global.get $hp))
    (global.set $hp
      (i32.add (local.get $size) (local.get $ret)))

    (br_if 1
      (i32.lt_u (i32.shr_u (global.get $hp) 16)
                (memory.size))
      (local.get $ret))

    (call $gc)
    (br $retry)))

Here's what an allocation function might look like. The allocation function $alloc is like malloc: it takes a number of bytes and returns a pointer. In WebAssembly, a pointer to memory is just an offset, which is a 32-bit integer (i32). (Having the option of a 64-bit address space is planned but not yet standard.)

If this is your first time seeing the text representation of a WebAssembly function, you're in for a treat, but that's not the point of the presentation :) What I'd like to focus on is the (call $gc) -- what happens when the allocation pointer reaches the end of the region?

GC and WebAssembly 1.0 (2)

What hides behind (call $gc) ?

Ship a GC over linear memory

Stop-the-world, not parallel, not concurrent

But... roots.

The first thing to note is that you have to provide the $gc yourself. Of course, this is doable -- this is what we do when compiling to a native target.

Unfortunately though the multithreading support in WebAssembly is somewhat underpowered; it lets you share memory and use atomic operations but you have to create the threads outside WebAssembly. In practice probably the GC that you ship will not take advantage of threads and so it will be rather primitive, deferring all collection work to a stop-the-world phase.

GC and WebAssembly 1.0 (3)

Live objects are

  • the roots
  • any object referenced by a live object

Roots are globals and locals in active stack frames

No way to visit active stack frames

What's worse though is that you have no access to roots on the stack. A GC has to keep live objects, as defined circularly as any object referenced by a root, or any object referenced by a live object. It starts with the roots: global variables and any GC-managed object referenced by an active stack frame.

But there we run into problems, because in WebAssembly (any version, not just 1.0) you can't iterate over the stack, so you can't find active stack frames, so you can't find the stack roots. (Sometimes people want to support this as a low-level capability but generally speaking the consensus would appear to be that overall performance will be better if the engine is the one that is responsible for implementing the GC; but that is foreshadowing!)

GC and WebAssembly 1.0 (3)

Workarounds

  • handle stack for precise roots
  • spill all possibly-pointer values to linear memory and collect conservatively

Handle book-keeping a drag for compiled code

Given the noniterability of the stack, there are basically two work-arounds. One is to have the compiler and run-time maintain an explicit stack of object roots, which the garbage collector can know for sure are pointers. This is nice because it lets you move objects. But, maintaining the stack is overhead; the state of the art solution is rather to create a side table (a "stack map") associating each potential point at which GC can be called with instructions on how to find the roots.

The other workaround is to spill the whole stack to memory. Or, possibly just pointer-like values; anyway, you conservatively scan all words for things that might be roots. But instead of having access to the memory to which the WebAssembly implementation would spill your stack, you have to do it yourself. This can be OK but it's sub-optimal; see my recent post on the Whippet garbage collector for a deeper discussion of the implications of conservative root-finding.

GC and WebAssembly 1.0 (4)

Cycles with external objects (e.g. JavaScript) uncollectable

A pointer to a GC-managed object is an offset to linear memory, need capability over linear memory to read/write object from outside world

No way to give back memory to the OS

Gut check: gut says no

If that were all, it would already be not so great, but it gets worse! Another problem with linear-memory GC is that it limits the potential for composing a number of modules and the host together, because the garbage collector that manages JavaScript objects in a web browser knows nothing about your garbage collector over your linear memory. You can easily create memory leaks in a system like that.

Also, it's pretty gross that a reference to an object in linear memory requires arbitrary read-write access over all of linear memory in order to read or write object fields. How do you build a reliable system without invariants?

Finally, once you collect garbage, and maybe you manage to compact memory, you can't give anything back to the OS. There are proposals in the works but they are not there yet.

If the BOB audience had to choose between Worse is Better and The Right Thing, I think the BOB audience is much closer to the Right Thing. People like that feel instinctual revulsion to ugly systems and I think GC over linear memory describes an ugly system.

GC and WebAssembly 1.0 (5)

There is already a high-performance concurrent parallel compacting GC in the browser

Halftime: C++ N – Altlangs 0

The kicker is that WebAssembly 1.0 requires you to write and deliver a terrible GC when there is already probably a great GC just sitting there in the host, one that has hundreds of person-years of effort invested in it, one that will surely do a better job than you could ever do. WebAssembly as hosted in a web browser should have access to the browser's garbage collector!

I have the feeling that while those of us with a soft spot for languages with garbage collection have been standing on the sidelines, Rust and C++ people have been busy on the playing field scoring goals. Tripping over the ball, yes, but eventually they do manage to make within striking distance.

Change is coming!

Support for built-in GC set to ship in Q4 2023

With GC, the material conditions are now in place

Let’s compile our languages to WebAssembly

But to continue the sportsball metaphor, I think in the second half our players will finally be able to get out on the pitch and give it the proverbial 110%. Support for garbage collection is coming to WebAssembly users, and I think even by the end of the year it will be shipping in major browsers. This is going to be big! We have a chance and we need to sieze it.

Scheme to Wasm

Spritely + Igalia working on Scheme to WebAssembly

Avoid truncating language to platform; bring whole self

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Even with GC, though, WebAssembly is still a weird machine. It would help to see the concrete approaches that some languages of interest manage to take when compiling to WebAssembly.

In that spirit, the rest of this article/presentation is a walkthough of the approach that I am taking as I work on a WebAssembly compiler for Scheme. (Thanks to Spritely for supporting this work!)

Before diving in, a meta-note: when you go to compile a language to, say, JavaScript, you are mightily tempted to cut corners. For example you might implement numbers as JavaScript numbers, or you might omit implementing continuations. In this work I am trying to not cut corners, and instead to implement the language faithfully. Sometimes this means I have to work around weirdness in WebAssembly, and that's OK.

When thinking about Scheme, I'd like to highlight a few specific areas that have interesting translations. We'll start with value representation, which stays in the GC theme from the introduction.

Scheme to Wasm: Values

;;       any  extern  func
;;        |
;;        eq
;;     /  |   \
;; i31 struct  array

The unitype: (ref eq)

Immediate values in (ref i31)

  • fixnums with 30-bit range
  • chars, bools, etc

Explicit nullability: (ref null eq) vs (ref eq)

The GC extensions for WebAssembly are phrased in terms of a type system. Oddly, there are three top types; as far as I understand it, this is the result of a compromise about how WebAssembly engines might want to represent these different kinds of values. For example, an opaque JavaScript value flowing into a WebAssembly program would have type (ref extern). On a system with NaN boxing, you would need 64 bits to represent a JS value. On the other hand a native WebAssembly object would be a subtype of (ref any), and might be representable in 32 bits, either because it's a 32-bit system or because of pointer compression.

Anyway, three top types. The user can define subtypes of struct and array, instantiate values of those types, and access their fields. The life cycle of reference-typed objects is automatically managed by the run-time, which is just another way of saying they are garbage-collected.

For Scheme, we need a common supertype for all values: the unitype, in Bob Harper's memorable formulation. We can use (ref any), but actually we'll use (ref eq) -- this is the supertype of values that can be compared by (pointer) identity. So now we can code up eq?:

(func $eq? (param (ref eq) (ref eq))
           (result i32)
  (ref.eq (local.get a) (local.get b)))

Generally speaking in a Scheme implementation there are immediates and heap objects. Immediates can be encoded in the bits of a value, whereas for heap object the bits of a value encode a reference (pointer) to an object on the garbage-collected heap. We usually represent small integers as immediates, as well as booleans and other oddball values.

Happily, WebAssembly gives us an immediate value type, i31. We'll encode our immediates there, and otherwise represent heap objects as instances of struct subtypes.

Scheme to Wasm: Values (2)

Heap objects subtypes of struct; concretely:

(struct $heap-object
  (struct (field $tag-and-hash i32)))
(struct $pair
  (sub $heap-object
    (struct i32 (ref eq) (ref eq))))

GC proposal allows subtyping on structs, functions, arrays

Structural type equivalance: explicit tag useful

We actually need to have a common struct supertype as well, for two reasons. One is that we need to be able to hash Scheme values by identity, but for this we need an embedded lazily-initialized hash code. It's a bit annoying to take the per-object memory hit but it's a reality, and the JVM does it this way, so it must not be so terrible.

The other reason is more subtle: WebAssembly's type system is built in such a way that types that are "structurally" equivalent are indistinguishable. So a pair has two fields, besides the hash, but there might be a number of other fundamental object types that have the same shape; you can't fully rely on WebAssembly's dynamic type checks (ref.test et al) to be able to query the type of a value. Instead we re-use the low bits of the hash word to include a type tag, which might be 1 for pairs, 2 for vectors, 3 for closures, and so on.

Scheme to Wasm: Values (3)

(func $cons (param (ref eq)
                   (ref eq))
            (result (ref $pair))
  (struct.new_canon $pair
    ;; Assume heap tag for pairs is 1.
    (i32.const 1)
    ;; Car and cdr.
    (local.get 0)
    (local.get 1)))

(func $%car (param (ref $pair))
            (result (ref eq))
  (struct.get $pair 1 (local.get 0)))

With this knowledge we can define cons, as a simple call to struct.new_canon pair.

I didn't have time for this in the talk, but there is a ghost haunting this code: the ghost of nominal typing. See, in a web browser at least, every heap object will have its first word point to its "hidden class" / "structure" / "map" word. If the engine ever needs to check that a value is of a specific shape, it can do a quick check on the map word's value; if it needs to do deeper introspection, it can dereference that word to get more details.

Under the hood, testing whether a (ref eq) is a pair or not should be a simple check that it's a (ref struct) (and not a fixnum), and then a comparison of its map word to the run-time type corresponding to $pair. If subtyping of $pair is allowed, we start to want inline caches to handle polymorphism, but the checking the map word is still the basic mechanism.

However, as I mentioned, we only have structural equality of types; two (struct (ref eq)) type definitions will define the same type and have the same map word (run-time type / RTT). Hence the _canon in the name of struct.new_canon $pair: we create an instance of $pair, with the canonical run-time-type for objects having $pair-shape.

In earlier drafts of the WebAssembly GC extensions, users could define their own RTTs, which effectively amounts to nominal typing: not only does this object have the right structure, but was it created with respect to this particular RTT. But, this facility was cut from the first release, and it left ghosts in the form of these _canon suffixes on type constructor instructions.

For the Scheme-to-WebAssembly effort, we effectively add back in a degree of nominal typing via type tags. For better or for worse this results in a so-called "open-world" system: you can instantiate a separately-compiled WebAssembly module that happens to define the same types and use the same type tags and it will be able to happily access the contents of Scheme values from another module. If you were to use nominal types, you would't be able to do so, unless there were some common base module that defined and exported the types of interests, and which any extension module would need to import.

(func $car (param (ref eq)) (result (ref eq))
  (local (ref $pair))
  (block $not-pair
    (br_if $not-pair
      (i32.eqz (ref.test $pair (local.get 0))))
    (local.set 1 (ref.cast $pair) (local.get 0))
    (br_if $not-pair
      (i32.ne
        (i32.const 1)
        (i32.and
          (i32.const 0xff)
          (struct.get $heap-object 0 (local.get 1)))))
    (return_call $%car (local.get 1)))

  (call $type-error)
  (unreachable))

In the previous example we had $%car, with a funny % in the name, taking a (ref $pair) as an argument. But in the general case (barring compiler heroics) car will take an instance of the unitype (ref eq). To know that it's actually a pair we have to make two checks: one, that it is a struct and has the $pair shape, and two, that it has the right tag. Oh well!

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

But with all of that I think we have a solid story on how to represent values. I went through all of the basic value types in Guile and checked that they could all be represented using GC types, and it seems that all is good. Now on to the next point: varargs.

Scheme to Wasm: Varargs

(list 'hey)      ;; => (hey)
(list 'hey 'bob) ;; => (hey bob)

Problem: Wasm functions strongly typed

(func $list (param ???) (result (ref eq))
  ???)

Solution: Virtualize calling convention

In WebAssembly, you define functions with a type, and it is impossible to call them in an unsound way. You must call $car exactly 2 arguments or it will not compile, and those arguments have to be of specific types, and so on. But Scheme doesn't enforce these restrictions on the language level, bless its little miscreant heart. You can call car with 5 arguments, and you'll get a run-time error. There are some functions that can take a variable number of arguments, doing different things depending on incoming argument count.

How do we square these two approaches to function types?

;; "Registers" for args 0 to 3
(global $arg0 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg1 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg2 (mut (ref eq)) (i31.new (i32.const 0)))
(global $arg3 (mut (ref eq)) (i31.new (i32.const 0)))

;; "Memory" for the rest
(type $argv (array (ref eq)))
(global $argN (ref $argv)
        (array.new_canon_default
          $argv (i31.const 42) (i31.new (i32.const 0))))

Uniform function type: argument count as sole parameter

Callee moves args to locals, possibly clearing roots

The approach we are taking is to virtualize the calling convention. In the same way that when calling an x86-64 function, you pass the first argument in $rdi, then $rsi, and eventually if you run out of registers you put arguments in memory, in the same way we'll pass the first argument in the $arg0 global, then $arg1, and eventually in memory if needed. The function will receive the number of incoming arguments as its sole parameter; in fact, all functions will be of type (func (param i32)).

The expectation is that after checking argument count, the callee will load its arguments from globals / memory to locals, which the compiler can do a better job on than globals. We might not even emit code to null out the argument globals; might leak a little memory but probably would be a win.

You can imagine a world in which $arg0 actually gets globally allocated to $rdi, because it is only live during the call sequence; but I don't think that world is this one :)

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Great, two points out of the way! Next up, tail calls.

Scheme to Wasm: Tail calls

;; Call known function
(return_call $f arg ...)

;; Call function by value
(return_call_ref $type callee arg ...)

Friends -- I almost cried making this slide. We Schemers are used to working around the lack of tail calls, and I could have done so here, but it's just such a relief that these functions are just going to be there and I don't have to think much more about them. Technically speaking the proposal isn't merged yet; checking the phases document it's at the last station before headed to the great depot in the sky. But, soon soon it will be present and enabled in all WebAssembly implementations, and we should build systems now that rely on it.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Next up, my favorite favorite topic: delimited continuations.

Scheme to Wasm: Prompts (1)

Problem: Lightweight threads/fibers, exceptions

Possible solutions

  • Eventually, built-in coroutines
  • binaryen’s asyncify (not yet ready for GC); see Julia
  • Delimited continuations

“Bring your whole self”

Before diving in though, one might wonder why bother. Delimited continuations are a building-block that one can use to build other, more useful things, notably exceptions and light-weight threading / fibers. Could there be another way of achieving these end goals without having to implement this relatively uncommon primitive?

For fibers, it is possible to implement them in terms of a built-in coroutine facility. The standards body seems willing to include a coroutine primitive, but it seems far off to me; not within the next 3-4 years I would say. So let's put that to one side.

There is a more near-term solution, to use asyncify to implement coroutines somehow; but my understanding is that asyncify is not ready for GC yet.

For the Guile flavor of Scheme at least, delimited continuations are table stakes of their own right, so given that we will have them on WebAssembly, we might as well use them to implement fibers and exceptions in the same way as we do on native targets. Why compromise if you don't have to?

Scheme to Wasm: Prompts (2)

Prompts delimit continuations

(define k
  (call-with-prompt ’foo
    ; body
    (lambda ()
      (+ 34 (abort-to-prompt 'foo)))
    ; handler
    (lambda (continuation)
      continuation)))

(k 10)       ;; ⇒ 44
(- (k 10) 2) ;; ⇒ 42

k is the _ in (lambda () (+ 34 _))

There are a few ways to implement delimited continuations, but my usual way of thinking about them is that a delimited continuation is a slice of the stack. One end of the slice is the prompt established by call-with-prompt, and the other by the continuation of the call to abort-to-prompt. Capturing a slice pops it off the stack, copying it out to the heap as a callable function. Calling that function splats the captured slice back on the stack and resumes it where it left off.

Scheme to Wasm: Prompts (3)

Delimited continuations are stack slices

Make stack explicit via minimal continuation-passing-style conversion

  • Turn all calls into tail calls
  • Allocate return continuations on explicit stack
  • Breaks functions into pieces at non-tail calls

This low-level intuition of what a delimited continuation is leads naturally to an implementation; the only problem is that we can't slice the WebAssembly call stack. The workaround here is similar to the varargs case: we virtualize the stack.

The mechanism to do so is a continuation-passing-style (CPS) transformation of each function. Functions that make no calls, such as leaf functions, don't need to change at all. The same goes for functions that make only tail calls. For functions that make non-tail calls, we split them into pieces that preserve the only-tail-calls property.

Scheme to Wasm: Prompts (4)

Before a non-tail-call:

  • Push live-out vars on stacks (one stack per top type)
  • Push continuation as funcref
  • Tail-call callee

Return from call via pop and tail call:

(return_call_ref (call $pop-return)
                 (i32.const 0))

After return, continuation pops state from stacks

Consider a simple function:

(define (f x y)
  (+ x (g y))

Before making a non-tail call, a "tailified" function will instead push all live data onto an explicitly-managed stack and tail-call the callee. It also pushes on the return continuation. Returning from the callee pops the return continuation and tail-calls it. The return continuation pops the previously-saved live data and continues.

In this concrete case, tailification would split f into two pieces:

(define (f x y)
  (push! x)
  (push-return! f-return-continuation-0)
  (g y))

(define (f-return-continuation-0 g-of-y)
  (define k (pop-return!))
  (define x (pop! x))
  (k (+ x g-of-y)))

Now there are no non-tail calls, besides calls to run-time routines like push! and + and so on. This transformation is implemented by tailify.scm.

Scheme to Wasm: Prompts (5)

abort-to-prompt:

  • Pop stack slice to reified continuation object
  • Tail-call new top of stack: prompt handler

Calling a reified continuation:

  • Push stack slice
  • Tail-call new top of stack

No need to wait for effect handlers proposal; you can have it all now!

The salient point is that the stack on which push! operates (in reality, probably four or five stacks: one in linear memory or an array for types like i32 or f64, three for each of the managed top types any, extern, and func, and one for the stack of return continuations) are managed by us, so we can slice them.

Someone asked in the talk about whether the explicit memory traffic and avoiding the return-address-buffer branch prediction is a source of inefficiency in the transformation and I have to say, yes, but I don't know by how much. I guess we'll find out soon.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Okeydokes, last point!

Scheme to Wasm: Numbers

Numbers can be immediate: fixnums

Or on the heap: bignums, fractions, flonums, complex

Supertype is still ref eq

Consider imports to implement bignums

  • On web: BigInt
  • On edge: Wasm support module (mini-gmp?)

Dynamic dispatch for polymorphic ops, as usual

First, I would note that sometimes the compiler can unbox numeric operations. For example if it infers that a result will be an inexact real, it can use unboxed f64 instead of library routines working on heap flonums ((struct i32 f64); the initial i32 is for the hash and tag). But we still need a story for the general case that involves dynamic type checks.

The basic idea is that we get to have fixnums and heap numbers. Fixnums will handle most of the integer arithmetic that we need, and will avoid allocation. We'll inline most fixnum operations as a fast path and call out to library routines otherwise. Of course fixnum inputs may produce a bignum output as well, so the fast path sometimes includes another slow-path callout.

We want to minimize binary module size. In an ideal compile-to-WebAssembly situation, a small program will have a small module size, down to a minimum of a kilobyte or so; larger programs can be megabytes, if the user experience allows for the download delay. Binary module size will be dominated by code, so that means we need to plan for aggressive dead-code elimination, minimize the size of fast paths, and also minimize the size of the standard library.

For numbers, we try to keep module size down by leaning on the platform. In the case of bignums, we can punt some of this work to the host; on a JavaScript host, we would use BigInt, and on a WASI host we'd compile an external bignum library. So that's the general story: inlined fixnum fast paths with dynamic checks, and otherwise library routine callouts, combined with aggressive whole-program dead-code elimination.

Scheme to Wasm

  • Value representation
  • Varargs
  • Tail calls
  • Delimited continuations
  • Numeric tower

Hey I think we did it! Always before when I thought about compiling Scheme or Guile to the web, I got stuck on some point or another, was tempted down the corner-cutting alleys, and eventually gave up before starting. But finally it would seem that the stars are aligned: we get to have our Scheme and run it too.

Miscellenea

Debugging: The wild west of DWARF; prompts

Strings: stringref host strings spark joy

JS interop: Export accessors; Wasm objects opaque to JS. externref.

JIT: A whole ’nother talk!

AOT: wasm2c

Of course, like I said, WebAssembly is still a weird machine: as a compilation target but also at run-time. Debugging is a right proper mess; perhaps some other article on that some time.

How to represent strings is a surprisingly gnarly question; there is tension within the WebAssembly standards community between those that think that it's possible for JavaScript and WebAssembly to share an underlying string representation, and those that think that it's a fool's errand and that copying is the only way to go. I don't know which side will prevail; perhaps more on that as well later on.

Similarly the whole interoperation with JavaScript question is very much in its early stages, with the current situation choosing to err on the side of nothing rather than the wrong thing. You can pass a WebAssembly (ref eq) to JavaScript, but JavaScript can't do anything with it: it has no prototype. The state of the art is to also ship a JS run-time that wraps each wasm object, proxying exported functions from the wasm module as object methods.

Finally, some language implementations really need JIT support, like PyPy. There, that's a whole 'nother talk!

WebAssembly for the rest of us

With GC, WebAssembly is now ready for us

Getting our languages on WebAssembly now a S.M.O.P.

Let’s score some goals in the second half!

(visit-links
 "gitlab.com/spritely/guile-hoot-updates"
 "wingolog.org"
 "wingo@igalia.com"
 "igalia.com"
 "mastodon.social/@wingo")

WebAssembly has proven to have some great wins for C, C++, Rust, and so on -- but now it's our turn to get in the game. GC is coming and we as a community need to be getting our compilers and language run-times ready. Let's put on the coffee and bang some bytes together; it's still early days and there's a world to win out there for the language community with the best WebAssembly experience. The game is afoot: happy consing!

pre-initialization of garbage-collected webassembly heaps

10 March 2023 9:20 AM (wasm | webassembly | wizer | gc | pre-initialization | types | igalia)

Hey comrades, I just had an idea that I won't be able to work on in the next couple months and wanted to release it into the wild. They say if you love your ideas, you should let them go and see if they come back to you, right? In that spirit I abandon this idea to the woods.

Basically the idea is Wizer-like pre-initialization of WebAssembly modules, but for modules that store their data on the GC-managed heap instead of just in linear memory.

Say you have a WebAssembly module with GC types. It might look like this:

(module
  (type $t0 (struct (ref eq)))
  (type $t1 (struct (ref $t0) i32))
  (type $t2 (array (mut (ref $t1))))
  ...
  (global $g0 (ref null eq)
    (ref.null eq))
  (global $g1 (ref $t1)
    (array.new_canon $t0 (i31.new (i32.const 42))))
  ...
  (function $f0 ...)
  ...)

You define some struct and array types, there are some global variables, and some functions to actually do the work. (There are probably also tables and other things but I am simplifying.)

If you consider the object graph of an instantiated module, you will have some set of roots R that point to GC-managed objects. The live objects in the heap are the roots and any object referenced by a live object.

Let us assume a standalone WebAssembly module. In that case the set of types T of all objects in the heap is closed: it can only be one of the types $t0, $t1, and so on that are defined in the module. These types have a partial order and can thus be sorted from most to least specific. Let's assume that this sort order is just the reverse of the definition order, for now. Therefore we can write a general type introspection function for any object in the graph:

(func $introspect (param $obj anyref)
  (block $L2 (ref $t2)
    (block $L1 (ref $t1)
      (block $L0 (ref $t0)
        (br_on_cast $L2 $t2 (local.get $obj))
        (br_on_cast $L1 $t1 (local.get $obj))
        (br_on_cast $L0 $t0 (local.get $obj))
        (unreachable))
      ;; Do $t0 things...
      (return))
    ;; Do $t1 things...
    (return))
  ;; Do $t2 things...
  (return))

In particular, given a WebAssembly module, we can generate a function to trace edges in an object graph of its types. Using this, we can identify all live objects, and what's more, we can take a snapshot of those objects:

(func $snapshot (result (ref (array (mut anyref))))
  ;; Start from roots, use introspect to find concrete types
  ;; and trace edges, use a worklist, return an array of
  ;; all live objects in topological sort order
  )

Having a heap snapshot is interesting for introspection purposes, but my interest is in having fast start-up. Many programs have a kind of "initialization" phase where they get the system up and running, and only then proceed to actually work on the problem at hand. For example, when you run python3 foo.py, Python will first spend some time parsing and byte-compiling foo.py, importing the modules it uses and so on, and then will actually run foo.py's code. Wizer lets you snapshot the state of a module after initialization but before the real work begins, which can save on startup time.

For a GC heap, we actually have similar possibilities, but the mechanism is different. Instead of generating an array of all live objects, we could generate a serialized state of the heap as bytecode, and another function to read the bytecode and reload the heap:

(func $pickle (result (ref (array (mut i8))))
  ;; Return an array of bytecode which, when interpreted,
  ;; can reconstruct the object graph and set the roots
  )
(func $unpickle (param (ref (array (mut i8))))
  ;; Interpret the bytecode, building object graph in
  ;; topological order
  )

The unpickler is module-dependent: it will need one case to construct each concrete type $tN in the module. Therefore the bytecode grammar would be module-dependent too.

What you would get with a bytecode-based $pickle/$unpickle pair would be the ability to serialize and reload heap state many times. But for the pre-initialization case, probably that's not precisely what you want: you want to residualize a new WebAssembly module that, when loaded, will rehydrate the heap. In that case you want a function like:

(func $make-init (result (ref (array (mut i8))))
  ;; Return an array of WebAssembly code which, when
  ;; added to the module as a function and invoked, 
  ;; can reconstruct the object graph and set the roots.
  )

Then you would use binary tools to add that newly generated function to the module.

In short, there is a space open for a tool which takes a WebAssembly+GC module M and produces M', a module which contains a $make-init function. Then you use a WebAssembly+GC host to load the module and call the $make-init function, resulting in a WebAssembly function $init which you then patch in to the original M to make M'', which is M pre-initialized for a given task.

Optimizations

Some of the object graph is constant; for example, an instance of a struct type that has no mutable fields. These objects don't have to be created in the init function; they can be declared as new constant global variables, which an engine may be able to initialize more efficiently.

The pre-initialized module will still have an initialization phase in which it builds the heap. This is a constant function and it would be nice to avoid it. Some WebAssembly hosts will be able to run pre-initialization and then snapshot the GC heap using lower-level facilities (copy-on-write mappings, pointer compression and relocatable cages, pre-initialization on an internal level...). This would potentially decrease latency and may allow for cross-instance memory sharing.

Limitations

There are five preconditions to be able to pickle and unpickle the GC heap:

  1. The set of concrete types in a module must be closed.

  2. The roots of the GC graph must be enumerable.

  3. The object-graph edges from each live object must be enumerable.

  4. To prevent cycles, we have to know when an object has been visited: objects must have identity.

  5. We must be able to create each type in a module.

I think there are three limitations to this pre-initialization idea in practice.

One is externref; these values come from the host and are by definition not introspectable by WebAssembly. Let's keep the closed-world assumption and consider the case where the set of external reference types is closed also. In that case if a module allows for external references, we can perhaps make its pickling routines call out to the host to (2) provide any external roots (3) identify edges on externref values (4) compare externref values for identity and (5) indicate some imported functions which can be called to re-create exernal objects.

Another limitation is funcref. In practice in the current state of WebAssembly and GC, you will only have a funcref which is created by ref.func, and which (3) therefore has no edges and (5) can be re-created by ref.func. However neither WebAssembly nor the JS API has no way of knowing which function index corresponds to a given funcref. Including function references in the graph would therefore require some sort of host-specific API. Relatedly, function references are not comparable for equality (func is not a subtype of eq), which is a little annoying but not so bad considering that function references can't participate in a cycle. Perhaps a solution though would be to assume (!) that the host representation of a funcref is constant: the JavaScript (e.g.) representations of (ref.func 0) and (ref.func 0) are the same value (in terms of ===). Then you could compare a given function reference against a set of known values to determine its index. Note, when function references are expanded to include closures, we will have more problems in this area.

Finally, there is the question of roots. Given a module, we can generate a function to read the values of all reference-typed globals and of all entries in all tables. What we can't get at are any references from the stack, so our object graph may be incomplete. Perhaps this is not a problem though, because when we unpickle the graph we won't be able to re-create the stack anyway.

OK, that's my idea. Have at it, hackers!

whippet: towards a new local maximum

7 February 2023 1:14 PM (whippet | garbage collection | gc | immix | guile | bdw | fosdem | igalia | c | aliasing | evacuation)

Friends, you might have noted, but over the last year or so I really caught the GC bug. Today's post sums up that year, in the form of a talk I gave yesterday at FOSDEM. It's long! If you prefer video, you can have a look instead to the at the FOSDEM event page.

Whippet: A New GC for Guile

4 Feb 2023 – FOSDEM

Andy Wingo

Guile is...

Mostly written in Scheme

Also a 30 year old C library

// API
SCM scm_cons (SCM car, SCM cdr);

// Many third-party users
SCM x = scm_cons (a, b);

So the context for the whole effort is that Guile has this part of its implementation which is in C. It also exposes a lot of that implementation to users as an API.

Putting the C into GC

SCM x = scm_cons (a, b);

Live objects: the roots, plus anything a live object refers to

How to include x into roots?

  • Refcounting
  • Register (& later unregister) &x with gc
  • Conservative roots

So what contraints does this kind of API impose on the garbage collector?

Let's start by considering the simple cons call above. In a garbage-collected environment, the GC is responsible for reclaiming unused memory. How does the GC know that the result of a scm_cons call is in use?

Generally speaking there are two main strategies for automatic memory management. One is reference counting: you associate a count with an object, incremented once for each referrer; in this case, the stack would hold a reference to x. When removing the reference, you decrement the count, and if it goes to 0 the object is unused and can be freed.

We GC people used to laugh at reference-counting as a memory management solution because it over-approximates the live object set in the presence of cycles, but it would seem that refcounting is coming back. Anyway, this isn't what Guile does, not right now anyway.

The other strategy we can use is tracing: the garbage collector periodically finds all of the live objects on the system and then recycles the memory for everything else. But how to actually find the first live objects to trace?

One way is to inform the garbage collector of the locations of all roots: references to objects originating from outside the heap. This can be done explicitly, as in V8's Handle<> API, or implicitly, in the form of a side table generated by the compiler associating code locations with root locations. This is called precise rooting: the GC is aware of all root locations at all code positions where GC might happen. Generally speaking you want the side table approach, in which the compiler writes out root locations to stack maps, because it doesn't impose any overhead at run-time to register and unregister locations. However for run-time routines implemented in C or C++, you won't be able to get the C compiler to do this for you, so you need the explicit approach if you want precise roots.

Conservative roots

Treat every word in stack as potential root; over-approximate live object set

1993: Bespoke GC inherited from SCM

2006 (1.8): Added pthreads, bugs

2009 (2.0): Switch to BDW-GC

BDW-GC: Roots also from extern SCM foo;, etc

The other way to find roots is very much not The Right Thing. Call it cheeky, call it sloppy, call it yolo, call it what you like, but in the trade it's known as conservative root-finding. This strategy looks like this:

uintptr_t *limit = stack_base_for_platform();
uintptr_t *sp = __builtin_frame_address();
for (; sp < limit; sp++) {
  void *obj = object_at_address(*sp);
  if (obj) add_to_live_objects(obj);
}

You just look at every word on the stack and pretend it's a pointer. If it happens to point to an object in the heap, we add that object to the live set. Of course this algorithm can find a spicy integer whose value just happens to correspond to an object's address, even if that object wouldn't have been counted as live otherwise. This approach doesn't compute the minimal live set, but rather a conservative over-approximation. Oh well. In practice this doesn't seem to be a big deal?

Guile has used conservative root-finding since its beginnings, 30 years ago and more. We had our own bespoke mark-sweep GC in the beginning, but it's now going on 15 years or so that we switched to the third-party Boehm-Demers-Weiser (BDW) collector. It's been good to us! It's better than what we had, it's mostly just worked, and it works correctly with threads.

Conservative roots

+: Ergonomic, eliminates class of bugs (handle registration), no compiler constraints

-: Potential leakage, no compaction / object motion; no bump-pointer allocation, calcifies GC choice

Conservative root-finding does have advantages. It's quite pleasant to program with, in environments in which the compiler is unable to produce stack maps for you, as it eliminates a set of potential bugs related to explicit handle registration and unregistration. Like stack maps, it also doesn't impose run-time overhead on the user program. And although the compiler isn't constrained to emit code to clear roots, it generally does, and sometimes does so more promptly than would be the case with explicit handle deregistration.

But, there are disadvantages too. The potential for leaks is one, though I have to say that in 20 years of using conservative-roots systems, I have not found this to be a problem. It's a source of anxiety whenever a program has memory consumption issues but I've never identified it as being the culprit.

The more serious disadvantage, though, is that conservative edges prevent objects from being moved by the GC. If you know that a location holds a pointer, you can update that location to point to a new location for an object. But if a location only might be a pointer, you can't do that.

In the end, the ergonomics of conservative collection lead to a kind of calcification in Guile, that we thought that BDW was as good as we could get given the constraints, and that changing to anything else would require precise roots, and thus an API and ABI change, losing users, and so on.

What if I told you

You can find roots conservatively and

  • move objects and compact the heap
  • do fast bump-pointer allocation
  • incrementally migrate to precise roots

BDW is not the local maximum

But it turns out, that's not true! There is a way to have conservative roots and also use more optimal GC algorithms, and one which preserves the ability to incrementally refactor the system to have more precision if that's what you want.

Immix

Fundamental GC algorithms

  • mark-compact
  • mark-sweep
  • evacuation
  • mark-region

Immix is a mark-region collector

Let's back up to a high level. Garbage collector implementations are assembled from instances of algorithms, and there are only so many kinds of algorithms out there.

There's mark-compact, in which the collector traverses the object graph once to find live objects, then once again to slide them down to one end of the space they are in.

There's mark-sweep, where the collector traverses the graph once to find live objects, then traverses the whole heap, sweeping dead objects into free lists to be used for future allocations.

There's evacuation, where the collector does a single pass over the object graph, copying the objects outside their space and leaving a forwarding pointer behind.

The BDW collector used by Guile is a mark-sweep collector, and its use of free lists means that allocation isn't as fast as it could be. We want bump-pointer allocation and all the other algorithms give it to us.

Then in 2008, Stephen Blackburn and Kathryn McKinley put out their Immix paper that identified a new kind of collection algorithm, mark-region. A mark-region collector will mark the object graph and then sweep the whole heap for unmarked regions, which can then be reused for allocating new objects.

Allocate: Bump-pointer into holes in thread-local block, objects can span lines but not blocks

Trace: Mark objects and lines

Sweep: Coarse eager scan over line mark bytes

Blackburn and McKinley's paper also describes a new mark-region GC algorithm, Immix, which is interesting because it gives us bump-pointer allocation without requiring that objects be moveable. The diagram above, from the paper, shows the organization of an Immix heap. Allocating threads (mutators) obtain 64-kilobyte blocks from the heap. Blocks contains 128-byte lines. When Immix traces the object graph, it marks both objects and the line the object is on. (Usually blocks are part of 2MB aligned slabs, with line mark bits/bytes are stored in a packed array at the start of the slab. When marking an object, it's easy to find the associated line mark just with address arithmetic.)

Immix reclaims memory in units of lines. A set of contiguous lines that were not marked in the previous collection form a hole (a region). Allocation proceeds into holes, in the usual bump-pointer fashion, giving us good locality for contemporaneously-allocated objects, unlike freelist allocation. The slow path, if the object doesn't fit in the hole, is to look for the next hole in the block, or if needed to acquire another block, or to stop for collection if there are no more blocks.

Immix: Opportunistic evacuation

Before trace, determine if compaction needed. If not, mark as usual

If so, select candidate blocks and evacuation target blocks. When tracing in that block, try to evacuate, fall back to mark

The neat thing that Immix adds is a way to compact the heap via opportunistic evacuation. As Immix allocates, it can end up skipping over holes and leaving them unpopulated, and as subsequent cycles of GC occur, it could be that a block ends up with many small holes. If that happens to many blocks it could be time to compact.

To fight fragmentation, Immix decides at the beginning of a GC cycle whether to try to compact or not. If things aren't fragmented, Immix marks in place; it's cheaper that way. But if compaction is needed, Immix selects a set of blocks needing evacuation and another set of empty blocks to evacuate into. (Immix has to keep around a couple percent of memory in empty blocks in reserve for this purpose.)

As Immix traverses the object graph, if it finds that an object is in a block that needs evacuation, it will try to evacuate instead of marking. It may or may not succeed, depending on how much space is available to evacuate into. Maybe it will succeed for all objects in that block, and you will be left with an empty block, which might even be given back to the OS.

Immix: Guile

Opportunistic evacuation compatible with conservative roots!

Bump-pointer allocation

Compaction!

1 year ago: start work on WIP GC implementation

Tying this back to Guile, this gives us all of our desiderata: we can evacuate, but we don't have to, allowing us to cause referents of conservative roots to be marked in place instead of moved; we can bump-pointer allocate; and we are back on the train of modern GC implementations. I could no longer restrain myself: I started hacking on a work-in-progress garbage collector workbench about a year ago, and ended up with something that seems to take us in the right direction.

Whippet vs Immix: Tiny lines

Immix: 128B lines + mark bit in object

Whippet: 16B “lines”; mark byte in side table

More size overhead: 1/16 vs 1/128

Less fragmentation (1 live obj = 2 lines retained)

More alloc overhead? More small holes

What I ended up building wasn't quite Immix. Guile's object representation is very thin and doesn't currently have space for a mark bit, for example, so I would have to have a side table of mark bits. (I could have changed Guile's object representation but I didn't want to require it.) I actually chose mark bytes instead of bits because both the Immix line marks and BDW's own side table of marks were bytes, to allow for parallel markers to race when setting marks.

Then, given that you have a contiguous table of mark bytes, why not remove the idea of lines altogether? Or what amounts to the same thing, why not make line size to be 16 bytes and do away with per-object mark bits? You can then bump-pointer into holes in the mark byte array. The only thing you need to do to that is to be able to cheaply find the end of an object, so you can skip to the next hole while sweeping; you don't want to have to chase pointers to do that. But consider, you've already paid the cost of having a mark byte associated with every possible start of an object, so if your basic object alignment is 16 bytes, that's a memory overhead of 1/16, or 6.25%; OK. Let's put that mark byte to work and include an "end" bit, indicating the end of the object. Allocating an object has to store into the mark byte array to initialize this "end" marker, but you need to write the mark byte anyway to allow for conservative roots ("does this address hold an object?"); writing the end at the same time isn't so bad, perhaps.

The expected outcome would be that relative to 128-byte lines, Whippet ends up with more, smaller holes. Such a block would be a prime target for evacuation, of course, but during allocation this is overhead. Or, it could be a source of memory efficiency; who knows. There is some science yet to do to properly compare this tactic to original Immix, but I don't think I will get around to it.

While I am here and I remember these things, I need to mention two more details. If you read the Immix paper, it describes "conservative line marking", which is related to how you find the end of an object; basically Immix always marks the line an object is on and the next one, in case the object spans the line boundary. Only objects larger than a line have to precisely mark the line mark array when they are traced. Whippet doesn't do this because we have the end bit.

The other detail is the overflow allocator; in the original Immix paper, if you allocate an object that's smallish but still larger than a line or two, but there's no hole big enough in the block, Immix keeps around a completely empty block per mutator in which to bump-pointer-allocate these medium-sized objects. Whippet doesn't do that either, instead relying on such failure to allocate in a block to cause fragmentation and thus hurry along the process of compaction.

Whippet vs Immix: Lazy sweeping

Immix: “cheap” eager coarse sweep

Whippet: just-in-time lazy fine-grained sweep

Corrolary: Data computed by sweep available when sweep complete

Live data at previous GC only known before next GC

Empty blocks discovered by sweeping

Having a fine-grained line mark array means that it's no longer a win to do an eager sweep of all blocks after collecting. Instead Whippet applies the classic "lazy sweeping" optimization to make mutators sweep their blocks just before allocating into them. This introduces a delay in the collection algorithm: Whippet doesn't find out about e.g. fragmentation until the whole heap is swept, but by the time we fully sweep the heap, we've exhausted it via allocation. It introduces a different flavor to the GC, not entirely unlike original Immix, but foreign.

Whippet vs BDW

Compaction/defrag/pinning, heap shrinking, sticky-mark generational GC, threads/contention/allocation, ephemerons, precision, tools

Right! With that out of the way, let's talk about what Whippet gives to Guile, relative to BDW-GC.

Whippet vs BDW: Motion

Heap-conservative tracing: no object moveable

Stack-conservative tracing: stack referents pinned, others not

Whippet: If whole-heap fragmentation exceeds threshold, evacuate most-fragmented blocks

Stack roots scanned first; marked instead of evacuated, implicitly pinned

Explicit pinning: bit in mark byte

If all edges in the heap are conservative, then you can't move anything, because you don't know if an edge is a pointer that can be updated or just a spicy integer. But most systems aren't actually like this: you have conservative edges from the stack, but you can precisely enumerate intra-object edges on the heap. In that case, you have a known set of conservative edges, and you can simply visit those edges first, marking their referents in place instead of evacuating. (Marking an object instead of evacuating implicitly pins it for the duration of the current GC cycle.) Then you visit heap edges precisely, possibly evacuating objects.

I should note that Whippet has a bit in the mark byte for use in explicitly pinning an object. I'm not sure how to manage who is responsible for setting that bit, or what the policy will be; the current idea is to set it for any object whose identity-hash value is taken. We'll see.

Whippet vs BDW: Shrinking

Lazy sweeping finds empty blocks: potentially give back to OS

Need empty blocks? Do evacuating collection

Possibility to do http://marisa.moe/balancer.html

With the BDW collector, your heap can only grow; it will never shrink (unless you enable a non-default option and you happen to have verrry low fragmentation). But with Whippet and evacuation, we can rearrange objects so as to produce empty blocks, which can then be returned to the OS if so desired.

In one of my microbenchmarks I have the system allocating long-lived data, interspersed with garbage (objects that are dead after allocation) whose size is in a power-law distribution. This should produce quite some fragmentation, eventually, and it does. But then Whippet decides to defragment, and it works great! Since Whippet doesn't keep a whole 2x reserve like a semi-space collector, it usually takes more than one GC cycle to fully compact the heap; usually about 3 cycles, from what I can see. I should do some more measurements here.

Of course, this is just mechanism; choosing the right heap sizing policy is a different question.

wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm

Card marking barrier (256B); compare to BDW mprotect / SIGSEGV

The Boehm collector also has a non-default mode in which it uses mprotect and a SIGSEGV handler to enable sticky-mark-bit generational collection. I haven't done a serious investigation, but I see it actually increasing run-time by 20% on one of my microbenchmarks that is actually generation-friendly. I know that Azul's C4 collector used to use page protection tricks but I can only assume that BDW's algorithm just doesn't work very well. (BDW's page barriers have another purpose, to enable incremental collection, in which marking is interleaved with allocation, but this mode is off if parallel markers are supported, and I don't know how well it works.)

Anyway, it seems we can do better. The ideal would be a semi-space nursery, which is the usual solution, but because of conservative roots we are limited to the sticky mark-bit algorithm. Some benchmarks aren't very generation-friendly; the first pair of bars in the chart above shows the mt-gcbench microbenchmark running with and without generational collection, and there's no difference. But in the second, for the quads benchmark, we see a 2x speedup or so.

Of course, to get generational collection to work, we require mutators to use write barriers, which are little bits of code that run when an object is mutated that tell the GC where it might find links from old objects to new objects. Right now in Guile we don't do this, but this benchmark shows what can happen if we do.

Whippet vs BDW: Scale

BDW: TLS segregated-size freelists, lock to refill freelists, SIGPWR for stop

Whippet: thread-local block, sweep without contention, wait-free acquisition of next block, safepoints to stop with ragged marking

Both: parallel markers

Another thing Whippet can do better than BDW is performance when there are multiple allocating threads. The Immix heap organization facilitates minimal coordination between mutators, and maximum locality for each mutator. Sweeping is naturally parallelized according to how many threads are allocating. For BDW, on the other hand, every time an mutator needs to refill its thread-local free lists, it grabs a global lock; sweeping is lazy but serial.

Here's a chart showing whippet versus BDW on one microbenchmark. On the X axis I add more mutator threads; each mutator does the same amount of allocation, so I'm increasing the heap size also by the same factor as the number of mutators. For simplicity I'm running both whippet and BDW with a single marker thread, so I expect to see a linear increase in elapsed time as the heap gets larger (as with 4 mutators there are roughly 4 times the number of live objects to trace). This test is run on a Xeon Silver 4114, taskset to free cores on a single socket.

What we see is that as I add workers, elapsed time increases linearly for both collectors, but more steeply for BDW. I think (but am not sure) that this is because whippet effectively parallelizes sweeping and allocation, whereas BDW has to contend over a global lock to sweep and refill free lists. Both have the linear factor of tracing the object graph, but BDW has the additional linear factor of sweeping, whereas whippet scales with mutator count.

Incidentally you might notice that at 4 mutator threads, BDW randomly crashed, when constrained to a fixed heap size. I have noticed that if you fix the heap size, BDW sometimes (and somewhat randomly) fails. I suspect the crash due to fragmentation and inability to compact, but who knows; multiple threads allocating is a source of indeterminism. Usually when you run BDW you let it choose its own heap size, but for these experiments I needed to have a fixed heap size instead.

Another measure of scalability is, how does the collector do as you add marker threads? This chart shows that for both collectors, runtime decreases as you add threads. It also shows that whippet is significantly slower than BDW on this benchmark, which is Very Weird, and I didn't have access to the machine on which these benchmarks were run when preparing the slides in the train... so, let's call this chart a good reminder that Whippet is a WIP :)

While in the train to Brussels I re-ran this test on the 4-core laptop I had on hand, and got the results that I expected: that whippet performed similarly to BDW, and that adding markers improved things, albeit marginally. Perhaps I should look on a different microbenchmark.

Incidentally, when you configure Whippet for parallel marking at build-time, it uses a different implementation of the mark stack when compared to the parallel marker, even when only 1 marker is enabled. Certainly the parallel marker could use some tuning.

Whippet vs BDW: Ephemerons

BDW: No ephemerons

Whippet: Yes

Another deep irritation I have with BDW is that it doesn't support ephemerons. In Guile we have a number of facilities (finalizers, guardians, the symbol table, weak maps, et al) built on what BDW does have (finalizers, weak references), but the implementations of these facilities in Guile are hacky, slow, sometimes buggy, and don't compose (try putting an object in a guardian and giving it a finalizer to see what I mean). It would be much better if the collector API supported ephemerons natively, specifying their relationship to finalizers and other facilities, allowing us to build what we need in terms of those primitives. With our own GC, we can do that, and do it in such a way that it doesn't depend on the details of the specific collection algorithm. The exception of course is that as BDW doesn't support ephemerons per se, what we get is actually a weak-key association instead, whose value can keep the key alive. Oh well, it's no worse than the current situation.

Whippet vs BDW: Precision

BDW: ~Always stack-conservative, often heap-conservative

Whippet: Fully configurable (at compile-time)

Guile in mid/near-term: C stack conservative, Scheme stack precise, heap precise

Possibly fully precise: unlock semi-space nursery

Conservative tracing is a fundamental design feature of the BDW collector, both of roots and of inter-heap edges. You can tell BDW how to trace specific kinds of heap values, but the default is to do a conservative scan, and the stack is always scanned conservatively. In contrast, these tradeoffs are all configurable in Whippet. You can scan the stack and heap precisely, or stack conservatively and heap precisely, or vice versa (though that doesn't make much sense), or both conservatively.

The long-term future in Guile is probably to continue to scan the C stack conservatively, to continue to scan the Scheme stack precisely (even with BDW-GC, the Scheme compiler emits stack maps and installs a custom mark routine), but to scan the heap as precisely as possible. It could be that a user uses some of our hoary ancient APIs to allocate an object that Whippet can't trace precisely; in that case we'd have to disable evacuation / object motion, but we could still trace other objects precisely.

If Guile ever moved to a fully precise world, that would be a boon for performance, in two ways: first that we would get the ability to use a semi-space nursery instead of the sticky-mark-bit algorithm, and relatedly that we wouldn't need to initialize mark bytes when allocating objects. Second, we'd gain the option to use must-move algorithms for the old space as well (mark-compact, semi-space) if we wanted to. But it's just an option, one that that Whippet opens up for us.

Whippet vs BDW: Tools?

Can build heap tracers and profilers moer easily

More hackable

(BDW-GC has as many preprocessor directives as whippet has source lines)

Finally, relative to BDW-GC, whippet has a more intangible advantage: I can actually hack on it. Just as an indication, 15% of BDW source lines are pre-processor directives, and there is one file that has like 150 #ifdef's, not counting #elseif's, many of them nested. I haven't done all that much to BDW itself, but I personally find it excruciating to work on.

Hackability opens up the possibility to build more tools to help us diagnose memory use problems. They aren't in Whippet yet, but there can be!

Engineering Whippet

Embed-only, abstractions, migration, modern; timeline

OK, that rounds out the comparison between BDW and Whippet, at least on a design level. Now I have a few words about how to actually get this new collector into Guile without breaking the bug budget. I try to arrange my work areas on Guile in such a way that I spend a minimum of time on bugs. Part of my strategy is negligence, I will admit, but part also is anticipating problems and avoiding them ahead of time, even if it takes more work up front.

Engineering Whippet: Embed-only

github.com/wingo/whippet-gc/

Semi: 6 kB; Whippet: 22 kB; BDW: 184 kB

Compile-time specialization:

  • for embedder (e.g. how to forward objects)
  • for selected GC algorithm (e.g. semi-space vs whippet)

Built apart, but with LTO to remove library overhead

So the BDW collector is typically shipped as a shared library that you dynamically link to. I should say that we've had an overall good experience with upgrading BDW-GC in the past; its maintainer (Ivan Maidanski) does a great and responsible job on a hard project. It's been many, many years since we had a bug in BDW-GC. But still, BDW is dependency, and all things being equal we prefer to remove moving parts.

The approach that Whippet is taking is to be an embed-only library: it's designed to be compiled into your project. It's not an include-only library; it still has to be compiled, but with link-time-optimization and a judicious selection of fast-path interfaces, Whippet is mostly able to avoid abstractions being a performance barrier.

The result is that Whippet is small, both in source and in binary, which minimizes its maintenance overhead. Taking additional stripped optimized binary size as the metric, by my calculations a semi-space collector (with a large object space and ephemeron support) takes about 6 kB of object file size, whereas Whippet takes 22 and BDW takes 184. Part of how Whippet gets so small is that it is is configured in major ways at compile-time (choice of main GC algorithm), and specialized against the program it's embedding against (e.g. how to patch in a forwarding pointer). Having all API being internal and visible to LTO instead of going through ELF symbol resolution helps in a minor way as well.

Engineering Whippet: Abstract performance

User API abstracts over GC algorithm, e.g. semi-space or whippet

Expose enough info to allow JIT to open-code fast paths

Inspired by mmtk.io

Abstractions permit change: of algorithm, over time

From a composition standpoint, Whippet is actually a few things. Firstly there is an abstract API to make a heap, create per-thread mutators for a heap, and allocate objects for a mutator. There is the aforementioned embedder API, for having the embedding program indicate how to trace objects and install forwarding pointers. Then there is some common code (for example ephemeron support). There are implementations of the different spaces: semi-space, large object, whippet/immix; and finally collector implementations that tie together the spaces into a full implementation of the abstract API. (In practice the more iconic spaces are intertwingled with the collector implementations they define.)

I don't think I would have gone down this route without seeing some prior work, for example libpas, but it was really MMTk that convinced me that it was worth spending a little time thinking about the GC not as a structureless blob but as a system made of parts and exposing a minimal interface. In particular, I was inspired by seeing that MMTk is able to get good performance while also being abstract, exposing representation details such as how to tell a JIT compiler about allocation fast-paths, but in a principled way. So, thanks MMTk people, for this and so many things!

I'm particularly happy that the API is abstract enough that it frees up not only the garbage collector to change implementations, but also Guile and other embedders, in that they don't have to bake in a dependency on specific collectors. The semi-space collector has been particularly useful here in ensuring that the abstractions don't accidentally rely on support for object pinning.

Engineering Whippet: Migration

API implementable by BDW-GC (except ephemerons)

First step for Guile: BDW behind Whippet API

Then switch to whippet/immix (by default)

The collector API can actually be implemented by the BDW collector. Whippet includes a collector that is a thin wrapper around the BDW API, with support for fast-path allocation via thread-local freelists. In this way we can always check the performance of any given collector against an external fixed point (BDW) as well as a theoretically known point (the semi-space collector).

Indeed I think the first step for Guile is precisely this: refactor Guile to allocate through the Whippet API, but using the BDW collector as the implementation. This will ensure that the Whippet API is sufficient, and then allow an incremental switch to other collectors.

Incidentally, when it comes to integrating Whippet, there are some choices to be made. I mentioned that it's quite configurable, and this chart can give you some idea. On the left side is one microbenchmark (mt-gcbench) and on the right is another (quads). The first generates a lot of fragmentation and has a wide range of object sizes, including some very large objects. The second is very uniform and many allocations die young.

(I know these images are small; right-click to open in new tab or pinch to zoom to see more detail.)

Within each set of bars we have 10 different scenarios, corresponding to different Whippet configurations. (All of these tests are run on my old 4-core laptop with 4 markers if parallel marking is supported, and a 2x heap.)

The first bar in each side is serial whippet: one marker. Then we see parallel whippet: four markers. Great. Then there's generational whippet: one marker, but just scanning objects allocated in the current cycle, hoping that produces enough holes. Then generational parallel whippet: the same as before, but with 4 markers.

The next 4 bars are the same: serial, parallel, generational, parallel-generational, but with one difference: the stack is scanned conservatively instead of precisely. You might be surprised but all of these configurations actually perform better than their precise counterparts. I think the reason is that the microbenchmark uses explicit handle registration and deregistration (it's a stack) instead of compiler-generated stack maps in a side table, but I'm not precisely (ahem) sure.

Finally the next 2 bars are serial and parallel collectors, but marking everything conservatively. I have generational measurements for this configuration but it really doesn't make much sense to assume that you can emit write barriers in this context. These runs are slower than the previous configuration, mostly because there are some non-pointer locations that get scanned conservatively that wouldn't get scanned precisely. I think conservative heap scanning is less efficient than precise but I'm honestly not sure, there are some instruction locality arguments in the other direction. For mt-gcbench though there's a big array of floating-point values that a precise scan will omit, which causes significant overhead there. Probably for this configuration to be viable Whippet would need the equivalent of BDW's API to allocate known-pointerless objects.

Engineering Whippet: Modern

stdatomic

constexpr-ish

pthreads (for parallel markers)

No void*; instead struct types: gc_ref, gc_edge, gc_conservative_ref, etc

Embed-only lib avoids any returns-struct-by-value ABI issue

Rust? MMTk; supply chain concerns

Platform abstraction for conservative root finding

I know it's a sin, but Whippet is implemented in C. I know. The thing is, in the Guile context I need to not introduce wild compile-time dependencies, because of bootstrapping. And I know that Rust is a fine language to use for GC implementation, so if that's what you want, please do go take a look at MMTk! It's a fantastic project, written in Rust, and it can just slot into your project, regardless of the language your project is written in.

But if what you're looking for is something in C, well then you have to pick and choose your C. In the case of Whippet I try to use the limited abilities of C to help prevent bugs; for example, I generally avoid void* and instead wrap pointers or addresses into single-field structs that can't be automatically cast, for example to prevent a struct gc_ref that denotes an object reference (or NULL; it's an option type) from being confused with a struct gc_conservative_ref, which might not point to an object at all.

(Of course, by "C" I mean "C as compiled by gcc and clang with -fno-strict-aliasing". I don't know if it's possible to implement even a simple semi-space collector in C without aliasing violations. Can you access a Foo* object within a mmap'd heap through its new address after it has been moved via memcpy? Maybe not, right? Thoughts are welcome.)

As a project written in the 2020s instead of the 1990s, Whippet gets to assume a competent C compiler, for example relying on the compiler to inline and fold branches where appropriate. As in libpas, Whippet liberally passes functions as values to inline functions, and relies on the compiler to boil away function calls. Whippet only uses the C preprocessor when it absolutely has to.

Finally, there is a clean abstraction for anything that's platform-specific, for example finding the current stack bounds. I haven't compiled this code on Windows or MacOS yet, but I am not anticipating too many troubles.

Engineering Whippet: Timeline

As time permits

Whippet TODO: heap growth/shrinking, finalizers, safepoint API

Guile TODO: safepoints; heap-conservative first

Precise heap TODO: gc_trace_object, SMOBs, user structs with raw ptr fields, user gc_malloc usage; 3.2

6 months for 3.1.1; 12 for 3.2.0 ?

So where does this get us? Where are we now?

For Whippet itself, I think it's mostly done -- enough to start shifting focus to some different phase. It's missing some needed features, notably the ability to grow the heap at all, as I've been in fixed-heap-size-only mode during development. It's also missing finalizers. And, something needs to be done to unify Guile's handling of safepoints and processing of asynchronous signals with Whippet's need to stop all mutators. Some details remain.

But, I think we are close to ready to start integrating in Guile. At first this is just porting Guile to use the Whippet API to access BDW instead of using BDW directly. This whole thing is a side project for me that I work on when I can, so it doesn't exactly proceed at full pace. Perhaps this takes 6 months. Then we can cut a new unstable release, and hopefully release 3.2 withe support for the Immix-flavored collector in another 6 or 9 months.

I thought that we would be forced to make ABI changes, if only because of some legacy APIs assume conservative tracing of object contents. But after a discussion at FOSDEM with Carlo Piovesan I realized this isn't true: because the decision to evacuate or not is made on a collection-by-collection basis, I could simply disable evacuation if the user ever uses a facility that might prohibit object motion, for example if they ever define a SMOB type. If the user wants evacuation, they need to be more precise with their data types, but either way Guile is ready.

Whippet: A Better GC?

An Immix-derived GC

github.com/wingo/whippet-gc/

https://wingolog.org/tags/gc/

Guile 3.2 ?

Thanks to MMTk authors for inspiration!

And that's it! Thanks for reading all the way here. Comments are quite welcome.

As I mentioned in the very beginning, this talk was really about Whippet in the context of Guile. There is a different talk to be made about Guile+Whippet versus other language implementations, for example those with concurrent marking or semi-space nurseries or the like. Yet another talk is Whippet in the context of other GC algorithms. But this is a start. It's something I've been working on for a while now already and I'm pleased that it's gotten to a point where it seems to be at least OK, at least an improvement with respect to BDW-GC in some ways.

But before leaving you, another chart, to give a more global idea of the state of things. Here we compare a single mutator thread performing a specific microbenchmark that makes trees and also lots of fragmentation, across three different GC implementations and a range of heap sizes. The heap size multipliers in this and in all the other tests in this post are calculated analytically based on what the test thinks its maximum heap size should be, not by measuring minimum heap sizes that work. This size is surely lower than the actual maximum required heap size due to internal fragmentation, but the tests don't know about this.

The three collectors are BDW, a semi-space collector, and whippet. Semi-space manages to squeeze in less than 2x of a heap multiplier because it has (and whippet has) a separate large object space that isn't ever evacuated.

What we expect is that tighter heaps impose more GC time, and indeed we see that times are higher on the left side than the right.

Whippet is the only implementation that manages to run at a 1.3x heap, but it takes some time. It's slower than BDW at a 1.5x heap but better there on out, until what appears to be a bug or pathology makes it take longer at 5x. Adding memory should always decrease run time.

The semi-space collector starts working at 1.75x and then surpasses all collectors from 2.5x onwards. We expect the semi-space collector to win for big heaps, because its overhead is proportional to live data only, whereas mark-sweep and mark-region collectors have to sweep, which is proportional to heap size, and indeed that's what we see.

I think this chart shows we have some tuning yet to do. The range between 2x and 3x is quite acceptable, but we need to see what's causing Whippet to be slower than BDW at 1.5x. I haven't done as much performance tuning as I would like to but am happy to finally be able to know where we stand.

And that's it! Happy hacking, friends, and may your heap sizes be ever righteous.