wingolog

free trade and the left

I came of age an age ago, in the end of history: the decade of NAFTA, the WTO, the battle of Seattle. My people hated it all, interpreting the global distribution of production as a race to the bottom leaving post-industrial craters in its wake. Our slogans were to buy local, to the extent that participation in capitalism was necessary; to support local businesses, local products, local communities. What were trade barriers to us? One should not need goods from far away in the first place; autarky is the bestarky.

Now, though, the vibe has shifted. If you told me then that most leftists would be aghast at the cancelling of NAFTA and subsequently relieved when it was re-instated as the USMCA, I think I would have used that most 90s of insults: sellouts. It has been a perplexing decade in many ways.

The last shoe to drop for me was when I realized that, given the seriousness of the climate crisis, it is now a moral imperative everywhere to import as many solar panels as cheaply as possible, from wherever they might be made. Solar panels will liberate us from fossil fuel consumption. China makes them cheapest. Neither geopolitical (“please believe China is our enemy”) nor antiglobalist (“home-grown inverters have better vibes”) arguments have any purchase on this reality. In this context, any trade barrier on Chinese solar panels means more misery for our grandchildren, as they suffer under global warming that we caused. Every additional half-kilowatt panel we install is that much less dinosaur that other people will choose to burn.

But if trade can be progressive, then when is it so, and when not? I know that the market faithful among my readers will see the question as nonsense, but that is not my home church. We on the left pride ourselves in our critical capacities, in our ability to challenge common understandings. However, I wonder about the extent to which my part of the left has calcified and become brittle, recycling old thoughts that we thunk about new situations that rhyme with things we remember, in which we spare ourselves the effort of engagement out of confidence in our past conclusions, but in doing so miss the mark. What is the real leftist take on the EU-Mercosur agreement, anyway?

This question has consumed me for some weeks. To help me think through it, I have enlisted four sources: Quinn Slobodian’s Globalists (2018), an intellectual history of neoliberalism, from the first world war to the 1990s; Marc-William Palen’s Pax Economica (2024), a history of arguments for free trade “from the left”, from the 1840s to the beginnings of Slobodian’s book; Starhawk’s Webs of power (2002), a moment-in-time, first-person-plural account of the antiglobalization movement between Seattle and 9/11; and finally Florence Reese’s Which side are you on? (1931), as recorded by the Weavers, a haunting refrain that dogs me all through this investigation, reminding me that the question is visceral and not abstract.

stave 1: mechanism

Before diving in though, I should admit that I don’t actually understand the basics, and it is not for a lack of trying.

Let’s work through some thought experiments together. Let’s imagine you are in France and want to buy nails. To produce nails, a capitalist has to purchase machines and buildings and steel, has to pay the workers to operate the machines, then finally the nails have to get boxed and shipped to, you know, me. I could buy French-made nails, apparently there is one clouterie left. Even assuming that Creil is a hotbed of industrial development and that its network of suppliers and machine operators is such that they can produce nails very efficiently, the wage in France is higher than the wage in, say, China, or Vietnam, or Turkey. The cost of the labor that goes into each nail is more.

So if I have 1000€ that I would like to devote to making a thing in France, I might prefer to obtain my nails from Turkey for 50€ rather than from France for 100€, because it leaves me more left over in which I can do my thing.

Perhaps France sees this as a travesty; in response, France imposes a tariff of 100% on imported nails. In that way, it might be the same to me as a consumer whether the nails were from Turkey or France. But this imposes a cost on me; I have less left over with which to do my thing. In exchange, the French clouterie gets to keep on going, along with the workers that work there, their families and the community around the factory, the suppliers of the factory, and so on.

But do I actually care that my nails are made in France? Should the country care that it has a capacity to make nails? When I was 25, I thought so, that it would be better if everything were made locally. Perhaps that was a distortion from having grown up in the US, in which we consider goods originating between either ocean as “ours”; smaller polities cannot afford that conceit. There may or may not be a reason that France might want nails to be made locally; it is a political question, but one that is not without tradeoffs. If we always choose the local option, we will ultimately do worse overall than a country that is happy to import; in the limit case, some goods like coffee would be unobtainable, with a price floor established by the cost to artificially re-create in France the mountain climate of Perú.

Let us accept, however, the argument that free trade improves overall efficiency, yielding productivity gains that ultimately get spread around society, making everyone’s lives better. We on the left are used to being more critical than constructive, but we do ourselves a disservice if we let our distaste of capitalism prevent us from appreciating how it works. Reading Marx’s Capital, for example, one can’t help but appreciate the awe which he has towards the dynamism of capitalism, apart from analysis and critique. In that spirit of dynamic awe, therefore, let us assume that free trade leads to higher productivity. How does it do this?

Here, I think, we must recognize a kind of ruthless cruelty. The one clouterie survives because of local demand; it can’t compete on the global market. It will fold eventually. Its workers will be laid off, its factories vacated, its town... well Creil is on the outskirts of Paris, it will live on, but less as a center in and of itself.

The nail factory will try to hang on for a while, though. It will start by putting downward pressure on wages, and extract other concessions from workers, in the name of economic necessity. Otherwise, the boss says, we move the factory to Tunisia. This operates also on larger levels, with the chamber of commerce (MEDEF) arguing for regulatory “simplifications”, new kinds of contracts with fewer protections, and so on. To the extent that a factory’s product is a commodity – whether the nails are Malaysian or French doesn’t matter – to that extent, free trade is indeed a race to the bottom, allowing mobile capital to play different jurisdictions off each other. The same goes for quality standards, environmental standards, and similar ways in which countries might choose to internalize what are elsewhere externalities.

I am a sentimental dude and have always found this sort of situation to be quite sad. Places that are on the up and up have a yuppie, rootless vibe, while those that are in decline have roots but no branches. Economics washes its hands of vibes, though; if it’s not measurable in euros, it doesn’t exist. What is the value of 19th-century India’s textile industry against the low price of Manchester-milled cloth? Can we put a number on it? We did, of course; zero is a number after all.

Finally, before I close today’s missive, a note on distribution. Most macro-economics is indifferent as regards distribution; gross domestic product is assessed relative to a country, not its parts. But if nails are producible for 50€ in Turkey, shipping included, that doesn’t mean they get sold at that price in France; if they sell for 70€, the capitalist pockets the change. That’s the base of surplus value: the owner pays the input costs, and the input cost of a worker is what the worker needs to make a living (rent, food, etc), but the that cost is less than the value of what the worker produces. That productivity goes up doesn’t necessarily mean that it makes it to the producers; that requires a finer analysis.

to be continued

For me, the fundamental economic argument for free trade is not unambiguously positive. Yes, lower trade barriers should leave us all with more left over with which we can do cool things. But the mechanism is cruel, the benefits accrue unequally, and there is value in producing communities that is not captured in prices.

In my next missive, we go back to the 19th century to see what Marx and Cobden had to say about the topic. Until then, happy trading!

six thoughts on generating c

So I work in compilers, which means that I write programs that translate programs to programs. Sometimes you will want to target a language at a higher level than just, like, assembler, and oftentimes C is that language. Generating C is less fraught than writing C by hand, as the generator can often avoid the undefined-behavior pitfalls that one has to be so careful about when writing C by hand. Still, I have found some patterns that help me get good results.

Today’s note is a quick summary of things that work for me. I won’t be so vain as to call them “best practices”, but they are my practices, and you can have them too if you like.

static inline functions enable data abstraction

When I learned C, in the early days of GStreamer (oh bless its heart it still has the same web page!), we used lots of preprocessor macros. Mostly we got the message over time that many macro uses should have been inline functions; macros are for token-pasting and generating names, not for data access or other implementation.

But what I did not appreciate until much later was that always-inline functions remove any possible performance penalty for data abstractions. For example, in Wastrel, I can describe a bounded range of WebAssembly memory via a memory struct, and an access to that memory in another struct:

struct memory { uintptr_t base; uint64_t size; };
struct access { uint32_t addr; uint32_t len; };

And then if I want a writable pointer to that memory, I can do so:

#define static_inline \
  static inline __attribute__((always_inline))

static_inline void* write_ptr(struct memory m, struct access a) {
  BOUNDS_CHECK(m, a);
  char *base = __builtin_assume_aligned((char *) m.base_addr, 4096);
  return (void *) (base + a.addr);
}

(Wastrel usually omits any code for BOUNDS_CHECK, and just relies on memory being mapped into a PROT_NONE region of an appropriate size. We use a macro there because if the bounds check fails and kills the process, it’s nice to be able to use __FILE__ and __LINE__.)

Regardless of whether explicit bounds checks are enabled, the static_inline attribute ensures that the abstraction cost is entirely burned away; and in the case where bounds checks are elided, we don’t need the size of the memory or the len of the access, so they won’t be allocated at all.

If write_ptr wasn’t static_inline, I would be a little worried that somewhere one of these struct values would get passed through memory. This is mostly a concern with functions that return structs by value; whereas in e.g. AArch64, returning a struct memory would use the same registers that a call to void (*)(struct memory) would use for the argument, the SYS-V x64 ABI only allocates two general-purpose registers to be used for return values. I would mostly prefer to not think about this flavor of bottleneck, and that is what static inline functions do for me.

avoid implicit integer conversions

C has an odd set of default integer conversions, for example promoting uint8_t to signed int, and also has weird boundary conditions for signed integers. When generating C, we should probably sidestep these rules and instead be explicit: define static inline u8_to_u32, s16_to_s32, etc conversion functions, and turn on -Wconversion.

Using static inline cast functions also allows the generated code to assert that operands are of a particular type. Ideally, you end up in a situation where all casts are in your helper functions, and no cast is in generated code.

wrap raw pointers and integers with intent

Whippet is a garbage collector written in C. A garbage collector cuts across all data abstractions: objects are sometimes viewed as absolute addresses, or ranges in a paged space, or offsets from the beginning of an aligned region, and so on. If you represent all of these concepts with size_t or uintptr_t or whatever, you’re going to have a bad time. So Whippet has struct gc_ref, struct gc_edge, and the like: single-member structs whose purpose it is to avoid confusion by partitioning sets of applicable operations. A gc_edge_address call will never apply to a struct gc_ref, and so on for other types and operations.

This is a great pattern for hand-written code, but it’s particularly powerful for compilers: you will often end up compiling a term of a known type or kind and you would like to avoid mistakes in the residualized C.

For example, when compiling WebAssembly, consider struct.set‘s operational semantics: the textual rendering states, “Assert: Due to validation, val is some ref.struct structaddr.” Wouldn’t it be nice if this assertion could translate to C? Well in this case it can: with single-inheritance subtyping (as WebAssembly has), you can make a forest of pointer subtypes:

typedef struct anyref { uintptr_t value; } anyref;
typedef struct eqref { anyref p; } eqref;
typedef struct i31ref { eqref p; } i31ref;
typedef struct arrayref { eqref p; } arrayref;
typedef struct structref { eqref p; } structref;

So for a (type $type_0 (struct (mut f64))), I might generate:

typedef struct type_0ref { structref p; } type_0ref;

Then if I generate a field setter for $type_0, I make it take a type_0ref:

static inline void
type_0_set_field_0(type_0ref obj, double val) {
  ...
}

In this way the types carry through from source to target language. There is a similar type forest for the actual object representations:

typedef struct wasm_any { uintptr_t type_tag; } wasm_any;
typedef struct wasm_struct { wasm_any p; } wasm_struct;
typedef struct type_0 { wasm_struct p; double field_0; } type_0;
...

And we generate little cast routines to go back and forth between type_0ref and type_0* as needed. There is no overhead because all routines are static inline, and we get pointer subtyping for free: if a struct.set $type_0 0 instruction is passed a subtype of $type_0, the compiler can generate an upcast that type-checks.

fear not `memcpy`

In WebAssembly, accesses to linear memory are not necessarily aligned, so we can’t just cast an address to (say) int32_t* and dereference. Instead we memcpy(&i32, addr, sizeof(int32_t)), and trust the compiler to just emit an unaligned load if it can (and it can). No need for more words here!

for ABI and tail calls, perform manual register allocation

So, GCC finally has __attribute__((musttail)): praise be. However, when compiling WebAssembly, it could be that you end up compiling a function with, like 30 arguments, or 30 return values; I don’t trust a C compiler to reliably shuffle between different stack argument needs at tail calls to or from such a function. It could even refuse to compile a file if it can’t meet its musttail obligations; not a good characteristic for a target language.

Really you would like it if all function parameters were allocated to registers. You can ensure this is the case if, say, you only pass the first n values in registers, and then pass the rest in global variables. You don’t need to pass them on a stack, because you can make the callee load them back to locals as part of the prologue.

What’s fun about this is that it also neatly enables multiple return values when compiling to C: simply go through the set of function types used in your program, allocate enough global variables of the right types to store all return values, and make a function epilogue store any “excess” return values—those beyond the first return value, if any—in global variables, and have callers reload those values right after calls.

what’s not to like

Generating C is a local optimum: you get the industrial-strength instruction selection and register allocation of GCC or Clang, you don’t have to implement many peephole-style optimizations, and you get to link to to possibly-inlinable C runtime routines. It’s hard to improve over this design point in a marginal way.

There are drawbacks, of course. As a Schemer, my largest source of annoyance is that I don’t have control of the stack: I don’t know how much stack a given function will need, nor can I extend the stack of my program in any reasonable way. I can’t iterate the stack to precisely enumerate embedded pointers (but perhaps that’s fine). I certainly can’t slice a stack to capture a delimited continuation.

The other major irritation is about side tables: one would like to be able to implement so-called zero-cost exceptions, but without support from the compiler and toolchain, it’s impossible.

And finally, source-level debugging is gnarly. You would like to be able to embed DWARF information corresponding to the code you residualize; I don’t know how to do that when generating C.

(Why not Rust, you ask? Of course you are asking that. For what it is worth, I have found that lifetimes are a frontend issue; if I had a source language with explicit lifetimes, I would consider producing Rust, as I could machine-check that the output has the same guarantees as the input. Likewise if I were using a Rust standard library. But if you are compiling from a language without fancy lifetimes, I don’t know what you would get from Rust: fewer implicit conversions, yes, but less mature tail call support, longer compile times... it’s a wash, I think.)

Oh well. Nothing is perfect, and it’s best to go into things with your eyes wide open. If you got down to here, I hope these notes help you in your generations. For me, once my generated C type-checked, it worked: very little debugging has been necessary. Hacking is not always like this, but I’ll take it when it comes. Until next time, happy hacking!

ahead-of-time wasm gc in wastrel

Hello friends! Today, a quick note: the Wastrel ahead-of-time WebAssembly compiler now supports managed memory via garbage collection!

hello, world

The quickest demo I have is that you should check out and build wastrel itself:

git clone https://codeberg.org/andywingo/wastrel
cd wastrel
guix shell
# alternately: sudo apt install guile-3.0 guile-3.0-dev \
#    pkg-config gcc automake autoconf make
autoreconf -vif && ./configure
make -j

Then run a quick check with hello, world:

$ ./pre-inst-env wastrel examples/simple-string.wat
Hello, world!

Now give a check to gcbench, a classic GC micro-benchmark:

$ WASTREL_PRINT_STATS=1 ./pre-inst-env wastrel examples/gcbench.wat
Garbage Collector Test
 Creating long-lived binary tree of depth 16
 Creating a long-lived array of 500000 doubles
Creating 33824 trees of depth 4
	Top-down construction: 10.189 msec
	Bottom-up construction: 8.629 msec
Creating 8256 trees of depth 6
	Top-down construction: 8.075 msec
	Bottom-up construction: 8.754 msec
Creating 2052 trees of depth 8
	Top-down construction: 7.980 msec
	Bottom-up construction: 8.030 msec
Creating 512 trees of depth 10
	Top-down construction: 7.719 msec
	Bottom-up construction: 9.631 msec
Creating 128 trees of depth 12
	Top-down construction: 11.084 msec
	Bottom-up construction: 9.315 msec
Creating 32 trees of depth 14
	Top-down construction: 9.023 msec
	Bottom-up construction: 20.670 msec
Creating 8 trees of depth 16
	Top-down construction: 9.212 msec
	Bottom-up construction: 9.002 msec
Completed 32 major collections (0 minor).
138.673 ms total time (12.603 stopped); 209.372 ms CPU time (83.327 stopped).
0.368 ms median pause time, 0.512 p95, 0.800 max.
Heap size is 26.739 MB (max 26.739 MB); peak live data 5.548 MB.

We set WASTREL_PRINT_STATS=1 to get those last 4 lines. So, this is a microbenchmark: it runs for only 138 ms, and the heap is tiny (26.7 MB). It does collect 30 times, which is something.

is it good?

I know what you are thinking: OK, it’s a microbenchmark, but can it tell us anything about how Wastrel compares to V8? Well, probably so:

$ guix shell node time -- \
   time node js-runtime/run.js -- \
     js-runtime/wtf8.wasm examples/gcbench.wasm
Garbage Collector Test
[... some output elided ...]
total_heap_size: 48082944
[...]
0.23user 0.03system 0:00.20elapsed 128%CPU (0avgtext+0avgdata 87844maxresident)k
0inputs+0outputs (0major+13325minor)pagefaults 0swaps

Which is to say, V8 takes more CPU time (230ms vs 209ms) and more wall-clock time (200ms vs 138ms). Also it uses twice as much managed memory (48 MB vs 26.7 MB), and more than that for the total process (88 MB vs 34 MB, not shown).

improving on v8, really?

Let’s try with quads, which at least has a larger active heap size. This time we’ll compile a binary and then run it:

$ ./pre-inst-env wastrel compile -o quads examples/quads.wat
$ WASTREL_PRINT_STATS=1 guix shell time -- time ./quads 
Making quad tree of depth 10 (1398101 nodes).
	construction: 23.274 msec
Allocating garbage tree of depth 9 (349525 nodes), 60 times, validating live tree each time.
	allocation loop: 826.310 msec
	quads test: 860.018 msec
Completed 26 major collections (0 minor).
848.825 ms total time (85.533 stopped); 1349.199 ms CPU time (585.936 stopped).
3.456 ms median pause time, 3.840 p95, 5.888 max.
Heap size is 133.333 MB (max 133.333 MB); peak live data 82.416 MB.
1.35user 0.01system 0:00.86elapsed 157%CPU (0avgtext+0avgdata 141496maxresident)k
0inputs+0outputs (0major+231minor)pagefaults 0swaps

Compare to V8 via node:

$ guix shell node time -- time node js-runtime/run.js -- js-runtime/wtf8.wasm examples/quads.wasm
Making quad tree of depth 10 (1398101 nodes).
	construction: 64.524 msec
Allocating garbage tree of depth 9 (349525 nodes), 60 times, validating live tree each time.
	allocation loop: 2288.092 msec
	quads test: 2394.361 msec
total_heap_size: 156798976
[...]
3.74user 0.24system 0:02.46elapsed 161%CPU (0avgtext+0avgdata 382992maxresident)k
0inputs+0outputs (0major+87866minor)pagefaults 0swaps

Which is to say, wastrel is almost three times as fast, while using almost three times less memory: 2460ms (v8) vs 849ms (wastrel), and 383MB vs 141 MB.

zowee!

So, yes, the V8 times include the time to compile the wasm module on the fly. No idea what is going on with tiering, either, but I understand that tiering up is a thing these days; this is node v22.14, released about a year ago, for what that’s worth. Also, there is a V8-specific module to do some impedance-matching with regards to strings; in Wastrel they are WTF-8 byte arrays, whereas in Node they are JS strings. But it’s not a string benchmark, so I doubt that’s a significant factor.

I think the performance edge comes in having the program ahead-of-time: you can statically allocate type checks, statically allocate object shapes, and the compiler can see through it all. But I don’t really know yet, as I just got everything working this week.

Wastrel with GC is demo-quality, thus far. If you’re interested in the back-story and the making-of, see my intro to Wastrel article from October, or the FOSDEM talk from last week:

Slides here, if that’s your thing.

More to share on this next week, but for now I just wanted to get the word out. Happy hacking and have a nice weekend!

pre-tenuring in v8

Hey hey happy new year, friends! Today I was going over some V8 code that touched pre-tenuring: allocating objects directly in the old space instead of the nursery. I knew the theory here but I had never looked into the mechanism. Today’s post is a quick overview of how it’s done.

allocation sites

In a JavaScript program, there are a number of source code locations that allocate. Statistically speaking, any given allocation is likely to be short-lived, so generational garbage collection partitions freshly-allocated objects into their own space. In that way, when the system runs out of memory, it can preferentially reclaim memory from the nursery space instead of groveling over the whole heap.

But you know what they say: there are lies, damn lies, and statistics. Some programs are outliers, allocating objects in such a way that they don’t die young, or at least not young enough. In those cases, allocating into the nursery is just overhead, because minor collection won’t reclaim much memory (because too many objects survive), and because of useless copying as the object is scavenged within the nursery or promoted into the old generation. It would have been better to eagerly tenure such allocations into the old generation in the first place. (The more I think about it, the funnier pre-tenuring is as a term; what if some PhD programs could pre-allocate their graduates into named chairs? Is going straight to industry the equivalent of dying young? Does collaborating on a paper with a full professor imply a write barrier? But I digress.)

Among the set of allocation sites in a program, a subset should pre-tenure their objects. How can we know which ones? There is a literature of static techniques, but this is JavaScript, so the answer in general is dynamic: we should observe how many objects survive collection, organized by allocation site, then optimize to assume that the future will be like the past, falling back to a general path if the assumptions fail to hold.

my runtime doth object

The high-level overview of how V8 implements pre-tenuring is based on per-program-point AllocationSite objects, and per-allocation AllocationMemento objects that point back to their corresponding AllocationSite. Initially, V8 doesn’t know what program points would profit from pre-tenuring, and instead allocates everything in the nursery. Here’s a quick picture:

diagram of linear allocation buffer containing interleaved objects and allocation mementos — *A linear allocation buffer containing objects allocated with allocation mementos*

Here we show that there are two allocation sites, Site1 and Site2. V8 is currently allocating into a linear allocation buffer (LAB) in the nursery, and has allocated three objects. After each of these objects is an AllocationMemento; in this example, M1 and M3 are AllocationMemento objects that point to Site1 and M2 points to Site2. When V8 allocates an object, it increments the “created” counter on the corresponding AllocationSite (if available; it’s possible an allocation comes from C++ or something where we don’t have an AllocationSite).

When the free space in the LAB is too small for an allocation, V8 gets another LAB, or collects if there are no more LABs in the nursery. When V8 does a minor collection, as the scavenger visits objects, it will look to see if the object is followed by an AllocationMemento. If so, it dereferences the memento to find the AllocationSite, then increments its “found” counter, and adds the AllocationSite to a set. Once an AllocationSite has had 100 allocations, it is enqueued for a pre-tenuring decision; sites with 85% survival get marked for pre-tenuring.

If an allocation site is marked as needing pre-tenuring, the code in which it is embedded it will get de-optimized, and then next time it is optimized, the code generator arranges to allocate into the old generation instead of the default nursery.

Finally, if a major collection collects more than 90% of the old generation, V8 resets all pre-tenured allocation sites, under the assumption that pre-tenuring was actually premature.

tenure for me but not for thee

What kinds of allocation sites are eligible for pre-tenuring? Sometimes it depends on object kind; wasm memories, for example, are almost always long-lived, so they are always pre-tenured. Sometimes it depends on who is doing the allocation; allocations from the bootstrapper, literals allocated by the parser, and many allocations from C++ go straight to the old generation. And sometimes the compiler has enough information to determine that pre-tenuring might be a good idea, as when it generates a store of a fresh object to a field in an known-old object.

But otherwise I thought that the whole AllocationSite mechanism would apply generally, to any object creation. It turns out, nope: it seems to only apply to object literals, array literals, and new Array. Weird, right? I guess it makes sense in that these are the ways to create objects that also creates the field values at creation-time, allowing the whole block to be allocated to the same space. If instead you make a pre-tenured object and then initialize it via a sequence of stores, this would likely create old-to-new edges, preventing the new objects from dying young while incurring the penalty of copying and write barriers. Still, I think there is probably some juice to squeeze here for pre-tenuring of class-style allocations, at least in the optimizing compiler or in short inline caches.

I suspect this state of affairs is somewhat historical, as the AllocationSite mechanism seems to have originated with typed array storage strategies and V8’s “boilerplate” object literal allocators; both of these predate per-AllocationSite pre-tenuring decisions.

fin

Well that’s adaptive pre-tenuring in V8! I thought the “just stick a memento after the object” approach is pleasantly simple, and if you are only bumping creation counters from baseline compilation tiers, it likely amortizes out to a win. But does the restricted application to literals point to a fundamental constraint, or is it just accident? If you have any insight, let me know :) Until then, happy hacking!

in which our protagonist dreams of laurels

I had a dream the other evening, in which I was at a large event full of hackers—funny, that this is the extent of my dreams at the moment; as a parent of three young kids, I don’t get out much—and, there, I was to receive an award and give a speech. (I know, I am a ridiculous man, even when sleeping.) The award was something about free software; it had the trappings of victory, but the vibe among attendees was numbness and bitter loss. Palantir had a booth; they use free software, and isn’t that just great?

My talk was to be about Guile, I think: something technical, something interesting, but, I suspected, something inadequate: in its place and time it would be a delight to go deep on mechanism but the moment seemed to call for something else.

These days are funny. We won, objectively, in the sense of the goals we set in the beginning; most software is available to its users under a free license: Firefox, Chromium, Android, Linux, all the programming languages, you know the list. So why aren’t we happy?

When I reflect back on what inspired me about free software 25 years ago, it was much more political than technical. The idea that we should be able to modify our own means of production and share those modifications was a part of a political project of mutual care: we should be empowered to affect the systems that surround us, to the extent that they affect us.

To give you an idea of the milieu, picture me in 1999. I left my home to study abroad on another continent. When I would go to internet cafés I would do my email and read slashdot and freshmeat as one did back then, but also I would often read Z magazine, Noam Chomsky and Michael Albert and Michael Parenti and Arundhati Roy and Zapatistas and all. I remember reading El País the day after “we” shut down the World Trade Organization meeting in Seattle, seeing front-page pictures of pink-haired kids being beat up by the cops and wishing I were there with them. For me, free software fit with all of this: the notion that a better world was possible, and we could build it together.

I won’t lie and say that the ideals were everything. I think much of my motivation to program is selfish: I like to learn, to find out, to do. But back then I felt the social component more strongly. Among my cohort, though, I think we now do free software because we did free software; the motive sedimented into mechanism. These are the spoils of victory: free is the default. But defaults lack a sense of urgency, of the political.

Nowadays the commons that we built is the feedlot of large language models, and increasingly also its waste pond. The software we make is free, but the system in which it is made is not; Linux Magazine 1, Z magazine 0.

All of this makes me think that free software as a cause has run its course. We were the vanguard, and we won. Our dreams of 25 years ago are today’s table stakes. Specifically for my copyleft comrades, it seems that the role of copyright as a societal lever has much less purchase; taken to its conclusion, we might find ourselves siding with Disney and OpenAI against Google.

If I had to choose an idea from the 90s to keep, I would take “another world is possible” over the four freedoms. For me, software freedom is a strategy within a broader humanist project of liberation. It was clever, in that it could motivate people from a variety of backgrounds in a way that was on the whole positive for the humanist project. It inspired me as a meaningful way in which I could work towards a world of people caring for each other. In that spirit, I would like to invite my comrades to reflect on their own hierarchy of principles; too often I see people arguing the fine points of “is this software free” according to a specific definition without appreciating the ends to which the software freedom definition is a means.

Anyway, it turns out that I did win something, the Award for the Advancement of Free Software, for my work on Guile over the years. My work on Guile has waxed and waned, and in these last few years of parenthood it has been rather the latter, but I am proud of some of the technical hacks; and it has been with a heart-warming, wondrous delight that I have been a spectator to the rise of Guix, a complete operating system built on Guile. Apart from its quite compelling technical contributions, I just love that Guix is a community of people working together to build a shared project. I am going to the Guix days in a month or so and in past years it has been such a pleasure to see so many people there, working to make possible another world.

In my dream, instead of talking about Guile, I gave a rousing and compelling impromptu invective against Palantir and their ilk. I thought it quite articulate; I was asleep. In these waking hours, some days later, I don’t know what I did say, but I think I know what I would like to have said: that if we take the means of free software to be the ends, then we will find ourselves arguing our enemies are our friends. Saying that it’s OK if some software we build on is made by people who facilitate ICE raids. People who build spy software for controlling domestic populations. People who work for empire.

What I would like to say is that free software is a strategy. As a community of people that share some kind of liberatory principles of which free software has been a part, let use free software as best we can, among many other strategies. If it fits, great. If you find yourself on the same side of an argument as Palantir, it’s time to back up and try something else.

the last couple years in v8's garbage collector

Let’s talk about memory management! Following up on my article about 5 years of developments in V8’s garbage collector, today I’d like to bring that up to date with what went down in V8’s GC over the last couple years.

methodololology

I selected all of the commits to src/heap since my previous roundup. There were 1600 of them, including reverts and relands. I read all of the commit logs, some of the changes, some of the linked bugs, and any design document I could get my hands on. From what I can tell, there have been about 4 FTE from Google over this period, and the commit rate is fairly constant. There are very occasional patches from Igalia, Cloudflare, Intel, and Red Hat, but it’s mostly a Google affair.

Then, by the very rigorous process of, um, just writing things down and thinking about it, I see three big stories for V8’s GC over this time, and I’m going to give them to you with some made-up numbers for how much of the effort was spent on them. Firstly, the effort to improve memory safety via the sandbox: this is around 20% of the time. Secondly, the Oilpan odyssey: maybe 40%. Third, preparation for multiple JavaScript and WebAssembly mutator threads: 20%. Then there are a number of lesser side quests: heuristics wrangling (10%!!!!), and a long list of miscellanea. Let’s take a deeper look at each of these in turn.

the sandbox

There was a nice blog post in June last year summarizing the sandbox effort: basically, the goal is to prevent user-controlled writes from corrupting memory outside the JavaScript heap. We start from the assumption that the user is somehow able to obtain a write-anywhere primitive, and we work to mitigate the effect of such writes. The most fundamental way is to reduce the range of addressable memory, notably by encoding pointers as 32-bit offsets and then ensuring that no host memory is within the addressable virtual memory that an attacker can write. The sandbox also uses some 40-bit offsets for references to larger objects, with similar guarantees. (Yes, a sandbox really does reserve a terabyte of virtual memory).

But there are many, many details. Access to external objects is intermediated via type-checked external pointer tables. Some objects that should never be directly referenced by user code go in a separate “trusted space”, which is outside the sandbox. Then you have read-only spaces, used to allocate data that might be shared between different isolates, you might want multiple cages, there are “shared” variants of the other spaces, for use in shared-memory multi-threading, executable code spaces with embedded object references, and so on and so on. Tweaking, elaborating, and maintaining all of these details has taken a lot of V8 GC developer time.

I think it has paid off, though, because the new development is that V8 has managed to turn on hardware memory protection for the sandbox: sandboxed code is prevented by the hardware from writing memory outside the sandbox.

Leaning into the “attacker can write anything in their address space” threat model has led to some funny patches. For example, sometimes code needs to check flags about the page that an object is on, as part of a write barrier. So some GC-managed metadata needs to be in the sandbox. However, the garbage collector itself, which is outside the sandbox, can’t trust that the metadata is valid. We end up having two copies of state in some cases: in the sandbox, for use by sandboxed code, and outside, for use by the collector.

The best and most amusing instance of this phenomenon is related to integers. Google’s style guide recommends signed integers by default, so you end up with on-heap data structures with int32_t len and such. But if an attacker overwrites a length with a negative number, there are a couple funny things that can happen. The first is a sign-extending conversion to size_t by run-time code, which can lead to sandbox escapes. The other is mistakenly concluding that an object is small, because its length is less than a limit, because it is unexpectedly negative. Good times!

oilpan

It took 10 years for Odysseus to get back from Troy, which is about as long as it has taken for conservative stack scanning to make it from Oilpan into V8 proper. Basically, Oilpan is garbage collection for C++ as used in Blink and Chromium. Sometimes it runs when the stack is empty; then it can be precise. But sometimes it runs when there might be references to GC-managed objects on the stack; in that case it runs conservatively.

Last time I described how V8 would like to add support for generational garbage collection to Oilpan, but that for that, you’d need a way to promote objects to the old generation that is compatible with the ambiguous references visited by conservative stack scanning. I thought V8 had a chance at success with their new mark-sweep nursery, but that seems to have turned out to be a lose relative to the copying nursery. They even tried sticky mark-bit generational collection, but it didn’t work out. Oh well; one good thing about Google is that they seem willing to try projects that have uncertain payoff, though I hope that the hackers involved came through their OKR reviews with their mental health intact.

Instead, V8 added support for pinning to the Scavenger copying nursery implementation. If a page has incoming ambiguous edges, it will be placed in a kind of quarantine area for a while. I am not sure what the difference is between a quarantined page, which logically belongs to the nursery, and a pinned page from the mark-compact old-space; they seem to require similar treatment. In any case, we seem to have settled into a design that was mostly the same as before, but in which any given page can opt out of evacuation-based collection.

What do we get out of all of this? Well, not only can we get generational collection for Oilpan, but also we unlock cheaper, less bug-prone “direct handles” in V8 itself.

The funny thing is that I don’t think any of this is shipping yet; or, if it is, it’s only in a Finch trial to a minority of users or something. I am looking forward in interest to seeing a post from upstream V8 folks; whole doctoral theses have been written on this topic, and it would be a delight to see some actual numbers.

shared-memory multi-threading

JavaScript implementations have had the luxury of a single-threadedness: with just one mutator, garbage collection is a lot simpler. But this is ending. I don’t know what the state of shared-memory multi-threading is in JS, but in WebAssembly it seems to be moving apace, and Wasm uses the JS GC. Maybe I am overstating the effort here—probably it doesn’t come to 20%—but wiring this up has been a whole thing.

I will mention just one patch here that I found to be funny. So with pointer compression, an object’s fields are mostly 32-bit words, with the exception of 64-bit doubles, so we can reduce the alignment on most objects to 4 bytes. V8 has had a bug open forever about alignment of double-holding objects that it mostly ignores via unaligned loads.

Thing is, if you have an object visible to multiple threads, and that object might have a 64-bit field, then the field should be 64-bit aligned to prevent tearing during atomic access, which usually means the object should be 64-bit aligned. That is now the case for Wasm structs and arrays in the shared space.

side quests

Right, we’ve covered what to me are the main stories of V8’s GC over the past couple years. But let me mention a few funny side quests that I saw.

the heuristics two-step

This one I find to be hilariousad. Tragicomical. Anyway I am amused. So any real GC has a bunch of heuristics: when to promote an object or a page, when to kick off incremental marking, how to use background threads, when to grow the heap, how to choose whether to make a minor or major collection, when to aggressively reduce memory, how much virtual address space can you reasonably reserve, what to do on hard out-of-memory situations, how to account for off-heap mallocated memory, how to compute whether concurrent marking is going to finish in time or if you need to pause... and V8 needs to do this all in all its many configurations, with pointer compression off or on, on desktop, high-end Android, low-end Android, iOS where everything is weird, something called Starboard which is apparently part of Cobalt which is apparently a whole new platform that Youtube uses to show videos on set-top boxes, on machines with different memory models and operating systems with different interfaces, and on and on and on. Simply tuning the system appears to involve a dose of science, a dose of flailing around and trying things, and a whole cauldron of witchcraft. There appears to be one person whose full-time job it is to implement and monitor metrics on V8 memory performance and implement appropriate tweaks. Good grief!

mutex mayhem

Toon Verwaest noticed that V8 was exhibiting many more context switches on MacOS than Safari, and identified V8’s use of platform mutexes as the problem. So he rewrote them to use os_unfair_lock on MacOS. Then implemented adaptive locking on all platforms. Then... removed it all and switched to abseil.

Personally, I am delighted to see this patch series, I wouldn’t have thought that there was juice to squeeze in V8’s use of locking. It gives me hope that I will find a place to do the same in one of my projects :)

ta-ta, third-party heap

It used to be that MMTk was trying to get a number of production language virtual machines to support abstract APIs so that MMTk could slot in a garbage collector implementation. Though this seems to work with OpenJDK, with V8 I think the churn rate and laser-like focus on the browser use-case makes an interstitial API abstraction a lose. V8 removed it a little more than a year ago.

fin

So what’s next? I don’t know; it’s been a while since I have been to Munich to drink from the source. That said, shared-memory multithreading and wasm effect handlers will extend the memory management hacker’s full employment act indefinitely, not to mention actually landing and shipping conservative stack scanning. There is a lot to be done in non-browser V8 environments, whether in Node or on the edge, but it is admittedly harder to read the future than the past.

In any case, it was fun taking this look back, and perhaps I will have the opportunity to do this again in a few years. Until then, happy hacking!

wastrel, a profligate implementation of webassembly

Hey hey hey good evening! Tonight a quick note on wastrel, a new WebAssembly implementation.

a wasm-to-native compiler that goes through c

Wastrel compiles Wasm modules to standalone binaries. It does so by emitting C and then compiling that C.

Compiling Wasm to C isn’t new: Ben Smith wrote wasm2c back in the day and these days most people in this space use Bastien Müller‘s w2c2. These are great projects!

Wastrel has two or three minor differences from these projects. Let’s lead with the most important one, despite the fact that it’s as yet vaporware: Wastrel aims to support automatic memory managment via WasmGC, by embedding the Whippet garbage collection library. (For the wingolog faithful, you can think of Wastrel as a Whiffle for Wasm.) This is the whole point! But let’s come back to it.

The other differences are minor. Firstly, the CLI is more like wasmtime: instead of privileging the production of C, which you then incorporate into your project, Wastrel also compiles the C (by default), and even runs it, like wasmtime run.

Unlike wasm2c (but like w2c2), Wastrel implements WASI. Specifically, WASI 0.1, sometimes known as “WASI preview 1”. It’s nice to be able to take the wasi-sdk‘s C compiler, compile your program to a binary that uses WASI imports, and then run it directly.

In a past life, I once took a week-long sailing course on a 12-meter yacht. One thing that comes back to me often is the way the instructor would insist on taking in the bumpers immediately as we left port, that to sail with them was no muy marinero, not very seamanlike. Well one thing about Wastrel is that it emits nice C: nice in the sense that it avoids many useless temporaries. It does so with a lightweight effects analysis, in which as temporaries are produced, they record which bits of the world they depend on, in a coarse way: one bit for the contents of all global state (memories, tables, globals), and one bit for each local. When compiling an operation that writes to state, we flush all temporaries that read from that state (but only that state). It’s a small thing, and I am sure it has very little or zero impact after SROA turns locals into SSA values, but we are vessels of the divine, and it is important for vessels to be C worthy.

Finally, w2c2 at least is built in such a way that you can instantiate a module multiple times. Wastrel doesn’t do that: the Wasm instance is statically allocated, once. It’s a restriction, but that’s the use case I’m going for.

on performance

Oh buddy, who knows?!? What is real anyway? I would love to have proper perf tests, but in the meantime, I compiled coremark using my GCC on x86-64 (-02, no other options), then also compiled it with the current wasi-sdk and then ran with w2c2, wastrel, and wasmtime. I am well aware of the many pitfalls of benchmarking, and so I should not say anything because it is irresponsible to make conclusions from useless microbenchmarks. However, we’re all friends here, and I am a dude with hubris who also believes blogs are better out than in, and so I will give some small indications. Please obtain your own salt.

So on coremark, Wastrel is some 2-5% percent slower than native, and w2c2 is some 2-5% slower than that. Wasmtime is 30-40% slower than GCC. Voilà.

My conclusion is, Wastrel provides state-of-the-art performance. Like w2c2. It’s no wonder, these are simple translators that use industrial compilers underneath. But it’s neat to see that performance is close to native.

on wasi

OK this is going to sound incredibly arrogant but here it is: writing Wastrel was easy. I have worked on Wasm for a while, and on Firefox’s baseline compiler, and Wastrel is kinda like a baseline compiler in shape: it just has to avoid emitting boneheaded code, and can leave the serious work to someone else (Ion in the case of Firefox, GCC in the case of Wastrel). I just had to use the Wasm libraries I already had and make it emit some C for each instruction. It took 2 days.

WASI, though, took two and a half weeks of agony. Three reasons: One, you can be sloppy when implementing just wasm, but when you do WASI you have to implement an ABI using sticks and glue, but you have no glue, it’s all just i32. Truly excruciating, it makes you doubt everything, and I had to refactor Wastrel to use C’s meager type system to the max. (Basically, structs-as-values to avoid type confusion, but via inline functions to avoid overhead.)

Two, WASI is not huge but not tiny either. Implementing poll_oneoff is annoying. And so on. Wastrel’s WASI implementation is thin but it’s still a couple thousand lines of code.

Three, WASI is underspecified, and in practice what is “conforming” is a function of what the Rust and C toolchains produce. I used wasi-testsuite to burn down most of the issues, but it was a slog. I neglected email and important things but now things pass so it was worth it maybe? Maybe?

on wasi’s filesystem sandboxing

WASI preview 1 has this “rights” interface that associated capabilities with file descriptors. I think it was an attempt at replacing and expanding file permissions with a capabilities-oriented security approach to sandboxing, but it was only a veneer. In practice most WASI implementations effectively implement the sandbox via a permissions layer: for example the process has capabilities to access the parents of preopened directories via .., but the WASI implementation has to actively prevent this capability from leaking to the compiled module via run-time checks.

Wastrel takes a different approach, which is to use Linux’s filesystem namespaces to build a tree in which only the exposed files are accessible. No run-time checks are necessary; the system is secure by construction. He says. It’s very hard to be categorical in this domain but a true capabilities-based approach is the only way I can have any confidence in the results, and that’s what I did.

The upshot is that Wastrel is only for Linux. And honestly, if you are on MacOS or Windows, what are you doing with your life? I get that it’s important to meet users where they are but it’s just gross to build on a corporate-controlled platform.

The current versions of WASI keep a vestigial capabilities-based API, but given that the goal is to compile POSIX programs, I would prefer if wasi-filesystem leaned into the approach of WASI just having access to a filesystem instead of a small set of descriptors plus scoped openat, linkat, and so on APIs. The security properties would be the same, except with fewer bug possibilities and with a more conventional interface.

on wtf

So Wastrel is Wasm to native via C, but with an as-yet-unbuilt GC aim. Why?

This is hard to explain and I am still workshopping it.

Firstly I am annoyed at the WASI working group’s focus on shared-nothing architectures as a principle of composition. Yes, it works, but garbage collection also works; we could be building different, simpler systems if we leaned in to a more capable virtual machine. Many of the problems that WASI is currently addressing are ownership-related, and would be comprehensively avoided with automatic memory management. Nobody is really pushing for GC in this space and I would like for people to be able to build out counterfactuals to the shared-nothing orthodoxy.

Secondly there are quite a number of languages that are targetting WasmGC these days, and it would be nice for them to have a good run-time outside the browser. I know that Wasmtime is working on GC, but it needs competition :)

Finally, and selfishly, I have a GC library! I would love to spend more time on it. One way that can happen is for it to prove itself useful, and maybe a Wasm implementation is a way to do that. Could Wastrel on wasm_of_ocaml output beat ocamlopt? I don’t know but it would be worth it to find out! And I would love to get Guile programs compiled to native, and perhaps with Hoot and Whippet and Wastrel that is a possibility.

Welp, there we go, blog out, dude to bed. Hack at y’all later and wonderful wasming to you all!

whippet hacklog: adding freelists to the no-freelist space

August greetings, comrades! Today I want to bookend some recent work on my Immix-inspired garbage collector: firstly, an idea with muddled results, then a slog through heuristics.

the big idea

My mostly-marking collector’s main space is called the “nofl space”. Its name comes from its historical evolution from mark-sweep to mark-region: instead of sweeping unused memory to freelists and allocating from those freelists, sweeping is interleaved with allocation; “nofl” means “no free-list”. As it finds holes, the collector bump-pointer allocates into those holes. If an allocation doesn’t fit into the current hole, the collector sweeps some more to find the next hole, possibly fetching another block. Space for holes that are too small is effectively wasted as fragmentation; mutators will try again after the next GC. Blocks with lots of holes will be chosen for opportunistic evacuation, which is the heap defragmentation mechanism.

Hole-too-small fragmentation has bothered me, because it presents a potential pathology. You don’t know how a GC will be used or what the user’s allocation pattern will be; if it is a mix of medium (say, a kilobyte) and small (say, 16 bytes) allocations, one could imagine a medium allocation having to sweep over lots of holes, discarding them in the process, which hastens the next collection. Seems wasteful, especially for non-moving configurations.

So I had a thought: why not collect those holes into a size-segregated freelist? We just cleared the hole, the memory is core-local, and we might as well. Then before fetching a new block, the allocator slow-path can see if it can service an allocation from the second-chance freelist of holes. This decreases locality a bit, but maybe it’s worth it.

Thing is, I implemented it, and I don’t know if it’s worth it! It seems to interfere with evacuation, in that the blocks that would otherwise be most profitable to evacuate, because they contain many holes, are instead filled up with junk due to second-chance allocation from the freelist. I need to do more measurements, but I think my big-brained idea is a bit of a wash, at least if evacuation is enabled.

heap growth

When running the new collector in Guile, we have a performance oracle in the form of BDW: it had better be faster for Guile to compile a Scheme file with the new nofl-based collector than with BDW. In this use case we have an additional degree of freedom, in that unlike the lab tests of nofl vs BDW, we don’t impose a fixed heap size, and instead allow heuristics to determine the growth.

BDW’s built-in heap growth heuristics are very opaque. You give it a heap multiplier, but as a divisor truncated to an integer. It’s very imprecise. Additionally, there are nonlinearities: BDW is relatively more generous for smaller heaps, because attempts to model and amortize tracing cost, and there are some fixed costs (thread sizes, static data sizes) that don’t depend on live data size.

Thing is, BDW’s heuristics work pretty well. For example, I had a process that ended with a heap of about 60M, for a peak live data size of 25M or so. If I ran my collector with a fixed heap multiplier, it wouldn’t do as well as BDW, because it collected much more frequently when the heap was smaller.

I ended up switching from the primitive “size the heap as a multiple of live data” strategy to live data plus a square root factor; this is like what Racket ended up doing in its simple implementation of MemBalancer. (I do have a proper implementation of MemBalancer, with time measurement and shrinking and all, but I haven’t put it through its paces yet.) With this fix I can meet BDW’s performance for my Guile-compiling-Guile-with-growable-heap workload. It would be nice to exceed BDW of course!

parallel worklist tweaks

Previously, in parallel configurations, trace workers would each have a Chase-Lev deque to which they could publish objects needing tracing. Any worker could steal an object from the top of a worker’s public deque. Also, each worker had a local, unsynchronized FIFO worklist, some 1000 entries in length; when this worklist filled up, the worker would publish its contents.

There is a pathology for this kind of setup, in which one worker can end up with a lot of work that it never publishes. For example, if there are 100 long singly-linked lists on the heap, and the worker happens to have them all on its local FIFO, then perhaps they never get published, because the FIFO never overflows; you end up not parallelising. This seems to be the case in one microbenchmark. I switched to not have local worklists at all; perhaps this was not the right thing, but who knows. Will poke in future.

a hilarious bug

Sometimes you need to know whether a given address is in an object managed by the garbage collector. For the nofl space it’s pretty easy, as we have big slabs of memory; bisecting over the array of slabs is fast. But for large objects whose memory comes from the kernel, we don’t have that. (Yes, you can reserve a big ol’ region with PROT_NONE and such, and then allocate into that region; I don’t do that currently.)

Previously I had a splay tree for lookup. Splay trees are great but not so amenable to concurrent access, and parallel marking is one place where we need to do this lookup. So I prepare a sorted array before marking, and then bisect over that array.

Except a funny thing happened: I switched the bisect routine to return the start address if an address is in a region. Suddenly, weird failures started happening randomly. Turns out, in some places I was testing if bisection succeeded with an int; if the region happened to be 32-bit-aligned, then the nonzero 64-bit uintptr_t got truncated to its low 32 bits, which were zero. Yes, crusty reader, Rust would have caught this!

fin

I want this new collector to work. Getting the growth heuristic good enough is a step forward. I am annoyed that second-chance allocation didn’t work out as well as I had hoped; perhaps I will find some time this fall to give a proper evaluation. In any case, thanks for reading, and hack at you later!

guile lab notebook: on the move!

Hey, a quick update, then a little story. The big news is that I got Guile wired to a moving garbage collector!

Specifically, this is the mostly-moving collector with conservative stack scanning. Most collections will be marked in place. When the collector wants to compact, it will scan ambiguous roots in the beginning of the collection cycle, marking objects referenced by such roots in place. Then the collector will select some blocks for evacuation, and when visiting an object in those blocks, it will try to copy the object to one of the evacuation target blocks that are held in reserve. If the collector runs out of space in the evacuation reserve, it falls back to marking in place.

Given that the collector has to cope with failed evacuations, it is easy to give the it the ability to pin any object in place. This proved useful when making the needed modifications to Guile: for example, when we copy a stack slice containing ambiguous references to a heap-allocated continuation, we eagerly traverse that stack to pin the referents of those ambiguous edges. Also, whenever the address of an object is taken and exposed to Scheme, we pin that object. This happens frequently for identity hashes (hashq).

Anyway, the bulk of the work here was a pile of refactors to Guile to allow a centralized scm_trace_object function to be written, exposing some object representation details to the internal object-tracing function definition while not exposing them to the user in the form of API or ABI.

bugs

I found quite a few bugs. Not many of them were in Whippet, but some were, and a few are still there; Guile exercises a GC more than my test workbench is able to. Today I’d like to write about a funny one that I haven’t fixed yet.

So, small objects in this garbage collector are managed by a Nofl space. During a collection, each pointer-containing reachable object is traced by a global user-supplied tracing procedure. That tracing procedure should call a collector-supplied inline function on each of the object’s fields. Obviously the procedure needs a way to distinguish between different kinds of objects, to trace them appropriately; in Guile, we use an the low bits of the initial word of heap objects for this purpose.

Object marks are stored in a side table in associated 4-MB aligned slabs, with one mark byte per granule (16 bytes). 4 MB is 0x400000, so for an object at address A, its slab base is at A & ~0x3fffff, and the mark byte is offset by (A & 0x3fffff) >> 4. When the tracer sees an edge into a block scheduled for evacuation, it first checks the mark byte to see if it’s already marked in place; in that case there’s nothing to do. Otherwise it will try to evacuate the object, which proceeds as follows...

But before you read, consider that there are a number of threads which all try to make progress on the worklist of outstanding objects needing tracing (the grey objects). The mutator threads are paused; though we will probably add concurrent tracing at some point, we are unlikely to implement concurrent evacuation. But it could be that two GC threads try to process two different edges to the same evacuatable object at the same time, and we need to do so correctly!

With that caveat out of the way, the implementation is here. The user has to supply an annoyingly-large state machine to manage the storage for the forwarding word; Guile’s is here. Basically, a thread will try to claim the object by swapping in a busy value (-1) for the initial word. If that worked, it will allocate space for the object. If that failed, it first marks the object in place, then restores the first word. Otherwise it installs a forwarding pointer in the first word of the object’s old location, which has a specific tag in its low 3 bits allowing forwarded objects to be distinguished from other kinds of object.

I don’t know how to prove this kind of operation correct, and probably I should learn how to do so. I think it’s right, though, in the sense that either the object gets marked in place or evacuated, all edges get updated to the tospace locations, and the thread that shades the object grey (and no other thread) will enqueue the object for further tracing (via its new location if it was evacuated).

But there is an invisible bug, and one that is the reason for me writing these words :) Whichever thread manages to shade the object from white to grey will enqueue it on its grey worklist. Let’s say the object is on an block to be evacuated, but evacuation fails, and the object gets marked in place. But concurrently, another thread goes to do the same; it turns out there is a timeline in which the thread A has marked the object, published it to a worklist for tracing, but thread B has briefly swapped out the object’s the first word with the busy value before realizing the object was marked. The object might then be traced with its initial word stompled, which is totally invalid.

What’s the fix? I do not know. Probably I need to manage the state machine within the side array of mark bytes, and not split between the two places (mark byte and in-object). Anyway, I thought that readers of this web log might enjoy a look in the window of this clown car.

next?

The obvious question is, how does it perform? Basically I don’t know yet; I haven’t done enough testing, and some of the heuristics need tweaking. As it is, it appears to be a net improvement over the non-moving configuration and a marginal improvement over BDW, but which currently has more variance. I am deliberately imprecise here because I have been more focused on correctness than performance; measuring properly takes time, and as you can see from the story above, there are still a couple correctness issues. I will be sure to let folks know when I have something. Until then, happy hacking!

whippet in guile hacklog: evacuation

Good evening, hackfolk. A quick note this evening to record a waypoint in my efforts to improve Guile’s memory manager.

So, I got Guile running on top of the Whippet API. This API can be implemented by a number of concrete garbage collector implementations. The implementation backed by the Boehm collector is fine, as expected. The implementation that uses the bump-pointer-allocation-into-holes strategy is less good. The minor reason is heap sizing heuristics; I still get it wrong about when to grow the heap and when not to do so. But the major reason is that non-moving Immix collectors appear to have pathological fragmentation characteristics.

Fragmentation, for our purposes, is memory under the control of the GC which was free after the previous collection, but which the current cycle failed to use for allocation. I have the feeling that for the non-moving Immix-family collector implementations, fragmentation is much higher than for size-segregated freelist-based mark-sweep collectors. For an allocation of, say, 1024 bytes, the collector might have to scan over many smaller holes until you find a hole that is big enough. This wastes free memory. Fragmentation memory is not gone—it is still available for allocation!—but it won’t be allocatable until after the current cycle when we visit all holes again. In Immix, fragmentation wastes allocatable memory during a cycle, hastening collection and causing more frequent whole-heap traversals.

The value proposition of Immix is that if there is too much fragmentation, you can just go into evacuating mode, and probably improve things. I still buy it. However I don’t think that non-moving Immix is a winner. I still need to do more science to know for sure. I need to fix Guile to support the stack-conservative, heap-precise version of the Immix-family collector which will allow for evacuation.

So that’s where I’m at: a load of gnarly Guile refactors to allow for precise tracing of the heap. I probably have another couple weeks left until I can run some tests. Fingers crossed; we’ll see!