wingolog

whippet hacklog: adding freelists to the no-freelist space

2025-08-07T15:02:04Z

August greetings, comrades! Today I want to bookend some recent work on my Immix-inspired garbage collector: firstly, an idea with muddled results, then a slog through heuristics.

the big idea

My mostly-marking collector’s main space is called the “nofl space”. Its name comes from its historical evolution from mark-sweep to mark-region: instead of sweeping unused memory to freelists and allocating from those freelists, sweeping is interleaved with allocation; “nofl” means “no free-list”. As it finds holes, the collector bump-pointer allocates into those holes. If an allocation doesn’t fit into the current hole, the collector sweeps some more to find the next hole, possibly fetching another block. Space for holes that are too small is effectively wasted as fragmentation; mutators will try again after the next GC. Blocks with lots of holes will be chosen for opportunistic evacuation, which is the heap defragmentation mechanism.

Hole-too-small fragmentation has bothered me, because it presents a potential pathology. You don’t know how a GC will be used or what the user’s allocation pattern will be; if it is a mix of medium (say, a kilobyte) and small (say, 16 bytes) allocations, one could imagine a medium allocation having to sweep over lots of holes, discarding them in the process, which hastens the next collection. Seems wasteful, especially for non-moving configurations.

So I had a thought: why not collect those holes into a size-segregated freelist? We just cleared the hole, the memory is core-local, and we might as well. Then before fetching a new block, the allocator slow-path can see if it can service an allocation from the second-chance freelist of holes. This decreases locality a bit, but maybe it’s worth it.

Thing is, I implemented it, and I don’t know if it’s worth it! It seems to interfere with evacuation, in that the blocks that would otherwise be most profitable to evacuate, because they contain many holes, are instead filled up with junk due to second-chance allocation from the freelist. I need to do more measurements, but I think my big-brained idea is a bit of a wash, at least if evacuation is enabled.

heap growth

When running the new collector in Guile, we have a performance oracle in the form of BDW: it had better be faster for Guile to compile a Scheme file with the new nofl-based collector than with BDW. In this use case we have an additional degree of freedom, in that unlike the lab tests of nofl vs BDW, we don’t impose a fixed heap size, and instead allow heuristics to determine the growth.

BDW’s built-in heap growth heuristics are very opaque. You give it a heap multiplier, but as a divisor truncated to an integer. It’s very imprecise. Additionally, there are nonlinearities: BDW is relatively more generous for smaller heaps, because attempts to model and amortize tracing cost, and there are some fixed costs (thread sizes, static data sizes) that don’t depend on live data size.

Thing is, BDW’s heuristics work pretty well. For example, I had a process that ended with a heap of about 60M, for a peak live data size of 25M or so. If I ran my collector with a fixed heap multiplier, it wouldn’t do as well as BDW, because it collected much more frequently when the heap was smaller.

I ended up switching from the primitive “size the heap as a multiple of live data” strategy to live data plus a square root factor; this is like what Racket ended up doing in its simple implementation of MemBalancer. (I do have a proper implementation of MemBalancer, with time measurement and shrinking and all, but I haven’t put it through its paces yet.) With this fix I can meet BDW’s performance for my Guile-compiling-Guile-with-growable-heap workload. It would be nice to exceed BDW of course!

parallel worklist tweaks

Previously, in parallel configurations, trace workers would each have a Chase-Lev deque to which they could publish objects needing tracing. Any worker could steal an object from the top of a worker’s public deque. Also, each worker had a local, unsynchronized FIFO worklist, some 1000 entries in length; when this worklist filled up, the worker would publish its contents.

There is a pathology for this kind of setup, in which one worker can end up with a lot of work that it never publishes. For example, if there are 100 long singly-linked lists on the heap, and the worker happens to have them all on its local FIFO, then perhaps they never get published, because the FIFO never overflows; you end up not parallelising. This seems to be the case in one microbenchmark. I switched to not have local worklists at all; perhaps this was not the right thing, but who knows. Will poke in future.

a hilarious bug

Sometimes you need to know whether a given address is in an object managed by the garbage collector. For the nofl space it’s pretty easy, as we have big slabs of memory; bisecting over the array of slabs is fast. But for large objects whose memory comes from the kernel, we don’t have that. (Yes, you can reserve a big ol’ region with PROT_NONE and such, and then allocate into that region; I don’t do that currently.)

Previously I had a splay tree for lookup. Splay trees are great but not so amenable to concurrent access, and parallel marking is one place where we need to do this lookup. So I prepare a sorted array before marking, and then bisect over that array.

Except a funny thing happened: I switched the bisect routine to return the start address if an address is in a region. Suddenly, weird failures started happening randomly. Turns out, in some places I was testing if bisection succeeded with an int; if the region happened to be 32-bit-aligned, then the nonzero 64-bit uintptr_t got truncated to its low 32 bits, which were zero. Yes, crusty reader, Rust would have caught this!

fin

I want this new collector to work. Getting the growth heuristic good enough is a step forward. I am annoyed that second-chance allocation didn’t work out as well as I had hoped; perhaps I will find some time this fall to give a proper evaluation. In any case, thanks for reading, and hack at you later!

guile lab notebook: on the move!

2025-07-08T14:28:45Z

Hey, a quick update, then a little story. The big news is that I got Guile wired to a moving garbage collector!

Specifically, this is the mostly-moving collector with conservative stack scanning. Most collections will be marked in place. When the collector wants to compact, it will scan ambiguous roots in the beginning of the collection cycle, marking objects referenced by such roots in place. Then the collector will select some blocks for evacuation, and when visiting an object in those blocks, it will try to copy the object to one of the evacuation target blocks that are held in reserve. If the collector runs out of space in the evacuation reserve, it falls back to marking in place.

Given that the collector has to cope with failed evacuations, it is easy to give the it the ability to pin any object in place. This proved useful when making the needed modifications to Guile: for example, when we copy a stack slice containing ambiguous references to a heap-allocated continuation, we eagerly traverse that stack to pin the referents of those ambiguous edges. Also, whenever the address of an object is taken and exposed to Scheme, we pin that object. This happens frequently for identity hashes (hashq).

Anyway, the bulk of the work here was a pile of refactors to Guile to allow a centralized scm_trace_object function to be written, exposing some object representation details to the internal object-tracing function definition while not exposing them to the user in the form of API or ABI.

bugs

I found quite a few bugs. Not many of them were in Whippet, but some were, and a few are still there; Guile exercises a GC more than my test workbench is able to. Today I’d like to write about a funny one that I haven’t fixed yet.

So, small objects in this garbage collector are managed by a Nofl space. During a collection, each pointer-containing reachable object is traced by a global user-supplied tracing procedure. That tracing procedure should call a collector-supplied inline function on each of the object’s fields. Obviously the procedure needs a way to distinguish between different kinds of objects, to trace them appropriately; in Guile, we use an the low bits of the initial word of heap objects for this purpose.

Object marks are stored in a side table in associated 4-MB aligned slabs, with one mark byte per granule (16 bytes). 4 MB is 0x400000, so for an object at address A, its slab base is at A & ~0x3fffff, and the mark byte is offset by (A & 0x3fffff) >> 4. When the tracer sees an edge into a block scheduled for evacuation, it first checks the mark byte to see if it’s already marked in place; in that case there’s nothing to do. Otherwise it will try to evacuate the object, which proceeds as follows...

But before you read, consider that there are a number of threads which all try to make progress on the worklist of outstanding objects needing tracing (the grey objects). The mutator threads are paused; though we will probably add concurrent tracing at some point, we are unlikely to implement concurrent evacuation. But it could be that two GC threads try to process two different edges to the same evacuatable object at the same time, and we need to do so correctly!

With that caveat out of the way, the implementation is here. The user has to supply an annoyingly-large state machine to manage the storage for the forwarding word; Guile’s is here. Basically, a thread will try to claim the object by swapping in a busy value (-1) for the initial word. If that worked, it will allocate space for the object. If that failed, it first marks the object in place, then restores the first word. Otherwise it installs a forwarding pointer in the first word of the object’s old location, which has a specific tag in its low 3 bits allowing forwarded objects to be distinguished from other kinds of object.

I don’t know how to prove this kind of operation correct, and probably I should learn how to do so. I think it’s right, though, in the sense that either the object gets marked in place or evacuated, all edges get updated to the tospace locations, and the thread that shades the object grey (and no other thread) will enqueue the object for further tracing (via its new location if it was evacuated).

But there is an invisible bug, and one that is the reason for me writing these words :) Whichever thread manages to shade the object from white to grey will enqueue it on its grey worklist. Let’s say the object is on an block to be evacuated, but evacuation fails, and the object gets marked in place. But concurrently, another thread goes to do the same; it turns out there is a timeline in which the thread A has marked the object, published it to a worklist for tracing, but thread B has briefly swapped out the object’s the first word with the busy value before realizing the object was marked. The object might then be traced with its initial word stompled, which is totally invalid.

What’s the fix? I do not know. Probably I need to manage the state machine within the side array of mark bytes, and not split between the two places (mark byte and in-object). Anyway, I thought that readers of this web log might enjoy a look in the window of this clown car.

next?

The obvious question is, how does it perform? Basically I don’t know yet; I haven’t done enough testing, and some of the heuristics need tweaking. As it is, it appears to be a net improvement over the non-moving configuration and a marginal improvement over BDW, but which currently has more variance. I am deliberately imprecise here because I have been more focused on correctness than performance; measuring properly takes time, and as you can see from the story above, there are still a couple correctness issues. I will be sure to let folks know when I have something. Until then, happy hacking!

whippet in guile hacklog: evacuation

2025-06-11T20:56:14Z

Good evening, hackfolk. A quick note this evening to record a waypoint in my efforts to improve Guile’s memory manager.

So, I got Guile running on top of the Whippet API. This API can be implemented by a number of concrete garbage collector implementations. The implementation backed by the Boehm collector is fine, as expected. The implementation that uses the bump-pointer-allocation-into-holes strategy is less good. The minor reason is heap sizing heuristics; I still get it wrong about when to grow the heap and when not to do so. But the major reason is that non-moving Immix collectors appear to have pathological fragmentation characteristics.

Fragmentation, for our purposes, is memory under the control of the GC which was free after the previous collection, but which the current cycle failed to use for allocation. I have the feeling that for the non-moving Immix-family collector implementations, fragmentation is much higher than for size-segregated freelist-based mark-sweep collectors. For an allocation of, say, 1024 bytes, the collector might have to scan over many smaller holes until you find a hole that is big enough. This wastes free memory. Fragmentation memory is not gone—it is still available for allocation!—but it won’t be allocatable until after the current cycle when we visit all holes again. In Immix, fragmentation wastes allocatable memory during a cycle, hastening collection and causing more frequent whole-heap traversals.

The value proposition of Immix is that if there is too much fragmentation, you can just go into evacuating mode, and probably improve things. I still buy it. However I don’t think that non-moving Immix is a winner. I still need to do more science to know for sure. I need to fix Guile to support the stack-conservative, heap-precise version of the Immix-family collector which will allow for evacuation.

So that’s where I’m at: a load of gnarly Guile refactors to allow for precise tracing of the heap. I probably have another couple weeks left until I can run some tests. Fingers crossed; we’ll see!

whippet lab notebook: guile, heuristics, and heap growth

2025-05-22T10:05:35Z

Greets all! Another brief note today. I have gotten Guile working with one of the Nofl-based collectors, specifically the one that scans all edges conservatively (heap-conservative-mmc / heap-conservative-parallel-mmc). Hurrah!

It was a pleasant surprise how easy it was to switch—from the user’s point of view, you just pass --with-gc=heap-conservative-parallel-mmc to Guile’s build (on the wip-whippet branch); when developing I also pass --with-gc-debug, and I had a couple bugs to fix—but, but, there are still some issues. Today’s note thinks through the ones related to heap sizing heuristics.

growable heaps

Whippet has three heap sizing strategies: fixed, growable, and adaptive (MemBalancer). The adaptive policy is the one I would like in the long term; it will grow the heap for processes with a high allocation rate, and shrink when they go idle. However I won’t really be able to test heap shrinking until I get precise tracing of heap edges, which will allow me to evacuate sparse blocks.

So for now, Guile uses the growable policy, which attempts to size the heap so it is at least as large as the live data size, times some multiplier. The multiplier currently defaults to 1.75×, but can be set on the command line via the GUILE_GC_OPTIONS environment variable. For example to set an initial heap size of 10 megabytes and a 4× multiplier, you would set GUILE_GC_OPTIONS=heap-size-multiplier=4,heap-size=10M.

Anyway, I have run into problems! The fundamental issue is fragmentation. Consider a 10MB growable heap with a 2× multiplier, consisting of a sequence of 16-byte objects followed by 16-byte holes. You go to allocate a 32-byte object. This is a small object (8192 bytes or less), and so it goes in the Nofl space. A Nofl mutator holds on to a block from the list of sweepable blocks, and will sequentially scan that block to find holes. However, each hole is only 16 bytes, so we can’t fit our 32-byte object: we finish with the current block, grab another one, repeat until no blocks are left and we cause GC. GC runs, and after collection we have an opportunity to grow the heap: but the heap size is already twice the live object size, so the heuristics say we’re all good, no resize needed, leading to the same sweep again, leading to a livelock.

I actually ran into this case during Guile’s bootstrap, while allocating a 7072-byte vector. So it’s a thing that needs fixing!

observations

The root of the problem is fragmentation. One way to solve the problem is to remove fragmentation; using a semi-space collector comprehensively resolves the issue, modulo any block-level fragmentation.

However, let’s say you have to live with fragmentation, for example because your heap has ambiguous edges that need to be traced conservatively. What can we do? Raising the heap multiplier is an effective mitigation, as it increases the average hole size, but for it to be a comprehensive solution in e.g. the case of 16-byte live objects equally interspersed with holes, you would need a multiplier of 512× to ensure that the largest 8192-byte “small” objects will find a hole. I could live with 2× or something, but 512× is too much.

We could consider changing the heap organization entirely. For example, most mark-sweep collectors (BDW-GC included) partition the heap into blocks whose allocations are of the same size, so you might have some blocks that only hold 16-byte allocations. It is theoretically possible to run into the same issue, though, if each block only has one live object, and the necessary multiplier that would “allow” for more empty blocks to be allocated is of the same order (256× for 4096-byte blocks each with a single 16-byte allocation, or even 4096× if your blocks are page-sized and you have 64kB pages).

My conclusion is that practically speaking, if you can’t deal with fragmentation, then it is impossible to just rely on a heap multiplier to size your heap. It is certainly an error to live-lock the process, hoping that some other thread mutates the graph in such a way to free up a suitable hole. At the same time, if you have configured your heap to be growable at run-time, it would be bad policy to fail an allocation, just because you calculated that the heap is big enough already.

It’s a shame, because we lose a mooring on reality: “how big will my heap get” becomes an unanswerable question because the heap might grow in response to fragmentation, which is not deterministic if there are threads around, and so we can’t reliably compare performance between different configurations. Ah well. If reliability is a goal, I think one needs to allow for evacuation, one way or another.

for nofl?

In this concrete case, I am still working on a solution. It’s going to be heuristic, which is a bit of a disappointment, but here we are.

My initial thought has two parts. Firstly, if the heap is growable but cannot defragment, then we need to reserve some empty blocks after each collection, even if reserving them would grow the heap beyond the configured heap size multiplier. In that way we will always be able to allocate into the Nofl space after a collection, because there will always be some empty blocks. How many empties? Who knows. Currently Nofl blocks are 64 kB, and the largest “small object” is 8kB. I’ll probably try some constant multiplier of the heap size.

The second thought is that searching through the entire heap for a hole is a silly way for the mutator to spend its time. Immix will reserve a block for overflow allocation: if a medium-sized allocation (more than 256B and less than 8192B) fails because no hole in the current block is big enough—note that Immix’s holes have 128B granularity—then the allocation goes to a dedicated overflow block, which is taken from the empty block set. This reduces fragmentation (holes which were not used for allocation because they were too small).

Nofl should probably do the same, but given its finer granularity, it might be better to sweep over a variable number of blocks, for example based on the logarithm of the allocation size; one could instead sweep over clz(min-size)–clz(size) blocks before taking from the empty block list, which would at least bound the sweeping work of any given allocation.

fin

Welp, just wanted to get this out of my head. So far, my experience with this Nofl-based heap configuration is mostly colored by live-locks, and otherwise its implementation of a growable heap sizing policy seems to be more tight-fisted regarding memory allocation than BDW-GC’s implementation. I am optimistic though that I will be able to get precise tracing sometime soon, as measured in development time; the problem as always is fragmentation, in that I don’t have a hole in my calendar at the moment. Until then, sweep on Wayne, cons on Garth, onwards and upwards!

guile on whippet waypoint: goodbye, bdw-gc?

2025-05-15T14:39:56Z

Hey all, just a lab notebook entry today. I’ve been working on the Whippet GC library for about three years now, learning a lot on the way. The goal has always been to replace Guile’s use of the Boehm-Demers-Weiser collector with something more modern and maintainable. Last year I finally got to the point that I felt Whippet was feature-complete, and taking into account the old adage about long arses and brief videos, I think that wasn’t too far off. I carved out some time this spring and for the last month have been integrating Whippet into Guile in anger, on the wip-whippet branch.

the haps

Well, today I removed the last direct usage of the BDW collector’s API by Guile! Instead, Guile uses Whippet’s API any time it needs to allocate an object, add or remove a thread from the active set, identify the set of roots for a collection, and so on. Most tracing is still conservative, but this will move to be more precise over time. I haven’t had the temerity to actually try one of the Nofl-based collectors yet, but that will come soon.

Code-wise, the initial import of Whippet added some 18K lines to Guile’s repository, as counted by git diff --stat, which includes documentation and other files. There was an unspeakable amount of autotomfoolery to get Whippet in Guile’s ancient build system. Changes to Whippet during the course of integration added another 500 lines or so. Integration of Whippet removed around 3K lines of C from Guile. It’s not a pure experiment, as my branch is also a major version bump and so has the freedom to refactor and simplify some things.

Things are better but not perfect. Notably, I switched to build weak hash tables in terms of buckets and chains where the links are ephemerons, which give me concurrent lock-free reads and writes but not resizable tables. I would like to somehow resize these tables in response to GC, but haven’t wired it up yet.

Anyway, next waypoint will be trying out the version of Whippet’s Nofl-based mostly-marking collector that traces all heap edges conservatively. If that works... well if that works... I don’t dare to hope! We will see what we get when that happens. Until then, happy hacking!

a whippet waypoint

2025-05-09T21:36:03Z

Hey peoples! Tonight, some meta-words. As you know I am fascinated by compilers and language implementations, and I just want to know all the things and implement all the fun stuff: intermediate representations, flow-sensitive source-to-source optimization passes, register allocation, instruction selection, garbage collection, all of that.

It started long ago with a combination of curiosity and a hubris to satisfy that curiosity. The usual way to slake such a thirst is structured higher education followed by industry apprenticeship, but for whatever reason my path sent me through a nuclear engineering bachelor’s program instead of computer science, and continuing that path was so distasteful that I noped out all the way to rural Namibia for a couple years.

Fast-forward, after 20 years in the programming industry, and having picked up some language implementation experience, a few years ago I returned to garbage collection. I have a good level of language implementation chops but never wrote a memory manager, and Guile’s performance was limited by its use of the Boehm collector. I had been on the lookout for something that could help, and when I learned of Immix it seemed to me that the only thing missing was an appropriate implementation for Guile, and hey I could do that!

whippet

I started with the idea of an MMTk-style interface to a memory manager that was abstract enough to be implemented by a variety of different collection algorithms. This kind of abstraction is important, because in this domain it’s easy to convince oneself that a given algorithm is amazing, just based on vibes; to stay grounded, I find I always need to compare what I am doing to some fixed point of reference. This GC implementation effort grew into Whippet, but as it did so a funny thing happened: the mark-sweep collector that I prototyped as a direct replacement for the Boehm collector maintained mark bits in a side table, which I realized was a suitable substrate for Immix-inspired bump-pointer allocation into holes. I ended up building on that to develop an Immix collector, but without lines: instead each granule of allocation (16 bytes for a 64-bit system) is its own line.

regions?

The Immix paper is funny, because it defines itself as a new class of mark-region collector, fundamentally different from the three other fundamental algorithms (mark-sweep, mark-compact, and evacuation). Immix’s regions are blocks (64kB coarse-grained heap divisions) and lines (128B “fine-grained” divisions); the innovation (for me) is the optimistic evacuation discipline by which one can potentially defragment a block without a second pass over the heap, while also allowing for bump-pointer allocation. See the papers for the deets!

However what, really, are the regions referred to by mark-region? If they are blocks, then the concept is trivial: everyone has a block-structured heap these days. If they are spans of lines, well, how does one choose a line size? As I understand it, Immix’s choice of 128 bytes was to be fine-grained enough to not lose too much space to fragmentation, while also being coarse enough to be eagerly swept during the GC pause.

This constraint was odd, to me; all of the mark-sweep systems I have ever dealt with have had lazy or concurrent sweeping, so the lower bound on the line size to me had little meaning. Indeed, as one reads papers in this domain, it is hard to know the real from the rhetorical; the review process prizes novelty over nuance. Anyway. What if we cranked the precision dial to 16 instead, and had a line per granule?

That was the process that led me to Nofl. It is a space in a collector that came from mark-sweep with a side table, but instead uses the side table for bump-pointer allocation. Or you could see it as an Immix whose line size is 16 bytes; it’s certainly easier to explain it that way, and that’s the tack I took in a recent paper submission to ISMM’25.

paper??!?

Wait what! I have a fine job in industry and a blog, why write a paper? Gosh I have meditated on this for a long time and the answers are very silly. Firstly, one of my language communities is Scheme, which was a research hotbed some 20-25 years ago, which means many practitioners—people I would be pleased to call peers—came up through the PhD factories and published many interesting results in academic venues. These are the folks I like to hang out with! This is also what academic conferences are, chances to shoot the shit with far-flung fellows. In Scheme this is fine, my work on Guile is enough to pay the intellectual cover charge, but I need more, and in the field of GC I am not a proven player. So I did an atypical thing, which is to cosplay at being an independent researcher without having first been a dependent researcher, and just solo-submit a paper. Kids: if you see yourself here, just go get a doctorate. It is not easy but I can only think it is a much more direct path to goal.

And the result? Well, friends, it is this blog post :) I got the usual assortment of review feedback, from the very sympathetic to the less so, but ultimately people were confused by leading with a comparison to Immix but ending without an evaluation against Immix. This is fair and the paper does not mention that, you know, I don’t have an Immix lying around. To my eyes it was a good paper, an 80% paper, but, you know, just a try. I’ll try again sometime.

In the meantime, I am driving towards getting Whippet into Guile. I am hoping that sometime next week I will have excised all the uses of the BDW (Boehm GC) API in Guile, which will finally allow for testing Nofl in more than a laboratory environment. Onwards and upwards!

partitioning ambiguous edges in guile

2025-04-25T15:00:26Z

Today, some more words on memory management, on the practicalities of a system with conservatively-traced references.

The context is that I have finally started banging Whippet into Guile, initially in a configuration that continues to use the conservative Boehm-Demers-Weiser (BDW) collector behind the scene. In that way I can incrementally migrate over all of the uses of the BDW API in Guile to use Whippet API instead, and then if all goes well, I should be able to switch Whippet to use another GC algorithm, probably the mostly-marking collector (MMC). MMC scales better than BDW for multithreaded mutators, and it can eliminate fragmentation via Immix-inspired optimistic evacuation.

problem statement: how to manage ambiguous edges

A garbage-collected heap consists of memory, which is a set of addressable locations. An object is a disjoint part of a heap, and is the unit of allocation. A field is memory within an object that may refer to another object by address. Objects are nodes in a directed graph in which each edge is a field containing an object reference. A root is an edge into the heap from outside. Garbage collection reclaims memory from objects that are not reachable from the graph that starts from a set of roots. Reclaimed memory is available for new allocations.

In the course of its work, a collector may want to relocate an object, moving it to a different part of the heap. The collector can do so if it can update all edges that refer to the object to instead refer to its new location. Usually a collector arranges things so all edges have the same representation, for example an aligned word in memory; updating an edge means replacing the word’s value with the new address. Relocating objects can improve locality and reduce fragmentation, so it is a good technique to have available. (Sometimes we say evacuate, move, or compact instead of relocate; it’s all the same.)

Some collectors allow ambiguous edges: words in memory whose value may be the address of an object, or might just be scalar data. Ambiguous edges usually come about if a compiler doesn’t precisely record which stack locations or registers contain GC-managed objects. Such ambiguous edges must be traced conservatively: the collector adds the object to its idea of the set of live objects, as if the edge were a real reference. This tracing mode isn’t supported by all collectors.

Any object that might be the target of an ambiguous edge cannot be relocated by the collector; a collector that allows conservative edges cannot rely on relocation as part of its reclamation strategy. Still, if the collector can know that a given object will not be the referent of an ambiguous edge, relocating it is possible.

How can one know that an object is not the target of an ambiguous edge? We have to partition the heap somehow into possibly-conservatively-referenced and definitely-not-conservatively-referenced. The two ways that I know to do this are spatially and temporally.

Spatial partitioning means that regardless of the set of root and intra-heap edges, there are some objects that will never be conservatively referenced. This might be the case for a type of object that is “internal” to a language implementation; third-party users that may lack the discipline to precisely track roots might not be exposed to objects of a given kind. Still, link-time optimization tends to weather these boundaries, so I don’t see it as being too reliable over time.

Temporal partitioning is more robust: if all ambiguous references come from roots, then if one traces roots before intra-heap edges, then any object not referenced after the roots-tracing phase is available for relocation.

kinds of ambiguous edges in guile

So let’s talk about Guile! Guile uses BDW currently, which considers edges to be ambiguous by default. However, given that objects carry type tags, Guile can, with relatively little effort, switch to precisely tracing most edges. “Most”, however, is not sufficient; to allow for relocation, we need to eliminate intra-heap ambiguous edges, to confine conservative tracing to the roots-tracing phase.

Conservatively tracing references from C stacks or even from static data sections is not a problem: these are roots, so, fine.

Guile currently traces Scheme stacks almost-precisely: its compiler emits stack maps for every call site, which uses liveness analysis to only mark those slots that are Scheme values that will be used in the continuation. However it’s possible that any given frame is marked conservatively. The most common case is when using the BDW collector and a thread is pre-empted by a signal; then its most recent stack frame is likely not at a safepoint and indeed is likely undefined in terms of Guile’s VM. It can also happen if there is a call site within a VM operation, for example to a builtin procedure, if it throws an exception and recurses, or causes GC itself. Also, when per-instruction traps are enabled, we can run Scheme between any two Guile VM operations.

So, Guile could change to trace Scheme stacks fully precisely, but this is a lot of work; in the short term we will probably just trace Scheme stacks as roots instead of during the main trace.

However, there is one more significant source of ambiguous roots, and that is reified continuation objects. Unlike active stacks, these have to be discovered during a trace and cannot be partitioned out to the root phase. For delimited continuations, these consist of a slice of the Scheme stack. Traversing a stack slice precisely is less problematic than for active stacks, because it isn’t in motion, and it is captured at a known point; but we will have to deal with stack frames that are pre-empted in unexpected locations due to exceptions within builtins. If a stack map is missing, probably the solution there is to reconstruct one using local flow analysis over the bytecode of the stack frame’s function; time-consuming, but it should be robust as we do it elsewhere.

Undelimited continuations (those captured by call/cc) contain a slice of the C stack also, for historical reasons, and there we can’t trace it precisely at all. Therefore either we disable relocation if there are any live undelimited continuation objects, or we eagerly pin any object referred to by a freshly captured stack slice.

fin

If you want to follow along with the Whippet-in-Guile work, see the wip-whippet branch in Git. I’ve bumped its version to 4.0 because, well, why the hell not; if it works, it will certainly be worth it. Until next time, happy hacking!

whippet lab notebook: untagged mallocs, bis

2025-03-07T13:47:11Z

Earlier this week I took an inventory of how Guile uses the Boehm-Demers-Weiser (BDW) garbage collector, with the goal of making sure that I had replacements for all uses lined up in Whippet. I categorized the uses into seven broad categories, and I was mostly satisfied that I have replacements for all except the last: I didn’t know what to do with untagged allocations: those that contain arbitrary data, possibly full of pointers to other objects, and which don’t have a header that we can use to inspect on their type.

But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s mostly-marking collector.

inside and outside

Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so.

The problem comes for uses of scm_gc_malloc from outside libguile itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.

on intent

The problem with scm_gc_malloc, though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an API for the tagged case and assume that anything that goes through scm_gc_malloc is requesting an untagged, conservatively-scanned block of memory. Similarly for scm_gc_malloc_pointerless: you could be allocating a tagged object that happens to not contain pointers, or you could be allocating an untagged array of whatever. A new API is needed there too for pointerless untagged allocations.

on data

Recall that the mostly-marking collector can be built in a number of different ways: it can support conservative and/or precise roots, it can trace the heap precisely or conservatively, it can be generational or not, and the collector can use multiple threads during pauses or not. Consider a basic configuration with precise roots. You can make tagged pointerless allocations just fine: the trace function for that tag is just trivial. You would like to extend the collector with the ability to make untagged pointerless allocations, for raw data. How to do this?

Consider first that when the collector goes to trace an object, it can’t use bits inside the object to discriminate between the tagged and untagged cases. Fortunately though the main space of the mostly-marking collector has one metadata byte for each 16 bytes of payload. Of those 8 bits, 3 are used for the mark (five different states, allowing for future concurrent tracing), two for the precise field-logging write barrier, one to indicate whether the object is pinned or not, and one to indicate the end of the object, so that we can determine object bounds just by scanning the metadata byte array. That leaves 1 bit, and we can use it to indicate untagged pointerless allocations. Hooray!

However there is a wrinkle: when Whippet decides the it should evacuate an object, it tracks the evacuation state in the object itself; the embedder has to provide an implementation of a little state machine, allowing the collector to detect whether an object is forwarded or not, to claim an object for forwarding, to commit a forwarding pointer, and so on. We can’t do that for raw data, because all bit states belong to the object, not the collector or the embedder. So, we have to set the “pinned” bit on the object, indicating that these objects can’t move.

We could in theory manage the forwarding state in the metadata byte, but we don’t have the bits to do that currently; maybe some day. For now, untagged pointerless allocations are pinned.

on slop

You might also want to support untagged allocations that contain pointers to other GC-managed objects. In this case you would want these untagged allocations to be scanned conservatively. We can do this, but if we do, it will pin all objects.

Thing is, conservative stack roots is a kind of a sweet spot in language run-time design. You get to avoid constraining your compiler, you avoid a class of bugs related to rooting, but you can still support compaction of the heap.

How is this, you ask? Well, consider that you can move any object for which we can precisely enumerate the incoming references. This is trivially the case for precise roots and precise tracing. For conservative roots, we don’t know whether a given edge is really an object reference or not, so we have to conservatively avoid moving those objects. But once you are done tracing conservative edges, any live object that hasn’t yet been traced is fair game for evacuation, because none of its predecessors have yet been visited.

But once you add conservatively-traced objects back into the mix, you don’t know when you are done tracing conservative edges; you could always discover another conservatively-traced object later in the trace, so you have to pin everything.

The good news, though, is that we have gained an easier migration path. I can now shove Whippet into Guile and get it running even before I have removed untagged allocations. Once I have done so, I will be able to allow for compaction / evacuation; things only get better from here.

Also as a side benefit, the mostly-marking collector’s heap-conservative configurations are now faster, because we have metadata attached to objects which allows tracing to skip known-pointerless objects. This regains an optimization that BDW has long had via its GC_malloc_atomic, used in Guile since time out of mind.

fin

With support for untagged allocations, I think I am finally ready to start getting Whippet into Guile itself. Happy hacking, and see you on the other side!

whippet lab notebook: on untagged mallocs

2025-03-04T15:42:32Z

Salutations, populations. Today’s note is more of a work-in-progress than usual; I have been finally starting to look at getting Whippet into Guile, and there are some open questions.

inventory

I started by taking a look at how Guile uses the Boehm-Demers-Weiser collector‘s API, to make sure I had all my bases covered for an eventual switch to something that was not BDW. I think I have a good overview now, and have divided the parts of BDW-GC used by Guile into seven categories.

implicit uses

Firstly there are the ways in which Guile’s run-time and compiler depend on BDW-GC’s behavior, without actually using BDW-GC’s API. By this I mean principally that we assume that any reference to a GC-managed object from any thread’s stack will keep that object alive. The same goes for references originating in global variables, or static data segments more generally. Additionally, we rely on GC objects not to move: references to GC-managed objects in registers or stacks are valid across a GC boundary, even if those references are outside the GC-traced graph: all objects are pinned.

Some of these “uses” are internal to Guile’s implementation itself, and thus amenable to being changed, albeit with some effort. However some escape into the wild via Guile’s API, or, as in this case, as implicit behaviors; these are hard to change or evolve, which is why I am putting my hopes on Whippet’s mostly-marking collector, which allows for conservative roots.

defensive uses

Then there are the uses of BDW-GC’s API, not to accomplish a task, but to protect the mutator from the collector: GC_call_with_alloc_lock, explicitly enabling or disabling GC, calls to sigmask that take BDW-GC’s use of POSIX signals into account, and so on. BDW-GC can stop any thread at any time, between any two instructions; for most users is anodyne, but if ever you use weak references, things start to get really gnarly.

Of course a new collector would have its own constraints, but switching to cooperative instead of pre-emptive safepoints would be a welcome relief from this mess. On the other hand, we will require client code to explicitly mark their threads as inactive during calls in more cases, to ensure that all threads can promptly reach safepoints at all times. Swings and roundabouts?

precise tracing

Did you know that the Boehm collector allows for precise tracing? It does! It’s slow and truly gnarly, but when you need precision, precise tracing nice to have. (This is the GC_new_kind interface.) Guile uses it to mark Scheme stacks, allowing it to avoid treating unboxed locals as roots. When it loads compiled files, Guile also adds some sliced of the mapped files to the root set. These interfaces will need to change a bit in a switch to Whippet but are ultimately internal, so that’s fine.

What is not fine is that Guile allows C users to hook into precise tracing, notably via scm_smob_set_mark. This is not only the wrong interface, not allowing for copying collection, but these functions are just truly gnarly. I don’t know know what to do with them yet; are our external users ready to forgo this interface entirely? We have been working on them over time, but I am not sure.

reachability

Weak references, weak maps of various kinds: the implementation of these in terms of BDW’s API is incredibly gnarly and ultimately unsatisfying. We will be able to replace all of these with ephemerons and tables of ephemerons, which are natively supported by Whippet. The same goes with finalizers.

The same goes for constructs built on top of finalizers, such as guardians; we’ll get to reimplement these on top of nice Whippet-supplied primitives. Whippet allows for resuscitation of finalized objects, so all is good here.

misc

There is a long list of miscellanea: the interfaces to explicitly trigger GC, to get statistics, to control the number of marker threads, to initialize the GC; these will change, but all uses are internal, making it not a terribly big deal.

I should mention one API concern, which is that BDW’s state is all implicit. For example, when you go to allocate, you don’t pass the API a handle which you have obtained for your thread, and which might hold some thread-local freelists; BDW will instead load thread-local variables in its API. That’s not as efficient as it could be and Whippet goes the explicit route, so there is some additional plumbing to do.

Finally I should mention the true miscellaneous BDW-GC function: GC_free. Guile exposes it via an API, scm_gc_free. It was already vestigial and we should just remove it, as it has no sensible semantics or implementation.

allocation

That brings me to what I wanted to write about today, but am going to have to finish tomorrow: the actual allocation routines. BDW-GC provides two, essentially: GC_malloc and GC_malloc_atomic. The difference is that “atomic” allocations don’t refer to other GC-managed objects, and as such are well-suited to raw data. Otherwise you can think of atomic allocations as a pure optimization, given that BDW-GC mostly traces conservatively anyway.

From the perspective of a user of BDW-GC looking to switch away, there are two broad categories of allocations, tagged and untagged.

Tagged objects have attached metadata bits allowing their type to be inspected by the user later on. This is the happy path! We’ll be able to write a gc_trace_object function that takes any object, does a switch on, say, some bits in the first word, dispatching to type-specific tracing code. As long as the object is sufficiently initialized by the time the next safepoint comes around, we’re good, and given cooperative safepoints, the compiler should be able to ensure this invariant.

Then there are untagged allocations. Generally speaking, these are of two kinds: temporary and auxiliary. An example of a temporary allocation would be growable storage used by a C run-time routine, perhaps as an unbounded-sized alternative to alloca. Guile uses these a fair amount, as they compose well with non-local control flow as occurring for example in exception handling.

An auxiliary allocation on the other hand might be a data structure only referred to by the internals of a tagged object, but which itself never escapes to Scheme, so you never need to inquire about its type; it’s convenient to have the lifetimes of these values managed by the GC, and when desired to have the GC automatically trace their contents. Some of these should just be folded into the allocations of the tagged objects themselves, to avoid pointer-chasing. Others are harder to change, notably for mutable objects. And the trouble is that for external users of scm_gc_malloc, I fear that we won’t be able to migrate them over, as we don’t know whether they are making tagged mallocs or not.

what is to be done?

One conventional way to handle untagged allocations is to manage to fit your data into other tagged data structures; V8 does this in many places with instances of FixedArray, for example, and Guile should do more of this. Otherwise, you make new tagged data types. In either case, all auxiliary data should be tagged.

I think there may be an alternative, which would be just to support the equivalent of untagged GC_malloc and GC_malloc_atomic; but for that, I am out of time today, so type at y’all tomorrow. Happy hacking!

tracepoints: gnarly but worth it

2025-02-14T13:32:10Z

Hey all, quick post today to mention that I added tracing support to the Whippet GC library. If the support library for LTTng is available when Whippet is compiled, Whippet embedders can visualize the GC process. Like this!

Click above for a full-scale screenshot of the Perfetto trace explorer processing the nboyer microbenchmark with the parallel copying collector on a 2.5x heap. Of course no image will have all the information; the nice thing about trace visualizers like is that you can zoom in to sub-microsecond spans to see exactly what is happening, have nice mouseovers and clicky-clickies. Fun times!

on adding tracepoints

Adding tracepoints to a library is not too hard in the end. You need to pull in the lttng-ust library, which has a pkg-config file. You need to declare your tracepoints in one of your header files. Then you have a minimal C file that includes the header, to generate the code needed to emit tracepoints.

Annoyingly, this header file you write needs to be in one of the -I directories; it can’t be just in the the source directory, because lttng includes it seven times (!!) using computed includes (!!!) and because the LTTng file header that does all the computed including isn’t in your directory, GCC won’t find it. It’s pretty ugly. Ugliest part, I would say. But, grit your teeth, because it’s worth it.

Finally you pepper your source with tracepoints, which probably you wrap in some macro so that you don’t have to require LTTng, and so you can switch to other tracepoint libraries, and so on.

using the thing

I wrote up a little guide for Whippet users about how to actually get traces. It’s not as easy as perf record, which I think is an error. Another ugly point. Buck up, though, you are so close to graphs!

By which I mean, so close to having to write a Python script to make graphs! Because LTTng writes its logs in so-called Common Trace Format, which as you might guess is not very common. I have a colleague who swears by it, that for him it is the lowest-overhead system, and indeed in my case it has no measurable overhead when trace data is not being collected, but his group uses custom scripts to convert the CTF data that he collects to... GTKWave (?!?!?!!).

In my case I wanted to use Perfetto’s UI, so I found a script to convert from CTF to the JSON-based tracing format that Chrome profiling used to use. But, it uses an old version of Babeltrace that wasn’t available on my system, so I had to write a new script (!!?!?!?!!), probably the most Python I have written in the last 20 years.

is it worth it?

Yes. God I love blinkenlights. As long as it’s low-maintenance going forward, I am satisfied with the tradeoffs. Even the fact that I had to write a script to process the logs isn’t so bad, because it let me get nice nested events, which most stock tracing tools don’t allow you to do.

I fixed a small performance bug because of it – a worker thread was spinning waiting for a pool to terminate instead of helping out. A win, and one that never would have shown up on a sampling profiler too. I suspect that as I add more tracepoints, more bugs will be found and fixed.

fin

I think the only thing that would be better is if tracepoints were a part of Linux system ABIs – that there would be header files to emit tracepoint metadata in all binaries, that you wouldn’t have to link to any library, and the actual tracing tools would be intermediated by that ABI in such a way that you wouldn’t depend on those tools at build-time or distribution-time. But until then, I will take what I can get. Happy tracing!