Hello all, a brief note today. For context, my big project over the last year and a half or so is Whippet, a new garbage collector implementation. If everything goes right, Whippet will finding a better point on the space/time tradeoff curve than Guile‘s current garbage collector, BDW-GC, and will make it into Guile, well, sometime.
But, testing a garbage collector is... hard. Methodologically it’s very tricky, though there has been some recent progress in assessing collectors in a more systematic way. Ideally of course you test against a real application, and barring that, against a macrobenchmark extracted from a real application. But garbage collectors are deeply intertwined with language run-times; to maximize the insight into GC performance, you need to minimize everything else, and to minimize non-collector cost, that means relying on a high-performance run-time (e.g. the JVM), which... is hard! It’s hard enough to get toy programs to work, but testing against a beast like the JVM is not easy.
In the case of Guile, it’s even more complicated, because the BDW-GC is a conservative collector. BDW-GC doesn’t require precise annotation of stack roots, and scans intraheap edges conservatively (unless you go out of your way to do otherwise). How can you test this against a semi-space collector if Guile can’t precisely enumerate all object references? The Immix-derived collector in Whippet can be configured to run with conservative stack roots (and even heap links), but how then to test the performance tradeoff of conservative vs precise tracing?
In my first iterations on Whippet, I hand-coded some benchmarks in C, starting with the classic gcbench. I used stack-allocated linked lists for precise roots, when run in precise-rooting mode. But, this is excruciating and error-prone; trying to write some more tests, I ran up against a wall of gnarly nasty C code. Not fun, and I wasn’t sure that it was representative in terms of workload; did the handle discipline have some kind of particular overhead? The usual way to do things in a high-performance system is to instead have stack maps, where the collector is able to precisely find roots on the stack and registers using a side table.
Of course the usual solution if something is not fun is to make it into a tools problem, so I wrote a little Scheme-to-C compiler, Whiffle, purpose-built for testing Whippet. It’s a baseline-style compiler that uses the C stack for control (function call and return), and a packed side stack of temporary values. The goal was to be able to generate some C that I could include in the collector’s benchmarks; I’ve gotten close but haven’t done that final step yet.
Anyway, because its temporaries are all in a packed stack, Whippet can always traverse the roots precisely. This means I can write bigger benchmark programs without worry, which will allow me to finish testing the collector. As a nominally separate project, Whiffle also tests that Whippet’s API is good for embedding. (Yes, the names are similar; if I thought that in the future I would need to say much more about Whiffle, I would rename it!)
I was able to translate over the sparse benchmarks that Whippet already had into Scheme, and set about checking that they worked as expected. Which they did... mostly.
The Whippet interface abstracts over different garbage collector implementations; the choice of which collector to use is made at compile-time. There’s the basic semi-space copying collector, with a large object space and ephemerons; there’s the Immix derived “whippet” collector (yes, it shares the same name); and there’s a shim that provides the Whippet API via the BDW-GC.
The benchmarks are multithreaded, except when using the semi-space collector which only supports one mutator thread. (I should fix that at some point.) I tested the benchmarks with one mutator thread, and they worked with all collectors. Yay!
Then I tested with multiple mutator threads. The Immix-derived collectors worked fine, in both precise and conservative modes. The BDW-GC collector... did not. With just one mutator thread it was fine, but with multiple threads I would get segfaults deep inside libgc.
Over on the fediverse, Daphe Preston-Kendal asks, how does Whiffle deal with tail calls? The answer is, “sloppily”. All of the generated function code has the same prototype, so return-calls from one function to another should just optimize to jumps, and indeed they do: at -O2. For -O0, this doesn’t work, and sometimes you do want to compile in that mode (no optimizations) when investigating. I wish that C had a musttail attribute, and I use GCC too much to rely on the one that only Clang supports.
I found the -foptimize-sibling-calls flag in GCC and somehow convinced myself that it was a sufficient replacement, even when otherwise at -O0. However most of my calls are indirect, and I think this must have fooled GCC. Really not sure. But I was sure, at one point, until a week later I realized that the reason that the call instruction was segfaulting was because it couldn’t store the return address in *$rsp, because I had blown the stack. That was embarrassing, but realizing that and trying again at -O2 didn’t fix my bug.
Finally I recompiled bdw-gc, which I should have done from the beginning. (I use Guix on my test machine, and I wish I could have entered into an environment that had a specific set of dependencies, but with debugging symbols and source code; there are many things about Guix that I would think should help me develop software but which don’t. Probably there is a way to do this that I am unaware of.)
After recompiling, it became clear that BDW-GC was trying to scan a really weird range of addresses for roots, and accessing unmapped memory. You would think that this would be a common failure mode for BDW-GC, but really, this never happens: it’s an astonishingly reliable piece of software for what it does. I have used it for almost 20 years and its problems are elsewhere. Anyway, I determined that the thread that it was scanning was... well it was stuck somewhere in some rr support library, which... why was that anyway?
You see, the problem was that since my strange spooky segfaults on stack overflow, I had been living in rr for a week or so, because I needed to do hardware watchpoints and reverse-continue. But, somehow, strangely, oddly, rr was corrupting my program: it worked fine when not run in rr. Somehow rr caused GCC to grab the wrong $rsp on a remote thread.
I had found the actual bug in the shower, some days before, and fixed it later that evening. Consider that both precise Immix and conservative Immix are working fine. Consider that conservative Immix is, well, stack-conservative just like BDW-GC. Clearly Immix finds the roots correctly, why didn’t BDW-GC? What is the real difference?
Well, friend, you have read the article title, so perhaps you won’t be surprised when I say “safepoints”. A safepoint is a potential stopping place in a program. At a safepoint, the overall state of the program can be inspected by some independent part of the program. In the case of garbage collection, safepoints are places where heap object references in a thread’s stack can be enumerated.
For my Immix-derived collectors, safepoints are cooperative: the only place a thread will stop for collection is in an allocation call, or if the thread has explicitly signalled to the collector that is doing a blocking operation such as a thread join. Whiffle usually keeps the stack pointer for the side value stack in a register, but for allocating operations that need to go through the slow path, it makes sure to write that stack pointer into a shared data structure, so that other threads can read it if they need to walk its stack.
The BDW collector doesn’t work this way. It can’t rely on cooperation from the host. Instead, it installs a signal handler for the process that can suspend and resume any thread at any time. When BDW-GC needs to trace the heap, it stops all threads, finds the roots for those threads, and then traces the graph.
I had installed a custom BDW-GC mark function for Whiffle stacks so that even though they were in anonymous mmap’d memory that BDW-GC doesn’t usually trace for roots, Whiffle would be sure to let BDW-GC know about them. That mark function used the safepoints that I was already tracking to determine which part of the side stack to mark.
But here’s the thing: cooperative safepoints can be lazy, but preemptive safepoints must be eager. For BDW-GC, it’s not enough to ensure that the thread’s stack is traversable at allocations: it must be traversable at all times, because a signal can stop a thread at any time.
Safepoints also differ on the axis of precision: precise safepoints must include no garbage roots, whereas conservative safepoints can be sloppy. For precise safepoints, if you bump the stack pointer to allocate a new frame, you can’t include the slots in that frame in the root set until they have values. Precision has a cost to the compiler and to the run-time in side table size or shared-data-structure update overhead. As far as I am aware, no production collector uses fully preemptive, precise roots. (Happy to be corrected, if people have counterexamples; I know that this used to be the case, but my understanding is that since multi-threaded mutators have been common, people have mostly backed away from this design.)
Concretely, as I was using a stack discipline to allocate temporaries, I had to shrink the precise safepoint at allocation sites, but I was neglecting to expand the conservative safepoint whenever the stack pointer would be restored.
When I was a kid, baseball was not my thing. I failed out at the earlier phase of tee-ball, where the ball isn’t thrown to you by the pitcher but is instead on a kind of rubber stand. As often as not I would fail to hit the ball and instead buckle the tee under it, making an unsatisfactory “flubt” sound, causing the ball to just fall to the ground. I probably would have persevered if I hadn’t also caught a ground ball to the face one day while playing right field, knocking me out and giving me a nosebleed so bad that apparently they called off the game because so many other players were vomiting. I woke up in the entryway of the YMCA and was given a lukewarm melted push-pop.
Anyway, “flubt” is the sound I heard in my mind when I realized that rr had been perturbing my use of BDW-GC, and instead of joy and relief when I finally realized that it worked already, it tasted like melted push-pop. It be that way sometimes. Better luck next try!
Friends, you might have noted, but over the last year or so I really caught the GC bug. Today’s post sums up that year, in the form of a talk I gave yesterday at FOSDEM. It’s long! If you prefer video, you can have a look instead to the at the FOSDEM event page.
4 Feb 2023 – FOSDEM
Andy Wingo
Mostly written in Scheme
Also a 30 year old C library
// API SCM scm_cons (SCM car, SCM cdr); // Many third-party users SCM x = scm_cons (a, b);
So the context for the whole effort is that Guile has this part of its implementation which is in C. It also exposes a lot of that implementation to users as an API.
SCM x = scm_cons (a, b);
Live objects: the roots, plus anything a live object refers to
How to include x into roots?
So what contraints does this kind of API impose on the garbage collector?
Let’s start by considering the simple cons call above. In a garbage-collected environment, the GC is responsible for reclaiming unused memory. How does the GC know that the result of a scm_cons call is in use?
Generally speaking there are two main strategies for automatic memory management. One is reference counting: you associate a count with an object, incremented once for each referrer; in this case, the stack would hold a reference to x. When removing the reference, you decrement the count, and if it goes to 0 the object is unused and can be freed.
We GC people used to laugh at reference-counting as a memory management solution because it over-approximates the live object set in the presence of cycles, but it would seem that refcounting is coming back. Anyway, this isn’t what Guile does, not right now anyway.
The other strategy we can use is tracing: the garbage collector periodically finds all of the live objects on the system and then recycles the memory for everything else. But how to actually find the first live objects to trace?
One way is to inform the garbage collector of the locations of all roots: references to objects originating from outside the heap. This can be done explicitly, as in V8’s Handle<> API, or implicitly, in the form of a side table generated by the compiler associating code locations with root locations. This is called precise rooting: the GC is aware of all root locations at all code positions where GC might happen. Generally speaking you want the side table approach, in which the compiler writes out root locations to stack maps, because it doesn’t impose any overhead at run-time to register and unregister locations. However for run-time routines implemented in C or C++, you won’t be able to get the C compiler to do this for you, so you need the explicit approach if you want precise roots.
Treat every word in stack as potential root; over-approximate live object set
1993: Bespoke GC inherited from SCM
2006 (1.8): Added pthreads, bugs
2009 (2.0): Switch to BDW-GC
BDW-GC: Roots also from extern SCM foo;, etc
The other way to find roots is very much not The Right Thing. Call it cheeky, call it sloppy, call it yolo, call it what you like, but in the trade it’s known as conservative root-finding. This strategy looks like this:
uintptr_t *limit = stack_base_for_platform(); uintptr_t *sp = __builtin_frame_address(); for (; sp < limit; sp++) { void *obj = object_at_address(*sp); if (obj) add_to_live_objects(obj); }
You just look at every word on the stack and pretend it’s a pointer. If it happens to point to an object in the heap, we add that object to the live set. Of course this algorithm can find a spicy integer whose value just happens to correspond to an object’s address, even if that object wouldn’t have been counted as live otherwise. This approach doesn’t compute the minimal live set, but rather a conservative over-approximation. Oh well. In practice this doesn’t seem to be a big deal?
Guile has used conservative root-finding since its beginnings, 30 years ago and more. We had our own bespoke mark-sweep GC in the beginning, but it’s now going on 15 years or so that we switched to the third-party Boehm-Demers-Weiser (BDW) collector. It’s been good to us! It’s better than what we had, it’s mostly just worked, and it works correctly with threads.
+: Ergonomic, eliminates class of bugs (handle registration), no compiler constraints
-: Potential leakage, no compaction / object motion; no bump-pointer allocation, calcifies GC choice
Conservative root-finding does have advantages. It’s quite pleasant to program with, in environments in which the compiler is unable to produce stack maps for you, as it eliminates a set of potential bugs related to explicit handle registration and unregistration. Like stack maps, it also doesn’t impose run-time overhead on the user program. And although the compiler isn’t constrained to emit code to clear roots, it generally does, and sometimes does so more promptly than would be the case with explicit handle deregistration.
But, there are disadvantages too. The potential for leaks is one, though I have to say that in 20 years of using conservative-roots systems, I have not found this to be a problem. It’s a source of anxiety whenever a program has memory consumption issues but I’ve never identified it as being the culprit.
The more serious disadvantage, though, is that conservative edges prevent objects from being moved by the GC. If you know that a location holds a pointer, you can update that location to point to a new location for an object. But if a location only might be a pointer, you can’t do that.
In the end, the ergonomics of conservative collection lead to a kind of calcification in Guile, that we thought that BDW was as good as we could get given the constraints, and that changing to anything else would require precise roots, and thus an API and ABI change, losing users, and so on.
You can find roots conservatively and
BDW is not the local maximum
But it turns out, that’s not true! There is a way to have conservative roots and also use more optimal GC algorithms, and one which preserves the ability to incrementally refactor the system to have more precision if that’s what you want.
Fundamental GC algorithms
Immix is a mark-region collector
Let’s back up to a high level. Garbage collector implementations are assembled from instances of algorithms, and there are only so many kinds of algorithms out there.
There’s mark-compact, in which the collector traverses the object graph once to find live objects, then once again to slide them down to one end of the space they are in.
There’s mark-sweep, where the collector traverses the graph once to find live objects, then traverses the whole heap, sweeping dead objects into free lists to be used for future allocations.
There’s evacuation, where the collector does a single pass over the object graph, copying the objects outside their space and leaving a forwarding pointer behind.
The BDW collector used by Guile is a mark-sweep collector, and its use of free lists means that allocation isn’t as fast as it could be. We want bump-pointer allocation and all the other algorithms give it to us.
Then in 2008, Stephen Blackburn and Kathryn McKinley put out their Immix paper that identified a new kind of collection algorithm, mark-region. A mark-region collector will mark the object graph and then sweep the whole heap for unmarked regions, which can then be reused for allocating new objects.
Allocate: Bump-pointer into holes in thread-local block, objects can span lines but not blocks
Trace: Mark objects and lines
Sweep: Coarse eager scan over line mark bytes
Blackburn and McKinley’s paper also describes a new mark-region GC algorithm, Immix, which is interesting because it gives us bump-pointer allocation without requiring that objects be moveable. The diagram above, from the paper, shows the organization of an Immix heap. Allocating threads (mutators) obtain 64-kilobyte blocks from the heap. Blocks contains 128-byte lines. When Immix traces the object graph, it marks both objects and the line the object is on. (Usually blocks are part of 2MB aligned slabs, with line mark bits/bytes are stored in a packed array at the start of the slab. When marking an object, it’s easy to find the associated line mark just with address arithmetic.)
Immix reclaims memory in units of lines. A set of contiguous lines that were not marked in the previous collection form a hole (a region). Allocation proceeds into holes, in the usual bump-pointer fashion, giving us good locality for contemporaneously-allocated objects, unlike freelist allocation. The slow path, if the object doesn’t fit in the hole, is to look for the next hole in the block, or if needed to acquire another block, or to stop for collection if there are no more blocks.
Before trace, determine if compaction needed. If not, mark as usual
If so, select candidate blocks and evacuation target blocks. When tracing in that block, try to evacuate, fall back to mark
The neat thing that Immix adds is a way to compact the heap via opportunistic evacuation. As Immix allocates, it can end up skipping over holes and leaving them unpopulated, and as subsequent cycles of GC occur, it could be that a block ends up with many small holes. If that happens to many blocks it could be time to compact.
To fight fragmentation, Immix decides at the beginning of a GC cycle whether to try to compact or not. If things aren’t fragmented, Immix marks in place; it’s cheaper that way. But if compaction is needed, Immix selects a set of blocks needing evacuation and another set of empty blocks to evacuate into. (Immix has to keep around a couple percent of memory in empty blocks in reserve for this purpose.)
As Immix traverses the object graph, if it finds that an object is in a block that needs evacuation, it will try to evacuate instead of marking. It may or may not succeed, depending on how much space is available to evacuate into. Maybe it will succeed for all objects in that block, and you will be left with an empty block, which might even be given back to the OS.
Opportunistic evacuation compatible with conservative roots!
Bump-pointer allocation
Compaction!
1 year ago: start work on WIP GC implementation
Tying this back to Guile, this gives us all of our desiderata: we can evacuate, but we don’t have to, allowing us to cause referents of conservative roots to be marked in place instead of moved; we can bump-pointer allocate; and we are back on the train of modern GC implementations. I could no longer restrain myself: I started hacking on a work-in-progress garbage collector workbench about a year ago, and ended up with something that seems to take us in the right direction.
Immix: 128B lines + mark bit in object
Whippet: 16B “lines”; mark byte in side table
More size overhead: 1/16 vs 1/128
Less fragmentation (1 live obj = 2 lines retained)
More alloc overhead? More small holes
What I ended up building wasn’t quite Immix. Guile’s object representation is very thin and doesn’t currently have space for a mark bit, for example, so I would have to have a side table of mark bits. (I could have changed Guile’s object representation but I didn’t want to require it.) I actually chose mark bytes instead of bits because both the Immix line marks and BDW’s own side table of marks were bytes, to allow for parallel markers to race when setting marks.
Then, given that you have a contiguous table of mark bytes, why not remove the idea of lines altogether? Or what amounts to the same thing, why not make line size to be 16 bytes and do away with per-object mark bits? You can then bump-pointer into holes in the mark byte array. The only thing you need to do to that is to be able to cheaply find the end of an object, so you can skip to the next hole while sweeping; you don’t want to have to chase pointers to do that. But consider, you’ve already paid the cost of having a mark byte associated with every possible start of an object, so if your basic object alignment is 16 bytes, that’s a memory overhead of 1/16, or 6.25%; OK. Let’s put that mark byte to work and include an “end” bit, indicating the end of the object. Allocating an object has to store into the mark byte array to initialize this “end” marker, but you need to write the mark byte anyway to allow for conservative roots (“does this address hold an object?”); writing the end at the same time isn’t so bad, perhaps.
The expected outcome would be that relative to 128-byte lines, Whippet ends up with more, smaller holes. Such a block would be a prime target for evacuation, of course, but during allocation this is overhead. Or, it could be a source of memory efficiency; who knows. There is some science yet to do to properly compare this tactic to original Immix, but I don’t think I will get around to it.
While I am here and I remember these things, I need to mention two more details. If you read the Immix paper, it describes “conservative line marking”, which is related to how you find the end of an object; basically Immix always marks the line an object is on and the next one, in case the object spans the line boundary. Only objects larger than a line have to precisely mark the line mark array when they are traced. Whippet doesn’t do this because we have the end bit.
The other detail is the overflow allocator; in the original Immix paper, if you allocate an object that’s smallish but still larger than a line or two, but there’s no hole big enough in the block, Immix keeps around a completely empty block per mutator in which to bump-pointer-allocate these medium-sized objects. Whippet doesn’t do that either, instead relying on such failure to allocate in a block to cause fragmentation and thus hurry along the process of compaction.
Immix: “cheap” eager coarse sweep
Whippet: just-in-time lazy fine-grained sweep
Corrolary: Data computed by sweep available when sweep complete
Live data at previous GC only known before next GC
Empty blocks discovered by sweeping
Having a fine-grained line mark array means that it’s no longer a win to do an eager sweep of all blocks after collecting. Instead Whippet applies the classic “lazy sweeping” optimization to make mutators sweep their blocks just before allocating into them. This introduces a delay in the collection algorithm: Whippet doesn’t find out about e.g. fragmentation until the whole heap is swept, but by the time we fully sweep the heap, we’ve exhausted it via allocation. It introduces a different flavor to the GC, not entirely unlike original Immix, but foreign.
Compaction/defrag/pinning, heap shrinking, sticky-mark generational GC, threads/contention/allocation, ephemerons, precision, tools
Right! With that out of the way, let’s talk about what Whippet gives to Guile, relative to BDW-GC.
Heap-conservative tracing: no object moveable
Stack-conservative tracing: stack referents pinned, others not
Whippet: If whole-heap fragmentation exceeds threshold, evacuate most-fragmented blocks
Stack roots scanned first; marked instead of evacuated, implicitly pinned
Explicit pinning: bit in mark byte
If all edges in the heap are conservative, then you can’t move anything, because you don’t know if an edge is a pointer that can be updated or just a spicy integer. But most systems aren’t actually like this: you have conservative edges from the stack, but you can precisely enumerate intra-object edges on the heap. In that case, you have a known set of conservative edges, and you can simply visit those edges first, marking their referents in place instead of evacuating. (Marking an object instead of evacuating implicitly pins it for the duration of the current GC cycle.) Then you visit heap edges precisely, possibly evacuating objects.
I should note that Whippet has a bit in the mark byte for use in explicitly pinning an object. I’m not sure how to manage who is responsible for setting that bit, or what the policy will be; the current idea is to set it for any object whose identity-hash value is taken. We’ll see.
Lazy sweeping finds empty blocks: potentially give back to OS
Need empty blocks? Do evacuating collection
Possibility to do http://marisa.moe/balancer.html
With the BDW collector, your heap can only grow; it will never shrink (unless you enable a non-default option and you happen to have verrry low fragmentation). But with Whippet and evacuation, we can rearrange objects so as to produce empty blocks, which can then be returned to the OS if so desired.
In one of my microbenchmarks I have the system allocating long-lived data, interspersed with garbage (objects that are dead after allocation) whose size is in a power-law distribution. This should produce quite some fragmentation, eventually, and it does. But then Whippet decides to defragment, and it works great! Since Whippet doesn’t keep a whole 2x reserve like a semi-space collector, it usually takes more than one GC cycle to fully compact the heap; usually about 3 cycles, from what I can see. I should do some more measurements here.
Of course, this is just mechanism; choosing the right heap sizing policy is a different question.
wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm
Card marking barrier (256B); compare to BDW mprotect / SIGSEGV
The Boehm collector also has a non-default mode in which it uses mprotect and a SIGSEGV handler to enable sticky-mark-bit generational collection. I haven’t done a serious investigation, but I see it actually increasing run-time by 20% on one of my microbenchmarks that is actually generation-friendly. I know that Azul’s C4 collector used to use page protection tricks but I can only assume that BDW’s algorithm just doesn’t work very well. (BDW’s page barriers have another purpose, to enable incremental collection, in which marking is interleaved with allocation, but this mode is off if parallel markers are supported, and I don’t know how well it works.)
Anyway, it seems we can do better. The ideal would be a semi-space nursery, which is the usual solution, but because of conservative roots we are limited to the sticky mark-bit algorithm. Some benchmarks aren’t very generation-friendly; the first pair of bars in the chart above shows the mt-gcbench microbenchmark running with and without generational collection, and there’s no difference. But in the second, for the quads benchmark, we see a 2x speedup or so.
Of course, to get generational collection to work, we require mutators to use write barriers, which are little bits of code that run when an object is mutated that tell the GC where it might find links from old objects to new objects. Right now in Guile we don’t do this, but this benchmark shows what can happen if we do.
BDW: TLS segregated-size freelists, lock to refill freelists, SIGPWR for stop
Whippet: thread-local block, sweep without contention, wait-free acquisition of next block, safepoints to stop with ragged marking
Both: parallel markers
Another thing Whippet can do better than BDW is performance when there are multiple allocating threads. The Immix heap organization facilitates minimal coordination between mutators, and maximum locality for each mutator. Sweeping is naturally parallelized according to how many threads are allocating. For BDW, on the other hand, every time an mutator needs to refill its thread-local free lists, it grabs a global lock; sweeping is lazy but serial.
Here’s a chart showing whippet versus BDW on one microbenchmark. On the X axis I add more mutator threads; each mutator does the same amount of allocation, so I’m increasing the heap size also by the same factor as the number of mutators. For simplicity I’m running both whippet and BDW with a single marker thread, so I expect to see a linear increase in elapsed time as the heap gets larger (as with 4 mutators there are roughly 4 times the number of live objects to trace). This test is run on a Xeon Silver 4114, taskset to free cores on a single socket.
What we see is that as I add workers, elapsed time increases linearly for both collectors, but more steeply for BDW. I think (but am not sure) that this is because whippet effectively parallelizes sweeping and allocation, whereas BDW has to contend over a global lock to sweep and refill free lists. Both have the linear factor of tracing the object graph, but BDW has the additional linear factor of sweeping, whereas whippet scales with mutator count.
Incidentally you might notice that at 4 mutator threads, BDW randomly crashed, when constrained to a fixed heap size. I have noticed that if you fix the heap size, BDW sometimes (and somewhat randomly) fails. I suspect the crash due to fragmentation and inability to compact, but who knows; multiple threads allocating is a source of indeterminism. Usually when you run BDW you let it choose its own heap size, but for these experiments I needed to have a fixed heap size instead.
Another measure of scalability is, how does the collector do as you add marker threads? This chart shows that for both collectors, runtime decreases as you add threads. It also shows that whippet is significantly slower than BDW on this benchmark, which is Very Weird, and I didn’t have access to the machine on which these benchmarks were run when preparing the slides in the train... so, let’s call this chart a good reminder that Whippet is a WIP :)
While in the train to Brussels I re-ran this test on the 4-core laptop I had on hand, and got the results that I expected: that whippet performed similarly to BDW, and that adding markers improved things, albeit marginally. Perhaps I should look on a different microbenchmark.
Incidentally, when you configure Whippet for parallel marking at build-time, it uses a different implementation of the mark stack when compared to the parallel marker, even when only 1 marker is enabled. Certainly the parallel marker could use some tuning.
BDW: No ephemerons
Whippet: Yes
Another deep irritation I have with BDW is that it doesn’t support ephemerons. In Guile we have a number of facilities (finalizers, guardians, the symbol table, weak maps, et al) built on what BDW does have (finalizers, weak references), but the implementations of these facilities in Guile are hacky, slow, sometimes buggy, and don’t compose (try putting an object in a guardian and giving it a finalizer to see what I mean). It would be much better if the collector API supported ephemerons natively, specifying their relationship to finalizers and other facilities, allowing us to build what we need in terms of those primitives. With our own GC, we can do that, and do it in such a way that it doesn’t depend on the details of the specific collection algorithm. The exception of course is that as BDW doesn’t support ephemerons per se, what we get is actually a weak-key association instead, whose value can keep the key alive. Oh well, it’s no worse than the current situation.
BDW: ~Always stack-conservative, often heap-conservative
Whippet: Fully configurable (at compile-time)
Guile in mid/near-term: C stack conservative, Scheme stack precise, heap precise
Possibly fully precise: unlock semi-space nursery
Conservative tracing is a fundamental design feature of the BDW collector, both of roots and of inter-heap edges. You can tell BDW how to trace specific kinds of heap values, but the default is to do a conservative scan, and the stack is always scanned conservatively. In contrast, these tradeoffs are all configurable in Whippet. You can scan the stack and heap precisely, or stack conservatively and heap precisely, or vice versa (though that doesn’t make much sense), or both conservatively.
The long-term future in Guile is probably to continue to scan the C stack conservatively, to continue to scan the Scheme stack precisely (even with BDW-GC, the Scheme compiler emits stack maps and installs a custom mark routine), but to scan the heap as precisely as possible. It could be that a user uses some of our hoary ancient APIs to allocate an object that Whippet can’t trace precisely; in that case we’d have to disable evacuation / object motion, but we could still trace other objects precisely.
If Guile ever moved to a fully precise world, that would be a boon for performance, in two ways: first that we would get the ability to use a semi-space nursery instead of the sticky-mark-bit algorithm, and relatedly that we wouldn’t need to initialize mark bytes when allocating objects. Second, we’d gain the option to use must-move algorithms for the old space as well (mark-compact, semi-space) if we wanted to. But it’s just an option, one that that Whippet opens up for us.
Can build heap tracers and profilers moer easily
More hackable
(BDW-GC has as many preprocessor directives as whippet has source lines)
Finally, relative to BDW-GC, whippet has a more intangible advantage: I can actually hack on it. Just as an indication, 15% of BDW source lines are pre-processor directives, and there is one file that has like 150 #ifdef‘s, not counting #elseif’s, many of them nested. I haven’t done all that much to BDW itself, but I personally find it excruciating to work on.
Hackability opens up the possibility to build more tools to help us diagnose memory use problems. They aren’t in Whippet yet, but there can be!
Embed-only, abstractions, migration, modern; timeline
OK, that rounds out the comparison between BDW and Whippet, at least on a design level. Now I have a few words about how to actually get this new collector into Guile without breaking the bug budget. I try to arrange my work areas on Guile in such a way that I spend a minimum of time on bugs. Part of my strategy is negligence, I will admit, but part also is anticipating problems and avoiding them ahead of time, even if it takes more work up front.
Semi: 6 kB; Whippet: 22 kB; BDW: 184 kB
Compile-time specialization:
Built apart, but with LTO to remove library overhead
So the BDW collector is typically shipped as a shared library that you dynamically link to. I should say that we’ve had an overall good experience with upgrading BDW-GC in the past; its maintainer (Ivan Maidanski) does a great and responsible job on a hard project. It’s been many, many years since we had a bug in BDW-GC. But still, BDW is dependency, and all things being equal we prefer to remove moving parts.
The approach that Whippet is taking is to be an embed-only library: it’s designed to be compiled into your project. It’s not an include-only library; it still has to be compiled, but with link-time-optimization and a judicious selection of fast-path interfaces, Whippet is mostly able to avoid abstractions being a performance barrier.
The result is that Whippet is small, both in source and in binary, which minimizes its maintenance overhead. Taking additional stripped optimized binary size as the metric, by my calculations a semi-space collector (with a large object space and ephemeron support) takes about 6 kB of object file size, whereas Whippet takes 22 and BDW takes 184. Part of how Whippet gets so small is that it is is configured in major ways at compile-time (choice of main GC algorithm), and specialized against the program it’s embedding against (e.g. how to patch in a forwarding pointer). Having all API being internal and visible to LTO instead of going through ELF symbol resolution helps in a minor way as well.
User API abstracts over GC algorithm, e.g. semi-space or whippet
Expose enough info to allow JIT to open-code fast paths
Inspired by mmtk.io
Abstractions permit change: of algorithm, over time
From a composition standpoint, Whippet is actually a few things. Firstly there is an abstract API to make a heap, create per-thread mutators for a heap, and allocate objects for a mutator. There is the aforementioned embedder API, for having the embedding program indicate how to trace objects and install forwarding pointers. Then there is some common code (for example ephemeron support). There are implementations of the different spaces: semi-space, large object, whippet/immix; and finally collector implementations that tie together the spaces into a full implementation of the abstract API. (In practice the more iconic spaces are intertwingled with the collector implementations they define.)
I don’t think I would have gone down this route without seeing some prior work, for example libpas, but it was really MMTk that convinced me that it was worth spending a little time thinking about the GC not as a structureless blob but as a system made of parts and exposing a minimal interface. In particular, I was inspired by seeing that MMTk is able to get good performance while also being abstract, exposing representation details such as how to tell a JIT compiler about allocation fast-paths, but in a principled way. So, thanks MMTk people, for this and so many things!
I’m particularly happy that the API is abstract enough that it frees up not only the garbage collector to change implementations, but also Guile and other embedders, in that they don’t have to bake in a dependency on specific collectors. The semi-space collector has been particularly useful here in ensuring that the abstractions don’t accidentally rely on support for object pinning.
API implementable by BDW-GC (except ephemerons)
First step for Guile: BDW behind Whippet API
Then switch to whippet/immix (by default)
The collector API can actually be implemented by the BDW collector. Whippet includes a collector that is a thin wrapper around the BDW API, with support for fast-path allocation via thread-local freelists. In this way we can always check the performance of any given collector against an external fixed point (BDW) as well as a theoretically known point (the semi-space collector).
Indeed I think the first step for Guile is precisely this: refactor Guile to allocate through the Whippet API, but using the BDW collector as the implementation. This will ensure that the Whippet API is sufficient, and then allow an incremental switch to other collectors.
Incidentally, when it comes to integrating Whippet, there are some choices to be made. I mentioned that it’s quite configurable, and this chart can give you some idea. On the left side is one microbenchmark (mt-gcbench) and on the right is another (quads). The first generates a lot of fragmentation and has a wide range of object sizes, including some very large objects. The second is very uniform and many allocations die young.
(I know these images are small; right-click to open in new tab or pinch to zoom to see more detail.)
Within each set of bars we have 10 different scenarios, corresponding to different Whippet configurations. (All of these tests are run on my old 4-core laptop with 4 markers if parallel marking is supported, and a 2x heap.)
The first bar in each side is serial whippet: one marker. Then we see parallel whippet: four markers. Great. Then there’s generational whippet: one marker, but just scanning objects allocated in the current cycle, hoping that produces enough holes. Then generational parallel whippet: the same as before, but with 4 markers.
The next 4 bars are the same: serial, parallel, generational, parallel-generational, but with one difference: the stack is scanned conservatively instead of precisely. You might be surprised but all of these configurations actually perform better than their precise counterparts. I think the reason is that the microbenchmark uses explicit handle registration and deregistration (it’s a stack) instead of compiler-generated stack maps in a side table, but I’m not precisely (ahem) sure.
Finally the next 2 bars are serial and parallel collectors, but marking everything conservatively. I have generational measurements for this configuration but it really doesn’t make much sense to assume that you can emit write barriers in this context. These runs are slower than the previous configuration, mostly because there are some non-pointer locations that get scanned conservatively that wouldn’t get scanned precisely. I think conservative heap scanning is less efficient than precise but I’m honestly not sure, there are some instruction locality arguments in the other direction. For mt-gcbench though there’s a big array of floating-point values that a precise scan will omit, which causes significant overhead there. Probably for this configuration to be viable Whippet would need the equivalent of BDW’s API to allocate known-pointerless objects.
stdatomic
constexpr-ish
pthreads (for parallel markers)
No void*; instead struct types: gc_ref, gc_edge, gc_conservative_ref, etc
Embed-only lib avoids any returns-struct-by-value ABI issue
Rust? MMTk; supply chain concerns
Platform abstraction for conservative root finding
I know it’s a sin, but Whippet is implemented in C. I know. The thing is, in the Guile context I need to not introduce wild compile-time dependencies, because of bootstrapping. And I know that Rust is a fine language to use for GC implementation, so if that’s what you want, please do go take a look at MMTk! It’s a fantastic project, written in Rust, and it can just slot into your project, regardless of the language your project is written in.
But if what you’re looking for is something in C, well then you have to pick and choose your C. In the case of Whippet I try to use the limited abilities of C to help prevent bugs; for example, I generally avoid void* and instead wrap pointers or addresses into single-field structs that can’t be automatically cast, for example to prevent a struct gc_ref that denotes an object reference (or NULL; it’s an option type) from being confused with a struct gc_conservative_ref, which might not point to an object at all.
(Of course, by “C” I mean “C as compiled by gcc and clang with -fno-strict-aliasing”. I don’t know if it’s possible to implement even a simple semi-space collector in C without aliasing violations. Can you access a Foo* object within a mmap‘d heap through its new address after it has been moved via memcpy? Maybe not, right? Thoughts are welcome.)
As a project written in the 2020s instead of the 1990s, Whippet gets to assume a competent C compiler, for example relying on the compiler to inline and fold branches where appropriate. As in libpas, Whippet liberally passes functions as values to inline functions, and relies on the compiler to boil away function calls. Whippet only uses the C preprocessor when it absolutely has to.
Finally, there is a clean abstraction for anything that’s platform-specific, for example finding the current stack bounds. I haven’t compiled this code on Windows or MacOS yet, but I am not anticipating too many troubles.
As time permits
Whippet TODO: heap growth/shrinking, finalizers, safepoint API
Guile TODO: safepoints; heap-conservative first
Precise heap TODO: gc_trace_object, SMOBs, user structs with raw ptr fields, user gc_malloc usage; 3.2
6 months for 3.1.1; 12 for 3.2.0 ?
So where does this get us? Where are we now?
For Whippet itself, I think it’s mostly done – enough to start shifting focus to some different phase. It’s missing some needed features, notably the ability to grow the heap at all, as I’ve been in fixed-heap-size-only mode during development. It’s also missing finalizers. And, something needs to be done to unify Guile’s handling of safepoints and processing of asynchronous signals with Whippet’s need to stop all mutators. Some details remain.
But, I think we are close to ready to start integrating in Guile. At first this is just porting Guile to use the Whippet API to access BDW instead of using BDW directly. This whole thing is a side project for me that I work on when I can, so it doesn’t exactly proceed at full pace. Perhaps this takes 6 months. Then we can cut a new unstable release, and hopefully release 3.2 withe support for the Immix-flavored collector in another 6 or 9 months.
I thought that we would be forced to make ABI changes, if only because of some legacy APIs assume conservative tracing of object contents. But after a discussion at FOSDEM with Carlo Piovesan I realized this isn’t true: because the decision to evacuate or not is made on a collection-by-collection basis, I could simply disable evacuation if the user ever uses a facility that might prohibit object motion, for example if they ever define a SMOB type. If the user wants evacuation, they need to be more precise with their data types, but either way Guile is ready.
An Immix-derived GC
https://wingolog.org/tags/gc/
Guile 3.2 ?
Thanks to MMTk authors for inspiration!
And that’s it! Thanks for reading all the way here. Comments are quite welcome.
As I mentioned in the very beginning, this talk was really about Whippet in the context of Guile. There is a different talk to be made about Guile+Whippet versus other language implementations, for example those with concurrent marking or semi-space nurseries or the like. Yet another talk is Whippet in the context of other GC algorithms. But this is a start. It’s something I’ve been working on for a while now already and I’m pleased that it’s gotten to a point where it seems to be at least OK, at least an improvement with respect to BDW-GC in some ways.
But before leaving you, another chart, to give a more global idea of the state of things. Here we compare a single mutator thread performing a specific microbenchmark that makes trees and also lots of fragmentation, across three different GC implementations and a range of heap sizes. The heap size multipliers in this and in all the other tests in this post are calculated analytically based on what the test thinks its maximum heap size should be, not by measuring minimum heap sizes that work. This size is surely lower than the actual maximum required heap size due to internal fragmentation, but the tests don’t know about this.
The three collectors are BDW, a semi-space collector, and whippet. Semi-space manages to squeeze in less than 2x of a heap multiplier because it has (and whippet has) a separate large object space that isn’t ever evacuated.
What we expect is that tighter heaps impose more GC time, and indeed we see that times are higher on the left side than the right.
Whippet is the only implementation that manages to run at a 1.3x heap, but it takes some time. It’s slower than BDW at a 1.5x heap but better there on out, until what appears to be a bug or pathology makes it take longer at 5x. Adding memory should always decrease run time.
The semi-space collector starts working at 1.75x and then surpasses all collectors from 2.5x onwards. We expect the semi-space collector to win for big heaps, because its overhead is proportional to live data only, whereas mark-sweep and mark-region collectors have to sweep, which is proportional to heap size, and indeed that’s what we see.
I think this chart shows we have some tuning yet to do. The range between 2x and 3x is quite acceptable, but we need to see what’s causing Whippet to be slower than BDW at 1.5x. I haven’t done as much performance tuning as I would like to but am happy to finally be able to know where we stand.
And that’s it! Happy hacking, friends, and may your heap sizes be ever righteous.
sweeping, coarse and lazy
One of the things that had perplexed me about the Immix collector was how to effectively defragment the heap via evacuation while keeping just 2-3% of space as free blocks for an evacuation reserve. The original Immix paper states:
To evacuate the object, the collector uses the same allocator as the mutator, continuing allocation right where the mutator left off. Once it exhausts any unused recyclable blocks, it uses any completely free blocks. By default, immix sets aside a small number of free blocks that it never returns to the global allocator and only ever uses for evacuating. This headroom eases defragmentation and is counted against immix's overall heap budget. By default immix reserves 2.5% of the heap as compaction headroom, but [...] is fairly insensitive to values ranging between 1 and 3%.
To Immix, a "recyclable" block is partially full: it contains surviving data from a previous collection, but also some holes in which to allocate. But when would you have recyclable blocks at evacuation-time? Evacuation occurs as part of collection. Collection usually occurs when there's no more memory in which to allocate. At that point any recyclable block would have been allocated into already, and won't become recyclable again until the next trace of the heap identifies the block's surviving data. Of course after the next trace they could become "empty", if no object survives, or "full", if all lines have survivor objects.
In general, after a full allocation cycle, you don't know much about the heap. If you could easily know where the live data and the holes were, a garbage collector's job would be much easier :) Any algorithm that starts from the assumption that you know where the holes are can't be used before a heap trace. So, I was not sure what the Immix paper is meaning here about allocating into recyclable blocks.
Thinking on it again, I realized that Immix might trigger collection early sometimes, before it has exhausted the previous cycle's set of blocks in which to allocate. As we discussed earlier, there is a case in which you might want to trigger an early compaction: when a large object allocator runs out of blocks to decommission from the immix space. And if one evacuating collection didn't yield enough free blocks, you might trigger the next one early, reserving some recyclable and empty blocks as evacuation targets.
when do you know what you know: lazy and eager
Consider a basic question, such as "how many bytes in the heap are used by live objects". In general you don't know! Indeed you often never know precisely. For example, concurrent collectors often have some amount of "floating garbage" which is unreachable data but which survives across a collection. And of course you don't know the difference between floating garbage and precious data: if you did, you would have collected the garbage.
Even the idea of "when" is tricky in systems that allow parallel mutator threads. Unless the program has a total ordering of mutations of the object graph, there's no one timeline with respect to which you can measure the heap. Still, Immix is a stop-the-world collector, and since such collectors synchronously trace the heap while mutators are stopped, these are times when you can exactly compute properties about the heap.
Let's retake the question of measuring live bytes. For an evacuating semi-space, knowing the number of live bytes after a collection is trivial: all survivors are packed into to-space. But for a mark-sweep space, you would have to compute this information. You could compute it at mark-time, while tracing the graph, but doing so takes time, which means delaying the time at which mutators can start again.
Alternately, for a mark-sweep collector, you can compute free bytes at sweep-time. This is the phase in which you go through the whole heap and return any space that wasn't marked in the last collection to the allocator, allowing it to be used for fresh allocations. This is the point in the garbage collection cycle in which you can answer questions such as "what is the set of recyclable blocks": you know what is garbage and you know what is not.
Though you could sweep during the stop-the-world pause, you don't have to; sweeping only touches dead objects, so it is correct to allow mutators to continue and then sweep as the mutators run. There are two general strategies: spawn a thread that sweeps as fast as it can (concurrent sweeping), or make mutators sweep as needed, just before they allocate (lazy sweeping). But this introduces a lag between when you know and what you know—your count of total live heap bytes describes a time in the past, not the present, because mutators have moved on since then.
For most collectors with a sweep phase, deciding between eager (during the stop-the-world phase) and deferred (concurrent or lazy) sweeping is very easy. You don't immediately need the information that sweeping allows you to compute; it's quite sufficient to wait until the next cycle. Moving work out of the stop-the-world phase is a win for mutator responsiveness (latency). Usually people implement lazy sweeping, as it is naturally incremental with the mutator, naturally parallel for parallel mutators, and any sweeping overhead due to cache misses can be mitigated by immediately using swept space for allocation. The case for concurrent sweeping is less clear to me, but if you have cores that would otherwise be idle, sure.
eager coarse sweeping
Immix is interesting in that it chooses to sweep eagerly, during the stop-the-world phase. Instead of sweeping irregularly-sized objects, however, it sweeps over its "line mark" array: one byte for each 128-byte "line" in the mark space. For 32 kB blocks, there will be 256 bytes per block, and line mark bytes in each 4 MB slab of the heap are packed contiguously. Therefore you get relatively good locality, but this just mitigates a cost that other collectors don't have to pay. So what does eager marking over these coarse 128-byte regions buy Immix?
Firstly, eager sweeping buys you eager identification of empty blocks. If your large object space needs to steal blocks from the mark space, but the mark space doesn't have enough empties, it can just trigger collection and then it knows if enough blocks are available. If no blocks are available, you can grow the heap or signal out-of-memory. If the lospace (large object space) runs out of blocks before the mark space has used all recyclable blocks, that's no problem: evacuation can move the survivors of fragmented blocks into these recyclable blocks, which have also already been identified by the eager coarse sweep.
Without eager empty block identification, if the lospace runs out of blocks, firstly you don't know how many empty blocks the mark space has. Sweeping is a kind of wavefront that moves through the whole heap; empty blocks behind the wavefront will be identified, but those ahead of the wavefront will not. Such a lospace allocation would then have to either wait for a concurrent sweeper to advance, or perform some lazy sweeping work. The expected latency of a lospace allocation would thus be higher, without eager identification of empty blocks.
Secondly, eager sweeping might reduce allocation overhead for mutators. If allocation just has to identify holes and not compute information or decide on what to do with a block, maybe it go brr? Not sure.
lines, lines, lines
The original Immix paper also notes a relative insensitivity of the collector to line size: 64 or 256 bytes could have worked just as well. This was a somewhat surprising result to me but I think I didn't appreciate all the roles that lines play in Immix.
Obviously line size affect the worst-case fragmentation, though this is mitigated by evacuation (which evacuates objects, not lines). This I got from the paper. In this case, smaller lines are better.
Line size affects allocation-time overhead for mutators, though which way I don't know: scanning for holes will be easier with fewer lines in a block, but smaller lines would contain more free space and thus result in fewer collections. I can only imagine though that with smaller line sizes, average hole size would decrease and thus medium-sized allocations would be harder to service. Something of a wash, perhaps.
However if we ask ourselves the thought experiment, why not just have 16-byte lines? How crazy would that be? I think the impediment to having such a precise line size would mainly be Immix's eager sweep, as a fine-grained traversal of the heap would process much more data and incur possibly-unacceptable pause time overheads. But, in such a design you would do away with some other downsides of coarse-grained lines: a side table of mark bytes would make the line mark table redundant, and you eliminate much possible "dark matter" hidden by internal fragmentation in lines. You'd need to defer sweeping. But then you lose eager identification of empty blocks, and perhaps also the ability to evacuate into recyclable blocks. What would such a system look like?
Readers that have gotten this far will be pleased to hear that I have made some investigations in this area. But, this post is already long, so let's revisit this in another dispatch. Until then, happy allocations in all regions.
Good evening, gentle hackfolk. Last time we talked about heuristics for when you might want to compact a heap. Compacting garbage collection is nice and tidy and appeals to our orderly instincts, and it enables heap shrinking and reallocation of pages to large object spaces and it can reduce fragmentation: all very good things. But evacuation is more expensive than just marking objects in place, and so a production garbage collector will usually just mark objects in place, and only compact or evacuate when needed.
Today's post is more details!
dedication
Just because it's been, oh, a couple decades, I would like to reintroduce a term I learned from Marnanel years ago on advogato, a nerdy group blog kind of a site. As I recall, there is a word that originates in the Oxbridge social environment, "narg", from "Not A Real Gentleman", and which therefore denotes things that not-real-gentlemen do: nerd out about anything that's not, like, fox-hunting or golf; or generally spending time on something not because it will advance you in conventional hierarchies but because you just can't help it, because you love it, because it is just your thing. Anyway, in the spirit of pursuits that are really not What One Does With One's Time, this post is dedicated to the word "nargery".
side note, bis: immix-style evacuation versus mark-compact
In my last post I described Immix-style evacuation, and noted that it might take a few cycles to fully compact the heap, and that it has a few pathologies: the heap might never reach full compaction, and that Immix might run out of free blocks in which to evacuate.
With these disadvantages, why bother? Why not just do a single mark-compact pass and be done? I implicitly asked this question last time but didn't really answer it.
For some people will be, yep, yebo, mark-compact is the right answer. And yet, there are a few reasons that one might choose to evacuate a fraction of the heap instead of compacting it all at once.
The first reason is object pinning. Mark-compact systems assume that all objects can be moved; you can't usefully relax this assumption. Most algorithms "slide" objects down to lower addresses, squeezing out the holes, and therefore every live object's address needs to be available to use when sliding down other objects with higher addresses. And yet, it would be nice sometimes to prevent an object from being moved. This is the case, for example, when you grant a foreign interface (e.g. a C function) access to a buffer: if garbage collection happens while in that foreign interface, it would be nice to be able to prevent garbage collection from moving the object out from under the C function's feet.
Another reason to want to pin an object is because of conservative root-finding. Guile currently uses the Boehm-Demers-Weiser collector, which conservatively scans the stack and data segments for anything that looks like a pointer to the heap. The garbage collector can't update such a global root in response to compaction, because you can't be sure that a given word is a pointer and not just an integer with an inconvenient value. In short, objects referenced by conservative roots need to be pinned. I would like to support precise roots at some point but part of my interest in Immix is to allow Guile to move to a better GC algorithm, without necessarily requiring precise enumeration of GC roots. Optimistic partial evacuation allows for the possibility that any given evacuation might fail, which makes it appropriate for conservative root-finding.
Finally, as moving objects has a cost, it's reasonable to want to only incur that cost for the part of the heap that needs it. In any given heap, there will likely be some data that stays live across a series of collections, and which, once compacted, can't be profitably moved for many cycles. Focussing evacuation on only the part of the heap with the lowest survival rates avoids wasting time on copies that don't result in additional compaction.
(I should admit one thing: sliding mark-compact compaction preserves allocation order, whereas evacuation does not. The memory layout of sliding compaction is more optimal than evacuation.)
multi-cycle evacuation
Say a mutator runs out of memory, and therefore invokes the collector. The collector decides for whatever reason that we should evacuate at least part of the heap instead of marking in place. How much of the heap can we evacuate? The answer depends primarily on how many free blocks you have reserved for evacuation. These are known-empty blocks that haven't been allocated into by the last cycle. If you don't have any, you can't evacuate! So probably you should keep some around, even when performing in-place collections. The Immix papers suggest 2% and that works for me too.
Then you evacuate some blocks. Hopefully the result is that after this collection cycle, you have more free blocks. But you haven't compacted the heap, at least probably not on the first try: not into 2% of total space. Therefore you tell the mutator to put any empty blocks it finds as a result of lazy sweeping during the next cycle onto the evacuation target list, and then the next cycle you have more blocks to evacuate into, and more and more and so on until after some number of cycles you fall below some overall heap fragmentation low-watermark target, at which point you can switch back to marking in place.
I don't know how this works in practice! In my test setups which triggers compaction at 10% fragmentation and continues until it drops below 5%, it's rare that it takes more than 3 cycles of evacuation until the heap drops to effectively 0% fragmentation. Of course I had to introduce fragmented allocation patterns into the microbenchmarks to even cause evacuation to happen at all. I look forward to some day soon testing with real applications.
concurrency
Just as a terminological note, in the world of garbage collectors, "parallel" refers to multiple threads being used by a garbage collector. Parallelism within a collector is essentially an implementation detail; when the world is stopped for collection, the mutator (the user program) generally doesn't care if the collector uses 1 thread or 15. On the other hand, "concurrent" means the collector and the mutator running at the same time.
Different parts of the collector can be concurrent with the mutator: for example, sweeping, marking, or evacuation. Concurrent sweeping is just a detail, because it just visits dead objects. Concurrent marking is interesting, because it can significantly reduce stop-the-world pauses by performing most of the computation while the mutator is running. It's tricky, as you might imagine; the collector traverses the object graph while the mutator is, you know, mutating it. But there are standard techniques to make this work. Concurrent evacuation is a nightmare. It's not that you can't implement it; you can. But it's very very hard to get an overall performance win from concurrent evacuation/copying.
So if you are looking for a good bargain in the marketplace of garbage collector algorithms, it would seem that you need to avoid concurrent copying/evacuation. It's an expensive product that would seem to not buy you very much.
All that is just a prelude to an observation that there is a funny source of concurrency even in some systems that don't see themselves as concurrent: mutator threads marking their own roots. To recall, when you stop the world for a garbage collection, all mutator threads have to somehow notice the request to stop, reach a safepoint, and then stop. Then the collector traces the roots from all mutators and everything they reference, transitively. Then you let the threads go again. Thing is, once you get more than a thread or four, stopping threads can take time. You'd be tempted to just have threads notice that they need to stop, then traverse their own stacks at their own safepoint to find their roots, then stop. But, this introduces concurrency between root-tracing and other mutators that might not have seen the request to stop. For marking, this concurrency can be fine: you are just setting mark bits, not mutating the roots. You might need to add an additional mark pattern that can be distinguished from marked-last-time and marked-the-time-before-but-dead-now, but that's a detail. Fine.
But if you instead start an evacuating collection, the gates of hell open wide and toothy maws and horns fill your vision. One thread could be stopping and evacuating the objects referenced by its roots, while another hasn't noticed the request to stop and is happily using the same objects: chaos! You are trying to make a minor optimization to move some work out of the stop-the-world phase but instead everything falls apart.
Anyway, this whole article was really to get here and note that you can't do ragged-stops with evacuation without supporting full concurrent evacuation. Otherwise, you need to postpone root traversal until all threads are stopped. Perhaps this is another argument that evacuation is expensive, relative to marking in place. In practice I haven't seen the ragged-stop effect making so much of a difference, but perhaps that is because evacuation is infrequent in my test cases.
Zokay? Zokay. Welp, this evening's nargery was indeed nargy. Happy hacking to all collectors out there, and until next time.
Good morning, mallocators. Last time we talked about how to split available memory between a block-structured main space and a large object space. Given a fixed heap size, making a new large object allocation will steal available pages from the block-structured space by finding empty blocks and temporarily returning them to the operating system.
Today I'd like to talk more about nothing, or rather, why might you want nothing rather than something. Given an Immix heap, why would you want it organized in such a way that live data is packed into some blocks, leaving other blocks completely free? How bad would it be if instead the live data were spread all over the heap? When might it be a good idea to try to compact the heap? Ideally we'd like to be able to translate the answers to these questions into heuristics that can inform the GC when compaction/evacuation would be a good idea.
lospace and the void
Let's start with one of the more obvious points: large object allocation. With a fixed-size heap, you can't allocate new large objects if you don't have empty blocks in your paged space (the Immix space, for example) that you can return to the OS. To obtain these free blocks, you have four options.
You can continue lazy sweeping of recycled blocks, to see if you find an empty block. This is a bit time-consuming, though.
Otherwise, you can trigger a regular non-moving GC, which might free up blocks in the Immix space but which is also likely to free up large objects, which would result in fresh empty blocks.
You can trigger a compacting or evacuating collection. Immix can't actually compact the heap all in one go, so you would preferentially select evacuation-candidate blocks by choosing the blocks with the least live data (as measured at the last GC), hoping that little data will need to be evacuated.
Finally, for environments in which the heap is growable, you could just grow the heap instead. In this case you would configure the system to target a heap size multiplier rather than a heap size, which would scale the heap to be e.g. twice the size of the live data, as measured at the last collection.
If you have a growable heap, I think you will rarely choose to compact rather than grow the heap: you will either collect or grow. Under constant allocation rate, the rate of empty blocks being reclaimed from freed lospace objects will be equal to the rate at which they are needed, so if collection doesn't produce any, then that means your live data set is increasing and so growing is a good option. Anyway let's put growable heaps aside, as heap-growth heuristics are a separate gnarly problem.
The question becomes, when should large object allocation force a compaction? Absent growable heaps, the answer is clear: when allocating a large object fails because there are no empty pages, but the statistics show that there is actually ample free memory. Good! We have one heuristic, and one with an optimum: you could compact in other situations but from the point of view of lospace, waiting until allocation failure is the most efficient.
shrinkage
Moving on, another use of empty blocks is when shrinking the heap. The collector might decide that it's a good idea to return some memory to the operating system. For example, I enjoyed this recent paper on heuristics for optimum heap size, that advocates that you size the heap in proportion to the square root of the allocation rate, and that as a consequence, when/if the application reaches a dormant state, it should promptly return memory to the OS.
Here, we have a similar heuristic for when to evacuate: when we would like to release memory to the OS but we have no empty blocks, we should compact. We use the same evacuation candidate selection approach as before, also, aiming for maximum empty block yield.
fragmentation
What if you go to allocate a medium object, say 4kB, but there is no hole that's 4kB or larger? In that case, your heap is fragmented. The smaller your heap size, the more likely this is to happen. We should compact the heap to make the maximum hole size larger.
side note: compaction via partial evacuation
The evacuation strategy of Immix is... optimistic. A mark-compact collector will compact the whole heap, but Immix will only be able to evacuate a fraction of it.
It's worth dwelling on this a bit. As described in the paper, Immix reserves around 2-3% of overall space for evacuation overhead. Let's say you decide to evacuate: you start with 2-3% of blocks being empty (the target blocks), and choose a corresponding set of candidate blocks for evacuation (the source blocks). Since Immix is a one-pass collector, it doesn't know how much data is live when it starts collecting. It may not know that the blocks that it is evacuating will fit into the target space. As specified in the original paper, if the target space fills up, Immix will mark in place instead of evacuating; an evacuation candidate block with marked-in-place objects would then be non-empty at the end of collection.
In fact if you choose a set of evacuation candidates hoping to maximize your empty block yield, based on an estimate of live data instead of limiting to only the number of target blocks, I think it's possible to actually fill the targets before the source blocks empty, leaving you with no empty blocks at the end! (This can happen due to inaccurate live data estimations, or via internal fragmentation with the block size.) The only way to avoid this is to never select more evacuation candidate blocks than you have in target blocks. If you are lucky, you won't have to use all of the target blocks, and so at the end you will end up with more free blocks than not, so a subsequent evacuation will be more effective. The defragmentation result in that case would still be pretty good, but the yield in free blocks is not great.
In a production garbage collector I would still be tempted to be optimistic and select more evacuation candidate blocks than available empty target blocks, because it will require fewer rounds to compact the whole heap, if that's what you wanted to do. It would be a relatively rare occurrence to start an evacuation cycle. If you ran out of space while evacuating, in a production GC I would just temporarily commission some overhead blocks for evacuation and release them promptly after evacuation is complete. If you have a small heap multiplier in your Immix space, occasional partial evacuation in a long-running process would probably reach a steady state with blocks being either full or empty. Fragmented blocks would represent newer objects and evacuation would periodically sediment these into longer-lived dense blocks.
mutator throughput
Finally, the shape of the heap has its inverse in the shape of the holes into which the mutator can allocate. It's most efficient for the mutator if the heap has as few holes as possible: ideally just one large hole per block, which is the limit case of an empty block.
The opposite extreme would be having every other "line" (in Immix terms) be used, so that free space is spread across the heap in a vast spray of one-line holes. Even if fragmentation is not a problem, perhaps because the application only allocates objects that pack neatly into lines, having to stutter all the time to look for holes is overhead for the mutator. Also, the result is that contemporaneous allocations are more likely to be placed farther apart in memory, leading to more cache misses when accessing data. Together, allocator overhead and access overhead lead to lower mutator throughput.
When would this situation get so bad as to trigger compaction? Here I have no idea. There is no clear maximum. If compaction were free, we would compact all the time. But it's not; there's a tradeoff between the cost of compaction and mutator throughput.
I think here I would punt. If the heap is being actively resized based on allocation rate, we'll hit the other heuristics first, and so we won't need to trigger evacuation/compaction based on mutator overhead. You could measure this, though, in terms of average or median hole size, or average or maximum number of holes per block. Since evacuation is partial, all you need to do is to identify some "bad" blocks and then perhaps evacuation becomes attractive.
gc pause
Welp, that's some thoughts on when to trigger evacuation in Immix. Next time, we'll talk about some engineering aspects of evacuation. Until then, happy consing!
Good day! In a recent dispatch we talked about the fundamental garbage collection algorithms, also introducing the Immix mark-region collector. Immix mostly leaves objects in place but can move objects if it thinks it would be profitable. But when would it decide that this is a good idea? Are there cases in which it is necessary?
I promised to answer those questions in a followup article, but I didn't say which followup :) Before I get there, I want to talk about paged spaces.
enter the multispace
We mentioned that Immix divides the heap into blocks (32kB or so), and that no object can span multiple blocks. "Large" objects -- defined by Immix to be more than 8kB -- go to a separate "large object space", or "lospace" for short.
Though the implementation of a large object space is relatively simple, I found that it has some points that are quite subtle. Probably the most important of these points relates to heap size. Consider that if you just had one space, implemented using mark-compact maybe, then the procedure to allocate a 16 kB object would go:
Try to bump the allocation pointer by 16kB. Is it still within range? If so we are done.
Otherwise, collect garbage and try again. If after GC there isn't enough space, the allocation fails.
In step (2), collecting garbage could decide to grow or shrink the heap. However when evaluating collector algorithms, you generally want to avoid dynamically-sized heaps.
cheatery
Here is where I need to make an embarrassing admission. In my role as co-maintainer of the Guile programming language implementation, I have long noodled around with benchmarks, comparing Guile to Chez, Chicken, and other implementations. It's good fun. However, I only realized recently that I had a magic knob that I could turn to win more benchmarks: simply make the heap bigger. Make it start bigger, make it grow faster, whatever it takes. For a program that does its work in some fixed amount of total allocation, a bigger heap will require fewer collections, and therefore generally take less time. (Some amount of collection may be good for performance as it improves locality, but this is a marginal factor.)
Of course I didn't really go wild with this knob but it now makes me doubt all benchmarks I have ever seen: are we really using benchmarks to select for fast implementations, or are we in fact selecting for implementations with cheeky heap size heuristics? Consider even any of the common allocation-heavy JavaScript benchmarks, DeltaBlue or Earley or the like; to win these benchmarks, web browsers are incentivised to have large heaps. In the real world, though, a more parsimonious policy might be more appreciated by users.
Java people have known this for quite some time, and are therefore used to fixing the heap size while running benchmarks. For example, people will measure the minimum amount of memory that can allow a benchmark to run, and then configure the heap to be a constant multiplier of this minimum size. The MMTK garbage collector toolkit can't even grow the heap at all currently: it's an important feature for production garbage collectors, but as they are just now migrating out of the research phase, heap growth (and shrinking) hasn't yet been a priority.
lospace
So now consider a garbage collector that has two spaces: an Immix space for allocations of 8kB and below, and a large object space for, well, larger objects. How do you divide the available memory between the two spaces? Could the balance between immix and lospace change at run-time? If you never had large objects, would you be wasting space at all? Conversely is there a strategy that can also work for only large objects?
Perhaps the answer is obvious to you, but it wasn't to me. After much reading of the MMTK source code and pondering, here is what I understand the state of the art to be.
Arrange for your main space -- Immix, mark-sweep, whatever -- to be block-structured, and able to dynamically decomission or recommission blocks, perhaps via MADV_DONTNEED. This works if the blocks are even multiples of the underlying OS page size.
Keep a counter of however many bytes the lospace currently has.
When you go to allocate a large object, increment the lospace byte counter, and then round up to number of blocks to decommission from the main paged space. If this is more than are currently decommissioned, find some empty blocks and decommission them.
If no empty blocks were found, collect, and try again. If the second try doesn't work, then the allocation fails.
Now that the paged space has shrunk, lospace can allocate. You can use the system malloc, but probably better to use mmap, so that if these objects are collected, you can just MADV_DONTNEED them and keep them around for later re-use.
After GC runs, explicitly return the memory for any object in lospace that wasn't visited when the object graph was traversed. Decrement the lospace byte counter and possibly return some empty blocks to the paged space.
There are some interesting aspects about this strategy. One is, the memory that you return to the OS doesn't need to be contiguous. When allocating a 50 MB object, you don't have to find 50 MB of contiguous free space, because any set of blocks that adds up to 50 MB will do.
Another aspect is that this adaptive strategy can work for any ratio of large to non-large objects. The user doesn't have to manually set the sizes of the various spaces.
This strategy does assume that address space is larger than heap size, but only by a factor of 2 (modulo fragmentation for the large object space). Therefore our risk of running afoul of user resource limits and kernel overcommit heuristics is low.
The one underspecified part of this algorithm is... did you see it? "Find some empty blocks". If the main paged space does lazy sweeping -- only scanning a block for holes right before the block will be used for allocation -- then after a collection we don't actually know very much about the heap, and notably, we don't know what blocks are empty. (We could know it, of course, but it would take time; you could traverse the line mark arrays for all blocks while the world is stopped, but this increases pause time. The original Immix collector does this, however.) In the system I've been working on, instead I have it so that if a mutator finds an empty block, it puts it on a separate list, and then takes another block, only allocating into empty blocks once all blocks are swept. If the lospace needs blocks, it sweeps eagerly until it finds enough empty blocks, throwing away any nonempty blocks. This causes the next collection to happen sooner, but that's not a terrible thing; this only occurs when rebalancing lospace versus paged-space size, because if you have a constant allocation rate on the lospace side, you will also have a complementary rate of production of empty blocks by GC, as they are recommissioned when lospace objects are reclaimed.
What if your main paged space has ample space for allocating a large object, but there are no empty blocks, because live objects are equally peppered around all blocks? In that case, often the application would be best served by growing the heap, but maybe not. In any case in a strict-heap-size environment, we need a solution.
But for that... let's pick up another day. Until then, happy hacking!
Good morning, hackers! Been a while. It used to be that I had long blocks of uninterrupted time to think and work on projects. Now I have two kids; the longest such time-blocks are on trains (too infrequent, but it happens) and in a less effective but more frequent fashion, after the kids are sleeping. As I start writing this, I'm in an airport waiting for a delayed flight -- my first since the pandemic -- so we can consider this to be the former case.
It is perhaps out of mechanical sympathy that I have been using my reclaimed time to noodle on a garbage collector. Managing space and managing time have similar concerns: how to do much with little, efficiently packing different-sized allocations into a finite resource.
I have been itching to write a GC for years, but the proximate event that pushed me over the edge was reading about the Immix collection algorithm a few months ago.
on fundamentals
Immix is a "mark-region" collection algorithm. I say "algorithm" rather than "collector" because it's more like a strategy or something that you have to put into practice by making a concrete collector, the other fundamental algorithms being copying/evacuation, mark-sweep, and mark-compact.
To build a collector, you might combine a number of spaces that use different strategies. A common choice would be to have a semi-space copying young generation, a mark-sweep old space, and maybe a treadmill large object space (a kind of copying collector, logically; more on that later). Then you have heuristics that determine what object goes where, when.
On the engineering side, there's quite a number of choices to make there too: probably you make some parts of your collector to be parallel, maybe the collector and the mutator (the user program) can run concurrently, and so on. Things get complicated, but the fundamental algorithms are relatively simple, and present interesting fundamental tradeoffs.
For example, mark-compact is most parsimonious regarding space usage -- for a given program, a garbage collector using a mark-compact algorithm will require less memory than one that uses mark-sweep. However, mark-compact algorithms all require at least two passes over the heap: one to identify live objects (mark), and at least one to relocate them (compact). This makes them less efficient in terms of overall program throughput and can also increase latency (GC pause times).
Copying or evacuating spaces can be more CPU-efficient than mark-compact spaces, as reclaiming memory avoids traversing the heap twice; a copying space copies objects as it traverses the live object graph instead of after the traversal (mark phase) is complete. However, a copying space's minimum heap size is quite high, and it only reaches competitive efficiencies at large heap sizes. For example, if your program needs 100 MB of space for its live data, a semi-space copying collector will need at least 200 MB of space in the heap (a 2x multiplier, we say), and will only run efficiently at something more like 4-5x. It's a reasonable tradeoff to make for small spaces such as nurseries, but as a mature space, it's so memory-hungry that users will be unhappy if you make it responsible for a large portion of your memory.
Finally, mark-sweep is quite efficient in terms of program throughput, because like copying it traverses the heap in just one pass, and because it leaves objects in place instead of moving them. But! Unlike the other two fundamental algorithms, mark-sweep leaves the heap in a fragmented state: instead of having all live objects packed into a contiguous block, memory is interspersed with live objects and free space. So the collector can run quickly but the allocator stops and stutters as it accesses disparate regions of memory.
allocators
Collectors are paired with allocators. For mark-compact and copying/evacuation, the allocator consists of a pointer to free space and a limit. Objects are allocated by bumping the allocation pointer, a fast operation that also preserves locality between contemporaneous allocations, improving overall program throughput. But for mark-sweep, we run into a problem: say you go to allocate a 1 kilobyte byte array, do you actually have space for that?
Generally speaking, mark-sweep allocators solve this problem via freelist allocation: the allocator has an array of lists of free objects, one for each "size class" (say 2 words, 3 words, and so on up to 16 words, then more sparsely up to the largest allocatable size maybe), and services allocations from their appropriate size class's freelist. This prevents the 1 kB free space that we need from being "used up" by a 16-byte allocation that could just have well gone elsewhere. However, freelists prevent objects allocated around the same time from being deterministically placed in nearby memory locations. This increases variance and decreases overall throughput for both the allocation operations but also for pointer-chasing in the course of the program's execution.
Also, in a mark-sweep collector, we can still reach a situation where there is enough space on the heap for an allocation, but that free space broken up into too many pieces: the heap is fragmented. For this reason, many systems that perform mark-sweep collection can choose to compact, if heuristics show it might be profitable. Because the usual strategy is mark-sweep, though, they still use freelist allocation.
on immix and mark-region
Mark-region collectors are like mark-sweep collectors, except that they do bump-pointer allocation into the holes between survivor objects.
Sounds simple, right? To my mind, though the fundamental challenge in implementing a mark-region collector is how to handle fragmentation. Let's take a look at how Immix solves this problem.
Firstly, Immix partitions the heap into blocks, which might be 32 kB in size or so. No object can span a block. Block size should be chosen to be a nice power-of-two multiple of the system page size, not so small that common object allocations wouldn't fit. Allocating "large" objects -- greater than 8 kB, for Immix -- go to a separate space that is managed in a different way.
Within a block, Immix divides space into lines -- maybe 128 bytes long. Objects can span lines. Any line that does not contain (a part of) an object that survived the previous collection is part of a hole. A hole is a contiguous span of free lines in a block.
On the allocation side, Immix does bump-pointer allocation into holes. If a mutator doesn't have a hole currently, it scans the current block (obtaining one if needed) for the next hole, via a side-table of per-line mark bits: one bit per line. Lines without the mark are in holes. Scanning for holes is fairly cheap, because the line size is not too small. Note, there are also per-object mark bits as well; just because you've marked a line doesn't mean that you've traced all objects on that line.
Allocating into a hole has good expected performance as well, as it's bump-pointer, and the minimum size isn't tiny. In the worst case of a hole consisting of a single line, you have 128 bytes to work with. This size is large enough for the majority of objects, given that most objects are small.
mitigating fragmentation
Immix still has some challenges regarding fragmentation. There is some loss in which a single (piece of an) object can keep a line marked, wasting any free space on that line. Also, when an object can't fit into a hole, any space left in that hole is lost, at least until the next collection. This loss could also occur for the next hole, and the next and the next and so on until Immix finds a hole that's big enough. In a mark-sweep collector with lazy sweeping, these free extents could instead be placed on freelists and used when needed, but in Immix there is no such facility (by design).
One mitigation for fragmentation risks is "overflow allocation": when allocating an object larger than a line (a medium object), and Immix can't find a hole before the end of the block, Immix allocates into a completely free block. So actually mutator threads allocate into two blocks at a time: one for small objects and medium objects if possible, and the other for medium objects when necessary.
Another mitigation is that large objects are allocated into their own space, so an Immix space will never be used for blocks larger than, say, 8kB.
The other mitigation is that Immix can choose to evacuate instead of mark. How does this work? Is it worth it?
stw
This question about the practical tradeoffs involving evacuation is the one I wanted to pose when I started this article; I have gotten to the point of implementing this part of Immix and I have some doubts. But, this article is long enough, and my plane is about to land, so let's revisit this on my return flight. Until then, see you later, allocators!