partitioning pitfalls for generational collectors

You might have heard of long-pole problems: when you are packing a tent, its bag needs to be at least as big as the longest pole. (This is my preferred etymology; there are others.) In garbage collection, the long pole is the longest chain of object references; there is no way you can algorithmically speed up pointer-chasing of (say) a long linked-list.

As a GC author, some of your standard tools can mitigate the long-pole problem, and some don’t apply.

Parallelism doesn’t help: a baby takes 9 months, no matter how many people you assign to the problem. You need to visit each node in the chain, one after the other, and having multiple workers available to process those nodes does not get us there any faster. Parallelism does help in general, of course, but doesn’t help for long poles.

You can apply concurrency: instead of stopping the user’s program (the mutator) to enumerate live objects, you trace while the mutator is running. This can happen on mutator threads, interleaved with allocation, via what is known as incremental tracing. Or it can happen in dedicated tracing threads, which is what people usually mean when they refer to concurrent tracing. Though it does impose some coordination overhead on the mutator, the pause-time benefits are such that most production garbage collectors do trace concurrently with the mutator, if they have the development resources to manage the complexity.

Then there is partitioning: instead of tracing the whole graph all the time, try to just trace part of it and just reclaim memory for that part. This bounds the size of a long pole—it can’t be bigger than the partition you trace—and so tracing a graph subset should reduce total pause time.

The usual way to partition is generational, in which the main program allocates into a nursery, and objects that survive a few collections then get promoted into old space. But there may be other partitions, for example to put “large objects” (for example, bigger than a few virtual memory pages) in their own section, to be managed with their own algorithm.

partitions, take one

And that brings me to today’s article: generational partitioning has a failure mode which manifests itself as spurious out-of-memory. For example, in V8, running this small JavaScript program that repeatedly allocates a 1-megabyte buffer grows to consume all memory in the system and eventually panics:

while (1) new Uint8Array(1e6);

This is a funny result, to say the least. Let’s dig in a bit to see why.

First off, note that allocating a 1-megabyte Uint8Array makes a large object, because it is bigger than half a V8 page, which is 256 kB on most systems There is a backing store for the array that will get allocated into the large object space, and then the JS object wrapper that gets allocated in the young generation of the regular object space.

Let’s imagine the heap consists of the nursery, the old space, and a large object space (lospace, for short). (See the next section for a refinement of this model.) In the loop, we cause allocation in the nursery and the lospace. When the nursery starts to get full, we run a minor collection, to trace the newly allocated part of the object graph, possibly promoting some objects to the old generation.

Promoting an object adds to the byte count in the old generation. You could say that promoting causes allocation in the old-gen, though it might happen just on an accounting level if an object is promoted in place. In V8, old-gen allocation is subject to a limit, and that limit is set by a gnarly series of heuristics. Exceeding the limit will cause a major GC, in which the whole heap is traced.

So, minor collections cause major collections via promotion. But what if a minor collection never promotes? This can happen in request-oriented workflows, in which a request comes in, you handle it in JS, write out the result, and then handle the next. If the nursery is at least as large as the memory allocated in one request, no object will ever survive more than one collection. But in our new Uint8Array(1e6) example above, we are allocating to newspace and the lospace. If we never cause promotion, we will always collect newspace but never trigger the full GC that would also collect the large object space, triggering this out-of-memory phenomenon.

partitions, take two

In general, the phenomenon we are looking for is nursery allocations that cause non-nursery allocations, where those non-nursery allocations will not themselves bring about a major GC.

In our example above, it was a typed array object with an associated lospace backing store, assuming the lospace allocation wouldn’t eventually trigger GC. But V8’s heap is not exactly like our model, for one because it actually has separate nursery and old-generation lospaces, and for two because allocation to the old-generation lospace does count towards the old-gen allocation limit. And yet, that simple does still blow through all memory. So what is the deal?

The answer is that V8’s heap now has around two dozen spaces, and allocations to some of those spaces escape the limits and allocation counters. In this case, V8’s sandbox makes the link between the JS typed array object and its backing store pass through a table of indirections. At each allocation, we make an object in the nursery, allocate a backing store in the nursery lospace, and then allocate a new entry in the external pointer table, storing the index of that entry in the JS object. When the object needs to get its backing store, it dereferences the corresponding entry in the external pointer table.

Our tight loop above would therefore cause an allocation (in the external pointer table) that itself would not hasten a major collection.

Seen this way, one solution immediately presents itself: maybe we should find a way to make external pointer table entry allocations trigger a GC based on some heuristic, perhaps the old-gen allocated bytes counter. But then you have to actually trigger GC, and this is annoying and not always possible, as EPT entries can be allocated in response to a pointer write. V8 hacker Michael Lippautz summed up the issue in a comment that can be summarized as “it’s gnarly”.

In the end, it would be better if new-space allocations did not cause old-space allocations. We should separate the external pointer table (EPT) into old and new spaces. Because there is at most one “host” object that is associated with any EPT entry—if it’s zero objects, the entry is dead—then the heuristics that apply to host objects will do the right thing with regards to EPT entries.

A couple weeks ago, I landed a patch to do this. It was much easier said than done; the patch went through upwards of 40 revisions, as it took me a while to understand the delicate interactions between the concurrent and parallel parts of the collector, which were themselves evolving as I was working on the patch.

The challenge mostly came from the fact that V8 has two nursery implementations that operate in different ways.

The semi-space nursery (the scavenger) is a parallel stop-the-world collector, which is relatively straightforward... except that there can be a concurrent/incremental major trace in progress while the scavenger runs (a so-called interleaved minor GC). External pointer table entries have mark bits, which the scavenger collector will want to use to determine which entries are in use and which can be reclaimed, but the concurrent major marker will also want to use those mark bits. In the end we decided that if a major mark is in progress, an interleaved collection will neither mark nor sweep the external pointer table nursery; a major collection will finish soon, and reclaiming dead EPT entries will be its responsibility.

The minor mark-sweep nursery does not run concurrently with a major concurrent/incremental trace, which is something of a relief, but it does run concurrently with the mutator. When we promote a page, we also need to promote all EPT entries for objects on that page. To keep track of which objects have external pointers, we had to add a new remembered set, built up during a minor trace, and cleared when the page is swept (also lazily / concurrently). Promoting a page iterates through that set, evacuating EPT entries to the old space. This is additional work during the stop-the-world pause, but hopefully it is not too bad.

To be honest I don’t have the performance numbers for this one. It rides the train this week to Chromium 126 though, and hopefully it fixes this problem in a robust way.

partitions, take three

The V8 sandboxing effort has sprouted a number of indirect tables: external pointers, external buffers, trusted objects, and code objects. Also recently to better integrate with compressed pointers, there is also a table for V8-to-managed-C++ (Oilpan) references. I expect the Oilpan reference table will soon grow a nursery along the same lines as the regular external pointer table.

In general, if you have a generational partition of your main object space, it would seem that you almost always need a generational partition of every other space. Otherwise either you cause new allocations to occur in what is effectively an old space, perhaps needlessly hastening a major GC, or you forget to track allocations in that space, leading to a memory leak. If the generational hypothesis applies for a wrapper object, it probably also applies for any auxiliary allocations as well.

fin

I have a few cleanups remaining in this area but I am relieved to have landed this patch, and pleased to have spent time under the hood of V8’s collector. Many many thanks to Samuel Groß, Michael Lippautz, and Dominik Inführ for their review and patience. Until the next cycle, happy allocating!

No responses