wingolog

javascript weakmaps should be iterable

2024-08-19T20:13:20Z

Good evening. Tonight, a brief position statement: it is a mistake for JavaScript’s WeakMap to not be iterable, and we should fix it.

story time

A WeakMap associates a key with a value, as long as the key is otherwise reachable in a program. (It is an ephemeron table.)

When WeakMap was added to JavaScript, back in the ES6 times, some implementors thought that it could be reasonable to implement weak maps not as a data structure in its own right, but rather as a kind of property on each object. Under the hood, adding an key→value association to a map M would set key[M] = value. GC would be free to notice dead maps and remove their associations in live objects.

If you implement weak maps like this, or are open to the idea of such an implementation, then you can’t rely on the associations being enumerable from the map itself, as they are instead spread out among all the key objects. So, ES6 specified WeakMap as not being iterable; given a map, you can’t know what’s in it.

As with many things GC-related, non-iterability of weak maps then gained a kind of legendary status: the lore states that non-iterability preserves some key flexibility for JS implementations, and therefore it is good, and you just have to accept it and move on.

dynamics

Time passed, and two things happened.

One was that this distributed WeakMap implementation strategy did not pan out; everyone ended up implementing weak maps as their own kind of object, and people use an algorithm like the one Steve Fink described a couple years ago to compute the map×key⇒value conjunction. The main original motivation for non-iterability was no longer valid.

The second development was WeakRef and FinalizationRegistry, which expose some details of reachability as viewed by the garbage collector to user JS code. With WeakRef (and WeakMap), you can build an iterable WeakMap.

(Full disclosure: I did work on ES6 and had a hand in FinalizationRegistry but don’t do JS language work currently.)

Thing is, your iterable WeakMap is strictly inferior to what the browser can provide: its implementation is extraordinarily gnarly, shipped over the wire instead of already in the browser, uses more memory, is single-threaded and high-latency (because FinalizationRegistry), and non-standard. What if instead as language engineers we just did our jobs and standardized iterability, as we do with literally every other collection in the JS standard?

Just this morning I wrote yet another iterable WeakSet (which has all the same concerns as WeakMap), and while it’s sufficient for my needs, it’s not good (lacking prompt cleanup of dead entries), and by construction can’t be great (because it has to be redundantly implemented on top of WeakSet instead of being there already).

I am sympathetic to deferring language specification decisions to allow the implementation space to be explored, but when the exploration is done and the dust has settled, we shouldn’t hesitate to pick a winner: JS weak maps and sets should be iterable. Godspeed, brave TC39 souls; should you take up this mantle, you are doing the Lord’s work!

Thanks to Philip Chimento for notes on the timeline and Joyee Cheung for notes on the iterable WeakMap implementation in the WeakRef spec. All errors mine, of course!

finalizers, guardians, phantom references, et cetera

2024-07-22T09:27:54Z

Consider guardians. Compared to finalizers, in which the cleanup procedures are run concurrently with the mutator, by the garbage collector, guardians allow the mutator to control concurrency. See Guile’s documentation for more notes. Java’s PhantomReference / ReferenceQueue seems to be similar in spirit, though the details differ.

questions

If we want guardians, how should we implement them in Whippet? How do they relate to ephemerons and finalizers?

It would be a shame if guardians were primitive, as they are a relatively niche language feature. Guile has them, yes, but really what Guile has is bugs: because Guile implements guardians on top of BDW-GC’s finalizers (without topological ordering), all the other things that finalizers might do in Guile (e.g. closing file ports) might run at the same time as the objects protected by guardians. For the specific object being guarded, this isn’t so much of a problem, because when you put an object in the guardian, it arranges to prepend the guardian finalizer before any existing finalizer. But when a whole clique of objects becomes unreachable, objects referred to by the guarded object may be finalized. So the object you get back from the guardian might refer to, say, already-closed file ports.

The guardians-are-primitive solution is to add a special guardian pass to the collector that will identify unreachable guarded objects. In this way, the transitive closure of all guarded objects will be already visited by the time finalizables are computed, protecting them from finalization. This would be sufficient, but is it necessary?

answers?

Thinking more abstractly, perhaps we can solve this issue and others with a set of finalizer priorities: a collector could have, say, 10 finalizer priorities, and run the finalizer fixpoint once per priority, in order. If no finalizer is registered at a given priority, there is no overhead. A given embedder (Guile, say) could use priority 0 for guardians, priority 1 for “normal” finalizers, and ignore the other priorities. I think such a facility could also support other constructs, including Java’s phantom references, weak references, and reference queues, especially when combined with ephemerons.

Anyway, all this is a note for posterity. Are there other interesting mutator-exposed GC constructs that can’t be implemented with a combination of ephemerons and multi-priority finalizers? Do let me know!

are ephemerons primitive?

2022-11-28T21:11:30Z

Good evening :) A quick note, tonight: I’ve long thought that ephemerons are primitive and can’t be implemented with mark functions and/or finalizers, but today I think I have a counterexample.

For context, one of the goals of the GC implementation I have been working on on is to replace Guile‘s current use of the Boehm-Demers-Weiser (BDW) conservative collector. Of course, changing a garbage collector for a production language runtime is risky, and for Guile one of the mitigation strategies for this work is that the new collector is behind an abstract API whose implementation can be chosen at compile-time, without requiring changes to user code. That way we can first switch to BDW-implementing-the-new-GC-API, then switch the implementation behind that API to something else.

Abstracting GC is a tricky problem to get right, and I thank the MMTk project for showing that this is possible – you have user-facing APIs that need to be implemented by concrete collectors, but also extension points so that the user can provide some compile-time configuration too, for example to provide field-tracing visitors that take into account how a user wants to lay out objects.

Anyway. As we discussed last time, ephemerons are usually have explicit support from the GC, so we need an ephemeron abstraction as part of the abstract GC API. The question is, can BDW-GC provide an implementation of this API?

I think the answer is “yes, but it’s very gnarly and will kill performance so bad that you won’t want to do it.”

the contenders

Consider that the primitives that you get with BDW-GC are custom mark functions, run on objects when they are found to be live by the mark workers; disappearing links, a kind of weak reference; and finalizers, which receive the object being finalized, can allocate, and indeed can resurrect the object.

BDW-GC’s finalizers are a powerful primitive, but not one that is useful for implementing the “conjunction” aspect of ephemerons, as they cannot constrain the marker’s idea of graph connectivity: a finalizer can only prolong the life of an object subgraph, not cut it short. So let’s put finalizers aside.

Weak references have a tantalizingly close kind of conjunction property: if the weak reference itself is alive, and the referent is also otherwise reachable, then the weak reference can be dereferenced. However this primitive only involves the two objects E and K; there’s no way to then condition traceability of a third object V to E and K.

We are left with mark functions. These are an extraordinarily powerful interface in BDW-GC, but somewhat expensive also: not inlined, and going against the grain of what BDW-GC is really about (heaps in which the majority of all references are conservative). But, OK. They way they work is, your program allocates a number of GC “kinds”, and associates mark functions with those kinds. Then when you allocate objects, you use those kinds. BDW-GC will call your mark functions when tracing an object of those kinds.

Let’s assume firstly that you have a kind for ephemerons; then when you go to mark an ephemeron E, you mark the value V only if the key K has been marked. Problem solved, right? Only halfway: you also have to handle the case in which E is marked first, then K. So you publish E to a global hash table, and... well. You would mark V when you mark a K for which there is a published E. But, for that you need a hook into marking V, and V can be any object...

So now we assume additionally that all objects are allocated with user-provided custom mark functions, and that all mark functions check if the marked object is in the published table of pending ephemerons, and if so marks values. This is essentially what a proper ephemeron implementation would do, though there are some optimizations one can do to avoid checking the table for each object before the mark stack runs empty for the first time. In this case, yes you can do it! Additionally if you register disappearing links for the K field in each E, you can know if an ephemeron E was marked dead in a previous collection. Add a pre-mark hook (something BDW-GC provides) to clear the pending ephemeron table, and you are in business.

yes, but no

So, it is possible to implement ephemerons with just custom mark functions. I wouldn’t want to do it, though: missing the mostly-avoid-pending-ephemeron-check optimization would be devastating, and really what you want is support in the GC implementation. I think that for the BDW-GC implementation in whippet I’ll just implement weak-key associations, in which the value is always marked strongly unless the key was dead on a previous collection, using disappearing links on the key field. That way a (possibly indirect) reference from a value V to a key K can indeed keep K alive, but oh well: it’s a conservative approximation of what should happen, and not worse than what Guile has currently.

Good night and happy hacking!

ephemerons and finalizers

2022-10-31T12:21:23Z

Good day, hackfolk. Today we continue the series on garbage collection with some notes on ephemerons and finalizers.

conjunctions and disjunctions

First described in a 1997 paper by Barry Hayes, which attributes the invention to George Bosworth, ephemerons are a kind of weak key-value association.

Thinking about the problem abstractly, consider that the garbage collector’s job is to keep live objects and recycle memory for dead objects, making that memory available for future allocations. Formally speaking, we can say:

An object is live if it is in the root set
An object is live it is referenced by any live object.

This circular definition uses the word any, indicating a disjunction: a single incoming reference from a live object is sufficient to mark a referent object as live.

Ephemerons augment this definition with a conjunction:

An object V is live if, for an ephemeron E containing an association betweeen objects K and V, both E and K are live.

This is a more annoying property for a garbage collector to track. If you happen to mark K as live and then you mark E as live, then you can just continue to trace V. But if you see E first and then you mark K, you don’t really have a direct edge to V. (Indeed this is one of the main purposes for ephemerons: associating data with an object, here K, without actually modifying that object.)

During a trace of the object graph, you can know if an object is definitely alive by checking if it was visited already, but if it wasn’t visited yet that doesn’t mean it’s not live: we might just have not gotten to it yet. Therefore one common implementation strategy is to wait until tracing the object graph is done before tracing ephemerons. But then we have another annoying problem, which is that tracing ephemerons can result in finding more live ephemerons, requiring another tracing cycle, and so on. Mozilla’s Steve Fink wrote a nice article on this issue earlier this year, with some mitigations.

finalizers aren’t quite ephemerons

All that is by way of introduction. If you just have an object graph with strong references and ephemerons, our definitions are clear and consistent. However, if we add some more features, we muddy the waters.

Consider finalizers. The basic idea is that you can attach one or a number of finalizers to an object, and that when the object becomes unreachable (not live), the system will invoke a function. One way to imagine this is a global association from finalizable object O to finalizer F.

As it is, this definition is underspecified in a few ways. One, what happens if F references O? It could be a GC-managed closure, after all. Would that prevent O from being collected?

Ephemerons solve this problem, in a way; we could trace the table of finalizers like a table of ephemerons. In that way F would only be traced if O is live already, so that by itself it wouldn’t keep O alive. But then if O becomes dead, you’d want to invoke F, so you’d need it to be live, so reachability of finalizers is not quite the same as ephemeron-reachability: indeed logically all F values in the finalizer table are live, because they all will be invoked at some point.

In the end, if F references O, then F actually keeps O alive. Whether this prevents O from being finalized depends on our definition for finalizability. We could say that an object is finalizable if it is found to be unreachable after a full trace, and the finalizers F are in the root set. Or we could say that an object is finalizable if it is unreachable after a partial trace, in which finalizers are not themselves in the initial root set, and instead we trace them after determining the finalizable set.

Having finalizers in the initial root set is unfortunate: there’s no quick check you can make when adding a finalizer to signal this problem to the user, and it’s very hard to convey to a user exactly how it is that an object is referenced. You’d have to add lots of gnarly documentation on top of the already unavoidable gnarliness that you already had to write. But, perhaps it is a local maximum.

Incidentally, you might think that you can get around these issues by saying “don’t reference objects from their finalizers”, and that’s true in a way. However it’s not uncommon for finalizers to receive the object being finalized as an argument; after all, it’s that object which probably encapsulates the information necessary for its finalization. Of course this can lead to the finalizer prolonging the longevity of an object, perhaps by storing it to a shared data structure. This is a risk for correct program construction (the finalized object might reference live-but-already-finalized objects), but not really a burden for the garbage collector, except in that it’s a serialization point in the collection algorithm: you trace, you compute the finalizable set, then you have to trace the finalizables again.

ephemerons vs finalizers

The gnarliness continues! Imagine that O is associated with a finalizer F, and also, via ephemeron E, some auxiliary data V. Imagine that at the end of the trace, O is unreachable and so will be dead. Imagine that F receives O as an argument, and that F looks up the association for O in E. Is the association to V still there?

Guile’s documentation on guardians, a finalization-like facility, specifies that weak associations (i.e. ephemerons) remain in place when an object becomes collectable, though I think in practice this has been broken since Guile switched to the BDW-GC collector some 20 years ago or so and I would like to fix it.

One nice solution falls out if you prohibit resuscitation by not including finalizer closures in the root set and not passing the finalizable object to the finalizer function. In that way you will never be able to look up E×O⇒V, because you don’t have O. This is the path that JavaScript has taken, for example, with WeakMap and FinalizationRegistry.

However if you allow for resuscitation, for example by passing finalizable objects as an argument to finalizers, I am not sure that there is an optimal answer. Recall that with resuscitation, the trace proceeds in three phases: first trace the graph, then compute and enqueue the finalizables, then trace the finalizables. When do you perform the conjunction for the ephemeron trace? You could do so after the initial trace, which might augment the live set, protecting some objects from finalization, but possibly missing ephemeron associations added in the later trace of finalizable objects. Or you could trace ephemerons at the very end, preserving all associations for finalizable objects (and their referents), which would allow more objects to be finalized at the same time.

Probably if you trace ephemerons early you will also want to trace them later, as you would do so because you think ephemeron associations are important, as you want them to prevent objects from being finalized, and it would be weird if they were not present for finalizable objects. This adds more serialization to the trace algorithm, though:

(Add finalizers to the root set?)
Trace from the roots
Trace ephemerons?
Compute finalizables
Trace finalizables (and finalizer closures if not done in 1)
Trace ephemerons again?

These last few paragraphs are the reason for today’s post. It’s not clear to me that there is an optimal way to compose ephemerons and finalizers in the presence of resuscitation. If you add finalizers to the root set, you might prevent objects from being collected. If you defer them until later, you lose the optimization that you can skip steps 5 and 6 if there are no finalizables. If you trace (not-yet-visited) ephemerons twice, that’s overhead; if you trace them only once, the user could get what they perceive as premature finalization of otherwise reachable objects.

In Guile I think I am going to try to add finalizers to the root set, pass the finalizable to the finalizer as an argument, and trace ephemerons twice if there are finalizable objects. I think this wil minimize incoming bug reports. I am bummed though that I can’t eliminate them by construction.

Until next time, happy hacking!