wingolog

the sticky mark-bit algorithm

2022-10-22T08:42:07Z

Good day, hackfolk!

The Sticky Mark-Bit Algorithm

Also an intro to mark-sweep GC

7 Oct 2022 – Igalia

Andy Wingo

A funny post today; I gave an internal presentation at work recently describing the so-called “sticky mark bit” algorithm. I figured I might as well post it here, as a gift to you from your local garbage human.

Automatic Memory Management

“Don’t free, the system will do it for you”

Eliminate a class of bugs: use-after-free

Relative to bare malloc/free, qualitative performance improvements

cheap bump-pointer allocation
cheap reclamation/recycling
better locality

Continuum: bmalloc / tcmalloc grow towards GC

Before diving in though, we start with some broad context about automatic memory management. The term mostly means “garbage collection” these days, but really it describes a component of a system that provides fresh memory for new objects and automatically reclaims memory for objects that won’t be needed in the program’s future. This stands in contrast to manual memory management, which relies on the programmer to free their objects.

Of course, automatic memory management ensures some valuable system-wide properties, like lack of use-after-free vulnerabilities. But also by enlarging the scope of the memory management system to include full object lifetimes, we gain some potential speed benefits, for example eliminating any cost for free, in the case of e.g. a semi-space collector.

Automatic Memory Management

Two strategies to determine live object graph

Reference counting
Tracing

What to do if you trace

Mark, and then sweep or compact
Evacuate

Tracing O(n) in live object count

I should mention that reference counting is a form of automatic memory management. It’s not enough on its own; unreachable cycles in the object reference graph have to be detected either by a heap tracer or broken by weak references.

It used to be that we GC nerds made fun of reference counting as being an expensive, half-assed solution that didn’t work very well, but there have been some fundamental advances in the state of the art in the last 10 years or so.

But this talk is more about the other kind of memory management, which involves periodically tracing the graph of objects in the heap. Generally speaking, as you trace you can do one of two things: mark the object, simply setting a bit indicating that an object is live, or evacuate the object to some other location. If you mark, you may choose to then compact by sliding all objects down to lower addresses, squeezing out any holes, or you might sweep all holes into a free list for use by further allocations.

Mark-sweep GC (1/3)

freelist := []

allocate():
  if freelist is empty: collect()
  return freelist.pop()

collect():
  mark()
  sweep()
  if freelist is empty: abort

Concretely, let’s look closer at mark-sweep. Let’s assume for the moment that all objects are the same size. Allocation pops fresh objects off a freelist, and collects if there is none. Collection does a mark and then a sweep, aborting if sweeping yielded no free objects.

Mark-sweep GC (2/3)

mark():
  worklist := []
  for ref in get_roots():
    if mark_one(ref):
      worklist.add(ref)
  while worklist is not empty:
    for ref in trace(worklist.pop()):
      if mark_one(ref):
        worklist.add(ref)

sweep():
  for ref in heap:
    if marked(ref):
      unmark_one(ref)
    else
      freelist.add(ref)

Going a bit deeper, here we have some basic implementations of mark and sweep. Marking starts with the roots: edges from outside the automatically-managed heap indicating a set of initial live objects. You might get these by maintaining a stack of objects that are currently in use. Then it traces references from these roots to other objects, until there are no more references to trace. It will visit each live object exactly once, and so is O(n) in the number of live objects.

Sweeping requires the ability to iterate the heap. With the precondition here that collect is only ever called with an empty freelist, it will clear the mark bit from each live object it sees, and otherwise add newly-freed objects to the global freelist. Sweep is O(n) in total heap size, but some optimizations can amortize this cost.

Mark-sweep GC (3/3)

marked := 1

get_tag(ref):
  return *(uintptr_t*)ref
set_tag(ref, tag):
  *(uintptr_t*)ref = tag

marked(ref):
  return (get_tag(ref) & 1) == marked
mark_one(ref):
  if marked(ref): return false;
  set_tag(ref, (get_tag(ref) & ~1) | marked)
  return true
unmark_one(ref):
  set_tag(ref, (get_tag(ref) ^ 1))

Finally, some details on how you might represent a mark bit. If a ref is a pointer, we could store the mark bit in the first word of the objects, as we do here. You can choose instead to store them in a side table, but it doesn’t matter for today’s example.

Observations

Freelist implementation crucial to allocation speed

Non-contiguous allocation suboptimal for locality

World is stopped during collect(): “GC pause”

mark O(n) in live data, sweep O(n) in total heap size

Touches a lot of memory

The salient point is that these O(n) operations happen when the world is stopped. This can be noticeable, even taking seconds for the largest heap sizes. It sure would be nice to have the benefits of GC, but with lower pause times.

Optimization: rotate mark bit

flip():
  marked ^= 1

collect():
  flip()
  mark()
  sweep()
  if freelist is empty: abort

unmark_one(ref):
  pass

Avoid touching mark bits for live data

Incidentally, before moving on, I should mention an optimization to mark bit representation: instead of clearing the mark bit for live objects during the sweep phase, we could just choose to flip our interpretation of what the mark bit means. This allows unmark_one to become a no-op.

Reducing pause time

Parallel tracing: parallelize mark. Clear improvement, but speedup depends on object graph shape (e.g. linked lists).

Concurrent tracing: mark while your program is running. Tricky, and not always a win (“Retrofitting Parallelism onto OCaml”, ICFP 2020).

Partial tracing: mark only a subgraph. Divide space into regions, record inter-region links, collect one region only. Overhead to keep track of inter-region edges.

Now, let’s revisit the pause time question. What can we do about it? In general there are three strategies.

Generational GC

Partial tracing

Two spaces: nursery and oldgen

Allocations in nursery (usually)

Objects can be promoted/tenured from nursery to oldgen

Minor GC: just trace the nursery

Major GC: trace nursery and oldgen

“Objects tend to die young”

Overhead of old-to-new edges offset by less amortized time spent tracing

Today’s talk is about partial tracing. The basic idea is that instead of tracing the whole graph, just trace a part of it, ideally a small part.

A simple and effective strategy for partitioning a heap into subgraphs is generational garbage collection. The idea is that objects tend to die young, and that therefore it can be profitable to focus attention on collecting objects that were allocated more recently. You therefore partition the heap graph into two parts, young and old, and you generally try to trace just the young generation.

The difficulty with partitioning the heap graph is that you need to maintain a set of inter-partition edges, and you do so by imposing overhead on the user program. But a generational partition minimizes this cost because you never do an only-old-generation collection, so you don’t need to remember new-to-old edges, and mutations of old objects are less common than new.

Generational GC

Usual implementation: semispace nursery and mark-compact oldgen

Tenuring via evacuation from nursery to oldgen

Excellent locality in nursery

Very cheap allocation (bump-pointer)

But... evacuation requires all incoming edges to an object to be updated to new location

Requires precise enumeration of all edges

Usually the generational partition is reflected in the address space: there is a nursery and it is in these pages and an oldgen in these other pages, and never the twain shall meet. To tenure an object is to actually move it from the nursery to the old generation. But moving objects requires that the collector be able to enumerate all incoming edges to that object, and then to have the collector update them, which can be a bit of a hassle.

JavaScriptCore

No precise stack roots, neither in generated nor C++ code

Compare to V8’s Handle<> in C++, stack maps in generated code

Stack roots conservative: integers that happen to hold addresses of objects treated as object graph edges

(Cheaper implementation strategy, can eliminate some bugs)

Specifically in JavaScriptCore, the JavaScript engine of WebKit and the Safari browser, we have a problem. JavaScriptCore uses a technique known as “conservative root-finding”: it just iterates over the words in a thread’s stack to see if any of those words might reference an object on the heap. If they do, JSC conservatively assumes that it is indeed a reference, and keeps that object live.

Of course a given word on the stack could just be an integer which happens to be an object’s address. In that case we would hold on to too much data, but that’s not so terrible.

Conservative root-finding is again one of those things that GC nerds like to make fun of, but the pendulum seems to be swinging back its way; perhaps another article on that some other day.

JavaScriptCore

Automatic memory management eliminates use-after-free...

...except when combined with manual memory management

Prevent type confusion due to reuse of memory for object of different shape

addrof/fakeobj primitives: phrack.org/issues/70/3.html

Type-segregated heaps

No evacuation: no generational GC?

The other thing about JSC is that it is constantly under attack by malicious web sites, and that any bug in it is a step towards hackers taking over your phone. Besides bugs inside JSC, there are bugs also in the objects exposed to JavaScript from the web UI. Although use-after-free bugs are impossible with a fully traceable object graph, references to and from DOM objects might not be traceable by the collector, instead referencing GC-managed objects by reference counting or weak references or even manual memory management. Bugs in these interfaces are a source of exploitable vulnerabilities.

In brief, there seems to be a decent case for trying to mitigate use-after-free bugs. Beyond the nuclear option of not freeing, one step we could take would be to avoid re-using memory between objects of different shapes. So you have a heap for objects with 3 fields, another objects with 4 fields, and so on.

But it would seem that this mitigation is at least somewhat incompatible with the usual strategy of generational collection, where we use a semi-space nursery. The nursery memory gets re-used all the time for all kinds of objects. So does that rule out generational collection?

Sticky mark bit algorithm

collect(is_major=false):
  if is_major: flip()
  mark(is_major)
  sweep()
  if freelist is empty:
    if is_major: abort
    collect(true)

mark(is_major):
  worklist := []
  if not is_major:
    worklist += remembered_set
    remembered_set := []
  ...

Turns out, you can generationally partition a mark-sweep heap.

Recall that to visit each live object, you trace the heap, setting mark bits. To visit them all again, you have to clear the mark bit between traces. Our first collect implementation did so in sweep, via unmark_one; then with the optimization we switched to clear them all before the next trace in flip().

Here, then, the trick is that you just don’t clear the mark bit between traces for a minor collection (tracing just the nursery). In that way all objects that were live at the previous collection are considered the old generation. Marking an object is tenuring, in-place.

There are just two tiny modifications to mark-sweep to implement sticky mark bit collection: one, flip the mark bit only on major collections; and two, include a remembered set in the roots for minor collections.

Sticky mark bit algorithm

Mark bit from previous trace “sticky”: avoid flip for minor collections

Consequence: old objects not traced, as they are already marked

Old-to-young edges: the “remembered set”

Write barrier

write_field(object, offset, value):
  remember(object)
  object[offset] = value

The remembered set is maintained by instrumenting each write that the program makes with a little call out to code from the garbage collector. This code is the write barrier, and here we use it to add to the set of objects that might reference new objects. There are many ways to implement this write barrier but that’s a topic for another day.

JavaScriptCore

Parallel GC: Multiple collector threads

Concurrent GC: mark runs while JS program running; “riptide”; interaction with write barriers

Generational GC: in-place, non-moving GC generational via sticky mark bit algorithm

Alan Demers, “Combining generational and conservative garbage collection: framework and implementations”, POPL ’90

So returning to JavaScriptCore and the general techniques for reducing pause times, I can summarize to note that it does them all. It traces both in parallel and concurrently, and it tries to trace just newly-allocated objects using the sticky mark bit algorithm.

Conclusions

A little-used algorithm

Motivation for JSC: conservative roots

Original motivation: conservative roots; write barrier enforced by OS-level page protections

Revived in “Sticky Immix”

Better than nothing, not quite as good as semi-space nursery

I find that people that are interested in generational GC go straight for the semispace nursery. There are some advantages to that approach: allocation is generally cheaper in a semispace than in a mark space, locality among new objects is better, locality after tenuring is better, and you have better access locality during a nursery collection.

But if for some reason you find yourself unable to enumerate all roots, you can still take advantage of generational collection via the sticky mark-bit algorithm. It’s a simple change that improves performance, as long as you are able to insert write barriers on all heap object mutations.

The challenge with a sticky-mark-bit approach to generations is avoiding the O(n) sweep phase. There are a few strategies, but more on that another day perhaps.

And with that, presentation done. Until next time, happy hacking!

understanding webassembly code generation throughput

2020-04-14T08:59:17Z

Greets! Today's article looks at browser WebAssembly implementations from a compiler throughput point of view. As I wrote in my article on Firefox's WebAssembly baseline compiler, web browsers have multiple wasm compilers: some that produce code fast, and some that produce fast code. Implementors are willing to pay the cost of having multiple compilers in order to satisfy these conflicting needs. So how well do they do their jobs? Why bother?

In this article, I'm going to take the simple path and just look at code generation throughput on a single chosen WebAssembly module. Think of it as X-ray diffraction to expose aspects of the inner structure of the WebAssembly implementations in SpiderMonkey (Firefox), V8 (Chrome), and JavaScriptCore (Safari).

experimental setup

As a workload, I am going to use a version of the "Zen Garden" demo. This is a 40-megabyte game engine and rendering demo, originally released for other platforms, and compiled to WebAssembly a couple years later. Unfortunately the original URL for the demo was disabled at some point in late 2019, so it no longer has a home on the web. A bit of a weird situation and I am not clear on licensing either. In any case I have a version downloaded, and have hacked out a minimal set of "imports" that the WebAssembly module needs from the host to allow the module to compile and link when run from a JavaScript shell, without requiring WebGL and similar facilities. So the benchmark is just to instantiate a WebAssembly module from the 40-megabyte byte array and see how long it takes. It would be better if I had more test cases (and would be happy to add them to the comparison!) but this is a start.

I start by benchmarking the various WebAssembly implementations, firstly in their standard configuration and then setting special run-time flags to measure the performance of the component compilers. I run these tests on the core-rich machine that I use for browser development (2 Xeon Silver 4114 CPUs for a total of 40 logical cores). The default-configuration numbers are therefore not indicative of performance on a low-end Android phone, but we can use them to extract aspects of the different implementations.

Since I'm interested in compiler throughput, I'm not particularly concerned about how well a compiler will use all 40 cores. Therefore when testing the specific compilers I will set implementation-specific flags to disable parallelism in the compiler and GC: --single-threaded on V8, --no-threads on SpiderMonkey, and --useConcurrentGC=false --useConcurrentJIT=false on JSC. To further restrict any threads that the implementation might decide to spawn, I'll bind these to a single core on my machine using taskset -c 4. Otherwise the machine is in its normal configuration (nothing else significant running, all cores available for scheduling, turbo boost enabled).

I'll express results in nanoseconds per WebAssembly code byte. Of the 40 megabytes or so in the Zen Garden demo, only 23 891 164 bytes are actually function code; the rest is mostly static data (textures and so on). So I'll divide the total time by this code byte count.

I tested V8 at git revision 0961376575206, SpiderMonkey at hg revision 8ec2329bef74, and JavaScriptCore at subversion revision 259633. The benchmarks can be run using just a shell; see the pull request. I timed how long it took to instantiate the Zen Garden demo, ensuring that a basic export was callable. I collected results from 20 separate runs, sleeping a second between them. The bars in the charts below show the median times, with a histogram overlay of all results.

results & analysis

We can see some interesting results in this graph. Note that the Y axis is logarithmic. The "concurrent tiering" results in the graph correspond to the default configurations (no special flags, no taskset, all cores available).

The first interesting conclusions that pop out for me concern JavaScriptCore, which is the only implementation to have a baseline interpreter (run using --useWasmLLInt=true --useBBQJIT=false --useOMGJIT=false). JSC's WebAssembly interpreter is actually structured as a compiler that generates custom WebAssembly-specific bytecode, which is then run by a custom interpreter built using the same infrastructure as JSC's JavaScript interpreter (the LLInt). Directly interpreting WebAssembly might be possible as a low-latency implementation technique, but since you need to validate the WebAssembly anyway and eventually tier up to an optimizing compiler, apparently it made sense to emit fresh bytecode.

The part of JSC that generates baseline interpreter code runs slower than SpiderMonkey's baseline compiler, so one is tempted to wonder why JSC bothers to go the interpreter route; but then we recall that on iOS, we can't generate machine code in some contexts, so the LLInt does appear to address a need.

One interesting feature of the LLInt is that it allows tier-up to the optimizing compiler directly from loops, which neither V8 nor SpiderMonkey support currently. Failure to tier up can be quite confusing for users, so good on JSC hackers for implementing this.

Finally, while baseline interpreter code generation throughput handily beats V8's baseline compiler, it would seem that something in JavaScriptCore is not adequately taking advantage of multiple cores; if one core compiles at 51ns/byte, why do 40 cores only do 41ns/byte? It could be my tests are misconfigured, or it could be that there's a nice speed boost to be found somewhere in JSC.

JavaScriptCore's baseline compiler (run using --useWasmLLInt=false --useBBQJIT=true --useOMGJIT=false) runs much more slowly than SpiderMonkey's or V8's baseline compiler, which I think can be attributed to the fact that it builds a graph of basic blocks instead of doing a one-pass compile. To me these results validate SpiderMonkey's and V8's choices, looking strictly from a latency perspective.

I don't have graphs for code generation throughput of JavaSCriptCore's optimizing compiler (run using --useWasmLLInt=false --useBBQJIT=false --useOMGJIT=true); it turns out that JSC wants one of the lower tiers to be present, and will only tier up from the LLInt or from BBQ. Oh well!

V8 and SpiderMonkey, on the other hand, are much of the same shape. Both implement a streaming baseline compiler and an optimizing compiler; for V8, we get these via --liftoff --no-wasm-tier-up or --no-liftoff, respectively, and for SpiderMonkey it's --wasm-compiler=baseline or --wasm-compiler=ion.

Here we should conclude directly that SpiderMonkey generates code around twice as fast as V8 does, in both tiers. SpiderMonkey can generate machine code faster even than JavaScriptCore can generate bytecode, and optimized machine code faster than JSC can make baseline machine code. It's a very impressive result!

Another conclusion concerns the efficacy of tiering: for both V8 and SpiderMonkey, their baseline compilers run more than 10 times as fast as the optimizing compiler, and the same ratio holds between JavaScriptCore's baseline interpreter and compiler.

Finally, it would seem that the current cross-implementation benchmark for lowest-tier code generation throughput on a desktop machine would then be around 50 ns per WebAssembly code byte for a single core, which corresponds to receiving code over the wire at somewhere around 160 megabits per second (Mbps). If we add in concurrency and manage to farm out compilation tasks well, we can obviously double or triple that bitrate. Optimizing compilers run at least an order of magnitude slower. We can conclude that to the desktop end user, WebAssembly compilation time is indistinguishable from download time for the lowest tier. The optimizing tier is noticeably slower though, running more around 10-15 Mbps per core, so time-to-tier-up is still a concern for faster networks.

Going back to the question posed at the start of the article: yes, tiering shows a clear benefit in terms of WebAssembly compilation latency, letting users interact with web sites sooner. So that's that. Happy hacking and until next time!

ffconf 2014

2014-11-09T17:36:45Z

Last week I had the great privilege of speaking at ffconf in Brighton, UK. It was lovely. The town put on a full demonstration of its range of November weather patterns, from blue skies to driving rain to hail (!) to sea-spray to drizzle and back again. Good times.

The conference itself was quite pleasant as well, and from the speaker perspective it was amazing. A million thanks to Remy and Julie for making it such a pleasure. ffconf is mostly a front-end development conference, so it's not directly related with the practice of my work on JS implementations -- perhaps you're unaware, but there aren't so many browser implementors that actually develop content for their browsers, and indeed fewer JS implementors that actually write JS. Me, I sling C++ all day long and the most JavaScript I write is for tests. When in the weeds, sometimes we forget we're building an amazing runtime and that people do inspiring things with it, so it's nice to check in with front-end folks at their conferences to see what people are excited about.

My talk was about the part of JavaScript implementations that are written in JavaScript itself. This is an area that isn't so well known, and it has its amusing quirks. I think it can be interesting to a curious JS hacker who wants to spelunk down a bit to see what's going on in their browsers. Intrepid explorers might even find opportunities to contribute patches. Anyway, nerdy stuff, but that's basically how I roll.

The slides are here: without images (350kB PDF) or with images (3MB PDF).

I haven't been to the UK in years, and being in a foreign country where everyone speaks my native language was quite refreshing. At the same time there was an awkward incident in which I was reminded that though close, American and English just aren't the same. I made this silly joke that if you get a polyfill into a JS implementation, then shucks, you have a "gollyfill", 'cause golly it made it in! In the US I think "golly" is just one of those milquetoast profanities, "golly" instead of "god" like saying "shucks" instead of "shit". Well in the UK that's a thing too I think, but there is also another less fortunate connotation, in which "golly" as a noun can be a racial slur. Check the Wikipedia if you're as ignorant as I was. I think everyone present understood that wasn't my intention, but if that is not the case I apologize. With slides though it's less clear, so I've gone ahead and removed the joke from the slides. It's probably not a ball to take and run with.

However I do have a ball that you can run with though! And actually, this was another terrible joke that wasn't bad enough to inflict upon my audience, but that now chance fate gives me the opportunity to use. So never fear, there are still bad puns in the slides. But, you'll have to click through to the PDF for optimal groaning.

Happy hacking, JavaScripters, and until next time.

a register vm for guile

2013-11-26T22:07:55Z

Greetings, hacker comrades! Tonight's epistle is gnarly nargery of the best kind. See, we just landed a new virtual machine, compiler, linker, loader, assembler, and debugging infrastructure in Guile, and stories like that don't tell themselves. Oh no. I am a firm believer in Steve Yegge's Big Blog Theory. There are nitties and gritties and they need explication.

a brief brief history

As most of you know, Guile is an implementation of Scheme. It started about 20 years ago as a fork of SCM.

I think this lines-of-code graph pretty much sums up the history:

That's from the Ohloh, in case you were wondering. Anyway the story is that in the beginning it was all C, pretty much: Aubrey Jaffer's SCM, just packaged as a library. And it was C people making it, obviously. But Scheme is a beguiling language, and over time Guile has had a way of turning C hackers into Scheme hackers.

I like to think of this graph as showing my ignorance. I started using Guile about 10 years ago, and hacking on it in 2008 or so. In the beginning I was totally convinced by the "C for speed, Scheme for flexibility" thing -- to the extent that I was willing to write off Scheme as inevitably slow. But that's silly of course, and one needs no more proof than the great performance JavaScript implementations have these days.

In 2009, we merged in a bytecode VM and a compiler written in Scheme itself. All that is pretty nifty stuff. We released that version of Guile as 2.0 in 2011, and that's been good times. But it's time to move onward and upward!

A couple of years ago I wrote an article on JavaScriptCore, and in it I spoke longingly of register machines. I think that's probably when I started to make sketches towards Guile 2.2, after having spent time with JavaScriptCore's bytecode compiler and interpreter.

Well, it took a couple of years, but Guile 2.2 is finally a thing. No, we haven't even made any prereleases yet, but the important bits have landed in master. This is the first article about it.

trashing your code

Before I start trashing Guile 2.0, I think it's important to say what it does well. It has a great inlining pass -- better than any mainstream language, I think. Its startup time is pretty good -- around 13 milliseconds on my machine. Its runs faster than other "scripting language" implementations like Python (CPython) or Ruby (MRI). The debugging experience is delightful. You get native POSIX threads. Plus you get all the features of a proper Scheme, like macros and delimited continuations and all of that!

But the Guile 2.0 VM is a stack machine. That means that its instructions usually take their values from the stack, and produce values (if appropriate) by pushing values onto the stack.

The problem with stack machines is that they penalize named values. If I realize that a computation is happening twice and I factor it out to a variable, that means in practice that I allocate a stack frame slot to the value. So far so good. However, to use the value, I have to emit an instruction to fetch the value for use by some other instruction; and to store it, I likewise have to have another instruction to do that.

For example, in Guile 2.0, check out the bytecode produced for this little function:

scheme@(guile-user)> ,disassemble (lambda (x y)
                                    (let ((z (+ x y)))
                                      (* z z)))

   0    (assert-nargs-ee/locals 10)     ;; 2 args, 1 local
   2    (local-ref 0)                   ;; `x'
   4    (local-ref 1)                   ;; `y'
   6    (add)
   7    (local-set 2)                   ;; `z'
   9    (local-ref 2)                   ;; `z'
  11    (local-ref 2)                   ;; `z'
  13    (mul)
  14    (return)

This is silly. There are seven instructions in the body of this procedure, not counting the prologue and epilogue, and only two of them are needed. The cost of interpreting a bytecode is largely dispatch cost, which is linear in the number of instructions executed, and we see here that we could be some 7/2 = 3.5 times as fast if we could somehow make the operations reference their operands by slot directly.

register vm to the rescue

The solution to this problem is to use a "register machine". I use scare quotes because in fact this is a virtual machine, so unlike a CPU, the number of "registers" is unlimited, and in fact they are just stack slots accessed by index.

So in Guile 2.2, our silly procedure produces the following code:

scheme@(guile-user)> ,disassemble (lambda (x y)
                                    (let ((z (+ x y)))
                                      (* z z)))

   0    (assert-nargs-ee/locals 3 1)    ;; 2 args, 1 local
   1    (add 3 1 2)
   2    (mul 3 3 3)
   3    (return 3)

This is optimal! There are four things that need to happen, and there are four opcodes that do them. Receiving operands and sending values is essentially free -- they are indexed accesses off of a pointer stored in a hardware register, into memory that is in cache.

This is a silly little example, but especially in loops, Guile 2.2 stomps Guile 2.0. A simple count-up-to-a-billion test runs in 9 seconds on Guile 2.2, compared to 24 seconds in Guile 2.0. Let's make a silly graph!

Of course if we compare to V8 for example we find that V8 does a loop-to-a-billion in about 1 second, or 9 times faster. There is some way to go. There are a couple of ways that I could generate better bytecode for this loop, for another 30% speed boost or so, but ultimately we will have to do native compilation. And we will! But that is another post.

gritties

Here's the VM. It's hairy in the prelude, and the whole thing is #included twice in another C file (for a debugging and a non-debugging mode; terrible), but I think it's OK for being in C. (If it were in C++ it could be nicer in various ways.)

The calling convention for this VM is that when a function is called, it receives its arguments on the stack. The stack frame looks like this:

   /------------------\
   | Local N-1        | <- sp
   | ...              |
   | Local 1          |
   | Local 0          | <- fp
   +==================+
   | Return address   |
   | Dynamic link     |
   +==================+
   :                  :

Local 0 holds the procedure being called. Free variables, if any, are stored inline with the (flat) closure. You know how many arguments you get by the difference between the stack pointer (SP) and the frame pointer (FP). There are a number of opcodes to bind optional arguments, keyword arguments, rest arguments, and to skip to other case-lambda clauses.

After deciding that a given clause applies to the actual arguments, a prelude opcode will reset the SP to have enough space to hold all locals. In this way the SP is only manipulated in function prologues and epilogues, and around calls.

Guile's stack is expandable: it is originally only a page or two, and it expands (via mremap if possible) by a factor of two on every overflow, up to a configurable maximum. At expansion you have to rewrite the saved FP chain, but nothing else points in, so it is safe to move the stack.

To call a procedure, you put it and its arguments in contiguous slots, with no live values below them, and two empty slots for the saved instruction pointer (IP) and FP. Getting this right requires some compiler sophistication. Then you reset your SP to hold just the arguments. Then you branch to the procedure's entry, potentially bailing out to a helper if it's not a VM procedure.

To return values, a procedure shuffles the return values down to start from slot 1, resets the stack pointer to point to the last return value, and then restores the saved FP and IP. The calling function knows how many values are returned by looking at the SP. There are convenience instructions for returning and receiving a single value. Multiple values can be returned on the stack easily and efficiently.

Each operation in Guile's VM consists of a number of 32-bit words. The lower 8 bits in the first word indicate the opcode. The width and layout of the operands depends on the word. For example, MOV takes two 12-bit operands. Of course, 4096 locals may not be enough. For that reason there is also LONG-MOV which has two words, and takes two 24-bit operands. In LONG-MOV there are 8 bits of wasted space, but I decided to limit the local frame address space to 24 bits.

In general, most operations cannot address the full 24-bit space. For example, there is ADD, which takes two 8-bit operands and one 8-bit destination. The plan is to have the compiler emit some shuffles in this case, but I haven't hit it yet, and it was too tricky to try to get right in the bootstrapping phase.

JavaScriptCore avoids the address space problem by having all operands be one full pointer wide. This wastes a lot of memory, but they lazily compile and can throw away bytecode and reparse from source as needed, neither of which are true for Guile. We aim to do a good ahead-of-time compilation, to enable self-hosting of the compiler.

JSC's pointer-wide operands do provide the benefit of allowing the "opcode" word to actually hold the address of the label, instead of an index to a table of addresses. This is a great trick, but again it's not applicable to Guile as we don't want to relocate bytecode that we load from disk.

Relative jumps in Guile's VM are 24 bits wide, and are measured in 32-bit units, giving us effectively a 26 bit jump space. Relative references -- references to static data, or other procedures -- are 32 bits wide. I certainly hope that four gigabytes in a compilation unit is enough! By the time it is a problem, hopefully we will be doing native compilation.

Well, those are the basics of Guile's VM. There's more to say, but I already linked to the source, so that should be good enough :) In some future dispatch, we'll talk about the other parts of Guile 2.2. Until then!

on generators

2013-02-25T16:17:25Z

Hello! Is this dog?

Hi dog this is Andy, with some more hacking nargery. Today the topic is generators.

(shift k)

By way of introduction, you might recall (with a shudder?) a couple of articles I wrote in the past on delimited continuations. I even gave a talk on delimited continuations at FrOSCon last year, a talk that was so hot that thieves broke into the the FrOSCon video squad's room and stole their hard drives before the edited product could hit the tubes. Not kidding! The FrOSCon folks still have the tapes somewhere, but meanwhile time marches on without the videos; truly, the terrorists have won.

Anyway, the summary of all that is that Scheme's much-praised call-with-current-continuation isn't actually a great building block for composable abstractions, and that delimited continuations have much better properties.

So imagine my surprise when Oleg Kiselyov, in a mail to the Scheme standardization list, suggested a different replacement for call/cc:

My bigger suggestion is to remove call/cc and dynamic-wind from the base library into an optional feature. I'd like to add the note inviting the discussion for more appropriate abstractions to supersede call/cc -- such [as] generators, for example.

He elaborated in another mail:

Generators are a wide term, some versions of generators are equivalent (equi-expressible) to shift/reset. Roshan James and Amr Sabry have written a paper about all kinds of generators, arguing for the generator interface.

And you know, generators are indeed a very convenient abstraction for composing producers and consumers of values -- more so than the lower-level delimited continuations. Guile's failure to standardize a generator interface is an instance of a class of Scheme-related problems, in which generality is treated as more important than utility. It's true that you can do generators in Guile, but we shouldn't make the user build their own.

The rest of the programming world moved on and built generators into their languages. James and Sabry's paper is good because it looks at yield as a kind of mainstream form of delimited continuations, and uses that perspective to place the generators of common languages in a taxonomy. Some of them are as expressive as delimited continuations. There are are also two common restrictions on generators that trade off expressive power for ease of implementation:

Generators can be restricted to disallow backtracking, as in Python. This allows generators to yield without allocating memory for a new saved execution state, instead updating the stored continuation in-place. This is commonly called the one-shot restriction.
Generators can also restrict access to their continuation. Instead of reifying generator state as external iterators, internal iterators call their consumers directly. The first generators developed by Mary Shaw in the Alphard language were like this, as are the generators in Ruby. This can be called the linear-stack restriction, after an implementation technique that they enable.

With that long introduction out of the way, I wanted to take a look at the proposal for generators in the latest ECMAScript draft reports -- to examine their expressiveness, and to see what implementations might look like. ES6 specifies generators with the the first kind of restriction -- with external iterators, like Python.

Also, Kiselyov & co. have a hip new paper out that explores the expressiveness of linear-stack generators, Lazy v. Yield: Incremental, Linear Pretty-printing. I was quite skeptical of this paper's claim that internal iterators were significantly more efficient than external iterators. So this article will also look at the performance implications of the various forms of generators, including full resumable generators.

ES6 iterators

As you might know, there's an official ECMA committee beavering about on a new revision of the ECMAScript language, and they would like to include both iterators and generators.

For the purposes of ECMAScript 6 (ES6), an iterator is an object with a next method. Like this one:

var p = console.log;

function integers_from(n) {
  function next() {
    return this.n++;
  }
  return { n: n, next: next };
}

var ints = integers_from(0);
p(ints.next()); // prints 0
p(ints.next()); // prints 1

Here, int_gen is an iterator that returns successive integers from 0, all the way up to ~~infinity~~2^53.

To indicate the end of a sequence, the current ES6 drafts follow Python's example by having the next method throw a special exception. ES6 will probably have an isStopIteration() predicate in the @iter module, but for now I'll just compare directly.

function StopIteration() {}
function isStopIteration(x) { return x === StopIteration; }

function take_n(iter, max) {
  function next() {
    if (this.n++ < max)
      return iter.next();
    else
      throw StopIteration;
  }
  return { n: 0, next: next };
}

function fold(f, iter, seed) {
  var elt;
  while (1) {
    try {
      elt = iter.next();
    } catch (e) {
      if (isStopIteration (e))
        break;
      else
        throw e;
    }
    seed = f(elt, seed);
  }
  return seed;
}

function sum(a, b) { return a + b; }

p(fold(sum, take_n(integers_from(0), 10), 0));
// prints 45

Now, I don't think that we can praise this code for being particularly elegant. The end result is pretty good: we take a limited amount of a stream of integers, and combine them with a fold, without creating intermediate data structures. However the means of achieving the result -- the implementation of the producers and consumers -- are nasty enough to prevent most people from using this idiom.

To remedy this, ES6 does two things. One is some "sugar" over the use of iterators, the for-of form. With this sugar, we can rewrite fold much more succinctly:

function fold(f, iter, seed) {
  for (var x of iter)
    seed = f(x, seed);
  return seed;
}

The other thing offered by ES6 are generators. Generators are functions that can yield. They are defined by function*, and yield their values using iterators:

function* integers_from(n) {
  while (1)
    yield n++;
}

function* take_n(iter, max) {
  var n = 0;
  while (n++ < max)
    yield iter.next();
}

These definitions are mostly equivalent to the previous iterator definitions, and are much nicer to read and write.

I say "mostly equivalent" because generators are closer to coroutines in their semantics because they can also receive values or exceptions from their caller via the send and throw methods. This is very much as in Python. Task.js is a nice example of how a coroutine-like treatment of generators can help avoid the fraction-of-an-action callback mess in frameworks like ~~Twisted~~node.js.

There is also a yield* operator to delegate to another generator, as in Python's PEP 380.

OK, now that you know all the things, it's time to ask (and hopefully answer) a few questions.

design of ES6 iterators

Firstly, has the ES6 committee made the right decisions regarding the iterators proposal? For me, I think they did well in choosing external iterators over internal iterators, and the choice to duck-type the iterator objects is a good one. In a language like ES6, lacking delimited continuations or cross-method generators, external iterators are more expressive than internal iterators. The for-of sugar is pretty nice, too.

Of course, expressivity isn't the only thing. For iterators to be maximally useful they have to be fast, and here I think they could do better. Terminating iteration with a StopIteration exception is not so great. It will be hard for an optimizing compiler to rewrite the throw/catch loop terminating condition into a simple goto, and with that you lose some visibility about your loop termination condition. Of course you can assume that the throw case is not taken, but you usually don't know for sure that there are no more throw statements that you might have missed.

Dart did a better job here, I think. They split the next method into two: a moveNext method to advance the iterator, and a current getter for the current value. Like this:

function dart_integers_from(n) {
  function moveNext() {
    this.current++;
    return true;
  }
  return { current: n-1, moveNext: moveNext };
}

function dart_take_n(iter, max) {
  function moveNext() {
    if (this.n++ < max && iter.moveNext()) {
      this.current = iter.current;
      return true;
    }
    else
      return false;
  }
  return {
    n: 0,
    current: undefined,
    moveNext: moveNext
  };
}

function dart_fold(f, iter, seed) {
  while (iter.moveNext())
    seed = f(iter.current, seed);
}

I think it might be fair to say that it's easier to implement an ES6 iterator, but it is easier to use a Dart iterator. In this case, dart_fold is much nicer, and almost doesn't need any sugar at all.

Besides the aesthetics, Dart's moveNext is transparent to the compiler. A simple inlining operation makes the termination condition immediately apparent to the compiler, enabling range inference and other optimizations.

iterator performance

To measure the overhead of for-of-style iteration, we can run benchmarks on "desugared" versions of for-of loops and compare them to normal for loops. We'll throw in Dart-style iterators as well, and also a stateful closure iterator designed to simulate generators. (More on that later.)

I put up a test case on jsPerf. If you click through, you can run the benchmarks directly in your browser, and see results from other users. It's imprecise, but I believe it is accurate, more or less.

The summary of the results is that current engines are able to run a basic summing for loop in about five cycles per iteration (again, assuming 2.5GHz processors; the Epiphany 3.6 numbers, for example, are mine). That's suboptimal only by a factor of 2 or so; pretty good. Thigh fives all around!

However, performance with iterators is not so good. I was surprised to find that most of the other tests are suboptimal by about a factor of 20. I expected the engines to do a better job at inlining.

One data point that stands out of my limited data set is the development Epiphany 3.7.90 browser, which uses JavaScriptCore from WebKit trunk. This browser did significantly better with Dart-style iterators, something I suspect due to better inlining, though I haven't verified it.

Better inlining and stack allocation is already a development focus of modern JS engines. Dart-style iterators will get faster without being a specific focus of optimization. It seems like a much better strategy for the ES6 specification to piggy-back off this work and specify iterators in such a way that they will be fast by default, without requiring more work on the exception system, something that is universally reviled among implementors.

what about generators?

It's tough to know how ES6 generators will perform, because they're not widely implemented and don't have a straightforward "desugaring". As far as I know, there is only one practical implementation strategy for yield in JavaScript, and that is to save the function's stack and the set of live registers. Resuming a generator copies the activation back onto the stack and restores the registers. Any other strategy that involves higher-level control-flow rewriting would make debugging too difficult. This is not a terribly onerous requirement for an implementation. They already have to know exactly what is live and what is not, and can reconstruct a stack frame without too much difficulty.

In practice, invoking a generator is not much different from calling a stateful closure. For this reason, I added another test to the jsPerf benchmark I mentioned before that tests stateful closures. It does slightly worse than iterators that keep state in an object, but not much worse -- and there is no essential reason why it should be slow.

On the other hand, I don't think we can expect compilers to do a good job at inlining generators, especially if they are specified as terminating with a StopIteration. Terminating with an exception has its own quirks, as Chris Leary wrote a few years ago, though I think in his conclusion he is searching for virtues where there are none.

dogs are curious about multi-shot generators

So dog, that's ECMAScript. But since we started with Scheme, I'll close this parenthesis with a note on resumable, multi-shot generators. What is the performance impact of allowing generators to be fully resumable?

Being able to snapshot your program at any time is pretty cool, but it has an allocation cost. That's pretty much the dominating factor. All of the examples that we have seen so far execute (or should execute) without allocating any memory. But if your generators are resumable -- or even if they aren't, but you build them on top of full delimited continuations -- you can't re-use the storage of an old captured continuation when reifying a new one. A test of iterating through resumable generators is a test of short-lived allocations.

I ran this test in my Guile:

(define (stateful-coroutine proc)
  (let ((tag (make-prompt-tag)))
    (define (thunk)
      (proc (lambda (val) (abort-to-prompt tag val))))
    (lambda ()
      (call-with-prompt tag
                        thunk
                        (lambda (cont ret)
                          (set! thunk cont)
                          ret)))))

(define (integers-from n)
  (stateful-coroutine
   (lambda (yield)
     (let lp ((i n))
       (yield i)
       (lp (1+ i))))))

(define (sum-values iter n)
  (let lp ((i 0) (sum 0))
    (if (< i n)
        (lp (1+ i) (+ sum (iter)))
        sum)))

(sum-values (integers-from 0) 1000000)

I get the equivalent of 2 Ops/sec (in terms of the jsPerf results), allocating about 0.5 GB/s (288 bytes/iteration). This is similar to the allocation rate I get for JavaScriptCore, which is also not generational. V8's generational collector lets it sustain about 3.5 GB/s, at least in my tests. Test case here; multiply the 8-element test by 100 to yield approximate MB/s.

Anyway these numbers are much lower than the iterator tests. ES6 did right to specify one-shot generators as a primitive. In Guile we should perhaps consider implementing some form of one-shot generators that don't have an allocation overhead.

Finally, while I enjoyed the paper from Kiselyov and the gang, I don't see the point in praising simple generators when they don't offer a significant performance advantage. I think we will see external iterators in JavaScript achieve near-optimal performance within a couple years without paying the expressivity penalty of internal iterators.

Catch you next time, dog!

inside javascriptcore's low-level interpreter

2012-06-27T11:25:30Z

Good day, hackers! And hello to the rest of you, too, though I fear that this article isn't for you. In the vertical inches that follow, we're going to nerd out with JavaScriptCore's new low-level interpreter. So for those of you that are still with me, refill that coffee cup, and get ready for a look into a lovely hack!

hot corn, cold corn

Earlier this year, JavaScriptCore got a new interpreter, the LLInt. ("Low-level interpreter", you see.) As you probably know, JavaScriptCore is the JavaScript implementation of the WebKit project. It's used in web browsers like Safari and Epiphany, and also offers a stable API for embedding in non-web applications.

In this article, we'll walk through some pieces of the LLInt. But first, we need to describe the problem that the LLInt solves.

So, what is the fundamental problem of implementing JavaScript? Of course there are lots of things you could say here, but I'm going to claim that in the context of web browsers, it's the tension between the need to optimize small amounts of hot code, while minimizing overhead on large amounts of cold code.

Web browsers are like little operating systems unto themselves, into which you as a user install and uninstall hundreds of programs a day. The vast majority of code that a web browser sees is only executed once or twice, so for cold code, the name of the game is to do as little work as possible.

For cold code, a good bytecode interpreter can beat even a very fast native compiler. Bytecode can be more compact than executable code, and quicker to produce.

All of this is a long introduction into what the LLInt actually is: it's a better bytecode interpreter for JSC.

context

Before the introduction of the LLInt, the so-called "classic interpreter" was just a C++ function with a bunch of labels. Every kind of bytecode instruction had a corresponding label in the function.

As a hack, in the classic interpreter, bytecode instructions are actually wide enough to hold a pointer. The opcode itself takes up an entire word, and the arguments also take up entire words. It is strange to have such bloated bytecode, but there is one advantage, in that the opcode word can actually store the address of the label that implements the opcode. This is called direct threading, and presumably its effects on the branch prediction table are sufficiently good so as to be a win.

Anyway, it means that Interpreter.cpp has a method that looks like this:

// vPC is pronounced "virtual program counter".
JSValue Interpreter::execute(void **vPC)
{
  goto *vPC;
//...
  // add dst op1 op2: Add two values together.
op_add:
  {
    int dst = (int)vPC[1];
    int op1 = (int)vPC[2];
    int op2 = (int)vPC[3];

    fp[dst] = add(fp[op1], fp[op2]);

    vPC += 4;
    goto *vPC;
  }

  // jmp offset: Unconditional branch.
op_jmp:
  {
    ptrdiff_t offset = (ptrdiff_t)vPC[1];
    vPC += offset;
    goto *vPC;
  }
//...
}

It's a little different in practice, but the essence is there.

OK, so what's the problem? Well, readers who have been with me for a while might recall my article on static single assignment form, used as an internal representation by compilers like GCC and LLVM. One conclusion that I had was that SSA is a great first-order IR, but that its utility for optimizing higher-order languages is less clear. However, this computed-goto thing that I showed above (goto *vPC) is a form of higher-order programming!

Indeed, as the GCC internals manual notes:

Computed jumps contain edges to all labels in the function referenced from the code. All those edges have EDGE_ABNORMAL flag set. The edges used to represent computed jumps often cause compile time performance problems, since functions consisting of many taken labels and many computed jumps may have very dense flow graphs, so these edges need to be handled with special care.

Basically, GCC doesn't do very well at compiling interpreter loops. At -O3, it does get everything into registers on my machine, but it residualizes 54 KB of assembly, whereas I only have 64 KB of L1 instruction cache. Many other machines just have 32 KB of L1 icache, so for those machines, this interpreter is a lose. If I compile with -Os, I get it down to 32 KB, but that's still a tight fit.

Besides that, implementing an interpreter from C++ means that you can't use the native stack frame to track the state of a computation. Instead, you have two stacks: one for the interpreter, and one for the program being interpreted. Keeping them in sync has an overhead. Consider what GCC does for op_jmp at -O3:

op_jmp:
    mov    %rbp,%rax
    ; int currentOffset = vPC - bytecode + 1
    sub    0x58(%rbx),%rax
    sar    $0x3,%rax
    add    $0x1,%eax
    ; callFrame[CurrentOffset] = currentOffset
    mov    %eax,-0x2c(%r11)

    ; ptrdiff_t offset = (ptrdiff_t)vPC[1]
    movslq 0x8(%rbp),%rax

    ; vPC += offset
    lea    0x0(%rbp,%rax,8),%rbp

    ; goto *vPC
    mov    0x0(%rbp),%rcx
    jmpq   *%rcx

First there is this strange prelude, which effectively stores the current vPC into an address on the stack. To be fair to GCC, this prelude is part of the DEFINE_OPCODE macro, and is not an artifact of compilation. Its purpose is to let other parts of the runtime see where a computation is. I tried, but I can't convince myself that it is necessary to do this at the beginning of every instruction, so maybe this is just some badness that should be fixed, if the classic interpreter is worth keeping.

The rest of the opcode is similar to the version that the LLInt produces, as we will see, but it is less compact.

the compiler to make the code you want

The goal of compilation is to produce good code. It can often be a good technique to start from the code you would like to have, and then go backwards and see how to produce it. It's also a useful explanatory technique: we can take a look at the machine code of the LLInt, and use it to put the LLInt's source in context.

In that light, here is the code corresponding to op_jmp, as part of the LLInt:

llint_op_jmp:
    ; offset += bytecode[offset + 1]
    add    0x8(%r10,%rsi,8),%esi
    ; jmp *bytecode[offset]
    jmpq   *(%r10,%rsi,8)

That's it! Just 9 bytes. There is the slight difference that the LLInt moves around an offset instead of a pointer into the bytecode. The whole LLInt is some 14 KB, which fits quite confortably into icache, even on mobile devices.

This assembly code is part of the LLInt as compiled for my platform, x86-64. The source of the LLInt is written in a custom low-level assembly language. Here is the source for the jmp opcode:

_llint_op_jmp:
    traceExecution()
    dispatchInt(8[PB, PC, 8])

You can define macros in this domain-specific language. Here's the definition of dispatchInt:

macro dispatchInt(advance)
    addi advance, PC
    jmp [PB, PC, 8]
end

At this point it's pretty clear how the source corresponds to the assembly. The WebKit build system will produce a version of the LLInt customized for the target platform, and link that interpreter into the JavaScriptCore library. There are backend code generators for ARMv7, X86-32, and X86-64. These interpreters mostly share the same source, modulo the representation differences for JSValue on 32-bit and 64-bit systems.

Incidentally, the "offlineasm" compiler that actually produces the native code is written in Ruby. It's clear proof that Ruby can be fast, as long as you don't load it at runtime ;-)

but wait, there's more

We take for granted that low-level programs should be able to determine how their data is represented. C and C++ are low-level languages, to the extent that they offer this kind of control. But what is not often remarked is that for a virtual machine, its code is also data that needs to be examined at runtime.

I have written before about JSC's optimizing DFG compiler. In order to be able to optimize a computation that is in progress, or to abort an optimization because a speculation failed -- in short, to be able to use OSR to tier up or down -- you need to be able to replace the current stack frame. Well, with the classic interpreter written in C++, you can't do that, because you have no idea what's in your stack frame! In contrast, with the LLInt, JSC has precise control over the stack layout, allowing it to tier up and down directly from the interpreter.

This can obviate the need for the baseline JIT. Currently, WebKitGTK+ and Apple builds still include the baseline, quick-and-dirty JIT, as an intermediate tier between the LLInt and the DFG JIT. I don't have sure details about about long-term plans, but I would speculate that recent work on the DFG JIT by Filip Pizlo and others has had the goal of obsoleting the baseline JIT.

Note that in order to tier directly to the optimizing compiler, you need type information. Building the LLInt with the DFG optimizer enabled causes the interpreter to be instrumented to record value profiles. These profiles record the types of values seen by instructions that load and store values from memory. Unlike V8, which stores this information in executable code as part of the inline caches, in the LLInt these value profiles are in non-executable memory.

Spiritually speaking, I am all for run-time code generation. But it must be admitted that web browsers are a juicy target, and while it is unfair to make safe languages pay for bugs in C++ programs, writable text is an attack surface. Indeed in some environments, JIT compilation is prohibited by the operating system -- and in that sort of environment, a fast interpreter like the LLInt is a big benefit.

There are a few more LLInt advantages, like being able to use the CPU's overflow flags, doing better register allocation, and retaining less garbage (JSC scans the stack conservatively). But the last point I want to mention is memory: an interpreter doesn't have to generate executable code just to display a web page. Obviously, you need to complement the LLInt with an optimizing compiler for hot code, but laziness up front can improve real-world page load times.

in your webs, rendering your kittens

So that's the LLInt: a faster bytecode interpreter for JSC.

I wrote this post because I finally got around to turning on the LLInt in the WebKitGTK+ port a couple weeks ago. That means that in September, when GNOME 3.6 comes out, not only will you get an Epiphany with the one-process-per-tab WebKit2 backend, you also get cheaper and faster JavaScript with the LLInt. Those of you that want the advance experience can try it out with last Monday's WebKitGTK+ 1.9.4 release.

stranger in these parts

2012-05-16T11:35:25Z

My JSConf 2012 video is out! Check it out:

Visit blip.tv to watch in your browser

The talk is called "Stranger in these parts: A hired gun in the JS corral", and in it I talk about my experiences as a Schemer in the implementation world, with a focus on JavaScriptCore, the JS implementation of the WebKit project.

If you want, you can fetch the slides or the notes. If you were unable to play the video in the browser, you can download it directly (25 minutes, ~80 MB, CC-BY-SA).

Special thanks to the A/V team for the fine recording. My talk was the first one that used the wireless mics, and it turned out there was some intermittent interference. They corrected this in later talks by double-miking the speakers. In my case, it was fortunate that they recorded the room as well, and with (I would imagine) a fair amount of post-processing the sound is perfectly intelligible. Cheers!

Finally, there were a number of other interesting talks whose recordings are starting to come out. I especially liked David Nolen's funhouse of ClojureScript and logic programming. It was pleasantly strange to see him mention Peter Norvig's 1992 book, Paradigms of Artificial Intelligence Programming, because I did too, and I think someone else did as well. Three people mentioning a somewhat obscure 20-year-old book, what are the odds?

I also liked Vyacheslav's amusing talk on V8's optimizing compiler. He actually showed some assembler! Folks that read this webrag might find it interesting. Dan Ingalls' talk was fun too. The ending scene was pretty surreal; be sure to watch all the way through.

Thanks again to the JSConf organizers for the invitation to speak. It was a pleasure to get to hang around the lively and energetic JS community. Happy hacking, all!

javascript eval considered crazy

2012-01-12T16:34:08Z

Peoples. I was hacking recently on JavaScriptCore, and I came to a realization: JavaScript's eval is absolutely crazy.

I mean, I thought I knew this before. But... words fail me, so I'll have to show a few examples.

eval and introduced bindings

This probably isn't worth mentioning, as you probably know it, but eval can introduce lexical bindings:

 > var foo = 10;
 > (function (){ eval('var foo=20;'); return foo; })()
 20
 > foo
 10

I find this to be pretty insane already, but I knew about it. You would think though that var x = 10; and eval('var x = 10;'); would be the same, though, but they're not:

 > (function (){ var x = 10; return delete x; })()
 false
 > (function (){ eval('var x = 10;'); return delete x; })()
 true

eval-introduced bindings do not have the DontDelete property set on them, according to the post-hoc language semantics, so unlike proper lexical variables, they may be deleted.

when is eval really eval?

Imagine you are trying to analyze some JavaScript code. If you see eval(...), is it really eval?

Not necessarily.

eval pretends to be a regular, mutable binding, so it can be rebound:

 > eval = print
 > eval('foo')
 foo // printed

or, shadowed lexically:

 > function () { var eval = print; eval('foo'); }
 foo // printed

or, shadowed dynamically:

 > with({'eval': print}) { eval('foo'); }
 foo // printed

You would think that if you can somehow freeze the eval binding on the global object, and verify that there are no with forms, and no statements of the form var eval = ..., that you could guarantee that eval is eval, but that is not the case:

 > Object.freeze(this);
 > (function (x){ return [eval(x), eval(x)]; })('var eval = print; 10')
 var eval = print; 10 // printed, only once!
 10,

(Here the first eval created a local binding for eval, and evaluated to 10. The second eval was actually a print.)

Crazy!!!!

an eval by any other name

So eval is an identifier that can be bound to another value. OK. One would expect to be able to bind another identifier to eval, then. Does that work? It seems to work:

 > var foo = eval;
 > foo('foo') === eval;
 true

But not really:

 > (function (){ var quux = 10; return foo('quux'); } )()
 Exception: ReferenceError: Can't find variable: quux

eval by any other name isn't eval. (More specifically, eval by any other name doesn't have access to lexical scope.)

Note, however, the mere presence of a shadowed declaration of eval doesn't mean that eval isn't eval:

 > var foo = 10
 > (function(x){ var eval = x; var foo = 20; return [x('foo'), eval('foo')] })(eval)
 10,20

Crazy!!!!

strict mode restrictions

ECMAScript 5 introduces "strict mode", which prevents eval from being rebound:

 > (function(){ "use strict"; var eval = print; })
 Exception: SyntaxError: Cannot declare a variable named 'eval' in strict mode.
 > (function(){ "use strict"; eval = print; })
 Exception: SyntaxError: 'eval' cannot be modified in strict mode
 > (function(){ "use strict"; eval('eval = print;'); })()
 Exception: SyntaxError: 'eval' cannot be modified in strict mode
 > (function(x){"use strict"; x.eval = print; return eval('eval');})(this)
 Exception: TypeError: Attempted to assign to readonly property.

But, since strict mode is embedded in "classic mode", it's perfectly fine to mutate eval from outside strict mode, and strict mode has to follow suit:

 > eval = print;
 > (function(){"use strict"; return eval('eval');})()
 eval // printed

The same is true of non-strict lexical bindings for eval:

 > (function(){ var eval = print; (function(){"use strict"; return eval('eval');})();})();
 eval // printed
 > with({'eval':print}) { (function(){ "use strict"; return eval('eval');})() }
 eval // printed

An engine still has to check at run-time that eval is really eval. This crazy behavior will be with us as long as classic mode, which is to say, forever.

Strict-mode eval does have the one saving grace that it cannot introduce lexical bindings, so implementors do get a break there, but it's a poor consolation prize.

in summary

What can an engine do when it sees eval?

Not much. It can't even prove that it is actually eval unless eval is not bound lexically, there is no with, there is no intervening non-strict call to any identifier eval (regardless of whether it is eval or not), and the global object's eval property is bound to the blessed eval function, and is configured as DontDelete and ReadOnly (not the default in web browsers).

But the very fact that an engine sees a call to an identifier eval poisons optimization: because eval can introduce variables, the scope of free variables is no longer lexically apparent, in many cases.

I'll say it again: crazy!!!

JavaScriptCore, the WebKit JS implementation

2011-10-28T15:51:24Z

My readers will know that I have recently had the pleasure of looking into the V8 JavaScript implementation, from Google. I'm part of a small group in Igalia doing compiler work, and it's clear that in addition to being lots of fun, JavaScript implementations are an important part of the compiler market today.

But V8 is not the only JS implementation in town. Besides Mozilla's SpiderMonkey, which you probably know, there is another major Free Software JS implementation that you might not have even heard of, at least not by its proper name: JavaScriptCore.

jsc: js for webkit

JavaScriptCore (JSC) is the JavaScript implementation of the WebKit project.

In the beginning, JavaScriptCore was a simple tree-based interpreter~~, as Mozilla's SpiderMonkey was~~. But then in June of 2008, a few intrepid hackers at Apple wrote a compiler and bytecode interpreter for JSC, threw away the tree-based interpreter, and called the thing SquirrelFish. This was eventually marketed as "Nitro" inside Apple's products[0].

JSC's bytecode interpreter was great, and is still pretty interesting. I'll go into some more details later in this article, because its structure conditions the rest of the implementation.

But let me continue here with this historical sketch by noting that later in 2008, the WebKit folks added inline caches, a regular expression JIT, and a simple method JIT, and then called the thing SquirrelFish Extreme. Marketers called this Nitro Extreme. (Still, the proper name of the engine is JavaScriptCore; Wikipedia currently gets this one wrong.)

One thing to note here is that the JSC folks were doing great, well-factored work. It was so good that SpiderMonkey hackers at Mozilla adopted JSC's regexp JIT compiler and their native-code assembler directly.

As far as I can tell, for JSC, 2009 and 2010 were spent in "consolidation". By that I mean that JSC had a JIT and a bytecode interpreter, and they wanted to maintain them both, and so there was a lot of refactoring and tweaking to make them interoperate. This phase consolidated the SFX gains on x86 architectures, while also adding implementations for ARM and other architectures.

But with the release of V8's Crankshaft in late 2010, the JS performance bar had been lowered again (assuming it is a limbo), and so JSC folks started working on what they call their "DFG JIT" (DFG for "data-flow graph"), which aims be more like Crankshaft, basically.

It's possible to configure a JSC with all three engines: the interpreter, the simple method JIT, and the DFG JIT. In that case there is tiered compilation between the three forms: initial parsing and compilation produces bytecode, that can be optimized with the method JIT, that can be optimized by the DFG JIT. In practice, though, on most platforms the interpreter is not included, so that all code runs through the method JIT. As far as I can tell, the DFG JIT is shipping in Mac OS X Lion's Safari browser, but it is not currently enabled on any platform other than 64-bit Mac. (I am working on getting that fixed.)

a register vm

The interpreter has a number of interesting pieces, but it is important mostly for defining the format of bytecode. Bytecode is effectively the high-level intermediate representation (IR) of JSC.

To put that into perspective, in V8, the high-level intermediate representation is the JS source code itself. When V8 first sees a piece of code, it pre-parses it to raise early syntax errors. Later when it needs to analyze the source code, either for the full-codegen compiler or for Hydrogen, it re-parses it to an AST, and then works on the AST.

In contrast, in JSC, when code is first seen, it is fully parsed to an AST and then that AST is compiled to bytecode. After producing the bytecode, the source text isn't needed any more, and so it is forgotten. The interpreter interprets the bytecode directly. The simple method JIT compiles the bytecode directly. The DFG JIT has to re-parse the bytecode into an SSA-style IR before optimizing and producing native code, which is a bit more expensive but worth it for hot code.

As you can see, bytecode is the common language spoken by all of JSC's engines, so it's important to understand it.

Before really getting into things, I should make an aside about terminology here. Traditionally, at least in my limited experience, a virtual machine was considered to be a piece of software that interprets sequences of virtual instructions. This would be in contrast to a "real" machine, that interprets sequences of "machine" or "native" instructions in hardware.

But these days things are more complicated. A common statement a few years ago would be, "is JavaScript interpreted or compiled?" This question is nonsensical, because "interpreted" or "compiled" are properties of implementations, not languages. Furthermore the implementation can compile to bytecode, but then interpret that bytecode, as JSC used to do.

And in the end, if you compile all the bytecode that you see, where is the "virtual machine"? V8 hackers still call themselves "virtual machine engineers", even as there is no interpreter in the V8 sources (not counting the ARM simulator; and what of a program run under qemu?).

All in all though, it is still fair to say that JavaScriptCore's high-level intermediate language is a sequence of virtual instructions for an abstract register machine, of which the interpreter and the simple method JIT are implementations.

When I say "register machine", I mean that in contrast to a "stack machine". The difference is that in a register machine, all temporary values have names, and are stored in slots in the stack frame, whereas in a stack machine, temporary results are typically pushed on the stack, and most instructions take their operands by popping values off the stack.

(Incidentally, V8's full-codegen compiler operates on the AST in a stack-machine-like way. Accurately modelling the state of the stack when switching from full-codegen to Crankshaft has been a source of many bugs in V8.)

Let me say that for an interpreter, I am totally convinced that register machines are the way to go. I say this as a Guile co-maintainer, which has a stack VM. Here are some reasons.

First, stack machines penalize named temporaries. For example, consider the following code:

(lambda (x)
  (* (+ x 2)
     (+ x 2)))

We could do common-subexpression elimination to optimize this:

(lambda (x)
  (let ((y (+ x 2)))
    (* y y)))

But in a stack machine is this really a win? Consider the sequence of instructions in the first case:

; stack machine, unoptimized
0: local-ref 0      ; x
1: make-int8 2
2: add
3: local-ref 0      ; x
4: make-int8 2
5: add
6: mul
7: return

Contrast this to the instructions for the second case:

; stack machine, optimized
0: local-ref 0      ; push x
1: make-int8 2      ; push 2
2: add              ; pop x and 2, add, and push sum
3: local-set 1      ; pop and set y
4: local-ref 1      ; push y
5: local-ref 1      ; push y
6: mul              ; pop y and y, multiply, and push product
7: return           ; pop and return

In this case we really didn't gain anything, because the storing values to locals and loading them back to the stack take up separate instructions, and in general the time spent in a procedure is linear in the number of instructions executed in the procedure.

In a register machine, on the other hand, things are easier, and CSE is definitely a win:

0: add 1 0 0           ; add x to x and store in y
1: mul 2 1 1           ; multiply y and y and store in z
2: return 2            ; return z

In a register machine, there is no penalty to naming a value. Using a register machine reduces the push/pop noise around the instructions that do the real work.

Also, because they include the names (or rather, locations) of their operands within the instruction, register machines also take fewer instructions to do the job. This reduces dispatch cost.

In addition, with a register VM, you know the size of a call frame before going into it, so you can avoid overflow checks when pushing values in the function. (Some stack machines also have this property, like the JVM.)

But the big advantage of targeting a register machine is that you can take advantage of traditional compiler optimizations like CSE and register allocation. In this particular example, we have used three virtual registers, but in reality we only need one. The resulting code is also closer to what real machines expect, and so is easier to JIT.

On the down side, instructions for a register machine typically occupy more memory than instructions for a stack machine. This is particularly the case for JSC, in which the opcode and each of the operands takes up an entire machine word. This was done to implement "direct threading", in which the opcodes are not indexes into jump tables, but actually are the addresses of the corresponding labels. This might be an acceptable strategy for an implementation of JS that doesn't serialize bytecode out to disk, but for anything else the relocations are likely to make it a lose. In fact I'm not sure that it's a win for JSC even, and perhaps the bloat was enough of a lose that the interpreter was turned off by default.

Stack frames for the interpreter consist of a six-word frame, the arguments to the procedure, and then the locals. Calling a procedure reserves space for a stack frame and then pushes the arguments on the stack -- or rather, sets them to the last n + 6 registers in the stack frame -- then slides up the frame pointer. For some reason in JSC the stack is called the "register file", and the frame pointer is the "register window". Go figure; I suppose the names are as inscrutable as the "activation records" of the stack world.

jit: a jit, a method jit

I mention all these details about the interpreter and the stack (I mean, the register file), because they apply directly to the method JIT. The simple method JIT (which has no name) does the exact same things that the bytecode interpreter does, but it does them via emitted machine instructions instead of interpreting virtual instructions.

There's not much to say here; jitting the code has the result you would expect, reducing dispatching overhead, while at the same time allowing some context-specific compilation, like when you add a constant integer to a variable. This JIT is really quick-and-dirty though, so you don't get a lot of the visibility benefits traditionally associated with method JITs like what HotSpot's C1 or C2 currently have. Granted, the register VM bytecode does allow for some important optimizations to happen, but JSC currently doesn't do very much in the way of optimizing bytecode, as far as I can tell.

Thinking more on the subject, I suspect that for Javascript, CSE isn't even possible unless you know the types, as a valueOf() callback could have side effects.

an interlude of snarky footnotes

Hello, reader! This is a long article, and it's a bit dense. I had some snarky footnotes that I enjoyed writing, but it felt wrong to put them at the end, so I thought it better to liven things up in the middle here. The article continues in the next section.

0. In case you didn't know, compilers are approximately 37% composed of marketing, and rebranding is one of the few things you can do to a compiler, marketing-wise, hence the name train: SquirrelFish, Nitro, SFX, Nitro Extreme...[1] As in, in response to "I heard that Nitro is slow.", one hears, "Oh, man they totally fixed that in SquirrelFish Extreme. It's blazingly fast![2]"

1. I don't mean to pick on JSC folks here. V8 definitely has this too, with their "big reveals". Firefox people continue to do this for some reason (SpiderMonkey, TraceMonkey, JaegerMonkey, IonMonkey). I expect that even they have forgotten the reason at this point. In fact the JSC marketing has lately been more notable in its absence, resulting in a dearth of useful communication. At this point, in response to "Oh, man they totally are doing a great job in JavaScriptCore", you're most likely to hear, "JavaScriptCore? Never heard of it. Kids these days hack the darndest things."

2. This is the other implement in the marketer's toolbox: "blazingly fast". It means, "I know that you don't understand anything I'm saying, but I would like for you to repeat this phrase to your colleagues please." As in, "LLVM does advanced TBAA on the SSA IR, allowing CSE and LICM while propagating copies to enable SIMD loop vectorization. It is blazingly fast."

dfg: a new crankshaft for jsc?

JavaScriptCore's data flow graph (DFG) JIT is work by Gavin Barraclough and Filip Pizlo to enable speculative optimizations for JSC. For example, if you see the following code in JS:

a[i++] = 0.7*x;

then a is probably an array of floating-point numbers, and i is probably an integer. To get great performance, you want to use native array and integer operations, so you speculatively compile a version of your code that makes these assumptions. If the assumptions don't work out, then you bail out and try again with the normal method JIT.

The fact that the interpreter and simple method JIT have a clear semantic model in the form of bytecode execution makes it easy to bail out: you just reconstruct the state of the virtual registers and register window, then jump back into the code. (V8 calls this process "deoptimization"; the DFG calls it "speculation failure".)

You can go the other way as well, switching from the simple JIT to the optimized DFG JIT, using on-stack replacement. The DFG JIT does do OSR. I hear that it's needed if you want to win Kraken, which puts you in lots of tight loops that you need to be able to optimize without relying on being able to switch to optimized code only on function re-entry.

When the DFG JIT is enabled, the interpreter (if present) and the simple method JIT are augmented with profiling information, to record what types flow through the various parts of the code. If a loop is executed a lot (currently more than 1000 times), or a function is called a lot (currently about 70 times), the DFG JIT kicks in. It parses the bytecode of a function into an SSA-like representation, doing inlining and collecting type feedback along the way. This probably sounds very familiar to my readers.

The difference between JSC and Crankshaft here is that Crankshaft parses out type feedback from the inline caches directly, instead of relying on in-code instrumentation. I think Crankshaft's approach is a bit more elegant, but it is prone to lossage when GC blows the caches away, and in any case either way gets the job done.

I mentioned inlining before, but I want to make sure that you noticed it: the DFG JIT does do inlining, and does so at parse-time, like HotSpot does. The type profiling (they call it "value profiling") combined with some cheap static analysis also allows the DFG to unbox int32 and double-precision values.

One thing that the DFG JIT doesn't do, currently, is much in the way of code motion. It does do some dead-code elimination and common-subexpression elimination, and as I mentioned before, you need the DFG's value profiles in order to be able to do this correctly. But it does not do loop-invariant code motion.

Also, the DFG's register allocator is not as good as Crankshaft's, yet. It is hampered in this regard by the JSC assembler that I praised earlier; while indeed a well-factored, robust piece of code, JSC's assembler has a two-address interface instead of a three-address interface. This means that instead of having methods like add(dest, op1, op2), it has methods like add(op1, op2), where the operation implicitly stores its result in its first operand. Though it does correspond to the x86 instruction set, this sort of interface is not great for systems where you have more registers (like on x86-64), and forces the compiler to shuffle registers around a lot.

The counter-based optimization triggers do require some code to run that isn't strictly necessary for the computation of the results, but this strategy does have the nice property that the DFG performance is fairly predictable, and measurable. Crankshaft, on the other hand, being triggered by a statistical profiler, has statistically variable performance.

And speaking of performance, AWFY on the mac is really where it's at for JSC right now. Since the DFG is only enabled by default on recent Mac OS 64-bit builds, you need to be sure you're benchmarking the right thing.

Looking at the results, I think we can say that JSC's performance on the V8 benchmark is really good. Also it's interesting to see JSC beat V8 on SunSpider. Of course, there are lots of quibbles to be had as to the validity of the various benchmarks, and it's also clear that V8 is the fastest right now once it has time to warm up. But I think we can say that JSC is doing good work right now, and getting better over time.

future

So that's JavaScriptCore. The team -- three people, really -- is mostly focusing on getting the DFG JIT working well right now, and I suspect they'll be on that for a few months. But we should at least get to the point where the DFG JIT is working and enabled by default on free systems within a week or two.

The one other thing that's in the works for JSC is a new generational garbage collector. This is progressing, but slowly. There are stubs in the code for card-marking write barriers, but currently there is no such GC implementation, as far as I can tell. I suspect that someone has a patch they're cooking in private; we'll see. At least JSC does have a Handle API, unlike SpiderMonkey.

conclusion

So, yes, in summary, JavaScriptCore is a fine JS implementation. Besides being a correct implementation for real-world JS -- something that is already saying quite a lot -- it also has good startup speed, is fairly robust, and is working on getting an optimizing compiler. There's work to do on it, as with all JS implementations, but it's doing fine.

Thanks for reading, if you got this far, though I understand if you skipped some parts in the middle. Comments and corrections are most welcome, from those of you that actually read all the way through, of course :). Happy hacking!