wingolog

a continuation-passing style intermediate language for guile

2014-01-12T21:58:00Z

Happy new year's, hackfolk!

A few weeks ago I wrote about the upcoming Guile 2.2 release, and specifically about its new register virtual machine. Today I'd like to burn some electrons on another new part in Guile 2.2, its intermediate language.

To recap, we switched from a stack machine to a register machine because, among other reasons, register machines can consume and produce named intermediate results in fewer instructions than stack machines, and that makes things faster.

To take full advantage of this new capability, it is appropriate to switch at the same time from the direct-style intermediate language (IL) that we had to an IL that names all intermediate values. This lets us effectively reason about each subexpression that goes into a computation, for example in common subexpression elimination.

As far as intermediate languages go, basically there are two choices these days: something SSA-based, or something CPS-based. I wrote an article on SSA, ANF and CPS a few years ago; if you aren't familiar with these or are feeling a little rusty, I suggest you go and take a look.

In Guile I chose a continuation-passing style language. I still don't know if I made the right choice. I'll go ahead and describe Guile's IL and then follow up with some reflections. The description below is abbreviated from a more complete version in Guile's manual.

guile's cps language

Guile's CPS language is composed of terms, expressions, and continuations. It was heavily inspired by Andrew Kennedy's Compiling with Continuations, Continued paper.

A term can either evaluate an expression and pass the resulting values to some continuation, or it can declare local continuations and contain a sub-term in the scope of those continuations.

$continue k src exp: Evaluate the expression exp and pass the resulting values (if any) to the continuation labelled k.
$letk conts body: Bind conts in the scope of the sub-term body. The continuations are mutually recursive.

Additionally, the early stages of CPS allow for a set of mutually recursive functions to be declared as a term via a $letrec term. A later pass will attempt to transform the functions declared in a $letrec into local continuations. Any remaining functions are later lowered to $fun expressions. More on "contification" later.

Here is an inventory of the kinds of expressions in Guile's CPS language. Recall that all expressions are wrapped in a $continue term which specifies their continuation.

$void: Continue with the unspecified value.
$const val: Continue with the constant value val.
$prim name: Continue with the procedure that implements the primitive operation named by name.
$fun src meta free body: Continue with a procedure. body is the $kentry $cont of the procedure entry. free is a list of free variables accessed by the procedure. Early CPS uses an empty list for free; only after closure conversion is it correctly populated.
$call proc args: Call proc with the arguments args, and pass all values to the continuation. proc and the elements of the args list should all be variable names. The continuation identified by the term's k should be a $kreceive or a $ktail instance.
$primcall name args: Perform the primitive operation identified by name, a well-known symbol, passing it the arguments args, and pass all resulting values to the continuation.
$values args: Pass the values named by the list args to the continuation.
$prompt escape? tag handler: Push a prompt on the stack identified by the variable name tag and continue with zero values. If the body aborts to this prompt, control will proceed at the continuation labelled handler, which should be a $kreceive continuation. Prompts are later popped by pop-prompt primcalls.

The remaining element of the CPS language in Guile is the continuation. In CPS, all continuations have unique labels. Since this aspect is common to all continuation types, all continuations are contained in a $cont instance:

$cont k cont: Declare a continuation labelled k. All references to the continuation will use this label.

The most common kind of continuation binds some number of values, and then evaluates a sub-term. $kargs is this kind of simple lambda.

$kargs names syms body: Bind the incoming values to the variables syms, with original names names, and then evaluate the sub-term body.

Variables (e.g., the syms of a $kargs) should be globally unique. To bind the result of an expression a variable and then use that variable, you would continue from the expression to a $kargs that declares one variable. The bound value would then be available for use within the body of the $kargs.

$kif kt kf: Receive one value. If it is a true value, branch to the continuation labelled kt, passing no values; otherwise, branch to kf.

Non-tail function calls should continue to a $kreceive continuation in order to adapt the returned values to their uses in the calling function, if any.

$kreceive arity k: Receive values from a function return. Parse them according to arity, and then proceed with the parsed values to the $kargs continuation labelled k.

$arity is a helper data structure used by $kreceive and also by $kclause, described below.

$arity req opt rest kw allow-other-keys?: A data type declaring an arity. See Guile's manual for details.

Additionally, there are three specific kinds of continuations that can only be declared at function entries.

$kentry self tail clauses: Declare a function entry. self is a variable bound to the procedure being called, and which may be used for self-references. tail declares the $cont wrapping the $ktail for this function, corresponding to the function's tail continuation. clauses is a list of $kclause $cont instances.

$ktail: A tail continuation.

$kclause arity cont: A clause of a function with a given arity. Applications of a function with a compatible set of actual arguments will continue to cont, a $kargs $cont instance representing the clause body.

reflections

Before starting Guile's compiler rewrite, I had no real-world experience with CPS-based systems. I had worked with a few SSA-based systems, and a few more direct-style systems. I had most experience with the previous direct-style system that Guile had, but never had to seriously design another kind of IL, so basically I was ignorant. It shows, I think; but time will tell if it came out OK anyway. At this point I am cautiously optimistic.

As far as fitness for purpose goes, the CPS IL works in the sense that it is part of a self-hosting compiler. I'll say no more on that point other than to mention that it has explicit support for a number of Guile semantic features: multiple-value returns; optional, rest, and keyword arguments; cheap delimited continuations; Guile-native constant literals.

Why not ANF instead? If you recall from my SSA and CPS article, I mentioned that ANF is basically CPS with fewer labels. It tries to eliminate "administrative" continuations, whereas Guile's CPS labels everything. There is no short-hand let form.

ANF proponents tout its parsimony as a strength, but I do not understand this argument. I like having labels for everything. In CPS, I have as many labels as there are expressions, plus a few for continuations that don't contain terms. I use them directly in the bytecode compiler; the compiler never has to generate a fresh label, as they are part of the CPS itself.

More importantly, labelling every control-flow point allows me to reason precisely about control flow. For example, if a function is always called with the same continuation, it can be incorporated in the flow graph of the calling function. This is called "contification". It is not the same thing as inlining, as it works for sets of recursive functions as well, and never increases code size. This is a crucial transformation for a Scheme compiler to make, as it turns function calls into gotos, and self-function calls into loop back-edges.

Guile's previous compiler did a weak form of contification, but because we didn't have names for all control points it was gnarly and I was afraid to make it any better. Now its contifier is optimal. See Fluet and Weeks' Contification using Dominators and Kennedy's CWCC, for more on contification.

One more point in favor of labelling all continuations. Many tranformations can be best cast as a two-phase process, in which you first compute a set of transformations to perform, and then you apply them. Dead-code elimination works this way; first you find the least fixed-point of live expressions, and then you residualize only those expressions. Without names, how are you going to refer to an expression in the first phase? The ubiquitous, thorough labelling that CPS provides does not have this problem.

So I am happy with CPS, relative to ANF. But what about SSA? In my previous article, I asked SSA proponents to imagine returning a block from a block. Of course it doesn't make any sense; SSA is a first-order language. But modern CPS is also first-order, is the thing! Modern CPS distinguishes "continuations" syntactically from functions, which is exactly the same as SSA's distinction between basic blocks and functions. CPS and SSA really are the same on this level.

The fundamental CPS versus SSA difference is, as Stephen Weeks noted a decade ago, one of data structures: do you group your expressions into basic blocks stored in a vector (SSA), or do you nest them into a scope tree (CPS)? It's not clear that I made the correct choice.

In practice with Guile's CPS you end up building graphs on the side that describe some aspect of your term. For example you can build a reverse-post-ordered control flow analysis that linearizes your continuations, and assigns them numbers. Then you can compute a bitvector for each continuation representing each one's reachable continuations. Then you can use this reachability analysis to determine the extent of a prompt's body, for example.

But this analysis is all on the side and not really facilitated by the structure of CPS itself; the CPS facilities that it uses are the globally unique continuation and value names of the CPS, and the control-flow links. Once the transformation is made, all of the analysis is thrown away.

Although this seems wasteful, the SSA approach of values having "implicit" names by their positions in a vector (instead of explicit ephemeral name-to-index mappings) is terrifying to me. Any graph transformation could renumber things, or leave holes, or cause vectors to expand... dunno. Perhaps I am too shy of the mutation foot-gun. I find comfort in CPS's minimalism.

One surprise I have found is that I haven't needed to do any dominator-based analysis in any of the paltry CPS optimizations I have made so far. I expect to do so once I start optimizing loops; here we see the cultural difference with SSA I guess, loops being the dominant object of study there. On the other hand I have had to solve flow equations on a few occasions, which was somewhat surprising, though enjoyable.

The optimizations I have currently implemented for CPS are fairly basic. Contification was tricky. One thing I did recently was to make all non-tail $call nodes require $kreceive continuations; if, as in the common case, extra values were unused, that was reflected in an unused rest argument. This required a number of optimizations to clean up and remove the extra rest arguments for other kinds of source expressions: dead-code elimination, the typical beta/eta reduction, and some code generation changes. It was worth it though, and now with the optimization passes things are faster than they were before.

Well, I find that I am rambling now. I know this is a lot of detail, but I hope that it helps some future compiler hacker understand more about intermediate language tradeoffs. I have been happy with CPS, but I'll report back if anything changes :) And if you are actually hacking on Guile, check the in-progress manual for all the specifics.

Happy hacking to all, and to all a good hack!

optimizing let in spidermonkey

2013-12-18T20:00:23Z

Peoples! Firefox now optimizes let-bound variables!

What does this mean, you ask? Well, as you nerdy wingolog readers probably know, the new ECMAScript 6 standard is coming soon. ES6 has new facilities that make JavaScript more expressive. At the same time, though, many ES6 features stress parts of JavaScript engines that are currently ignored.

One of these areas is lexical scope. Until now, with two exceptions, all local variables in JavaScript have been "hoisted" to the function level. (The two exceptions are the exception in a catch clause, and the name of a named function expression.) Therefore JS engines have rightly focused on compiling methods at a time, falling back to slower execution strategies when lexical scopes are present.

The presence of let in ES6 changes this. It was a hip saying a couple years ago that "let is the new var", but in reality no one uses let -- not only because engines didn't yet implement let without feature flags, if they implemented it at all, but because let was slow. Because let-bound variables were on the fallback path, they were unoptimized.

But that is changing. Firefox now optimizes many kinds of let-bound variables, making "let is the new var" much closer to being an actual thing. V8 has made some excellent steps in similar directions as well. I'll focus on the Firefox bit here as that's what I've been working on, then go on to mention the work that the excellent V8 hackers have done.

implementing scope in javascript

JavaScript the language is defined in terms of a "scope chain", with respect to which local variables are looked up. Each link in the chain binds some set of variables. Lookup of a name traverses the chain from tip to tail looking for a binding.

Usually it's possible to know at compile-time where a binding will be in the chain. For example, you can know that the "x" reference in the returned function here:

function () {
  var x = 3;
  return function () { return x; };
}

will be one link up in the scope chain, and will be stored in the first named slot in that link. This is a classic implementation of lexical scope, and it plays very well with the semantics of JavaScript as specified. All engines do something like this.

Another thing to note in this case is that we don't need to store the name for "x" anywhere, as we can see through all of its uses at compile-time. In practice however, as JavaScript functions are typically compiled lazily on first call, we do store the association "x -> slot 0" in the outer function's environment.

(Could you just propagate the constant, you ask? Of course you could. But since JS compilation typically occurs one function at a time, lazily, no engine does this initially. Of course later when the optimizing compiler kicks in, it does get propagated, but that usually doesn't avoid the scope chain node allocation.)

Let's take another case.

function foo() { var x = 3; return x; }

In this case we know that the "x" reference can be found in the top link of the chain, in the first slot. Semantically speaking, that is. No JS engine implements things this way. Instead we take advantage of noticing that "x" cannot be captured by any other function. Therefore we are free to assign "x" to a slot in a stack frame, and refer to it by index directly -- without allocating a scope chain node on the heap. All engines do this.

And of course later if this function is hot, the constant exists only in a register, probably inlined to its caller. For this reason scope hasn't had a lot of work put into it. The unit of optimization -- the function -- is the same as the unit of scope. If the implementation details of scope get costly, we take advantage of run-time optimization to speculatively remove allocations, loads, and stores.

OK. What if you have some variables captured, and some not captured?

function foo() { var x = 3; var y = 4; return function () { return x; } }

Here "x" is captured, but "y" is not. Usually engines will allocate "x" on the scope chain, and "y" as a local. (JavaScriptCore is an interesting exception; it uses a strategy I haven't seen elsewhere called "lazy tear-off". Basically all variables are on the stack for the dynamic extent of the scope, whether they are captured or not. If needed, potentially lazily, an object is pushed on the scope chain that aliases the stack slots. If a scope is pushed on the scope chain, when the scope ends the current values are "torn off" the stack and copied to the heap, and the slots pointer of the scope object updated to point to the heap.)

I digress. "x" on the scope chain, "y" not. (I regress: why allocate "y" at all, isn't it dead? Yes it is dead. Optimizing compilers kill it. Baseline compilers don't bother.) So typically access to "x" goes through the scope chain, and access to "y" does not.

Calculating the set of variables that can be captured is one of the few static analyses done by JS baseline compilers. The other ones are whether we are in strict mode or not, whether nested scopes contain "with" or "eval" (thus forcing all locals to be on the scope chain), and whether the "arguments" object is used or not. There may be more, but that is the lowest common denominator of efficiency.

implementing let

let is part of ES6, and all browsers will eventually implement it.

Let me give you an insight as to the mindset of a browser maker. What a browser maker does when they see a feature is first to try to ignore it -- all features have a cost and not all of them pay for themselves. Next you try to do the minimum correct thing -- the thing that passes all the test suites, but imposes a minimal burden on the rest of the system. Usually this means that the feature probably works, except for the corner cases which users will file bugs for, but it is slow. Finally when there is either an internal use the browser maker cares about (Google, Mozilla to an extent) or you want to heat up a benchmark war (everyone else), you start to optimize.

The state of let was until recently between "ignore" and "slow". "Slow" means different things to different browsers.

The engines that have started to implement let are V8 (Chrome) and SpiderMonkey (Firefox). In Chrome, until recently using let in a function was an easy way of preventing that function from ever being optimized by Crankshaft. Good times? When writing this article I was going to claim this was still the case, but I see that in fact this is happily not true. You can Crankshaft a function that has let bindings. I believe, with a cursory glance, that locals that are not captured are still allocated on a scope chain -- but perhaps I am misreading things.

Firefox, on the other hand, would not Ion-compile any function with a let. You would think that this would not be the case, given that Firefox has had let for many years. If you thought this, you misunderstand how browser development works. Basically let me lay it out for you:

class struggle : marxism :: benchmarks : browsers

Get it? Benchmarks are the motor of browser history. If this bloviation has any effect, please let it be to inspire you to go and make benchmarks!

Anyway, Firefox. No Ion for let. The reason why you would avoid optimizing these things is clear -- breaking the "optimization unit == scope unit" equivalence has a cost. However the time has come.

What I have done is to recognize let scopes that have no captured locals. This class of let scope will now be optimized by Ion. So for example in this case:

function sumto(n) {
  let sum = 0;
  for (let i=0; iPreviously this would allocate a block scope around the "for".  (Incidentally, Firefox borks block scoping in this case; each iteration should produce a fresh binding.  But this is historical, will be fixed, and I digress.)  The operations that created and pushed block scopes were not optimized.  Avoiding pushing and popping scopes from the scope chain avoids this limitation, and thus we have awesome speed!
The effect of simply turning on Ion cannot be denied.  In this case, the runtime of a sum up to 1e9 goes from 8.8 seconds (!) to 0.8 seconds.  Obviously it's a little micro-benchmark but that's how I'm rolling today.  The biggest optimization is to stop deoptimization.  And here's where history cranks up: V8 takes 7.7 seconds on this loop.  Gentlefolk, start your engines!
At the same time, in Firefox we are still able to reify a parallel chain of "debug scopes" as needed that corresponds to the scope chain as the specification would see it.  This was the most challenging part of optimizing block scope in Firefox -- not the actual optimization, which was trivial, but optimization while retaining introspection and debuggability.
future work
Unhappily, scopes with let-bound variables that are captured by nested scopes ("aliased", in spidermonkey parlance) are not yet optimized.  So not only do they cause scope chain allocation, but they also don't benefit from Ion.  Boo.  Bug 942810.
I should also mention that the let supported by Firefox is not the let specified in ES6.  In ES6 there is thing called a "temporal dead zone" whereby it is invalid to access the value of a let before its initialization.  This is like Scheme's "letrec restriction", and has to be enforced dynamically for similar reasons.  V8 does a great job in actually implementing this, and Firefox should do it soon.
Of course, it's not clear to me how let can actually be deployed without breaking the tubes.  I think it probably can somehow but I haven't checked the latest TC39 cogitations in that regard.
twisted paths
It's been a super-strange end-of-year.  I was sure I would be optimizing SpiderMonkey generators and moving on to other things, but I just got caught up with stuff -- the for-of bits took approximately forever and then the amount of state carried in SpiderMonkey stack frames filled me with despair.  So it was that this hack, while assisting me in that goal, was not actually a planned thing.
See, SpiderMonkey used to actually reserve a stack slot for the "block chain".  No, they aren't using your browser to mine for Bitcoins, though that would be a neat hack.  The "block chain" was a simulation of the spec-mandated scope chain.  But of course this might not reflect the actual implemented, optimized behavior -- but again one might want to map them to each other for debugging purposes.  It was a mess.
The resulting changeset to fix this ended up so large that it took weeks to land.  Well, live and learn, right?  I remember Michael Starzinger telling me the same thing about V8 -- you just have to keep your patches small, as small as possible, and always working.  Words to the wise indeed.
happy days
But in the end we at least have some juice from this citric fruit.  This has been an Igalia joint.  Thanks very much to Mozilla's Luke Wagner for suffering through the reviews.
Thanks especially to Bloomberg for making this work possible; you folks are swell.  Forward ES6!