wingologA mostly dorky weblog by Andy Wingo2023-05-20T07:10:03Ztekutihttps://wingolog.org/feed/atomAndy Wingohttps://wingolog.org/approaching cps souphttps://wingolog.org/2023/05/20/approaching-cps-soup2023-05-20T07:10:03Z2023-05-20T07:10:03Z

Good evening, hackers. Today’s missive is more of a massive, in the sense that it’s another presentation transcript-alike; these things always translate to many vertical pixels.

In my defense, I hardly ever give a presentation twice, so not only do I miss out on the usual per-presentation cost amortization and on the incremental improvements of repetition, the more dire error is that whatever message I might have can only ever reach a subset of those that it might interest; here at least I can be more or less sure that if the presentation would interest someone, that they will find it.

So for the time being I will try to share presentations here, in the spirit of, well, why the hell not.

CPS Soup

A functional intermediate language

10 May 2023 – Spritely

Andy Wingo

Igalia, S.L.

Last week I gave a training talk to Spritely Institute collaborators on the intermediate representation used by Guile‘s compiler.

CPS Soup

Compiler: Front-end to Middle-end to Back-end

Middle-end spans gap between high-level source code (AST) and low-level machine code

Programs in middle-end expressed in intermediate language

CPS Soup is the language of Guile’s middle-end

An intermediate representation (IR) (or intermediate language, IL) is just another way to express a computer program. Specifically it’s the kind of language that is appropriate for the middle-end of a compiler, and by “appropriate” I meant that an IR serves a purpose: there has to be a straightforward transformation to the IR from high-level abstract syntax trees (ASTs) from the front-end, and there has to be a straightforward translation from IR to machine code.

There are also usually a set of necessary source-to-source transformations on IR to “lower” it, meaning to make it closer to the back-end than to the front-end. There are usually a set of optional transformations to the IR to make the program run faster or allocate less memory or be more simple: these are the optimizations.

“CPS soup” is Guile’s IR. This talk presents the essentials of CPS soup in the context of more traditional IRs.

How to lower?

High-level:

(+ 1 (if x 42 69))

Low-level:

  cmpi $x, #f
  je L1
  movi $t, 42
  j L2  
L1:
  movi $t, 69
L2:
  addi $t, 1

How to get from here to there?

Before we dive in, consider what we might call the dynamic range of an intermediate representation: we start with what is usually an algebraic formulation of a program and we need to get down to a specific sequence of instructions operating on registers (unlimited in number, at this stage; allocating to a fixed set of registers is a back-end concern), with explicit control flow between them. What kind of a language might be good for this? Let’s attempt to answer the question by looking into what the standard solutions are for this problem domain.

1970s

Control-flow graph (CFG)

graph := array<block>
block := tuple<preds, succs, insts>
inst  := goto B
       | if x then BT else BF
       | z = const C
       | z = add x, y
       ...

BB0: if x then BB1 else BB2
BB1: t = const 42; goto BB3
BB2: t = const 69; goto BB3
BB3: t2 = addi t, 1; ret t2

Assignment, not definition

Of course in the early days, there was no intermediate language; compilers translated ASTs directly to machine code. It’s been a while since I dove into all this but the milestone I have in my head is that it’s the 70s when compiler middle-ends come into their own right, with Fran Allen’s work on flow analysis and optimization.

In those days the intermediate representation for a compiler was a graph of basic blocks, but unlike today the paradigm was assignment to locations rather than definition of values. By that I mean that in our example program, we get t assigned to in two places (BB1 and BB2); the actual definition of t is implicit, as a storage location, and our graph consists of assignments to the set of storage locations in the program.

1980s

Static single assignment (SSA) CFG

graph := array<block>
block := tuple<preds, succs, phis, insts>
phi   := z := φ(x, y, ...)
inst  := z := const C
       | z := add x, y
       ...
BB0: if x then BB1 else BB2
BB1: v0 := const 42; goto BB3
BB2: v1 := const 69; goto BB3
BB3: v2 := φ(v0,v1); v3:=addi t,1; ret v3

Phi is phony function: v2 is v0 if coming from first predecessor, or v1 from second predecessor

These days we still live in Fran Allen’s world, but with a twist: we no longer model programs as graphs of assignments, but rather graphs of definitions. The introduction in the mid-80s of so-called “static single-assignment” (SSA) form graphs mean that instead of having two assignments to t, we would define two different values v0 and v1. Then later instead of reading the value of the storage location associated with t, we define v2 to be either v0 or v1: the former if we reach the use of t in BB3 from BB1, the latter if we are coming from BB2.

If you think on the machine level, in terms of what the resulting machine code will be, this either function isn’t a real operation; probably register allocation will put v0, v1, and v2 in the same place, say $rax. The function linking the definition of v2 to the inputs v0 and v1 is purely notational; in a way, you could say that it is phony, or not real. But when the creators of SSA went to submit this notation for publication they knew that they would need something that sounded more rigorous than “phony function”, so they instead called it a “phi” (φ) function. Really.

2003: MLton

Refinement: phi variables are basic block args

graph := array<block>
block := tuple<preds, succs, args, insts>

Inputs of phis implicitly computed from preds

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

SSA is still where it’s at, as a conventional solution to the IR problem. There have been some refinements, though. I learned of one of them from MLton; I don’t know if they were first but they had the idea of interpreting phi variables as arguments to basic blocks. In this formulation, you don’t have explicit phi instructions; rather the “v2 is either v1 or v0” property is expressed by v2 being a parameter of a block which is “called” with either v0 or v1 as an argument. It’s the same semantics, but an interesting notational change.

Refinement: Control tail

Often nice to know how a block ends (e.g. to compute phi input vars)

graph := array<block>
block := tuple<preds, succs, args, insts,
               control>
control := if v then L1 else L2
         | L(v, ...)
         | switch(v, L1, L2, ...)
         | ret v

One other refinement to SSA is to note that basic blocks consist of some number of instructions that can define values or have side effects but which otherwise exhibit fall-through control flow, followed by a single instruction that transfers control to another block. We might as well store that control instruction separately; this would let us easily know how a block ends, and in the case of phi block arguments, easily say what values are the inputs of a phi variable. So let’s do that.

Refinement: DRY

Block successors directly computable from control

Predecessors graph is inverse of successors graph

graph := array<block>
block := tuple<args, insts, control>

Can we simplify further?

At this point we notice that we are repeating ourselves; the successors of a block can be computed directly from the block’s terminal control instruction. Let’s drop those as a distinct part of a block, because when you transform a program it’s unpleasant to have to needlessly update something in two places.

While we’re doing that, we note that the predecessors array is also redundant, as it can be computed from the graph of block successors. Here we start to wonder: am I simpliying or am I removing something that is fundamental to the algorithmic complexity of the various graph transformations that I need to do? We press on, though, hoping we will get somewhere interesting.

Basic blocks are annoying

Ceremony about managing insts; array or doubly-linked list?

Nonuniformity: “local” vs ‘`global’' transformations

Optimizations transform graph A to graph B; mutability complicates this task

  • Desire to keep A in mind while making B
  • Bugs because of spooky action at a distance

Recall that the context for this meander is Guile’s compiler, which is written in Scheme. Scheme doesn’t have expandable arrays built-in. You can build them, of course, but it is annoying. Also, in Scheme-land, functions with side-effects are conventionally suffixed with an exclamation mark; after too many of them, both the writer and the reader get fatigued. I know it’s a silly argument but it’s one of the things that made me grumpy about basic blocks.

If you permit me to continue with this introspection, I find there is an uneasy relationship between instructions and locations in an IR that is structured around basic blocks. Do instructions live in a function-level array and a basic block is an array of instruction indices? How do you get from instruction to basic block? How would you hoist an instruction to another basic block, might you need to reallocate the block itself?

And when you go to transform a graph of blocks... well how do you do that? Is it in-place? That would be efficient; but what if you need to refer to the original program during the transformation? Might you risk reading a stale graph?

It seems to me that there are too many concepts, that in the same way that SSA itself moved away from assignment to a more declarative language, that perhaps there is something else here that might be more appropriate to the task of a middle-end.

Basic blocks, phi vars redundant

Blocks: label with args sufficient; “containing” multiple instructions is superfluous

Unify the two ways of naming values: every var is a phi

graph := array<block>
block := tuple<args, inst>
inst  := L(expr)
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

I took a number of tacks here, but the one I ended up on was to declare that basic blocks themselves are redundant. Instead of containing an array of instructions with fallthrough control-flow, why not just make every instruction a control instruction? (Yes, there are arguments against this, but do come along for the ride, we get to a funny place.)

While you are doing that, you might as well unify the two ways in which values are named in a MLton-style compiler: instead of distinguishing between basic block arguments and values defined within a basic block, we might as well make all names into basic block arguments.

Arrays annoying

Array of blocks implicitly associates a label with each block

Optimizations add and remove blocks; annoying to have dead array entries

Keep labels as small integers, but use a map instead of an array

graph := map<label, block>

In the traditional SSA CFG IR, a graph transformation would often not touch the structure of the graph of blocks. But now having given each instruction its own basic block, we find that transformations of the program necessarily change the graph. Consider an instruction that we elide; before, we would just remove it from its basic block, or replace it with a no-op. Now, we have to find its predecessor(s), and forward them to the instruction’s successor. It would be useful to have a more capable data structure to represent this graph. We might as well keep labels as being small integers, but allow for sparse maps and growth by using an integer-specialized map instead of an array.

This is CPS soup

graph := map<label, cont>
cont  := tuple<args, term>
term  := continue to L
           with values from expr
       | if v then L1() else L2()
       ...
expr  := const C
       | add x, y
       ...

SSA is CPS

This is exactly what CPS soup is! We came at it “from below”, so to speak; instead of the heady fumes of the lambda calculus, we get here from down-to-earth basic blocks. (If you prefer the other way around, you might enjoy this article from a long time ago.) The remainder of this presentation goes deeper into what it is like to work with CPS soup in practice.

Scope and dominators

BB0(a0): if a0 then BB1() else BB2()
BB1(): v0 := const 42; BB3(v0)
BB2(): v1 := const 69; BB3(v1)
BB3(v2): v3 := addi v2, 1; ret v3

What vars are “in scope” at BB3? a0 and v2.

Not v0; not all paths from BB0 to BB3 define v0.

a0 always defined: its definition dominates all uses.

BB0 dominates BB3: All paths to BB3 go through BB0.

Before moving on, though, we should discuss what it means in an SSA-style IR that variables are defined rather than assigned. If you consider variables as locations to which values can be assigned and which initially hold garbage, you can read them at any point in your program. You might get garbage, though, if the variable wasn’t assigned something sensible on the path that led to reading the location’s value. It sounds bonkers but it is still the C and C++ semantic model.

If we switch instead to a definition-oriented IR, then a variable never has garbage; the single definition always precedes any uses of the variable. That is to say that all paths from the function entry to the use of a variable must pass through the variable’s definition, or, in the jargon, that definitions dominate uses. This is an invariant of an SSA-style IR, that all variable uses be dominated by their associated definition.

You can flip the question around to ask what variables are available for use at a given program point, which might be read equivalently as which variables are in scope; the answer is, all definitions from all program points that dominate the use site. The “CPS” in “CPS soup” stands for continuation-passing style, a dialect of the lambda calculus, which has also has a history of use as a compiler intermediate representation. But it turns out that if we use the lambda calculus in its conventional form, we end up needing to maintain a lexical scope nesting at the same time that we maintain the control-flow graph, and the lexical scope tree can fail to reflect the dominator tree. I go into this topic in more detail in an old article, and if it interests you, please do go deep.

CPS soup in Guile

Compilation unit is intmap of label to cont

cont := $kargs names vars term
      | ...
term := $continue k src expr
      | ...
expr := $const C
      | $primcall ’add #f (a b)
      | ...

Conventionally, entry point is lowest-numbered label

Anyway! In Guile, the concrete form that CPS soup takes is that a program is an intmap of label to cont. A cont is the smallest labellable unit of code. You can call them blocks if that makes you feel better. One kind of cont, $kargs, binds incoming values to variables. It has a list of variables, vars, and also has an associated list of human-readable names, names, for debugging purposes.

A $kargs contains a term, which is like a control instruction. One kind of term is $continue, which passes control to a continuation k. Using our earlier language, this is just goto *k*, with values, as in MLton. (The src is a source location for the term.) The values come from the term’s expr, of which there are a dozen kinds or so, for example $const which passes a literal constant, or $primcall, which invokes some kind of primitive operation, which above is add. The primcall may have an immediate operand, in this case #f, and some variables that it uses, in this case a and b. The number and type of the produced values is a property of the primcall; some are just for effect, some produce one value, some more.

CPS soup

term := $continue k src expr
      | $branch kf kt src op param args
      | $switch kf kt* src arg
      | $prompt k kh src escape? tag
      | $throw src op param args

Expressions can have effects, produce values

expr := $const val
      | $primcall name param args
      | $values args
      | $call proc args
      | ...

There are other kinds of terms besides $continue: there is $branch, which proceeds either to the false continuation kf or the true continuation kt depending on the result of performing op on the variables args, with immediate operand param. In our running example, we might have made the initial term via:

(build-term
  ($branch BB1 BB2 'false? #f (a0)))

The definition of build-term (and build-cont and build-exp) is in the (language cps) module.

There is also $switch, which takes an unboxed unsigned integer arg and performs an array dispatch to the continuations in the list kt, or kf otherwise.

There is $prompt which continues to its k, having pushed on a new continuation delimiter associated with the var tag; if code aborts to tag before the prompt exits via an unwind primcall, the stack will be unwound and control passed to the handler continuation kh. If escape? is true, the continuation is escape-only and aborting to the prompt doesn’t need to capture the suspended continuation.

Finally there is $throw, which doesn’t continue at all, because it causes a non-resumable exception to be thrown. And that’s it; it’s just a handful of kinds of term, determined by the different shapes of control-flow (how many continuations the term has).

When it comes to values, we have about a dozen expression kinds. We saw $const and $primcall, but I want to explicitly mention $values, which simply passes on some number of values. Often a $values expression corresponds to passing an input to a phi variable, though $kargs vars can get their definitions from any expression that produces the right number of values.

Kinds of continuations

Guile functions untyped, can multiple return values

Error if too few values, possibly truncate too many values, possibly cons as rest arg...

Calling convention: contract between val producer & consumer

  • both on call and return side

Continuation of $call unlike that of $const

When a $continue term continues to a $kargs with a $const 42 expression, there are a number of invariants that the compiler can ensure: that the $kargs continuation is always passed the expected number of values, that the vars that it binds can be allocated to specific locations (e.g. registers), and that because all predecessors of the $kargs are known, that those predecessors can place their values directly into the variable’s storage locations. Effectively, the compiler determines a custom calling convention between each $kargs and its predecessors.

Consider the $call expression, though; in general you don’t know what the callee will do to produce its values. You don’t even generally know that it will produce the right number of values. Therefore $call can’t (in general) continue to $kargs; instead it continues to $kreceive, which expects the return values in well-known places. $kreceive will check that it is getting the right number of values and then continue to a $kargs, shuffling those values into place. A standard calling convention defines how functions return values to callers.

The conts

cont := $kfun src meta self ktail kentry
      | $kclause arity kbody kalternate
      | $kargs names syms term
      | $kreceive arity kbody
      | $ktail

$kclause, $kreceive very similar

Continue to $ktail: return

$call and return (and $throw, $prompt) exit first-order flow graph

Of course, a $call expression could be a tail-call, in which case it would continue instead to $ktail, indicating an exit from the first-order function-local control-flow graph.

The calling convention also specifies how to pass arguments to callees, and likewise those continuations have a fixed calling convention; in Guile we start functions with $kfun, which has some metadata attached, and then proceed to $kclause which bridges the boundary between the standard calling convention and the specialized graph of $kargs continuations. (Many details of this could be tweaked, for example that the case-lambda dispatch built-in to $kclause could instead dispatch to distinct functions instead of to different places in the same function; historical accidents abound.)

As a detail, if a function is well-known, in that all its callers are known, then we can lighten the calling convention, moving the argument-count check to callees. In that case $kfun continues directly to $kargs. Similarly for return values, optimizations can make $call continue to $kargs, though there is still some value-shuffling to do.

High and low

CPS bridges AST (Tree-IL) and target code

High-level: vars in outer functions in scope

Closure conversion between high and low

Low-level: Explicit closure representations; access free vars through closure

CPS soup is the bridge between parsed Scheme and machine code. It starts out quite high-level, notably allowing for nested scope, in which expressions can directly refer to free variables. Variables are small integers, and for high-level CPS, variable indices have to be unique across all functions in a program. CPS gets lowered via closure conversion, which chooses specific representations for each closure that remains after optimization. After closure conversion, all variable access is local to the function; free variables are accessed via explicit loads from a function’s closure.

Optimizations at all levels

Optimizations before and after lowering

Some exprs only present in one level

Some high-level optimizations can merge functions (higher-order to first-order)

Because of the broad remit of CPS, the language itself has two dialects, high and low. The high level dialect has cross-function variable references, first-class abstract functions (whose representation hasn’t been chosen), and recursive function binding. The low-level dialect has only specific ways to refer to functions: labels and specific closure representations. It also includes calls to function labels instead of just function values. But these are minor variations; some optimization and transformation passes can work on either dialect.

Practicalities

Intmap, intset: Clojure-style persistent functional data structures

Program: intmap<label,cont>

Optimization: program→program

Identify functions: (program,label)→intset<label>

Edges: intmap<label,intset<label>>

Compute succs: (program,label)→edges

Compute preds: edges→edges

I mentioned that programs were intmaps, and specifically in Guile they are Clojure/Bagwell-style persistent functional data structures. By functional I mean that intmaps (and intsets) are values that can’t be mutated in place (though we do have the transient optimization).

I find that immutability has the effect of deploying a sense of calm to the compiler hacker – I don’t need to worry about data structures changing out from under me; instead I just structure all the transformations that you need to do as functions. An optimization is just a function that takes an intmap and produces another intmap. An analysis associating some data with each program label is just a function that computes an intmap, given a program; that analysis will never be invalidated by subsequent transformations, because the program to which it applies will never be mutated.

This pervasive feeling of calm allows me to tackle problems that I wouldn’t have otherwise been able to fit into my head. One example is the novel online CSE pass; one day I’ll either wrap that up as a paper or just capitulate and blog it instead.

Flow analysis

A[k] = meet(A[p] for p in preds[k])
         - kill[k] + gen[k]

Compute available values at labels:

  • A: intmap<label,intset<val>>
  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union
  • kill[k]: values invalidated by cont because of side effects
  • gen[k]: values defined at k

But to keep it concrete, let’s take the example of flow analysis. For example, you might want to compute “available values” at a given label: these are the values that are candidates for common subexpression elimination. For example if a term is dominated by a car x primcall whose value is bound to v, and there is no path from the definition of V to a subsequent car x primcall, we can replace that second duplicate operation with $values (v) instead.

There is a standard solution for this problem, which is to solve the flow equation above. I wrote about this at length ages ago, but looking back on it, the thing that pleases me is how easy it is to decompose the task of flow analysis into manageable parts, and how the types tell you exactly what you need to do. It’s easy to compute an initial analysis A, easy to define your meet function when your maps and sets have built-in intersect and union operators, easy to define what addition and subtraction mean over sets, and so on.

Persistent data structures FTW

  • meet: intmap-intersect<intset-intersect>
  • -, +: intset-subtract, intset-union

Naïve: O(nconts * nvals)

Structure-sharing: O(nconts * log(nvals))

Computing an analysis isn’t free, but it is manageable in cost: the structure-sharing means that meet is usually trivial (for fallthrough control flow) and the cost of + and - is proportional to the log of the problem size.

CPS soup: strengths

Relatively uniform, orthogonal

Facilitates functional transformations and analyses, lowering mental load: “I just have to write a function from foo to bar; I can do that”

Encourages global optimizations

Some kinds of bugs prevented by construction (unintended shared mutable state)

We get the SSA optimization literature

Well, we’re getting to the end here, and I want to take a step back. Guile has used CPS soup as its middle-end IR for about 8 years now, enough time to appreciate its fine points while also understanding its weaknesses.

On the plus side, it has what to me is a kind of low cognitive overhead, and I say that not just because I came up with it: Guile’s development team is small and not particularly well-resourced, and we can’t afford complicated things. The simplicity of CPS soup works well for our development process (flawed though that process may be!).

I also like how by having every variable be potentially a phi, that any optimization that we implement will be global (i.e. not local to a basic block) by default.

Perhaps best of all, we get these benefits while also being able to use the existing SSA transformation literature. Because CPS is SSA, the lessons learned in SSA (e.g. loop peeling) apply directly.

CPS soup: weaknesses

Pointer-chasing, indirection through intmaps

Heavier than basic blocks: more control-flow edges

Names bound at continuation only; phi predecessors share a name

Over-linearizes control, relative to sea-of-nodes

Overhead of re-computation of analyses

CPS soup is not without its drawbacks, though. It’s not suitable for JIT compilers, because it imposes some significant constant-factor (and sometimes algorithmic) overheads. You are always indirecting through intmaps and intsets, and these data structures involve significant pointer-chasing.

Also, there are some forms of lightweight flow analysis that can be performed naturally on a graph of basic blocks without looking too much at the contents of the blocks; for example in our available variables analysis you could run it over blocks instead of individual instructions. In these cases, basic blocks themselves are an optimization, as they can reduce the size of the problem space, with corresponding reductions in time and memory use for analyses and transformations. Of course you could overlay a basic block graph on top of CPS soup, but it’s not a well-worn path.

There is a little detail that not all phi predecessor values have names, since names are bound at successors (continuations). But this is a detail; if these names are important, little $values trampolines can be inserted.

Probably the main drawback as an IR is that the graph of conts in CPS soup over-linearizes the program. There are other intermediate representations that don’t encode ordering constraints where there are none; perhaps it would be useful to marry CPS soup with sea-of-nodes, at least during some transformations.

Finally, CPS soup does not encourage a style of programming where an analysis is incrementally kept up to date as a program is transformed in small ways. The result is that we end up performing much redundant computation within each individual optimization pass.

Recap

CPS soup is SSA, distilled

Labels and vars are small integers

Programs map labels to conts

Conts are the smallest labellable unit of code

Conts can have terms that continue to other conts

Compilation simplifies and lowers programs

Wasm vs VM backend: a question for another day :)

But all in all, CPS soup has been good for Guile. It’s just SSA by another name, in a simpler form, with a functional flavor. Or, it’s just CPS, but first-order only, without lambda.

In the near future, I am interested in seeing what a new GC will do for CPS soup; will bump-pointer allocation palliate some of the costs of pointer-chasing? We’ll see. A tricky thing about CPS soup is that I don’t think that anyone else has tried it in other languages, so it’s hard to objectively understand its characteristics independent of Guile itself.

Finally, it would be nice to engage in the academic conversation by publishing a paper somewhere; I would like to see interesting criticism, and blog posts don’t really participate in the citation graph. But in the limited time available to me, faced with the choice between hacking on something and writing a paper, it’s always been hacking, so far :)

Speaking of limited time, I probably need to hit publish on this one and move on. Happy hacking to all, and until next time.

Andy Wingohttps://wingolog.org/instruction explosion in guilehttps://wingolog.org/2018/01/17/instruction-explosion-in-guile2018-01-17T10:30:15Z2018-01-17T10:30:15Z

Greetings, fellow Schemers and compiler nerds: I bring fresh nargery!

instruction explosion

A couple years ago I made a list of compiler tasks for Guile. Most of these are still open, but I've been chipping away at the one labeled "instruction explosion":

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

Now that I'm getting close to finished I wanted to share some thoughts. Previous progress reports on the mailing list.

a simple loop

As an example, consider this loop that sums the 32-bit floats in a bytevector. I've annotated the code with lines and columns so that you can correspond different pieces to the assembly.

   0       8   12     19
 +-v-------v---v------v-
 |
1| (use-modules (rnrs bytevectors))
2| (define (f32v-sum bv)
3|   (let lp ((n 0) (sum 0.0))
4|     (if (< n (bytevector-length bv))
5|         (lp (+ n 4)
6|             (+ sum (bytevector-ieee-single-native-ref bv n)))
7|          sum)))

The assembly for the loop before instruction explosion went like this:

L1:
  17    (handle-interrupts)     at (unknown file):5:12
  18    (uadd/immediate 0 1 4)
  19    (bv-f32-ref 1 3 1)      at (unknown file):6:19
  20    (fadd 2 2 1)            at (unknown file):6:12
  21    (s64<? 0 4)             at (unknown file):4:8
  22    (jnl 8)                ;; -> L4
  23    (mov 1 0)               at (unknown file):5:8
  24    (j -7)                 ;; -> L1

So, already Guile's compiler has hoisted the (bytevector-length bv) and unboxed the loop index n and accumulator sum. This work aims to simplify further by exploding bv-f32-ref.

exploding the loop

In practice, instruction explosion happens in CPS conversion, as we are converting the Scheme-like Tree-IL language down to the CPS soup language. When we see a Tree-Il primcall (a call to a known primitive), instead of lowering it to a corresponding CPS primcall, we inline a whole blob of code.

In the concrete case of bv-f32-ref, we'd inline it with something like the following:

(unless (and (heap-object? bv)
             (eq? (heap-type-tag bv) %bytevector-tag))
  (error "not a bytevector" bv))
(define len (word-ref bv 1))
(define ptr (word-ref bv 2))
(unless (and (<= 4 len)
             (<= idx (- len 4)))
  (error "out of range" idx))
(f32-ref ptr len)

As you can see, there are four branches hidden in the bv-f32-ref: two to check that the object is a bytevector, and two to check that the index is within range. In this explanation we assume that the offset idx is already unboxed, but actually unboxing the index ends up being part of this work as well.

One of the goals of instruction explosion was that by breaking the operation into a number of smaller, more orthogonal parts, native code generation would be easier, because the compiler would only have to know about those small bits. However without an optimizing compiler, it would be better to reify a call out to a specialized bv-f32-ref runtime routine instead of inlining all of this code -- probably whatever language you write your runtime routine in (C, rust, whatever) will do a better job optimizing than your compiler will.

But with an optimizing compiler, there is the possibility of removing possibly everything but the f32-ref. Guile doesn't quite get there, but almost; here's the post-explosion optimized assembly of the inner loop of f32v-sum:

L1:
  27    (handle-interrupts)
  28    (tag-fixnum 1 2)
  29    (s64<? 2 4)             at (unknown file):4:8
  30    (jnl 15)               ;; -> L5
  31    (uadd/immediate 0 2 4)  at (unknown file):5:12
  32    (u64<? 2 7)             at (unknown file):6:19
  33    (jnl 5)                ;; -> L2
  34    (f32-ref 2 5 2)
  35    (fadd 3 3 2)            at (unknown file):6:12
  36    (mov 2 0)               at (unknown file):5:8
  37    (j -10)                ;; -> L1

good things

The first thing to note is that unlike the "before" code, there's no instruction in this loop that can throw an exception. Neat.

Next, note that there's no type check on the bytevector; the peeled iteration preceding the loop already proved that the bytevector is a bytevector.

And indeed there's no reference to the bytevector at all in the loop! The value being dereferenced in (f32-ref 2 5 2) is a raw pointer. (Read this instruction as, "sp[2] = *(float*)((byte*)sp[5] + (uptrdiff_t)sp[2])".) The compiler does something interesting; the f32-ref CPS primcall actually takes three arguments: the garbage-collected object protecting the pointer, the pointer itself, and the offset. The object itself doesn't appear in the residual code, but including it in the f32-ref primcall's inputs keeps it alive as long as the f32-ref itself is alive.

bad things

Then there are the limitations. Firstly, instruction 28 tags the u64 loop index as a fixnum, but never uses the result. Why is this here? Sadly it's because the value is used in the bailout at L2. Recall this pseudocode:

(unless (and (<= 4 len)
             (<= idx (- len 4)))
  (error "out of range" idx))

Here the error ends up lowering to a throw CPS term that the compiler recognizes as a bailout and renders out-of-line; cool. But it uses idx as an argument, as a tagged SCM value. The compiler untags the loop index, but has to keep a tagged version around for the error cases.

The right fix is probably some kind of allocation sinking pass that sinks the tag-fixnum to the bailouts. Oh well.

Additionally, there are two tests in the loop. Are both necessary? Turns out, yes :( Imagine you have a bytevector of length 1025. The loop continues until the last ref at offset 1024, which is within bounds of the bytevector but there's one one byte available at that point, so we need to throw an exception at this point. The compiler did as good a job as we could expect it to do.

is is worth it? where to now?

On the one hand, instruction explosion is a step sideways. The code is more optimal, but it's more instructions. Because Guile currently has a bytecode VM, that means more total interpreter overhead. Testing on a 40-megabyte bytevector of 32-bit floats, the exploded f32v-sum completes in 115 milliseconds compared to around 97 for the earlier version.

On the other hand, it is very easy to imagine how to compile these instructions to native code, either ahead-of-time or via a simple template JIT. You practically just have to look up the instructions in the corresponding ISA reference, is all. The result should perform quite well.

I will probably take a whack at a simple template JIT first that does no register allocation, then ahead-of-time compilation with register allocation. Getting the AOT-compiled artifacts to dynamically link with runtime routines is a sufficient pain in my mind that I will put it off a bit until later. I also need to figure out a good strategy for truly polymorphic operations like general integer addition; probably involving inline caches.

So that's where we're at :) Thanks for reading, and happy hacking in Guile in 2018!

Andy Wingohttps://wingolog.org/loop optimizations in guilehttps://wingolog.org/2015/07/28/loop-optimizations-in-guile2015-07-28T08:10:36Z2015-07-28T08:10:36Z

Sup peeps. So, after the slog to update Guile's intermediate language, I wanted to land some new optimizations before moving on to the next thing. For years I've been meaning to do some loop optimizations, and I was finally able to land a few of them.

loop peeling

For a long time I have wanted to do "loop peeling". Loop peeling means peeling off the first iteration of a loop. If you have a source program that looks like this:

while foo:
  bar()
  baz()

Loop peeling turns it into this:

if foo:
  bar()
  baz()
  while foo:
    bar()
    baz()

You wouldn't think that this is actually an optimization, would you? Well on its own, it's not. But if you combine it with common subexpression elimination, then it means that the loop body is now dominated by all effects and all loop-invariant expressions that must be evaluated for the expression to loop.

In dynamic languages, this is most useful when one source expression expands to a number of low-level steps. So for example if your language runtime implements top-level variable references in three parts, one where it gets a reference to a mutable box, then it checks if the box has a value, and and the third where it unboxes it, then we would have:

if foo:
  bar_location = lookup("bar")
  bar_value = dereference(bar_location)
  if bar_value is null: throw NotFound("bar")
  call(bar_value)

  baz_location = lookup("baz")
  baz_value = dereference(baz_location)
  if baz_value is null: throw NotFound("baz")
  call(baz_value)

  while foo:
    bar_value = dereference(bar_location)
    call(bar_value)

    baz_value = dereference(baz_location)
    call(baz_value)

The result is that we have hoisted the lookups and null checks out of the loop (if a box can never transition from full back to empty). It's a really powerful transformation that can even hoist things that traditional loop-invariant code motion can't, but more on that later.

Now, the problem with loop peeling is that usually values will escape your loop. For example:

while foo:
  x = qux()
  if x then return x
...

In this little example, there is a value x, and the return x statement is actually not in the loop. It's syntactically in the loop, but the underlying representation that the compiler uses looks more like this:

function qux(k):
  label loop_header():
    fetch(foo) -gt; loop_test
  label loop_test(foo_value):
    if foo_value then -> exit else -> body
  label body():
    fetch(x) -gt; have_x
  label have_x(x_value):
    if x_value then -> return_x else -> loop_header
  label return_x():
    values(x) -> k
  label exit():
    ...

This is the "CPS soup" I described in my last post. Only the bold parts are in the loop; notably, the return is outside the loop. Point being, if we peel off the first iteration, then there are two possible values for x that we would return:

if foo:
  x1 = qux()
  if x1 then return x1
  while foo:
    x2 = qux()
    if x2 then return x2
  ...

I have them marked as x1 and x2. But I've also duplicated the return x terms, which is not what we want. We want to peel off the first iteration, which will cause code growth equal to the size of the loop body, but we don't want to have to duplicate everything that's after the loop. What we have to do is re-introduce a join point that defines x:

if foo:
  x1 = qux()
  if x1 then join(x1)
  while foo:
    x2 = qux()
    if x2 then join(x2)
  ...
label join(x)
  return x

Here I'm playing fast and loose with notation because the real terms are too gnarly. What I'm trying to get across is that for each value that flows out of a loop, you need a join point. That's fine, it's a bit more involved, but what if your loop exits to two different points, but one value is live in both of them? A value can only be defined in one place, in CPS or SSA. You could re-place a whole tree of phi variables, in SSA parlance, with join blocks and such, but it's just too hard.

However we can still get the benefits of peeling in most cases if we restrict ourselves to loops that exit to only one continuation. In that case the live variable set is the intersection of all variables defined in the loop that are live at the exit points. Easy enough, and that's what we have in Guile now. Peeling causes some code growth but the loops are smaller so it should still be a win. Check out the source, if that's your thing.

loop-invariant code motion

Usually when people are interested in moving code out of loops they talk about loop-invariant code motion, or LICM. Contrary to what you might think, LICM is complementary to peeling: some things that peeling+CSE can hoist are not hoistable by LICM, and vice versa.

Unlike peeling, LICM does not cause code growth. Instead, for each expression in a loop, LICM tries to hoist it out of the loop if it can. An expression can be hoisted if all of these conditions are true:

  1. It doesn't cause the creation of an observably new object. In Scheme, the definition of "observable" is quite subtle, so in practice in Guile we don't hoist expressions that can cause any allocation. We could use alias analysis to improve this.

  2. The expression cannot throw an exception, or the expression is always evaluated for every loop iteration.

  3. The expression makes no writes to memory, or if it writes to memory, other expressions in the loop cannot possibly read from that memory. We use effects analysis for this.

  4. The expression makes no reads from memory, or if it reads from memory, no other expression in the loop can clobber those reads. Again, effects analysis.

  5. The expression uses only loop-invariant variables.

This definition is inductive, so once an expression is hoisted, the values it defines are then considered loop-invariant, so you might be able to hoist a whole chain of values.

Compared to loop peeling, this has the gnarly aspect of having to explicitly reason about loop invariance and manually move code, which is a pain. (Really LICM would be better named "artisanal code motion".) However it causes no code growth, which is a plus, though like peeling it can increase register pressure. But the big difference is that LICM can hoist effect-free expressions that aren't always executed. Consider:

while foo:
  x = qux() ? "hi" : "ho"

Here for some reason it could be faster to cache "hi" or "ho" in registers, which is what LICM allows:

hi, ho = "hi", "ho"
while foo:
  x = qux() ? hi : ho

On the other hand, LICM alone can't hoist the if baz is null checks in this example from above:

while foo:
  bar()
  baz()

The issue is that the call to bar() might not return, so the error that might be thrown if baz is null shouldn't be observed until bar is called. In general we can't hoist anything that might throw an exception past some non-hoisted code that might throw an exception. This specific situation happens in Guile but there are similar ones in any language, I think.

More formally, LICM will hoist effectful but loop-invariant expressions that postdominate the loop header, whereas peeling hoists those expressions that dominate all back-edges. I think? We'll go with that. Again, the source.

loop inversion

Loop inversion is a little hack to improve code generation, and again it's a little counterintuitive. If you have this loop:

while n < x:
  n++

Loop inversion turns it into:

if n < x:
  do
    n++
  while n < x

The goal is that instead of generating code that looks like this:

header:
  test n, x;
  branch-if-greater-than-or-equal done;
  x = x + 1
  goto header
done:

You make something that looks like this:

  test n, x;
  branch-if-greater-than-or-equal done;
header:
  x = x + 1
  test n, x;
  branch-if-less-than header;
done:

The upshot is that the loop body now contains one branch instead of two. It's mostly helpful for tight loops.

It turns out that you can express this transformation on CPS (or SSA, or whatever), but that like loop peeling the extra branch introduces an extra join point in your program. If your loop exits to more than one label, then we have the same problems as loop peeling. For this reason Guile restricts loop inversion (which it calls "loop rotation" at the moment; I should probably fix that) to loops with only one exit continuation.

Loop inversion has some other caveats, but probably the biggest one is that in Guile it doesn't actually guarantee that each back-edge is a conditional branch. The reason is that usually a loop has some associated loop variables, and it could be that you need to reshuffle those variables when you jump back to the top. Mostly Guile's compiler manages to avoid shuffling, allowing inversion to have the right effect, but it's not guaranteed. Fixing this is not straightforward, since the shuffling of values is associated with the predecessor of the loop header and not the loop header itself. If instead we reshuffled before the header, that might work, but each back-edge might have a different shuffling to make... anyway. In practice inversion seems to work out fine; I haven't yet seen a case where it doesn't work. Source code here.

loop identification

One final note: what is a loop anyway? Turns out this is a somewhat hard problem, especially once you start trying to identify nested loops. Guile currently does the simple thing and just computes strongly-connected components in a function's flow-graph, and says that a loop is a non-trivial SCC with a single predecessor. That won't tease apart loop nests but oh wells! I spent a lot of time last year or maybe two years ago with that "Loop identification via D-J graphs" paper but in the end simple is best, at least for making incremental steps.

Okeysmokes, until next time, loop on!

Andy Wingohttps://wingolog.org/cps souphttps://wingolog.org/2015/07/27/cps-soup2015-07-27T14:43:00Z2015-07-27T14:43:00Z

Hello internets! This blog goes out to my long-time readers who have followed my saga hacking on Guile's compiler. For the rest of you, a little history, then the new thing.

In the olden days, Guile had no compiler, just an interpreter written in C. Around 8 years ago now, we ported Guile to compile to bytecode. That bytecode is what is currently deployed as Guile 2.0. For many reasons we wanted to upgrade our compiler and virtual machine for Guile 2.2, and the result of that was a new continuation-passing-style compiler for Guile. Check that link for all the backstory.

So, I was going to finish documenting this intermediate language about 5 months ago, in preparation for making the first Guile 2.2 prereleases. But something about it made me really unhappy. You can catch some foreshadowing of this in my article from last August on common subexpression elimination; I'll just quote a paragraph here:

In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

This is exactly the same concern that Matthew Fluet and Stephen Weeks had back in 2003:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

To be honest I think all this talk about dominators is distracting. Dominators are but a lightweight flow analysis, and I usually find myself using full-on flow analysis to compute the set of optimizations that I can do on a piece of code. In fact the only use I had for dominators in the nested CPS language was to rewrite scope trees! The salient part of Weeks' observation is that nested scope trees are the problem, not that dominators are the solution.

So, after literally years of hemming and hawing about this, I finally decided to remove nested scope trees from Guile's CPS intermediate language. Instead, a function is now a collection of labelled continuations, with one distinguished entry continuation. There is no more $letk term to nest continuations in each other. A program is now represented as a "soup" -- basically a map from labels to continuation bodies, again with a distinguished entry. As an example, consider this expression:

function(x):
  return add(x, 1)

If we rewrote it in continuation-passing style, we'd give the function a name for its "tail continuation", ktail, and annotate each expression with its continuation:

function(ktail, x):
  add(x, 1) -> ktail

Here the -> ktail means that the add expression passes its values to the continuation ktail.

With nested CPS, it could look like:

function(ktail, x):
  letk have_one(one): add(x, one) -> ktail
    load_constant(1) -> have_one

Here the label have_one is in a scope, as is the value one. With "CPS soup", though, it looks more like this:

function(ktail, x):
  label have_one(one): add(x, one) -> ktail
  label main(x): load_constant(1) -> have_one

It's a subtle change, but it took a few months to make so it's worth pointing out what's going on. The difference is that there is no scope tree for labels or variables any more. A variable can be used at a label if it flows to the label, in a flow analysis sense. Indeed, determining the set of variables that can be used at a label requires flow analysis; that's what Weeks was getting at in his 2003 mail about the advantages of SSA, which are really the advantages of an intermediate language without nested scope trees.

The question arises, though, now that we've decided on CPS soup, how should we represent a program as a value? We've gone from a nested term to a graph term, and we need to find a way to represent it somehow that facilitates looking up labels by name, and facilitates tree rewrites.

In Guile's IR, labels and variables are both integers, so happily enough, we have such a data structure: Clojure-style maps specialized for integer keys.

Friends, if there has been one realization or revolution for me in the last year, it has been Clojure-style data structures. Here's why. In compilers, I often have to build up some kind of analysis, then use that analysis to transform data. Often I need to keep the old term around while I build a new one, but it would be nice to share state between old and new terms. With a nested tree, if a leaf changed you'd have to rebuild all surrounding terms, which is gnarly. But with Clojure-style data structures, more and more I find myself computing in terms of values: build up this value, transform this map to that set, fold over this map -- and yes, you can fold over Guile's intmaps -- and so on. By providing an expressive data structure for which I can control performance characteristics by using transients if needed, these data structures make my programs more about data and less about gnarly machinery.

As a concrete example, the old contification pass in Guile, I didn't have the mental capacity to understand all the moving parts in such a way that I could compute an optimal contification from the beginning; instead I had to iterate to a fixed point, as Kennedy did in his "Compiling with Continuations, Continued" paper. With the new CPS soup language and with Clojure-style data structures, I could actually fit more of the algorithm into my head, with the result that Guile now contifies optimally while avoiding the fixed-point transformation. Also, the old pass used hash tables to represent the analysis, which I found incredibly confusing to reason about -- I totally buy Rich Hickey's argument that place-oriented programming is the source of many evils in programs, and hash tables are nothing if not a place party. Using functional maps let me solve harder problems because they are easier for me to reason about.

Contification isn't an isolated case, either. For example, we are able to do the complete set of optimizations from the "Optimizing closures in O(0) time" paper, including closure sharing, which I think makes Guile unique besides Chez Scheme. I wasn't capable of doing it on the old representation because it was just too hard for me to think about, because my data structures weren't right.

This new "CPS soup" language is still a first-order CPS language in that each term specifies its continuation, and that variable names appear in the continuation of a definition, not the definition itself. This effectively makes every variable a phi variable, in the sense of SSA, and you have to do some work to get to a variable's definition. It could be that still this isn't the right number of names; consider this function:

function foo(k, x):
  label have_y(y) bar(y) -> k
  label y_is_two() load_constant(2) -> have_y
  label y_is_one() load_constant(1) -> have_y
  label main(x) if x -> y_is_one else -> y_is_two

Here there is no distinguished name for the value load_constant(1) versus load_constant(2): both are possible values for y. If we ended up giving them names, we'd have to reintroduce actual phi variables for the joins, which would basically complete the transformation to SSA. Until now though I haven't wanted those names, so perhaps I can put this off. On the other hand, every term has a label, which simplifies many things compared to having to contain terms in basic blocks, as is usually done in SSA. Yet another chapter in CPS is SSA is CPS is SSA, it seems.

Welp, that's all the nerdery for right now. Talk at yall later!

Andy Wingohttps://wingolog.org/revisiting common subexpression elimination in guilehttps://wingolog.org/2014/08/25/revisiting-common-subexpression-elimination-in-guile2014-08-25T09:48:48Z2014-08-25T09:48:48Z

A couple years ago I wrote about a common subexpression pass that I implemented in Guile 2.0.

To recap, Guile 2.0 has a global, interprocedural common subexpression elimination (CSE) pass.

In the context of compiler optimizations, "global" means that it works across basic block boundaries. Basic blocks are simple, linear segments of code without control-flow joins or branches. Working only within basic blocks is called "local". Working across basic blocks requires some form of understanding of how values can flow within the blocks, for example flow analysis.

"Interprocedural" means that Guile 2.0's CSE operates across closure boundaries. Guile 2.0's CSE is "context-insensitive", in the sense that any possible effect of a function is considered to occur at all call sites; there are newer CSE passes in the literature that separate effects of different call sites ("context-sensitive"), but that's not a Guile 2.0 thing. Being interprocedural was necessary for Guile 2.0, as its intermediate language could not represent (e.g.) loops directly.

The conclusion of my previous article was that although CSE could do cool things, in Guile 2.0 it was ultimately limited by the language that it operated on. Because the Tree-IL direct-style intermediate language didn't define order of evaluation, didn't give names to intermediate values, didn't have a way of explicitly representing loops and other kinds of first-order control flow, and couldn't precisely specify effects, the results, well, could have been better.

I know you all have been waiting for the last 27 months for an update, probably forgoing meaningful social interaction in the meantime because what if I posted a followup while you were gone? Be at ease, fictitious readers, because that day has finally come.

CSE over CPS

The upcoming Guile 2.2 has a more expressive language for the optimizer to work on, called continuation-passing style (CPS). CPS explicitly names all intermediate values and control-flow points, and can integrate nested functions into first-order control-flow via "contification". At the same time, the Guile 2.2 virtual machine no longer penalizes named values, which was another weak point of CSE in Guile 2.0. Additionally, the CPS intermediate language enables more fined-grained effects analysis.

All of these points mean that CSE has the possibility to work better in Guile 2.2 than in Guile 2.0, and indeed it does. The shape of the algorithm is a bit different, though, and I thought some compiler nerds might be interested in the details. I'll follow up in the next section with some things that new CSE pass can do that the old one couldn't.

So, by way of comparison, the old CSE pass was a once-through depth-first visit of the nested expression tree. As the visit proceeded, the pass built up an "environment" of available expressions -- for example, that (car a) was evaluated and bound to b, and so on. This environment could be consulted to see if a expression was already present in the environment. If so, the environment would be traversed from most-recently-added to the found expression, to see if any intervening expression invalidated the result. Control-flow joins would cause recomputation of the environment, so that it only held valid values.

This simple strategy works for nested expressions without complex control-flow. CPS, on the other hand, can have loops and other control flow that Tree-IL cannot express, so for it to build up a set of "available expressions" requires a full-on flow analysis. So that's what the pass does: a flow analysis over the labelled expressions in a function to compute the set of "available expressions" for each label. A labelled expression a is available at label b if a dominates b, and no intervening expression could have invalidated the results. An expression invalidates a result if it may write to a memory location that the result may have read. The code, such as it is, may be found here.

Once you have the set of available expressions for a function, you can proceed to the elimination phase. First, you start by creating an "eliminated variable" map, which initially maps each variable to itself, and an "equivalent expressions" table, which maps "keys" to a set of labels and bound variables. Then you visit each expression in a function, again in topologically sorted order. For each expression, you compute a "key", which is some unique representation of an expression that can be compared by structural equality. Keys that compare as equal are equivalent, and are subject to elimination.

For example, consider a call to the add primitive with variables labelled b and c as arguments. Imagine that b maps to a in the eliminated variable table. The expression as a whole would then have a key representation as the list (primcall add a c). If this key is present in the equivalent expression table, you check to see if any of the equivalent labels is available at the current label. If so, hurrah! You mark the outputs of the current label as being replaced by the outputs of the equivalent label. Otherwise you add the key to the equivalent table, associated with the current label.

This simple algorithm is enough to recursively eliminate common subexpressions. Sometimes the recursive aspect (i.e. noticing that b should be replaced by a), along with the creation of a common key, causes the technique to be called global value numbering (GVN), but CSE seems a better name to me.

The algorithm as outlined above eliminates expressions that bind values. However not all expressions do that; some are used as control-flow branches. For this reason, Guile also computes a "truthy table" with another flow analysis pass. This table computes a set of which branches have been taken to get to each program point. In the elimination phase, if a branch is reached that is equivalent to a previously taken branch, we consult the truthy table to see which continuation the previous branch may have taken. If it can be proven to have taken just one of the legs, the test is elided and replaced with a direct jump.

A few things to note before moving on. First, the "compute an analysis, then transform the function" sequence is quite common in this sort of problem. It leads to some challenges regarding space for the analysis; my last article deals with these in more detail.

Secondly, the rewriting phase assumes that a value that is available may be substituted, and that the result would be a proper CPS term. This isn't always the case; see the discussion at the end of the article on CSE in Guile 2.0 about CPS, SSA, dominators, and scope. In essence, the scope tree doesn't necessarily reflect the dominator tree, so not all transformations you might like to make are syntactically valid. In Guile 2.2's CSE pass, we work around the issue by concurrently rewriting the scope tree to reflect the dominator tree. It's something I am seeing more and more and it gives me some pause as to the suitability of CPS as an intermediate language.

Also, consider the clobbering part of analysis, where e.g. an expression that writes a value to memory has to invalidate previously read values. Currently this is implemented by traversing all available expressions. This is suboptimal and could be quadratic in the end. A better solution is to compute a dependency graph for expressions, which links together operations on the same regions of memory; see LLVM's memory dependency analysis for an idea of how to do this.

Finally, note that this algorithm is global but intraprocedural, meaning that it doesn't propagate values across closure boundaries. It's possible to extend it to be interprocedural, though it's less necessary in the presence of contification.

scalar replacement via fabricated expressions

Let's say you get to an expression at label L, (cons a b). It binds a result c. You determine you haven't seen it before, so you add (primcall cons a b) → L, c to your equivalent expressions set. Cool. We won't be able to replace a future instance of (cons a b) with c, because that doesn't preserve object identity of the newly allocated memory, but it's definitely a cool fact, yo.

What if we add an additional mapping to the table, (car c) → L, a? That way any expression at which L is available would replace (car c) with a, which would be pretty neat. To do so, you would have to add the &read effect to the cons call's effects analysis, but since the cons wasn't really up for elimination anyway it's all good.

Similarly, for (set-car! c d) we can add a mapping of (car c) → d. Again we have to add the &read effect to the set-car, but that's OK too because the write invalidated previous reads anyway.

The same sort of transformation holds for other kinds of memory that Guile knows how to allocate and mutate. Taken together, they form a sort of store-to-load forwarding and scalar replacement that can entirely eliminate certain allocations, and many accesses as well. To actually eliminate the allocations requires a bit more work, but that will be the subject of the next article.

future work

So, that's CSE in Guile 2.0. It works pretty well. In the future I think it's probably worth considering an abstract heap-style analysis of effects; in the end, the precision of CSE is limited to how precisely we can model the effects of expressions.

The trick of using CSE to implement scalar replacement is something I haven't seen elsewhere, though I doubt that it is novel. To fully remove the intermediate allocations needs a couple more tricks, which I will write about in my next nargy dispatch. Until then, happy hacking!

Andy Wingohttps://wingolog.org/flow analysis in guilehttps://wingolog.org/2014/07/01/flow-analysis-in-guile2014-07-01T08:00:47Z2014-07-01T08:00:47Z

Greets, and welcome back to the solipsism! I've been wandering the wilderness with my Guile hackings lately, but I'm finally ready to come back to civilization. Hopefully you will enjoy my harvest of forest fruit.

Today's article is about flow analysis and data structures. Ready? Let's rock!

flow analysis

Many things that a compiler would like to know can be phrased as a question of the form, "What do I know about the data flowing through this particular program point?" Some things you might want to know are:

  1. The set of variables that must be live.

  2. The set of variables that happen to be live. This is the same as (1) except it includes variables that aren't needed but haven't been clobbered by anything.

  3. The set of expressions whose results are still valid (i.e., haven't been clobbered by anything else).

  4. An upper and lower bound on the range of numeric variables.

Et cetera. I'll talk about specific instances of flow analysis problems in the future, but today's article is a bit more general.

The first thing to note about these questions is that they don't necessarily need or have unique answers. If GCC decides that it can't prove anything about the ranges of integers in your program, it's not the end of the world -- it just won't be able to do some optimizations that it would like to do.

At the same time, there are answers that are better and worse than others, and answers that are just invalid. Consider a function of the form:

int f():
  int a = 1
  int b = 2
  int c = a + b
  int d = b + c
  ...
  int z = x + y
  return z

In this function, there are 27 different expressions, including the return, and 27 different program points. (You can think of a program point as a labelled sub-expression. In this example none of the expressions have sub-expressions.) If we number the program points in order from 0 to 26, we will have a program that first executes expression 0 (int a = 1), then 1, and so on to the end.

Let's plot some possible solutions to the live variable flow-analysis problem for this program.

Here we see two solutions to the problem (in light and dark blue), along with a space of invalid solutions (in red). The Y axis corresponds to the variables of the program, starting with a on the bottom and finishing with z on the top.

For example, consider position 4 in the program, corresponding to int e = c + d. It is marked in the graph with a vertical bar. After position 4, the only values that are used in the rest of the program are d and e. These are the variables that are contained within the light-blue area. It wouldn't be invalid to consider a, b, and c to be live also, but it also wouldn't be as efficient to allocate space and reason about values that won't contribute to the answer. The dark blue space holds those values that may harmlessly be considered to be live, but which actually aren't live.

It would, however, be invalid to consider the variable f to be live after position 4, because it hasn't been defined yet. This area of the variable space is represented in red on the graph.

Of course, the space of all possible solutions isn't possible to represent nicely on a two-dimensional graph; we're only able to show two with colors, and that not very well as they overlap. This difficulty cuts close to the heart of the data-flow problem: that it ultimately requires computing a two-dimensional answer, which necessarily takes time and space O(n2) in program size.

Or does it?

classical flow analysis frameworks

The classical way to do flow analysis is to iterate a set of data-flow equations over an finite lattice until you reach a fixed point.

That's a pithy sentence that deserves some unpacking. If you're already comfortable with what it means, you can skip a couple sections.

Still here? Cool, me too. Let's take a simple example of sign analysis. The problem is to determine, for the integer variables of a program, at every point in the program, which ones may be negative (-), which ones may be zero (0), and which may be positive (+). All of these are conceptually bit-flags.

For example, in this program:

int f(int x):
 L0:  while (x >= 0)
 L1:    int y = x - 1
 L2:    x = y
 L3:  return x

We can assign the flags -0+ to the argument x as the initial state that flows into L0, because we don't know what it is ahead of time, and it is the only variable in scope. We start by representing the initial state of the solution as a set of sets of state values:

state := {L0: {x: -0+}, L1: Ø, L2: Ø, L3: Ø}

In this notation, Ø indicates a program point that hasn't been visited yet.

Now we iterate through all labels in the program, propagating state to their successors. Here is where the specific problem being solved "hooks in" to the generic classical flow analysis framework: before propagating to a successor, a flow equation transforms the state that flows into a program point to a state that flows out, to the particular successor. In this case we could imagine equations like this:

visit_test(expr, in, true_successor, false_successor):
  if expr matches "if var >= 0":
    # On the true branch, var is not negative.
    propagate(in + {var: in[var] - -}, true_successor)
    # On the false branch, var is not zero and not positive.
    propagate(in + {var: in[var] - 0+}, false_successor)
  else if ...

visit_expr(expr, in, successor):
  if expr matches "left = right - 1":
    if in[right] has +:
      if in[right] has 0:
        # Subtracting one from a non-negative arg may be negative.
        propagate(in + {left: in[right] + -}, successor)
      else
        # Subtracting one from a positive arg may be 0.
        propagate(in + {left: in[right] + 0}, successor)
    else:
      # Subtracting one from a nonpositive arg will be negative.
      propagate(in + {left: -}, successor)
  else if expr matches "left = right":
    propagate(in + {left: in[right]}, successor)
  ...

The meat of classical data-flow analysis is the meet operation:

propagate(out, successor):
  if state[successor] is Ø:
    state[successor] = out
  else
    state[successor] = meet(out, state[successor]):

# A version of meet for sign analysis
meet(out, in):
  return intersect_vars_and_union_values(out, in)

Let's run this algorithm by hand over the example program. Starting from the initial state, we propagate the L0→L1 and L0→L3 edges:

visit_test("if x <= 0", {x: -0+}, L1, L3)
→ propagate({x: 0+}, L1)
→ state[L1] = {x: 0+}
→ propagate({x: -}, L3)
→ state[L3] = {x: -}

Neat. Let's keep going. The successor of L1 is L2:

visit_expr("y = x - 1", {x: 0+}, L2)
→ propagate({x: 0+, y: -0+}, L2)
→ state[L2] = {x: 0+, y: -0+}

L2→L0 is a back-edge, returning to the top of the loop:

visit_expr("x = y", {x: 0+, y: -0+}, L0)
→ propagate({x: -0+, y: -0+}, L0)
→ state[L0] = meet({x: -0+, y: -0+}, state[L0])
→ state[L0] = meet({x: -0+, y: -0+}, {x: -0+})
→ state[L0] = {x: 0+}

Finally, L3 has no successors, so we're done with this iteration. The final state is:

{L0: {x: -0+},
 L1: {x: 0+},
 L2: {x: 0+, y: -0+},
 L3: {x: -}}

which indeed corresponds with what we would know intuitively.

fixed points and lattices

Each of the steps in our example flow analysis was deterministic: the result was calculated from the inputs and nothing else. However the backwards branch in the loop, L2→L0, could have changed inputs that were used by the previous L0→L1 and L0→L3 forward edges. So what we really should do is iterate the calculation to a fixed point: start it over again, and run it until the state doesn't change any more.

It's easy to see in this case that running it again won't end up modifying the state. But do we know that in all cases? How do we know that iteration would terminate at all? It turns out that a few simple conditions are sufficient.

The first thing to ensure is that state space being explored is finite. Here we can see this is the case, because there are only so many ways you can combine -, 0, and +. Each one may be present or not, and so we have 2n = 23 = 8 possible states. The elements of the state array will be a set with at most one entry for each variable, so the whole state space is finite. That at least ensures that an answer exists.

Next, the "meet" operation has to be commutative, associative, and idempotent. The above example used intersect_vars_and_union_values. We intersect vars because it only makes sense to talk about a variable at a program point if the variable dominates the program point. It didn't make sense to propagate y on the L2→L0 branch, for example. It's usually a good idea to model a data-flow problem using sets, as set union and intersection operations fulfill these commutative, associative, and distributive requirements.

Finally, the state being modelled should have a partial order, and functions that add information along control-flow edges -- above, visit_test and visit_expr -- should preserve this partial ordering. That is to say, visit_test and visit_expr should be monotonic. This means that no matter on what control paths data propagates, we keep building towards an answer with more information, making forward progress. This condition is also easily fulfilled with sets, or more generally with any lattice. (A lattice is nothing more than a data type that fulfills these conditions.)

Iterating the data-flow equations until the state stops changing will find a fixed point of the lattice. Whether you find the greatest or least fixed point is another question; I can't help linking to Paul Khuong's old article on Québécois student union strikes for a lovely discussion.

Another question is, how many iterations are necessary to reach a fixed point? I would first note that although in our walk-through we iterated in forward order (L0, L1, L2, L3), we could have visited nodes in any order and the answer would be the same. I'll cut to the chase and say that if:

  1. you represent your state with bitvectors

  2. the control-flow graph is reducible (has only natural loops)

  3. the meet operation on values is bitvector union or intersection

  4. you visit the program points in topologically sorted order

If these conditions are fulfilled, then you will reach a fixed point after LC + 2 iterations, where LC is the "loop-connectness number" of your graph. You can ensure (1), (3), and (4) by construction. (Reverse post-order numbering is an easy way to fulfill (4).) (2) can be ensured by using programming languages without goto (a for loop is always a natural loop) but can be violated by optimizing compilers (for example, via contification).

Loop connectedness is roughly equivalent to the maximum nesting level of loops in the program, which has experimentally been determined to rarely exceed 3. Therefore in practice, data-flow analysis requires a number of steps that is O(n * 5) = O(n) in program size.

For more information on data-flow analysis, including full proofs and references, see Carl Offner's excellent, excellent manuscript "Notes on Graph Algorithms used in Optimizing Compilers". I don't know of any better free resource than that. Thanks, Carl!

an aside: the kCFA algorithms

I just finished describing what I called "classical" data-flow analysis. By that I mean to say that people have been doing it since the 1970s, which is classical enough as far as our industry goes. However with the rise of functional languages in the 1980s, it became unclear how to apply classical data-flow analysis on a language like Scheme. Let's hear it from the horse's mouth:

This brings us to the summer of 1984. The mission was to build the world's most highly-optimising Scheme compiler. We wanted to compete with C and Fortran. The new system was T3, and the compiler was to be called Orbit. We all arrived at WRL and split up responsibility for the compiler. Norman was going to do the assembler. Philbin was going to handle the runtime (as I recall). Jonathan was project leader and (I think) wrote the linker. Kranz was to do the back end. Kelsey, the front end. I had passed the previous semester at CMU becoming an expert on data-flow analysis, a topic on which I completely grooved. All hot compilers do DFA. It is necessary for all the really cool optimisations, like loop-invariant hoisting, global register allocation, global common subexpression elimination, copy propagation, induction-variable elimination. I knew that no Scheme or Lisp compiler had ever provided these hot optimisations. I was burning to make it happen. I had been writing 3D graphics code in T, and really wanted my floating-point matrix multiplies to get the full suite of DFA optimisation. Build a DFA module for T, and we would certainly distinguish ourselves from the pack. So when we divided up the compiler, I told everyone else to back off and loudly claimed DFA for my own. Fine, everyone said. You do the DFA module. Lamping signed up to do it with me.

Lamping and I spent the rest of the summer failing. Taking trips to the Stanford library to look up papers. Hashing things out on white boards. Staring into space. Writing little bits of experimental code. Failing. Finding out *why* no one had ever provided DFA optimisation for Scheme. In short, the fundamental item the classical data-flow analysis algorithms need to operate is not available in a Scheme program. It was really depressing. I was making more money than I'd ever made in my life ($600/week). I was working with *great* guys on a cool project. I had never been to California before, so I was discovering San Francisco, my favorite city in the US and second-favorite city in the world. Silicon Valley in 1984 was beautiful, not like the crowded strip-mall/highway hell hole it is today. Every day was perfect and beautiful when I biked into work. I got involved with a gorgeous redhead. And every day, I went in to WRL, failed for 8 hours, then went home.

It was not a good summer.

At the end of the summer, I slunk back to CMU with my tail between my legs, having contributed not one line of code to Orbit.

Olin Shivers, A history of T

It took him another 7 years, but Shivers stuck with it, and in the end came out with the family of algorithms known as k-CFA. Instead of focusing on loops, which Scheme doesn't have syntactically, Shivers used continuation-passing style to ruthlessly simplify Scheme into a dialect consisting of not much more than function calls, and focused his attention on function calls. The resulting family of flow algorithms can solve flow equations even in the presence of higher-order functions -- a contribution to computer science born out of necessity, failure, and stubbornness.

With all those words, you'd think that I'd be itching to use k-CFA in Guile, and I'm not. Unfortunately even the simplest, least expressive version (0-CFA) is O(n2); 1-CFA is exponential. I don't have time for that. Instead, Guile is able to use classical DFA because it syntactically distinguishes labelled continuations and functions, and contifies functions to continuations where possible, which makes the Scheme DFA problem exactly the same as in any other language.

n times what?

Now that we have established that the number of visit operations is O(n), it remains to be seen what the individual complexity of a visit operation is in order to determine the total complexity. The naïve thing is just to use bitvectors, with each of the bitvectors having as many entries as the program has variables, times however many bits we are using.

This leads to O(|L|*|V|) space and time complexity, where |L| is the number of program points (labels) and |V| is the number of variables. As the number of variables is generally proportional to the size of program, we can approximate this as O(n2).

In practice, this means that we can use data-flow analysis to programs up to about 10000 labels in size. Sign analysis on a 10000-label function would require 100002*3/8 = 37.5 MB of memory, which is already a bit hefty. It gets worse if you need to represent more information. I was recently doing some flow-sensitive type and range inference, storing 12 bytes per variable per program point; for a 10000-label function, that's more than a gigabyte of memory. Badness.

shared tails

Although it was the type inference case that motivated this investigation, sign inference is similar and more simple so let's go with that. The visit_expr and visit_test functions above are only ever going to add additional information about the variables that are used in or defined by an expression; in practice this is a small finite number. What if we chose a representation of state that could exploit this fact by only adding O(1) amounts of data, sharing a common tail with preceding expressions?

If we draw a control-flow graph for the sign analysis program, we get something like:

The goal is to create a data structure that looks like the dominator tree. For "normal" control-flow edges -- those whose destination only have one predecessor -- we can avoid the "meet" operations, and just copy the predecessor's out set to the successor's in set. We then define "meet" as an adjoin operation that effectively conses the new information onto a shared tail, if it wasn't there already. The first iteration through the CFG will initialize the shared tail of a given control-flow join to the set of variables flowing into the join's dominator. Subsequent information will adjoin (cons) on new incoming values. In this case the resulting data structure ends up looking like:

Here the italic references like L1 indicate shared structure, and the tuples annotating the edges represent additional information flow, beyond that information that was already present in the successor's dominator.

Of course, you can implement this with linked lists and it will work fine. The problem there will be lookup speed -- when your visit operation (visit_expr or visit_test) goes to look up the sign of a variable, or the same happens via the meet operation, you get O(n) lookup penalties. Anticipating this, I implemented this with a version of Phil Bagwell's vhashes, which promise O(log n) variable lookup. See Guile's documentation, or Bagwell's excellent paper.

Note that you can't remove items from sets once they have been added in a shared-tail flow analysis; to keep the meet function monotonic, you have to instead insert tombstone entries. Not so nice, but it is what it is.

A shared-tail flow analysis consumes only O(1) additional memory per node, leading to O(n) space complexity. I have some measured space and time graphs below that show this experimentally as well.

space and time

Unfortunately, lookup time on branchy vhashes is really terrible: O(log n) in the best case, and O(n) at worst. This is especially acute because there is no easy way to do unions or intersections on vhashes -- you end up having to compute the unshared heads of the two vhashes you are merging, and looking up elements in one in the other... I could go on, but I think you believe me when I say it gets complicated and slow. It's possible to beat a bitvector approach in time for relatively "big" problems like type analysis, but for common subexpression elimination where I was just storing a bit per expression, it was tough to beat the speed of bitvectors.

I started looking for another solution, and in the end came on a compromise that I am much happier with, and again it's Phil Bagwell to the rescue. Instead of relying on vhashes that explicitly share state, I use Clojure-style persistent sparse bit-sets and bit-maps that share state opportunistically.

Guile's intset module implements a bitvector as a functional tree whose branches are vectors and whose leaves are fixnums. Each leaf represents one range of 32 integers, and each branch on top of it increases the range by a factor of 8. Branches can be sparse, so not all integers in the range of an intset need leaves.

As you would expect, adjoining an element onto such a tree is O(log n). Intersecting is much faster than vhashes though, as intsets partition the key space into power-of-two blocks. Intsets try hard to share state, so that if your adjoin would return the same value, the result is the same object, at the same address. This allows sub-trees to be compared for equality via pointer comparison, which is a great fast-path for intersection and union.

Likewise, Guile's new intmap module allow the association of larger values with integer keys.

science! fetch your white coats and lab books!

I had the chance to actually test the system with all three of these data structures, so I compiled one of Guile's bigger files and recorded the memory used and time taken when solving a 1-bit flow analysis problem. This file has around 600 functions, many of them small nested functions, many of them macro-generated, some of them quite loopy, and one big loopless one (6000 labels) to do the initialization.

First, a plot of how many bytes are consumed per label during while solving this 1-bit DFA.

Note that the X axis is on a log scale.

The first thing that pops out at me from these graphs is that the per-label overhead vhash sizes are indeed constant. This is a somewhat surprising result for me; I thought that iterated convergence would make this overhead depend on the size of the program being compiled.

Secondly we see that the bitvector approach, while quadratic in overall program size, is still smaller until we get to about 1000 labels. It's hard to beat the constant factor for packed bitvectors! Note that I restricted the Y range, and the sizes for the bitvector approach are off the charts for N > 1024.

The intset size is, as we expected, asymptotically worse than vhashes, but overall not bad. It stays on the chart at least. Surprisingly, intsets are often better than vhashes for small functions, where we can avoid allocating branches at all -- note the "shelves" in the intset memory usage, at 32 and 256 entries, respectively, corresponding to the sizes that require additional levels in the tree. Things keep on rising with n, but sublinearly (again, recall that the X axis is on a log scale).

Next, a plot of how many nanoseconds it takes per label to solve the DFA equation above.

Here we see, as expected, intsets roundly beating vhashes for all n greater than about 200 or so, and show sublinear dependence on program size.

The good results for vhashes for the largest program are because the largest program in this file doesn't have any loops, and hardly any branching either. It's the best case for vhashes: all appends and no branches. Unfortunately, it's not the normal case.

It's not quite fair to compare intsets to bitvectors, as Guile's bitvectors are implemented in C and intsets are implemented in Scheme, which runs on a bytecode VM right now. But still, the results aren't bad, with intsets even managing to beat bitvectors for the biggest function. The gains there probably pay for the earlier losses.

This is a good result, considering that the goal was to reduce the space complexity of the algorithm. The 1-bit case is also the hardest case; when the state size grows, as in type inference, the gains of using structure-sharing trees grow accordingly.

conclusion

Let's wrap up this word-slog. A few things to note.

Before all this DFA work in Guile, I had very little appreciation of the dangers of N-squared complexity. I mean, sometimes I had to to think about it, but not often, expecially if your constant factors are low, or so I thought. But I got burned by it; hopefully the next time, if any, will be a long time coming.

I was happily, pleasantly surprised at the expressiveness and power of Bagwell/Clojure-style persistent data structures when applied to the kinds of problems that I work on. Space-sharing can make a fundamental difference to the characteristics of an algorithm, and Bagwell's data structures can do that well. Intsets simplified my implementations because I didn't have to reason much about space-sharing on my own -- finding the right shared tail for vhashes is, as I said, an unmitigated mess.

Finally I would close by saying that I was happy to fail in such interesting (to me) ways. It has been a pleasant investigation and I hope I have been able to convey some of the feeling of it. If you want to see what the resulting algorithm looks like in practice, see compute-truthy-expressions.

Until next time, happy hacking!

Andy Wingohttps://wingolog.org/effects analysis in guilehttps://wingolog.org/2014/05/18/effects-analysis-in-guile2014-05-18T19:19:29Z2014-05-18T19:19:29Z

OK kids, so I had a bit of time recently. I've been hacking on Guile's new CPS-based compiler, which should appear in a stable release in a few months. I have a few things to write about, but today's article is on effects analysis.

what it is, yo

The job of the optimization phase of a compiler is to remove, replace and reorder the sub-expressions of a program. The optimizer recognizes ways in which the program can be better, and then needs to check if those transformations are valid. A transformation is valid if a program exhibits the same behavior both before and after the transformation.

Effects analysis is a proof technique that builds a conservative model of how expressions can affect each other. It can be used to prove the validity of some transformations. For example, if we determine that an expression A reads the first field in a vector, and expression B sets the second field in a vector, then we can use effects analysis to show that the expressions don't affect each others' value and can be freely reordered, for example to hoist either one out of a loop.

In effects analysis, we model the program state coarsely and conservatively with a limited set of categories. The precise set of categories depends on the domain. In graphics, for example, you might have a state bit for the current coordinate system, the current color, the current fill mode, etc. In general-purpose compilation, the effects are mostly about memory and (sometimes) exceptions.

In Guile we model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

For the purposes of Guile, a "type check" is the possibility that this expression will throw an exception if its arguments are somehow out of range. For example, cons will never throw an exception, except in out-of-memory situations, but + may throw an exception if its arguments aren't numbers. However if an expression is dominated by a copy of itself -- that is, if we see that (+ a b) (which may throw an exception) is dominated by (+ a b), perhaps after CSE -- then we can determine that only the first will exhibit the type-check effects, and that the second will not.

Allocation indicates that an expression allocates a fresh object. In Scheme (and many other languages), each allocated object has its own identity: (eq? (cons 1 2) (cons 1 2)) must be false. Note that this restriction does not apply to constant literals, in Scheme; (eq? '(1 . 2) '(1 . 2)) may or may not be true. In Guile both results are possible. They are the same object when compiled (and thus deduplicated), but different when interpreted; the objects are just the ones returned from `read' and this are different. Anyway we use this allocation bit to indicate that an expression allocates a fresh object with a fresh identity.

The remaining effect kinds are "read" and "write", which indicate reads or writes to memory. So there are just 4 kinds of effects.

Allocation, read, and write effects are associated at run-time with particular memory locations. At compile-time we cannot in general know which of these locations are distinct, and which are actually the same. To simplify the problem, we simply record the type of the object, knowing that (say) a pair and a vector will never be at the same location. We also record the field in the object in the case of objects with multiple fields. There are special catch-all values to indicate "all memory kinds", when we really don't know what an expression will do (which is the case for all expression kinds without specific support in the effect analyzer), and for "all fields" when we don't know which field we are accessing.

One example of the use of this analysis is in common subexpression elimination (CSE). If at a program point P we have a set of available expressions A, then to determine which members of A are still available after the expression at P, you subtract members of A that are clobbered by P. Computation of A at each P plus value numbering is most of CSE; but more on that in some later dispatch. Anyway here's the definition of effects-clobber?.

(define (effect-clobbers? a b)
  "Return true if A clobbers B.  This is the case
if A might be a write, and B might be a read or a
write to the same location as A."
  (define (locations-same?)
    (let ((a (ash a (- &effect-kind-bits)))
          (b (ash b (- &effect-kind-bits))))
      (or (eqv? &unknown-memory-kinds
                (logand a &memory-kind-mask))
          (eqv? &unknown-memory-kinds
               (logand b &memory-kind-mask))
          (and (eqv? (logand a &memory-kind-mask)
                     (logand b &memory-kind-mask))
               ;; A negative field indicates "the
               ;; whole object".  Non-negative fields
               ;; indicate only part of the object.
               (or (< a 0) (< b 0) (= a b))))))
  (and (not (zero? (logand a &write)))
       (not (zero? (logand b (logior &read &write))))
       (locations-same?)))

This analysis is simple, small, and fast. It's also coarse and imprecise -- if you are reading from and writing to two vectors at once, you're almost sure to miss some optimization opportunities as accesses to all vectors are conflated into one bit. Oh well. If you get into this situation, you'll know it, and be able to invest a bit more time into alias analysis; there's lots of literature out there. A simple extension would be to have alias analysis create another mapping from expression to equivalence class, and to use those equivalence classes in the same-location? check above.

Of course this assumes that expressions just access one object. This is the case for Guile's CPS intermediate language, because in CPS, as in SSA or ANF, expressions don't have subexpressions.

This contrasts with direct-style intermediate languages, in which expressions may have nested subexpressions. If you are doing effects analysis on such a language, it's probably more appropriate to allocate a bit to each kind of effect on each kind of object, so that you can usefully union effects for a tree of expressions. Since you don't have to do this for CPS, we can allocate a fixed bit-budget towards more precision as to which field of an object is being accessed. The inability to be precise as to which field was being accessed due to the direct-style IL was one of the problems in Guile's old CSE pass.

Finally, a note about type checks. Guile includes type checks as part of the effects analysis for two reasons. The first is the obvious case of asking whether an expression is effect-free or not, which can lead to some optimizations in other parts of the compiler. The other is to express the potential for elision of duplicate expressions, if one dominates the other. But it's also possible to remove type checks in more cases than that: one can run a type inference pass to remove type-check effects if we can prove that the arguments to an expression are in range. Obviously this is more profitable for dynamically-typed languages, but the same considerations apply to any language with sum types.

Guile's effects analysis pass is here. V8 seems to have two effects analysis passes; one is in effects.h and typing.cc, and operates over the direct-style AST, and the other is in the value numbering pass (hydrogen-instructions.h and hydrogen-gvn.h; search for GVNFlag).

More on how effects analysis is used in Guile in a future missive. Until then, happy hacking.

Andy Wingohttps://wingolog.org/a continuation-passing style intermediate language for guilehttps://wingolog.org/2014/01/12/a-continuation-passing-style-intermediate-language-for-guile2014-01-12T21:58:00Z2014-01-12T21:58:00Z

Happy new year's, hackfolk!

A few weeks ago I wrote about the upcoming Guile 2.2 release, and specifically about its new register virtual machine. Today I'd like to burn some electrons on another new part in Guile 2.2, its intermediate language.

To recap, we switched from a stack machine to a register machine because, among other reasons, register machines can consume and produce named intermediate results in fewer instructions than stack machines, and that makes things faster.

To take full advantage of this new capability, it is appropriate to switch at the same time from the direct-style intermediate language (IL) that we had to an IL that names all intermediate values. This lets us effectively reason about each subexpression that goes into a computation, for example in common subexpression elimination.

As far as intermediate languages go, basically there are two choices these days: something SSA-based, or something CPS-based. I wrote an article on SSA, ANF and CPS a few years ago; if you aren't familiar with these or are feeling a little rusty, I suggest you go and take a look.

In Guile I chose a continuation-passing style language. I still don't know if I made the right choice. I'll go ahead and describe Guile's IL and then follow up with some reflections. The description below is abbreviated from a more complete version in Guile's manual.

guile's cps language

Guile's CPS language is composed of terms, expressions, and continuations. It was heavily inspired by Andrew Kennedy's Compiling with Continuations, Continued paper.

A term can either evaluate an expression and pass the resulting values to some continuation, or it can declare local continuations and contain a sub-term in the scope of those continuations.

$continue k src exp
Evaluate the expression exp and pass the resulting values (if any) to the continuation labelled k.
$letk conts body
Bind conts in the scope of the sub-term body. The continuations are mutually recursive.

Additionally, the early stages of CPS allow for a set of mutually recursive functions to be declared as a term via a $letrec term. A later pass will attempt to transform the functions declared in a $letrec into local continuations. Any remaining functions are later lowered to $fun expressions. More on "contification" later.

Here is an inventory of the kinds of expressions in Guile's CPS language. Recall that all expressions are wrapped in a $continue term which specifies their continuation.

$void
Continue with the unspecified value.
$const val
Continue with the constant value val.
$prim name
Continue with the procedure that implements the primitive operation named by name.
$fun src meta free body
Continue with a procedure. body is the $kentry $cont of the procedure entry. free is a list of free variables accessed by the procedure. Early CPS uses an empty list for free; only after closure conversion is it correctly populated.
$call proc args
Call proc with the arguments args, and pass all values to the continuation. proc and the elements of the args list should all be variable names. The continuation identified by the term's k should be a $kreceive or a $ktail instance.
$primcall name args
Perform the primitive operation identified by name, a well-known symbol, passing it the arguments args, and pass all resulting values to the continuation.
$values args
Pass the values named by the list args to the continuation.
$prompt escape? tag handler
Push a prompt on the stack identified by the variable name tag and continue with zero values. If the body aborts to this prompt, control will proceed at the continuation labelled handler, which should be a $kreceive continuation. Prompts are later popped by pop-prompt primcalls.

The remaining element of the CPS language in Guile is the continuation. In CPS, all continuations have unique labels. Since this aspect is common to all continuation types, all continuations are contained in a $cont instance:

$cont k cont
Declare a continuation labelled k. All references to the continuation will use this label.

The most common kind of continuation binds some number of values, and then evaluates a sub-term. $kargs is this kind of simple lambda.

$kargs names syms body
Bind the incoming values to the variables syms, with original names names, and then evaluate the sub-term body.

Variables (e.g., the syms of a $kargs) should be globally unique. To bind the result of an expression a variable and then use that variable, you would continue from the expression to a $kargs that declares one variable. The bound value would then be available for use within the body of the $kargs.

$kif kt kf
Receive one value. If it is a true value, branch to the continuation labelled kt, passing no values; otherwise, branch to kf.

Non-tail function calls should continue to a $kreceive continuation in order to adapt the returned values to their uses in the calling function, if any.

$kreceive arity k
Receive values from a function return. Parse them according to arity, and then proceed with the parsed values to the $kargs continuation labelled k.

$arity is a helper data structure used by $kreceive and also by $kclause, described below.

$arity req opt rest kw allow-other-keys?
A data type declaring an arity. See Guile's manual for details.

Additionally, there are three specific kinds of continuations that can only be declared at function entries.

$kentry self tail clauses
Declare a function entry. self is a variable bound to the procedure being called, and which may be used for self-references. tail declares the $cont wrapping the $ktail for this function, corresponding to the function's tail continuation. clauses is a list of $kclause $cont instances.
$ktail
A tail continuation.
$kclause arity cont
A clause of a function with a given arity. Applications of a function with a compatible set of actual arguments will continue to cont, a $kargs $cont instance representing the clause body.

reflections

Before starting Guile's compiler rewrite, I had no real-world experience with CPS-based systems. I had worked with a few SSA-based systems, and a few more direct-style systems. I had most experience with the previous direct-style system that Guile had, but never had to seriously design another kind of IL, so basically I was ignorant. It shows, I think; but time will tell if it came out OK anyway. At this point I am cautiously optimistic.

As far as fitness for purpose goes, the CPS IL works in the sense that it is part of a self-hosting compiler. I'll say no more on that point other than to mention that it has explicit support for a number of Guile semantic features: multiple-value returns; optional, rest, and keyword arguments; cheap delimited continuations; Guile-native constant literals.

Why not ANF instead? If you recall from my SSA and CPS article, I mentioned that ANF is basically CPS with fewer labels. It tries to eliminate "administrative" continuations, whereas Guile's CPS labels everything. There is no short-hand let form.

ANF proponents tout its parsimony as a strength, but I do not understand this argument. I like having labels for everything. In CPS, I have as many labels as there are expressions, plus a few for continuations that don't contain terms. I use them directly in the bytecode compiler; the compiler never has to generate a fresh label, as they are part of the CPS itself.

More importantly, labelling every control-flow point allows me to reason precisely about control flow. For example, if a function is always called with the same continuation, it can be incorporated in the flow graph of the calling function. This is called "contification". It is not the same thing as inlining, as it works for sets of recursive functions as well, and never increases code size. This is a crucial transformation for a Scheme compiler to make, as it turns function calls into gotos, and self-function calls into loop back-edges.

Guile's previous compiler did a weak form of contification, but because we didn't have names for all control points it was gnarly and I was afraid to make it any better. Now its contifier is optimal. See Fluet and Weeks' Contification using Dominators and Kennedy's CWCC, for more on contification.

One more point in favor of labelling all continuations. Many tranformations can be best cast as a two-phase process, in which you first compute a set of transformations to perform, and then you apply them. Dead-code elimination works this way; first you find the least fixed-point of live expressions, and then you residualize only those expressions. Without names, how are you going to refer to an expression in the first phase? The ubiquitous, thorough labelling that CPS provides does not have this problem.

So I am happy with CPS, relative to ANF. But what about SSA? In my previous article, I asked SSA proponents to imagine returning a block from a block. Of course it doesn't make any sense; SSA is a first-order language. But modern CPS is also first-order, is the thing! Modern CPS distinguishes "continuations" syntactically from functions, which is exactly the same as SSA's distinction between basic blocks and functions. CPS and SSA really are the same on this level.

The fundamental CPS versus SSA difference is, as Stephen Weeks noted a decade ago, one of data structures: do you group your expressions into basic blocks stored in a vector (SSA), or do you nest them into a scope tree (CPS)? It's not clear that I made the correct choice.

In practice with Guile's CPS you end up building graphs on the side that describe some aspect of your term. For example you can build a reverse-post-ordered control flow analysis that linearizes your continuations, and assigns them numbers. Then you can compute a bitvector for each continuation representing each one's reachable continuations. Then you can use this reachability analysis to determine the extent of a prompt's body, for example.

But this analysis is all on the side and not really facilitated by the structure of CPS itself; the CPS facilities that it uses are the globally unique continuation and value names of the CPS, and the control-flow links. Once the transformation is made, all of the analysis is thrown away.

Although this seems wasteful, the SSA approach of values having "implicit" names by their positions in a vector (instead of explicit ephemeral name-to-index mappings) is terrifying to me. Any graph transformation could renumber things, or leave holes, or cause vectors to expand... dunno. Perhaps I am too shy of the mutation foot-gun. I find comfort in CPS's minimalism.

One surprise I have found is that I haven't needed to do any dominator-based analysis in any of the paltry CPS optimizations I have made so far. I expect to do so once I start optimizing loops; here we see the cultural difference with SSA I guess, loops being the dominant object of study there. On the other hand I have had to solve flow equations on a few occasions, which was somewhat surprising, though enjoyable.

The optimizations I have currently implemented for CPS are fairly basic. Contification was tricky. One thing I did recently was to make all non-tail $call nodes require $kreceive continuations; if, as in the common case, extra values were unused, that was reflected in an unused rest argument. This required a number of optimizations to clean up and remove the extra rest arguments for other kinds of source expressions: dead-code elimination, the typical beta/eta reduction, and some code generation changes. It was worth it though, and now with the optimization passes things are faster than they were before.

Well, I find that I am rambling now. I know this is a lot of detail, but I hope that it helps some future compiler hacker understand more about intermediate language tradeoffs. I have been happy with CPS, but I'll report back if anything changes :) And if you are actually hacking on Guile, check the in-progress manual for all the specifics.

Happy hacking to all, and to all a good hack!

Andy Wingohttps://wingolog.org/doing it wrong: cse in guilehttps://wingolog.org/2012/05/14/doing-it-wrong-cse-in-guile2012-05-14T17:07:14Z2012-05-14T17:07:14Z

Greetings, readers! It's been a little while, but not because I haven't been doing anything, or nothing worth writing about. No, the reason I haven't written recently is because the perceived range of my ignorance has been growing faster than the domain of my expertise.

Knowledge may well be something one can dominate, but ignorance must forever be a range, stretching off to a hazy horizon. Climbing the hill of JavaScript implementations has let me see farther out on that plain. I think it's only the existence of the unknown unknowns that can let one muster up the arrogance to write at all.

But back to domains and dominators later. To begin, I note that there is very little in the way of correct, current, and universal folk wisdom as to how to implement a programming language. Compiler hackers are priests of their languages, yes, but their temples are in reality more or less isolated cults, in which the values of one's Gods may be unknown or abhorrent to those of others. Witness the attention paid to loop optimizations in some languages, compared to garbage collection in others, or closures in still others.

In my ecumenical capacity as abbot of Guile and adjunct deacon of JavaScriptCore, sometimes I'm tempted to mix rites: to sprinkle the holy water of lexical scope optimizations on JS, and, apropos of this article, to exorcise common subexpressions in Scheme.

As one might well imagine, the rites of one cult must be adapted to circumstances. I implemented CSE for Guile, but I don't know if it was actually a win. In this article I'll go into what CSE is, how it works in Guile, why it probably won't survive in its present form.

cse: common subexpression elimination

I implemented a source-to-source optimization pass in Guile that eliminates common subexpressions. It actually does both more and less than that: it propagates predicates and eliminates effect-free statements as well, and these latter optimizations are why I implemented the pass in the first place.

Let me give an example. Let's imagine we implement a binary tree in Guile, using the records facility.

(use-modules (srfi srfi-9))

(define-record-type btree
  (make-btree elt left right)
  btree?
  (elt btree-elt)
  (left btree-left)
  (right btree-right))

(define *btree-null* #f)

(define (btree-cons head tail)
  (if (btree? tail)
      (let ((elt (btree-elt tail)))
        (if (< elt head)
            (make-btree elt
                        (btree-left tail)
                        (btree-cons head (btree-right tail)))
            (make-btree elt
                        (btree-cons head (btree-left tail))
                        (btree-right tail))))
      (make-btree head
                  *btree-null*
                  *btree-null*)))

That's enough to illustrate my point, I think. We have the data type, the base case, and a constructor. Of course in Haskell or something with better data types it would be much cleaner, but hey, let's roll with this.

If you look at btree-cons, it doesn't seem to be amenable in its current form to classic common subexpression elimination. People don't tend to write duplicated code. You see that I bound the temporary elt instead of typing (btree-elt btree) each time, and that was partly because of typing, and partly out of some inner efficiency puritan, telling me I shouldn't write duplicate expressions. (Cult behavior, again!)

But, note that all of these record abstractions will probably inline, instead of calling out to procedures. (They are implemented as inlinable procedures. Yes, it would be better to have cross-module inlining, but we don't, so this is what we do.) In general, syntactic abstraction in Scheme can lead to duplicate code. Apologies in advance for this eyeball-rending torrent, but here's a listing of what btree-cons reduces to, after expansion and partial evaluation:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (if (eq? (struct-vtable tail) btree)
                     (struct-ref tail 0)
                     (throw 'wrong-type-arg
                            'btree-elt
                            "Wrong type argument: ~S"
                            (list tail)
                            (list tail)))))
        (if (< elt head)
            (let ((left (if (eq? (struct-vtable tail) btree)
                            (struct-ref tail 1)
                            (throw 'wrong-type-arg
                                   'btree-left
                                   "Wrong type argument: ~S"
                                   (list tail)
                                   (list tail))))
                  (right (btree-cons
                           head
                           (if (eq? (struct-vtable tail) btree)
                               (struct-ref tail 2)
                               (throw 'wrong-type-arg
                                      'btree-right
                                      "Wrong type argument: ~S"
                                      (list tail)
                                      (list tail))))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons
                          head
                          (if (eq? (struct-vtable tail) btree)
                              (struct-ref tail 1)
                              (throw 'wrong-type-arg
                                     'btree-left
                                     "Wrong type argument: ~S"
                                     (list tail)
                                     (list tail)))))
                  (right (if (eq? (struct-vtable tail) btree)
                             (struct-ref tail 2)
                             (throw 'wrong-type-arg
                                    'btree-right
                                    "Wrong type argument: ~S"
                                    (list tail)
                                    (list tail)))))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

Again, I'm really sorry about that, and it's not just for your eyes: it's also because that's a crapload of code for what should be a simple operation. It's also redundant! There are 6 checks that btree is in fact a btree, when only one is needed semantically. (Note that the null case is not a btree, of course.)

Furthermore, all of the checks in the first arm of the if are redundant. The code above is what the optimizer produces -- which is, you know, turrible.

So, I thought, we could run a pass over the source that tries to propagate predicates, and then tries to fold predicates whose boolean value we already know.

And that's what I did. Here's what Guile's optimizer does with the function, including the CSE pass:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (struct-ref tail 0)))
        (if (< elt head)
            (let ((left (struct-ref tail 1))
                  (right (btree-cons head (struct-ref tail 2))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons head (struct-ref tail 1)))
                  (right (struct-ref tail 2)))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

This is much better. It's quite close to the source program, except the symbolic references like btree-elt have been replaced with indexed references. The type check in the predicate of the if expression propagated to all the other type checks, causing those nested if expressions to fold.

Of course, CSE can also propagate bound lexicals:

(let ((y (car x)))
  (car x))
=> (let ((y (car x)))
     y)

This is the classic definition of CSE.

but is it a win?

I should be quite pleased with the results, except that CSE makes Guile's compiler approximately twice as slow. Granted, in the long run, this should be acceptable: code is usually run many more times than it is compiled. But this is a fairly expensive pass, and yet at the same time it's not as good as it could be.

In order to get to the heart of the matter, I need to explain a little about the implementation. CSE is a post-pass, that runs after partial evaluation (peval). I tried to make it a part of peval, as the two optimizations are synergistic -- oh yes, let's revel in that word -- are you feeling it? -- but it was too complicated in the end. The reason is that in functions like this:

(define (explode btree)
  (unless (btree? btree)
    (error "not a btree" btree))
  (values (btree-head btree)
          (btree-left btree)
          (btree-right btree)))

Here we have a sequence of two expressions. Since the first one bails out if the predicate is false, we should propagate a true predicate past the first expression. This means that running CSE on an expression returns two values: the rewritten expression, given the predicates already seen; and a new set of predicates that the expression asserts. We should be able to use these new assertions to elide the type checks in the second expression. And indeed, Guile can do this.

Perhaps you can see the size of the problem now. CSE is a pass that runs over all code, building up an ordered set of expressions that were evaluated, and in what context. When it sees a new expression in a test context -- as the predicate in an if -- it checks to see if the set contains that expression (or its negation) already, in test context, and if so tries to fold the expression to true or false. Already doing this set lookup and traversal is expensive -- at least N log N in the total size of the program, with some quadratic components in the case an expression is found, and also with high constant factors due to the need for custom hash and equality predicates.

The quadratic factor comes in when walking the set to see if the elimination is valid. Consider:

(if (car x)
    (if (car x) 10 20)
    30)

Here, we should be able to eliminate the second (car x), folding to (if (car x) 10 30). However, in this one:

(if (car x)
    (begin
      (y)
      (if (car x) 10 20))
    30)

If we don't know what (y) does, then we can't fold the second test, because perhaps (y) will change the contents of the pair, x. The information that allows us to make these decisions is effects analysis. For the purposes of Guile's optimizer, (car x) has two dependencies and can cause two effects: it depends on the contents of a mutable value, and on the value of a toplevel (x), and can cause the effect of an unbound variable error when resolving the toplevel, or a type error when accessing its car. Two expressions commute if neither depends on effects that the other causes.

I stole the idea of doing a coarse effects analysis, and representing it as bits in a small integer, from V8. Guile's version is here: effects.scm. The ordered set is a form of global value numbering. See the CSE pass here: cse.scm.

The commute test is fairly cheap, but the set traversal is currently a bit expensive.

and for what?

As I have indicated, the pass does do something useful on real programs, as in the binary tree example. But it does not do all it could, and it's difficult to fix that, for a few reasons.

Unlike traditional CSE, Guile's version of it is interprocedural. Instead of operating on just one basic block or one function, it operates across nested functions as well. However, only some dependencies can commute across a function boundary. For example:

(lambda (x)
  (if (pair? x)
      (let ((y (car x)))
        (lambda ()
          (and (pair? x) (car x))))))

Can the first pair? test propagate to the second expression? It can, because pair? does not depend on the values of mutable data, or indeed on any effect. If it's true once, it will always be true.

But can we replace the second (car x) with y? No, because (car x) has a dependency on mutable data, and because we don't do escape analysis on the closure, we don't let those dependencies commute across a procedure boundary. (In this case, even if we did escape analysis, we'd have the same conclusion.)

However, not all lambda abstractions are closures. Some of them might end up being compiled to labels in the function. Scheme uses syntactically recursive procedures to implement loops, after all. But Guile's CSE does poorly for lambda expressions that are actually labels. The reason is that lexical scope is not a dominator tree.

MLton hacker Stephen Weeks says it better than I do:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y. (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

See my article on SSA and CPS for more context.

So that's one large source of lost CSE opportunities, especially in loops.

Another large source of lost opportunities is that the Tree-IL language, which is basically a macro-expanded Scheme, has the same property that Scheme does, that the order of evaluation of operands is unspecified.

Consider the base-case clause of my btree-cons above:

(let ((left *btree-null*) (right *btree-null*))
  (make-struct/no-tail btree head left right))

Now, *btree-null* is a toplevel lookup, that might have an unbound-variable effect. We should be able to eliminate one of them, though. Why doesn't the CSE pass do it? Because in Tree-IL, the order of evaluation of the right-hand-sides of left and right is unspecified. This gives Guile some useful freedoms, but little information for CSE.

This is an instance of a more general problem, that Tree-IL might be too high-level to be useful for CSE. For example, at runtime, not all lexical references are the same -- some are local, and some reference free variables. For mutated free variables, the variable is itself in a box, so to reference it you would load the box into a local and then dereference the box. CSE should allow you to eliminate duplicate loads of the box, even in the case that it can't eliminate duplicate references into the box.

conclusion

It is nice to be able to eliminate the duplicate checks, but not at any price. Currently the bootstrapping time cost is a bit irritating. I have other ideas on how to fix that, but ultimately we probably need to re-implement CSE at some lower level. More on that in a future post. Until then, happy hacking.

Andy Wingohttps://wingolog.org/static single assignment for functional programmershttps://wingolog.org/2011/07/12/static-single-assignment-for-functional-programmers2011-07-12T16:39:50Z2011-07-12T16:39:50Z

Friends, I have an admission to make: I am a functional programmer.

By that I mean that lambda is my tribe. And you know how tribalism works: when two tribes meet, it's usually to argue and not to communicate.

So it is that I've been well-indoctrinated in the lore of the lambda calculus, continuation-passing style intermediate languages, closure conversion and lambda lifting. But when it comes to ideas from outside our tribe, we in the lambda tribe tune out, generally.

At the last Scheme workshop in Montreal, some poor fellow had the temerity to mention SSA on stage. (SSA is what the "machine tribe" uses as an intermediate langauge in their compilers.) I don't think the "A" was out of his mouth before Olin Shivers' booming drawl started, "d'you mean CPS?" (CPS is what "we" use.) There were titters from the audience, myself included.

But there are valuable lessons to be learned from SSA language and the optimizations that it enables, come though it may from another tribe. In this article I'd like to look at what the essence of SSA is. To do so, I'll start with explaining the functional programming story on intermediate languages, as many of my readers are not of my tribe. Then we'll use that as a fixed point against which SSA may be compared.

the lambda tribe in two sentences

In the beginning was the lambda. God saw it, realized he didn't need anything else, and stopped there.


functional programmers got god's back

Hey, it's true, right? The lambda-calculus is great because of its expressivity and precision. In that sense this evaluation is a utilitarian one: the lambda-calculus allows us to reason about computation with precision, so it is worth keeping around.

I don't think that Church was thinking about digital computers when he came up with the lambda-calculus back in the 1930s, given that digital computers didn't exist yet. Nor was McCarthy thinking about computers when he came up with Lisp in the 1960s. But one of McCarthy's students did hack it up, and that's still where we are now: translating between the language of the lambda-calculus and machine language.

This translation process is compilation, of course. For the first 20 years or so of practicing computer science, compilers (and indeed, languages) were very ad-hoc. In the beginning they didn't exist, and you just wrote machine code directly, using switches on a control panel or other such things, and later, assembly language. But eventually folks figured out parsing, and you get the first compilers for high-level languages.

I've written before about C not being a high-level assembly language, but back then, FORTRAN was indeed such a language. There wasn't much between the parser and the code generator. Everyone knows how good compilers work these days: you parse, you optimize, then you generate code. The medium in which you do your work is your intermediate language. A good intermediate language should be simple, so your optimizer can be simple; expressive, so that you can easily produce it from your source program; and utilitarian, in that its structure enables the kinds of optimizations that you want to make.

The lambda tribe already had a good intermediate language in this regard, in the form of the lambda-calculus itself. In many ways, solving a logic problem in the lambda-calculus is a lot like optimizing a program. Copy propagation is beta-reduction. Inlining is copy propagation extended to lambda expressions. Eta-conversion of continuations eliminates "forwarding blocks" -- basic blocks which have no statements, and just jump to some other continuation. Eta-conversion of functions eliminates functional trampolines.

continuation-passing style

But I'm getting ahead of myself. In the lambda tribe, we don't actually program in the lambda-calculus, you see. If you read any of our papers there's always a section in the beginning that defines the language we're working in, and then defines its semantics as a translation to the lambda-calculus.

This translation is always possible, for any programming language, and indeed Peter Landin did so in 1965 for Algol. Landin's original translations used his "J operator" to capture continuations, allowing a more direct translation of code into the lambda-calculus.

I wrote more on Landin, letrec, and the Y combinator a couple of years ago, but I wanted to mention one recent paper that takes a modern look at J, A Rational Deconstruction of Landin's J Operator. This paper is co-authored by V8 hacker Kevin Millikin, and cites work by V8 hackers Mads Sig Ager and Lasse R. Nielsen. Indeed all three seem to have had the privilege of having Olivier Danvy as PhD advisor. That's my tribe!

Anyway, J was useful in the context of Landin's abstract SECD machine, used to investigate the semantics of programs and programming languages. However it does not help the implementor of a compiler to a normal machine, and intermediate languages are all about utility. The answer to this problem, for functional programmers, was to convert the source program to what is known as continuation-passing style (CPS).

With CPS, the program is turned inside out. So instead of (+ 1 (f (+ 2 3))), you would have:

lambda return
  let c1 = 1
    letcont k1 t1 = _+ return c1 t1
      letcont k2 t2 = f k1 t2
        let c2 = 2, c3 = 3
          _+ k2 c2 c3

Usually the outer lambda is left off, as it is implicit. Every call in a CPS program is a tail call, for the purposes of the lambda calculus. Continuations are explicitly represented as lambda expressions. Every function call or primitive operation takes the continuation as an argument. Papers in this field usually use Church's original lambda-calculus notation instead of the ML-like notation I give here. Continuations introduced by a CPS transformation are usually marked as such, so that they can be efficiently compiled later, without any flow analysis.

Expressing a program in CPS has a number of practical advantages:

  • CPS is capable of expressing higher-order control-flow, for languages in which functions may be passed as values.

  • All temporary values are named. Unreferenced names represent dead code, or code compiled for effect only. Referenced names are the natural input to a register allocator.

  • Continuations correspond to named basic blocks. Their names in the source code correspond to a natural flow analysis simply by tracing the definitions and uses of the names. Flow analysis enables more optimizations, like code motion.

  • Full beta-reduction is sound on terms of this type, even in call-by-value languages.

Depending on how you implement your CPS language, you can also attach notes to different continuations to help your graph reduce further: this continuation is an effect context (because its formal parameter is unreferenced in its body, or because you knew that when you made it), so its caller can be processed for effect and not for value; this one is of variable arity (e.g. can receive one or two values), so we could jump directly to the right handler, depending on what we want; etc. Guile's compiler is not in CPS right now, but I am thinking of rewriting it for this reason, to allow more transparent handling of control flow.

Note that nowhere in there did I mention Scheme's call-with-current-continuation! For me, the utility of CPS is in its explicit naming of temporaries, continuations, and its affordances for optimization. Call/cc is a rare construct in Guile Scheme, that could be optimized better with CPS, but one that I don't care a whole lot about, because it's often impossible to prove that the continuation doesn't escape, and in that case you're on the slow path anyway.

So that's CPS. Continuations compile to jumps within a function, and functions get compiled to closures, or labels for toplevel functions. The best reference I can give on it is Andrew Kennedy's 2007 paper, Compiling With Continuations, Continued. CWCC is a really fantastic paper and I highly recommend it.

a digression: anf

CPS fell out of favor in the nineties, in favor of what became known as Administrative Normal Form, or ANF. ANF is like CPS except instead of naming the continuations, the code is left in so-called "direct-style", in which the continuations are implicit. So my previous example would look like this:

let c2 = 2, c3 = 3
  let t2 = + c2 c3
    let t1 = f t2
      let c1 = 1
        + c1 t1

There are ANF correspondences for CPS reductions, like the beta-rule. See the Essence of Compiling With Continuations paper, which introduced ANF and sparked the decline of the original CPS formulation, for more.

This CPS-vs-ANF discussion still goes on, even now in 2011. In particular, Kennedy's CWCC paper is quite compelling. But the debate has been largely mooted by the advances made by the machine tribe, as enabled by their SSA intermediate language.

the machine tribe in two sentences

In the beginning was the
Segmentation fault (core dumped)

(Just kidding, guys & ladies!)

Instead of compiling abstract ideas of naming and control to existing hardware, as the lambda tribe did, the machine tribe took as a given the hardware available, and tries to expose the capabilities of the machine to the programmer.

The machine tribe doesn't roll with closures, continuations, or tail calls. But they do have loops, and they crunch a lot of data. The most important thing for a compiler of a machine-tribe language like C is to produce efficient machine code for loops.

Clearly, I'm making some simplifications here. But if you look at a machine-tribe language like Java, you will be dealing with many control-flow constructs that are built-in to the language (for, while, etc.) instead of layered on top of recursion like loops in Scheme. What this means is that large, important parts of your program have already collapsed to a first-order control-flow graph problem. Layering other optimizations on top of this like inlining (the mother of all optimizations) only expands this first-order flow graph. More on "first-order" later.

So! After decades of struggling with this problem, after having abstracted away from assembly language to three-address register transfer language, finally the machine folks came up with something truly great: static single-assignment (SSA) form. The arc here is away from the machine, and towards more abstraction, in order to be able to optimize better, and thus generate better code.

It's precisely for this utilitarian reason that SSA was developed. Consider one of the earliest SSA papers, Global Value Numbers and Redundant Comparisons by Rosen, Wegman, and Zadeck. Rosen et al were concerned about being able to move invariant expressions out of loops, extending the "value numbering" technique to operate across basic blocks. But the assignment-oriented intermediate languages that they had been using were getting in the way of code motion.

To fix this issue, Rosen et al switched from the assignment-oriented model of the machine tribe to the binding-oriented model of the lambda tribe.

In SSA, variables are never mutated (assigned); they are bound once and then left alone. Assignment to a source-program variable produces a new binding in the SSA intermediate language.

For the following function:

function clamp (x, lower, upper) {
  if (x < lower)
    x = lower;
  else if (x > upper)
    x = upper;
  return x;
}  

The SSA translation would be:

entry:
  x0, lower0, upper0 = args;
  goto b0;

b0:
  t0 = x0 < lower0;
  goto t0 ? b1 : b2;

b1:
  x1 = lower0;
  goto exit;

b2:
  t1 = x0 > upper0;
  goto t1 ? b3 : exit;
  
b3:
  x2 = upper0;
  goto exit;
  
exit:
  x4 = phi(x0, x1, x2);
  return x4;

SSA form breaks down a procedure into basic blocks, each of which ends with a branch to another block, either conditional or unconditional. Usually temporary values receive their own names as well, as it facilitates optimization.

phony functions

The funny thing about SSA is the last bit, the "phi" function. Phi functions are placed at control-flow joins. In our case, the value of x may be proceed from the argument or from the assignment in the first or second if statement. The phi function indicates that.

But you know, lambda tribe, I didn't really get what this meant. What is a phi function? It doesn't help to consider where the name comes from, that the original IBM compiler hackers put in a "phony" function to merge the various values, but considered that "phi" was a better name if they wanted to be taken seriously by journal editors.

Maybe phi functions are intuitive to the machine tribe; I don't know. I doubt it. But fortunately there is another interpretation: that each basic block is a function, and that a phi function indicates that the basic block has an argument.

Like this:

entry x lower upper =
  letrec b0 = let t0 = x0 < lower
                if t0 then b1() else b2()
         b1 = let x = lower
                exit(x);
         b2 = let t1 = x > upper0
                if t1 then b3() else exit(x)
         b3 = let x = upper
                exit(x);
         exit x = x
    b0()

Here I have represented basic blocks as named functions instead of labels. Instead of phi functions, we allow the blocks to take a number of arguments; the call sites determine the values that the phi function may take on.

Note that all calls to blocks are tail calls. Reminds you of CPS, doesn't it? For more, see Richard Kelsey's classic paper, A Correspondence Between Continuation-Passing Style and Static Single Assignment Form, or my earlier article about Landin, Steele, letrec, and labels.

But for a shorter, readable, entertaining piece, see Appel's SSA is Functional Programming. I agree with Appel that we in the lambda-tribe get too hung up on our formalisms, when sometimes the right thing to do is draw a flow-graph.

so what's the big deal?

If it were only this, what I've said up to now, then SSA would not be as interesting as CPS, or even ANF. But SSA is not just about binding, it is also about control flow. In order to place your phi functions correctly, you need to build what is called a dominator tree. One basic block is said to dominate another if all control paths must pass through the first before reaching the second.

For example, the entry block always dominates the entirety of a function. In our example above, b0 also dominates every other block. However though b1 does branch to exit, it does not dominate it, as exit may be reached on other paths.

It turns out that you need to place phi functions wherever a definition of a variable meets a use of the variable that is not strictly dominated by the definition. In our case, that means we place a phi node on exit. The dominator tree is a precise, efficient control-flow analysis that allows us to answer questions like this one (where do I place a phi node?).

For more on SSA and dominators, see the very readable 1991 paper by Cytron, Ferrante, Rosen, Wegman, and Zadeck, Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.

Typical implementations of SSA embed in each basic block pointers to the predecessors and successors of the blocks, as well as the block's dominators and (sometimes) post-dominators. (A predecessor is a block that precedes the given node in the control-flow graph; a successor succeeds it. A post-dominator is like a dominator, but for the reverse control flow; search the tubes for more.) There are well-known algorithms to calculate these links in linear time, and the SSA community has developed a number of optimizations on top of this cheap flow information.

In contrast, the focus in the lambda tribe has been more on interprocedural control flow, which -- as far as I can tell -- no one does in less than O(N2) time, which is, as my grandmother would say, "just turrible".

I started off with a mention of global value numbering (GVN) on purpose. This is still, 20+ years later, the big algorithm for code motion in JIT compilers. HotSpot C1 and V8 both use it, and it just landed in IonMonkey. GVN is well-known, well-studied, and it works. It results in loop-invariant code motion: if an invariant definition reaches a loop header, it can be hoisted out of the loop. In contrast I don't know of anything from the lambda tribe that really stacks up. There probably is something, but it's certainly not as well-studied.

why not ssa?

Peoples of the machine tribe, could you imagine returning a block as a value? Didn't think so. It doesn't make sense to return a label. But that's exactly what the lambda-calculus is about. One may represent blocks as functions, and use them as such, but one may also pass them as arguments and return them as values. Such blocks are of a higher order than the normal kind of block that is a jump target. Indeed it's the only way to express recursion in the basic lambda calculus.

That's what I mean when I say that CPS is good as a higher-order intermediate language, and when I say that SSA is a good first-order intermediate language.

If you have a fundamentally higher-order language, one in which you need to loop by recursion, then you have two options: do whole-program analysis to aggressively closure-convert your program to be first-order, and then you can use SSA, or use a higher-order IL, and use something more like CPS.

MLton is an example of a compiler that does the former. Actually, MLton's SSA implementation is simply lovely. They do represent blocks as functions with arguments instead of labels and phi functions.

But if you can't do whole-program analysis -- maybe because you want to extend your program at runtime, support separate compilation, or whatever -- then you can't use SSA as a global IL. That's not to say that you shouldn't identify first-order segments of your program and apply SSA-like analysis and optimization on them, of course! That's really where the lambda tribe should go.

finally

I wrote this because I was in the middle of V8's Crankshaft compiler and realized I didn't understand some of the idioms, so I went off to read a bunch of papers. At the same time, I wanted to settle the CPS-versus-ANF question for my Guile work. (Guile currently has a direct-style compiler, for which there are precious few optimizations; this fact is mostly a result of being difficult to work with the IL.)

This post summarizes my findings, but I'm sure I made a mistake somewhere. Please note any corrections in the comments.