## needed-bits optimizations in guile

Hey all, I had a fun bug this week and want to share it with you.

### numbers and representations

First, though, some background. Guile’s numeric operations are defined over the complex numbers, not over e.g. a finite field of integers. This is generally great when writing an algorithm, because you don’t have to think about how the computer will actually represent the numbers you are working on.

In practice, Guile will represent a small exact integer as a
*fixnum*,
which is a machine word with a low-bit tag. If an integer doesn’t fit
in a word (minus space for the tag), it is represented as a
heap-allocated bignum. But sometimes the compiler can realize that
e.g. the operands to a specific bitwise-and operation are within (say)
the 64-bit range of unsigned integers, and so therefore we can use
*unboxed operations* instead of the more generic functions that do
run-time dispatch on the operand types, and which might perform heap
allocation.

Unboxing is important for speed. It’s also tricky: under what
circumstances can we do it? In the example above, there is information
that flows from defs to uses: the operands of `logand` are known to be
exact integers in a certain range and the operation itself is closed over
its domain, so we can unbox.

But there is another case in which we can unbox, in which information
flows backwards, from uses to defs: if we see `(logand n #xff)`, we know:

the result will be in

`[0, 255]`that

*n*will be an exact integer (or an exception will be thrown)we are only interested in a subset of

*n*‘s bits.

Together, these observations let us transform the more general `logand`
to an unboxed operation, having first truncated *n* to a `u64`. And
actually, the information can flow from use to def: if we know that *n*
will be an exact integer but don’t know its range, we can transform the
potentially heap-allocating computation that produces *n* to instead
truncate its result to the `u64` range where it is defined, instead of
just truncating at the use; and potentially this information could
travel farther up the dominator tree, to inputs of the operation that
defines *n*, their inputs, and so on.

### needed-bits: the `|0` of scheme

Let’s say we have a numerical operation that produces an exact integer,
but we don’t know the range. We could truncate the result to a `u64`
and use unboxed operations, if and only if only `u64` bits are used. So
we need to compute, for each variable in a program, what bits are needed
from it.

I think this is generally known a *needed-bits analysis*, though both
Google and my textbooks are failing me at the moment; perhaps this is
because dynamic languages and flow analysis don’t get so much attention
these days. Anyway, the analysis can be local (within a basic block),
global (all blocks in a function), or interprocedural (larger than a
function). Guile’s is global. Each CPS/SSA variable in the function
starts as needing 0 bits. We then compute the fixpoint of visiting each
term in the function; if a term causes a variable to flow out of the
function, for example via return or call, the variable is recorded as
needing all bits, as is also the case if the variable is an operand to
some primcall that doesn’t have a specific needed-bits analyser.

Currently, only `logand` has a needed-bits analyser, and this is because
sometimes you want to do modular arithmetic, for example in a hash
function. Consider Bon Jenkins’ lookup3 string hash
function:

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k)))) #define mix(a,b,c) \ { \ a -= c; a ^= rot(c, 4); c += b; \ b -= a; b ^= rot(a, 6); a += c; \ c -= b; c ^= rot(b, 8); b += a; \ a -= c; a ^= rot(c,16); c += b; \ b -= a; b ^= rot(a,19); a += c; \ c -= b; c ^= rot(b, 4); b += a; \ } ...

If we transcribe this to Scheme, we get something like:

(define (jenkins-lookup3-hashword2 str) (define (u32 x) (logand x #xffffFFFF)) (define (shl x n) (u32 (ash x n))) (define (shr x n) (ash x (- n))) (define (rot x n) (logior (shl x n) (shr x (- 32 n)))) (define (add x y) (u32 (+ x y))) (define (sub x y) (u32 (- x y))) (define (xor x y) (logxor x y)) (define (mix a b c) (let* ((a (sub a c)) (a (xor a (rot c 4))) (c (add c b)) (b (sub b a)) (b (xor b (rot a 6))) (a (add a c)) (c (sub c b)) (c (xor c (rot b 8))) (b (add b a)) ...) ...)) ...

These `u32` calls are like the JavaScript `|0`
idiom,
to tell the compiler that we really just want the low 32 bits of the
number, as an integer. Guile’s compiler will propagate that information
down to uses of the defined values but also back up the dominator tree,
resulting in unboxed arithmetic for all of these operations.

(When writing this, I got all the way here and then realized I had already written quite a bit about this, almost a decade ago ago. Oh well, consider this your lucky day, you get two scoops of prose!)

### the bug

All that was just prelude. So I said that needed-bits is a fixed-point flow analysis problem. In this case, I want to compute, for each variable, what bits are needed for its definition. Because of loops, we need to keep iterating until we have found the fixed point. We use a worklist to represent the conts we need to visit.

Visiting a cont may cause the program to require more bits from the variables that cont uses. Consider:

(define-significant-bits-handler ((logand/immediate label types out res) param a) (let ((sigbits (sigbits-intersect (inferred-sigbits types label a) param (sigbits-ref out res)))) (intmap-add out a sigbits sigbits-union)))

This is the sigbits (needed-bits) handler for `logand` when one of its
operands (*param*) is a constant and the other (*a*) is variable. It
adds an entry for *a* to the analysis *out*, which is an intmap from
variable to a bitmask of needed bits, or `#f` for all bits. If *a*
already has some computed sigbits, we add to that set via
`sigbits-union`. The interesting point comes in the `sigbits-intersect`
call: the bits that we will need from *a* are first the bits that we
infer *a* to have, by forward type-and-range analysis; intersected with
the bits from the immediate *param*; intersected with the needed bits
from the result value *res*.

If the `intmap-add` call is idempotent—i.e., *out* already contains
*sigbits* for *a*—then *out* is returned as-is. So we can check for a
fixed-point by comparing *out* with the resulting analysis, via `eq?`.
If they are not equal, we need to add the cont that defines *a* to the
worklist.

The bug? The bug was that we were not enqueuing the def of *a*, but
rather the predecessors of *label*. This works when there are no
cycles, provided we visit the worklist in post-order; and regardless, it
works for many other analyses in Guile where we compute, for each
labelled cont (basic block), some set of facts about all other labels
or about all other
variables.
In that case, enqueuing a predecessor on the worklist will cause all
nodes up and to including the variable’s definition to be visited,
because each step adds more information (relative to the analysis
computed on the previous visit). But it doesn’t work for this case,
because we aren’t computing a per-label analysis.

The solution was to rewrite that particular fixed-point to enqueue labels that define a variable (possibly multiple defs, because of joins and loop back-edges), instead of just the predecessors of the use.

Et voilà ! If you got this far, bravo. Type at y’all again soon!

I think one reason you’re not seeing many results for “needed bits” is because things like LLVM at least call this analysis “demanded bits”. Thanks for the article! I love to see nice dataflows solving problems.

I love to see nice dataflows solving problems.