wingolog

inline cache applications in scheme

2012-05-29T08:07:39Z

The inline cache is a dynamic language implementation technique that originated in Smalltalk 80 and Self, and made well-known by JavaScript implementations. It is fundamental for getting good JavaScript performance.

a cure for acute dynamic dispatch

A short summary of the way inline caches work is that when you see an operation, like x + y, you don't compile in a procedure call to a generic addition subroutine. Instead, you compile a call to a procedure stub: the inline cache (IC). When the IC is first called, it will generate a new procedure specialized to the particular types that flow through that particular call site. On the next call, if the types are the same, control flows directly to the previously computed implementation. Otherwise the process repeats, potentially resulting in a polymorphic inline cache (one with entries for more than one set of types).

An inline cache is called "inline" because it is specific to a particular call site, not to the operation. Also, adaptive optimization can later inline the stub in place of the call site, if that is considered worthwhile.

Inline caches are a win wherever you have dynamic dispatch: named field access in JavaScript, virtual method dispatch in Java, or generic arithmetic -- and here we get to Scheme.

the skeptical schemer

What is the applicability of inline caches to Scheme? The only places you have dynamic dispatch in Scheme are in arithmetic and in ports.

Let's take arithmetic first. Arithmetic operations in Scheme can operate on number of a wide array of types: fixnums, bignums, single-, double-, or multi-precision floating point numbers, complex numbers, rational numbers, etc. Scheme systems are typically compiled ahead-of-time, so in the absence of type information, you always want to inline the fixnum case and call out [of line] for other cases. (Which line is this? The line of flow control: the path traced by a program counter.) But if you end up doing a lot of floating-point math, this decision can cost you. So inline caches can be useful here.

Similarly, port operations like read-char and write can operate on any kind of port. If you are always writing UTF-8 data to a file port, you might want to be able to inline write for UTF-8 strings and file ports, possibly inlining directly to a syscall. It's probably a very small win in most cases, but a win nonetheless.

These little wins did not convince me that it was worthwhile to use ICs in a Scheme implementation, though. In the context of Guile, they're even less applicable than usual, because Guile is a bytecode-interpreted implementation with a self-hosted compiler. ICs work best when implemented as runtime-generated native code. Although it probably will by the end of the year, Guile doesn't generate native code yet. So I was skeptical.

occam's elf

Somehow, through all of this JavaScript implementation work, I managed to forget the biggest use of inline caches in GNU systems. Can you guess?

The PLT!

You may have heard how this works, but if you haven't, you're in for a treat. When you compile a shared library that has a reference to printf, from the C library, the compiler doesn't know where printf will be at runtime. So even in C, that most static of languages, we have a form of dynamic dispatch: a call to an unknown callee.

When the dynamic linker loads a library at runtime, it could resolve all the dynamic references, but instead of doing that, it does something more clever: it doesn't. Instead, the compiler and linker collude to make the call to printf call a stub -- an inline cache. The first time that stub is called, it will resolve the dynamic reference to printf, and replace the stub with an indirect call to the procedure. In this way we trade off a faster loading time for dynamic libraries at the cost of one indirection per call site, for the inline cache. This stub, this inline cache, is sometimes called the PLT entry. You might have seen it in a debugger or a disassembler or something.

I found this when I was writing an ELF linker for Guile's new virtual machine. More on that at some point in the future. ELF is interesting: I find that if I can't generate good code in the ELF format, I'm generating the wrong kind of code. Its idiosyncrasies remind me of what happens at runtime.

lambda: the ultimate inline cache

So, back to Scheme. Good Scheme implementations are careful to have only one way of calling a procedure. Since the only kind of callable object in the Scheme language is generated by the lambda abstraction, Scheme implementations typically produce uniform code for procedure application: load the procedure, prepare the arguments, and go to the procedure's entry point.

However, if you're already eating the cost of dynamic linking -- perhaps via separately compiled Scheme modules -- you might as well join the operations of "load a dynamically-linked procedure" and "go to the procedure's entry point" into a call to an inline cache, as in C shared libraries. In the cold case, the inline cache resolves the dynamic reference, updates the cache, and proceeds with the call. In the hot case, the cache directly dispatches to the call.

One benefit of this approach is that it now becomes cheap to support other kinds of applicable objects. One can make hash tables applicable, if that makes sense. (Clojure folk seem to think that it does.) Another example would be to more efficiently support dynamic programming idioms, like generic functions. Inline caches in Scheme would allow generic functions to have per-call-site caches instead of per-operation caches, which could be a big win.

It seems to me that this dynamic language implementation technique could allow Guile programmers to write different kinds of programs. The code to generate an inline cache could even itself be controlled by a meta-object protocol, so that the user could precisely control application of her objects. The mind boggles, but pleasantly so!

Thanks to Erik Corry for provoking this thought, via a conversation at JSConf EU last year. All blame to me, of course.

as PLT_HULK would say

NOW THAT'S AN APPLICATION OF AN INLINE CACHE! HA! HA HA!

case-lambda in guile

2009-11-07T11:52:25Z

Oh man, does the hack proceed apace. I really haven't had time to write about it all, but stories don't tell themselves, so it's back behind the megaphone for me.

Guile is doing well, with the monthly release train still on the roll. Check the latest news entries for the particulars of the past; but here I'd like to write about a couple aspects of the present.

First, case-lambda. The dilly here is that sometimes you want a procedure that can take N or M arguments. For example, Scheme's write can be invoked as:

(write "Hi." (current-output-port))
=| "Hi."

(=| means "prints", in the same way that => means "yields".)

But actually you can omit the second argument, because it defaults to the current output port anyway, and just do:

(write "Hi.")
=| "Hi."

Well hello. So the question: how can one procedure take two different numbers of arguments -- how can it have two different arities?

The standard answer in Scheme is the "rest argument", as in "this procedure has two arguments, and put the rest in the third." The syntax for it is not very elegant, because it introduces improper lists into the code:

(define (foo a b . c)
  (format #t "~a ~a ~a\n" a b c))
(foo 1 2 3 4)
=| 1 2 (3 4)

You see that 1 and 2 are apart, but that 3 and 4 have been consed into a list. Rest args are great when your procedure really does take any number of arguments, but if the true situation is that your procedure simply takes 1 or 2 arguments, you end up with code like this:

(define my-write
  (lambda (obj . rest)
    (let ((port (if (pair? rest)
                    (car rest)
                    (current-output-port))))
      (write obj port))))

It's ugly, and it's not expressive. What's more, there's a bug in the code above -- that you can give it 3 arguments and it does not complain. And even more than that, it actually has to allocate memory to store the rest argument, on every function call. (Whole-program analysis can recover this, but that is an entirely different kettle of fish.)

The solution to this is case-lambda, which allows you to have one procedure with many different arities.

(define my-write
  (case-lambda
    ((obj port) (write obj port))
    ((obj)      (my-write obj (current-output-port)))))

implementation

You can implement case-lambda in terms of rest arguments, with macros. Guile did so for many years. But you don't get the efficiency benefits that way, and all of your tools still assume functions only have one arity.

Probably the first time you make a VM, you encode the arity of a procedure into the procedure itself, in some kind of header. Then the opcodes that do calls or tail-calls or what-have-you check the procedure header against the number of arguments, to make sure that everything is right before transferring control to the new procedure.

Well with case-lambda that's not a good idea. Actually if you think a bit, there are all kinds of things that procedures might want to do with their arguments -- optional and keyword arguments, for example. (I'll discuss those shortly.) Or when you are implementing Elisp, and you have a rest argument, you should make a nil-terminated list instead of a null-terminated list. Et cetera. Many variations, and yet the base case should be fast.

The answer is to make calling a procedure very simple -- just a jump to the new location. Then let the procedure that's being called handle its arguments. If it's a simple procedure, then it's a simple check, or if it's a case-lambda, then you have some dispatch. Indeed in Guile's VM now there are opcodes to branch based on the number of arguments.

So much for the VM; what about the compiler and the toolchain? For the compiler it's got its ups and downs. Instead of a that just has its arguments and body, it now has no arguments, and a as its body. Each lambda-case has an "alternate", the next one in the series. More complicated.

Then you have the debugging information about the arities. The deal here is that there are parts of a procedure that have arities, probably contiguous parts, and there are parts that have no arity at all. For example, program counter 0 in most procedures has no arity -- no bindings have been made from the arguments to local variables -- because the number of arguments hasn't been checked yet. And if that check fails, you'll want to show those arguments on your stacktrace. Complication there too.

And the introspection procedures, like procedure-arguments and such, all need to be updated. On the plus side, and this is a big plus, now there is much more debugging information available. Argument names for the different case-lambda clauses, and whether they are required or rest arguments -- and also optional and keyword arguments. This is nice. So for example my-write prints like this:

So yeah, Guile does efficient multiple-arity dispatch now, and has the toolchain to back it up.

Next up, efficient optional and keyword arguments. Tata for now!

dynamic dispatch: a followup

2008-10-19T22:40:00Z

It seems that the 8-hash technique for dynamic dispatch that I mentioned in my last essay actually has a longer pedigree. At least 10 years before GOOPS' implementation, the always-excellent Gregor Kiczales wrote, with Luis H Rodriguez Jr.:

If we increase the size class wrappers slightly, we can add more hash seeds to each wrapper. If n is the number of hash seeds stored in each wrapper, we can think of each generic function selecting some number x less than n and using the xth hash seed from each wrapper. Currently we store 8 hash seeds in each wrapper, resulting in very low average probe depths.
The additional hash seeds increase the probability that a generic function will be able to have a low average probe depth in its memoization table. If one set of seeds doesn't produce a good distribution, the generic function can select one of the other sets instead. In effect, we are increasing the size of class wrappers in order to decrease the size of generic function memoization tables. This tradeoff is attractive since typical systems seem to have between three and five times as many generic functions as classes.
Efficient method dispatch in PCL

So it seems that Mikael Djurfeldt, the GOOPS implementor, appears to have known about CLOS implementation strategies. But it's interesting how this knowledge percolates out -- it's not part of the computer science canon. When you read these papers, it's always "Personal communication from Dave Moon this" and "I know about this Kiczales paper that". (Now you do too.)

Also interesting about the Kiczales paper is the focus on the user, the programmer, in the face of redefinitions -- truly a different culture than the one that is dominant now.

polymorphic inline caches buzz buzz buzz

This reference comes indirectly via Keith Rarick, who writes to mention a beautiful paper by Hölzle, Chambers, and Ungar, introducing polymorphic inline caches, a mechanism to dispatch based on runtime types, as GOOPS does.

PICs take dispatch one step further: instead of indirect table lookups as GOOPS does, a PIC is a runtime-generated procedure that performs the lookups directly in code. This difference between data-driven processing and direct execution is the essence of compilation -- compilation pushes all of the caching and branching logic as close to the metal as possible.

Furthermore, PICs can be a source of data as well as a dispatch mechanism:

The presence of PIC-based type information fundamentally alters the nature of optimization of dynamically-typed object-oriented languages. In “traditional” systems such as the current SELF compiler, type information is scarce, and consequently the compiler is designed to make the best possible use of the type information. This effort is expensive both in terms of compile time and compiled code space, since the heuristics in the compiler are tuned to spend time and space if it helps extract or preserve type information. In contrast, a PIC-based recompiling system has a veritable wealth of type information: every message has a set of likely receiver types associated with it derived from the previously compiled version’s PICs. The compiler’s heuristics and perhaps even its fundamental design should be reconsidered once the information in PICs becomes available [...].
Optimizing Dynamically-Typed Object-Oriented Programming Languages with Polymorphic Inline Caches

The salient point is that in latent-typed languages, all of the static type analysis techniques that we know are insufficient. Only runtime analysis and runtime recompilation can capture the necessary information for efficient compilation.

Read both of these articles! But if you just read one, make it the Ungar/Chambers/Hölzle -- it is well-paced, clearly-written, and illuminating.

Happy hacking!

dispatch strategies in dynamic languages

2008-10-17T16:54:03Z

In a past dispatch, I described in passing how accessors in Scheme can be efficiently dispatched into direct vector accesses. In summary, if I define a type, and make an instance:

(define-class  ()
  (x #:init-keyword #:x #:accessor x)
  (y #:init-keyword #:y #:accessor y))

(define p (make  #:x 10 #:y 20))

Access to the various bits of what I call the object closure of p occurs via lazy, type-specific compilation, such that:

(y p)

internally dispatches to

(@slot-ref p 1)

So this is all very interesting and such, but how do we actually do the dispatch? It turns out that this is a quite interesting problem, one that the JavaScript people have been taking up recently in other contexts. The basic problem is: given a generic operation and a set of parameters with specific types, how do you determine the exact procedure to apply to the arguments?

dynamic dispatch from a moppy perspective

This problem is known as dynamic dispatch. GOOPS, Guile's object system, approaches it from the perspective of the meta-object protocol, the MOP. A MOP is an an extensible, layered protocol that allows the user to replace and extend parts of a system's behaviour, while still allowing the system to maintain performance. (The latter is what I refer to as the "negative specification" -- the set of optimizations that the person writing the specifications doesn't say, but has in the back of her mind.)

The MOP that specifies GOOPS' behavior states that, on an abstract level, the process of applying a generic function to a particular set of arguments is performed by the apply-generic procedure, which itself is a generic function:

(apply-generic (gf ) args)

The default implementation looks something like this:

(define-method (apply-generic (gf ) args)
  (let ((methods (compute-applicable-methods gf args)))
    (if methods
	(apply-methods
         gf (sort-applicable-methods gf methods args) args)
	(no-applicable-method gf args))))

That is to say, first we figure out which methods actually apply to specific arguments, then sort them, then apply them. Of course, there are many ways that you might want the system to apply the methods. For some methods, you might want to invoke all applicable methods, in order from most to least specific; others, the same but in opposite order; or, in the normal case, simply apply the most specific method, and allow that method to explicitly chain up.

In effect, you want to parameterize the operation of the various components of apply-generic, depending on the type of the operation. With a MOP, you effect this parameterization by specifying that compute-applicable-methods, apply-methods, sort-applicable-methods, and no-applicable-method should themselves be generic functions. They should have default implementations that make sense, but those implementations should be replaceable by the user.

At this point the specification is awesome in its power -- completely general, completely overridable... but what about efficiency? Do you really have to go through this entire process just to invoke a method?

performance in dynamic languages

There are four general techniques for speeding up an algorithm: caching, compiling, delaying computation, and indexing.
Peter Norvig, A Retrospective on PAIP, lesson 21.

Just because the process has been specified in a particular way at a high level does not mean that the actual implementation has to perform all of the steps all of the time. My previous article showed an example of compilation, illustrating how high-level specifications can compile down to pointer math. In this article, what I'm building to is an exegesis of GOOPS' memoization algorithm, which to me is an amazing piece of work.

The first point to realize is that in the standard case, in which the generic function is an instance of and not a subclass thereof, application of a generic function to a set of arguments will map to the invocation of a single implementing method.

Furthermore, the particular method to invoke depends entirely on the concrete types of the arguments at the call site. For example, let's define a cartesian distance generic:

(define-method (distance (p1 ) (p2 ))
  (define (square z)
    (* z z))
  (sqrt (+ (square (- (x p1) (x p2)))
           (square (- (y p1) (y p2))))))

Now, if we invoke this generic on some set of arguments:

(distance a b)

The particular method to be applied can be determined entirely from the types of a and b. Specifically, if both types are subtypes of , the above-defined method will apply, and if not, no method that we know about will apply.

Therefore, whatever the result of the dispatch is, we can cache (or memoize) that result, and use it the next time, avoiding invocation of the entire protocol of methods.

The only conditions that might invalidate this memoization would be adding other methods to the distance generic, or redefining methods in the meta-object protocol itself. Both of these can be detected by the runtime object system, and can then trigger cache invalidation.

implementation

So, how to implement this memoization, then? The obvious place to store the cached data is on the generic itself, the same object that has a handle on all of the methods anyway. In the end, dynamic dispatch is always at least one level indirected -- even C++ people can't get around their vmethod table.

Algorithms in computer science are fundmentally about tradeoffs between space and time. If you can improve in one without affecting the other, assuming correctness, then you can say that one algorithm is better than the other. However after the initial optimizations are made, you are left with a tradeoff between the two.

This point is particularly salient in the context of generic functions, some of which might have only one implementing method, while others might have hundreds. How to choose an algorithm that makes the right tradeoff for this wide range in input?

Guile does something really interesting in this regard. First, it recognizes that the real input to the algorithm is the set of types being dispatched, not the set of types for which a generic function is specialized.

This is based on the realization that at runtime, a given piece of code will only see a certain, reduced set of types -- perhaps even one set of types.

linear search

In the degenerate but common case in which dispatch only sees one or two sets of types, you can do a simple linear search of a vector containing typeset-method pairs, comparing argument types. If this search works, and the vector is short enough, you can dispatch with a few cmp instructions.

This observation leads to a simple formulation of the method cache, as a list of a list of classes, with each class list tailed by the particular method implemention.

In Algol-like pseudocode, because this algorithm is indeed implemented in C:

def dispatch(types, cache):
    for (i=0; i(The tradeoff between vectors and lists is odd at short lengths; the former requires contiguous blocks of memory and goes through malloc, whereas the latter can take advantage of the efficient uniform allocation and marking infrastructure provided by GC. In this case Guile actually uses a vector of lists, due to access patterns.)
hash cache
In the general case in which we see more than one or two different sets of types, you really want to avoid the linear search, as this code is in the hot path, called every time you dispatch a generic function.
For this reason, Guile offers a second strategy, optimistic hashing. There is still a linear search through a cache vector, but instead of starting at index 0, we start at an index determined from hashing all of the types being dispatched.
With luck, if we hit the cache, the linear search succeeds after the first check; and if not, the search continues, wrapping around at the end of the vector, and stops with a cache miss if we wrap all the way around.
So, the algorithm is about the same, but rotated:
def dispatch(types, cache):
    i = h = hash(types) % len(cache)
    do
        entry=cache[i]
        x = types
        while 1:
           if null(x) && null(cdr(entry)): return car(entry)
           if null(x): break
           if car(x) != car(entry): break
           x = cdr(x); entry = cdr(entry)
        i = (i + 1) % len(cache)
    while i != h
    return false
OK, all good. But: how to compute the hash value?
One straightforward way to do it would be to associate some random-ish variable with each type object, given that they are indeed first-class objects, and simply add together all of those values. This strategy would not distinguish hash values on argument order, but otherwise would be pretty OK.
The standard source for this random-ish number would be the address of the class. There are two problems with this strategy: one, the lowest bits are likely to be the same for all classes, as they are aligned objects, but we can get around that one. The more important problem is that this strategy is likely to produce collisions in the cache vector between different sets of argument types, especially if dispatch sees a dozen or more combinations.
Here is where Guile does some wild trickery that I didn't fully appreciate until this morning when following all of the code paths. It associates a vector of eight random integers with each type object, called the "hashset". These values are the possible hash values of the class.
Then, when adding a method to the dispatch cache, Guile first tries rehashing with the first element of each type's hashset. If there are no collisions, the cache is saved, along with the index into the hashset. If there are collisions, it continues on, eventually selecting the cache and hashset index with the fewest collisions.
This necessitates storing the hashset index on the generic itself, so that dispatch knows how the cache has been hashed.
I think this strategy is really neat! Actually I wrote this whole article just to get to here. So if you missed my point and actually are interested, let me know and I can see about explaining better.
fizzleout
Once you know exactly what types are running through your code, it's not just about dispatch -- you can compile your code given those types, as if you were in a static language but with much less work. That's what Psyco does, and I plan on getting around to the same at some point.
Of course there are many more applications of this in the dynamic languages community, but since the Fortran family of languages has had such a stranglehold on programmer's minds for so many years, the techniques remain obscure to many.
Thankfully, as the years pass, other languages approach Lisp more and more. Here's to 50 years of the Buddhist virus!