wingolog

thoughts on blink

2013-04-04T06:43:17Z

So Chromium forked WebKit again! You've probably seen some articles on it already, but Alex Russell's piece is pretty good.

In retrospect the split was easy to see coming, but I can't help but feel bittersweet about it. The announcement and Alex's article both cite technical reasons for the fork, but to me it seems to be ultimately a human failure. So the Chromium team saw problems in WebKit and agreed on a solution that will help them achieve their goals faster: that's great! But why were they unable to do so within WebKit itself?

There are technical differences, yes, but those could be resolved by human action (as they now will be, by humans, under the Blink aegis). No, to me the fundamental problem was a lack of the conditions for consensus within the WebKit project, and for that there is ample blame to go around.

In the best of events this split will end up with both Blink and WebKit as more technically consistent, nimble projects. I would also hope that this split serves as a wakeup call to some of the decision-making processes within WebKit. However it is easy to be pessimistic in this regard -- the Apple-Google dynamic has been great for WebKit as an open project, opening up discussions and collaboration possibilities, especially among folks that don't work for the two major corporations. Both projects will have to actively make an effort to remain open, because passivity will favor the structural tendency to work in-house.

I should probably mention the obvious thing, that Igalia will work with both projects, of course. We'll get our WebKit patches into Blink, and vice versa. The message for customers doesn't change much, either -- most customers have already chosen a particular WebKit port, and those that used Chromium will now use Chromium on Blink.

For those projects that haven't chosen a WebKit port yet, the fundamental decision has always been WebKit2/JSC versus Chromium/V8, and that hasn't changed too much from this announcement. It's true that WebKit2 is a less mature multi-process implementation than Chromium, but it is getting a lot of investment these days, so it's fair to assume that technically that the two approaches will work just as well within a few months. The same goes for JSC versus V8.

Finally, a meta-note. Sometimes these press releases catch us up in their whirlwind and we're left with a feeling of anxiety: "Google just released AngularJS, now I need to rewrite all my apps!?" Well, maybe ;) But I find a bit of dispassion is appropriate after a piece of news like this. Good luck to both projects, and see you on the mailing lists.

inside javascriptcore's low-level interpreter

2012-06-27T11:25:30Z

Good day, hackers! And hello to the rest of you, too, though I fear that this article isn't for you. In the vertical inches that follow, we're going to nerd out with JavaScriptCore's new low-level interpreter. So for those of you that are still with me, refill that coffee cup, and get ready for a look into a lovely hack!

hot corn, cold corn

Earlier this year, JavaScriptCore got a new interpreter, the LLInt. ("Low-level interpreter", you see.) As you probably know, JavaScriptCore is the JavaScript implementation of the WebKit project. It's used in web browsers like Safari and Epiphany, and also offers a stable API for embedding in non-web applications.

In this article, we'll walk through some pieces of the LLInt. But first, we need to describe the problem that the LLInt solves.

So, what is the fundamental problem of implementing JavaScript? Of course there are lots of things you could say here, but I'm going to claim that in the context of web browsers, it's the tension between the need to optimize small amounts of hot code, while minimizing overhead on large amounts of cold code.

Web browsers are like little operating systems unto themselves, into which you as a user install and uninstall hundreds of programs a day. The vast majority of code that a web browser sees is only executed once or twice, so for cold code, the name of the game is to do as little work as possible.

For cold code, a good bytecode interpreter can beat even a very fast native compiler. Bytecode can be more compact than executable code, and quicker to produce.

All of this is a long introduction into what the LLInt actually is: it's a better bytecode interpreter for JSC.

context

Before the introduction of the LLInt, the so-called "classic interpreter" was just a C++ function with a bunch of labels. Every kind of bytecode instruction had a corresponding label in the function.

As a hack, in the classic interpreter, bytecode instructions are actually wide enough to hold a pointer. The opcode itself takes up an entire word, and the arguments also take up entire words. It is strange to have such bloated bytecode, but there is one advantage, in that the opcode word can actually store the address of the label that implements the opcode. This is called direct threading, and presumably its effects on the branch prediction table are sufficiently good so as to be a win.

Anyway, it means that Interpreter.cpp has a method that looks like this:

// vPC is pronounced "virtual program counter".
JSValue Interpreter::execute(void **vPC)
{
  goto *vPC;
//...
  // add dst op1 op2: Add two values together.
op_add:
  {
    int dst = (int)vPC[1];
    int op1 = (int)vPC[2];
    int op2 = (int)vPC[3];

    fp[dst] = add(fp[op1], fp[op2]);

    vPC += 4;
    goto *vPC;
  }

  // jmp offset: Unconditional branch.
op_jmp:
  {
    ptrdiff_t offset = (ptrdiff_t)vPC[1];
    vPC += offset;
    goto *vPC;
  }
//...
}

It's a little different in practice, but the essence is there.

OK, so what's the problem? Well, readers who have been with me for a while might recall my article on static single assignment form, used as an internal representation by compilers like GCC and LLVM. One conclusion that I had was that SSA is a great first-order IR, but that its utility for optimizing higher-order languages is less clear. However, this computed-goto thing that I showed above (goto *vPC) is a form of higher-order programming!

Indeed, as the GCC internals manual notes:

Computed jumps contain edges to all labels in the function referenced from the code. All those edges have EDGE_ABNORMAL flag set. The edges used to represent computed jumps often cause compile time performance problems, since functions consisting of many taken labels and many computed jumps may have very dense flow graphs, so these edges need to be handled with special care.

Basically, GCC doesn't do very well at compiling interpreter loops. At -O3, it does get everything into registers on my machine, but it residualizes 54 KB of assembly, whereas I only have 64 KB of L1 instruction cache. Many other machines just have 32 KB of L1 icache, so for those machines, this interpreter is a lose. If I compile with -Os, I get it down to 32 KB, but that's still a tight fit.

Besides that, implementing an interpreter from C++ means that you can't use the native stack frame to track the state of a computation. Instead, you have two stacks: one for the interpreter, and one for the program being interpreted. Keeping them in sync has an overhead. Consider what GCC does for op_jmp at -O3:

op_jmp:
    mov    %rbp,%rax
    ; int currentOffset = vPC - bytecode + 1
    sub    0x58(%rbx),%rax
    sar    $0x3,%rax
    add    $0x1,%eax
    ; callFrame[CurrentOffset] = currentOffset
    mov    %eax,-0x2c(%r11)

    ; ptrdiff_t offset = (ptrdiff_t)vPC[1]
    movslq 0x8(%rbp),%rax

    ; vPC += offset
    lea    0x0(%rbp,%rax,8),%rbp

    ; goto *vPC
    mov    0x0(%rbp),%rcx
    jmpq   *%rcx

First there is this strange prelude, which effectively stores the current vPC into an address on the stack. To be fair to GCC, this prelude is part of the DEFINE_OPCODE macro, and is not an artifact of compilation. Its purpose is to let other parts of the runtime see where a computation is. I tried, but I can't convince myself that it is necessary to do this at the beginning of every instruction, so maybe this is just some badness that should be fixed, if the classic interpreter is worth keeping.

The rest of the opcode is similar to the version that the LLInt produces, as we will see, but it is less compact.

the compiler to make the code you want

The goal of compilation is to produce good code. It can often be a good technique to start from the code you would like to have, and then go backwards and see how to produce it. It's also a useful explanatory technique: we can take a look at the machine code of the LLInt, and use it to put the LLInt's source in context.

In that light, here is the code corresponding to op_jmp, as part of the LLInt:

llint_op_jmp:
    ; offset += bytecode[offset + 1]
    add    0x8(%r10,%rsi,8),%esi
    ; jmp *bytecode[offset]
    jmpq   *(%r10,%rsi,8)

That's it! Just 9 bytes. There is the slight difference that the LLInt moves around an offset instead of a pointer into the bytecode. The whole LLInt is some 14 KB, which fits quite confortably into icache, even on mobile devices.

This assembly code is part of the LLInt as compiled for my platform, x86-64. The source of the LLInt is written in a custom low-level assembly language. Here is the source for the jmp opcode:

_llint_op_jmp:
    traceExecution()
    dispatchInt(8[PB, PC, 8])

You can define macros in this domain-specific language. Here's the definition of dispatchInt:

macro dispatchInt(advance)
    addi advance, PC
    jmp [PB, PC, 8]
end

At this point it's pretty clear how the source corresponds to the assembly. The WebKit build system will produce a version of the LLInt customized for the target platform, and link that interpreter into the JavaScriptCore library. There are backend code generators for ARMv7, X86-32, and X86-64. These interpreters mostly share the same source, modulo the representation differences for JSValue on 32-bit and 64-bit systems.

Incidentally, the "offlineasm" compiler that actually produces the native code is written in Ruby. It's clear proof that Ruby can be fast, as long as you don't load it at runtime ;-)

but wait, there's more

We take for granted that low-level programs should be able to determine how their data is represented. C and C++ are low-level languages, to the extent that they offer this kind of control. But what is not often remarked is that for a virtual machine, its code is also data that needs to be examined at runtime.

I have written before about JSC's optimizing DFG compiler. In order to be able to optimize a computation that is in progress, or to abort an optimization because a speculation failed -- in short, to be able to use OSR to tier up or down -- you need to be able to replace the current stack frame. Well, with the classic interpreter written in C++, you can't do that, because you have no idea what's in your stack frame! In contrast, with the LLInt, JSC has precise control over the stack layout, allowing it to tier up and down directly from the interpreter.

This can obviate the need for the baseline JIT. Currently, WebKitGTK+ and Apple builds still include the baseline, quick-and-dirty JIT, as an intermediate tier between the LLInt and the DFG JIT. I don't have sure details about about long-term plans, but I would speculate that recent work on the DFG JIT by Filip Pizlo and others has had the goal of obsoleting the baseline JIT.

Note that in order to tier directly to the optimizing compiler, you need type information. Building the LLInt with the DFG optimizer enabled causes the interpreter to be instrumented to record value profiles. These profiles record the types of values seen by instructions that load and store values from memory. Unlike V8, which stores this information in executable code as part of the inline caches, in the LLInt these value profiles are in non-executable memory.

Spiritually speaking, I am all for run-time code generation. But it must be admitted that web browsers are a juicy target, and while it is unfair to make safe languages pay for bugs in C++ programs, writable text is an attack surface. Indeed in some environments, JIT compilation is prohibited by the operating system -- and in that sort of environment, a fast interpreter like the LLInt is a big benefit.

There are a few more LLInt advantages, like being able to use the CPU's overflow flags, doing better register allocation, and retaining less garbage (JSC scans the stack conservatively). But the last point I want to mention is memory: an interpreter doesn't have to generate executable code just to display a web page. Obviously, you need to complement the LLInt with an optimizing compiler for hot code, but laziness up front can improve real-world page load times.

in your webs, rendering your kittens

So that's the LLInt: a faster bytecode interpreter for JSC.

I wrote this post because I finally got around to turning on the LLInt in the WebKitGTK+ port a couple weeks ago. That means that in September, when GNOME 3.6 comes out, not only will you get an Epiphany with the one-process-per-tab WebKit2 backend, you also get cheaper and faster JavaScript with the LLInt. Those of you that want the advance experience can try it out with last Monday's WebKitGTK+ 1.9.4 release.

stranger in these parts

2012-05-16T11:35:25Z

My JSConf 2012 video is out! Check it out:

Visit blip.tv to watch in your browser

The talk is called "Stranger in these parts: A hired gun in the JS corral", and in it I talk about my experiences as a Schemer in the implementation world, with a focus on JavaScriptCore, the JS implementation of the WebKit project.

If you want, you can fetch the slides or the notes. If you were unable to play the video in the browser, you can download it directly (25 minutes, ~80 MB, CC-BY-SA).

Special thanks to the A/V team for the fine recording. My talk was the first one that used the wireless mics, and it turned out there was some intermittent interference. They corrected this in later talks by double-miking the speakers. In my case, it was fortunate that they recorded the room as well, and with (I would imagine) a fair amount of post-processing the sound is perfectly intelligible. Cheers!

Finally, there were a number of other interesting talks whose recordings are starting to come out. I especially liked David Nolen's funhouse of ClojureScript and logic programming. It was pleasantly strange to see him mention Peter Norvig's 1992 book, Paradigms of Artificial Intelligence Programming, because I did too, and I think someone else did as well. Three people mentioning a somewhat obscure 20-year-old book, what are the odds?

I also liked Vyacheslav's amusing talk on V8's optimizing compiler. He actually showed some assembler! Folks that read this webrag might find it interesting. Dan Ingalls' talk was fun too. The ending scene was pretty surreal; be sure to watch all the way through.

Thanks again to the JSConf organizers for the invitation to speak. It was a pleasure to get to hang around the lively and energetic JS community. Happy hacking, all!

webkittens! lexical scoping is in danger!

2011-12-02T17:36:56Z

The GTK+ WebKittens are on the loose here in Coruña. There's folks here from Red Hat, Motorola, Collabora, and of course Igalia. It's good times; beyond the obvious platitudes of "um, the web is important and stuff" it's good to be in a group that is creating the web experience of millions of users.

My part in that is very small, adding support for block-scoped let and const to JavaScriptCore.

I've made some progress, but it could be going more smoothly. I have made the parser do the right thing for const, correctly raising errors for duplicate bindings, including nested var declarations that get hoisted. The parser is fine: it maintains an environment like you would expect. But the AST assumes that all locals get hoisted to function scope, so there's no provision for e.g. two distinct local variables with the same name. So there is still some work to do on the AST, and it's a thicket of templates.

Hopefully I'll end up with a prototype by the end of the hackfest (Sunday). Sooner if I find that sword of omens, which I seem to have misplaced. Sight beyond sight!