JIT compilation is coming to Unison: an early progress report

In the past couple weeks, I've been working on a new method of compiling unison programs to native code. Unison already has a way of packaging a standalone program into a binary file that starts up very quickly, but that still uses the same interpreter written in Haskell. While the interpreter is written to be reasonably efficient given optimized intermediate code, unfortunately there hasn't been an opportunity to spend a significant amount of time on optimizations that would realize its full potential. Also, even in the best case, an interpreter will always have significant overhead compared to native code.

Thus, compiling to native code seems like an inevitable step. There are several considerations that would make a target more or less suitable/convenient:

The compilation has to have the ability to work just-in-time. It's okay to be able to precompile binaries from known code, but one of the primary uses we have for the standalone binaries right now is to implement a server that evaluates unison code sent to it from another environment. That code is not known ahead of time; it is serialized and sent from the client, and needs to be deserialized, compiled and executed on the server.
Tail calls need to be handled properly. They’re a useful feature in any language but in Unison they’re also the only mechanism for writing loops.
Ideally, the target would support delimited continuations as an implementation of abilities. There are ways of implementing algebraic effects without delimited continuations, but continuations are a better match (I'll explain further on).
It would be advantageous if the target already had a reasonable implementation of various low level compilation details. Unison isn't really novel in ways that would require a specialized garbage collector or native code generator, so it would be good if we can avoid needing to implement these things from scratch.
The runtime should be capable of doing some relatively sophisticated things. Right now Unison wraps features of GHC's runtime, like lightweight threads with asynchronous I/O, and software transactional memory. Ideally it wouldn't be necessary to completely reimplement these features.

Based on these considerations, we decided that a Chez Scheme backend was a promising direction. Scheme and Lisp in general are well known for supporting dynamic code loading features, and well behaved tail calls are one of the defining features of Scheme. Continuations are also heavily associated with scheme, so it's likely that most implementations would be well equipped in that area, too. Since it's meant to be usable as a functional language in its own right, it has garbage collection and reasonable library capabilities. Moreover it seems to be fairly well equipped as far as generating code for functional languages goes.

Jared Forsyth did some early prototyping of a compilation to Scheme pipeline and was getting some nice results compiling pure code, and that really inspired us to look more closely at Scheme as a serious target. We also knew that the rewrite of the Idris language had initially based its backend on Chez (although I think it has several backends again at this point). But when researching, I also discovered that Racket has been moving toward replacing its backend with Chez as well. The Racket report specifically suggests that Chez could be a better target for functional languages than a lot of other common compiler backends, since many of the latter have limitations based around the needs of more traditional imperative languages.

🤔

What about LLVM?

For a high-level functional language like Unison, LLVM is pretty low-level as a compilation target. It provides something like a platform independent assembly language, saving you from needing to write a register allocator and native code generator. LLVM can also support tail calls provided you use one of the right calling conventions.

It doesn't provide any runtime services out of the box, such as a garbage collector, continuations and/or delimited continuations, lightweight threads, async I/O, etc, which means it can be a considerable amount of work to build a serious LLVM-based JIT with these features.

Probably the most significant gap in the capabilities of the Chez runtime is software transactional memory. Unsurprisingly (since it's very difficult to use correctly without an effect checking system), Chez doesn't implement it. Glancing at the original STM paper for GHC, the implementation they describe doesn't sound too difficult, and looks doable using the concurrency primitives present in Chez.

What's working so far

So far, I've written (part of) a compiler from one of the existing intermediate code representations to Scheme code. The intermediate representation uses A-normal form with nested pattern matching desugared to single level matching and all local function definitions floated to an outer-most recursive let block for each top level definition. This level of normalization is not all completely necessary for compilation to Scheme, but it is the format we're already using to send code between Unison instances, so it would need to be handled anyway.

I started by writing the Scheme emitter in Haskell using the existing data structures there. But after some discussion, Paul and I realized that in order for the dynamic loading of Unison code to work in a program compiled to Scheme, it'd be necessary to have the compilation from Unison to Scheme available in Scheme itself. One option is to just write it directly in Scheme, but another option is to write it in Unison, which is then compiled to Scheme. This is what I've done most recently: I have a version of the intermediate data structures defined as Unison data types, along with ways of parsing the binary format for code interchange into those Unison types, and a compiler that emits Scheme definitions from those. Right now it is rather limited, only supporting pure code and a few arithmetic builtins, but it is possible to compile those Unison definitions to Scheme just by running a Unison program. For instance, if we define:

codeTest : '{IO,Exception} ()
codeTest _ = printScheme (termLink addMain)

And execute run codeTest at the prompt, we get output ending with:

(define-unison
  (ref-3YXYAR7AJAIUWLOQDTCM4VL3FYDCHJ x1)
  (letrec-unison
    ([(x0 x1)
      (let* ([x2 10000000]
             [x3 (identity x2)])
        (ref-M5Y65PHZ3GWHMCUMIJ3LGLBLKPSYMA x3))])
    (let* ([x2 100]
           [x3 (identity x2)]
           [x4 (x0)])
      (ref-ZQRWH24DGQ437B72KCOZXGRZAJJW2U x3 x4))))

(ref-3YXYAR7AJAIUWLOQDTCM4VL3FYDCHJ #f)

The full Unison API for code loading requires a fair bit more than just this sort of code generation, but it's an important component. Unfortunately I haven't yet filled in enough primitive operations to be able to compile the compiler itself to Scheme. The above code comes from a nested loop that prints the sum of the numbers from 0 to 10 million 100 times. Using the compiled scheme code is around 470x faster than our current interpreter, at least for this simple arithmetic loop.

While we could produce much better intermediate code for our interpreter by applying standard optimizations (such as worker/wrapper transformations, inlining, etc), Scheme is providing great performance with our more naive intermediate code, presumably because it already does many of these optimizations itself.

Implementing abilities

One major piece currently missing is the implementation of the primitives for abilities. As alluded to above, the way abilities work in the intermediate code and interpreter is based on delimited continuations. This section walks through some of the technical details of this but feel free to skip ahead.

Scheme is well known for its support of continuations. However, the sort in the (R6RS) Scheme standard are not the ones we want. They are based around the operation call-with-current-continuation, which I'll abbreviate to call/cc going forward. The idea is that if we write:

(call/cc (lambda (k) ...))

Then the lambda expression gets called with a special function that has captured the control flow context leading up to that point in the program. The lambda expression may return normally, in which case its result is supplied to that context. However, calling k also causes the program to act like the argument to k was returned from the lambda expression. This is true even if k escapes the scope of the call/cc block, which means that it has the power to replace the entire remaining control flow context anywhere in the program.

An ability handler clearly involves a continuation as well. A pattern match implementing one looks something like:

{ E x y z -> k } -> ...

k being the continuation again. However, these continuations behave much more like functions. Calling k does not completely replace the execution context, but adds to it like a normal function. It's possible that k would have captured some Exception effects that would jump to a top-level handler, but this is also no different from functions, and those effects can be handled so that they don't escape.

This is where delimited continuations come in. They are instead based around a pair of operations, the simplest of which are shift and reset. shift is similar to call/cc, and a use would look something like:

(shift (lambda (k) ...))

There are three primary differences:

shift only captures into k a portion of the control flow context, limited by the nearest enclosing reset.

k actually behaves like a function. To know what (k x) does, imagine if x were returned from the original (shift ...) expression, and trace the execution until you find the value y that would be returned from the enclosing (reset ...) expression. Then (k x) evaluates to y.

Here is a concrete example.

Consider the expression:

      (+ 1 2
        (reset
          (+ 1
             2
             (shift (lambda (k) (k (k 5))))
             3)))

This evaluates to 20, because the k captured is essentially the inner addition, meaning it's equivalent to:

      (let ((f (lambda (x) (+ 1 2 x 3))))
        (+ 1 2 (f (f 5))))

If the shift is replaced with call/cc (removing the redundant reset), then the result is 14, because it is equivalent to:

      (+ 1 2 (+ 1 2 5 3))

If you want, you can also try using delimited continuations in Unison. shift is the effect part, introduced by the ability:

structural ability Shift r

structural ability Shift r where
  blog.jitAnnounce.Shift.shift : ((a -> r) -> r) ->{Shift r} a

reset is the handler part, definable via:

jitAnnounce.reset : '{e, Shift r} r ->{e} r

jitAnnounce.reset : '{e, Shift r} r ->{e} r
jitAnnounce.reset e =
  h = cases
    { x }            -> x
    { shift e -> k } -> e (x -> (handle k x with h))
  handle e() with h

Then we can try the above example:

use Nat +
v = 1 + 2 + (jitAnnounce.reset do 1 + 2 + shift (k -> k (k 5)) + 3)
v⧨20

One thing to notice is that these operations use a "deep" handler. Shallow handlers actually correspond more closely to another pair of operations known as control (similar to shift) and prompt (similar to reset). A Unison version of these might be defined something like:

structural ability Control r

structural ability Control r where
  blog.jitAnnounce.Control.control :
    ((a ->{Control r} r) -> r) ->{Control r} a

prompt : '{e, Control r} r ->{e} r

prompt : '{e, Control r} r ->{e} r
prompt e =
  handle e()
  with cases
    { x }                      -> x
    { Control.control e -> k } -> e k

Although I haven't experimented much with them, so I'm not sure how precisely they match the theoretical definitions.

shift actually removes the captured portion in k from the control flow context. So, returning directly from the lambda expression yields a value directly to the enclosing reset.

This is essentially how ability handlers work, too. The continuation captured in an ability match only captures a portion of the computation delimited by the associated handle block. So, handle blocks correspond to the reset operation. The big difference is that we also associate the ability matching, and thus the uses of shift with the handle block.

However, the intermediate code actually converts this to something more like the Scheme presentation. Ability matching is compiled to code that uses an operation like shift, so that handled ability requests can actually be implemented by calls to the appropriate generated code. Then references to the implementation are threaded around, and requests are just function calls. So the intermediate code is already ideally suited for a runtime that directly supports delimited continuations.

There are theoretical results about implementing delimited continuations using call/cc and a global mutable variable. I believe the original is in Filinski's Representing Monads. Essentially the Scheme implementation there is:

(define mk (lambda (x) (raise "fell off end")))

(define-syntax reset
  (syntax-rules ()
    [(reset e ...)
     (call/cc
       (lambda (k)
         (let ([ok mk])
           (set! mk (lambda (r) (set! mk ok) (k r)))
           (let ([v (begin e ...)]) (mk v)))))]))

(define-syntax shift
  (syntax-rules ()
    [(shift f)
     (call/cc
       (lambda (sk)
         (let* ([k (lambda (x) (reset (sk x)))]
                [v (f k)])
           (mk v))))]))

However, there seem to be some issues with doing this. Oleg Kiselyov's page on the downsides of call/cc lists some. The most significant problem seems to be that the correctness of the result hinges on every other side effect being subsequently implemented by reduction to shift and reset. However, this ranges from expensive to impossible. Even in Unison, where declared abilities would reduce somehow to the delimited continuations, it is desirable to have some wired in pseudo abilities that implement things more efficiently. Encoding every mutable reference cell using shift and reset, which are themselves encoded using call/cc and a single global variable using closures is a very expensive strategy. And there are also built-in side effects like I/O and threading that are important for us to use.

There may be better 'portable' encodings than the one above. But it seems like what is really needed is a careful, system-specific implementation of delimited continuations that definitely interacts well with other parts of the system. Luckily, it should be possible to get this for Chez. As mentioned, Racket now has a Chez backend, and the former has built-in support for delimited continuations.

Technically, Racket uses its own forked copy of Chez, and the upstream version does not have all the features used for Racket's continuations implementation. There is a pull request against the upstream to add the feature, but it seems to be missing someone to fix some conflicts. I'm uncertain how necessary the feature is just for implementing delimited continuations (Racket uses it for other features as well). We might have to help get the pull request merged, or try using the Racket fork of Chez for our compilation in the end.

Conclusion

There's still quite a bit of work to be done, but I think even some of the early results so far are promising. The arithmetic examples mentioned above are much faster than the existing system, and that is on code actually generated from Unison definitions, not hand-crafted output that we could hypothetically generate given enough optimizations.

I think the data structures I've ported to Unison may also have other uses than this compiler, as well. Paul recently demoed a debugging ability he wrote completely in Unison. While discussing it with him, it occurred to me that a lot of the runtime introspection capabilities I've implemented as part of the JIT pipeline could also be useful for providing an even richer debugging experience, with access to all local variables and other niceties. That's something to investigate another time, though.

Stay tuned for more progress updates on this in the future!