Spark-like datasets in Unison

Spark-like distributed datasets in under 100 lines of Unison

  • Rebecca Mark
  • Paul Chiusano

Hi there! This is the first in a series of articles oncompositional distributed systems,showing various neat things you can do with Unison's distributed programming support. This article presents an API for computations on distributed datasets. It's a bit like Spark, though we'll make different design choices in a few key areas. Our library will be absolutely tiny — the core data type is 3 lines of code, and operations likeSeq.mapare just a handful of lines each.

You can try writing programs with this API today, using theRemote.pure.runlocal interpreter. It's not very fast but that's okay for development and testing. Programs written usingRemotecan run unchanged on actual distributed infrastructure such asUnison Cloud.

Spark is a huge project (over a million lines of code) with lots of functionality, but we're interested in just the core idea of a distributed immutable dataset that can be operated on in parallel.

Here's a quick preview of what we'll get to in this article:

src.ex1 : Seq k Nat ->{Remote} Nat
src.ex1 dseq =
  use Nat + ==
    |> (x -> x + 1)
    |> Seq.filter (x -> Nat.mod x 7 == 0)
    |> Seq.reduce 0 (+)

This code will operate in parallel on a distributed sequence of whatever size, spread across any number of machines. The functions are "moved to the data" so there's little network traffic during execution, and the distributed sequenceSeqis lazy so nothing actually happens until theSeq.reduce.Also, the data structure fuses all operations so they happen in one pass over the data.

Any stage of the computation can be cached via functions likeSeq.memoormemoReduce.These functions cache subtrees of the computation as well, not just the root, so repeated runs of related jobs only have to compute on the diff of what's new. Imagine batch jobs that finish in seconds rather than hours.

Developing distributed programs doesn't require building containers or deploying jar files. Due to Unison's design, dependency conflicts are impossible and there's no serialization code to write. We can focus on the core data structures, algorithms, and business logic.

To test your code, we can use a simple local interpreter such

... or perhaps a more interesting "chaos monkey" local interpreter that injects failures and delays and simulated network partitions. Interpreters are also possible that determine the network usage of your program by local simulation ("Oh no! This sort is shuffling lots of data over the network!") allowing you to diagnose and fix performance problems without having to deploy or run jobs in production.

We get all this from a library that is tiny (the core data type is just 3 lines) and extensible. For instance,Seq.mapis a few lines of straightforward code, and any user is empowered to add new operations. We'll explain this code and how it works next in the tutorial: : (a ->{Exception, Remote} b) -> Seq k a -> Seq Defer b : (a ->{Exception, Remote} b) -> Seq k a -> Seq Defer b f s =
  go = cases
    Two.Empty        -> Two.Empty
    Two.One a        -> Two.One (f a)
    Two.Two mode l r ->
      use Seq map
      Two.Two mode (map f l) (map f r)
  Seq ( go (Seq.unwrap s))

Operations on the distributed data structure end up looking similar to an in-memory version, just with an extraValue.mapto move the computation inside aremotedistributed.Value.Any immutable data structure can be turned into a distributed version by wrapping references in remote values.

Continue on to the tutorial for a detailed explanation.

If you're interested in using Unison at work for this sort of thing, we'd love to chat with you.