Spark-like datasets in Unison | Spark-like distributed datasets in under 100 lines of Unison

Hi there! This is the first in a series of articles on compositional distributed systems, showing various neat things you can do with Unison's distributed programming support. This article presents an API for computations on distributed datasets. It's a bit like Spark, though we'll make different design choices in a few key areas. Our library will be absolutely tiny — the core data type is 3 lines of code, and operations like Seq.map are just a handful of lines each.

🌱

You can try writing programs with this API today, using local interpreters for Remote. Programs written using Remote can run unchanged on actual distributed infrastructure such as Unison Cloud.

Spark is a huge project (over a million lines of code) with lots of functionality, but we're interested in just the core idea of a distributed immutable dataset that can be operated on in parallel.

Here's a quick preview of what we'll get to in this article:

distributed : Seq k Nat ->{Remote} Nat

distributed : Seq k Nat ->{Remote} Nat
distributed dseq =
  use Nat % + ==
  dseq
    |> Seq.map (x -> x + 1)
    |> Seq.filter (x -> x % 7 == 0)
    |> Seq.reduce 0 (+)

This code will operate in parallel on a distributed sequence of whatever size, spread across any number of machines. The functions are "moved to the data" so there's little network traffic during execution, and the distributed sequence Seq is lazy so nothing actually happens until the Seq.reduce. Also, the data structure fuses all operations so they happen in one pass over the data.

Any stage of the computation can be cached via functions like Seq.memo or memoReduce. These functions cache subtrees of the computation as well, not just the root, so repeated runs of related jobs only have to compute on the diff of what's new. Imagine batch jobs that finish in seconds rather than hours.

Developing distributed programs doesn't require building containers or deploying jar files. Due to Unison's design, dependency conflicts are impossible and there's no serialization code to write. We can focus on the core data structures, algorithms, and business logic.

To test your code, we can use a simple local interpreter such as run:

docs.run do distributed (fromChunkedList 10 (Nat.range 0 100))⧨Either.Right 735

... or perhaps a more interesting "chaos monkey" local interpreter that injects failures and delays and simulated network partitions. Interpreters are also possible that determine the network usage of your program by local simulation ("Oh no! This sort is shuffling lots of data over the network!") allowing you to diagnose and fix performance problems without having to deploy or run jobs in production.

We get all this from a library that is tiny (the core data type is just 3 lines) and extensible. For instance, Seq.map is a few lines of straightforward code, and any user is empowered to add new operations. We'll explain this code and how it works next in the tutorial:

Seq.map : (a ->{Exception, Remote} b) -> Seq k a -> Seq Defer b

Seq.map : (a ->{Exception, Remote} b) -> Seq k a -> Seq Defer b
Seq.map f s =
  use Two Empty One Two
  go = cases
    Empty        -> Empty
    One a        -> One (f a)
    Two mode l r -> Two mode (Seq.map f l) (Seq.map f r)
  Seq (Value.map go (Seq.unwrap s))

Operations on the distributed data structure end up looking similar to an in-memory version, just with an extra Value.map to move the computation inside a remote Remote.Value. Any immutable data structure can be turned into a distributed version by wrapping references in remote values.

Continue on to the tutorial for a detailed explanation.