Spark-like datasets in Unison | How to make any immutable data structure distributed

In this post we'll build our own distributed data structure from first principles. The version developed here is a little simpler than the Seq used in the "real" version of the library. We'll get to the real Seq in Part 5 of this article.

First, let's look at a simple in-memory tree data structure and understand how it can be modified to represent data spread across multiple nodes. Here's the in-memory version:

structural type SimpleTree a

structural type SimpleTree a
  = articles.distributedDatasets.coreIdea.SimpleTree.SimpleTree.One a
  | articles.distributedDatasets.coreIdea.SimpleTree.SimpleTree.Empty
  | articles.distributedDatasets.coreIdea.SimpleTree.SimpleTree.Two
      (SimpleTree a) (SimpleTree a)

A SimpleTree is either SimpleTree.Empty, a leaf: SimpleTree.One, or a branch: SimpleTree.Two. All the leaves and subtrees in a SimpleTree are references to values which are in local memory on the same machine. If instead we want it to contain references to values that may exist at another location, we can write the following instead:

structural type lib.Tree a

structural type lib.Tree a
  = articles.distributedDatasets.lib.Tree.One (Remote.Value a)
  | articles.distributedDatasets.lib.Tree.Two
      (Remote.Value (lib.Tree a)) (Remote.Value (lib.Tree a))
  | articles.distributedDatasets.lib.Tree.Empty

All we've done here is wrap each branch and leaf of the tree in a data type, Remote.Value, which represents a lazy value at a possibly remote location.

👉

Making a distributed data type in Unison can be as simple as wrapping the fields in Remote.Value.

There are some choices you can make here. For instance you can put Remote.Value around all the fields in the data constructor or only some, and you can also have laziness "start at the root" or only when looking at subtrees. We'll discuss these subtleties in the last part of the series.

With this simple type definition, we can now implement functions like a distributed map, reduce, and so on without needing to write any networking or serialization code.

Let's try it for our lib.Tree. This will be instructive!

stub.Tree.map : (a ->{Remote} b) -> lib.Tree a ->{Remote} lib.Tree b

stub.Tree.map : (a ->{Remote} b) -> lib.Tree a ->{Remote} lib.Tree b
stub.Tree.map f = cases
  lib.Tree.Empty -> lib.Tree.Empty
  Tree.One valueA -> todo "🧐 hmmm, apply f to Value"
  Tree.Two leftValue rightValue ->
    todo "something recursive goes here 🤨"

We know we'll need to pattern match on the data constructors of lib.Tree and perform a recursive call of some kind, but now that the leaf and branch are both wrapped in Remote.Value, what should we do? There are two ways to do this that typecheck.

The first is not what quite what we want but it's instructive nonetheless. It makes ample use of Value.get : Remote.Value a ->{Remote} a to force the lazy Remote.Value and Value.pure : a -> Remote.Value a for creating a remote value from an in-memory value:

eager.Tree.map : (a ->{Remote} b) -> lib.Tree a ->{Remote} lib.Tree b

eager.Tree.map : (a ->{Remote} b) -> lib.Tree a ->{Remote} lib.Tree b
eager.Tree.map f = cases
  lib.Tree.Empty -> lib.Tree.Empty
  Tree.One valueA -> Tree.One (Value.pure (f (Value.get valueA)))
  Tree.Two leftValue rightValue ->
    Tree.Two
      (Value.pure (eager.Tree.map f (Value.get leftValue)))
      (Value.pure (eager.Tree.map f (Value.get rightValue)))

To see why this isn't what we want, let's look at the documentation of Value.get and Value.pure:

Value.get
Obtain the a denoted by this Remote.Value.
A few properties:
Value.get (Value.pure a) === a
Value.get (Value.at loc a) === a
f (Value.get v) === Value.get (Value.map f v)

Value.pure
Create an in-memory Remote.Value.
Satisfies the property: Value.get (Value.pure a) === a

We're summoning the value from a potentially remote location with Value.get, and then Value.pure creates a value in memory at the location where it's called. Our implementation of eager.Tree.map will thus end up sending the entire tree to the original caller of the function, applying the function there, and storing the resulting tree in memory at that location. That's no good—presumably the whole reason we're using a distributed data structure is the data is too big to fit in memory on a single machine.

The version of map we want will do its work lazily, when the resulting lib.Tree is forced, and it will do the mapping in parallel at the locations where the data lives rather shipping everything to the caller.

A good rule of thumb when implementing functions like Tree.map is to never call Value.get. Instead we will use Value.map to lazily apply a function at the location where the remote value lives.

lib.Tree.map : (a ->{Remote} b) -> lib.Tree a -> lib.Tree b

lib.Tree.map : (a ->{Remote} b) -> lib.Tree a -> lib.Tree b
lib.Tree.map f = cases
  lib.Tree.Empty -> lib.Tree.Empty
  Tree.One a     -> Tree.One (Value.map f a)
  Tree.Two l r   ->
    use lib Tree.map
    l' = Value.map (Tree.map f) l
    r' = Value.map (Tree.map f) r
    Tree.Two l' r'

While you can perhaps see how this code typechecks, what this code does is a bit brain-bending. First, because our lib.Tree.map function does not call Value.get, lib.Tree.map is just the blueprint for the data transformation, not the transformed data itself. Not until the tree is evaluated (by say, the reduce function we'll cover in Part 3) does any mapping actually happen. This laziness gives us fusion

Fusion

Fusion, or map-fusion, is a property of the evaluation of a data structure where multiple transformations over the data structure do not require building up intermediate representations of the structure. Instead, the desired transformations can be "fused" together, resulting in a more efficient overall computation.

Strict data structures, like List do not support the fusion optimization, so the expression

[1, 2, 3] |> List.map (x -> x Nat.+ 1) |> List.map (y -> y Nat.+ 2)⧨[4, 5, 6]

will build an intermediate List as a result of applying x -> x Nat.+ 1 in the List.map function.

"for free", so if we lib.Tree.map multiple times, the functions will get composed together and applied in the same pass over the data as the reduce.

Morever, because we are using Value.map, the function will be applied at the location where the data lives, rather than shipping the entire tree to the caller and applying the transformation function there.

Optional but fun exercises 😎

See if you can write more functions for lib.Tree. We've provided a codebase with the necessary dependencies sand stubbed functions. To pull in the codebase, run the following in the UCM:

.> pull git@github.com:unisonweb/website:.articles.distributedDatasets .article1

The lib.Tree function stubs are under Tree.

📓

Exercise: Another basic lib.Tree transformation

stub.Tree.map is a structure preserving transformation that doesn't force the distributed data to be evaluated. In a similar vein write Tree.reverse : Tree a -> Tree a which swaps the left and right branches of the tree but doesn't force the evaluation of the tree.

Show me the answer

ex1.Tree.reverse : lib.Tree a -> lib.Tree a

ex1.Tree.reverse : lib.Tree a -> lib.Tree a
ex1.Tree.reverse = cases
  Tree.Two left right ->
    use Value map
    use ex1.Tree reverse
    l = map reverse left
    r = map reverse right
    Tree.Two r l
  remainder           -> remainder

📓

Exercise: Building a lib.Tree from a list

For testing purposes, it would be great if we had an easy way to make an in-memory lib.Tree from a List.

Implement Tree.fromList : [a] -> Tree a. It might be nice if the Tree was roughly balanced. 😊

Show me the answer

ex3.Tree.fromList : [a] -> lib.Tree a

ex3.Tree.fromList : [a] -> lib.Tree a
ex3.Tree.fromList = cases
  [] -> lib.Tree.Empty
  [a] -> Tree.One (Value.pure a)
  as ->
    (left, right) = List.halve as
    Tree.Two
      (Value.pure (ex3.Tree.fromList left))
      (Value.pure (ex3.Tree.fromList right))

Takeaways

We learned a few things in this part:

To make a data structure distributed, wrap Remote.Value around the fields it defines in its data constructors. This lets the data structure represent the data being "elsewhere" with a minimum amount of ceremony.
Use Value.map judiciously to build lazy computations like lib.Tree.map where the function is brought to the data.

We also saw how to obtain different runtime behaviors for your distributed programs through implementations of the Tree.map function using Value.map or Value.get.

As a library author, you do have to be explicit when describing how your program should behave, and we consider this a good thing: you can achieve exactly what you want with a tiny amount of code, and you can wrap up common patterns in reusable functions like lib.Tree.map that anyone can use without being a distributed systems expert!

If you've been wondering "how do I evaluate this in some way?" we'll cover that next, in Part 3. Again there will be some instructive decisions to make: how we implement functions like reduce will determine whether the work happens in parallel close to the data or whether the data gets sent to the caller and reduced there.