In this post we'll build our own distributed data structure from first principles. The version developed here is a little simpler than theSeq
used in the "real" version of the library. We'll get to the realSeq
in Part 5 of this article.
First, let's look at a simple in-memory tree data structure and understand how it can be modified to represent data spread across multiple nodes. Here's the in-memory version:
structural type SimpleTree a
structural type SimpleTree a
= One a
| Empty
| Two (SimpleTree a) (SimpleTree a)
ASimpleTree
is eitherSimpleTree.Empty
,a leaf:SimpleTree.One
,or a branch:SimpleTree.Two
.All the leaves and subtrees in aSimpleTree
are references to values which are in local memory on the same machine. If instead we want it to contain references to values that may exist at another location, we can write the following instead:
structural type src.Tree a
structural type src.Tree a
= Tree.Two
(distributed.Value (src.Tree a))
(distributed.Value (src.Tree a))
| Empty
| Tree.One (distributed.Value a)
All we've done here is wrap each branch and leaf of the tree in a data type,distributed.Value
,which represents alazyvalue at a possibly remote location.
With this simple type definition, we can now implement functions like a distributedmap
,reduce
,and so on without needing to write any networking or serializationcode.
Let's try it for oursrc.Tree
.This will be instructive!
stub.Tree.map : (a ->{Remote} b) -> src.Tree a ->{Remote} src.Tree b
stub.Tree.map : (a ->{Remote} b) -> src.Tree a ->{Remote} src.Tree b
stub.Tree.map f = cases
src.Tree.Empty -> src.Tree.Empty
Tree.One valueA -> todo "๐ง hmmm, apply f to Value"
Tree.Two leftValue rightValue ->
todo "something recursive goes here ๐คจ"
We know we'll need to pattern match on the data constructors ofsrc.Tree
and perform a recursive call of some kind, but now that the leaf and branch are both wrapped indistributed.Value
,what should we do? There are two ways to do this that typecheck.
The first is not what quite what we want but it's instructive nonetheless. It makes ample use ofValue.get : distributed.Value a ->{Remote} a
to force the lazydistributed.Value
andValue.pure : a -> distributed.Value a
for creating a remote value from an in-memory value:
eager.Tree.map : (a ->{Remote} b) -> src.Tree a ->{Remote} src.Tree b
eager.Tree.map : (a ->{Remote} b) -> src.Tree a ->{Remote} src.Tree b
eager.Tree.map f = cases
src.Tree.Empty -> src.Tree.Empty
Tree.One valueA -> Tree.One (Value.pure (f (Value.get valueA)))
Tree.Two leftValue rightValue ->
use Value get pure
use eager.Tree map
Tree.Two
(pure (map f (get leftValue))) (pure (map f (get rightValue)))
To see why this isn't what we want, let's look at the documentation ofValue.get
andValue.pure
:
Value.get
Obtain the
a
denoted by thisdistributed.Value
.A few properties:
Value.get (Value.pure a) Universal.== a
Value.get (Value.at loc a) Universal.== a
f (Value.get v) Universal.== Value.get (Value.map f v)
Value.pure
Create an in-memory
distributed.Value
.Satisfies the property:
Value.get (Value.pure a) Universal.== a
We'resummoningthe value from a potentially remote location withValue.get
,and thenValue.pure
creates a value in memory at the location where it's called. Our implementation ofeager.Tree.map
will thus end up sending the entire tree to the original caller of the function, applying the function there, and storing the resulting tree in memory at that location. That's no good--presumably the whole reason we're using a distributed data structure is the data is too big to fit in memory on a single machine.
The version ofmap
we want will do its work lazily, when the resultingsrc.Tree
is forced, and it will do the mapping in parallel at the locations where the data lives rather shipping everything to thecaller.
A good rule of thumb when implementing functions likeTree.map
is tonever callValue.get
.Instead we will useValue.map
to lazily apply a functionat the location where the remote value lives.
src.Tree.map : (a ->{Remote} b) -> src.Tree a -> src.Tree b
While you can perhaps see how this code typechecks, what this code does is a bit brain-bending. First, because oursrc.Tree.map
function does not callValue.get
,src.Tree.map
is just the blueprint for the data transformation, not the transformed data itself. Not until the tree is evaluated (by say, thereduce
function we'll cover in Part 3) does any mapping actually happen. This laziness gives us fusion
src.Tree.map
multiple times, the functions will get composed together and applied in the same pass over the data as thereduce
.Morever, because we are usingValue.map
,the function will be applied at the location where the data lives, rather than shipping the entire tree to the caller and applying the transformation function there.
See if you can write more functions forsrc.Tree
.We've provided a codebase with the necessary dependencies and stubbed functions. To pull in the codebase, run the following in the UCM:
.> pull git@github.com:unisonweb/website:.articles.distributedDatasets .article1
Thesrc.Tree
function stubs are underTree
.
Takeaways
We learned a few things in this part:
- To make a data structure distributed, wrap
distributed.Value
around the fields it defines in its data constructors. This lets the data structure represent the data being "elsewhere" with a minimum amount of ceremony. - Use
Value.map
judiciously to build lazy computations likesrc.Tree.map
where the function is brought to the data.
We also saw how to obtain different runtime behaviors for your distributed programs through implementations of theTree.map
function usingValue.map
orValue.get
.
As a library author, you do have to be explicit when describing how your program should behave, and we consider this a good thing: you can achieve exactly what you want with a tiny amount of code, and you can wrap up common patterns in reusable functions likesrc.Tree.map
that anyone can use without being a distributed systems expert!
If you've been wondering "how do I evaluate this in some way?" we'll cover that next, in Part 3. Again there will be some instructive decisions to make: how we implement functions likereduce
will determine whether the work happens in parallel close to the data or whether the data gets sent to the caller and reduced there.