Hi there! This is the first in a series of articles oncompositional distributed systems,showing various neat things you can do with Unison's distributed programming support. This article presents an API for computations on distributed datasets. It's a bit like Spark, though we'll make different design choices in a few key areas. Our library will be absolutely tiny — the core data type is 3 lines of code, and operations like
Seq.mapare just a handful of lines each.
Spark is a huge project (over a million lines of code) with lots of functionality, but we're interested in just the core idea of a distributed immutable dataset that can be operated on in parallel.
Here's a quick preview of what we'll get to in this article:
This code will operate in parallel on a distributed sequence of whatever size, spread across any number of machines. The functions are "moved to the data" so there's little network traffic during execution, and the distributed sequence
Seqis lazy so nothing actually happens until the
Seq.reduce.Also, the data structure fuses all operations so they happen in one pass over the data.
Any stage of the computation can be cached via functions like
a4.distributed.Seq.memoReduce.These functions cache subtrees of the computation as well, not just the root, so repeated runs of related jobs only have to compute on the diff of what's new. Imagine batch jobs that finish in seconds rather than hours.
Developing distributed programs doesn't require building containers or deploying jar files. Due to Unison's design, dependency conflicts are impossible and there's no serialization code to write. We can focus on the core data structures, algorithms, and business logic.
To test your code, we can use a simple local interpreter such as
a4.distributed.Remote.pure.run '(distributed (fromChunkedList 10 (List.range 0 100)))⧨Right 735
... or perhaps a more interesting "chaos monkey" local interpreter that injects failures and delays and simulated network partitions. Interpreters are also possible that determine the network usage of your program by local simulation ("Oh no! This sort is shuffling lots of data over the network!") allowing you to diagnose and fix performance problems without having to deploy or run jobs in production.
We get all this from a library that is tiny (the core data type is just 3 lines) and extensible. For instance,
Seq.mapis a few lines of straightforward code, and any user is empowered to add new operations. We'll explain this code and how it works next in the tutorial:
Operations on the distributed data structure end up looking similar to an in-memory version, just with an extra
Value.mapto move the computation inside aremote
distributed.Value.Any immutable data structure can be turned into a distributed version by wrapping references in remote values.
Continue on to the tutorial for a detailed explanation.