In [1]:
:load KarpsDisplays KarpsDagDisplay
:extension OverloadedStrings
import Spark.Core.Dataset
import Spark.Core.Context
import Spark.Core.Functions
import Spark.Core.Column
import Spark.Core.ColumnFunctions

Organizing pipelines with scopes and paths

If you have worked with large Spark jobs, you know that workflows can become really large and convoluted. Karps provides some tools to decompose and organize workflows, but still retain the power of lazy execution and full-pipeline optimization.

As a simple example, we will compute the mean of a dataset containing integers. In practice, such an operation that involves computing the mean is just one step of a much larger pipeline. We will see how to build and vizualize such pipelines.

Like Spark, Karps comes with a built-in visualization tool to explore how data is being generated. For any dataset or observable, you can use the showGraph command to see the graph of computations for this node. This graph is displayed thanks to the TensorBoard tool that comes with Google TensorFlow.

In [2]:
set = dataset ([1 ,2, 3, 4]::[Int]) @@ "initial_set"
myCount = count set
showGraph myCount

Organizing computations with Scopes

Here is a function that computes the mean of a dataset of integers. For the sake of convenience, we give some names to the various pieces of the graph:

In [3]:
myMean :: Dataset Int -> LocalData Double
myMean ds' =
  let s2 = asDouble (count ds') @@ "count"
      c1 = asCol ds'
      s1 = asDouble(sumCol c1) @@ "sum"
  in (s2 / s1) @@ "mean"

Then we can use this function on a dataset:

In [4]:
let ds = dataset ([1,2,3] :: [Int]) @@ "data"
m = myMean ds
x = m + 1

Here is the graph of operations. All the operations are flattened into the same scope, so it is hard to see the big picture.

In [5]:
showNameGraph x
In [6]:
let ds = dataset ([1 ,2, 3, 4]::[Int]) @@ "data"
let c = count ds
let c2 = (c + identity c) `logicalParents` [untyped ds] @@ "c2"

Karps provides a way to deal with that by marking a node with what its 'logical' parents are. All the computation nodes between the logical parent(s) and the final node will presented as children of the output node.

In [7]:
:t logicalParents
logicalParents :: forall loc a. ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a

Here is the mean function, this time with some labels for logical parents.

In [8]:
myMean2 :: Dataset Int -> LocalData Int
myMean2 ds' =
  let s2 = count ds' @@ "count"
      c1 = asCol ds'
      s1 = sumCol c1 @@ "sum"
  -- Only one parent (ds')
  -- Also, the name is NOT set. It will be set by the calling node.
  in ((s1 + s2) `logicalParents` [untyped ds'])
In [9]:
ds = dataset ([1,2,3] :: [Int]) @@ "data"
m2 = (myMean2 ds) @@ "my_mean"
x2 = m2 + 1
In [10]:
showGraph x2

Now the content of the function is nicely packed under the scope of the function. Tensorboard lets us recursively expand and explore the content of the blocks:

In [11]:
showGraph x2

Of course, this nesting can occur for arbitrary depths.

Note In this preview, the algorithm that performs the name resolution may fail for complicated cases.

Conclusion

Organizing computations is important for large pipelines in practice. Karps provides some basic tools to organize and compose pipelines of arbitrary complexity.

In [ ]: