:load KarpsDisplays KarpsDagDisplay
:extension DeriveGeneric
:extension FlexibleContexts
:extension OverloadedStrings
:extension GeneralizedNewtypeDeriving
:extension FlexibleInstances
:extension MultiParamTypeClasses
import Spark.Core.Dataset
import Spark.Core.Context
import Spark.Core.Column
import Spark.Core.ColumnFunctions
import Spark.Core.Functions
import Spark.Core.Row
import Spark.Core.Types
import Spark.Core.Try
import qualified Data.Vector as V
import qualified Data.Text as T
import GHC.Generics
It is common to start data exploration with untyped data, to see how things are working together, and to refine this approach into more and more types as a pipeline moves into production.
Karps offers a typed and an untyped API, reflecting different tradeoffs:
We are going to explore these tradeoffs on a small task of filtering data.
You are given a dataset about the nearby forest that details various trees being observed.
And we are going to compute various statistics about these trees.
rawData = [(1, 3, 2)] :: [(Int, Int, Int)]
For the sake of this example, we manually build up the content in a form that is digestible by Karps. In practice, though, this data would come from one of Spark's many input sources.
let fun (id', height, width) = RowArray (V.fromList [IntElement id', IntElement height, IntElement width])
dataCells = fun <$> rawData
dataCells
The first way to build a dataframe is to combine a datatype with a list of cells. We can build the datatype by hand, and then call the dataframe
function:
-- The API is not stable yet, importing some internals
import Spark.Core.Internal.TypesFunctions
dt = structType [
structField "treeId" intType,
structField "treeWidth" intType,
structField "treeHeight" intType]
-- The 'dataframe' function builds as new dataframe from a type and some content.
treesDF = dataframe dt dataCells
treesDF
As one can see from above, this operation succeeded, and lets us manipulate the data. We can extract columns of data, combine these columns, etc.
-- The '/-' operator extracts a column of data, using the field name as a string.
idCol = treesDF /- "treeId"
idCol
widthCol = treesDF/-"treeWidth"
doubleWidthCol = (widthCol + widthCol) @@ "doubleWidth"
doubleWidthCol
At the end of the day, all the resulting columns can be packed again in a new dataframe:
outputDF = pack' [idCol, doubleWidthCol]
outputDF
Of course, this API does not prevent us from doing nonsensical operations:
-- What does that mean?
weirdCol = idCol + widthCol
weirdCol2 = idCol + 1
And Haskell will not help us if the schema of the dataframe changes:
treesDF /- "missingColumn"
We can enforce some level of typing by making the structure of the data available to Karps. Here is how to assign a data structure to some data represented by Karps. We start by some simple representation that uses raw types:
data Tree = Tree {
treeId :: Int,
treeWidth :: Int,
treeHeight :: Int } deriving (Generic, Show) -- You need these two classes at least
-- Automotically builds some converters between Spark datatypes and the
-- Haskell representation.
instance SQLTypeable Tree
-- Automatically infers some converters between the Spark data formats
-- and the Haskell in-memory representation.
instance ToSQL Tree
-- Theses accessors must be written by hand for now, but they can be
-- inferred in the future using TemplateHaskell.
treeId' :: StaticColProjection Tree Int
treeId' = unsafeStaticProjection buildType "treeId"
treeWidth' :: StaticColProjection Tree Int
treeWidth' = unsafeStaticProjection buildType "treeWidth"
treeHeight' :: StaticColProjection Tree Int
treeHeight' = unsafeStaticProjection buildType "treeHeight"
instance TupleEquivalence Tree (Int, Int, Int) where
tupleFieldNames = NameTuple ["treeId", "treeWidth", "treeHeight"]
We can now take a dataframe and attempt to cast it to a (typed) dataset. Since this operation can fail, it is wrapped with a Try
.
tTreesDS = asDS treesDF :: Try (Dataset Tree)
tTreesDS
asDS outputDF :: Try (Dataset Tree)
Since we know that this is going to work and since we are doing exploratory analysis, we are going to unwrap the Try
and look at the dataset directly. That code will throw an exception if the types are not compatible:
-- To import `forceRight`.
import Spark.Core.Internal.Utilities(forceRight)
treesDS = forceRight (asDS treesDF) :: Dataset Tree
treesDS
All the operations on the dataframes can now be checked by the compiler:
col1 = treesDS // treeId'
:t col1
col1
We can still do some dynamic matching if we prefer, but then we get a dynamic column instead.
col1' = treesDS /- "treeId"
:t col1'
col1'
After manipulating columns, all the data can be packed as a tuple, or as some other types. The following operations are fully type-checked. Try to change the types to see what happens:
outputDS = pack (col1, treesDS//treeWidth') :: Dataset (Int, Int)
-- Or we can get our trees back
outputDS2 = pack (treesDS//treeId', treesDS//treeWidth', treesDS//treeHeight') :: Dataset Tree
This still lets do some bogus operations, because we use some primitive types to represent the data:
-- Some curious operation
curious = (treesDS//treeWidth') + (treesDS//treeId')
:t curious
curious
Or course we do not want to mix the different data types together, as we would do with regular Haskell code. Using newtype
instances, we can tell Karps to use different types in Haskell, while still using the same datatype representation in Spark.
-- We are not allowing arithmetic operations on the ids anymore, just to be printed (Show)
newtype MyId = MyId Int deriving (Generic, Show)
instance SQLTypeable MyId
instance ToSQL MyId
-- We allow the new Length type to do some operations (Num)
newtype Length = Length Int deriving (Generic, Num, Show)
instance SQLTypeable Length
instance ToSQL Length
typeForLength :: SQLType Length
typeForLength = buildType
Let us define our new 'safer' tree structure. Because of Haskell's limitation with records, and because we have all the structures in the same notebooks, we have to pick a differnt name for the variables. In practice, these structures would not get mixed up.
data STree = STree {
sTreeId :: MyId,
sTreeWidth :: Length,
sTreeHeight :: Length } deriving (Generic, Show)
instance SQLTypeable STree
instance ToSQL STree
-- Theses accessors must be written by hand for now, but they can be
-- inferred in the future using TemplateHaskell.
sTreeId' :: StaticColProjection STree MyId
sTreeId' = unsafeStaticProjection buildType "sTreeId"
sTreeWidth' :: StaticColProjection STree Length
sTreeWidth' = unsafeStaticProjection buildType "sTreeWidth"
sTreeHeight' :: StaticColProjection STree Length
sTreeHeight' = unsafeStaticProjection buildType "sTreeHeight"
instance TupleEquivalence STree (MyId, Length, Length) where
tupleFieldNames = NameTuple ["sTreeId", "sTreeWidth", "sTreeHeight"]
Because of the name change, we cannot cast directly our previous dataframe to that dataset: the names of the fields do not match.
NOTE: that behaviour may be changed in future by just focusing on the types and dropping the name checks.
forceRight (asDS treesDF) :: Dataset STree
We are going to do some gymnastics with the columns. There are two choices: either we build a dataframe first and then type-check it, or we type-check first each of the columns of the dataframe, and then combine the checked columns in a safe manner.
Here is the first option:
-- We can build a structure first and convert it to a dataframe:
str = struct' [ (treesDF/-"treeId") @@ "sTreeId",
(treesDF/-"treeWidth") @@ "sTreeWidth",
(treesDF/-"treeHeight") @@ "sTreeHeight"]
treesDF2 = pack' str
treesDS2 = forceRight (asDS treesDF2) :: Dataset STree
:t treesDS2
treesDS2
And here is using typed columns. The do...return
block wraps all the possible failurs when extracting the types columns.
tTreesDS2 = do
idCol <- castCol' (buildType::SQLType MyId) (treesDF/-"treeId")
widthCol <- castCol' (buildType::SQLType Length) (treesDF/-"treeWidth")
heightCol <- castCol' (buildType::SQLType Length) (treesDF/-"treeWidth")
-- This operation is type-safe
let s = pack (idCol, widthCol, heightCol) :: Dataset STree
return s
treesDS2 = forceRight tTreesDS2
:t treesDS2
treesDS2
let widthCol' = treesDF/-"treeWidth"
let widthCol = forceRight (castCol' typeForLength widthCol')
Now all the data can be manipulated in a type-safe manner. Under the hood, all these types will be unwrapped to Spark's primitive types.
idCol = treesDS2 // sTreeId'
:t idCol
idCol
-- This will not work anymore:
idCol + idCol
-- But this will still work:
widthCol = treesDS2//sTreeWidth'
heightCol = treesDS2//sTreeHeight'
volumeCol = (widthCol + heightCol) @@ "volume"
:t volumeCol
volumeCol
Potentially illegal casting operations will not work:
pack (idCol, volumeCol) :: Dataset (Int, Int)
pack (idCol, volumeCol) :: Dataset (MyId, Length)
And of course, the final result can always be converted back to a dataframe if it is more convenient:
pack' [untypedCol idCol, untypedCol volumeCol]
To conclude, Karps allows you to use Haskell's type checking as an opt-in compile-time check. You can still mix and match both styles if more convenient.