Spark Introduction

Kevin Mader
22 July 2014

IBM Data Science Connect Event

Paul Scherrer Institut ETH Zurich 4Quant

Outline

  • Motivation
  • What is Big Data?
  • Simple Example
  • What is Spark
  • How to get started?
  • Examples

Motivation

  • Data science, machine learning, image processing are all computationally intensive tasks.
  • More data is available now than ever before
    • X-Ray Imaging: 8GB/s continuous images
    • Genomics: Full genomes can be analyzed for less than $1000 at rapidly increasing rates
    • Facebook, Twitter, Google collect and analyze petabytes of data per day
  • Standard tools on single machines cannot keep up

Big Data: Definition

Velocity, Volume, Variety

When a ton of heterogeneous is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Michael Franklin, Director of AMPLab, said their rate limiting factor is always enough interesting data

O 'clicks' per sample

Reality Check

Spark Shortcomings

  • Spark is not performant \rightarrow dedicated, optimized CPU and GPU codes will perform slightly to much much better when evaulated by data points per second per processing power unit
    • these codes will be wildly outperformed by dedicated hardware / FPGA solutions
  • Serialization overhead and network congestion are not neglible for large datasets

But

  • Scala / Python in Spark is substantially easier to write and test
    • Highly optimized codes are very inflexible
    • Human time is 400x more expensive than AWS time
    • Mistakes due to poor testing can be fatal
  • Spark scales smoothly to enormous datasets
    • GPUs rarely have more than a few gigabytes
    • Writing code that pages to disk is painful
  • Spark is hardware agnostic (no drivers or vendor lock-in)

Spark Error Messages

Exception in thread "main" java.lang.AssertionError: assertion failed: Tried to find '$line47' in '/var/folders/yq/w_mvh2xj7yzdzb6k4pknwzgc0000gn/T/spark-f8aac450-9c70-4232-a71f-e089e7bdd03b' but it is not a directory
  at scala.reflect.io.AbstractFile.subdirectoryNamed(AbstractFile.scala:254)
  at scala.tools.nsc.backend.jvm.BytecodeWriters$class.getFile(BytecodeWriters.scala:31)
    at scala.tools.nsc.backend.jvm.BytecodeWriters$class.scala$tools$nsc$backend$jvm$BytecodeWriters$$getFile(BytecodeWriters.scala:37)
    at scala.tools.nsc.backend.jvm.BytecodeWriters$ClassBytecodeWriter$class.writeClass(BytecodeWriters.scala:89)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anon$4.writeClass(GenASM.scala:67)
    at scala.tools.nsc.backend.jvm.GenASM$JBuilder.writeIfNotTooBig(GenASM.scala:459)
    at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1413)
    at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120)
    at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
    at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
    at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
    at org.apache.spark.repl.SparkIMain.compileSourcesKeepingRun(SparkIMain.scala:468)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.compileAndSaveRun(SparkIMain.scala:859)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.compile(SparkIMain.scala:815)
    at org.apache.spark.repl.SparkIMain$Request.compile$lzycompute(SparkIMain.scala:1009)
    at org.apache.spark.repl.SparkIMain$Request.compile(SparkIMain.scala:1004)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:644)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
    at org.apache.spark.repl.SparkILoop$$anonfun$replay$1.apply(SparkILoop.scala:634)
    at org.apache.spark.repl.SparkILoop$$anonfun$replay$1.apply(SparkILoop.scala:632)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.repl.SparkILoop.replay(SparkILoop.scala:632)
    at org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:579)
    at org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:566)
    at scala.runtime.AbstractPartialFunction$mcZL$sp.apply$mcZL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
    at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)

Acknowledgements

  • AIT at PSI
  • TOMCAT Group Tomcat Group

We are interested in partnerships and collaborations

Learn more at