Spark Introduction

Kevin Mader
22 July 2014

IBM Data Science Connect Event

Paul Scherrer Institut ETH Zurich 4Quant


  • Motivation
  • What is Big Data?
  • Simple Example
  • What is Spark
  • How to get started?
  • Examples


  • Data science, machine learning, image processing are all computationally intensive tasks.
  • More data is available now than ever before
    • X-Ray Imaging: 8GB/s continuous images
    • Genomics: Full genomes can be analyzed for less than $1000 at rapidly increasing rates
    • Facebook, Twitter, Google collect and analyze petabytes of data per day
  • Standard tools on single machines cannot keep up

Big Data: Definition

Velocity, Volume, Variety

When a ton of heterogeneous is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Michael Franklin, Director of AMPLab, said their rate limiting factor is always enough interesting data

O 'clicks' per sample

K-Means Clustering

A common method to automatically group large datasets into different groups.