Taming the flood: Distributed image processing made easy on large tomographic datasets

K. Mader, R. Mokso, A. Patera, M. Stampanoni
ICTMS 2015, 30 June 2015 (4quant.com/ICTMS2015)

SIL

Paul Scherrer Institut ETH Zurich

Outline

  • Imaging in 2015
  • Changing Computing
  • Small Data vs Big Data
  • The Problem(s)

The Tools

  • Our Image Analysis
  • 3D Imaging
  • Hyperspectral Imaging
  • Interactive Analysis / Streaming

The Science

  • Brain-scale Vasculature
  • Outlook / Developments
  • New Science?
    Internal Structures

Image Science in 2015: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \rightarrow 8GB/s, diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data
  • Swiss Radiologists will make 1PB of data year

Optical

  • Light-sheet microscopy (see talk of Jeremy Freeman) produces images \rightarrow 500MB/s
  • High-speed confocal images at (>200fps) \rightarrow 78Mb/s

Geospatial

  • New satellite projects (Skybox, etc) will measure >10 petabytes of images a year

Personal

  • GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
  • fps1000 - 400MB/s (640 x 480 x 840 fps) for $400

Data Flood

Exponential growth in both:

  • the amount of data
  • analysis time

over the last 16 years based on the latest generation detector at the TOMCAT Beamline.

It assumes a manual analysis rate of 1Mpx/second

plot of chunk scaling-figure

Parallel Tools for Image and Quantitative Analysis

  • val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
  • val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
  • Collaborators / Competitors can verify results and extend on analyses
  • Combine Images with Results
    • avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
    • See immediately in datasets of terabytes which image had the largest cells
  • New hypotheses and analyses can be done in seconds / minutes
Task Single Core Time Spark Time (40 cores)
Load and Preprocess 360 minutes 10 minutes
Single Column Average 4.6s 400ms
1 K-means Iteration 2 minutes 1s

A basic image filtering operation

  • Thanks to Spark, it is cached, in memory, approximate, cloud-ready
  • Thanks to Map-Reduce it is fault-tolerant, parallel, distributed
  • Thanks to Java, it is hardware agnostic
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length*wind.length)
  for(x<-wind; y<-wind) 
    yield ((pos._1+x,pos._2+y),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)
  • But it is also not really so readable

Little blocks for big data

Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.

Workflow Blocks

Workflow Settings