From microscopes to satellites: Interactive Image Analysis at Scale Using Spark

Kevin Mader
Swiss Data Science, 12 June 2015

SIL

Paul Scherrer Institut ETH Zurich

Outline

  • Imaging in 2015
  • Changing Computing
  • Small Data vs Big Data
  • The Problem(s)

The Tools

  • Our Image Analysis
  • 3D Imaging
  • Hyperspectral Imaging
  • Interactive Analysis / Streaming

The Science

  • Academic Projects
  • Commercial Projects
  • Outlook / Developments

Internal Structures

Image Science in 2015: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \( \rightarrow \) 8GB/s, diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data
  • Swiss Radiologists will make 1PB of data year

Optical

  • Light-sheet microscopy (see talk of Jeremy Freeman) produces images \( \rightarrow \) 500MB/s
  • High-speed confocal images at (>200fps) \( \rightarrow \) 78Mb/s

Geospatial

  • New satellite projects (Skybox, etc) will measure >10 petabytes of images a year

Personal

  • GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
  • fps1000 - 400MB/s (640 x 480 x 840 fps) for $400

Data Flood

The figure highlights the exponential growth in both:

  • the amount of data
  • analysis time

over the last 16 years based on the latest generation detector at the TOMCAT Beamline.

It assumes a manual analysis rate of 1Mpx/second

plot of chunk scaling-figure

Time Breakdown

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

  • Experiment Design (Pink)
  • Data Management (Blue)
  • Measurement Time (Green)
  • Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

plot of chunk time-figure

How much is a TB, really?

If you looked at one 1000 x 1000 sized image every second Mammography Image

It would take you
139 hours to browse through a terabyte of data.

Year Time to 1 TB Man power to keep up Salary Costs / Month
2000 4096 min 2 people 25 kCHF
2008 1092 min 8 people 95 kCHF
2014 32 min 260 people 3255 kCHF
2016 2 min 3906 people 48828 kCHF

Computing has changed: Parallel

Moores Law

\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]

Based on trends from Wikipedia and Intel Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d

There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?

  • Multiple Core
    • Many machines have multiple cores for each processor which can perform tasks independently
  • Multiple CPUs
    • More than one chip is commonly present
  • New modalities
    • GPUs provide many cores which operate at slow speed

Parallel Code is important

Cloud Computing Costs

The figure shows the range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.

plot of chunk unnamed-chunk-3

The figure shows the cost of a cloud based solution as a percentage of the cost of buying a single machine. The values below 1 show the percentage as a number. The panels distinguish the average time to replacement for the machines in months

plot of chunk unnamed-chunk-4

Small Data vs Big Data

Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it

Before we can think about Big Data we need to define what it's replacing, Small Data.

Small Data

The 0815 approach, using standard tools and lots of clicking

Real Example: Genome-Scale Imaging of Bone

High-throughput \( \rightarrow \) easy measurement

analysis \( \rightarrow \) complex, multistep, and time-consuming

Linkage

uCT Image

Genetic studies require hundreds to thousands of samples, in this case the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.

Genetic LOD Scores

Small Data: Genome-Scale Imaging

  • Hand Identification \( \rightarrow \) 30s / object
  • 30-40k objects per sample
  • One Sample in 6.25 weeks
  • 1300 samples \( \rightarrow \) 120 man-years

Advantages

  • Hand verification of every sample, visual identification of errors and problems

Disadvantages

  • Biased by user
  • Questionable reproducibility
  • Time-consuming
  • Exploration challenging
  • Data versioning is difficult

Medium Data: Genome-scale Imaging

  • ImageJ macro for segmentation
  • Python script for shape analysis
  • Paraview macro for network and connectivity
  • Python script to pool results
  • MySQL Database storing results (5 minutes / query)

1.5 man-years

Advantages

  • Faster than manual methods
  • Less bias
  • More reproducible

Disadvantages

  • Compartmentalized analysis
  • Time-consuming
  • Complex interdependent workflow
  • Fragile (machine crashes, job failures)
  • Expensive - Measure twice, cut once, not exploratory

Compartmentalized

Gant Flow

Exploratory Image Processing Priorities

Correctness

The most important job for any piece of analysis is to be correct.

  • A powerful testing framework is essential
  • Avoid repetition of code which leads to inconsistencies
  • Use compilers to find mistakes rather than users

Easily understood, changed, and used

Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets

  • Flexible, modular structure that enables replacing specific pieces

Fast

The last of the major priorities is speed which covers both scalability, raw performance, and development time.

  • Long waits for processing discourages exploration
  • Manual access to data on separeate disks is a huge speed barrier
  • Real-time image processing requires millisecond latencies
  • Implementing new ideas can be done quickly

Big Data

Velocity, Volume, Variety

When a ton of heterogeneous is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Director of AMPLab said their rate limiting factor is always enough interesting data

O 'clicks' per sample

The Problem

There is a flood of new data

What took an entire PhD 3-4 years ago, can now be measured in a weekend, or even several seconds. Analysis tools have not kept up, are difficult to customize, and usually highly specific.

Optimized Data-Structures do not fit

Data-structures that were fast and efficient for computers with 640kb of memory do not make sense anymore

Single-core computing is too slow

CPU's are not getting that much faster but there are a lot more of them. Iterating through a huge array takes almost as long on 2014 hardware as 2006 hardware

The Framework First

  • Rather than building an analysis as quickly as possible and then trying to hack it to scale up to large datasets
    • chose the framework first
    • then start making the necessary tools.
  • Google, Amazon, Yahoo, and many other companies have made huge in-roads into these problems
  • The real need is a fast, flexible framework for robustly, scalably performing complicated analyses, a sort of Excel for big imaging data.

Apache Spark and Hadoop 2

The two frameworks provide a free out of the box solution for

  • scaling to >10000 computers
  • storing and processing exabytes of data
  • fault tolerance
    • 2/3rds of computers can crash and a request still accurately finishes
  • hardware and software platform indpendence (Mac, Windows, Linux)

Spark -> Microscopy?

These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.

I want to

  • filter out noise, segment, choose regions of interest
  • contour, component label
  • measure, count, and analyze

Spark Image Layer

  • Developed at 4Quant, ETH Zurich, and Paul Scherrer Institut
  • The Spark Image Layer is a Domain Specific Language for Microscopy for Spark.
  • It converts common imaging tasks into coarse-grained Spark operations

SIL

Spark Image Layer

We have developed a number of commands for SIL handling standard image processing tasks

SIL Commands

Fully exensible with Spark Languages

SIL Commands

Use Case with Spark Image Layer: Genetic Studies

  • Analysis could be completed in several months (instead of 120 years, could now be completed in days in the cloud)
  • Data can be freely explored and analyzed
val imageTable = 
    sc.readImage[Float]("s3n://bone-images/f2-study/*/ufilt.tif").cache
  • Filter all of the images
CREATE TABLE FilteredImages AS
SELECT boneId,GaussianFilter(image) FROM ImageTable
  • Segment hard and soft tissues
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredImages

Use Case with Spark Image Layer: Genetic Studies

  • Label cells
CREATE TABLE LabelImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
  • Export results
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"

Parallel Tools for Image and Quantitative Analysis

  • val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
  • val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
  • Collaborators / Competitors can verify results and extend on analyses
  • Combine Images with Results
    • avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
    • See immediately in datasets of terabytes which image had the largest cells
  • New hypotheses and analyses can be done in seconds / minutes
Task Single Core Time Spark Time (40 cores)
Load and Preprocess 360 minutes 10 minutes
Single Column Average 4.6s 400ms
1 K-means Iteration 2 minutes 1s

Cost Comparison: Why this hasn't been done before?

For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)

  • Development costs are very high for Big
  • Traditionally students are much cheaper than the 150kCHF / year used in this model
  • Medium Data is a good compromise

plot of chunk unnamed-chunk-8

Scaling Costs

As studies get bigger (>100 samples), we see that these costs begin to radically shift.

  • The high development costs in Medium and Big \( \rightarrow \) 0
  • The high human costs of Small and Medium \( \propto \) \( Samples \)

plot of chunk scaling-costs

Quantitative Search Machine for Images: Maps

PSI

Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH

\[ \downarrow \textrm{Translate to SQL} \]

SELECT contour FROM (
  SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles 
  WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
  ) WHERE area>200

Quantitative Search Machine for Images: Maps

We can then visualize these contours easily

plot of chunk region-lat-long-image

or apply them back to the original map

PSI

Quantitative Search Machine for Images: Pathological Slices

  • Count the number of highly anisotropic nuclei in myeloma patients

\[ \downarrow \textrm{Translate to SQL} \]

SELECT COUNT(*) FROM 
  (SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients 
  WHERE disease LIKE "myleoma")
  WHERE anisotropy > 0.75

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Load Myleoma Data Subset} \] Slices

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Perform analysis on a every image} \] Slices

\[ \downarrow \textrm{Filter out the most anisotropic cells} \] Slices

Distribution and Analysis

The query is processed by the SQL parsing engine

  • sent to the database where it can be combined with locally available private data
  • sent to the image layer for all image processing components to be integrated
  • sent to the Spark Core so specific map, reduce, fold operations can be generated
  • sent to the master node where it can be distributed across the entire cluster / cloud

Distribution and Analysis

Many Nodes

Research Outlook: Pathology

Pathology

Adult Zebrafish Imaging

  • In collaboration with K. Cheng (PMC), …
  • Store entire image collection online
  • Manually segment hundreds of cells
  • Determine feature vector for automatic segmentation

Viral Infection

  • In collaboration with M. Gerber (UniZurich)
  • Store all images for a time-series
  • Identify cells, nucleus, and infected regions
  • Track infection progression and colocalize with cells

Nexus Personalized Medicine Platform

  • In collaboration with M. Prummel
  • Images for large studies stored in OpenBIS
  • Run CellProfiler-style analysis on entire collection of images directly in database
  • Use approximate results to rapidly, interactively screen

Mouse Brain Structure

  • In collaboration with (A. Patera, A. Astolfo, M. Schneider , B. Weber)
  • Store all images in one place
  • Iteratively determine image offsets using cross-correlation
  • Stitch images together as needed for analysis / view generation

Research Outlook: Radiology

Clinical Radiology

  • 40-90k patients die from preventable mistakes every year
  • 1PB in Switzerland every year
  • 'Smart' database of all images
    • Test all / new algorithms against all images
  • Chest/Pelvis CT Scans
    • Automatically identify lungs and spine
    • Track anatomic changes for frequent fliers (+20 scans)
    • Determine normals and ranges imperically

Real-time with Spark Streaming: Webcam

In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.

ImageJ Fiji

We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.

plot of chunk unnamed-chunk-14

Streaming Analysis Real-time Webcam Processing

val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)

Filter images

val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))

Create a background image

val totImgs = inImages.count()
val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)

Identify Outliers in Streams

Remove the background image and find the mean value

val eventImages = filtImgs.
    transform{
    inImages =>
      val corImage = inImages.map {
        case (inTime,inImage) =>
          val corImage = inImage.subtract(bgImage)
          (corImage.getImageStatistics().mean,
            (inTime,corImage))
      }
      corImage
  }

Show the outliers

eventImages.filter(iv => Math.abs(iv._1)>20).
  foreachRDD(showResultsStr("outlier",_))

Streaming Demo with Webcam

We have a cool tool, but what does this mean for me?

A spinoff - 4Quant: From images to insight

  • Quantitative Search Machine for Images
    • Find all patients with livers larger than 10cm diameter
    • Count the number of highly anisotropic nuclei in myeloma patients
  • Custom Analysis Solutions
    • Custom-tailored software to solve your problems
  • One Stop Shop
    • Measurement, analysis, and statistical analysis

Education / Training

Acknowledgements: 4Quant

  • Flavio Trolese

Flavio

  • Dr. Prat Das Kanungo

Prat

  • Dr. Ana Balan

Ana

  • Prof. Marco Stampanoni

Marco

Acknowledgements: ETH and PSI

  • AIT at PSI and Scientific Computer at ETH
  • TOMCAT Group Tomcat Group

We are interested in partnerships and collaborations

Learn more at

Feature Vectors

A pairing between spatial information (position) and some other kind of information (value). \[ \vec{x} \rightarrow \vec{f} \]

We are used to seeing images in a grid format where the position indicates the row and column in the grid and the intensity (absorption, reflection, tip deflection, etc) is shown as a different color

plot of chunk unnamed-chunk-15

The alternative form for this image is as a list of positions and a corresponding value

\[ \hat{I} = (\vec{x},\vec{f}) \]

x y Intensity
1 1 30
2 1 90
3 1 13
4 1 5
5 1 64
1 2 86

This representation can be called the feature vector and in this case it only has Intensity

Why Feature Vectors

If we use feature vectors to describe our image, we are no longer to worrying about how the images will be displayed, and can focus on the segmentation/thresholding problem from a classification rather than a image-processing stand point.

Example

So we have an image of a cell and we want to identify the membrane (the ring) from the nucleus (the point in the middle).

plot of chunk unnamed-chunk-17

A simple threshold doesn't work because we identify the point in the middle as well. We could try to use morphological tricks to get rid of the point in the middle, or we could better tune our segmentation to the ring structure.

plot of chunk unnamed-chunk-18

Adding a new feature

In this case we add a very simple feature to the image, the distance from the center of the image (distance).

plot of chunk unnamed-chunk-19

x y Intensity Distance
-10 -10 0.9350683 14.14214
-10 -9 0.7957197 13.45362
-10 -8 0.6045178 12.80625
-10 -7 0.3876575 12.20656
-10 -6 0.1692429 11.66190
-10 -5 0.0315481 11.18034

We now have a more complicated image, which we can't as easily visualize, but we can incorporate these two pieces of information together.

plot of chunk unnamed-chunk-21

Applying two criteria

Now instead of trying to find the intensity for the ring, we can combine density and distance to identify it

\[ iff (5<\textrm{Distance}<10 \\ \& 0.5<\textrm{Intensity}>1.0) \]

plot of chunk unnamed-chunk-22

plot of chunk unnamed-chunk-23

Common Features

The distance while illustrative is not a commonly used features, more common various filters applied to the image

  • Gaussian Filter (information on the values of the surrounding pixels)
  • Sobel / Canny Edge Detection (information on edges in the vicinity)
  • Entroy (information on variability in vicinity)
x y Intensity Sobel Gaussian
1 1 0.94 0.91 0.90
1 10 0.48 -0.83 0.51
1 11 0.50 -0.82 0.53
1 12 0.48 -0.83 0.51
1 13 0.43 -0.86 0.46
1 14 0.33 -0.90 0.37

plot of chunk unnamed-chunk-25

Analyzing the feature vector

The distributions of the features appear very different and can thus likely be used for identifying different parts of the images.

plot of chunk unnamed-chunk-26

Combine this with our a priori information (called supervised analysis)

plot of chunk unnamed-chunk-27

plot of chunk unnamed-chunk-28

Using Machine Learning

Now that the images are stored as feature vectors, they can be easily analyzed with standard Machine Learning tools. It is also much easier to combine with training information.

x y Absorb Scatter Training
700 4 0.3706262 0.9683849 0.0100140
704 4 0.3694059 0.9648784 0.0100140
692 8 0.3706371 0.9047878 0.0183156
696 8 0.3712537 0.9341989 0.0334994
700 8 0.3666887 0.9826912 0.0453049
704 8 0.3686623 0.8728824 0.0453049

Want to predict Training from x,y, Absorb, and Scatter \( \rightarrow \) MLLib: Logistic Regression, Random Forest, K-Nearest Neighbors, …

plot of chunk unnamed-chunk-29

As a scientist (not a data-scientist)

Apache Spark is brilliant platform and utilizing GraphX, MLLib, and other packages there unlimited possibilities

  • Scala can be a beautiful but not easy language
  • Python is an easier language
  • Both suffer from
    • Non-obvious workflows
    • Scripts depending on scripts depending on scripts (can be very fragile)
  • Although all analyses can be expressed as a workflow, this is often difficult to see from the code
  • Non-technical persons have little ability to understand or make minor adjustments to analysis
  • Parameters require recompiling to change
  • or GUIs need to be placed on top

Interactive Analysis

Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows

Web and Image Database Examples

A basic image filtering operation

  • Thanks to Spark, it is cached, in memory, approximate, cloud-ready
  • Thanks to Map-Reduce it is fault-tolerant, parallel, distributed
  • Thanks to Java, it is hardware agnostic
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length*wind.length)
  for(x<-wind; y<-wind) 
    yield ((pos._1+x,pos._2+y),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)
  • But it is also not really so readable

Little blocks for big data

Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.

Workflow Blocks

Workflow Settings

Example Projects: Full Brain Imaging

  • Collaboration with A. Astolfo and A. Patera
  • Measure a full mouse brain (1cm\( ^3 \)) with cellular resolution (1 \( \mu \) m)
  • 10 x 10 x 10 scans at 2560 x 2560 x 2160 \( \rightarrow \) 14 TVoxels
  • 0.000004% of the entire dataset
    Single Slice
  • 14TVoxels = 56TB
  • Each scan needs to be registered and aligned together
  • There are no computers with 56TB of memory
  • Even multithreaded approachs are not feasible and require many logistics
  • Analysis of the stitched data is also of interest (segmentation, vessel analysis, distribution and network connectivity)

Science Problems: Big Stitching

\[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]

plot of chunk unnamed-chunk-32

dispField = Images.
  cartesian(Images).map{
  ((xA,ImgA), (xB,ImgB)) =>
    xcorr(ImgA,ImgB,in=xB-xA)
  }

plot of chunk unnamed-chunk-34

From Matching to Stitching

From the updated information provided by the cross correlations and by applying appropriate smoothing criteria (if necessary).

plot of chunk unnamed-chunk-35

The stitching itself, rather than rewriting the original data can be done in a lazy fashion as certain regions of the image are read.

def getView(tPos,tSize) =
  stImgs.
  filter(x=>abs(x-tPos)<img.size).
  map { (x,img) =>
   val oImg = new Image(tSize)
   oImg.copy(img,x,tPos)
}.addImages(AVG)

This also ensures the original data is left unaltered and all analysis is reversible.

Viewing Regions

getView(Pos(26.5,13),Size(2,2))

plot of chunk unnamed-chunk-38

An oversimplified view of Health-care

The attending physician or the general practioner is the integrator, responsible for combining, and interpreting a number of complex, often contradictory inputs.

A Big Data Vision

The centralized storage and analysis of images provides an alternative of backup pathway so that the risk that something is overlooked is lower

Eventually this could lead to the GP getting second opinions automatically from Deep Learning tools or IBM's Watson and even allowing the patient to directly view and interpret some of the results

Algorithms?

Significant energy is spent in many research groups on new algorithms (>1000 new publications in the top 3 journals every year) \( \rightarrow \) hundreds to thousands for every standard problem.

  • As imaging scientists we cannot afford to compete in such an area Many of these algorithms are good on a small hand-picked set of images but terrible on large scale problems 'Big Data' necessitates much more robust algorithms than 'image processing'.
  • One or two steps back from latest generation
  • Well understood failure behavior
  • Predictable noise / artifact behavior