High-throughput, Scalable, Quantitative, Cellular Phenotyping using X-Ray Tomographic Microscopy

Kevin Mader [1,2,3], Leah Rae Donahue [4], Ralph Müller [5], and Marco Stampanoni [1,2]
8 April 2014

Swiss Light Source, Paul Scherrer Institut, 5232 Villigen, Switzerland
Institute for Biomedical Engineering, University and ETH Zurich, 8092 Zurich, Switzerland
4Quant, Zurich, Switzerland
The Jackson Laboratory, Bar Harbor, ME, United States
Institute for Biomechanics, ETH Zurich, 8093 Zurich, Switzerland

ETH Zurich 4Quant

Outline

Introduction
Motivation
- Big Data
- Scalability
Scalability
- Imaging
- Numerical Analysis
Results
Performance
Beyond

Introduction

A phenotype is a very general term for any measurable characteristic resulting from a combination of genetic background and environmental conditions
- All fields of genetics are ultimately dependant on phenotypes since they are the observable differences
Structural phenotypes are related to position, size, shape, and organization of different cells, tissues, and organs

Bone Structure

Implied

Many points, involves looking at 100s to 10000s+ of samples
Reproducible, every sample needs to be handled identically to avoid biasing the results

Synchrotron-based X-Ray Tomographic Microscopy

The only technique which can do all

peer deep into large samples
achieve $ \mathbf{<1\mu m} $ isotropic spatial resolution
- with 1.8mm field of view
achieve >10 Hz temporal resolution

SLS and TOMCAT

[1] Mokso et al., J. Phys. D, 46(49),2013

Cellular and Canal Structures in Murine Cortical Bone

So we have the imaging modality, now what?

Phenotyping

Phenotype Pipeline

Scaling Acquisition

Automated Sample Exchange

Automated Alignment and Region of Interest Scanning

Acquisition Results

1300 bone samples measured at 0.7 $ \mu $ m resolution in 2 weeks, automatically

Time Consumption

Just like genetics, the burden of imaging is moving from acquisition to post-processing

Time Consumption

[3] Sboner A,et. al. Genome Biology, 2011

Post-Processing Motivation

There are three different types of problems that we will run into.

Really big data sets

Several copies of the dataset need to be in memory for processing
Computers with more 256GB are expensive and difficult to find
Even they have 16 cores so still 16GB per CPU
Drive speed / network file access becomes a limiting factor
If it crashes you lose everything
- or you have to manually write a bunch of mess check-pointing code

Many datasets

For genome-scale studies 1000s of samples need to be analyzed identically
Dynamic experiments can have hundreds of measurements
Animal phenotyping can have many huge data-sets (1000s of 328GB datasets)

Exploratory Studies

Not sure what we are looking for
Easy to develop new analyses
Quick to test hypotheses

Definition: Big Data

Velocity, Volume, Variety

When a ton of heterogeneous information is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Most imaging groups are overwhelmed with data (8GB/s adds up very quickly)

High Levels of Automation = O 'clicks' per sample

Don't existing tools already do Big Data

Not really.
Scale well up to 1024 x 1024 x 1024 $ \approx $ 1 GVx.
take 12-24 hours for full structural analysis
if one machine crashes, everything crashes
High performance GPU methods are very tedious to scale to large datasets
Some tools for filtering $ \rightarrow $, other tools for segmentation $ \rightarrow $, others for visualization $ \rightarrow $ others for statistical analysis
- Spend a lot of time writing glue
- Different languages, conventions, and tools
None of these approaches are robust or deal with the data flood

Why is scaling up hard?

Parallel

Coordination

Parallel computing requires a significant of coordinating between computers for non-easily parallelizable tasks.

Mutability

If you have two cores / computers trying to write the same information at the same it is no longer deterministic (not good)

Blocking

Taking turns and waiting for every independent process to run completely negates the benefits of parallel computing

Distributed

Sending Instructions / Data Afar

Fault Tolerance

If you have 1000 computers working on solving a problem and one fails, you do not want your whole job to crash

Data Storage

How can you access and process data from many different computers quickly without very expensive infrastructure

Why is it even harder?

Each parallel library / tool requires different tools with different constraints CUDA, OpenCL, OpenMPI, Matlab Distributed Toolbox, KORUS, .NET Concurrent Framework, Python Multiprocessing, Hadoop, Storm, Impala, Redshift

Learn many very specific code bases, and managing the logistics of organizing.
Almost none are fault-tolerant out of the box.
Lock in: Transitioning from one code-base to another is at best nightmarish
Testing and profiling code locally can be very difficult
None of them are flexible allowing extra-processing power to “jump” in intense complications when it is needed

Solution: Spark / Resilient Distributed Datasets

Technical Specifications

Developed by the AMP Lab at UC Berkeley in 2012
General tool for all Directed Acyclical Graph (DAG) workflows
Course-grained processing $ \rightarrow $ simple operations applied to entire sets
- Map, reduce, join, group by, fold, foreach, filter,…
In-memory caching
Based on Scala, offers interactive REPL

Zaharia, M., et. al (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Practical Specification

Distributed, parallel computing without logistics, libraries, or compiling
Declarative rather than imperative
- Apply operation $ f $ to each image / block
- Even scheduling is handled automatically
Input and results can be stored in memory, on disk, in the cloud, using Hadoop, redundantly or not
Functional, unit, integration tests can be performed locally

Spark: Imaging Layer / TIPL

Developed at ETH Zurich, Paul Scherrer Institut, and 4Quant

TIPL is a flexible framework for image processing and analysis
- modular design
- verification and testing
- reproduction of results from parameters
- cluster integration
Report Generation
- parametric assessment
- pipelined analysis
- Rendering subregions for movies
- Displaying multidimensional data efficiently

TIPL

Image Processing Results

1300 bone samples analyzed using the Merlin Cluster at PSI in 6 weeks

Segmentation
- hard calcified tissue (bone)
- interior soft tissue (canals and cells)
Shape Analysis on all cells
Distance Maps for Colocation Analysis
Voronoi Tesselation for Density and Neighborhood Metrics

Substructures

Post-processing: Statistical Analysis

Exploring phenotypes of 55 million cells.
Our current approach has been either
Group and divide then average the results.
- Average Cell Volume, etc.
- Average density or connectivity
Detailed analysis of individual samples
- K-means clustering for different regions in bone / layers
- Principal component analysis for phenotypes

If we do that, we miss a lot!

Phenotype	Within	Between	Ratio (%)
Length	36.97	4.279	864.1
Width	27.92	4.734	589.9
Height	25.15	4.636	542.5
Volume	67.85	12.479	543.7
Nearest Canal Distance	70.35	333.395	21.1
Density (Lc.Te.V)	144.40	27.658	522.1
Nearest Neighbors (Lc.DN)	31.86	1.835	1736.1
Stretch (Lc.St)	13.98	2.360	592.5
Oblateness (Lc.Ob)	141.27	18.465	765.1

The results in the table show the within and between sample variation for selected phenotypes in the first two columns and the ratio of the within and between sample numbers

Visualizing the Variation

How does this result look visually? Each line shows the mean $ \pm $ standard deviation for sample. The range within a single sample is clearly much larger than between Partial Volume Effect

New Phenotypes

Phenotype values at D9Mit259 Marker located on chromosome 9 in the mouse Partial Volume Effect

Statistical Comparison

Spark Imaging Performance

Real Performance

Overhead Costs

Spark Numerical Performance

With the ability to scale to millions of samples there is no need to condense. We can analyze the entire dataset in real-time and try and identify groups or trends within the whole data instead of just precalculated views of it.

Practical

1276 comma separated text files with 56 columns of data and 15-60K rows

Task	Single Core Time	Spark Time (40 cores)
Load and Preprocess	360 minutes	10 minutes
Single Column Average	4.6s	400ms
1 K-means Iteration	2 minutes	1s

Can iteratively explore and hypothesis test with data quickly

We found several composite phenotypes which are more consistent within samples than between samples

Rich, heavily developed platform

Available Tools

Tools built for table-like data data structures and much better adapted to it.

Commercial Support

Dozens of major companies (Apple, Google, Facebook, Cisco, …) donate over $30M a year to development of Spark and the Berkeley Data Analytics Stack

2 startups in the last 6 months with seed-funding in excess of $15M each

Academic Support

All source code is available on GitHub
- Elegant (20,000 lines vs my PhD of 75,000+)
No patents or restrictions on usage
Machine Learning Course in D-INFK next semester based on Spark

Quo Vadis: Synchrotron-based µCT Imaging?

Needs

Scale up to 1000GVx samples (eventually)
Analyze standard measurements 14GVx regularly in a day
Analyze as fast as we usually measure 15 minutes / scan

Would be nice

Analyze scans as fast as we measure now (1 scan / minute)
Analyze scans as fast as we could measure with Gigafrost (10 scans / minute)

Tomcat Goals

Beyond: Streaming

Post-processing goals

Analysis done in weeks instead of months
Some real-time analysis and statistics

Streaming

Can handle static data or live data coming in from a 'streaming' device like a camera to do real-time analysis. The exact same code can be used for real-time analysis and static code

Scalability

Connect more computers.

Start workers on these computer (Full Stop).

Source could the camera server itself or a watched directory on a machine.

plot of chunk unnamed-chunk-10

Beyond: Approximate Results

Projects at AMPLab like Spark and BlinkDB are moving towards approximate results.

Instead of mean(volume)
- mean(volume).within_time(5)
- mean(volume).within_ci(0.95)

For real-time image processing it might be the only feasible solution and could drastically reduce the amount of time spent on analysis.

What sort of projects demand this?

Zebra fish Phenotyping Project

Collaboration with Keith Cheng, Ian Foster, Xuying Xin, Patrick La Raviere, Steve Wang

1000s of samples of full adult animals, imaged at 0.74 $ \mu m $ resolution: Images 11500 x 2800 x 628 $ \longrightarrow $ 20-40GVx / sample

Identification of single cells (no down-sampling)
Cell networks and connectivity
Classification of cell type
Registration with histology

Brain Project

Collaboration with Alberto Astolfo, Matthias Schneider, Bruno Weber, Marco Stampanoni

1 $ cm^3 $ scanned at 1 $ \mu m $ resolution: Images $ \longrightarrow $ 1000 GVx / sample

Registration separate scans together
Blood vessel structure and networks
Registration with fMRI, histology

Spark Preview

Acknowledgements

AIT at PSI
X-Ray Microscopy Group at the TOMCAT Beamline of the Swiss Light Source

We are interested in partnerships and collaborations

Learn more at

This Presentation - https://github.com/4Quant/IWBBIO2014
4Quant - http://www.4quant.com
X-Ray Imaging Group at ETH Zurich - http://bit.ly/1gD8wKb
Quantitative Big Imaging Course at ETH Zurich - http://bit.ly/1kj9mnq

Starting Spark

Spin up your own cluster in an hour

Start a master node start-master.sh
Start several worker nodes

start-worker.sh spark://master-ip:7077 -c #CORES -m #MEM

Start the Spark-Shell spark-shell.sh
Write code in Scala, Java, Python, or R

Perform a threshold

val threshVal=127
val labelImg=inImg.filter(_._2>threshVal)

Runs on 1 core on your laptop or 1000 cores in the cloud or on Merlin or the beamline.
If one computer crashes or disconnects it automatically continues on another one.
If one part of the computation is taking too long it will be sent to other computers to finish
If a computer runs out of memory it writes the remaining results to disk and continues running (graceful dropoff in performance )

Region of Interest

Take a region of interest between 0 and 100 in X,Y, and Z

def roiFunc(pvec: ((Int,Int,Int),Double)) = 
 {pvec._1._1>=0 & pvec._1._1<100 & // X
  pvec._1._2>=0 & pvec._1._2<100 & //Y
  pvec._1._3>=0 & pvec._1._3<100} // Z

rImg.filter(roiFunc)

Perform a 3x3x3 box filter

def spread_voxels(pvec: ((Int,Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length**3)
  for(x<-wind; y<-wind; z<-wind) 
    yield ((pos._1+x,pos._2+y,pos._3+z),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)

Perform component labeling

var groupList=Array((0L,0))
var running=true
var iterations=0
while (running) {
  val newLabels=labelImg.
  flatMap(spread_voxels(_,1)).
    reduceByKey((a,b) => (math.min(a._1,b._1),(a._2 | b._2))).
    filter(_._2._2). // keep only voxels which contain original pixels
    map(pvec => (pvec._1,pvec._2._1))
  // make a list of each label and how many voxels are in it
  val curGroupList=newLabels.map(pvec => (pvec._2,1)).
    reduceByKey(_ + _).sortByKey(true).collect
  // if the list isn't the same as before, continue running since we need to wait for swaps to stop
  running = (curGroupList.deep!=groupList.deep)
  groupList=curGroupList
  labelImg=newLabels
  iterations+=1
  print("Iter #"+iterations+":"+groupList.mkString(","))
}
groupList

Lazy evaluation

val outThresh=inImg.map(threshFunc)
val outLabels=runCL(outThresh)
outLabels.filter(roiFunc).saveImage('test.tif')

No execution starts until you save the file or require output
Spark automatically deconstructs the pipeline and optimizes the jobs to run so computation is not wasted outside of the region of interest (even though we did it last)