High-throughput, Scalable, Quantitative, Cellular Phenotyping using X-Ray Tomographic Microscopy

Kevin Mader [1,2,3], Leah Rae Donahue [4], Ralph Müller [5], and Marco Stampanoni [1,2]
8 April 2014

  1. Swiss Light Source, Paul Scherrer Institut, 5232 Villigen, Switzerland
  2. Institute for Biomedical Engineering, University and ETH Zurich, 8092 Zurich, Switzerland
  3. 4Quant, Zurich, Switzerland
  4. The Jackson Laboratory, Bar Harbor, ME, United States
  5. Institute for Biomechanics, ETH Zurich, 8093 Zurich, Switzerland

Paul Scherrer Institut ETH Zurich 4Quant

Outline

  • Introduction
  • Motivation

    • Big Data
    • Scalability
  • Scalability

    • Imaging
    • Numerical Analysis
  • Results

  • Performance

  • Beyond

Introduction

  • A phenotype is a very general term for any measurable characteristic resulting from a combination of genetic background and environmental conditions
    • All fields of genetics are ultimately dependant on phenotypes since they are the observable differences
  • Structural phenotypes are related to position, size, shape, and organization of different cells, tissues, and organs

Bone Structure

Implied

  • Many points, involves looking at 100s to 10000s+ of samples
  • Reproducible, every sample needs to be handled identically to avoid biasing the results

Synchrotron-based X-Ray Tomographic Microscopy

The only technique which can do all

  • peer deep into large samples
  • achieve \( \mathbf{<1\mu m} \) isotropic spatial resolution
    • with 1.8mm field of view
  • achieve >10 Hz temporal resolution

SLS and TOMCAT

[1] Mokso et al., J. Phys. D, 46(49),2013

Cellular and Canal Structures in Murine Cortical Bone

So we have the imaging modality, now what?

Phenotyping

Phenotype Pipeline

Scaling Acquisition

Automated Sample Exchange

Automated Alignment and Region of Interest Scanning

Acquisition Results

1300 bone samples measured at 0.7 \( \mu \) m resolution in 2 weeks, automatically

Scans

Time Consumption

Just like genetics, the burden of imaging is moving from acquisition to post-processing

Time Consumption

[3] Sboner A,et. al. Genome Biology, 2011

Post-Processing Motivation

There are three different types of problems that we will run into.

Really big data sets

  • Several copies of the dataset need to be in memory for processing
  • Computers with more 256GB are expensive and difficult to find
  • Even they have 16 cores so still 16GB per CPU
  • Drive speed / network file access becomes a limiting factor
  • If it crashes you lose everything
    • or you have to manually write a bunch of mess check-pointing code

Many datasets

  • For genome-scale studies 1000s of samples need to be analyzed identically
  • Dynamic experiments can have hundreds of measurements
  • Animal phenotyping can have many huge data-sets (1000s of 328GB datasets)

Exploratory Studies

  • Not sure what we are looking for
  • Easy to develop new analyses
  • Quick to test hypotheses

Definition: Big Data

Velocity, Volume, Variety

When a ton of heterogeneous information is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Most imaging groups are overwhelmed with data (8GB/s adds up very quickly)

High Levels of Automation = O 'clicks' per sample

Don't existing tools already do Big Data

  • Not really.
  • Scale well up to 1024 x 1024 x 1024 \( \approx \) 1 GVx.
  • take 12-24 hours for full structural analysis
  • if one machine crashes, everything crashes
  • High performance GPU methods are very tedious to scale to large datasets
  • Some tools for filtering \( \rightarrow \), other tools for segmentation \( \rightarrow \), others for visualization \( \rightarrow \) others for statistical analysis
    • Spend a lot of time writing glue
    • Different languages, conventions, and tools
  • None of these approaches are robust or deal with the data flood

Why is scaling up hard?

Parallel

Coordination

Parallel computing requires a significant of coordinating between computers for non-easily parallelizable tasks.

Mutability

If you have two cores / computers trying to write the same information at the same it is no longer deterministic (not good)

Blocking

Taking turns and waiting for every independent process to run completely negates the benefits of parallel computing

Distributed

Sending Instructions / Data Afar

Fault Tolerance

If you have 1000 computers working on solving a problem and one fails, you do not want your whole job to crash

Data Storage

How can you access and process data from many different computers quickly without very expensive infrastructure

Why is it even harder?

Each parallel library / tool requires different tools with different constraints CUDA, OpenCL, OpenMPI, Matlab Distributed Toolbox, KORUS, .NET Concurrent Framework, Python Multiprocessing, Hadoop, Storm, Impala, Redshift

  • Learn many very specific code bases, and managing the logistics of organizing.
  • Almost none are fault-tolerant out of the box.
  • Lock in: Transitioning from one code-base to another is at best nightmarish
  • Testing and profiling code locally can be very difficult
  • None of them are flexible allowing extra-processing power to “jump” in intense complications when it is needed

Solution: Spark / Resilient Distributed Datasets

Technical Specifications

  • Developed by the AMP Lab at UC Berkeley in 2012
  • General tool for all Directed Acyclical Graph (DAG) workflows
  • Course-grained processing \( \rightarrow \) simple operations applied to entire sets
    • Map, reduce, join, group by, fold, foreach, filter,…
  • In-memory caching
  • Based on Scala, offers interactive REPL

Apache Spark

Zaharia, M., et. al (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Practical Specification

  • Distributed, parallel computing without logistics, libraries, or compiling
  • Declarative rather than imperative
    • Apply operation \( f \) to each image / block
    • Even scheduling is handled automatically
  • Input and results can be stored in memory, on disk, in the cloud, using Hadoop, redundantly or not
  • Functional, unit, integration tests can be performed locally

Spark: Imaging Layer / TIPL

Developed at ETH Zurich, Paul Scherrer Institut, and 4Quant

  • TIPL is a flexible framework for image processing and analysis

    • modular design
    • verification and testing
    • reproduction of results from parameters
    • cluster integration
  • Report Generation

    • parametric assessment
    • pipelined analysis
    • Rendering subregions for movies
    • Displaying multidimensional data efficiently

TIPL

Image Processing Results

1300 bone samples analyzed using the Merlin Cluster at PSI in 6 weeks

  • Segmentation
    • hard calcified tissue (bone)
    • interior soft tissue (canals and cells)
  • Shape Analysis on all cells
  • Distance Maps for Colocation Analysis
  • Voronoi Tesselation for Density and Neighborhood Metrics

Substructures

Scans

Post-processing: Statistical Analysis

  • Exploring phenotypes of 55 million cells.
  • Our current approach has been either
  • Group and divide then average the results.
    • Average Cell Volume, etc.
    • Average density or connectivity
  • Detailed analysis of individual samples
    • K-means clustering for different regions in bone / layers
    • Principal component analysis for phenotypes

If we do that, we miss a lot!

Phenotype Within Between Ratio (%)
Length 36.97 4.279 864.1
Width 27.92 4.734 589.9
Height 25.15 4.636 542.5
Volume 67.85 12.479 543.7
Nearest Canal Distance 70.35 333.395 21.1
Density (Lc.Te.V) 144.40 27.658 522.1
Nearest Neighbors (Lc.DN) 31.86 1.835 1736.1
Stretch (Lc.St) 13.98 2.360 592.5
Oblateness (Lc.Ob) 141.27 18.465 765.1

The results in the table show the within and between sample variation for selected phenotypes in the first two columns and the ratio of the within and between sample numbers

Visualizing the Variation

How does this result look visually? Each line shows the mean \( \pm \) standard deviation for sample. The range within a single sample is clearly much larger than between Partial Volume Effect

New Phenotypes

Phenotype values at D9Mit259 Marker located on chromosome 9 in the mouse Partial Volume Effect

Statistical Comparison

Spark Imaging Performance

Real Performance

Overhead Costs

Spark Numerical Performance

With the ability to scale to millions of samples there is no need to condense. We can analyze the entire dataset in real-time and try and identify groups or trends within the whole data instead of just precalculated views of it.

Practical

1276 comma separated text files with 56 columns of data and 15-60K rows

Task Single Core Time Spark Time (40 cores)
Load and Preprocess 360 minutes 10 minutes
Single Column Average 4.6s 400ms
1 K-means Iteration 2 minutes 1s

Can iteratively explore and hypothesis test with data quickly

We found several composite phenotypes which are more consistent within samples than between samples

Rich, heavily developed platform

Available Tools

Tools built for table-like data data structures and much better adapted to it.

Commercial Support

Dozens of major companies (Apple, Google, Facebook, Cisco, …) donate over $30M a year to development of Spark and the Berkeley Data Analytics Stack

  • 2 startups in the last 6 months with seed-funding in excess of $15M each

Academic Support

  • All source code is available on GitHub
    • Elegant (20,000 lines vs my PhD of 75,000+)
  • No patents or restrictions on usage
  • Machine Learning Course in D-INFK next semester based on Spark

Quo Vadis: Synchrotron-based µCT Imaging?

Needs

  • Scale up to 1000GVx samples (eventually)
  • Analyze standard measurements 14GVx regularly in a day
  • Analyze as fast as we usually measure 15 minutes / scan

Would be nice

  • Analyze scans as fast as we measure now (1 scan / minute)
  • Analyze scans as fast as we could measure with Gigafrost (10 scans / minute)

Tomcat Goals

Tomcat Goals

Beyond: Streaming

Post-processing goals

  • Analysis done in weeks instead of months
  • Some real-time analysis and statistics

Streaming

Can handle static data or live data coming in from a 'streaming' device like a camera to do real-time analysis. The exact same code can be used for real-time analysis and static code

Scalability

Connect more computers.

Start workers on these computer (Full Stop).

Source could the camera server itself or a watched directory on a machine.

plot of chunk unnamed-chunk-10

Beyond: Approximate Results

Projects at AMPLab like Spark and BlinkDB are moving towards approximate results.

  • Instead of mean(volume)
    • mean(volume).within_time(5)
    • mean(volume).within_ci(0.95)

For real-time image processing it might be the only feasible solution and could drastically reduce the amount of time spent on analysis.

What sort of projects demand this?

Zebra fish Phenotyping Project

Collaboration with Keith Cheng, Ian Foster, Xuying Xin, Patrick La Raviere, Steve Wang

1000s of samples of full adult animals, imaged at 0.74 \( \mu m \) resolution: Images 11500 x 2800 x 628 \( \longrightarrow \) 20-40GVx / sample

  • Identification of single cells (no down-sampling)
  • Cell networks and connectivity
  • Classification of cell type
  • Registration with histology

Brain Project

Collaboration with Alberto Astolfo, Matthias Schneider, Bruno Weber, Marco Stampanoni

1 \( cm^3 \) scanned at 1 \( \mu m \) resolution: Images \( \longrightarrow \) 1000 GVx / sample

  • Registration separate scans together
  • Blood vessel structure and networks
  • Registration with fMRI, histology

Spark Preview

Acknowledgements

  • AIT at PSI
  • X-Ray Microscopy Group at the TOMCAT Beamline of the Swiss Light Source Tomcat Group

We are interested in partnerships and collaborations

Learn more at

Starting Spark

Spin up your own cluster in an hour

  1. Start a master node start-master.sh
  2. Start several worker nodes

start-worker.sh spark://master-ip:7077 -c #CORES -m #MEM

  1. Start the Spark-Shell spark-shell.sh
  2. Write code in Scala, Java, Python, or R

Perform a threshold

val threshVal=127
val labelImg=inImg.filter(_._2>threshVal)
  • Runs on 1 core on your laptop or 1000 cores in the cloud or on Merlin or the beamline.
  • If one computer crashes or disconnects it automatically continues on another one.
  • If one part of the computation is taking too long it will be sent to other computers to finish
  • If a computer runs out of memory it writes the remaining results to disk and continues running (graceful dropoff in performance )

Region of Interest

Take a region of interest between 0 and 100 in X,Y, and Z

def roiFunc(pvec: ((Int,Int,Int),Double)) = 
 {pvec._1._1>=0 & pvec._1._1<100 & // X
  pvec._1._2>=0 & pvec._1._2<100 & //Y
  pvec._1._3>=0 & pvec._1._3<100} // Z

rImg.filter(roiFunc)

Perform a 3x3x3 box filter

def spread_voxels(pvec: ((Int,Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length**3)
  for(x<-wind; y<-wind; z<-wind) 
    yield ((pos._1+x,pos._2+y,pos._3+z),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)

Perform component labeling

var groupList=Array((0L,0))
var running=true
var iterations=0
while (running) {
  val newLabels=labelImg.
  flatMap(spread_voxels(_,1)).
    reduceByKey((a,b) => (math.min(a._1,b._1),(a._2 | b._2))).
    filter(_._2._2). // keep only voxels which contain original pixels
    map(pvec => (pvec._1,pvec._2._1))
  // make a list of each label and how many voxels are in it
  val curGroupList=newLabels.map(pvec => (pvec._2,1)).
    reduceByKey(_ + _).sortByKey(true).collect
  // if the list isn't the same as before, continue running since we need to wait for swaps to stop
  running = (curGroupList.deep!=groupList.deep)
  groupList=curGroupList
  labelImg=newLabels
  iterations+=1
  print("Iter #"+iterations+":"+groupList.mkString(","))
}
groupList

Lazy evaluation

val outThresh=inImg.map(threshFunc)
val outLabels=runCL(outThresh)
outLabels.filter(roiFunc).saveImage('test.tif')
  • No execution starts until you save the file or require output
  • Spark automatically deconstructs the pipeline and optimizes the jobs to run so computation is not wasted outside of the region of interest (even though we did it last)