Investigating the Microvessel Architecture of the Mouse Brain

An approach for measuring, stitching, and analyzing 50 terabytes of data
K. Mader, A. Patera, A. Astolfo, M. Schneider, B. Weber, M. Stampanoni (4quant.com/SRI2015)

SIL

Paul Scherrer Institut ETH Zurich

Outline

  • Imaging in 2015
  • Brain-scale Vasculature
  • Traditional Approaches
  • Our Approach
    • Stitching
    • Searching
    • Streaming
  • Outlook / Developments

Internal Structures

Image Science in 2015: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \( \rightarrow \) 8GB/s
  • Diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data
  • Swiss Radiologists will make 1PB of data year

Geospatial

  • New satellite projects (Skybox, etc) will measure >10 petabytes of images a year

Personal

  • fps1000 - (640 x 480 x 840 fps) for $400 \( \rightarrow \) 780Mb/s \( \rightarrow \) 66TB/day

    • GoPro 4 Black - 3840 x 2160 x 30fps for $600 \( \rightarrow \) 750Mb/s \( \rightarrow \) 64TB/day \( \rightarrow \) 23PB/year

Time Breakdown

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

  • Experiment Design (Pink)
  • Data Management (Blue)
  • Measurement Time (Green)
  • Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

plot of chunk time-figure

Science or IT?

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

  • Experiment Design (Pink)
  • Data Management (Blue)
  • Measurement Time (Green)
  • Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

plot of chunk time-figure-cs

Mouse Brain: Full Vascular Imaging

  • Measure a full mouse brain (1cm\( ^3 \)) with cellular resolution (1 \( \mu \) m)
  • 10 x 10 x 10 scans at 2560 x 2560 x 2160 \( \rightarrow \) 14 TVoxels
  • 0.000004% of the entire dataset
    Single Slice
  • 14TVoxels = 56TB
  • Each scan needs to be registered and aligned together
  • There are no computers with 56TB of memory
  • Analysis of the stitched data is also of interest (segmentation, vessel analysis, distribution and network connectivity)

The Scanning

Thank you Alessandra and Alberto!

plot of chunk manypoly

plot of chunk unnamed-chunk-5

Let's Stitch

The first of challenges with these data is to start stitching them together.

  • Algorithmically it is very simple:

  • \( \textrm{Overlap}_{A \rightarrow B} = \)

  • \( \textrm{argmax}_{\vec{x}}\left[\mathcal{FT}^{-1} \left\{\mathcal{FT}\{A\}\mathcal{FT}\{B\}^{*}\right\} \right] \)

Computationally

We need to stitch each scan with all of its neighbors

  • \( x=\pm 1 \)
  • \( y=\pm 1 \)
  • \( z=\pm 1 \)

How does that look?

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

But it gets better

We have a few more parameters (\( \theta \)) and we know the initial positions so we optimize in a limited range rather than the entire image.

  • \( \rightarrow \) we need to add another loop
  • Iterate as long as the offset vectors are changing

Execute

Now that our task is clear we just execute \( \rightarrow \)

Super workstation

  • 160GB RAM
  • awesome-sauce graphics card
  • 40 cores with hyperthreading
  • SSD drives

Matlab + FIJI

  • Premade algorithms
  • Just click and go
  • 10x10x10 scans x 8 neighbors \( \rightarrow \) 8,000 registrations per iteration
  • 20 - 40 minutes per registration
  • 1000-4000 hours \( \rightarrow \) 40-160 days per iteration

A Cluster?

Basic Topology

  • Driver Node

The machine being used by the user which is responsible for creating jobs which need to run

  • Master Node

Distributes task across multiple machines, coordinates communication between machines

  • Worker Nodes

The nodes where the compuation is actually done

Using a cluster

  • \( \rightarrow \) Tell the driver to load all images
  • The driver determines which data need to be loaded
  • Driver \( \rightarrow \) Master exact data to load

Steps

  • tell machine 1 to load images 0 to 2
  • tell machine 2 to load images 3 to 5

Exchanging Images

tell share the images to calculate the overlap

  • \( \rightarrow \) tell machine 1 to send image 0 to machine 2
  • \( \rightarrow \) tell machine 2 to calculate: \( \textrm{Overlap}(A\rightarrow B) \)

Problems

  • Every task on every machine must be explicitly specified
  • Substantial effort in coordination
  • Huge amounts of data traffic between shared storage and machines
  • Not flexible or elastic

SIL

Exploratory Image Processing Priorities

Correctness

The most important job for any piece of analysis is to be correct.

  • A powerful testing framework is essential
  • Avoid repetition of code which leads to inconsistencies
  • Use compilers to find mistakes rather than users

Easily understood, changed, and used

Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets

  • Flexible, modular structure that enables replacing specific pieces

Fast

The last of the major priorities is speed which covers both scalability, raw performance, and development time.

  • Long waits for processing discourages exploration
  • Manual access to data on separeate disks is a huge speed barrier
  • Real-time image processing requires millisecond latencies
  • Implementing new ideas can be done quickly

Big Data

Velocity, Volume, Variety

When a ton of heterogeneous is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Director of AMPLab said their rate limiting factor is always enough interesting data

O 'clicks' per sample

Question not implementation driven

  • Ask the questions you should not the questions you can

You conduct the orchestra, you don't spend time telling each person what to do

Big Data

The Framework First

  • Rather than building an analysis as quickly as possible and then trying to hack it to scale up to large datasets
    • chose the framework first
    • then start making the necessary tools.
  • Google, Amazon, Yahoo, and many other companies have made huge in-roads into these problems
  • The real need is a fast, flexible framework for robustly, scalably performing complicated analyses, a sort of Excel for big imaging data.

Apache Spark

  • World Record Holder for Terabyte and Petabyte Sort
    • 100TB of data on 206 machines in 23 minutes
    • 1PB of data in 190 machines in <4 hours
  • Scalable
    • >10000 computers
    • storing and processing exabytes of data
    • sustained data rates of 3Gb/s/node (map) and 1.1Gb/s/node (reduce)
  • fault tolerance
    • 2/3rds of computers can crash and a request still accurately finishes
  • hardware and software platform indpendence (Mac, Windows, Linux)

Spark -> Microscopy?

These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.

I want to

  • filter out noise, segment, choose regions of interest
  • contour, component label
  • measure, count, and analyze

Spark Image Layer

  • Developed at 4Quant, ETH Zurich, and Paul Scherrer Institut
  • The Spark Image Layer is a Domain Specific Language for Microscopy for Spark.
  • It converts common imaging tasks into coarse-grained Spark operations

SIL

Spark Image Layer

We have developed a number of commands for SIL handling standard image processing tasks

SIL Commands

Fully exensible with Spark Languages

SIL Commands

How do we get started?

  • First we start our cluster:
    spark-ec2.py -s 50 launch 4quant-image-cluster

  • Load all of the samples (56TB of data)

loadImage(
  "s3:/../brain_*_*_*/rec.tif"
  )

\[ \downarrow \] A Resilient Distributed Dataset (RDD) \[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]

  • Now start processing, run a gaussian filter on all the images to reduce the noise
filteredImages = Images.gaussianFilter(1,0.5)
  • Calculate the volume fraction at a threshold of 50%
volFraction = Images.threshold(0.5).
  map{keyImg =>
    (sum(keyImg.img),keyImg.size) 
    }.reduce(_+_)

Stitching?

We have all of the filtered images and we want to stitch them together in a smart way.

  • Start simple and match the images up with its neighbors (all image permutations and filter out the proximal ones)
pairImages = Images.
  cartesian(Images).
  filter((im1,im2) => dist(im1.pos,im2.pos)<1)

plot of chunk cartiscan

Cross Correlating

The cross correlation can then be executed on each pair of images from the new RDD (pairImages) by using the map command

displacementField = pairImages.
  map{
  ((posA,ImgA), (posB,ImgB)) =>

    xcorr(ImgA,ImgB,
      in=posB-posA)

  }

plot of chunk unnamed-chunk-14

From Matching to Stitching

From the updated information provided by the cross correlations and by applying appropriate smoothing criteria across windows.

smoothField = displacementField.
    window(3,3,3).
    map(gaussianSmoothFunction)

This also ensures the original data is left unaltered and all analysis is reversible.

Viewing Regions

The final stiching can then be done by

alignImages.
  filter(x=>abs(x-tPos)<img.size).
  map { (x,img) =>
   new Image(tSize).
    copy(img,x,tPos)
  }.combineImages()

plot of chunk unnamed-chunk-17

Exploring the data

Searching?

New detectors certainly provide a substantial amount of new data, but what about old data?

  • At PSI we have years of tape archives with thousands of interesting experiments.
  • At hospitals and satellite imaging companies it is even more

PI:

Look at this new technique from famous journal, let's try it on all of our old data

PhD Student:

  1. Find old data
  2. Copy old data one disk at a time to current machine
  3. Read in old data, Run analysis, Save results
  4. Repeat from 2.

Small Data

2 years later

It seems to give the same results

PI

Well what about this technique?

Repeat ad naseaum

It doesn't need to be like this

  • Amazon provides cheep, universally acessible (but secure) storage
  • Amazon provides cheap, powerful computational resources
  • Spark / Spark Image Layer provides the glue to match the data to the computation to the science
allData = loadPSIImages("s3://psi-data/femur-bones/*.tif") ++
loadAPSImages("s3://aps-data/bone-images/*.hdf5") ++
loadESRFImages("s3://esrf-data/femur-bones/*.tif")
allData.registerImageMetaTable("BoneImages")

Now something interesting?

How many of the bone samples measured had a cortical thickness larger than 30mm? \[ \downarrow \]

SELECT COUNT(*) FROM (
  SELECT THICKNESS(
    SEGMENT_CORTICAL(
      GAUSSFILT(imgData)
    )
  ) AS thk WHERE thk>30
)

Is there a correlation between cortical thickness and mineral content?

\[ \downarrow \]

SELECT THICKNESS(cortical),
  IMG_AVG(cortical) FROM (
  SELECT SEGMENT_CORTICAL(
    GAUSSFILT(imgData)
    ) AS cortical
  )

Image Query Engine: A PhD in 9 lines

CREATE TABLE FilteredBones AS
SELECT sampleId,GaussianFilter(image) FROM BoneImages
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredBones
CREATE TABLE CellImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"

Distribution and Analysis

The query is processed by the SQL parsing engine

  • sent to the database where it can be combined with locally available private data
  • sent to the image layer for all image processing components to be integrated
  • sent to the Spark Core so specific map, reduce, fold operations can be generated
  • sent to the master node where it can be distributed across the entire cluster / cloud

Distribution and Analysis

Many Nodes

Streaming: Real-time Image Processing

In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.

ImageJ Fiji

We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.

Streaming Analysis Real-time Webcam Processing

val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)

Filter images

val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))

Create a background image

val totImgs = inImages.count()
val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)

Identify Outliers in Streams

Remove the background image and find the mean value

val eventImages = filtImgs.
    transform{
    inImages =>
      val corImage = inImages.map {
        case (inTime,inImage) =>
          val corImage = inImage.subtract(bgImage)
          (corImage.getImageStatistics().mean,
            (inTime,corImage))
      }
      corImage
  }

Show the outliers

eventImages.filter(iv => Math.abs(iv._1)>20).
  foreachRDD(showResultsStr("outlier",_))

Streaming Demo with Webcam

We have a cool tool, but what does this mean for me?

A spinoff - 4Quant: From images to insight

  • Quantitative Search Machine for Images
    • Make analyzing huge sets of image data like a database search
  • Custom Analysis Solutions
    • Custom-tailored software to solve your problems
    • Streaming / Acquisition Solutions
    • Integration with Image Database Systems
    • Cloud Solutions

Education / Training

We support Open, Reproducible Science

While our framework is commercial, we build on top and integrate into open-source tools so that the research is reproducible by anyone, anywhere, using any machine (it just might take a bit longer)

We also try to publish as much code, educational material, and samples as possible on our github account.

Github

Acknowledgements: 4Quant

  • Flavio Trolese

Flavio

  • Dr. Prat Das Kanungo

Prat

  • Dr. Ana Balan

Ana

  • Prof. Marco Stampanoni

Marco

Acknowledgements: ETH and PSI

  • AIT at PSI and Scientific Computer at ETH
  • TOMCAT Group Tomcat Group

We are interested in partnerships and collaborations

Learn more at

Small Data vs Big Data

Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it

Before we can think about Big Data we need to define what it's replacing, Small Data.

Small Data

The 0815 approach, using standard tools and lots of clicking, one image at a time

Small Data

Small Data: Single Sample Clicking

  • Hand Identification \( \rightarrow \) 30s / object
  • 30-40k objects per sample
  • One Sample in 6.25 weeks
  • 1300 samples \( \rightarrow \) 120 man-years

Advantages

  • Hand verification of every sample, visual identification of errors and problems

Disadvantages

  • Biased by user
  • Questionable reproducibility
  • Time-consuming
  • Exploration challenging
  • Data versioning is difficult

How much is a TB, really?

If you looked at one 1000 x 1000 sized image every second Mammography Image

It would take you
139 hours to browse through a terabyte of data.

Year Time to 1 TB Man power to keep up Salary Costs / Month
2000 4096 min 2 people 25 kCHF
2008 1092 min 8 people 95 kCHF
2014 32 min 260 people 3255 kCHF
2016 2 min 3906 people 48828 kCHF

Top Down Science

The standard approach to producing data is top-down.

  • Selecting a model
  • generating hypotheses
  • conducting experiments
  • updating the model.

In a study with 80 samples with a variability of 10% and a 5% difference between them, we can afford only 1 test and maintain the commonly accepted \( p<0.05 \)

We thus need smart people making and choosing the best models so that we can preserve the signifance.

Letting Data Drive: From models to probability

  • Rather than impose a preconceived, possibly biased, model on data we should investigate what possible models, interpretations, or correlations are in the data (possibly in the phenomena) that might help us understand it. - Michael Brodie: MIT and Former Chief Scientist at Verizon

The Big Data era suggests a new approach. First given an abundance of data, we can afford to run many tests.

In a study with 800 samples with a variability of 10% and a 5% difference between them, we can afford over 10M tests and maintain the commonly accepted \( p<0.05 \)

For knowledge creation this means a switch

  • from long, arduous studies involving careful model selection
  • to hypothesis and model creation based on the correlations and trends found in the raw data

Computing has changed: Parallel

Moores Law

\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]

Based on trends from Wikipedia and Intel Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d

There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?

  • Multiple Core
    • Many machines have multiple cores for each processor which can perform tasks independently
  • Multiple CPUs
    • More than one chip is commonly present
  • New modalities
    • GPUs provide many cores which operate at slow speed

Parallel Code is important

Cloud Computing Costs

The range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.

  • People usually only work ~40 hours a week (in europe at least)
  • Adding IT costs and management easily shifts the blue cruve substantially upward.

plot of chunk unnamed-chunk-22

Medium Data: Multi-tool Scripting

  • ImageJ macro for segmentation
  • Python script for shape analysis
  • Paraview macro for network and connectivity
  • Python/MPI script to pool results
  • MySQL Database storing results

1.5 man-years

Advantages

  • Faster than manual methods
  • Less bias
  • More reproducible
  • Can be highly optimized / performant

Disadvantages

  • Compartmentalized analysis
  • Time-consuming
  • Complex interdependent workflow
  • Fragile (machine crashes, job failures)
  • Expensive - Measure twice, cut once, not exploratory

Compartmentalized

Gant Flow

Cost Comparison: Why this hasn't been done before?

For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)

  • Development costs are very high for Big
  • Traditionally students are much cheaper than the 150kCHF / year used in this model
  • Medium Data is a good compromise

plot of chunk unnamed-chunk-24

Scaling Costs

As studies get bigger (>100 samples), we see that these costs begin to radically shift.

  • The high development costs in Medium and Big \( \rightarrow \) 0
  • The high human costs of Small and Medium \( \propto \) \( Samples \)

plot of chunk scaling-costs

Quantitative Search Machine for Images: Pathological Slices

  • Count the number of highly anisotropic nuclei in myeloma patients

\[ \downarrow \textrm{Translate to SQL} \]

SELECT COUNT(*) FROM 
  (SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients 
  WHERE disease LIKE "myleoma")
  WHERE anisotropy > 0.75

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Load Myleoma Data Subset} \] Slices

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Perform analysis on a every image} \] Slices

\[ \downarrow \textrm{Filter out the most anisotropic cells} \] Slices

Quantitative Search Machine for Images: Maps

PSI

Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH

\[ \downarrow \textrm{Translate to SQL} \]

SELECT contour FROM (
  SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles 
  WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
  ) WHERE area>200

Quantitative Search Machine for Images: Maps

We can then visualize these contours easily

plot of chunk region-lat-long-image

or apply them back to the original map

PSI

Hadoop Filesystem (HDFS not HDF5)

Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl

SIL

One of the central tenants of MapReduce™ is data-centric computation \( \rightarrow \) instead of data to computation, move the computation to the data.

  • Use fast local storage for storing everything redundantly \( \rightarrow \) less transfer and fault-tolerance
  • Largest file size: 512 yottabytes, Yahoo has 14 petabyte filesystem in use

SIL

As a scientist (not a data-scientist)

Apache Spark is brilliant platform and utilizing GraphX, MLLib, and other packages there unlimited possibilities

  • Scala can be a beautiful but not easy language
  • Python is an easier language
  • Both suffer from
    • Non-obvious workflows
    • Scripts depending on scripts depending on scripts (can be very fragile)
  • Although all analyses can be expressed as a workflow, this is often difficult to see from the code
  • Non-technical persons have little ability to understand or make minor adjustments to analysis
  • Parameters require recompiling to change
  • or GUIs need to be placed on top

Interactive Analysis

Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows

Web and Image Database Examples

Parallel Tools for Image and Quantitative Analysis

  • val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
  • val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
  • Collaborators / Competitors can verify results and extend on analyses
  • Combine Images with Results
    • avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
    • See immediately in datasets of terabytes which image had the largest cells
  • New hypotheses and analyses can be done in seconds / minutes
Task Single Core Time Spark Time (40 cores)
Load and Preprocess 360 minutes 10 minutes
Single Column Average 4.6s 400ms
1 K-means Iteration 2 minutes 1s

A basic image filtering operation

  • Thanks to Spark, it is cached, in memory, approximate, cloud-ready
  • Thanks to Map-Reduce it is fault-tolerant, parallel, distributed
  • Thanks to Java, it is hardware agnostic
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length*wind.length)
  for(x<-wind; y<-wind) 
    yield ((pos._1+x,pos._2+y),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)
  • But it is also not really so readable

Little blocks for big data

Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.

Workflow Blocks

Workflow Settings