From high-througput on mice to big data on people

K. Mader
2 September 2015

SIL

ETH Zurich

Outline

The PhD Project

Mouse Genetics
Automating Measurement
Automating Analysis
Combining Results
Fragile Approaches

The Business

Imaging in 2015
Changing Computing
Our Image Analysis
Interactive Analysis / Streaming
Outlook / Developments

Business

Understanding bone fracture

Fracture risk is a poorly understood, multifactorial problem [1]
Femur neck fractures: over $12,000/ patient / year ⇒ $34.8B worldwide [2]
Cortical bone loss increases risk [1]
- Bone mineral density is a poor indicator for cortical strength [3,4]
- Cortical structures are deep inside optically opaque bone and small ~µm
- Understand the relevant microstructure for bone health and disease

Genome-Scale Imaging of Bone

Genetic background can be fully controlled
A mouse model allows for inbreeding and controlled environment and diet
Longitudinal studies not needed
Genetics offers the possibility for simple, cheap diagnostic testing
Relevant for humans: 26 of 28 regions of the genome responsible for BMD can be matched [1]
Connect phenotypes to genotypes $ \Rightarrow $ identify controlling genes

Linkage

Many Samples

Genetic studies require hundreds to thousands of samples, in this case the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.

QTL Identification

High-Throughput Measurement

Analysis

Measurement is just the beginning, how can you analyze it?

Small Data: Genome-Scale Imaging

Hand Identification $ \rightarrow $ 30s / object
30-40k objects per sample
One Sample in 6.25 weeks
1300 samples $ \rightarrow $ 120 man-years

Advantages

Hand verification of every sample, visual identification of errors and problems

Disadvantages

Biased by user
Questionable reproducibility
Time-consuming
Exploration challenging
Data versioning is difficult

Medium Data: Genome-scale Imaging

ImageJ macro for segmentation
Python script for shape analysis
Paraview macro for network and connectivity
Python/MPI script to pool results
MySQL Database storing results

1.5 man-years

Advantages

Faster than manual methods
Less bias
More reproducible
Can be highly optimized / performant

Disadvantages

Compartmentalized analysis
Time-consuming
Complex interdependent workflow
Fragile (machine crashes, job failures)
Expensive - Measure twice, cut once, not exploratory

Compartmentalized

Going faster? GPUs!

Taken from Serban's K-Means CUDA Project (https://github.com/serban/kmeans/blob/master/cuda_kmeans.cu)

checkCuda(cudaMalloc(&deviceObjects, numObjs*numCoords*sizeof(float)));
checkCuda(cudaMalloc(&deviceClusters, numClusters*numCoords*sizeof(float)));
checkCuda(cudaMalloc(&deviceMembership, numObjs*sizeof(int)));
checkCuda(cudaMalloc(&deviceIntermediates, numReductionThreads*sizeof(unsigned int)));
checkCuda(cudaMemcpy(deviceObjects, dimObjects[0],
  numObjs*numCoords*sizeof(float), cudaMemcpyHostToDevice));
checkCuda(cudaMemcpy(deviceMembership, membership,
  numObjs*sizeof(int), cudaMemcpyHostToDevice));
do {
  checkCuda(cudaMemcpy(deviceClusters, dimClusters[0],
                  numClusters*numCoords*sizeof(float), cudaMemcpyHostToDevice));
  find_nearest_cluster
            <<< numClusterBlocks, numThreadsPerClusterBlock, clusterBlockSharedDataSize >>>
            (numCoords, numObjs, numClusters,
             deviceObjects, deviceClusters, deviceMembership, deviceIntermediates);
  cudaDeviceSynchronize(); checkLastCudaError();
        compute_delta <<< 1, numReductionThreads, reductionBlockSharedDataSize >>>
            (deviceIntermediates, numClusterBlocks, numReductionThreads);
        cudaDeviceSynchronize(); checkLastCudaError();
        int d;
        checkCuda(cudaMemcpy(&d, deviceIntermediates,
                  sizeof(int), cudaMemcpyDeviceToHost));
        delta = (float)d;
        checkCuda(cudaMemcpy(membership, deviceMembership,
                  numObjs*sizeof(int), cudaMemcpyDeviceToHost));
        for (i=0; i<numObjs; i++) {
            /* find the array index of nestest cluster center */
            index = membership[i];

            /* update new cluster centers : sum of objects located within */
            newClusterSize[index]++;
            for (j=0; j<numCoords; j++)
                newClusters[j][index] += objects[i][j];
        }

Setting Priorities: Exploratory Image Processing

Correctness

The most important job for any piece of analysis is to be correct.

A powerful testing framework is essential
Avoid repetition of code which leads to inconsistencies
Use compilers to find mistakes rather than users

Easily understood, changed, and used

Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets

Flexible, modular structure that enables replacing specific pieces

Fast

The last of the major priorities is speed which covers both scalability, raw performance, and development time.

Long waits for processing discourages exploration
Manual access to data on separeate disks is a huge speed barrier
Real-time image processing requires millisecond latencies
Implementing new ideas can be done quickly

Small Data vs Big Data

Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it

Before we can think about Big Data we need to define what it's replacing, Small Data.

Small Data

The 0815 approach, using standard tools and lots of clicking, one image at a time

Small Data

Big Data

Velocity, Volume, Variety

When a ton of heterogeneous is coming in fast.

Performant, scalable, and flexible

When scaling isn't scary

10X, 100X, 1000X is the same amount of effort

When you are starving for enough data

Director of AMPLab said their rate limiting factor is always enough interesting data

O 'clicks' per sample

Question not implementation driven

Ask the questions you should not the questions you can

You conduct the orchestra, you don't spend time telling each person what to do

Big Data

The Framework First

Rather than building an analysis as quickly as possible and then trying to hack it to scale up to large datasets
- chose the framework first
- then start making the necessary tools.
Google, Amazon, Yahoo, and many other companies have made huge in-roads into these problems
The real need is a fast, flexible framework for robustly, scalably performing complicated analyses, a sort of Excel for big imaging data.

Apache Spark

World Record Holder for Terabyte and Petabyte Sort
- 100TB of data on 206 machines in 23 minutes
- 1PB of data in 190 machines in <4 hours
Scalable
- >10000 computers
- storing and processing exabytes of data
- sustained data rates of 3Gb/s/node (map) and 1.1Gb/s/node (reduce)
fault tolerance
- 2/3rds of computers can crash and a request still accurately finishes
hardware and software platform indpendence (Mac, Windows, Linux)

Spark -> Microscopy?

These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.

I want to

filter out noise, segment, choose regions of interest
contour, component label
measure, count, and analyze
…

Spark Image Layer

Developed at 4Quant, ETH Zurich, and Paul Scherrer Institut
The Spark Image Layer is a Domain Specific Language for Microscopy for Spark.
It converts common imaging tasks into coarse-grained Spark operations

SIL

Spark Image Layer

We have developed a number of commands for SIL handling standard image processing tasks

SIL Commands

Fully exensible with Spark Languages

Use Case with Spark Image Layer: Genetic Studies

Analysis could be completed in several months (instead of 120 years, could now be completed in days in the cloud)
Data can be freely explored and analyzed

val imageTable = 
    sc.readImage[Float]("s3n://bone-images/f2-study/*/ufilt.tif").cache

Filter all of the images

CREATE TABLE FilteredImages AS
SELECT boneId,GaussianFilter(image) FROM ImageTable

Segment hard and soft tissues

CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredImages

Use Case with Spark Image Layer: Genetic Studies

Label cells

CREATE TABLE LabelImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages

Export results

CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"

Transition Point: Pioneer Fellowship

ETH and Hasler Foundation have a Pioneer Fellowship program for transitioning research projects to commercial ideas.

18 months
Office space in Innovation Lab at Hönggerberg or Central
Weekly invited speakers on business and marketing topics
Monthly discussions of strategy, funding, and planning

Pioneer Slide

Image Science in 2015: More and faster

X-Ray

Swiss Light Source (SRXTM) images at (>1000fps) $ \rightarrow $ 8GB/s, diffraction patterns (cSAXS) at 30GB/s
Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data
Swiss Radiologists will make 1PB of data year

Optical

Light-sheet microscopy (see talk of Jeremy Freeman) produces images $ \rightarrow $ 500MB/s
High-speed confocal images at (>200fps) $ \rightarrow $ 78Mb/s

Geospatial

New satellite projects (Skybox, etc) will measure >10 petabytes of images a year

Personal

GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
fps1000 - 400MB/s (640 x 480 x 840 fps) for $400

Data Flood

Exponential growth in both:

the amount of data
analysis time

over the last 16 years based on the latest generation detector at the TOMCAT Beamline.

It assumes a manual analysis rate of 1Mpx/second

plot of chunk scaling-figure

Time Breakdown

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

Experiment Design (Pink)
Data Management (Blue)
Measurement Time (Green)
Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

Science or IT?

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

Experiment Design (Pink)
Data Management (Blue)
Measurement Time (Green)
Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

How much is a TB, really?

If you looked at one 1000 x 1000 sized image every second Mammography Image

It would take you
139 hours to browse through a terabyte of data.

Year	Time to 1 TB	Man power to keep up	Salary Costs / Month
2000	4096 min	2 people	25 kCHF
2008	1092 min	8 people	95 kCHF
2014	32 min	260 people	3255 kCHF
2016	2 min	3906 people	48828 kCHF

Computing has changed: Parallel

Moores Law

\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]

Based on trends from Wikipedia and Intel Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d

There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?

Multiple Core
- Many machines have multiple cores for each processor which can perform tasks independently
Multiple CPUs
- More than one chip is commonly present
New modalities
- GPUs provide many cores which operate at slow speed

Parallel Code is important

Cloud Computing Costs

The range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.

People usually only work ~40 hours a week (in europe at least)
Adding IT costs and management easily shifts the blue cruve substantially upward.

plot of chunk unnamed-chunk-5

Cost Comparison: Why this hasn't been done before?

For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)

Development costs are very high for Big
Traditionally students are much cheaper than the 150kCHF / year used in this model
Medium Data is a good compromise

plot of chunk unnamed-chunk-6

Scaling Costs

As studies get bigger (>100 samples), we see that these costs begin to radically shift.

The high development costs in Medium and Big $ \rightarrow $ 0
The high human costs of Small and Medium $ \propto $ $ Samples $

plot of chunk scaling-costs

Quantitative Search Machine for Images: Pathological Slices

Count the number of highly anisotropic nuclei in myeloma patients

\[ \downarrow \textrm{Translate to SQL} \]

SELECT COUNT(*) FROM 
  (SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients 
  WHERE disease LIKE "myleoma")
  WHERE anisotropy > 0.75

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Load Myleoma Data Subset} \] $Slices$

Quantitative Search Machine for Images: Pathological Slices

\[ \downarrow \textrm{Perform analysis on a every image} \]

\[ \downarrow \textrm{Filter out the most anisotropic cells} \] Slices

Distribution and Analysis

The query is processed by the SQL parsing engine

sent to the database where it can be combined with locally available private data
sent to the image layer for all image processing components to be integrated
sent to the Spark Core so specific map, reduce, fold operations can be generated
sent to the master node where it can be distributed across the entire cluster / cloud

Distribution and Analysis

Many Nodes

A cool tool, how do you make a product?

Creating a product

Realtime MRI Analysis for Guided Treatment

Identifying Lesions in Colons from Capsule Endoscopy

Flood Risk Estimations from Satellite Images

Track Assessment for Railway Quality Control

People Detection Drone Survellience

Acknowledgements: 4Quant

$4Quant Team$

Acknowledgements: ETH and PSI

AIT at PSI and Scientific Computer at ETH
TOMCAT Group

We are interested in partnerships and collaborations

Learn more at

4Quant: From Images to Statistics - http://www.4quant.com
Quantitative Big Imaging Course at ETH Zurich
X-Ray Imaging Group at ETH Zurich - http://bit.ly/1gD8wKb

More Information

A spinoff - 4Quant: From images to insight

Quantitative Search Machine for Images
- Find all patients with livers larger than 10cm diameter
- Count the number of highly anisotropic nuclei in myeloma patients
Custom Analysis Solutions
- Custom-tailored software to solve your problems
One Stop Shop
- Measurement, analysis, and statistical analysis

Education / Training

Consulting
- Advice on imaging techniques, analysis possibilities
Education
- Workshops on Image Analysis
- Courses / Training
- Quantitative Big Imaging
Contact Us: contact@4quant.com

K-Means Declarative

Starting list of points: Points
Take 4 random points from the list $ \rightarrow $ Centers
- Apply a filter operation with the function (random()<nPoints/4)
Create a function called nearest center which takes a point $ \vec{x} $ and returns the nearest center to the point and the point itself

nearestCenter(x) = {
  for cCenter in Centers
    pointDist(cCenter) = dist(x,cCenter)
  end
  return (argmin(pointDist),x)
}

Map nearestCenter onto Points as NearestCenterList

Quantitative Search Machine for Images: Maps

Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH

\[ \downarrow \textrm{Translate to SQL} \]

SELECT contour FROM (
  SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles 
  WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
  ) WHERE area>200

Quantitative Search Machine for Images: Maps

We can then visualize these contours easily

plot of chunk region-lat-long-image

or apply them back to the original map

Top Down Science

The standard approach to producing data is top-down.

Selecting a model
generating hypotheses
conducting experiments
updating the model.

In a study with 80 samples with a variability of 10% and a 5% difference between them, we can afford only 1 test and maintain the commonly accepted $ p<0.05 $

We thus need smart people making and choosing the best models so that we can preserve the signifance.

Letting Data Drive: From models to probability

Rather than impose a preconceived, possibly biased, model on data we should investigate what possible models, interpretations, or correlations are in the data (possibly in the phenomena) that might help us understand it. - Michael Brodie: MIT and Former Chief Scientist at Verizon

The Big Data era suggests a new approach. First given an abundance of data, we can afford to run many tests.

In a study with 800 samples with a variability of 10% and a 5% difference between them, we can afford over 10M tests and maintain the commonly accepted $ p<0.05 $

For knowledge creation this means a switch

from long, arduous studies involving careful model selection
to hypothesis and model creation based on the correlations and trends found in the raw data

Real-time with Spark Streaming: Webcam

In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.

We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.

plot of chunk unnamed-chunk-13

Streaming Analysis Real-time Webcam Processing

val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)

Filter images

val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))

Create a background image

val totImgs = inImages.count()
val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)

Identify Outliers in Streams

Remove the background image and find the mean value

val eventImages = filtImgs.
    transform{
    inImages =>
      val corImage = inImages.map {
        case (inTime,inImage) =>
          val corImage = inImage.subtract(bgImage)
          (corImage.getImageStatistics().mean,
            (inTime,corImage))
      }
      corImage
  }

Show the outliers

eventImages.filter(iv => Math.abs(iv._1)>20).
  foreachRDD(showResultsStr("outlier",_))

Streaming Demo with Webcam

As a scientist (not a data-scientist)

Apache Spark is brilliant platform and utilizing GraphX, MLLib, and other packages there unlimited possibilities

Scala can be a beautiful but not easy language
Python is an easier language
Both suffer from
- Non-obvious workflows
- Scripts depending on scripts depending on scripts (can be very fragile)

Although all analyses can be expressed as a workflow, this is often difficult to see from the code
Non-technical persons have little ability to understand or make minor adjustments to analysis
Parameters require recompiling to change
or GUIs need to be placed on top

Interactive Analysis

Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows

Web and Image Database Examples

Parallel Tools for Image and Quantitative Analysis

val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
Collaborators / Competitors can verify results and extend on analyses
Combine Images with Results
- avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
- See immediately in datasets of terabytes which image had the largest cells
New hypotheses and analyses can be done in seconds / minutes

Task	Single Core Time	Spark Time (40 cores)
Load and Preprocess	360 minutes	10 minutes
Single Column Average	4.6s	400ms
1 K-means Iteration	2 minutes	1s

A basic image filtering operation

Thanks to Spark, it is cached, in memory, approximate, cloud-ready
Thanks to Map-Reduce it is fault-tolerant, parallel, distributed
Thanks to Java, it is hardware agnostic

def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
  val wind=(-windSize to windSize)
  val pos=pvec._1
  val scalevalue=pvec._2/(wind.length*wind.length)
  for(x<-wind; y<-wind) 
    yield ((pos._1+x,pos._2+y),scalevalue)
}

val filtImg=roiImg.
      flatMap(cvec => spread_voxels(cvec)).
      filter(roiFun).reduceByKey(_ + _)

But it is also not really so readable

Little blocks for big data

Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.

Workflow Settings