Quantitative Medical Image Analysis in the Cloud using Big Data Approaches

Kevin Mader
LifeScienceForumBasel, 18 June 2015

SIL

Paul Scherrer Institut ETH Zurich

Outline

  • Introduction
    • Quantitative?
    • Big?
  • Small Data vs Big Data

The Tools

  • Spark Image Layer
  • Quantitative Search Engine for Images

The Science

  • Academic Projects
  • Commercial Projects
  • Outlook / Developments

Internal Structures

Lung Imaging

Look for potentially cancerous nodules in the following lung image, taken from NPR

  • Lung Scan

Lung Imaging

Lung Scan

Why quantitative?

Human vision system is imperfect

Which center square seems brighter? plot of chunk unnamed-chunk-1

Are the intensities constant in the image?

plot of chunk unnamed-chunk-2

Overwhelmed: Bone Physiology

  • There is a complex relationship between the macroscopic bone strength and the microscopic cellular networks inside of them.
  • Examining the individual cells can give us insights into what makes for healthy and pathological bone growth
  1. Count how many cells are in the bone slice
  2. Ignore the ones that are ‘too big’ or shaped ‘strangely’
  3. Are there more on the right side or left side?
  4. Are the ones on the right or left bigger, top or bottom?

cells in bone tissue

More overwhelmed

  • Do it all over again for 96 more samples, this time with 2000 slices instead of just one!

more samples

A Genome Level Study

  • Genetic studies require hundreds to thousands of samples

  • Genetic LOD Scores

  • For this study, the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.

  • Now again with 1090 samples!
  • even more samples

It gets even better

  • Those metrics were quantitative and could be easily visually extracted from the images
  • What happens if you have softer metrics

  • alignment

  • How aligned are these cells?
  • Is the group on the left more or less aligned than the right?
  • errr?

Scaling Radiologists

As the amount of data increases

plot of chunk scaling-figure-rad

But how do radiologists / experts scale

Research Image Science in 2015: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \( \rightarrow \) 8GB/s, diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data

Optical

  • Light-sheet microscopy (see talk of Jeremy Freeman) produces images \( \rightarrow \) 500MB/s
  • High-speed confocal images at (>200fps) \( \rightarrow \) 78Mb/s

Geospatial

  • New satellite projects (Skybox, etc) will measure >10 petabytes of images a year

Personal

  • GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
  • fps1000 - 400MB/s (640 x 480 x 840 fps) for $400

Time Breakdown

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

  • Experiment Design (Pink)
  • Data Management (Blue)
  • Measurement Time (Green)
  • Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

plot of chunk time-figure

Science or IT?

The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.

  • Experiment Design (Pink)
  • Data Management (Blue)
  • Measurement Time (Green)
  • Post Processing (Purple)

The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.

plot of chunk time-figure-cs

How much is a TB, really?

If you looked at one 1000 x 1000 sized image every second Mammography Image

It would take you
139 hours to browse through a terabyte of data.

Year Time to 1 TB Man power to keep up Salary Costs / Month
2000 4096 min 2 people 25 kCHF
2008 1092 min 8 people 95 kCHF
2014 32 min 260 people 3255 kCHF
2016 2 min 3906 people 48828 kCHF

Small Data vs Big Data

Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it

Before we can think about Big Data we need to define what it's replacing, Small Data.

Small Data

The 0815 approach, using standard tools and lots of clicking

Small Data: Genome-Scale Imaging

  • Hand Identification \( \rightarrow \) 30s / object
  • 30-40k objects per sample
  • One Sample in 6.25 weeks
  • 1300 samples \( \rightarrow \) 120 man-years

Advantages

  • Hand verification of every sample, visual identification of errors and problems

Disadvantages

  • Biased by user
  • Questionable reproducibility
  • Time-consuming
  • Exploration challenging
  • Data versioning is difficult

Medium Data: Genome-scale Imaging

  • ImageJ macro for segmentation
  • Python script for shape analysis
  • Paraview macro for network and connectivity
  • Python script to pool results
  • MySQL Database storing results (5 minutes / query)

1.5 man-years

Advantages

  • Faster than manual methods
  • Less bias
  • More reproducible

Disadvantages

  • Compartmentalized analysis
  • Time-consuming
  • Complex interdependent workflow
  • Fragile (machine crashes, job failures)
  • Expensive - Measure twice, cut once, not exploratory

Exploratory Image Analysis Priorities

Correctness

The most important job for any piece of analysis is to be correct.

  • A powerful testing framework is essential
  • Avoid repetition of code which leads to inconsistencies
  • Use compilers to find mistakes rather than users

Easily understood, changed, and used

Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets

  • Flexible, modular structure that enables replacing specific pieces

Fast

The last of the major priorities is speed which covers both scalability, raw performance, and development time.

  • Long waits for processing discourages exploration
  • Manual access to data on separeate disks is a huge speed barrier
  • Real-time image processing requires millisecond latencies
  • Implementing new ideas can be done quickly

The Framework First

  • Rather than building an analysis as quickly as possible and then trying to hack it to scale up to large datasets
    • chose the framework first
    • then start making the necessary tools.
  • Google, Amazon, Yahoo, and many other companies have made huge in-roads into these problems
  • The real need is a fast, flexible framework for robustly, scalably performing complicated analyses, a sort of Excel for big imaging data.

Apache Spark and Hadoop 2

The two frameworks provide a free out of the box solution for

  • scaling to >10000 computers
  • storing and processing exabytes of data
  • fault tolerance
    • 2/3rds of computers can crash and a request still accurately finishes
  • hardware and software platform indpendence (Mac, Windows, Linux)

Spark -> Microscopy?

These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.

I want to

  • filter out noise, segment, choose regions of interest
  • contour, component label
  • measure, count, and analyze

Spark Image Layer

  • Developed at 4Quant, ETH Zurich, and Paul Scherrer Institut
  • The Spark Image Layer is a Domain Specific Language for Microscopy for Spark.
  • It converts common imaging tasks into coarse-grained Spark operations

SIL

Spark Image Layer

We have developed a number of commands for SIL handling standard image processing tasks

SIL Commands

Fully exensible with Spark Languages

SIL Commands

Quantitative Search Machine for Images

  • Count the number of highly anisotropic nuclei in myeloma patients

  • \( \downarrow \textrm{Translate to SQL} \)

  • SELECT COUNT(*) FROM 
    (SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients 
    WHERE disease LIKE "myleoma")
    WHERE anisotropy > 0.75
    

Quantitative Search Machine for Images

\[ \downarrow \textrm{Load Myleoma Data Subset} \] Slices

Quantitative Search Machine for Images

\[ \downarrow \textrm{Perform analysis on a every image} \] Slices

\[ \downarrow \textrm{Filter out the most anisotropic cells} \] Slices

Distribution and Analysis

The query is processed by the SQL parsing engine

  • sent to the database where it can be combined with locally available private data
  • sent to the image layer for all image processing components to be integrated
  • sent to the Spark Core so specific map, reduce, fold operations can be generated
  • sent to the master node where it can be distributed across the entire cluster / cloud

Distribution and Analysis

Once processed into Spark commands the query can be executed on multiple machines using multiple data storage environments in the background

  • Data-locality ensures analysis is run efficiently on the same machine it is stored on
  • Network communication fully managed
  • Fault tolerance of nodes ensured

Large Computation on the Cloud

For the same price as a single radiologist, you get (each blue dot is 20 computers):

Cost Comparison: Why this hasn't been done before?

For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)

  • Development costs are very high for Big
  • Traditionally students are much cheaper than the 150kCHF / year used in this model
  • Medium Data is a good compromise

plot of chunk unnamed-chunk-5

Scaling Costs

As studies get bigger (>100 samples), we see that these costs begin to radically shift.

  • The high development costs in Medium and Big \( \rightarrow \) 0
  • The high human costs of Small and Medium \( \propto \) \( Samples \)

plot of chunk scaling-costs

The next step: From Big Data to improving health-care?

From Research to the Clinic

  • New algorithms can be tested on all patient data just as easily as 10
  • Streaming analysis can be applied to all new patients in a streaming manner

Automatic Analysis

  • Instead of piecewise, single algorithms in fancy tools
  • Test new algorithms against hundreds or even millions of labeled data to validate and find the best ones
  • Apply the best ones to all data, automatically
  • Assist radiologists and physicians by augmenting not replacing them

The vision

Instead of

Chest X-Ray

We have

  • Healthy

plot of chunk unnamed-chunk-6

  • Sick

plot of chunk unnamed-chunk-7

We have a cool tool, but what does this mean for me?

A spinoff - 4Quant: From images to insight

  • Quantitative Search Machine for Images
    • Find all patients with livers larger than 10cm diameter
    • Count the number of highly anisotropic nuclei in myeloma patients
  • Custom Analysis Solutions
    • Custom-tailored software to solve your problems
  • One Stop Shop
    • Measurement, analysis, and statistical analysis

Education / Training

Acknowledgements: 4Quant

  • Flavio Trolese

Flavio

  • Dr. Prat Das Kanungo

Prat

  • Dr. Ana Balan

Ana

  • Prof. Marco Stampanoni

Marco

Acknowledgements: ETH and PSI

  • TOMCAT Group
  • Radiology Team at the Universitaetspital Basel
  • Keith Cheng from Penn State Medical Center
  • AIT at PSI and Scientific Computer at ETH Tomcat Group

We are interested in partnerships and collaborations

Learn more at