Quantitative Big Imaging

Statistics, Prediction, and Reproducibility

ETHZ: 227-0966-00L

## Loading required package: knitr

Course Outline

Literature / Useful References

Books

Videos / Podcasts

Papers / Sites

Model Evaluation

Iris Dataset

Previously on QBI …

Quantitative “Big” Imaging

The course has covered imaging enough and there have been a few quantitative metrics, but “big” has not really entered.

What does big mean?

So what is “big” imaging

doing analyses in a disciplined manner

having everything automated

being able to adapt and reuse analyses

Objectives

  1. Scientific Studies all try to get to a single number
  1. How do we compare the number from different samples and groups?
  1. How do we compare different processing steps like filter choice, minimum volume, resolution, etc?
  2. How do we evaluate our parameter selection?
  3. How can we ensure our techniques do what they are supposed to do?
  4. How can we visualize so much data? Are there rules?

Outline

What do we start with?

Going back to our original cell image

  1. We have been able to get rid of the noise in the image and find all the cells (lecture 2-4)
  2. We have analyzed the shape of the cells using the shape tensor (lecture 5)
  3. We even separated cells joined together using Watershed (lecture 6)
  4. We have created even more metrics characterizing the distribution (lecture 7)

We have at least a few samples (or different regions), large number of metrics and an almost as large number of parameters to tune

How do we do something meaningful with it?

Correlation and Causation

One of the most repeated criticisms of scientific work is that correlation and causation are confused.

  1. Correlation
  1. Causation

Controlled and Observational

There are two broad classes of data and scientific studies.

Observational

We examined 100 people and the ones with blue eyes were on average 10cm taller

In 100 cake samples, we found a 0.9 correlation between cooking time and bubble size

Controlled

We examined 50 mice with gene XYZ off and 50 gene XYZ on and as the foot size increased by 10%

We increased the temperature and the number of pores in the metal increased by 10%

Simple Model: Magic / Weighted Coin

incremental: true

Since most of the experiments in science are usually specific, noisy, and often very complicated and are not usually good teaching examples

Simple Model: Magic / Weighted Coin

  1. Each coin represents a stochastic variable \(\mathcal{X}\) and each flip represents an observation \(\mathcal{X}_i\).
  2. The act of performing a coin flip \(\mathcal{F}\) is an observation \(\mathcal{X}_i = \mathcal{F}(\mathcal{X})\)

We normally assume

  1. A fair coin has an expected value of \(E(\mathcal{X})=0.5 \rightarrow\) 50% Heads, 50% Tails
  2. An unbiased flip(er) means

\[ P(\mathcal{F}_1(\mathcal{X})*\mathcal{F}_2(\mathcal{X}))= P(\mathcal{F}_1(\mathcal{X}))*P(\mathcal{F}_2(\mathcal{X}))\]

\[ E(\prod_{i=0}^\infty \mathcal{F}_i(\mathcal{X})) = E(\mathcal{X}) \]

Simple Model to Reality

Coin Flip

  1. Each flip gives us a small piece of information about the flipper and the coin
  2. More flips provides more information

Real Experiment

  1. Each measurement tells us about our sample, out instrument, and our analysis
  2. More measurements provide more information

Iris: A more complicated model

Coin flips are very simple and probably difficult to match to another experiment. A very popular dataset for learning about such values beyond ‘coin-flips’ is called the Iris dataset which covers a number of measurements from different plants and the corresponding species.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 4.9 3.1 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
57 6.3 3.3 4.7 1.6 versicolor
73 6.3 2.5 4.9 1.5 versicolor
81 5.5 2.4 3.8 1.1 versicolor
<img src=“pres_figur es/unnamed-chu nk-3-1.png" wid th=“640” style =“display: block; margin: auto;” />

Reproducibility

A very broad topic with plenty of sub-areas and deeper meanings. We mean two things by reproducibility

Analysis

The process of going from images to numbers is detailed in a clear manner that anyone, anywhere could follow and get the exact (within some tolerance) same numbers from your samples

Measurement

Everything for analysis + taking a measurement several times (noise and exact alignment vary each time) does not change the statistics significantly

Reproducible Analysis

The basis for reproducible scripts and analysis are scripts and macros. Since we will need to perform the same analysis many times to understand how reproducible it is.

IMAGEFILE=$1
THRESHOLD=130
matlab -r "inImage=$IMAGEFILE; threshImage=inImage>$THRESHOLD; analysisScript;"

Comparing Groups: Intraclass Correlation Coefficient

The intraclass correlation coefficient basically looking at how similar objects within a group are compared to between groups

Intraclass Correlation Coefficient Definition

\[ ICC = \frac{S_A^2}{S_A^2+S_W^2} \]

where - \(S_A^2\) is the variance among groups or classes - Estimate with the standard deviations of the mean values for each group - \(S_W^2\) is the variance within groups or classes. - Estimate with the average of standard deviations for each group

Intraclass Correlation Coefficient: Values

Intraclass Correlation Coefficient: Values for Coin-Flips

We have one biased coin and try to figure out how many flips we need for the ICC to tell the difference to the normal coin

With many thousands of flips we eventually see a very strong difference but unless it is very strongly biased ICC is a poor indicator for the differences

Comparing Groups: Tests

Once the reproducibility has been measured, it is possible to compare groups. The idea is to make a test to assess the likelihood that two groups are the same given the data

  1. List assumptions
  2. Establish a null hypothesis
  1. Calculate the probability of the observations given the truth of the null hypothesis

Loaded Coin

We have 1 coin from a magic shop - our assumptions are - we flip and observe flips of coins accurately and independently - the coin is invariant and always has the same expected value - our null hypothesis is the coin is unbiased \(E(\mathcal{X})=0.5\) - we can calculate the likelihood of a given observation given the number of flips (p-value)

Number of Flips Probability of All Heads Given Null Hypothesis (p-value)
1 50 %
5 3.1 %
10 0.1 %

How good is good enough?

Comparing Groups: Student’s T Distribution

Since we do not usually know our distribution very well or have enough samples to create a sufficient probability model

Student T Distribution

We assume the distribution of our stochastic variable is normal (Gaussian) and the t-distribution provides an estimate for the mean of the underlying distribution based on few observations.

Student T-Test

Incorporates this distribution and provides an easy method for assessing the likelihood that the two given set of observations are coming from the same underlying process (null hypothesis)

Multiple Testing Bias

Back to the magic coin, let’s assume we are trying to publish a paper, we heard a p-value of < 0.05 (5%) was good enough. That means if we get 5 heads we are good!

Number of Flips Probability of All Heads Given Null Hypothesis (p-value)
1 50 %
4 6.2 %
5 3.1 %
Number of Friends Flipping Probability Someone Flips 5 heads
1 3.1 %
10 27.2 %
20 47 %
40 71.9 %
80 92.1 %

Clearly this is not the case, otherwise we could keep flipping coins or ask all of our friends to flip until we got 5 heads and publish

The p-value is only meaningful when the experiment matches what we did. - We didn’t say the chance of getting 5 heads ever was < 5% - We said if we have exactly 5 observations and all of them are heads the likelihood that a fair coin produced that result is <5%

Many methods to correct, most just involve scaling \(p\). The likelihood of a sequence of 5 heads in a row if you perform 10 flips is 5x higher.

Multiple Testing Bias: Experiments

This is very bad news for us. We have the ability to quantify all sorts of interesting metrics - cell distance to other cells - cell oblateness - cell distribution oblateness

So lets throw them all into a magical statistics algorithm and push the publish button

With our p value of less than 0.05 and a study with 10 samples in each group, how does increasing the number of variables affect our result

Multiple Testing Bias: Correction

Using the simple correction factor (number of tests performed), we can make the significant findings constant again

So no harm done there we just add this correction factor right? Well what if we have exactly one variable with shift of 1.0 standard deviations from the other.

Multiple Testing Bias: Sample Size

Predicting and Validating

Validation Graphic - Borrowed from http://peekaboo-vision.blogspot.ch/2013/01/machine-learning-cheat-sheet-for-scikit.html

Main Categories

Overview

Basically all of these are ultimately functions which map inputs to outputs.

The input could be

The output is

Overfitting

The most serious problem with machine learning and such approachs is overfitting your model to your data. Particularly as models get increasingly complex (random forest, neural networks, deep learning, …), it becomes more and more difficult to apply common sense or even understand exactly what a model is doing and why a given answer is produced.

magic_classifier = {}
# training
magic_classifier['Dog'] = 'Animal'
magic_classifier['Bob'] = 'Person'
magic_classifier['Fish'] = 'Animal'

Now use this classifier, on the training data it works really well

magic_classifier['Dog'] == 'Animal' # true, 1/1 so far!
magic_classifier['Bob'] == 'Person' # true, 2/2 still perfect!
magic_classifier['Fish'] == 'Animal' # true, 3/3, wow!

On new data it doesn’t work at all, it doesn’t even execute.

magic_classifier['Octopus'] == 'Animal' # exception?! but it was working so well
magic_classifier['Dan'] == 'Person' # exception?! 

The above example appeared to be a perfect trainer for mapping names to animals or people, but it just memorized the inputs and reproduced them at the output and so didn’t actually learn anything, it just copied.

Validation

Relevant for each of the categories, but applied in a slightly different way depending on the group. The idea is two divide the dataset into groups called training and validation or ideally training, validation, and testing. The analysis is then

Concrete Example: Classifying Flowers

Here we return to the iris data set and try to automatically classify flowers

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
103 7.1 3.0 5.9 2.1 virginica
57 6.3 3.3 4.7 1.6 versicolor
29 5.2 3.4 1.4 0.2 setosa
135 6.1 2.6 5.6 1.4 virginica
37 5.5 3.5 1.3 0.2 setosa

Dividing the data

We first decide on a split, in this case 60%, 30%, 10% for training, validation, and testing and randomly divide up the data.

div.iris<-iris %>%
  mutate(
    # generate a random number uniformally between 0 and 1
    rand_value = runif(nrow(iris)),
    # divide the data based on how high this number is into different groups
    data_div = ifelse(rand_value<0.6,"Training",
                      ifelse(rand_value<0.9,"Validation",
                             "Testing")
                      )
  ) %>% select(-rand_value) # we don't need this anymore
div.iris %>% sample_n(4) %>% kable(digits=2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species data_div
55 6.5 2.8 4.6 1.5 versicolor Validation
77 6.8 2.8 4.8 1.4 versicolor Training
100 5.7 2.8 4.1 1.3 versicolor Validation
127 6.2 2.8 4.8 1.8 virginica Training
<img s rc=“pres_figure s/unnamed-chun k-19-1.png" wid th=“640” style =“display: bl ock; margin: auto;" />

Using a simple decision tree

Making a decision tree can be done by providing the output (class) as a function of the input, in this case just a combination of x1 and y1 (~x1+y1). From this a

library(rpart)
library(rpart.plot)
training.data <- div.iris %>% subset(data_div == "Training")
dec.tree<-rpart(Species~Sepal.Length+Sepal.Width,data=iris)

A tree can be visualized graphically as a trunk (the top most node) dividing progressively into smaller subnodes

or as a list of rules to apply

## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
##    2) Sepal.Length< 5.45 52   7 setosa (0.86538462 0.11538462 0.01923077)  
##      4) Sepal.Width>=2.8 45   1 setosa (0.97777778 0.02222222 0.00000000) *
##      5) Sepal.Width< 2.8 7   2 versicolor (0.14285714 0.71428571 0.14285714) *
##    3) Sepal.Length>=5.45 98  49 virginica (0.05102041 0.44897959 0.50000000)  
##      6) Sepal.Length< 6.15 43  15 versicolor (0.11627907 0.65116279 0.23255814)  
##       12) Sepal.Width>=3.1 7   2 setosa (0.71428571 0.28571429 0.00000000) *
##       13) Sepal.Width< 3.1 36  10 versicolor (0.00000000 0.72222222 0.27777778) *
##      7) Sepal.Length>=6.15 55  16 virginica (0.00000000 0.29090909 0.70909091) *

Overlaying with the prediction data looks good

It struggles more with the validation data since it has never seen it before and it’s not quite the same as the training

The test data (normally we would not look at it at all right now and wait until the very end) looks even worse and an even smaller fraction is correctly matched.

Tricky Concrete Example: Classification

Taking a list of points (feature vectors) where each has an \(x1\) and a \(y1\) coordinate and a classification (Happy or Sad), we can show the data as a table

x1 y1 class
131 -30.00 0.00 Sad
24 8.34 25.68 Happy
28 15.39 17.09 Happy
26 12.50 21.65 Happy
57 4.60 43.76 Sad

Or graphically

You can play around with neural networks and this data set at TensorFlow Playground

Dividing the data

We first decide on a split, in this case 60%, 30%, 10% for training, validation, and testing and randomly divide up the data.

div.spiral.pts<-spiral.pts %>%
  mutate(
    # generate a random number uniformally between 0 and 1
    rand_value = runif(nrow(spiral.pts)),
    # divide the data based on how high this number is into different groups
    data_div = ifelse(rand_value<0.6,"Training",
                      ifelse(rand_value<0.9,"Validation",
                             "Testing")
                      )
  ) %>% select(-rand_value) # we don't need this anymore
div.spiral.pts %>% sample_n(4) %>% kable(digits=2)
x1 y1 class data_div
71 30.00 0.00 Sad Training
35 15.91 1.67 Happy Training
93 -5.35 -5.95 Sad Testing
34 16.63 3.53 Happy Validation

Using a simple decision tree

Making a decision tree can be done by providing the output (class) as a function of the input, in this case just a combination of x1 and y1 (~x1+y1). From this a

library(rpart)
library(rpart.plot)
training.data <- div.spiral.pts %>% subset(data_div == "Training")
dec.tree<-rpart(class~x1+y1,data=training.data)

A tree can be visualized graphically as a trunk (the top most node) dividing progressively into smaller subnodes

or as a list of rules to apply

## n= 125 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 125 57 Happy (0.54400000 0.45600000)  
##     2) y1< 33.01294 116 48 Happy (0.58620690 0.41379310)  
##       4) y1>=18.4582 14  0 Happy (1.00000000 0.00000000) *
##       5) y1< 18.4582 102 48 Happy (0.52941176 0.47058824)  
##        10) y1< 5.691917 82 31 Happy (0.62195122 0.37804878)  
##          20) y1>=-31.22271 75 25 Happy (0.66666667 0.33333333)  
##            40) y1>=3.447282 10  0 Happy (1.00000000 0.00000000) *
##            41) y1< 3.447282 65 25 Happy (0.61538462 0.38461538)  
##              82) y1< -18.4582 13  1 Happy (0.92307692 0.07692308) *
##              83) y1>=-18.4582 52 24 Happy (0.53846154 0.46153846)  
##               166) y1>=-10.0245 34 11 Happy (0.67647059 0.32352941) *
##               167) y1< -10.0245 18  5 Sad (0.27777778 0.72222222) *
##          21) y1< -31.22271 7  1 Sad (0.14285714 0.85714286) *
##        11) y1>=5.691917 20  3 Sad (0.15000000 0.85000000) *
##     3) y1>=33.01294 9  0 Sad (0.00000000 1.00000000) *

Overlaying with the prediction data looks good

It struggles more with the validation data since it has never seen it before and it’s not quite the same as the training

The test data (normally we would not look at it at all right now and wait until the very end) looks even worse and an even smaller fraction is correctly matched.

We can choose to make more complicated trees by changing the function to something more detailed like

\[ class = x1+y1+x1^2+y1^2+\sin(x1/5)+\sin(y1/5) \]

dec.tree<-rpart(class~x1+y1+x1^2+y1^2+sin(x1/5)+sin(y1/5),data=training.data)
prp(dec.tree)

Parameters

How does a standard image processing chain look?

The Full Chain

The Full Chain (with Parameters)

Qualitative vs Quantitative

Given the complexity of the tree, we need to do some pruning

Qualitative Assessment

Quantitative Metrics

With a quantitative approach, we can calculate the specific shape or distribution metrics on the sample with each parameter and establish the relationship between parameter and metric.

Parameter Sweep

The way we do this is usually a parameter sweep which means taking one (or more) parameters and varying them between the reasonable bounds (judged qualitatively).

Is it always the same?

Sensitivity

Sensitivity is defined in control system theory as the change in the value of an output against the change in the input. \[ S = \frac{|\Delta \textrm{Metric}|}{|\Delta \textrm{Parameter}|} \]

Such a strict definition is not particularly useful for image processing since a threshold has a unit of intensity and a metric might be volume which has \(m^3\) so the sensitivity becomes volume per intensity.

Practical Sensitivity

A more common approach is to estimate the variation in this parameter between images or within a single image (automatic threshold methods can be useful for this) and define the sensitivity based on this variation. It is also common to normalize it with the mean value so the result is a percentage.

\[ S = \frac{max(\textrm{Metric})-min(\textrm{Metric})}{avg(\textrm{Metric})} \]

Sensitivity: Real Measurements

In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter

Comparing Different Variables we see that the best (lowest) value for the count sensitivity is the highest for the volume and anisotropy.

Which metric is more important?

Unit Testing

In computer programming, unit testing is a method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine if they are fit for use.

The first requirement for unit testing to work well is to have you tools divided up into small independent parts (functions) - Each part can then be tested independently (unit testing) - If the tests are well done, units can be changed and tested independently - Makes upgrading or expanding tools easy - The entire path can be tested (integration testing) - Catches mistakes in integration or glue

Ideally with realistic but simulated test data - The utility of the testing is only as good as the tests you make

Example

function vxCnt=countVoxs(inImage)

assert countVoxs(zeros(3,3)) == 0

assert countVoxs(zeros(3,3,3)) == 0

assert countVoxs(eye(3)) == 3

Unit Testing: Examples

Unit Testing in ImageJ

On this page

Unit Testing in KNIME

Read more and Here The Java-based unit-testing can be used (JUnit) before any of the plugins are compiled, additionally entire workflows can be made to test the objects using special testing nodes like - difference node (check the if two values are different)

KNIME Tests

KNIME Tests

Test Driven Programming

Test Driven programming is a style or approach to programming where the tests are written before the functional code. Like very concrete specifications. It is easy to estimate how much time is left since you can automatically see how many of the tests have been passed. You and your collaborators are clear on the utility of the system.

  1. shapeAnalysis must give an anisotropy of 0 when we input a sphere
  2. shapeAnalysis must give the center of volume within 0.5 pixels
  3. shapeAnalysis must run on a 1000x1000 image in 30 seconds

Visualization

One of the biggest problems with big sciences is trying to visualize a lot of heterogeneous data.

Bad Graphs

There are too many graphs which say

3d Plots Spectroscopy

Linkage

Linkage

Linkage 2

Linkage 2

Key Ideas

  1. What is my message?
  2. Does the graphic communicate it clearly?
  3. Is a graphic representation really necessary?
  1. Does every line / color serve a purpose?

Simple Rules

  1. Never use 3D graphics when it can be avoided (unless you want to be deliberately misleading), our visual system is not well suited for comparing heights of different Dumb 3d
  2. Pie charts can also be hard to interpret
  3. Background color should almost always be white (not light gray)
  4. Use color palettes adapted to human visual sensitivity

What is my message

Does my graphic communicate it clearly?

Grammar of Graphics

Grammar Explained

Normally we think of plots in terms of some sort of data which is fed into a plot command that produces a picture - In Excel you select a range and plot-type and click “Make” - In Matlab you run plot(xdata,ydata,color/shape)

  1. These produces entire graphics (sentences) or at least phrases in one go and thus abstract away from the idea of grammar.
  2. If you spoke by finding entire sentences in a book it would be very ineffective, it is much better to build up word by word

Grammar

Separate the graph into its component parts

  1. Data Mapping
Graph Decomposed

Graph Decomposed

  1. Points
  2. Axes / Coordinate System
  3. Labels / Annotation

Construct graphics by focusing on each portion independently.

Wrapping up