title: “Quantitative Big Imaging” author: “Statistics, Prediction, and Reproducibility” date: ‘ETHZ: 227-0966-00L’ output: slidy_presentation: footer: “Quantitative Big Imaging 2017” —
source('../common/schedule.R')
David J.C. MacKay, Bayesian Interpolartion (1991) [http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.9072]
The course has covered imaging enough and there have been a few quantitative metrics, but “big” has not really entered.
What does big mean?
So what is “big” imaging
Going back to our original cell image
We have at least a few samples (or different regions), large number of metrics and an almost as large number of parameters to tune
One of the most repeated criticisms of scientific work is that correlation and causation are confused.
There are two broad classes of data and scientific studies.
We examined 100 people and the ones with blue eyes were on average 10cm taller
In 100 cake samples, we found a 0.9 correlation between cooking time and bubble size
We examined 50 mice with gene XYZ off and 50 gene XYZ on and as the foot size increased by 10%
We increased the temperature and the number of pores in the metal increased by 10%
incremental: true
Since most of the experiments in science are usually specific, noisy, and often very complicated and are not usually good teaching examples
We normally assume
P(ℱ1(𝒳)*ℱ2(𝒳)) = P(ℱ1(𝒳)) * P(ℱ2(𝒳))
E(∞∏i=0Fi(X))=E(X)
Coin flips are very simple and probably difficult to match to another experiment. A very popular dataset for learning about such values beyond ‘coin-flips’ is called the Iris dataset which covers a number of measurements from different plants and the corresponding species.
iris %>% sample_n(5) %>% kable
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
126 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
75 | 6.4 | 2.9 | 4.3 | 1.3 | versicolor |
68 | 5.8 | 2.7 | 4.1 | 1.0 | versicolor |
39 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
iris %>%
mutate(plant.id=1:nrow(iris)) %>%
melt(id.vars=c("Species","plant.id"))->flat_iris
flat_iris %>%
merge(flat_iris,by=c("Species","plant.id")) %>%
ggplot(aes(value.x,value.y,color=Species)) +
geom_jitter()+
facet_grid(variable.x~variable.y,scales="free")+
theme_bw(10)
A very broad topic with plenty of sub-areas and deeper meanings. We mean two things by reproducibility
The process of going from images to numbers is detailed in a clear manner that anyone, anywhere could follow and get the exact (within some tolerance) same numbers from your samples
Everything for analysis + taking a measurement several times (noise and exact alignment vary each time) does not change the statistics significantly
The basis for reproducible scripts and analysis are scripts and macros. Since we will need to perform the same analysis many times to understand how reproducible it is.
IMAGEFILE=$1
THRESHOLD=130
matlab -r "inImage=$IMAGEFILE; threshImage=inImage>$THRESHOLD; analysisScript;"
java -jar ij.jar -macro TestMacro.ijm blobs.tif
Rscript -e "library(plyr);..."
The intraclass correlation coefficient basically looking at how similar objects within a group are compared to between groups
ggplot(iris,aes(x=Species,y=Sepal.Width))+
geom_boxplot()+
geom_jitter()+
labs(x="Species",y="Sepal Width",title="Low Group Similarity")+
theme_bw(20)
ggplot(iris,aes(x=Species,y=Petal.Length))+
geom_boxplot()+
geom_jitter()+
labs(x="Species",y="Petal Length",title="High Group Similarity")+
theme_bw(20)
ICC=S2AS2A+S2W
where
Estimate with the average of standard deviations for each group
0 means 0% of the variance is between classes
library("ICC")
icVal<-ICCbare(Species,Sepal.Width,data=iris)
ggplot(iris,aes(x=Species,y=Sepal.Width))+
geom_boxplot()+
geom_jitter()+
labs(x="Species",y="Sepal Width",title=sprintf("Low Group Similarity\n ICC:%2.2f",icVal))+
theme_bw(20)
icVal<-ICCbare(Species,Petal.Length,data=iris)
ggplot(iris,aes(x=Species,y=Petal.Length))+
geom_boxplot()+
geom_jitter()+
labs(x="Species",y="Sepal Width",title=sprintf("High Group Similarity\n ICC:%2.2f",icVal))+
theme_bw(20)
We have one biased coin and try to figure out how many flips we need for the ICC to tell the difference to the normal coin
name.list<-c("Coin A","Coin B")
test.data<-plyr::ldply(c(1:length(name.list)),function(class.id) {
data.frame(name=name.list[class.id],values=runif(100,max=1+2*(class.id-1))>0.5)
})
icVal<-ICCbare(name,values,data=test.data)
ggplot(test.data,aes(x=name,y=values))+
geom_jitter()+
labs(x="Groups",y="Value",title=sprintf("100 flips\n ICC:%2.2f",icVal))+
theme_bw(20)
With many thousands of flips we eventually see a very strong difference but unless it is very strongly biased ICC is a poor indicator for the differences
name.list<-c("Coin A","Coin B")
test.data<-plyr::ldply(c(1:length(name.list)),function(class.id) {
data.frame(name=name.list[class.id],values=runif(20000,max=1+2*(class.id-1))>0.5)
})
icVal<-ICCbare(name,values,data=test.data)
ggplot(test.data,aes(x=name,y=values))+
geom_jitter()+
labs(x="Groups",y="Value",title=sprintf("20,000 flips\n ICC:%2.2f",icVal))+
theme_bw(20)
Once the reproducibility has been measured, it is possible to compare groups. The idea is to make a test to assess the likelihood that two groups are the same given the data
We have 1 coin from a magic shop
n.flips<-c(1,5,10)
cf.table<-data.frame(No.Flips=n.flips,PAH=paste(round(1000*0.5^n.flips)/10,"%"))
names(cf.table)<-c("Number of Flips","Probability of All Heads Given Null Hypothesis (p-value)")
kable(cf.table)
Number of Flips | Probability of All Heads Given Null Hypothesis (p-value) |
---|---|
1 | 50 % |
5 | 3.1 % |
10 | 0.1 % |
How good is good enough?
Since we do not usually know our distribution very well or have enough samples to create a sufficient probability model
We assume the distribution of our stochastic variable is normal (Gaussian) and the t-distribution provides an estimate for the mean of the underlying distribution based on few observations.
Incorporates this distribution and provides an easy method for assessing the likelihood that the two given set of observations are coming from the same underlying process (null hypothesis)
Back to the magic coin, let’s assume we are trying to publish a paper, we heard a p-value of < 0.05 (5%) was good enough. That means if we get 5 heads we are good!
n.flips<-c(1,4,5)
cf.table<-data.frame(No.Flips=n.flips,PAH=paste(round(1000*0.5^n.flips)/10,"%"))
names(cf.table)<-c("Number of Flips","Probability of All Heads Given Null Hypothesis (p-value)")
kable(cf.table)
Number of Flips | Probability of All Heads Given Null Hypothesis (p-value) |
---|---|
1 | 50 % |
4 | 6.2 % |
5 | 3.1 % |
n.friends<-c(1,10,20,40,80)
cfr.table<-data.frame(No.Friends=n.friends,PAH=paste(round((1000*(1-(1-0.5^5)^n.friends)))/10,"%"))
names(cfr.table)<-c("Number of Friends Flipping","Probability Someone Flips 5 heads")
kable(cfr.table)
Number of Friends Flipping | Probability Someone Flips 5 heads |
---|---|
1 | 3.1 % |
10 | 27.2 % |
20 | 47 % |
40 | 71.9 % |
80 | 92.1 % |
Clearly this is not the case, otherwise we could keep flipping coins or ask all of our friends to flip until we got 5 heads and publish
The p-value is only meaningful when the experiment matches what we did.
Many methods to correct, most just involve scaling p. The likelihood of a sequence of 5 heads in a row if you perform 10 flips is 5x higher.
This is very bad news for us. We have the ability to quantify all sorts of interesting metrics
So lets throw them all into a magical statistics algorithm and push the publish button
With our p value of less than 0.05 and a study with 10 samples in each group, how does increasing the number of variables affect our result
make.random.data<-function(n.groups=2,n.samples=10,n.vars=1,rand.fun=runif,group.off.fun=function(grp.id) 0) {
ldply(1:n.groups,function(c.group) {
data.frame(group=c.group,
do.call(cbind,llply(1:n.vars,function(c.var) group.off.fun(c.group)+rand.fun(n.samples)))
)
})
}
# only works for two groups
all.t.test<-function(in.data) {
group1<-subset(in.data,group==1)[,-1,drop=F]
group2<-subset(in.data,group==2)[,-1,drop=F]
ldply(1:ncol(group1),function(var.id) {
tres<-t.test(group1[,var.id],group2[,var.id])
data.frame(var.col=var.id,
p.val=tres$p.value,
method=tres$method,
var.count=ncol(group1),
sample.count=nrow(in.data))
}
)
}
# run the entire analysis several times to get an average
test.random.data<-function(n.test=10,...) {
ldply(1:n.test,function(c.test) cbind(test.num=c.test,all.t.test(make.random.data(...))))
}
var.range<-round(seq(1,60,length.out=15))
test.cnt<-80
sim.data<-ldply(var.range,function(n.vars) test.random.data(test.cnt,n.vars=n.vars))
sig.likelihood<-ddply(sim.data,.(var.count),function(c.tests) {
data.frame(sig.vars=nrow(subset(c.tests,p.val<=0.05))/length(unique(c.tests$test.num)))
})
ggplot(sig.likelihood,
aes(x=var.count,y=sig.vars))+
geom_point()+geom_line()+
labs(x="Number of Variables in Study",y="Number of Significant \n (P<0.05) Findings")+
theme_bw(20)
sig.likelihood.corr<-ddply(sim.data,.(var.count),function(c.tests) {
data.frame(sig.vars=nrow(subset(c.tests,p.val<=0.05/var.count))/length(unique(c.tests$test.num)))
})
ggplot(sig.likelihood.corr,
aes(x=var.count,y=sig.vars))+
geom_point()+geom_line(aes(color="Corrected"))+
geom_point(data=sig.likelihood)+
geom_line(data=sig.likelihood,aes(color="Non-Corrected"))+
geom_hline(yintercept=0.05,color="green",alpha=0.4,size=2)+
scale_y_sqrt()+
labs(x="Number of Variables in Study",y="Number of Significant \n (P<0.05) Findings")+
theme_bw(20)
So no harm done there we just add this correction factor right? Well what if we have exactly one variable with shift of 1.0 standard deviations from the other.
var.range<-round(seq(10,60,length.out=10))
test.cnt<-100
one.diff.sample<-function(grp.id) ifelse(grp.id==2,.10,0)
sim.data.diff<-ldply(var.range,function(n.samples)
test.random.data(test.cnt,n.samples=n.samples,
rand.fun=function(n.cnt) rnorm(n.cnt,mean=1,sd=0.1),
group.off.fun=one.diff.sample))
ggplot(sim.data.diff,aes(x=sample.count,y=p.val))+
geom_point()+
geom_smooth(aes(color=" 1 Variable"))+
geom_hline(yintercept=0.05,color="green",alpha=0.4,size=2)+
labs(x="Number of Samples in Study",y="P-Value for a 10% Difference")+
theme_bw(20)
var.range<-c(1,5,10,20,100) # variable count
sim.data.psig<-ldply(var.range,function(c.vcnt) {
cbind(var.count=c.vcnt,ddply(sim.data.diff,.(sample.count),function(c.sample)
data.frame(prob.sig=nrow(subset(c.sample,p.val<=0.05/c.vcnt))/nrow(c.sample))
))
})
ggplot(sim.data.psig,aes(x=sample.count,y=100*prob.sig))+
geom_line(aes(color=as.factor(var.count)),size=2)+
ylim(0,100)+
labs(x="Number of Samples in Study",y="Probability of Finding\n Significant Variable (%)",color="Variables")+
theme_bw(20)
Basically all of these are ultimately functions which map inputs to outputs.
The input could be
The output is
The most serious problem with machine learning and such approachs is overfitting your model to your data. Particularly as models get increasingly complex (random forest, neural networks, deep learning, …), it becomes more and more difficult to apply common sense or even understand exactly what a model is doing and why a given answer is produced.
magic_classifier = {}
# training
magic_classifier['Dog'] = 'Animal'
magic_classifier['Bob'] = 'Person'
magic_classifier['Fish'] = 'Animal'
Now use this classifier, on the training data it works really well
magic_classifier['Dog'] == 'Animal' # true, 1/1 so far!
magic_classifier['Bob'] == 'Person' # true, 2/2 still perfect!
magic_classifier['Fish'] == 'Animal' # true, 3/3, wow!
On new data it doesn’t work at all, it doesn’t even execute.
magic_classifier['Octopus'] == 'Animal' # exception?! but it was working so well
magic_classifier['Dan'] == 'Person' # exception?!
The above example appeared to be a perfect trainer for mapping names to animals or people, but it just memorized the inputs and reproduced them at the output and so didn’t actually learn anything, it just copied.
Relevant for each of the categories, but applied in a slightly different way depending on the group. The idea is two divide the dataset into groups called training and validation or ideally training, validation, and testing. The analysis is then
Here we return to the iris data set and try to automatically classify flowers
iris %>% sample_n(5) %>% kable(,digits=2)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
105 | 6.5 | 3.0 | 5.8 | 2.2 | virginica |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
101 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
72 | 6.1 | 2.8 | 4.0 | 1.3 | versicolor |
We first decide on a split, in this case 60%, 30%, 10% for training, validation, and testing and randomly divide up the data.
div.iris<-iris %>%
mutate(
# generate a random number uniformally between 0 and 1
rand_value = runif(nrow(iris)),
# divide the data based on how high this number is into different groups
data_div = ifelse(rand_value<0.6,"Training",
ifelse(rand_value<0.9,"Validation",
"Testing")
)
) %>% select(-rand_value) # we don't need this anymore
div.iris %>% sample_n(4) %>% kable(digits=2)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | data_div | |
---|---|---|---|---|---|---|
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | Training |
86 | 6.0 | 3.4 | 4.5 | 1.6 | versicolor | Validation |
122 | 5.6 | 2.8 | 4.9 | 2.0 | virginica | Training |
142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica | Training |
ggplot(div.iris,aes(Sepal.Length,Sepal.Width))+
geom_point(aes(shape=data_div,color=Species),size=2)+
labs(shape="Type")+
facet_grid(~data_div)+
coord_equal()+
theme_bw(10)
class
) as a function of the input, in this case just a combination of x1 and y1 (~x1+y1
). From this a
library(rpart)
library(rpart.plot)
training.data <- div.iris %>% subset(data_div == "Training")
dec.tree<-rpart(Species~Sepal.Length+Sepal.Width,data=iris)
A tree can be visualized graphically as a trunk (the top most node) dividing progressively into smaller subnodes
or as a list of rules to applyprint(dec.tree)
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
## 2) Sepal.Length< 5.45 52 7 setosa (0.86538462 0.11538462 0.01923077)
## 4) Sepal.Width>=2.8 45 1 setosa (0.97777778 0.02222222 0.00000000) *
## 5) Sepal.Width< 2.8 7 2 versicolor (0.14285714 0.71428571 0.14285714) *
## 3) Sepal.Length>=5.45 98 49 virginica (0.05102041 0.44897959 0.50000000)
## 6) Sepal.Length< 6.15 43 15 versicolor (0.11627907 0.65116279 0.23255814)
## 12) Sepal.Width>=3.1 7 2 setosa (0.71428571 0.28571429 0.00000000) *
## 13) Sepal.Width< 3.1 36 10 versicolor (0.00000000 0.72222222 0.27777778) *
## 7) Sepal.Length>=6.15 55 16 virginica (0.00000000 0.29090909 0.70909091) *
Overlaying with the prediction data looks good
match_range<-function(ivec,n_length=10) seq(from=min(ivec),to=max(ivec),length=n_length)
pred.map<-expand.grid(Sepal.Length = match_range(training.data$Sepal.Length),
Sepal.Width = match_range(training.data$Sepal.Width))
pred.map$pred_class<-predict(dec.tree,pred.map,type="class")
training.data %>%
mutate(pred_class=predict(dec.tree,training.data,type="class"),
class_result=ifelse(as.character(pred_class)==as.character(Species),"Correct","Incorrect")
)->training.data
ggplot(pred.map,aes(Sepal.Length,Sepal.Width))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=training.data,aes(color=Species,size=class_result))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
It struggles more with the validation data since it has never seen it before and it’s not quite the same as the training
valid.data<-div.iris %>% subset(data_div == "Validation")
valid.data %>%
mutate(pred_class=predict(dec.tree,valid.data,type="class"),
class_result=ifelse(as.character(pred_class)==as.character(Species),"Correct","Incorrect")
)->valid.data
ggplot(pred.map,aes(Sepal.Length,Sepal.Width))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=valid.data,aes(color=Species,size=class_result))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
The test data (normally we would not look at it at all right now and wait until the very end) looks even worse and an even smaller fraction is correctly matched.
valid.data<-div.iris %>% subset(data_div == "Testing")
valid.data %>%
mutate(pred_class=predict(dec.tree,valid.data,type="class"),
class_result=ifelse(as.character(pred_class)==as.character(Species),"Correct","Incorrect")
)->valid.data
ggplot(pred.map,aes(Sepal.Length,Sepal.Width))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=valid.data,aes(color=Species,size=class_result))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
Taking a list of points (feature vectors) where each has an x1 and a y1 coordinate and a classification (Happy or Sad), we can show the data as a table
spiral.pts <- expand.grid(x = -50:50, y = -50:50) %>%
subset((x==0) | (y==0)) %>%
mutate(
r = sqrt(x^2+y^2),
th = r/60*2*pi,
x1 = cos(th)*x-sin(th)*y,
y1 = sin(th)*x+cos(th)*y,
class = ifelse(x==0,"Happy","Sad")
) %>%
select(x1,y1,class)
kable(spiral.pts %>% sample_n(5),digits=2)
x1 | y1 | class | |
---|---|---|---|
137 | -29.12 | -21.16 | Sad |
18 | -10.20 | 31.38 | Happy |
144 | -8.94 | -42.06 | Sad |
88 | -2.70 | -12.72 | Sad |
66 | 30.31 | 17.50 | Sad |
Or graphically
You can play around with neural networks and this data set at TensorFlow Playground
We first decide on a split, in this case 60%, 30%, 10% for training, validation, and testing and randomly divide up the data.
div.spiral.pts<-spiral.pts %>%
mutate(
# generate a random number uniformally between 0 and 1
rand_value = runif(nrow(spiral.pts)),
# divide the data based on how high this number is into different groups
data_div = ifelse(rand_value<0.6,"Training",
ifelse(rand_value<0.9,"Validation",
"Testing")
)
) %>% select(-rand_value) # we don't need this anymore
div.spiral.pts %>% sample_n(4) %>% kable(digits=2)
x1 | y1 | class | data_div | |
---|---|---|---|---|
75 | 23.75 | -10.58 | Sad | Training |
94 | -5.20 | -4.68 | Sad | Validation |
125 | -19.42 | 14.11 | Sad | Training |
63 | 25.43 | 28.24 | Sad | Training |
ggplot(div.spiral.pts,aes(x1,y1))+
geom_point(aes(shape=data_div,color=class),size=2)+
labs(shape="Type")+
facet_wrap(~data_div)+
coord_equal()+
theme_bw(20)
class
) as a function of the input, in this case just a combination of x1 and y1 (~x1+y1
). From this a
library(rpart)
library(rpart.plot)
training.data <- div.spiral.pts %>% subset(data_div == "Training")
dec.tree<-rpart(class~x1+y1,data=training.data)
A tree can be visualized graphically as a trunk (the top most node) dividing progressively into smaller subnodes
or as a list of rules to applyprint(dec.tree)
## n= 119
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 119 56 Sad (0.47058824 0.52941176)
## 2) x1< -33.01294 8 0 Happy (1.00000000 0.00000000) *
## 3) x1>=-33.01294 111 48 Sad (0.43243243 0.56756757)
## 6) x1>=-18.4582 94 47 Happy (0.50000000 0.50000000)
## 12) y1< 6.003109 66 28 Happy (0.57575758 0.42424242)
## 24) y1>=-5.691917 32 8 Happy (0.75000000 0.25000000) *
## 25) y1< -5.691917 34 14 Sad (0.41176471 0.58823529)
## 50) y1< -18.36846 16 6 Happy (0.62500000 0.37500000) *
## 51) y1>=-18.36846 18 4 Sad (0.22222222 0.77777778) *
## 13) y1>=6.003109 28 9 Sad (0.32142857 0.67857143)
## 26) y1>=24.87892 12 5 Happy (0.58333333 0.41666667) *
## 27) y1< 24.87892 16 2 Sad (0.12500000 0.87500000) *
## 7) x1< -18.4582 17 1 Sad (0.05882353 0.94117647) *
Overlaying with the prediction data looks good
pred.map<-expand.grid(x1 = -50:50, y1 = -50:50)
pred.map$pred_class<-ifelse(predict(dec.tree,pred.map)[,1]>0.5,"Happy","Sad")
training.data$pred_class<-ifelse(predict(dec.tree,training.data)[,1]>0.5,"Happy","Sad")
ggplot(pred.map,aes(x1,y1))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=training.data,aes(color=class,size=(pred_class!=class)))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
It struggles more with the validation data since it has never seen it before and it’s not quite the same as the training
valid.data<-div.spiral.pts %>% subset(data_div == "Validation")
valid.data$pred_class<-ifelse(predict(dec.tree,valid.data)[,1]>0.5,"Happy","Sad")
ggplot(pred.map,aes(x1,y1))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=valid.data,aes(color=class,size=(pred_class!=class)))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
The test data (normally we would not look at it at all right now and wait until the very end) looks even worse and an even smaller fraction is correctly matched.
valid.data<-div.spiral.pts %>% subset(data_div == "Testing")
valid.data$pred_class<-ifelse(predict(dec.tree,valid.data)[,1]>0.5,"Happy","Sad")
ggplot(pred.map,aes(x1,y1))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=valid.data,aes(color=class,size=(pred_class!=class)))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
We can choose to make more complicated trees by changing the function to something more detailed like
class = x1 + y1 + x12 + y12 + sin(x1/5)+sin(y1/5)
pred.map$pred_class<-ifelse(predict(dec.tree,pred.map)[,1]>0.5,"Happy","Sad")
training.data$pred_class<-ifelse(predict(dec.tree,training.data)[,1]>0.5,"Happy","Sad")
ggplot(pred.map,aes(x1,y1))+
geom_tile(aes(fill=pred_class),alpha=0.5)+
geom_point(data=training.data,aes(color=class,size=(pred_class!=class)))+
labs(fill="Predicted",size = "Incorrectly\nLabeled")+
theme_bw(20)
library(igraph)
make.im.proc.chain<-function(root.node="Raw\nImages",filters=c(),filter.parms=c(),
segmentation=c(),segmentation.parms=c(),
analysis=c(),analysis.parms=c()) {
node.names<-c("Raw\nImages",
filter.parms,filters,
segmentation.parms,segmentation,
analysis.parms,analysis
)
c.mat<-matrix(0,length(node.names),length(node.names))
colnames(c.mat)<-node.names
rownames(c.mat)<-node.names
for(cFilt in filters) {
c.mat["Raw\nImages",cFilt]<-1
for(cParm in filter.parms) c.mat[cParm,cFilt]<-1
for(cSeg in segmentation) {
c.mat[cFilt,cSeg]<-1
for(cParm in segmentation.parms) c.mat[cParm,cSeg]<-1
for(cAnal in analysis) {
c.mat[cSeg,cAnal]<-1
for(cParm in analysis.parms) c.mat[cParm,cAnal]<-1
}
}
}
g<-graph.adjacency(c.mat,mode="directed")
V(g)$degree <- degree(g)
V(g)$label <- V(g)$name
V(g)$color <- "lightblue"
V(g)["Raw\nImages"]$color<-"lightgreen"
for(cAnal in analysis) V(g)[cAnal]$color<-"pink"
V(g)$size<-30
for(cParam in c(filter.parms,segmentation.parms,analysis.parms)) {
V(g)[cParam]$color<-"grey"
V(g)[cParam]$size<-25
}
E(g)$width<-2
g
}
g<-make.im.proc.chain(filters=c("Gaussian\nFilter"),
filter.parms=c("3x3\nNeighbors","0.5 Sigma"),
segmentation=c("Threshold"),
segmentation.parms=c("100"),
analysis=c("Shape\nAnalysis","Thickness\nAnalysis")
)
plot(g)#,layout=layout.circle) #, layout=layout.circle)# layout.fruchterman.reingold)# layout.kamada.kawai)
library(igraph)
g<-make.im.proc.chain(filters=c("Gaussian\nFilter","Median\nFilter","Diffusion\nFilter","No\nFilter",
"Laplacian\nFilter"),
segmentation=c("Threshold","Hysteresis\nThreshold","Automated"),
analysis=c("Shape\nAnalysis","Thickness\nAnalysis","Distribution\nAnalysis",
"Skeleton\nAnalysis","2 Point\nCorr","Curvature")
)
plot(g,layout=layout.reingold.tilford) #, layout=layout.circle)# layout.fruchterman.reingold)# layout.kamada.kawai)
g<-make.im.proc.chain(filters=c("Gaussian\nFilter","Median\nFilter","Diffusion\nFilter"),
filter.parms=c("3x3\nNeighbors","5x5\nNeighbors","7x7\nNeighbors",
"0.5 Sigma","1.0 Sigma","1.2 Sigma"),
segmentation=c("Threshold","Hysteresis\nThreshold","Automated"),
segmentation.parms=paste(seq(90,110,length.out=3)),
analysis=c("Shape\nAnalysis","Thickness\nAnalysis","Distribution\nAnalysis","Skeleton\nAnalysis","2 Point\nCorr")
)
plot(g,layout=layout.lgl(g,maxiter=10000,root=1)) #, layout=layout.circle)# layout.fruchterman.reingold)# layout.kamada.kawai)
Given the complexity of the tree, we need to do some pruning
title.fun<-function(file.name) ""
show.pngs.as.grid(Sys.glob("ext-figures/poros*.png"),title.fun,zoom=0.5)
With a quantitative approach, we can calculate the specific shape or distribution metrics on the sample with each parameter and establish the relationship between parameter and metric.
The way we do this is usually a parameter sweep which means taking one (or more) parameters and varying them between the reasonable bounds (judged qualitatively).
source('../common/shapeAnalysisProcess.R')
source('../common/commonReportFunctions.R')
# read and correct the coordinate system
thresh.fun<-function(x) {
pth<-rev(strsplit(x,"/")[[1]])[2]
t<-strsplit(pth,"_")[[1]][3]
as.numeric(substring(t,2,nchar(t)))
}
readfcn<-function(x) cbind(compare.foam.corrected(x,
checkProj=F
#force.scale=0.011 # force voxel size to be 11um
),
thresh=thresh.fun(x) # how to parse the sample names
)
# Where are the csv files located
rootDir<-"../common/data/mcastudy"
clpor.files<-Sys.glob(paste(rootDir,"/a*/lacun_0.csv",sep="/")) # list all of the files
# Read in all of the files
all.lacun<-ldply(clpor.files,readfcn,.parallel=T)
ggplot(all.lacun,aes(y=VOLUME*1e9,x=thresh))+
geom_jitter(alpha=0.1)+geom_smooth()+
theme_bw(24)+labs(y="Volume (um3)",x="Threshold Value",color="Threshold")+ylim(0,1000)
ggplot(subset(all.lacun,thresh %% 1000==0),aes(y=VOLUME*1e9,x=as.factor(thresh)))+
geom_violin()+
theme_bw(24)+labs(y="Volume (um3)",x="Threshold Value",color="Threshold")+ylim(0,1000)
ggplot(all.lacun,aes(y=PCA1_Z,x=thresh))+
geom_jitter(alpha=0.1)+geom_smooth()+
theme_bw(24)+labs(y="Orientation",x="Threshold Value",color="Threshold")
Sensitivity is defined in control system theory as the change in the value of an output against the change in the input.
S=|ΔMetric||ΔParameter|
Such a strict definition is not particularly useful for image processing since a threshold has a unit of intensity and a metric might be volume which has m3 so the sensitivity becomes volume per intensity.
A more common approach is to estimate the variation in this parameter between images or within a single image (automatic threshold methods can be useful for this) and define the sensitivity based on this variation. It is also common to normalize it with the mean value so the result is a percentage.
S=max(Metric)−min(Metric)avg(Metric)
In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter
poresum<-function(all.data) ddply(all.data,.(thresh),function(c.sample) {
data.frame(Count=nrow(c.sample),
Volume=mean(c.sample$VOLUME*1e9),
Stretch=mean(c.sample$AISO),
Oblateness=mean(c.sample$OBLATENESS),
#Lacuna_Density_mm=1/mean(c.sample$DENSITY_CNT),
Length=mean(c.sample$PROJ_PCA1*1000),
Width=mean(c.sample$PROJ_PCA2*1000),
Height=mean(c.sample$PROJ_PCA3*1000),
Orientation=mean(abs(c.sample$PCA1_Z)))
})
comb.summary<-cbind(poresum(all.lacun),Phase="Lacuna")
splot<-ggplot(comb.summary,aes(x=thresh))
splot+geom_line(aes(y=Count))+geom_point(aes(y=Count))+scale_y_log10()+
theme_bw(24)+labs(y="Object Count",x="Threshold",color="Phase")
Comparing Different Variables we see that the best (lowest) value for the count sensitivity is the highest for the volume and anisotropy.
calc.sens<-function(in.df) {
data.frame(sens.cnt=100*with(in.df,(max(Count)-min(Count))/mean(Count)),
sens.vol=100*with(in.df,(max(Volume)-min(Volume))/mean(Volume)),
sens.stretch=100*with(in.df,(max(Stretch)-min(Stretch))/mean(Stretch))
)
}
sens.summary<-ddply.cutcols(comb.summary,.(cut_interval(thresh,5)),calc.sens)
ggplot(sens.summary,aes(x=thresh))+
geom_line(aes(y=sens.cnt,color="Count"))+
geom_line(aes(y=sens.vol,color="Volume"))+
geom_line(aes(y=sens.stretch,color="Anisotropy"))+
labs(x="Threshold",y="Sensitivity (%)",color="Metric")+
theme_bw(20)
In computer programming, unit testing is a method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine if they are fit for use.
The first requirement for unit testing to work well is to have you tools divided up into small independent parts (functions)
Ideally with realistic but simulated test data
function vxCnt=countVoxs(inImage)
assert countVoxs(zeros(3,3)) == 0
assert countVoxs(zeros(3,3,3)) == 0
assert countVoxs(eye(3)) == 3
function shapeTable=shapeAnalysis(inImage)
We should decompose the task into sub-componentsfunction clImage=componentLabel(inImage)
function objInfo=analyzeObject(inObject)
function vxCnt=countVoxs(inObject)
function covMat=calculateCOV(inObject)
function shapeT=calcShapeT(covMat)
function angle=calcOrientation(shapeT)
function aniso=calcAnisotropy(shapeT)
Read more and Here The Java-based unit-testing can be used (JUnit) before any of the plugins are compiled, additionally entire workflows can be made to test the objects using special testing nodes like
Test Driven programming is a style or approach to programming where the tests are written before the functional code. Like very concrete specifications. It is easy to estimate how much time is left since you can automatically see how many of the tests have been passed. You and your collaborators are clear on the utility of the system.
One of the biggest problems with big sciences is trying to visualize a lot of heterogeneous data.
There are too many graphs which say
ggplot(test.data,aes(x,y))+
stat_binhex(bins=20)+
geom_smooth(method="lm",aes(color="Fit"))+
coord_equal()+
theme_bw(20)+guides(color=F)
ggplot(test.data,aes(x,y))+
geom_density2d(aes(color="Contour"))+
geom_smooth(method="lm",aes(color="Linear Fit"))+
coord_equal()+
labs(color="Type")+
theme_bw(20)
If we develop a consistent way of expressing graphics (sentences) in terms of elements (words) we can compose and decompose graphics easily
The most important modern work in graphical grammars is “The Grammar of Graphics” by Wilkinson, Anand, and Grossman (2005). This work built on earlier work by Bertin (1983) and proposed a grammar that can be used to describe and construct a wide range of statistical graphics.
This can be applied in R using the ggplot2 library (H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.)
Normally we think of plots in terms of some sort of data which is fed into a plot command that produces a picture
plot(xdata,ydata,color/shape)
Separate the graph into its component parts
Construct graphics by focusing on each portion independently.
They are rarely bearers of good news
Simulations (even simple ones) are very helpful (see StatisticalSignificanceHunter)