Kevin Mader
Swiss Data Science, 12 June 2015
The figure highlights the exponential growth in both:
over the last 16 years based on the latest generation detector at the TOMCAT Beamline.
It assumes a manual analysis rate of 1Mpx/second
The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.
The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.
If you looked at one 1000 x 1000 sized image every second
It would take you
139
hours to browse through a terabyte of data.
Year | Time to 1 TB | Man power to keep up | Salary Costs / Month |
---|---|---|---|
2000 | 4096 min | 2 people | 25 kCHF |
2008 | 1092 min | 8 people | 95 kCHF |
2014 | 32 min | 260 people | 3255 kCHF |
2016 | 2 min | 3906 people | 48828 kCHF |
\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]
Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d
There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?
The figure shows the range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.
The figure shows the cost of a cloud based solution as a percentage of the cost of buying a single machine. The values below 1 show the percentage as a number. The panels distinguish the average time to replacement for the machines in months
Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it
Before we can think about Big Data we need to define what it's replacing, Small Data.
The 0815 approach, using standard tools and lots of clicking
High-throughput \( \rightarrow \) easy measurement
analysis \( \rightarrow \) complex, multistep, and time-consuming
Genetic studies require hundreds to thousands of samples, in this case the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.
1.5 man-years
The most important job for any piece of analysis is to be correct.
Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets
The last of the major priorities is speed which covers both scalability, raw performance, and development time.
When a ton of heterogeneous is coming in fast.
Performant, scalable, and flexible
10X, 100X, 1000X is the same amount of effort
Director of AMPLab said their rate limiting factor is always enough interesting data
What took an entire PhD 3-4 years ago, can now be measured in a weekend, or even several seconds. Analysis tools have not kept up, are difficult to customize, and usually highly specific.
Data-structures that were fast and efficient for computers with 640kb of memory do not make sense anymore
CPU's are not getting that much faster but there are a lot more of them. Iterating through a huge array takes almost as long on 2014 hardware as 2006 hardware
The two frameworks provide a free out of the box solution for
These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.
I want to
We have developed a number of commands for SIL handling standard image processing tasks
Fully exensible with
val imageTable =
sc.readImage[Float]("s3n://bone-images/f2-study/*/ufilt.tif").cache
CREATE TABLE FilteredImages AS
SELECT boneId,GaussianFilter(image) FROM ImageTable
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredImages
CREATE TABLE LabelImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"
val cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")
val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROM cells GROUP BY SAMPLE")
avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)
Task | Single Core Time | Spark Time (40 cores) |
---|---|---|
Load and Preprocess | 360 minutes | 10 minutes |
Single Column Average | 4.6s | 400ms |
1 K-means Iteration | 2 minutes | 1s |
For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)
As studies get bigger (>100 samples), we see that these costs begin to radically shift.
Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH
\[ \downarrow \textrm{Translate to SQL} \]
SELECT contour FROM (
SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles
WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
) WHERE area>200
We can then visualize these contours easily
or apply them back to the original map
\[ \downarrow \textrm{Translate to SQL} \]
SELECT COUNT(*) FROM
(SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients
WHERE disease LIKE "myleoma")
WHERE anisotropy > 0.75
\[ \downarrow \textrm{Load Myleoma Data Subset} \]
\[ \downarrow \textrm{Perform analysis on a every image} \]
\[ \downarrow \textrm{Filter out the most anisotropic cells} \]
The query is processed by the SQL parsing engine
In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.
We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.
val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)
val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))
val totImgs = inImages.count()
val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
val eventImages = filtImgs.
transform{
inImages =>
val corImage = inImages.map {
case (inTime,inImage) =>
val corImage = inImage.subtract(bgImage)
(corImage.getImageStatistics().mean,
(inTime,corImage))
}
corImage
}
eventImages.filter(iv => Math.abs(iv._1)>20).
foreachRDD(showResultsStr("outlier",_))
A spinoff - 4Quant: From images to insight
We are interested in partnerships and collaborations
A pairing between spatial information (position) and some other kind of information (value). \[ \vec{x} \rightarrow \vec{f} \]
We are used to seeing images in a grid format where the position indicates the row and column in the grid and the intensity (absorption, reflection, tip deflection, etc) is shown as a different color
The alternative form for this image is as a list of positions and a corresponding value
\[ \hat{I} = (\vec{x},\vec{f}) \]
x | y | Intensity |
---|---|---|
1 | 1 | 30 |
2 | 1 | 90 |
3 | 1 | 13 |
4 | 1 | 5 |
5 | 1 | 64 |
1 | 2 | 86 |
This representation can be called the feature vector and in this case it only has Intensity
If we use feature vectors to describe our image, we are no longer to worrying about how the images will be displayed, and can focus on the segmentation/thresholding problem from a classification rather than a image-processing stand point.
So we have an image of a cell and we want to identify the membrane (the ring) from the nucleus (the point in the middle).
A simple threshold doesn't work because we identify the point in the middle as well. We could try to use morphological tricks to get rid of the point in the middle, or we could better tune our segmentation to the ring structure.
In this case we add a very simple feature to the image, the distance from the center of the image (distance).
x | y | Intensity | Distance |
---|---|---|---|
-10 | -10 | 0.9350683 | 14.14214 |
-10 | -9 | 0.7957197 | 13.45362 |
-10 | -8 | 0.6045178 | 12.80625 |
-10 | -7 | 0.3876575 | 12.20656 |
-10 | -6 | 0.1692429 | 11.66190 |
-10 | -5 | 0.0315481 | 11.18034 |
We now have a more complicated image, which we can't as easily visualize, but we can incorporate these two pieces of information together.
Now instead of trying to find the intensity for the ring, we can combine density and distance to identify it
\[ iff (5<\textrm{Distance}<10 \\ \& 0.5<\textrm{Intensity}>1.0) \]
The distance while illustrative is not a commonly used features, more common various filters applied to the image
x | y | Intensity | Sobel | Gaussian |
---|---|---|---|---|
1 | 1 | 0.94 | 0.91 | 0.90 |
1 | 10 | 0.48 | -0.83 | 0.51 |
1 | 11 | 0.50 | -0.82 | 0.53 |
1 | 12 | 0.48 | -0.83 | 0.51 |
1 | 13 | 0.43 | -0.86 | 0.46 |
1 | 14 | 0.33 | -0.90 | 0.37 |
The distributions of the features appear very different and can thus likely be used for identifying different parts of the images.
Combine this with our a priori information (called supervised analysis)
Now that the images are stored as feature vectors, they can be easily analyzed with standard Machine Learning tools. It is also much easier to combine with training information.
x | y | Absorb | Scatter | Training |
---|---|---|---|---|
700 | 4 | 0.3706262 | 0.9683849 | 0.0100140 |
704 | 4 | 0.3694059 | 0.9648784 | 0.0100140 |
692 | 8 | 0.3706371 | 0.9047878 | 0.0183156 |
696 | 8 | 0.3712537 | 0.9341989 | 0.0334994 |
700 | 8 | 0.3666887 | 0.9826912 | 0.0453049 |
704 | 8 | 0.3686623 | 0.8728824 | 0.0453049 |
Want to predict Training
from x,y, Absorb, and Scatter
\( \rightarrow \) MLLib: Logistic Regression, Random Forest, K-Nearest Neighbors, …
Apache Spark is brilliant platform and utilizing GraphX, MLLib, and other packages there unlimited possibilities
Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = {
val wind=(-windSize to windSize)
val pos=pvec._1
val scalevalue=pvec._2/(wind.length*wind.length)
for(x<-wind; y<-wind)
yield ((pos._1+x,pos._2+y),scalevalue)
}
val filtImg=roiImg.
flatMap(cvec => spread_voxels(cvec)).
filter(roiFun).reduceByKey(_ + _)
Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.
\[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]
dispField = Images.
cartesian(Images).map{
((xA,ImgA), (xB,ImgB)) =>
xcorr(ImgA,ImgB,in=xB-xA)
}
From the updated information provided by the cross correlations and by applying appropriate smoothing criteria (if necessary).
The stitching itself, rather than rewriting the original data can be done in a lazy fashion as certain regions of the image are read.
def getView(tPos,tSize) =
stImgs.
filter(x=>abs(x-tPos)<img.size).
map { (x,img) =>
val oImg = new Image(tSize)
oImg.copy(img,x,tPos)
}.addImages(AVG)
This also ensures the original data is left unaltered and all analysis is reversible.
getView(Pos(26.5,13),Size(2,2))
The attending physician or the general practioner is the integrator, responsible for combining, and interpreting a number of complex, often contradictory inputs.
The centralized storage and analysis of images provides an alternative of backup pathway so that the risk that something is overlooked is lower
Eventually this could lead to the GP getting second opinions automatically from Deep Learning tools or IBM's Watson and even allowing the patient to directly view and interpret some of the results
Significant energy is spent in many research groups on new algorithms (>1000 new publications in the top 3 journals every year) \( \rightarrow \) hundreds to thousands for every standard problem.