Workshop on Large Scale Tomography , 26 January 2016
Kevin Mader, CTO 4Quant / Lecturer ETH Zuerich
| Mammography | Satellite |
|---|---|
| Brain Tumors | Throat Cancer |
|---|---|
Recent improvements allow you to:
[1] Mokso et al., J. Phys. D, 46(49),2013
Courtesy of M. Pistone at U. Bristol
If you looked at one 1000 x 1000 sized image every second
It would take you
139
hours
to browse through a terabyte of data.
| Year | Time to 1 TB | Man power to keep up | Salary Costs / Month |
|---|---|---|---|
| 2000 | 4096 min | 2 people | 25 kCHF |
| 2008 | 1092 min | 8 people | 95 kCHF |
| 2014 | 32 min | 260 people | 3255 kCHF |
| 2016 | 2 min | 3906 people | 48828 kCHF |
The first of challenges with these data is to start stitching them together.
Algorithmically it is very simple:
\( \textrm{Overlap}_{A \rightarrow B} = \)
\( \textrm{argmax}_{\vec{x}}\left[\mathcal{FT}^{-1} \left\{\mathcal{FT}\{A\}\mathcal{FT}\{B\}^{*}\right\} \right] \)
We need to stitch each scan with all of its neighbors
pre-main prep time: 2 ms
We have a few more parameters (\( \theta \)) and we know the initial positions so we optimize in a limited range rather than the entire image.
pre-main prep time: 1 ms
Now that our task is clear we just execute \( \rightarrow \)
Too Strict
Mat: CV_8UC1 1D array of unsigned bytesImg<RealType< T > & NativeType< T > >> recursive generic ND arrays with cursor accessImage< TPixel, VImageDimension > templated C++Too Loose
imread \( \rightarrow \) could be 2D, maybe 3D, sometimes even 4Dnumpy \( \rightarrow \) runtime problem determinationParallel computing requires a significant of coordinating between computers for non-easily parallelizable tasks.
The second major issue is mutability, if you have two cores / computers trying to write the same information at the same it is no longer deterministic (not good)
The simple act of taking turns and waiting for every independent process to take its turn can completely negate the benefits of parallel computing
Inherits all of the problems of parallel programming with a whole variety of new issues.
If you have 1000 computers working on solving a problem and one fails, you do not want your whole job to crash
How can you access and process data from many different computers quickly without very expensive infrastructure
The machine being used by the user which is responsible for creating jobs which need to run
Distributes task across multiple machines, coordinates communication between machines
The nodes where the compuation is actually done
pre-main prep time: 1 ms
pre-main prep time: 1 ms
tell machine 1 to load images 0 to 2tell machine 2 to load images 3 to 5pre-main prep time: 0 ms
pre-main prep time: 2 ms
tell share the images to calculate the overlap
pre-main prep time: 1 ms
\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]
Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d
There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?
Elastic, On-Demand
Directly coordinating tasks on a computer.
What took an entire PhD 3-4 years ago, can now be measured in a weekend, or even several seconds. Analysis tools have not kept up, are difficult to customize, and usually highly specific.
Data-structures that were fast and efficient for computers with 640kb of memory do not make sense anymore
CPU's are not getting that much faster but there are a lot more of them. Iterating through a huge array takes almost as long on 2014 hardware as 2006 hardware
The most important job for any piece of analysis is to be correct.
Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets
The last of the major priorities is speed which covers both scalability, raw performance, and development time.
You can analyze streaming tweets during World Cup matches to see goals before you see them on television
You can process a petabyte of clicks to optimize ad-campaigns
Powerful tools
Distributed computation
Interactive computation
Hadoop is the opensource version of MapReduce developed by Yahoo and released as an Apache project. It provides underlying infrastructure and filesystem that handles storing and distributing data so each machine stores some of the data locally and processing jobs run where the data is stored.
“The Ultimate Scala Collections”- Martin Odersky (EPFL / Creator of Scala)
Developed by the Algorithms, Machines, and People Lab at UC Berkeley in 2012
General tool for all Directed Acyclical Graph (DAG) workflows
Course-grained processing \( \rightarrow \) simple operations applied to entire sets
In-memory caching
Zaharia, M., et. al (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
int[] newImage = new int[inputImage.length];
for(int i=0;i<inputImage.length; i++) {
if(inputImage[i]>0) {
newImage[i] = inputImage[i];
}
}
val newImage = inputImage.map(v => if(v>0) v else 0)
It is being performed on a cluster where the computation is not only divided between all available nodes, but also locality-aware meaning it moves the computation where the data is.
These frameworks are really cool and Apache Spark has a lot of functionality , but I don't want to map-reduce a collection.
I want to
We have developed a number of commands for IQAE handling standard image processing tasks
Fully exensible with
CREATE TABLE FilteredBones AS
SELECT sampleId,GaussianFilter(image) FROM BoneImages
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredBones
CREATE TABLE CellImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"
The query is processed by the SQL parsing engine
pre-main prep time: 2 ms
pre-main prep time: 1 ms
pre-main prep time: 1 ms
First we start our cluster:
spark-ec2.py -s 50 launch 4quant-image-cluster
Load all of the samples (56TB of data)
loadImage(
"s3:/../brain_*_*_*/rec.tif"
)
\[ \downarrow \] A Resilient Distributed Dataset (RDD) \[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]
filteredImages = Images.gaussianFilter(1,0.5)
volFraction = Images.threshold(0.5).
map{keyImg =>
(sum(keyImg.img),keyImg.size)
}.reduce(_+_)
We have all of the filtered images and we want to stitch them together in a smart way.
pairImages = Images.
cartesian(Images).
filter((im1,im2) => dist(im1.pos,im2.pos)<1)
The cross correlation can then be executed on each pair of images from the new RDD (pairImages) by using the map command
displacementField = pairImages.
map{
((posA,ImgA), (posB,ImgB)) =>
xcorr(ImgA,ImgB,
in=posB-posA)
}
From the updated information provided by the cross correlations and by applying appropriate smoothing criteria across windows.
smoothField = displacementField.
window(3,3,3).
map(gaussianSmoothFunction)
This also ensures the original data is left unaltered and all analysis is reversible.
The final stiching can then be done by
alignImages.
filter(x=>abs(x-tPos)<img.size).
map { (x,img) =>
new Image(tSize).
copy(img,x,tPos)
}.combineImages()
Developing in Scala brings additional flexibility through types[1], with microscopy the standard formats are 2-, 3- and even 4- or more dimensional arrays or matrices which can be iterated through quickly using CPU and GPU code. While still possible in Scala, there is a great deal more flexibility for data types allowing anything to be stored as an image and then processed as long as basic functions make sense.
[1] Fighting Bit Rot with Types (Experience Report: Scala Collections), M Odersky, FSTTCS 2009, December 2009
A collection of positions and values, maybe more (not an array of double). Arrays are efficient for storing in computer memory, but often a poor way of expressing scientific ideas and analyses.
combine information from nearby pixels
determine groups of pixels which are very similar to desired result
trait BasicMathSupport[T] extends Serializable {
def plus(a: T, b: T): T
def times(a: T, b: T): T
def scale(a: T, b: Double): T
def negate(a: T): T = scale(a,-1)
def invert(a: T): T
def abs(a: T): T
def minus(a: T, b: T): T = plus(a, negate(b))
def divide(a: T, b: T): T = times(a, invert(b))
def compare(a: T, b: T): Int
}
Simple filter implementation
def SimpleFilter[T](inImage: Image[T])
(implicit val wst: BasicMathSupport[T]) = {
val width: Double = 1
kernel = (pos: D3int,value: T) => value * exp(-(pos.mag/width)**2)
kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5
runFilter(inImage,kernel,kernelReduce)
}
Spectra as well supported types (or 2D diffraction patterns)
implicit val SpectraBMS = new BasicMathSupport[Array[Double]] {
def plus(a: Array[Double], b: Array[Double]) =
a.zip(b).map(_ + _)
...
def scale(a: Array[Double], b: Double) =
a.map(_*b)
Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl
One of the central tenants of MapReduce™ is data-centric computation \( \rightarrow \) instead of data to computation, move the computation to the data.
ETH Spinoff - 4Quant: From images to insight
New detectors certainly provide a substantial amount of new data, but what about old data?
Look at this new technique from famous journal, let's try it on all of our old data
It seems to give the same results
Well what about this technique?
allData = loadPSIImages("s3://psi-data/femur-bones/*.tif") ++
loadAPSImages("s3://aps-data/bone-images/*.hdf5") ++
loadESRFImages("s3://esrf-data/femur-bones/*.tif")
allData.registerImageMetaTable("BoneImages")
\[ \downarrow \textrm{Translate to SQL} \]
SELECT COUNT(*) FROM
(SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients
WHERE disease LIKE "myleoma")
WHERE anisotropy > 0.75
\[ \downarrow \textrm{Load Myleoma Data Subset} \]
\[ \downarrow \textrm{Perform analysis on a every image} \]
\[ \downarrow \textrm{Filter out the most anisotropic cells} \]
Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH
\[ \downarrow \textrm{Translate to SQL} \]
SELECT contour FROM (
SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles
WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
) WHERE area>200
We can then visualize these contours easily
or apply them back to the original map
While our framework is commercial, we build on top and integrate into open-source tools so that the research is reproducible by anyone, anywhere, using any machine (it just might take a bit longer)
We also try to publish as much code, educational material, and samples as possible on our github account.

We are interested in partnerships and collaborations
Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl
One of the central tenants of MapReduce™ is data-centric computation \( \rightarrow \) instead of data to computation, move the computation to the data.