Kevin Mader
LifeScienceForumBasel, 18 June 2015
Look for potentially cancerous nodules in the following lung image, taken from NPR
Which center square seems brighter?
Are the intensities constant in the image?
Genetic studies require hundreds to thousands of samples
For this study, the difference between 717 and 1200 samples is the difference between finding the links and finding nothing.
What happens if you have softer metrics
As the amount of data increases
But how do radiologists / experts scale
The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.
The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.
The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.
The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.
If you looked at one 1000 x 1000 sized image every second
It would take you
139
hours to browse through a terabyte of data.
Year | Time to 1 TB | Man power to keep up | Salary Costs / Month |
---|---|---|---|
2000 | 4096 min | 2 people | 25 kCHF |
2008 | 1092 min | 8 people | 95 kCHF |
2014 | 32 min | 260 people | 3255 kCHF |
2016 | 2 min | 3906 people | 48828 kCHF |
Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it
Before we can think about Big Data we need to define what it's replacing, Small Data.
The 0815 approach, using standard tools and lots of clicking
1.5 man-years
The most important job for any piece of analysis is to be correct.
Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets
The last of the major priorities is speed which covers both scalability, raw performance, and development time.
The two frameworks provide a free out of the box solution for
These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.
I want to
We have developed a number of commands for SIL handling standard image processing tasks
Fully exensible with
Count the number of highly anisotropic nuclei in myeloma patients
\downarrow \textrm{Translate to SQL}
SELECT COUNT(*) FROM
(SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients
WHERE disease LIKE "myleoma")
WHERE anisotropy > 0.75
\downarrow \textrm{Load Myleoma Data Subset}
\downarrow \textrm{Perform analysis on a every image}
\downarrow \textrm{Filter out the most anisotropic cells}
The query is processed by the SQL parsing engine
Once processed into Spark commands the query can be executed on multiple machines using multiple data storage environments in the background
For the same price as a single radiologist, you get (each blue dot is 20 computers):
For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)
As studies get bigger (>100 samples), we see that these costs begin to radically shift.
A spinoff - 4Quant: From images to insight
We are interested in partnerships and collaborations