We will tell about ‘best practice’ tools that we use in daily work as Bioinformaticians
You will (probably) not come away being an expert
We cannot teach you everything about NGS data
plus, it is a fast-moving field
RNA and ChIP only
much of the initial processing is the same for other assays
However, we hope that you will
Understand how your data are processed
Increase confidence with R and Bioconductor
Be able to explore new technologies, methods, tools as they come out
Further disclaimer
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”. R.A. Fisher, 1938
If you haven’t designed your experiment properly, then all the Bioinformatics we teach you won’t help: Consult with your local statistician - preferably not the day before your grant is due!!!!
Experimental Design; despite this fancy new technolgy, if we don’t design the experiments properly we won’t get meaningful conclusions
Quality assessment; Yes, NGS experiments can still go wrong!
Normalisation; NGS data come with their own set of biases and error that need to be accounted for
Stats; testing for RNA-seq is built-upon the knowledge from microarrays
Plenty of tools and workflows were established.
Don’t forget about arrays; the data are all out there somewhere waiting to be discovered and explored
Reproducibility is key
Two Biostatiscians (later termed ‘Forensic Bioinformaticians’) from M.D. Anderson used R extensively during their re-analysis and investigation of a Clinical Prognostication paper from Duke. The subsequent scandal put Reproducible Research on the map.
Keith Baggerly’s talk from Cambridge in 2010 is highy-recommended.
‘…mappings are frozen, as a Dorian Gray-like syndrome: the apparent eternal youth of the mapping does not reflect that somewhere the ’picture of it’ decays’
Sequencing data are ‘future proof’
if a new genome version comes along, just re-align the data!
can grab published-data from public repositories and re-align to your own choice of genome / transcripts and aligner
Limited number of novel findings from microarays
can’t find what you’re not looking for!
Genome coverage
some areas of genome are problematic to design probes for
Maturity of analysis techniques
on the other hand, analysis methods and workflows for microarrays are well-established
until recently…
The cost of sequencing
Reports of the death of microarrays
Reports of the death of microarrays. Greatly exagerated?
Sequencing produces high-resolution TIFF ../images; not unlike microarray data
100 tiles per lane, 8 lanes per flow cell, 100 cycles
4 ../images (A,G,C,T) per tile per cycle = 320,000 ../images
Each TIFF image ~ 7Mb = 2,240,000 Mb of data (2.24TB)
Image processing
Firecrest
“Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.”
The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include Facebook, google, Microsoft (who recently invested in a commerical provider of R), and the New York Times.
Key features
Open-source
Cross-platform
Access to existing visualisation / statistical tools
Flexibility
Visualisation and interactivity
Add-ons for many fields of research
Facilitating Reproducible Research
Support for R
Online forums such as Stack Overflow regularly feature R