This video gives an overview of the ‘sequencing-by-synthesis’ approach used by Illumina. Other companies will have different techniques, but Illumina is probably the most-popular sequencing technology out there. For most of what we will discuss, it won’t really matter how your samples were sequenced.
* Other sequencing technologies are available
Images from:- http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
A library is prepared by breaking the DNA of interest into shorter, single-strand fragments. Adapters are each added to each end of the fragments. A flow-cell has already been prepared to have a “lawn” of sequences that are complementary to the adapters.
Bridge amplification forms clusters of identical fragments on the flow-cell surface. This is required as we’re going to be taking images of the flow-cell and need even copies of each fragment so that we get a decent signal. However, as we’ll see later, this can introduce errors.
The construction of a sequencing reads involves adding a single, terminated, DNA base (each given a distinct flourescent label) one at a time, simulataneously across the whole flow-cell, and taking an image.
So, we try adding an “A” with a red label and take an image
and then a “T” with a green, and take another image
Analysis of the images can determine which base was succesfully incorporated to each fragment.
The process of added each base repeats for the next cycle, and so-on for “N” cycles. e.g. 100 times for a 100-base sequence.
Although this process is high-optimised and refined, the sheer number of reactions being performed means that errors are inevitable. The signal from a particular cluster could be affected by interference from its neighbours
Or sometimes we get a bit over-excited and add too many bases
Therefore the identification of bases comes with some degree of uncertainty which we must capture.
Most of the sequencing we come across in this course is likely to the paired-end, which means that we are sequencing both “ends” of the fragment. This will allow us to achieve better alignment to tricky regions of the genome and detect some kinds of variation that would not be possible with single-end reads alone. It also gives us greater confidence in the alignments and detecting artefacts caused by PCR amplification
We can also add multiple samples into the same sequencing library. The fragments from each sequence can be identified by a barcode (typically a few bases in length) unique to that sample. When the sequence reads are generated, they have the barcode at the beginning. The process of demultiplexing is used to identify to barcodes present for each read and split into separate files for alignment.
Multiplexing is a way of increasing our sample size.
.TIFF
images; not unlike microarray data.TIFF
image ~ 7Mb = 2,240,000 Mb of data (2.24TB)We don’t need any special software to view these, but bear in mind there can be ~ 250 Million reads (sequences) per Hi-Seq lane.
The name of a sequence is unique and can encode some useful information. e.g.
@HWUSI-EAS100R:6:73:941:1973#0/1
However, this depends on instrument setup and processing pipelines. Sometimes the tile and coordinate information is omitted to save space.
As we saw earlier the process of deciding which base is present at each cycle of each fragment comes with some probability (p
) that we make a mistake. The quality score expresses our confidence in a particular base-call; higher quality score, higher confidence
The raw base-calling probabilities are converted to text characters to make it easier to store in a file
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
First of all, we convert the base-calling probability (p) into a Q
score using the formula
Annoyingly, different sequencing instruments have used different offsets over time. It’s important to check what encoding has been used for your data
Phred+33
This handy graphic from wikipedia compares the different schemes
Given a particular quality string, we have to look-up the ASCII code for each character and subtract the offset to get the Q score. We can then convert to a probability using the formula:-
p=10−Q/10
So for our particular example:
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
it works out as follows:-
Character Code Minus.Offset..33.. Probability
1 N 78 45 0.00003
2 ? 63 30 0.00100
3 > 62 29 0.00126
4 : 58 25 0.00316
5 < 60 27 0.00200
6 9 57 24 0.00398
7 > 62 29 0.00126
8 > 62 29 0.00126
9 > 62 29 0.00126
10 : 58 25 0.00316
...
...
Character Code Minus.Offset..33.. Probability
58 # 35 2 0.63096
59 # 35 2 0.63096
60 # 35 2 0.63096
61 # 35 2 0.63096
62 # 35 2 0.63096
63 # 35 2 0.63096
64 # 35 2 0.63096
65 # 35 2 0.63096
66 # 35 2 0.63096
67 # 35 2 0.63096
68 # 35 2 0.63096
For the extremely-keen we can do this in R
:-
pq <- "N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############"
code <- as.integer(charToRaw(as.character(pq)))
qs <- code -33
qs
## [1] 45 30 29 25 27 24 29 29 29 25 28 26 29 29 30 27 29 25 31 30 29 26 28
## [24] 28 31 31 31 29 30 28 32 32 32 27 29 28 32 31 30 21 29 19 33 28 27 29
## [47] 29 13 31 29 30 27 31 26 30 2 2 2 2 2 2 2 2 2 2 2 2 2
probs <- 10^(unlist(qs)/-10)
round(probs,5)
## [1] 0.00003 0.00100 0.00126 0.00316 0.00200 0.00398 0.00126 0.00126
## [9] 0.00126 0.00316 0.00158 0.00251 0.00126 0.00126 0.00100 0.00200
## [17] 0.00126 0.00316 0.00079 0.00100 0.00126 0.00251 0.00158 0.00158
## [25] 0.00079 0.00079 0.00079 0.00126 0.00100 0.00158 0.00063 0.00063
## [33] 0.00063 0.00200 0.00126 0.00158 0.00063 0.00079 0.00100 0.00794
## [41] 0.00126 0.01259 0.00050 0.00158 0.00200 0.00126 0.00126 0.05012
## [49] 0.00079 0.00126 0.00100 0.00200 0.00079 0.00251 0.00100 0.63096
## [57] 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096
## [65] 0.63096 0.63096 0.63096 0.63096
Now it’s time to get our hands on some .fastq
data and start to explore it.