Illumina

Single-channel microarrays

50 base-pair ‘probes’ with the same sequence are attached to a bead
Beads that have the same probe sequence are known as a bead-type
Beads are identified by Illumina using a decoding sequence

Randomly-arranged arrays

Arrays of randomly-arranged beads are constructed
The number of replicates of each bead-type follows a normal distribution
Around 20 replicates on most-recent arrays

abead

Different chip types

illumina-chips

‘Refseq’ chips with 24,000 bead-types per array, 8 arrays per chip
WG- chips with 48,000 bead-types per array, 6 arrays per chip
HT-12 chips with 48,000 bead-types per array, 12 arrays per chip
Each chip also has an annotation ‘version’ (v1, v2, v3, v4)

Example Arrangement

Observations for Gene X on Illumina and older array technologies

BeadPositions

Example Consider a technical failure on one of the arrays

Workflow

The analyst may have a choice over what data to take as a starting point

Raw images
Bead-level data - vary number of observations for a bead-type on each array and pre-computed intensities
Bead-summary - One value for each bead-type on each array. i.e. an ‘Expression Matrix’
Starting with raw, or bead-level data gives more flexiblity in analysis
Bead-summary data fit in with standard Bioconductor tools

(adapted from Ritchie et al)

The recommmended approach is to work with the so-called bead-level data as it allows the entire analysis workflow to be performed in R and Bioconductor, with all the benefits of reproducibility and transparency that gives. However, in order to produce data of this type, several modifications need to be made to the manufacturer’s scanning software. Consequently, data of this type are uncommon.

If you have bead-level data, then for each hybridisation you will have;

A TIFF image representation of the array surface

nicearray

A text description of the bead locations, unique to each array. This is required

beadleveltext

Workflow from bead-level data

The scanner software produces a simple tab-delimited file as the arrays are being scanner. This turns-out to be a particularly good early-warning of problems. Specifically it records the 5th and 95th percentile of intensities which can be used as a measure of the signal-to-noise ratio. At some point, Illumina stated that ratio of these two numbers should be at least 10.

ht12metrics <- read.table(system.file("extdata/Chips/Metrics.txt" ,
  package = "BeadArrayUseCases") , sep = "\t" , header = TRUE ,as.is = TRUE )
ht12metrics

##                     Date     Matrix Section RegGrn FocusGrn SatGrn P95Grn
## 1   3/13/2009 6:45:04 PM 4613710017       B   0.13     0.70      0    704
## 2   3/24/2009 5:30:53 PM 4613710052       B   0.14     0.64      0    763
## 3   3/17/2009 5:58:00 PM 4613710054       B   0.10     0.67      0    815
## 4   3/31/2009 4:04:09 PM 4616443079       B   0.10     0.64      0    604
## 5   3/26/2009 1:29:30 PM 4616443093       B   0.12     0.66      0    716
## 6         04/07/09 18:28 4616443115       B   0.10     0.64      0    757
## 7   3/31/2009 6:38:16 PM 4616443081       B   0.12     0.51      0     41
## 8   3/31/2009 7:13:54 PM 4616443081       H   0.94     0.66      0    939
## 9  3/26/2009 12:10:56 PM 4616443092       B   0.12     0.66      0    562
## 10        04/01/09 16:34 4616443107       A   0.12     0.67      0    559
## 11        04/01/09 19:08 4616443136       A   0.15     0.58      0    612
## 12        04/01/09 04:50 4616494005       A   0.13     0.59      0    678
##    P05Grn RegRed FocusRed SatRed P95Red P05Red
## 1      36      0        0      0      0      0
## 2      37      0        0      0      0      0
## 3      36      0        0      0      0      0
## 4      34      0        0      0      0      0
## 5      34      0        0      0      0      0
## 6      37      0        0      0      0      0
## 7      29      0        0      0      0      0
## 8      55      0        0      0      0      0
## 9      34      0        0      0      0      0
## 10     37      0        0      0      0      0
## 11     44      0        0      0      0      0
## 12     38      0        0      0      0      0

ht12snr <- ht12metrics$P95Grn / ht12metrics$P05Grn
plot (1:12 , ht12snr , pch = 19 , ylab = " P95 / P05 " , xlab = " " ,main = " Signal - to - noise ratio for HT12 data ")
axis (2)

The beadarray Bioconductor package can be used to read Illumina bead-level data. It assumes the files are arranged in the same folder structure that is created by the scanner software; one folder for each chip, named according to the name of the chip.

Each chip type has a different set of bead-types, so it is important to tell beadarray what annotation to use. In this case we choose illuminaHumanv3.

library(beadarray)

chipPath <- system.file("extdata/Chips",package="BeadArrayUseCases")
list.files(chipPath)

##  [1] "4613710017"      "4613710052"      "4613710054"     
##  [4] "4616443079"      "4616443081"      "4616443092"     
##  [7] "4616443093"      "4616443107"      "4616443115"     
## [10] "4616443136"      "4616494005"      "Metrics.txt"    
## [13] "sampleSheet.csv"

sampleSheetFile <- paste0(chipPath,"/sampleSheet.csv")
data <- readIllumina(dir=chipPath,sampleSheet=sampleSheetFile,illuminaAnnotation="Humanv3")

## Sample Sheet /home/dunnin01/R/x86_64-pc-linux-gnu-library/3.2/BeadArrayUseCases/extdata/Chips/sampleSheet.csv will be used to read the data
## Processing section 4613710017_B
## Processing section 4613710052_B
## Processing section 4613710054_B
## Processing section 4616443079_B
## Processing section 4616443093_B
## Processing section 4616443115_B
## Processing section 4616443081_B
## Processing section 4616443081_H
## Processing section 4616443092_B
## Processing section 4616443107_A
## Processing section 4616443136_A
## Processing section 4616494005_A

Just like with Affy data, we can visualise the raw images.

imageplot(data, array=6,high="darkgreen",low="lightgreen",zlim=c(4,10))

imageplot(data, array=8,high="darkgreen",low="lightgreen",zlim=c(4,10))

We could use BASH which we don’t cover today

Illumina put a number (~700) so-called negative-controls on each array. The probe sequences used are not supposed to hybridise to any location of the genome so therefore should just be measuring “background”. Additionally, several “housekeeping” genes are used which should be highly-expressed on any tissue-type. We can exploit the values of these controls for QA purposes; the housekeeping controls should always be high intensity and the negative controls should have low signal. There should be a large difference between the two extremes.

combinedControlPlot(data,array=1)
combinedControlPlot(data,array=7)
combinedControlPlot(data,array=8)

Summarisation

Now we can summarise the data into a more-usable form. Recall that there are ~30 measurements for each bead-type and we would prefer to work with a single, reliable, value for each bead-type (gene).

Take a log$_2$ transformation of all the recorded observations.
Remove any “outliers” ( > 3 median-absolute-deviations from the median)
Take the mean and standard deviation of the remaining beads

This procedure is taken care of by the summarize function.

## Annotating control probes using package illuminaHumanv3.db Version:1.26.0

eset.ill

## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: Sample_Name Sample_Group Sentrix_ID Sentrix_Position
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
##     total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

The object should seem familiar from the previous section on Affymetrix

head(exprs(eset.ill))

##              4613710017_B 4613710052_B 4613710054_B 4616443079_B
## ILMN_1802380     8.454468     8.616796     8.523001     8.420796
## ILMN_1893287     5.388161     5.419345     5.162849     5.133287
## ILMN_1736104     5.268626     5.457679     5.012766     4.988511
## ILMN_1792389     6.767519     7.183788     6.947624     7.168571
## ILMN_1854015     5.556947     5.721614     5.595413     5.520391
## ILMN_1904757     5.421553     5.320500     5.522316     5.375212
##              4616443093_B 4616443115_B 4616443081_B 4616443081_H
## ILMN_1802380     8.527748     8.733281     4.970241    10.347973
## ILMN_1893287     5.221987     5.529493     5.113827     7.117936
## ILMN_1736104     5.284026     5.168590     5.281035     6.480966
## ILMN_1792389     7.386435     7.524649     5.032331     8.604813
## ILMN_1854015     5.558717     5.747483     5.101193     7.839441
## ILMN_1904757     5.638935     5.519093     5.018379     6.841015
##              4616443092_B 4616443107_A 4616443136_A 4616494005_A
## ILMN_1802380     9.693461     9.670220     9.743319     9.903972
## ILMN_1893287     5.471102     5.530818     5.816294     5.355552
## ILMN_1736104     5.143453     5.609556     5.603493     5.347251
## ILMN_1792389     8.052951     8.090507     8.106430     8.395739
## ILMN_1854015     5.524689     6.048046     6.138562     5.710613
## ILMN_1904757     5.070717     5.399798     5.622445     5.460538

head(pData(eset.ill))

##              Sample_Name Sample_Group Sentrix_ID Sentrix_Position
## 4613710017_B      UHRR-1         UHRR 4613710017                B
## 4613710052_B      UHRR-2         UHRR 4613710052                B
## 4613710054_B      UHRR-3         UHRR 4613710054                B
## 4616443079_B      UHRR-4         UHRR 4616443079                B
## 4616443093_B      UHRR-5         UHRR 4616443093                B
## 4616443115_B      UHRR-6         UHRR 4616443115                B

Examine the boxplot. Do the arrays seem normalised?

boxplot(exprs(eset.ill),outline=FALSE)

We can normalise using normaliseIllumina

eset.norm <- normaliseIllumina(eset.ill[,c(1:6,9:12)])

Alternative data import

A more-common form for Illumina data to be distributed is the output of Illumina’s GenomeStudio software. Data in this form have already been summarised. i.e. we get one measurement for each bead-type for each array. This format is naturally more-compatible with tools in Bioconductor. However, as we have already mentioned we lose some of the benefits of being able to control the whole analysis in R.

The limma package is one of the most highly-cited pacakges in Bioconductor, and as we will see later has some powerful tools for differential expression analysis. A workflow to import and normalise data that have been exported from GenomeStudio is described in Section 17.3 of the limma users guide.

Data for this section can be downloaded from the WEHI and un-zipped in the data directory.

library(limma)

## 
## Attaching package: 'limma'
## 
## The following object is masked from 'package:beadarray':
## 
##     imageplot
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

 x <- read.ilmn(files="data/probe profile.txt",ctrlfiles="data/control probe profile.txt",
 other.columns="Detection")

## Reading file data/probe profile.txt ... ...
## Reading file data/control probe profile.txt ... ...

## An object of class "EListRaw"
## $source
## [1] "illumina"
## 
## $E
##                     1        2        3        4        5        6
## ILMN_1762337 52.34406 46.10429 54.01238 47.65873 54.75389 47.43921
## ILMN_2055271 69.91481 73.86729 58.64280 72.36581 77.10600 82.08336
## ILMN_1736007 57.47208 53.70050 53.39117 49.43674 58.56845 59.88695
## ILMN_2383229 53.60817 57.50813 48.22960 48.18757 61.81213 64.53053
## ILMN_1806310 58.11687 55.14536 50.52731 59.97917 64.15617 58.37450
##                     7        8        9       10       11       12
## ILMN_1762337 67.36201 47.92186 40.48154 44.69235 80.57645 42.53519
## ILMN_2055271 69.12629 81.29189 79.11217 82.50546 87.01139 60.91446
## ILMN_1736007 56.42299 51.62467 58.67375 51.74950 58.42390 43.89006
## ILMN_2383229 52.73515 43.47368 65.48235 49.83155 53.89276 39.26819
## ILMN_1806310 58.04995 52.31484 56.60397 55.61397 65.28136 46.38885
## 49582 more rows ...
## 
## $genes
##              SYMBOL  Status TargetID
## ILMN_1762337    7A5 regular     <NA>
## ILMN_2055271   A1BG regular     <NA>
## ILMN_1736007   A1BG regular     <NA>
## ILMN_2383229   A1CF regular     <NA>
## ILMN_1806310   A1CF regular     <NA>
## 49582 more rows ...
## 
## $other
## $Detection
##                       1         2          3           4          5
## ILMN_1762337 0.55849580 0.6754875 0.13698630 0.601388900 0.57762240
## ILMN_2055271 0.03064067 0.0000000 0.04931507 0.002777778 0.03636364
## ILMN_1736007 0.27715880 0.2924791 0.15342470 0.486111100 0.41118880
## ILMN_2383229 0.47353760 0.1866295 0.36575340 0.563888900 0.29510490
## ILMN_1806310 0.26183840 0.2479109 0.25890410 0.127777800 0.21958040
##                      6          7         8          9        10        11
## ILMN_1762337 0.7816412 0.05027933 0.4781382 0.90819210 0.7144847 0.0000000
## ILMN_2055271 0.0000000 0.03910615 0.0000000 0.00000000 0.0000000 0.0000000
## ILMN_1736007 0.2197497 0.23184360 0.3145275 0.15536720 0.3774373 0.2539007
## ILMN_2383229 0.1237830 0.34078210 0.7447109 0.05367232 0.4679666 0.3985816
## ILMN_1806310 0.2642559 0.19553070 0.2919605 0.19632770 0.2381616 0.1219858
##                     12
## ILMN_1762337 0.4599448
## ILMN_2055271 0.0000000
## ILMN_1736007 0.3604972
## ILMN_2383229 0.7472376
## ILMN_1806310 0.2030387
## 49582 more rows ...

 head(x$E)

##                     1        2        3        4        5        6
## ILMN_1762337 52.34406 46.10429 54.01238 47.65873 54.75389 47.43921
## ILMN_2055271 69.91481 73.86729 58.64280 72.36581 77.10600 82.08336
## ILMN_1736007 57.47208 53.70050 53.39117 49.43674 58.56845 59.88695
## ILMN_2383229 53.60817 57.50813 48.22960 48.18757 61.81213 64.53053
## ILMN_1806310 58.11687 55.14536 50.52731 59.97917 64.15617 58.37450
## ILMN_1779670 64.49660 61.22243 52.84443 61.93639 67.88351 59.74094
##                     7        8        9       10       11       12
## ILMN_1762337 67.36201 47.92186 40.48154 44.69235 80.57645 42.53519
## ILMN_2055271 69.12629 81.29189 79.11217 82.50546 87.01139 60.91446
## ILMN_1736007 56.42299 51.62467 58.67375 51.74950 58.42390 43.89006
## ILMN_2383229 52.73515 43.47368 65.48235 49.83155 53.89276 39.26819
## ILMN_1806310 58.04995 52.31484 56.60397 55.61397 65.28136 46.38885
## ILMN_1779670 68.17852 63.10263 65.07272 65.68924 69.60009 52.01776

One downside of limma, is that the data are represented in a slightly different format to the rest of Bioconductor. So the exprs and pData functions cannot be used. Instead to retrieve the expression measurements we have to use $E.

boxplot(log2(x$E),range=0,ylab="log2 intensity")

A popular choice for the normalisation of Illumina data is neqc. This incorporates a background correction step, log$_2$ transformation and quantile normalisation. The background estimation uses the intensities from the negative controls. After the data have been transformed, these controls are then removed from the data.

y <- neqc(x)
head(y$E)

##                     1        2        3        4        5        6
## ILMN_1762337 4.754040 4.731519 4.980562 4.741192 4.755547 4.704071
## ILMN_2055271 5.162217 5.298597 5.139491 5.284588 5.142003 5.383228
## ILMN_1736007 4.849849 4.852764 4.960072 4.769297 4.808160 4.895093
## ILMN_2383229 4.775144 4.923796 4.809635 4.749622 4.855720 4.983020
## ILMN_1806310 4.863640 4.877851 4.871417 4.975431 4.893602 4.867401
## ILMN_1779670 5.017084 5.002160 4.943029 5.019756 4.958687 4.892008
##                     7        8        9       10       11       12
## ILMN_1762337 5.090111 4.779881 4.669165 4.712218 5.226774 4.788520
## ILMN_2055271 5.123613 5.426473 5.410965 5.413067 5.342115 5.568949
## ILMN_1736007 4.891535 4.838546 4.946151 4.818017 4.859202 4.826098
## ILMN_2383229 4.831107 4.710633 5.078065 4.788162 4.797814 4.716899
## ILMN_1806310 4.919383 4.849762 4.911665 4.880178 4.957913 4.907443
## ILMN_1779670 5.105954 5.039321 5.068816 5.058381 5.029809 5.121597

boxplot(y$E,range=0,ylab="log2 intensity")

Another use for the negative controls is to estimate the proportion of genes that are expressed in each sample. There should not be a significant difference in these proportions between samples.

pe <- propexpr(x)
pe

##         1         2         3         4         5         6         7 
## 0.5568000 0.5176149 0.5489187 0.5548044 0.5038857 0.5141279 0.4950729 
##         8         9        10        11        12 
## 0.5175220 0.5285855 0.5137117 0.5351397 0.5167289

barplot(pe)

Illumina

Mark Dunning; mark ‘dot’ dunning ‘at’ cruk.cam.ac.uk, Oscar Rueda; oscar ‘dot’ rueda ‘at’ cruk.cam.ac.uk

Last modified: 11 Feb 2016

Illumina

Single-channel microarrays

Randomly-arranged arrays

Different chip types

Example Arrangement

Workflow

Workflow from bead-level data

Summarisation

Alternative data import