Observations for Gene X on Illumina and older array technologies
Example Consider a technical failure on one of the arrays
The analyst may have a choice over what data to take as a starting point
(adapted from Ritchie et al)
The recommmended approach is to work with the so-called bead-level data as it allows the entire analysis workflow to be performed in R and Bioconductor, with all the benefits of reproducibility and transparency that gives. However, in order to produce data of this type, several modifications need to be made to the manufacturer’s scanning software. Consequently, data of this type are uncommon.
If you have bead-level data, then for each hybridisation you will have;
The scanner software produces a simple tab-delimited file as the arrays are being scanner. This turns-out to be a particularly good early-warning of problems. Specifically it records the 5th and 95th percentile of intensities which can be used as a measure of the signal-to-noise ratio. At some point, Illumina stated that ratio of these two numbers should be at least 10.
ht12metrics <- read.table(system.file("extdata/Chips/Metrics.txt" ,
package = "BeadArrayUseCases") , sep = "\t" , header = TRUE ,as.is = TRUE )
ht12metrics
## Date Matrix Section RegGrn FocusGrn SatGrn P95Grn
## 1 3/13/2009 6:45:04 PM 4613710017 B 0.13 0.70 0 704
## 2 3/24/2009 5:30:53 PM 4613710052 B 0.14 0.64 0 763
## 3 3/17/2009 5:58:00 PM 4613710054 B 0.10 0.67 0 815
## 4 3/31/2009 4:04:09 PM 4616443079 B 0.10 0.64 0 604
## 5 3/26/2009 1:29:30 PM 4616443093 B 0.12 0.66 0 716
## 6 04/07/09 18:28 4616443115 B 0.10 0.64 0 757
## 7 3/31/2009 6:38:16 PM 4616443081 B 0.12 0.51 0 41
## 8 3/31/2009 7:13:54 PM 4616443081 H 0.94 0.66 0 939
## 9 3/26/2009 12:10:56 PM 4616443092 B 0.12 0.66 0 562
## 10 04/01/09 16:34 4616443107 A 0.12 0.67 0 559
## 11 04/01/09 19:08 4616443136 A 0.15 0.58 0 612
## 12 04/01/09 04:50 4616494005 A 0.13 0.59 0 678
## P05Grn RegRed FocusRed SatRed P95Red P05Red
## 1 36 0 0 0 0 0
## 2 37 0 0 0 0 0
## 3 36 0 0 0 0 0
## 4 34 0 0 0 0 0
## 5 34 0 0 0 0 0
## 6 37 0 0 0 0 0
## 7 29 0 0 0 0 0
## 8 55 0 0 0 0 0
## 9 34 0 0 0 0 0
## 10 37 0 0 0 0 0
## 11 44 0 0 0 0 0
## 12 38 0 0 0 0 0
ht12snr <- ht12metrics$P95Grn / ht12metrics$P05Grn
plot (1:12 , ht12snr , pch = 19 , ylab = " P95 / P05 " , xlab = " " ,main = " Signal - to - noise ratio for HT12 data ")
axis (2)
The beadarray
Bioconductor package can be used to read Illumina bead-level data. It assumes the files are arranged in the same folder structure that is created by the scanner software; one folder for each chip, named according to the name of the chip.
Each chip type has a different set of bead-types, so it is important to tell beadarray
what annotation to use. In this case we choose illuminaHumanv3
.
library(beadarray)
chipPath <- system.file("extdata/Chips",package="BeadArrayUseCases")
list.files(chipPath)
## [1] "4613710017" "4613710052" "4613710054"
## [4] "4616443079" "4616443081" "4616443092"
## [7] "4616443093" "4616443107" "4616443115"
## [10] "4616443136" "4616494005" "Metrics.txt"
## [13] "sampleSheet.csv"
sampleSheetFile <- paste0(chipPath,"/sampleSheet.csv")
data <- readIllumina(dir=chipPath,sampleSheet=sampleSheetFile,illuminaAnnotation="Humanv3")
## Sample Sheet /home/dunnin01/R/x86_64-pc-linux-gnu-library/3.2/BeadArrayUseCases/extdata/Chips/sampleSheet.csv will be used to read the data
## Processing section 4613710017_B
## Processing section 4613710052_B
## Processing section 4613710054_B
## Processing section 4616443079_B
## Processing section 4616443093_B
## Processing section 4616443115_B
## Processing section 4616443081_B
## Processing section 4616443081_H
## Processing section 4616443092_B
## Processing section 4616443107_A
## Processing section 4616443136_A
## Processing section 4616494005_A
Just like with Affy data, we can visualise the raw images.
imageplot(data, array=6,high="darkgreen",low="lightgreen",zlim=c(4,10))
imageplot(data, array=8,high="darkgreen",low="lightgreen",zlim=c(4,10))
We could use BASH which we don’t cover today
Illumina put a number (~700) so-called negative-controls on each array. The probe sequences used are not supposed to hybridise to any location of the genome so therefore should just be measuring “background”. Additionally, several “housekeeping” genes are used which should be highly-expressed on any tissue-type. We can exploit the values of these controls for QA purposes; the housekeeping controls should always be high intensity and the negative controls should have low signal. There should be a large difference between the two extremes.
combinedControlPlot(data,array=1)
combinedControlPlot(data,array=7)
combinedControlPlot(data,array=8)
Now we can summarise the data into a more-usable form. Recall that there are ~30 measurements for each bead-type and we would prefer to work with a single, reliable, value for each bead-type (gene).
This procedure is taken care of by the summarize
function.
## Annotating control probes using package illuminaHumanv3.db Version:1.26.0
eset.ill
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 12 samples
## element names: exprs, se.exprs, nObservations
## protocolData: none
## phenoData
## rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
## varLabels: Sample_Name Sample_Group Sentrix_ID Sentrix_Position
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
## total)
## fvarLabels: ArrayAddressID IlluminaID Status
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3
## QC Information
## Available Slots:
## QC Items: Date, Matrix, ..., SampleGroup, numBeads
## sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A
The object should seem familiar from the previous section on Affymetrix
head(exprs(eset.ill))
## 4613710017_B 4613710052_B 4613710054_B 4616443079_B
## ILMN_1802380 8.454468 8.616796 8.523001 8.420796
## ILMN_1893287 5.388161 5.419345 5.162849 5.133287
## ILMN_1736104 5.268626 5.457679 5.012766 4.988511
## ILMN_1792389 6.767519 7.183788 6.947624 7.168571
## ILMN_1854015 5.556947 5.721614 5.595413 5.520391
## ILMN_1904757 5.421553 5.320500 5.522316 5.375212
## 4616443093_B 4616443115_B 4616443081_B 4616443081_H
## ILMN_1802380 8.527748 8.733281 4.970241 10.347973
## ILMN_1893287 5.221987 5.529493 5.113827 7.117936
## ILMN_1736104 5.284026 5.168590 5.281035 6.480966
## ILMN_1792389 7.386435 7.524649 5.032331 8.604813
## ILMN_1854015 5.558717 5.747483 5.101193 7.839441
## ILMN_1904757 5.638935 5.519093 5.018379 6.841015
## 4616443092_B 4616443107_A 4616443136_A 4616494005_A
## ILMN_1802380 9.693461 9.670220 9.743319 9.903972
## ILMN_1893287 5.471102 5.530818 5.816294 5.355552
## ILMN_1736104 5.143453 5.609556 5.603493 5.347251
## ILMN_1792389 8.052951 8.090507 8.106430 8.395739
## ILMN_1854015 5.524689 6.048046 6.138562 5.710613
## ILMN_1904757 5.070717 5.399798 5.622445 5.460538
head(pData(eset.ill))
## Sample_Name Sample_Group Sentrix_ID Sentrix_Position
## 4613710017_B UHRR-1 UHRR 4613710017 B
## 4613710052_B UHRR-2 UHRR 4613710052 B
## 4613710054_B UHRR-3 UHRR 4613710054 B
## 4616443079_B UHRR-4 UHRR 4616443079 B
## 4616443093_B UHRR-5 UHRR 4616443093 B
## 4616443115_B UHRR-6 UHRR 4616443115 B
Examine the boxplot. Do the arrays seem normalised?
boxplot(exprs(eset.ill),outline=FALSE)
We can normalise using normaliseIllumina
eset.norm <- normaliseIllumina(eset.ill[,c(1:6,9:12)])
A more-common form for Illumina data to be distributed is the output of Illumina’s GenomeStudio software. Data in this form have already been summarised. i.e. we get one measurement for each bead-type for each array. This format is naturally more-compatible with tools in Bioconductor. However, as we have already mentioned we lose some of the benefits of being able to control the whole analysis in R.
The limma package is one of the most highly-cited pacakges in Bioconductor, and as we will see later has some powerful tools for differential expression analysis. A workflow to import and normalise data that have been exported from GenomeStudio is described in Section 17.3 of the limma users guide.
Data for this section can be downloaded from the WEHI and un-zipped in the data
directory.
library(limma)
##
## Attaching package: 'limma'
##
## The following object is masked from 'package:beadarray':
##
## imageplot
##
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
x <- read.ilmn(files="data/probe profile.txt",ctrlfiles="data/control probe profile.txt",
other.columns="Detection")
## Reading file data/probe profile.txt ... ...
## Reading file data/control probe profile.txt ... ...
x
## An object of class "EListRaw"
## $source
## [1] "illumina"
##
## $E
## 1 2 3 4 5 6
## ILMN_1762337 52.34406 46.10429 54.01238 47.65873 54.75389 47.43921
## ILMN_2055271 69.91481 73.86729 58.64280 72.36581 77.10600 82.08336
## ILMN_1736007 57.47208 53.70050 53.39117 49.43674 58.56845 59.88695
## ILMN_2383229 53.60817 57.50813 48.22960 48.18757 61.81213 64.53053
## ILMN_1806310 58.11687 55.14536 50.52731 59.97917 64.15617 58.37450
## 7 8 9 10 11 12
## ILMN_1762337 67.36201 47.92186 40.48154 44.69235 80.57645 42.53519
## ILMN_2055271 69.12629 81.29189 79.11217 82.50546 87.01139 60.91446
## ILMN_1736007 56.42299 51.62467 58.67375 51.74950 58.42390 43.89006
## ILMN_2383229 52.73515 43.47368 65.48235 49.83155 53.89276 39.26819
## ILMN_1806310 58.04995 52.31484 56.60397 55.61397 65.28136 46.38885
## 49582 more rows ...
##
## $genes
## SYMBOL Status TargetID
## ILMN_1762337 7A5 regular <NA>
## ILMN_2055271 A1BG regular <NA>
## ILMN_1736007 A1BG regular <NA>
## ILMN_2383229 A1CF regular <NA>
## ILMN_1806310 A1CF regular <NA>
## 49582 more rows ...
##
## $other
## $Detection
## 1 2 3 4 5
## ILMN_1762337 0.55849580 0.6754875 0.13698630 0.601388900 0.57762240
## ILMN_2055271 0.03064067 0.0000000 0.04931507 0.002777778 0.03636364
## ILMN_1736007 0.27715880 0.2924791 0.15342470 0.486111100 0.41118880
## ILMN_2383229 0.47353760 0.1866295 0.36575340 0.563888900 0.29510490
## ILMN_1806310 0.26183840 0.2479109 0.25890410 0.127777800 0.21958040
## 6 7 8 9 10 11
## ILMN_1762337 0.7816412 0.05027933 0.4781382 0.90819210 0.7144847 0.0000000
## ILMN_2055271 0.0000000 0.03910615 0.0000000 0.00000000 0.0000000 0.0000000
## ILMN_1736007 0.2197497 0.23184360 0.3145275 0.15536720 0.3774373 0.2539007
## ILMN_2383229 0.1237830 0.34078210 0.7447109 0.05367232 0.4679666 0.3985816
## ILMN_1806310 0.2642559 0.19553070 0.2919605 0.19632770 0.2381616 0.1219858
## 12
## ILMN_1762337 0.4599448
## ILMN_2055271 0.0000000
## ILMN_1736007 0.3604972
## ILMN_2383229 0.7472376
## ILMN_1806310 0.2030387
## 49582 more rows ...
head(x$E)
## 1 2 3 4 5 6
## ILMN_1762337 52.34406 46.10429 54.01238 47.65873 54.75389 47.43921
## ILMN_2055271 69.91481 73.86729 58.64280 72.36581 77.10600 82.08336
## ILMN_1736007 57.47208 53.70050 53.39117 49.43674 58.56845 59.88695
## ILMN_2383229 53.60817 57.50813 48.22960 48.18757 61.81213 64.53053
## ILMN_1806310 58.11687 55.14536 50.52731 59.97917 64.15617 58.37450
## ILMN_1779670 64.49660 61.22243 52.84443 61.93639 67.88351 59.74094
## 7 8 9 10 11 12
## ILMN_1762337 67.36201 47.92186 40.48154 44.69235 80.57645 42.53519
## ILMN_2055271 69.12629 81.29189 79.11217 82.50546 87.01139 60.91446
## ILMN_1736007 56.42299 51.62467 58.67375 51.74950 58.42390 43.89006
## ILMN_2383229 52.73515 43.47368 65.48235 49.83155 53.89276 39.26819
## ILMN_1806310 58.04995 52.31484 56.60397 55.61397 65.28136 46.38885
## ILMN_1779670 68.17852 63.10263 65.07272 65.68924 69.60009 52.01776
One downside of limma, is that the data are represented in a slightly different format to the rest of Bioconductor. So the exprs
and pData
functions cannot be used. Instead to retrieve the expression measurements we have to use $E
.
boxplot(log2(x$E),range=0,ylab="log2 intensity")
A popular choice for the normalisation of Illumina data is neqc. This incorporates a background correction step, log\(_2\) transformation and quantile normalisation. The background estimation uses the intensities from the negative controls. After the data have been transformed, these controls are then removed from the data.
y <- neqc(x)
head(y$E)
## 1 2 3 4 5 6
## ILMN_1762337 4.754040 4.731519 4.980562 4.741192 4.755547 4.704071
## ILMN_2055271 5.162217 5.298597 5.139491 5.284588 5.142003 5.383228
## ILMN_1736007 4.849849 4.852764 4.960072 4.769297 4.808160 4.895093
## ILMN_2383229 4.775144 4.923796 4.809635 4.749622 4.855720 4.983020
## ILMN_1806310 4.863640 4.877851 4.871417 4.975431 4.893602 4.867401
## ILMN_1779670 5.017084 5.002160 4.943029 5.019756 4.958687 4.892008
## 7 8 9 10 11 12
## ILMN_1762337 5.090111 4.779881 4.669165 4.712218 5.226774 4.788520
## ILMN_2055271 5.123613 5.426473 5.410965 5.413067 5.342115 5.568949
## ILMN_1736007 4.891535 4.838546 4.946151 4.818017 4.859202 4.826098
## ILMN_2383229 4.831107 4.710633 5.078065 4.788162 4.797814 4.716899
## ILMN_1806310 4.919383 4.849762 4.911665 4.880178 4.957913 4.907443
## ILMN_1779670 5.105954 5.039321 5.068816 5.058381 5.029809 5.121597
boxplot(y$E,range=0,ylab="log2 intensity")
Another use for the negative controls is to estimate the proportion of genes that are expressed in each sample. There should not be a significant difference in these proportions between samples.
pe <- propexpr(x)
pe
## 1 2 3 4 5 6 7
## 0.5568000 0.5176149 0.5489187 0.5548044 0.5038857 0.5141279 0.4950729
## 8 9 10 11 12
## 0.5175220 0.5285855 0.5137117 0.5351397 0.5167289
barplot(pe)