Two-colour microarrays were used to compare two samples (e.g. cancer and normal cells) on the same microarray. The RNA from the two samples is extracted separately and fluorescently labelled with different dyes, usually red and green. Therefore, after hybridisation, each feature is a mixture of red and green fluorescence. A completely red or green feature indicates that a particular gene is expressed in one sample, but not the other.
Single-channel microrrays can also be produced to measure the absolute expression level of every gene of interest in a given sample. Therefore, the fluorescence of each feature is a measure of the expression level of a particular gene. Arguably the most popular single-channel microarray technology was that of Affymetrix. As we will describe in the next section, these arrays use 25 base-pair probes that are synthesised on the array surface. Each gene of interest is interrogated by a collection of 11-20 probe pairs, known as a probe set. The expression level for a gene is then derived by combining all measurements from a particular probe set.
Illumina (who are probably best-known for their sequencing technologies these days), were also a major player. Until recently, all gene-expression arrays at the CRUK Cambridge Institute were run on Illumina arrays and we have much experience in conducting large-scale cancer studies and developing software and analysis methods for these arrays.
Despite differences in array construction, there are a few commonalities in the way that raw data from a microarray experiment are processed.
A microarray surface is typically scanned by a laser to produce an image representation of the fluorescence emitted by it. Thus, depending on the resolution of the scanner, each feature will be represented by a number of pixels. These are known as the raw images and are usually in the 16-bit TIFF image format. Therefore, the intensity of each pixel is a value in the range 0 to 2^16 − 1.
These images are usually processed by the manufacturers’ software, which involves locating all the features on the image and then calculating foreground intensities using the pixels that make up each feature. However, the pixel intensities measured on the image may be influenced by factors other than hybridisation, such as optical noise from the scanner or foreign items deposited on the array. Therefore, a background intensity is estimated for each feature to account for such factors. The background and foreground estimates generally act as a starting point for statistical analysis.
The intensities of the features on a microarray are influenced by many sources of noise and repeated measurements made on different microarrays may also appear to disagree. Therefore, a number of data-cleaning, or pre-processing steps, must take place before being able to draw valid biological conclusions from a microarray experiment (Quackenbush, 2002; Smyth et al., 2003; Allison et al., 2006)
The “death” of microarrays was predicted as early as 2008. In reality, it took quite a lot longer for arrays to be come obsolete. We have recently reached the tipping point where RNA-seq has taken over from gene expression arrays.
There is a vast amount of Illumina and Affymetrix data out there waiting to be explored. Some studies often use these historical samples as validation of computational methods of cancer subtypes. e.g. here or here
Microarray data are much more manageable in size. We can work with decent-sized experiments (~100s of samples) and learn about high-dimensional analysis techniques that you will encounter in the analysis of newer, sexier, technologies.
Many of the packages are by well-respected authors and get lots of citations.
Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you’ll find;
library
function. e.g. library(beadarray)
Recall that data can be read into R using read.csv
, read.delim
, read.table
etc. Several packages provided special modifications of these to read raw data from different manufacturers
limma
for various two-colour platformsaffy
for Affymetrix databeadarray
, lumi
, limma
for Illumina BeadArray dataA dataset may be split into different components
In Bioconductor we will often put these data the same object for easy referencing. The Biobase
package has all the code to do this.