Historical overview

Two-colour microarrays were used to compare two samples (e.g. cancer and normal cells) on the same microarray. The RNA from the two samples is extracted separately and fluorescently labelled with different dyes, usually red and green. Therefore, after hybridisation, each feature is a mixture of red and green fluorescence. A completely red or green feature indicates that a particular gene is expressed in one sample, but not the other.

probe-and-target

probe-and-target

Single-channel microrrays can also be produced to measure the absolute expression level of every gene of interest in a given sample. Therefore, the fluorescence of each feature is a measure of the expression level of a particular gene. Arguably the most popular single-channel microarray technology was that of Affymetrix. As we will describe in the next section, these arrays use 25 base-pair probes that are synthesised on the array surface. Each gene of interest is interrogated by a collection of 11-20 probe pairs, known as a probe set. The expression level for a gene is then derived by combining all measurements from a particular probe set.

Illumina (who are probably best-known for their sequencing technologies these days), were also a major player. Until recently, all gene-expression arrays at the CRUK Cambridge Institute were run on Illumina arrays and we have much experience in conducting large-scale cancer studies and developing software and analysis methods for these arrays.

Typical workflow

Despite differences in array construction, there are a few commonalities in the way that raw data from a microarray experiment are processed.

Image Processing

A microarray surface is typically scanned by a laser to produce an image representation of the fluorescence emitted by it. Thus, depending on the resolution of the scanner, each feature will be represented by a number of pixels. These are known as the raw images and are usually in the 16-bit TIFF image format. Therefore, the intensity of each pixel is a value in the range 0 to 2^16 − 1.

A high-resolution TIFF image is the result of scanning the array surface

A high-resolution TIFF image is the result of scanning the array surface

These images are usually processed by the manufacturers’ software, which involves locating all the features on the image and then calculating foreground intensities using the pixels that make up each feature. However, the pixel intensities measured on the image may be influenced by factors other than hybridisation, such as optical noise from the scanner or foreign items deposited on the array. Therefore, a background intensity is estimated for each feature to account for such factors. The background and foreground estimates generally act as a starting point for statistical analysis.

close-up

close-up

Data processing

The intensities of the features on a microarray are influenced by many sources of noise and repeated measurements made on different microarrays may also appear to disagree. Therefore, a number of data-cleaning, or pre-processing steps, must take place before being able to draw valid biological conclusions from a microarray experiment (Quackenbush, 2002; Smyth et al., 2003; Allison et al., 2006)

  • Background correction
    • A separate step to the correction for chip surface anomalies discussed above
    • Microarray probes are affected by cross hybridisation and other noise sources. The baseline measurements from the instrument are never zero
    • We address this by measuring ‘negative control’ probes that we don’t expect to yield any signal
    • Affymetrix and Illumina have different ways of doing this
  • Quality assessment
    • Some chips might be dodgy or we may have used poor-quality samples
  • Transformation
    • The TIFF images yield values on the scale 0 to 2^16, which is not very convenient for analysis
    • A suitable transformation, such as log\(_2\) is often used so a change of 1 unit in this corresponds to a two-fold change.
  • Normalisation
    • Systematic effects may emerge over time which we need to calibrate
    • Experiments should be adequately-designed to cope with this
  • Annotation
    • Microarry manufacturers use their own identifier schemes that don’t relate to biology
      • “ILMN_1343291”, “ILMN_1343295”,…
      • “1000_at”, “1001_at”,….
    • We need to map these IDs to gene names, genome position etc.
    • Sometimes they mappings can be wrong, like for Affy and Illumina
    • Actually, this is a main reason why sequencing is better

Microarrays vs sequencing

  • Probe design issues
  • Limited number of novel findings
  • Genome coverage
  • On the other-hand, microarray analysis methods are well-understood and established pipelines can process the data quickly and efficiently
  • (although sequencing (particularly RNA-seq) is catching-up)

Are arrays still relevant?

  • Wealth of data available online e.g. on G.E.O
  • Useful as a validation platform
  • Methods are established and well-understood
  • Cheaper? And easier access to equipment

The “death” of microarrays was predicted as early as 2008. In reality, it took quite a lot longer for arrays to be come obsolete. We have recently reached the tipping point where RNA-seq has taken over from gene expression arrays.

There is a vast amount of Illumina and Affymetrix data out there waiting to be explored. Some studies often use these historical samples as validation of computational methods of cancer subtypes. e.g. here or here

Many of the same issues and techniques apply to NGS data

  • Experimental Design; despite this fancy new technolgy, if we don’t design the experiments properly we won’t get meaningful conclusions
  • Quality assessment; Yes, NGS experiments can still go wrong!
  • Normalisation; NGS data come with their own set of biases and error that need to be accounted for
  • Stats; testing for RNA-seq is built-upon the knowledge from microarrays
    • some of the analysis we do for proteomics uses the linear modelling approach to be described later

Microarray data are much more manageable in size. We can work with decent-sized experiments (~100s of samples) and learn about high-dimensional analysis techniques that you will encounter in the analysis of newer, sexier, technologies.

The Bioconductor project

BioC

BioC

  • Packages analyse all kinds of Genomic data (>800)
  • Compulsory documentation (vignettes) for each package
  • 6-month release cycle
  • Course Materials
  • Example data and workflows
  • Common, re-usable framework and functionality
  • Available Support
    • Often you will be able to interact with the package maintainers / developers and other power-users of the project software
  • Annual conferences in U.S and Europe
    • The 2015 European conference was in Cambridge

Many of the packages are by well-respected authors and get lots of citations.

citations

citations

Downloading a package

Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you’ll find;

  • Installation script (will install all dependancies)
  • Vignettes and manuals
  • Details of package maintainer
  • After downloading, you can load using the library function. e.g. library(beadarray)

Reading data using Bioconductor

Recall that data can be read into R using read.csv, read.delim, read.table etc. Several packages provided special modifications of these to read raw data from different manufacturers

  • limma for various two-colour platforms
  • affy for Affymetrix data
  • beadarray, lumi, limma for Illumina BeadArray data
  • A common class is used to represent the data

A dataset may be split into different components

  • Matrix of expression values
  • Sample information
  • Annotation for the probes

In Bioconductor we will often put these data the same object for easy referencing. The Biobase package has all the code to do this.