Processing aligned reads with R and Bioconductor

We will now look at how we can represent and access genomic intervals in R and Bioconducto which fit best within the framework of the course. There are also tools outside of R that are extremely powerful; such as bedtools which are worth considering if you want to go further with NGS analysis.

Getting started

Working directory

Session -> Set Working Directory -> Choose Directory

/home/participant/Course_Materials/Day1/

Template

/home/participant/Course_Materials/Day1/Session3-template.Rmd

Importing aligned data

We can import reads from a .bam file into Bioconductor. However, to set expectations we are not going to be processing the entire set of reads from a whole-genome sequencing run in this manner. This can be a useful way of diving-into a particular region of interest and exploring the data.

A package that can be used to parse a .bam file is GenomicAlignments. You should notice that although the .bam file is not particularly big (~ 12.5 Million reads), but already takes a little while to read.

library(GenomicAlignments)
mybam <- readGAlignments("paired.bam")
mybam

## GAlignments object with 12581680 alignments and 0 metadata columns:
##              seqnames strand       cigar    qwidth     start       end
##                 <Rle>  <Rle> <character> <integer> <integer> <integer>
##          [1]        1      +       1S67M        68     10003     10069
##          [2]        1      +         68M        68     10036     10103
##          [3]        1      +         68M        68     10036     10103
##          [4]        1      +         68M        68     10041     10108
##          [5]        1      -         68M        68     10041     10108
##          ...      ...    ...         ...       ...       ...       ...
##   [12581676]   hs37d5      -         68M        68  35465836  35465903
##   [12581677]   hs37d5      +         68M        68  35466096  35466163
##   [12581678]   hs37d5      -    47M1D21M        68  35466133  35466201
##   [12581679]   hs37d5      +         68M        68  35466270  35466337
##   [12581680]   hs37d5      -         68M        68  35466357  35466424
##                  width     njunc
##              <integer> <integer>
##          [1]        67         0
##          [2]        68         0
##          [3]        68         0
##          [4]        68         0
##          [5]        68         0
##          ...       ...       ...
##   [12581676]        68         0
##   [12581677]        68         0
##   [12581678]        69         0
##   [12581679]        68         0
##   [12581680]        68         0
##   -------
##   seqinfo: 86 sequences from an unspecified genome

Similar to the importing the .fastq files using ShortRead, readGAlignments has provided us with an object that can be manipulated using the standard vector conventions.

mybam[1:10]

## GAlignments object with 10 alignments and 0 metadata columns:
##        seqnames strand        cigar    qwidth     start       end
##           <Rle>  <Rle>  <character> <integer> <integer> <integer>
##    [1]        1      +        1S67M        68     10003     10069
##    [2]        1      +          68M        68     10036     10103
##    [3]        1      +          68M        68     10036     10103
##    [4]        1      +          68M        68     10041     10108
##    [5]        1      -          68M        68     10041     10108
##    [6]        1      +          68M        68     10041     10108
##    [7]        1      -          68M        68     10046     10113
##    [8]        1      +      6M1I61M        68     10103     10169
##    [9]        1      - 38M1D3M1D27M        68     10108     10177
##   [10]        1      -     36M1D32M        68     10110     10178
##            width     njunc
##        <integer> <integer>
##    [1]        67         0
##    [2]        68         0
##    [3]        68         0
##    [4]        68         0
##    [5]        68         0
##    [6]        68         0
##    [7]        68         0
##    [8]        67         0
##    [9]        70         0
##   [10]        69         0
##   -------
##   seqinfo: 86 sequences from an unspecified genome

There are also a number of accessor functions that can get particular items from the object; cigar to obtain the CIGAR strings, start / end to get the start and end positions, width to get the width of each read.

Exercise

How many reads have an exact match to the genome
- what cigar string would this be?
What proportion of all reads is this?

The object we have created, bam, contains only a small amount of the information available in a .bam file. If we wish we can import extra fields such as the read sequence, mapping quality and flag:-

bam <- readGAlignments("paired.bam",param=ScanBamParam(what=c("seq","mapq","flag")))
bam

## GAlignments object with 3 alignments and 3 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]        1      +       1S67M        68     10003     10069        67
##   [2]        1      +         68M        68     10036     10103        68
##   [3]        1      +         68M        68     10036     10103        68
##           njunc |                     seq      mapq      flag
##       <integer> |          <DNAStringSet> <integer> <integer>
##   [1]         0 | GACCCTGACC...CTAACCCTAA         6       163
##   [2]         0 | CTAAGCCTAA...CCCGAACGCT         5        99
##   [3]         0 | CTAACCCTAA...CCCTAACCCT        14        99
##   -------
##   seqinfo: 86 sequences from an unspecified genome

The command takes longer to run, but we get more detail on each of the reads. The extra fields make up the metadata for each read and can be accessed using the mcols function. If we save this metadata as an object, we can treat it as a data frame and therefore have the usual $ operator to access the columns

meta <- mcols(bam)

The sequences are stored in Biostrings format

meta$seq

##   A DNAStringSet instance of length 12581680
##            width seq
##        [1]    68 GACCCTGACCCTAACCCTGACCCTGACC...TGACCCTAACCCTGACCCTAACCCTAA
##        [2]    68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCTAACCCTCAGCCTAACCCGAACGCT
##        [3]    68 CTAACCCTAACCCTAACCCTAACCCTAA...CCTAACCCTAACCCTAACCCTAACCCT
##        [4]    68 CCTAACCCTAACCCTAACCCTAACCCTA...CCCTAACCCTAACCCTAACCCTAACCC
##        [5]    68 CCTAACCCTAACCCTAACCCTAACCCTA...CCCTAACCCTAACCCTAACCCTAACCC
##        ...   ... ...
## [12581676]    68 ATTCCATTCCCTTCCTTTCTTTCTTCAG...CACCCAGTATGGAGTGAAGTGCAGATT
## [12581677]    68 CACTTCATCACATTCCATTCGATTTCAT...AATCTTTCAATTTCATTCCATTCGATT
## [12581678]    68 TTCGAATCTTTCAATTTCATTCCATTCG...CCATTTTTTCCATTCGATACCATTCGA
## [12581679]    68 ATTCGATTCCTTTTGATTCCATATGATT...TTGGATTTGATTCAAAACCATTTGATT
## [12581680]    68 TCCGTTTGATTCCATTCCATTCGATTCC...CCACTCCGTTCCATTTCATTCCATTTG

The flags should be valid values as explained online

table(meta$flag)

## 
##      83      99     147     163    1107    1123    1171    1187 
## 3074883 3127385 3127385 3074312   43835   45023   45023   43834

Finally the mapping qualities are numeric quantities that will vary according to aligner

summary(meta$mapq)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   30.00   39.00   32.87   40.00   53.00

hist(meta$mapq)

Rather than taking a whole-genome view, we often want to view the reads for a particular gene or region of interest. This we can do using the functions we have already seen.

my.reads <- which(seqnames(mybam)=="17" & start(mybam) > 7577851 & end(mybam) < 7598063)
mybam[my.reads]

## GAlignments object with 224 alignments and 0 metadata columns:
##         seqnames strand       cigar    qwidth     start       end
##            <Rle>  <Rle> <character> <integer> <integer> <integer>
##     [1]       17      +         76M        76   7577918   7577993
##     [2]       17      -         76M        76   7577930   7578005
##     [3]       17      +         76M        76   7578191   7578266
##     [4]       17      +         68M        68   7578195   7578262
##     [5]       17      +         68M        68   7578203   7578270
##     ...      ...    ...         ...       ...       ...       ...
##   [220]       17      -         76M        76   7596411   7596486
##   [221]       17      +         68M        68   7596523   7596590
##   [222]       17      -         68M        68   7596639   7596706
##   [223]       17      +       67M1S        68   7597833   7597899
##   [224]       17      -         68M        68   7597888   7597955
##             width     njunc
##         <integer> <integer>
##     [1]        76         0
##     [2]        76         0
##     [3]        76         0
##     [4]        68         0
##     [5]        68         0
##     ...       ...       ...
##   [220]        76         0
##   [221]        68         0
##   [222]        68         0
##   [223]        67         0
##   [224]        68         0
##   -------
##   seqinfo: 86 sequences from an unspecified genome

However, there are much more efficient ways to do such queries in Bioconductor. First of all, we need to understand a bit more about how the data are being represented.

Introducing ranges objects

GenomicAlignments is part of a family of packages that provide object-types and functionality for dealing with genomic intervals; which are described in a PLoS Computational Biology paper from 2013.

The basic type of interval we can define is an IRanges object

ir <- IRanges(
start = c(7,9,12,14,22:24), 
end=c(15,11,13,18,26,27,28))
ir

## IRanges object with 7 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7        15         9
##   [2]         9        11         3
##   [3]        12        13         2
##   [4]        14        18         5
##   [5]        22        26         5
##   [6]        23        27         5
##   [7]        24        28         5

There is an extensive list of operations that we can perform on this object

shift(ir, 5)

## IRanges object with 7 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]        12        20         9
##   [2]        14        16         3
##   [3]        17        18         2
##   [4]        19        23         5
##   [5]        27        31         5
##   [6]        28        32         5
##   [7]        29        33         5

resize(ir ,2)

## IRanges object with 7 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7         8         2
##   [2]         9        10         2
##   [3]        12        13         2
##   [4]        14        15         2
##   [5]        22        23         2
##   [6]        23        24         2
##   [7]        24        25         2

reduce(ir)

## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         7        18        12
##   [2]        22        28         7

coverage(ir)

## integer-Rle of length 28 with 10 runs
##   Lengths: 6 2 7 3 3 1 1 3 1 1
##   Values : 0 1 2 1 0 1 2 3 2 1

To start using these operations in a genome context, we can create a GRanges object. This has all the same information as an IRanges object with extra capability to store the sequence name, strand and other (optional) metadata.

mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063),strand="+")
mygene

## GRanges object with 1 range and 0 metadata columns:
##       seqnames             ranges strand
##          <Rle>          <IRanges>  <Rle>
##   [1]    chr17 [7577851, 7598063]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

A large set of IRanges functionality is designed to deal with overlapping queries. Such operations are common in NGS analysis so we need to have an efficient way of performing them. You will probably come across such operations without even realising it.

What reads overlap with a particular genomic region?
What genes are affected by this particular copy-number loss / gain?
How many variants are in exonic regions?
For a given set of variants, what genes do they overlap with?
How many variants does my gene of interest have?

The terminology of overlapping defines a query and a subject

The countOverlaps function is be an ideal candidate for this kind of investigation.

countOverlaps(mygene, mybam)

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

## [1] 0

What is this this warning message telling us? It’s basically saying that the sequence names are not compatible. A major annonyance of dealing with NGS data is the way that different databases represent a relatively trivial piece of information such as chromosome name.

table(seqnames(mybam))

## 
##          1          2          3          4          5          6 
##    1222361     974687     760716     634872     656322     665565 
##          7          8          9         10         11         12 
##     621993     515407     500207     538097     605012     580508 
##         13         14         15         16         17         18 
##     305082     432772     400797     433895     501464     235618 
##         19         20         21         22          X          Y 
##     422294     267243     139654     201444     314641      63743 
##         MT GL000207.1 GL000226.1 GL000229.1 GL000231.1 GL000210.1 
##      22951         14       8508        208        186         76 
## GL000239.1 GL000235.1 GL000201.1 GL000247.1 GL000245.1 GL000197.1 
##        192        146        200        182         82        144 
## GL000203.1 GL000246.1 GL000249.1 GL000196.1 GL000248.1 GL000244.1 
##         76        130        174        154        100        448 
## GL000238.1 GL000202.1 GL000234.1 GL000232.1 GL000206.1 GL000240.1 
##        300        120        218        184        150        156 
## GL000236.1 GL000241.1 GL000243.1 GL000242.1 GL000230.1 GL000237.1 
##        222        378        300         98         84        804 
## GL000233.1 GL000204.1 GL000198.1 GL000208.1 GL000191.1 GL000227.1 
##        130        226        312       1032        546       1994 
## GL000228.1 GL000214.1 GL000221.1 GL000209.1 GL000218.1 GL000220.1 
##       1988       2344       2215       3734       1148      27398 
## GL000213.1 GL000211.1 GL000199.1 GL000217.1 GL000216.1 GL000215.1 
##        860       1514       2000        666        974        622 
## GL000205.1 GL000219.1 GL000224.1 GL000223.1 GL000195.1 GL000212.1 
##       1628       1066       3120       1140       2108       2030 
## GL000222.1 GL000200.1 GL000193.1 GL000194.1 GL000225.1 GL000192.1 
##        956        898       1162       1756       2133       4022 
##  NC_007605     hs37d5 
##      92718     385861

table(seqnames(mygene))

## 
## chr17 
##     1

So before attempting an overlap between two different set of ranges, we need to make sure that their respective sequences names are compatible. One solution is to be more careful when creating our ranges.

mygene <- GRanges("17", ranges=IRanges(7577851, 7598063))
countOverlaps(mygene, mybam)

## [1] 225

The convenience function seqlevelsStyle has been written to help us understand what covention has been used to name the chromosomes, and can even rename where appropriate.

mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063))
seqlevelsStyle(mygene)

## [1] "UCSC"

seqlevelsStyle(mybam)

## [1] "NCBI"    "Ensembl"

seqlevelsStyle(mygene) <- "NCBI"
countOverlaps(mygene, mybam)

## [1] 225

If we want the actual reads themselves, we can use a convenient %over% shortcut

mybam[mybam %over% mygene]

## GAlignments object with 225 alignments and 0 metadata columns:
##         seqnames strand       cigar    qwidth     start       end
##            <Rle>  <Rle> <character> <integer> <integer> <integer>
##     [1]       17      +         76M        76   7577918   7577993
##     [2]       17      -         76M        76   7577930   7578005
##     [3]       17      +         76M        76   7578191   7578266
##     [4]       17      +         68M        68   7578195   7578262
##     [5]       17      +         68M        68   7578203   7578270
##     ...      ...    ...         ...       ...       ...       ...
##   [221]       17      +         68M        68   7596523   7596590
##   [222]       17      -         68M        68   7596639   7596706
##   [223]       17      +       67M1S        68   7597833   7597899
##   [224]       17      -         68M        68   7597888   7597955
##   [225]       17      +         68M        68   7598046   7598113
##             width     njunc
##         <integer> <integer>
##     [1]        76         0
##     [2]        76         0
##     [3]        76         0
##     [4]        68         0
##     [5]        68         0
##     ...       ...       ...
##   [221]        68         0
##   [222]        68         0
##   [223]        67         0
##   [224]        68         0
##   [225]        68         0
##   -------
##   seqinfo: 86 sequences from an unspecified genome

Extracting a subset of reads

We might want to focus on a subset of reads from the start, is when we want to analyse a particular gene. Provided that the .bam file has been indexed (creating a .bam.bai file in the same directory), we can very quickly jump to a particular genomic region. Notice that this (should) give the same number of reads as before when we did a countOverlaps operation.

system.time(mygene.reads <- readGAlignments(file="paired.bam",param=ScanBamParam(which=mygene)))

##    user  system elapsed 
##   0.098   0.000   0.144

mygene.reads

## GAlignments object with 225 alignments and 0 metadata columns:
##         seqnames strand       cigar    qwidth     start       end
##            <Rle>  <Rle> <character> <integer> <integer> <integer>
##     [1]       17      +         76M        76   7577918   7577993
##     [2]       17      -         76M        76   7577930   7578005
##     [3]       17      +         76M        76   7578191   7578266
##     [4]       17      +         68M        68   7578195   7578262
##     [5]       17      +         68M        68   7578203   7578270
##     ...      ...    ...         ...       ...       ...       ...
##   [221]       17      +         68M        68   7596523   7596590
##   [222]       17      -         68M        68   7596639   7596706
##   [223]       17      +       67M1S        68   7597833   7597899
##   [224]       17      -         68M        68   7597888   7597955
##   [225]       17      +         68M        68   7598046   7598113
##             width     njunc
##         <integer> <integer>
##     [1]        76         0
##     [2]        76         0
##     [3]        76         0
##     [4]        68         0
##     [5]        68         0
##     ...       ...       ...
##   [221]        68         0
##   [222]        68         0
##   [223]        67         0
##   [224]        68         0
##   [225]        68         0
##   -------
##   seqinfo: 86 sequences from an unspecified genome

The region filer can be used in conjuction with the what argument to ScanBamParam function to provide a detailed picture of the reads for your gene

mygene.reads <- readGAlignments(file="paired.bam",param=ScanBamParam(which=mygene, what=c("seq","mapq","flag","qual","isize")))
mygene.reads

## GAlignments object with 3 alignments and 5 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]       17      +         76M        76   7577918   7577993        76
##   [2]       17      -         76M        76   7577930   7578005        76
##   [3]       17      +         76M        76   7578191   7578266        76
##           njunc |                     seq      mapq      flag
##       <integer> |          <DNAStringSet> <integer> <integer>
##   [1]         0 | TCCGCCTGCC...GGGAGGCCCT        37       163
##   [2]         0 | GGCCTCCCAA...GCCTCTGTAA        31        83
##   [3]         0 | AGGGCACCAC...CCACTCGGAT        40        99
##                          qual     isize
##                <PhredQuality> <integer>
##   [1] H;9476;:9:...@?9<085==J        87
##   [2] ##########...>?><><<<>S       -87
##   [3] S=>>=<<><<...?>==??8??P       202
##   -------
##   seqinfo: 86 sequences from an unspecified genome

Pre-built databases of gene coordinates

Aside from the many useful software packages, Bioconductor also provides numerous annotation resources that we can utilise in our analysis. Firstly, we have a set of organism-level packages that can translate between different types of identifer. The package for humans is called org.Hs.eg.db. The advantage of such a package, rather than services such as biomaRt, is that we can do queries offline. The packages are updated every 6 months, so we can always be sure of what version of the relevant databases are being used.

library(org.Hs.eg.db)
org.Hs.eg.db

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2016-Mar14
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 20160305
## | GOEGSOURCEDATE: 2016-Mar14
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
## | GPSOURCEDATE: 2010-Mar22
## | ENSOURCEDATE: 2016-Mar9
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Wed Mar 23 15:50:09 2016

There are several types of “key” we can use to make a query, and we have to specify one of these names.

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

For the given keytype we have chosen, we can also choose what data we want to retrieve. We can think of these as columns in a table, and the pre-defined values are given by:-

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

For example, if we want to know the “Entrez” ID for the gene BRCA1, we can do:-

eg <- select(org.Hs.eg.db, keys="BRCA1", keytype = "SYMBOL",columns = "ENTREZID")
eg

##   SYMBOL ENTREZID
## 1  BRCA1      672

But we’re not restricted to just one type of information to return

eg <- select(org.Hs.eg.db, keys=c("BRCA1","PTEN"), keytype = "SYMBOL",columns = c("ENTREZID","ENSEMBL"))

## 'select()' returned 1:many mapping between keys and columns

You should see that the above command prints a message to the screen:- 'select()' returned 1:many mapping between keys and columns. This is not an error message and R has still been able to generate the output requested.

eg

##   SYMBOL ENTREZID         ENSEMBL
## 1  BRCA1      672 ENSG00000012048
## 2   PTEN     5728 ENSG00000171862
## 3   PTEN     5728 ENSG00000283094

In this case, we have “many” (well, two) values of ENSEMBL for the gene PTEN. In practice this means we probably want to think carefully about merging this data with other tables.

You might expect to be able to retrieve information about the coordinates for a particular gene using the same interface. This was supported until recently, but the recommended approach now is to use another class of packages which describe the structure of genes in more detail.

The packages with the prefix TxDb.... represent the structure of all genes for a given organism in an efficient manner. For humans, we can use the package TxDb.Hsapiens.UCSC.hg19.knownGene to tell us about transcripts for the hg19 build of the genome. The package was generated using tables from the UCSC genome browser

As with the org.Hs.eg.db package we can load the package and inspect the kind of mappings available to us.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
columns(txdb)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

keytypes(txdb)

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"    
## [7] "TXNAME"

You’ll see that all the mappings are regarding the coordinates and IDs of various genomic features. There is only one type of identifier used, in this case Entrez ID. If we know a genes Entrez ID, we can get the exon coordinates with the following query.

mygene <- select(txdb, keys="672", columns = c("EXONCHROM","EXONSTART","EXONEND","EXONSTRAND"),keytype="GENEID")
mygene

##    GENEID EXONCHROM EXONSTRAND EXONSTART  EXONEND
## 1     672     chr17          -  41276034 41276132
## 2     672     chr17          -  41267743 41267796
## 3     672     chr17          -  41258473 41258550
## 4     672     chr17          -  41256885 41256973
## 5     672     chr17          -  41256139 41256278
## 6     672     chr17          -  41251792 41251897
## 7     672     chr17          -  41249261 41249306
## 8     672     chr17          -  41247863 41247939
## 9     672     chr17          -  41246761 41246877
## 10    672     chr17          -  41242961 41243049
## 11    672     chr17          -  41234421 41234592
## 12    672     chr17          -  41228505 41228628
## 13    672     chr17          -  41226348 41226538
## 14    672     chr17          -  41222945 41223255
## 15    672     chr17          -  41219625 41219712
## 16    672     chr17          -  41215891 41215968
## 17    672     chr17          -  41215350 41215390
## 18    672     chr17          -  41209069 41209152
## 19    672     chr17          -  41203080 41203134
## 20    672     chr17          -  41201138 41201211
## 21    672     chr17          -  41199660 41199720
## 22    672     chr17          -  41196312 41197819
## 23    672     chr17          -  41277199 41277340
## 24    672     chr17          -  41258495 41258550
## 25    672     chr17          -  41251792 41251894
## 26    672     chr17          -  41243452 41246877
## 27    672     chr17          -  41228505 41228631
## 28    672     chr17          -  41277294 41277468
## 29    672     chr17          -  41277288 41277500
## 30    672     chr17          -  41231351 41231416
## 31    672     chr17          -  41202079 41202207
## 32    672     chr17          -  41322143 41322420
## 33    672     chr17          -  41243117 41246877
## 34    672     chr17          -  41262482 41262597

Exercise

Retrieve the exon coordinates for the gene PTEN
- you’ll need the Entrez ID for this gene as retrieved above
How about all transcripts for PTEN and BRCA1?

It is useful to be able to retrive the coordinates in this manner. However, we should now be familiar with the way intervals can be represented using GRanges. We have the ability to create a GRanges object from the result:-

my.gr <- GRanges(mygene$EXONCHROM, IRanges(mygene$EXONSTART,mygene$EXONEND))
my.gr

## GRanges object with 34 ranges and 0 metadata columns:
##        seqnames               ranges strand
##           <Rle>            <IRanges>  <Rle>
##    [1]    chr17 [41276034, 41276132]      *
##    [2]    chr17 [41267743, 41267796]      *
##    [3]    chr17 [41258473, 41258550]      *
##    [4]    chr17 [41256885, 41256973]      *
##    [5]    chr17 [41256139, 41256278]      *
##    ...      ...                  ...    ...
##   [30]    chr17 [41231351, 41231416]      *
##   [31]    chr17 [41202079, 41202207]      *
##   [32]    chr17 [41322143, 41322420]      *
##   [33]    chr17 [41243117, 41246877]      *
##   [34]    chr17 [41262482, 41262597]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

A powerful feature of the transcript database packages is to allow the user to create a GRanges representation of all exons / transcripts for a particular organism. The function to do this is exonsBy (or equivalently transcriptsBy). The result is a list object that can be subset according to Entrez ID.

allex <- exonsBy(txdb, "gene")
mygene <- allex[["672"]]
mygene

## GRanges object with 34 ranges and 2 metadata columns:
##        seqnames               ranges strand |   exon_id   exon_name
##           <Rle>            <IRanges>  <Rle> | <integer> <character>
##    [1]    chr17 [41196312, 41197819]      - |    227457        <NA>
##    [2]    chr17 [41199660, 41199720]      - |    227458        <NA>
##    [3]    chr17 [41201138, 41201211]      - |    227459        <NA>
##    [4]    chr17 [41202079, 41202207]      - |    227460        <NA>
##    [5]    chr17 [41203080, 41203134]      - |    227461        <NA>
##    ...      ...                  ...    ... .       ...         ...
##   [30]    chr17 [41276034, 41276132]      - |    227486        <NA>
##   [31]    chr17 [41277199, 41277340]      - |    227487        <NA>
##   [32]    chr17 [41277288, 41277500]      - |    227488        <NA>
##   [33]    chr17 [41277294, 41277468]      - |    227489        <NA>
##   [34]    chr17 [41322143, 41322420]      - |    227496        <NA>
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

We are almost in a position to overlap our reads with the GRanges representation of our gene. First of all, we need the same trick from before to make sure the chromosome names are compatible

seqlevelsStyle(mybam)

## [1] "NCBI"    "Ensembl"

seqlevelsStyle(mygene)

## [1] "UCSC"

seqlevelsStyle(mygene) <- "Ensembl"
mygene

## GRanges object with 34 ranges and 2 metadata columns:
##        seqnames               ranges strand |   exon_id   exon_name
##           <Rle>            <IRanges>  <Rle> | <integer> <character>
##    [1]       17 [41196312, 41197819]      - |    227457        <NA>
##    [2]       17 [41199660, 41199720]      - |    227458        <NA>
##    [3]       17 [41201138, 41201211]      - |    227459        <NA>
##    [4]       17 [41202079, 41202207]      - |    227460        <NA>
##    [5]       17 [41203080, 41203134]      - |    227461        <NA>
##    ...      ...                  ...    ... .       ...         ...
##   [30]       17 [41276034, 41276132]      - |    227486        <NA>
##   [31]       17 [41277199, 41277340]      - |    227487        <NA>
##   [32]       17 [41277288, 41277500]      - |    227488        <NA>
##   [33]       17 [41277294, 41277468]      - |    227489        <NA>
##   [34]       17 [41322143, 41322420]      - |    227496        <NA>
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Furthermore, since we know we are only interested in the reads from chromosome 17, we could subset the reads accordingly.

mybam.chr17 <- keepSeqlevels(mybam, "17")

mybam.chr17[mybam.chr17 %over% mygene]

## GAlignments object with 345 alignments and 0 metadata columns:
##         seqnames strand       cigar    qwidth     start       end
##            <Rle>  <Rle> <character> <integer> <integer> <integer>
##     [1]       17      -         76M        76  41197373  41197448
##     [2]       17      -         68M        68  41197677  41197744
##     [3]       17      -         68M        68  41197707  41197774
##     [4]       17      -         76M        76  41197734  41197809
##     [5]       17      -         68M        68  41197780  41197847
##     ...      ...    ...         ...       ...       ...       ...
##   [341]       17      -         68M        68  41276093  41276160
##   [342]       17      -         76M        76  41276099  41276174
##   [343]       17      -         68M        68  41277212  41277279
##   [344]       17      -         68M        68  41277341  41277408
##   [345]       17      -         68M        68  41322151  41322218
##             width     njunc
##         <integer> <integer>
##     [1]        76         0
##     [2]        68         0
##     [3]        68         0
##     [4]        76         0
##     [5]        68         0
##     ...       ...       ...
##   [341]        68         0
##   [342]        76         0
##   [343]        68         0
##   [344]        68         0
##   [345]        68         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome

NOTE:- it also seems that the Mitochondial sequence is a different length in the reference that these reads were aligned to, compared to the transcript database. It may seem like a trivial difference, but will cause an error if we don’t perform the subset to just chromosome 17.

Error in mergeNamedAtomicVectors(seqlengths(x), seqlengths(y), what = c("sequence",  : 
  sequence MT has incompatible seqlengths:
  - in 'x': 16569
  - in 'y': 16571

Per-exon counts can be achieved using the countOverlaps function:-

countOverlaps(mygene,mybam.chr17)

##  [1]   8   6   4   0   1  12   3   3   5  24   5   8   8   1  13   2 207
## [18] 204   4   1   2   7   7  18   7   2   2   0   0   4   1   1   1   1

And finally, we should note that we could subset just the reads for this gene with the following set of commands. The only thing we need to know in advance is the Entrez ID of our gene.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
allex <- exonsBy(txdb, "gene")
mygene <- allex[["672"]]
seqlevelsStyle(mygene) <- "Ensembl"
mygene <- keepSeqlevels(mygene,"17")
reads <- readGAlignments("paired.bam",param=ScanBamParam(which=mygene))

A final note is that we can also export any ranges we have created to a file (e.g. a .bed file) for further processing or visualisation in IGV.

library(rtracklayer)
export(mygene, con="mygene.bed")

Exercise

Import all the reads from the file paired.bam that align to the gene PTEN
Obtain counts for the number of reads mapping to each exon of PTEN
How many reads don’t have a perfect match to the reference genome?

Understanding Alignments

Mark Dunning

Last modified: 23 Jul 2016

Outline

Processing aligned reads with R and Bioconductor

Getting started

Working directory

Template

Importing aligned data

Exercise

Introducing ranges objects

Extracting a subset of reads

Pre-built databases of gene coordinates

Exercise

Exercise

Summary