.bam
file using BioconductorWe will now look at how we can represent and access genomic intervals in R and Bioconducto which fit best within the framework of the course. There are also tools outside of R that are extremely powerful; such as bedtools which are worth considering if you want to go further with NGS analysis.
Session -> Set Working Directory -> Choose Directory
/home/participant/Course_Materials/Day1/
/home/participant/Course_Materials/Day1/Session3-template.Rmd
We can import reads from a .bam
file into Bioconductor. However, to set expectations we are not going to be processing the entire set of reads from a whole-genome sequencing run in this manner. This can be a useful way of diving-into a particular region of interest and exploring the data.
A package that can be used to parse a .bam
file is GenomicAlignments
. You should notice that although the .bam
file is not particularly big (~ 12.5 Million reads), but already takes a little while to read.
library(GenomicAlignments)
mybam <- readGAlignments("paired.bam")
mybam
## GAlignments object with 12581680 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 1 + 1S67M 68 10003 10069
## [2] 1 + 68M 68 10036 10103
## [3] 1 + 68M 68 10036 10103
## [4] 1 + 68M 68 10041 10108
## [5] 1 - 68M 68 10041 10108
## ... ... ... ... ... ... ...
## [12581676] hs37d5 - 68M 68 35465836 35465903
## [12581677] hs37d5 + 68M 68 35466096 35466163
## [12581678] hs37d5 - 47M1D21M 68 35466133 35466201
## [12581679] hs37d5 + 68M 68 35466270 35466337
## [12581680] hs37d5 - 68M 68 35466357 35466424
## width njunc
## <integer> <integer>
## [1] 67 0
## [2] 68 0
## [3] 68 0
## [4] 68 0
## [5] 68 0
## ... ... ...
## [12581676] 68 0
## [12581677] 68 0
## [12581678] 69 0
## [12581679] 68 0
## [12581680] 68 0
## -------
## seqinfo: 86 sequences from an unspecified genome
Similar to the importing the .fastq
files using ShortRead
, readGAlignments
has provided us with an object that can be manipulated using the standard vector conventions.
mybam[1:10]
## GAlignments object with 10 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 1 + 1S67M 68 10003 10069
## [2] 1 + 68M 68 10036 10103
## [3] 1 + 68M 68 10036 10103
## [4] 1 + 68M 68 10041 10108
## [5] 1 - 68M 68 10041 10108
## [6] 1 + 68M 68 10041 10108
## [7] 1 - 68M 68 10046 10113
## [8] 1 + 6M1I61M 68 10103 10169
## [9] 1 - 38M1D3M1D27M 68 10108 10177
## [10] 1 - 36M1D32M 68 10110 10178
## width njunc
## <integer> <integer>
## [1] 67 0
## [2] 68 0
## [3] 68 0
## [4] 68 0
## [5] 68 0
## [6] 68 0
## [7] 68 0
## [8] 67 0
## [9] 70 0
## [10] 69 0
## -------
## seqinfo: 86 sequences from an unspecified genome
There are also a number of accessor functions that can get particular items from the object; cigar
to obtain the CIGAR strings, start
/ end
to get the start and end positions, width
to get the width of each read.
The object we have created, bam
, contains only a small amount of the information available in a .bam
file. If we wish we can import extra fields such as the read sequence, mapping quality and flag:-
bam <- readGAlignments("paired.bam",param=ScanBamParam(what=c("seq","mapq","flag")))
bam
## GAlignments object with 3 alignments and 3 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] 1 + 1S67M 68 10003 10069 67
## [2] 1 + 68M 68 10036 10103 68
## [3] 1 + 68M 68 10036 10103 68
## njunc | seq mapq flag
## <integer> | <DNAStringSet> <integer> <integer>
## [1] 0 | GACCCTGACC...CTAACCCTAA 6 163
## [2] 0 | CTAAGCCTAA...CCCGAACGCT 5 99
## [3] 0 | CTAACCCTAA...CCCTAACCCT 14 99
## -------
## seqinfo: 86 sequences from an unspecified genome
The command takes longer to run, but we get more detail on each of the reads. The extra fields make up the metadata for each read and can be accessed using the mcols
function. If we save this metadata as an object, we can treat it as a data frame
and therefore have the usual $
operator to access the columns
meta <- mcols(bam)
The sequences are stored in Biostrings
format
meta$seq
## A DNAStringSet instance of length 12581680
## width seq
## [1] 68 GACCCTGACCCTAACCCTGACCCTGACC...TGACCCTAACCCTGACCCTAACCCTAA
## [2] 68 CTAAGCCTAACCCTAACCCTAACCCTAA...CCTAACCCTCAGCCTAACCCGAACGCT
## [3] 68 CTAACCCTAACCCTAACCCTAACCCTAA...CCTAACCCTAACCCTAACCCTAACCCT
## [4] 68 CCTAACCCTAACCCTAACCCTAACCCTA...CCCTAACCCTAACCCTAACCCTAACCC
## [5] 68 CCTAACCCTAACCCTAACCCTAACCCTA...CCCTAACCCTAACCCTAACCCTAACCC
## ... ... ...
## [12581676] 68 ATTCCATTCCCTTCCTTTCTTTCTTCAG...CACCCAGTATGGAGTGAAGTGCAGATT
## [12581677] 68 CACTTCATCACATTCCATTCGATTTCAT...AATCTTTCAATTTCATTCCATTCGATT
## [12581678] 68 TTCGAATCTTTCAATTTCATTCCATTCG...CCATTTTTTCCATTCGATACCATTCGA
## [12581679] 68 ATTCGATTCCTTTTGATTCCATATGATT...TTGGATTTGATTCAAAACCATTTGATT
## [12581680] 68 TCCGTTTGATTCCATTCCATTCGATTCC...CCACTCCGTTCCATTTCATTCCATTTG
The flags should be valid values as explained online
table(meta$flag)
##
## 83 99 147 163 1107 1123 1171 1187
## 3074883 3127385 3127385 3074312 43835 45023 45023 43834
Finally the mapping qualities are numeric quantities that will vary according to aligner
summary(meta$mapq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 30.00 39.00 32.87 40.00 53.00
hist(meta$mapq)
Rather than taking a whole-genome view, we often want to view the reads for a particular gene or region of interest. This we can do using the functions we have already seen.
my.reads <- which(seqnames(mybam)=="17" & start(mybam) > 7577851 & end(mybam) < 7598063)
mybam[my.reads]
## GAlignments object with 224 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 17 + 76M 76 7577918 7577993
## [2] 17 - 76M 76 7577930 7578005
## [3] 17 + 76M 76 7578191 7578266
## [4] 17 + 68M 68 7578195 7578262
## [5] 17 + 68M 68 7578203 7578270
## ... ... ... ... ... ... ...
## [220] 17 - 76M 76 7596411 7596486
## [221] 17 + 68M 68 7596523 7596590
## [222] 17 - 68M 68 7596639 7596706
## [223] 17 + 67M1S 68 7597833 7597899
## [224] 17 - 68M 68 7597888 7597955
## width njunc
## <integer> <integer>
## [1] 76 0
## [2] 76 0
## [3] 76 0
## [4] 68 0
## [5] 68 0
## ... ... ...
## [220] 76 0
## [221] 68 0
## [222] 68 0
## [223] 67 0
## [224] 68 0
## -------
## seqinfo: 86 sequences from an unspecified genome
However, there are much more efficient ways to do such queries in Bioconductor. First of all, we need to understand a bit more about how the data are being represented.
GenomicAlignments
is part of a family of packages that provide object-types and functionality for dealing with genomic intervals; which are described in a PLoS Computational Biology paper from 2013.
The basic type of interval we can define is an IRanges
object
ir <- IRanges(
start = c(7,9,12,14,22:24),
end=c(15,11,13,18,26,27,28))
ir
## IRanges object with 7 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 7 15 9
## [2] 9 11 3
## [3] 12 13 2
## [4] 14 18 5
## [5] 22 26 5
## [6] 23 27 5
## [7] 24 28 5
There is an extensive list of operations that we can perform on this object
shift(ir, 5)
## IRanges object with 7 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 12 20 9
## [2] 14 16 3
## [3] 17 18 2
## [4] 19 23 5
## [5] 27 31 5
## [6] 28 32 5
## [7] 29 33 5
resize(ir ,2)
## IRanges object with 7 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 7 8 2
## [2] 9 10 2
## [3] 12 13 2
## [4] 14 15 2
## [5] 22 23 2
## [6] 23 24 2
## [7] 24 25 2
reduce(ir)
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 7 18 12
## [2] 22 28 7
coverage(ir)
## integer-Rle of length 28 with 10 runs
## Lengths: 6 2 7 3 3 1 1 3 1 1
## Values : 0 1 2 1 0 1 2 3 2 1
To start using these operations in a genome context, we can create a GRanges
object. This has all the same information as an IRanges
object with extra capability to store the sequence name, strand and other (optional) metadata.
mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063),strand="+")
mygene
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr17 [7577851, 7598063] +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A large set of IRanges
functionality is designed to deal with overlapping queries. Such operations are common in NGS analysis so we need to have an efficient way of performing them. You will probably come across such operations without even realising it.
The terminology of overlapping defines a query and a subject
The countOverlaps
function is be an ideal candidate for this kind of investigation.
countOverlaps(mygene, mybam)
## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
## suppressWarnings() to suppress this warning.)
## [1] 0
What is this this warning message telling us? It’s basically saying that the sequence names are not compatible. A major annonyance of dealing with NGS data is the way that different databases represent a relatively trivial piece of information such as chromosome name.
table(seqnames(mybam))
##
## 1 2 3 4 5 6
## 1222361 974687 760716 634872 656322 665565
## 7 8 9 10 11 12
## 621993 515407 500207 538097 605012 580508
## 13 14 15 16 17 18
## 305082 432772 400797 433895 501464 235618
## 19 20 21 22 X Y
## 422294 267243 139654 201444 314641 63743
## MT GL000207.1 GL000226.1 GL000229.1 GL000231.1 GL000210.1
## 22951 14 8508 208 186 76
## GL000239.1 GL000235.1 GL000201.1 GL000247.1 GL000245.1 GL000197.1
## 192 146 200 182 82 144
## GL000203.1 GL000246.1 GL000249.1 GL000196.1 GL000248.1 GL000244.1
## 76 130 174 154 100 448
## GL000238.1 GL000202.1 GL000234.1 GL000232.1 GL000206.1 GL000240.1
## 300 120 218 184 150 156
## GL000236.1 GL000241.1 GL000243.1 GL000242.1 GL000230.1 GL000237.1
## 222 378 300 98 84 804
## GL000233.1 GL000204.1 GL000198.1 GL000208.1 GL000191.1 GL000227.1
## 130 226 312 1032 546 1994
## GL000228.1 GL000214.1 GL000221.1 GL000209.1 GL000218.1 GL000220.1
## 1988 2344 2215 3734 1148 27398
## GL000213.1 GL000211.1 GL000199.1 GL000217.1 GL000216.1 GL000215.1
## 860 1514 2000 666 974 622
## GL000205.1 GL000219.1 GL000224.1 GL000223.1 GL000195.1 GL000212.1
## 1628 1066 3120 1140 2108 2030
## GL000222.1 GL000200.1 GL000193.1 GL000194.1 GL000225.1 GL000192.1
## 956 898 1162 1756 2133 4022
## NC_007605 hs37d5
## 92718 385861
table(seqnames(mygene))
##
## chr17
## 1
So before attempting an overlap between two different set of ranges, we need to make sure that their respective sequences names are compatible. One solution is to be more careful when creating our ranges.
mygene <- GRanges("17", ranges=IRanges(7577851, 7598063))
countOverlaps(mygene, mybam)
## [1] 225
The convenience function seqlevelsStyle
has been written to help us understand what covention has been used to name the chromosomes, and can even rename where appropriate.
mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063))
seqlevelsStyle(mygene)
## [1] "UCSC"
seqlevelsStyle(mybam)
## [1] "NCBI" "Ensembl"
seqlevelsStyle(mygene) <- "NCBI"
countOverlaps(mygene, mybam)
## [1] 225
If we want the actual reads themselves, we can use a convenient %over%
shortcut
mybam[mybam %over% mygene]
## GAlignments object with 225 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 17 + 76M 76 7577918 7577993
## [2] 17 - 76M 76 7577930 7578005
## [3] 17 + 76M 76 7578191 7578266
## [4] 17 + 68M 68 7578195 7578262
## [5] 17 + 68M 68 7578203 7578270
## ... ... ... ... ... ... ...
## [221] 17 + 68M 68 7596523 7596590
## [222] 17 - 68M 68 7596639 7596706
## [223] 17 + 67M1S 68 7597833 7597899
## [224] 17 - 68M 68 7597888 7597955
## [225] 17 + 68M 68 7598046 7598113
## width njunc
## <integer> <integer>
## [1] 76 0
## [2] 76 0
## [3] 76 0
## [4] 68 0
## [5] 68 0
## ... ... ...
## [221] 68 0
## [222] 68 0
## [223] 67 0
## [224] 68 0
## [225] 68 0
## -------
## seqinfo: 86 sequences from an unspecified genome
We might want to focus on a subset of reads from the start, is when we want to analyse a particular gene. Provided that the .bam
file has been indexed (creating a .bam.bai
file in the same directory), we can very quickly jump to a particular genomic region. Notice that this (should) give the same number of reads as before when we did a countOverlaps
operation.
system.time(mygene.reads <- readGAlignments(file="paired.bam",param=ScanBamParam(which=mygene)))
## user system elapsed
## 0.098 0.000 0.144
mygene.reads
## GAlignments object with 225 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 17 + 76M 76 7577918 7577993
## [2] 17 - 76M 76 7577930 7578005
## [3] 17 + 76M 76 7578191 7578266
## [4] 17 + 68M 68 7578195 7578262
## [5] 17 + 68M 68 7578203 7578270
## ... ... ... ... ... ... ...
## [221] 17 + 68M 68 7596523 7596590
## [222] 17 - 68M 68 7596639 7596706
## [223] 17 + 67M1S 68 7597833 7597899
## [224] 17 - 68M 68 7597888 7597955
## [225] 17 + 68M 68 7598046 7598113
## width njunc
## <integer> <integer>
## [1] 76 0
## [2] 76 0
## [3] 76 0
## [4] 68 0
## [5] 68 0
## ... ... ...
## [221] 68 0
## [222] 68 0
## [223] 67 0
## [224] 68 0
## [225] 68 0
## -------
## seqinfo: 86 sequences from an unspecified genome
The region filer can be used in conjuction with the what
argument to ScanBamParam
function to provide a detailed picture of the reads for your gene
mygene.reads <- readGAlignments(file="paired.bam",param=ScanBamParam(which=mygene, what=c("seq","mapq","flag","qual","isize")))
mygene.reads
## GAlignments object with 3 alignments and 5 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] 17 + 76M 76 7577918 7577993 76
## [2] 17 - 76M 76 7577930 7578005 76
## [3] 17 + 76M 76 7578191 7578266 76
## njunc | seq mapq flag
## <integer> | <DNAStringSet> <integer> <integer>
## [1] 0 | TCCGCCTGCC...GGGAGGCCCT 37 163
## [2] 0 | GGCCTCCCAA...GCCTCTGTAA 31 83
## [3] 0 | AGGGCACCAC...CCACTCGGAT 40 99
## qual isize
## <PhredQuality> <integer>
## [1] H;9476;:9:...@?9<085==J 87
## [2] ##########...>?><><<<>S -87
## [3] S=>>=<<><<...?>==??8??P 202
## -------
## seqinfo: 86 sequences from an unspecified genome
Aside from the many useful software packages, Bioconductor also provides numerous annotation resources that we can utilise in our analysis. Firstly, we have a set of organism-level packages that can translate between different types of identifer. The package for humans is called org.Hs.eg.db
. The advantage of such a package, rather than services such as biomaRt, is that we can do queries offline. The packages are updated every 6 months, so we can always be sure of what version of the relevant databases are being used.
library(org.Hs.eg.db)
org.Hs.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2016-Mar14
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 20160305
## | GOEGSOURCEDATE: 2016-Mar14
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
## | GPSOURCEDATE: 2010-Mar22
## | ENSOURCEDATE: 2016-Mar9
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Wed Mar 23 15:50:09 2016
There are several types of “key” we can use to make a query, and we have to specify one of these names.
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
For the given keytype we have chosen, we can also choose what data we want to retrieve. We can think of these as columns in a table, and the pre-defined values are given by:-
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
For example, if we want to know the “Entrez” ID for the gene BRCA1, we can do:-
eg <- select(org.Hs.eg.db, keys="BRCA1", keytype = "SYMBOL",columns = "ENTREZID")
eg
## SYMBOL ENTREZID
## 1 BRCA1 672
But we’re not restricted to just one type of information to return
eg <- select(org.Hs.eg.db, keys=c("BRCA1","PTEN"), keytype = "SYMBOL",columns = c("ENTREZID","ENSEMBL"))
## 'select()' returned 1:many mapping between keys and columns
You should see that the above command prints a message to the screen:- 'select()' returned 1:many mapping between keys and columns
. This is not an error message and R has still been able to generate the output requested.
eg
## SYMBOL ENTREZID ENSEMBL
## 1 BRCA1 672 ENSG00000012048
## 2 PTEN 5728 ENSG00000171862
## 3 PTEN 5728 ENSG00000283094
In this case, we have “many” (well, two) values of ENSEMBL
for the gene PTEN
. In practice this means we probably want to think carefully about merging this data with other tables.
You might expect to be able to retrieve information about the coordinates for a particular gene using the same interface. This was supported until recently, but the recommended approach now is to use another class of packages which describe the structure of genes in more detail.
The packages with the prefix TxDb....
represent the structure of all genes for a given organism in an efficient manner. For humans, we can use the package TxDb.Hsapiens.UCSC.hg19.knownGene
to tell us about transcripts for the hg19
build of the genome. The package was generated using tables from the UCSC genome browser
As with the org.Hs.eg.db
package we can load the package and inspect the kind of mappings available to us.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
columns(txdb)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART"
## [6] "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
## [11] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM"
## [16] "TXEND" "TXID" "TXNAME" "TXSTART" "TXSTRAND"
## [21] "TXTYPE"
keytypes(txdb)
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID"
## [7] "TXNAME"
You’ll see that all the mappings are regarding the coordinates and IDs of various genomic features. There is only one type of identifier used, in this case Entrez ID
. If we know a genes Entrez ID, we can get the exon coordinates with the following query.
mygene <- select(txdb, keys="672", columns = c("EXONCHROM","EXONSTART","EXONEND","EXONSTRAND"),keytype="GENEID")
mygene
## GENEID EXONCHROM EXONSTRAND EXONSTART EXONEND
## 1 672 chr17 - 41276034 41276132
## 2 672 chr17 - 41267743 41267796
## 3 672 chr17 - 41258473 41258550
## 4 672 chr17 - 41256885 41256973
## 5 672 chr17 - 41256139 41256278
## 6 672 chr17 - 41251792 41251897
## 7 672 chr17 - 41249261 41249306
## 8 672 chr17 - 41247863 41247939
## 9 672 chr17 - 41246761 41246877
## 10 672 chr17 - 41242961 41243049
## 11 672 chr17 - 41234421 41234592
## 12 672 chr17 - 41228505 41228628
## 13 672 chr17 - 41226348 41226538
## 14 672 chr17 - 41222945 41223255
## 15 672 chr17 - 41219625 41219712
## 16 672 chr17 - 41215891 41215968
## 17 672 chr17 - 41215350 41215390
## 18 672 chr17 - 41209069 41209152
## 19 672 chr17 - 41203080 41203134
## 20 672 chr17 - 41201138 41201211
## 21 672 chr17 - 41199660 41199720
## 22 672 chr17 - 41196312 41197819
## 23 672 chr17 - 41277199 41277340
## 24 672 chr17 - 41258495 41258550
## 25 672 chr17 - 41251792 41251894
## 26 672 chr17 - 41243452 41246877
## 27 672 chr17 - 41228505 41228631
## 28 672 chr17 - 41277294 41277468
## 29 672 chr17 - 41277288 41277500
## 30 672 chr17 - 41231351 41231416
## 31 672 chr17 - 41202079 41202207
## 32 672 chr17 - 41322143 41322420
## 33 672 chr17 - 41243117 41246877
## 34 672 chr17 - 41262482 41262597
PTEN
It is useful to be able to retrive the coordinates in this manner. However, we should now be familiar with the way intervals can be represented using GRanges
. We have the ability to create a GRanges
object from the result:-
my.gr <- GRanges(mygene$EXONCHROM, IRanges(mygene$EXONSTART,mygene$EXONEND))
my.gr
## GRanges object with 34 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr17 [41276034, 41276132] *
## [2] chr17 [41267743, 41267796] *
## [3] chr17 [41258473, 41258550] *
## [4] chr17 [41256885, 41256973] *
## [5] chr17 [41256139, 41256278] *
## ... ... ... ...
## [30] chr17 [41231351, 41231416] *
## [31] chr17 [41202079, 41202207] *
## [32] chr17 [41322143, 41322420] *
## [33] chr17 [41243117, 41246877] *
## [34] chr17 [41262482, 41262597] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A powerful feature of the transcript database packages is to allow the user to create a GRanges
representation of all exons / transcripts for a particular organism. The function to do this is exonsBy
(or equivalently transcriptsBy
). The result is a list object that can be subset according to Entrez ID.
allex <- exonsBy(txdb, "gene")
mygene <- allex[["672"]]
mygene
## GRanges object with 34 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr17 [41196312, 41197819] - | 227457 <NA>
## [2] chr17 [41199660, 41199720] - | 227458 <NA>
## [3] chr17 [41201138, 41201211] - | 227459 <NA>
## [4] chr17 [41202079, 41202207] - | 227460 <NA>
## [5] chr17 [41203080, 41203134] - | 227461 <NA>
## ... ... ... ... . ... ...
## [30] chr17 [41276034, 41276132] - | 227486 <NA>
## [31] chr17 [41277199, 41277340] - | 227487 <NA>
## [32] chr17 [41277288, 41277500] - | 227488 <NA>
## [33] chr17 [41277294, 41277468] - | 227489 <NA>
## [34] chr17 [41322143, 41322420] - | 227496 <NA>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We are almost in a position to overlap our reads with the GRanges
representation of our gene. First of all, we need the same trick from before to make sure the chromosome names are compatible
seqlevelsStyle(mybam)
## [1] "NCBI" "Ensembl"
seqlevelsStyle(mygene)
## [1] "UCSC"
seqlevelsStyle(mygene) <- "Ensembl"
mygene
## GRanges object with 34 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 17 [41196312, 41197819] - | 227457 <NA>
## [2] 17 [41199660, 41199720] - | 227458 <NA>
## [3] 17 [41201138, 41201211] - | 227459 <NA>
## [4] 17 [41202079, 41202207] - | 227460 <NA>
## [5] 17 [41203080, 41203134] - | 227461 <NA>
## ... ... ... ... . ... ...
## [30] 17 [41276034, 41276132] - | 227486 <NA>
## [31] 17 [41277199, 41277340] - | 227487 <NA>
## [32] 17 [41277288, 41277500] - | 227488 <NA>
## [33] 17 [41277294, 41277468] - | 227489 <NA>
## [34] 17 [41322143, 41322420] - | 227496 <NA>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Furthermore, since we know we are only interested in the reads from chromosome 17, we could subset the reads accordingly.
mybam.chr17 <- keepSeqlevels(mybam, "17")
mybam.chr17[mybam.chr17 %over% mygene]
## GAlignments object with 345 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] 17 - 76M 76 41197373 41197448
## [2] 17 - 68M 68 41197677 41197744
## [3] 17 - 68M 68 41197707 41197774
## [4] 17 - 76M 76 41197734 41197809
## [5] 17 - 68M 68 41197780 41197847
## ... ... ... ... ... ... ...
## [341] 17 - 68M 68 41276093 41276160
## [342] 17 - 76M 76 41276099 41276174
## [343] 17 - 68M 68 41277212 41277279
## [344] 17 - 68M 68 41277341 41277408
## [345] 17 - 68M 68 41322151 41322218
## width njunc
## <integer> <integer>
## [1] 76 0
## [2] 68 0
## [3] 68 0
## [4] 76 0
## [5] 68 0
## ... ... ...
## [341] 68 0
## [342] 76 0
## [343] 68 0
## [344] 68 0
## [345] 68 0
## -------
## seqinfo: 1 sequence from an unspecified genome
NOTE:- it also seems that the Mitochondial sequence is a different length in the reference that these reads were aligned to, compared to the transcript database. It may seem like a trivial difference, but will cause an error if we don’t perform the subset to just chromosome 17.
Error in mergeNamedAtomicVectors(seqlengths(x), seqlengths(y), what = c("sequence", :
sequence MT has incompatible seqlengths:
- in 'x': 16569
- in 'y': 16571
Per-exon counts can be achieved using the countOverlaps
function:-
countOverlaps(mygene,mybam.chr17)
## [1] 8 6 4 0 1 12 3 3 5 24 5 8 8 1 13 2 207
## [18] 204 4 1 2 7 7 18 7 2 2 0 0 4 1 1 1 1
And finally, we should note that we could subset just the reads for this gene with the following set of commands. The only thing we need to know in advance is the Entrez ID of our gene.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
allex <- exonsBy(txdb, "gene")
mygene <- allex[["672"]]
seqlevelsStyle(mygene) <- "Ensembl"
mygene <- keepSeqlevels(mygene,"17")
reads <- readGAlignments("paired.bam",param=ScanBamParam(which=mygene))
A final note is that we can also export any ranges we have created to a file (e.g. a .bed
file) for further processing or visualisation in IGV.
library(rtracklayer)
export(mygene, con="mygene.bed")
paired.bam
that align to the gene PTENWe have explored the properties of bam files using Bioconductor. The techniques and types of object we have learnt about will crop-up again-and-again in the course. The vast majority of NGS analysis tools in Bioconductor will use GenomicRanges
and TxDb
objects in some form.
Due to the high-volume of the dataset, some of the tools and pipelines we use will not be in R. However, you will still be able to interrogate the results you obtain and explore them in more detail using R.