Representing sequencing data in R and Bioconductor

Mark Dunning

Last modified: 16 Mar 2016

Overview

Aims

Motivation

Core data-type 1: Genome Intervals

IRanges

IRanges is crucial for many packages

Just some of the packages that depend on IRanges

iranges-depends

IRanges paper

bioc paper

Example

Suppose we want to capture information on the following intervals

Creating the object

library(IRanges)
ir <- IRanges(
start = c(7,9,12,14,22:24), 
end=c(15,11,13,18,26,27,28))
str(ir)
## Formal class 'IRanges' [package "IRanges"] with 6 slots
##   ..@ start          : int [1:7] 7 9 12 14 22 23 24
##   ..@ width          : int [1:7] 9 3 2 5 5 5 5
##   ..@ NAMES          : NULL
##   ..@ elementType    : chr "integer"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

Display the object

ir
## IRanges of length 7
##     start end width
## [1]     7  15     9
## [2]     9  11     3
## [3]    12  13     2
## [4]    14  18     5
## [5]    22  26     5
## [6]    23  27     5
## [7]    24  28     5

Adding metadata

We can give our ranges names

ir <- IRanges(
start = c(7,9,12,14,22:24), 
end=c(15,11,13,18,26,27,28),names=LETTERS[1:7])
ir
## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G

Ranges as vectors

ir[1:2]
## IRanges of length 2
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
ir[c(2,4,6)]
## IRanges of length 3
##     start end width names
## [1]     9  11     3     B
## [2]    14  18     5     D
## [3]    23  27     5     F

Accessing the object

start(ir)
## [1]  7  9 12 14 22 23 24
end(ir)
## [1] 15 11 13 18 26 27 28
width(ir)
## [1] 9 3 2 5 5 5 5

More-complex subsetting

width(ir) == 5
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
ir[width(ir)==5]
## IRanges of length 4
##     start end width names
## [1]    14  18     5     D
## [2]    22  26     5     E
## [3]    23  27     5     F
## [4]    24  28     5     G

More-complex subsetting

start(ir) > 10
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
end(ir) < 27
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
ir[start(ir) > 10]
## IRanges of length 5
##     start end width names
## [1]    12  13     2     C
## [2]    14  18     5     D
## [3]    22  26     5     E
## [4]    23  27     5     F
## [5]    24  28     5     G

More-complex subsetting

ir[end(ir) < 27]
## IRanges of length 5
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
ir[start(ir) > 10 & end(ir) < 27]
## IRanges of length 3
##     start end width names
## [1]    12  13     2     C
## [2]    14  18     5     D
## [3]    22  26     5     E

Manipulating Ranges

Lots of common use-cases are implemented

operations

Shifting

e.g. sliding windows

ir
## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G
shift(ir, 5)
## IRanges of length 7
##     start end width names
## [1]    12  20     9     A
## [2]    14  16     3     B
## [3]    17  18     2     C
## [4]    19  23     5     D
## [5]    27  31     5     E
## [6]    28  32     5     F
## [7]    29  33     5     G

Shifting

Shifting

Size of shift doesn’t need to be constant

ir
## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G
shift(ir, 7:1)
## IRanges of length 7
##     start end width names
## [1]    14  22     9     A
## [2]    15  17     3     B
## [3]    17  18     2     C
## [4]    18  22     5     D
## [5]    25  29     5     E
## [6]    25  29     5     F
## [7]    25  29     5     G

Shifting

Resize

e.g. trimming reads

ir
## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G
resize(ir,3)
## IRanges of length 7
##     start end width names
## [1]     7   9     3     A
## [2]     9  11     3     B
## [3]    12  14     3     C
## [4]    14  16     3     D
## [5]    22  24     3     E
## [6]    23  25     3     F
## [7]    24  26     3     G

Resize

Coverage

coverage returns a Run Length Encoding - an efficient representation of repeated values

cvg <- coverage(ir)
cvg
## integer-Rle of length 28 with 10 runs
##   Lengths: 6 2 7 3 3 1 1 3 1 1
##   Values : 0 1 2 1 0 1 2 3 2 1
as.vector(cvg)
##  [1] 0 0 0 0 0 0 1 1 2 2 2 2 2 2 2 1 1 1 0 0 0 1 2 3 3 3 2 1

Coverage Results

Overlapping

e.g. counting - The terminology of overlapping defines a query and a subject overlaps

Overlaps

ir3 <- IRanges(start = c(1, 14, 27), end = c(13,
    18, 30),names=c("X","Y","Z"))
ir3
## IRanges of length 3
##     start end width names
## [1]     1  13    13     X
## [2]    14  18     5     Y
## [3]    27  30     4     Z

Overlaps

Overlaps

query <- ir
subject <- ir3
ov <- findOverlaps(query, subject)
ov
## Hits object with 7 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         1           2
##   [3]         2           1
##   [4]         3           1
##   [5]         4           2
##   [6]         6           3
##   [7]         7           3
##   -------
##   queryLength: 7
##   subjectLength: 3

queryHits and subjectHits

queryHits(ov)
## [1] 1 1 2 3 4 6 7
subjectHits(ov)
## [1] 1 2 1 1 2 3 3

Overlap example - First hit

query[queryHits(ov)[1]]
## IRanges of length 1
##     start end width names
## [1]     7  15     9     A
subject[subjectHits(ov)[1]]
## IRanges of length 1
##     start end width names
## [1]     1  13    13     X

Overlap example - second hit

query[queryHits(ov)[2]]
## IRanges of length 1
##     start end width names
## [1]     7  15     9     A
subject[subjectHits(ov)[2]]
## IRanges of length 1
##     start end width names
## [1]    14  18     5     Y

Overlap example - Third hit

query[queryHits(ov)[3]]
## IRanges of length 1
##     start end width names
## [1]     9  11     3     B
subject[subjectHits(ov)[3]]
## IRanges of length 1
##     start end width names
## [1]     1  13    13     X

Counting

countOverlaps(query,subject)
## A B C D E F G 
## 2 1 1 1 0 1 1
countOverlaps(subject,query)
## X Y Z 
## 3 2 2

Modify overlap criteria

findOverlaps(query,subject,type="within")
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

More stringent overlap

findOverlaps(query,subject,type="within")
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

More stringent overlap

findOverlaps(query,subject,type="within")
## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

Intersection

intersect(ir,ir3)
## IRanges of length 2
##     start end width
## [1]     7  18    12
## [2]    27  28     2

Subtraction

setdiff(ir,ir3)
## IRanges of length 1
##     start end width
## [1]    22  26     5

Core data-type 2: DNA sequences

Biostrings

The Biostrings package is specifically-designed for biological sequences

library(Biostrings)
myseq <- DNAStringSet(randomStrings)
myseq
##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 ACACAAGTGACCA
##   [2]    13 CCACCCGGTAACA
##   [3]    17 CGGTAAGGTACGGTTAT
##   [4]    11 CATTCGTTTAT
##   [5]    12 TACTGCACCAAG
##   ...   ... ...
##  [96]    13 GCCCTCCACACGC
##  [97]    10 CCCGGTCAGT
##  [98]    18 ACTCTTGACCAACAACTC
##  [99]    19 AAGCGTACGGTTGCAGACG
## [100]    18 TTGCCAAATTCTTGTATG

Object structure

str(myseq)
## Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
##   ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
##   .. .. ..@ xp_list                    :List of 1
##   .. .. .. ..$ :<externalptr> 
##   .. .. ..@ .link_to_cached_object_list:List of 1
##   .. .. .. ..$ :<environment: 0x7f8d810> 
##   ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
##   .. .. ..@ group          : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..@ start          : int [1:100] 1 14 27 44 55 67 77 93 105 117 ...
##   .. .. ..@ width          : int [1:100] 13 13 17 11 12 10 16 12 12 18 ...
##   .. .. ..@ NAMES          : NULL
##   .. .. ..@ elementType    : chr "integer"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementType    : chr "DNAString"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

Biostrings operations

myseq[1:5]
##   A DNAStringSet instance of length 5
##     width seq
## [1]    13 ACACAAGTGACCA
## [2]    13 CCACCCGGTAACA
## [3]    17 CGGTAAGGTACGGTTAT
## [4]    11 CATTCGTTTAT
## [5]    12 TACTGCACCAAG

Accessor functions

width(myseq)
##   [1] 13 13 17 11 12 10 16 12 12 18 17 15 14 12 10 17 12 20 10 18 12 18 12
##  [24] 17 17 16 17 19 15 12 20 20 12 15 15 12 20 16 16 14 12 19 19 17 15 18
##  [47] 19 11 13 20 18 13 14 20 18 10 19 10 15 17 12 17 12 11 11 16 17 15 17
##  [70] 11 15 19 11 14 12 20 16 19 14 14 19 13 17 12 18 18 15 13 11 11 13 19
##  [93] 15 12 11 13 10 18 19 18
head(as.character(myseq))
## [1] "ACACAAGTGACCA"     "CCACCCGGTAACA"     "CGGTAAGGTACGGTTAT"
## [4] "CATTCGTTTAT"       "TACTGCACCAAG"      "CCTTGCTAGC"

Accessor functions

What does this do?

myseq[width(myseq)>19]
##   A DNAStringSet instance of length 7
##     width seq
## [1]    20 AGGGCGTCTGTCCGACCATA
## [2]    20 CGTCAATGTCATCCGCCCCT
## [3]    20 TAACATCTAACCGGCCATTT
## [4]    20 CGAATTTATGGGAGTCGTGA
## [5]    20 ACCAATAGAACACTGGGCTC
## [6]    20 GAAGAGTCTTAACATAGATC
## [7]    20 GTTTCCATGGCCAGGTGATC

More advanced subsetting

myseq[subseq(myseq,1,3) == "TTC"]
##   A DNAStringSet instance of length 4
##     width seq
## [1]    14 TTCCTTTACCGATA
## [2]    16 TTCAGGGGGGAGAAGA
## [3]    18 TTCACCTTCACTAAAGTG
## [4]    18 TTCATACCCTGAAATTAT

We can also use the matchPattern function + see practical for details

Other useful operations

Some useful string operation functions are provided

af <- alphabetFrequency(myseq, baseOnly=TRUE)
head(af)
##      A C G T other
## [1,] 6 4 2 1     0
## [2,] 4 6 2 1     0
## [3,] 4 2 6 5     0
## [4,] 2 2 1 6     0
## [5,] 4 4 2 2     0
## [6,] 1 4 2 3     0

Letter frequencies

myseq[af[,1] ==0,]
##   A DNAStringSet instance of length 1
##     width seq
## [1]    12 CCTCTCTCCTTT
boxplot(af)

More-specialised features

reverse(myseq)
##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 ACCAGTGAACACA
##   [2]    13 ACAATGGCCCACC
##   [3]    17 TATTGGCATGGAATGGC
##   [4]    11 TATTTGCTTAC
##   [5]    12 GAACCACGTCAT
##   ...   ... ...
##  [96]    13 CGCACACCTCCCG
##  [97]    10 TGACTGGCCC
##  [98]    18 CTCAACAACCAGTTCTCA
##  [99]    19 GCAGACGTTGGCATGCGAA
## [100]    18 GTATGTTCTTAAACCGTT
reverseComplement(myseq)
##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 TGGTCACTTGTGT
##   [2]    13 TGTTACCGGGTGG
##   [3]    17 ATAACCGTACCTTACCG
##   [4]    11 ATAAACGAATG
##   [5]    12 CTTGGTGCAGTA
##   ...   ... ...
##  [96]    13 GCGTGTGGAGGGC
##  [97]    10 ACTGACCGGG
##  [98]    18 GAGTTGTTGGTCAAGAGT
##  [99]    19 CGTCTGCAACCGTACGCTT
## [100]    18 CATACAAGAATTTGGCAA
translate(myseq)
##   A AAStringSet instance of length 100
##       width seq
##   [1]     4 TQVT
##   [2]     4 PPGN
##   [3]     5 R*GTV
##   [4]     3 HSF
##   [5]     4 YCTK
##   ...   ... ...
##  [96]     4 ALHT
##  [97]     3 PGQ
##  [98]     6 TLDQQL
##  [99]     6 KRTVAD
## [100]     6 LPNSCM

Fastq recap

Recall that sequence reads are represented in text format

readLines(path.to.my.fastq ,n=10)

It should be possible to represent these as Biostrings objects

The ShortRead package

One of the first NGS packages in Bioconductor

library(ShortRead)
fq <- readFastq(path.to.my.fastq)
fq

Practical application - Representing the genome

The genome as a string - BSgenome

library(BSgenome)
head(available.genomes())
## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.UCSC.apiMel2"       
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Athaliana.TAIR.04232008"       
## [6] "BSgenome.Athaliana.TAIR.TAIR9"

Various versions of the human genome

ag <- available.genomes()
ag[grep("Hsapiens",ag)]
##  [1] "BSgenome.Hsapiens.1000genomes.hs37d5"
##  [2] "BSgenome.Hsapiens.NCBI.GRCh38"       
##  [3] "BSgenome.Hsapiens.UCSC.hg17"         
##  [4] "BSgenome.Hsapiens.UCSC.hg17.masked"  
##  [5] "BSgenome.Hsapiens.UCSC.hg18"         
##  [6] "BSgenome.Hsapiens.UCSC.hg18.masked"  
##  [7] "BSgenome.Hsapiens.UCSC.hg19"         
##  [8] "BSgenome.Hsapiens.UCSC.hg19.masked"  
##  [9] "BSgenome.Hsapiens.UCSC.hg38"         
## [10] "BSgenome.Hsapiens.UCSC.hg38.masked"

The latest human genome

library(BSgenome.Hsapiens.UCSC.hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
hg19
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Chromosome-level sequence

head(names(hg19))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
chrX <- hg19[["chrX"]]
chrX
##   155270560-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
alphabetFrequency(chrX,baseOnly=TRUE)
##        A        C        G        T    other 
## 45648952 29813353 29865831 45772424  4170000

Retrieving sequences

tp53 <- getSeq(hg19, "chr17", 7577851, 7590863)
tp53
##   13013-letter "DNAString" instance
## seq: TTGTATTTTTCAGTAGAGACGGGGTTTCACCGTT...GTCTTGAGCACATGGGAGGGGAAAACCCCAATC
as.character(tp53[1:10])
## [1] "TTGTATTTTT"
alphabetFrequency(tp53,baseOnly=TRUE)
##     A     C     G     T other 
##  3102  3375  3025  3511     0
subseq(tp53, 1000,1010)
##   11-letter "DNAString" instance
## seq: TATAGGTGTGC

Timings

Don’t need to load the whole genome into memory, so reading a particular sequence is fast

system.time(tp53 <- getSeq(hg19, "chr17", 7577851, 7598063))
##    user  system elapsed 
##   0.117   0.000   0.118

Manipulating sequences

We can now use Biostrings operations to manipulate the sequence

translate(subseq(tp53, 1000,1010))
## Warning in .Call2("DNAStringSet_translate", x, skip_code,
## dna_codes[codon_alphabet], : last 2 bases were ignored
##   3-letter "AAString" instance
## seq: YRC
reverseComplement(subseq(tp53, 1000,2000))
##   1001-letter "DNAString" instance
## seq: CCTATGGAAACTGTGAGTGGATCCATTGGAAGGG...AAAATTAGCCAGGCATGGTGGTGCACACCTATA

Introducing GRanges

library(GenomicRanges)
gr <- GRanges(c("A","A","A","B","B","B","B"), ranges=ir)
gr
## GRanges object with 7 ranges and 0 metadata columns:
##     seqnames    ranges strand
##        <Rle> <IRanges>  <Rle>
##   A        A  [ 7, 15]      *
##   B        A  [ 9, 11]      *
##   C        A  [12, 13]      *
##   D        B  [14, 18]      *
##   E        B  [22, 26]      *
##   F        B  [23, 27]      *
##   G        B  [24, 28]      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

GRanges with metadata

We can add extra metadata to these ranges

mcols(gr) <- data.frame(Count = runif(length(gr)), Gene =sample(LETTERS,length(gr)))
gr
## GRanges object with 7 ranges and 2 metadata columns:
##     seqnames    ranges strand |     Count     Gene
##        <Rle> <IRanges>  <Rle> | <numeric> <factor>
##   A        A  [ 7, 15]      * | 0.7864672        J
##   B        A  [ 9, 11]      * | 0.9716962        A
##   C        A  [12, 13]      * | 0.2972768        B
##   D        B  [14, 18]      * | 0.5064363        H
##   E        B  [22, 26]      * | 0.3862037        U
##   F        B  [23, 27]      * | 0.7934070        M
##   G        B  [24, 28]      * | 0.6414842        F
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr[mcols(gr)$Count > 0.5]
## GRanges object with 5 ranges and 2 metadata columns:
##     seqnames    ranges strand |     Count     Gene
##        <Rle> <IRanges>  <Rle> | <numeric> <factor>
##   A        A  [ 7, 15]      * | 0.7864672        J
##   B        A  [ 9, 11]      * | 0.9716962        A
##   D        B  [14, 18]      * | 0.5064363        H
##   F        B  [23, 27]      * | 0.7934070        M
##   G        B  [24, 28]      * | 0.6414842        F
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Representing a gene

mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063))
myseq <- getSeq(hg19, mygene)
myseq
##   A DNAStringSet instance of length 1
##     width seq
## [1] 20213 TTGTATTTTTCAGTAGAGACGGGGTTTCACC...CTACTTGGGAGGCTGAGGTGGGAGGATCGCT
tp53
##   20213-letter "DNAString" instance
## seq: TTGTATTTTTCAGTAGAGACGGGGTTTCACCGTT...AGCTACTTGGGAGGCTGAGGTGGGAGGATCGCT

Intermission

Work through section 1 of the practical

Practical application - Manipulating Aligned Reads

Dealing with aligned reads

We will assume that the sequencing reads have been aligned and that we are interested in processing the alignments.

library(GenomicAlignments)
bam <- readGAlignments(mybam,use.names = TRUE)
str(bam)
## Formal class 'GAlignments' [package "GenomicAlignments"] with 8 slots
##   ..@ NAMES          : chr [1:175346] "SRR031715.1138209" "SRR031714.776678" "SRR031715.3258011" "SRR031715.4791418" ...
##   ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 8 levels "chr2L","chr2R",..: 5
##   .. .. ..@ lengths        : int 175346
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ start          : int [1:175346] 169 184 187 193 326 943 944 946 946 957 ...
##   ..@ cigar          : chr [1:175346] "37M" "37M" "37M" "37M" ...
##   ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 1 2 1 2 1 2 1 2 1 2 ...
##   .. .. ..@ lengths        : int [1:37319] 1 2 1 1 3 2 3 10 3 1 ...
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : NULL
##   .. .. ..@ nrows          : int 175346
##   .. .. ..@ listData       : Named list()
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
##   .. .. ..@ seqnames   : chr [1:8] "chr2L" "chr2R" "chr3L" "chr3R" ...
##   .. .. ..@ seqlengths : int [1:8] 23011544 21146708 24543557 27905053 1351857 19517 22422827 347038
##   .. .. ..@ is_circular: logi [1:8] NA NA NA NA NA NA ...
##   .. .. ..@ genome     : chr [1:8] NA NA NA NA ...
##   ..@ metadata       : list()

Representation of aligned reads

The result looks a lot like a GRanges object. In fact, a lot of the same operations can be used

bam
## GAlignments object with 175346 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1138209     chr4      +         37M        37
##    SRR031714.776678     chr4      -         37M        37
##   SRR031715.3258011     chr4      -         37M        37
##   SRR031715.4791418     chr4      +         37M        37
##   SRR031715.1138209     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031714.1650928     chr4      +         37M        37
##   SRR031714.1650928     chr4      -         37M        37
##   SRR031714.5192891     chr4      +         37M        37
##   SRR031715.2351056     chr4      +         37M        37
##    SRR031714.864195     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1138209       169       205        37         0
##    SRR031714.776678       184       220        37         0
##   SRR031715.3258011       187       223        37         0
##   SRR031715.4791418       193       229        37         0
##   SRR031715.1138209       326       362        37         0
##                 ...       ...       ...       ...       ...
##   SRR031714.1650928   1349708   1349744        37         0
##   SRR031714.1650928   1349838   1349874        37         0
##   SRR031714.5192891   1351640   1351676        37         0
##   SRR031715.2351056   1351640   1351676        37         0
##    SRR031714.864195   1351760   1351796        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Accessing particular reads

length(bam)
## [1] 175346
bam[1:5]
## GAlignments object with 5 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1138209     chr4      +         37M        37
##    SRR031714.776678     chr4      -         37M        37
##   SRR031715.3258011     chr4      -         37M        37
##   SRR031715.4791418     chr4      +         37M        37
##   SRR031715.1138209     chr4      -         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1138209       169       205        37         0
##    SRR031714.776678       184       220        37         0
##   SRR031715.3258011       187       223        37         0
##   SRR031715.4791418       193       229        37         0
##   SRR031715.1138209       326       362        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome
bam[sample(1:length(bam),5)]
## GAlignments object with 5 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1038030     chr4      +         37M        37
##   SRR031714.2298029     chr4      -         37M        37
##   SRR031715.2746340     chr4      +         37M        37
##   SRR031714.3459405     chr4      +         37M        37
##   SRR031714.4323180     chr4      -         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1038030     50440     50476        37         0
##   SRR031714.2298029   1051440   1051476        37         0
##   SRR031715.2746340    947137    947173        37         0
##   SRR031714.3459405    137653    137689        37         0
##   SRR031714.4323180   1037390   1037426        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Querying alignments

table(strand(bam))
## 
##     +     -     * 
## 84871 90475     0
summary(width(bam))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    37.00    37.00    37.00    58.72    37.00 19350.00
range(start(bam))
## [1]     169 1351760
head(cigar(bam))
## [1] "37M" "37M" "37M" "37M" "37M" "37M"

Overlap aligned reads with GRanges

gr <- GRanges("chr4", IRanges(start = 20000, end = 20100))
gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]     chr4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps(gr,bam)
## Hits object with 12 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1        6699
##    [2]         1        6700
##    [3]         1        6701
##    [4]         1        6702
##    [5]         1        6703
##    ...       ...         ...
##    [8]         1        6706
##    [9]         1        6707
##   [10]         1        6708
##   [11]         1        6709
##   [12]         1        6710
##   -------
##   queryLength: 1
##   subjectLength: 175346

Identifying the reads

A shortcut

bam.sub <- bam[bam %over% gr]
bam.sub
## GAlignments object with 12 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031714.4092638     chr4      -         37M        37
##   SRR031714.4275537     chr4      -         37M        37
##   SRR031715.1315719     chr4      -         37M        37
##   SRR031715.1502533     chr4      -         37M        37
##    SRR031714.336402     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031715.3358559     chr4      +         37M        37
##   SRR031715.4831822     chr4      +         37M        37
##   SRR031715.4459351     chr4      +         37M        37
##   SRR031715.2716654     chr4      -         37M        37
##   SRR031715.1552693     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031714.4092638     19968     20004        37         0
##   SRR031714.4275537     19968     20004        37         0
##   SRR031715.1315719     19968     20004        37         0
##   SRR031715.1502533     19968     20004        37         0
##    SRR031714.336402     19971     20007        37         0
##                 ...       ...       ...       ...       ...
##   SRR031715.3358559     19974     20010        37         0
##   SRR031715.4831822     19975     20011        37         0
##   SRR031715.4459351     19981     20017        37         0
##   SRR031715.2716654     19986     20022        37         0
##   SRR031715.1552693     20046     20082        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Chromosome naming conventions

gr <- GRanges("4", IRanges(start = 20000, end = 20100))
gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]        4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps(gr,bam)
## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)
## Hits object with 0 hits and 0 metadata columns:
##    queryHits subjectHits
##    <integer>   <integer>
##   -------
##   queryLength: 1
##   subjectLength: 175346

Solution

gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]        4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr <- renameSeqlevels(gr, c("4"="chr4"))
gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]     chr4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finer-control over reading

?ScanBamParam

Example: adding mapping quality, base quality and flag

bam.extra <- readGAlignments(file=mybam,param=ScanBamParam(what=c("mapq","qual","flag")))
bam.extra[1:5]
## GAlignments object with 5 alignments and 3 metadata columns:
##       seqnames strand       cigar    qwidth     start
##          <Rle>  <Rle> <character> <integer> <integer>
##   [1]     chr4      +         37M        37       169
##   [2]     chr4      -         37M        37       184
##   [3]     chr4      -         37M        37       187
##   [4]     chr4      +         37M        37       193
##   [5]     chr4      -         37M        37       326
##             end     width     njunc |      mapq
##       <integer> <integer> <integer> | <integer>
##   [1]       205        37         0 |       255
##   [2]       220        37         0 |       255
##   [3]       223        37         0 |       255
##   [4]       229        37         0 |       255
##   [5]       362        37         0 |       255
##                                        qual      flag
##                              <PhredQuality> <integer>
##   [1] IIIIIIIIIIIIIIIIIIIIIIIIII8IIIIIIIGII        99
##   [2] IIIIIIIIEIIIIIIIIIIIIIIIIIIIIIIIIIIII       153
##   [3] II6II7IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        89
##   [4] IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFII3I       137
##   [5] ++I+4-05>*I2GF/II6IIIIIIIIIIIIIIIII<I       147
##   -------
##   seqinfo: 8 sequences from an unspecified genome
table(mcols(bam.extra)$flag)
## 
##    65    73    81    83    89    97    99   113   129   137 
##    29  4891 14737 23037  7769 14781 22791    34    29  4576 
##   145   147   153   161   163   177 
## 14781 22791  7292 14737 23037    34

Example: Dealing with PCR duplicates

dupReads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = TRUE)))
nodupReads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = FALSE)))
allreads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = NA)))
length(dupReads)
## [1] 0
length(nodupReads)
## [1] 175346
length(allreads)
## [1] 175346
length(allreads) - length(dupReads)
## [1] 175346

Reading a particular region

bam.sub2 <-
  readGAlignments(file=mybam,param=ScanBamParam(which=gr),use.names = TRUE)
length(bam.sub2)
## [1] 14
bam.sub2
## GAlignments object with 14 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031714.4100693     chr4      +  31M7704N6M        37
##   SRR031715.5248298     chr4      +  29M7704N8M        37
##   SRR031714.4092638     chr4      -         37M        37
##   SRR031714.4275537     chr4      -         37M        37
##   SRR031715.1315719     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031715.3358559     chr4      +         37M        37
##   SRR031715.4831822     chr4      +         37M        37
##   SRR031715.4459351     chr4      +         37M        37
##   SRR031715.2716654     chr4      -         37M        37
##   SRR031715.1552693     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031714.4100693     13660     21400      7741         1
##   SRR031715.5248298     13662     21402      7741         1
##   SRR031714.4092638     19968     20004        37         0
##   SRR031714.4275537     19968     20004        37         0
##   SRR031715.1315719     19968     20004        37         0
##                 ...       ...       ...       ...       ...
##   SRR031715.3358559     19974     20010        37         0
##   SRR031715.4831822     19975     20011        37         0
##   SRR031715.4459351     19981     20017        37         0
##   SRR031715.2716654     19986     20022        37         0
##   SRR031715.1552693     20046     20082        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Recap

Now, work through Section 2 of the practical