IRanges

A Genome is typically represented as linear sequence
Ranges are an ordered set of consecutive integers defined by a start and end position
- start $\le$ end
Ranges are a common scaffold for many genomic analyses
Ranges can be associated with genomic information (e.g. gene name) or data derived from analysis (e.g. counts)
The IRanges package in Bioconductor allows us to work with intervals
- one of the aims of Bioconductor is to encourage core object-types and functions
- IRanges is an example of this

IRanges is crucial for many packages

Just some of the packages that depend on IRanges

iranges-depends

Example

Suppose we want to capture information on the following intervals

Creating the object

The IRanges function from the IRanges package is used to construct a new object
- think data.frame, vector or matrix
- it’s structure is quite unlike anything we’ve seen before

library(IRanges)
ir <- IRanges(
start = c(7,9,12,14,22:24), 
end=c(15,11,13,18,26,27,28))
str(ir)

## Formal class 'IRanges' [package "IRanges"] with 6 slots
##   ..@ start          : int [1:7] 7 9 12 14 22 23 24
##   ..@ width          : int [1:7] 9 3 2 5 5 5 5
##   ..@ NAMES          : NULL
##   ..@ elementType    : chr "integer"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

Display the object

Typing the name of the object will print a summary of the object to the screen
- useful compared to display methods for data frames, which print the whole object
the square brackets [] should give a hint about how to access the data…

ir

## IRanges of length 7
##     start end width
## [1]     7  15     9
## [2]     9  11     3
## [3]    12  13     2
## [4]    14  18     5
## [5]    22  26     5
## [6]    23  27     5
## [7]    24  28     5

Adding metadata

We can give our ranges names

ir <- IRanges(
start = c(7,9,12,14,22:24), 
end=c(15,11,13,18,26,27,28),names=LETTERS[1:7])
ir

## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G

Ranges as vectors

IRanges can be treated as if they were vectors
- no new rules to learn
  - if we can subset vectors, we can subset ranges
- vector operations are efficient
- Remember, square brackets [ to subset
- Inside the brackets, put a numeric vector to specify the indices that you want values for
  - e.g. get the first two intervals in the object using the : shortcut

ir[1:2]

## IRanges of length 2
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B

ir[c(2,4,6)]

## IRanges of length 3
##     start end width names
## [1]     9  11     3     B
## [2]    14  18     5     D
## [3]    23  27     5     F

Accessing the object

If we want to extract the properties of the object, the package authors have provided some useful functions
- we call these accessor functions
- We don’t need to know the details of how the objects and implemented to access the data
- the authors are free to change the implementation at any time
  - we shouldn’t notice the difference
- the result is a vector with the same length as the number of intervals

start(ir)

## [1]  7  9 12 14 22 23 24

end(ir)

## [1] 15 11 13 18 26 27 28

width(ir)

## [1] 9 3 2 5 5 5 5

More-complex subsetting

Recall that ‘logical’ vectors can be used in subsetting
- i.e. TRUE or FALSE
Such a vector can be derived using a comparison operator
- <, >, ==

width(ir) == 5

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

ir[width(ir)==5]

## IRanges of length 4
##     start end width names
## [1]    14  18     5     D
## [2]    22  26     5     E
## [3]    23  27     5     F
## [4]    24  28     5     G

More-complex subsetting

start(ir) > 10

## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

end(ir) < 27

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

ir[start(ir) > 10]

## IRanges of length 5
##     start end width names
## [1]    12  13     2     C
## [2]    14  18     5     D
## [3]    22  26     5     E
## [4]    23  27     5     F
## [5]    24  28     5     G

More-complex subsetting

Multiple logical vectors can be combined using & (and), | (or)
- eg intervals that start after 10, and before 27

ir[end(ir) < 27]

## IRanges of length 5
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E

ir[start(ir) > 10 & end(ir) < 27]

## IRanges of length 3
##     start end width names
## [1]    12  13     2     C
## [2]    14  18     5     D
## [3]    22  26     5     E

Lots of common use-cases are implemented

operations

Shifting

e.g. sliding windows

ir

## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G

shift(ir, 5)

## IRanges of length 7
##     start end width names
## [1]    12  20     9     A
## [2]    14  16     3     B
## [3]    17  18     2     C
## [4]    19  23     5     D
## [5]    27  31     5     E
## [6]    28  32     5     F
## [7]    29  33     5     G

Shifting

Size of shift doesn’t need to be constant

ir

## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G

shift(ir, 7:1)

## IRanges of length 7
##     start end width names
## [1]    14  22     9     A
## [2]    15  17     3     B
## [3]    17  18     2     C
## [4]    18  22     5     D
## [5]    25  29     5     E
## [6]    25  29     5     F
## [7]    25  29     5     G

Resize

e.g. trimming reads

ir

## IRanges of length 7
##     start end width names
## [1]     7  15     9     A
## [2]     9  11     3     B
## [3]    12  13     2     C
## [4]    14  18     5     D
## [5]    22  26     5     E
## [6]    23  27     5     F
## [7]    24  28     5     G

resize(ir,3)

## IRanges of length 7
##     start end width names
## [1]     7   9     3     A
## [2]     9  11     3     B
## [3]    12  14     3     C
## [4]    14  16     3     D
## [5]    22  24     3     E
## [6]    23  25     3     F
## [7]    24  26     3     G

Coverage

Often we want to know how much sequencing we have at particular positions
- i.e. depth of coverage

coverage returns a Run Length Encoding - an efficient representation of repeated values

cvg <- coverage(ir)
cvg

## integer-Rle of length 28 with 10 runs
##   Lengths: 6 2 7 3 3 1 1 3 1 1
##   Values : 0 1 2 1 0 1 2 3 2 1

as.vector(cvg)

##  [1] 0 0 0 0 0 0 1 1 2 2 2 2 2 2 2 1 1 1 0 0 0 1 2 3 3 3 2 1

Overlapping

e.g. counting - The terminology of overlapping defines a query and a subject overlaps

Overlaps

lets start be defining a new set of ranges

ir3 <- IRanges(start = c(1, 14, 27), end = c(13,
    18, 30),names=c("X","Y","Z"))
ir3

## IRanges of length 3
##     start end width names
## [1]     1  13    13     X
## [2]    14  18     5     Y
## [3]    27  30     4     Z

Overlaps

The findOverlaps function is used for overlap
- the output isn’t immediately obvious
- length of output is the number of hits
  - each hit is defined by a subject and query index
- require accessor functions to get the data; queryHits and subjectHits

query <- ir
subject <- ir3
ov <- findOverlaps(query, subject)
ov

## Hits object with 7 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         1           2
##   [3]         2           1
##   [4]         3           1
##   [5]         4           2
##   [6]         6           3
##   [7]         7           3
##   -------
##   queryLength: 7
##   subjectLength: 3

queryHits and subjectHits

queryHits returns indices from the query
- each query may overlap with many in the subject

queryHits(ov)

## [1] 1 1 2 3 4 6 7

subjectHits returns indices from the subject
- each subject range may overlap with many in the query

subjectHits(ov)

## [1] 1 2 1 1 2 3 3

e.g. 1 from the query overlaps with 1 from the subject

Overlap example - First hit

query[queryHits(ov)[1]]

## IRanges of length 1
##     start end width names
## [1]     7  15     9     A

subject[subjectHits(ov)[1]]

## IRanges of length 1
##     start end width names
## [1]     1  13    13     X

Overlap example - second hit

query[queryHits(ov)[2]]

## IRanges of length 1
##     start end width names
## [1]     7  15     9     A

subject[subjectHits(ov)[2]]

## IRanges of length 1
##     start end width names
## [1]    14  18     5     Y

Overlap example - Third hit

query[queryHits(ov)[3]]

## IRanges of length 1
##     start end width names
## [1]     9  11     3     B

subject[subjectHits(ov)[3]]

## IRanges of length 1
##     start end width names
## [1]     1  13    13     X

Counting

If we just wanted to count the number of overlaps for each range, we can use countOverlaps
- result is a vector with length the number of intervals in query
- e.g. interval 1 in the query overlaps with 2 intervals in the subject

countOverlaps(query,subject)

## A B C D E F G 
## 2 1 1 1 0 1 1

Order of arguments is important

countOverlaps(subject,query)

## X Y Z 
## 3 2 2

Modify overlap criteria

There are various ways of defining an overlap
We can be more stringent by stating that all positions need to be in common

findOverlaps(query,subject,type="within")

## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

More stringent overlap

findOverlaps(query,subject,type="within")

## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

More stringent overlap

findOverlaps(query,subject,type="within")

## Hits object with 3 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   [2]         3           1
##   [3]         4           2
##   -------
##   queryLength: 7
##   subjectLength: 3

Intersection

Rather than counting, we might want to know which positions are in common

intersect(ir,ir3)

## IRanges of length 2
##     start end width
## [1]     7  18    12
## [2]    27  28     2

Subtraction

Or which positions are missing

setdiff(ir,ir3)

## IRanges of length 1
##     start end width
## [1]    22  26     5

Core data-type 2: DNA sequences

Biostrings

The Biostrings package is specifically-designed for biological sequences

It introduces a new object type, the DNAStringSet for storing sequences
We can create an object of this type by using the DNAStringSet function
Typing the name of your new object prints a summary to the screen

library(Biostrings)
myseq <- DNAStringSet(randomStrings)
myseq

##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 ACACAAGTGACCA
##   [2]    13 CCACCCGGTAACA
##   [3]    17 CGGTAAGGTACGGTTAT
##   [4]    11 CATTCGTTTAT
##   [5]    12 TACTGCACCAAG
##   ...   ... ...
##  [96]    13 GCCCTCCACACGC
##  [97]    10 CCCGGTCAGT
##  [98]    18 ACTCTTGACCAACAACTC
##  [99]    19 AAGCGTACGGTTGCAGACG
## [100]    18 TTGCCAAATTCTTGTATG

Object structure

The definition of the object is not for the faint-hearted

str(myseq)

## Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
##   ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
##   .. .. ..@ xp_list                    :List of 1
##   .. .. .. ..$ :<externalptr> 
##   .. .. ..@ .link_to_cached_object_list:List of 1
##   .. .. .. ..$ :<environment: 0x7f8d810> 
##   ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
##   .. .. ..@ group          : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..@ start          : int [1:100] 1 14 27 44 55 67 77 93 105 117 ...
##   .. .. ..@ width          : int [1:100] 13 13 17 11 12 10 16 12 12 18 ...
##   .. .. ..@ NAMES          : NULL
##   .. .. ..@ elementType    : chr "integer"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementType    : chr "DNAString"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

Biostrings operations

However, we can treat a Biostrings object like a standard vector

myseq[1:5]

##   A DNAStringSet instance of length 5
##     width seq
## [1]    13 ACACAAGTGACCA
## [2]    13 CCACCCGGTAACA
## [3]    17 CGGTAAGGTACGGTTAT
## [4]    11 CATTCGTTTAT
## [5]    12 TACTGCACCAAG

Accessor functions

If we want to do a calculation on the width and sequences themselves, we can extract them with width and as.character
- the result is a vector

width(myseq)

##   [1] 13 13 17 11 12 10 16 12 12 18 17 15 14 12 10 17 12 20 10 18 12 18 12
##  [24] 17 17 16 17 19 15 12 20 20 12 15 15 12 20 16 16 14 12 19 19 17 15 18
##  [47] 19 11 13 20 18 13 14 20 18 10 19 10 15 17 12 17 12 11 11 16 17 15 17
##  [70] 11 15 19 11 14 12 20 16 19 14 14 19 13 17 12 18 18 15 13 11 11 13 19
##  [93] 15 12 11 13 10 18 19 18

head(as.character(myseq))

## [1] "ACACAAGTGACCA"     "CCACCCGGTAACA"     "CGGTAAGGTACGGTTAT"
## [4] "CATTCGTTTAT"       "TACTGCACCAAG"      "CCTTGCTAGC"

Accessor functions

What does this do?

myseq[width(myseq)>19]

##   A DNAStringSet instance of length 7
##     width seq
## [1]    20 AGGGCGTCTGTCCGACCATA
## [2]    20 CGTCAATGTCATCCGCCCCT
## [3]    20 TAACATCTAACCGGCCATTT
## [4]    20 CGAATTTATGGGAGTCGTGA
## [5]    20 ACCAATAGAACACTGGGCTC
## [6]    20 GAAGAGTCTTAACATAGATC
## [7]    20 GTTTCCATGGCCAGGTGATC

More advanced subsetting

myseq[subseq(myseq,1,3) == "TTC"]

##   A DNAStringSet instance of length 4
##     width seq
## [1]    14 TTCCTTTACCGATA
## [2]    16 TTCAGGGGGGAGAAGA
## [3]    18 TTCACCTTCACTAAAGTG
## [4]    18 TTCATACCCTGAAATTAT

We can also use the matchPattern function + see practical for details

Other useful operations

Some useful string operation functions are provided

af <- alphabetFrequency(myseq, baseOnly=TRUE)
head(af)

##      A C G T other
## [1,] 6 4 2 1     0
## [2,] 4 6 2 1     0
## [3,] 4 2 6 5     0
## [4,] 2 2 1 6     0
## [5,] 4 4 2 2     0
## [6,] 1 4 2 3     0

Letter frequencies

myseq[af[,1] ==0,]

##   A DNAStringSet instance of length 1
##     width seq
## [1]    12 CCTCTCTCCTTT

boxplot(af)

More-specialised features

reverse(myseq)

##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 ACCAGTGAACACA
##   [2]    13 ACAATGGCCCACC
##   [3]    17 TATTGGCATGGAATGGC
##   [4]    11 TATTTGCTTAC
##   [5]    12 GAACCACGTCAT
##   ...   ... ...
##  [96]    13 CGCACACCTCCCG
##  [97]    10 TGACTGGCCC
##  [98]    18 CTCAACAACCAGTTCTCA
##  [99]    19 GCAGACGTTGGCATGCGAA
## [100]    18 GTATGTTCTTAAACCGTT

reverseComplement(myseq)

##   A DNAStringSet instance of length 100
##       width seq
##   [1]    13 TGGTCACTTGTGT
##   [2]    13 TGTTACCGGGTGG
##   [3]    17 ATAACCGTACCTTACCG
##   [4]    11 ATAAACGAATG
##   [5]    12 CTTGGTGCAGTA
##   ...   ... ...
##  [96]    13 GCGTGTGGAGGGC
##  [97]    10 ACTGACCGGG
##  [98]    18 GAGTTGTTGGTCAAGAGT
##  [99]    19 CGTCTGCAACCGTACGCTT
## [100]    18 CATACAAGAATTTGGCAA

translate(myseq)

##   A AAStringSet instance of length 100
##       width seq
##   [1]     4 TQVT
##   [2]     4 PPGN
##   [3]     5 R*GTV
##   [4]     3 HSF
##   [5]     4 YCTK
##   ...   ... ...
##  [96]     4 ALHT
##  [97]     3 PGQ
##  [98]     6 TLDQQL
##  [99]     6 KRTVAD
## [100]     6 LPNSCM

Fastq recap

Recall that sequence reads are represented in text format

readLines(path.to.my.fastq ,n=10)

It should be possible to represent these as Biostrings objects

The `ShortRead` package

One of the first NGS packages in Bioconductor

Has convenient functions for reading fastq files and performing quality assessment
- In practice, we would use other tools for processing fastq files
- e.g. fastqc for quality assessment

library(ShortRead)
fq <- readFastq(path.to.my.fastq)
fq

Practical application - Representing the genome

The genome as a string - `BSgenome`

library(BSgenome)
head(available.genomes())

## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.UCSC.apiMel2"       
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Athaliana.TAIR.04232008"       
## [6] "BSgenome.Athaliana.TAIR.TAIR9"

Various versions of the human genome

ag <- available.genomes()
ag[grep("Hsapiens",ag)]

##  [1] "BSgenome.Hsapiens.1000genomes.hs37d5"
##  [2] "BSgenome.Hsapiens.NCBI.GRCh38"       
##  [3] "BSgenome.Hsapiens.UCSC.hg17"         
##  [4] "BSgenome.Hsapiens.UCSC.hg17.masked"  
##  [5] "BSgenome.Hsapiens.UCSC.hg18"         
##  [6] "BSgenome.Hsapiens.UCSC.hg18.masked"  
##  [7] "BSgenome.Hsapiens.UCSC.hg19"         
##  [8] "BSgenome.Hsapiens.UCSC.hg19.masked"  
##  [9] "BSgenome.Hsapiens.UCSC.hg38"         
## [10] "BSgenome.Hsapiens.UCSC.hg38.masked"

The latest human genome

library(BSgenome.Hsapiens.UCSC.hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
hg19

## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Chromosome-level sequence

The genome package can be accessed at a chromosome level
The names of the object are chromosome names
- can use list accessing method [[]] to get chromsome sequence
- result is a DNAString
  - which we have various tools for dealing with

head(names(hg19))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

chrX <- hg19[["chrX"]]
chrX

##   155270560-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

alphabetFrequency(chrX,baseOnly=TRUE)

##        A        C        G        T    other 
## 45648952 29813353 29865831 45772424  4170000

Retrieving sequences

Of course, we might just want the sequence of a particular region (e.g. gene)
we can use getSeq to do this

tp53 <- getSeq(hg19, "chr17", 7577851, 7590863)
tp53

##   13013-letter "DNAString" instance
## seq: TTGTATTTTTCAGTAGAGACGGGGTTTCACCGTT...GTCTTGAGCACATGGGAGGGGAAAACCCCAATC

as.character(tp53[1:10])

## [1] "TTGTATTTTT"

alphabetFrequency(tp53,baseOnly=TRUE)

##     A     C     G     T other 
##  3102  3375  3025  3511     0

subseq(tp53, 1000,1010)

##   11-letter "DNAString" instance
## seq: TATAGGTGTGC

Timings

Don’t need to load the whole genome into memory, so reading a particular sequence is fast

system.time(tp53 <- getSeq(hg19, "chr17", 7577851, 7598063))

##    user  system elapsed 
##   0.117   0.000   0.118

Manipulating sequences

We can now use Biostrings operations to manipulate the sequence

translate(subseq(tp53, 1000,1010))

## Warning in .Call2("DNAStringSet_translate", x, skip_code,
## dna_codes[codon_alphabet], : last 2 bases were ignored

##   3-letter "AAString" instance
## seq: YRC

reverseComplement(subseq(tp53, 1000,2000))

##   1001-letter "DNAString" instance
## seq: CCTATGGAAACTGTGAGTGGATCCATTGGAAGGG...AAAATTAGCCAGGCATGGTGGTGCACACCTATA

Introducing GRanges

GRanges are a special kind of IRanges object used to manipulate genomic intervals in an efficient manner
We can define a ‘chromosome’ for each range
- referred to as seqnames
we have the option to define a strand
need to supply a ranges object, as we saw before
operations on ranges respect the chromosome labels

library(GenomicRanges)
gr <- GRanges(c("A","A","A","B","B","B","B"), ranges=ir)
gr

## GRanges object with 7 ranges and 0 metadata columns:
##     seqnames    ranges strand
##        <Rle> <IRanges>  <Rle>
##   A        A  [ 7, 15]      *
##   B        A  [ 9, 11]      *
##   C        A  [12, 13]      *
##   D        B  [14, 18]      *
##   E        B  [22, 26]      *
##   F        B  [23, 27]      *
##   G        B  [24, 28]      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

GRanges with metadata

We can add extra metadata to these ranges

mcols can be set to be a data frame with one row for each range
Counts, gene names
- anything you like!

mcols(gr) <- data.frame(Count = runif(length(gr)), Gene =sample(LETTERS,length(gr)))
gr

## GRanges object with 7 ranges and 2 metadata columns:
##     seqnames    ranges strand |     Count     Gene
##        <Rle> <IRanges>  <Rle> | <numeric> <factor>
##   A        A  [ 7, 15]      * | 0.7864672        J
##   B        A  [ 9, 11]      * | 0.9716962        A
##   C        A  [12, 13]      * | 0.2972768        B
##   D        B  [14, 18]      * | 0.5064363        H
##   E        B  [22, 26]      * | 0.3862037        U
##   F        B  [23, 27]      * | 0.7934070        M
##   G        B  [24, 28]      * | 0.6414842        F
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

gr[mcols(gr)$Count > 0.5]

## GRanges object with 5 ranges and 2 metadata columns:
##     seqnames    ranges strand |     Count     Gene
##        <Rle> <IRanges>  <Rle> | <numeric> <factor>
##   A        A  [ 7, 15]      * | 0.7864672        J
##   B        A  [ 9, 11]      * | 0.9716962        A
##   D        B  [14, 18]      * | 0.5064363        H
##   F        B  [23, 27]      * | 0.7934070        M
##   G        B  [24, 28]      * | 0.6414842        F
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Representing a gene

Creating an object to represent a particular gene is easy if we know its coordinates
- we will look at represening the full gene structure tomorrow
  - e.g. exons, introns etc

mygene <- GRanges("chr17", ranges=IRanges(7577851, 7598063))
myseq <- getSeq(hg19, mygene)
myseq

##   A DNAStringSet instance of length 1
##     width seq
## [1] 20213 TTGTATTTTTCAGTAGAGACGGGGTTTCACC...CTACTTGGGAGGCTGAGGTGGGAGGATCGCT

tp53

##   20213-letter "DNAString" instance
## seq: TTGTATTTTTCAGTAGAGACGGGGTTTCACCGTT...AGCTACTTGGGAGGCTGAGGTGGGAGGATCGCT

Intermission

Work through section 1 of the practical

Examples of creating IRanges and GRanges objects
Accessing genome packages
Manipulating genome sequences

Practical application - Manipulating Aligned Reads

Dealing with aligned reads

We will assume that the sequencing reads have been aligned and that we are interested in processing the alignments.

Rsamtools provides an interface for doing this.
However, we will use the readGAlignments tool in GenomicAlignments which extracts the essential information from the bam file.
- don’t even attempt to try to understand the data structure!

library(GenomicAlignments)
bam <- readGAlignments(mybam,use.names = TRUE)
str(bam)

## Formal class 'GAlignments' [package "GenomicAlignments"] with 8 slots
##   ..@ NAMES          : chr [1:175346] "SRR031715.1138209" "SRR031714.776678" "SRR031715.3258011" "SRR031715.4791418" ...
##   ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 8 levels "chr2L","chr2R",..: 5
##   .. .. ..@ lengths        : int 175346
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ start          : int [1:175346] 169 184 187 193 326 943 944 946 946 957 ...
##   ..@ cigar          : chr [1:175346] "37M" "37M" "37M" "37M" ...
##   ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
##   .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 1 2 1 2 1 2 1 2 1 2 ...
##   .. .. ..@ lengths        : int [1:37319] 1 2 1 1 3 2 3 10 3 1 ...
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementMetadata:Formal class 'DataFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : NULL
##   .. .. ..@ nrows          : int 175346
##   .. .. ..@ listData       : Named list()
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
##   .. .. ..@ seqnames   : chr [1:8] "chr2L" "chr2R" "chr3L" "chr3R" ...
##   .. .. ..@ seqlengths : int [1:8] 23011544 21146708 24543557 27905053 1351857 19517 22422827 347038
##   .. .. ..@ is_circular: logi [1:8] NA NA NA NA NA NA ...
##   .. .. ..@ genome     : chr [1:8] NA NA NA NA ...
##   ..@ metadata       : list()

Representation of aligned reads

The result looks a lot like a GRanges object. In fact, a lot of the same operations can be used

bam

## GAlignments object with 175346 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1138209     chr4      +         37M        37
##    SRR031714.776678     chr4      -         37M        37
##   SRR031715.3258011     chr4      -         37M        37
##   SRR031715.4791418     chr4      +         37M        37
##   SRR031715.1138209     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031714.1650928     chr4      +         37M        37
##   SRR031714.1650928     chr4      -         37M        37
##   SRR031714.5192891     chr4      +         37M        37
##   SRR031715.2351056     chr4      +         37M        37
##    SRR031714.864195     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1138209       169       205        37         0
##    SRR031714.776678       184       220        37         0
##   SRR031715.3258011       187       223        37         0
##   SRR031715.4791418       193       229        37         0
##   SRR031715.1138209       326       362        37         0
##                 ...       ...       ...       ...       ...
##   SRR031714.1650928   1349708   1349744        37         0
##   SRR031714.1650928   1349838   1349874        37         0
##   SRR031714.5192891   1351640   1351676        37         0
##   SRR031715.2351056   1351640   1351676        37         0
##    SRR031714.864195   1351760   1351796        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Accessing particular reads

Yet again, we can treat the object as a vector

length(bam)

## [1] 175346

bam[1:5]

## GAlignments object with 5 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1138209     chr4      +         37M        37
##    SRR031714.776678     chr4      -         37M        37
##   SRR031715.3258011     chr4      -         37M        37
##   SRR031715.4791418     chr4      +         37M        37
##   SRR031715.1138209     chr4      -         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1138209       169       205        37         0
##    SRR031714.776678       184       220        37         0
##   SRR031715.3258011       187       223        37         0
##   SRR031715.4791418       193       229        37         0
##   SRR031715.1138209       326       362        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

bam[sample(1:length(bam),5)]

## GAlignments object with 5 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031715.1038030     chr4      +         37M        37
##   SRR031714.2298029     chr4      -         37M        37
##   SRR031715.2746340     chr4      +         37M        37
##   SRR031714.3459405     chr4      +         37M        37
##   SRR031714.4323180     chr4      -         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031715.1038030     50440     50476        37         0
##   SRR031714.2298029   1051440   1051476        37         0
##   SRR031715.2746340    947137    947173        37         0
##   SRR031714.3459405    137653    137689        37         0
##   SRR031714.4323180   1037390   1037426        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Querying alignments

As usual, there are a variety of accessor functions to get data from the object

table(strand(bam))

## 
##     +     -     * 
## 84871 90475     0

summary(width(bam))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    37.00    37.00    37.00    58.72    37.00 19350.00

range(start(bam))

## [1]     169 1351760

head(cigar(bam))

## [1] "37M" "37M" "37M" "37M" "37M" "37M"

Overlap aligned reads with GRanges

A GAlignments object can be used in findOverlaps

gr <- GRanges("chr4", IRanges(start = 20000, end = 20100))
gr

## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]     chr4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps(gr,bam)

## Hits object with 12 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1        6699
##    [2]         1        6700
##    [3]         1        6701
##    [4]         1        6702
##    [5]         1        6703
##    ...       ...         ...
##    [8]         1        6706
##    [9]         1        6707
##   [10]         1        6708
##   [11]         1        6709
##   [12]         1        6710
##   -------
##   queryLength: 1
##   subjectLength: 175346

Identifying the reads

A shortcut

bam.sub <- bam[bam %over% gr]
bam.sub

## GAlignments object with 12 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031714.4092638     chr4      -         37M        37
##   SRR031714.4275537     chr4      -         37M        37
##   SRR031715.1315719     chr4      -         37M        37
##   SRR031715.1502533     chr4      -         37M        37
##    SRR031714.336402     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031715.3358559     chr4      +         37M        37
##   SRR031715.4831822     chr4      +         37M        37
##   SRR031715.4459351     chr4      +         37M        37
##   SRR031715.2716654     chr4      -         37M        37
##   SRR031715.1552693     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031714.4092638     19968     20004        37         0
##   SRR031714.4275537     19968     20004        37         0
##   SRR031715.1315719     19968     20004        37         0
##   SRR031715.1502533     19968     20004        37         0
##    SRR031714.336402     19971     20007        37         0
##                 ...       ...       ...       ...       ...
##   SRR031715.3358559     19974     20010        37         0
##   SRR031715.4831822     19975     20011        37         0
##   SRR031715.4459351     19981     20017        37         0
##   SRR031715.2716654     19986     20022        37         0
##   SRR031715.1552693     20046     20082        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Chromosome naming conventions

Regrettably, people can’t seem to agree on how to name chromosomes
- e.g. chr1 vs 1 etc
We have to make sure to use the same convention if attempted to overlap

gr <- GRanges("4", IRanges(start = 20000, end = 20100))
gr

## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]        4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps(gr,bam)

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

## Hits object with 0 hits and 0 metadata columns:
##    queryHits subjectHits
##    <integer>   <integer>
##   -------
##   queryLength: 1
##   subjectLength: 175346

Solution

gr

## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]        4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr <- renameSeqlevels(gr, c("4"="chr4"))
gr

## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]     chr4 [20000, 20100]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finer-control over reading

readGAlignments uses the Rsamtools interface, which allows more control over how we import data
- the ’ScanBamParam (!) function allows the user to customise what fields from the bam file are imported
  - recall yesterday’s discussion about bam file contents

?ScanBamParam

Example: adding mapping quality, base quality and flag

bam.extra <- readGAlignments(file=mybam,param=ScanBamParam(what=c("mapq","qual","flag")))
bam.extra[1:5]

## GAlignments object with 5 alignments and 3 metadata columns:
##       seqnames strand       cigar    qwidth     start
##          <Rle>  <Rle> <character> <integer> <integer>
##   [1]     chr4      +         37M        37       169
##   [2]     chr4      -         37M        37       184
##   [3]     chr4      -         37M        37       187
##   [4]     chr4      +         37M        37       193
##   [5]     chr4      -         37M        37       326
##             end     width     njunc |      mapq
##       <integer> <integer> <integer> | <integer>
##   [1]       205        37         0 |       255
##   [2]       220        37         0 |       255
##   [3]       223        37         0 |       255
##   [4]       229        37         0 |       255
##   [5]       362        37         0 |       255
##                                        qual      flag
##                              <PhredQuality> <integer>
##   [1] IIIIIIIIIIIIIIIIIIIIIIIIII8IIIIIIIGII        99
##   [2] IIIIIIIIEIIIIIIIIIIIIIIIIIIIIIIIIIIII       153
##   [3] II6II7IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        89
##   [4] IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFII3I       137
##   [5] ++I+4-05>*I2GF/II6IIIIIIIIIIIIIIIII<I       147
##   -------
##   seqinfo: 8 sequences from an unspecified genome

table(mcols(bam.extra)$flag)

## 
##    65    73    81    83    89    97    99   113   129   137 
##    29  4891 14737 23037  7769 14781 22791    34    29  4576 
##   145   147   153   161   163   177 
## 14781 22791  7292 14737 23037    34

Example: Dealing with PCR duplicates

dupReads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = TRUE)))
nodupReads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = FALSE)))
allreads <- readGAlignments(file=mybam,param=ScanBamParam(scanBamFlag(isDuplicate = NA)))
length(dupReads)

## [1] 0

length(nodupReads)

## [1] 175346

length(allreads)

## [1] 175346

length(allreads) - length(dupReads)

## [1] 175346

Reading a particular region

Only possible if the bam file has an accompanying bai index file

bam.sub2 <-
  readGAlignments(file=mybam,param=ScanBamParam(which=gr),use.names = TRUE)
length(bam.sub2)

## [1] 14

bam.sub2

## GAlignments object with 14 alignments and 0 metadata columns:
##                     seqnames strand       cigar    qwidth
##                        <Rle>  <Rle> <character> <integer>
##   SRR031714.4100693     chr4      +  31M7704N6M        37
##   SRR031715.5248298     chr4      +  29M7704N8M        37
##   SRR031714.4092638     chr4      -         37M        37
##   SRR031714.4275537     chr4      -         37M        37
##   SRR031715.1315719     chr4      -         37M        37
##                 ...      ...    ...         ...       ...
##   SRR031715.3358559     chr4      +         37M        37
##   SRR031715.4831822     chr4      +         37M        37
##   SRR031715.4459351     chr4      +         37M        37
##   SRR031715.2716654     chr4      -         37M        37
##   SRR031715.1552693     chr4      +         37M        37
##                         start       end     width     njunc
##                     <integer> <integer> <integer> <integer>
##   SRR031714.4100693     13660     21400      7741         1
##   SRR031715.5248298     13662     21402      7741         1
##   SRR031714.4092638     19968     20004        37         0
##   SRR031714.4275537     19968     20004        37         0
##   SRR031715.1315719     19968     20004        37         0
##                 ...       ...       ...       ...       ...
##   SRR031715.3358559     19974     20010        37         0
##   SRR031715.4831822     19975     20011        37         0
##   SRR031715.4459351     19981     20017        37         0
##   SRR031715.2716654     19986     20022        37         0
##   SRR031715.1552693     20046     20082        37         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

Representing sequencing data in R and Bioconductor

Overview

Aims

Motivation

Core data-type 1: Genome Intervals

IRanges

IRanges is crucial for many packages

IRanges paper

Example

Creating the object

Display the object

Adding metadata

Ranges as vectors

Accessing the object

More-complex subsetting

More-complex subsetting

More-complex subsetting

Manipulating Ranges

Lots of common use-cases are implemented

Shifting

Shifting

Shifting

Shifting

Resize

Resize

Coverage

Coverage Results

Overlapping

Overlaps

Overlaps

Overlaps

queryHits and subjectHits

Overlap example - First hit

Overlap example - second hit

Overlap example - Third hit

Counting

Modify overlap criteria

More stringent overlap

More stringent overlap

Intersection

Subtraction

Core data-type 2: DNA sequences

Biostrings

Object structure

Biostrings operations

Accessor functions

Accessor functions

More advanced subsetting

Other useful operations

Letter frequencies

More-specialised features

Fastq recap

The ShortRead package

Practical application - Representing the genome

The genome as a string - BSgenome

The latest human genome

Chromosome-level sequence

Retrieving sequences

Timings

Manipulating sequences

Introducing GRanges

GRanges with metadata

Representing a gene

Intermission

Practical application - Manipulating Aligned Reads

Dealing with aligned reads

Representation of aligned reads

Accessing particular reads

Querying alignments

Overlap aligned reads with GRanges

Identifying the reads

A shortcut

Chromosome naming conventions

Solution

Finer-control over reading

Example: adding mapping quality, base quality and flag

Example: Dealing with PCR duplicates

Reading a particular region

Recap

The `ShortRead` package

The genome as a string - `BSgenome`