In this session we:-
We will use the freebayes genotype caller for convenience. The main reasons for choosing this were:-
freebayes
is a command-line tool that you will need to run from the terminal. As with all command-line tools, we first need to make sure that we are located in the correct directory. Remember that we can change directory using the unix command cd
.
freebayes
is included in the CRUK Docker. However, we can make sure that freebayes
is available by typing freebayes
at the terminal. This will give a basic description, a version number, and information about how to cite the package (very important if you end-up using the tool to generate results for your paper). Adding the parameter -h
will display more-detailed help about the tool and some examples of how to use it.
cd /home/participant/Course_Materials/Day2/
freebayes
freebayes -h
We are going to use freebayes
to call SNVs on some 1000 genomes samples. In order to make the SNV-calling run in a reasonable time, we are only considering reads aligned to chromosome 20 in this analysis. In the Appendix you will see the commands used to create the bam files.
The minimal requirements to run freebayes
(and why it is appealing for this practical!) are a reference genome and a .bam
file. The -f
parameter is used to specify the location of a reference genome in .fasta
format.
The reference file is in the folder /home/participant/Course_Materials/ref_data
. However, rather than having the write the full path to the file, we can write the path relative to our current working directory (which is /home/participant/Course_Materials/Day2
). When specifying the location of the files the ../
means to go one directory “up” from the current working directory.
Please don’t run this next command
freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta ../data/hapmap/NA12878.chr20.bam
If you did run that command, you would quickly see that the screen gets filled with lots of text. These are the calls that freebayes
is making being printed to the screen (the standard output for some unix commands). If you find yourself in this situation, a swift press of CTRL + C
should stop the tool from running.
What we need to do is direct the output to a file. We can call the output file anything we like, but it is advisable to make the name relatable to the name of the input. If we are in the situation of calling genotypes on many samples, with many different callers, then we want to be able to identify the processing used on each sample.
Although it is not mandatory, we give the output file the extension .vcf
. The .vcf
format is a commonly-adopted standard for variant calls, which we will look into detail now.
freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta ../data/hapmap/NA12878.chr20.bam > NA12878.chr20.freebayes.vcf
The vcf format was initially developed by the 1000 Genomes Project, and ownership has been subsequently transferred to Global Alliance for Genomics and Health Data Working group file format team. The format can be used to represent information about all kinds of genomic variation. In this session we will just consider SNVs.
We don’t require any specialised software to look at the contents of a vcf file. They can be opened in a bog-standard text editor.For now, navigate to the folder and right-click on NA12878.chr20.freebayes.vcf
. Select the option to open with gedit
(a common text editor for Ubuntu).
In a similar vein to the .bam
and .sam
files we saw earlier, the .vcf
files contains many lines of header information.
##fileformat=VCFv4.1
##fileDate=20160724
##source=freeBayes v1.0.2-33-gdbb6160
##reference=../ref_data/Homo_sapiens_assembly19.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
After many more lines of information, we finally get to the details of the actual calls themsevles. This part of the file is tab-delimited; with 10 columns for every call. The vcf specification page gives details of what should be contained in each column
Shown here is the information about the first five calls
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 61795 . G T 95.0616 . AB=0.454545;ABP=3.20771;AC=1;AF=0.5;AN=2;AO=5;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=3.44459;EPPR=4.45795;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=21.8887;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=188;QR=245;RO=6;RPL=1;RPP=6.91895;RPPR=3.0103;RPR=4;RUN=1;SAF=2;SAP=3.44459;SAR=3;SRF=5;SRP=8.80089;SRR=1;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/1:11:11,5:6:245:5:188:-13.9693,0,-19.1141
20 63244 . A C 88.3208 . AB=0.8;ABP=6.91895;AC=1;AF=0.5;AN=2;AO=4;CIGAR=1X;DP=5;DPB=5;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.26311;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=149;QR=38;RO=1;RPL=4;RPP=11.6962;RPPR=5.18177;RPR=0;RUN=1;SAF=2;SAP=3.0103;SAR=2;SRF=1;SRP=5.18177;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/1:5:5,4:1:38:4:149:-12.2609,0,-2.29212
20 65900 . G A 44.2954 . AB=0.3;ABP=6.48466;AC=1;AF=0.5;AN=2;AO=3;CIGAR=1X;DP=10;DPB=10;DPRA=0;EPP=3.73412;EPPR=3.32051;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=10.1994;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=125;QR=250;RO=7;RPL=1;RPP=3.73412;RPPR=3.32051;RPR=2;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=4;SRP=3.32051;SRR=3;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/1:10:10,3:7:250:3:125:-8.63855,0,-19.8321
20 66370 . G A 57.531 . AB=0.4;ABP=3.87889;AC=1;AF=0.5;AN=2;AO=4;CIGAR=1X;DP=10;DPB=10;DPRA=0;EPP=11.6962;EPPR=4.45795;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=13.247;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=150;QR=213;RO=6;RPL=3;RPP=5.18177;RPPR=4.45795;RPR=1;RUN=1;SAF=1;SAP=5.18177;SAR=3;SRF=2;SRP=4.45795;SRR=4;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/1:10:10,4:6:213:4:150:-10.8494,0,-16.4929
The first seven columns should look consistent across different genotype callers. In this exercise we have not annotated against known variants, or applied any filtering, so the ID
and FILTER
columns are blank. In some .vcf
files these columns might be populated with dbSNP IDs or flags such as PASS
/ FAIL
respectively.
The contents of the INFO
and FORMAT
columns will depend on what variant caller has been used. The INFO
column contains metrics and other information related to each variant call as a set of KEY;VALUE
pairs. Each pair is separated by a ;
character.
The INFO
for the first variant reads as:-
AB=0.454545;ABP=3.20771;AC=1;AF=0.5;AN=2;AO=5;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=3.44459;EPPR=4.45795;GTI=0;.......
which we can interpret as:-
Key | Value | |
---|---|---|
AB=0.454545 | AB | 0.454545 |
ABP=3.20771 | ABP | 3.20771 |
AC=1 | AC | 1 |
AF=0.5 | AF | 0.5 |
AN=2 | AN | 2 |
AO=5 | AO | 5 |
CIGAR=1X | CIGAR | 1X |
DP=11 | DP | 11 |
DPB=11 | DPB | 11 |
DPRA=0 | DPRA | 0 |
The meaning of each KEY
is given in the header for the .vcf
file.
## ##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
## ##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
## ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
## ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally">
## ##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
## ##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
## ##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio. Ratio between depth in samples with each called alternate allele and those without.">
The final column in the file describes the calls for the sample NA12878
, which was the only sample that we called variants on. In the sample column (NA12878
) for the first variant we see the entry
(0/1:11:11,5:6:245:5:188:-13.9693,0,-19.1141
).
These are values separated by a :
character and they are interpreted in the same order as dictated by the FORMAT
column which is GT:DP:DPR:RO:QR:AO:QA:GL
Key | Value | Description |
---|---|---|
GT | 0/1 | Genotype |
DP | 11 | Read Depth |
DPR | 11,5 | Number of observation for each allele |
RO | 6 | Reference allele observation count |
QR | 245 | Sum of quality of the reference observations |
AO | 5 | Alternate allele observation count |
QA | 188 | Sum of quality of the alternate observations |
GL | -13.9693,0,-19.1141 | Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy |
So for this particular variant, in the sample NA12878
there is a genotype of 0\1
(heterozygous) and a depth of 11 etc.
To understand the vcf format better, we can use another mode of operation of freebayes
which is to call genotypes on multiple samples simultaneously. We can also specify an exact region of the genome to call genotypes to speed-up the processing.
freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta --region 20:500000-800000 ../data/hapmap/NA12878.chr20.bam ../data/hapmap/NA12873.chr20.bam ../data/hapmap/NA12874.chr20.bam > combined.chr20.subset.freebayes.vcf
As before, we can look at the first five calls.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 NA12873 NA12874
20 502620 . G A 124.93 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=5;CIGAR=1X;DP=24;DPB=24;DPRA=0.526316;EPP=3.44459;EPPR=5.8675;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=1.91343;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=206;QR=587;RO=19;RPL=2;RPP=3.44459;RPPR=5.8675;RPR=3;RUN=1;SAF=2;SAP=3.44459;SAR=3;SRF=13;SRP=8.61041;SRR=6;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/0:2:2,0:2:59:0:0:0,-0.60206,-5.60414 1/1:5:5,5:0:0:5:206:-18.9161,-1.50515,0 0/0:17:17,0:17:528:0:0:0,-5.11751,-47.8112
20 502793 . G A 508.485 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=18;CIGAR=1X;DP=36;DPB=36;DPRA=2;EPP=3.49285;EPPR=3.49285;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=57;NS=3;NUMALT=1;ODDS=2.14415;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=659;QR=702;RO=18;RPL=10;RPP=3.49285;RPPR=3.0103;RPR=8;RUN=1;SAF=8;SAP=3.49285;SAR=10;SRF=15;SRP=20.3821;SRR=3;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/0:6:6,0:6:223:0:0:0,-1.80618,-20.4049 0/0:12:12,0:12:479:0:0:0,-3.61236,-42.35 1/1:18:18,18:0:0:18:659:-59.6311,-5.41854,0
20 502937 . C T 144.588 . AB=0.5;ABP=3.0103;AC=2;AF=0.333333;AN=6;AO=7;CIGAR=1X;DP=27;DPB=27;DPRA=0.538462;EPP=3.32051;EPPR=3.44459;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=5.35048;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=270;QR=750;RO=20;RPL=3;RPP=3.32051;RPPR=3.44459;RPR=4;RUN=1;SAF=4;SAP=3.32051;SAR=3;SRF=4;SRP=18.6449;SRR=16;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/1:2:2,1:1:38:1:41:-3.49251,0,-3.19521 0/1:12:12,6:6:230:6:229:-17.3628,0,-17.4496 0/0:13:13,0:13:482:0:0:0,-3.91339,-43.7103
20 504482 . G A 41.2659 . AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=7;DPB=7;DPRA=1.5;EPP=3.73412;EPPR=3.0103;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=54;NS=3;NUMALT=1;ODDS=1.33886;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=112;QR=133;RO=4;RPL=3;RPP=9.52472;RPPR=5.18177;RPR=0;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=3;SRP=5.18177;SRR=1;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL 0/0:2:2,0:2:68:0:0:0,-0.60206,-6.45793 0/0:2:2,0:2:65:0:0:0,-0.60206,-6.06068 1/1:3:3,3:0:0:3:112:-10.4467,-0.90309,0
You should notice that we now have columns NA12878
, NA12873
and NA12874
as we have called variants in three samples.
One easy way of being able to visualise the calls is to use the IGV browser. Before we do this however, we can do some extra processing to make it easier to process the files. This series of commands will compress and index the vcf files (similar to how bam files are indexed to produce a .bai
file). IGV would probably cope fine with reading such relatively-small files, but it is good to get into the habit of processing our files in this manner.
bgzip -c combined.chr20.subset.freebayes.vcf > combined.chr20.subset.freebayes.vcf.gz
tabix -p vcf combined.chr20.subset.freebayes.vcf.gz
bgzip -c NA12878.chr20.freebayes.vcf > NA12878.chr20.freebayes.vcf.gz
tabix -p vcf NA12878.chr20.freebayes.vcf.gz
After running these commands, you should see that the files combined.chr20.subset.freebayes.vcf.gz
, combined.chr20.subset.freebayes.vcf.gz.tbi
, NA12878.chr20.freebayes.vcf.gz
and NA12878.chr20.freebayes.vcf.gz.tbi
have been created in your working directory.
These files can be viewing in IGV and as usual we can zoom-in and scroll along the genome. Each .vcf
introduces two tracks into the IGV data panel; the first gives information about the variant, and the second is for sample-specific information.
NA12878.chr20.freebayes.vcf.gz
and combined.chr20.subset.freebayes.vcf.gz
from the File -> Open
menuchr20:500,900-505,630
.vcf
is shown in IGVTo dig-into these files further, we will use tools within Bioconductor. Our goal will be to find calls that are in common between different samples, and what SNPs occur within coding regions. For completeness, we note that other command-line tools can be used to perform these operations (the bedtools suite for example). However, we should be reasonably comfortable with R by now and the package we use interacts nicely with the GenomicRanges
concepts we discussed yesterday.
Session -> Set Working Directory -> Choose Directory
/home/participant/Course_Materials/Day2/
/home/participant/Course_Materials/Day2/Session1-template.Rmd
The VariantAnnotation
package allows .vcf
files to be imported. The readVcf
function can be used and requires the file of a vcf file. You also need to specify a genome name. As we have seen already with other Bioconductor objects, typing the name of the object will print a summary to the screen. In the case of a “CollapsedVCF
” object this is very detailed
library(VariantAnnotation)
hapmap.calls <- readVcf("combined.chr20.subset.freebayes.vcf","hg19")
hapmap.calls
The “header” information can be extracted using header
. This will contain the definitions of all the per-genotype (INFO
) and per-sample information stored in the file
header(hapmap.calls)
info(header(hapmap.calls))
geno(header(hapmap.calls))
INFO
As described above, the INFO
column in a .vcf
file gives per-variant metadata as a series of key-value pairs. We can interrogate these data using the info
function, which returns a data-frame-like object. Consequently, we can use the $
operator to select a particular column of interest.
The name of each row is derived from the genomic location and base-change for each variant. There are too many columns to go through in detail, so we will just describe a few that are likely to be common to different callers.
info(hapmap.calls)
NS
represents the number of samples with data. We would expect this to be 3 in all cases. However, as we are dealing with low-coverage data there are some sites where data could not be observed for all samples.
DP
gives the depth observed at each position, which is given to us in the form of a vector. By looking at the distribtion of the depths, we see some variants with extremely low coverage. Later-on, we will see how we can develop filters to remove unreliable calls from our dataset. Variants with low depth would seem to be a good candidate for exclusion.
hist(info(hapmap.calls)$DP)
If we take an arbitrary cut-off of 25, we could identify which variants have sufficient depth and then write the result to another .vcf
file.
enoughDepth <- which(info(hapmap.calls)$DP > 25)
writeVcf(hapmap.calls[enoughDepth], filename = "combined.chr20.subset.freebayes.highDepth.vcf")
The called genotypes can be accessed using the geno
function. Recall that in the .vcf
file, we have one column of genotype information for each sample, with each column consisting of key:value
pairs. Using geno
we can access all the values for a particular keys and be able to compare across samples.
The column names of the data frame returned by geno
are the same as the FORMAT
in the .vcf
description. Thus we can use a $
operator to access a particular set of values; GT
in this case
geno(hapmap.calls)
## List of length 10
## names(10): GT GQ GL DP DPR RO QR AO QA MIN_DP
head(geno(hapmap.calls)$GT)
## NA12878 NA12873 NA12874
## 20:502620_G/A "0/0" "1/1" "0/0"
## 20:502793_G/A "0/0" "0/0" "1/1"
## 20:502937_C/T "0/1" "0/1" "0/0"
## 20:504482_G/A "0/0" "0/0" "1/1"
## 20:504789_C/T "1/1" "0/1" "0/0"
## 20:505120_C/T "0/0" "0/1" "0/0"
The output allows us to compare the genotype for a particular variant across all samples; which we could not easily do from the .vcf
.
Usually each entry is 0/0
for a homozygous reference, 0/1
for a heterozygous call and 1/1
for a homozyous alternate allele. An entry of .
indicates a position where no call could be made due to insufficient data. Moreover, we can also find 0/2
and 1/2
in rare cases where a second alternative allele was found.
With table
we can tabulate the calls between one sample and another
table(geno(hapmap.calls)$GT[,1])
##
## . 0/0 0/1 1/1 1/2
## 28 500 220 273 3
table(geno(hapmap.calls)$GT[,2])
##
## . 0/0 0/1 0/2 1/1 1/2
## 12 399 446 2 161 4
table(geno(hapmap.calls)$GT[,1], geno(hapmap.calls)$GT[,2])
##
## . 0/0 0/1 0/2 1/1 1/2
## . 3 6 11 0 8 0
## 0/0 2 231 249 1 16 1
## 0/1 2 110 75 0 33 0
## 1/1 5 52 110 0 104 2
## 1/2 0 0 1 1 0 1
.vcf
file containing just these heterozygous variantsIn this section we will see how we can overlap the calls we have made with other genomic features, such as the pre-built gene models that we saw yesterday.
For this section, we are going to use all calls made on chromosome 20 for a single sample; NA12878
. As we discovered yesterday, it can be problematic trying to perform overlaps when the chromosome naming convention is not consistent between files. Consequently, we are going to rename our variants to the UCSC
style from the start.
NA12878.calls <- readVcf("NA12878.chr20.freebayes.vcf","hg19")
seqlevelsStyle(NA12878.calls) <- "UCSC"
NA12878.calls <- keepSeqlevels(NA12878.calls, "chr20")
The rowRanges
function will retrieve the positions of our variants as a familiar GRanges
object. Along with the usual positional information, we also get extra “metadata” (mcols
) about the base-change, a quality-score from freebayes
and a placeholder for a filter (which has not been applied in this case).
NA12878.calls.ranges <- rowRanges(NA12878.calls)
NA12878.calls.ranges
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 20:61795_G/T chr20 [61795, 61795] * | <NA>
## 20:63244_A/C chr20 [63244, 63244] * | <NA>
## 20:65900_G/A chr20 [65900, 65900] * | <NA>
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 20:61795_G/T G T 95.0616 .
## 20:63244_A/C A C 88.3208 .
## 20:65900_G/A G A 44.2954 .
## -------
## seqinfo: 1 sequence from hg19 genome
We saw yesterday how to retrieve the coordinates for all genes in a given organism and genome-build. For this example, we can again use the package TxDb.Hsapiens.UCSC.hg19.knownGene
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
Find and count all NA12878
variants that have been called within the gene “PRND”
exonsBy
and %over%
, countOverlaps
functions from yesterday….We are not just restricted to overlapping with these pre-defined gene models. If we have a set of generic regions in .bed
or similar format we can import as a GRanges
object using the rtracklayer
package and call the countOverlaps
and %over%
functions as usual.
library(rtracklayer)
regions <- import("regions.of.interest.bed")
regions
NA12878.calls.ranges[NA12878.calls.ranges %over% regions]
countOverlaps(regions, NA12878.calls.ranges)
With a bit of scripting it would be possible to loop through a series of gene locations and identify which gene each variant is located within. However, there is a convenience function locateVariants
that will do the hard work for us. It has flexibility to allow us to search for variants within coding, intergenic or other regions. For example, adding the CodingVariants()
argument tells the function to look for variants within coding sequences. You can see the help for locateVariants
(?locateVariants
) for information on the various options.
As this function runs you should see some message sbout select()' returned many:1 mapping between keys and columns
. This is because some variants might be located within multiple transcripts for the same gene.
locs <- locateVariants(NA12878.calls.ranges, txdb, CodingVariants())
## 'select()' returned many:1 mapping between keys and columns
locs
## GRanges object with 3 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## 1 chr20 [ 76962, 76962] + | coding 375 375
## 2 chr20 [126310, 126316] + | coding 313 319
## 3 chr20 [168728, 168728] - | coding 81 81
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 11 70477 206101 245938
## 2 51 70478 206103 81623
## 3 165 71525 209098 245939
## FOLLOWID
## <CharacterList>
## 1
## 2
## 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This annotation step has put Entrez IDs into the locs
object, which is great if we want overlap with transcript databases. However, we are more likely to be familiar with gene symbols. We can do a look-up using the org.Hs.eg.db
package that we saw yesterday and add the identifiers we retrieve to the table.
library(org.Hs.eg.db)
##
symbol <- select(org.Hs.eg.db, keys=mcols(locs)$GENEID, keytype = "ENTREZID", columns="SYMBOL")
## 'select()' returned many:1 mapping between keys and columns
head(symbol)
## ENTREZID SYMBOL
## 1 245938 DEFB125
## 2 81623 DEFB126
## 3 245939 DEFB128
## 4 400830 DEFB132
## 5 85364 ZCCHC3
## 6 57761 TRIB3
mcols(locs)$SYMBOL <- symbol[,2]
locs
## GRanges object with 1133 ranges and 10 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## 1 chr20 [ 76962, 76962] + | coding 375 375
## 2 chr20 [126310, 126316] + | coding 313 319
## 3 chr20 [168728, 168728] - | coding 81 81
## 4 chr20 [238435, 238442] + | coding 16 23
## 5 chr20 [279185, 279185] + | coding 958 958
## ... ... ... ... . ... ... ...
## 1129 chr20 [62597666, 62597666] - | coding 862 862
## 1130 chr20 [62597694, 62597694] - | coding 834 834
## 1131 chr20 [62598815, 62598815] - | coding 183 183
## 1132 chr20 [62608677, 62608677] - | coding 174 174
## 1133 chr20 [62737568, 62737568] - | coding 617 617
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 11 70477 206101 245938
## 2 51 70478 206103 81623
## 3 165 71525 209098 245939
## 4 269 70481 206108 400830
## 5 304 70482 206110 85364
## ... ... ... ... ... ...
## 1129 74408 72503 211710 57473
## 1130 74409 72503 211710 57473
## 1131 74410 72503 211712 57473
## 1132 74415 72504 211717 140700
## 1133 74531 72511 211729 2832
## FOLLOWID SYMBOL
## <CharacterList> <character>
## 1 DEFB125
## 2 DEFB126
## 3 DEFB128
## 4 DEFB132
## 5 ZCCHC3
## ... ... ...
## 1129 ZNF512B
## 1130 ZNF512B
## 1131 ZNF512B
## 1132 SAMD10
## 1133 NPBWR2
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If we want, we can locate all the variants with respect to the nearest feature (in the case of intergenic variants).
all.locs <- locateVariants(NA12878.calls.ranges, txdb, AllVariants())
all.locs
table(all.locs$LOCATION)
## GRanges object with 4 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART
## <Rle> <IRanges> <Rle> | <factor> <integer>
## 20:61795_G/T chr20 [61795, 61795] * | intergenic <NA>
## 20:63244_A/C chr20 [63244, 63244] * | intergenic <NA>
## 20:65900_G/A chr20 [65900, 65900] * | intergenic <NA>
## chr20 [66370, 66370] + | promoter <NA>
## LOCEND QUERYID TXID CDSID GENEID
## <integer> <integer> <character> <IntegerList> <character>
## 20:61795_G/T <NA> 1 <NA> <NA>
## 20:63244_A/C <NA> 2 <NA> <NA>
## 20:65900_G/A <NA> 3 <NA> <NA>
## <NA> 4 70477 245938
## PRECEDEID FOLLOWID
## <CharacterList> <CharacterList>
## 20:61795_G/T 245938,81623,140850,...
## 20:63244_A/C 245938,81623,140850,...
## 20:65900_G/A 245938,81623,140850,...
##
## -------
## seqinfo: 1 sequence from hg19 genome
##
## spliceSite intron fiveUTR threeUTR coding intergenic
## 52 93846 274 1790 1133 41861
## promoter
## 4209
We will see more of this tomorrow, but for now determining whether a change at a particular base will lead to a different amino acid can be performed using one of the many BSgenome...
packages in Bioconductor. These packages provide a pre-built database of genome sequence for various organisms. They allow us to query particular regions of the genome and return the results in a Biostrings
-compatible format. We have already seen that Biostrings
allows us to perform various computations on DNA sequences; including the ability to retrieve the DNA sequence for a particular genomic region and translate DNA to RNA.
library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
## Loading required package: rtracklayer
hg19 <-BSgenome.Hsapiens.UCSC.hg19
getSeq(hg19, GRanges("chr20", IRanges(76960,76962)))
## A DNAStringSet instance of length 1
## width seq
## [1] 3 TCT
translate(getSeq(hg19, GRanges("chr20", IRanges(76960,76962))))
## A AAStringSet instance of length 1
## width seq
## [1] 1 S
There is a function, predictCoding
that will do all the predictions for us. In order to make use of this function, we need to have loaded a pre-built genome in Biostrings
format. There are many such packages listed on the bioconductor web page. The one we need is BSgenome.Hsapiens.UCSC.hg19
coding.pred <- predictCoding(NA12878.calls, txdb, seqSource=Hsapiens)
coding.pred
table(coding.pred $CONSEQUENCE)
## GRanges object with 3 ranges and 17 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 20:76962_T/C chr20 [ 76962, 76962] + | <NA>
## 20:126310_ACCCCCG/ACCCG chr20 [126310, 126316] + | <NA>
## 20:168728_T/A chr20 [168728, 168728] - | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## 20:76962_T/C T C 112.0310
## 20:126310_ACCCCCG/ACCCG ACCCCCG ACCCG 15.1176
## 20:168728_T/A T A 237.0390
## FILTER varAllele CDSLOC
## <character> <DNAStringSet> <IRanges>
## 20:76962_T/C . C [375, 375]
## 20:126310_ACCCCCG/ACCCG . ACCCG [313, 319]
## 20:168728_T/A . T [ 81, 81]
## PROTEINLOC QUERYID TXID
## <IntegerList> <integer> <character>
## 20:76962_T/C 125 11 70477
## 20:126310_ACCCCCG/ACCCG 105,107 51 70478
## 20:168728_T/A 27 165 71525
## CDSID GENEID CONSEQUENCE
## <IntegerList> <character> <factor>
## 20:76962_T/C 206101 245938 synonymous
## 20:126310_ACCCCCG/ACCCG 206103 81623 frameshift
## 20:168728_T/A 209098 245939 nonsynonymous
## REFCODON VARCODON REFAA
## <DNAStringSet> <DNAStringSet> <AAStringSet>
## 20:76962_T/C TCT TCC S
## 20:126310_ACCCCCG/ACCCG ACCCCCGTT ACCCGTT
## 20:168728_T/A AAA AAT K
## VARAA
## <AAStringSet>
## 20:76962_T/C S
## 20:126310_ACCCCCG/ACCCG
## 20:168728_T/A N
## -------
## seqinfo: 1 sequence from hg19 genome
##
## frameshift nonsense nonsynonymous synonymous
## 2 2 575 554
freebayes
to call SNVs from a set of healthy individuals
.vcf
format contains a rich description of the called variants.vcf
filesGenomicRanges
to overlap our calls with other genomic features of interest.vcf
would probably involve other non-R tools
To recap, we have information in the following tables
NA12878.calls.ranges
which has locations, details of the base-changes and a quality-score for each variantlocs
info(NA12878.calls)
Our task now is to combine these into one data frame, which we can start by converting each component into a data frame. We don’t need all the columns from the NA12878.calls.ranges
object as we already have the genomic location in the locs.df
data frame. Consequently we can extract the meta data only using the function meta
.
locs.df <- as.data.frame(locs)
dim(locs.df)
## [1] 1133 15
meta <- mcols(NA12878.calls.ranges)
head(meta)
## DataFrame with 6 rows and 5 columns
## paramRangeID REF ALT QUAL FILTER
## <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1 NA G T 95.0616 .
## 2 NA A C 88.3208 .
## 3 NA G A 44.2954 .
## 4 NA G A 57.5310 .
## 5 NA TT TTGGTATCTAGT 44.2540 .
## 6 NA T C 186.8270 .
dim(meta)
## [1] 74702 5
info.df <- info(NA12878.calls)
dim(info.df)
## [1] 74702 44
As not all variants are inside coding regions, the number of rows in locs
is not the same as the total number of variants. However, the column locs$QUERYID
gives an index that we can use to look-up the relevant rows in the complete set of calls. Below we show how the extract data fror the first five coding variants.
locs.df[1:5,]
## seqnames start end width strand LOCATION LOCSTART LOCEND QUERYID
## 1 chr20 76962 76962 1 + coding 375 375 11
## 2 chr20 126310 126316 7 + coding 313 319 51
## 3 chr20 168728 168728 1 - coding 81 81 165
## 4 chr20 238435 238442 8 + coding 16 23 269
## 5 chr20 279185 279185 1 + coding 958 958 304
## TXID CDSID GENEID PRECEDEID FOLLOWID SYMBOL
## 1 70477 206101 245938 DEFB125
## 2 70478 206103 81623 DEFB126
## 3 71525 209098 245939 DEFB128
## 4 70481 206108 400830 DEFB132
## 5 70482 206110 85364 ZCCHC3
locs.df$QUERYID[1:5]
## [1] 11 51 165 269 304
NA12878.calls.ranges[locs.df$QUERYID[1:5],]
## GRanges object with 5 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## 20:76962_T/C chr20 [ 76962, 76962] * | <NA>
## 20:126310_ACCCCCG/ACCCG chr20 [126310, 126316] * | <NA>
## 20:168728_T/A chr20 [168728, 168728] * | <NA>
## 20:238435_CTGGTCTT/CT chr20 [238435, 238442] * | <NA>
## 20:279185_C/T chr20 [279185, 279185] * | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## 20:76962_T/C T C 112.0310
## 20:126310_ACCCCCG/ACCCG ACCCCCG ACCCG 15.1176
## 20:168728_T/A T A 237.0390
## 20:238435_CTGGTCTT/CT CTGGTCTT CT 62.9073
## 20:279185_C/T C T 39.4505
## FILTER
## <character>
## 20:76962_T/C .
## 20:126310_ACCCCCG/ACCCG .
## 20:168728_T/A .
## 20:238435_CTGGTCTT/CT .
## 20:279185_C/T .
## -------
## seqinfo: 1 sequence from hg19 genome
Construct a data frame containing:-
locs.df
data frame with details of all the coding variantsmeta
data frameINFO
columns DP
, NS
, RO
and AO
from info.df
## seqnames start end width strand LOCATION LOCSTART LOCEND QUERYID
## 1 chr20 76962 76962 1 + coding 375 375 11
## 2 chr20 126310 126316 7 + coding 313 319 51
## 3 chr20 168728 168728 1 - coding 81 81 165
## 4 chr20 238435 238442 8 + coding 16 23 269
## 5 chr20 279185 279185 1 + coding 958 958 304
## 6 chr20 371972 371972 1 + coding 333 333 454
## TXID CDSID GENEID PRECEDEID FOLLOWID SYMBOL paramRangeID REF
## 1 70477 206101 245938 DEFB125 <NA> T
## 2 70478 206103 81623 DEFB126 <NA> ACCCCCG
## 3 71525 209098 245939 DEFB128 <NA> T
## 4 70481 206108 400830 DEFB132 <NA> CTGGTCTT
## 5 70482 206110 85364 ZCCHC3 <NA> C
## 6 70485 206116 57761 TRIB3 <NA> T
## ALT QUAL FILTER DP NS RO AO
## 1 C 112.0310 . 4 1 0 4
## 2 ACCCG 15.1176 . 4 1 2 2
## 3 A 237.0390 . 7 1 0 7
## 4 CT 62.9073 . 3 1 0 3
## 5 T 39.4505 . 5 1 2 3
## 6 C 11.0642 . 3 1 1 2
Commands used to generate bam files for genotype calling
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/Homo_sapiens_assembly19.fasta.gz -P ../ref_data/
gunzip ../ref_data/Homo_sapiens_assembly19.fasta.gz
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam -O NA12878.chr20.bam
samtools index NA12878.chr20.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12874/alignment/NA12874.chrom20.ILLUMINA.bwa.CEU.low_coverage.20130415.bam -O NA12874.chr20.bam
samtools index NA12874.chr20.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12873/alignment/NA12873.chrom20.ILLUMINA.bwa.CEU.low_coverage.20130415.bam -O NA12873.chr20.bam
samtools index NA12873.chr20.bam
wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta
wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta.fai
wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.dict