Introduction

In this session we:-

Demonstrate how to call germline variants for a bam file containing aligned reads
Explain what a vcf file is
Loading a vcf file into IGV
Interrogating vcf files using Bioconductor
Intersect sets of calls

Calling SNVs

We will use the freebayes genotype caller for convenience. The main reasons for choosing this were:-

freely-available with an open-source licence
easy to run and minimal command-line parameters to specify
it is Bayesian; which we like
gives reasonable results

freebayes is a command-line tool that you will need to run from the terminal. As with all command-line tools, we first need to make sure that we are located in the correct directory. Remember that we can change directory using the unix command cd.

freebayes is included in the CRUK Docker. However, we can make sure that freebayes is available by typing freebayes at the terminal. This will give a basic description, a version number, and information about how to cite the package (very important if you end-up using the tool to generate results for your paper). Adding the parameter -h will display more-detailed help about the tool and some examples of how to use it.

cd /home/participant/Course_Materials/Day2/
freebayes
freebayes -h

We are going to use freebayes to call SNVs on some 1000 genomes samples. In order to make the SNV-calling run in a reasonable time, we are only considering reads aligned to chromosome 20 in this analysis. In the Appendix you will see the commands used to create the bam files.

The minimal requirements to run freebayes (and why it is appealing for this practical!) are a reference genome and a .bam file. The -f parameter is used to specify the location of a reference genome in .fasta format.

The reference file is in the folder /home/participant/Course_Materials/ref_data. However, rather than having the write the full path to the file, we can write the path relative to our current working directory (which is /home/participant/Course_Materials/Day2). When specifying the location of the files the ../ means to go one directory “up” from the current working directory.

Please don’t run this next command

freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta ../data/hapmap/NA12878.chr20.bam

If you did run that command, you would quickly see that the screen gets filled with lots of text. These are the calls that freebayes is making being printed to the screen (the standard output for some unix commands). If you find yourself in this situation, a swift press of CTRL + C should stop the tool from running.

What we need to do is direct the output to a file. We can call the output file anything we like, but it is advisable to make the name relatable to the name of the input. If we are in the situation of calling genotypes on many samples, with many different callers, then we want to be able to identify the processing used on each sample.

Although it is not mandatory, we give the output file the extension .vcf. The .vcf format is a commonly-adopted standard for variant calls, which we will look into detail now.

freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta ../data/hapmap/NA12878.chr20.bam > NA12878.chr20.freebayes.vcf

Understanding the vcf format

The vcf format was initially developed by the 1000 Genomes Project, and ownership has been subsequently transferred to Global Alliance for Genomics and Health Data Working group file format team. The format can be used to represent information about all kinds of genomic variation. In this session we will just consider SNVs.

We don’t require any specialised software to look at the contents of a vcf file. They can be opened in a bog-standard text editor.For now, navigate to the folder and right-click on NA12878.chr20.freebayes.vcf. Select the option to open with gedit (a common text editor for Ubuntu).

In a similar vein to the .bam and .sam files we saw earlier, the .vcf files contains many lines of header information.

##fileformat=VCFv4.1
##fileDate=20160724
##source=freeBayes v1.0.2-33-gdbb6160
##reference=../ref_data/Homo_sapiens_assembly19.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>

After many more lines of information, we finally get to the details of the actual calls themsevles. This part of the file is tab-delimited; with 10 columns for every call. The vcf specification page gives details of what should be contained in each column

CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
NA12878

Shown here is the information about the first five calls

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  61795   .   G   T   95.0616 .   AB=0.454545;ABP=3.20771;AC=1;AF=0.5;AN=2;AO=5;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=3.44459;EPPR=4.45795;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=21.8887;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=188;QR=245;RO=6;RPL=1;RPP=6.91895;RPPR=3.0103;RPR=4;RUN=1;SAF=2;SAP=3.44459;SAR=3;SRF=5;SRP=8.80089;SRR=1;TYPE=snp;technology.ILLUMINA=1  GT:DP:DPR:RO:QR:AO:QA:GL    0/1:11:11,5:6:245:5:188:-13.9693,0,-19.1141
20  63244   .   A   C   88.3208 .   AB=0.8;ABP=6.91895;AC=1;AF=0.5;AN=2;AO=4;CIGAR=1X;DP=5;DPB=5;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.26311;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=149;QR=38;RO=1;RPL=4;RPP=11.6962;RPPR=5.18177;RPR=0;RUN=1;SAF=2;SAP=3.0103;SAR=2;SRF=1;SRP=5.18177;SRR=0;TYPE=snp;technology.ILLUMINA=1   GT:DP:DPR:RO:QR:AO:QA:GL    0/1:5:5,4:1:38:4:149:-12.2609,0,-2.29212
20  65900   .   G   A   44.2954 .   AB=0.3;ABP=6.48466;AC=1;AF=0.5;AN=2;AO=3;CIGAR=1X;DP=10;DPB=10;DPRA=0;EPP=3.73412;EPPR=3.32051;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=10.1994;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=125;QR=250;RO=7;RPL=1;RPP=3.73412;RPPR=3.32051;RPR=2;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=4;SRP=3.32051;SRR=3;TYPE=snp;technology.ILLUMINA=1  GT:DP:DPR:RO:QR:AO:QA:GL    0/1:10:10,3:7:250:3:125:-8.63855,0,-19.8321
20  66370   .   G   A   57.531  .   AB=0.4;ABP=3.87889;AC=1;AF=0.5;AN=2;AO=4;CIGAR=1X;DP=10;DPB=10;DPRA=0;EPP=11.6962;EPPR=4.45795;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=13.247;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=150;QR=213;RO=6;RPL=3;RPP=5.18177;RPPR=4.45795;RPR=1;RUN=1;SAF=1;SAP=5.18177;SAR=3;SRF=2;SRP=4.45795;SRR=4;TYPE=snp;technology.ILLUMINA=1   GT:DP:DPR:RO:QR:AO:QA:GL    0/1:10:10,4:6:213:4:150:-10.8494,0,-16.4929

The first seven columns should look consistent across different genotype callers. In this exercise we have not annotated against known variants, or applied any filtering, so the ID and FILTER columns are blank. In some .vcf files these columns might be populated with dbSNP IDs or flags such as PASS / FAIL respectively.

The contents of the INFO and FORMAT columns will depend on what variant caller has been used. The INFO column contains metrics and other information related to each variant call as a set of KEY;VALUE pairs. Each pair is separated by a ; character.

The INFO for the first variant reads as:-

AB=0.454545;ABP=3.20771;AC=1;AF=0.5;AN=2;AO=5;CIGAR=1X;DP=11;DPB=11;DPRA=0;EPP=3.44459;EPPR=4.45795;GTI=0;.......

which we can interpret as:-

	Key	Value
AB=0.454545	AB	0.454545
ABP=3.20771	ABP	3.20771
AC=1	AC	1
AF=0.5	AF	0.5
AN=2	AN	2
AO=5	AO	5
CIGAR=1X	CIGAR	1X
DP=11	DP	11
DPB=11	DPB	11
DPRA=0	DPRA	0

The meaning of each KEY is given in the header for the .vcf file.

## ##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">

## ##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">

## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">

## ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">

## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

## ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally">

## ##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">

## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">

## ##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">

## ##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio.  Ratio between depth in samples with each called alternate allele and those without.">

The final column in the file describes the calls for the sample NA12878, which was the only sample that we called variants on. In the sample column (NA12878) for the first variant we see the entry

(0/1:11:11,5:6:245:5:188:-13.9693,0,-19.1141).

These are values separated by a : character and they are interpreted in the same order as dictated by the FORMAT column which is GT:DP:DPR:RO:QR:AO:QA:GL

Key	Value	Description
GT	0/1	Genotype
DP	11	Read Depth
DPR	11,5	Number of observation for each allele
RO	6	Reference allele observation count
QR	245	Sum of quality of the reference observations
AO	5	Alternate allele observation count
QA	188	Sum of quality of the alternate observations
GL	-13.9693,0,-19.1141	Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy

So for this particular variant, in the sample NA12878 there is a genotype of 0\1 (heterozygous) and a depth of 11 etc.

To understand the vcf format better, we can use another mode of operation of freebayes which is to call genotypes on multiple samples simultaneously. We can also specify an exact region of the genome to call genotypes to speed-up the processing.

freebayes -f ../ref_data/Homo_sapiens_assembly19.fasta --region 20:500000-800000 ../data/hapmap/NA12878.chr20.bam ../data/hapmap/NA12873.chr20.bam ../data/hapmap/NA12874.chr20.bam > combined.chr20.subset.freebayes.vcf

As before, we can look at the first five calls.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878 NA12873 NA12874
20  502620  .   G   A   124.93  .   AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=5;CIGAR=1X;DP=24;DPB=24;DPRA=0.526316;EPP=3.44459;EPPR=5.8675;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=1.91343;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=206;QR=587;RO=19;RPL=2;RPP=3.44459;RPPR=5.8675;RPR=3;RUN=1;SAF=2;SAP=3.44459;SAR=3;SRF=13;SRP=8.61041;SRR=6;TYPE=snp;technology.ILLUMINA=1  GT:DP:DPR:RO:QR:AO:QA:GL    0/0:2:2,0:2:59:0:0:0,-0.60206,-5.60414  1/1:5:5,5:0:0:5:206:-18.9161,-1.50515,0 0/0:17:17,0:17:528:0:0:0,-5.11751,-47.8112
20  502793  .   G   A   508.485 .   AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=18;CIGAR=1X;DP=36;DPB=36;DPRA=2;EPP=3.49285;EPPR=3.49285;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=57;NS=3;NUMALT=1;ODDS=2.14415;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=659;QR=702;RO=18;RPL=10;RPP=3.49285;RPPR=3.0103;RPR=8;RUN=1;SAF=8;SAP=3.49285;SAR=10;SRF=15;SRP=20.3821;SRR=3;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL    0/0:6:6,0:6:223:0:0:0,-1.80618,-20.4049 0/0:12:12,0:12:479:0:0:0,-3.61236,-42.35    1/1:18:18,18:0:0:18:659:-59.6311,-5.41854,0
20  502937  .   C   T   144.588 .   AB=0.5;ABP=3.0103;AC=2;AF=0.333333;AN=6;AO=7;CIGAR=1X;DP=27;DPB=27;DPRA=0.538462;EPP=3.32051;EPPR=3.44459;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=3;NUMALT=1;ODDS=5.35048;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=270;QR=750;RO=20;RPL=3;RPP=3.32051;RPPR=3.44459;RPR=4;RUN=1;SAF=4;SAP=3.32051;SAR=3;SRF=4;SRP=18.6449;SRR=16;TYPE=snp;technology.ILLUMINA=1 GT:DP:DPR:RO:QR:AO:QA:GL    0/1:2:2,1:1:38:1:41:-3.49251,0,-3.19521 0/1:12:12,6:6:230:6:229:-17.3628,0,-17.4496 0/0:13:13,0:13:482:0:0:0,-3.91339,-43.7103
20  504482  .   G   A   41.2659 .   AB=0;ABP=0;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=7;DPB=7;DPRA=1.5;EPP=3.73412;EPPR=3.0103;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=54;NS=3;NUMALT=1;ODDS=1.33886;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=112;QR=133;RO=4;RPL=3;RPP=9.52472;RPPR=5.18177;RPR=0;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=3;SRP=5.18177;SRR=1;TYPE=snp;technology.ILLUMINA=1  GT:DP:DPR:RO:QR:AO:QA:GL    0/0:2:2,0:2:68:0:0:0,-0.60206,-6.45793  0/0:2:2,0:2:65:0:0:0,-0.60206,-6.06068  1/1:3:3,3:0:0:3:112:-10.4467,-0.90309,0

You should notice that we now have columns NA12878, NA12873 and NA12874 as we have called variants in three samples.

One easy way of being able to visualise the calls is to use the IGV browser. Before we do this however, we can do some extra processing to make it easier to process the files. This series of commands will compress and index the vcf files (similar to how bam files are indexed to produce a .bai file). IGV would probably cope fine with reading such relatively-small files, but it is good to get into the habit of processing our files in this manner.


bgzip -c combined.chr20.subset.freebayes.vcf > combined.chr20.subset.freebayes.vcf.gz
tabix -p  vcf combined.chr20.subset.freebayes.vcf.gz

bgzip -c NA12878.chr20.freebayes.vcf > NA12878.chr20.freebayes.vcf.gz
tabix -p vcf NA12878.chr20.freebayes.vcf.gz

After running these commands, you should see that the files combined.chr20.subset.freebayes.vcf.gz, combined.chr20.subset.freebayes.vcf.gz.tbi, NA12878.chr20.freebayes.vcf.gz and NA12878.chr20.freebayes.vcf.gz.tbi have been created in your working directory.

These files can be viewing in IGV and as usual we can zoom-in and scroll along the genome. Each .vcf introduces two tracks into the IGV data panel; the first gives information about the variant, and the second is for sample-specific information.

Exercise

Load IGV
Select the files NA12878.chr20.freebayes.vcf.gz and combined.chr20.subset.freebayes.vcf.gz from the File -> Open menu
Navigate to chr20:500,900-505,630
Verify that the same information from the .vcf is shown in IGV
What do the light and dark blue rectangles in the sample-specific tracks refer to?
What SNPs are shared between the different samples?
- are any unique to a particular sample?

To dig-into these files further, we will use tools within Bioconductor. Our goal will be to find calls that are in common between different samples, and what SNPs occur within coding regions. For completeness, we note that other command-line tools can be used to perform these operations (the bedtools suite for example). However, we should be reasonably comfortable with R by now and the package we use interacts nicely with the GenomicRanges concepts we discussed yesterday.

Importing into Bioconductor

Working directory

Session -> Set Working Directory -> Choose Directory

/home/participant/Course_Materials/Day2/

Template

/home/participant/Course_Materials/Day2/Session1-template.Rmd

Multi-sample example

The VariantAnnotation package allows .vcf files to be imported. The readVcf function can be used and requires the file of a vcf file. You also need to specify a genome name. As we have seen already with other Bioconductor objects, typing the name of the object will print a summary to the screen. In the case of a “CollapsedVCF” object this is very detailed

library(VariantAnnotation)
hapmap.calls <- readVcf("combined.chr20.subset.freebayes.vcf","hg19")

hapmap.calls

The “header” information can be extracted using header. This will contain the definitions of all the per-genotype (INFO) and per-sample information stored in the file

header(hapmap.calls)
info(header(hapmap.calls))
geno(header(hapmap.calls))

`INFO`

As described above, the INFO column in a .vcf file gives per-variant metadata as a series of key-value pairs. We can interrogate these data using the info function, which returns a data-frame-like object. Consequently, we can use the $ operator to select a particular column of interest.

The name of each row is derived from the genomic location and base-change for each variant. There are too many columns to go through in detail, so we will just describe a few that are likely to be common to different callers.

info(hapmap.calls)

NS represents the number of samples with data. We would expect this to be 3 in all cases. However, as we are dealing with low-coverage data there are some sites where data could not be observed for all samples.

DP gives the depth observed at each position, which is given to us in the form of a vector. By looking at the distribtion of the depths, we see some variants with extremely low coverage. Later-on, we will see how we can develop filters to remove unreliable calls from our dataset. Variants with low depth would seem to be a good candidate for exclusion.

hist(info(hapmap.calls)$DP)

If we take an arbitrary cut-off of 25, we could identify which variants have sufficient depth and then write the result to another .vcf file.

enoughDepth <- which(info(hapmap.calls)$DP > 25)
writeVcf(hapmap.calls[enoughDepth], filename = "combined.chr20.subset.freebayes.highDepth.vcf")

Accessing genotypes

The called genotypes can be accessed using the geno function. Recall that in the .vcf file, we have one column of genotype information for each sample, with each column consisting of key:value pairs. Using geno we can access all the values for a particular keys and be able to compare across samples.

The column names of the data frame returned by geno are the same as the FORMAT in the .vcf description. Thus we can use a $ operator to access a particular set of values; GT in this case

geno(hapmap.calls)

## List of length 10
## names(10): GT GQ GL DP DPR RO QR AO QA MIN_DP

head(geno(hapmap.calls)$GT)

##               NA12878 NA12873 NA12874
## 20:502620_G/A "0/0"   "1/1"   "0/0"  
## 20:502793_G/A "0/0"   "0/0"   "1/1"  
## 20:502937_C/T "0/1"   "0/1"   "0/0"  
## 20:504482_G/A "0/0"   "0/0"   "1/1"  
## 20:504789_C/T "1/1"   "0/1"   "0/0"  
## 20:505120_C/T "0/0"   "0/1"   "0/0"

The output allows us to compare the genotype for a particular variant across all samples; which we could not easily do from the .vcf.

Usually each entry is 0/0 for a homozygous reference, 0/1 for a heterozygous call and 1/1 for a homozyous alternate allele. An entry of . indicates a position where no call could be made due to insufficient data. Moreover, we can also find 0/2 and 1/2 in rare cases where a second alternative allele was found.

With table we can tabulate the calls between one sample and another

table(geno(hapmap.calls)$GT[,1])

## 
##   . 0/0 0/1 1/1 1/2 
##  28 500 220 273   3

table(geno(hapmap.calls)$GT[,2])

## 
##   . 0/0 0/1 0/2 1/1 1/2 
##  12 399 446   2 161   4

table(geno(hapmap.calls)$GT[,1], geno(hapmap.calls)$GT[,2])

##      
##         . 0/0 0/1 0/2 1/1 1/2
##   .     3   6  11   0   8   0
##   0/0   2 231 249   1  16   1
##   0/1   2 110  75   0  33   0
##   1/1   5  52 110   0 104   2
##   1/2   0   0   1   1   0   1

Exercise

Find the variants that are called Heterozygous in all three samples
Create a new .vcf file containing just these heterozygous variants

Variant locations

In this section we will see how we can overlap the calls we have made with other genomic features, such as the pre-built gene models that we saw yesterday.

For this section, we are going to use all calls made on chromosome 20 for a single sample; NA12878. As we discovered yesterday, it can be problematic trying to perform overlaps when the chromosome naming convention is not consistent between files. Consequently, we are going to rename our variants to the UCSC style from the start.

NA12878.calls <- readVcf("NA12878.chr20.freebayes.vcf","hg19")
seqlevelsStyle(NA12878.calls) <- "UCSC"
NA12878.calls <- keepSeqlevels(NA12878.calls, "chr20")

The rowRanges function will retrieve the positions of our variants as a familiar GRanges object. Along with the usual positional information, we also get extra “metadata” (mcols) about the base-change, a quality-score from freebayes and a placeholder for a filter (which has not been applied in this case).

NA12878.calls.ranges <- rowRanges(NA12878.calls)

NA12878.calls.ranges

## GRanges object with 3 ranges and 5 metadata columns:
##                seqnames         ranges strand | paramRangeID
##                   <Rle>      <IRanges>  <Rle> |     <factor>
##   20:61795_G/T    chr20 [61795, 61795]      * |         <NA>
##   20:63244_A/C    chr20 [63244, 63244]      * |         <NA>
##   20:65900_G/A    chr20 [65900, 65900]      * |         <NA>
##                           REF                ALT      QUAL      FILTER
##                <DNAStringSet> <DNAStringSetList> <numeric> <character>
##   20:61795_G/T              G                  T   95.0616           .
##   20:63244_A/C              A                  C   88.3208           .
##   20:65900_G/A              G                  A   44.2954           .
##   -------
##   seqinfo: 1 sequence from hg19 genome

Overlapping with genes

We saw yesterday how to retrieve the coordinates for all genes in a given organism and genome-build. For this example, we can again use the package TxDb.Hsapiens.UCSC.hg19.knownGene

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

Exercise

Find and count all NA12878 variants that have been called within the gene “PRND”

this gene has an Entrez ID of “23627”
- recall the exonsBy and %over%, countOverlaps functions from yesterday….
check your answer in IGV

We are not just restricted to overlapping with these pre-defined gene models. If we have a set of generic regions in .bed or similar format we can import as a GRanges object using the rtracklayer package and call the countOverlaps and %over% functions as usual.

library(rtracklayer)
regions <- import("regions.of.interest.bed")
regions
NA12878.calls.ranges[NA12878.calls.ranges %over% regions]
countOverlaps(regions, NA12878.calls.ranges)

More-automated location of variants

With a bit of scripting it would be possible to loop through a series of gene locations and identify which gene each variant is located within. However, there is a convenience function locateVariants that will do the hard work for us. It has flexibility to allow us to search for variants within coding, intergenic or other regions. For example, adding the CodingVariants()argument tells the function to look for variants within coding sequences. You can see the help for locateVariants (?locateVariants) for information on the various options.

As this function runs you should see some message sbout select()' returned many:1 mapping between keys and columns. This is because some variants might be located within multiple transcripts for the same gene.

locs <- locateVariants(NA12878.calls.ranges, txdb, CodingVariants())

## 'select()' returned many:1 mapping between keys and columns

locs

## GRanges object with 3 ranges and 9 metadata columns:
##     seqnames           ranges strand | LOCATION  LOCSTART    LOCEND
##        <Rle>        <IRanges>  <Rle> | <factor> <integer> <integer>
##   1    chr20 [ 76962,  76962]      + |   coding       375       375
##   2    chr20 [126310, 126316]      + |   coding       313       319
##   3    chr20 [168728, 168728]      - |   coding        81        81
##       QUERYID        TXID         CDSID      GENEID       PRECEDEID
##     <integer> <character> <IntegerList> <character> <CharacterList>
##   1        11       70477        206101      245938                
##   2        51       70478        206103       81623                
##   3       165       71525        209098      245939                
##            FOLLOWID
##     <CharacterList>
##   1                
##   2                
##   3                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

This annotation step has put Entrez IDs into the locs object, which is great if we want overlap with transcript databases. However, we are more likely to be familiar with gene symbols. We can do a look-up using the org.Hs.eg.db package that we saw yesterday and add the identifiers we retrieve to the table.

library(org.Hs.eg.db)

##

symbol <- select(org.Hs.eg.db, keys=mcols(locs)$GENEID, keytype = "ENTREZID", columns="SYMBOL")

## 'select()' returned many:1 mapping between keys and columns

head(symbol)

##   ENTREZID  SYMBOL
## 1   245938 DEFB125
## 2    81623 DEFB126
## 3   245939 DEFB128
## 4   400830 DEFB132
## 5    85364  ZCCHC3
## 6    57761   TRIB3

mcols(locs)$SYMBOL <- symbol[,2]
locs

## GRanges object with 1133 ranges and 10 metadata columns:
##        seqnames               ranges strand | LOCATION  LOCSTART    LOCEND
##           <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
##      1    chr20     [ 76962,  76962]      + |   coding       375       375
##      2    chr20     [126310, 126316]      + |   coding       313       319
##      3    chr20     [168728, 168728]      - |   coding        81        81
##      4    chr20     [238435, 238442]      + |   coding        16        23
##      5    chr20     [279185, 279185]      + |   coding       958       958
##    ...      ...                  ...    ... .      ...       ...       ...
##   1129    chr20 [62597666, 62597666]      - |   coding       862       862
##   1130    chr20 [62597694, 62597694]      - |   coding       834       834
##   1131    chr20 [62598815, 62598815]      - |   coding       183       183
##   1132    chr20 [62608677, 62608677]      - |   coding       174       174
##   1133    chr20 [62737568, 62737568]      - |   coding       617       617
##          QUERYID        TXID         CDSID      GENEID       PRECEDEID
##        <integer> <character> <IntegerList> <character> <CharacterList>
##      1        11       70477        206101      245938                
##      2        51       70478        206103       81623                
##      3       165       71525        209098      245939                
##      4       269       70481        206108      400830                
##      5       304       70482        206110       85364                
##    ...       ...         ...           ...         ...             ...
##   1129     74408       72503        211710       57473                
##   1130     74409       72503        211710       57473                
##   1131     74410       72503        211712       57473                
##   1132     74415       72504        211717      140700                
##   1133     74531       72511        211729        2832                
##               FOLLOWID      SYMBOL
##        <CharacterList> <character>
##      1                     DEFB125
##      2                     DEFB126
##      3                     DEFB128
##      4                     DEFB132
##      5                      ZCCHC3
##    ...             ...         ...
##   1129                     ZNF512B
##   1130                     ZNF512B
##   1131                     ZNF512B
##   1132                      SAMD10
##   1133                      NPBWR2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

If we want, we can locate all the variants with respect to the nearest feature (in the case of intergenic variants).

all.locs <- locateVariants(NA12878.calls.ranges, txdb, AllVariants())

all.locs
table(all.locs$LOCATION)

## GRanges object with 4 ranges and 9 metadata columns:
##                seqnames         ranges strand |   LOCATION  LOCSTART
##                   <Rle>      <IRanges>  <Rle> |   <factor> <integer>
##   20:61795_G/T    chr20 [61795, 61795]      * | intergenic      <NA>
##   20:63244_A/C    chr20 [63244, 63244]      * | intergenic      <NA>
##   20:65900_G/A    chr20 [65900, 65900]      * | intergenic      <NA>
##                   chr20 [66370, 66370]      + |   promoter      <NA>
##                   LOCEND   QUERYID        TXID         CDSID      GENEID
##                <integer> <integer> <character> <IntegerList> <character>
##   20:61795_G/T      <NA>         1        <NA>                      <NA>
##   20:63244_A/C      <NA>         2        <NA>                      <NA>
##   20:65900_G/A      <NA>         3        <NA>                      <NA>
##                     <NA>         4       70477                    245938
##                              PRECEDEID        FOLLOWID
##                        <CharacterList> <CharacterList>
##   20:61795_G/T 245938,81623,140850,...                
##   20:63244_A/C 245938,81623,140850,...                
##   20:65900_G/A 245938,81623,140850,...                
##                                                       
##   -------
##   seqinfo: 1 sequence from hg19 genome

## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##         52      93846        274       1790       1133      41861 
##   promoter 
##       4209

Predicting Amino acid changes

We will see more of this tomorrow, but for now determining whether a change at a particular base will lead to a different amino acid can be performed using one of the many BSgenome... packages in Bioconductor. These packages provide a pre-built database of genome sequence for various organisms. They allow us to query particular regions of the genome and return the results in a Biostrings-compatible format. We have already seen that Biostrings allows us to perform various computations on DNA sequences; including the ability to retrieve the DNA sequence for a particular genomic region and translate DNA to RNA.

library(BSgenome.Hsapiens.UCSC.hg19)

## Loading required package: BSgenome

## Loading required package: rtracklayer

hg19 <-BSgenome.Hsapiens.UCSC.hg19
getSeq(hg19, GRanges("chr20", IRanges(76960,76962)))

##   A DNAStringSet instance of length 1
##     width seq
## [1]     3 TCT

translate(getSeq(hg19, GRanges("chr20", IRanges(76960,76962))))

##   A AAStringSet instance of length 1
##     width seq
## [1]     1 S

There is a function, predictCoding that will do all the predictions for us. In order to make use of this function, we need to have loaded a pre-built genome in Biostrings format. There are many such packages listed on the bioconductor web page. The one we need is BSgenome.Hsapiens.UCSC.hg19

coding.pred <- predictCoding(NA12878.calls, txdb, seqSource=Hsapiens)

coding.pred 
table(coding.pred $CONSEQUENCE)

## GRanges object with 3 ranges and 17 metadata columns:
##                           seqnames           ranges strand | paramRangeID
##                              <Rle>        <IRanges>  <Rle> |     <factor>
##              20:76962_T/C    chr20 [ 76962,  76962]      + |         <NA>
##   20:126310_ACCCCCG/ACCCG    chr20 [126310, 126316]      + |         <NA>
##             20:168728_T/A    chr20 [168728, 168728]      - |         <NA>
##                                      REF                ALT      QUAL
##                           <DNAStringSet> <DNAStringSetList> <numeric>
##              20:76962_T/C              T                  C  112.0310
##   20:126310_ACCCCCG/ACCCG        ACCCCCG              ACCCG   15.1176
##             20:168728_T/A              T                  A  237.0390
##                                FILTER      varAllele     CDSLOC
##                           <character> <DNAStringSet>  <IRanges>
##              20:76962_T/C           .              C [375, 375]
##   20:126310_ACCCCCG/ACCCG           .          ACCCG [313, 319]
##             20:168728_T/A           .              T [ 81,  81]
##                              PROTEINLOC   QUERYID        TXID
##                           <IntegerList> <integer> <character>
##              20:76962_T/C           125        11       70477
##   20:126310_ACCCCCG/ACCCG       105,107        51       70478
##             20:168728_T/A            27       165       71525
##                                   CDSID      GENEID   CONSEQUENCE
##                           <IntegerList> <character>      <factor>
##              20:76962_T/C        206101      245938    synonymous
##   20:126310_ACCCCCG/ACCCG        206103       81623    frameshift
##             20:168728_T/A        209098      245939 nonsynonymous
##                                 REFCODON       VARCODON         REFAA
##                           <DNAStringSet> <DNAStringSet> <AAStringSet>
##              20:76962_T/C            TCT            TCC             S
##   20:126310_ACCCCCG/ACCCG      ACCCCCGTT        ACCCGTT              
##             20:168728_T/A            AAA            AAT             K
##                                   VARAA
##                           <AAStringSet>
##              20:76962_T/C             S
##   20:126310_ACCCCCG/ACCCG              
##             20:168728_T/A             N
##   -------
##   seqinfo: 1 sequence from hg19 genome

## 
##    frameshift      nonsense nonsynonymous    synonymous 
##             2             2           575           554

Summary

We have used a respected genotype caller freebayes to call SNVs from a set of healthy individuals
- there are many paramters that can be tweaked that we haven’t described here
The .vcf format contains a rich description of the called variants
Bioconducor tools can be used to import and parse .vcf files
We can use GenomicRanges to overlap our calls with other genomic features of interest
Production-level manipulation of .vcf would probably involve other non-R tools
- bedtools
- vcftools

EXTRA:- Producing an output table

To recap, we have information in the following tables

NA12878.calls.ranges which has locations, details of the base-changes and a quality-score for each variant
A table of variants which occur in the coding region of genes; locs
meta information about the quality of each call; info(NA12878.calls)

Our task now is to combine these into one data frame, which we can start by converting each component into a data frame. We don’t need all the columns from the NA12878.calls.ranges object as we already have the genomic location in the locs.df data frame. Consequently we can extract the meta data only using the function meta.

locs.df <- as.data.frame(locs)
dim(locs.df)

## [1] 1133   15

meta <- mcols(NA12878.calls.ranges)
head(meta)

## DataFrame with 6 rows and 5 columns
##   paramRangeID            REF                ALT      QUAL      FILTER
##       <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1           NA              G                  T   95.0616           .
## 2           NA              A                  C   88.3208           .
## 3           NA              G                  A   44.2954           .
## 4           NA              G                  A   57.5310           .
## 5           NA             TT       TTGGTATCTAGT   44.2540           .
## 6           NA              T                  C  186.8270           .

dim(meta)

## [1] 74702     5

info.df <- info(NA12878.calls)
dim(info.df)

## [1] 74702    44

As not all variants are inside coding regions, the number of rows in locs is not the same as the total number of variants. However, the column locs$QUERYID gives an index that we can use to look-up the relevant rows in the complete set of calls. Below we show how the extract data fror the first five coding variants.

locs.df[1:5,]

##   seqnames  start    end width strand LOCATION LOCSTART LOCEND QUERYID
## 1    chr20  76962  76962     1      +   coding      375    375      11
## 2    chr20 126310 126316     7      +   coding      313    319      51
## 3    chr20 168728 168728     1      -   coding       81     81     165
## 4    chr20 238435 238442     8      +   coding       16     23     269
## 5    chr20 279185 279185     1      +   coding      958    958     304
##    TXID  CDSID GENEID PRECEDEID FOLLOWID  SYMBOL
## 1 70477 206101 245938                    DEFB125
## 2 70478 206103  81623                    DEFB126
## 3 71525 209098 245939                    DEFB128
## 4 70481 206108 400830                    DEFB132
## 5 70482 206110  85364                     ZCCHC3

locs.df$QUERYID[1:5]

## [1]  11  51 165 269 304

NA12878.calls.ranges[locs.df$QUERYID[1:5],]

## GRanges object with 5 ranges and 5 metadata columns:
##                           seqnames           ranges strand | paramRangeID
##                              <Rle>        <IRanges>  <Rle> |     <factor>
##              20:76962_T/C    chr20 [ 76962,  76962]      * |         <NA>
##   20:126310_ACCCCCG/ACCCG    chr20 [126310, 126316]      * |         <NA>
##             20:168728_T/A    chr20 [168728, 168728]      * |         <NA>
##     20:238435_CTGGTCTT/CT    chr20 [238435, 238442]      * |         <NA>
##             20:279185_C/T    chr20 [279185, 279185]      * |         <NA>
##                                      REF                ALT      QUAL
##                           <DNAStringSet> <DNAStringSetList> <numeric>
##              20:76962_T/C              T                  C  112.0310
##   20:126310_ACCCCCG/ACCCG        ACCCCCG              ACCCG   15.1176
##             20:168728_T/A              T                  A  237.0390
##     20:238435_CTGGTCTT/CT       CTGGTCTT                 CT   62.9073
##             20:279185_C/T              C                  T   39.4505
##                                FILTER
##                           <character>
##              20:76962_T/C           .
##   20:126310_ACCCCCG/ACCCG           .
##             20:168728_T/A           .
##     20:238435_CTGGTCTT/CT           .
##             20:279185_C/T           .
##   -------
##   seqinfo: 1 sequence from hg19 genome

Exercise

Construct a data frame containing:-

The locs.df data frame with details of all the coding variants
The metadata from the ranges object in the meta data frame
INFO columns DP, NS, RO and AO from info.df
- or any other columns you think might be of interest
Write this data frame to a tab-delimited text file

##   seqnames  start    end width strand LOCATION LOCSTART LOCEND QUERYID
## 1    chr20  76962  76962     1      +   coding      375    375      11
## 2    chr20 126310 126316     7      +   coding      313    319      51
## 3    chr20 168728 168728     1      -   coding       81     81     165
## 4    chr20 238435 238442     8      +   coding       16     23     269
## 5    chr20 279185 279185     1      +   coding      958    958     304
## 6    chr20 371972 371972     1      +   coding      333    333     454
##    TXID  CDSID GENEID PRECEDEID FOLLOWID  SYMBOL paramRangeID      REF
## 1 70477 206101 245938                    DEFB125         <NA>        T
## 2 70478 206103  81623                    DEFB126         <NA>  ACCCCCG
## 3 71525 209098 245939                    DEFB128         <NA>        T
## 4 70481 206108 400830                    DEFB132         <NA> CTGGTCTT
## 5 70482 206110  85364                     ZCCHC3         <NA>        C
## 6 70485 206116  57761                      TRIB3         <NA>        T
##     ALT     QUAL FILTER DP NS RO AO
## 1     C 112.0310      .  4  1  0  4
## 2 ACCCG  15.1176      .  4  1  2  2
## 3     A 237.0390      .  7  1  0  7
## 4    CT  62.9073      .  3  1  0  3
## 5     T  39.4505      .  5  1  2  3
## 6     C  11.0642      .  3  1  1  2

Working with VCF files

Mark Dunning

Last modified: 24 Jul 2016

Introduction

Calling SNVs

Understanding the vcf format

Exercise

Importing into Bioconductor

Working directory

Template

Multi-sample example

`INFO`

Accessing genotypes

Exercise

Variant locations

Overlapping with genes

Exercise

More-automated location of variants

Predicting Amino acid changes

Summary

EXTRA:- Producing an output table

Exercise

Appendix

Files used in session

References