Outline

Imagine that we’ve called a set of somatic variants from our paired tumour and normal sample. From the previous section we can define a set of filters that can give us confidence that our variants are legitimate biological differences, and not technical noise. In this session we will go to the next step and see how we can gain biological intuition from our calls.

The calls we are going to be using is HCC1143_vs_HCC1143_BL.flagged.muts.vcf, which should be in the Day3 directory

Working Directory

/home/participant/Course_Materials/Day3

Template

/home/participant/Course_Materials/Day3/annotation-template.Rmd

Why annotate

  • Can have huge list of variants after SNV-calling
    • order of millions even after filtering
  • Not all will have consequences for the individual
  • Need to functionally annotate biological and biochemical function
  • Determine which variants have been previously-identified
    • in previous studies
    • amongst healthy / diseased individuals

Introduction to annovar

There are many annotation resources available, but we have chosen to demonstrate annovar. There is an overhead associated with downloading the required annotation files, and a lot of disk space is consumed. However, you will hopefully agree that once configured it is relatively easy to use. annovar is a suite of perl scripts that can perform a variety of annotation and filtering tasks. It can also collate the results from annotating against different databases into one convenient table.

Important:- annovar license

annovar is free for academic use. However, you need to supply a valid academic email address in order to download the software. If you wish to use annovar after the course. You will have to fill out the form on the annovar website.

annovar provides “gene-based”, region-based" and “filter-based” annotation and relies on pre-downloaded annotation files so all queries are done offline. Files can be downloaded from the UCSC genome browser or from the annovar website. Some of these files can be quite large, but fortunately we only have to download them once. For the course, we have downloaded some of these files to the directory /home/participant/Course_Materials/software/annovar/humandb. Each annotation file has a genome version associated with it. For our purposes we are going to be using hg19.

For your reference the script to download all the resources used in this session is

/home/participant/Course_Materials/software/annovar/annovar_commands.sh

Preparing files for annovar

Some of the commands in annovar require a specific input format. Fortunately, they have provided a script that can convert from popular file formats to their own format.

We convert our .vcf file using a script from annovar convert2annovar.pl

../software/annovar/convert2annovar.pl -format vcf4old  HCC1143_vs_HCC1143_BL.flagged.muts.vcf > HCC1143_vs_HCC1143_BL.flagged.muts.avinput
-format: vcfold, the input is in vcf format (old-style)
-includeinfo: Include the INFO fields from the vcf file
>: redirect to the file of our choosing
1   16103   16103   T   G   unknown .   61
1   17222   17222   A   G   unknown .   38

The first five columns are compulsory and contain

  • Chromosome
  • Start Position
  • End Position
  • Reference base(s)
  • Observed base(s)

All other columns are optional. These columns are not used in the annotation process, but will get carried-along with any annotation and filtering we perform if we wish.

Gene-based Annotation

Gene-based annotation is the default operation in annovar and is used to identify variants that cause protein-coding changes. Various gene-definition systems are supported.

The following command will annotate the file we have just prepared against the reference file located in ../software/annovar/humandb/hg19_refGene.txt. We would normally have to specify an annotation mode and database keyword. However, region-based annotation and refGene are the defaults for the annotation script.

../software/annovar/annotate_variation.pl -buildver hg19 HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb

Two files are created:

-HCC1143_vs_HCC1143_BL.flagged.muts.avinput.variant_function

This file lists all variants in the input file are pre-pends two columns. The first states whether a given variant is within an exonic or not and gives the relevant gene name.

ncRNA_intronic  WASH7P  1   16103   16103   T   G   unknown .   61
ncRNA_intronic  WASH7P  1   17222   17222   A   G   unknown .   38
intergenic  OR4F5(dist=4562),LOC729737(dist=60203)  1   74570   74570   T   A   unknown .   16
intergenic  OR4F5(dist=31437),LOC729737(dist=33328) 1   101445  101445  G   C   unknown .   66
intergenic  OR4F5(dist=31448),LOC729737(dist=33317) 1   101456  101456  A   G   unknown .   67
intergenic  OR4F5(dist=31459),LOC729737(dist=33306) 1   101467  101467  G   A   unknown .   71
intergenic  OR4F5(dist=54575),LOC729737(dist=10190) 1   124583  124583  C   T   unknown .   17
ncRNA_exonic    LOC729737   1   136586  136586  T   G   unknown .   55
ncRNA_exonic    LOC729737   1   136836  136836  G   A   unknown .   65
intergenic  LOC729737(dist=90889),LOC100133331(dist=92437)  1   231455  231455  T   A   unknown .   108

The second file is

  • HCC1143_vs_HCC1143_BL.flagged.muts.avinput.exonic_variant_function
line1389    synonymous SNV  ESPN:NM_031475:exon10:c.C2217T:p.L739L, 1   6512048 6512048 C   T   unknown .   76
line1392    stopgain    TNFRSF25:NM_003790:exon4:c.G370T:p.E124X,TNFRSF25:NM_148967:exon3:c.G235T:p.E79X,TNFRSF25:NM_148965:exon4:c.G370T:p.E124X,TNFRSF25:NM_148966:exon4:c.G370T:p.E124X,TNFRSF25:NM_001039664:exon4:c.G370T:p.E124X, 1   6524705 6524705 C   A   unknown .   64
line1531    unknown UNKNOWN 1   12887606    12887606    C   G   unknown .   66
line1554    nonsynonymous SNV   PRAMEF4:NM_001009611:exon4:c.G1292A:p.S431N,    1   12939510    12939510    C   T   unknown .   58
line1558    nonsynonymous SNV   PRAMEF4:NM_001009611:exon2:c.T278A:p.L93Q,  1   12942938    12942938    A   T   unknown .   63
line1559    nonsynonymous SNV   PRAMEF4:NM_001009611:exon2:c.A275C:p.N92T,  1   12942941    12942941    T   G   unknown .   63
line1673    nonsynonymous SNV   FAM231B:NM_001310155:exon1:c.G113C:p.S38T,  1   16865673    16865673    G   C   unknown .   51
line1682    unknown UNKNOWN 1   16890558    16890558    C   G   unknown .   218
line1683    unknown UNKNOWN 1   16890559    16890559    A   G   unknown .   216
line1687    unknown UNKNOWN 1   16894476    16894476    G   C   unknown .   216

This file lists for the coding variants only whether the stated base-change will lead to a change in protein or not. Full definitions of all the terms is on the annovar website

Region-based annotation

Regioin-based annotation is used when we want to see if our variants fall within pre-defined genomic regions of interest. Here we list some examples along with the “keyword” to be used when downloading and annotating against the database. More comprehensive details on the types of region-based filtering that is supported can be found on the http://annovar.openbioinformatics.org/en/latest/user-guide/region/.

Database Keyword
Cytoband cytoband
microRNA and snoRNA wgRna
Conserved Regions phastConsElements46way
Transcription Factor Binding Sites tfbsConsSites
Published GWAS gwasCatalog
Database of Genomic Variants dgvMerged
Segmental Duplications genomicSuperDups

Annotation against regions identified in ENCODE project is also described in its own section on the annovar website. Alternatively, you can specify your own regions of interest in .bed or .gff format.

The simplest form of region annotation is probably when we want to look at which “cytoband” each variant belongs to. This uses a reference file that contains the coordinates for each cytoband in a given genome version.

To perform the annotation, we use need to specify -regionanno, to perform region-annotation and set the dbtype to cytoBand.

../software/annovar/annotate_variation.pl -regionanno -build hg19 -dbtype cytoBand  HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/

This should result in a file being created with .cytoBand added as a file extension to your input file (e.g. HCC1143_vs_HCC1143_BL.flagged.muts.avinput.cytoBand). An extra column has been added to indicate which chromosome band each variant is located on. In the event of a variant spanning multiple bands (obviously not in this case), the multiple bands would be listed here.

head HCC1143_vs_HCC1143_BL.flagged.muts.avinput.cytoBand
cytoBand    1p36.33 1   16103   16103   T   G   unknown .   61
cytoBand    1p36.33 1   17222   17222   A   G   unknown .   38
cytoBand    1p36.33 1   74570   74570   T   A   unknown .   16
cytoBand    1p36.33 1   101445  101445  G   C   unknown .   66
cytoBand    1p36.33 1   101456  101456  A   G   unknown .   67
cytoBand    1p36.33 1   101467  101467  G   A   unknown .   71
cytoBand    1p36.33 1   124583  124583  C   T   unknown .   17
cytoBand    1p36.33 1   136586  136586  T   G   unknown .   55
cytoBand    1p36.33 1   136836  136836  G   A   unknown .   65
cytoBand    1p36.33 1   231455  231455  T   A   unknown .   108



Exercise

  • Select any of the databases from the table that you think might be relevant to your study, and annotate the example SNV calls
    • all the database files should already be downloaded.
    • try the GWAS catalogue if you are not sure



Filter-based annotation

Unlike the annotation performed in the previous section, which was based purely on the genomic location, filter-based annotation also takes the base-change into account. One can also use the observed frequency in the database population as a means to filter.

As before, many types of database are available. An additional complication compared to the region filters from the previous section is that several version numbers exist for each database. In this table, we list the latest version available as of July 2016.

Database Keyword
dbSNP snp138
1000 Genomes 1000g2015aug
Exome Aggregation Consortium exac03
Exome Sequencing Project esp6500si_all
Complete Genomics Individuals cg69
ClinVar clinvar_20160302
COSMIC cosmic70
ICGC icgc21
NCI60 nci60

As we will see, filter-based annotation is much the same as region- and gene-based. However, when downloading for the first time we have to add an additional -webfrom annovar argument into the command as the filter-based files are provided by the annovar website. For the purposes of this session, all the files have already been provided.

It seems natural to use the latest version of a database for the most-comprehensive annotation. However, the annovar website suggests that we be cautious with regards to dbSNP. As dbSNP has been expanded to include variants from other large-scale projects (such as 1000 genomes) it has become somewhat “contaminated” by variants with very low frequency or with clinical relevance. Thus, dbSNP129 is regarded as being the last clean version of the database.

To filter against dbSNP 129 we select the filter to .annotate_variation.pl with a dbtype of snp129

../software/annovar/annotate_variation.pl -filter -buildver hg19 -dbtype snp129  HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/

This time we get two files created; HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_dropped and HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_filtered. The difference being that the .._dropped file contains details of all your variants that are in the database

head HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_dropped

Thus, the .._filtered file contains all the variants not found in the database. The implication being that we want to remove common variants found amongst healthy individuals from our list. There is nothing to stop us using multiple versions of dbSNP in the filtering as we will see later-on.

Healthy or disease individuals?

How you choose to interpret the filtering results depends on whether you are interested in variants that are exclusive to healthy or diseased individuals.

Both dbSNP and 1000 genomes report frequencies amongst healthy individuals. The “complete genomics” (cg69) set also provides variant frequencies for 69 healthy subjects (including 46 unrelated subjects). Particularly useful for those with exome data is the ESP (Exome Sequencing Project) which aims to identify variants in exonic regions from over 6000 individuals; including however some subjects with disease.

annovar can also annotate against variants that are relevant to diseases such as cancer. For example, there is the COSMIC (Catalogue of Somatic Mutations in Cancer).

../software/annovar/annotate_variation.pl -filter -dbtype cosmic70 -buildver hg19 HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/

Again, the variants found in the database are printed in the _dropped file along with details of the mutation in COSMIC that they match.

cosmic70    ID=COSM28322;OCCURENCE=2(breast)    1   108298069   108298069   G   A   unknown .   74
cosmic70    ID=COSM51138;OCCURENCE=1(breast)    1   147380295   147380295   C   T   unknown .   50
cosmic70    ID=COSM1472637;OCCURENCE=1(breast)  1   152189272   152189272   G   A   unknown .   14
cosmic70    ID=COSM33491;OCCURENCE=1(breast)    1   155253762   155253762   C   T   unknown .   65
cosmic70    ID=COSM33492;OCCURENCE=1(breast)    1   159558247   159558247   G   A   unknown .   47
cosmic70    ID=COSM50598;OCCURENCE=1(breast)    1   176811562   176811562   G   T   unknown .   49
cosmic70    ID=COSM33493;OCCURENCE=1(breast)    1   178414751   178414751   G   C   unknown .   56
cosmic70    ID=COSM25311;OCCURENCE=2(breast)    1   198675866   198675866   A   C   unknown .   29
cosmic70    ID=COSM51126;OCCURENCE=1(breast)    1   204082199   204082199   C   T   unknown .   79
cosmic70    ID=COSM33529;OCCURENCE=1(breast)    1   24687508    24687508    G   A   unknown .   74

This is slightly contrary to the 1000 genomes and dbSNP filtering we did previously where we wanted to retain variants that were not found in a database. Fortunately in the next section we will look at any approach where we can get results for all our variants in a single table that we can develop our own filtering criteria for.

Predicting consequence

annovar also has a wide array of filters that can help us to predict the consequence of a particular mutation. These include predictions from methods such as Sifting Intolerant From Tolerant / SIFT, Polyphen , MutationTaster and FATHMM.

The keyword dbsnfp30a will annotate against all annovars collection of prediction databases in one go. However, for the purposes of the practical we will not be running this as it takes a long time. We will however add predictions from SIFT to our output.

Combining multiple annotations and filters

Up to this point, we have amassed quite a collection of tables of our set of variants annotated / filtered against various databases. A particularly appealing feature of annovar is that we can define protocols to combine multiple filters and collect the results in the same table.

The -protocol argument is used to list what types of annotation / filtering you want to be applied using the same keywrods as previously. The -operation argument is used to specify whether each annotation is a gene (g), region (r) or filter (f) -based annotation. You need to take care to provide the same number of names in both the -protocol and -operation arguments. When you run the following command you should see that each filter is applied in turn.

../software/annovar/table_annovar.pl HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/ -buildver hg19 -out HCC1143_vs_HCC1143_BL.flagged.muts.annovar -remove -protocol refGene,cytoBand,gwasCatalog,genomicSuperDups,dgvMerged,snp129,1000g2015aug_all,esp6500si_all,cosmic70,nci60,ljb23_sift -operation g,r,r,r,r,f,f,f,f,f,f -nastring NA -csvout

The file HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.csv is created. By specifying the -remove argument in the command, any tables that are created as part of the process are removed. This .csv file can now be viewed in Excel, or LibreOffice in our case (right-click on the file)

Having opened this file into a spreadsheet program we can of course filter and sort the data as we see fit. However, due to the sheer number of columns we have generated this may prove a bit unwieldy. Therefore we are going to introduce an R package that will make the interrogation and comprehension of such a table more intuitive.

Using R to interrogate and filter an annotation table

Introducing dplyr

dplyr is a very powerful package for manipulating data in tables and is part of the tidy-verse collection of packages that provide a seamless and consistent interface to exploring and analysing data.

The process of analysing a dataset is rarely a linear one, and will generate scores of intermediate plots and tables along the way that don’t make their way to a final publication. Packages such as dplyr and ggplot2 (for plotting) provide an interface whereby the effort to produce meaningful summaries and graphics of our data is minimised. This allows us to worry more about the interpretation and what questions we should ask, rather than writing code.

In the context of filtering variants this is useful, as we may have to re-visit our criteria many times before settling on a set of rules.

The cheatsheet is highly recommended and summarises all the functions within the package. We won’t have time to cover everything here today.

For more information on dplyr see our one-day course Data Manipulation and Visualisation using R

The first step is to load the dplyr library. We will also need VariantAnnotation for this workflow.

library(VariantAnnotation)
library(dplyr)

An incredibly useful but simple function is tbl_df which converts the standard R data.frame into a special dplyr one that has a nicer standard display. Rather than filling the whole R console we can see a snapshot of the data frame and summaries of the data contained in each column

anno <- read.csv("HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.csv")
anno <- tbl_df(anno)
anno
## # A tibble: 92,286 x 20
##       Chr  Start    End    Ref    Alt   Func.refGene
##    <fctr>  <int>  <int> <fctr> <fctr>         <fctr>
## 1       1  16103  16103      T      G ncRNA_intronic
## 2       1  17222  17222      A      G ncRNA_intronic
## 3       1  74570  74570      T      A     intergenic
## 4       1 101445 101445      G      C     intergenic
## 5       1 101456 101456      A      G     intergenic
## 6       1 101467 101467      G      A     intergenic
## 7       1 124583 124583      C      T     intergenic
## 8       1 136586 136586      T      G   ncRNA_exonic
## 9       1 136836 136836      G      A   ncRNA_exonic
## 10      1 231455 231455      T      A     intergenic
## # ... with 92,276 more rows, and 14 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

dplyr introduces a series of verbs, each of which is capable of performing a specific operation on a data frame in an efficient manner. The syntax of the verbs is very consistent. Not only does this allow us to easily switch between using different operations, but we can chain the operations together as we will see later.

select

select is the first of the dplyr verbs that can be used to pick certain columns from the table

select(anno, Func.refGene)
## # A tibble: 92,286 x 1
##      Func.refGene
##            <fctr>
## 1  ncRNA_intronic
## 2  ncRNA_intronic
## 3      intergenic
## 4      intergenic
## 5      intergenic
## 6      intergenic
## 7      intergenic
## 8    ncRNA_exonic
## 9    ncRNA_exonic
## 10     intergenic
## # ... with 92,276 more rows

No great advance in our capabilities yet, as we already know how to select a variable from a table using the $ operator

anno$Func.refGene

However, the syntax to select multiple columns is a lot more concise. Mutiple columns can be selected by separating their names with a comma

Compare:-

select(anno, Func.refGene,ExonicFunc.refGene)
## # A tibble: 92,286 x 2
##      Func.refGene ExonicFunc.refGene
##            <fctr>             <fctr>
## 1  ncRNA_intronic                 NA
## 2  ncRNA_intronic                 NA
## 3      intergenic                 NA
## 4      intergenic                 NA
## 5      intergenic                 NA
## 6      intergenic                 NA
## 7      intergenic                 NA
## 8    ncRNA_exonic                 NA
## 9    ncRNA_exonic                 NA
## 10     intergenic                 NA
## # ... with 92,276 more rows

to

anno[,c("Func.refGene","ExonicFunc.refGene")]

Non R users probably find these commands a bit obtuse

  • why the [ ]?
  • what is c?
  • need to remember the row and column index
  • [,...] means display all rows

However, select is a lot more intuitive; easier to convert to the sentance “Select the Func.refGene and ExonicFunc.refGene columns from patients”

We can select a range of columns with a :

select(anno, cytoBand:X1000g2015aug_all)
## # A tibble: 92,286 x 6
##    cytoBand gwasCatalog                              genomicSuperDups
##      <fctr>      <fctr>                                        <fctr>
## 1   1p36.33          NA                Score=0.993729;Name=chr9:10843
## 2   1p36.33          NA                Score=0.993729;Name=chr9:10843
## 3   1p36.33          NA               Score=0.994828;Name=chr19:60000
## 4   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 5   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 6   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 7   1p36.33          NA Score=0.99448;Name=chr1:323738,chr5:180750353
## 8   1p36.33          NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 9   1p36.33          NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 10  1p36.33          NA            Score=0.992138;Name=chr16:90251926
## # ... with 92,276 more rows, and 3 more variables: dgvMerged <fctr>,
## #   snp129 <fctr>, X1000g2015aug_all <dbl>

Or suppress columns by putting with a - in front of their name

select(anno, -End)
## # A tibble: 92,286 x 19
##       Chr  Start    Ref    Alt   Func.refGene           Gene.refGene
##    <fctr>  <int> <fctr> <fctr>         <fctr>                 <fctr>
## 1       1  16103      T      G ncRNA_intronic                 WASH7P
## 2       1  17222      A      G ncRNA_intronic                 WASH7P
## 3       1  74570      T      A     intergenic        OR4F5,LOC729737
## 4       1 101445      G      C     intergenic        OR4F5,LOC729737
## 5       1 101456      A      G     intergenic        OR4F5,LOC729737
## 6       1 101467      G      A     intergenic        OR4F5,LOC729737
## 7       1 124583      C      T     intergenic        OR4F5,LOC729737
## 8       1 136586      T      G   ncRNA_exonic              LOC729737
## 9       1 136836      G      A   ncRNA_exonic              LOC729737
## 10      1 231455      T      A     intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

Combinations of the above are possible

select(anno, cytoBand:X1000g2015aug_all,-snp129,-AAChange.refGene)
## # A tibble: 92,286 x 5
##    cytoBand gwasCatalog                              genomicSuperDups
##      <fctr>      <fctr>                                        <fctr>
## 1   1p36.33          NA                Score=0.993729;Name=chr9:10843
## 2   1p36.33          NA                Score=0.993729;Name=chr9:10843
## 3   1p36.33          NA               Score=0.994828;Name=chr19:60000
## 4   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 5   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 6   1p36.33          NA               Score=0.992727;Name=chr1:235525
## 7   1p36.33          NA Score=0.99448;Name=chr1:323738,chr5:180750353
## 8   1p36.33          NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 9   1p36.33          NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 10  1p36.33          NA            Score=0.992138;Name=chr16:90251926
## # ... with 92,276 more rows, and 2 more variables: dgvMerged <fctr>,
## #   X1000g2015aug_all <dbl>

Some convenient functions are available so we don’t need to specify the column names in full

select(anno, contains("1000g"))
## # A tibble: 92,286 x 1
##    X1000g2015aug_all
##                <dbl>
## 1                 NA
## 2                 NA
## 3                 NA
## 4                 NA
## 5                 NA
## 6                 NA
## 7                 NA
## 8                 NA
## 9                 NA
## 10                NA
## # ... with 92,276 more rows

Exercise

  • Display the columns Func.refGene to AAChange.refGene
    • you could try the convenience functions contains ends_with to do this, or using the : notation
  • Now remove the column GeneDetail.refGene
  • Append the chromosome start and end position columns
## # A tibble: 92,286 x 7
##       Chr  Start    End   Func.refGene           Gene.refGene
##    <fctr>  <int>  <int>         <fctr>                 <fctr>
## 1       1  16103  16103 ncRNA_intronic                 WASH7P
## 2       1  17222  17222 ncRNA_intronic                 WASH7P
## 3       1  74570  74570     intergenic        OR4F5,LOC729737
## 4       1 101445 101445     intergenic        OR4F5,LOC729737
## 5       1 101456 101456     intergenic        OR4F5,LOC729737
## 6       1 101467 101467     intergenic        OR4F5,LOC729737
## 7       1 124583 124583     intergenic        OR4F5,LOC729737
## 8       1 136586 136586   ncRNA_exonic              LOC729737
## 9       1 136836 136836   ncRNA_exonic              LOC729737
## 10      1 231455 231455     intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 2 more variables:
## #   ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>

We should note at this point, that we have not actually changed the data frame at any point; just created different views of the data.

anno
## # A tibble: 92,286 x 20
##       Chr  Start    End    Ref    Alt   Func.refGene
##    <fctr>  <int>  <int> <fctr> <fctr>         <fctr>
## 1       1  16103  16103      T      G ncRNA_intronic
## 2       1  17222  17222      A      G ncRNA_intronic
## 3       1  74570  74570      T      A     intergenic
## 4       1 101445 101445      G      C     intergenic
## 5       1 101456 101456      A      G     intergenic
## 6       1 101467 101467      G      A     intergenic
## 7       1 124583 124583      C      T     intergenic
## 8       1 136586 136586      T      G   ncRNA_exonic
## 9       1 136836 136836      G      A   ncRNA_exonic
## 10      1 231455 231455      T      A     intergenic
## # ... with 92,276 more rows, and 14 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

filter

A particularly useful operation for us is going to be filter. Again, there is nothing here that we couldn’t do with “base R”, but it will allow us to create more readable code.

The syntax is consistent with select and uses the usual selection techniques in R such as ==.

Selecting variants on chromsome 1

filter(anno, Chr ==1)
## # A tibble: 6,151 x 20
##       Chr  Start    End    Ref    Alt   Func.refGene
##    <fctr>  <int>  <int> <fctr> <fctr>         <fctr>
## 1       1  16103  16103      T      G ncRNA_intronic
## 2       1  17222  17222      A      G ncRNA_intronic
## 3       1  74570  74570      T      A     intergenic
## 4       1 101445 101445      G      C     intergenic
## 5       1 101456 101456      A      G     intergenic
## 6       1 101467 101467      G      A     intergenic
## 7       1 124583 124583      C      T     intergenic
## 8       1 136586 136586      T      G   ncRNA_exonic
## 9       1 136836 136836      G      A   ncRNA_exonic
## 10      1 231455 231455      T      A     intergenic
## # ... with 6,141 more rows, and 14 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

Selecting exonic variants

filter(anno, Func.refGene == "exonic")
## # A tibble: 642 x 20
##       Chr    Start      End    Ref    Alt Func.refGene Gene.refGene
##    <fctr>    <int>    <int> <fctr> <fctr>       <fctr>       <fctr>
## 1       1  6512048  6512048      C      T       exonic         ESPN
## 2       1  6524705  6524705      C      A       exonic     TNFRSF25
## 3       1 12887606 12887606      C      G       exonic     PRAMEF11
## 4       1 12939510 12939510      C      T       exonic      PRAMEF4
## 5       1 12942938 12942938      A      T       exonic      PRAMEF4
## 6       1 12942941 12942941      T      G       exonic      PRAMEF4
## 7       1 16865673 16865673      G      C       exonic      FAM231B
## 8       1 16890558 16890558      C      G       exonic        NBPF1
## 9       1 16890559 16890559      A      G       exonic        NBPF1
## 10      1 16894476 16894476      G      C       exonic        NBPF1
## # ... with 632 more rows, and 13 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

Multiple conditions can be separated by a ,. This is the same as using & in base R to combine various tests.

Exonic variants on chromosome 1 that are no synonymous or unknown

filter(anno, Func.refGene == "exonic", Chr == 1, ExonicFunc.refGene != "synonymous SNV", ExonicFunc.refGene != "unknown")
## # A tibble: 33 x 20
##       Chr    Start      End    Ref    Alt Func.refGene Gene.refGene
##    <fctr>    <int>    <int> <fctr> <fctr>       <fctr>       <fctr>
## 1       1  6524705  6524705      C      A       exonic     TNFRSF25
## 2       1 12939510 12939510      C      T       exonic      PRAMEF4
## 3       1 12942938 12942938      A      T       exonic      PRAMEF4
## 4       1 12942941 12942941      T      G       exonic      PRAMEF4
## 5       1 16865673 16865673      G      C       exonic      FAM231B
## 6       1 17275337 17275337      C      T       exonic        CROCC
## 7       1 21807427 21807427      G      T       exonic        NBPF3
## 8       1 24687508 24687508      G      A       exonic        STPG1
## 9       1 40535490 40535490      C      G       exonic         CAP1
## 10      1 93701865 93701865      A      G       exonic       CCDC18
## # ... with 23 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## #   ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## #   gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## #   snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## #   cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>

Partial matches can be performed using grepl. We often use grep to find the indices where a particular string is found amongst a vector. However, in order to do filtering dplyr requires a logical vector the same length as the number of rows.

grep("breast",anno$cosmic70)
##  [1]  1392  1965  2729  2828  4069  4566  4607  4649  4844  4862  5060
## [12]  5151  6056  6066  6794 10270 12129 12200 12537 12718 12961 13105
## [23] 13147 15110 15303 15346 19835 22537 25353 25369 25440 25776 27077
## [34] 27082 27086 27180 27976 28060 36433 36466 36523 39009 46487 47180
## [45] 48900 48971 49285 49681 49682 49907 50679 53326 53430 53492 54277
## [56] 54279 54761 56672 57652 57667 58300 58561 60798 61005 63998 65150
## [67] 65281 70305 70407 71375 71423 73028 73381 73731 75522 76711 78735
## [78] 81454 87125 87214 90498 90578
grepl("breast",anno$cosmic70)[1:10]
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
filter(anno, grepl("breast",cosmic70))
## # A tibble: 82 x 20
##       Chr     Start       End    Ref    Alt Func.refGene Gene.refGene
##    <fctr>     <int>     <int> <fctr> <fctr>       <fctr>       <fctr>
## 1       1   6524705   6524705      C      A       exonic     TNFRSF25
## 2       1  24687508  24687508      G      A       exonic        STPG1
## 3       1  93701865  93701865      A      G       exonic       CCDC18
## 4       1 108298069 108298069      G      A       exonic         VAV3
## 5       1 147380295 147380295      C      T       exonic         GJA8
## 6       1 152189272 152189272      G      A       exonic         HRNR
## 7       1 155253762 155253762      C      T     intronic         HCN3
## 8       1 159558247 159558247      G      A       exonic         APCS
## 9       1 176811562 176811562      G      T       exonic       PAPPA2
## 10      1 178414751 178414751      G      C       exonic       RASAL2
## # ... with 72 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## #   ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## #   gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## #   snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## #   cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>

During annovar’s filtering we get NA entries whenever a particular variant is not found in a database. To filter for these NA values we use the function is.na.

e.g. Variants that are found in dbSNP

filter(anno, is.na(snp129))
## # A tibble: 82,329 x 20
##       Chr  Start    End    Ref    Alt   Func.refGene
##    <fctr>  <int>  <int> <fctr> <fctr>         <fctr>
## 1       1  16103  16103      T      G ncRNA_intronic
## 2       1  74570  74570      T      A     intergenic
## 3       1 101445 101445      G      C     intergenic
## 4       1 101456 101456      A      G     intergenic
## 5       1 101467 101467      G      A     intergenic
## 6       1 124583 124583      C      T     intergenic
## 7       1 136586 136586      T      G   ncRNA_exonic
## 8       1 136836 136836      G      A   ncRNA_exonic
## 9       1 231455 231455      T      A     intergenic
## 10      1 237545 237545      C      A     intergenic
## # ... with 82,319 more rows, and 14 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

Variants that are found in GWAS; here we use a ! to reverse the result of is.na; TRUE -> FALSE and FALSE -> TRUE

filter(anno, !is.na(gwasCatalog))
## # A tibble: 2 x 20
##      Chr     Start       End    Ref    Alt   Func.refGene   Gene.refGene
##   <fctr>     <int>     <int> <fctr> <fctr>         <fctr>         <fctr>
## 1      1 221010205 221010205      C      T ncRNA_intronic        HLX-AS1
## 2     14 106232585 106232585      A      G     intergenic ELK2AP,MIR4507
## # ... with 13 more variables: GeneDetail.refGene <fctr>,
## #   ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## #   gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## #   snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## #   cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>

Variants that not found in 1000 genomes, or have a maf frequency less than 0.05. Here we use the operation | to combine different logical tests.

filter(anno, is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05)
## # A tibble: 88,961 x 20
##       Chr  Start    End    Ref    Alt   Func.refGene
##    <fctr>  <int>  <int> <fctr> <fctr>         <fctr>
## 1       1  16103  16103      T      G ncRNA_intronic
## 2       1  17222  17222      A      G ncRNA_intronic
## 3       1  74570  74570      T      A     intergenic
## 4       1 101445 101445      G      C     intergenic
## 5       1 101456 101456      A      G     intergenic
## 6       1 101467 101467      G      A     intergenic
## 7       1 124583 124583      C      T     intergenic
## 8       1 136586 136586      T      G   ncRNA_exonic
## 9       1 136836 136836      G      A   ncRNA_exonic
## 10      1 231455 231455      T      A     intergenic
## # ... with 88,951 more rows, and 14 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>

Exercise

  • Find which variants are not in segmental duplications
  • Find all variants that are in TP53 and exonic
  • Find variants with greater than 0.1 frequency in the Exome Sequencing Project
    • column esp6500si_all

Combining commands with “pipes”

A slighty annoyance with the exercise above is that once we’ve filtered for low frequency esp variants, we don’t actually see the esp6500si_all column appearing in the output:-

filter(anno, esp6500si_all > 0.1)
## # A tibble: 46 x 20
##       Chr    Start      End    Ref    Alt Func.refGene  Gene.refGene
##    <fctr>    <int>    <int> <fctr> <fctr>       <fctr>        <fctr>
## 1       2 89196757 89196757      G      A   intergenic MIR4436A,NONE
## 2       2 89196906 89196906      A      C   intergenic MIR4436A,NONE
## 3       2 89197190 89197190      C      T   intergenic MIR4436A,NONE
## 4       2 89246951 89246951      C      T   intergenic MIR4436A,NONE
## 5       2 89278489 89278489      A      C   intergenic MIR4436A,NONE
## 6       2 89292182 89292182      A      G   intergenic MIR4436A,NONE
## 7       2 89292201 89292201      C      T   intergenic MIR4436A,NONE
## 8       2 89292401 89292401      T      C   intergenic MIR4436A,NONE
## 9       2 89309542 89309542      T      G   intergenic MIR4436A,NONE
## 10      2 89309554 89309554      T      A   intergenic MIR4436A,NONE
## # ... with 36 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## #   ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## #   gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## #   snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## #   cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>

Given what we have done so far, it would seem natural to follow this filter statement with a select to view the columns we are interested in

The following line of R code is perfectly valid and does exactly what we want. However, reading the code back to ourselves is a bit clunky.

select(filter(anno, esp6500si_all > 0.1), Chr:Alt, esp6500si_all)
## # A tibble: 46 x 6
##       Chr    Start      End    Ref    Alt esp6500si_all
##    <fctr>    <int>    <int> <fctr> <fctr>         <dbl>
## 1       2 89196757 89196757      G      A      0.234492
## 2       2 89196906 89196906      A      C      0.234575
## 3       2 89197190 89197190      C      T      0.235162
## 4       2 89246951 89246951      C      T      0.137349
## 5       2 89278489 89278489      A      C      0.321513
## 6       2 89292182 89292182      A      G      0.265630
## 7       2 89292201 89292201      C      T      0.108410
## 8       2 89292401 89292401      T      C      0.187469
## 9       2 89309542 89309542      T      G      0.173259
## 10      2 89309554 89309554      T      A      0.172195
## # ... with 36 more rows

select from variants with more than 0.1 frequency in the exome sequencing project, the columns Chr to Alt and esp6500si_all

or we can split over multiple lines

select(filter(anno, esp6500si_all > 0.1),
       Chr:Alt, esp6500si_all)
## # A tibble: 46 x 6
##       Chr    Start      End    Ref    Alt esp6500si_all
##    <fctr>    <int>    <int> <fctr> <fctr>         <dbl>
## 1       2 89196757 89196757      G      A      0.234492
## 2       2 89196906 89196906      A      C      0.234575
## 3       2 89197190 89197190      C      T      0.235162
## 4       2 89246951 89246951      C      T      0.137349
## 5       2 89278489 89278489      A      C      0.321513
## 6       2 89292182 89292182      A      G      0.265630
## 7       2 89292201 89292201      C      T      0.108410
## 8       2 89292401 89292401      T      C      0.187469
## 9       2 89309542 89309542      T      G      0.173259
## 10      2 89309554 89309554      T      A      0.172195
## # ... with 36 more rows

We always have to work out what the first statement was, and work forwards from that. - the arguments to a function are far-away from the function call

piping is a familiar concept in unix where the output of one command is used as input to the following line. The dplyr package allows such operations to be performed in R via use of the %>% operation from the magrittr package

  • Shortcut in RStudio: CTRL + SHIFT + M, CMD + SHIFT + M (Mac)

dplyr functions are designed to take a data frame as their first argument, and return a data frame as an output. This makes them ideal to use with a pipe.

The code from before can be rewritten as follows, making it easier to see the order in which the operations were performed.

filter(anno, esp6500si_all > 0.1) %>% 
  select(Chr:Alt, cosmic70)
## # A tibble: 46 x 6
##       Chr    Start      End    Ref    Alt cosmic70
##    <fctr>    <int>    <int> <fctr> <fctr>   <fctr>
## 1       2 89196757 89196757      G      A       NA
## 2       2 89196906 89196906      A      C       NA
## 3       2 89197190 89197190      C      T       NA
## 4       2 89246951 89246951      C      T       NA
## 5       2 89278489 89278489      A      C       NA
## 6       2 89292182 89292182      A      G       NA
## 7       2 89292201 89292201      C      T       NA
## 8       2 89292401 89292401      T      C       NA
## 9       2 89309542 89309542      T      G       NA
## 10      2 89309554 89309554      T      A       NA
## # ... with 36 more rows



Exercise

  • Find which variants are present in the GWAS catalog
  • Use select to print the chromsome, start, end, Gene name and 1000 genomes frequency and GWAS catalogue details for these variants
  • Use a “pipe” command to write your answer
## # A tibble: 2 x 6
##      Chr     Start       End   Gene.refGene X1000g2015aug_all
##   <fctr>     <int>     <int>         <fctr>             <dbl>
## 1      1 221010205 221010205        HLX-AS1          0.228435
## 2     14 106232585 106232585 ELK2AP,MIR4507          0.282748
## # ... with 1 more variables: gwasCatalog <fctr>
  • Explain in words what the following chain of commands is doing
filter(anno, Func.refGene=="exonic") %>% 
  filter(!is.na(genomicSuperDups)) %>% 
  filter(is.na(snp129)) %>% 
  filter(X1000g2015aug_all < 0.05 | is.na(X1000g2015aug_all)) %>% 
  select(Chr:ExonicFunc.refGene,-GeneDetail.refGene,-Func.refGene,X1000g2015aug_all)
## # A tibble: 200 x 8
##       Chr     Start       End    Ref    Alt Gene.refGene
##    <fctr>     <int>     <int> <fctr> <fctr>       <fctr>
## 1       1   6512048   6512048      C      T         ESPN
## 2       1  12939510  12939510      C      T      PRAMEF4
## 3       1  12942938  12942938      A      T      PRAMEF4
## 4       1  12942941  12942941      T      G      PRAMEF4
## 5       1  16890558  16890558      C      G        NBPF1
## 6       1  16890559  16890559      A      G        NBPF1
## 7       1  16894476  16894476      G      C        NBPF1
## 8       1  17275337  17275337      C      T        CROCC
## 9       1  40535490  40535490      C      G         CAP1
## 10      1 117158772 117158772      C      T        IGSF3
## # ... with 190 more rows, and 2 more variables: ExonicFunc.refGene <fctr>,
## #   X1000g2015aug_all <dbl>



Adding extra information to the table

The mutate operation allows us to compute and append one or more new columns to a data frame. We can specify a new column name, or overwrite an existing one. As above, we do not change the data frame so need to create a new variable if we want the changes to persist.

For example, we might want to add the “chr” prefix to all our chromosome names. In order to perform this operation we need to specify a function that will take a vector as an input and return as vector of the same length as an output.

mutate(anno, Chr = paste("Chr", Chr,sep=""))
## # A tibble: 92,286 x 20
##      Chr  Start    End    Ref    Alt   Func.refGene           Gene.refGene
##    <chr>  <int>  <int> <fctr> <fctr>         <fctr>                 <fctr>
## 1   Chr1  16103  16103      T      G ncRNA_intronic                 WASH7P
## 2   Chr1  17222  17222      A      G ncRNA_intronic                 WASH7P
## 3   Chr1  74570  74570      T      A     intergenic        OR4F5,LOC729737
## 4   Chr1 101445 101445      G      C     intergenic        OR4F5,LOC729737
## 5   Chr1 101456 101456      A      G     intergenic        OR4F5,LOC729737
## 6   Chr1 101467 101467      G      A     intergenic        OR4F5,LOC729737
## 7   Chr1 124583 124583      C      T     intergenic        OR4F5,LOC729737
## 8   Chr1 136586 136586      T      G   ncRNA_exonic              LOC729737
## 9   Chr1 136836 136836      G      A   ncRNA_exonic              LOC729737
## 10  Chr1 231455 231455      T      A     intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>
mutate(anno, Chr = paste0("Chr", Chr))
## # A tibble: 92,286 x 20
##      Chr  Start    End    Ref    Alt   Func.refGene           Gene.refGene
##    <chr>  <int>  <int> <fctr> <fctr>         <fctr>                 <fctr>
## 1   Chr1  16103  16103      T      G ncRNA_intronic                 WASH7P
## 2   Chr1  17222  17222      A      G ncRNA_intronic                 WASH7P
## 3   Chr1  74570  74570      T      A     intergenic        OR4F5,LOC729737
## 4   Chr1 101445 101445      G      C     intergenic        OR4F5,LOC729737
## 5   Chr1 101456 101456      A      G     intergenic        OR4F5,LOC729737
## 6   Chr1 101467 101467      G      A     intergenic        OR4F5,LOC729737
## 7   Chr1 124583 124583      C      T     intergenic        OR4F5,LOC729737
## 8   Chr1 136586 136586      T      G   ncRNA_exonic              LOC729737
## 9   Chr1 136836 136836      G      A   ncRNA_exonic              LOC729737
## 10  Chr1 231455 231455      T      A     intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>
anno <- mutate(anno, Chr = paste("Chr", Chr,sep=""))

To aid us selecting variants for follow-up, we need to know which ones have passed filters. We can first import the .vcf file as we saw yesterday.

vcf <- readVcf("HCC1143_vs_HCC1143_BL.flagged.muts.vcf","hg19")

Recall that the FILTER flag is stored in the metadata of the variant ranges object rowRanges. We can add this to our existing annotation using mutate.

anno.extra <- mutate(anno, FILTER = mcols(rowRanges(vcf))$FILTER)  
anno.extra
## # A tibble: 92,286 x 21
##      Chr  Start    End    Ref    Alt   Func.refGene           Gene.refGene
##    <chr>  <int>  <int> <fctr> <fctr>         <fctr>                 <fctr>
## 1   Chr1  16103  16103      T      G ncRNA_intronic                 WASH7P
## 2   Chr1  17222  17222      A      G ncRNA_intronic                 WASH7P
## 3   Chr1  74570  74570      T      A     intergenic        OR4F5,LOC729737
## 4   Chr1 101445 101445      G      C     intergenic        OR4F5,LOC729737
## 5   Chr1 101456 101456      A      G     intergenic        OR4F5,LOC729737
## 6   Chr1 101467 101467      G      A     intergenic        OR4F5,LOC729737
## 7   Chr1 124583 124583      C      T     intergenic        OR4F5,LOC729737
## 8   Chr1 136586 136586      T      G   ncRNA_exonic              LOC729737
## 9   Chr1 136836 136836      G      A   ncRNA_exonic              LOC729737
## 10  Chr1 231455 231455      T      A     intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 14 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>

Putting it together..

We now have all the dplyr operations in place to create a filtering pipeline. Firstly, probably want to use only those variants that passed the filters employed by caveman.

filter(anno.extra, FILTER=="PASS")
## # A tibble: 17,698 x 21
##      Chr   Start     End    Ref    Alt Func.refGene
##    <chr>   <int>   <int> <fctr> <fctr>       <fctr>
## 1   Chr1  324486  324486      G      C ncRNA_exonic
## 2   Chr1  847617  847617      G      A   intergenic
## 3   Chr1  882050  882050      C      G     intronic
## 4   Chr1  990394  990394      A      C         UTR3
## 5   Chr1 1114986 1114986      C      G     intronic
## 6   Chr1 1529885 1529885      G      A   intergenic
## 7   Chr1 1583724 1583724      T      C     intronic
## 8   Chr1 1649269 1649269      A      T     intronic
## 9   Chr1 1649276 1649276      C      G     intronic
## 10  Chr1 1651166 1651166      C      T     intronic
## # ... with 17,688 more rows, and 15 more variables: Gene.refGene <fctr>,
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>

Now we select

  • Variants that pass caveman filters
  • Variants not in segmental duplications
  • Variants less than 0.05 frequency in 1000 genomes data
  • Variants less than 0.05 frequencing in Exome Sequencing Project data
  • Exonic variants
  • Variants that cause a change in amino acid
filter(anno.extra, FILTER=="PASS") %>% 
  filter(is.na(genomicSuperDups)) %>% 
  filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>% 
  filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>% 
  filter(Func.refGene == "exonic") %>% 
  filter(ExonicFunc.refGene != "synonymous SNV")
## # A tibble: 124 x 21
##      Chr     Start       End    Ref    Alt Func.refGene Gene.refGene
##    <chr>     <int>     <int> <fctr> <fctr>       <fctr>       <fctr>
## 1   Chr1   6524705   6524705      C      A       exonic     TNFRSF25
## 2   Chr1  24687508  24687508      G      A       exonic        STPG1
## 3   Chr1  93701865  93701865      A      G       exonic       CCDC18
## 4   Chr1 108298069 108298069      G      A       exonic         VAV3
## 5   Chr1 159558247 159558247      G      A       exonic         APCS
## 6   Chr1 162302880 162302880      A      T       exonic       NOS1AP
## 7   Chr1 176811562 176811562      G      T       exonic       PAPPA2
## 8   Chr1 178414751 178414751      G      C       exonic       RASAL2
## 9   Chr1 198675866 198675866      A      C       exonic        PTPRC
## 10  Chr1 207107897 207107897      G      A       exonic         PIGR
## # ... with 114 more rows, and 14 more variables:
## #   GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## #   AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## #   genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## #   X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## #   nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>

We can further reduce the set to those that SIFT considers to be deleterious

filter(anno.extra, FILTER=="PASS") %>% 
  filter(is.na(genomicSuperDups)) %>% 
  filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>% 
  filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>% 
  filter(Func.refGene == "exonic") %>% 
  filter(ExonicFunc.refGene != "synonymous SNV") %>% 
  filter(grepl("D", ljb23_sift)) %>% 
  select(Chr:Alt, Gene.refGene,ExonicFunc.refGene,cosmic70,ljb23_sift) 
## # A tibble: 50 x 9
##      Chr     Start       End    Ref    Alt Gene.refGene ExonicFunc.refGene
##    <chr>     <int>     <int> <fctr> <fctr>       <fctr>             <fctr>
## 1   Chr1   6524705   6524705      C      A     TNFRSF25           stopgain
## 2   Chr1  24687508  24687508      G      A        STPG1  nonsynonymous SNV
## 3   Chr1 178414751 178414751      G      C       RASAL2  nonsynonymous SNV
## 4   Chr1 207107897 207107897      G      A         PIGR  nonsynonymous SNV
## 5   Chr1 247320095 247320095      G      C       ZNF124  nonsynonymous SNV
## 6   Chr2  32361653  32361653      G      T        SPAST  nonsynonymous SNV
## 7   Chr2 204304231 204304231      T      C        RAPH1  nonsynonymous SNV
## 8   Chr3 121340337 121340337      C      T       FBXO40  nonsynonymous SNV
## 9   Chr5 141039358 141039358      G      C        ARAP3  nonsynonymous SNV
## 10  Chr6  47685074  47685074      C      T       ADGRF4  nonsynonymous SNV
## # ... with 40 more rows, and 2 more variables: cosmic70 <fctr>,
## #   ljb23_sift <fctr>

Finally, write out to a .csv file

filter(anno.extra, FILTER=="PASS") %>% 
  filter(is.na(genomicSuperDups)) %>% 
  filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>% 
  filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>% 
  filter(Func.refGene == "exonic") %>% 
  filter(ExonicFunc.refGene != "synonymous SNV") %>% 
  filter(grepl("D", ljb23_sift)) %>% 
  write.csv("HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.filtered.csv")

Depending on what your particular study is interested in, you may wish to omit or revise some of these steps.