Imagine that we’ve called a set of somatic variants from our paired tumour and normal sample. From the previous section we can define a set of filters that can give us confidence that our variants are legitimate biological differences, and not technical noise. In this session we will go to the next step and see how we can gain biological intuition from our calls.
annovar
to retrieve and annotate using pre-defined databasesannovar
to create a human-readable and searchable table of annotation resultsdplyr
”) that allow us to filter and subset our data more efficientlyThe calls we are going to be using is HCC1143_vs_HCC1143_BL.flagged.muts.vcf
, which should be in the Day3
directory
/home/participant/Course_Materials/Day3
/home/participant/Course_Materials/Day3/annotation-template.Rmd
There are many annotation resources available, but we have chosen to demonstrate annovar
. There is an overhead associated with downloading the required annotation files, and a lot of disk space is consumed. However, you will hopefully agree that once configured it is relatively easy to use. annovar
is a suite of perl scripts that can perform a variety of annotation and filtering tasks. It can also collate the results from annotating against different databases into one convenient table.
Important:- annovar license
annovar
is free for academic use. However, you need to supply a valid academic email address in order to download the software. If you wish to use annovar
after the course. You will have to fill out the form on the annovar website.
annovar
provides “gene-based”, region-based" and “filter-based” annotation and relies on pre-downloaded annotation files so all queries are done offline. Files can be downloaded from the UCSC genome browser or from the annovar website. Some of these files can be quite large, but fortunately we only have to download them once. For the course, we have downloaded some of these files to the directory /home/participant/Course_Materials/software/annovar/humandb
. Each annotation file has a genome version associated with it. For our purposes we are going to be using hg19
.
For your reference the script to download all the resources used in this session is
/home/participant/Course_Materials/software/annovar/annovar_commands.sh
Some of the commands in annovar
require a specific input format. Fortunately, they have provided a script that can convert from popular file formats to their own format.
We convert our .vcf
file using a script from annovar convert2annovar.pl
../software/annovar/convert2annovar.pl -format vcf4old HCC1143_vs_HCC1143_BL.flagged.muts.vcf > HCC1143_vs_HCC1143_BL.flagged.muts.avinput
-format: vcfold, the input is in vcf format (old-style)
-includeinfo: Include the INFO fields from the vcf file
>: redirect to the file of our choosing
1 16103 16103 T G unknown . 61
1 17222 17222 A G unknown . 38
The first five columns are compulsory and contain
All other columns are optional. These columns are not used in the annotation process, but will get carried-along with any annotation and filtering we perform if we wish.
Gene-based annotation is the default operation in annovar
and is used to identify variants that cause protein-coding changes. Various gene-definition systems are supported.
The following command will annotate the file we have just prepared against the reference file located in ../software/annovar/humandb/hg19_refGene.txt
. We would normally have to specify an annotation mode and database keyword. However, region-based annotation and refGene are the defaults for the annotation script.
../software/annovar/annotate_variation.pl -buildver hg19 HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb
Two files are created:
-HCC1143_vs_HCC1143_BL.flagged.muts.avinput.variant_function
This file lists all variants in the input file are pre-pends two columns. The first states whether a given variant is within an exonic or not and gives the relevant gene name.
ncRNA_intronic WASH7P 1 16103 16103 T G unknown . 61
ncRNA_intronic WASH7P 1 17222 17222 A G unknown . 38
intergenic OR4F5(dist=4562),LOC729737(dist=60203) 1 74570 74570 T A unknown . 16
intergenic OR4F5(dist=31437),LOC729737(dist=33328) 1 101445 101445 G C unknown . 66
intergenic OR4F5(dist=31448),LOC729737(dist=33317) 1 101456 101456 A G unknown . 67
intergenic OR4F5(dist=31459),LOC729737(dist=33306) 1 101467 101467 G A unknown . 71
intergenic OR4F5(dist=54575),LOC729737(dist=10190) 1 124583 124583 C T unknown . 17
ncRNA_exonic LOC729737 1 136586 136586 T G unknown . 55
ncRNA_exonic LOC729737 1 136836 136836 G A unknown . 65
intergenic LOC729737(dist=90889),LOC100133331(dist=92437) 1 231455 231455 T A unknown . 108
The second file is
HCC1143_vs_HCC1143_BL.flagged.muts.avinput.exonic_variant_function
line1389 synonymous SNV ESPN:NM_031475:exon10:c.C2217T:p.L739L, 1 6512048 6512048 C T unknown . 76
line1392 stopgain TNFRSF25:NM_003790:exon4:c.G370T:p.E124X,TNFRSF25:NM_148967:exon3:c.G235T:p.E79X,TNFRSF25:NM_148965:exon4:c.G370T:p.E124X,TNFRSF25:NM_148966:exon4:c.G370T:p.E124X,TNFRSF25:NM_001039664:exon4:c.G370T:p.E124X, 1 6524705 6524705 C A unknown . 64
line1531 unknown UNKNOWN 1 12887606 12887606 C G unknown . 66
line1554 nonsynonymous SNV PRAMEF4:NM_001009611:exon4:c.G1292A:p.S431N, 1 12939510 12939510 C T unknown . 58
line1558 nonsynonymous SNV PRAMEF4:NM_001009611:exon2:c.T278A:p.L93Q, 1 12942938 12942938 A T unknown . 63
line1559 nonsynonymous SNV PRAMEF4:NM_001009611:exon2:c.A275C:p.N92T, 1 12942941 12942941 T G unknown . 63
line1673 nonsynonymous SNV FAM231B:NM_001310155:exon1:c.G113C:p.S38T, 1 16865673 16865673 G C unknown . 51
line1682 unknown UNKNOWN 1 16890558 16890558 C G unknown . 218
line1683 unknown UNKNOWN 1 16890559 16890559 A G unknown . 216
line1687 unknown UNKNOWN 1 16894476 16894476 G C unknown . 216
This file lists for the coding variants only whether the stated base-change will lead to a change in protein or not. Full definitions of all the terms is on the annovar website
Regioin-based annotation is used when we want to see if our variants fall within pre-defined genomic regions of interest. Here we list some examples along with the “keyword” to be used when downloading and annotating against the database. More comprehensive details on the types of region-based filtering that is supported can be found on the http://annovar.openbioinformatics.org/en/latest/user-guide/region/.
Database | Keyword |
---|---|
Cytoband | cytoband |
microRNA and snoRNA | wgRna |
Conserved Regions | phastConsElements46way |
Transcription Factor Binding Sites | tfbsConsSites |
Published GWAS | gwasCatalog |
Database of Genomic Variants | dgvMerged |
Segmental Duplications | genomicSuperDups |
Annotation against regions identified in ENCODE project is also described in its own section on the annovar website. Alternatively, you can specify your own regions of interest in .bed or .gff format.
The simplest form of region annotation is probably when we want to look at which “cytoband” each variant belongs to. This uses a reference file that contains the coordinates for each cytoband in a given genome version.
To perform the annotation, we use need to specify -regionanno
, to perform region-annotation and set the dbtype
to cytoBand
.
../software/annovar/annotate_variation.pl -regionanno -build hg19 -dbtype cytoBand HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/
This should result in a file being created with .cytoBand
added as a file extension to your input file (e.g. HCC1143_vs_HCC1143_BL.flagged.muts.avinput.cytoBand
). An extra column has been added to indicate which chromosome band each variant is located on. In the event of a variant spanning multiple bands (obviously not in this case), the multiple bands would be listed here.
head HCC1143_vs_HCC1143_BL.flagged.muts.avinput.cytoBand
cytoBand 1p36.33 1 16103 16103 T G unknown . 61
cytoBand 1p36.33 1 17222 17222 A G unknown . 38
cytoBand 1p36.33 1 74570 74570 T A unknown . 16
cytoBand 1p36.33 1 101445 101445 G C unknown . 66
cytoBand 1p36.33 1 101456 101456 A G unknown . 67
cytoBand 1p36.33 1 101467 101467 G A unknown . 71
cytoBand 1p36.33 1 124583 124583 C T unknown . 17
cytoBand 1p36.33 1 136586 136586 T G unknown . 55
cytoBand 1p36.33 1 136836 136836 G A unknown . 65
cytoBand 1p36.33 1 231455 231455 T A unknown . 108
Exercise
Unlike the annotation performed in the previous section, which was based purely on the genomic location, filter-based annotation also takes the base-change into account. One can also use the observed frequency in the database population as a means to filter.
As before, many types of database are available. An additional complication compared to the region filters from the previous section is that several version numbers exist for each database. In this table, we list the latest version available as of July 2016.
Database | Keyword |
---|---|
dbSNP | snp138 |
1000 Genomes | 1000g2015aug |
Exome Aggregation Consortium | exac03 |
Exome Sequencing Project | esp6500si_all |
Complete Genomics Individuals | cg69 |
ClinVar | clinvar_20160302 |
COSMIC | cosmic70 |
ICGC | icgc21 |
NCI60 | nci60 |
As we will see, filter-based annotation is much the same as region- and gene-based. However, when downloading for the first time we have to add an additional -webfrom annovar
argument into the command as the filter-based files are provided by the annovar website. For the purposes of this session, all the files have already been provided.
It seems natural to use the latest version of a database for the most-comprehensive annotation. However, the annovar website suggests that we be cautious with regards to dbSNP. As dbSNP has been expanded to include variants from other large-scale projects (such as 1000 genomes) it has become somewhat “contaminated” by variants with very low frequency or with clinical relevance. Thus, dbSNP129 is regarded as being the last clean version of the database.
To filter against dbSNP 129 we select the filter
to .annotate_variation.pl
with a dbtype
of snp129
../software/annovar/annotate_variation.pl -filter -buildver hg19 -dbtype snp129 HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/
This time we get two files created; HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_dropped
and HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_filtered
. The difference being that the .._dropped
file contains details of all your variants that are in the database
head HCC1143_vs_HCC1143_BL.flagged.muts.avinput.hg19_snp129_dropped
Thus, the .._filtered
file contains all the variants not found in the database. The implication being that we want to remove common variants found amongst healthy individuals from our list. There is nothing to stop us using multiple versions of dbSNP in the filtering as we will see later-on.
How you choose to interpret the filtering results depends on whether you are interested in variants that are exclusive to healthy or diseased individuals.
Both dbSNP and 1000 genomes report frequencies amongst healthy individuals. The “complete genomics” (cg69
) set also provides variant frequencies for 69 healthy subjects (including 46 unrelated subjects). Particularly useful for those with exome data is the ESP (Exome Sequencing Project) which aims to identify variants in exonic regions from over 6000 individuals; including however some subjects with disease.
annovar
can also annotate against variants that are relevant to diseases such as cancer. For example, there is the COSMIC (Catalogue of Somatic Mutations in Cancer).
../software/annovar/annotate_variation.pl -filter -dbtype cosmic70 -buildver hg19 HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/
Again, the variants found in the database are printed in the _dropped
file along with details of the mutation in COSMIC that they match.
cosmic70 ID=COSM28322;OCCURENCE=2(breast) 1 108298069 108298069 G A unknown . 74
cosmic70 ID=COSM51138;OCCURENCE=1(breast) 1 147380295 147380295 C T unknown . 50
cosmic70 ID=COSM1472637;OCCURENCE=1(breast) 1 152189272 152189272 G A unknown . 14
cosmic70 ID=COSM33491;OCCURENCE=1(breast) 1 155253762 155253762 C T unknown . 65
cosmic70 ID=COSM33492;OCCURENCE=1(breast) 1 159558247 159558247 G A unknown . 47
cosmic70 ID=COSM50598;OCCURENCE=1(breast) 1 176811562 176811562 G T unknown . 49
cosmic70 ID=COSM33493;OCCURENCE=1(breast) 1 178414751 178414751 G C unknown . 56
cosmic70 ID=COSM25311;OCCURENCE=2(breast) 1 198675866 198675866 A C unknown . 29
cosmic70 ID=COSM51126;OCCURENCE=1(breast) 1 204082199 204082199 C T unknown . 79
cosmic70 ID=COSM33529;OCCURENCE=1(breast) 1 24687508 24687508 G A unknown . 74
This is slightly contrary to the 1000 genomes and dbSNP filtering we did previously where we wanted to retain variants that were not found in a database. Fortunately in the next section we will look at any approach where we can get results for all our variants in a single table that we can develop our own filtering criteria for.
annovar
also has a wide array of filters that can help us to predict the consequence of a particular mutation. These include predictions from methods such as Sifting Intolerant From Tolerant / SIFT, Polyphen , MutationTaster and FATHMM.
The keyword dbsnfp30a
will annotate against all annovar
s collection of prediction databases in one go. However, for the purposes of the practical we will not be running this as it takes a long time. We will however add predictions from SIFT to our output.
Up to this point, we have amassed quite a collection of tables of our set of variants annotated / filtered against various databases. A particularly appealing feature of annovar
is that we can define protocols to combine multiple filters and collect the results in the same table.
The -protocol
argument is used to list what types of annotation / filtering you want to be applied using the same keywrods as previously. The -operation
argument is used to specify whether each annotation is a gene (g
), region (r
) or filter (f
) -based annotation. You need to take care to provide the same number of names in both the -protocol
and -operation
arguments. When you run the following command you should see that each filter is applied in turn.
../software/annovar/table_annovar.pl HCC1143_vs_HCC1143_BL.flagged.muts.avinput ../software/annovar/humandb/ -buildver hg19 -out HCC1143_vs_HCC1143_BL.flagged.muts.annovar -remove -protocol refGene,cytoBand,gwasCatalog,genomicSuperDups,dgvMerged,snp129,1000g2015aug_all,esp6500si_all,cosmic70,nci60,ljb23_sift -operation g,r,r,r,r,f,f,f,f,f,f -nastring NA -csvout
The file HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.csv
is created. By specifying the -remove
argument in the command, any tables that are created as part of the process are removed. This .csv
file can now be viewed in Excel, or LibreOffice in our case (right-click on the file)
Having opened this file into a spreadsheet program we can of course filter and sort the data as we see fit. However, due to the sheer number of columns we have generated this may prove a bit unwieldy. Therefore we are going to introduce an R package that will make the interrogation and comprehension of such a table more intuitive.
dplyr
is a very powerful package for manipulating data in tables and is part of the tidy-verse collection of packages that provide a seamless and consistent interface to exploring and analysing data.
The process of analysing a dataset is rarely a linear one, and will generate scores of intermediate plots and tables along the way that don’t make their way to a final publication. Packages such as dplyr
and ggplot2
(for plotting) provide an interface whereby the effort to produce meaningful summaries and graphics of our data is minimised. This allows us to worry more about the interpretation and what questions we should ask, rather than writing code.
In the context of filtering variants this is useful, as we may have to re-visit our criteria many times before settling on a set of rules.
The cheatsheet is highly recommended and summarises all the functions within the package. We won’t have time to cover everything here today.
For more information on dplyr
see our one-day course Data Manipulation and Visualisation using R
The first step is to load the dplyr
library. We will also need VariantAnnotation
for this workflow.
library(VariantAnnotation)
library(dplyr)
An incredibly useful but simple function is tbl_df
which converts the standard R data.frame
into a special dplyr
one that has a nicer standard display. Rather than filling the whole R console we can see a snapshot of the data frame and summaries of the data contained in each column
anno <- read.csv("HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.csv")
anno <- tbl_df(anno)
anno
## # A tibble: 92,286 x 20
## Chr Start End Ref Alt Func.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 16103 16103 T G ncRNA_intronic
## 2 1 17222 17222 A G ncRNA_intronic
## 3 1 74570 74570 T A intergenic
## 4 1 101445 101445 G C intergenic
## 5 1 101456 101456 A G intergenic
## 6 1 101467 101467 G A intergenic
## 7 1 124583 124583 C T intergenic
## 8 1 136586 136586 T G ncRNA_exonic
## 9 1 136836 136836 G A ncRNA_exonic
## 10 1 231455 231455 T A intergenic
## # ... with 92,276 more rows, and 14 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
dplyr
introduces a series of verbs, each of which is capable of performing a specific operation on a data frame in an efficient manner. The syntax of the verbs is very consistent. Not only does this allow us to easily switch between using different operations, but we can chain the operations together as we will see later.
select
is the first of the dplyr
verbs that can be used to pick certain columns from the table
select(anno, Func.refGene)
## # A tibble: 92,286 x 1
## Func.refGene
## <fctr>
## 1 ncRNA_intronic
## 2 ncRNA_intronic
## 3 intergenic
## 4 intergenic
## 5 intergenic
## 6 intergenic
## 7 intergenic
## 8 ncRNA_exonic
## 9 ncRNA_exonic
## 10 intergenic
## # ... with 92,276 more rows
No great advance in our capabilities yet, as we already know how to select a variable from a table using the $
operator
anno$Func.refGene
However, the syntax to select multiple columns is a lot more concise. Mutiple columns can be selected by separating their names with a comma
Compare:-
select(anno, Func.refGene,ExonicFunc.refGene)
## # A tibble: 92,286 x 2
## Func.refGene ExonicFunc.refGene
## <fctr> <fctr>
## 1 ncRNA_intronic NA
## 2 ncRNA_intronic NA
## 3 intergenic NA
## 4 intergenic NA
## 5 intergenic NA
## 6 intergenic NA
## 7 intergenic NA
## 8 ncRNA_exonic NA
## 9 ncRNA_exonic NA
## 10 intergenic NA
## # ... with 92,276 more rows
to
anno[,c("Func.refGene","ExonicFunc.refGene")]
Non R users probably find these commands a bit obtuse
[ ]
?c
?[,...]
means display all rowsHowever, select
is a lot more intuitive; easier to convert to the sentance “Select the Func.refGene and ExonicFunc.refGene columns from patients”
We can select a range of columns with a :
select(anno, cytoBand:X1000g2015aug_all)
## # A tibble: 92,286 x 6
## cytoBand gwasCatalog genomicSuperDups
## <fctr> <fctr> <fctr>
## 1 1p36.33 NA Score=0.993729;Name=chr9:10843
## 2 1p36.33 NA Score=0.993729;Name=chr9:10843
## 3 1p36.33 NA Score=0.994828;Name=chr19:60000
## 4 1p36.33 NA Score=0.992727;Name=chr1:235525
## 5 1p36.33 NA Score=0.992727;Name=chr1:235525
## 6 1p36.33 NA Score=0.992727;Name=chr1:235525
## 7 1p36.33 NA Score=0.99448;Name=chr1:323738,chr5:180750353
## 8 1p36.33 NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 9 1p36.33 NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 10 1p36.33 NA Score=0.992138;Name=chr16:90251926
## # ... with 92,276 more rows, and 3 more variables: dgvMerged <fctr>,
## # snp129 <fctr>, X1000g2015aug_all <dbl>
Or suppress columns by putting with a -
in front of their name
select(anno, -End)
## # A tibble: 92,286 x 19
## Chr Start Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <fctr> <fctr> <fctr> <fctr>
## 1 1 16103 T G ncRNA_intronic WASH7P
## 2 1 17222 A G ncRNA_intronic WASH7P
## 3 1 74570 T A intergenic OR4F5,LOC729737
## 4 1 101445 G C intergenic OR4F5,LOC729737
## 5 1 101456 A G intergenic OR4F5,LOC729737
## 6 1 101467 G A intergenic OR4F5,LOC729737
## 7 1 124583 C T intergenic OR4F5,LOC729737
## 8 1 136586 T G ncRNA_exonic LOC729737
## 9 1 136836 G A ncRNA_exonic LOC729737
## 10 1 231455 T A intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
Combinations of the above are possible
select(anno, cytoBand:X1000g2015aug_all,-snp129,-AAChange.refGene)
## # A tibble: 92,286 x 5
## cytoBand gwasCatalog genomicSuperDups
## <fctr> <fctr> <fctr>
## 1 1p36.33 NA Score=0.993729;Name=chr9:10843
## 2 1p36.33 NA Score=0.993729;Name=chr9:10843
## 3 1p36.33 NA Score=0.994828;Name=chr19:60000
## 4 1p36.33 NA Score=0.992727;Name=chr1:235525
## 5 1p36.33 NA Score=0.992727;Name=chr1:235525
## 6 1p36.33 NA Score=0.992727;Name=chr1:235525
## 7 1p36.33 NA Score=0.99448;Name=chr1:323738,chr5:180750353
## 8 1p36.33 NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 9 1p36.33 NA Score=0.99448;Name=chr5:180750353,chr1:323738
## 10 1p36.33 NA Score=0.992138;Name=chr16:90251926
## # ... with 92,276 more rows, and 2 more variables: dgvMerged <fctr>,
## # X1000g2015aug_all <dbl>
Some convenient functions are available so we don’t need to specify the column names in full
select(anno, contains("1000g"))
## # A tibble: 92,286 x 1
## X1000g2015aug_all
## <dbl>
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## 9 NA
## 10 NA
## # ... with 92,276 more rows
Exercise
Func.refGene
to AAChange.refGene
contains
ends_with
to do this, or using the :
notationGeneDetail.refGene
## # A tibble: 92,286 x 7
## Chr Start End Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr>
## 1 1 16103 16103 ncRNA_intronic WASH7P
## 2 1 17222 17222 ncRNA_intronic WASH7P
## 3 1 74570 74570 intergenic OR4F5,LOC729737
## 4 1 101445 101445 intergenic OR4F5,LOC729737
## 5 1 101456 101456 intergenic OR4F5,LOC729737
## 6 1 101467 101467 intergenic OR4F5,LOC729737
## 7 1 124583 124583 intergenic OR4F5,LOC729737
## 8 1 136586 136586 ncRNA_exonic LOC729737
## 9 1 136836 136836 ncRNA_exonic LOC729737
## 10 1 231455 231455 intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 2 more variables:
## # ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>
We should note at this point, that we have not actually changed the data frame at any point; just created different views of the data.
anno
## # A tibble: 92,286 x 20
## Chr Start End Ref Alt Func.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 16103 16103 T G ncRNA_intronic
## 2 1 17222 17222 A G ncRNA_intronic
## 3 1 74570 74570 T A intergenic
## 4 1 101445 101445 G C intergenic
## 5 1 101456 101456 A G intergenic
## 6 1 101467 101467 G A intergenic
## 7 1 124583 124583 C T intergenic
## 8 1 136586 136586 T G ncRNA_exonic
## 9 1 136836 136836 G A ncRNA_exonic
## 10 1 231455 231455 T A intergenic
## # ... with 92,276 more rows, and 14 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
filter
A particularly useful operation for us is going to be filter
. Again, there is nothing here that we couldn’t do with “base R”, but it will allow us to create more readable code.
The syntax is consistent with select
and uses the usual selection techniques in R such as ==
.
Selecting variants on chromsome 1
filter(anno, Chr ==1)
## # A tibble: 6,151 x 20
## Chr Start End Ref Alt Func.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 16103 16103 T G ncRNA_intronic
## 2 1 17222 17222 A G ncRNA_intronic
## 3 1 74570 74570 T A intergenic
## 4 1 101445 101445 G C intergenic
## 5 1 101456 101456 A G intergenic
## 6 1 101467 101467 G A intergenic
## 7 1 124583 124583 C T intergenic
## 8 1 136586 136586 T G ncRNA_exonic
## 9 1 136836 136836 G A ncRNA_exonic
## 10 1 231455 231455 T A intergenic
## # ... with 6,141 more rows, and 14 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
Selecting exonic variants
filter(anno, Func.refGene == "exonic")
## # A tibble: 642 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 1 6512048 6512048 C T exonic ESPN
## 2 1 6524705 6524705 C A exonic TNFRSF25
## 3 1 12887606 12887606 C G exonic PRAMEF11
## 4 1 12939510 12939510 C T exonic PRAMEF4
## 5 1 12942938 12942938 A T exonic PRAMEF4
## 6 1 12942941 12942941 T G exonic PRAMEF4
## 7 1 16865673 16865673 G C exonic FAM231B
## 8 1 16890558 16890558 C G exonic NBPF1
## 9 1 16890559 16890559 A G exonic NBPF1
## 10 1 16894476 16894476 G C exonic NBPF1
## # ... with 632 more rows, and 13 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
Multiple conditions can be separated by a ,
. This is the same as using &
in base R to combine various tests.
“Exonic variants on chromosome 1 that are no synonymous or unknown”
filter(anno, Func.refGene == "exonic", Chr == 1, ExonicFunc.refGene != "synonymous SNV", ExonicFunc.refGene != "unknown")
## # A tibble: 33 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 1 6524705 6524705 C A exonic TNFRSF25
## 2 1 12939510 12939510 C T exonic PRAMEF4
## 3 1 12942938 12942938 A T exonic PRAMEF4
## 4 1 12942941 12942941 T G exonic PRAMEF4
## 5 1 16865673 16865673 G C exonic FAM231B
## 6 1 17275337 17275337 C T exonic CROCC
## 7 1 21807427 21807427 G T exonic NBPF3
## 8 1 24687508 24687508 G A exonic STPG1
## 9 1 40535490 40535490 C G exonic CAP1
## 10 1 93701865 93701865 A G exonic CCDC18
## # ... with 23 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## # ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## # gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## # snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## # cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>
Partial matches can be performed using grepl
. We often use grep
to find the indices where a particular string is found amongst a vector. However, in order to do filtering dplyr
requires a logical vector the same length as the number of rows.
grep("breast",anno$cosmic70)
## [1] 1392 1965 2729 2828 4069 4566 4607 4649 4844 4862 5060
## [12] 5151 6056 6066 6794 10270 12129 12200 12537 12718 12961 13105
## [23] 13147 15110 15303 15346 19835 22537 25353 25369 25440 25776 27077
## [34] 27082 27086 27180 27976 28060 36433 36466 36523 39009 46487 47180
## [45] 48900 48971 49285 49681 49682 49907 50679 53326 53430 53492 54277
## [56] 54279 54761 56672 57652 57667 58300 58561 60798 61005 63998 65150
## [67] 65281 70305 70407 71375 71423 73028 73381 73731 75522 76711 78735
## [78] 81454 87125 87214 90498 90578
grepl("breast",anno$cosmic70)[1:10]
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
filter(anno, grepl("breast",cosmic70))
## # A tibble: 82 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 1 6524705 6524705 C A exonic TNFRSF25
## 2 1 24687508 24687508 G A exonic STPG1
## 3 1 93701865 93701865 A G exonic CCDC18
## 4 1 108298069 108298069 G A exonic VAV3
## 5 1 147380295 147380295 C T exonic GJA8
## 6 1 152189272 152189272 G A exonic HRNR
## 7 1 155253762 155253762 C T intronic HCN3
## 8 1 159558247 159558247 G A exonic APCS
## 9 1 176811562 176811562 G T exonic PAPPA2
## 10 1 178414751 178414751 G C exonic RASAL2
## # ... with 72 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## # ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## # gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## # snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## # cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>
During annovar’s filtering we get NA
entries whenever a particular variant is not found in a database. To filter for these NA
values we use the function is.na
.
e.g. Variants that are found in dbSNP
filter(anno, is.na(snp129))
## # A tibble: 82,329 x 20
## Chr Start End Ref Alt Func.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 16103 16103 T G ncRNA_intronic
## 2 1 74570 74570 T A intergenic
## 3 1 101445 101445 G C intergenic
## 4 1 101456 101456 A G intergenic
## 5 1 101467 101467 G A intergenic
## 6 1 124583 124583 C T intergenic
## 7 1 136586 136586 T G ncRNA_exonic
## 8 1 136836 136836 G A ncRNA_exonic
## 9 1 231455 231455 T A intergenic
## 10 1 237545 237545 C A intergenic
## # ... with 82,319 more rows, and 14 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
Variants that are found in GWAS; here we use a !
to reverse the result of is.na
; TRUE
-> FALSE
and FALSE
-> TRUE
filter(anno, !is.na(gwasCatalog))
## # A tibble: 2 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 1 221010205 221010205 C T ncRNA_intronic HLX-AS1
## 2 14 106232585 106232585 A G intergenic ELK2AP,MIR4507
## # ... with 13 more variables: GeneDetail.refGene <fctr>,
## # ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## # gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## # snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## # cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>
Variants that not found in 1000 genomes, or have a maf frequency less than 0.05. Here we use the operation |
to combine different logical tests.
filter(anno, is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05)
## # A tibble: 88,961 x 20
## Chr Start End Ref Alt Func.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 16103 16103 T G ncRNA_intronic
## 2 1 17222 17222 A G ncRNA_intronic
## 3 1 74570 74570 T A intergenic
## 4 1 101445 101445 G C intergenic
## 5 1 101456 101456 A G intergenic
## 6 1 101467 101467 G A intergenic
## 7 1 124583 124583 C T intergenic
## 8 1 136586 136586 T G ncRNA_exonic
## 9 1 136836 136836 G A ncRNA_exonic
## 10 1 231455 231455 T A intergenic
## # ... with 88,951 more rows, and 14 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
Exercise
esp6500si_all
A slighty annoyance with the exercise above is that once we’ve filtered for low frequency esp variants, we don’t actually see the esp6500si_all
column appearing in the output:-
filter(anno, esp6500si_all > 0.1)
## # A tibble: 46 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 2 89196757 89196757 G A intergenic MIR4436A,NONE
## 2 2 89196906 89196906 A C intergenic MIR4436A,NONE
## 3 2 89197190 89197190 C T intergenic MIR4436A,NONE
## 4 2 89246951 89246951 C T intergenic MIR4436A,NONE
## 5 2 89278489 89278489 A C intergenic MIR4436A,NONE
## 6 2 89292182 89292182 A G intergenic MIR4436A,NONE
## 7 2 89292201 89292201 C T intergenic MIR4436A,NONE
## 8 2 89292401 89292401 T C intergenic MIR4436A,NONE
## 9 2 89309542 89309542 T G intergenic MIR4436A,NONE
## 10 2 89309554 89309554 T A intergenic MIR4436A,NONE
## # ... with 36 more rows, and 13 more variables: GeneDetail.refGene <fctr>,
## # ExonicFunc.refGene <fctr>, AAChange.refGene <fctr>, cytoBand <fctr>,
## # gwasCatalog <fctr>, genomicSuperDups <fctr>, dgvMerged <fctr>,
## # snp129 <fctr>, X1000g2015aug_all <dbl>, esp6500si_all <dbl>,
## # cosmic70 <fctr>, nci60 <dbl>, ljb23_sift <fctr>
Given what we have done so far, it would seem natural to follow this filter
statement with a select
to view the columns we are interested in
The following line of R code is perfectly valid and does exactly what we want. However, reading the code back to ourselves is a bit clunky.
select(filter(anno, esp6500si_all > 0.1), Chr:Alt, esp6500si_all)
## # A tibble: 46 x 6
## Chr Start End Ref Alt esp6500si_all
## <fctr> <int> <int> <fctr> <fctr> <dbl>
## 1 2 89196757 89196757 G A 0.234492
## 2 2 89196906 89196906 A C 0.234575
## 3 2 89197190 89197190 C T 0.235162
## 4 2 89246951 89246951 C T 0.137349
## 5 2 89278489 89278489 A C 0.321513
## 6 2 89292182 89292182 A G 0.265630
## 7 2 89292201 89292201 C T 0.108410
## 8 2 89292401 89292401 T C 0.187469
## 9 2 89309542 89309542 T G 0.173259
## 10 2 89309554 89309554 T A 0.172195
## # ... with 36 more rows
select from variants with more than 0.1 frequency in the exome sequencing project, the columns Chr to Alt and esp6500si_all
or we can split over multiple lines
select(filter(anno, esp6500si_all > 0.1),
Chr:Alt, esp6500si_all)
## # A tibble: 46 x 6
## Chr Start End Ref Alt esp6500si_all
## <fctr> <int> <int> <fctr> <fctr> <dbl>
## 1 2 89196757 89196757 G A 0.234492
## 2 2 89196906 89196906 A C 0.234575
## 3 2 89197190 89197190 C T 0.235162
## 4 2 89246951 89246951 C T 0.137349
## 5 2 89278489 89278489 A C 0.321513
## 6 2 89292182 89292182 A G 0.265630
## 7 2 89292201 89292201 C T 0.108410
## 8 2 89292401 89292401 T C 0.187469
## 9 2 89309542 89309542 T G 0.173259
## 10 2 89309554 89309554 T A 0.172195
## # ... with 36 more rows
We always have to work out what the first statement was, and work forwards from that. - the arguments to a function are far-away from the function call
piping is a familiar concept in unix where the output of one command is used as input to the following line. The dplyr
package allows such operations to be performed in R via use of the %>%
operation from the magrittr
package
dplyr
functions are designed to take a data frame as their first argument, and return a data frame as an output. This makes them ideal to use with a pipe.
The code from before can be rewritten as follows, making it easier to see the order in which the operations were performed.
filter(anno, esp6500si_all > 0.1) %>%
select(Chr:Alt, cosmic70)
## # A tibble: 46 x 6
## Chr Start End Ref Alt cosmic70
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 2 89196757 89196757 G A NA
## 2 2 89196906 89196906 A C NA
## 3 2 89197190 89197190 C T NA
## 4 2 89246951 89246951 C T NA
## 5 2 89278489 89278489 A C NA
## 6 2 89292182 89292182 A G NA
## 7 2 89292201 89292201 C T NA
## 8 2 89292401 89292401 T C NA
## 9 2 89309542 89309542 T G NA
## 10 2 89309554 89309554 T A NA
## # ... with 36 more rows
Exercise
select
to print the chromsome, start, end, Gene name and 1000 genomes frequency and GWAS catalogue details for these variants## # A tibble: 2 x 6
## Chr Start End Gene.refGene X1000g2015aug_all
## <fctr> <int> <int> <fctr> <dbl>
## 1 1 221010205 221010205 HLX-AS1 0.228435
## 2 14 106232585 106232585 ELK2AP,MIR4507 0.282748
## # ... with 1 more variables: gwasCatalog <fctr>
filter(anno, Func.refGene=="exonic") %>%
filter(!is.na(genomicSuperDups)) %>%
filter(is.na(snp129)) %>%
filter(X1000g2015aug_all < 0.05 | is.na(X1000g2015aug_all)) %>%
select(Chr:ExonicFunc.refGene,-GeneDetail.refGene,-Func.refGene,X1000g2015aug_all)
## # A tibble: 200 x 8
## Chr Start End Ref Alt Gene.refGene
## <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 1 6512048 6512048 C T ESPN
## 2 1 12939510 12939510 C T PRAMEF4
## 3 1 12942938 12942938 A T PRAMEF4
## 4 1 12942941 12942941 T G PRAMEF4
## 5 1 16890558 16890558 C G NBPF1
## 6 1 16890559 16890559 A G NBPF1
## 7 1 16894476 16894476 G C NBPF1
## 8 1 17275337 17275337 C T CROCC
## 9 1 40535490 40535490 C G CAP1
## 10 1 117158772 117158772 C T IGSF3
## # ... with 190 more rows, and 2 more variables: ExonicFunc.refGene <fctr>,
## # X1000g2015aug_all <dbl>
The mutate
operation allows us to compute and append one or more new columns to a data frame. We can specify a new column name, or overwrite an existing one. As above, we do not change the data frame so need to create a new variable if we want the changes to persist.
For example, we might want to add the “chr” prefix to all our chromosome names. In order to perform this operation we need to specify a function that will take a vector as an input and return as vector of the same length as an output.
mutate(anno, Chr = paste("Chr", Chr,sep=""))
## # A tibble: 92,286 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 Chr1 16103 16103 T G ncRNA_intronic WASH7P
## 2 Chr1 17222 17222 A G ncRNA_intronic WASH7P
## 3 Chr1 74570 74570 T A intergenic OR4F5,LOC729737
## 4 Chr1 101445 101445 G C intergenic OR4F5,LOC729737
## 5 Chr1 101456 101456 A G intergenic OR4F5,LOC729737
## 6 Chr1 101467 101467 G A intergenic OR4F5,LOC729737
## 7 Chr1 124583 124583 C T intergenic OR4F5,LOC729737
## 8 Chr1 136586 136586 T G ncRNA_exonic LOC729737
## 9 Chr1 136836 136836 G A ncRNA_exonic LOC729737
## 10 Chr1 231455 231455 T A intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
mutate(anno, Chr = paste0("Chr", Chr))
## # A tibble: 92,286 x 20
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 Chr1 16103 16103 T G ncRNA_intronic WASH7P
## 2 Chr1 17222 17222 A G ncRNA_intronic WASH7P
## 3 Chr1 74570 74570 T A intergenic OR4F5,LOC729737
## 4 Chr1 101445 101445 G C intergenic OR4F5,LOC729737
## 5 Chr1 101456 101456 A G intergenic OR4F5,LOC729737
## 6 Chr1 101467 101467 G A intergenic OR4F5,LOC729737
## 7 Chr1 124583 124583 C T intergenic OR4F5,LOC729737
## 8 Chr1 136586 136586 T G ncRNA_exonic LOC729737
## 9 Chr1 136836 136836 G A ncRNA_exonic LOC729737
## 10 Chr1 231455 231455 T A intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 13 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>
anno <- mutate(anno, Chr = paste("Chr", Chr,sep=""))
To aid us selecting variants for follow-up, we need to know which ones have passed filters. We can first import the .vcf
file as we saw yesterday.
vcf <- readVcf("HCC1143_vs_HCC1143_BL.flagged.muts.vcf","hg19")
Recall that the FILTER
flag is stored in the metadata of the variant ranges object rowRanges
. We can add this to our existing annotation using mutate
.
anno.extra <- mutate(anno, FILTER = mcols(rowRanges(vcf))$FILTER)
anno.extra
## # A tibble: 92,286 x 21
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 Chr1 16103 16103 T G ncRNA_intronic WASH7P
## 2 Chr1 17222 17222 A G ncRNA_intronic WASH7P
## 3 Chr1 74570 74570 T A intergenic OR4F5,LOC729737
## 4 Chr1 101445 101445 G C intergenic OR4F5,LOC729737
## 5 Chr1 101456 101456 A G intergenic OR4F5,LOC729737
## 6 Chr1 101467 101467 G A intergenic OR4F5,LOC729737
## 7 Chr1 124583 124583 C T intergenic OR4F5,LOC729737
## 8 Chr1 136586 136586 T G ncRNA_exonic LOC729737
## 9 Chr1 136836 136836 G A ncRNA_exonic LOC729737
## 10 Chr1 231455 231455 T A intergenic LOC729737,LOC100133331
## # ... with 92,276 more rows, and 14 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>
We now have all the dplyr
operations in place to create a filtering pipeline. Firstly, probably want to use only those variants that passed the filters employed by caveman.
filter(anno.extra, FILTER=="PASS")
## # A tibble: 17,698 x 21
## Chr Start End Ref Alt Func.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr>
## 1 Chr1 324486 324486 G C ncRNA_exonic
## 2 Chr1 847617 847617 G A intergenic
## 3 Chr1 882050 882050 C G intronic
## 4 Chr1 990394 990394 A C UTR3
## 5 Chr1 1114986 1114986 C G intronic
## 6 Chr1 1529885 1529885 G A intergenic
## 7 Chr1 1583724 1583724 T C intronic
## 8 Chr1 1649269 1649269 A T intronic
## 9 Chr1 1649276 1649276 C G intronic
## 10 Chr1 1651166 1651166 C T intronic
## # ... with 17,688 more rows, and 15 more variables: Gene.refGene <fctr>,
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>
Now we select
filter(anno.extra, FILTER=="PASS") %>%
filter(is.na(genomicSuperDups)) %>%
filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>%
filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>%
filter(Func.refGene == "exonic") %>%
filter(ExonicFunc.refGene != "synonymous SNV")
## # A tibble: 124 x 21
## Chr Start End Ref Alt Func.refGene Gene.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 Chr1 6524705 6524705 C A exonic TNFRSF25
## 2 Chr1 24687508 24687508 G A exonic STPG1
## 3 Chr1 93701865 93701865 A G exonic CCDC18
## 4 Chr1 108298069 108298069 G A exonic VAV3
## 5 Chr1 159558247 159558247 G A exonic APCS
## 6 Chr1 162302880 162302880 A T exonic NOS1AP
## 7 Chr1 176811562 176811562 G T exonic PAPPA2
## 8 Chr1 178414751 178414751 G C exonic RASAL2
## 9 Chr1 198675866 198675866 A C exonic PTPRC
## 10 Chr1 207107897 207107897 G A exonic PIGR
## # ... with 114 more rows, and 14 more variables:
## # GeneDetail.refGene <fctr>, ExonicFunc.refGene <fctr>,
## # AAChange.refGene <fctr>, cytoBand <fctr>, gwasCatalog <fctr>,
## # genomicSuperDups <fctr>, dgvMerged <fctr>, snp129 <fctr>,
## # X1000g2015aug_all <dbl>, esp6500si_all <dbl>, cosmic70 <fctr>,
## # nci60 <dbl>, ljb23_sift <fctr>, FILTER <chr>
We can further reduce the set to those that SIFT considers to be deleterious
filter(anno.extra, FILTER=="PASS") %>%
filter(is.na(genomicSuperDups)) %>%
filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>%
filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>%
filter(Func.refGene == "exonic") %>%
filter(ExonicFunc.refGene != "synonymous SNV") %>%
filter(grepl("D", ljb23_sift)) %>%
select(Chr:Alt, Gene.refGene,ExonicFunc.refGene,cosmic70,ljb23_sift)
## # A tibble: 50 x 9
## Chr Start End Ref Alt Gene.refGene ExonicFunc.refGene
## <chr> <int> <int> <fctr> <fctr> <fctr> <fctr>
## 1 Chr1 6524705 6524705 C A TNFRSF25 stopgain
## 2 Chr1 24687508 24687508 G A STPG1 nonsynonymous SNV
## 3 Chr1 178414751 178414751 G C RASAL2 nonsynonymous SNV
## 4 Chr1 207107897 207107897 G A PIGR nonsynonymous SNV
## 5 Chr1 247320095 247320095 G C ZNF124 nonsynonymous SNV
## 6 Chr2 32361653 32361653 G T SPAST nonsynonymous SNV
## 7 Chr2 204304231 204304231 T C RAPH1 nonsynonymous SNV
## 8 Chr3 121340337 121340337 C T FBXO40 nonsynonymous SNV
## 9 Chr5 141039358 141039358 G C ARAP3 nonsynonymous SNV
## 10 Chr6 47685074 47685074 C T ADGRF4 nonsynonymous SNV
## # ... with 40 more rows, and 2 more variables: cosmic70 <fctr>,
## # ljb23_sift <fctr>
Finally, write out to a .csv
file
filter(anno.extra, FILTER=="PASS") %>%
filter(is.na(genomicSuperDups)) %>%
filter(is.na(X1000g2015aug_all) | X1000g2015aug_all < 0.05) %>%
filter(is.na(esp6500si_all) | esp6500si_all < 0.05) %>%
filter(Func.refGene == "exonic") %>%
filter(ExonicFunc.refGene != "synonymous SNV") %>%
filter(grepl("D", ljb23_sift)) %>%
write.csv("HCC1143_vs_HCC1143_BL.flagged.muts.annovar.hg19_multianno.filtered.csv")
Depending on what your particular study is interested in, you may wish to omit or revise some of these steps.