---
title: "CRUK Summer School 2016 - Day 4"
author: "Geoff Macintyre"
date: "20/07/2016"
output: html_document
---

##The biology underlying somatic structural variation
[Lecture 1](lecture1.pdf)

```{r echo=FALSE}
show_answers<-FALSE
```


###Exercise 1: Understanding the relationship between SVs and short read sequencing
1. How might the following double-stranded break repair mechanisms manifest in short read sequence data aligned to reference genome at the breakpoint?
    + Homologous recombination
    + Non-homologous end-joining
    + Microhomology mediated end-joining

2. How would the following structural variants manifest in short read sequence data aligned to reference genome? 
    + Deletion
    + Tandem duplication
    + Inversion
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: How many breaks are there? Consider both reads that overlap the break and those that span the break.)*

*Advanced: Can you sketch how reads might overlap translocations?*

******

##Methods for calling SVs
[Lecture 2](lecture2.pdf)

###Exercise 2 - Calling and visualising your SVs
0. Run BRASS on the downsampled test tumour
    a. Click on the CGP Docker link on the desktop
    b. Navigate to the directory /datastore
    c. make a new directory called brass_downsampled
    d. run brass with no parameters (brass.pl) and look carefully at the required parameters
    e. build a commandline for running brass on the downsampled bam file and hit run

1. Open the test tumour and normal genomes in IGV, navigate to the following locations and record the number of reads supporting each breakpoint.
    + chr1:150447295-150447299
    + chr4:92896530-92896534
    + chr6:24910138-24910139

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: switch "Show soft-clipped bases" on under Tracks -> Preferences -> Alignments)*

2. Load the deep whole-genome cell-line BRASS output (SV bedpe format) into R. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: Use the read.table function with the parameters: header, sep, stringsAsFactors, skip and comment.char. You will also need to reformat the chromosome co-ordinates e.g. change 1 to chr1.)*

```{r echo=show_answers}
SVbedpe<-read.table("HCC1143_vs_HCC1143_BL.annot.bedpe", 
                    header = T, sep="\t",stringsAsFactors = F, skip=69, comment.char = "")
colnames(SVbedpe)[1]<-"chr1"
SVbedpe$chr1<-paste0("chr",SVbedpe$chr1)
SVbedpe$chr2<-paste0("chr",SVbedpe$chr2)
```

3. Generate a plot showing a summary of the number, type and location of SVs.
```{r echo=show_answers}
suppressMessages(library(ggplot2))
suppressMessages(library(RColorBrewer))
#generate a vector of distinct colours for plotting
n <- 30
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
#plot
ggplot(SVbedpe,aes(svclass,fill=factor(chr1)))+geom_bar()+scale_fill_manual(values=col_vector)+coord_flip()
```


4. Generate a plot showing the size distributions of the different SV classes.

```{r echo=show_answers, warning=F}
ggplot(SVbedpe[SVbedpe$bkdist>0,],aes(bkdist))+geom_density()+facet_grid(svclass ~ .,scales = "free_y")+xlim(0,1e6)
```


5. Using the circlize package generate a circos plot with SVs as links (use [this](https://cran.r-project.org/web/packages/circlize/vignettes/genomic_plot.pdf) documentation to assist you) 

```{r echo=show_answers}
suppressMessages(library(circlize))
par(mar = c(1, 1, 1, 1))
circos.initializeWithIdeogram(plotType = c('axis', 'labels'))
bed1<-SVbedpe[,1:3]
bed2<-SVbedpe[,4:6]
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
colours<-cbbPalette[1:length(unique(SVbedpe$svclass))]
names(colours)<-unique(SVbedpe$svclass)
circos.genomicLink(bed1, bed2,col=colours[SVbedpe$svclass])
legend("topright", legend = names(colours),fill=colours,cex=0.35)
```

******

##Improving the quality of SV calls
[Lecture 3](lecture3.pdf)

###Exercise 3: Filtering your SV calls
1. Load the ASCAT copy-number calls into R (for convenience these are stored in HCC1143.Rdata). Find the data object which contains segmented copy numbers and output the first 10 lines.
```{r echo=show_answers}
load("HCC1143.Rdata")
head(ascat.output$segments)
```

2. Using these segments, add the copy-number calls to your circos plot.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: use the genomicTrackPlotRegion function)*
```{r echo=show_answers}
totalcn<-ascat.output$segments[,c("chr","startpos","endpos")]
totalcn<-cbind(totalcn,value=ascat.output$segments$nMajor+ascat.output$segments$nMinor)
totalcn$chr<-paste0("chr",totalcn$chr)
totalcn$value[totalcn$value>10]<-10
circos.initializeWithIdeogram(plotType = c('axis', 'labels'))
circos.genomicTrackPlotRegion(totalcn,ylim=c(0,10),
                              panel.fun=function(region,value,...){
                                i=getI(...)
                 circos.genomicLines(region,value,type="segment",lwd=3,col="blue",...)})
circos.genomicLink(bed1, bed2,col=colours[SVbedpe$svclass])
legend("topright", legend = names(colours),fill=colours,cex=0.35)
```

3. How many copy-number breakpoints can be explained by SV breakpoints?

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: Firstly, convert the copy-number breakpoints (not segments!) to a GRanges object. Then convert the bedpe table to an [InteractionSet](http://bioconductor.org/packages/release/bioc/vignettes/InteractionSet/inst/doc/interactions.html) object. You can use the countOverlaps function to test for concordance within a specified distance.)*

```{r echo=show_answers}
suppressMessages(library(GenomicRanges))
suppressMessages(library(InteractionSet))
cnbp<-sortSeqlevels(GRanges(seqnames = c(totalcn$chr,totalcn$chr),ranges = IRanges(c(totalcn$startpos,totalcn$endpos),c(totalcn$startpos,totalcn$endpos)+1)))
svbp<-GInteractions(GRanges(bed1$chr1,IRanges(bed1$start1,bed1$end1)),GRanges(bed2$chr2,IRanges(bed2$start2,bed2$end2)))
```
Within 10,000bp (count and fraction)?
```{r echo=show_answers}
hit<-sum(countOverlaps(cnbp,svbp,maxgap = 10000)>0)
#count
hit
#fraction
hit/length(cnbp)
```

Within 100,000bp (count and fraction)?
```{r echo=show_answers}
hit<-sum(countOverlaps(cnbp,svbp,maxgap = 100000)>0)
#count
hit
#fraction
hit/length(cnbp)
```

You may think that this isn't particularly good. Perhaps BRASS is missing some calls? Perhaps the copy-number is noisy? Can we improve this?

*(Advanced: explore the events that don't have matching breakpoints. Can you work out what might be going wrong?)*

8. Look at the metadata provided by brass and determine which variables might be useful for filtering.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: use the View() function to look at the complete table os SVs. Use the documentation located [here](https://github.com/cancerit/BRASS/blob/dev/perl/docs.tar.gz) to help decipher the column descriptors.*

9. Decide on a filtering strategy and generate a final SV callset.
```{r echo=show_answers}
SVbedpe_filtered<-SVbedpe[SVbedpe$sample=="HCC1143",] #keep only variants appearing in the tumour
SVbedpe_filtered<-SVbedpe_filtered[!(SVbedpe_filtered$svclass=="deletion"&SVbedpe_filtered$copynumber_flag==0),] #remove dels not near CN changepoint
SVbedpe_filtered<-SVbedpe_filtered[!(SVbedpe_filtered$svclass=="tandem-duplication"&SVbedpe_filtered$copynumber_flag==0),] #remove dups not near CN changepoint
```

*Advanced: In the Day 4 directory you will also find SV calls from another caller, MANTA (somaticSV.bedpe). Can you compute the overlap between callers? Which is better?*

******

##The functional consequences of SVs
[Lecture 4](lecture4.pdf)

###Exercise 4: Annotating your SVs
1. Manually (using pen and paper) reconstruct the following set of breakpoints into a single rearrangement event:
    + 2:47676199	+	2:50629634	+
    + 2:49437116	-	2:47678469	-
    + 2:50632712	-	2:49431472	+


2. What might the functional consequence of this rearrangement be?
```{r echo=FALSE}
if(show_answers==TRUE)
{
"http://www.nature.com/ncomms/2015/150401/ncomms7605/extref/ncomms7605-s1.pdf"
}
```

3. Look at the GRASS annotations in the BRASS bedpe output for the cell line. Extract the events which involve two different genes. Which of these will NOT give rise to a viable fusion protein?  
```{r echo=show_answers}
putative_fusions<-SVbedpe_filtered[SVbedpe_filtered$gene1!="_"&SVbedpe_filtered$gene2!="_",] #keep only events with two annotations
putative_fusions<-putative_fusions[putative_fusions$gene1!=putative_fusions$gene2,] #keep only events joining two different genes
non_fusions<-putative_fusions[(putative_fusions$svclass%in%c("deletion","tandem-duplication")&putative_fusions$strand1.1!=putative_fusions$strand2.1)|
                              (putative_fusions$svclass=="inversion"&putative_fusions$strand1.1==putative_fusions$strand2.1),]
non_fusions[,c(12,28,31,37,40)]
```

4. Which are most likely to give rise to a viable fusion product with at least 30% of whole exons from both genes?
```{r echo=show_answers}
putative_fusions<-putative_fusions[!putative_fusions$id.name%in%non_fusions$id.name,] # remove non-fusions
putative_fusions<-putative_fusions[putative_fusions$region1=="intron"&putative_fusions$region2=="intron",] #keep events lying  only in introns
putative_fusions<-putative_fusions[(as.numeric(putative_fusions$region_number1)/as.numeric(putative_fusions$total_region_count1)>0.3)&
                                   (as.numeric(putative_fusions$region_number2)/as.numeric(putative_fusions$total_region_count2)>0.3),]
putative_fusions[,c(12,28,31,34,35,37,40,43,44)]

```

*Advanced: What protein domains are included in these fusions?*


******

##Complex rearrangements
[Lecture 5](lecture5.pdf)

###Exercise 5: Characterising and annotating the HCC1143 double minute chromosome
1. This sample contains a double minute chromosome. Which chromosomes are likely to make up this event? Plot a circos plot of the SVs explaining the double minute. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: use the [SKY karyotyping](http://www.pawefish.path.cam.ac.uk/BreastCellLineDescriptions/HCC1143.html) of this cell-line to assist in your search.)*

```{r echo=show_answers}
chroms<-c("chr1","chr6","chr11") #create a list of anchor chroms
dminsvbedpe<-SVbedpe_filtered[(SVbedpe_filtered$chr1%in%chroms|SVbedpe_filtered$chr2%in%chroms)&
                             SVbedpe_filtered$chr1!=SVbedpe_filtered$chr2,]#keep only translocations involving these chroms
bed1_dmin<-dminsvbedpe[,1:3] #create bed
bed2_dmin<-dminsvbedpe[,4:6] #create bed
totalcn_dmin<-totalcn[totalcn$chr%in%unique(c(bed1_dmin$chr1,bed2_dmin$chr1,bed1_dmin$chr2,bed2_dmin$chr2)),] #keep only CNs on these chroms
circos.initializeWithIdeogram(chromosome.index = unique(c(bed1_dmin$chr,bed2_dmin$chr)),plotType = c('axis', 'labels'))
circos.genomicTrackPlotRegion(totalcn,ylim=c(0,10),
                               panel.fun=function(region,value,...){
                                 i=getI(...)
                                 circos.genomicLines(region,value,type="segment",lwd=3,col="blue",...)})
circos.genomicLink(bed1_dmin, bed2_dmin,col=colours[dminsvbedpe$svclass])
```

2. What are the genes contained in this double-minute chromosome?

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*(Hint: use the boundingBox function to get the genomic coordinates spanning the double-minute. Use the list of cancer genes found in the COSMIC.67 package to annotate these regions - data(cgc_67, package = "COSMIC.67"))*
```{r echo=show_answers, warnings=F}
dminsvbp<-sort(swapAnchors(GInteractions(
  GRanges(dminsvbedpe$chr1,IRanges(dminsvbedpe$start1,dminsvbedpe$end1),seqinfo = seqinfo(svbp)),
  GRanges(dminsvbedpe$chr2,IRanges(dminsvbedpe$start2,dminsvbedpe$end2),seqinfo = seqinfo(svbp)))))#create an Interaction set object
all.chrs <- as.character(seqnames(regions(dminsvbp)))
f <- paste0(all.chrs[anchors(dminsvbp, type="first", id=TRUE)], ".",
            all.chrs[anchors(dminsvbp, type="second", id=TRUE)])
dminbb<-boundingBox(dminsvbp,f)

suppressMessages(library("TxDb.Hsapiens.UCSC.hg19.knownGene"))
data(cgc_67, package = "COSMIC.67")
allg<-select(TxDb.Hsapiens.UCSC.hg19.knownGene,keys=cgc_67$ENTREZID,keytype = "GENEID",columns=c("TXCHROM","TXSTART","TXEND"))
allg<-allg[!duplicated(allg$GENEID),]
allg<-merge(cgc_67,allg,by.x=2,by.y=1)
allg<-allg[!is.na(allg$TXCHROM),]
cgcranges<-GRanges(allg$TXCHROM,IRanges(allg$TXSTART,allg$TXEND),SYMBOL=allg[,2])
seqlevels(cgcranges)<-sort(seqlevels(cgcranges))
cgcranges<-sort(cgcranges)
genes_in_dmin<-subsetByOverlaps(cgcranges,dminbb)
unique(mcols(genes_in_dmin)[,1])
```

*Advanced: Can you find evidence of chromothripsis in the SVs or copy-number profiles? Use the criteria on the lecture slides to assist*