For conversion from .bam format to .rda we can use the bamToR()
function. This function uses a Rsamtools package from the Bioconductor, so we have to load this package before.
library(Rsamtools)
library(directRNAExplorer)
bamData <- bamToR("P13.2016-11-17T16_40_36.1234062861011.fc1.ch1.sorted.bam")
We have to remember that we should set path to a directory containing this bam file.
Using frame created above we can find out, for each gene, how many nucleotides were recorded during the sequencing. For genes aggrergation we use the TAIR10_genes
data. In my examples I use data for first chromosome.
datawt <- dataChromosome1
databrm <- brmDataChromosome1
head(datawt)
#> DataFrame with 6 rows and 13 columns
#> qname flag rname strand pos qwidth mapq
#> <character> <integer> <factor> <factor> <integer> <integer> <integer>
#> 1 70661909252085 16 1 - 2001464 21 255
#> 2 70646155445083 16 1 - 2001594 23 255
#> 3 70657379404766 16 1 - 2002661 29 255
#> 4 70643789860258 16 1 - 2002663 26 255
#> 5 70652883110414 16 1 - 2002664 25 255
#> 6 70657379411167 16 1 - 2002666 24 255
#> cigar mrnm mpos isize seq
#> <character> <factor> <integer> <integer> <DNAStringSet>
#> 1 1M1D13M1D7M NA 0 0 GTTATCTTTTTTTTTTTATCT
#> 2 6M1D1M2D16M NA 0 0 TTTCTGTTTTTTCTCGCTGTTGC
#> 3 1M1D28M NA 0 0 TTTATGTGCT...GATTGGTTTC
#> 4 16M2D5M1D4M1D1M NA 0 0 TTATGTGCTT...GATTGTTTCC
#> 5 1M1D24M NA 0 0 TTGTGCTTGA...TGATTGGTTT
#> 6 24M NA 0 0 TGTGCTTGAG...TGATTGGTTT
#> qual
#> <PhredQuality>
#> 1 .....................
#> 2 .......................
#> 3 .......................
#> 4 .......................
#> 5 .......................
#> 6 .......................
genesSummary <- genesSummarize(datawt, chromosome = 1)
genesSummary <- dplyr::arrange(genesSummary, desc(counts))
head(genesSummary)
#> name counts uniqueCounts length
#> 1 AT1G01010 0 0 2268
#> 2 AT1G01020 0 0 2809
#> 3 AT1G01030 0 0 2065
#> 4 AT1G01040 0 0 8081
#> 5 AT1G01046 0 0 206
#> 6 AT1G01050 0 0 1983
We can compare the distribution of counts between two genes or part of the gene (for example three prime UTR, exons). We also can compare distributions between positive or negative strands. We can use for this comparision for example the two-sample Kolmogorov-Smirnov test. In next example I will use the data for gene “AT1G06680”.
ksDistributionTest(datawt, databrm, "AT1G06760")
#>
#> Two-sample Kolmogorov-Smirnov test
#>
#> data: dt_neg[dt_neg$condition == "type1", "position"] and dt_neg[dt_neg$condition == "type2", "position"]
#> D = 0.079927, p-value = 0.9829
#> alternative hypothesis: two-sided
For selected parts of chromosomes in two groups we can compute the table with p-values for chosen test about their distributions.
table <- pvalTable(datawt, databrm, chromosome="1", type ="KS")
head(dplyr::arrange(table, pval),6)
#> gene pval statistic
#> 1 AT1G06720 0.02736573 0.45174825
#> 2 AT1G06630 0.07844748 0.40079365
#> 3 AT1G06590 0.08502377 0.50000000
#> 4 AT1G06730 0.16083539 0.50000000
#> 5 AT1G06690 0.17340708 0.19909688
#> 6 AT1G06680 0.24452875 0.03018483
For chosen gene we can draw a coverage plot for datawt
and databrm
. First we have plots for the three prime UTR.
plotGeneDistribution(gene="AT1G06760", bamDataFrame=datawt)
plotGeneDistribution(gene="AT1G06760", bamDataFrame=databrm)
And plots for whole gene.
plotGeneDistribution(gene="AT1G06760", bamDataFrame=datawt)
plotGeneDistribution(gene="AT1G06760", bamDataFrame=databrm)
We can also see the distribution of these two samples on one plot.
{r, fig.width=7} # plotGeneComparisonDistribution("AT1G06760", datawt, databrm) #
In this case red color lines correspond to data from first dataset and blue - to second.
We can also see the ecdf of these two samples on one plot.
plotGeneComparisonDistribution("AT1G06760", datawt, databrm,stat="ecdf")
For two types of individuals we can compute negative binomial test to find significantly changed genes between one and the other type.
Firstly, we have to compute, from sequencing data, the table of counts for each gene from Arabidopsis. We can create this table using countsForAllGenes
function.
counts_genes_wt1 <- countsForAllGenes(dataChromosome1, geneData = directRNAExplorer::Araport11, genePart="gene")
To test differences between types of individuals we must compute at least two counts tables. One for the first type, and then for the second type. I’ve create a table contains inforamation for two types - wild type and a mutant type (brm). We have seven samples in our test.
allCounts <- directRNAExplorer::countsAll
We want to test these genes which could be significant. Firstly, we should select genes which have “good number” of counts. We assumed that the counts are “good” if more than at least 2 samples in the wt case and at least 3 samples in brm case we have more than 9 counts for gene. In addition, we are looking for such genes that have many counts in one of the subtypes, and almost zero in the second one.
genesTest <-c()
countsWithCondition <- allCounts
for(i in 2:ncol(allCounts)){
countsWithCondition[,i] <- ifelse(allCounts[,i]>=9, 1, 0)
}
allCounts_2 <- countsWithCondition
genes <- allCounts_2[,1]
allCountsTransposed <- t(allCounts_2[,-1])
allCountsTransposed <- data.frame(allCountsTransposed)
colnames(allCountsTransposed)<- genes
allCountsTransposed$type <- c(rep("wt", 3), rep("c", 4))
allCountsTransposed <- allCountsTransposed[,c(ncol(allCountsTransposed), 1:(ncol(allCountsTransposed)-1))]
sums <- aggregate(. ~ type, data=allCountsTransposed, FUN=sum)
for(i in 2:ncol(allCountsTransposed)){
if(((sums[1,i]>=3) & (sums[2,i]>=2)) || ((sums[1,i]>=3) & (sums[2,i]==0))||((sums[1,i]==0) & (sums[2,i]>=2))){
genesTest[i] <- colnames(allCountsTransposed)[i]
}else{
genesTest[i] <- NA
}
}
genesTest2<- na.omit(genesTest)
countsTest <- allCounts
countsTest <- countsTest[which(countsTest$name %in% genesTest2),]
countsTest2 <- data.frame(t(countsTest[,-1]))
colnames(countsTest2) <- countsTest$name
For selected genes we perform a binomial negative test, in order to find genes for which the number of readings has changed significantly divided into two groups - wt and brm.
cond <- c(rep("wt", 3), rep("c", 4))
test<-nbinomTest(countsTest2, condition = cond)
To find the most changed genes, we will focus only on those that have a small pvalue and fold greater than 1.5.
library(dplyr)
testFiltered <- filter(test, pvalue<0.05)
testFiltered <- filter(testFiltered, abs(log2FoldChange)>0.6)
testFiltered <- arrange(testFiltered, pvalue)
testFiltered <- arrange(testFiltered, desc(log2FoldChange))
head(testFiltered)
#> baseMean log2FoldChange lfcSE stat pvalue padj
#> 1 21.51619 3.938911 0.9028028 4.362981 1.283023e-05 3.754278e-03
#> 2 14.83630 3.896429 1.9463092 2.001958 4.528924e-02 3.653090e-01
#> 3 19.98301 3.878431 0.8595935 4.511936 6.423847e-06 2.267819e-03
#> 4 19.65402 3.867319 0.9135222 4.233416 2.301684e-05 5.532367e-03
#> 5 63.20134 3.775733 0.5291750 7.135131 9.669483e-13 1.820602e-09
#> 6 19.25211 3.535125 0.8319170 4.249373 2.143700e-05 5.264648e-03
#> id
#> 1 AT5G57250
#> 2 AT4G11543
#> 3 AT5G04890
#> 4 AT5G24790
#> 5 AT3G01345
#> 6 AT4G39952