Welcome to an introduction to the ksRepo
package. ksRepo
enables investigators to mix and match various types of case/control gene lists with any gene::drug interaction database to predict repositioning oportunities. This document will introduce you to the kinds of data you’ll need to use ksRepo, and walk you through some techniques for obtaining it.
More information about ksRepo
is available in open-access form at BMC Bioinformatics.
In order to use ksRepo
, you’ll need to first install and load it into your R environment. You can either do so by downloading/forking the GitHub Repository and installing it manually (for experienced users) or by using the Devtools
package:
# Install devtools if necessary
install.packages('devtools')
# Load the package
require(devtools)
# Download and install ksRepo from GitHub directly
install_github('adam-sam-brown/ksRepo')
library(ksRepo)
We will also install some additional packages to be used over the course of this introduction:
# Get Bioconductor to enable package installation
install.packages('BiocInstaller')
require('BiocInstaller')
# Install packages
biocLite(c('rmeta', 'GEOquery', 'Biobase')
# Load into memory
require(GEOquery)
require(rmeta)
require(Biobase)
ksRepo
requires as input a list of genes ordered by statistical significance. Statistical significance can be defined by, for example, differential expression between normal and disease samples, as in this introduction. For this analysis, we’re going to use publically available data from the Gene Expression Omnibus. The analysis we’ll be performing is based off of GEO’s built-in GEO2R
tool.
The dataset we’ll be using is GSE39645
, a series of microarray experiments on normal human nerve tissue, and tumor tissue from vestibular schwannoma (VS) patients. VS is a rare tumor syndrome characterized by schwann cell tumors in the inner ear, which lead to dizziness and sensorineural deficits.
To begin, we’ll download the dataset directly from GEO using the GEOquery
package:
# Load series and platform data from GEO
gset <- getGEO("GSE39645", GSEMatrix =TRUE, AnnotGPL=TRUE)
gset <- gset[[1]]
fvarLabels(gset) <- make.names(fvarLabels(gset))
Once we have the data, we can perform simple differential gene expression analysis on the probe-level using limma
:
# Set normal/tumor tissue identifiers
gsms <- "0000000001111111111111111111111111111111"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
sml <- paste("G", sml, sep="") # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
# Perform simple limma analysis
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
# Get relevant information for further analysis
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=17587)
tT <- subset(tT, select=c("ID","t","Gene.symbol"))
tT$Gene.symbol <- as.character(tT$Gene.symbol)
Now we have a list of differentially expressed probes, but not genes, so we need to combine them together. To do this, we’ll perform a fixed effects meta analysis to consolidate probes to genes. First, we’ll define some useful functions
# Back out Cohen's d-statistics for meta-analysis
dfromt <- function(t, n1, n2) {
df <- n1 + n2 - 2
d <- (2*t) / sqrt(df)
return(d)
}
sefromt <- function(t, n1, n2) {
d <- dfromt(t, n1, n2)
var <- ( (n1 + n2)/(n1*n2) + d^2/(2*(n1+n2-2)) ) * ((n1 + n2)/(n1 + n2 - 2))
se <- sqrt(var)
return(se)
}
# Perform gene-level fixed-effects meta analysis
meta <- function(gFr, nSamp) {
# Parse Gene Names
genel <- strsplit(gFr$Gene.symbol, split='///')
names(genel) <- gFr$ID
genel <- lapply(genel, unique)
# Annotate with IDs
gene <- setNames(unlist(genel, use.names=F),rep(names(genel), lengths(genel)))
gene <- data.frame(gene=gene, probe=names(gene), stringsAsFactors = F)
# Output Init
omni <- data.frame(gene = unique(unlist(genel)), stringsAsFactors = F)
omni$d <- rep(NA, nrow(omni))
omni$se <- rep(NA, nrow(omni))
omni$p <- rep(NA, nrow(omni))
# Output loop
for (i in 1:nrow(omni)) {
probes <- subset(gene, gene==omni$gene[i])$probe
slice <- subset(gFr, ID %in% probes)
metasum <- meta.summaries(slice$d,slice$se) # Fixed-effects meta-analysis of probe values
omni$d[i] <-metasum$summary
omni$se[i] <- metasum$se.summary
omni$p[i] <- 2*pnorm(-abs(omni$d[i]*sqrt(nSamp))) # Two-sided p-value from d-statistic
}
omni <- omni[order(omni$p),]
}
With these functions defined, we can easily convert our probe list into a gene list:
# Remove unmapped probes
tT <- subset(tT, Gene.symbol != '')
# Effect sizes and standard errors
tT$d <- dfromt(tT$t, 9, 31) # 9 normal and 31 tumor samples
tT$se <- sefromt(tT$t, 9, 31)
# Convert probe- to gene-level differential expression
dge <- meta(tT, 40)
genelist <- dge$gene
We’re now all set to perform our ksRepo
analysis.
With our sorted gene list, we can move on to ksRepo
analysis. We’ll be using the built in Comparative Toxicogenomics Database (CTD) dataset, which contains a set of expert-curated gene::drug interactions. ksRepo
automatically selects a subset of the CTD that overlaps with our gene list, making running ksRepo as easy as a single line of code:
# Fetch the CTD data
data(CTD)
# Run in a single line (on most computers should take < 1 hour)
results <- ksRepo(genelist, CTD) # Beyond the default options, you can change the number of resamples performed
# As well as the type of resamples (resampling compound lists or your input)
Once you’ve run a ksRepo analysis, it’s time to look at the results. ksRepo
outputs a data.frame
with columns for: 1. compound name, which exactly corresponds to an entry in drugbank 2. the number of genes interacting with a compound in the CTD (or your own database) 3. enrichment score (ks, for Kolmogorov-Smirnov enrichment) 4. a bootstrapped p-value (# of times bootstrapped ks exceeds actual ks) 5. an FDR q-value
A good rule of thumb for ksRepo
analysis is to look first at your FDR significant results and subset from there:
# Significant
subset(results, boot.fdr < 0.05)
## compound n.genes ks boot.p boot.fdr
## 20 ACETAMINOPHEN 4394 0.03217755 0e+00 0.00000000
## 99 POTASSIUM 2301 0.05441585 0e+00 0.00000000
## 126 AZATADINE 1 0.99923022 5e-04 0.03714706
## 210 CARBAMAZEPINE 1903 0.04256119 4e-04 0.03714706
## 286 CLEMASTINE 1 0.99923022 5e-04 0.03714706
## 341 DECITABINE 1717 0.05636499 1e-04 0.01148182
## 368 CISPLATIN 2268 0.04768448 0e+00 0.00000000
## 390 DIMETHINDENE 1 0.99923022 5e-04 0.03714706
## 397 DIPHENYLPYRALINE 1 0.99923022 5e-04 0.03714706
## 560 GLIMEPIRIDE 11 0.53153351 6e-04 0.04210000
## 566 GOLD SODIUM THIOMALATE 85 0.25324100 0e+00 0.00000000
## 747 METHYLPREDNISOLONE 95 0.21307238 1e-04 0.01148182
## 862 OLOPATADINE 1 0.99923022 5e-04 0.03714706
## 890 PANOBINOSTAT 1065 0.09396930 0e+00 0.00000000
## 983 PROGESTERONE 1727 0.06108625 0e+00 0.00000000
## 1194 TRETINOIN 3971 0.05360247 0e+00 0.00000000
## 1199 TRICLOSAN 53 0.26199411 7e-04 0.04653158
## 1219 VALPROIC ACID 9673 0.04251244 0e+00 0.00000000
## 1244 VORINOSTAT 887 0.07939923 0e+00 0.00000000
# Significant and highly enriched
subset(results, boot.fdr < 0.05 & ks > 0.2)
## compound n.genes ks boot.p boot.fdr
## 126 AZATADINE 1 0.9992302 5e-04 0.03714706
## 286 CLEMASTINE 1 0.9992302 5e-04 0.03714706
## 390 DIMETHINDENE 1 0.9992302 5e-04 0.03714706
## 397 DIPHENYLPYRALINE 1 0.9992302 5e-04 0.03714706
## 560 GLIMEPIRIDE 11 0.5315335 6e-04 0.04210000
## 566 GOLD SODIUM THIOMALATE 85 0.2532410 0e+00 0.00000000
## 747 METHYLPREDNISOLONE 95 0.2130724 1e-04 0.01148182
## 862 OLOPATADINE 1 0.9992302 5e-04 0.03714706
## 1199 TRICLOSAN 53 0.2619941 7e-04 0.04653158
# Significant, highly enriched, with known interactions with multiple target genes
subset(results, boot.fdr < 0.05 & ks > 0.2 & n.genes > 1)
## compound n.genes ks boot.p boot.fdr
## 560 GLIMEPIRIDE 11 0.5315335 6e-04 0.04210000
## 566 GOLD SODIUM THIOMALATE 85 0.2532410 0e+00 0.00000000
## 747 METHYLPREDNISOLONE 95 0.2130724 1e-04 0.01148182
## 1199 TRICLOSAN 53 0.2619941 7e-04 0.04653158
In this introduction, we’re worked through an example of repositioning using ksRepo
. We first downloaded freely accessible gene expression data and pre-processed it for ksRepo
analysis. We then performed ksRepo
analysis in a single line of code and examined the results.
ksRepo
provides an easy and accessible way to perform drug repositioning with any data, and without a lot of the hassle of other methods.
If you’re interested in learning more about ksRepo
or licensing our software, please email Adam Brown.