Welcome to an introduction to the ksRepo package. ksRepo enables investigators to mix and match various types of case/control gene lists with any gene::drug interaction database to predict repositioning oportunities. This document will introduce you to the kinds of data you’ll need to use ksRepo, and walk you through some techniques for obtaining it.

More information about ksRepo is available in open-access form at BMC Bioinformatics.

Getting Started

In order to use ksRepo, you’ll need to first install and load it into your R environment. You can either do so by downloading/forking the GitHub Repository and installing it manually (for experienced users) or by using the Devtools package:

# Install devtools if necessary
install.packages('devtools')

# Load the package
require(devtools)

# Download and install ksRepo from GitHub directly
install_github('adam-sam-brown/ksRepo')
library(ksRepo)

We will also install some additional packages to be used over the course of this introduction:

# Get Bioconductor to enable package installation
install.packages('BiocInstaller')
require('BiocInstaller')

# Install packages
biocLite(c('rmeta', 'GEOquery', 'Biobase')

# Load into memory
require(GEOquery)
require(rmeta)
require(Biobase)

Preparing your gene list

ksRepo requires as input a list of genes ordered by statistical significance. Statistical significance can be defined by, for example, differential expression between normal and disease samples, as in this introduction. For this analysis, we’re going to use publically available data from the Gene Expression Omnibus. The analysis we’ll be performing is based off of GEO’s built-in GEO2R tool.

The dataset we’ll be using is GSE39645, a series of microarray experiments on normal human nerve tissue, and tumor tissue from vestibular schwannoma (VS) patients. VS is a rare tumor syndrome characterized by schwann cell tumors in the inner ear, which lead to dizziness and sensorineural deficits.

To begin, we’ll download the dataset directly from GEO using the GEOquery package:

# Load series and platform data from GEO
gset <- getGEO("GSE39645", GSEMatrix =TRUE, AnnotGPL=TRUE)
gset <- gset[[1]]
fvarLabels(gset) <- make.names(fvarLabels(gset))

Once we have the data, we can perform simple differential gene expression analysis on the probe-level using limma:

# Set normal/tumor tissue identifiers
gsms <- "0000000001111111111111111111111111111111"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)

# Perform simple limma analysis
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)

# Get relevant information for further analysis
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=17587)
tT <- subset(tT, select=c("ID","t","Gene.symbol"))
tT$Gene.symbol <- as.character(tT$Gene.symbol)

Now we have a list of differentially expressed probes, but not genes, so we need to combine them together. To do this, we’ll perform a fixed effects meta analysis to consolidate probes to genes. First, we’ll define some useful functions

# Back out Cohen's d-statistics for meta-analysis
dfromt <- function(t, n1, n2) {
    df <- n1 + n2 - 2
    d <- (2*t) / sqrt(df)
    return(d)
}

sefromt <- function(t, n1, n2) {
    d <- dfromt(t, n1, n2)
    var <- ( (n1 + n2)/(n1*n2) + d^2/(2*(n1+n2-2)) ) * ((n1 + n2)/(n1 + n2 - 2))
    se <- sqrt(var)
    return(se)
}

# Perform gene-level fixed-effects meta analysis
meta <- function(gFr, nSamp) {
    # Parse Gene Names
    genel <- strsplit(gFr$Gene.symbol, split='///')
    names(genel) <- gFr$ID
    genel <- lapply(genel, unique)

    # Annotate with IDs
    gene <- setNames(unlist(genel, use.names=F),rep(names(genel), lengths(genel)))
    gene <- data.frame(gene=gene, probe=names(gene), stringsAsFactors = F)

    # Output Init
    omni <- data.frame(gene = unique(unlist(genel)), stringsAsFactors = F)
    omni$d <- rep(NA, nrow(omni))
    omni$se <- rep(NA, nrow(omni))
    omni$p <- rep(NA, nrow(omni))

    # Output loop
    for (i in 1:nrow(omni)) {
        probes <- subset(gene, gene==omni$gene[i])$probe
        slice <- subset(gFr, ID %in% probes)
        metasum <- meta.summaries(slice$d,slice$se) # Fixed-effects meta-analysis of probe values
        omni$d[i] <-metasum$summary
        omni$se[i] <- metasum$se.summary
        omni$p[i] <- 2*pnorm(-abs(omni$d[i]*sqrt(nSamp))) # Two-sided p-value from d-statistic
    }
    omni <- omni[order(omni$p),]
}

With these functions defined, we can easily convert our probe list into a gene list:

# Remove unmapped probes
tT <- subset(tT, Gene.symbol != '')

# Effect sizes and standard errors
tT$d <- dfromt(tT$t, 9, 31) # 9 normal and 31 tumor samples
tT$se <- sefromt(tT$t, 9, 31)

# Convert probe- to gene-level differential expression
dge <- meta(tT, 40)
genelist <- dge$gene

We’re now all set to perform our ksRepo analysis.

ksRepo analysis and interpretation

With our sorted gene list, we can move on to ksRepo analysis. We’ll be using the built in Comparative Toxicogenomics Database (CTD) dataset, which contains a set of expert-curated gene::drug interactions. ksRepo automatically selects a subset of the CTD that overlaps with our gene list, making running ksRepo as easy as a single line of code:

# Fetch the CTD data
data(CTD)

# Run in a single line (on most computers should take < 1 hour)
results <- ksRepo(genelist, CTD) # Beyond the default options, you can change the number of resamples performed
                                 # As well as the type of resamples (resampling compound lists or your input)

Once you’ve run a ksRepo analysis, it’s time to look at the results. ksRepo outputs a data.frame with columns for: 1. compound name, which exactly corresponds to an entry in drugbank 2. the number of genes interacting with a compound in the CTD (or your own database) 3. enrichment score (ks, for Kolmogorov-Smirnov enrichment) 4. a bootstrapped p-value (# of times bootstrapped ks exceeds actual ks) 5. an FDR q-value

A good rule of thumb for ksRepo analysis is to look first at your FDR significant results and subset from there:

# Significant
subset(results, boot.fdr < 0.05)
##                    compound n.genes         ks boot.p   boot.fdr
## 20            ACETAMINOPHEN    4394 0.03217755  0e+00 0.00000000
## 99                POTASSIUM    2301 0.05441585  0e+00 0.00000000
## 126               AZATADINE       1 0.99923022  5e-04 0.03714706
## 210           CARBAMAZEPINE    1903 0.04256119  4e-04 0.03714706
## 286              CLEMASTINE       1 0.99923022  5e-04 0.03714706
## 341              DECITABINE    1717 0.05636499  1e-04 0.01148182
## 368               CISPLATIN    2268 0.04768448  0e+00 0.00000000
## 390            DIMETHINDENE       1 0.99923022  5e-04 0.03714706
## 397        DIPHENYLPYRALINE       1 0.99923022  5e-04 0.03714706
## 560             GLIMEPIRIDE      11 0.53153351  6e-04 0.04210000
## 566  GOLD SODIUM THIOMALATE      85 0.25324100  0e+00 0.00000000
## 747      METHYLPREDNISOLONE      95 0.21307238  1e-04 0.01148182
## 862             OLOPATADINE       1 0.99923022  5e-04 0.03714706
## 890            PANOBINOSTAT    1065 0.09396930  0e+00 0.00000000
## 983            PROGESTERONE    1727 0.06108625  0e+00 0.00000000
## 1194              TRETINOIN    3971 0.05360247  0e+00 0.00000000
## 1199              TRICLOSAN      53 0.26199411  7e-04 0.04653158
## 1219          VALPROIC ACID    9673 0.04251244  0e+00 0.00000000
## 1244             VORINOSTAT     887 0.07939923  0e+00 0.00000000
# Significant and highly enriched
subset(results, boot.fdr < 0.05 & ks > 0.2)
##                    compound n.genes        ks boot.p   boot.fdr
## 126               AZATADINE       1 0.9992302  5e-04 0.03714706
## 286              CLEMASTINE       1 0.9992302  5e-04 0.03714706
## 390            DIMETHINDENE       1 0.9992302  5e-04 0.03714706
## 397        DIPHENYLPYRALINE       1 0.9992302  5e-04 0.03714706
## 560             GLIMEPIRIDE      11 0.5315335  6e-04 0.04210000
## 566  GOLD SODIUM THIOMALATE      85 0.2532410  0e+00 0.00000000
## 747      METHYLPREDNISOLONE      95 0.2130724  1e-04 0.01148182
## 862             OLOPATADINE       1 0.9992302  5e-04 0.03714706
## 1199              TRICLOSAN      53 0.2619941  7e-04 0.04653158
# Significant, highly enriched, with known interactions with multiple target genes
subset(results, boot.fdr < 0.05 & ks > 0.2 & n.genes > 1)
##                    compound n.genes        ks boot.p   boot.fdr
## 560             GLIMEPIRIDE      11 0.5315335  6e-04 0.04210000
## 566  GOLD SODIUM THIOMALATE      85 0.2532410  0e+00 0.00000000
## 747      METHYLPREDNISOLONE      95 0.2130724  1e-04 0.01148182
## 1199              TRICLOSAN      53 0.2619941  7e-04 0.04653158

Wrapping Up

In this introduction, we’re worked through an example of repositioning using ksRepo. We first downloaded freely accessible gene expression data and pre-processed it for ksRepo analysis. We then performed ksRepo analysis in a single line of code and examined the results.

ksRepo provides an easy and accessible way to perform drug repositioning with any data, and without a lot of the hassle of other methods.

Contact

If you’re interested in learning more about ksRepo or licensing our software, please email Adam Brown.