xGRviaGenomicAnnoAdv | R Documentation |
xGRviaGenomicAnnoAdv
is supposed to conduct region-based
enrichment analysis for the input genomic region data (genome build
h19), using genomic annotations (eg active chromatin, transcription
factor binding sites/motifs, conserved sites). Enrichment analysis is
achieved by comparing the observed overlaps against the expected
overlaps which are estimated from the null distribution. The null
distribution is generated via sampling, that is, randomly generating
samples for data genomic regions from background genomic regions.
Background genomic regions can be provided by the user; by default, the
annotatable genomic regions will be used.
xGRviaGenomicAnnoAdv(
data.file,
annotation.file = NULL,
background.file = NULL,
format.file = c("data.frame", "bed", "chr:start-end", "GRanges"),
build.conversion = c(NA, "hg38.to.hg19", "hg18.to.hg19"),
background.annotatable.only = F,
num.samples = 1000,
gap.max = 50000,
max.distance = NULL,
p.adjust.method = c("BH", "BY", "bonferroni", "holm", "hochberg",
"hommel"),
GR.annotation = NA,
parallel = TRUE,
multicores = NULL,
verbose = T,
RData.location = "http://galahad.well.ox.ac.uk/bigdata",
guid = NULL
)
data.file |
an input data file, containing a list of genomic regions to test. If the input file is formatted as a 'data.frame' (specified by the parameter 'format.file' below), the first three columns correspond to the chromosome (1st column), the starting chromosome position (2nd column), and the ending chromosome position (3rd column). If the format is indicated as 'bed' (browser extensible data), the same as 'data.frame' format but the position is 0-based offset from chromomose position. If the genomic regions provided are not ranged but only the single position, the ending chromosome position (3rd column) is allowed not to be provided. If the format is indicated as "chr:start-end", instead of using the first 3 columns, only the first column will be used and processed. If the file also contains other columns, these additional columns will be ignored. Alternatively, the input file can be the content itself assuming that input file has been read. Note: the file should use the tab delimiter as the field separator between columns. |
annotation.file |
an input annotation file containing genomic annotations for genomic regions. If the input file is formatted as a 'data.frame', the first four columns correspond to the chromosome (1st column), the starting chromosome position (2nd column), the ending chromosome position (3rd column), and the genomic annotations (eg transcription factors and histones; 4th column). If the format is indicated as 'bed', the same as 'data.frame' format but the position is 0-based offset from chromomose position. If the format is indicated as "chr:start-end", the first two columns correspond to the chromosome:start-end (1st column) and the genomic annotations (eg transcription factors and histones; 2nd column). If the file also contains other columns, these additional columns will be ignored. Alternatively, the input file can be the content itself assuming that input file has been read. Note: the file should use the tab delimiter as the field separator between columns. |
background.file |
an input background file containing a list of genomic regions as the test background. The file format is the same as 'data.file'. By default, it is NULL meaning all annotatable bases (ig non-redundant bases covered by 'annotation.file') are used as background. However, if only one annotation (eg only a transcription factor) is provided in 'annotation.file', the background must be provided. |
format.file |
the format for input files. It can be one of "data.frame", "chr:start-end", "bed" and "GRanges" |
build.conversion |
the conversion from one genome build to another. The conversions supported are "hg38.to.hg19" and "hg18.to.hg19". By default it is NA (no need to do so). |
background.annotatable.only |
logical to indicate whether the background is further restricted to annotatable bases (covered by 'annotation.file'). In other words, if the background is provided, the background bases are those after being overlapped with annotatable bases. Notably, if only one annotation (eg only a transcription factor) is provided in 'annotation.file', it should be false. |
num.samples |
the number of samples randomly generated |
gap.max |
the maximum distance of background islands to be considered away from data regions. Only background islands no far way from this distance will be considered. For example, if it is 0, meaning that only background islands that overlapp with genomic regions will be considered. By default, it is 50000 |
max.distance |
the maximum distance away from data regions that is allowed when generating random samples. By default, it is NULl meaning no such restriction |
p.adjust.method |
the method used to adjust p-values. It can be one of "BH", "BY", "bonferroni", "holm", "hochberg" and "hommel". The first two methods "BH" (widely used) and "BY" control the false discovery rate (FDR: the expected proportion of false discoveries amongst the rejected hypotheses); the last four methods "bonferroni", "holm", "hochberg" and "hommel" are designed to give strong control of the family-wise error rate (FWER). Notes: FDR is a less stringent condition than FWER |
GR.annotation |
the genomic regions of annotation data. By
default, it is 'NA' to disable this option. Pre-built genomic
annotation data are detailed in |
parallel |
logical to indicate whether parallel computation with multicores is used. By default, it sets to true, but not necessarily does so. It will depend on whether these two packages "foreach" and "doParallel" have been installed |
multicores |
an integer to specify how many cores will be registered as the multicore parallel backend to the 'foreach' package. If NULL, it will use a half of cores available in a user's computer. This option only works when parallel computation is enabled |
verbose |
logical to indicate whether the messages will be displayed in the screen. By default, it sets to false for no display |
RData.location |
the characters to tell the location of built-in
RData files. See |
guid |
a valid (5-character) Global Unique IDentifier for an OSF
project. See |
a data frame with 8 columns:
name
: the annotation name
nAnno
: the number of bases covered by that annotation. If
the background is provided, they are also restricted by this
nOverlap
: the number of bases overlapped between input
regions and annotation regions. If the background is provided, they are
also restricted by this
fc
: fold change
zscore
: z-score
pvalue
: p-value
adjp
: adjusted p-value. It is the p value but after being
adjusted for multiple comparisons
nData
: the number of bases covered by input regions
nBG
: the number of bases covered by background regions
Pre-built genomic annotation data are detailed in
xDefineGenomicAnno
.
xDefineGenomicAnno
## Not run:
# Load the XGR package and specify the location of built-in data
library(XGR)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"
# Enrichment analysis for GWAS SNPs from ImmunoBase
## a) provide input data
data.file <- "http://galahad.well.ox.ac.uk/bigdata/ImmunoBase_GWAS.bed"
## b) perform enrichment analysis using FANTOM expressed enhancers
eTerm <- xGRviaGenomicAnnoAdv(data.file=data.file, format.file="bed",
GR.annotation="FANTOM5_Enhancer_Cell", num.samples=1000, gap.max=50000,
RData.location=RData.location)
## c) view enrichment results for the top significant terms
xEnrichViewer(eTerm)
## d) barplot of enriched terms
bp <- xEnrichBarplot(eTerm, top_num='auto', displayBy="fdr")
bp
## e) save enrichment results to the file called 'Regions_enrichments.txt'
output <- xEnrichViewer(eTerm, top_num=length(eTerm$adjp),
sortBy="adjp", details=TRUE)
utils::write.table(output, file="Regions_enrichments.txt", sep="\t",
row.names=FALSE)
## End(Not run)