CompGO: an R package for comparing and visualizing Gene Ontology enrichment differences between DNA binding experiments
- 7.3k Downloads
Gene ontology (GO) enrichment is commonly used for inferring biological meaning from systems biology experiments. However, determining differential GO and pathway enrichment between DNA-binding experiments or using the GO structure to classify experiments has received little attention.
Herein, we present a bioinformatics tool, CompGO, for identifying Differentially Enriched Gene Ontologies, called DiEGOs, and pathways, through the use of a z-score derivation of log odds ratios, and visualizing these differences at GO and pathway level. Through public experimental data focused on the cardiac transcription factor NKX2-5, we illustrate the problems associated with comparing GO enrichments between experiments using a simple overlap approach.
We have developed an R/Bioconductor package, CompGO, which implements a new statistic normally used in epidemiological studies for performing comparative GO analyses and visualizing comparisons from .BED data containing genomic coordinates as well as gene lists as inputs. We justify the statistic through inclusion of experimental data and compare to the commonly used overlap method. CompGO is freely available as a R/Bioconductor package enabling easy integration into existing pipelines and is available at: http://www.bioconductor.org/packages/release/bioc/html/CompGO.html packages/release/bioc/html/CompGO.html
KeywordsGene Ontology Serum Response Factor Jaccard Coefficient Enrich Gene Ontology Hard Thresholding
Differentially Enriched Gene Ontologies
The Database for Annotation, Visualization and Integrated Discovery
NKX2-5 tyrosine-rich domain
Gaining biological insight from high-throughput data underpins systems biology. However, determining biological “function” or indeed “relevance” from lists of genes or DNA regions (loci) remains problematic. Ashburner et al. proposed a structured Gene Ontology (GO) approach for grouping genes into conceptual “ontologies” based on their annotated or predicted biological functions . GOs are organized into a hierarchical network where broad functionality sits at the top (e.g. cell) and fine functionality at the bottom (e.g. calcium ion binding). Individual genes can have multiple GOs. The accumulation of gene annotations and subsequent classification of thousands of ontologies has seen the development of a number of tools using a range of statistical approaches to identify “semantic” patterns, or GO enrichment, within a given list of genes . GO enrichment is typically determined using a hypergeometric test (or modified version) or similar over-representation test based on gene sets alone or, for example, signatures derived from the correlation of gene expression profiles [3, 4, 5].
However, few methods have been developed to determine how similar or different experiments are using a GO approach; most are focused on different visualization methods and are not adaptable to existing pipelines, requiring users to reformat and manually input data into third party web services. For instance, WebGestalt  and GOEAST  are webservers that visualize multiple gene list inputs by overlaying their individual statistics onto a GO directed acyclic graph. Enrichment maps visualize GO enrichment from multiple gene lists as a network; edges derived from the Jaccard coefficient (JC) of GO gene set overlap . However, enrichment maps are difficult to resolve when more than two experiments are compared and do not indicate overall differences between experiments. Comparative GO , a web based GO tool, via the Kolmogorov-Smirnov statistic, compares observed GOs to an expected GO distribution, however is limited to bacterial gene lists and visualization of pairwise comparisons.
Motivated by our interest in DNA binding experiments (e.g. ChIP-seq or DamID) and their similarities/differences, we developed a tool that would enable rapid comparison of multiple experiments unconstrained by input type (gene list or loci) or species, and taking advantage of existing unsupervised clustering and dimensionality reduction methods (e.g. hierarchical clustering and principle component analysis), implemented in R for classification of experiments based on GO. We present an open-source implementation of a comparative GO approach, CompGO, which is readily adaptable to existing analysis pipelines for performing these functions and implement a log odds ratio [10, 11] normally applied to epidemiological studies for comparing GO enrichment directly. We justify the use of this statistic for direct comparisons by assessing experimental data recently published .
Differential GO enrichment
p-values are not derived from log odds ratios, but 95 % confidence intervals could be assigned to enrichment scores as zi ± 1.96SE (δi). The greater the absolute zi, the greater the odds a term was enriched than by chance alone.
Scoring of Differentially Enriched Gene Ontologies (DiEGOs) can then be inferred from their z-scores. The greater the absolute zk, the greater the odds a term was differentially enriched than by chance alone. p-value’s can be inferred using R assuming normal approximations and multiple methods are available for correcting for multiple hypotheses.
Overlap of genes between GOs
Example of CompGO Code
Results and discussion
To determine the utility of the methods proposed in CompGO we downloaded DNA targeted regions (peaks) for a number of wild-type (WT) and mutated cardiac TFs identified by Bouveret et al.  using the DamID method, and compare the outcomes using a simple overlap approach. Bouveret et al. surveyed DNA binding regions for the WT NKX2-5 cardiac transcription factor twice (independent experiments with 3–4 replicates each performed 2 years apart; data sets hereafter named NKX2-51 and NKX2-52) and in addition surveyed three NKX2-5 mutants - NKX2-5Y191C is a congenital heart disease-causing mutation [20, 21], while NKX2-5ΔHD and NKX2-5YRDY-A are synthetic mutations with a disrupted homeodomain (involved in both DNA-binding and cofactor interactions) and Tyrosine-Rich Domain (YRD; cofactor interactions), respectively. DNA binding regions of the muscle-enriched TF serum response factor (SRF) and the ubiquitously-expressed ETS-domain TFs ELK1 and ELK4, were also considered  .
These results suggest that ELK TFs regulate distinct although overlapping sets of biological processes compared to NKX2-5. Furthermore, while SRF and the mutation NKX2-5YRDY-A largely target genes with similar GO terms as WT NKX2-5, the mutations NKX2-5ΔHD and NKX2-5Y191C, predicted to be the more severe mutations among those studied here, targeted sets of genes representing distinct biological processes . Notably the average JC, a metric representing overall concordance of genes belonging to the same GO term, varied, indicating that distinct sets of target genes could belong to the same GO term. Of the DiEGOs from the NKX2-51 versus ELK4 comparison, those unique to ELK4 included metabolic and generic GO terms such as GO:0006396 ~ RNA processing (z-scores: 0.13 vs. 5.41; p-value: 0.001) and GO:0034470 ~ ncRNA processing (z-scores: -0.09 vs. 3.60; p-value: 0.028), whereas those for NKX2-51 included muscle related terms such as GO:0043292 ~ contractile fiber (z-scores: 6.50 vs. 1.70; p-value: 0.035) and GO:0048514 ~ blood vessel morphogenesis (z-scores: 4.00 vs. 0.26; p-value: 0.043). This is consistent with the known roles for NKX2-5 in muscle and vasculature development and the ubiquitous expression of ELK TFs .
To better illustrate the differences, we compared the overlap method to the log odds ratio method by directly computing the differential of p-values (scored as the difference between –log10 transformed p-values or simply ‘Δ –log10 p-value’) returned by DAVID to the log odds ratio returned from direct comparison using CompGO for NKX2-51 versus NKX2-52 (Fig. 4c) and NKX2-51 versus ELK4 (Fig. 4d). For NKX2-51 versus NKX2-52, this illustrated that GO terms reported by the overlap method did not approximate to the tails of the distribution where differences would be expected to occur if compared directly as per the log odds ratio in Eq. 3. When comparing NKX2-51 to ELK4 some concordance was observed, but there was still a large number of differentially enriched GO terms identified using CompGO that were 1) not detected using the overlap method; and 2) not approximating to the tails of the log-odds distribution - likely to be false positives (Fig. 4d). In addition to hard thresholding, DieGOs identified by CompGO and not detected using the overlap method arose as a result of “under-representation”. This is because the log odds ratio (Eq. 3) considers both tails of the distribution, in contrast to the single-tailed modified Fishers exact test implemented in DAVID which only considers over-representation. For example, DAVID returned p-values of 0.54 and 1.00 for GO:0006811 ~ ion transport indicating that this GO term was not significantly over-represented in either set, however CompGO returned a p-value of 0.0003 which reflected an under-representation of this term for ELK4 targets (z-scores: 1.57 vs. -3.23). Therefore, the approach of hard thresholding of individual GO statistical results from each comparison and performing overlaps introduces many false positives as well as missing potential differences. This illustrates how CompGO overcomes the issue of hard thresholding implicit in the overlap method by directly computing differential enrichment via a log odds ratio, thereby reducing the number of false positive results.
CompGO enables rapid identification, comparison and visualization of differentially enriched GO terms calculated from multiple lists of genetic loci. Through experimental data we illustrate the problems associated with comparing GO enrichment between experiments using a simple overlap method in contrast to the proposed log odds ratio. CompGO provides methods to address the questions of “how significant are GO enrichment differences?” and “how similar are multiple experiments based on GO enrichments”. Input data can be .BED files or gene identifiers. CompGO is applicable to any species where a reference genome assembly is available. As CompGO is implemented in R, it is accessible to a broad range of users and can readily be incorporated into existing pipelines. CompGO is an easy and fast comparative package for GO enrichments from experimentally identified DNA regions or genes.
Project name: CompGO
Project home page: http://www.bioconductor.org/packages/release/bioc/html/CompGO.html
Operating system(s): Platform independent
Programming language: R
Other requirements: BioC 2.14 (R-3.1)
The authors would like to thank Marc Carlson for helpful comments when building and submitting CompGO to Bioconductor.
This work was funded by grants from the National Health and Medical Research Council, Australia (NHMRC; 573705, 573703, 1061539), Australia Research Council Strategic Initiative in Stem Cell Science (Stem Cells Australia; 110001002), the Australian-India Strategic Research Fund (BF020084) and Foundation Leducq.
- 12.Bouveret R, Waardenberg AJ, Schonrock N, Ramialison M, Doan T, Jong D, et al. NKX2-5 mutations causative for congenital heart disease retain functionality and are directed to hundreds of targets. eLife 2015, http://elifesciences.org/content/early/2015/07/06/eLife.06942
- 14.Ihaka R, Gentleman R. R: A Language for Data Analysis and Graphics. J Comput Graph Stat. 1996;5(3):299–314.Google Scholar
- 17.Carlson M. KEGG.db: A set of annotation maps for KEGG. R package version 3.1.2. http://www.bioconductor.org/packages/release/data/annotation/html/KEGG.db.html.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.