Background

The availability of functional genomics data has increased dramatically over the last decade, mostly due to the development of high-throughput microarray-based technologies such as expression profiling. Automatic mining of these data for meaningful biological signals requires systematic annotation of genomic elements at different levels. The Gene Ontology (GO) project [1] is a collaborative effort aimed at providing a controlled vocabulary to describe gene product attributes in all organisms. GO consists of three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions. The building blocks of GO are terms, the relationship between which can be described by a directed acyclic graph (DAG), a hierarchy in which each gene product may be annotated to one or more terms in each ontology.

Since its inception, many tools have been developed to explore, filter and search the GO database. A comprehensive list of available tools is provided at the Gene Ontology web site http://www.geneontology.org. One of the most common applications of the GO vocabulary is enrichment analysis – the identification of GO terms that are significantly overrepresented in a given set of genes [2]. Enrichment may suggest possible functional characteristics of the given set. For example, enriched GO terms in a set of genes that are significantly over-expressed in a specific condition may suggest possible mechanisms of regulation that are put into play, or functional pathways that are activated in that condition.

A large repertoire of tools for enrichment analysis has been developed in recent years, including GoMiner [3], FatiGO [4], BiNGO [5], GOAT [6], DAVID [7] and others. In general, these tools accept as input a target set of genes that is compared to a given background set of genes, or to a default "complete" background set. Some subset of GO terms from one or more of the three ontologies is scanned for enrichment in the target set relative to the background set, and terms for which significant enrichment is discovered are reported. The statistical test used for enrichment analysis is typically based on a hypergeometric or binomial model.

The most common form of output is a list of enriched terms. This simple approach allows the user to identify terms that are most significantly enriched but may lose substantial information regarding the relations between these terms. A more informative approach is to present the enrichment results in the context of the DAG structure of the respective ontology. In a typical case, the list of significantly enriched GO terms may include several related terms at varying significance levels. Identifying the clusters of enriched terms in the GO hierarchy becomes much simpler if the DAG structure is made available. A few tools visualize the results of enrichment analysis in the DAG structure, including the downloadable version of GoMiner [3], the CytoScape plug-in BiNGO [5], GOLEM [8], GOEAST [9] and GOTM [10]. A particularly friendly and useful GO enrichment analysis tool is GO::TermFinder which is provided at the Saccharomyces Genome Database (SGD, [11]). This tool provides a color-coded map of the enriched GO terms. It is, however, limited only to analysis of S. cerevisiae genes and requires specifying an explicit target set.

In many practical cases, functional genomic information used as the input for the GO enrichment analysis may be naturally represented as a ranked list. For most applications, the requirement for an input target gene set forces the user to set some arbitrary threshold and define the target set as all genes with ranks above (or below) the threshold. For example, genes may be naturally scored and ranked according to their differential expression between two conditions. However, defining the specific set of genes that are differentially expressed requires setting an arbitrary threshold. Unfortunately, the results of the enrichment analysis may often depend on the specific threshold that is set. Tools that use the simple hypergeometric distribution require setting such a fixed threshold.

A few tools have been developed that use a threshold free approach including GSEA [12], FatiScan [13], GO-stat [14], GeneTrail [15] and iGA [16]. The widely used GSEA tool uses a statistic that is similar to Kolmogorov-Smirnov but assigns different weights to the occurrences of genes in different ranks in the list. The tool is not specifically aimed at GO enrichment, and therefore does not offer visualization in terms of the GO DAG structure. In addition GSEA does not provide an exact p-value and estimates the p-value using permutations. The p-values assigned by GSEA are therefore limited by the number of permutations performed. FatiScan is another threshold free tool by the creators of FatiGO. It tests a number of thresholds determined by the user (the default is 30 thresholds) and then corrects for multiple testing using FDR. Again, this tool does not provide an exact p-value. The iGA method uses an iterative approach that circumvents the need for a fixed cutoff by computing the hypergeometric score at all possible cutoffs. iGA does not produce an exact p-value as well. In [17] the authors study the advantages of sample re-sampling. The sample permutation approach is applicable for the analysis of differential gene expression data but not in other applications of gene set enrichment where ranking is inferred otherwise.

As a matter of practicality it is also important for GO enrichment tools to perform in an interactive manner. Many of the existing tools, in particular those based on flexible thresholds that require time consuming simulations, fall short on this desired property.

In this note we describe a web-server interactive software tool, called GOrilla, that enables GO enrichment analysis in ranked lists of genes. It is based on previous work [18], in which we describe a statistical framework, called mHG, for enrichment analysis in ranked lists. The method identifies, independently for each GO term, the threshold at which the most significant enrichment is obtained. The significance score is accurately and tightly corrected for threshold multiple testing without the need for time consuming simulations. Consequentially GOrilla performs the enrichment analysis on thousands of genes and thousands of GO terms in a few seconds. In the Results section we demonstrate how GOrilla can capture relevant biological processes and visualize the results with an easy to use graphical representation of the GO hierarchy, emphasizing on the enriched nodes.

Implementation

Enrichment Analysis

A standard approach for identifying enriched GO terms uses the hypergeometric distribution (e.g. [8]). Given a total number of genes N, with B of these genes associated with a particular GO term and n of these genes in the target set, then the probability that b or more genes from the target set are associated with the given GO term is given by the hypergeometric tail:

Prob ( X b ) = HGT ( b ; N , B , n ) = i = b min ( n , B ) ( n i ) ( N n B i ) ( N B ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeiuaaLaeeOCaiNaee4Ba8MaeeOyaiMaeiikaGIaemiwaGLaeyyzImRaemOyaiMaeiykaKIaeyypa0JaeeisaGKaee4raCKaeeivaqLaeiikaGIaemOyaiMaei4oaSJaemOta4KaeiilaWIaemOqaiKaeiilaWIaemOBa4MaeiykaKIaeyypa0ZaaabCaeaajuaGdaWcaaqaamaabmaabaqbaeqabiqaaaqaaiabd6gaUbqaaiabdMgaPbaaaiaawIcacaGLPaaadaqadaqaauaabeqaceaaaeaacqWGobGtcqGHsislcqWGUbGBaeaacqWGcbGqcqGHsislcqWGPbqAaaaacaGLOaGaayzkaaaabaWaaeWaaeaafaqabeGabaaabaGaemOta4eabaGaemOqaieaaaGaayjkaiaawMcaaaaakiabcYcaSaWcbaGaemyAaKMaeyypa0JaemOyaigabaGagiyBa0MaeiyAaKMaeiOBa4MaeiikaGIaemOBa4MaeiilaWIaemOqaiKaeiykaKcaniabggHiLdaaaa@673C@
(1)

If a ranked gene list: g1,...,g N is provided in place of a target set, we define a label vector λ = λ1,...,λ N ∈ {0, 1}Naccording to the association of the ranked genes to the given GO term, λ i = 1 iff g i is associated with the term. The minimum hypergeometric (mHG) score is then defined as:

m H G ( λ ) = min 1 n < N H G T ( b n ( λ ) ; N , B , n ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyBa0MaemisaGKaem4raCKaeiikaGIaeq4UdWMaeiykaKIaeyypa0ZaaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaSqaaiabigdaXiabgsMiJkabd6gaUjabgYda8iabd6eaobqabaGccqWGibascqWGhbWrcqWGubavcqGGOaakcqWGIbGydaWgaaWcbaGaemOBa4gabeaakiabcIcaOiabeU7aSjabcMcaPiabcUda7iabd6eaojabcYcaSiabdkeacjabcYcaSiabd6gaUjabcMcaPiabcYcaSaaa@5150@
(2)

where b n ( λ ) = i = 1 n λ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOyai2aaSbaaSqaaiabd6gaUbqabaGccqGGOaakcqaH7oaBcqGGPaqkcqGH9aqpdaaeWaqaaiabeU7aSnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaa@3D11@ . In words, the mHG score is the optimal HGT probability that is found over all possible partitions induced by the gene ranking. As such, this score must be corrected for multiple testing. In previous work we describe a dynamic programming algorithm for computing the exact p-value of a given mHG score [18]. More specifically, given a ranked list of genes, a GO term associated with some of these genes and a corresponding mHG score s, the mHG p-value tells us the exact probability of observing an mHG score s's under the null assumption that all GO term occurrence configurations in the ranked list are equiprobable. We describe various considerations for efficient implementation of the mHG p-value algorithm elsewhere [19].

Description of the Tool

GOrilla is publicly available as a web-based application at: http://cbl-gorilla.cs.technion.ac.il. The application has two modes of operation:

  1. 1.

    Discovery of enriched GO terms at the top of a ranked list of genes using the mHG statistics (as explained in this paper).

  2. 2.

    Discovery of enriched GO terms in a target set versus a background set and using a hypergeometric model (as commonly done in other applications). In this case the ranking inside these lists is ignored.

The web interface of GOrilla is shown in Figure 1. The input to the application is a ranked list or two sets (depending on the mode of operation) that consist of gene names in the following formats: gene symbol, protein RefSeq, Uniprot, Unigene or Ensembl. Gene duplication often occurs in such lists, which can lead to statistically biased results. GOrilla automatically removes duplicates keeping the highest ranking occurrence. This includes dealing with duplicates that hide behind different nomenclatures. GOrilla currently supports the following organisms: human, mouse, rat, yeast, D. melanogaster, C. elegans and Arabidopsis thaliana.

Figure 1
figure 1

How to use the GOrilla web user interface. To use the GOrilla web interface, the user is required to perform the following four simple steps: (i) choose an organism; (ii) choose a running mode (either flexible threshold or fixed threshold mode) (iii) copy and paste a list (or upload a file) of genes in the case of a flexible threshold or two lists of genes – a target and a background – in the case of a fixed cutoff; (iv) choose an ontology.

The output consists of a color-coded trimmed DAG of all significantly enriched GO terms. The output also includes a table consisting of the enriched GO terms (with web links to additional information), the enrichment p-values (computed under the null model as described above) and the relevant annotated genes. The nodes of the resulting DAG, which represent GO terms, are color-coded according to the significance of the detected enrichment and clickable, leading to the relevant entry in the above table. Results can be exported to Excel. Figure 2 depicts an example of the result of enrichment analysis, where the relations between enriched terms can be observed. To generate the GO DAG visualization, GOrilla employs AT&T's GraphViz tool [20].

Figure 2
figure 2

An example of the GOrilla analysis output. 14,565 genes from the van't Veer dataset were ranked according to their differential expression and given as input to GOrilla. The resulting enriched GO terms are visualized using a DAG graphical representation with color coding reflecting their degree of enrichment. Nodes in the graph are clickable and give additional information on the GO terms and genes attributing to the enrichment. N is the total number of genes; B is the total number of genes associated with a specific GO term; n is the flexible cutoff, i.e. the automatically determined number of genes in the 'target set' and b is the number of genes in the 'target set' that are associated with a specific GO term. Enrichment is defined as (b/n)/(B/N).

Results and discussion

To test the performance of GOrilla we used the van't Veer et al. breast cancer dataset [21], which is a landmark study in clinical use of gene expression data. This dataset consists of expression profiles containing 14,565 genes measured on 77 breast cancer patients. Different patients showed various prognostic characteristics, 44 with more than 5 years survival versus 33 patients with less than 5 years survival. All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups. The ranked list was given as input to GOrilla with default running parameters. The result is shown in Fig 2 and highlights a unique set of enriched GO terms that were identified at different cutoffs. The enriched GO terms include mitosis (p < 10-12, top 717 genes), cell cycle (p < 10-13, top 692 genes) and microtubule cytoskeleton organization and biogenesis (p < 10-8, top 927 genes). These enriched GO terms are attributed to genes that were over expressed in patients with bad prognosis and under-expressed in patients with good prognosis, which is in accordance with biological common sense and supports their relevance. The total analysis running time was less than 10 seconds.

A comparison to other web-based tools was performed on the same dataset using each software's default parameters. Web-based tools have several advantages over standalone tools [2] and therefore we only compare GOrilla to other web-based tools. The flexible threshold tools Fatiscan [13] and GO-stat [14] were given the ranked list as input while the fixed threshold methods GOEAST and DAVID were given a target set containing all the genes with t-test p < 10-3, which includes the top 124 genes, and a background set containing the rest of the genes. We note that the p < 10-3 choice of fixed threshold for cutting the data, which is often used for this type of tasks, is inherently arbitrary.

A summary of the comparison between the different tools is given in Table 1. DAVID and GOEAST which use a fixed threshold based approach, yielded a similar set of enriched GO terms to the ones yielded by GOrilla. Most of these terms were identified with less significant enrichment scores presumably because these methods use an arbitrary cutoff. The flexible cutoff based methods Fatiscan did not identify any enriched GO term while the GO-stat flexible method identified GO terms similar to the ones obtained by GOrilla. Running time of different tools varied considerably ranging from a few minutes up to 30 minutes per single analysis, which may be partially due to Monte-Carlo simulations employed by some of the methods for assessing the p-value.

Table 1 A comparison of web-based GO enrichment tools.

Conclusion

The main contribution of the application presented herein are:

  1. 1.

    Most other approaches to GO enrichment analysis assume that a target set and a background set are given or evident to the process. It is often the case, however, that genomic data which is the subject of the enrichment analysis step, is naturally represented as a ranked list of elements. GOrilla uses a statistical model that supports the discovery of GO terms that are enriched at the top of a ranked list, enabling a threshold to be determined in a data driven manner. It also provides an exact p-value for the observed event, which is not the case for most other available flexible threshold tools.

  2. 2.

    GOrilla provides a simple and informative graphical representation of the significantly enriched terms, in the context of the complete DAG representation of the ontology used in the analysis. This graphical representation is color coded based on the p-value attained for each GO term.

  3. 3.

    The application is highly interactive with running time of a few seconds per analysis. This is achieved by using an efficient algorithm for computing the exact mHG p-value, which circumvents the need for simulations, and an efficient software implementation.

Availability and requirements

Project name: GOrilla

Project home page: http://cbl-gorilla.cs.technion.ac.il

Operating system(s): (Platform independent) web-based application

Programming language: Java