In genome-wide microarray experiments, it is possible to analyse the relevance of many different categories of genes, obtained from prior knowledge in the form of database annotations or from other experiments. These gene annotation analyses can unravel new information about pathways and cellular functions responsible for different phenotypes. Computational tools aiding in this process have recently been developed [18], most notably for annotations based on the Gene Ontology (GO) [9]. Generally, category relevance is calculated as the p-value of a score, thus being dependent on both the choice of score and the choice of null hypothesis.

In microarray analyses such as clustering, which provide defined subsets of genes with no internal ranking, it is natural to base the score on the number of category genes in the relevant subset. However, ranking of genes appear in many techniques for microarray analysis, such as correlation of gene expression to target profiles [10] and scoring of genes by their ability to discriminate between experimental conditions [1113]. A separation of relevant and irrelevant genes can easily be constructed from ranked gene lists by introducing a cutoff, but the choice of cutoff becomes somewhat arbitrary and information in the list is lost. Tools addressing this problem, by using rank-based scores that are independent of a rank cutoff, have adopted the Kolmogorov-Smirnov score [1417], and a minimized cutoff-based p-value [7, 8], which optimizes the cutoff for each category. The Wilcoxon rank sum [18], investigated here, serves the same purpose.

To calculate a p-value for the assigned score, a set of gene lists, ranked according to a chosen null hypothesis, are needed. The simplest choice of null hypothesis is just random gene permutations, and for some rank-based scores, the p-value can then be calculated analytically, without explicitly performing the permutations. However, the random gene permutations null hypothesis assumes independence of gene expression over biological samples, and the p-value is thus a combination of the p-value of how important the category is and the p-value for the genes of the category being coexpressed. When category genes behave similarly over a wide range of experimental conditions, the coexpression does not indicate relevance of the category for the question under study. In many analyses, a more appropriate null hypothesis is therefore sample label permutations, in which a set of ranked gene lists are generated based on the gene expression correlations to randomly permuted target values of the samples. This approach accounts for correlations between category genes and gives p-values that are bounded from below by the number of possible permutations of the samples in the data set. The latter is particularly important in data sets with few samples. Despite this, publicly available tools for gene annotation analysis are restricted to gene permutations [18].

We present a program, Catmap, for gene category analysis based on ranked gene lists. The program uses either the number of genes above a cutoff or the Wilcoxon rank sum as score, and the significance of the score can be calculated from a user supplied set of ranked lists, thus allowing for sample label permutations. Furthermore, the program calculates corrections for multiple category testing, using permutation results to assess an effective number of independent categories, which enables Catmap to estimate very small multiple category p-values, that would otherwise have been computationally infeasible. The input to the program is two files and some arguments. The first file contains the biologically relevant ranked list of genes and, if needed, additional ranked gene lists drawn from the null hypothesis. The second file contains the categories and their corresponding genes. The input arguments can either be specified on the command line or in a settings file, and are as follows: 1) a choice between the cutoff score the Wilcoxon rank sum score; 2) a choice of null hypothesis, which can be either the above mentioned user-supplied ranked lists or random gene permutations; 3) the number of permutations used in multiple category testing. If zero, no multiple category testing is performed.

The output of Catmap is two files. The main output file contains all the categories, one on each line ordered according to their significance. The line of a category contains the p-value, the multiple comparison p-value, the false discovery rate, the ROC area (which is a normalized way to represent the Wilcoxon rank sum), the number of genes in the category, and the 25th, 50th, and 75th percentiles of the ranks. The other output file, the companion file, contains all the categories, with all the genes and their ranks listed below. Each line contains a gene and its rank. The program can be downloaded at [19], where file format specification and example files are accessible as well.

Results and discussion

Comparing cutoff independent and cutoff-based score functions

We analysed the breast cancer data set of van 't Veer et al. [13] with a cutoff-based score function, using different cutoffs. Table 1 presents results for 15 categories with low p-values from cutoff independent scoring, showing that the p-value depends strongly on the choice of cutoff. This is further illustrated by the very different cutoffs at which the minimized cutoff-based p-value was obtained. A table with all categories is provided as a supplement [see Additional file 2].

Table 1 Category p-values for cutoff-based and cutoff independent score functions. The 15 GO categories with the lowest Wilcoxon rank sum p-values from the ranked gene list, based on the data set of van't Veer et al., which comprises 5224 genes in total. Three columns show p-values for cutoff based score functions, with cutoffs at position 100, 300 and 600 in the list. The columns "min p" and "cutoff" give the minimal cutoff based p-value and the cutoff where this minimum was attained. The column "WRS" gives the p-value calculated with the Wilcoxon rank sum as score function and random permutation of genes as null hypothesis, and the column marked #genes indicates the total number of genes in the category. A full table (sorted by WRS p-value) is given as supplementary information [see Additional file 2]. The supplementary table also contains the ranking of each category using the different methods and the 25th, 50th and 75th percentiles of those genes in the ranked list.

Compared to the variations between the cutoff-based alternatives, the results shown in Table 1 are in reasonable agreement for two cutoff independent p-values, using the Wilcoxon rank sum and the minimized cutoff-based p-value, respectively. The p-value based on the Wilcoxon rank sum was most often larger than the minimal cutoff-based p-value. Since the latter is biased by a minimization process, it must be interpreted as a score, rather than a p-value, thus requiring additional analyses to find statistical significance [7, 8].

Comparing null hypotheses

Using the Wilcoxon rank sum, we compared the results of different null hypotheses. Three publicly available data sets were examined [11, 13, 20]. As can be seen in Figure 1, p-values based on gene permutations tend to be lower than those based on sample label permutations. For categories with small p-values, there are remarkable differences, in particular for large categories with more than 20 genes. Since the gene permutation null hypothesis assumes independent genes, we expect a GO category whose genes are uncorrelated to have roughly the same p-value under the two different null hypotheses, whereas a significant category whose genes are highly correlated will get a lower p-value using the gene permutation null hypothesis. To illustrate this coexpression effect, we picked two large categories, "carboxylic acid metabolism" and "M phase", which are encircled in Figure 1. In the data set of van 't Veer et al. [13], "carboxylic acid metabolism" has similar p-values for the two null hypotheses, while "M phase" has a p-value of 10-7using gene permutations but the much higher p-value of 3 · 10-2 using sample label permutations. As seen in Figure 2, the most highly ranked genes of "M phase" are indeed more coexpressed than the most highly ranked genes of "carboxylic acid metabolism".

Figure 1
figure 1

Comparing null hypotheses. Comparison of p-values obtained by sample label permutations and gene permutations, using the data set of van 't Veer et al. [13] (left), Golub et al. [11] (middle), and Alon et al. [20] (right). Sample label permutation results were obtained with 100.000 permutations for the van 't Veer et al. data set and with 10.000 permutations for the other data sets. Gene permutation results were calculated as described in Methods. Red, green and blue colours represent categories with 1 to 5, 6 to 20, and over 20 genes, respectively. Encircled boxes in the left figure represent the categories "M phase" and "carboxylic acid metabolism", which are further discussed in the text.

Figure 2
figure 2

Effects of coexpression on different null hypotheses. Expression profiles, over the 97 samples in van 't Veer et al. [13], of the 12 most highly ranked genes in the "M phase" category (left) and 13 most highly ranked genes in the "carboxylic acid metabolism" category (right), respectively. Some genes were inverted since the ranking was based on absolute correlation values to metastasis class. The metastasis free samples are to the left of the vertical line, and within each metastasis class, samples are ordered in increasing average expression of the examined genes. The expressions of each gene was normalized to zero average across samples. The narrower band of expressions in the left figure illustrates the higher Pearson correlation of M phase genes. Average absolute Pearson correlation between gene expressions was 0.74, with standard deviation of 0.16, for the M phase genes, and 0.44, with standard deviation of 0.27, for carboxylic acid metabolism genes.

In Table 2, the ranks of categories for the different null hypotheses are compared. There are distinct differences, with only a small overlap among top ten categories. One can clearly see the tendency for the gene permutation null hypothesis to find categories with very many genes, as discussed above. A table with all categories is provided in the supplement [see Additional file 3].

Table 2 Comparison of different cutoff independent approaches. The top ten categories and their corresponding ranks for each of the the four methods: Wilcoxon rank sum (WRS) with sample label permutation null hypothesis (s.l.p.), WRS with gene permutation null hypothesis (g.p.), the Kolmogorov-Smirnov score (K-S) as used in GSEA [17], and the minimal cutoff-based p-value (min-p) [7, 8]. Percentile columns indicate the position of the 25th, 50th and 75th percentile in the ranked gene list which comprises 5224 genes and is based on the data set of van't Veer et al.. The last column indicates the number of genes in each category. The full table is available as a supplementary file [see Additional file 3].

Table 2 also shows category ranks obtained with two alternative cutoff independent score functions: the Kolmogorov-Smirnov score as used in GSEA [17] and the minimal cutoff-based p-value used in FuncAssociate [7] and iGA [8]. These two alternatives do not calculate individual p-values for categories, but ranks categories based on the chosen score. Nevertheless, they give results similar to those obtained with the Wilcoxon rank sum and gene permutation. This is expected, since the minimized p-value is calculated with gene permutations, and the score adopted in GSEA [17] ranks categories similarly to what a Kolmogorov-Smirnov p-value, based on gene permutations, would do. It should be noted that GSEA, FuncAssociate, and iGA calculate multiple hypotheses corrected p-values, but these do not change the ranking of categories.

There is a possible difference (which does not reveal itself in Table 2) between the Kolmogorov-Smirnov score and minimized p-value score on one hand, and the Wilcoxon rank sum on the other, in the treatment of categories for which only a subset of genes have expressions correlating significantly with the question under study. The important genes being in the top of the ranked list will give the category a good score with all three score functions, provided the remaining, seemingly insignificant, genes are distributed in the ranked list as expected by random. However, if these less important genes lie higher in the list than expected by random (though not high enough to affect the Kolmogorov-Smirnov or min-p scores), the category will be considered more important by the Wilcoxon rank sum. Reversely, if the less important category genes prevail in the bottom of the list, the Wilcoxon rank sum score function will deem the category as unimportant, while the other two scores will give the category a high significance, based on the top ranked genes alone. Whether seemingly insignificant genes being ranked better or poorer than explainable by random expectations should be observed or ignored is of course a matter of taste, and a possibility is to use several score functions, that may complement each other. The differences are, however, much smaller than those related to choice of null hypothesis, as revealed in Table 2.

Multiple category testing

The more categories that are being tested, the more likely it is that at least one category gets a very small p-value by chance. To better evaluate the statistical significance of the best scoring categories, we used Catmap to calculate false discovery rates and family-wise error rates by permutation tests. This also gave us an effective number of independent categories, Neff, as described in Methods.

The GO contains many small categories which would be reasonable to ignore in a study aiming at biological conclusions, and they were included in Figure 1 mainly to highlight the differences between the null hypotheses. When performing multiple category testing, we restricted the study to large categories, containing more than 20 genes. We tested the 3 sub-ontologies (biological process, molecular function, and cellular component) both separately and together.

As expected from the discussion above, several categories with coexpressed genes got small pmultiple and small false discovery rates with random gene permutations. In contrast, when using sample label permutations, the smallest pmultiple was obtained in the data set of van 't Veer et al. [13] for the biological process category "organic acid metabolism", which contained 83 genes and had p = 3 · 10-4 and pmultiple = 0.02. Interestingly, organic acid metabolism is known in the literature to be relevant for breast cancer [21, 22]. For this data set and the biological process categories, there was a 38% false discovery rate among the top 15 categories.

For all 3 sub-ontologies, the effective number of categories, Neff, was around half of the full number of categories, N. In the data set of van 't Veer et al. [13] the numbers were Neff = 83 versus N = 166 for biological process, Neff = 69 versus N = 119 for molecular function, and Neff = 22 versus N = 42 for cellular component. For all categories together the real number of large categories was N = 327 whereas Neff = 152. Using random gene permutations for the same data set and categories, we got Neff = 170. The fact that Neff for the two null hypotheses are so close is a general phenomena that we see in all our examples (data not shown). Furthermore, for all data sets and ontologies studied, Neff was approximately half of the total number of categories. If this is a general feature for GO categories, the simple Bonferroni correction would not be totally unreasonable for small p-values.

Figure 3 shows that the fit with an effective number of categories was good; in the range where permutations results were available it did not deviate more than a factor of two. The example in Figure 3 was obtained with 100.000 sample label permutations, and minimal p-values were found for 1000 random gene lists.

Figure 3
figure 3

Fitting an effective number of independent categories. The multiple category p-value, pmultiple, versus p-value for the data set of van 't Veer et al. [13], using 327 large Gene Ontology categories with more than 20 genes. The yellow band shows 95% confidence interval of sample label permutation results, based on 1000 random lists, and the blue curves show the results of Equation (1), with the fitted Neff = 152 (solid line), the total number of categories N = 327 (dashed line), and also the Bonferroni correction (dotted line).

It should be noted that whenever several ranked lists are examined as part of a project, this additional source of multiple hypotheses testing should also be corrected for. An example of such a correction, for cutoff-based score functions, is presented by Corà et al. [23].


We developed a computer program for calculating the significance of gene categories in a ranked list of genes. Corrections for multiple category testing can be performed by the program. To investigate the properties of different scores and null hypotheses, we analyzed three publicly available data sets [11, 13, 20]. Commonly [16], a subset of relevant genes is defined from a ranked gene list by introducing a cutoff in the list. Our results show that the obtained p-values of biologically relevant categories depend strongly on the choice of cutoff. The cutoff independent Wilcoxon rank sum score overcomes the problem, representing an alternative to the Kolmogorov-Smirnov score [1417] and the minimized cutoff-based p-value [7, 8]. The ranking of categories for the three cutoff independent scores are very similar.

Though sample label permutations in many situations represent a better null hypothesis than gene permutations, available gene annotation analysis tools are restricted to the latter. Our implementation allows for both null hypotheses, and we find that both the p-values and the ranking of categories depend strongly on the choice of null hypothesis. Compared to sample label permutations, gene permutations gave much smaller p-values for large categories with many coexpressed genes.



The implemented algorithm treats the categories sequentially and independently. As score function for category relevance, the program uses either the Wilcoxon rank sum or the number of genes above a given cutoff in the ranked list. The latter is implemented for method comparison and for the case of a defined subset of relevant genes, without internal ranking.

For the case of the Wilcoxon rank sum, the user can supply a set of ranked lists distributed according to an appropriate null hypothesis, or request random permutations of genes as the null hypothesis. In the latter case, the significance of the score is calculated analytically by the program, using either an exact calculation by an iterative method, a Gaussian approximation, or a continuous volume approximation. The program chooses method based on a balance between accuracy and computation time. Details are presented in supplementary information [see Additional file 1].

For the case of the cutoff-based score function, the p-value of category relevance is determined with Fisher's exact test [24], corresponding to randomly permuted genes as null hypothesis.

When N independent categories are tested simultaneously, family-wise error rate simply means calculating the probability,

pmultiple (q) = 1 - (1 - q)N,     (1)

that at least one category has a p-value below any given number q by chance. For correlated categories, we make the assumption that the same functional form describes pmultiple (q), with N replaced by an effective number of independent categories Neff. We find Neff by generating a number, K, of ordered lists under the null hypothesis and calculating the p-values of all categories. We fit Neff using the maximum likelihood estimation

where p k is the minimal p-value for the k'th ordered list.

The false discovery rate for the j highest ranked categories is found by counting the number of p-values from K permuted lists lower than the p-value of the j:th category and divide this number with K · j. For the case of sample label permutations, when a user supplied set of ranked gene lists are used to represent the null hypothesis, the first K lists are used to find Neff and false discovery rates, and the remaining lists are used to calculate p-values for each of the K lists.


The algorithm is implemented in the Perl program and is released under the GNU General Public License (GPL)., together with user instructions, is available for download at [19].

Public data sets

Using Catmap, we analysed three publicly available data sets with gene annotations from the Gene Ontology.

The data set of van 't Veer et al. [13] consists of 97 patients with primary sporadic breast cancer, of which 46 had metastases within five years following treatment. Quality filtering was performed as described in [13], and rendered about 5,000 genes which were ranked according to their absolute Pearson correlation to metastasis class. A Gene Ontology analysis of the data set has previously been performed with the 231 top genes as the subset of important genes and random gene permutations [25].

The data set of Golub et al. [11] consists of bone marrow samples from leukemia patients, 27 with AML and 11 with ALL. The published data contains expression levels for 5000 genes, which after removal of genes with no variance across samples rendered 4812 genes which were ranked according to their absolute Pearson correlation to leukemia type.

The data set of Alon et al. [20] consists of 40 tumour and 22 normal colon tissue samples. The 2000 genes in the published data set were ranked according to their absolute Pearson correlation to tissue type.

Gene ontology associations

All genes were first mapped to corresponding UniGene clusters [26]. For the data set of Golub et al. [11] this mapping was given from chip annotation files provided by Affymetrix, whereas for the other data sets [13, 20], the mapping was done via GenBank accession numbers. GO annotations for UniGene clusters were extracted with ACID [27], and completed by back propagating all lower level associations on the GO graph.