Background

Genome sequencing, gene identification and the functional annotation of gene products are the basic first steps towards understanding the wide spectrum of biological processes taking place in a living organism. Although each of these steps presents its own difficulties, the experimental determination of protein function is probably the most challenging [1]. It is practically impossible to perform functional assays for all the uncharacterized proteins provided by the hundreds of genome sequencing projects that are currently underway. Computational tools are clearly necessary to assist in a task of such great magnitude [2]. In fact, the functional descriptions associated with the vast majority of genes in newly released genomes are not experimentally determined but are simply inferred from similarity to previously characterized sequences [35]. The basic assumption of this strategy, often (and misleadingly) referred to as "annotation transfer by homology" [6], is that sequence similarity implies functional similarity. However, the transfer of function based on sequence similarity is complicated by a technical issue: the lack of consistent annotation strategies, and by two other main factors: functional divergence and the domain organization of proteins.

Functional divergence of highly similar sequences has been detected in many protein families [7, 8]. In these cases, the use of permissive criteria to assess the significance of the similarity between proteins can easily lead to wrong annotations. For example, detailed biochemical function is not completely conserved between similar proteins even at 60% [9] to 70% pairwise sequence identity [10]; however, much lower thresholds have been used in the functional annotation of some genomes [11]. On the other hand, the modularity of proteins and the fact that different domains of the same protein may have different functions [12] can also lead to wrong functional assignments, for example, when the domain structure of a best database hit is overlooked [13]. These two factors associated with functional annotation based on sequence similarity represent the most important sources of error in genome annotation [1315]. The consequences of these misannotations are far reaching because they propagate in public databases [15], leading to their systematic deterioration, a process termed error percolation [16]. Genome reannotation, i.e. the annotation of a previously annotated genome using better bioinformatics algorithms and more complete databases [3], provides more accurate and up-to-date functional information and can mitigate the effects of error percolation when the higher quality annotations reach the databases [16]. Genome reannotation projects can provide improved gene structure, refinement of function annotation, benchmarking or comparison of different annotation strategies, and evaluation of annotation reproducibility [3]. In this spirit, we focus here on the reannotation of biochemical function as a more quantifiable aspect of this general problem.

The biological function of a protein can be defined in a physiological, developmental, cellular or biochemical context, among others [5]. From a biochemical point of view, the most important group of proteins is constituted by enzymes. Enzymes are responsible for the majority of biochemical functions, catalyzing the chemical reactions involved in the metabolism of all living organisms and represent a significant fraction of a proteome [17]. Enzymes are organized according to the Enzyme Commission (EC) system, a hierarchical classification that assigns unique four-field numbers to different enzymatic activities [18]. The first field of an EC number indicates the general class of catalyzed reaction: 1. oxidoreductases, 2. transferases, 3. hydrolases, 4. lyases, 5. isomerases and 6. ligases. The second and third fields depend on different criteria related to the chemical features of the substrate and the product of the reaction, and the fourth field is a sequential number without any special meaning. An EC number is assigned solely based on the global reaction that is catalyzed and does not provide information about a specific catalytic mechanism, evolutionary family or structural fold associated to the classified enzyme [19]. New schemes that overcome these problems of the EC system are under development [20, 21]; however, their impact on the biological community is very low compared to the widespread recognition and the universal use of the EC classification. For example, all the main databases focused on enzymes (e.g. BRENDA [22] and ENZYME [23]) or metabolic pathways (e.g. KEGG, the Kyoto Encyclopedia of Genes and Genomes [24] and MetaCyc [25]) rely on the EC classification.

In our previous work [26], we presented EFICAz (Enzyme Function Inference by a Combined Approach), an engine for large-scale enzyme function inference that addresses the limitations of annotation approaches based on sequence similarity alone. EFICAz combines different methods based on family-dependent sequence similarity thresholds, the presence of patterns in functionally relevant domains, and the identification of functionally discriminating residues, all carefully optimized to generate highly precise predictions (see Methods and our previous article [26] for a detailed description of EFICAz). In this work, we present the results of a multi-genome scale reannotation of enzyme function, using an updated version of EFICAz.

Many genome reannotation efforts have been dedicated to individual species [3, 2730]; here, we investigate 245 genomes, in a very consistent way, and obtain EC number annotations for more than 200,000 coding sequences predicted to be enzymes by EFICAz; more than 14,000 of these are novel functional predictions.

Freilich and collaborators have recently conducted a survey and analysis of enzymes in 85 genomes [17]; however, they have inferred enzyme function using strategies based on sequence similarity alone, which suffer from the problems mentioned above. On the contrary, in this study, we employ EFICAz, a method that we specifically designed and optimized to generate high quality predictions [26]. The ultimate purpose of our multi-genome reannotation project is to provide detailed enzyme function assignments, i.e. four-field EC numbers when possible or at least three-field EC numbers, that permit the reconstruction of metabolic pathways. Accordingly, we have generated a detailed, precise and standardized biochemical function annotation of genome sequences that satisfy the strong requirements of automated methods for metabolic pathway reconstruction [31]. In fact, based on the results of the present reannotation study, we have initiated our own multi-genome scale metabolic pathway reconstruction project, where we demonstrate that novel EFICAz assignments permit the connection of a significant number of disjointed subpathways that occur systematically in certain groups of bacterial species (manuscript in preparation). However, we believe that the up-to-date enzyme function annotations obtained from this large-scale analysis, all available on our website [32], will also be of great utility to researchers interested in comparative genome analysis or the general understanding of biochemical processes occurring in particular species.

This manuscript is organized as follows: in the Results and Discussion section, we first present a reassessment of enzymatic content in organisms from the three domains of life. Second, we compare our predictions with enzyme function annotations from two releases of KEGG. Third, we estimate the precision of our novel assignments by comparing EFICAz predicted and experimentally derived biochemical functions of 64 previously hypothetical proteins. Fourth, we provide examples that highlight the potential of novel EFICAz predictions as a source of biologically relevant hypotheses. In the Conclusions section, we summarize the present work, stress its significance, and discuss its limitations. Finally, in the Methods section, we briefly describe EFICAz, introduce the data sources of our analysis, present the results of an extensive benchmark in a scenario of periodic updates, and describe the procedure we followed to identify recently characterized hypothetical proteins.

Results and discussion

Enzyme content assessed by EFICAz

We applied our enzyme function prediction method (EFICAz version 5.0) to the 245 genomes available in the Genes database Release 33.0+/03–05 of KEGG [24]. In Table 1, we show taxonomic information, scientific names and KEGG genomes abbreviations for all the species analyzed in this study, which include representatives from the three domains of life [33]: 21 archaeal species, 204 bacterial species and 20 eukaryotic species. EFICAz assigned four-field (three-field) EC numbers to 158,941 (221,999) of the 866,142 coding sequences found in the set of analyzed genomes. All the EFICAz predictions are available on our website [32], where they can be downloaded and browsed by various criteria (see next section). The multi-genome scale application of EFICAz not only provides a highly precise annotation of enzyme function, it also gives us the opportunity of reassessing the enzyme content throughout the different domains of life in a very consistent way.

Table 1 Species and taxonomic groups represented in the genome sequence dataset. The taxonomic information is from the NCBI Taxonomy database [85] and the three-letter code for the organisms is from KEGG [24]

In Figure 1, we show the number of enzymes per genome (estimated by the number of sequences annotated with three-field EC numbers by EFICAz) as a function of proteome size, for archaeal (Fig. 1A), bacterial (Fig. 1B) and eukaryotic (Fig. 1C) species. As reported before based on analyses of a smaller set of genomes [17, 34], we observe a linear relationship between the number of enzymes and proteome size that is characteristic of genomes in each domain of life; although we note that those of archaeal and eukaryotic species are very similar (correlation coefficient R2 = 0.85, 0.95 and 0.93, regression coefficient or slope b = 0.179, 0.242 and 0.178, standard error of regression coefficient se b = 0.017, 0.004 and 0.012, intercept a = 149.24, 118.56 and 42.04 for archaeal, bacterial and eukaryotic genomes, respectively). Only two bacterial organisms, Rhodopirellula baltica and Leptospira interrogans serovar Lai, show a significant deviation from the linear relationship, both having less enzymes than expected for their proteome size (Fig. 1B). The main feature shared by these two species is an elevated number of regulatory proteins as a consequence of adaptation to changing environments. Rhodopirellula baltica, the only planctomycetes among the analyzed genomes, has acquired a high proportion of two-component systems and Extra Cytoplasmic Function sigma factors to adapt to the changing conditions of free-living in marine, fresh water and terrestrial environments [35, 36]. Interestingly, the enzyme content of Rhodopirellula baltica agrees very well with the linear relationship between the number of enzymes and eukaryotic proteome size (Fig. 1C), yet another eukaryotic-like feature of planctomycetes in addition to the lack of peptidoglycan in their cell walls, unique cell compartmentalization, and presence of a condensed fibrillar nucleoid [35].

Figure 1
figure 1

Enzyme content in organisms from the three domains of life. Number of enzymes as a function of the proteome size for archaeal (A), bacterial (B) and eukaryotic (C) genomes. The gray, magenta and green lines represent: regression line, 95% and 99% prediction intervals, respectively. (D) Distribution of the fraction of enzymes in archaeal, bacterial and eukaryotic genomes. The statistics represented in the box-and-whisker plots are: outliers below the 10th percentile (circles, bottom), 10th percentile (whisker, bottom), 25th percentile (box, bottom), median (thick line), 75th percentile (box, top), 90th percentile (whisker, top) and outliers above 90th percentile (circles, top).

The spirochaete Leptospira interrogans serovar Lai, a pathogenic non-obligate parasitic bacterium that can survive as a saprophyte or as a facultative parasite of mammals, has developed a vast regulatory system to interpret the signals from these distinct environment [37, 38]. The other serotype of Leptospira interrogans analyzed in our set, serovar Copenhageni [39], would also show less enzymes than expected from a linear relationship if a minimum open reading frame (ORF) size less restrictive than 50 amino acids would have been used for ORF detection (less than 4% of the analyzed genomes show a minimum ORF size so high). The low enzyme content of Rhodopirellula baltica and Leptospira interrogans compared with other bacterial organisms of similar proteome size is thus consistent with the correlation that has been reported between the low fraction of enzymes and the massive recruitment of regulatory proteins [17, 40]. Figure 1D shows the distribution of the fraction of enzymes characteristic of each domain of life, whose median and mean ± standard deviation values are: 0.24 and 0.25 ± 0.04 for Archaea, 0.29 and 0.30 ± 0.05 for Bacteria, and 0.17 and 0.18 ± 0.05 for Eukarya. Freilich and collaborators have recently reported higher estimates for the fraction of enzymes in the three domains of life [17]; however, the sets they analyzed included enzymes and some non-enzymes, because their definition of putative enzyme was much more permissive than the one used in our analysis. Thus, we can consider theirs and ours as upper-bound and lower-bound estimates of the true values, respectively. See additional file 1: Enzyme_content.xls for a list of the estimated fraction of enzymes for each of the analyzed genomes.

Comparison of EFICAz predictions with KEGG annotations

To evaluate the level of agreement of EFICAz predictions with other sources of annotation, we compared our enzyme function assignments to those available in the Genes database of KEGG. In general, the quality and completeness of the functional annotation of genomes tend to continuously improve due to the incessant flow of new experimental results and the correction of systematic errors in annotation transfer [13]. To account for the dynamic nature of the functional annotation process, we compare our predictions with annotations from two different releases of the Genes database: (i) 33.0+/03–05 of March 5, 2005, which is contemporary to the sources we employed for training the version of EFICAz used for our multi-genome scale enzyme annotation effort (Fig. 2A, B), and (ii) 37.0+/03–07, released a year later (Fig. 2C, D). We compare the enzyme function annotations at the level of four-field EC numbers (Fig. 2A, C) and three-field EC numbers (Fig. 2B, D), in the latter case, we compare only the first three fields of the annotated EC numbers, whether the fourth field is known or unknown. Besides our EFICAz predictions, we also have the set of KEGG annotations as of 2006 available on our website [32], where the assignments made by EFICAz and/or KEGG can be browsed. The annotations can also be easily selected and retrieved according to species name, level of detail of the enzyme function prediction (four-field or three-field EC numbers), consistency or inconsistency between EFICAz and KEGG assignments, presence of the keywords "hypothetical" or "unknown" in KEGG assignments as of 2005, EC number and gene name.

Figure 2
figure 2

Comparison of EFICAz predictions with KEGG annotations. Comparison of EFICAz predictions with KEGG annotations from the Genes database of March 5, 2005, Release 33.0+/03–5 (A-B) and of March 7, 2006, Release 37.0+/03–07 (C-D). We analyze two levels of enzyme function description: four-field EC numbers (A, C) and three-field EC numbers (B, D). For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.

The functional annotations in the Genes database of KEGG is obtained from various sources: descriptions of gene functions in the GenBank [41] database, on-line genome databases which are generally more up-to-date, the Swiss-Prot [42] database, and additional annotations by KEGG based on ortholog identification and pathway reconstruction [43]. Since the algorithms for enzyme function annotation employed by EFICAz and KEGG are different, and their sources only partially overlap, it is expected to find some sequences for which both methods make functional assignments (which may agree or disagree), and other sequences for which only one of the methods is capable of making an EC number assignment. In Figure 2, we plot the average percentage of sequences per genome for which EFICAz enzyme function predictions and KEGG annotations agree (green) or disagree (red) at the four- or three-field EC number level, and the average percentage of sequences per genome for which only EFICAz (blue) or KEGG (yellow) provide enzyme function information at the specified level of detail. The statistical significance of the differences observed after one year in the mean percentage of annotations corresponding to each group of EC number assignments was evaluated by correlated two-tailed t-tests at a critical alpha level of 10-3.

Most newly added KEGG enzyme function annotations agree with earlier EFICAz predictions

We first analyze the degree of agreement of the enzyme function assignments for sequences that both EFICAz and KEGG annotate as enzymes. As of 2005, we observe that, on average, EFICAz and KEGG assign the same four- and three-field EC numbers to 14.2% (Fig. 2A) and 18.2% (Fig. 2B) of the sequences in a proteome, respectively. Only an average of 0.9% (Fig. 2A) and 1.0% (Fig. 2B) of the sequences in a proteome show disagreement in their four-field and three-field EC number assignments, respectively.

When KEGG annotations as of 2006 are considered, the agreement increases from 14.2% (Fig. 2A) to 15.5% for four-field EC number assignments (Fig. 2C) and from 18.2% (Fig. 2B) to 20.1% for three-field EC number assignments (Fig. 2D). In contrast, there is no significant change in the level of disagreement. After one year, the average percentage of sequences in a proteome with four-field (three-field) EC numbers assigned by KEGG, including agreeing, disagreeing and unique annotations, grew from 18.4% (21.6%) to 20.3% (23.8%), with 67% (80%) of that growth corresponding to agreeing annotations and only 2% (1%) corresponding to disagreeing annotations. Thus, most of the newly added enzyme function annotations in KEGG agree with predictions made by EFICAz a year before. About 31% (19%) of the growth corresponds to unique four-field (three-field) EC number annotations made by KEGG, which are analyzed in the next section.

EFICAz predictions have higher coverage than KEGG annotations, especially for eukaryotes

Figure 2 shows that the average fraction of sequences per proteome that is only annotated by KEGG with four-field (three-field) EC numbers increased from 3.3% to 3.9% (2.3% to 2.7%) after one year. Still, unique EFICAz predictions are more numerous than unique KEGG annotations, even when KEGG annotations as of 2006 are considered. The average fraction of sequences per proteome with three-field (four-field) EC number assignments made only by EFICAz is 4.0 (1.7) times higher than the fraction corresponding to KEGG as of 2005, and 2.7 (1.1) times higher than the fraction corresponding to KEGG as of 2006.

When we analyze the unique predictions in genomes from different domains of life, the most extreme difference between the number of EFICAz-based and KEGG-based unique assignments corresponds to Eukarya. For eukaryotic genomes, the average fraction of unique EFICAz predictions ranges from 4.0 (Fig. 2C) to 10.9 (Fig. 2B) times higher than the average fraction of unique KEGG annotations. Before suggesting an explanation for this discrepancy, we should mention that one of the principles used by KEGG curators for enzyme function annotation is the transfer of annotation between orthologs, which are identified by sequence similarity with consideration of the positional coupling of genes on the chromosome [43]. Thus, a probable reason for the low number of KEGG-based unique assignments in eukaryotes is that in these organisms, with only a few exceptions [44], genes do not appear to be organized in operons, preventing KEGG annotators from making use of the conservation of local genomic context (such as gene order or gene neighboring) to validate orthology-based annotations of enzyme function. This observation raises the question as to how much the recall of EFICAz would improve if we account for the conservation of local genomic context. We would expect an increased coverage for archaeal and bacterial genomes; although, evidently, this component of the method would not be relevant for enzyme function inference of single sequences.

EFICAz predictions for recently characterized hypothetical proteins

In the previous section, we have shown that a considerable fraction of an average proteome is annotated with at least three-field EC numbers only by EFICAz. An average of 36%, 25% and 12% of the three-field EC number annotations uniquely provided by EFICAz in the archaeal, bacterial and eukaryotic proteomes, respectively, correspond to proteins annotated as hypothetical in KEGG as of 2005. In this section, we assess the EFICAz predictions for a subset of hypothetical proteins for which experimentally-derived enzyme function annotation has recently become available. More precisely, we compare the EFICAz-predicted and the experimentally-derived EC numbers of 64 proteins annotated as hypothetical in KEGG whose enzyme functions we could confidently retrieve from the literature (see Methods for details). For this evaluation, we assume that the true EC number associated to an enzyme is the one derived from the referred experimental results. To exclude cases in which the transfer of functional annotation could be successfully achieved in most cases by simple sequence similarity based methods, we only consider hypothetical proteins whose maximal sequence identity to any of the enzymes we used to train EFICAz is less than 60%. We have previously shown that below this threshold of sequence identity the conservation of enzyme function is on average poor [9]. From the histogram shown in Figure 3, we can observe that the median value of the maximal sequence identity to training enzymes is only 25%.

Figure 3
figure 3

Similarity of 64 previously hypothetical proteins to EFICAz training enzymes. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.

EFICAz correctly predicts the enzyme function of most of the recently characterized hypothetical proteins

EFICAz could predict four-field EC numbers for 37 of the 64 previously hypothetical proteins analyzed. We further divided these 37 proteins in two groups: one group of 25 proteins for which the number of matching first fields between the EFICAz-predicted and the true EC numbers can be univocally determined (Table 2), and another group of 12 proteins for which the number of matching fields could be either three or four (Table 3). We observe that the four fields of the predicted and the true EC numbers agree for 21 of the 25 proteins listed in Table 2, indicating a precision of 84% for EFICAz four-field EC number prediction applied to this set of hypothetical proteins. The three-field precision of EFICAz four-field EC number predictions is 92%, since 34 out of 37 proteins listed in Tables 2 and 3 show agreement in at least the first three fields of the predicted and true EC numbers. Table 4 lists 27 of the 64 previously hypothetical proteins analyzed, for which EFICAz could only predict three-field EC numbers. In this case, 26 out of 27 proteins show agreement in the first three fields of the predicted and the true EC numbers, indicating a precision of 96% for EFICAz three-field EC number prediction applied to these hypothetical proteins.

Table 2 Four-field EC number validation of EFICAz-predicted enzyme functions for 25 previously hypothetical proteins
Table 3 Three-field EC number validation of enzyme functions predicted by EFICAz with four-field EC numbers for 12 previously hypothetical proteins
Table 4 Three-field EC number validation of enzyme functions predicted by EFICAz with three-field EC numbers for 27 previously hypothetical proteins

In agreement with the results of the benchmark described in Methods, there is no significant correlation between the precision of the EFICAz predictions for this set of hypothetical proteins and their sequence similarity to the enzymes included in the EFICAz training set (Figure 3). Also, the precision agrees reasonably well with the average precision derived from the benchmark test, especially considering the small size of the analyzed sample (64 proteins), and the fact that hypothetical proteins that are the subject of recent publications often belong to novel families. For example, three of our five wrong predictions correspond to enzymes that are the first studied member of a new family, with no significant sequence similarity to other functionally equivalent proteins: (i) the product of gene MJ0044 of Methanococcus jannaschii (Table 2), an isopentenyl-phosphate kinase that still has not been assigned an EC number by the Enzyme Commission [18], (ii) the product of gene MJ0936 of Methanococcus jannaschii (Table 2), a new cAMP phosphodiesterase, and (iii) the product of Ta1419 gene of Thermoplasma acidophilum, a novel bifunctional phosphoglucose/phosphomannose isomerase (Table 3). It is well known that Archaea have unique enzymes that are optimized for extreme environments [45]; therefore, it is not surprising that these three misclassified proteins belong to archaeal organisms.

The annotation lag in databases can be longer than two years

Interestingly, for some of the 64 previously hypothetical proteins analyzed, the experimental evidence to support a specific enzyme function has been available for quite a long time in the literature; however, the corresponding functional annotation is not acknowledged in current databases. One of the 21 successfully predicted enzymes listed in Table 2, the product of the PA1167 gene from Pseudomonas aeruginosa, constitutes an example of this problem, known as annotation lag [46]. An article available as early as May 10, 2004 describes the biochemical characterization of PA1167 and demonstrates that it is a new alginate lyase (EC 4.2.2.3), an alginate biofilm degrading enzyme [47]. However, as of June 27, 2006, PA1167 was still annotated as a hypothetical protein in all the relevant databases we checked, from very general ones such as Swiss-Prot (Accession number: Q9I4H0) [42] and Entrez Gene (GeneID: 878215) [48], to those that are genome-oriented such as KEGG (Entry: PA1167 of Pseudomonas aeruginosa) [24] and TIGR-CMR, The Institute for Genomic Research Comprehensive Microbial Resource (TIGR Locus: NT03PA1297) [49], and even a database exclusively dedicated to Pseudomonas aeruginosa, Pseudomonas Genome Database v2 (Locus ID: PA1167) [50]. We think this specific example is worth mentioning, given the direct involvement of alginate biofilm in the pathogenicity of this bacterial species, and the recently suggested therapeutic possibilities of alginate lyase in the treatment of Pseudomonas aeruginosa infection of respiratory tract in cystic fibrosis patients [51]. Similarly, experimental evidence supporting the ADP-ribose diphosphatase activity (EC 3.6.1.13) of the product of gene Rv1700 of Mycobacterium tuberculosis H37Rv (Table 2) has been available since August, 2003 [52, 53]; however, it is currently annotated as a hypothetical protein in all major databases. We believe that more elaborate approaches for detecting these ignored but highly confident functional assignments (e.g. methods based on natural-language processing of full text journal articles [54]) would extract considerably more annotations than our simple keyword-based PubMed search (see Methods for details).

Utility of novel predictions made by EFICAz

The results of the thorough benchmark described in Methods, the agreement between newly added enzyme function annotations in KEGG and EFICAz predictions made a year earlier, and the precision of EFICAz predictions for recently characterized hypothetical proteins, suggest that novel predictions made by EFICAz are of high confidence and can provide interesting leads for investigation in many biological fields. Below, we present two interesting cases that exemplify the utility of EFICAz predictions for hypothetical proteins. We believe that experts in different fields of biology will be capable of formulating other interesting hypothesis based on the mining of our numerous novel predictions.

EFICAz predictions as hypothesis generators for understanding functional roles of hypothetical proteins

Although not biochemically characterized even in the most recent literature, some of the hypothetical proteins that EFICAz predicts to be enzymes are known to be directly or indirectly involved in specific biological processes. In these cases, the enzyme function predicted by EFICAz can help to form new hypotheses about the functional role of a hypothetical protein in the particular biological process with which it has been associated. To illustrate this situation, we selected the product of the human gene FLJ11151. As of June 27, 2006, FLJ11151 was annotated as a hypothetical protein in Swiss-Prot (Accession number: Q9BRF8) [42], Entrez Gene (GeneID: 55313) [41] and KEGG (Entry: 55313 of Homo sapiens genome) [24], and lacked any kind of functional description in the Ensembl v39 database (Vega Gene ID: OTTHUMG00000073008) [55].

The enzyme function of FLJ11151 predicted by EFICAz is endopolyphosphatase (EC 3.6.1.10). Endopolyphosphatases catalyze the non-processive internal cleavage of polyphosphate (chain of tens to hundreds of phosphate residues linked by phosphoanhydride bonds [56]) to release polyphosphate chains of shorter size [57]. Endopolyphosphatase activity has been detected in all eukaryotes tested to date, from unicellular organisms like Saccharomyces cerevisiae to mammals [57]; however, no human gene has been shown to be associated to this enzymatic activity or proposed as a putative endopolyphosphatase before this work. Although it was recently shown that the terminal cleavage products of the Saccharomyces cerevisiae endopolyphosphatase Ppn1 are inorganic phosphate and triphosphate, the Km value of Ppn1 for polyphosphate chains of 45 to 20 phosphate residues is much higher than its Km for long chains [58]. Therefore, under physiological conditions, Ppn1 probably degrades the long-chain polymer to short-chain polyphosphate of more than 20 phosphate residues, which is known to be required for the growth of yeast in minimal medium [59].

The physiological role of short-chain polyphosphate in mammals is unclear; however, in vitro experiments have demonstrated that polyphosphate chains of 15 to 750 residues strongly activate the serine/threonine kinase mTOR (mammalian Target Of Rapamycin) [60]. Activation of mTOR kinase, a central regulator that integrates growth factor and nutrient signals, enhances tumor growth and neoplastic proliferation [61]. Consequently, its inhibition is a cancer therapeutic strategy that is being vigorously investigated [62]. In vivo experiments have shown that the activation of mTOR by polyphosphate can be suppressed in human carcinoma cell lines by the expression of a highly processive exopolyphosphatase of yeast (EC 3.6.1.11) that degrades the polymer to inorganic phosphate [63], resulting in a dramatic reduction of cell proliferation [60].

Interestingly, the FLJ11151 transcript has been found to be expressed at high tag count in four Serial Analysis of Gene Expression (SAGE) libraries of primary melanomas in the vertical or metastatic growth phase, indicating that the hypothetical protein FLJ11151 may play an important role in advanced stages of cancer [64]. The predicted endopolyphosphatase activity of FLJ11151 suggests that the product of this gene may be involved in tumorigenesis via an activation of mTOR. We propose that the activation is due to an increased level of short-chain polyphosphate produced by the cleavage of longer molecules of the polymer. It is important to emphasize that we have arrived at this hypothesis in a semi-automatic way, by correlating the results of a PubMed [65] search for a given gene with its EFICAz-predicted enzyme function.

Candidate virulence factors predicted by EFICAz

Most of the virulence factors detected in pathogenic organisms exhibit some kind of enzymatic activity, e.g. many exotoxins are pentosyltransferases (EC 2.4.2.-), serine endopeptidases (EC 3.4.21.-) or metalloendopeptidases (EC 3.4.24.-) [66]. Furthermore, some carboxylic ester hydrolases (EC 3.1.1.-) and phosphoric diester hydrolases (EC 3.1.4.-) are involved in invasion or host cell penetration [67], and several peptidases (EC 3.4.-.-) are implicated in anti-immune strategies to evade the host defenses [68]. Even some housekeeping enzymes that perform essential metabolic functions can also play a role in enhancing virulence in many pathogens [69]. Thus, hypothetical proteins whose EFICAz-predicted enzyme functions are known to be associated with pathogenicity can be considered as putative virulence factors. The product of the gene MW0119 of Staphylococcus aureus strain MW2 and its ortholog SA0140 in strain N315, both annotated as sphingomyelin phosphodiesterases by EFICAz, constitute a good example of this type of novel prediction. N315 and MW2 are meticillin resistant S. aureus (MRSA) strains, that were isolated from hospital-acquired [70] and community-acquired infections [71], respectively. The treatment of patients infected by MRSA has become increasingly difficult because MRSA strains are beginning to develop resistance to vancomycin, the antibiotic traditionally used to treat MRSA infections [72].

S. aureus is the human pathogen that displays the widest assortment of virulence factors [73]. Beta-hemolysin, beta-toxin or sphingomyelinase C, one of the many exotoxins secreted by S. aureus, is a sphingomyelin phosphodiesterase (EC 3.1.4.12) that disrupts the membranes of erythrocytes and other mammalian cells [74]. In humans, beta- hemolysin has been shown to selectively kill monocytes, which then release cytokines that are important for the initiation and progression of S. aureus infection [75]. As of June 27, 2006, only truncated beta-hemolysins were annotated in the S. aureus MW2 genome: the product of the genes MW1881 (TIGR Locus: NT03SA2038, Swiss-Prot accession number: Q99QS0, Entrez GeneID: 1003995, KEGG entry: MW1881 of S. aureus MW2 genome) and MW1940 (TIGR Locus: NT03SA2101, Swiss-Prot accession number: Q8NVM0, Entrez GeneID: 1004054, KEGG entry: MW1940 of S. aureus MW2 genome). Similarly, the only beta-hemolysins annotated in the genome of strain N315 were truncated: the product of the genes SA1752 and SA1811, orthologs of MW1881 and MW1940, respectively. The inactivation of the beta-hemolysin genes in MW2 and N315 strains is caused by the insertion of bacteriophages [76].

The apparent absence of active beta-hemolysins in the MW2 and N315 strains opens the possibility that other genes with sphingomyelinase activity could serve as their functional substitutes. MW0119, one of our predicted sphingomyelin phosphodiesterases, was annotated as a "hypothetical protein, similar to lactococcal phosphatase homologue" in TIGR-CMR (TIGR Locus: NT03SA0129) [49], lacked a functional annotation in Swiss-Prot (Accession number: Q8NYQ6) [42] and Entrez Gene (GeneID: 1004871) [41], and was annotated as a hypothetical protein in KEGG (Entry: MW0119 of S. aureus MW2 genome) [24]. In all these databases, the annotations for SA0140, the ortholog of MW0119 in strain N315, were identical to those of MW0119. Based on the enzymatic activity assigned by EFICAz to these gene products, we suggest that the hypothetical proteins MW0119 and SA0140 may act as beta-hemolysins in the MW2 and N315 strains of S. aureus. We believe that the EFICAz-based strategy of detecting putative virulence factors described here can generate leads for the developing of new antibacterial agents, which are urgently needed given the increasing magnitude of the public health problem that multiresistance to antibiotics constitute.

Conclusion

The reannotation effort presented in this work provides up-to-date enzyme function information corresponding to 245 genomes. Based on the fact that more than double the number of genomes considered in previous analyses are now available [17, 34], and using EFICAz, our highly precise approach for enzyme function prediction, we have confirmed the existence of a linear relationship between the number of enzymes and proteome size and provided up to date estimations of the fraction of enzymes in genomes from each domain of life (Figure 1).

Precision was the highest priority of our analysis; accordingly, our results suggest that by using EFICAz [26], we have generated annotations of good quality. First, the comprehensive series of benchmarks of EFICAz show that we can expect a mean precision of 94% regardless of the sequence similarity between testing and training enzymes (Figure 4A–C). Second, by comparing our predictions with KEGG annotations available a year later (which can take advantage of updated databases and new experimental results available in the literature), we find that most of the newly added KEGG enzyme function annotations agreed with our earlier EFICAz predictions (Figure 2). Third, by way of illustration, we identified a set of 64 previously hypothetical proteins whose biochemical functions have been recently characterized and found that in 96% of the cases, we correctly identified their three-field EC numbers, and in 84% of the cases, we could provide their fully detailed enzymatic activities (Tables 1, 2, 3). Achieving this level of precision is not trivial, considering that: (i) hypothetical proteins are the most difficult targets for automated function prediction [77], and (ii) the maximal sequence identity between the 64 hypothetical proteins and the EFICAz training enzymes has a median value of 25% (Figure 3). We were surprised to find a few cases among this set of 64 hypothetical proteins, where the annotation lag in databases was more than two years. It is difficult to estimate the full dimension of this problem; nevertheless, a systematic rescue of those annotations lost in the literature is very much needed, given the low number of experimentally verified functional assignments in the current databases [78].

Figure 4
figure 4

Benchmark test of updated versions of EFICAz. Precision (A-C), recall (D-F) and number of enzyme types described by four-field EC numbers (G-I) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.

There always exists a trade-off between precision and recall in the implementation of a predictive method. A consequence of our priorization of precision over recall is that the enzyme contents calculated based on our EFICAz predictions are lower-bound estimates (Figure 1D). However, EFICAz is still sensitive enough as to generate thousands of novel annotations. We believe that our novel predictions can be mined and correlated with other information sources to generate biologically significant hypotheses. As a proof of principle of this strategy, we have presented two examples, selected because of their potential impact on human health. Using the EFICAz based database on our website [32], we are confident that experts in different fields of biology will be able to discover many more such cases. To facilitate this task, the enzyme function assignments can be browsed on our website [32] according to species name, gene name, level of detail of the enzyme function prediction and EC number. Agreeing, disagreeing, or unique KEGG and EFICAz annotations, as well as EFICAz assignments for hypothetical proteins can also be selected and retrieved.

The main drawback of our analysis is the fact that we can only predict biochemical functions that are represented in our set of training enzymes by at least one sequence. Because of this requirement, the prediction of orphan enzymes is beyond the capabilities of both our approach and of all the current computational approaches for enzyme function inference. Orphan enzymes are defined as enzymatic activities that have been experimental measured, but not yet mapped to a gene product, i.e., EC numbers without known associated sequences [79, 80]. According to a recent survey, the number of different orphan enzymes exceeds fifteen hundred, i.e. more than 39% of the known enzymatic activities [81].

We plan to periodically repeat the reannotation of all available genomes using updated versions of EFICAz and maintain all the annotations in a web-accessible database. By using the same version of EFICAz to reannotate all the available genomes simultaneously, rather than only the newly released genomes, we will keep the consistency of the annotations between genomes. This feature together with the full standardization of our annotations (EFICAz always reports EC numbers rather than enzyme names; although the latter are also provided) will be very useful for comparative genome analysis and automated metabolic pathway reconstruction, and will also facilitate the incorporation of EFICAz predictions to other functional databases.

Methods

EFICAz: Enzyme Function Inference by a Combined Approach

EFICAz is a combined approach designed specifically for high precision enzyme function inference [26]. It integrates the predictions of four independent methods: (i) CHIEFc family based FDR recognition: detection of Functionally Discriminating Residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (ii) CHIEFc family specific SIT evaluation: pairwise sequence comparison using a CHIEFc family specific Sequence Identity Threshold (SIT), (iii) High specificity multiple Prosite pattern recognition: detection of multiple Prosite [82] patterns of high specificity, and (iv) Multiple Pfam family based FDR recognition: detection of FDRs in Multiple Pfam [83] enzyme families. In EFICAz, an enzyme family is defined as a group of proteins that are evolutionarily related and share the full four or the first three fields of their EC numbers. Each of the four methods is highly precise and able to generate unique assignments that are not detected by the other three components. Therefore, EFICAz makes an inference when one or more of the four component methods predict a particular enzyme function. The primary goal of EFICAz is predicting four-field EC numbers; however, when the highest level of detail for the enzyme function description cannot be confidently determined, EFICAz can provide three-field EC numbers. EFICAz and its components are fully described in our previous article [26].

Training of different EFICAz versions

The source of annotated protein sequences for EFICAz is the UniProt Knowledgebase database (or UniProt for short) [42]. The UniProtKB/Swiss-Prot (or Swiss-Prot for short) component of UniProt is the source of the training enzyme sequences, that we require to be fully annotated with four-field EC numbers. A combination of the Swiss-Prot and the TrEMBL components of UniProt provides the source of sequences to prepare the heterofunctional multiple sequence alignments that are required for FDR selection by the Evolutionary Footprinting method [26]. For training of the Multiple Pfam family based FDR recognition component of EFICAz, we use the Pfam database [83].

We prepared three versions of EFICAz (2.0, 3.0 and 4.0) to benchmark the performance of our enzyme function prediction method in a situation mimicking periodic updates, and one version (5.0) to carry out enzyme function prediction on a multi-genome scale. The only differences among the various EFICAz versions are the releases of the different databases used for the training process. The sources of annotated protein sequences for versions 2.0, 3.0, 4.0 and 5.0 of EFICAz are the Releases 2.0, 3.0, 4.0 and 5.0 of UniProt, respectively. Table 5 shows the relevant statistics of the sequence data sources for the different versions of EFICAz. For training of the Multiple Pfam family based FDR recognition component of EFICAz versions 2.0, 3.0 and 4.0, we use the following Pfam database Releases: 15.0 of August, 2004 (based on UniProt 2.0), 16.0 of October, 2004 (based on UniProt 3.0) and 17.0 of March, 2005 (based on UniProt 4.0), respectively. For EFICAz version 5.0 we also use the Release 17.0 of Pfam. A detailed description of EFICAz training procedures can be found in our previous work [26]. See additional file 2: EFICAz_v5_enzymes.xls for a list of 2,061 enzyme types with four-field EC numbers and 203 enzymes types with three-field EC numbers recognized by EFICAz version 5.0.

Table 5 Source of Sequence Data for EFICAz training. The fifth column shows the number of enzymes annotated in Swiss-Prot with four-field EC numbers, which constitute the primary source for the training of EFICAz.

Benchmarking of EFICAz using annotated Swiss-Prot sequences

The results of the jackknife test presented in our previous work [26], showed that the original version of EFICAz generates highly precise enzyme function predictions. To corroborate that the precision of newer versions of EFICAz is comparable to that of the rigorously tested original version, we performed a benchmark in a scenario of periodic updates. Briefly, we select all the newly added Swiss-Prot sequences in the Release 5.0 of UniProt, i.e. not included in a given previous release of this database, and compare their functional annotations in UniProt 5.0 with our functional predictions using a version of EFICAz trained with the given previous release of UniProt. We tested the new sequences added to UniProt 5.0 since the release of UniProt 2.0 (33,475 sequences), UniProt 3.0 (18,325 sequences) and UniProt 4.0 (10,495 sequences), using the versions 2.0, 3.0 and 4.0 of EFICAz, respectively.

For a given enzyme function described by a four-field EC number, we calculate: precision = (true positives)/(true positives + false positives), and recall = (true positives)/(true positives + false negatives), where (i) true positives is the number of new sequences predicted by EFICAz as having the given enzyme function and annotated in UniProt 5.0 with that same function, (ii) false positives is the number of new sequences predicted by EFICAz as having the given enzyme function, but annotated in UniProt 5.0 with a different function, and (iii) false negatives is the number of new sequences annotated in UniProt 5.0 with the given function, but predicted by EFICAz as having a different enzymatic function or no enzymatic function at all. The enzyme sequences in UniProt are not evenly distributed over the different EC classes, i.e., some enzyme functions might be overrepresented. To reduce the bias towards the most populated enzyme functions, we first evaluate precision and recall for each individual enzyme type, and then average them across all types. On the other hand, because some newly added sequences are very similar to training enzymes (e.g., more than 90% sequence identity), they are much easier to predict than others. To reduce this second source of bias, we evaluate the performance of EFICAz according to different levels of maximal sequence identity of the test sequences to the training enzymes. Thus, for each enzyme type, we first select the test sequences whose sequence identities to any member of their corresponding training sets are not higher than a given value. Then, based on the selected testing sequences, we calculate the precision and recall of EFICAz for each of those enzyme types. Finally, for each version of EFICAz, we report the average precision and recall at different levels of maximal testing to training sequence identity.

Figure 4 shows the average precision (Fig. 4A–C), average recall (Fig. 4D–F) and number of predicted enzyme types (Fig. 4G–I), when EFICAz versions 2.0, 3.0 and 4.0 are applied to the sequences in UniProt 5.0 that were added since the release of UniProt 2.0, 3.0 and 4.0, respectively. Besides the results corresponding to all the observed enzyme types (blue curves in Figure 4), we also show those corresponding to enzyme types for which 10 or more training sequences were available (red curves in Figure 4). The average precision of any version of EFICAz is never below 94% (with standard deviations that never exceed 20%), irrespective of whether all enzyme types or only those with at least 10 or more training sequences are considered, and regardless of the sequence identity interval analyzed (Fig. 4A–C).

The average recall of EFICAz depends of the specific maximal testing to training sequence identity interval. Thus, when all enzyme types are considered, the recall ranges from 95% to 97% if no testing to training sequence identity restrictions are applied (100% sequence identity interval), but decreases to 69–74% at 40% sequence identity (Fig. 4D–F, blue curves). When only enzyme types with 10 or more training sequences are considered, the recall significantly improves, e.g. it ranges from 82% to 85% at 40% sequence identity (Fig. 4D–F, red curves). All the shown results correspond to four-field EC number predictions; the three-field EC number predictions follow the same trends, with slightly higher precision and recall (not shown). In general, these benchmark results clearly show that updated versions of our enzyme function inference method are very likely to perform as well as the original version of EFICAz.

Genome sequence dataset

Using EFICAz version 5.0, we analyzed the protein sequences of all the genomes available in the Genes database Release 33.0+/03–05 (of March 5, 2005), a component of KEGG. The dataset comprises 866,142 coding sequences corresponding to 245 genomes. The whole dataset was processed in approximately 19.5 days, using 50 of the 1,000 nodes in our IBM e1350 cluster, powered by two 2.0 GHz dual core AMD Opteron 270 processors per node, i.e. the average running time of EFICAz in a single 2.0 GHz core was 3.24 minutes per genomic sequence. With the purpose of comparison, we collected the enzyme function annotation available for these sequences in the Genes database. We extracted the EC numbers (described at least at the level of the first three-fields) from the DEFINITION line in the corresponding gene entries of the Release 33.0+/03–05 and the Release 37.0+/03–07 (of March 7, 2006) of Genes. Table 1 includes the scientific names and taxonomic classification of all the organisms analyzed in this study.

Search of hypothetical proteins annotated by EFICAz and recently characterized by experiments

To estimate the validity of our novel predictions, we first collected all the protein products predicted to be enzymes by EFICAz version 5.0 and defined as hypothetical or unknown in the Release 33.0+/03–05 of the Genes database (14,177 coding sequences). Predicting the function of unannotated proteins with high sequence similarity to enzymes that we used to train EFICAz can be considered a trivial exercise; e.g. EFICAz training enzymes may include homologs with very high sequence identity to a given protein labeled as hypothetical in databases due to the annotation lag problem [46]. Therefore, to make our test more demanding, we excluded from our list 254 hypothetical or unknown proteins exhibiting more than 60% sequence identity to any enzyme in the EFICAz version 5.0 training set. Then, for the remaining 13,921 proteins, we searched the PubMed database of May 26, 2006 [65] using their corresponding gene entry ids and names as a set of synonym query terms, resulting in 544 sequences linked to at least one article published in the last five years. To carry out the PubMed search, we used the Entrez Programming Utilities (eUtils) [84] from the National Center for Biotechnology Information (NCBI). After manual inspection to eliminate irrelevant abstracts, we obtained a set of 64 proteins whose biochemical functions have been experimentally determined and described with at least three-field EC numbers. Finally, we compared the experimentally-derived annotations of the 64 proteins with their EFICAz-predicted enzyme functions.