Background

Of the known biological sequences in the post-genomic era, the vast majority have not yet been, and cannot be, characterized by experimentation or manual annotation [1]. For example, Swiss-Prot, a protein database with a manually curated functional annotation, has only 547,085 entries as of December 2014, whereas a comprehensive protein database such as UniProt-TrEMBL, which contains a high-quality computationally analyzed functional annotations and covers most of the known protein sequences, contains tens of millions of members. Therefore, automated annotation is necessary to assign functions to uncharacterized sequences. Enzymes are of special importance owing to their central roles in metabolism and their potential uses in biotechnology [2]. Hence, a greater ability to predict enzyme functions will not only give biologists deeper insight into metabolism in general but also increase the toolkits for bioengineers.

Many novel bioinformatics tools with different bases, such as protein structure [3], functional clustering [4], evolutionary relationships [5] and biological systems networks [6], have been developed for enzyme or protein functional annotations. Many of them perform better [7,8] than conventional approaches like BLAST, which is based on pairwise comparisons of gene sequence similarities to assign functions to new genes [9]. However, BLAST is currently the main approach used in functional annotations [10], whereas many recently developed tools are rarely applied in research projects [7]. Additionally, BLAST-based functional annotations perform poorly when only distantly related homologs with similarities of <30% can be found [11,12]. Furthermore, many proteins recently discovered using metagenomics approaches do not have homologs with high enough amino acid sequence identity levels for reliable functional annotation. For example, in a benchmark study, which used a metagenomics approach focusing on cow rumen–derived biomass-degrading enzymes, it was found that in terms of amino acid sequence identity, only 12% of the 27,755 carbohydrate-activated genes assembled had >75% identity to genes deposited in NCBI-nr, whereas 43% of the genes had <50% identity to any known protein in NCBI-nr, NCBI-env and CAZy [13]. Thus, if novel and combinatorial approaches are used, to what extent, with acceptable precision, can we improve the coverage of the protein annotation? For enzymes, there is a well-established system, the Enzyme Commission (EC) number [14], which describes catalytic functions hierarchically using four digits. As far as we know, although many EC number prediction tools are available, most are limited to performance tests within small datasets and none of them has been used to systematically address the comprehensiveness of enzyme functional annotation in public protein database. Thus, a more specific question, “To what extent we can improve, with an acceptable precision, the coverage of enzyme annotations using EC numbers?” is worth addressing by illustrating the power of approaches whose utility goes beyond BLAST. The insight we obtain can be also generalized to protein annotations for other functional attributes.

Thus, novel approaches with high coverage rates that maintain an acceptable precision are of special interest. Hierarchical or top-down algorithms with a layer-by-layer logic satisfy these requirements [15,16]. Such approaches assign functions only at a level that can be inferred with high confidence. Hence, in many cases, general rather than specific functions (for example, the top level of EC numbers) are assigned to avoid the overprediction of protein functions, such as annotation below the trusted cutoff or inference only from a superfamily, a main problem of current database annotations [17]. Furthermore, this approach is suitable for widely accepted protein function definition systems, such as EC or Gene Ontology (GO), both of which are widely applied metrics systems to consistently describe the functions of gene products [18], owing to their hierarchical structure.

Domains are conserved parts of a given protein’s amino acid sequence and structure that can evolve, function and exist independently of the rest of the amino acid chain. Thus, it has been hypothesized that machine learning with domains as input labels might serve as a powerful approach to predict protein functions [19]. For example, the dcGO database, based on associating SCOP domains or domain combinations with GO terms of protein products, infers the domain or domain combinations responsible for particular GO terms [20]. A domain architecture–based approach might thus be a powerful tool for predicting enzymatic functions. Here, we report on “DomSign”, a top-down enzyme function (EC number) annotation pipeline based on domain signature–derived machine learning. We must emphasize, based on the belief that any reliable protein function prediction tools should depend on multiplicity [21], that our purpose here is not just to present a simple function prediction tool but rather to address the issue of to what extent can the coverage of enzyme annotations by EC numbers be improved, with acceptable accuracy, by methods beyond simple BLAST.

To test the reliability of DomSign, many benchmark enzyme annotation methods were compared with. The performance of DomSign was comparable, or superior, to all of them after exhaustive testing against reliable datasets, such as Swiss-Prot enzymes, suggesting that DomSign is a highly reliable enzyme annotation tool that can identify more enzymes in the protein universe. Furthermore, to expand the number of enzymes retrieved from large datasets, we compared our results with those proteins already assigned EC numbers in the original dataset. More ‘hidden enzymes’ were predicted by DomSign. Thus, DomSign, with >90% accuracy suggested by the tests, can be used to predict a large number of enzymes by assigning EC numbers to proteins in both the UniProt-TrEMBL [22] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] bacterial subsection, which, respectively, represent the most complete protein database and best metabolic pathway information collection. DomSign also can be applied to metagenomic samples as exemplified by the Human Microbiome Project (HMP) dataset [24], a comprehensive and well-analyzed metagenomic gene dataset focused on parsing the interactions between commensal microorganisms of humans (human microbiome) and human health. In this case, DomSign not only significantly increased the number of EC-labeled enzymes but also helped to clarify the metabolic capacity of the sample by recovering new EC numbers beyond the official annotation. These results highlight the necessity to develop enzyme EC number prediction projects or, more generally, protein annotation projects with novel approaches akin to DomSign to extract more biological information from the available sequencing data.

Methods

Definition of a domain signature

Pfam is a protein domain collection with ~80% coverage of the current protein universe [25], and its Pfam-A subsection is highly reliable owing to its manually curated seed alignment. For our purpose, a string of non-duplicated Pfam-A domains belonging to a protein was defined as its domain signature (DS) and used to predict function(s). Although some research has suggested a potential advantage of involving domain recurrence and order in protein GO assignments [26], our results showed that this simpler DS definition provided a higher coverage for proteins identified in metagenomics studies. When utilizing Swiss-Prot protein DSs to retrieve HMP phase I non-redundant proteins, the coverage was 74.7% when considering domain recurrence and order versus 77.1% with more simple definition. Unlike the GO term assignment used previously [26], recurrence did not lead to a significant difference in coverage as indicated by reconstructing the EC number machine-learning prediction model (Additional file 1) used in this work, whose method is presented in the following part. Thus, because the main aim of our study was to improve enzyme annotation coverage, our simpler DS definition was applied.

Preparation of the dataset

Swiss-Prot and TrEMBL datasets were downloaded on November 2, 2013, from the Pfam ftp site (version 27.0) from which Pfam-A domains were extracted. Pfam-A Hidden Markov Model (PfamA.hmm) for hmmsearch (version 3.1b1) [27] was accessed from the same site. The HMP phase I non-redundant protein dataset (95% identity cut-off, 15,006,602 entries from 690 samples) [24] was collected from the HMP data processing center (http://www.hmpdacc.org/). A benchmark dataset for unbiased tests was collected from [15] (Supplementary Data 2). The files (gene IDs and sequences in the fasta format) from KEGG were downloaded on March 6, 2014. The EC2GO mapping file [28] was downloaded on June 20, 2014 from the GO homepage. All of these files were further processed as stated below.

“Sprot enzyme” dataset

The Swiss-Prot dataset is a protein collection with an exhaustive manually curated—and thus reliable—functional annotation. In this context, it was a good choice working as the training set for comparing prediction model performance by cross-validation. The subset of enzymes in Swiss-Prot with both single EC numbers and Pfam-A domains was termed “sprot enzyme”, encompassing 228,710 entries and 4,216 distinct DSs. This set was used to construct the “Specific enzyme domain signature” dataset as described below and also as a training dataset to build the prediction model for enzyme mining in several general protein databases (TrEMBL, KEGG and HMP).

“Sprot protein” dataset

Another subset of Swiss-Prot, which contains all of the Pfam-A proteins with single or no EC numbers, was named “sprot protein”, encompassing 46.8% enzymes (with single EC numbers) and 53.2% non-enzymes (without EC numbers), which covers 99.0% of the Swiss-Prot proteins with Pfam-A domains. This dataset was used for model parameter optimization and performance comparisons against BLAST and FS models (see descriptions below in Methods about FS model) [19].

“Specific enzyme domain signature” dataset

To identify enzymes from the protein pool, we further constructed a “Specific enzyme domain signature” dataset. The fundamental idea was to remove non-enzyme-derived DSs from the 4,216 distinct DSs belonging to “sprot enzyme”. Because EC numbers do not cover all enzymes, however, a more reliable non-enzymatic dataset beyond simple proteins without EC numbers needed to be constructed. Briefly, for the proteins without EC numbers in Swiss-Prot, their annotation raw files (‘KW’, ‘DR’ and ‘DE’ lines) were filtered using a catalytic or functional uncertainty–inferring term (‘iron sulfur’, ‘uncharacterized’, ‘biosynthesis’, ‘ferredoxin’, ‘ase’, ‘enzyme’, ‘hypothetic’, ‘putative’ and ‘predicted’) to reliably extract non-enzymes. By this means, we collected 2,901 unique DSs from 157,240 non-enzymes carrying Pfam-A domains. After removing these DSs from the “sprot enzyme” DS set, 3,949 specific enzyme DSs were acquired, covering 95.4% of “sprot enzyme”. This dataset was used for selecting enzyme candidates from a protein pool using the benchmark comparison method and enzyme mining process.

“SVMHL unbiased” dataset

To compare the performance of our approach with the SVMHL pipeline (see descriptions below in Methods about SVMHL model) [29], the aforementioned unbiased dataset was further processed to remove, as described in their paper, enzyme sub-subfamilies with fewer than 50 members.

“TrEMBL enzyme” and “HMP enzyme” datasets

The TrEMBL raw dataset was filtered to extract enzymes with single EC numbers and Pfam-A domains, producing “TrEMBL enzyme”. Likewise, “HMP enzyme” was constructed from the HMP non-redundant protein set. Pfam-A domains were retrieved by an hmmsearch against PfamA.hmm using the cut_tc cutoff with all other parameters set as default. These two datasets were used as the gold standards to test the reliability of the DomSign-based enzyme EC number annotation prior to the actual enzyme mining of TrEMBL and HMP original datasets. The statistics and usage of the datasets constructed in this work are presented in Additional file 2.

Prediction model description

Our prediction model consists of two separate steps: enzyme differentiation from the protein pool and EC number annotation based on machine learning. In the first step, proteins in query datasets are recognized as potential enzyme candidates if their DSs are among the aforementioned “Specific enzyme domain signature” set. In the second step, a top-down machine-learning model is developed to predict EC numbers.

First, we converted the training dataset into a list in which every protein had one DS and one EC number (Figure 1(1)). Subsequently, the proteins were categorized based on their DSs. Thus, we constructed a series of protein groups in which all members contained the same DS. Here, we define the number of member proteins in one group as N DSi . Then, the member proteins in one group were further divided into subgroups based on their EC numbers, leading to a protein subgroup with the same EC (N DSi − ECj  and N DSi  = ∑ j N DSi − ECj ) (Figure 1(2)). Further, the abundance of every subgroup among one protein group was calculated (A DSi-ECj = N DSi-ECj / NDSi ) (Figure 1(3)). In each group, there exists at least one dominant subgroup with the highest abundance. The EC number for this subgroup is then associated with the relevant DS, whereas the abundance of this subgroup is defined as the “specificity” for this DS-EC pair, which acts as the fundamental parameter in the machine-learning model (Figure 1(4)). We constructed four prediction models to assign four levels for one complete EC hierarchy. For each model, at the first step (Figure 1(1)) one fraction of the EC number was extracted—for instance, for the model focusing on the second EC hierarchy, EC = x.x.-.- is extracted. All further steps were the same during the construction of these four models. Thus, this machine-learning approach makes it possible to annotate the EC hierarchy from general to specific where the “specificity” of DS-EC pairs can be used to balance the tradeoff between recall and precision, depending on the particular purpose.

Figure 1
figure 1

Construction of the machine-learning model to predict EC numbers. (1) Test dataset: DSs and EC numbers for every enzyme were extracted from original datasets, such as Swiss-Prot. (2) These proteins were categorized into groups based on common DSs. Subsequently, the groups were divided into subgroups based on the corresponding EC numbers. Thus, the numbers in each cell represent the number of proteins in each subgroup, and the total member number for each group is summarized in the last row. The numbers of dominant subgroups within one group are colored red. (3) The abundance of each subgroup within its parent group (the same DS) was calculated and represented. The abundance of dominant subgroups for each group (the same DS) is colored red. (4) Prediction model: Every DS was associated with the relevant dominant EC number within its protein group (carrying this DS). The abundance of dominant EC subgroups was extracted and set as the “specificity” for this EC-DS pair.

From model to prediction engine

First, the training dataset was used to construct four prediction models for each EC hierarchy level, and the DSs of query proteins were calculated by hmmsearch with a cut_tc cutoff and all other parameters set as default. Then, the specific enzyme DS dataset was used to select potential enzyme candidates from query proteins. Then, four constructed prediction models were used one by one to annotate EC digits, assigning the EC number that corresponds to the query DS. In this process, a specificity threshold is applied to balance precision and recall. Specifically, when the “specificity” of the DS-EC pair is less than the specificity threshold, the procedure is shut down and only the EC digits annotated previously form the output (Figure 2). In this way, the precision can be increased by making the specificity threshold stricter with a loss of recall, or vice versa. Additionally, although it is not statistically rigorous, the specificity for one particular DS-EC pair can be used as the confidence score to infer the reliability of each prediction by DomSign. For example, if DomSign assigns one enzyme with EC number 1.1.1.- and the specificity values for the DS-EC pair of the first three hierarchical levels are 0.9, 0.88 and 0.85, we can simply set these three parameters as the confidence score for the reliability of prediction for the first three EC digits, respectively. The script package for this tool is provided as Additional file 3.

Figure 2
figure 2

Schematic representation of the DomSign pipeline. The pipeline is divided into two parts—enzyme candidate selection and EC number annotation. In the first step, specific enzyme DSs are utilized, and all proteins with DSs within this dataset are selected as potential enzyme candidates. Simultaneously, four annotation references for the EC digits at four levels are constructed as described in Figure 1. At every level, if the “specificity” of the corresponding DS-EC pair in the annotation reference is less than the user-defined threshold, the pipeline is shut down and the previously annotated EC digits form the output. If not, the pipeline continues until the fourth EC digit has been annotated. An example of the DomSign procedure to annotate a protein is shown here. Because the specificity threshold is above the specificity of the DS-EC pair at the last level, only the first three DS-EC digits are predicted, leading to final result: EC = 1.1.1.-.

Performance evaluation statistics

Owing to the top-down nature of our approach, we designed a new result evaluation system to use instead of the widely used recall-precision curve [19] that differentiates the annotation results at different levels, resulting in better resolution. Briefly, the predicted EC number (PE) is compared with the true EC number (TE), and the result is classified into the following groups (Figure 3A, right): E—Equality, PE is the same as TE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.4”); OP—Overprediction, there is at least one incorrectly assigned EC digit in PE compared with TE (“PE: EC = 1.2.1.1” vs. “TE: EC = 1.2.3.4”); IA—Insufficient Annotation, PE is correct but not complete compared with TE (“PE: EC = 1.2.-.-” vs. “TE: EC = 1.2.3.4”); and IM—Improvement, TE is the parent family of PE (“PE: EC = 1.2.3.4” vs. “TE: EC = 1.2.3.-”). When TE is “Non-enzyme”, if the PE equals “Non-enzyme”, then the comparison result is set as “Equality”. Otherwise, the result is “Overprediction”. Additionally, if PE is “Non-enzyme” and TE is not, then the comparison result is set as IA. What needs to be specifically mentioned here is IA. Although this result means incomplete annotation, it is correct and does not cause any increase in the error rate. Thus, IA provides better annotation coverage and simultaneously maintains high precision. The evaluation metrics defined here differ from traditional ones [19]. However, compared with previous precision-recall curves that equally consider different EC hierarchy levels, this system covers all the possible situations and also gives an intuitive view of the performance at different annotation levels with higher resolution, which is especially suitable for evaluating annotation results using metrics of a hierarchical structure.

Figure 3
figure 3

DomSign performance comparison with BLAST and FS models by 1,000-fold cross-validation of “sprot protein”. Three levels of 1,000-fold cross-validations were conducted for each method. Homologs of a query above a given threshold (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” described in Methods) were removed from the reference dataset and, for each reference dataset, only sequences below the given threshold were kept. In this test, an 80% specificity threshold, 10−3 E-value and default parameters were applied to the DomSign, BLASTP and FS models. The relative standard errors were not significant (<1%) and therefore are not illustrated here. (A) Results for the evaluation of the three methods. As shown on the right, four attributes are defined to evaluate the annotation results in contrast to the “true EC number” (see Methods for details). (B) The EC hierarchy level distribution in the annotation results of the three methods. Seven attributes are defined here to describe the annotation results. Among them, “No best hit” is specific to BLASTP. “More than one EC” is specific to the FS model because this dataset encompasses only enzymes with single EC numbers or non-enzymes, and this attribute is regarded as “OP” in Panel A. We integrated the annotation result “Non-enzyme” and “EC = −.-.-.-”, as shown in Figure 2 into one unified group, “Non-enzyme”, in the result’s illustration because the latter has no EC number assigned and also only occupies a small fraction of the annotation results (the ratio of the “EC = −.-.-.-” subclass is only 1.4% in the “identity ≥ 100%” group for DomSign) of the annotation results.

Performance test

Four benchmark methods, BLASTP (2.2.28+), FS [19], SVMHL [29] and EnzML [30], were selected to test the performance of DomSign.

Comparison with BLASTP and FS by cross-validation

For the FS model, the script package from Forslund K. et al. [19] was run on our system to calculate the GO terms derived from the DS defined in their work. Subsequently, we used the EC2GO mapping file to convert the FS model’s predicted GO terms to EC numbers. If multiple EC numbers existed for one particular GO term, we assigned that protein all of the relevant associated EC numbers. The three pipelines were tested by 1,000-fold cross-validations of the “sprot protein” dataset. Because the dataset has only enzymes with single EC numbers or non-enzymes, if the FS model predicted more than one EC number for a query then the result was “OP”. Furthermore, to simulate the situation in which no sequences in the database have a high similarity to the query protein, two additional rounds of cross-validations against “sprot protein” were executed. Briefly, sequences in the training set having specificities above threshold I (60% identity, 80% query coverage) and II (30% identity, 80% query coverage) with any query sequence, respectively, were removed by BLASTP. In this way, any sequence in the training set is no more similar to any query sequence than the defined threshold. These two rounds of cross-validation, together with the common cross-validation, were termed “identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%”. For BLASTP, a 10−3 E-value and default parameters were applied. For the FS model, parameters were set as default for the processing.

Comparison with the SVMHL model by cross-validation

Because the source code of SVMHL is not available, we compared DomSign with SVMHL by the same test as stated in [29], and the raw data were used for performance comparisons. Briefly, a 10-fold cross-validation was conducted using DomSign on the “SVMHL unbiased dataset”, and prediction accuracy [29] was used to evaluate the results. In this case, accuracy is defined as the percentage of completely correct annotations. Here, one predicted EC number at one specific hierarchy level (an EC number consisting of three digits when the EC hierarchy level is three) is set as ‘correct’ when its component digits are all correct. Because SVMHL does not have an enzyme and non-enzyme differentiation step, we included only the predicted “enzyme” by DomSign in the results comparison, which covered 85.2% ± 0.4% of the query proteins on average during the cross-validation.

Comparison with EnzML

Like the SVMHL model, owing to the inability to run EnzML on our system, we also compared the performance between EnzML and DomSign by the same test stated in [30], and the data published in that paper were used as the benchmark. The “Swiss-Prot&KEGG” set and the less redundant “UniRef50 Swiss-Prot&KEGG” set were constructed according to the description in the EnzML paper [30], and a 10-fold cross-validation was conducted. The example-based precision and recall rate were applied to the performance evaluation. Briefly, these two metrics consider how many correct EC predictions are assigned to each individual protein example on average [31]. For example, for each protein, true (TE) and predicted (PE) EC number sets at every hierarchical level (EC = 1.1.-.- is decomposed to EC = 1, EC = 1.1, EC = 1.1.- and EC = 1.1.-.-) are extracted and compared with each other. The example-based precision and recall rate can be defined by the two equations shown below:

$$ Precision=\frac{1}{m}{\displaystyle \sum_{i=1}^m}\frac{\left|T{E}_i{\displaystyle \cap }P{E}_i\right|}{P{E}_i} $$
$$ Recall=\frac{1}{m}{\displaystyle {\sum}_{i=1}^m\frac{\left|T{E}_i\cap P{E}_i\right|}{T{E}_i},} $$

Here ‘m’ refers to total number of proteins, and TEi and PEi refer to the sets of annotated EC numbers at four hierarchical levels or ‘Non-enzymes’ for each protein.

Enzyme predictions from large-scale datasets

“Sprot enzyme” was used as the test dataset, and “Specific enzyme domain signature” was used to select enzyme candidates. “TrEMBL enzyme” and “HMP enzyme”, combined with their original annotations, were used to evaluate the reliability of DomSign for expanding enzyme space. All TrEMBL and HMP proteins were then annotated by DomSign to test the extent of the enzyme expansion. Further, to show the significance of enzyme expansion in KEGG, among the predicted novel enzymes of TrEMBL, novel enzymes for 2,584 bacterial genomes in KEGG were extracted. Owing to the subtle differences between KEGG and TrEMBL annotations, a few novel enzymes in TrEMBL have EC numbers in KEGG. These were removed to retrieve the exact number of novel enzymes from KEGG, and the relevant statistics were calculated.

Results

Optimization of the DomSign specificity threshold

We tested the reliability of DomSign as an EC number prediction tool. Because we designed a parameter “specificity threshold” (Methods) in DomSign to balance the tradeoff between precision and recall (Figure 2), three rounds of 1,000-fold cross-validations (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” cutoffs as described in Methods) were performed on the “sprot protein” dataset using DomSign with 99%, 90%, 80% and 70% specificity thresholds to optimize this parameter (Additional file 4). Among the 99%, 90% and 80% specificity thresholds, the 80% had the best coverage (IA, E and IM) and a slightly increased error rate (OP). However, further reduction of the specificity threshold to 70% resulted in a much smaller increase in coverage accompanied with a relatively severe OP ratio, especially for the “identity ≤ 30%” group, indicating that 80% might be the optimal specificity threshold for DomSign. Thus, we applied this parameter in further analyses.

Comparisons among DomSign, BLAST and FS models

BLAST was selected as the benchmark because of its wide application in research, and we used the best hit of BLAST to assign EC numbers. The FS model applies similar DS definitions, with no consideration for recurrence or order. However, it considers the contributions of every subset of DSs rather than regarding them as intact labels. Briefly, this model utilizes Bayesian statistical methods to evaluate the possibility of one particular GO annotation term inferred from all the subsets of the DS. By averaging the contributions of all the subsets, the probability of one protein having this annotation term can be calculated accordingly. There are three reasons for the comparison with the FS model: first, it utilizes domain information to assign GO terms. Thus, it can act as a good benchmark among the domain architecture–based methods. Secondly, this method yields reliable GO assignments, even in the situation where UniRef50 is applied for cross-validation, indicating the performance stability in an unbiased condition; and finally, the FS model provides a very user-friendly package for command line usage. Here, we converted GO terms to EC numbers using the EC2GO mapping file provided by the GO consortium [28].

Similar to the last section, to compare the performance among DomSign, BLAST and FS models, especially when the database contained no sequences having high similarities to the query protein, three rounds of 1,000-fold cross-validations (“identity ≤ 100%”, “identity ≤ 60%” and “identity ≤ 30%” as described in Methods) were conducted on the “sprot protein” dataset by DomSign with an 80% specificity threshold, BLASTP with a 10−3 E-value and the FS model with default parameter settings. It is necessary to emphasize the importance of performance tests using this scenario because BLAST itself performs enzyme functional annotations well (above 90% precision and recall in some situations) when homologs with similarities above a particular threshold are available [12]. Thus, there is limited room for further improvement in this regard, whereas there is ample need for improvement when homologs are unavailable. With the accumulation of novel sequences, this issue is expected to become more important. Thus, in the development of a new generation of computational approaches, more attention should be paid to the “homolog unavailable scenario”. As shown in Figure 3, machine learning–based methods, such as DomSign and the FS model, are much more robust when there is a reduced homolog availability compared with BLAST. Meanwhile, with a significant increase in “No best hit” (Figure 3B), coverage for BLAST decreases dramatically. Hence, in contrast to the nearly perfect performance of BLAST in the “identity ≤ 100%” group, DomSign achieved an overall performance superior to BLAST in the case of “identity ≤ 30%”, producing a comparable OP ratio but much higher coverage. Meanwhile, the FS model tended to have a very high OP ratio in all three tests, partly because of the multiple EC number predictions (Figure 3A) in this single EC enzyme plus non-enzyme dataset (Additional file 2) and partly because of incorrect EC assignments (both reasons contributed ~50% to the high OP level in the FS model, Figure 3A, B). Therefore, DomSign has the potential to partly replace BLAST as a functional annotation tool for novel proteins that have no homologs in the database.

Comparison with SVMHL using an unbiased dataset

To further test the effectiveness of DomSign with respect to avoiding potential bias towards abundant enzyme families [32], the “SVMHL unbiased dataset” was subjected to a 10-fold cross-validation because any two sequences have <50% identity and the enzymes are manually selected to cover most of the enzyme families without bias. The SVMHL model [29] is the benchmark that annotates EC hierarchy by considering two main features, namely the abundance of every possible tripeptide sequence within a polypeptide [33] and a protein structure–based enzymatic function prediction model. The annotation accuracy of DomSign and SVMHL at the second and third EC hierarchy levels is shown in Additional file 5. Although the accuracy for the SVMHL model at the second hierarchy level was slightly greater than that of DomSign, at the third hierarchy level DomSign outperformed SVMHL for most enzyme families. Because Wang et al. [29] did not present their results at the fourth level, only the DomSign results at this level are shown (Additional file 5). Based on this comparison, DomSign works well in the unbiased situation compared with other benchmark methods.

Comparison with EnzML

The EnzML model is a multi-label classification method that uses Binary Relevance Nearest Neighbors (BR-kNN) to predict EC numbers [30]. Briefly, this model utilized a more general protein signature set, InterPro [34], rather than Pfam as the input label. A multi-label support vector machine methodology was used, and the k parameter—the number of neighbors considered during the prediction—was optimized to ‘1’. The methodology of the multi-label support vector machines can be intuitively considered as the combination of multiple support vector machines for a series of binary labels (‘yes’ or ‘no’ for one particular EC hierarchy). Noteworthy, Mulan [35], an open-source software infrastructure for evaluation and prediction, is used for this specific work. This model is presently the best benchmark, which has been shown to be superior to some other widely used tools such as ModEnzA [36] and EFICAz2 [37]. “Swiss-Prot&KEGG” and the less redundant “UniRef50 Swiss-Prot&KEGG” [30] datasets were used for the 10-fold cross-validation (Figure 4A, B). Although the differences were not significant, we observed that EnzML performed better than DomSign in terms of example-based precision and recall. To clarify the source of these differences, for our evaluation we excluded the real enzymes that were incorrectly predicted as non-enzymes by DomSign (Figure 4C, D). Thereafter, DomSign’s performance became comparable to that of EnzML. Hence, we assert that the main reason for the loss of precision and recall in DomSign was that it is too strict to differentiate enzyme candidates from protein pools. Therefore, more enzymes are mistakenly categorized into the non-enzyme group by DomSign, leading to the loss of coverage. Even though this problem causes a decrease in the “example-based precision” defined here, it does not cause errors such as predicting the wrong EC number or mistakenly identifying a non-enzyme as an enzyme. Considering that the EnzML model is difficult to implement, we posit that using DomSign would be more facile by comparison with respect to expanding the enzyme space from a large-scale dataset, as discussed in the next section.

Figure 4
figure 4

Comparison between DomSign and EnzML using Swiss-Prot&KEGG and Swiss-Prot&KEGG extracted by UniRef50 datasets. The barplot represents accuracy calucated by DomSign(white) and EnzML(gray). In contrast to panels (A) and (B), enzymes that are incorrectly annotated as non-enzymes by DomSign are excluded from the evaluation in panels (C) and (D). “Coverage” in panels (C) and (D) describes the percentage of proteins left after removal of real enzymes that were incorrectly predicted to be non-enzymes. ‘Example based precision’ and ‘Example based recall’ are used to evaluate the result as stated in Methods.

Enzyme prediction in UniProt-TrEMBL and KEGG

Having demonstrated the reliability of DomSign, we annotated the whole protein space to determine if we could improve the prediction coverage of enzymes with EC numbers. UniProt-TrEMBL was used in this scenario owing to its exhaustive coverage of the known protein universe.

To test the precision of this enzyme prediction model, we ran the DomSign annotation against the “TrEMBL enzyme” set, which contained enzymes with single EC numbers in the TrEMBL database (Additional file 6). DomSign with an 80% specificity threshold yielded a 6.6% OP ratio while assigning EC numbers to ~90% enzymes. This OP ratio, which is higher than previous validations, may be due to the greater degree of error in the TrEMBL annotation [17]. This result, combined with the performance test, demonstrated that the enzyme space expansion effort we conducted, as described below, was highly reliable.

Thus, we extended our data mining by predicting enzymes with EC numbers from all of the TrEMBL proteins. The annotation result is presented in Additional file 7. Approximately 3.9 million proteins lacking an EC number could be annotated with an EC number, and the majority of these belong to the three- or four-EC-digit group (Figure 5A). Even with a specificity threshold of 99%, the number of predicted novel enzymes was still around 3.6 million (Additional file 8), further indicating the reliability of this method. By this means, we successfully raised the EC-tagged enzyme ratio from the original 12% to ~30% in TrEMBL (Figure 5A) with high precision. To further illustrate the significance of this EC resource expansion, the increased EC-tagged enzyme ratios for every genome of the bacterial taxonomy in KEGG were calculated and are presented in Figure 5B (see Additional file 9 for detailed bacterial EC number annotations in KEGG). Remarkably, on average, we raised the EC-tagged enzyme ratio of each bacterial genome from the previous 26.0% to 33.2% for 2,584 bacterial genomes in KEGG, implying that the DomSign enzyme prediction method can provide deeper insight into the metabolism of many sequenced but insufficiently characterized organisms. Taken together, DomSign enzyme predictions in TrEMBL and KEGG increased the number of EC-labeled enzymes with precision and confirmed the existence of hypothetical gaps between the real enzyme space and the functional annotation.

Figure 5
figure 5

Expansion of enzyme space in UniProt-TrEMBL and KEGG by DomSign (specificity threshold = 80%). (A) Expansion of enzyme space in UniProt-TrEMBL. The circles illustrate the distribution of three kinds of proteins in the TrEMBL database. Blue: enzymes already tagged with EC numbers in TrEMBL; red: novel enzymes exclusively predicted by DomSign; light orange: other proteins without EC numbers. The column on the right represents the ratio of EC hierarchy levels among predicted novel enzymes by DomSign. Straight line: predicted enzymes annotated as EC = x.-.-.-; blank: annotated as EC = x.x.-.-; dot: annotated as EC = x.x.x.-; slash: annotated as EC = x.x.x.x. (B) Expansion of enzyme space in KEGG. Each blue dot represents the original enzyme ratio for one particular bacteria genome in KEGG. Each red dot represents the total enzyme ratio for one particular bacteria genome after DomSign annotation. In total, 2,584 bacterial genomes were tested.

Enzyme predictions in metagenomic samples

Although millions of proteins have been discovered by the biological community, our knowledge of the protein world is still far from complete, and new metagenomic data provide us with new resources to explore [13]. Thus, we chose the HMP dataset as a test set to expand the enzyme space for proteins identified in metagenomic datasets using DomSign. Additionally, a combinational annotation pipeline in HMP using BLAST, TIGRFAM and Pfam-A [24] would be expected to be a good benchmark against which to compare DomSign in the functional annotations of metagenomic sequences.

As with TrEMBL, we first applied DomSign enzyme prediction to the “HMP enzyme” set to assess DomSign’s ability to predict enzymes. Compared with previous tests, much higher OP ratio (9.2%) was observed for DomSign with an 80% specificity threshold (Additional file 10). Despite the inability to evaluate the reliability of HMP annotations in this analysis, similar to the high error values in automatically annotated protein datasets such as TrEMBL [17], the quality of automatic HMP annotations is probably not as high as a manually curated set like Swiss-Prot. Thus, HMP annotation errors partly explain this abnormally high OP ratio, which is strongly supported by the fact that the OP ratio reached 5.4% even for DomSign with a 99% specificity threshold. These results still support the hypothesis that the reliability of the DomSign-based enzyme space expansion in HMP metagenomic datasets is acceptable.

DomSign can recover more enzymes from this metagenomic dataset (Figure 6 and Additional file 11). Approximately one million new enzymes can be annotated with EC numbers exclusively by DomSign (around 7% of proteins in HMP set) (Additional file 12), and 84% of them contain at least three EC digits. DomSign and HMP also seem to be highly complementary because half of their identified enzymes do not overlap. This is probably owing to the low Pfam-A (45.7%) coverage of HMP proteins and the appearance of many novel DSs in metagenomic sequences. The complementary properties also indicate the possibility that DomSign can detect many different catalytic functions and thus may provide further insight into the metabolic capacity of the human microbiome. To test this hypothesis, we compared the unique four-digit EC numbers retrieved by both approaches. Here, the results for DomSign with a 99% specificity threshold were used to increase the reliability of EC number assignment. As an example, 81 novel EC numbers, which were exclusively detected by DomSign with a 99% specificity threshold, were discovered from the human gut microbiome (stool sample; Additional file 13), indicating one potential biologically significant discovery. These EC numbers may reflect important components that complement the known metabolism of the human microbiome.

Figure 6
figure 6

Expansion of enzyme space in HMP non-redundant proteins by DomSign (specificity threshold = 80%). The circles illustrate the distribution of four kinds of proteins in the HMP non-redundant dataset. Red: enzymes with EC numbers annotated exclusively by HMP; blue: novel enzymes exclusively predicted by DomSign; green: enzymes identified by both HMP and DomSign; purple: all remaining proteins. The column on the right represents the ratio of EC hierarchy levels for predicted novel enzymes by DomSign, similar to the description in Figure 5.

Discussion

Limitations of DomSign

In this preliminary trial, our method performed well under diverse conditions, including having only distantly related sequences in the reference database (“sprot enzyme identity ≤ 30%”) and a query set without bias towards rich enzyme families (“SVMHL unbiased dataset”), indicating its potential to predict enzyme EC numbers in large-scale datasets. However, the precision and recall of this method are still not perfect.

First, even DomSign with a 99% specificity threshold results in a 3.6% OP ratio in the “identity ≤ 30%” 1,000-fold cross-validation. This is mainly because the domain architecture is unable to fully encode enzymatic activity, especially substrate specificity [38,39]. Substrate specificity determination is complex [40], especially for some superfamilies with diverse catalytic functions [41], and thus much effort has been devoted to this task using pioneering methods such as determining key functional residues in enzymes [42], key-residue 3D templates [43] and substrate de novo docking [44]. Future work will likely include the integration of these methodologies into our pipeline to more precisely predict the substrate specificity–determining fourth EC digit. With the development of DS databases, we can further increase the resolution of our method by involving more unique protein signatures, such as those from InterPro [34]. By this means, further increases in performance can be expected without changing the basic workflow of our method.

The comparison with SVMHL revealed variability in the performance of predicting EC number among different enzyme families. This corroborated a previous report that the worst result was obtained for oxidoreductase, as we observed with DomSign [30]. A possible solution is to utilize a combinational approach because different methodologies have diverse strengths for annotating specific enzyme families. SVMHL captures the sequence-function relationship of oxidoreductases quite well using triad abundance and structure [29]. Finally, as suggested by the comparison with EnzML, DomSign tends to have a high IA rate because it incorrectly predicts enzymes as non-enzymes. Considering that DomSign uses a very strict “yes or no” methodology to classify non-enzymes and enzymes at the first step in the pipeline, it could be improved by applying a probabilistic approach, such as the “specificity” we used in later iterations of DomSign for predicting EC numbers.

Perspective expansion of enzyme space

To our knowledge, our present study represents the first systematic attempt to determine the extent to which the coverage of enzyme annotation by EC numbers could be improved, with acceptable precision, by methods beyond simple BLAST. By trying to close the gap between available EC-tagged enzymes in current databases and the real number of enzymes working in organisms, we showed that the quantity of EC-tagged enzymes can be significantly improved with high precision using relatively simple but reliable tools, such as DomSign, whether the sample is genomic or metagenomic. A series of assessments was performed to test the ability of DomSign to expand the enzyme space in large-scale protein datasets. This included a performance comparison with other benchmark enzyme annotation methods (Figures 3 and 4 Additional file 5) and a prediction and result comparison using large-scale protein sets whose members had already been assigned EC numbers, such as TrEMBL (Additional file 6) and HMP (Additional file 10). Under all conditions, the precision rate was >90% and recall was quite remarkable.

The results of the first large-scale critical assessment of protein function annotations (CAFA) were recently published [7]. One of the main conclusions of CAFA was that many advanced methods for protein function annotation are superior to the first generation of methods, such as BLAST. Most of the top-ranked methods in CAFA utilized a machine learning–based computational approach. As suggested by Furnham N et al. [10], however, first-generation annotation methods are still used in most research. For instance, in a previous version of SEED, an intensively used comparative genomics environment, homology-based functional transfer is the main method of annotation. This is also true for UniProt. In recent releases, UniProt incorporated the HAMAP system [45], and SEED complements its annotation strategy using a k-mer-based subsystem and FIGfam recognition approach [46]; still, these approaches depend on sequence similarity–based function transfers, such as functionally homologous family profiles. The situation is essentially the same for benchmark metagenomic projects such as HMP [24,47]. With the development of metagenomics, many more sequences will be derived from environmental samples and will be novel compared with the current databases. In such cases, as shown in our work and that of many others [11,13], similarity-based function transfer will struggle to achieve the desired performance.

As our work demonstrates, there is still need to improve the ability to predict more enzymes using in silico methods. Only 12% of the proteins in UniProt have EC numbers. In the HMP phase I 95% non-redundant set, this value is 13% (Figure 6). All of the values are far below the average 30% enzyme ratio of the nine intensively studied organisms [48]. We believe that a richer annotated sequence resource will result once this gap is closed using a hierarchical or top-down machine-learning method. This will allow researchers to not only study many important biological questions such as orphan enzyme gene identification [49] and metabolism network reconstruction [50] but also improve strategies used in biotechnology, including secondary metabolism gene cluster identification [51], artificial biosynthesis pathway design [52], novel enzyme mining [53] and metabolic engineering [54].

Conclusions

In this work, we developed a novel enzyme EC number prediction tool, DomSign, which is superior to conventional BLAST for the homolog unavailable scenario. In addition, other novel and outstanding enzyme functional annotation tools were selected as benchmarks and these were used to run comparisons against DomSign, which confirmed the superior or competitive ability in enzyme functional annotation of DomSign. The DomSign method requires only the amino acid sequences, without the need for existing annotations or structures. Based on the test results, the performance of DomSign should be improved by incorporating more exhaustive protein signatures, such as substrate specificity-determining residues, and revising the pipeline to select enzyme candidates using a probabilistic approach.

Using DomSign, we tried to address whether a large number of ‘hidden enzymes’ without EC number annotations exist in current protein databases, such as TrEMBL, KEGG and metagenomic sets like HMP. Our results preliminarily confirmed this hypothesis by significantly improving the ratio of EC-tagged enzymes in these databases. The illustration and annotation of these enzymes should significantly deepen our understanding of the metabolisms of diverse organisms or consortia, and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity to involve more advanced tools than BLAST in protein database annotations, thereby extracting more biological information from the available number of biological sequences.