Background

Due to extensive efforts during the past decades, the genome sequences of many species have been completed. However, researchers have realized the limitations of knowing the genome sequence information if there is no functional information included. For this reason, assigning functions to unknown proteins has become one of the most important topics in functional genomics; although experimental methods have identified a number of protein functions with high accuracy, these approaches have required extensive resources in terms of both time and labour. Therefore, computational methods are necessary to overcome the limitations of these experiments. To this end, computational approaches were first proposed by using the protein sequences, which allowed for the identification of homologous proteins and then to infer specific protein functions [1, 2]. Determining the structures of proteins has also identified several interactions between proteins, which subsequently enabled the prediction of protein functions [3]; see a review [4] for the computational methods used for predicting protein functions based on protein structures and protein-protein interactions. In addition, genomic data sets such as protein domain data, protein-protein interaction, gene expression data, phylogenetic profile, phenotype ontology, disease data, and protein complex data have been extensively used to predict protein functions.

Many researchers have also attempted to develop integration models based on various machine learning algorithms and statistical approaches [519]. For example, Deng et al.[6, 7] proposed a Markov random field method for predicting protein function by integrating protein-protein interaction and protein domain data sets. And integrative models based on the kernel were introduced by [8, 9] and heterogeneous data sources were integrated in a Bayesian framework for function prediction [10, 11]. It should be noted that these methods have outperformed previous approaches that used only one data source [20, 21], and that their coverage has also been extended because different data sources can cover the lack of information in others. However, Lu et al. and other groups [20, 22] showed that the integration of all available data sources is not always the most effective method for increasing prediction quality. For instance, for a given protein function, some data sources highly contribute to improving prediction performance, though others rarely do. Therefore, selecting the most relevant set of data sources is very important in precisely predicting protein function, instead of attempting to use all available data sources.

Feature selection methods have also been applied to many other computational biology problems. For example, to predict protein-protein interactions using protein sequence order information and protein properties, Liu et al.[23] selected important features by combining filer-based and wrapper-based feature selection methods. Also, feature selection approaches such as a correlation-based feature subset selection (CFSS) method have been applied to protein structure predictions and protein folding rate predictions to reduce the dimensionality of the protein sequences [24, 25]. In any case, all of these studies have demonstrated the importance of feature selection. In this paper, we further posit that the selection of different genomic data sets dependent on the GO term is important for accurate protein function prediction.

The majority of protein function prediction methods have focused on non-mammalian model organisms, and it has not been made clear how accurate function predictions can be made for mammalian organisms. To investigate these problems, nine bioinformatics teams performed an evaluation process using diverse computational approaches in Mus musculus to predict protein function based on the integrated data sources [26]. They confirmed that computational approaches integrating various genomic data sets are quite promising for protein function predictions in mammals.

In this study, we investigate the relationship between genomic data sources and protein functions in order to select a group of highly contributed data sources for function prediction in Mus musculus and to increase the prediction quality. Here, Gene Ontology (GO) terms are used as the protein function labels and the ten different genomic data sources in Mus musculus include protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and diseases. We then apply two feature selection approaches to improve the prediction quality. In the first approach, the contribution of each data source is measured based on its prediction accuracy of the AUC (i.e., the area under a curve of the sensitivity and false positive rate) using a kernel based logistic regression (KLR) method [5]. We repeat the prediction of 1,726 Gene Ontology Biological Process (GO-BP) terms ten times, where each prediction is performed using a different data source. We then select the data source that provides the highest prediction quality for each GO term; we refer this method as an exhaustive search feature selection approach. As a systematic feature selection method, we next introduce a kernel based L1-norm regularized logistic regression (KL1LR) method. In this method, the coefficients of non-contributing features shrink to zero so that we can evaluate the importance of the data source for each protein function. KL1LR outperforms the exhaustive search approach especially for specific GO terms covered by the small number of proteins. We subsequently analyze the contributing data sources obtained using these two approaches. Generally, the protein domain data set is an important data source for a large number of GO functions, though protein-protein interaction and phenotype data sets are also important features of many GO terms. We also investigate whether or not there is agreement between parents and off-springs in the GO hierarchy for most of the contributing data sets. We are able to find many GO terms, in which most off-springs of the given GO term are predicted accurately using the same data set.

Finally, we apply a filter-based feature selection method, Relief [27], and compare the performances between the proposed approaches in this paper and the Relief method. It is found that the proposed methods outperform the Relief method. In addition, consistent results are observed when these methods are applied to the protein function prediction for yeast.

The remainder of this paper is organized as follows. In the Methods section, we first present the descriptions of the various data sources and the definition of kernels for measuring similarities between two proteins. Then, we introduce two different feature selection approaches before describing an enrichment test for detecting the contributing genomic data sets for each GO function. In the Results section, we compare the prediction quality of proposed methods and then, we show the results of grouping GO terms according to the contribution of each data source. Finally, we summarize the contribution of our approach and discuss the limitations of our approach.

Methods

Data sources

Throughout this study, ten data sources across six different genomic data types of Mus musculus were used: three data sets from gene expressions, two data sets from protein annotations, one protein-protein interaction data set, one phenotype ontology data set, two phylogenetic profile data sets, and one disease association data set. All data sources used were collected during a previous work [26].

1. Gene expression

Three different data sources of gene expression were used and all data sources only contain the data with probes/tags mapping to Mouse Genome Informatics (MGI) genes. The relationship of mapping from probes/tags to MGI genes can be either one to one or many to one. The Zhang et al.[28] expression data source includes normalized, median subtracted, and arcsinh intensity values for 13,566 genes across 55 mouse tissues. And the Su et al.[29] expression data source includes normalized and gcRMA-condensed values from Affymetrix arrays for 18,208 genes across 61 mouse tissues. In the Zhang et al. and Su et al. expression data, multiple rows for the same gene are averaged. The SAGE data source contains 99 quality tag counts cut from 139 SAGE libraries for 16,726 genes [30]; in the SAGE data source, we used the average tag count for tags mapped to the same gene.

2. Protein annotations

The Pfam protein-domain [31] and Interpro protein-domain [32] data sets were used. The Pfam data set contains 15,569 genes with 3,133 Pfam-A protein families (release 19), and the Interpro data source contains 16,965 genes with 5,404 sequence patterns (release 12.1) [32]. All these data sources are represented as binary annotation patterns.

3. Protein-protein interactions

The OPHID protein-protein interaction (PPI) data set obtained by orthology (provided by MGI) was used [33]. It contains interactions among 7,125 genes (downloaded in April 2006).

4. Phenotype ontology

The phenotype annotation data set containing 3,439 genes with 33 phenotypes was used [34]. This data source was obtained from MGI and is represented as binary annotation patterns (downloaded in Feb 2006 from [35]).

5. Phylogenetic profile

Conservation patterns indicating whether 15,939 genes have putative orthologues in 18 different species (from yeast to human) were used [36]. These data source were provided by bioMart (v38). The Inparanoid phylogenetic profile data set (v4.0) contains conservation patterns across 21 different species for 15,703 genes [37]; all are represented as binary conservation patterns.

6. Disease associations

Disease association data from the Online Mendelian Inheritance in Man (OMIM) database for 1,938 genes with 2,488 diseases were used [38, 39]. This data source contains binary annotation patterns (downloaded in Jun 2006 from [40]).

To associate these ten biological data sources with protein functional annotations, Pena-Castillo et al.[26] mapped the gene identifiers of each data source to MGI gene identifiers (obtained from MGI in Feb 2006). They used only non-inferred from electronic annotations (IEA) annotations because most IEA annotations are computational predictions that have not been reviewed by curators. From a biological point of view, specific GO terms are interesting, but if GO terms are too specific, there are number of inherent difficulties in making accurate predictions due to the limited number of positive training data sets; therefore, GO-BP terms with {3-300} genes annotated in the database were used. In this study, we only used GO terms from a category of biological process, so the final data collection (with GO annotations obtained on Feb 2006) encompassed the information of 21,135 MGI genes, of which 7,557 were associated with at least one of the 1,726 GO terms that we investigated.

In addition, we collected MGI gene identifiers in Aug 2009 and found 2,808 newly annotated Mus musculus genes for 1,051 GO terms since Feb 2006. We predicted these newly annotated genes using the proposed methods to show their prediction quality. Then, to evaluate the performance of our approaches in other model organism, we also used eight genomic data sets and GO terms for yeast obtained from [22]. These data sets consisted of a protein-protein interaction data set, four gene expression data sets, a protein-domain annotation data set, a gene knock-out phenotype data set and a protein localization data set. The GO terms are the Jun 2006 version of the Yeast SGD database; among them, we used 1,246 GO terms covered by {3-300} genes.

Definition of kernel

We measure the similarity between two genes based on a kernel approach, as we did previously in [5]. In the case of the Pfam and Interpro domain data, we define v ik = 1 if the ithprotein has the kthdomain, and 0 otherwise. Then, the kernel between the ithand the jthproteins is defined as

(1)

where n is the number of domains. The kernel is defined similarly for other data sets such as the phenotype ontology, phylogenetic profile, and disease data sets, where a protein is represented by discrete values. On the other hand, kernels for gene expression data sets such as Zhang et al., Su et al., and SAGE gene expression data sources that are represented by continuous values are defined as Kexpression(i, j) = PCC(i, j), the Pearson correlation coefficient between proteins i and j. Then, the values are discretized to 0 and 1 using the threshold 0.3.

For the protein-protein data set, the following diffusion kernel matrix [5] are used.

(2)

where

where d i is the degree of the ithprotein in the protein interaction network, τ is the diffusion constant, and the diffusion kernel matrix calculates the similarity between all pairs of proteins in the protein interaction network. Here, four different values (0.1, 0.5, 1, and 3) were tested as diffusion constants, and 0.1 was subsequently selected as the optimal diffusion constant (i.e. the diffusion constant 0.1 gives the best prediction accuracy). Note that in the computed diffusion matrix, values less than the cut-off of 0.001 were considered as 0.

Standardization of data source

Before all the data sources could be integrated, we first standardized each data source since the data sources were each presented in a different scale. The standardized data sources were defined as

(3)

where

where m is the number of proteins and n is the number of features. As a preprocessing step, the standardization results in a zero mean and unit variance for each data feature.

Kernel-based logistic regression prediction model

In order to integrate ten genomic data sources using feature selection, we propose two approaches based on the kernel-based logistic regression model (KLR) introduced in [5]. In brief, X i = 1 if the ithprotein has the function, otherwise X i = 0. Let us first explain the case in which a single data set is used. For a given function of interest, a single data set generates two features: 1) the average of similarities between the protein i and proteins having the function, and 2) the average of similarities between the protein i and proteins not having the function. Then, let the corresponding coefficients θ = {δ, ρ}. Here, the similarity between two proteins i and j is calculated using the kernel K(i,j). Then, we use the following logistic regression model.

(4)

where X[-i] = (X1, ..., Xi-1, Xi+1, ..., X N ), N is the number of proteins, and

This model can be extended for multiple data sets, such that

(5)

where D is the number of data sets, and and can be obtained using Eq. (4) for the d-th data set. In this study, we used a glm function (stats package in R) to implement the KRL method.

Feature selection approaches

In order to investigate factors that highly contribute to protein function prediction and improve the prediction quality, we again introduce two approaches. Both approaches consider one protein function at a time.

1. Exhaustive search

For each data set we estimate the probability of proteins having the function of interest using Eq. (4). We repeat this process for each of ten data sets. In this way, we obtained the accuracy of the predictions for all ten data sets for each GO function.

2. Kernel based L1-norm regularized logistic regression (KL1LR)

The concept of L1-norm regularization for feature selection has been widely used in a number of contexts because it relatively easily generates an explainable model for significant features by shrinking some coefficients into zero [41]. For this regularization, let a set of parameters θ = {α1, β1, ..., αD, βD} and rewrite Eq. (5) as follows.

(6)

Note that the log likelihood function from the observed data i = 1 ... N, where N is the number of the observed samples, is given as

. By employing the L1-norm regularized logistic regression, we can thus estimate the coefficients θ by maximizing the log likelihood function and simultaneously penalizing coefficients for features having small contributions to subsequently predict the function.

(7)

where a regularization parameter λ (>0) controls the cardinality (the number of nonzero components) of θ. Generally a larger λ tends to yield a smaller cardinality of θ, though it also provides a lower prediction accuracy. In an empirical situation, various values of λ should be tested to determine the modest trade-off between accuracy and efficiency. In this study, we tested two sets of λ, where each set consisted of ten sliding values of λ, that were uniformly distributed on a logarithmic scale over the interval [0.1 λmax, λmax] and [0.01 λmax, λmax]. Here, λmax was directly calculated from the feature data source. For this study, we used an interior-point method [42] for the implementation of KL1LR since it can be effectively applied to all sizes of problems and has a low time complexity.

Measurement of prediction performance for each method

To evaluate performance of prediction, we use two-fold cross validation to predict GO terms covered by {3-10} genes and five-fold cross validation for GO terms covered by {10-300} genes. For a given GO term, the contribution of each feature is measured using training sets and the performance of the test set is then measured by predicting the function based on the selected features. More specifically, for the exhaustive search feature selection approach, the AUC value of each data set is measured using the training set, and the data set with the highest AUC value is then selected as the feature for further testing. For the KL1LR approach, the coefficients of the features are first measured by the training set and the features with non-zero coefficients are selected as the features for testing. Finally, the AUC and precision at 20% recall (P20R) values are calculated for the test set.

Measurement of contributions of each data source

We measured the contribution of data sources for each GO term and analyzed them. In the exhaustive search approach, data sources providing the high AUC value (≥0.75) and the high P20R value (≥0.2) for the test set were selected as the highly contributing factors for each GO-BP term. In the case of KL1LR, when the P20R value was greater than or equal to 0.2, we considered features with regression coefficients located outside the 1 standard deviation (SD) as the highly contributing factors for protein function prediction.

Next, the enrichment test was employed to obtain a general view of the GO terms that were well predicted by the specific data source. Because GO is a directed acyclic graph, we could derive the properties of specific groups of GO terms by backtracking their common ancestor. In this way, we investigated whether there is an agreement between parents and off-springs in the GO hierarchy for most contributing data sets. In order to test the statistical significance, we performed a hypergeometric test to compare the actual number of descendent GO terms that were well predicted using a given data source with the total number of descendent GO terms.

For this task, we denoted the group G d as the GO terms having high predictive power based on data source d. For each GO term in the group G d , let m, n, k, and x be the number of descendent GO terms, the number of non-descendent GO terms, the number of GO terms in the group G d , and the number of descendent GO terms in the group G d , respectively. Then, the p-value is given by

(8)

Through this statistical approach, we could measure how much the contributing data sources of each GO term agreed with those of its descendent GO terms, thereby allowing recognition of the overall tendencies of the GO terms.

Results

Performance evaluation

We first show the significance of feature selection in integrating multiple data sources. Table 1 and Figure 1 show the averages of AUC values of four different categories of GO terms predicted by the KLR method with Pfam or Interpro data only, the KLR method with all data sources, the KLR method with a data source selected by exhaustive search, and the KL1LR method with all data sources. When integrating the data sets, the AUC values for both standardized and non-standardized data are used. In KL1LR, it can be seen that the standardization of integrated data has improved the prediction quality and that the prediction accuracy of the AUC value is higher when the relative parameter λ = 0.01 than when λ = 0.1. On the other hand, performances in the KLR method integrating all data sources with standardization and non-standardization are similar. Hence, we used the KL1LR results with standardization and λ = 0.01 for the study in this paper.

Figure 1
figure 1

The AUC values of function predictions based on different prediction approaches. The AUC values of KLR using Pfam, KLR using all data sets (standardized), KLR using a data source selected by exhaustive search, KL1LR (λ = 0.01, standardized), and KLR using data sources selected by the Relief method are shown.

Table 1 Prediction quality of different prediction approaches

Based on the prediction of protein functions using each data set, we found that the domain data sets, including Interpro and Pfam, were the most informative among the six different types of data sets (see Additional File 1 for more information). Surprisingly, the AUC values for the Interpro data set for the categories of {3-10} and {11-30} were higher than the AUC values when all the data sets were integrated. This result is due to the fact that some data sets actually act noise during modelling, thereby confirming the importance of accurate feature selection. To this end, prediction accuracy could be increased here by using the two feature selection approaches of exhaustive search feature selection and KL1LR. The average AUC values of exhaustive search were 0.55, 0.82, 0.84, and 0.83 for the four GO term categories. However, KL1LR significantly outperformed the exhaustive search feature selection for the {3-10} category (0.75) though it had comparable prediction accuracies for the other general categories (0.88, 0.88, and 0.86). As expected, the prediction accuracy was generally lower in the category {3-10} due to the small number of positives compared to negatives. Note that the AUC value of all the GO terms for KLR with each data source, KLR with all data sources, and KL1LR are available in Additional File 1. Precision at various recall values for KLR with each data source and KL1LR (λ = 0.01, standardized data) are available in Additional File 2.

Both exhaustive search feature selection and KL1LR methods are wrapper-based feature selection methods. Therefore, to compare the performances between different feature selection approaches in the genomic data sets, we applied a filter-based feature selection method, Relief [27]. As a pre-processor, Relief is usually used to remove irrelevant attributes from data sets before learning. In Relief, the weight of each feature is calculated by estimating the number of instances a feature can distinguish in the same class from instances in a different class. The features with the shortest distance in the same class and the longest distance in a different class obtain a high weight. Usually, normalized weights are used, at an interval of [-1, 1]. This procedure is then repeated for n times, which is a user parameter for determining the number of training instances. In our experiment, the sampling rate was set as 25. We then calculated the weights for all 20 features obtained from kernels using the 10 data sets. Finally, we selected important features displaying more than 0.05 as a weight. These sampling rates and thresholds were determined through experimentation.

After selecting features using the Relief method, we applied the KLR method for further classification. The AUC values of four categories were 0.58, 0.69, 0.64, and 0.61, which show low prediction accuracy (Table 1 and Figure 1). Because mouse function data set contains a small number of positive instances compared to negative instances, this structure-based feature selection technique seems unable to achieve a high prediction accuracy; prediction accuracies were similar although the sampling rates were set to be higher than 25. For all 1,726 GO terms, prediction accuracies of the AUC values and Relief weights of 20 features are available in Additional File 3.

Contributions of each data source for protein function predictions

Using the feature selection methods, we improved the prediction performance compared to the method integrating all the data sets. In addition, we were able to determine which data sources are informative for predicting each GO term; further details for each GO term of which are presented in Additional File 2. Here, we investigated how many GO terms are highly accurately predicted using each data source in order to summarize the importance of genomic data sets in protein function predictions.

During this investigation, we selected data sources with high prediction accuracies for each GO term. For the KLR method with exhaustive search, we selected a data set having thresholds of AUC of ≥ 0.75 and P20R of ≥ 0.2 from the test set (see Additional File 2 for the data). For the KL1LR method, we selected data sources having a high regression coefficient (outside of 1 SD) and a high P20R value (≥ 0.2). These criteria were adjusted so that there were a similar total number of GO terms between the exhaustive search feature selection method and the KL1LR method. Table 2 shows the number of GO terms selected for each data source and the average of the prediction accuracies with the respective data source (see Additional File 4 for the lists of GO terms selected for each data source). In this analysis, we observed that the domain data sources are informative for the largest number of protein functions.

Table 2 Contributions of genomic data sources

Specifically, using the exhaustive search, 40% of the GO terms (697 of 1,726 terms) have a high prediction accuracy with domain information; this is also the case for the KL1LR method. Also, the protein interaction and the phenotype data sets turned out to be the informative data sources for many of the GO terms. Conversely, phylogenetic profile, disease, and gene expression data sets were only informative for a small fraction of the GO terms; 5% (95/1,726), 2% (41/1,726), and 3% (55/1,726), respectively.

For the exhaustive search feature selection approach, the informative data sources were independently selected from other data sources. However, the KL1LR method tends to shrink the coefficients of similar features to select more relevant features when several redundant data sources are available. Hence, the contribution of each data set might not be accurately detected. The last column in Table 2 indicates the number of GO terms common in two feature selection approaches. As can be seen, the fractions of intersections are pretty high when considering the fact that the features selected by the KL1LR method are affected by the redundant data sources. Among the GO terms selected by using the exhaustive search approach, 51% (266/522), 43% (83/192), and 61% (129/213) of the terms were also selected using the KL1LR approach from Interpro domain data, protein-protein interaction data, and phenotype data, respectively. These observations confirmed that the general importance of genomic data types was consistently observed throughout the two feature selection approaches.

To enhance the understanding of the relationship between protein functions and genomic data sets, we also investigated the group of GO terms that have a high prediction quality based on only one data set. For this task, we selected a set of GO terms using the exhaustive search for each data type, in which the given data source had the highest AUC value for a given GO term; the AUC score was ≥ 0.75 and the P20R value was ≥ 0.2, and the AUC value difference with the second highest AUC value from other data sources was larger than 0.1. Note that if the data sets with the highest and the second highest accuracy were in the same class of data type, we compared the highest prediction accuracy with the third highest accuracy. Here, two protein domain data sets, three gene expression data sets, and two phylogenetic profile data sets are considered as the same type of data. Table 3 shows a part of the results with the entire table being available on Additional File 5. Additional File 6 then presents the data values for the most contributing data set for each GO term. In the following section, we explain examples of GO terms with their respective most contributing data types.

Table 3 GO terms giving a high prediction quality using only one data source

1) Domain

Interpro is the most informative data set for predicting proteins involved in 'Cation transport' (GO:0006812) as shown in Figure 2. Many genes with this function contain the domains related to ion transport and cation channels such as the K+ channel, which are the hallmark detecting genes for 'Cation transport', which is supported by the Interpro providing the manual assignments of GO terms to each Interpro domain [32].

Figure 2
figure 2

Illustration of genomic data sources for the genes with 'Cation transport' function. 176 genes have the 'Cation transport' (GO:0006812) function. The relationship between these genes and the five different genomic data sources are illustrated. For each data source, the AUC value and the P20R value with the KLR method are also represented. (a) Protein domains belonging to the genes with the given GO term are coloured blue in the matrix. Domains appearing in the more than 10 genes are boxed in red in the matrix, and their names and identifiers are listed below. (b) Expression levels of genes with the given GO terms are presented. Genes and tissues are grouped based on the hierarchical bi-clustering of expression levels. Tissues commonly over-expressed in several genes are circled in blue and the names of tissues are listed below. (c) MGI phenotypes belonging to the genes with the given GO term are coloured blue in the matrix. Phenotypes appearing in the more than 15 genes are boxed in red in the matrix, and their names and identifiers are listed below. (d) Protein-protein interaction network of genes with the given GO term. Red (black) lines indicate the direct (indirect) interactions. Highly connected proteins based on direct interactions are listed with their identifiers. (e) OMIM disease is similarly presented its domains and phenotypes.

The prediction power for the 'Cation transport' using the Interpro protein-domain information is considerably higher than using the other data sets. However, the prediction quality for the 'Positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism (GO:0045935)' in Figure 3 is higher using both the protein-protein interaction and the Interpro domain data sets. Many genes in this GO category commonly contain domains related to DNA binding such as helix-loop-helix DNA binding (IPR011598, 001092), winged helix repressor DNA-binding (IPR011991), homeobox (IPR009057, 001356, 012287) and zinc finger C2H2 (IPR007087), which might increase the prediction quality using the Interpro data set. Concurrently, genes such as Sp1, Ep300, Smad2, Ncoa1, Rxra, and Crebbp in this category belong to transcription factor complexes, transcriptional regulation, or signal transduction so that they work together with other proteins in the same GO process by physically interacting with each other [4345], which might increase the prediction quality using the protein-protein data set.

Figure 3
figure 3

Illustration of genomic data sources for the genes having a 'Positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism'. 161 genes have the 'positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism' (GO:0045935) function. (a) - (e) as described in Figure 2 (a) - (e).

2)Phenotype

Table 3 and Figure 4 show that the MGI phenotype is the most relevant data source for predicting proteins with 'Reproduction (GO: 0000003)'. Most of genes in this category belong to the 'Reproductive system phenotype (0005389)' which is the most powerful indicator for predicting this category. Also, more than 20 genes belong to the same phenotype in 9 phenotype categories, which helps to predict their GO functions. Note that genes in this GO term rarely interact with each other and only a few genes have the same domain. (Additional Files 7, 8, and 9 contain the data sets for Figures 2, 3, and 4.)

Figure 4
figure 4

Illustration of genomic data sources for the genes having a 'Reproduction' function. 152 genes have the 'Reproduction' (GO: 0000003) function. (a) - (e) as described in Figure 2 (a) - (e).

3) Gene expression

The prediction accuracy of cartilage condensation (GO:0001502) using Zhang et al.[28] gene expression data is very high, with an AUC value of 0.85, a P20R value of 0.225, and the difference from the second highest AUC value is 0.15. Genes with cartilage condensation function are commonly highly expressed in the tissues of E14.5. Head, femur, snout, teeth, thyroid, and trachea (see Additional File 6), most of which are tissues related to cartilage. Cartilage is a tissue found in the organs such as femur, snout, and thyroid gland, and trachea is held open by15~20 C-shaped rings of cartilage [46]. Genes belonging to this GO term such as CTGF, Col11a1, Ror2, and Thra are known to be expressed in skeletal cells [47], are essential genes for skeletal morphogenesis and skeletal system development [48, 49], and are related to bone formation [50].

An enrichment test for an informative data source in a GO hierarchy

It could be observed in Table 3 and Additional File 6 that the GO terms highly accurately predicted with one data source are sometimes in a parent-offspring relationship. To obtain a general view of the relationship between GO terms and genomic data sources, we performed an enrichment test on an informative data source in the GO hierarchy, where the majority of the offspring GO terms of a given GO term were highly accurately predicted using the same data type. For this task, we used hypergeometic test described in the Method section. Table 4 summarizes parts of the results based on the exhaustive search feature selection approach. Because the informative data sources for each GO term might differ between the exhaustive search feature selection method and the KL1LR method, the significant GO terms in the enrichment test detected by the KL1LR method are highlighted in bold and italic type. It should be noted that for some GO terms, the data source was informative for the majority of the GO term's off-spring, but not informative for the given GO term itself. In this case, the informative data source for the GO term itself from both approaches is underlined. The entire table is available in Additional File 10.

Table 4 Enrichment test for an informative data source in the GO hierarchy

From Table 4, we observed that the informative data sources for the GO terms were not independent of each other. For example, 'Ion transport' (GO:0006811, 232 genes) in Figure 5 is a highly significant term in the enrichment test with the Interpro domain data set. Among the 23 off-spring GO terms of "Ion transport", 18 off-springs have high prediction accuracy when the Interpro data set is used, giving a p-value of 0.000000268. 'Cation transport' (GO:0006812, 176 genes) in Figure 2, which has a high AUC value when using only the Interpro data set, is one of the off-spring of 'Ion transport'. Similarly, 'Reproduction' (GO:0000003, 152 genes) was found to be a significant term in the MGI phenotype; the GO term hierarchy of 'Reproduction' (GO:0000003) is also available in Additional File 11. In total, 17 out of 47 descendent GO terms could be well predicted using the MGI phenotype.

Figure 5
figure 5

The hierarchy of 'Ion transport' (GO:0006811). Coloured GO terms have high AUC values based on the Interpro domain data set. Among them, the red boxes represent GO terms having a high AUC in only that data source. In parentheses, the number of genes having the given GO term (the number of genes having the Interpro data source is also represented), the AUC values, and P20R values having the Interpro data source are represented.

Performance of our prediction approach for newly annotated genes in Mus musculus and yeast

As one of the performance evaluation approaches for our feature selection methods, we collected MGI gene identifiers on Aug 2009 and predicted 2,808 genes newly annotated since Feb 2006. We tested these genes using both KLR with exhaustive search feature selection and KL1LR methods based on selected features and the trained parameters using data sets collected on Feb 2006. As a result, it was found that among the 1,726 GO terms, 1,051 had newly annotated genes.

Table 5 shows the prediction results for the newly annotated genes. In these results, the prediction accuracies of newly annotated genes using the KL1LR method are quite high; AUC values are 0.79, 0.81, 0.82, and 0.80 for four categories, respectively. Surprisingly, the AUC values in category {3-10} are also high and are comparable to other categories. Since some inaccurate annotations are discarded from older annotation data, the frequency of false positive and false negative could be decreased, which confirms the robustness of our approaches. The full prediction results and precision at various recall values, including the number of newly annotated genes of each GO term, are available on Additional Files 12 and 13.

Table 5 Prediction quality of newly annotated genes

To evaluate the performance of our approaches for other model organisms, eight genomic data sets from yeast were integrated using the proposed feature selection methods. The data sets consist of a protein-protein interaction data set, four gene expression data sets, a protein-domain annotation data set, a gene knock-out phenotype data set, and a protein localization data set [22]. Table 6 shows the average AUC values of GO terms for four different GO term categories predicted using the KLR with all data sets, KLR with a data source selected by exhaustive search, the KL1LR method, and the Relief method. After the different data sets were integrated, both the standardized and non-standardized data were compared. Compared to prediction results in Table 1 for Mus musculus data, the AUC values decreased. However, the integration approach based on the features selected using the exhaustive search and the KL1LR method consistently outperformed the other approaches, such as KLR with all genomic data sets and the Relief feature selection method. From these results, we can confirm that the proposed approaches are also suitable for use with other organisms. The full set of prediction results and precision at various recall values of each GO term are available in Additional Files 14 and 15.

Table 6 Performance of GO term prediction of yeast genes

Consistency of informative features in newly annotated genes in Mus musculus and yeast genes

We then investigated whether or not the informative genomic data types collected on Feb 2006 were also informative for newly annotated genes. Among the 1,051 GO terms having newly annotated genes, we selected informative features (AUC ≥0.75) for the newly annotated genes using the exhaustive search feature selection and compared them with those from the Feb 2006 data sets (Addition File 2). Table 7 shows that 506 GO terms have the same informative feature types for both data sets; full information for Table 7 is available in Additional File 16.

Table 7 Consistencies of informative genomic data types between old and new annotation data in Mus musculus and between Mus musculus and yeast

To investigate the consistency of informative genomic data sets in yeast, among the 1,246 GO terms, we analyzed 622 GO terms that intersected with Mus musculus. Here, the informative features were defined as data sets having ≥ 0.75 AUC. For the analysis, the same types of genomic data sets between Mus musculus and yeast were considered; i.e., a protein-domain annotation data set, gene expression data sets, a protein-protein interaction data set, and a phenotype data set. Table 7 shows that 186 GO terms have the same informative feature types for both data sets; full information including the gene counts in the yeast data for Table 7 is available in Additional File 17.

Discussion

The contribution of this paper compared to the study by Pena-castillo et al. (2006) [26] is that the methods proposed in this paper could systematically select important genomic data types for function prediction and thereby improve the prediction quality based on feature selection. Pena-castillo et al. (2006) [26] presented two GO term examples having different contributions of data types. However, this information was used to explain the nature of the supporting genomic data types in the prediction results, but was not incorporated into the prediction methods. In other words, these examples are independent from the nine methods used in [26]. Our previous work in [26], one of the nine bioinformatics teams evaluating their methods for mouse function prediction, incorporated informative genomic data types into the KLR method for each GO term. The previous method, similar to the exhaustive search feature selection approach in this paper, has been among the best prediction groups in the general category of {31-100} and {101-300}, though it showed relatively poor prediction performance in the specific category of {3-10}. The KL1LR method introduced in this paper significantly improved the prediction quality in the category of {3-10}. As a result, the prediction performance of the KL1KR method was comparable to the best approaches in [26] using the same data sets, although the performance of the KL1KR method was still lower in the categories of {3-10}. It is difficult to directly compare the AUC values of these methods because the performance evaluation in this paper is based on the five-fold cross validation, whereas [26] is based on a held-out blind test set. However, the overall performance might be compared by considering that they are assessed in the same data sets. One of the best prediction groups in the categories of {3-10}, {11-30}, {31-100} and {101-300} of the biological process had respective AUC values of around 0.873, 0.872, 0.881, and 0.84 (Group C in Figure 2(a) of [26]) for the held-out blind test; those for the KL1LR method were 0.75, 0.88, 0.88, and 0.86 for the five-fold cross validation.

In the results section, we presented the prediction accuracy using the AUC value. In the Additional Files, various recall values are also calculated for each GO term and data source. It should be noted, however, that the AUC value might misinterpret the prediction when the number of the positive genes is smaller than the number of negative genes, though this can be compensated for by using the various recall values. Hence, we used the P20R threshold as one of the thresholds to select informative features, as shown through Tables 2, 3, and 4.

Furthermore, we analyzed the informative features of each GO term and the tendencies for informative genomic data types in the parent-offspring relationship in the GO term hierarchy. During the enrichment test in Table 4, we observed that the informative data sources for the GO terms are not independent of each other. As such, this observation can be used to infer the informative genomic data sets for the GO terms, even though their relationships with genomic data sets have not previously been analyzed. In an effort to guide the informative data sources for GO terms such that the use of this information by experimental and computational biologists who will collect data sources and predict protein functions can be made easier, a website [51] has been constructed. With the input of GO terms, the AUC values and various recall values based on the exhaustive search are listed. Using this website, biologists who plan for the prediction of functions of new GO terms are informed which data sources are needed to be collected first; users can also explore the parent's GO terms. This developement helps to deterimine the informative data sources even though the GO term has yet to be analyzed.

Conclusions

For the assessment of contributions of genomic data sources for protein function prediction, we proposed the KLR method that uses the exhaustive search feature selection approach and the KL1LR method. The reason this protein function prediction is successful is due to the estimation of contributions of the data source, which directly influences the prediction performance. Our approach has the ability to find the relationship between protein functions and genomic data types. Moreover, our approach can be applied to any bioinformatics field that uses the high dimensional data sources. Therefore, it can be used as indispensable methodology for analyzing the subsequent genomic data.