Background

As an important research field in molecular cell biology and proteomics, protein subcellular localization is closely related to protein function, metabolic pathway, signal transduction and biological process, and plays an important role in drug discovery, drug design, basic biological research and biomedicine research. Experimental determination of subcellular localization is time-consuming and laborious, and in some cases, it is hard to determine some subcellular compartments by fluorescent microscopy imaging techniques. Computational methods may help biologist select target proteins and design experiments.

Recent years have witnessed much progress in protein subcellular localization prediction [135]. Machine learning methods for predicting protein subcellular localization involve two major aspects: one is to derive protein features and the other is to design predictive model. State-of-art feature extraction methods are data- and model- dependent. We should guarantee that the features not only capture rich biological information but also should be discriminative enough to construct an effective classifier for prediction. On one hand, high throughout sequencing technique makes protein sequences cheaply available, and many computational models are based on protein primary sequences only in computational proteomics. On the other hand, data integration has become a popular method to integrate diverse biological data, including non-sequence information, such as GO annotation, protein-protein interaction network, protein structural information, cell image features etc.

There are many effective protein features extracted specifically for protein subcellular localization prediction. Amino acid composition (AA) has close relation with protein subcellular localization [16] and is the most frequently-used features. PseAA [4, 10, 12, 13, 1732] encodes the pair-wise correlation of two amino acids at λ intervals using amino acid physiochemical properties. Sliding-window based k-mer feature representation is often used to capture the contextual information of amino acid and the conserved motif information, such as gapAA, di-AA, and motif kernel [35, 36], etc. Since the dimensionality of k-mer feature space (20nfor 20 amino acids) expands exponentially with the window size n, some researches [37, 38] compress 20 amino acids into 7 groups according to amino acid physiochemical properties. Sorting signal and anchoring signal are important information for protein subcellular localization [39, 40], but with the disadvantages that the cleavage sites vary substantially with proteins and the signal peptides may be missing.

Sequence profile is also important information for protein subcellular localization. Marcotte E et al. (2000) [41] revealed the relation between phylogenetic profile distribution and protein subcellular localization pattern. Sequence profile reveals the approximately true amino acid distribution for each amino acid residue along protein sequence, and thus can be naturally used as evolutionary distance between amino acids for measuring the similarity between two protein sequences. Through deliberate design, the similarity between two protein profile distributions can lead to a valid Mercer kernel [14, 15, 4246]. Mak M et al. (2008) [42] derived the alignment score between two protein profile distributions using dynamic programming, based on which to derive a valid profile alignment kernel. Profile kernels [43, 44] used PSSM & PSFM to derive the similarity score between any two k-mers, based on which to measure the similarity between two protein sequences. Kuang R et al. (2005) [44] designed a profile kernel, a variant mismatch kernel [45], which allowed a k fragment to match its corresponding k-mer if the fragment fell within the positional mutation neighbourhood defined by k-mer self-entropy. Kuang R et al. (2009) [46] extended the profile kernel by simple kernel fusion for prediction of malaria degradomes. Spectrum kernel [47] is based on exact k-mer match while (k, l) mismatch kernel [45] allows l mismatches within each k-mer, both of which are based on protein sequence only without profile incorporation. Actually, we can derive multiple kernels from multi-aspect knowledge about protein and then combine the kernels for more accurate definition of protein similarity. Alexander Z et al. [36] used semi-infinite linear programming to derive the optimal kernel weights for motif kernels combination. Mei S et al. (2010) [48] derived multiple motif kernels from diverse physiochemical constraints on amino acid substitution and combined the kernels for protein subnuclear localization. Kernel method is a good approach for heterogeneous data integration in computational biology.

Although protein sequence contains all the information for proteins to be transported to due compartments, to form correct folding, to form proper 3-D structural conformation and to function properly, etc., quality feature extraction from protein sequence is still a challenging problem because there is no general law or complete knowledge for effective feature extraction from protein sequence. However, large amount of biological experiments and computational inference have accumulated reliable multi-aspect local knowledge about genes and gene products, which has been well organized in the biological knowledgebase: gene ontology (GO). Gene ontology is a controlled vocabulary that describes biomolecules and gene products in terms of biological process, function and components. With the rapid progress of experimental and electronic annotation, gene ontology has become a general feature of proteomics that can be used to boost the predictive performance of protein subcellular localization [4960]. In what follows, we briefly review the GO-based predictive models for protein subcellular localization from three viewpoints: (1) from the viewpoint of GO term extraction, the previous models can be classified into three categories. The first type of method directly uses protein accession number to query GO terms against GOA database [61], fast but not applicable to novel proteins [412, 4953]. The second type of method uses PSI-Blast to transfer the GO terms of homologous proteins to the target protein [54, 55]. The third method uses InterProScan[56] to transfer the GO terms of signature proteins to the target proteins [57, 58], which may be more reliable than the PSI-Blast transfer. Tung T et al. (2009) enlarged the GO term coverage by transferring to the target protein the GO terms of physically interacting partners in yeast interacting network [59]. (2) From the viewpoint of GO feature construction, the previous models also can be classified into three categories. The first way of GO feature construction is to simply turn all GOA GO terms into a flat binary feature vector to represent proteins [4953, 5760]. This method has large GO term coverage but introduces many GO terms irrelevant to the problem concerned. The second type of method uses genetic algorithm to select the most informative GO component terms to minimize the irrelevant GO terms [54, 55], but low GO term coverage may be highly likely to turn the test proteins to be null feature vector, so that the effect of PSI-Blast GO term transfer would be counteracted. The third type of method does not use explicit GO feature representation but designs an implicit kernel function to measure the semantic similarity between two GO terms [62]. Actually, the three aspects of gene ontology have different discriminative abilities, but the aforesaid three types of method assume equal feature weight. (3) From the viewpoint of data integration, the previous models can be classified into two categories. The first type of method uses ensemble learning to combine protein sequence with gene ontology, such as k-NN ensemble [52], fuzzy k-NN [59], and SVM ensemble [62]. The second type of method concatenates all the heterogeneous feature space (e.g. AA, di-AA, gap-AA, chem-AA, GO, PPI, etc) into a highly sparse high-dimension feature space [60].

In this paper, we design an explicitly weighted kernel learning system to transfer the known knowledge in terms of GO terms from related homology to the target problem, called Gene Ontology Based Transfer Learning Model (GO-TLM), for the purpose of sharing knowledge between closely-evolved protein families and achieving better model performance for protein subcellular localization. We use InterProScan to conduct multiple homologous signatures based queries against the InterPro database, and then transfer the homologous GO terms to the target protein. The transferred GO terms are potentially prone to errors, partly because of possibly noisy annotations from fluorescent microscopy experiments, electronic annotations using text mining, computational inference, etc. [49], or partly because of the outliers from homology transfer, that is, the homologous proteins actually have distinct function, process and subcellular localization patterns due to evolutionary divergence. Therefore, we should further construct a learning system that is trained on the transferred GO terms for reliable prediction. Such a scenario of borrowing knowledge in terms of GO terms from homologous proteins for further learning can be viewed as a case of Transfer Learning [6366], where knowledge is transferred between well-correlated domains for better learning in the target domain. Dai W et al. (2007) [63] proposed an instance-based knowledge transfer learning method, where auxiliary data were drawn in to augment the target training set using AdaBoost weighing system to reduce the unfavourable impact of auxiliary data that are subjected to different distribution. Dai W et al. (2008) [64] proposed a feature-based translated transfer learning method, where a translator was constructed between text feature space and image feature space for knowledge transfer from text data to image data. Yang Q et al. (2009) [65] proposed a parameter-based knowledge transfer learning method, where the knowledge contained in annotated image of heterogeneous social web was transferred for target image clustering. Pan S et al. (2010) [66] reviewed the recent progress in transfer learning modelling. Because of the unbalanced knowledge about protein, the three aspects of gene ontology may have distinct discriminative abilities. For this reason, we derive GO process features, GO function features and GO component features individually, and then derive three individual GO kernels from the three types of GO feature representation. Besides the three GO kernels, we further derive another two sequence kernels from amino acid composition (AA) and di-pepetide (di-AA), which are actually spectrum kernel. These heterogeneous feature representations then are then merged into one kernel using linear kernel combination, a classical scenario of multiple kernel learning [36, 67]. To reduce the computational cost of parameter optimization for multiple kernel learning, we use simple non-parametric cross validation to estimate the kernel weights instead. The model GO-TLM is evaluated against three baseline models on three eukaryotic benchmark datasets using cross validation and independent test.

Methods

GO feature construction

The InterPro database [68] integrates into a single source the most frequently-accessed signature databases including PROSITE[69], PRINTS[70], PFAM[71], ProDom[72], SMART[73] and TIGRFAMs[74]. PROSITE uses regular expression to represent significant amino acid patterns or uses profile (weight matrices) to detect structural and functional domains; PRINTS collects protein family fingerprints (motif); PFAM is a database of protein domain families that contains curated multiple sequence alignments for each family and corresponding profile hidden Markov models (HMMs); ProDom provides automatic domain query that is based on recursive use of PSI-BLAST homology search; SMART collects domains that are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues; TIGRFAMs are a collection of protein families that are characteristic of curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information supporting functional identification of proteins by sequence homology. InterProScan[61] combines different protein signature recognition methods into one resource and provides a uniform web service interface to query signatures against the integrative InterPro database. InterProScan provides an option "--goterms" that enables GO term query using protein sequence only, which can collect more reliable GO terms than Blast transfer [54, 55]. Parallel access and fast B-tree indexing make InterProScan practicable to large problem. For the reason, we use the perl script InterProScan.pl as a GO term extraction tool. The GO term set consists of three subsets: process, function and component. The three GO term subsets are organized as three individual binary feature vectors: (xP,1, xP,2,...,xP,l); (xF,1, xF,2,...,x F,m ); (xC,1, xC,2,...,x C,n ). It should be noted that InterProScan can overcome the problem of data unavailability to a certain degree. If we set high threshold to query more reliable GO terms with higher confidence, or the homology also is unannotated, InterProScan could neither transfer GO terms to the target proteins.

Kernel weight

K-mer occurrence patterns can reveal some conserved sub-sequences (e.g. motif) and k-spectrum kernel can be used to define the similarity between protein sequences. Since the feature space expands exponentially with window size |Σ| k , we only use 1-mer (AA) and 2-mer (di-AA) as protein sequence feature representation, thus we can derive kernels K AA , K diAA . Based on the GO feature representation, we define GO process kernel K P , GO function kernel K F and GO component kernel K C . The 5 kernels are fused into single kernel for more accurate protein similarity definition. Kernel fusion is equivalent to the kernel that is computed in the concatenated feature space, but kernel fusion has the advantage of explicitly weighing the importance of feature subsets. The information content transferred from GO kernels to sequence kernels is measured by GO kernel weights. The weights of feature subsets vary with problems and should be derived from data. The final kernel is defined as the following linear combination of sub-kernels:

K G O T L M = e { P , F , C , A A , d i A A } w e * K e
(1)

Lanckriet G et al. (2004) [75] used semi-definite programming to solve the problem, and Alexander Zien et al. (2007) [36] used semi-indefinite linear programming to derive the optimal weights. Both methods have rather large time & space complexity. Here, we use simple non-parametric cross validation to derive the kernel weights w e ,e ∈ {P, F, C, AA, diAA}. Given a training data X, derive kernels K AA , K diAA , K P , K F , K C and split X into K folds, then conduct K-fold cross validation, we can estimate the recall rate or sensitivity (SE) for each kernel. Sensitivity reflects the discriminative ability of kernel or feature subset, but sensitivity is highly biased towards predominant class in the case of unbalanced data, so we should include Matthew's correlation coefficient (MCC) into performance measure to objectively estimate the kernel weights:

w e = S E e * M C C e c { A A , d i A A , P , F , C } S E c * M C C c
(2)

For denotation simplicity, the subscript e is omitted. Assume confusion matrix M for some kernel (K AA , K diAA , K P , K F , K C ), M i,j records the counts that class i is classified to class j. Given the following variables that can be derived from the confusion matrix M:

p l = M l , l , q l = i = 1 , i l L j = 1 , j l L M i , j , r l = i = 1 , i l L M i , l , s l = j = 1 , j l L M l , j , p = l = 1 L p l , q = l = 1 L q l , r = l = 1 L r l , s = l = 1 L s l
(3)

We can derive the kernel's SE and MCC measure as follows:

S E = l = 1 L M l , l i = 1 L j = 1 L M i , j , M C C = p q r s ( p + r ) ( p + s ) ( q + r ) ( q + s )
(4)

Where, superscript L denotes subcellular locations.

As regards with K e , e ∈ {AA, diAA, P, F, C}, Gaussian kernel is used here:

K e ( x , y ) = exp ( γ | x y | 2 )
(5)

γ should be fine tuned by experiments.

Results

Dataset description

We choose three highly unbalanced eukaryotic benchmark datasets to evaluate GO-TLM performance. The first dataset MultiLoc collects 5859 proteins that are unevenly distributed to 10 subcellular locations for Plant data and 9 subcellular locations for Fungi data and Animal data [58]; the second dataset BaCelLoc, originally from the work [76], collects 491 proteins for Plant, 1198 proteins for Fungi and 2597 proteins for Animal that are unevenly located in 5 subcellular locations for Plant and 4 subcellular location for Fungi and Animal[58, 77]; the third dataset Euk-mPLoc collects 5618 proteins that are unevenly located in 22 subcellular locations, the largest dataset as far in terms of number of subcellular locations [50]. To overcome overestimation of model performance, a cut-off threshold of 25% sequence similarity is generally accepted in current researches [57, 13, 15, 33, 34]. In this paper, to allow more training data and as conducted as the baseline models do, 30% threshold of sequence similarity is adopted on all the benchmark datasets, except 40% threshold of sequence similarity for MultiLoc plant dataset and 25% threshold of sequence similarity for Euk-mPloc dataset.

Model evaluation and model selection

Among the independent dataset test, sub-sampling (e.g. 5 or 10-fold cross-validation) test and jackknife test (leave-one-out cross validation), the jackknife test is deemed the most objective model evaluation method, as elucidated in [13, 15]. Therefore, the jackknife test has been increasingly adopted and widely recognized by investigators to test the power of various prediction methods [134]. 5-fold cross validation is a commonly-accepted model evaluation approach in computational biology for large dataset or complex learning models, whereas leave-one-out cross validation (LOOCV) (i.e. jackknife test) is a better choice for small data or simple computational model. We use 5-fold cross validation to evaluate GO-TLM on data MultiLoc, BaCelLoc and Euk-mPLoc, and evaluate GO-TLM on BaCelLoc independent test as MultiLoc-GO did. For 5-fold cross validation, the protein dataset is randomly split into five disjoint parts with equal size. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. One part of the dataset is used as test set and the remained parts are jointly used as training set. The procedure iterates for five times, and each time a different part is chosen as test set. The independent test is actually hold-out test that randomly partition the dataset into training set and test set. As performance measure, hold-out set is not so objective as cross validation because it does not ensure that each data point is chosen to be tested. For the sake of comparison, we also conduct performance evaluation on BaCelLoc independent sets.

As regards to the cross validation for kernel weight evaluation, we select the cvK from {3, 5, 10} that achieves best overall accuracy. We use four commonly-adopted measures: Sensitivity (SE), Specificity (SP), Matthew's correlation coefficient (MCC) and Overall Accuracy. MCC is often used to evaluate the performance balance of model prediction. As compared to MCC, Overall Accuracy is a better candidate performance measure for model selection, because it has taken MCC into account. The overall MCC is not given, now that we pay more attention to the bias comparison between sub-categories. LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) is used together with the model GO-TLM. The regularization parameter C is selected within {21, 22, 23, 24, 25, 26, 27, 28, 29, 210, 211} and the kernel parameter γ is selected within {2-1, 2-2, 2-3, 2-4}. We adopt the cvK, γ, C combination that achieves the best overall accuracy.

Comparison with baseline model

We choose MultiLoc-GO and Euk-mPLoc as baseline models for performance comparison. Both the baseline models incorporated gene ontology information to boost the model's predictive performance. MultiLoc-GO used InterProScan to draw in GO terms while Euk-mPLoc used protein accession to directly query GO terms against GOA database. We use Specificity (SP), Sensitivity (SE), MCC and Overall Accuracy as performance measures.

The baseline model MultiLoc-GO gave overall accuracy only for cross validation estimation on MultiLoc dataset and BaCelLoc dataset without detailed SP, SE and MCC. For intuitive illustration of eight comparison experiments between GO-TLM and MultiLoc-GO, we give performance comparison in a separate chart Figure 1. As can be seen from Figure 1GO-TLM significantly outperforms MultiLoc-GO on all benchmark datasets. GO-TLM achieves quite satisfactory performance for cross validation but significant decrease on BaCelLoc independent sets. The accuracy decrease may be caused by the subjective partition of training set and test set. From Figure 1 we can see that GO-TLM demonstrates more stable performance than MultiLoc-GO. GO-TLM' s detailed performance measures see Table 1, 2, 3.

Figure 1
figure 1

Performance comparison between MultiLoc-GO and GO-TLM.

Table 1 Performance comparison on 5859 MultiLoc protein dataset
Table 2 Performance comparison on BaCelLoc protein dataset
Table 3 Performance comparison on 5618 Euk-mPLoc protein dataset

On MultiLoc plant dataset with 10 subcellular compartments, the best parameter combination is cvK = 5, γ = 2-2, C = 28 and the best Overall Accuracy is 96.55%, 7.05% increase from MultiLoc-GO 89.60% [58], 22.05% sharp increase from MultiLoc 74.60% [76]. As can be seen from Table 1GO-TLM demonstrates quite satisfactory performances on all the subcellular locations, with SP, SE and MCC all greater than 90%, far better than sequence-based MultiLoc. MultiLoc-GO gave no detailed cross validation performance measures on each subcellular location. The performance measures SP, SE and MCC demonstrate that GO-TLM shows no bias towards large subcellular locations, e.g. the smallest vacuole SP: 0.9355, SE: 0.9206, MCC: 0.9273 on MultiLoc plant. Similar conclusions can be drawn on MultiLoc animal. The best parameter combination is cvK = 5, γ = 2-2, C = 28 for MultiLoc animal. MultiLoc fungi dataset shares most proteins with MultiLoc plant, without chloroplast compartment, so we don't give results on MultiLoc fungi dataset.

We conduct two sets of experiments on the second dataset BaCelLoc. As can be seen from Table 2 the cross validation experiments show that GO-TLM achieves best overall accuracy 97.14%, 95.90% and 96.85% on BaCelLoc plant, BaCelLoc fungi and BaCelLoc animal, respectively against MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15%, respectively. The performance measures SP, SE and MCC demonstrate that GO-TLM shows no bias towards large subcellular locations, e.g. the smallest extracellular SP: 0.9762, SE: 1.0000, MCC: 0.9869 on BaCelLoc plant; the smallest extracellular SP: 0.9130, SE: 0.8957, MCC: 0.8849 on BaCelLoc fungi; and the smallest Mitochondria SP: 0.9783, SE: 0.9574, MCC: 0.9653 on BaCelLoc animal. The best parameter combination is cvK = 5, γ = 2-2, C = 27 for BaCelLoc plant; cvK = 5, γ = 2-2, C = 27for BaCelLoc fungi; and cvK = 5, γ = 2-2, C = 26 for BaCelLoc animal. MultiLoc-GO gave no detailed SP, SE and MCC performance.

As can be seen in Table 2 the independent test on BaCelLoc datasets show that GO-TLM achieves 81.25%, 80.45% and 79.46% on plant, fungi and animal, respectively, as compared against MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively. As can be seen from MCC performance, GO-TLM generally shows less bias towards large subcellular locations than MultiLoc-GO, e.g. Cytoplasm (0.8590 vs. 0.38), Extracellular (0.8100 vs. 0.58) on plant; Nucleus (0.7246 vs. 0.36), Cytoplasm (0.6311 vs. 0.27) on fungi; and Nucleus (0.7876 vs. 0.57), Cytoplasm (0.7648 vs. 0.43) on animal. The improvement on MCC measure may indicate the significance of incorporating MCC measure into GO-TLM kernel weight estimation as illustrated in formula (1). At the same time, GO-TLM also shows a little performance decrease on several measure values (in bold italic).

On Euk-mPLoc data with 22 subcellular compartments, the best parameter combination is cvK = 5, γ = 2-3, C = 27 and the best Overall Accuracy is 80.38%, 12.98% substantial increase from Euk-mPLoc 67.40% [50] and 18.13% sharp increase from Fuzzy K-NN 62.25% [59]. Fuzzy K-NN was evaluated on the old version of Euk-mPLoc with 22 subcellular locations and 4708 proteins. The multi-location proteins are excluded and only its single-location Measure I is taken as the comparative baseline here. Euk-mPLoc and Fuzzy K-NN gave no detailed performance. As can be seen from Table 3GO-TLM shows quite satisfactory MCC performance on most subcellular locations, including most small compartments such as Acrosome 0.8764, Microsome 0.8923, Hydrogenosome 0.7747, etc. There are two small compartments that achieve poor MCC performance: Cytoskeleton (MCC: 0.1431) & Melanosome (MCC: 0.5523). As compared to the previous models, GO-TLM can help reduce the bias towards the subcellular locations with larger number of training proteins.

Kernel weight distribution

The weights for kernel K AA , K diAA , K P , K F , K C on the benchmark datasets are illustrated in Figure 2. For each fold of cross validation, the training set is further subjected to cvK-fold cross validation to estimate the five kernels' performance measures (SP, SE and MCC), based on which to further estimate the kernels' weights using formula (1). Experiments shows that the kernel weights for 5-fold cross validation vary slightly (take Euk-mPLoc dataset for instance, see Figure 3). As can be seen from Figure 2GO-TLM demonstrates similar kernel weight distribution on all the benchmark datasets. GO features show much stronger discriminative ability than sequence features and the GO component terms from signature proteins contribute most to the predictive performance, GO process terms the second and GO function terms the third. The results may imply that GO component terms are more directly indicative of subcellular location than GO function terms and GO process terms, or the training proteins have less component term missing rate than function and process term missing rate. Take Euk-mPLoc dataset for example, there are 658 proteins without GO process terms, accounting for 11.71% missing rate; there are 755 proteins without GO function terms, accounting for 13.44% missing rate; and there are 31 proteins without GO component terms, accounting for 0.56%, far less than the missing rate of function terms and process terms. On the other hand, the weights for K AA , K diAA vary little with datasets, while the weights for K P , K F , K C vary widely with datasets, the higher for K C weight, the lower for K P , K F weights. GO-TLM achieves the highest K C weight on Euk-mPLoc and the lowest K C weight on BaCelLoc-fungi. The result may also be explained by the missing rate of GO terms, e.g. 0 missing rate for BaCelLoc-fungi component terms, while 0.56% missing rate for Euk-mPLoc component terms. BaCelLoc-fungi has less missing rate of process term and function term, and has process weight and function weight slightly increased. We can see that the unbalanced GO term distribution contributes much to the variation of GO kernel weights.

Figure 2
figure 2

Kernel weight estimation using 5-fold cross validation.

Figure 3
figure 3

Kernel weights estimation on Euk-mPLoc dataset using 5-fold cross validation.

Now that K C weight is much higher than the other kernel weights, we had better further study the predictive performance of the model that is trained on all the kernels except K C , referred to as GO-TLM(~K C ). The performance comparison between GO-TLM and GO-TLM(~K C ) is illustrated in Figure 4 which shows that the removal of kernel K C leads to substantial 14.67%~26.12% performance decrease. The result demonstrates that the GO component terms play a critical role in protein subcellular localization. However, the model GO-TLM-I(~K C ) achieves over 80% overall accuracy on datasets MultiLoc-plant, MultiLoc-animal, BaCelLoc-fungi and BaCelLoc-animal, which demonstrates that the other 4 kernels also benefit the protein subcellular localization prediction. Lu Z et al. (2005) has elucidated that GO function terms are good indicator of protein subcellular localization [78].

Figure 4
figure 4

Performance comparison between GO-TLM and GO-TLM(~Kc).

Discussion

Traditionally, the knowledge in terms of GO terms about homology can be directly transferred to the target proteins based on signature or homology search. Such a way of knowledge transfer generally benefits the research on unknown domain, species or family in biology. However, this process may be prone to introducing noise and outlier, partly because sequence similarity unnecessarily implies similar subcellular localization pattern, molecular function or biological process; or partly because the annotations in themselves may be noisy. For the reason, we design a transfer learning system to share knowledge between homology for reliable protein subcellular localization, called Gene Ontology Based Transfer Learning Model (GO-TLM). GO-TLM collects GO terms based on signature or homology search against the integrative database InterPro, and then transfer the GO terms to the target proteins for further learning. All the transferred GO terms are used to train a kernel-based SVM classifier, which can effectively reduce the risk of outliers by allowing larger training error to achieve maximum margin between two-class separating hyperplanes. Thus, the quite different GO terms (e.g. extracell GO term is transferred to nuclear proteins) would be viewed as outlier after SVM training. Such a way of constructing a learning system based on the transferred knowledge between related domains or data may benefit computational biology in many aspects. As compared to concatenation of heterogeneous feature subspace, multiple kernel learning has the advantage of explicitly weighing the feature subset/kernel contribution to the classification task. GO-TLM uses simple non-parametric cross validation to estimate the kernel weights, serially one kernel in memory at a time, such that it requires much less time and space than the complicated semi-definite/semi-indefinite linear programming. Simple non-parametric cross validation is used to estimate the kernel weights. Meanwhile, the kernel weight estimation allows for both sensitivity and unbalanced measure MCC, such that GO-TLM would work better in the scenario of unbalanced training dataset. Experiments reveal that GO component feature play more important role than GO process feature and GO function feature. With less missing rate, GO function terms and GO process terms would further increase the predictive performance.

GO-TLM only uses those GO terms that belong to the problem concerned, thus no irrelevant GO term is into the GO feature vector. However, this method of GO feature construction may cause low GO term coverage, that's, a test GO term (GO term that belongs to a test protein) may find no match in the training GO term set. In such a scenario, we should include the test GO term into the training GO term set to re-train the well-trained learning system. Re-training is generally time-consuming for large data and complex model selection. We had better pull in more statistically correlated GO terms for those proteins with very few evidences. To avoid re-training, we had better use statistically correlated GO term to replace the GO term that may not hit the training GO terms. Lastly, there is still a large chance for InterProScan to miss capturing GO terms from homology because of the unevenly distribution of GO terms. In such a scenario, we can lower the threshold for InterProScan to draw in the GO terms from remote homology. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors [79], we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Conclusions

In this paper, we design an explicitly weighted kernel learning system to transfer the known knowledge in terms of GO terms from related homology to the target problem, called Gene Ontology Based Transfer Learning Model (GO-TLM), to reduce the risk of outlier and achieve better model performance. On one hand, homology or signature based GO term transfer enables reliable knowledge share between homology, protein subfamily or protein family. On the other hand, GO-TLM uses simple and effective non-parametric cross validation to explicitly weigh the contribution of the three aspects of gene ontology. The explicitly weighted kernel combination can better cope with the different missing rates and different discriminative abilities between the three aspects of gene ontology. The kernel weight estimation takes into account MCC measure, such that GO-TLM could perform better in the scenario of unbalanced data distribution among subcellular locations. Experiments on three benchmark datasets show that GO-TLM significantly outperforms the previous models.