Background

Over recent years the number of new genomes and protein sequences has increased dramatically. Therefore, reliable and efficient sequence analysis tools are urgently needed. The native subcellular localization of a protein is important for understanding gene/protein function. Aberrant subcellular localization of proteins has been observed in the cells of several diseases, such as cancer and Alzheimer's disease [1]. Therefore, knowing the protein's localization will be one important step identifying its function. Even if we already know a protein's function, information about protein localization may provide us insights into the specific enzyme pathway [25]. Experimental annotations of subcellular localization are often based on operational, biochemical definitions that can be error prone [1]. Therefore, predicting subcellular localization has become one of the central problems in bioinformatics.

Actually, some methods have been developed to quickly predict the subcellular localizations of proteins. Most of these methods can be classified into two classes: one is based on the N-terminal sorting signals [5] and the other is based on amino acid composition. One advantage of the former is a clear biological implication [6]. However, in large genome analysis projects, genes are usually automatically assigned and these assignments are often unreliable for the 5'-regions [7]. This can result in leader sequences being missed or only partially included, thereby causing problems for prediction algorithms depending on them. Therefore, most methods are based on the amino acid composition rather than the N-terminal sorting signals alone. Our method is also based on the amino acid composition.

Nakashima and Nishikawa [8] have indicated that intracellular and extracellular proteins differ significantly in their amino acid composition. There are already several algorithms based on amino acid compositions, such as least Mahalanobis distance [911], neural network [7], covariant discriminant algorithm [12, 13], Markov Chain [14], and support vector machine [15, 16]. Some researchers also consider combining other features together with amino acid composition. Feng and Zhang [17, 18] proposed two methods: one combined the hydrophobic information, and the other combined Zp parameters. Recently, many novel methods have been developed based on new features. Gardy et al. developed a tool to predict protein subcellular localizations for Gram-negative bacteria, PSORT-B, which combined several methods together [19]. Rajesh and Burkhard developed a tool, LOC3D, to predict subcellular localizations for eukaryotic proteins of known three-dimensional (3D) structure [1]. Chou initially introduced the use of pseudo-amino-acid-composition to predict protein subcellular localization [20], and then Cai, Zhou and Chou developed several methods based on this new feature [13, 21, 22]. Functional domain composition was used by Chou and Cai [16, 21] who also presented a new method incorporating gene ontology [22]. The results of above papers indicate that some of these new features can improve the prediction accuracy markedly, but a great shortcoming of these features is that it is difficult to obtain these features for new sequences, such as the functional domain composition and gene ontology.

We noted that in most of these methods using traditional amino acid composition to represent a protein, all the sequence-order information is neglected; consequently, the methods based on amino acid composition bear a bias of losing the sequence-order information. Chou firstly introduced a set of sequence-order-coupling numbers based on the physicochemical distance between amino acid to reflect the sequence order effect [23]. Actually this effect is a quasi-sequence-order effect. This paper takes into account sequence-order information by a different method. We also think that 1-v-1 multi-class SVM is better than 1-v-r SVM, so in this paper 1-v-1 SVM is used and the sequence-order information is also considered. We achieved excellent results: the total prediction accuracies of two tests on Reinhardt's dataset (predict three localizations for prokaryotic proteins and four localizations for eukaryotic proteins) are 100% by the self-consistency test, 92.9% and 84.14% by the jackknife test. Our method represents a different approach for predicting protein subcellular localization and achieved a satisfactory result. Our results show that the prediction accuracies are significantly improved. In this paper we also developed a tool, Esub8, to predict eight subcellular localizations for eukaryotic proteins: the total accuracies are 100% by the self-consistency test and 87% by the jackknife test. The results indicate Esub8 is a useful tool.

Results

Prediction accuracy

The prediction accuracies of subcellular localization for prokaryotic sequences on Reinhardt's dataset are shown in Table 1. The total accuracy by the self-consistency test reaches 100%. The total accuracy by the jackknife test reaches 92.9%. The prediction accuracies of subcellular localization for eukaryotic proteins Reinhardt's dataset are shown in Table 2. The total accuracy by the self-consistency test reaches 100%, the total accuracy by the jackknife test reaches 84.14%.

Table 1 Prediction accuracies of traditional subcellular localization for prokaryotic sequences with RBF kernel function
Table 2 Prediction accuracies of traditional subcellular localization for eukaryotic sequences with RBF kernel function

The prediction accuracies of Esub8 are shown in Table 3. The total accuracy by the self-consistency test reaches 100%, the total accuracy by the jackknife test reaches 87%. The kernel functions are all Radial Basis Functions (RBF). Esub8 and the other two traditional prediction programs, after cross-validation tests, all achieved optimal results with the same parameters: C = 500, γ = 50.

Table 3 Prediction accuracies of Esub8 with RBF kernel function

Comparison with other methods

In this section, our traditional localization results are compared with results obtained by other methods. These methods include Reinhardt and Hubbard's method using neural networks [7], Chou and Elrod's method using a covariant discriminant algorithm [12], Yuan's method based on the Markov Chain [14], Hua and Sun's method using a 1-v-r SVM method [15], Feng and Zhang's two methods using Bayesian discriminant function [17, 18]. These methods are all based on Reinhardt and Hubbard's dataset [7], that is, all these methods used an identical dataset and their input vectors are all based on amino acid composition alone (Feng and Zhang's two methods are based on input vectors combining amino acid composition with other features). As shown in Table 4, for prokaryotic sequences, the total accuracy by the self-consistency test is about 10% higher than that of method 3 and about 2.3% higher than that of method 6. The total accuracy by the jackknife test is about 11.8% higher than that of method 2, 5.9% higher than that of method 3, 3.8% higher than that of method 4, 3.3% higher than that of method 7, 2.5% higher than that of method 6 and 1.5% higher than that of method 5. For eukaryotic sequences, other methods did not match the results of the self-consistency test; the total accuracy by the jackknife test is about 18.14% higher than that of method 2, 11.14% higher than that of method 4 and 4.74% higher than that of method 5. From these we know our method represents a different approach for predicting protein subcellular localization that achieved a satisfactory result.

Table 4 Comparing the total accuracies with other 6 methods. From 1 to 7, the methods are our method, Reinhardt and Hubbard's method, Chou and Elrod's method, Yuan's method, Hua and Sun's method, Feng and Zhang's method 1 and method 2.

Esub8 uses the same method to predict more rigorous localization (8 localizations) for eukaryotic proteins. From the data in Table 3, we know Esub8 is a satisfactory tool. The Institute of Bioinformatics, Tsinghua University also provided a web server, http://bioinfo.tsinghua.edu.cn/CoupleLoc/eu8.html, for eight localizations prediction of eukaryotic proteins, but the accuracies are unpublished.

Discussion

Subcellular localization of a new protein sequence is very important and fruitful for understanding its function, and predicting subcellular localization has become one of the central problems in bioinformatics. In this paper, we have developed a novel tool for protein eight subcellular localization predictions. We also test our method on Reinhardt's dataset. The proposed method differs from the existing method with the use of the 1-v-1 SVM and the order information of the protein sequence. The experimental results show that our method represents a different and satisfactory approach for predicting protein subcellular localization. Furthermore, our method has an advantage common to other methods based on amino acid composition: it is robust to errors in gene 5'-region annotation. We believe that Esub8 is a useful and efficient tool for protein localization prediction and is an important auxiliary tool for protein function prediction.

We have also found that the parameters of SVMs play an important role in the prediction results. RBF kernel function is better than linear kernel and polynomial kernel functions in solving this problem. After the cross-validation experiment, we obtain the optimal results with C = 500, γ = 50 both for Esub8 and for traditional subcellular localizations. We think SVMs with advanced kernel function will achieve better results. Combining our method with the method based on N-terminal sorting signal also will achieve better results. We also noted that looking for better features is very important. As described above, some new features were used, such as hydrophobic information [17], Zp parameters [18], pseudo-amino-acid-composition [20], and functional domain composition [16, 21], and some of these methods achieved satisfactory results. More recently Keun-Joon and Minoru presented a method that used amino acid pairs as features based on SVM [25]. Another point that should be mentioned is that one can provide new datasets to be applied by this method.

Conclusions

In this paper, we proposed a novel tool to predict protein subcellular localizations for eukaryotic proteins based on amino acid composition alone. As a result, the total prediction accuracies of two traditional tests are both 100% by the self-consistency test, and are 92.9% and 84.14% by the jackknife test respectively. Esub8 also obtains excellent results: the total prediction accuracies are 100% by the self-consistency test and 87% by the jackknife test. As described above, our method represents a different approach for predicting protein subcellular localization and achieved a satisfactory result. We believe that Esub8 is a useful and efficient tool for protein localization prediction and is an important auxiliary tool for protein function prediction.

Methods

Materials

The training dataset used in Esub8 was downloaded at http://bioinfo.tsinghua.edu.cn/CoupleLoc, Institute of Bioinformatics, Tsinghua University. The dataset used to test our method on the three traditional subcellular localizations for prokaryotic proteins and four subcellular localizations for eukaryotic proteins is the same as that used by Reinhardt and Hubbard [7]. For more details, please contact the above authors. Table 5 shows the dataset used in Esub8, which includes 8305 eukaryotic sequences classified into 8 localization classes (chloroplast, cytoplasm, extracellular, golgi apparatus, lysosome, mitochondria, nucleus and peroxisome). Table 6 shows Reinhardt and Hubbard's dataset that includes 997 prokaryotic sequences, classified into three localization classes (extracellular, periplasmic and cytoplasmic), and 2427 eukaryotic sequences belonging to four localization classes (extracellular, mitochondrial, cytoplasmic and nuclear).

Table 5 The dataset used in Esub8.
Table 6 The final sequences in each location class of the dataset

Feature vector

In many methods, the feature used to classify protein subcellular localizations is mainly amino acid composition [79, 12, 14, 15, 17, 18]. In these papers, no matter how long the protein sequence is, the input vector is a twenty-dimensional vector because there are twenty kinds of amino acid in biological proteins. Each element in the feature vector denotes the presence frequency (or tendency) of an amino acid, so a feature vector can be represented by R20. However, one drawback of this representation is that it neglects the order information of the protein sequence, that is, one cannot observe any amino acid order information from the feature vector. The order information may play an important role in protein subcellular localization.

In this paper, we present a novel approach for considering the sequence order information by dividing a protein sequence into two equal half sequences. For the first half sequence, we compute the amino acid composition to construct a 20D feature vector, and do the same with the second one. Then a forty-dimensional vector is constructed by combining the first 20D feature vector with the second one. Then the new feature vector can be represented by

R40. The results prove that our new 40D feature vector based on amino acid composition is better than 20D feature vector and then prove that amino acid order information plays an important role in protein subcellular localization.

Multi-Class SVM

SVM was introduced by Vapnik [26], and has been applied in many classification and regression problems. The standard SVM [26] was originally developed for dichotomic classification problems (binary classification). A classification problem usually involves training data and testing data that consist of some data instances. Each instance in training data contains one class label and one feature vector. The goal of SVM is to construct a classifier that classifies the data instances in the testing data.

For a binary classification problem, assume that we have a series of feature vectors x i and class labels y i (i = 1, 2... N, where N is the number of samples), where x i Rd, y i ∈ {+1, -1}. For protein sequences localization, the input vector dimension is 40, as described in the above section. The SVM requires the solution of the following optimization problem:

Subject to y i (wTφ(x i ) + b) ≥ 1 - ξ i , ξ i ≥ 0.    (1)

Here, feature vectors x i are mapped into a higher dimensional space by the function φ(x) ∈ H and then SVM constructs an Optimal Separating Hyperplane (OSH), which maximizes the margin in the higher dimensional space. C > 0 is the penalty factor of the error term. Furthermore, K(x i , x j ) = φ(x i )Tφ(x j ) is called the kernel function. There are several typical kernel functions:

Polynomial kernel function: K(x i , x j ) = (x i x j + 1)d,     (2)

Radia Basic Function (RBF): K(x i , x j ) = exp(-γ||x i - x j ||2), γ > 0,    (3)

Sigmoid function: K(x i , x j ) = tanh(γ x j + c)    (4)

Here, d, γ and c are kernel parameters.

The multi-class classification problem is commonly solved by a decomposing and reconstructing procedure when the binary class SVM is implied. Protein subcellular localization is a multi-class problem, so we should decompose this problem into several binary classifications and then reconstruct them together. In this paper, we use the 1-v-1 SVM. For the 1-v-1 multi-class SVM, the decomposing method constructs all the possible binary machines from K-class training samples, each SVM being trained on only two out of all K classes. The usual reconstruction method is a parallel structure: when a new entry is presented, each binary learned machine provides one output concerning the classes involved in the training phase; then an algorithm interprets these two-class classifier outputs to determine the label to be assigned to the input. There exist several combinatorial algorithms for the outputs. Voting schemes are used in this paper because the output scale of a SVM is not robust. Since it depends on just the support vectors, voting schemes are more practical [27].

Implementation of the prediction system

In this paper, the 1-v-1 SVM was used to construct a protein subcellular localization system, Esub8, based on a 40D amino acid input vector. Esub8 is a program to classify one protein sequence into one of the eight classes. We also test our method on three traditional localizations for prokaryotic proteins and four localizations for eukaryotic proteins. Esub8 and the other programs were all written in Matlab using the software package, Osusvm, which was developed by Junshui Ma and Yi Zhao et al. based on SVMlight [28]. Our hardware platform is a PC running at 2.4 GHz. In traditional localizations, the self-consistency test can be finished in one minute; the jackknife test takes about two hours for all eukaryotic sequences and about 10 minutes for all prokaryotic sequences. In eight localizations, self-consistency can be finished in several minutes; the jackknife test takes about four days for all 8305 eukaryotic sequences. For Esub8, predicting the subcellular localization of an unknown sequence will take several seconds; hence, Esub8 is an efficient subcellular localization predication tool.

Self-consistency test and Jackknife test

Usually, the prediction results are evaluated by the self-consistency and jackknife tests. Although the sum-sampling test method is still widely used in biology literatures, the self-consistency and jackknife tests are more objective and rigorous, see Chou and Zhang's paper for a comprehensive discussion [29]. The former reflects the consistency of the prediction system, and the latter reflects the extrapolating effectiveness of the algorithm. When the self-consistency test is performed, the subcellular localizations of each protein in the dataset are in turn identified using the rule parameters derived from the training dataset. However, the prediction system parameters obtained by the self-consistency test are from the training dataset that includes the information of the later query protein. Since the same proteins are used to train the predictive system and test themselves, the error will be underestimated and the success rate will be enhanced, so a more reliable and rigorous test method, the jackknife test, is introduced. However, the self-consistency test is absolutely necessary because it reflects the self-consistency of the predictive system [30, 31].

The jackknife test is the most effective and objective test method in statistical prediction. In the jackknife test, each protein in the dataset is singled out in turn as an independent test sample, and all the parameters of SVM are derived from training all the remaining proteins. In the process of jackknife tests, each protein has one chance to be the test sample, and for other tests this protein will be included in the training dataset.

Prediction system assessment

The total prediction accuracy is given by the following equations:

As described by Hua and Sun [15], N is the total number of sequences, k is the class number, obs(i) is the number of sequences observed in localization i, and p(i) is the number of correctly predicted sequences of localization i.