Background

Protein palmitoylation is a reversible lipid modification that plays important roles in cell signaling associated with cellular dynamics and plasticity. However, very little is known about the molecular mechanism underlying this modification and regulation in cells. Palmitoylation, also known as S-acylation, is one of the most ubiquitous post-translational modifications (PTM), reversibly attaching a 16-carbon saturated fatty acid as lipid palmitate (C16:0) to cysteine residues in protein substrates through thioester linkage [16]. Biochemically, palmitoylation increases the hydrophobicity of proteins to promote protein-membrane association [16]. Also, palmitoylation modifies numerous proteins to control protein-protein interaction [79], intracellular trafficking [10, 11], lipid raft targeting [12, 13], and proteins' activities [8, 14], etc. Moreover, palmitoylation has been implicated in a variety of biological and physiological processes, including signal transduction [14, 15], mitosis [16], neuronal development [3, 6], and apoptosis [17], etc. Although protein palmitoylation has attracted extensive attention, its molecular mechanisms still remain to be elusive.

Identification of palmitoylation sites is essential for a better understanding the molecular regulation of palmitoylation process. To date, only a few palmitoylation sites have been experimentally identified. Although several efficient techniques, such as mass spectrometry (MS), have been employed recently, most of the known palmitoylation sites are mapped by mutagenesis of candidate cysteine residues with conventional biochemical methods. The features of substrate specificity for palmitoylation is still unclear and most previous studies have proposed that there is no common and canonical consensus sequence/motif for palmitoylation [1, 35].

Moreover, only a few palmitoyltransferases have been identified although palmitoylation of proteins has been known for many years [2, 4, 18, 19]. Palmitoylation of proteins can be carried out in both enzyme- and nonenzyme-dependent manners [5, 1820]. These intrinsic but diversified characteristics of palmitoylation introduce great difficulties into choosing appropriate candidate cysteine residues in the substrates for further experimental manipulation. Thus, in silico prediction of palmitoylation sites implemented in an apt algorithm/approach is in urgent need and insightful for further experimental design.

Previously, we developed a computational program named CSS-Palm, deployed with the approach of Clustering and Scoring strategy [21]. In that work, the data set for training was curated from scientific literature (PubMed) with 210 experimentally verified palmitoylated sites from 83 distinct proteins (referred to as old data set). Due to the fast pace of research progress in this area, more palmitoylation sites have been identified since our last publication of CSS-Palm. After survey recent progress and redundancy elimination, the final data set includes 245 non-homologous sites from 105 proteins (referred to as new data set, see in Table 1). We then employ several machine learning algorithms including Naïve Bayes [22], Support Vector Machines (SVMs) [23] and RBF Networks [24] for palmitoylation site prediction. Also, the proper window length for a potential palmitoylated peptide has been optimized. The accuracy of prediction performance fluctuates from 82% to 86%. By comparison, the Naïve Bayes approach achieves the best accuracy of 85.79% for 3-fold cross-validation, 86.72% for 8-fold cross-validation and 86.74% for Jack-Knife validation, with the window length of six. Thus, we construct a computational web service of NBA-Palm – prediction of palmitoylation site implemented in Naïve Bayes algorithm. And the prediction performance is comparable with our previous work of CSS-Palm.

Table 1 The detailed description of data set.

Results & discussions

Functional analysis of Palmitoylated Proteins

In order to elucidate the molecular determinants responsible for protein palmitoylation, we downloaded the GO annotation files for Uniprot from EBI-GOA [25] for processing. In our non-redundant data set with 105 palmitoylated proteins, we have observed 455 distinct GO categories. Table 2 shows the top five Gene Ontology (GO) entries of biological processes, molecular functions and cellular components of palmitoylated proteins.

Table 2 Top five Gene Ontology (GO) groups of biological processes, molecular functions and cellular components in palmitoylated proteins.

The most abundant GO item of biological process in which palmitoylated proteins are implicated is "signal transduction" (26 proteins). The other four biological processes are "G-protein coupled receptor protein signaling pathway" (21 proteins), "transport" (16 proteins), "ion transport" (7 proteins) and "cell adhesion" (7 proteins). The most enriched GO group of molecular function is "protein binding" (41 proteins), while the other four highly-abundant molecular functions are "receptor activity" (27 proteins), "signal transducer activity" (25 proteins), "G-protein coupled receptor activity" (15 proteins) and "rhodopsin-like receptor activity" (14 proteins). Again, the most frequent GO entry of cellular component is "membrane" (70 proteins), and the other four highly-frequent cellular components are "integral to membrane" (54 proteins), "plasma membrane" (245 proteins), "integral to plasma membrane" (19 proteins) and "endoplasmic reticulum" (9 proteins).

Taken together, the computational analyses of the palmitoylated proteins support the notion that palmitoylated proteins carry diversified cellular functions. The result points to two conclusions. First, the data set is general enough and suitable for our prediction work as training data set. Second, computational tools which can accelerate palmitoylation function research are valuable and helpful.

Performance of NBA-Palm

We carried out 3-fold cross-validation, 8-fold cross validation and the Jack-Knife validation to evaluate the performance of NBA-Palm (shown in Table 3 and Table 4). On the old data set, NBA-Palm achieves best average MCC of 0.594 with window length of six. On the new data set, the best average MCC is 0.548 with the same optimized window length of six. The prediction performances on the old and new data set are very similar. However, the performance on the new data set is slightly lower than that on the old data set. To find out the reason of this performance decrease, we built sequence logos [26] on the old and new data sets (shown in Figure 1a and Figure 1b). Both two logos show that around palmitoylation sites there is a Leucine/Cysteine-rich region. Comparison of the two logos leads to the observation that the pattern of the old data set is slightly stronger than that of the new data set. This may explain why performance of NBA-Palm on the new data set is slightly lower.

Table 3 Comparison of the prediction performance for three machine learning algorithms on old data set.
Table 4 Comparison of the prediction performance for three machine learning algorithms on new data set.
Figure 1
figure 1

The sequence logos of palmitoylation sites. Both two logos show that around palmitoylation sites there is a Leucine/Cysteine-rich region. A taller letter indicates that this kind of residue is more frequently used. (a) on old data set; (b) on new data set.

Comparison of Prediction Performance with several machine learning algorithms

Besides Naïve Bayes, we also adopt two additional machine learning algorithms, RBF networks and support vector machines (SVMs), to predict palmitoylation site. Table 3 and Table 4 show the detailed performances of the three algorithms on the old and new data sets, separately. Several conclusions can be reached: firstly, despite of its simple structure, Naïve Bayes is overall the best algorithm. However, its performance is only slightly better than that of the other two. Secondly, best window lengths for the three algorithms are not identical, e.g. on new data set 6 for Naïve Bayes, 8 for RBF networks and 7 for SVMs, according to average MCC of 3-fold cross-validation, 8-fold cross-validation and Jack-Knife validation. Thirdly, performances of Jack-Knife tests are often better than those of 3-fold and 8-fold cross-validations because there are more training data and less test data. Among the three algorithms, SVM has the largest differences in MCC between 3-fold cross-validation, 8-fold cross validation and Jack-Knife test while Naïve Bayes has the smallest. This implies that Naïve Bayes may be the most robust algorithm when changing the numbers of training data and test data. And the window length of Naïve Bayes algorithm is optimized as six by comparison of the average MCC. Hence, Naïve Bayes is a very simple-structured algorithm with high performance and robustness, which is extremely suitable for biological classification problems.

Comparison with previously described analysis CSS-Palm

Performance comparison was carried out between NBA-Palm and the previously established method CSS-Palm [21] on the same old data set. Details are shown in Table 5. In the Jack-Knife validation, NBA-Palm performs comparatively with CSS-Palm in all metrics. However, in 3-fold cross-validation, NBA-Palm achieves much higher MCCs, which is probably due to the volatility of the 3-fold cross-validation, because the 3-fold cross-validation uses less training data (2/3 of whole data set) and makes predictions on more testing data (1/3 of whole data set) while Jack-Knife validation uses all data but one for training. The result implies that the robustness of the Naïve Bayes method is probably inherited from the nature of probability theory. This is consistent with the conclusion above achieved in comparison of Naïve Bayes and SVM. In contrast, CSS-Palm is based on sequence/peptide homology scoring and clustering. And lacking a key sequence/peptide in training data might cause large changes in clustering results. Thus, CSS-Palm depends on training data heavily with less robustness.

Table 5 Comparison of prediction performances between NBA-Palm and CSS-Palm.

Perspective of Future work

Our work points to several paths for further research. Firstly, as the proteomic techniques continue to be improved, more and more palmitoylation sites will be identified. We can expect that the accuracies will be further improved with more training data. Secondly, some other machine learning methods could be applied, i.e., decision trees [24] and hidden Markov models [27]. These approaches could be used separately or combined together to build potentially better models. Thirdly, evolutionary information, for example, phylogenetic conservation between human and mouse, can also be integrated into the prediction system to improve its accuracy.

Conclusion

In this work, we present a new method for protein palmitoylation site prediction based on Naïve Bayes. The performance is satisfactorily high. Comparison between Naïve Bayes, RBF networks and SVMs was also carried out, and demonstrated that Naïve Bayes outperforms the other two methods. We also compared NBA-Palm with our previously established method CSS-Palm. The comparison demonstrates that NBA-Palm carries superior computing efficiency to CSS-Palm with equal predicting accuracy. These results indicate that Naïve Bayes is an effective classification algorithm for biological problems. In addition, with high specificity and sensitivity, NBA-Palm could be a valuable computational tool for functional proteomic biologists.

Methods

Data Preparation

Here we define the cysteine (C) residues that undergo palmitoylated modification as positive data (+), while those non-palmitoylated cysteine residues are regarded as negative data (-). Previously, we have collected 210 experimentally verified palmitoylation sites of 84 proteins [21]. Since palmitoylation-related research is updated rapidly, more and more palmitoylated sites have been identified and reported. We searched the PubMed with the keyword "palmitoylation" to collect new palmitoylation sites. Now the updated new data set contains 266 sites from 111 proteins (before March. 31st, 2006). We then retrieved the primary sequences of these proteins from Swiss-Prot/TrEMBL database [28]. The final curated data set is available upon request.

The positive data (+) set for training might contain several homologous sites from homologous proteins. If the training data are highly redundant with too many homologous sites, the prediction accuracy will be overestimated. To avoid the overestimation, we clustered the protein sequences from positive data (+) set with a threshold of 30% identity by BLASTCLUST [29], one program of clustering highly homologous sequences into distinct groups. If two proteins were similar with ≥30% identity, we re-aligned the proteins with BL2SEQ, a program in the BLAST package [29], and checked the results manually. If two palmitoylation sites from two homologous proteins were at the same position after sequence alignment, only one item was reserved while the other was discarded. Thus, we obtained a non-redundant positive data (+) of high quality with 245 palmitoylation sites from 105 proteins.

As previously described [30, 31], the negative (-) sites were composed of non-annotated cysteine residues in the same proteins from which positive (+) sites were taken, instead of using proteins randomly picked from the Swiss-Prot/TrEMBL database. Thus, both (+) and (-) sites are extracted from the same protein sequences, making our test more strict. Obviously, the (-) sites may contain some false negative hits – these cysteine residues in fact undergo palmitoylation but are not characterized so far. In this regard, the prediction performance of any computational approaches will overestimate the false positive rates. However, without a high-quality gold-standard (-) set, this overestimation is inevitable.

For comparing the prediction performance from NBA-Palm with our previous tool of CSS-Palm [21], both the previously used old data set from CSS-Palm and the new updated data set were used. The detailed information of data description is listed in Table 1.

Algorithm design and validation

Sequence coding

We employed a traditional sliding window strategy to represent a potentially palmitoylated peptide (PPP). Given the window length n, a fragment of 2n residues centering on palmitoylated site was adopted to represent a PPP. Since there is always C in middle of a PPP, we didn't include the center site into the encoding fragment. We chose an orthogonal binary coding scheme to transform protein sequences into numeric vectors. For example, Glycine was designated as 00000000000000000001, Alanine designated as 00000000000000000010, and so on. The length of final feature vector representing the palmitoylated site is n × 2 ×20. Different values of n varying from 3 to 8 were used to determine the optimized window length.

The Machine Learning Algorithms

Naïve Bayes is a classification model based on so-called Bayes theorem [22]. Naïve Bayes classifiers assume that the effect of a variable value on a given class is independent of the values of other variables. This assumption is called class conditional independence. It is made to simplify the computation and in this sense is considered to be "Naïve". Given a potential palmitoylation site X, described by its 0–1 feature vector (x1, x2,..., xn) described in above section, we are looking for a class C that maximizes the likelihood: P(X|C)=P(x1, x2,..., xn|C) where C can be "palmitoylation" or "non-palmitoylation". The assumption of class conditional independence allows us to decompose the likelihood to a product of simpler probabilities: P ( X | C ) = i = 1 n P ( x i | C ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqqGybawcqGG8baFcqWGdbWqcqGGPaqkcqGH9aqpdaqeWbqaaiabdcfaqjabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaem4qamKaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0Gaey4dIunaaaa@43AE@ . Despite of its simple structure and ease of implementation, Naïve Bayes often performs comparatively well with other algorithms, such as SVMs and neural networks.

The support vector machine (SVM) is a new machine learning method, which has been applied for many kinds of pattern recognition problems. The principle of the SVM method is to transform the samples into a high dimension Hilbert space and seek a separating hyperplane in the space. The separating hyperplane, called the optimal separating hyperplane, is chosen in such a way as to maximize its distance from the closest training samples. As a supervised machine learning technology, SVM is well founded theoretically on Statistical Learning Theory [23]. Recently, SVM has been successfully adopted to solve many biological problems, such as predicting protein subcellular locations [32], protein secondary structures [32, 33], tumor classification [34] and phosphorylation sites [30, 31]. In present work, the feature vector of each potential palmitoylation site was transformed into a higher dimension space through polynomial kernel function.

The RBF network is a kind of multi-layer, feed-forward artificial neural network [24]. An RBF network consists of three layers, namely the input layer, the hidden layer, and the output layer. The input layer broadcasts the coordinates of the input vector to each of the nodes in the hidden layer. Each node in the hidden layer then produces an activation based on the associated radial basis function. Finally, each node in the output layer computes a linear combination of the activations of the hidden nodes. How an RBF network reacts to a given input stimulus is completely determined by the activation functions associated with the hidden nodes and the weights associated with the links between the hidden layer and the output layer. In our model, after feature vectors were fed into input layers, the links between nodes were iteratively updated until convergence. The output layer finally produced the decision of "palmitoylation" or "non-palmitoylation".

The Jack-Knife validation and n-fold cross-validation

The prediction performances of NBA-Palm were evaluated by the 3-fold cross-validation, 8-fold cross-validation and the Jack-Knife validation, for the convenience of comparison with the previous method CSS-Palm. In the Jack-Knife validation, which is also named "leave-one-out" cross-validation, each sample in the dataset is singled out in turn as an independent test sample, and all the remaining samples are used as training data. This process is repeated until every sample is used as test sample one time. In n-fold cross validation all the (+) sites and (-) sites were combined and then divided equally into n parts, keeping the same distribution of (+) and (-) sites in each part. Then n-1 parts were merged into a training data set while the one part left out was taken as a test data set. The average accuracy of n-fold cross validation was used to estimate the performance. All models were implemented in the WEKA software package[35].

Performance measurements

We adopted four frequently considered measurements: accuracy(Ac), sensitivity (Sn), specificity (Sp) and Mathew correlation coefficient (MCC). Accuracy(Ac) illustrates the correct ratio between both positive (+) and negative (-) data sets, while sensitivity (Sn) and specificity (Sp) represent the correct prediction ratios of positive (+) and negative data (-) sets respectively. However, when the number of positive data and negative data differ too much from each other, the Mathew correlation coefficient (MCC) should be included to evaluate the prediction performance. The value of MCC ranges from -1 to 1, and a larger MCC value stands for better prediction performance.

Among the data with positive hits by NBA-Palm, the real positives are defined as true positives (TP), while the others are defined as false positives (FP). Among the data with negative predictions by NBA-Palm, the real positives are defined as false negatives (FN), while the others are defined as true negatives (TN). The performance measurements of sensitivity (Sn), specificity (Sp), accuracy (Ac), and Mathew correlation coefficient (MCC) are all defined as below:

S n = T P T P + F N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWucqWGUbGBcqGH9aqpdaWcaaqaaiabdsfaujabdcfaqbqaaiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaobaaaaa@3826@
(1)

, S p = T N T N + F P MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWucqWGWbaCcqGH9aqpdaWcaaqaaiabdsfaujabd6eaobqaaiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqbaaaaa@3826@ ,

A c = T P + T N T P + F P + T N + F N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqcqWGJbWycqGH9aqpdaWcaaqaaiabdsfaujabdcfaqjabgUcaRiabdsfaujabd6eaobqaaiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabgUcaRiabdsfaujabd6eaojabgUcaRiabdAeagjabd6eaobaaaaa@417C@
(2)

,

M C C = ( T P × T N ) ( F N × F P ) ( T P + F N ) × ( T N + F P ) × ( T P + F P ) × ( T N + F N ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqcqGH9aqpdaWcaaqaaiabcIcaOiabdsfaujabdcfaqjabgEna0kabdsfaujabd6eaojabcMcaPiabgkHiTiabcIcaOiabdAeagjabd6eaojabgEna0kabdAeagjabdcfaqjabcMcaPaqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaaaaa@65AA@
(3)

.

ROC curves

The prediction performance of Naïve Bayesian algorithm with window length of six is very similar to that of seven. To compare their performance in detail, ROC curves were used for intuitively visualizing prediction performance (see in Figure 2). ROC curves plot the true positive rate as a function of the false positive rate, which is equal to 1-specificity. The area under the ROC curve (the ROC score) is the average sensitivity over all possible specificity values, which can be used as a measure of prediction performance over different thresholds. ROC curves of random predictors will be around the diagonal line from bottom left to top right with scores of about 0.5, while a perfect predictor will produce a curve along the left and top boundary of the square and will receive a score of one.

Figure 2
figure 2

The ROC curves for potential palmitoylated peptides with window length of six. The "3 fold CV" stands for 3 fold cross-validation, the "8 fold CV" for 8 fold cross-validation and the "Jack-Knife" stands for the Jack-Knife validation. The "AUC" stands for Area Under Curve score. (a) ROC curves on old data set; (b) ROC curves on new data set.