Background

Protein phosphorylation, as one of the most common post-translational modifications (PTM), is reversibly and transiently performed by protein kinases (PKs). It plays crucial regulatory roles in a variety of biological cellular processes, including transcription [1], translation [2], mitosis/cell cycle [3], neurite outgrowth [4, 5] and signal transductions [6], etc. Many previous researches have contributed to increase our knowledge on phosphorylation. However, the intrinsic features of phosphorylation dynamics are still cryptic and remain to be dissected. Biochemically, the catalytic site of a PK hydrolyzes adenosine triphosphate (ATP) and transfers a phosphate moiety to the acceptor residue (S/T, Y in eukaryotes) in the substrate. Each PK only modifies a defined subset of substrates specifically to ensure signaling fidelity, and defects of PK function often induce a variety of diseases and cancers [7].

There is an extensively-adopted hypothesis that PKs phosphorylate their substrates at the specific sites (consensus sequence) flanking with canonical motif [810]. To date, the consensus motifs of ~30 PKs have been reported [11]. However, there is still a large number of PKs with their specific target motifs remained to be identified. Therefore, elucidating PK-specific phosphorylation sites on the substrates is the foundation of understanding the molecular mechanism of substrates specificity and important for the biomedical drug design. However, it has been described that only consensus motif is not enough for providing the specificity of PK recognition in vivo [12]. There are numerous mechanisms have been proposed to contribute specificity for PKs, such as co-complex of PKs with their substrates, subcellular co-localization, interacting through modular docking sites, phosphopeptide-binding mechanisms, etc [1217]. In a cell, protein kinase usually forms a tight complex with its target either through a third scaffold protein, or by recognizing and binding short sequence of the substrate, known as a docking site [12, 18]. Moreover, phosphopeptide-binding domains (PBDs) are also important to achieve substrate specificity. Numerous PBDs (PTB, WW, SH2, SH3, FHA, MH2, WD40, Polo-box, and 14-3-3, etc) bind the phosphorylated forms of specific proteins, with recognizing distinct peptides surrounding the phosphorylated sites (pS/T, or pY) [1417, 19]. However, how these mechanisms achieve the additional specificity for PKs beyond phosphorylated motifs is still elusive, and there are very few computational studies published on this area [13, 16, 19]. In addition, many docking sites and PBDs still remain to be dissected. Thus, in this work, we focus on the prediction of PK-specific phosphorylation sites based on profiles/features of the surrounding primary sequences, as previously described [810].

Conventional experimental identifications of PK-specific phosphorylation sites on substrates in vivo and in vitro have provided the foundation of understanding the mechanisms of phosphorylation dynamics. However, these experiments are often time-consuming and expensive. And the enzymatic activity of the PKs are usually diminished or impeded in vitro, hampering on the studies of phosphorylation greatly. Recently, phospho-proteomic studies with mass spectrometry (MS) approaches have generated numerous data in yeast [20], mouse [21], and human [8], etc. But in these cases, it's still difficult to distinguish the PK-specific sites on the substrates. With regard of this, it is of note that the in silico prediction of PK-specific phosphorylation sites is in urgent need for the further experimental manipulation. To address this question, several excellent predictors have been implemented and reported [13, 2225]. For example, NetPhos has employed the consensus-motif-based approaches implemented in the artificial neural networks (ANNs) algorithm [22]. The enhanced version, NetPhosK can predict PK-specific phosphorylation sites for ~17 PKs [23]. Another online tool Scansite [13] has constructed the motif profiles of phosphorylation sites for ~20 PKs, and could predict their target sites, respectively. Previously, we have reported a web server GPS, which has been implemented in GPS (Group-based phosphorylation Predicting and Scoring) algorithm [26, 27]. GPS could predict ~70 kinds of PK-specific phosphorylation sites, and gain excellent performance on several PK groups, especially for kinase Aurora-B. Recently, a novel and excellent web tool of KinasePhos has been incorporated with HMM (Hidden Markov Models) algorithm and constructed for phosphorylation sites predicting of 18 PK-specific groups [24, 25].

In this study, we present a novel, convenient and comprehensive program, PPSP (Prediction of PK-specific Phosphorylation site), implemented in an algorithm of Bayesian decision theory (BDT). An online PPSP web service has been also constructed, accurately predicting PK-specific phosphorylation sites for 68 PK groups. The prediction performances of PPSP are satisfactory on several well-studied PKs and comparable with the other existing tools NetPhosK, Scansite, KinasePhos and GPS. Moreover, PPSP also provides the accurate prediction for many novel PKs, such as TRK, mTOR, SyK, and MET/RON, etc. Obviously, PPSP is more accurate and powerful. Therefore, we propose that PPSP could be useful and insightful for further experimental design. In addition, the prediction results of PPSP combined with delicate experiments verifications will propel our understanding of the mechanisms of phosphorylation into a new phase.

Implementation

Preparation of training data set

Firstly, we obtained the data set of phosphorylation sites from Phospho.ELM (Ver 2.0, Sep. 2004) [28] and filtered the phosphorylation sites without information of PKs. There were ~1,400 sites preserved. We also manually curated the recent literature and acquired ~660 items (Before Nov. 2004). These newly curated data has been submitted to Phospho.ELM for further integration. The two data sets were integrated, and the redundant items were removed if two items exactly pinned point to the same phosphorylation site from one protein sequence. Then the total training data set contained >2,000 non-redundant positive data with very few homologous sites (see additional file 1).

Since there were several PKs with too few known phosphorylation sites, we clustered them into distinct sub-groups based on sequence homology. For example, eight ribosomal protein S6 kinases (RSK1, Q15418; RSK3, Q15349; RSK2, P51812; MSK1, O75676; MSK1, O75582; RSK4, Q9UK32; S6K1, P23443; STK14B, Q9UBS0) are homologous with high similarity, so we clustered these PKs into a unique PK group of S6K (Ribosomal protein S6 kinase, or RSK). In total, we have enabled 68 PK grouped.

Although Swiss-Prot also curates a huge amount of phosphorylation sites, we have found ~69% of the annotation to be ambiguous (7,924 of 11,520) (see additional file 2). There are only 842 items to be kinase-specific sites, and only 18 PKs with not less than ten sites (see additional file 3). Phospho.ELM has been constructed based on the rationale of allowing both experimentalists and bioinformatists to easily access extensive information of phosphoproteins with their sites, i.e., tracking the primary reference to find whether the site is really phosphorylated, identified in vivo or in vitro, and the relationship between the phosphorylation with physiological response [28]. And these data has been collected from literature manually with high quality. Taken together, although other resources also have collected some phosphorylation sites, we chose Phospho.ELM for its comprehensiveness.

Positive & negative control for evaluation

The sequence information of these phosphorylation substrates was retrieved from ExPASy. As previously described [11], we adopted the experimental phosphorylation sites as the positive control, while all other residues (S/T or Y) in the phosphorylation substrates were regarded as the negative control. The detailed statistics of the positive and negative data sets categorized by PK groups is available (see additional file 4).

Bayesian Decision Theory (BDT)

Supposed that we have an unclassified data x that belongs to one of two certain categories: C1 (defined as phosphorylated sites in this work) and C2 (defined as non-phosphorylated sites). In addition, suppose the posterior probability of x for these two categories can be denoted as: p(C1|x) and p(C2|x). Then the probability of wrong prediction is:

P ( e r r o r | x ) = p ( C 1 | x ) , i f x C 2 p ( C 2 | x ) , i f x C 1 ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGLbqzcqWGYbGCcqWGYbGCcqWGVbWBcqWGYbGCcqGG8baFcqWG4baEcqGGPaqkcqGH9aqpfaqabeGabaaabaGaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabigdaXaqabaGccqGG8baFcqWG4baEcqGGPaqkcqGGSaalcqWGPbqAcqWGMbGzcqqGGaaicqWG4baEcqGHiiIZcqWGdbWqdaWgaaWcbaGaeGOmaidabeaaaOqaaiabdchaWjabcIcaOiabdoeadnaaBaaaleaacqaIYaGmaeqaaOGaeiiFaWNaemiEaGNaeiykaKIaeiilaWIaemyAaKMaemOzayMaeeiiaaIaemiEaGNaeyicI4Saem4qam0aaSbaaSqaaiabigdaXaqabaaaaOGaaCzcaiaaxMaadaqadaqaaiabigdaXaGaayjkaiaawMcaaaaa@61F9@

To minimize the expectation of error probability that is defined as [29]:

P(error) = ∫P(error|x)p(x)dx    (2)

It is obvious that one should choose the more probable category as the prediction result, which can be formulated by the Bayesian Decision Rule [29]:

p r e d i c t x a s { C 1 , i f P ( C 1 | x ) > P ( C 2 | x ) C 2 , o t h e r w i s e ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGYbGCcqWGLbqzcqWGKbazcqWGPbqAcqWGJbWycqWG0baDcaaMc8UaemiEaGNaaGPaVlabdggaHjabdohaZnaaceaabaqbaeaabiGaaaqaaiabdoeadnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWcabaGaemyAaKMaemOzayMaaGPaVlabdcfaqjabcIcaOiabdoeadnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNaemiEaGNaeiykaKIaeyOpa4JaemiuaaLaeiikaGIaem4qam0aaSbaaSqaaiabikdaYaqabaGccqGG8baFcqWG4baEcqGGPaqkaeaacqWGdbWqdaWgaaWcbaGaeGOmaidabeaakiabcYcaSaqaaiabd+gaVjabdsha0jabdIgaOjabdwgaLjabdkhaYjabdEha3jabdMgaPjabdohaZjabdwgaLbaaaiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@6A7C@

Furthermore, by definition we can introduce the loss function λ(α i |C j ), where α i ,i = 1,2 is the finite set of possible solution. Thus the expected loss (risk) of taking action α i is:

R ( α i | x ) = l = 1 2 λ ( α i | C l ) P ( C l | x ) ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4jabcMcaPiabg2da9maaqahabaGaeq4UdWMaeiikaGIaeqySde2aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPiabdcfaqjabcIcaOiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiiFaWNaemiEaGNaeiykaKcaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@5448@

In this condition, the goal of optimization becomes to minimize the overall risk for every x. Similar to the rationale of Bayesian Decision Rule, we can obtain the best performance by computing R i |x) for each solution α i and choose that for which has the minimal overall risk (also named as Bayes Risk) [29].

Training and prediction procedure

In this study, a local ennea-peptide (9aa) is deployed to define a candidate phosphorylation site, which has 4 upstream and 4 downstream residues of the potential phosphorylation site and can be denoted as x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaWcaaaa@2E37@ = (x1,x2,...,x9)'. Given some positive (training) data, there are many ways to estimate R i |x) (where α1 and α2 denote different prediction results: true and false phosphorylation sites, respectively). One simple way is to assume that all flanking residues are mutual independent, and then the Bayes Risk can be formulated as:

R ( α i | x ) = j = 1 9 R ( α i | x j ) ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jqbdIha4zaalaGaeiykaKIaeyypa0ZaaabCaeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqaI5aqoa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI1aqnaiaawIcacaGLPaaaaaa@4BCB@

R ( α i | x j ) = E ( λ | x j , α i ) = l = 1 2 k = 1 20 λ ( j , k | α i , C l ) p ( C l | x j ) ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0JaemyrauKaeiikaGIaeq4UdWMaeiiFaWNaemiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGSaalcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9maaqahabaWaaabCaeaacqaH7oaBcqGGOaakcqWGQbGAcqGGSaalcqWGRbWAcqGG8baFcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGG8baFcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdaaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI2aGnaiaawIcacaGLPaaaaaa@71C0@

Here p(C l |x j ) is the posterior probability of x j belonging to category C l and can be further described by the Bayesian formula:

p ( C l | x j ) = p ( x j | C l ) p ( C l ) p ( x j ) = p ( x j | C l ) p ( C l ) l = 1 2 p ( x j | C l ) p ( C l ) , l = 1 , 2 ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0ZaaSaaaeaacqWGWbaCcqGGOaakcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcYha8jabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkaeaacqWGWbaCcqGGOaakcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaaacqGH9aqpdaWcaaqaaiabdchaWjabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiiFaWNaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPaqaamaaqahabaGaemiCaahaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiiFaWNaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPaaacqGGSaalcqWGSbaBcqGH9aqpcqaIXaqmcqGGSaalcqaIYaGmcaWLjaGaaCzcamaabmaabaGaeG4naCdacaGLOaGaayzkaaaaaa@7FDD@

Here p(C l ) is the prior probability of category C l and p(x j |C l ) can be estimated by observing the occurrence of each residue in training data given the hypothesis of equation (5). Although there are much more false phosphorylation site in data set, we give equal prior probability for each category (no prior information), which can avoid bias prediction results. The loss function we construct is based on BLOSUM62 matrix [30] by considering the biochemical difference of residues, which can be formulated as:

λ ( j , k | α i , C l ) = { B L O S U M 62 ( j , k ) , i f α i C l 0 , i f α i = C l ( 8 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaH7oaBcqGGOaakcqWGQbGAcqGGSaalcqWGRbWAcqGG8baFcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaeyypa0ZaaiqaaeaafaqabeGacaaabaGaeyOeI0IaemOqaiKaemitaWKaem4ta8Kaem4uamLaemyvauLaemyta0KaeGOnayJaeGOmaiJaeiikaGIaemOAaOMaeiilaWIaem4AaSMaeiykaKIaeiilaWcabaGaemyAaKMaemOzayMaaGPaVlabeg7aHnaaBaaaleaacqWGPbqAaeqaaOGaeyiyIKRaem4qam0aaSbaaSqaaiabdYgaSbqabaaakeaacqaIWaamcqGGSaalaeaacqWGPbqAcqWGMbGzcaaMc8UaeqySde2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGdbWqdaWgaaWcbaGaemiBaWgabeaaaaaakiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaI4aaoaiaawIcacaGLPaaaaaa@6A7D@

Although other matrices could be also adopted, the BLOSUM62 matrix is chosen in this work. Moreover, we introduce a trade-off threshold b as the only parameter in this method to control the performance for different categories. Thus the final Discriminant function for prediction is:

p r e d i c t x a s { C 1 , i f R ( α 2 | x ) R ( α 1 | x ) > b C 2 , o t h e r w i s e ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGYbGCcqWGLbqzcqWGKbazcqWGPbqAcqWGJbWycqWG0baDcaaMc8UafmiEaGNbaSaacaaMc8UaemyyaeMaem4Cam3aaiqaaeaafaqaaeGacaaabaGaem4qam0aaSbaaSqaaiabigdaXaqabaGccqGGSaalaeaacqWGPbqAcqWGMbGzcaaMc8UaemOuaiLaeiikaGIaeqySde2aaSbaaSqaaiabikdaYaqabaGccqGG8baFcuWG4baEgaWcaiabcMcaPiabgkHiTiabdkfasjabcIcaOiabeg7aHnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNafmiEaGNbaSaacqGGPaqkcqGH+aGpcqWGIbGyaeaacqWGdbWqdaWgaaWcbaGaeGOmaidabeaakiabcYcaSaqaaiabd+gaVjabdsha0jabdIgaOjabdwgaLjabdkhaYjabdEha3jabdMgaPjabdohaZjabdwgaLbaaaiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaI5aqoaiaawIcacaGLPaaaaaa@6E20@

The outline of the training and procedure in this work is illustrated in Figure 1. We first estimate the probability distribution of each residue of the true/false ennea-peptide within the training data. Then the Bayes risk for either potential solution (i.e true or false phosphorylation site) is calculated, respectively. To implement the final differential function in equation (9) effectively, we built a difference profile of Bayesian decision risk for each PK family/group in prediction. In this way, a candidate site for a given protein kinase is assessed in the profile and the outcome for each residue is summed up. If the difference of risks (false prediction minus true prediction) is greater than the threshold b, it will be predicted by PPSP as a negative site that can not be phosphorylated by this PK. Otherwise, PPSP will infer the site is as a potential phosphorylation site. In this work, the threshold for each PK has been optimized automatically.

Figure 1
figure 1

The outline of the training and procedure of PPSP.

Results and discussions

Prediction performance of PPSP

Three measurements, i.e., Sensitivity (Sn), Specificity (Sp), and Mathew correlation coefficient (MCC) are widely employed to evaluate the performance of prediction with definitions as below:

S n = T P T P + F N , S p = T N T N + F P , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaem4uamLaemOBa4Maeyypa0ZaaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaGaeiilaWcabaGaem4uamLaemiCaaNaeyypa0ZaaSaaaeaacqWGubavcqWGobGtaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbauaaaaaiabcYcaSaaa@456D@

and

M C C = ( T P × T N ) ( F N × F P ) ( T P + F N ) × ( T N + F P ) × ( T P + F P ) × ( T N + F N ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqcqGH9aqpdaWcaaqaaiabcIcaOiabdsfaujabdcfaqjabgEna0kabdsfaujabd6eaojabcMcaPiabgkHiTiabcIcaOiabdAeagjabd6eaojabgEna0kabdAeagjabdcfaqjabcMcaPaqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaakiabc6caUaaa@6698@

Among the data with positive predictions by PPSP, the real positives are regarded as true positives (TP), while the others are defined as false positives (FP). Among the data with negative predictions by PPSP, the real positives are regarded as false negatives (FN), while the others are defined as true negatives (TN). The Sensitivity (Sn) and Specificity (Sp) illustrate the correct prediction ratios of positive and negative data sets respectively. But when the number of positive data and negative data differ too much from each other, the Mathew correlation coefficient (MCC) should be calculated to assess the prediction performance. The value of MCC ranges from -1 to 1, and bigger MCC stands for better prediction performance.

To assess whether PPSP is unbiased and robust for prediction, we adopt the standard method "Jack-Knife" validation. We perform the Jack-Knife validation for these PKs by removing one real site from the training data set at a time and re-calculating the Sn &Sp, respectively. The final results are the average of the all Sn &Sp of the Jack-Knife validation. Although "Jack-Knife" validation does make sense when the size of the data set is small (i.e., N < 30), we have also taken an additional test with n-fold (4-, 6-, 8- and 10-fold in this work) cross-validation for 22 PK groups with larger positive data set (N ≥ 30). As previously proposed [25], the tests are repeated for 20 times and the Sn &Sp is computed each time. Then the average Sn & Sp are calculated as the final results (see additional file 5).

In table 1, we list the prediction performances for four most well-studied PKs of PKA (Protein kinase A), CK2 (Casein Kinase II), ATM (Ataxia telangiectasia mutated) and S6K (Ribosomal protein S6 kinase, or RSK). The prediction performances of self-consistency, Jack-knife validation and n-fold cross-validation has been provided. For PKA, CK2, ATM and S6K, the Sn &Sp of the self-consistency is 92.31% & 97.40%, 93.33% & 96.46%, 92.59% & 91.98%, and 89.47% & 95.90%, while the Jack-Knife validation is 90.11% & 90.46%, 83.21% & 88.44%, 86.05% & 91.89%, 92.86% & 91.05%, respectively. Interestingly, the performances of n-fold cross-validation are very similar and consistent with the results of the Jack-Knife validation. So the PPSP is quite robust and unbiased for these well-studied PKs. Moreover, PPSP could predict for several novel PKs (>30, see additional file 4). In Table 2, we choose four PKs, including TRK (Neurotrophic tyrosine kinase receptor), mTOR (Mammalian target of rapamycin), SyK (Spleen tyrosine kinase), and MET/RON (Hepatocyte growth factor receptor/Macrophage-stimulating protein receptor), which predictors for them are not available previously. Interestingly, the prediction performance of PPSP is also satisfying. And the Jack-knife validation proposes that the PPSP approach is also robust and unbiased for these novel PKs. The full content of the prediction performance of PPSP is available from PPSP website.

Table 1 The performances of self-consistency, Jack-knife validation and n -fold (4-, 6-, 8-, 10-fold in this work) cross-validation for four well-studied PKs of PKA, CK2, ATM and S6K. The n- fold cross-validation has been performed for the large data sets (N ≥ 30).
Table 2 The self-consistency performance and Jack-knife validation for four novel PKs of TRK, mTOR, SyK and MET/RON.

To evaluate the performance of PPSP on the signal to noise for phosphorylation sites retrieval, we also perform two additional evaluations. Firstly, we randomly generate 10, 000 serine (S) and threonine (T) ennea-peptides for serine/threonine kinases (STKs), with 10, 000 tyrosine (Y) nona-peptides for tyrosine kinases (TKs). In addition, to determine the ability of the PPSP to retrieve potential real phosphorylation sites from the full proteome, we have downloaded the protein sequences of human proteome from public database ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.fasta.gz. Again, we randomly retrieve 10, 000 S & T and Y ennea-peptides for STKs and TKs from the human proteome, respectively. Then we compute the Risk Difference (RD) of each ennea-peptide. Under the default threshold of PPSP, the percentile of the sites predicted to be potential true positive hits is listed (see in Table 3). The prediction results of random and human proteome data set are very similar. And the distribution of Risk Difference of random and human proteome data set of PKA-specific site prediction is diagramed in Figure 2. In this work, the default threshold of PKA is 3.5, and predicted Risk Differences of the most of the ennea-peptides from the two data sets are smaller than this cut-off. Based on these analyses, we propose that PPSP could efficiently predict the potential real sites with very low false positive hits. The ratio of Serine and Threonine is not exactly equal. However, we and others are unable to explain this question [25].

Table 3 With the default cut-off of PPSP, the percentile of the sites predicted to be potential true positive hits is listed. Both random ennea-peptides and data sets from human proteome have been computed, separately.
Figure 2
figure 2

the distribution of risk difference of random and human proteome data set of PKA-specific site prediction is diagramed in Figure 2. A. Distribution of Risk Difference of random data set (serine) of PKA-specific site prediction. B. Distribution of Risk Difference of random data set (threonine) of PKA-specific site prediction. C. Distribution of Risk Difference of human proteome data set (serine) of PKA-specific site prediction. D. Distribution of Risk Difference of Human proteome data set (threonine) of PKA-specific site prediction.

Comparison of PPSP with Scansite, NetPhosK, KinasePhos and GPS

With four well-studied PKs of PKA, CK2, ATM and S6K as model kinases, we compare PPSP with four previous online prediction systems: Scansite, NetPhosK, KinasePhos and GPS. In Table 4, we list the prediction performances of Scansite, NetPhosK, KinasePhos and GPS for PKA, CK2, ATM and S6K, respectively. Since we can't re-perform the Jack-knife validation for the predictors, we submit the substrate sequence into these tools for prediction. And the self-consistency performance of PPSP is adopted here for comparison. Scansite has three thresholds for prediction, including high, medium and low stringency, while KinasePhos has paid attention to prediction specificity with three cut-off values, as 90%, 95% and 100%. And the default parameter is adopted for GPS. We calculate the prediction performances of Scansite and KinasePhos at three distinct thresholds, separately. As for NetPhosK, we only adopt the default cut-off value with 0.5, in mode of Prediction without filtering. Obviously, PPSP, NetPhosK, KinasePhos and GPS are better than Scansite. For PKA, the prediction performance of PPSP is 90.11% (Sn) and 91.70% (Sp), and outperforms to NetPhosK (Sn 79.12% &Sp 90.65%) with about 10% higher sensitivity and similar specificity. And for CK2, the performance of PPSP is 83.21% (Sn) and 90.01% (Sp), slightly higher than NetPhosK (Sn 82.48% &Sp 89.43%). The prediction performance of KinasePhos is similar with PPSP on PKA and CK2. However, for ATM, the NetPhosK is 86.01% (Sn) and 98.51% (Sp), whereas PPSP is 93.02% (Sn) and 94.06% (Sp). Although PPSP has a lower specificity than NetPhosK with ~4%, the sensitivity is high with ~7% enhanced. Finally, for S6K (also called as RSK in NetPhosK), although the specificity of PPSP (97.97%) and NetPhosK (97.14%) is quite similar, PPSP outperforms than NetPhosK with ~10% higher in sensitivity. With regard of this, we propose the prediction performance of PPSP could be at least comparable with the existing systems.

Table 4 The prediction performance of Scansite, NetPhosK, KinasePhos and GPS for four well-studied PKs of PKA, CK2, ATM and S6K.

However, the analysis and comparison above are only in theoretical and not intuitive. Furthermore, we browse the recent literature from PubMed and randomly choose some instances for comparison. One example is Bluetongue virus (BTV) nonstructural protein 2 (NS2, P23065), a substrate of CK2 [31]. As a nonspecific single-stranded RNA (ssRNA)-binding protein, NS2 accumulates in BTV-infected cells, and is functional in viral replication and morphogenesis [3134]. NS2 could hydrolyze both ATP and GTP with high affinity, showing strong enzymatic activity [32]. Using mutagenesis analysis, CK2 was demonstrated to phosphorylate NS2 in two serine sites S249 and S259, probably modulating its RNA binding properties, enzymatic activity or influencing its ability to interact with other viral proteins [31]. For CK2-specific phosphorylation sites prediction, all of the four programs can detect them successfully (see in Table 5). In this case, the Scansite with medium stringency get the best hits. PPSP predict three sites as positive hits (T247, S249, and S259), but NetPhosK provide too much results with seven positive hits. Two additional instances are also provided in Table 5. One is Drosophila transcription factor protein GAGA (Q08605), regulating gene transcription and chromatin remodeling, etc [35]. The other is human Calmodulin protein (P62158) [36]. The prediction results of the four programs are shown in Table 5. And the online prediction of PPSP is diagramed in Figure 3. Obviously, for the well-studied PKs, i.e. CK2, PPSP is accurate and comparable with the existing tools.

Table 5 The experimental verified vs. predicted CK2-specific phosphorylation sites of Bluetongue virus (BTV) nonstructural protein 2 (NS2), Drosophila transcription factor GAGA and human Calmodulin protein.
Figure 3
figure 3

The prediction results of Bluetongue virus (BTV) nonstructural protein 2 (NS2), Drosophila transcription factor GAGA and human Calmodulin protein with PPSP. Figure 3A – prediction results of NS2; Figure 3B – prediction results of GAGA; Figure 3C – prediction results of Calmodulin.

Application of PPSP to the novel PKs

For application of PPSP to the novel PKs, here we employ PPSP to predict the phosphorylation sites of TRK. TRK is a sub-family of receptor tyrosine kinases (RTK), consisting three highly similar homologs, TRK-A, -B, and -C [37]. TRK-A, -B, and -C could be activated specifically by nerve growth factor (NGF), brain-derived neurotrophic factor (BDNF) and NT-4/-5, and NT-3, respectively. Under activated state, TRK could regulate a variety of biological processes including cell survival, embryo, differentiation, proliferation, axon and dendrite growth and patterning, and apoptosis, etc [37].

Recently, protein Ras guanine-releasing factor 1 (RasGrf1, Q13972), a GTPase of the Ras and Rho family, has been proposed to be phosphorylated and interact with TRK-A, -B, -C in co-transfection experiments [5]. However, the exact phosphorylation sites of RasGrf1 by TRK remain to be identified. PPSP has predicted that there are totally two potential phosphorylation sites on RasGrf1 (Y94 & Y1209) (see in Figure 4). Moreover, the human tumorous imaginal disc 1 (TID1, Q96EY1) was verified as a substrate of TRK with co-immunoprecipitation (Co-IP) [4] and the phosphorylation sites were not elucidated. PPSP could predict that there are three candidate sites with Y94, Y95 and Y173 (see in Figure 4). These prediction results would be very useful for the further experimentation and elucidation phospho-regulation underlying cellular dynamics.

Figure 4
figure 4

The diagram of potential phosphorylation sites of human RasGrf1 (Q13972) and TID1 (Q96EY1) by TRK.

Conclusion

In this work, we present a novel computational program–PPSP (prediction of PK-specific phosphorylation sites) based on Bayesian decision theory (BDT). We model a candidate phosphorylation motif as an ennea/nona-peptide (9aa) flanking with 4 upstream and 4 downstream residues of a potential phosphorylation site (S/T, or Y). With the BDT algorithm, we estimate the probability distributions of true and false phosphorylation sites and make prediction based on a loss function constructed with BLOSUM62 matrix [30]. We have evaluated the sensitivity and specificity of PPSP by "Jack-knife" validation. An online PPSP web service has been also constructed, accurately predicting PK-specific phosphorylation sites for 68 PK groups. For comparison with four reported systems Scansite, NetPhosK, KinasePhos and GPS, we take four well-studied PKs of PKA, CK2, ATM and S6K as model kinases. The prediction performances of PPSP are satisfactory judged using these well-studied PKs and comparable with the other existing tools. Moreover, PPSP also provides the accurate prediction for many novel PKs, such as TRK, mTOR, SyK, and MET/RON, etc. Thus, comparison with the previous work, PPSP provides more accurate and powerful ability. Moreover, the BDT approach could also be an extensive method for PTMs prediction, such as sumoylation and ubiquitination, etc. In addition, although many phospho-proteomic researches have generated numerous data [8, 20, 21], however, the up-regulated PKs still remain to be dissected. Despite the demonstration of phosphor-regulation of protein kinases and their respective substrates, the exact phosphorylation sites are unclear [4, 5]. Taken together, the prediction results of PPSP should be insightful and important for further experiments. The combination of computational and experimental identifications will propel our understanding of phosphorylation dynamics into a new phase.

Availability and requirements

PPSP has been implemented in Linux + Apache + PHP, and is freely available at: http://bioinformatics.lcd-ustc.org/PPSP. A latest web browser (eg. Internet Explorer, Netscape, or Mozilla, etc) is required.