1 Introduction

Human genome harbors 20,687 protein-coding genes and encodes much larger number of proteins with the help of alternative splicing [1]. After the translation from the mature mRNA, a protein is dynamically modified through various mechanisms and exerts its functions in the dynamically changing modified forms. The posttranslational modification (PTM) of a protein usually introduces a biochemical group to a specific residue, and there are more than 300 types of PTMs [2], e.g., phosphorylation and SUMOylation. Phosphorylation is the major mechanism to deliver the signals between the extra- and intracellular systems [3], and SUMOylation ensures the stability of the modified proteins [4]. Malfunction of PTMs is known to be associated with various human diseases, including cancer and cardiovascular diseases [5]. So a number of PTM types have been extensively studied for their roles in the initiation and development of human diseases.

The PTM residues of proteins may be detected using two major classes of techniques. Both gel- and mass spectrometry-based experimental techniques are widely used to detect the mass change of a peptide after its attachment with the PTM-specific biochemical group, e.g., the 80-Da phosphate group from phosphorylation [6]. Due to the limited availability of catalytic enzymes and low sensitivity, the experimental characterization of PTM residues are still very costly and labor intensive for proteome-wide studies. The alternative strategy is to computationally screen a query protein for residues whose flanking peptides are highly similar to the experimentally verified PTM residues. The current literature supports the assumption that two residues with the same or highly similar flanking peptides tend to have similar probability to be modified by the same PTM type [7]. Various scoring strategies and machine learning algorithms were applied to computationally detect PTM residues from protein sequences [8, 9].

This study proposes a novel position-dependent scoring strategy, the Echo algorithm, to measure the similarity between two peptides. The position-dependent vector of weights for different positions flanking the PTM residues is optimized by an evolutionary algorithm, by simulating the nature’s selection process of random mutation and fitting evaluation. Even the simple nearest neighbor classification strategy based on Echo outperforms similar computer programs for three phosphoserine/threonine kinases, three phosphotyrosine kinases and other three PTM types. A computer program, jEcho, is implemented to facilitate the biologists to easily use these optimized PTM prediction models.

2 Materials and Methods

2.1 Data Sources

Experimentally verified phosphorylated residues were collected from the most comprehensive phosphorylation database Phospho.ELM version [10]. The database’s latest version 9.0 was retrieved on July 31, 2012. This study chooses three phosphoserine/threonine kinases (MAPK3, MAPK8 and CDK5) and three phosphotyrosine kinases (EGFR, Met and SYK) as examples to demonstrate how the evolutionary optimization algorithm contributes to PTM residue predictions. In Phospho.ELM version 9.0, there are 91, 33 and 24 phosphorylated residues for MAPK3, MAPK8 and CDK5, respectively. 55, 49 and 26 phosphorylated residues are collected for EGFR, Met and SYK, respectively.

Besides phosphorylation, we also tested our system on three other PTM types, i.e., SUMOylation, Nitrated tyrosine and S nitrosylation. These three PTM data were retrieved from the database dbPTM version 3.0 [11] on November 23, 2012. 1051, 96 and 3289 are collected for the modification types SUMOylation, Nitrated tyrosine and S nitrosylation, respectively.

2.2 PTM Prediction Problem

This study investigates the PTM prediction problem, which is defined as follows. Firstly, for a given PTM type, the modification alphabet is defined to be the amino acid(s) that may be modified by this PTM type. That is to say, {S, T} and {Y} are the modification alphabets for phosphoserine/threonine and phosphotyrosine kinases, respectively. SUMOylation, Nitrated tyrosine and S nitrosylation have the modification alphabets {K}, {Y} and {C}, respectively. The experimentally verified PTM residues of this given PTM type constitutes the positive dataset \(P=\{P_{1}, P_{2}, \ldots , P_{G}\}\). A positive data entry is a peptide consisting of a upstream, the modified residue and b downstream amino acids of the given PTM residue, defined as PSP(ab) [7]. The negative dataset \(N=\{N_{1}, N_{2}, {\ldots }, N_{H}\}\) are the PSP(ab) peptides of all the other residues belonging to the modification alphabet in the proteins with positive residues, as similarly defined in all the other PTM prediction programs [7]. In order to conduct a consistent performance comparison with the program GPS [7], this study uses the same parameters (a = 7 and b = 7) for all the PTM types except SUMOylation. The prediction performance of Echo on SUMOylation is compared with the program SUMOsp [12], so Echo uses the same parameters (a = 5 and b = 5) as SUMOsp.

Echo chooses the simple nearest neighbor algorithm for the PTM prediction problem. The similarity between two PSP(ab) peptides A and B is defined as \(Score(A,\;B) = \big \{ \sum \nolimits _{i\in [1,\;a+1+b],\;i\ne a+1} (w_i \times BLOSUM62(A_i, B_i)) \Big \}/(a+b),\) where \(w_{i}\) is a predefined weight for the position i, and BLOSUM62\((A_{i}, B_{i})\) is the similarity score in the matrix BLOSUM62 [13] between the two amino acids \(A_{i}\) and \(B_{i}\). For the two datasets P and N, a query peptide Q is defined to be in the same dataset with its nearest neighbor. And the weight vector \(W=\{w_{1},w_{2}, {\ldots }, w_{a+1+b}\}\) is optimized by an evolutionary algorithm described in the next section.

2.3 Evaluation Measurements and Evolutionary Algorithm

This study evaluates a PTM prediction algorithm’s performance by its sensitivity (Sn), specificity (Sp), accuracy (Ac) and Matthews correlation coefficient (MCC) [7, 14]. For the positive and negative datasets P and N, a true positive is a positive data entry predicted to be positive, whereas a positive data entry is a false negative if it is predicted to be negative. A negative data entry is defined to be a true negative and false positive if it is predicted correctly or incorrectly, respectively. The numbers of these classes of data entries are abbreviated as TP, FN, TN and FP, respectively. The algorithm’s prediction performance measurements \({\rm Sn}={\rm TP}/({\rm TP}+{\rm FN}),\; {\rm Sp}={ \rm TN}/({\rm TN}+{\rm FP}),\; {\rm Ac}=({\rm Sn}+{\rm Sp})/2,\) and \({\rm MCC}=({\rm TP} \times {\rm TN-FP} \times {\rm FN})/{\rm sqrt}(({\rm TP}+{\rm FP}) \times ({\rm TP}+{\rm FN}) \times ({\rm TN}+{\rm FP}) \times ({\rm TN}+{\rm FN})),\) where sqrt (X) is the squared root of X.

An evolutionary algorithm simulates the nature’s random mutation and competitive selection process, and works well on some optimization problems with no clues of optimal patterns [15, 16]. In this work, the weight vector \(W=\{w_{1}, w_{2}, \ldots , w_{a+1+b}\}\) is defined to be the molecule that receives the random mutations, and the selection/optimization goal is to maximize the measurement accuracy Ac. Each generation consists of 100 individuals or weight vectors. After the random mutations, 300 pairs of parent individuals are randomly chosen to randomly exchange half positions of their weight vectors. Only the individuals with top 95 Ac values survive or are kept for the next generation. In order to avoid the decrease in Ac in the next generation, the best five individuals are kept intact for the next generation. All the 9 PTM types reach the best Ac values after 1000 generations of optimizations. In case the readers may be interested in the optimized weight vectors, they may be found in the supplementary table S1.

3 Results and Discussion

3.1 Comparison of Leave-One-Out Performance

Firstly, we compare the Echo’s prediction accuracy on the three phosphoserine/threonine kinases and three phosphotyrosine kinases with the computer program GPS version 2.1 using the same Jack-Knife validation [14]. The Jack-Knife validation is also called the leave-one-out (LOO) validation, which predicts each data entry’s modification status using all the other data entries as the training dataset [17]. Echo outperforms the GPS 2.1 algorithm in all the four prediction performance measurements in the corresponding cutoff levels for all the six kinases, as shown in Table 1. Even more than 10 % improvements in the overall accuracy Ac values are achieved by Echo for phosphoserine/threonine kinase CDK5 and phosphotyrosine Met, compared with the low cutoff values of the algorithm GPS 2.1. More than 0.20 gain in the Matthews correlation coefficient (MCC) values by Echo for CDK5, EGFR and Met also suggests that Echo performs consistently well on both the positive and negative datasets for these kinases. For example, Echo achieves 100 % accuracy for the positive dataset (Sn) and more than 95 % specificity for the kinases CDK5 and Met, as shown in Table 1. We further evaluate Echo’s performance on identifying phosphorylation residues of six more common kinases, PKA_alpha, MAPK1, Abl, PKG, Aurror_A and ATM. Echo outperforms GPS 2.1 on all the cases with all the threshold values. The maximum improvement 14.04 % in accuracy is achieved by Echo on the low threshold value of kinase Abl.

Table 1 Leave-one-out prediction performances of the Echo algorithm compared with the other alternatives

Echo also outperforms the alternative algorithms in any performance measurements for the other three PTM types, i.e., SUMOylation, Nitrated tyrosine and S nitrosylation, as shown in Table 1. A significant improvement has been achieved for S nitrosylation residue predictions. 10.89 % improved Ac and 0.2976 improved MCC for the high cutoff level of S nitrosylation suggest that Echo performs more consistently in both Sn and Sp. Echo improves the overall accuracy Ac by more than 5 % for both SUMOylation and Nitrated tyrosine, and even improves the MCC by 0.4024 for the high cutoff level of SUMOylation. The performance of \({ \rm Sn}=90.15\) % and \({\rm Sp}=99.65\) % for SUMOylation suggests that the annotations of Echo may be reasonably applied to the large-scale characterization of cellular SUMOylation dynamics.

3.2 Fourfold Cross-Validation Performance of jEcho

Reasonable detection performance is also achieved by Echo on all the 15 PTM types using the fourfold cross-validation, as shown in Table 2. As expected, the data of fourfold cross-validation of Echo is slightly smaller than the leave-one-out validation in the above section. But most PTM types receive over 90 % in accuracy by Echo. Echo performs best on the detection of SUMOylation residues, with 99.06 % in the overall accuracy and 0.8857 in MCC, which is even better than the leave-one-out validation of both Echo and GPS on SUMOylation.

Table 2 Fourfold cross-validation performance is calculated for all the 15 PTM types
Fig. 1
figure 1

User interface of jEcho version 1.0. The left tree box gives the hierarchical list of PTM types. The top right box waits for the input of query sequences in FASTA format. The parameters may be tuned in the right middle box. The result box is in the bottom right table. The illustrated data are the predicted from the example proteins by clicking the button “Example”

3.3 Prediction and Visualization of PTM Residues

The Echo algorithm is implemented as an easy-to-use PTM prediction software, jEcho v1, using the Java programming language, as shown in Fig. 1 and Supplementary Figure S1. Firstly, jEcho may be used in any operating systems with a Java running environment. And jEcho is packaged as an JAR file, which contains all the required external libraries. A user may run jEcho directly after downloading it. Secondly, jEcho has an all-in-one user interface (UI), so that a user may get any information from the UI, as the standard of a PTM prediction server/program [8]. Thirdly, after a user generates the PTM predictions for a specific catalytic enzyme, the distribution of all the predicted PTM residues may be visualized in the current protein by clicking the prediction in the right bottom result area, as in Supplementary Figure S1 (d) and (e). Lastly, the predicted results may be exported as a text file or an image file, by clicking a button in Fig. 1 right top area.