jEcho: an Evolved weight vector to CHaracterize the protein’s posttranslational modification mOtifs

Abstract

Protein’s posttranslational modification (PTM) represents a major dynamic regulation of protein functions after the translation of polypeptide chains from mRNA molecule. Compared with the costly and labor-intensive wet laboratory characterization of PTMs, the computer-based detection of PTM residues has been a major complementary technique in recent years. Previous studies demonstrated that the PTM-flanking positions convey different contributions to the computational detection of PTM residue, but did not directly translate this observation into the in silico PTM prediction. We propose a weight vector to represent the variant contributions of the PTM-flanking positions and use an evolutionary algorithm to optimize the vector. Even a simple nearest neighbor algorithm with the incorporated optimal weight vector outperforms the currently available algorithms. The algorithm is implemented as an easy-to-use computer program, jEcho version 1.0. The implementation language, Java, makes jEcho platform-independent and visually interactive. The predicted results may be directly exported as publication-quality images or text files. jEcho may be downloaded from http://www.healthinformaticslab.org/supp/.

Introduction

Human genome harbors 20,687 protein-coding genes and encodes much larger number of proteins with the help of alternative splicing [1]. After the translation from the mature mRNA, a protein is dynamically modified through various mechanisms and exerts its functions in the dynamically changing modified forms. The posttranslational modification (PTM) of a protein usually introduces a biochemical group to a specific residue, and there are more than 300 types of PTMs [2], e.g., phosphorylation and SUMOylation. Phosphorylation is the major mechanism to deliver the signals between the extra- and intracellular systems [3], and SUMOylation ensures the stability of the modified proteins [4]. Malfunction of PTMs is known to be associated with various human diseases, including cancer and cardiovascular diseases [5]. So a number of PTM types have been extensively studied for their roles in the initiation and development of human diseases.

The PTM residues of proteins may be detected using two major classes of techniques. Both gel- and mass spectrometry-based experimental techniques are widely used to detect the mass change of a peptide after its attachment with the PTM-specific biochemical group, e.g., the 80-Da phosphate group from phosphorylation [6]. Due to the limited availability of catalytic enzymes and low sensitivity, the experimental characterization of PTM residues are still very costly and labor intensive for proteome-wide studies. The alternative strategy is to computationally screen a query protein for residues whose flanking peptides are highly similar to the experimentally verified PTM residues. The current literature supports the assumption that two residues with the same or highly similar flanking peptides tend to have similar probability to be modified by the same PTM type [7]. Various scoring strategies and machine learning algorithms were applied to computationally detect PTM residues from protein sequences [8, 9].

This study proposes a novel position-dependent scoring strategy, the Echo algorithm, to measure the similarity between two peptides. The position-dependent vector of weights for different positions flanking the PTM residues is optimized by an evolutionary algorithm, by simulating the nature’s selection process of random mutation and fitting evaluation. Even the simple nearest neighbor classification strategy based on Echo outperforms similar computer programs for three phosphoserine/threonine kinases, three phosphotyrosine kinases and other three PTM types. A computer program, jEcho, is implemented to facilitate the biologists to easily use these optimized PTM prediction models.

Materials and Methods

Data Sources

Experimentally verified phosphorylated residues were collected from the most comprehensive phosphorylation database Phospho.ELM version [10]. The database’s latest version 9.0 was retrieved on July 31, 2012. This study chooses three phosphoserine/threonine kinases (MAPK3, MAPK8 and CDK5) and three phosphotyrosine kinases (EGFR, Met and SYK) as examples to demonstrate how the evolutionary optimization algorithm contributes to PTM residue predictions. In Phospho.ELM version 9.0, there are 91, 33 and 24 phosphorylated residues for MAPK3, MAPK8 and CDK5, respectively. 55, 49 and 26 phosphorylated residues are collected for EGFR, Met and SYK, respectively.

Besides phosphorylation, we also tested our system on three other PTM types, i.e., SUMOylation, Nitrated tyrosine and S nitrosylation. These three PTM data were retrieved from the database dbPTM version 3.0 [11] on November 23, 2012. 1051, 96 and 3289 are collected for the modification types SUMOylation, Nitrated tyrosine and S nitrosylation, respectively.

PTM Prediction Problem

This study investigates the PTM prediction problem, which is defined as follows. Firstly, for a given PTM type, the modification alphabet is defined to be the amino acid(s) that may be modified by this PTM type. That is to say, {S, T} and {Y} are the modification alphabets for phosphoserine/threonine and phosphotyrosine kinases, respectively. SUMOylation, Nitrated tyrosine and S nitrosylation have the modification alphabets {K}, {Y} and {C}, respectively. The experimentally verified PTM residues of this given PTM type constitutes the positive dataset \(P=\{P_{1}, P_{2}, \ldots , P_{G}\}\). A positive data entry is a peptide consisting of a upstream, the modified residue and b downstream amino acids of the given PTM residue, defined as PSP(ab) [7]. The negative dataset \(N=\{N_{1}, N_{2}, {\ldots }, N_{H}\}\) are the PSP(ab) peptides of all the other residues belonging to the modification alphabet in the proteins with positive residues, as similarly defined in all the other PTM prediction programs [7]. In order to conduct a consistent performance comparison with the program GPS [7], this study uses the same parameters (a = 7 and b = 7) for all the PTM types except SUMOylation. The prediction performance of Echo on SUMOylation is compared with the program SUMOsp [12], so Echo uses the same parameters (a = 5 and b = 5) as SUMOsp.

Echo chooses the simple nearest neighbor algorithm for the PTM prediction problem. The similarity between two PSP(ab) peptides A and B is defined as \(Score(A,\;B) = \big \{ \sum \nolimits _{i\in [1,\;a+1+b],\;i\ne a+1} (w_i \times BLOSUM62(A_i, B_i)) \Big \}/(a+b),\) where \(w_{i}\) is a predefined weight for the position i, and BLOSUM62\((A_{i}, B_{i})\) is the similarity score in the matrix BLOSUM62 [13] between the two amino acids \(A_{i}\) and \(B_{i}\). For the two datasets P and N, a query peptide Q is defined to be in the same dataset with its nearest neighbor. And the weight vector \(W=\{w_{1},w_{2}, {\ldots }, w_{a+1+b}\}\) is optimized by an evolutionary algorithm described in the next section.

Evaluation Measurements and Evolutionary Algorithm

This study evaluates a PTM prediction algorithm’s performance by its sensitivity (Sn), specificity (Sp), accuracy (Ac) and Matthews correlation coefficient (MCC) [7, 14]. For the positive and negative datasets P and N, a true positive is a positive data entry predicted to be positive, whereas a positive data entry is a false negative if it is predicted to be negative. A negative data entry is defined to be a true negative and false positive if it is predicted correctly or incorrectly, respectively. The numbers of these classes of data entries are abbreviated as TP, FN, TN and FP, respectively. The algorithm’s prediction performance measurements \({\rm Sn}={\rm TP}/({\rm TP}+{\rm FN}),\; {\rm Sp}={ \rm TN}/({\rm TN}+{\rm FP}),\; {\rm Ac}=({\rm Sn}+{\rm Sp})/2,\) and \({\rm MCC}=({\rm TP} \times {\rm TN-FP} \times {\rm FN})/{\rm sqrt}(({\rm TP}+{\rm FP}) \times ({\rm TP}+{\rm FN}) \times ({\rm TN}+{\rm FP}) \times ({\rm TN}+{\rm FN})),\) where sqrt (X) is the squared root of X.

An evolutionary algorithm simulates the nature’s random mutation and competitive selection process, and works well on some optimization problems with no clues of optimal patterns [15, 16]. In this work, the weight vector \(W=\{w_{1}, w_{2}, \ldots , w_{a+1+b}\}\) is defined to be the molecule that receives the random mutations, and the selection/optimization goal is to maximize the measurement accuracy Ac. Each generation consists of 100 individuals or weight vectors. After the random mutations, 300 pairs of parent individuals are randomly chosen to randomly exchange half positions of their weight vectors. Only the individuals with top 95 Ac values survive or are kept for the next generation. In order to avoid the decrease in Ac in the next generation, the best five individuals are kept intact for the next generation. All the 9 PTM types reach the best Ac values after 1000 generations of optimizations. In case the readers may be interested in the optimized weight vectors, they may be found in the supplementary table S1.

Results and Discussion

Comparison of Leave-One-Out Performance

Firstly, we compare the Echo’s prediction accuracy on the three phosphoserine/threonine kinases and three phosphotyrosine kinases with the computer program GPS version 2.1 using the same Jack-Knife validation [14]. The Jack-Knife validation is also called the leave-one-out (LOO) validation, which predicts each data entry’s modification status using all the other data entries as the training dataset [17]. Echo outperforms the GPS 2.1 algorithm in all the four prediction performance measurements in the corresponding cutoff levels for all the six kinases, as shown in Table 1. Even more than 10 % improvements in the overall accuracy Ac values are achieved by Echo for phosphoserine/threonine kinase CDK5 and phosphotyrosine Met, compared with the low cutoff values of the algorithm GPS 2.1. More than 0.20 gain in the Matthews correlation coefficient (MCC) values by Echo for CDK5, EGFR and Met also suggests that Echo performs consistently well on both the positive and negative datasets for these kinases. For example, Echo achieves 100 % accuracy for the positive dataset (Sn) and more than 95 % specificity for the kinases CDK5 and Met, as shown in Table 1. We further evaluate Echo’s performance on identifying phosphorylation residues of six more common kinases, PKA_alpha, MAPK1, Abl, PKG, Aurror_A and ATM. Echo outperforms GPS 2.1 on all the cases with all the threshold values. The maximum improvement 14.04 % in accuracy is achieved by Echo on the low threshold value of kinase Abl.

Table 1 Leave-one-out prediction performances of the Echo algorithm compared with the other alternatives

Echo also outperforms the alternative algorithms in any performance measurements for the other three PTM types, i.e., SUMOylation, Nitrated tyrosine and S nitrosylation, as shown in Table 1. A significant improvement has been achieved for S nitrosylation residue predictions. 10.89 % improved Ac and 0.2976 improved MCC for the high cutoff level of S nitrosylation suggest that Echo performs more consistently in both Sn and Sp. Echo improves the overall accuracy Ac by more than 5 % for both SUMOylation and Nitrated tyrosine, and even improves the MCC by 0.4024 for the high cutoff level of SUMOylation. The performance of \({ \rm Sn}=90.15\) % and \({\rm Sp}=99.65\) % for SUMOylation suggests that the annotations of Echo may be reasonably applied to the large-scale characterization of cellular SUMOylation dynamics.

Fourfold Cross-Validation Performance of jEcho

Reasonable detection performance is also achieved by Echo on all the 15 PTM types using the fourfold cross-validation, as shown in Table 2. As expected, the data of fourfold cross-validation of Echo is slightly smaller than the leave-one-out validation in the above section. But most PTM types receive over 90 % in accuracy by Echo. Echo performs best on the detection of SUMOylation residues, with 99.06 % in the overall accuracy and 0.8857 in MCC, which is even better than the leave-one-out validation of both Echo and GPS on SUMOylation.

Table 2 Fourfold cross-validation performance is calculated for all the 15 PTM types
Fig. 1
figure1

User interface of jEcho version 1.0. The left tree box gives the hierarchical list of PTM types. The top right box waits for the input of query sequences in FASTA format. The parameters may be tuned in the right middle box. The result box is in the bottom right table. The illustrated data are the predicted from the example proteins by clicking the button “Example”

Prediction and Visualization of PTM Residues

The Echo algorithm is implemented as an easy-to-use PTM prediction software, jEcho v1, using the Java programming language, as shown in Fig. 1 and Supplementary Figure S1. Firstly, jEcho may be used in any operating systems with a Java running environment. And jEcho is packaged as an JAR file, which contains all the required external libraries. A user may run jEcho directly after downloading it. Secondly, jEcho has an all-in-one user interface (UI), so that a user may get any information from the UI, as the standard of a PTM prediction server/program [8]. Thirdly, after a user generates the PTM predictions for a specific catalytic enzyme, the distribution of all the predicted PTM residues may be visualized in the current protein by clicking the prediction in the right bottom result area, as in Supplementary Figure S1 (d) and (e). Lastly, the predicted results may be exported as a text file or an image file, by clicking a button in Fig. 1 right top area.

References

  1. 1.

    Pennisi E (2012) Genomics. ENCODE project writes eulogy for junk DNA. Science 337(6099):1159–1161

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Witze ES, Old WM, Resing KA, Ahn NG (2007) Mapping protein post-translational modifications with mass spectrometry. Nat Methods 4(10):798–806

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Mowen KA, David M (2014) Unconventional post-translational modifications in immunological signaling. Nat Immunol 15(6):512–520

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Li Z, Hu Q, Zhou M, Vandenbrink J, Li D, Menchyk N, Reighard S, Norris A, Liu H, Sun D et al (2013) Heterologous expression of OsSIZ1, a rice SUMO E3 ligase, enhances broad abiotic stress tolerance in transgenic creeping bentgrass. Plant Biotechnol J 11(4):432–445

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Kamath KS, Vasavada MS, Srivastava S (2011) Proteomic databases and tools to decipher post-translational modifications. J Proteomics 75(1):127–144

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Chen S Loughrey, Huddleston MJ, Shou W, Deshaies RJ, Annan RS, Carr SA (2002) Mass spectrometry-based methods for phosphorylation site mapping of hyperphosphorylated proteins applied to Net1, a regulator of exit from mitosis in yeast. Mol Cell Proteomics MCP 1(3):186–196

    CAS  Article  Google Scholar 

  7. 7.

    Zhou FF, Xue Y, Chen GL, Yao X (2004) GPS: a novel group-based phosphorylation predicting and scoring method. Biochem Biophys Res Commun 325(4):1443–1448

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Zhou F, Xue Y, Yao X, Xu Y (2006) A general user interface for prediction servers of proteins’ post-translational modification sites. Nat Protoc 1(3):1318–1321

    CAS  Article  PubMed  Google Scholar 

  9. 9.

    Trost B, Kusalik A (2011) Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 27(21):2927–2935

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Diella F, Gould CM, Chica C, Via A, Gibson TJ (2008) Phospho. ELM: a database of phosphorylation sites-update 2008. Nucleic Acids Res 36(Database issue):D240–D244

    PubMed Central  CAS  PubMed  Google Scholar 

  11. 11.

    Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH (2006) dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res 34(Database issue):D622–D627

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  12. 12.

    Xue Y, Zhou F, Fu C, Xu Y, Yao X (2006) SUMOsp: a web server for sumoylation site prediction. Nucleic Acids Res 34(Web Server issue):W254–W257

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  13. 13.

    Mount DW (2008) Using BLOSUM in sequence alignments. CSH Protoc 2008:39

    Google Scholar 

  14. 14.

    Xue Y, Liu Z, Cao J, Ma Q, Gao X, Wang Q, Jin C, Zhou Y, Wen L, Ren J (2011) GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng Des Sel 24(3):255–260

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Falkenauer E, Delchambre A (1992) A genetic algorithm for bin packing and line balancing. In: Proceedings of the 1992 IEEE international conference on Robotics and automation, 1992, IEEE, pp 1186–1192

  16. 16.

    Van Soest A, Casius L (2003) The merits of a parallel genetic algorithm in solving hard optimization problems. J Biomech Eng 125(1):141–146

    Article  PubMed  Google Scholar 

  17. 17.

    Yan C, Honavar V, Dobbs D (2004) Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approach. Neural Comput Appl 13(2):123–129

    PubMed Central  Article  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400), Shenzhen Peacock Plan (KQCX20130628112914301), Shenzhen Research Grants (ZDSY20120617113021359, CXB201104220026A and JCYJ20130401170306884) and Key Laboratory of Human-Machine-Intelligence Synergic Systems, Chinese Academy of Sciences, China 973 program (2010CB732606), the MOE Humanities Social Sciences Fund (No. 13YJC790105) and Doctoral Research Fund of HBUT (No. BSQD13050). Computing resources were partly provided by the Dawning supercomputing clusters at SIAT CAS. Constructive comments from the anonymous reviewers are appreciated.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Fengfeng Zhou.

Additional information

Miaomiao Zhao and Zhao Zhang contribute equally to this work.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 655 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, M., Zhang, Z., Mai, G. et al. jEcho: an Evolved weight vector to CHaracterize the protein’s posttranslational modification mOtifs. Interdiscip Sci Comput Life Sci 7, 194–199 (2015). https://doi.org/10.1007/s12539-015-0260-2

Download citation

Keywords

  • jEcho
  • Evolutionary algorithm
  • Posttranslational modification (PTM)
  • Motif
  • Phosphorylation