Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices

  • Marcin Tatjewski
  • Marcin Kierczak
  • Dariusz PlewczynskiEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1484)


Here, we present two perspectives on the task of predicting post translational modifications (PTMs) from local sequence fragments using machine learning algorithms. The first is the description of the fundamental steps required to construct a PTM predictor from the very beginning. These steps include data gathering, feature extraction, or machine-learning classifier selection. The second part of our work contains the detailed discussion of more advanced problems which are encountered in PTM prediction task. Probably the most challenging issues which we have covered here are: (1) how to address the training data class imbalance problem (we also present statistics describing the problem); (2) how to properly set up cross-validation folds with an approach which takes into account the homology of protein data records, to address this problem we present our folds-over-clusters algorithm; and (3) how to efficiently reach for new sources of learning features. Presented techniques and notes resulted from intense studies in the field, performed by our and other groups, and can be useful both for researchers beginning in the field of PTM prediction and for those who want to extend the repertoire of their research techniques.

Key words

Phosphorylation Feature extraction Feature selection Class imbalance Cross-validation 



Marcin Tatjewski was supported by the European Union from resources of the European Social Fund. Project PO KL “Information technologies: Research and their interdisciplinary applications”, Agreement UDA-POKL.04.01.01-00-051/10-00. Marcin Tatjewski and Dariusz Plewczynski were supported by Polish National Science Centre (grant numbers: 2015/16/T/ST6/00493, 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121) and EU COST BM1405 and BM1408 actions. Marcin Kierczak was supported by the Swedish Foundation for Strategic Research and the Swedish Research Council.


  1. 1.
    Uhlen M, Ponten F (2005) Antibody-based proteomics for human tissue profiling. Mol Cell Proteomics 4:384–393CrossRefPubMedGoogle Scholar
  2. 2.
    Jensen ON (2004) Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr Opin Chem Biol 1:33–41CrossRefGoogle Scholar
  3. 3.
    Walsh C (2006) Posttranslational modification of proteins: expanding nature’s inventory. Roberts and Company Publishers, Englewood, COGoogle Scholar
  4. 4.
    Irby RB, Yeatman TJ (2000) Role of Src expression and activation in human cancer. Oncogene 19(49):5636–5642CrossRefPubMedGoogle Scholar
  5. 5.
    Brown M, Cooper JA (1996) Regulation, substrates and functions of Src. Biochim Biophys Acta 1287:121–149PubMedGoogle Scholar
  6. 6.
    Abram CL, Courtneidge SA (2000) Src family tyrosine kinases and growth factor signaling. Exp Cell Res 254:1–13CrossRefPubMedGoogle Scholar
  7. 7.
    Blom N, Gammeltoft S, Brunak S (1999) Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294(5):1351–1362. doi:10.1006/jmbi.1999.3310CrossRefPubMedGoogle Scholar
  8. 8.
    Biswas AK, Noman N, Sikder AR (2010) Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinf 11(1):273. doi:10.1186/1471-2105-11-273CrossRefGoogle Scholar
  9. 9.
    Plewczynski D, Basu S, Saha I (2012) AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43(2):573–582. doi:10.1007/s00726-012-1290-2CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Jalal S, Arsenault R, Potter AA, Babiuk LA, Griebel PJ, Napper S (2009) Genome to kinome: species-specific peptide arrays for kinome analysis. Sci Signal 2(54):pl1. doi:10.1126/scisignal.254pl1Google Scholar
  11. 11.
    Trost B, Kusalik A (2011) Computational prediction of eukaryotic phosphorylation sites. Bioinformatics (Oxford, England) 27(21):2927–2935. doi:10.1093/bioinformatics/btr525CrossRefGoogle Scholar
  12. 12.
    Trost B, Arsenault R, Griebel P, Napper S, Kusalik A (2013) DAPPLE: a pipeline for the homology-based prediction of phosphorylation sites. Bioinformatics (Oxford, England) 29(13):1693–1695. doi:10.1093/bioinformatics/btt265CrossRefGoogle Scholar
  13. 13.
    Robertson AJ, Trost B, Scruten E, Robertson T, Mostajeran M, Connor W, Kusalik A, Griebel P, Napper S (2014) Identification of developmentally-specific kinotypes and mechanisms of Varroa mite resistance through whole-organism, kinome analysis of honeybee. Front Genet 5:139. doi:10.3389/fgene.2014.00139CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. doi:10.1093/nar/gku989CrossRefPubMedCentralGoogle Scholar
  15. 15.
    Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res 43(Database issue):D512–D520. doi:10.1093/nar/gku1267CrossRefPubMedGoogle Scholar
  16. 16.
    Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ, Diella F (2011) Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res 39(Database issue):D261–D267. doi:10.1093/nar/gkq1104Google Scholar
  17. 17.
    Kamath KS, Vasavada MS, Srivastava S (2011) Proteomic databases and tools to decipher post-translational modifications. J Proteomics 75(1):127–144. doi:10.1016/j.jprot.2011.09.014CrossRefPubMedGoogle Scholar
  18. 18.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2012) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. 1201.0490Google Scholar
  19. 19.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. In: ACM SIGKDD explorations newsletter, vol 11, issue 1, p 10. doi:10.1145/1656274.1656278Google Scholar
  20. 20.
    Samuel A (2000) Some studies in machine learning using the game of checkers. IBM J Res Dev 44(1.2):206–226. doi:10.1147/rd.441.0206Google Scholar
  21. 21.
    Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 445–453Google Scholar
  22. 22.
    Matthews B (1975) Comparison of the predicted and observed secondary structure of {T4} phage lysozyme. Biochim Biophys Acta Protein Struct 405(2):442–451.
  23. 23.
    Powers DM (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63Google Scholar
  24. 24.
    Neuberger G, Schneider G, Eisenhaber F (2007) pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model. Biol Direct 2:1. doi:10.1186/1745-6150-2-1Google Scholar
  25. 25.
    Jung I, Matsuyama A, Yoshida M, Kim D (2010) PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinf 11(Suppl 1):S10. doi:10.1186/1471-2105-11-S1-S10CrossRefGoogle Scholar
  26. 26.
    Kawashima S (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. doi:10.1093/nar/28.1.374CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205. doi:10.1093/nar/gkm998PubMedGoogle Scholar
  28. 28.
    Saha I, Maulik U, Bandyopadhyay S, Plewczynski D (2012) Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acids 43(2):583–594. doi:10.1007/s00726-011-1106-9CrossRefPubMedGoogle Scholar
  29. 29.
    Iakoucheva LM, Radivojac P, Brown CJ, O’Connor TR, Sikes JG, Obradovic Z, Dunker AK (2004) The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 32(3):1037–1049. doi:10.1093/nar/gkh253CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Lee TY, Hsu JBK, Lin FM, Chang WC, Hsu PC, Huang HD (2010) N-Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites. J Comput Chem 31(15):2759–2771. doi:10.1002/jcc.21569CrossRefPubMedGoogle Scholar
  31. 31.
    Chen YZ, Chen Z, Gong YA, Ying G (2012) SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PloS One 7(6):e39195. doi:10.1371/journal.pone.0039195CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    Pejaver V, Hsu WL, Xin F, Dunker AK, Uversky VN, Radivojac P (2014) The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Sci 23(8):1077–1093. doi:10.1002/pro.2494CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Li A, Wang L, Shi Y, Wang M, Jiang Z, Feng H (2005) Phosphorylation site prediction with a modified k-nearest neighbor algorithm and blosum62 matrix. In: 27th Annual International conference of the engineering in medicine and biology society, 2005 (IEEE-EMBS 2005), pp 6075–6078. doi:10.1109/IEMBS.2005.1615878Google Scholar
  34. 34.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1007/BF00058655Google Scholar
  35. 35.
    Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601CrossRefGoogle Scholar
  36. 36.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi:10.1023/A:1010933404324CrossRefGoogle Scholar
  37. 37.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018,  10.1007/BF00994018
  38. 38.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Anal 6(5):429–449Google Scholar
  39. 39.
    Kramer C, Gedeck P (2010) Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J Chem Inf Model 50(11):1961–1969. doi:10.1021/ci100264eCrossRefPubMedGoogle Scholar
  40. 40.
    Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D (2015) Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae. PeerJ 3:e1041. doi:10.7717/peerj.1041CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Schwartz D (2012) Prediction of lysine post-translational modifications using bioinformatic tools. Essays Biochem 52:165–177. doi:10.1042/bse0520165CrossRefPubMedGoogle Scholar
  42. 42.
    Durek P, Schudoma C, Weckwerth W, Selbig J, Walther D (2009) Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins. BMC Bioinf 10(1):117. doi:10.1186/1471-2105-10-117CrossRefGoogle Scholar
  43. 43.
    Rudnicki WR, Kierczak M, Koronacki J, Komorowski J (2006) A statistical method for determining importance of variables in an information system. In: Rough sets and current …, pp 557–566. doi:10.1007/11908029_58Google Scholar
  44. 44.
    Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008) Monte Carlo feature selection for supervised classification. Bioinformatics (Oxford, England) 24(1):110–117. doi:10.1093/bioinformatics/btm486CrossRefGoogle Scholar
  45. 45.
    Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England) 22(13):1658–1659. doi:10.1093/bioinformatics/btl158CrossRefGoogle Scholar
  46. 46.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Pennsylvania T, Park U (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. doi:10.1016/S0022-2836(05)80360-2CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Marcin Tatjewski
    • 1
    • 2
  • Marcin Kierczak
    • 3
  • Dariusz Plewczynski
    • 4
    Email author
  1. 1.Institute of Computer SciencePolish Academy of SciencesWarsawPoland
  2. 2.Centre of New TechnologiesUniversity of WarsawWarsawPoland
  3. 3.Science for Life Laboratory, Department of Medical Biochemistry and MicrobiologyUppsala UniversityUppsalaSweden
  4. 4.Centre of New TechnologiesUniversity of WarsawWarsawPoland

Personalised recommendations