Skip to main content

A Text Mining and Machine Learning Protocol for Extracting Posttranslational Modifications of Proteins from PubMed: A Special Focus on Glycosylation, Acetylation, Methylation, Hydroxylation, and Ubiquitination

  • Protocol
  • First Online:
Biomedical Text Mining

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2496))

  • 748 Accesses

Abstract

Posttranslational modifications (PTMs) of proteins impart a significant role in human cellular functions ranging from localization to signal transduction. Hundreds of PTMs act in a human cell. Among them, only the selected PTMs are well established and documented. PubMed includes thousands of papers on the selected PTMs, and it is a challenge for the biomedical researchers to assimilate useful information manually. Alternatively, text mining approaches and machine learning algorithm automatically extract the relevant information from PubMed. Protein phosphorylation is a well-established PTM and several research works are under way. Many existing systems are there for protein phosphorylation information extraction. A recent approach uses a hybrid approach using text mining and machine learning to extract protein phosphorylation information from PubMed. Some of the other common PTMs that exhibit similar features in terms of entities that are involved in PTM process, that is, the substrate, the enzymes, and the amino acid residues, are glycosylation, acetylation, methylation, hydroxylation, and ubiquitination. This has motivated us to repurpose and extend the text mining protocol and machine learning information extraction methodology developed for protein phosphorylation to these PTMs. In this chapter, the chemistry behind each of the PTMs is briefly outlined and the text mining protocol and machine learning algorithm adaption is explained for the same.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Makałowski W (2001) The human genome structure and organization. Acta Biochim Pol 48(3):587–598. Available from: https://pubmed.ncbi.nlm.nih.gov/11833767/

    Article  Google Scholar 

  2. Kim M-S et al (2014) A draft map of the human proteome. Nature 509:575–581. Available from: https://pubmed.ncbi.nlm.nih.gov/24870542/

    Article  CAS  Google Scholar 

  3. Minguez P, Parca L, Diella F et al (2012) Deciphering a global network of functionally associated post-translational modifications. Mol Syst Biol 8:599. https://doi.org/10.1038/msb.2012.31. Available from: https://pubmed.ncbi.nlm.nih.gov/22806145/

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Khoury GA, Baliban RC, Floudas CA (2011) Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database. Sci Rep 1:90. Available from: https://www.nature.com/articles/srep00090?message-global=remove&page=2

    Article  CAS  Google Scholar 

  5. Wang YC, Peterson S, Loring J (2014) Protein post-translational modifications and regulation of pluripotency in human stem cells. Cell Res 24:143–160. https://doi.org/10.1038/cr.2013.151. Available from: https://www.nature.com/articles/cr2013151

    Article  CAS  PubMed  Google Scholar 

  6. David GC et al Post-translational protein acetylation: an elegant mechanism for bacteria to dynamically regulate metabolic functions. Front Microbiol. https://doi.org/10.3389/fmicb.2019.01604. Available from: https://www.frontiersin.org/articles/10.3389/fmicb.2019.01604/full

  7. Blom N, Sicheritz-Pontén T, Gupta R, Gammeltoft S, Brunak S (2004) Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4(6):1633–1649. https://doi.org/10.1002/pmic.200300771. Available from: https://pubmed.ncbi.nlm.nih.gov/15174133/

    Article  CAS  PubMed  Google Scholar 

  8. Ramazi S, Allahverdi A, Zahiri J (2020) Evaluation of post-translational modifications in histone proteins: A review on histone modification defects in developmental and neurological disorders. J Biosci 45:135. https://doi.org/10.1007/s12038-020-00099-2. Available from: https://link.springer.com/article/10.1007/s12038-020-00099-2#citeas

    Article  CAS  PubMed  Google Scholar 

  9. Pratt DV, Judith GV, Charlotte W (2006) Fundamentals of biochemistry : life at the molecular level, 2nd edn. Wiley, Hoboken, NJ

    Google Scholar 

  10. Walsh CT (2006) Posttranslational modification of proteins : expanding nature’s inventory. Roberts and Co., Englewood

    Google Scholar 

  11. Omenn GS, Lane L, Lundberg EK, Beavis RC, Overall CM, Deutsch EW (2016) Metrics for the human proteome project 2016: Progress on identifying and characterizing the human proteome, including post-translational modifications. J Proteome Res 15(11):3951–3960. https://doi.org/10.1021/acs.jproteome.6b00511. Available from: https://pubmed.ncbi.nlm.nih.gov/27487407/

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Lange PF, Overall CM (2013) Protein tails: when termini tell tales of proteolysis and function. Curr Opin Chem Biol 17:73–82. https://doi.org/10.1016/j.cbpa.2012.11.025

    Article  CAS  PubMed  Google Scholar 

  13. Walsh CT, Garneau-Tsodikova S, Gatto GJ (2005) Protein posttranslational modifications: the chemistry of proteome diversifications. Angew Chem Int Ed Engl 44:7342–7372. https://doi.org/10.1002/anie.200501023

    Article  CAS  PubMed  Google Scholar 

  14. Paulus H (2000) Protein splicing and related forms of protein autoprocessing. Annu Rev Biochem 69:447–496. https://doi.org/10.1146/annurev.biochem.69.1.447. Available from:https://pubmed.ncbi.nlm.nih.gov/10966466/

    Article  CAS  PubMed  Google Scholar 

  15. Lu KP, Finn G, Lee TH, Nicholson LK (2007) Prolyl cis-trans isomerization as a molecular timer. Nat Chem Biol 3:619–629. https://doi.org/10.1038/nchembio.2007.35. Available from: https://pubmed.ncbi.nlm.nih.gov/17876319/

    Article  CAS  PubMed  Google Scholar 

  16. Santos AL, Lindner AB (2017) Protein posttranslational modifications: roles in aging and age-related disease. Oxid Med Cell Longev 2017:5716409. https://doi.org/10.1155/2017/5716409. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5574318/#B20

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Apweiler R et al (1999) On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim Biophys Acta 1473:4–8

    Article  CAS  Google Scholar 

  18. Schjoldager KT, Narimatsu Y, Joshi HJ et al (2020) Global view of human protein glycosylation pathways and functions. Nat Rev Mol Cell Biol 21:729–749. https://doi.org/10.1038/s41580-020-00294-x. Available from: https://pubmed.ncbi.nlm.nih.gov/33087899/

    Article  CAS  PubMed  Google Scholar 

  19. Kim EH, Misek DE (2011) Glycoproteomics-based identification of cancer biomarkers. Int J Proteomics 1–10. https://doi.org/10.1155/2011/601937

  20. Overview of Post-Translational Modifications (PTMs). Available from: https://www.thermofisher.com/us/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-post-translational-modification.html

  21. Glycosylation. UniProt: Protein sequence and functional information. Available from: https://www.uniprot.org/help/carbohyd

  22. Protein Glycosylation. Available from: https://www.creative-proteomics.com/services/glycosylation-analysis-of-protein.htm

  23. Drazic A et al (2016) The world of protein acetylation. Biochim Biophys Acta, Proteins Proteomics 1864(10):1372–1401

    Article  CAS  Google Scholar 

  24. Zhang K, Shanshan T, Enguo F (2013) Protein lysine acetylation analysis: current MS-based proteomic technologies. Analyst 138(6):1628–1636

    Article  CAS  Google Scholar 

  25. Shantha Raju T (2019) Methylation of Proteins. In: Chapter 11. Co and post translational modifications of therapeutic antibodies and proteins. Wiley, NJ, pp 133–146

    Chapter  Google Scholar 

  26. Bedford MT (2006) Methylation of Proteins. In: Encyclopedic Reference of Genomics and Proteomics in Molecular Medicine. Springer, Berlin, Heidelberg, p 114. https://doi.org/10.1007/3-540-29623-9_2780

    Chapter  Google Scholar 

  27. Mahmood MK, Ehsan A, Khan YD, Chou KC (2020) iHyd-LysSite (EPSV): identifying Hydroxylysine sites in protein using statistical formulation by extracting enhanced position and sequence variant feature technique. Curr Genomics 21(7):536–545. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7604750/

    Article  CAS  Google Scholar 

  28. Markolovic S, Wilkins SE, Schofield CJ (2015) Protein hydroxylation catalyzed by 2-Oxoglutarate-dependent Oxygenases. J Biol Chem 290(34):20712–20722. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4543633/

    Article  CAS  Google Scholar 

  29. Swatek K, Komander D (2016) Ubiquitin modifications. Cell Res 26:399–422. Available from: https://www.nature.com/articles/cr201639

    Article  CAS  Google Scholar 

  30. Choo YS, Zhang Z (2009) Detection of protein ubiquitination. J Vis Exp 30:1293. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149903/

    Google Scholar 

  31. Neutzner M, Neutzner A (2012) Enzymes of ubiquitination and deubiquitination. Essays Biochem 52:37–50. https://doi.org/10.1042/bse0520037

    Article  CAS  PubMed  Google Scholar 

  32. Faktor J, Pjechová M, Hernychová L, Vojtěšek B (2019) Protein ubiquitination research in oncology. Klin Onkol 32(Suppl. 3):56–64. Available from: https://pubmed.ncbi.nlm.nih.gov/31627707/

    CAS  PubMed  Google Scholar 

  33. Torii M, Arighi CN, Li G, Wang Q, Wu CH, Vijay-Shanker K (2015) RLIMS-P 2.0: a generalizable rule-based information extraction system for literature Mining of Protein Phosphorylation Information. IEEE/ACM Trans Comput Biol Bioinform 12(1):17–29. https://doi.org/10.1109/TCBB.2014.2372765

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Sun D, Wang M, Li A (2017) MPTM: A tool for mining protein post-translational modifications from literature. J Bioinforma Comput Biol 15(5):1740005. https://doi.org/10.1142/S0219720017400054. Available from: https://pubmed.ncbi.nlm.nih.gov/28982288/

    Article  CAS  Google Scholar 

  35. Huang H, Arighi CN, Ross KE, Ren J, Li G, Chen SC, Wang Q, Cowart J, Vijay-Shanker K, Wu CH (2018) iPTMnet: an integrated resource for protein post-translational modification network discovery. Nucleic Acids Res 46(D1):D542–D550. https://doi.org/10.1093/nar/gkx1104. Available from: https://pubmed.ncbi.nlm.nih.gov/29145615/

    Article  CAS  PubMed  Google Scholar 

  36. Raja K, Natarajan J (2018) Mining protein phosphorylation information from biomedical literature using NLP parsing and support vector machines. Comput Methods Prog Biomed 160:57–64. https://doi.org/10.1016/j.cmpb.2018.03.022. Epub 2018 Mar 22

    Article  Google Scholar 

  37. Eichler J (2019) Protein glycosylation. Curr Biol 29(7):R229–R231. https://doi.org/10.1016/j.cub.2019.01.003

    Article  CAS  PubMed  Google Scholar 

  38. Leaman R, Gonzalez G (2008) BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput:652–663

    Google Scholar 

  39. Raja K, Subramani S, Natarajan J (2014) A hybrid named entity tagger for tagging human proteins/genes. Int J Data Min Bioinform 10(3):315–328. https://doi.org/10.1504/ijdmb.2014.064545. Available from: https://pubmed.ncbi.nlm.nih.gov/25946866/

    Article  PubMed  Google Scholar 

  40. Antje C et al (2021) BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res 8:D498–D508. https://doi.org/10.1093/nar/gkaa1025. Available from: https://academic.oup.com/nar/article/49/D1/D498/5992283

    Article  CAS  Google Scholar 

  41. Hu ZZ, Mani I, Hermoso V, Liu H, Wu CH (2004) iProLINK: an integrated protein resource for literature mining. Comput Biol Chem 28(5–6):409–416. https://doi.org/10.1016/j.compbiolchem.2004.09.010

    Article  CAS  PubMed  Google Scholar 

  42. PIR-Protein Information Resource. iProLINK/corpora. Available from: https://research.bioinformatics.udel.edu/iprolink/corpora.php

  43. Ej L, Seo JH, Kim KW (2018) Special issue on protein acetylation: from molecular modification to human disease. Exp Mol Med 50:1–2. https://doi.org/10.1038/s12276-018-0103-4. Available from: https://www.nature.com/articles/s12276-018-0103-4

    Article  CAS  Google Scholar 

  44. Hounsell EF, Davies MJ, Renouf DV (1996) O-linked protein glycosylation structure and function. Glycoconj J 13(1):19–26. https://doi.org/10.1007/bf01049675. Available from: https://pubmed.ncbi.nlm.nih.gov/8785483/

    Article  CAS  PubMed  Google Scholar 

  45. Varki A (2015) Essentials of glycobiology, 3rd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor. New York

    Google Scholar 

  46. PIR-Protein Information Resource.iProLINK/Evidence Attribution. Available from: https://proteininformationresource.org/pirwww/iprolink/ftcorpora.shtml

  47. Raja K, Subramanian D, Abdulkadhar S, Natarajan J (2020) hPP Corpus: A Tagged Biomedical Corpus for Automatic Extraction of Human Protein Phosphorylation for Understanding Cellular Functions. J. Embryol. Stem Cell Res 1:1–12. Available from: https://medwinpublishers.com/JES/JES16000140.pdf

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Arumugam, K., Sellappan, M., Anand, D., Anand, S., Radhakrishnan, S.V. (2022). A Text Mining and Machine Learning Protocol for Extracting Posttranslational Modifications of Proteins from PubMed: A Special Focus on Glycosylation, Acetylation, Methylation, Hydroxylation, and Ubiquitination. In: Raja, K. (eds) Biomedical Text Mining. Methods in Molecular Biology, vol 2496. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2305-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-2305-3_10

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-2304-6

  • Online ISBN: 978-1-0716-2305-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics