Bioinformatics pp 139-158 | Cite as

Metabolic Pathway Mining

  • Jan M. Czarnecki
  • Adrian J. Shepherd
Part of the Methods in Molecular Biology book series (MIMB, volume 1526)


Understanding metabolic pathways is one of the most important fields in bioscience in the post-genomic era, but curating metabolic pathways requires considerable man-power. As such there is a lack of reliable, experimentally verified metabolic pathways in databases and databases are forced to predict all but the most immediately useful pathways.

Text-mining has the potential to solve this problem, but while sophisticated text-mining methods have been developed to assist the curation of many types of biomedical networks, such as protein–protein interaction networks, the mining of metabolic pathways from the literature has been largely neglected by the text-mining community. In this chapter we describe a pipeline for the extraction of metabolic pathways built on freely available open-source components and a heuristic metabolic reaction extraction algorithm.

Key words

Metabolic pathway Metabolic interaction extraction Text-mining Natural language processing Named entity recognition Information extraction 



Named entity recognition


Natural language processing


Protein–protein interaction


  1. 1.
    PubMed Help [Internet] (2005) National Center for Biotechnology Information (US), Bethesda, MD. Available from
  2. 2.
    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN et al (2000) The protein data bank. Nucleic Acids Res 28:235–242CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB et al (1997) Cath–a hierarchic classification of protein domain structures. Structure 5:1093–1108CrossRefPubMedGoogle Scholar
  4. 4.
    Schomburg I, Chang A, Placzek S, Söhngen C, Rother M et al (2013) Brenda in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic Acids Res 41:D764–D772CrossRefPubMedGoogle Scholar
  5. 5.
    Ogata H, Goto S, Sato K, Fujibuchi W, Bono H et al (1999) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27:29–34CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA et al (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 38:D473–D479CrossRefPubMedGoogle Scholar
  7. 7.
    McQuilton P, FlyBase Consortium (2012) Opportunities for text mining in the flybase genetic literature curation workflow. Database (Oxford) 2012:bas039Google Scholar
  8. 8.
    Orchard S, Ammari M, Aranda B, Breuza L, Briganti L et al (2013) The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–D363CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Orchard S, Kerrien S, Abbani S, Aranda B, Bhate J et al (2012) Protein interaction data curation: the international molecular exchange (imex) consortium. Nat Methods 9:345–350CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome Biol 9(Suppl 2):S4CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Kabiljo R, Clegg AB, Shepherd AJ (2009) A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinf 10:233CrossRefGoogle Scholar
  12. 12.
    Miyao Y, Sagae K, Saetre R, Matsuzaki T, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25:394–400CrossRefPubMedGoogle Scholar
  13. 13.
    Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL et al (2008) Opendmap: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinf 9:78CrossRefGoogle Scholar
  14. 14.
    Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through web services: calling Whatizit. Bioinformatics 24:296–298CrossRefPubMedGoogle Scholar
  15. 15.
    Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A et al (2011) The protein-protein interaction tasks of biocreative iii: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinf 12(Suppl 8):S3CrossRefGoogle Scholar
  16. 16.
    Kwon D, Kim S, Shin SY, Chatr-aryamontri A, Wilbur WJ (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database 2014:bau067Google Scholar
  17. 17.
    Jamieson DG, Gerner M, Sarafraz F, Nenadic G, Robertson DL (2012) Towards semi-automated curation: using text mining to recreate the hiv-1, human protein interaction database. Database (Oxford) 2012:bas023Google Scholar
  18. 18.
    Leaman R, Gonzalez G (2008) Banner: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput 13:652–663Google Scholar
  19. 19.
    Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T (2010) Complex event extraction at pubmed scale. Bioinformatics 26:i382–i390CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Miwa M, Saetre R, Kim JD, Tsujii J (2010) Event extraction with complex event classification using rich features. J Bioinform Comput Biol 8:131–146CrossRefPubMedGoogle Scholar
  21. 21.
    Li L, Zhang P, Zheng T, Zhang H, Jiang Z et al (2014) Integrating semantic information into multiple kernels for protein-protein interaction extraction from biomedical literatures. PLoS One 9:e91898CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Quan C, Wang M, Ren F (2014) An unsupervised text mining method for relation extraction from biomedical literature. PLoS One 9:e102039CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Kim J, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of bionlp’09 shared task on event extraction. In: Proceedings of the BioNLP 2009 workshop companion volume for shared task. Association for Computational Linguistics, Boulder, CO, pp 1–9.
  24. 24.
    Blaschke C, Valencia A (2002) The frame-based module of the SUISEKI information extraction system. IEEE Intell Syst 17:14–20Google Scholar
  25. 25.
    Iossifov I, Krauthammer M, Friedman C, Hatzivassiloglou V, Bader JS et al (2004) Probabilistic inference of molecular networks from noisy data sources. Bioinformatics 20:1205–1213CrossRefPubMedGoogle Scholar
  26. 26.
    Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P et al (2004) Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43–53CrossRefPubMedGoogle Scholar
  27. 27.
    Santos C, Eggle D, States DJ (2005) Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 21:1653–1658CrossRefPubMedGoogle Scholar
  28. 28.
    Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S et al (2006) Automatic pathway building in biological association networks. BMC Bioinf 7:171CrossRefGoogle Scholar
  29. 29.
    Marshall B, Su H, McDonald D, Eggers S, Chen H (2006) Aggregating automatically extracted regulatory pathway relations. IEEE Trans Inf Technol Biomed 10:100–108CrossRefPubMedGoogle Scholar
  30. 30.
    Rodríguez-Penagos C, Salgado H, Martínez-Flores I, Collado-Vides J (2007) Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinf 8:293CrossRefGoogle Scholar
  31. 31.
    Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinf 6(Suppl 1):S1CrossRefGoogle Scholar
  32. 32.
    Smith L, Tanabe LK, nee Ando RJ, Kuo CJ, Chung IF et al (2008) Overview of biocreative ii gene mention recognition. Genome Biol 9(Suppl 2):S2Google Scholar
  33. 33.
    Lu Z, Kao HY, Wei CH, Huang M, Liu J et al (2011) The gene normalization task in biocreative iii. BMC Bioinf 12(Suppl 8):S2CrossRefGoogle Scholar
  34. 34.
    Humphreys K, Demetriou G, Gaizauskas R (2000) Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp Biocomput 5:505–516Google Scholar
  35. 35.
    Novichkova S, Egorov S, Daraselia N (2003) MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 19:1699–1706CrossRefPubMedGoogle Scholar
  36. 36.
    Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. Pac Symp Biocomput 12:245–256Google Scholar
  37. 37.
    Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ (2009) Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinf 10:326CrossRefGoogle Scholar
  38. 38.
    Winnenburg R, Wächter T, Plake C, Doms A, Schroeder M (2008) Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 9:466–478CrossRefPubMedGoogle Scholar
  39. 39.
    Kottmann J, Margulies B, Ingersoll G, Drost I, Kosin J, Baldridge J, Goetz T, Morton T, Silva W, Autayeu A, Galitsky B (2011) Apache opennlp. Online.
  40. 40.
    Clegg AB, Shepherd AJ (2007) Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinf 8:24CrossRefGoogle Scholar
  41. 41.
    Buyko E, Wermter J, Poprat M, Hahn U (2006) Automatically adapting an NLP core engine to the biology domain. In: Proceedings of the ISMB 2006 joint linking literature, information and knowledge for biology and the 9th bio-ontologies meeting.Google Scholar
  42. 42.
    Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus–semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182CrossRefPubMedGoogle Scholar
  43. 43.
    Kulick S, Bies A, Liberman M, Mandel M, Mcdonald R et al (2004) Integrated annotation for biomedical information extraction. In: Biolink: linking biological literature, ontologies and databases, proceedings of HLT-NAACL, pp 61–68Google Scholar
  44. 44.
    Hahn U, Matthies F, Faessler E, Hellrich J (2016) UIMA-based JCoRe 2.0 goes GitHub and Maven central―state-of-the-art software resource engineering and distribution of NLP pipelines. In: Calzolari N (Conference Chair), Choukri K, Declerck T, Grobelnik M, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds.) Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), Portorož, SloveniaGoogle Scholar
  45. 45.
    Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S et al (2010) Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513CrossRefPubMedPubMedCentralGoogle Scholar
  46. 46.
    Corbett P, Murray-Rust P (2006) High throughput identification of chemistry in life science texts. In: Proceedings of the 2nd international symposium on computational life science (CompLife ’06), pp 107–118Google Scholar
  47. 47.
    Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) Oscar4: a flexible architecture for chemical text-mining. J Cheminform 3:41CrossRefPubMedPubMedCentralGoogle Scholar
  48. 48.
    Rocktäschel T, Weidlich M, Leser U (2012) Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics 28:1633–1640CrossRefPubMedGoogle Scholar
  49. 49.
    Kolarik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J (2008) Chemical names: Terminological resources and corpora annotation. In: Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference). Marrakech, MoroccoGoogle Scholar
  50. 50.
    Gerner M, Nenadic G, Bergman CM (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinf 11:85CrossRefGoogle Scholar
  51. 51.
    Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database (Oxford) 2014:bau003Google Scholar
  52. 52.
    de Matos P, Ennis M, Darsow M, Guedj M, Degtyarenko K et al (2006) Chebi — chemical entities of biological interest. Database Summary Paper 646, EMBL Outstation - The European Bioinformatics InstituteGoogle Scholar
  53. 53.
    Czarnecki J, Nobeli I, Smith AM, Shepherd AJ (2012) A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinf 13:172CrossRefGoogle Scholar
  54. 54.
    Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12 PubChem: integrated platform of small molecules and biological activities. Annu Rep Comput Chem 4:217–241CrossRefGoogle Scholar
  55. 55.
    de Matos P, Alcantara R, Dekker A, Ennis M, Hastings J et al (2010) Chemical entities of biological interest: an update. Nucleic Acids Res 38:D249–D254CrossRefPubMedGoogle Scholar
  56. 56.
    (2006) Porter stemming algorithm implementations.
  57. 57.
    Porter M (1980) An algorithm for suffix stripping. Program 14:130–137CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.School of Biosciences, BirkbeckUniversity of LondonLondonUK
  2. 2.Department of Biological Sciences and Institute of Structural and Molecular Biology, BirkbeckUniversity of LondonLondonUK

Personalised recommendations