A Guide to Dictionary-Based Text Mining

  • Helen V. Cook
  • Lars Juhl JensenEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1939)


PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

Key words

Automated text processing Dictionary-based approach Named entity recognition PubMed Structured information Text mining Text normalization 


  1. 1.
    Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database 2011:1–13. issn: 17580463. arXiv: baq03. Google Scholar
  2. 2.
    The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. issn: 0305-1048. Google Scholar
  3. 3.
    Attwood T, Agit B, Ellis L (2015) Longevity of biological databases. EMBnet.journal 21.0 issn: 2226-6089.
  4. 4.
    Pletscher-Frankild S et al (2015) DISEASES: text mining and data integration of disease-gene associations. Methods 74:83–89. issn: 10959130. Google Scholar
  5. 5.
    Junge A et al (2017) RAIN: RNA-protein association and interaction networks. Database baw167:1–9. issn: 1047- 3211. arXiv: 1611.06654. Google Scholar
  6. 6.
    Binder JX et al (2014) COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database 1–.9. issn: 17580463.
  7. 7.
    Santos A et al (2015) Comprehensive comparison of large-scale tissue expression datasets. PeerJ 3:e1054. issn: 2167-8359. Google Scholar
  8. 8.
    Meaney C et al (2016) Text mining describes the use of statistical and epidemiological methods in published medical research. J Clin Epidemiol 74:124–132. issn: 18785921. Google Scholar
  9. 9.
    IDG Knowledge Management Center (2016) Unexplored opportunities in the druggable genome. Nat Rev Drug Discov
  10. 10.
    Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30:7–18Google Scholar
  11. 11.
    Swanson DR, Smalheiserf NR (1996) Undiscovered public knowledge: a ten-year update. KDD-96 Proceedings 56(2):103–118. issn: 00242519. Google Scholar
  12. 12.
    Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol MedGoogle Scholar
  13. 13.
    Russo F et al (2018) miRandola 2017: a curated knowledge base of non-invasive biomarkers. Nucleic Acids Res 46:D354–D359. issn: 0305-1048. Google Scholar
  14. 14.
    Orchard S et al (2014) The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(November 2013):358–363. Google Scholar
  15. 15.
    Xenarios I et al (2002) DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. issn: 1362-4962. Google Scholar
  16. 16.
    Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. issn: 03051048. Google Scholar
  17. 17.
    Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5(12):1–5. issn: 1553734X. Google Scholar
  18. 18.
    Pafilis E et al (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27(6):508–510. issn: 1087- 0156. Google Scholar
  19. 19.
    Pafilis E et al (2013) The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS ONE 8(6):2–7. issn: 19326203. Google Scholar
  20. 20.
    Szklarczyk D et al (2016) The STRING database in 2017: quality- controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45(D1):D362–D368. issn: 0305-1048. Google Scholar
  21. 21.
    Cook H, Pafilis E, Jensen L (2016) A dictionary- and rule-based system for identification of bacteria and habitats in text. In: Proceedings of the 4th BioNLP shared task workshop, p 50–55. isbn: 978-1-945626-21-0.
  22. 22.
    Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7(2):119–129. issn: 1471-0056. Google Scholar
  23. 23.
    Arighi CN et al (2014) BioCreative-IV virtual issue. Database 2014:1–6. issn: 1758-0463. Google Scholar
  24. 24.
    Deléger L et al (2016) Overview of the bacteria biotope task at BioNLP shared task 2016. In: Proceedings of the 4th BioNLP shared task workshop, p 12–22Google Scholar
  25. 25.
    Huang CC, Zhiyong L (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 17(1):132–144. issn: 14774054. Google Scholar
  26. 26.
    Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database 2014., bau003. issn: 1758-0463.
  27. 27.
    Roque FS et al (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 7(8):e1002141. issn: 1553734X. arXiv: NIHMS150003. Google Scholar
  28. 28.
    Ford E et al (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 23(5):1007–1015. issn: 1527974X. Google Scholar
  29. 29.
    Thomas CE et al. (2014) Negation scope and spelling variation for text-mining of Danish electronic patient records. In: Proceedings of the 5th international workshop on health text mining and information analysis 2014, p 64–68Google Scholar
  30. 30.
    Kuhn M et al (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079. issn: 13624962. Google Scholar
  31. 31.
    Pafilis E et al (2015) ENVIRONMENTS and EOL: identification of environment ontology terms in text and the annotation of the encyclopedia of life. Bioinformatics 31(11):1872–1874. issn: 14602059. Google Scholar
  32. 32.
    Yang Y et al (2017) Exploiting sequence-based features for predicting enhancer-promoter interactions. Bioinformatics 33(14):i252–i260. issn: 14602059. Google Scholar
  33. 33.
    Sayers E (2010) A general introduction to the E-utilities. National Center for Biotechnology Information (US), Bethesda, MD, pp 1–10Google Scholar
  34. 34.
    Westergaard D et al (2017) Text mining of 15 million full-text scientific articles. bioRxiv.
  35. 35.
    Eysenbach G (2006) Citation advantage of open access articles. PLoS Biol 4(5):692–698. issn: 15457885. Google Scholar
  36. 36.
    Handke C, Guibault L, Vallbé JJ (2015) Is Europe falling behind in data mining? Copyright’s impact on data mining in academic research. In: New avenues for electronic publishing in the age of infinite collections and citizen science: scale, openness and trust—Proceedings of the 19th international conference on electronic publishing, Elpub 2015 June (2015), pp. 120–130. issn: 1556-5068. doi:
  37. 37.
    Noonburg D XpdfReader.
  38. 38.
    Ramakrishnan C et al (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:7. issn: 1751-0473. Google Scholar
  39. 39.
    Kim D, Hong Y (2011) Figure text extraction in biomedical literature. PLoS ONE 6(1):1–11. issn: 19326203. Google Scholar
  40. 40.
    Free software foundation. iconv.
  41. 41.
    Moolenaar B Vim.
  42. 42.
    Przybyla P et al (2016) Text mining resources for the life sciences. Database 2016:1–30. issn: 17580463. arXiv: 1611.06654. Google Scholar
  43. 43.
    Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014, p 740–750. isbn: 9781937284961.
  44. 44.
    Recasens M, De Marneffe MC, Potts C (2013) The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL-HLT 0.June 2013, p 627–633.
  45. 45.
    NLTK Project. Natural Language Toolkit
  46. 46.
    Sayers EW et al (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res 37:D5–D15 issn: 1362-4962. Google Scholar
  47. 47.
    Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. In: BMC Bioinformatics 111 (2010), p. 85. issn: 1471-2105., doi:
  48. 48.
    Leaman R, Zhiyong L (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846. issn: 14602059. Google Scholar
  49. 49.
    Cho H-C et al NERsuite: a named entity recognition toolkit.
  50. 50.
    Hogenboom F et al (2011) An overview of event extraction from text. CEUR Workshop Proceedings 779:48–57 isbn: 1467392006Google Scholar
  51. 51.
    Ramos J (2003) Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning 2003, p 1–4. doi:
  52. 52.
    Damashek M (1995) Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199):843–848. issn: 0036-8075. Google Scholar
  53. 53.
    Björne J, Salakoski T (2015) TEES 2.2: biomedical event extraction for diverse corpora. BMC Bioinformatics 16 Suppl 16 S4. issn: 1471-2105. doi:
  54. 54.
    Lever J, Jones SJM (2016) VERSE: event and relation extraction in the BioNLP 2016 shared task. In: Proceedings of the 4th BioNLP shared task workshop, 2016, p 42–49Google Scholar
  55. 55.
    Mikolov T, Yih W-T, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT 2013, p 746–751. isbn: 9781937284473.,
  56. 56.
    Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. issn: 10495258. doi: arXiv: 1504.06654.
  57. 57.
    Bojanowski P et al (2016) Enriching word vectors with subword information. issn: 10450823. arXiv:1607.04606. doi: 1511.09249v1
  58. 58.
    Pyysalo S et al (2012) Distributional semantics resources for biomedical text processingGoogle Scholar
  59. 59.
    Cejuela JM et al (2014) Tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database 2014:1–8. issn: 17580463. Google Scholar
  60. 60.
    Stenetorp P, Pyysalo S, Topic G Brat rapid annotation tool.
  61. 61.
    Database Center for Life Science. PubAnnotation.
  62. 62.
    Johns Hopkins University McKusick-Nathans Institute of Genetic Medicine. Online Mendelian Inheritance in Man, OMIM.Google Scholar
  63. 63.
    Law V et al (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42(D1):1091–1097. issn: 03051048. Google Scholar
  64. 64.
    Kanehisa M et al (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45(Database):D353–D361Google Scholar
  65. 65.
    Docker Inc. Docker.Google Scholar
  66. 66.
    Jupp S et al (2015) A new ontology lookup service at EMBL-EBI. CEUR Workshop Proceedings 1546:118–119 issn: 16130073Google Scholar
  67. 67.
    Smith B et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251–1255. issn: 1087-0156. Google Scholar
  68. 68.
    Whetzel PL et al (2011) BioPortal: enhanced functionality via new Web services from the national center for biomedical ontology to access and use ontologies in software applications”. In: Nucleic Acids Res 39 SUPPL 2 pp. 541–545. issn: 03051048. doi: arXiv:arXiv:1011.1669v3.
  69. 69.
    Faria D et al (2013) The AgreementMakerLight ontology matching system. Springer, pp 527–541. isbn: 9783642410291.
  70. 70.
    Nédellec C (2013) OntoBiotope. In: INRAGoogle Scholar
  71. 71.
    Huerta-Cepas J et al (2015) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(Database issue):286–293. issn: 0305-1048. Google Scholar
  72. 72.
    Finkel JR, Kleeman A, Manning CD (2008) Feature-based, conditional random field parsing. In: Proceedings of the 46th meeting of the ACL, 2008, p 959–967Google Scholar
  73. 73.
    Tang B et al (2013) Recognizing and encoding disorder concepts in clinical text using machine learning and vector space. In: Proceedings of the ShARe/CLEF Evaluation Lab (2013). issn: 16130073.
  74. 74.
    Zheng J et al (2011) Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform 44(6):1113–1122. issn: 15320464. Google Scholar
  75. 75.
    Jensen LJ (2017) Personal CommunicationGoogle Scholar
  76. 76.
    Thompson P et al (2016) Text mining the history of medicine. PLoS ONE 11(1):1–33. issn: 19326203. Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Clinical MedicineUniversity of CambridgeCambridgeUK
  2. 2.Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical SciencesUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations