Advertisement

Mapping of Biomedical Text to Concepts of Lexicons, Terminologies, and Ontologies

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1159)

Abstract

Concept mapping is a fundamental task in biomedical text mining in which textual mentions of concepts of interest are annotated with specific entries of lexicons, terminologies, ontologies, or databases representing these concepts. Though there has been a significant amount of research, there are still a limited number of practical, publicly available tools for concept mapping of biomedical text specified by the user as an independent task. In this chapter, several tools that can automatically map biomedical text to concepts from a wide range of terminological resources are presented, followed by those that can map to more restricted sets of these resources. This presentation is intended to serve as a guide to researchers without a background in biomedical concept mapping of text for the selection of an appropriate tool based on usability, scalability, configurability, balance between precision and recall, and the desired set of terminological resources with which to annotate the text. Only with effective automatic concept-mapping tools will systems be able to scalably analyze the biomedical literature and other large sets of documents as a fundamental part of more complex text-mining tasks such as information extraction and hypothesis evaluation and generation.

Key words

Concept mapping Concept recognition Concept normalization Annotation Terminologies Vocabularies Ontologies 

References

  1. 1.
    Nadeau K, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26CrossRefGoogle Scholar
  2. 2.
    Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform 6(Suppl I):S3CrossRefGoogle Scholar
  3. 3.
    Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inform 37:512–526PubMedCrossRefGoogle Scholar
  4. 4.
    Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenburg J, Sun C, Liu H-H, Torres R, Krauthammer M, Lau WM, Liu H, Hsu C-N, Schuemie M, Cohen KB, Hirschman L (2008) Overview of BioCreative II gene normalization. Gen Biol 9(Suppl 2):S3CrossRefGoogle Scholar
  5. 5.
    Bales ME, Lussier YA, Johnson SB (2007) Topological analysis of large-scale biomedical terminology structures. J Am Med Inform Assoc 14:788–797PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Whetzel PL, Noy NF, Shah NH, Alexander RR, Nyulas C, Tudorache T, Musen MA (2011) BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res 39(Web Server issue):W541–W545PubMedCentralPubMedCrossRefGoogle Scholar
  7. 7.
    Chen L, Liu H, Friedman C (2005) Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21:248–255PubMedCrossRefGoogle Scholar
  8. 8.
    Hirschman L, Morgan AA, Yeh AS (2002) Rutabaga by any other name: extracting biological names. J Biomed Inform 35(4): 247–259PubMedCrossRefGoogle Scholar
  9. 9.
    McCray AT, Browne AC, Bodenreider O (2002) The lexical properties of the gene ontology. Proc AMIA Annual Symp, 504–508Google Scholar
  10. 10.
    Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus: a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182PubMedCrossRefGoogle Scholar
  11. 11.
    Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T (2007) BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform 8:50CrossRefGoogle Scholar
  12. 12.
    Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner Jr. WA, Cohen KB, Verspoor V, Blake JA, Hunter LE (2012) Concept annotation in the CRAFT corpus. BMC Bioinform 13:161Google Scholar
  13. 13.
    Briscoe T (1991) Lexical issues in natural language processing. In: Klein E, Veltman F (eds) Natural language and speech. Springer, BerlinGoogle Scholar
  14. 14.
    Hirst G (2009) Ontology and the Lexicon. In: Staab S, Studer S (eds) Handbook on ontologies. Springer, Berlin, pp 269–292CrossRefGoogle Scholar
  15. 15.
    Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge, MAGoogle Scholar
  16. 16.
    McCray AT, Srinavasan S, Browne AC (1994) Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care, 235–239Google Scholar
  17. 17.
    Quochi V, Monachini M, Del Gratta R, Calzolari N (2008) A lexicon for biology and bioinformatics: the BOOTStrep experience. Proceedings international conf on language resources and evaluation (LREC) 2008, Marrakech, MoroccoGoogle Scholar
  18. 18.
    Chute C (2000) Clinical classification and terminology: some history and current observations. J Am Med Informatics Assoc 7(3): 298–303CrossRefGoogle Scholar
  19. 19.
    Svenonius E (2003) Design of controlled vocabularies. In: Drake M (ed) Encyclopedia of library and information science. Marcel Dekker, New York, NY, pp 822–838Google Scholar
  20. 20.
    Ingenerf J, Pöppl S (2007) Biomedical vocabularies: the demand for differentiation. Proc Internat Conf Med Informatics (MEDINFO) 2007, BrisbaneGoogle Scholar
  21. 21.
    Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu W-L, Wright LW (2007) NCI thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform 40:30–43PubMedCrossRefGoogle Scholar
  22. 22.
    Gray KA, Daugherty LC, Gordon SM, Seal RL, Wright MW, Bruford EA (2013) Genenames.org: the HGNC resources in 2013. Nucl Acids Res 41(Database issue):D545–D552PubMedCentralPubMedCrossRefGoogle Scholar
  23. 23.
    The UniProt Consortium (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40(D1): D71–D75PubMedCentralCrossRefGoogle Scholar
  24. 24.
    Smith B (2003) Ontology. In: Floridi L (ed) Blackwell guide to the philosophy of computing and information. Blackwell, Oxford, pp 155–166Google Scholar
  25. 25.
    Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Hum Comp Stud 43(5/6):907–928CrossRefGoogle Scholar
  26. 26.
    Bodenreider O, Stevens R (2006) Bio-ontologies: current trends and future directions. Brief Bioinform 7(3):256–274PubMedCentralPubMedCrossRefGoogle Scholar
  27. 27.
    Rubin DL, Shah NH, Noy NF (2007) Biomedical ontologies: a functional perspective. Brief Bioinform 9(1):75–90PubMedCrossRefGoogle Scholar
  28. 28.
    Smith B, Ashburner M, Rosse C, Bard C, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, The OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25:1251–1255PubMedCentralPubMedCrossRefGoogle Scholar
  29. 29.
    The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29PubMedCentralCrossRefGoogle Scholar
  30. 30.
    Aronson AR, Lang F-M (2010) An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 17:229–236PubMedCentralPubMedGoogle Scholar
  31. 31.
    Schuyler PL, Hole WT, Tuttle MS, Sherertz DD (1993) The UMLS Metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc 81(2):217–222PubMedCentralPubMedGoogle Scholar
  32. 32.
    Dai M, Shah NH, Xuan W, Musen MA, Watson SJ, Athey BD, Meng F (2008) An efficient solution for mapping free text to ontology terms. Proc AMIA Summit Translat BioinformGoogle Scholar
  33. 33.
    Jonquet C, Shah NH, Musen MA (2009) The open biomedical annotator. Proc AMIA Summit Translat BioinformGoogle Scholar
  34. 34.
    Tanenblatt M, Coden A, Saminsky I (2010) The ConceptMapper approach to named entity recognition. Proc 7th Internat Conf Lang Resources and Eval (LREC)Google Scholar
  35. 35.
    Ferrucci D, Lally A (2004) UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng 10(3–4):327–348CrossRefGoogle Scholar
  36. 36.
    Schuemie MJ, Jelier R, Kors JA (2007) Peregrine: lightweight gene name normalization by dictionary lookup. Proc 2nd BioCreative Challenge Evaluation Workshop, 131–133Google Scholar
  37. 37.
    Browne AC, Divita G, Lu C, McCreedy L, Nace D (2003) Lexical systems; a report to the board of scientific counselors. Lister Hill National Center for Biomedical Communications Technical Report LHNCBC-TR-2003-003Google Scholar
  38. 38.
    Shah NH, Bhatia N, Jonquet C, Rubin D, Chiang AP, Musen MA (2009) Comparison of concept recognizers for building the open biomedical annotator. BMC Bioinform 10 (Suppl 9):S14CrossRefGoogle Scholar
  39. 39.
    Stewart SA, von Maltzahn ME, Abidi SSR (2012) Comparing MetaMa to MGrep as a tool for mapping free text to formal medical lexicons. Proc 1st international workshop on knowledge extraction and consolidation from social media (KECSM)Google Scholar
  40. 40.
    Hripcsak G, Rothschild AS (2005) Agreement, the F-measure, and reliability in information retrieval. J Am Med Inform Assoc 12:296–298PubMedCentralPubMedCrossRefGoogle Scholar
  41. 41.
    Funk C, Baumgartner Jr. W, Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K (2013) Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC BioinformGoogle Scholar
  42. 42.
    Kang N, Singh B, Afzal Z, van Mulligen EM, Kors JA (2013) Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Inform Assoc 0:1–6Google Scholar
  43. 43.
    Maglott D, Ostell J, Pruitt KD, Tatusova T (2011) Entrez Gene: gene-centered information at NCBI. Nucl Acids Res 39(Database Issue):D52–D57PubMedCentralPubMedCrossRefGoogle Scholar
  44. 44.
    Wermter J, Tomanek K, Hahn U (2009) High-performance gene name normalization with GENO. Bioinformatics 25(6):815–821PubMedCrossRefGoogle Scholar
  45. 45.
    Hakenberg J, Gerner M, Haeussler M, Solt I, Plake C, Schroeder M, Gonzalez G, Nenadic G, Bergman CM (2011) The GNAT library for local and remote gene mention normalization. Bioinformatics 27(19):2769–2771PubMedCentralPubMedCrossRefGoogle Scholar
  46. 46.
    Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeria E, Sherry ST, Shumway M, Sirotkin K, Souvarov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J (2009) Database resources of the National Center for Biotechnology Information. Nucl Acids Res 37(Database Issue):D5–D15PubMedCentralPubMedCrossRefGoogle Scholar
  47. 47.
    Gerner M, Nenadic G, Bergman CM (2010) LINNAEUS: a species name identification system for biomedical literature. BMC Bioinform 11:85CrossRefGoogle Scholar
  48. 48.
    Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucl Acids Res 36(Database Issue):D344–D350PubMedCentralPubMedGoogle Scholar
  49. 49.
    Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminform 3:41PubMedCentralPubMedCrossRefGoogle Scholar
  50. 50.
    Weisgerber DW (1997) Chemical abstracts service chemical registry system: history, scope, and impacts. J Am Soc Inform Sci 48(4): 349–360CrossRefGoogle Scholar
  51. 51.
    Tomasulo P (2002) ChemIDplus: super source for chemical and drug information. Med Ref Serv Q 21(1):53–59PubMedCrossRefGoogle Scholar
  52. 52.
    Li Q, Cheng T, Wang Y, Bryant SH (2010) PubChem as a public resource for drug discovery. Drug Discov Today 15(23–24):1052–1057PubMedCentralPubMedCrossRefGoogle Scholar
  53. 53.
    Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics 28(12): 1633–1640PubMedCrossRefGoogle Scholar
  54. 54.
    Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djombou Y, Eisner R, Guo AC, Wishart DS (2011) DrugBank 3.0: a comprehensive resource for “Omics” research on drugs. Nucl Acids Res 39(Database Issue): D1035–D1041PubMedCentralPubMedCrossRefGoogle Scholar
  55. 55.
    Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through Web services: calling Whatizit. Bioinformatics 24(2):296–298PubMedCrossRefGoogle Scholar
  56. 56.
    Doms A, Schroeder M (2005) GoPubMed: exploring PubMed with the gene ontology. Nucl Acids Res 33(Web Server Issue):W783–W786PubMedCentralPubMedCrossRefGoogle Scholar
  57. 57.
    Pafilis E, Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27:508–510PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Computational Bioscience program, School of MedicineUniversity of Colorado Anschutz Medical CampusAuroraUSA

Personalised recommendations