Advertisement

Discovering Biomedical Knowledge from the Literature

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 484)

Abstract

Biomedical knowledge is to a very large extent represented only in textual form. To make this knowledge accessible to humans and/or further automatic processing, text mining applications have been developed. At the end of this chapter we present an overview of the most important open access applications and their functionality. The main part of the paper is devoted to the major problems with which all such applications have to deal. The first problem is terminology processing, i.e., recognizing biomedical terms and identifying their meanings, at least to a certain degree. The second problem is to bring together information units that are distributed over more than one sentence. The task of coreference resolution consists of identifying the entities to which the text refers in different sentences and in different ways. The third problem we discuss is that of information extraction, in particular, extraction of relational information. The representation of the domain knowledge is an indispensable component of any text mining application. We discuss different types and depths of ontological modeling and how this knowledge helps to accomplish the tasks described above. An overview of ontological resources is given at the end of the chapter.

Key words

Natural language processing text mining information extraction named entity recognition terminology processing ontologies taxonomies ambiguity coreference 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bhalotia, G., Nakov, P. I., Schwartz, A. S., and Hearst, M. A. (2003) BioText team report for the TREC 2003 genomics track. Proc. TREC 2003, Vol. 12.Google Scholar
  2. 2.
    Fukuda, K., Tamura, A., Tsunoda, T., and Takagi, T. (1998) Toward information extraction: identifying protein names from biological papers. In Pacific Symposium of Biocomputation, Hawaii, Vol. 3, pp. 707–718, World Scientific, Singapore.Google Scholar
  3. 3.
    Tanabe, L., and Wilbur, W. J. (2002): Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132.PubMedCrossRefGoogle Scholar
  4. 4.
    Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P., and Coster, J. (2002) Protein names and how to find them. Int. J. Med. Inform. 67, 49–61.CrossRefGoogle Scholar
  5. 5.
    Collier, N., Nobata, C., and Tsujii, J. (2000) Extracting the names of genes and gene products with a hidden Markov model. Int. Conf. Comput. Linguistics 18, 201–207.Google Scholar
  6. 6.
    Chang, J. T., Schütze, H., and Altman, R. B. (2004) GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20, 216–225.PubMedCrossRefGoogle Scholar
  7. 7.
    McDonald, R. and Pereira, F. (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6, S6.CrossRefGoogle Scholar
  8. 8.
    Settles, B. (2005) ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21, 3191–3192.PubMedCrossRefGoogle Scholar
  9. 9.
    Zhou, G., Shen, D., Zhang, J., Su, J., and Tan, S. (2005) Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinform. 6, S7.CrossRefGoogle Scholar
  10. 10.
    Kim, J.-D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180–i182.PubMedCrossRefGoogle Scholar
  11. 11.
    Gaizauskas, R. J., Demetriou, G., Artymiuk, P. J., and Willett, P. (2003) Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19, 135–143.PubMedCrossRefGoogle Scholar
  12. 12.
    Krauthammer, M., Rzhetsky, A., Morozov, P., et al. (2000) Using blast for identifying gene and protein names in journal articles. Gene 259, 245–252.PubMedCrossRefGoogle Scholar
  13. 13.
    Fundel, K., Güttler, D., Zimmer, R., and Apostolakis, J. (2005) A simple approach for protein name identification: prospects and limits. BMC Bioinform. 6, S15.CrossRefGoogle Scholar
  14. 14.
    Hanisch, D., Fundel, K., Mevissen, H. T., Zimmer, R., and Fluck, J. (2005) ProMiner: rule-based protein and gene entity recognition. BMC Bioinform. 6, S14.CrossRefGoogle Scholar
  15. 15.
    Fundel, K. and Zimmer, R. (2006) Gene and protein nomenclature in public databases. BMC Bioinform. 7, 372.CrossRefGoogle Scholar
  16. 16.
    Gaudan, S., Kirsch, H., and Rebholz-Schuhmann, D. (2005) Resolving abbreviations to their senses in medline. Bioinformatics 21, 3658–3664.PubMedCrossRefGoogle Scholar
  17. 17.
    Schijvenaars, B. J. A., Mons, B., Weeber, M., Schuemie, M. J., van Mulligen, E. M., Wain, H. W., and Kors, J. A. (2005) Thesaurus-based disambiguation of gene symbols. BMC Bioinform. 6, 149.CrossRefGoogle Scholar
  18. 18.
    Cimiano, P. (2006) Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, New York.Google Scholar
  19. 19.
    Cimiano, P., Reyle, U., and Saric, J. (2004) Ontology driven discourse analysis for information extraction. Data Knowledge Eng. J. 55(1), 59–83.CrossRefGoogle Scholar
  20. 20.
    Cimiano, P. (2002) On the resolution of bridging references within information extraction systems. Master’s Thesis.Google Scholar
  21. 21.
    Castaño, J., Zhang, J., and Pustejovsky, J. (2002) Anaphora resolution in biomedical literature. International Symposium on Reference Resolution.Google Scholar
  22. 22.
    Grosz, B., Joshi, A. K., and Weinstein, S. (1983) Providing a unified account of definite noun phrases in discourse. Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 44–50.Google Scholar
  23. 23.
    Grosz, B., Joshi, A. K., and Weinstein, S. (1995) Centering: a framework for modeling the local coherence of discourse. Comput. Linguistics 2(21), 203–225.Google Scholar
  24. 24.
    Brennan, S. E. Friedman, M. W., and Pollard, C. J. (1987) A centering approach to pronouns. Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, pp. 155–162.Google Scholar
  25. 25.
    Strube, M. (1998) Never look back: an alternative to centering. Proceedings of the 17th International Conference on Computational Linguistics, pp. 1251–1257.Google Scholar
  26. 26.
    Ge, N., Hale, J., and Charniak, E. (1998) A statistical approach to anaphora resolution. Proceedings of the 6th ACL Workshop on Very Large Corpora, pp. 161–170.Google Scholar
  27. 27.
    Cardie, C. and Wagstaff, K. (1999) Noun phrase coreference as clustering. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 82–89.Google Scholar
  28. 28.
    Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001) A machine learning approach to coreference resolution of noun phrases. Comput. Linguistics 27(4), 521–544.CrossRefGoogle Scholar
  29. 29.
    Ng, V. and Cardie, C. (2002) Improving machine learning approaches to coreference resolution. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 104–111.Google Scholar
  30. 30.
    Yang, X., Zhou, G., Su, J., and Tan, C. L. (2004) Improving noun phrase coreference resolution by matching strings. Proceedings of the 1st International Joint Conference of Natural Language Processing, Lecture Notes in Computer Science, Vol. 3248, pp. 22–38.Google Scholar
  31. 31.
    Yang, X., Zhou, G., Su, J., and Tan, C. L. (2003) Coreference resolution using competition learning approach. Proceedings of the 41st Annual Meetinf of the Association for Computational Linguistics, pp. 176–183.Google Scholar
  32. 32.
    Harabagiu, S. M., Bunescu, R. C., and Maiorano, S. J. (2001) Text and knowledge mining for coreference resolution. Proceedings of the 2nd Conference of the North American Chapter of the Association for Computational Linguistics, pp. 55–62.Google Scholar
  33. 33.
    Blaschke, C., Andrade, M., Ouzounis, C., and Valencia, A. (1999) Automatic extraction of biological information from scientific text: protein-protein interactions. Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB99), pp. 60–67.Google Scholar
  34. 34.
    Koike, A. and Takagi, T. (2004) PRIME: automatically extracted PRotein Interactions and Molecular Information database. In Silico Biol. 5, 0004.Google Scholar
  35. 35.
    Šarić, J., Jensen, L. J., Ouzounova, R., Rojas, I., and Bork, P. (2005) Extraction of regulatory gene/protein networks from Medline. Doi: 10.1093/Bioinformatics/bti597.Google Scholar
  36. 36.
    Mack, R., et al. (2004) Text analytics for life science using the unstructured information management architecture. IBM Syst. J. 43, 490–515.CrossRefGoogle Scholar
  37. 37.
    Ferrucci, D. and Lally, A. (2004) Uima: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327–348.CrossRefGoogle Scholar
  38. 38.
    Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K. and Li, M. (2004) Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20, 3604–3612.PubMedCrossRefGoogle Scholar
  39. 39.
    Hao, Y., Zhu, X. Huang, M., and Li, M. (2005) Discovering patterns to extract protein-protein interactions from the literature: Part ii. Bioinformatics 21, 3294–3300.PubMedCrossRefGoogle Scholar
  40. 40.
    Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet. 36, 664.PubMedCrossRefGoogle Scholar
  41. 41.
    Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., and Stoehr, P. (2007) Ebimed—text crunching to gather facts for proteins from medline. Bioinformatics 23, 237–244.CrossRefGoogle Scholar
  42. 42.
    Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., and Leser, U. (2006) Alibaba: Pubmed as a graph. Bioinformatics 22, 2444–2445.PubMedCrossRefGoogle Scholar
  43. 43.
    Rinaldi, F., Schneider, G., Kaljurand, K., Hess, M., and Romacker, M. (2006) An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinform. 7, S3.CrossRefGoogle Scholar
  44. 44.
    Rhodes, J., Boyer, S., Kreulen, Y., J. Chen, and Ordonez, P. (2007) Mining patents using molecular similarity search. 12th Pacific Symposium on Biocomputing, Hawaii, Vol. 12, pp. 304–315, World Scientific, Singapore.Google Scholar
  45. 45.
    Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 125–29.CrossRefGoogle Scholar
  46. 46.
    Rosse, C. and Mejino, J. L. V. (2003) A reference ontology for biomedical informatics: the foundational model of anatomy. J. Biomed. Inform. 36, 478–500.PubMedCrossRefGoogle Scholar
  47. 47.
    U.S. Department of Health and Human Services, N.L.O.M., NIH (2002) Unified medical language system. URL: http://www.nlm.nih.gov/research/umls/.Google Scholar
  48. 48.
    Raghavan, P. (2004) Text centric structure extraction and exploitation (abstract only). WebDB’ 04: Proceedings of the 7th International Workshop on the Web and Databases, New York.Google Scholar
  49. 49.
    Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002) Sweetening ontologies with dolce. Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, pp. 166–181.Google Scholar

Copyright information

© Humana Press, Totowa, NJ 2008

Authors and Affiliations

  1. 1.Boehringer Ingelheim Pharma GmbH & Co.BiberachGermany
  2. 2.EML Research gGmbHHeidelbergGermany
  3. 3.Institute for Computational LinguisticsUniversity of StuttgartStuttgartGermany

Personalised recommendations