Functional Proteomics pp 415-433 | Cite as
Discovering Biomedical Knowledge from the Literature
- 1 Citations
- 2.7k Downloads
Abstract
Biomedical knowledge is to a very large extent represented only in textual form. To make this knowledge accessible to humans and/or further automatic processing, text mining applications have been developed. At the end of this chapter we present an overview of the most important open access applications and their functionality. The main part of the paper is devoted to the major problems with which all such applications have to deal. The first problem is terminology processing, i.e., recognizing biomedical terms and identifying their meanings, at least to a certain degree. The second problem is to bring together information units that are distributed over more than one sentence. The task of coreference resolution consists of identifying the entities to which the text refers in different sentences and in different ways. The third problem we discuss is that of information extraction, in particular, extraction of relational information. The representation of the domain knowledge is an indispensable component of any text mining application. We discuss different types and depths of ontological modeling and how this knowledge helps to accomplish the tasks described above. An overview of ontological resources is given at the end of the chapter.
Key words
Natural language processing text mining information extraction named entity recognition terminology processing ontologies taxonomies ambiguity coreferencePreview
Unable to display preview. Download preview PDF.
References
- 1.Bhalotia, G., Nakov, P. I., Schwartz, A. S., and Hearst, M. A. (2003) BioText team report for the TREC 2003 genomics track. Proc. TREC 2003, Vol. 12.Google Scholar
- 2.Fukuda, K., Tamura, A., Tsunoda, T., and Takagi, T. (1998) Toward information extraction: identifying protein names from biological papers. In Pacific Symposium of Biocomputation, Hawaii, Vol. 3, pp. 707–718, World Scientific, Singapore.Google Scholar
- 3.Tanabe, L., and Wilbur, W. J. (2002): Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132.PubMedCrossRefGoogle Scholar
- 4.Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P., and Coster, J. (2002) Protein names and how to find them. Int. J. Med. Inform. 67, 49–61.CrossRefGoogle Scholar
- 5.Collier, N., Nobata, C., and Tsujii, J. (2000) Extracting the names of genes and gene products with a hidden Markov model. Int. Conf. Comput. Linguistics 18, 201–207.Google Scholar
- 6.Chang, J. T., Schütze, H., and Altman, R. B. (2004) GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20, 216–225.PubMedCrossRefGoogle Scholar
- 7.McDonald, R. and Pereira, F. (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6, S6.CrossRefGoogle Scholar
- 8.Settles, B. (2005) ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21, 3191–3192.PubMedCrossRefGoogle Scholar
- 9.Zhou, G., Shen, D., Zhang, J., Su, J., and Tan, S. (2005) Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinform. 6, S7.CrossRefGoogle Scholar
- 10.Kim, J.-D., Ohta, T., Tateisi, Y., and Tsujii, J. (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180–i182.PubMedCrossRefGoogle Scholar
- 11.Gaizauskas, R. J., Demetriou, G., Artymiuk, P. J., and Willett, P. (2003) Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19, 135–143.PubMedCrossRefGoogle Scholar
- 12.Krauthammer, M., Rzhetsky, A., Morozov, P., et al. (2000) Using blast for identifying gene and protein names in journal articles. Gene 259, 245–252.PubMedCrossRefGoogle Scholar
- 13.Fundel, K., Güttler, D., Zimmer, R., and Apostolakis, J. (2005) A simple approach for protein name identification: prospects and limits. BMC Bioinform. 6, S15.CrossRefGoogle Scholar
- 14.Hanisch, D., Fundel, K., Mevissen, H. T., Zimmer, R., and Fluck, J. (2005) ProMiner: rule-based protein and gene entity recognition. BMC Bioinform. 6, S14.CrossRefGoogle Scholar
- 15.Fundel, K. and Zimmer, R. (2006) Gene and protein nomenclature in public databases. BMC Bioinform. 7, 372.CrossRefGoogle Scholar
- 16.Gaudan, S., Kirsch, H., and Rebholz-Schuhmann, D. (2005) Resolving abbreviations to their senses in medline. Bioinformatics 21, 3658–3664.PubMedCrossRefGoogle Scholar
- 17.Schijvenaars, B. J. A., Mons, B., Weeber, M., Schuemie, M. J., van Mulligen, E. M., Wain, H. W., and Kors, J. A. (2005) Thesaurus-based disambiguation of gene symbols. BMC Bioinform. 6, 149.CrossRefGoogle Scholar
- 18.Cimiano, P. (2006) Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, New York.Google Scholar
- 19.Cimiano, P., Reyle, U., and Saric, J. (2004) Ontology driven discourse analysis for information extraction. Data Knowledge Eng. J. 55(1), 59–83.CrossRefGoogle Scholar
- 20.Cimiano, P. (2002) On the resolution of bridging references within information extraction systems. Master’s Thesis.Google Scholar
- 21.Castaño, J., Zhang, J., and Pustejovsky, J. (2002) Anaphora resolution in biomedical literature. International Symposium on Reference Resolution.Google Scholar
- 22.Grosz, B., Joshi, A. K., and Weinstein, S. (1983) Providing a unified account of definite noun phrases in discourse. Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 44–50.Google Scholar
- 23.Grosz, B., Joshi, A. K., and Weinstein, S. (1995) Centering: a framework for modeling the local coherence of discourse. Comput. Linguistics 2(21), 203–225.Google Scholar
- 24.Brennan, S. E. Friedman, M. W., and Pollard, C. J. (1987) A centering approach to pronouns. Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, pp. 155–162.Google Scholar
- 25.Strube, M. (1998) Never look back: an alternative to centering. Proceedings of the 17th International Conference on Computational Linguistics, pp. 1251–1257.Google Scholar
- 26.Ge, N., Hale, J., and Charniak, E. (1998) A statistical approach to anaphora resolution. Proceedings of the 6th ACL Workshop on Very Large Corpora, pp. 161–170.Google Scholar
- 27.Cardie, C. and Wagstaff, K. (1999) Noun phrase coreference as clustering. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 82–89.Google Scholar
- 28.Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001) A machine learning approach to coreference resolution of noun phrases. Comput. Linguistics 27(4), 521–544.CrossRefGoogle Scholar
- 29.Ng, V. and Cardie, C. (2002) Improving machine learning approaches to coreference resolution. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 104–111.Google Scholar
- 30.Yang, X., Zhou, G., Su, J., and Tan, C. L. (2004) Improving noun phrase coreference resolution by matching strings. Proceedings of the 1st International Joint Conference of Natural Language Processing, Lecture Notes in Computer Science, Vol. 3248, pp. 22–38.Google Scholar
- 31.Yang, X., Zhou, G., Su, J., and Tan, C. L. (2003) Coreference resolution using competition learning approach. Proceedings of the 41st Annual Meetinf of the Association for Computational Linguistics, pp. 176–183.Google Scholar
- 32.Harabagiu, S. M., Bunescu, R. C., and Maiorano, S. J. (2001) Text and knowledge mining for coreference resolution. Proceedings of the 2nd Conference of the North American Chapter of the Association for Computational Linguistics, pp. 55–62.Google Scholar
- 33.Blaschke, C., Andrade, M., Ouzounis, C., and Valencia, A. (1999) Automatic extraction of biological information from scientific text: protein-protein interactions. Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB99), pp. 60–67.Google Scholar
- 34.Koike, A. and Takagi, T. (2004) PRIME: automatically extracted PRotein Interactions and Molecular Information database. In Silico Biol. 5, 0004.Google Scholar
- 35.Šarić, J., Jensen, L. J., Ouzounova, R., Rojas, I., and Bork, P. (2005) Extraction of regulatory gene/protein networks from Medline. Doi: 10.1093/Bioinformatics/bti597.Google Scholar
- 36.Mack, R., et al. (2004) Text analytics for life science using the unstructured information management architecture. IBM Syst. J. 43, 490–515.CrossRefGoogle Scholar
- 37.Ferrucci, D. and Lally, A. (2004) Uima: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327–348.CrossRefGoogle Scholar
- 38.Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K. and Li, M. (2004) Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20, 3604–3612.PubMedCrossRefGoogle Scholar
- 39.Hao, Y., Zhu, X. Huang, M., and Li, M. (2005) Discovering patterns to extract protein-protein interactions from the literature: Part ii. Bioinformatics 21, 3294–3300.PubMedCrossRefGoogle Scholar
- 40.Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet. 36, 664.PubMedCrossRefGoogle Scholar
- 41.Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., and Stoehr, P. (2007) Ebimed—text crunching to gather facts for proteins from medline. Bioinformatics 23, 237–244.CrossRefGoogle Scholar
- 42.Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., and Leser, U. (2006) Alibaba: Pubmed as a graph. Bioinformatics 22, 2444–2445.PubMedCrossRefGoogle Scholar
- 43.Rinaldi, F., Schneider, G., Kaljurand, K., Hess, M., and Romacker, M. (2006) An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinform. 7, S3.CrossRefGoogle Scholar
- 44.Rhodes, J., Boyer, S., Kreulen, Y., J. Chen, and Ordonez, P. (2007) Mining patents using molecular similarity search. 12th Pacific Symposium on Biocomputing, Hawaii, Vol. 12, pp. 304–315, World Scientific, Singapore.Google Scholar
- 45.Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 125–29.CrossRefGoogle Scholar
- 46.Rosse, C. and Mejino, J. L. V. (2003) A reference ontology for biomedical informatics: the foundational model of anatomy. J. Biomed. Inform. 36, 478–500.PubMedCrossRefGoogle Scholar
- 47.U.S. Department of Health and Human Services, N.L.O.M., NIH (2002) Unified medical language system. URL: http://www.nlm.nih.gov/research/umls/.Google Scholar
- 48.Raghavan, P. (2004) Text centric structure extraction and exploitation (abstract only). WebDB’ 04: Proceedings of the 7th International Workshop on the Web and Databases, New York.Google Scholar
- 49.Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002) Sweetening ontologies with dolce. Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, pp. 166–181.Google Scholar