GeneTUC, GENIA and Google: Natural Language Understanding in Molecular Biology Literature

Sætre, Rune; Søvik, Harald; Amble, Tore; Tsuruoka, Yoshimasa

doi:10.1007/11790105_6

Rune Sætre²³,
Harald Søvik²³,
Tore Amble²³ &
…
Yoshimasa Tsuruoka²⁴

Part of the book series: Lecture Notes in Computer Science ((TCSB,volume 4070))

282 Accesses

Abstract

With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. GeneTUC has been developed to be able to read biological texts and answer questions about them afterwards. The knowledge base of the system is constructed by parsing MEDLINE abstracts or other online text strings retrieved by the Google API. When the system encounters words that are not in the dictionary, the Google API can be used to automatically determine the semantic class of the word and add it to the dictionary. The performance of the GeneTUC parser was tested and compared to the manually tagged GENIA corpus with EvalB, giving bracketing precision and recall scores of 70,6% and 53,9% respectively. GeneTUC was able to parse 60,2% of the sentences, and the POS-tagging accuracy was 86.0%. This is not as high as the best taggers and parsers available, but GeneTUC is also capable of doing deep reasoning, like anaphora resolution and question answering, which is not a part of the state-of-the-art parsers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative Experiments on Learning Information Extractors for Proteins and their Interactions. Journal Artificial Intelligence in Medicine: Special Issue on Summarization and Information Extraction from Medical Documents (2004)
Google Scholar
Castano, J., Zhang, J., Pustejovsky, J.: Anaphora resolution in biomedical literature. In: International Symposium on Reference Resolution (2002)
Google Scholar
Clark, S., Hockenmaier, J., Steedman, M.: Building Deep Dependency Structures with a Wide-Coverage CCG Parser. In: Proceedings of ACL 2002, pp. 327–334 (2002)
Google Scholar
Clegg, A.B., Shepherd, A.J.: Evaluating and integrating treebank parsers on a biomedical corpus. In: Proceedings of the ACL Workshop on Software (2005)
Google Scholar
Covington, M.A.: Natural Language Processing for Prolog Programmers. Prentice-Hall, Englewood Cliffs (1994)
MATH Google Scholar
Cowie, J., Lehnert, W.: Information Extraction. Communications of the ACM 39(1), 80–91 (1996)
Article Google Scholar
Hara, T., Miyao, Y., Tsujii, J.: Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, Springer, Heidelberg (2005)
Google Scholar
Huang, M., Zhu, X., Hao, Y., Payan, D.G., Qu, K., Li, M.: Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20(18), 3604–3612 (2004)
Article Google Scholar
Jenssen, T.-K., Lægreid, A., Komorowski, J., Hovig, E.: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28(1), 21–28 (2001)
Article Google Scholar
Lease, M., Charniak, E.: Parsing biomedical literature. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 58–69. Springer, Heidelberg (2005)
Chapter Google Scholar
Miyao, Y., Tsujii, J.: Deep linguistic analysis for the accurate identification of predicate-argument relations. In: Proceedings of COLING 2004, pp. 1392–1397 (2004)
Google Scholar
O’Donovan, R., Burkea, M., Cahill, A., van Genabith, J., Way, A.: Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank. In: Proceedings of the 42nd Annual Meeting of the ACL. Association for Computational Linguistics, Barcelona, Spain, July 21-26, pp. 368–375 (2004)
Google Scholar
Osterhout, L., Holcomb, P.J., Swinney, D.A.: Brain potentials elicited by garden-path sentences: Evidence of the application of verb information during parsing. Journal of Experimental Psychology: Learning, Memory, and Cognition 20(4), 786–803 (1994)
Article Google Scholar
Pustejovsky, J., Casta, J., Zhang, J., Cochran, B., Kotecki, M.: Robust relational parsing over biomedical literature: Extracting inhibit relations. In: Pacific Symposium on Biocomputing (2002)
Google Scholar
Sætre, R.: Natural Language Processing of Gene Information. Master’s thesis, Norwegian University of Science and Technology, Norway and CIS/LMU München, Germany (April 2003)
Google Scholar
Sætre, R., Ranang, M.T., Steigedal, T.S., Stunes, K., Misund, K., Thommesen, L., Lægreid, A.: Webprot: Online Mining and Annotation of Biomedical Literature using Google. In: Pham, T.D., Yan, H., Crane, D.I. (eds.) Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers, New York, USA (2006)
Google Scholar
Sætre, R., Tveit, A., Ranang, M.T., Steigedal, T.S., Thommesen, L., Stunes, K., Lægreid, A.: GProt: Annotating Protein Interactions Using Google and Gene Ontology. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3683, pp. 1195–1203. Springer, Heidelberg (2005)
Chapter Google Scholar
Sætre, R., Tveit, A., Steigedal, T.S., Lægreid, A.: Semantic Annotation of Biomedical Literature using Google. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3482, pp. 327–337. Springer, Heidelberg (2005)
Chapter Google Scholar
Tateishi, Y., Yakushiji, A., Ohta, T., Tsujii, J.: Syntax Annotation for the GENIA corpus. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651. Springer, Heidelberg (2005)
Google Scholar
Tveit, A., Sætre, R., Steigedal, T.S., Lægreid, A.: ProtChew: Automatic Extraction of Protein Names from . In: Proceedings of the International Workshop on Biomedical Data Engineering (BMDE 2005, in conjunction with ICDE 2005), Tokyo, Japan, April 2005, p. 1161. IEEE Press, Los Alamitos (2005)
Google Scholar
Vailaya, A., Bluvas, P., Kincaid, R., Kuchinsky, A., Creech, M., Adler, A.: An architecture for biological information extraction and representation. Bioinformatics 21(4), 430–438 (2005)
Article Google Scholar
Xiao, J., Su, J., Tan, G.Z.C.: Protein-protein interaction extraction: A supervised learning approach. In: Semantic Mining in Biomedicine (SMBM) (2005)
Google Scholar
Yakushiji, A., Miyao, Y., Tateishi, Y.: Biomedical information extraction with predicate-argument structure patterns. In: Semantic Mining in Biomedicine (SMBM) (2005)
Google Scholar
Yu, H., Hatzivassiloglou, V., Friedman, C., Rzhetsky, A., Wilbur, W.J.: Automatic Extraction of Gene and Protein Synonyms from MEDLINE and Journal Articles. In: Proceedings of the AMIA Symposium 2002, pp. 919–923 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, Sem Sælandsv. 7-9, NO-7491, Trondheim, Norway
Rune Sætre, Harald Søvik & Tore Amble
Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, 113-0033, Japan
Yoshimasa Tsuruoka

Authors

Rune Sætre
View author publications
You can also search for this author in PubMed Google Scholar
Harald Søvik
View author publications
You can also search for this author in PubMed Google Scholar
Tore Amble
View author publications
You can also search for this author in PubMed Google Scholar
Yoshimasa Tsuruoka
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Microsoft Research - Centre for Computational and Systems Biology, University of Trento, Piazza Manci, 17, 38050, Povo (TN), Italy
Corrado Priami
College of Computer and Information Engineering, Hehan University, Henan, China
Xiaohua Hu
Georgia State University, Dept. of CS, 30302, Atlanta, GA, USA
Yi Pan
Department of Computer Science, San Jose State University, CA 95192, San Jose, USA
Tsau Young Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sætre, R., Søvik, H., Amble, T., Tsuruoka, Y. (2006). GeneTUC, GENIA and Google: Natural Language Understanding in Molecular Biology Literature. In: Priami, C., Hu, X., Pan, Y., Lin, T.Y. (eds) Transactions on Computational Systems Biology V. Lecture Notes in Computer Science(), vol 4070. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11790105_6

Download citation

DOI: https://doi.org/10.1007/11790105_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36048-3
Online ISBN: 978-3-540-36049-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics