Multilevel Annotation for Information Extraction

Kim, Jin-Dong; Ohta, Tomoko; Tsujii, Jun’ichi

doi:10.1007/978-90-481-3331-4_7

Jin-Dong Kim³,
Tomoko Ohta³ &
Jun’ichi Tsujii^3,4

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

757 Accesses

Abstract

Information Extraction (IE) is the broad task of detecting and extracting specific structured information from unstructured natural language text. IE typically requires analysis to determine the linguistic structure of text and semantic processing to map linguistic structures to semantic ones. For real-world applications, this processing often needs to be performed at various levels, determining e.g. the parts-of-speech, syntactic structure, named entities, and events. Multilevel annotations made to a corpus are a necessary resource for the development of multilevel text processing tools and eventually automatic IE systems, providing both reference and training material for method development and benchmark data sets. This chapter introduces the GENIA corpus and various annotations made to it as an example of multilevel annotation made for IE, and discusses general issues in multilevel annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carletta, Jean and Isard, Amy (1999). The mate annotation workbench: User requirements. In Proceedings of the ACL Workshop: Towards Standards and Tools for Discourse Tagging, pages 11–17.
Google Scholar
Chinchor, Nancy and Robinson, Patty (1998). Muc-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference.
Google Scholar
Hovy, Eduard, Marcus, Mitchell, Palmer, Martha, Ramshaw, Lance, and Weischedel, Ralph (2006). Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60, New York City, USA. Association for Computational Linguistics.
Google Scholar
Ide, Nancy, Bonhomme, Patrice, and Romary, Laurent (2000). Xces: An xml-based standard for linguistic corpora. In Proceedings of the Second Language Resources and Evaluation.
Google Scholar
Ide, Nancy and Romary, Laurent (2007). Towards international standards for language resources. In Dybkjær, Laila, Hemsen, Holmer, and Minker, Wolfgang, editors, Evaluation of Text and Speech Systems, pages 263–84. Springer, New York.
Google Scholar
Ide, Nancy and Suderman, Keith (2006). Integrating linguistic resources: The american national corpus model. In Proceedings of the Fifth Language Resources and Evaluation Conference (LREC).
Google Scholar
Kim, Jin-Dong, Ohta, Tomoko, Pyysalo, Sampo, Kano, Yoshinobu, and Tsujii, Jun’ichi (2009). Overview of BioNLP’09 Shared Task on Event Extraction. In Proceedings of Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop.
Google Scholar
Kim, Jin-Dong, Ohta, Tomoko, Teteisi, Yuka, and Tsujii, Jun’ichi (2006). Genia ontology. Technical Report TR-NLP-UT-2006-2, Tsujii Laboratory, University of Tokyo.
Google Scholar
Kim, Jin-Dong, Ohta, Tomoko, and Tsujii, Jun’ichi (2008). Corpus annotation for mining biomedical events from lterature. BMC Bioinformatics, 9(1):10.
Article Google Scholar
Kim, Jin-Dong, Ohta, Tomoko, Tsuruoka, Yoshimasa, Tateisi, Yuka, and Collier, Nigel (2004). Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pages 70–75.
Google Scholar
Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., Schein, A., and Ungar, L. (2004a). Integrated annotation for biomedical information extraction. In NAACL/HLT Workshop on Linking Biological Literature, Ontologies and Databases: Tools for Users, pages 61–68.
Google Scholar
Kulick, Seth, Bies, Ann, Liberman, Mark, Mandel, Mark, McDonald, Ryan, Palmer, Martha, Schein, Andrew, and Ungar, Lyle (2004b). Integrated annotation for biomedical information extraction. In Proceedings of the NAACL/HLT Workshop on Linking Biological Literature, Ontologies and Databases: Tools for Users, pages 61–68.
Google Scholar
Marcus, Mitchell P., Santorini, Beatrice, and Marcinkiewicz, Mary Ann (1994). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330.
Google Scholar
Müller, Christoph and Strube, Michael (2006). Multi-level annotation of linguistic data with MMAX2. In Braun, Sabine, Kohn, Kurt, and Mukherjee, Joybrato, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany.
Google Scholar
Pyysalo, Sampo, Ginter, Filip, Heimonen, Juho, Björne, Jari, Boberg, Jorma, Järvinen, Jouni, and Salakoski, Tapio (2007). BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(50).
Google Scholar
Thompson, Henry S. and McKelvie, David (1997). Hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe ’97.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Tokyo, Tokyo, Japan
Jin-Dong Kim, Tomoko Ohta & Jun’ichi Tsujii
University of Manchester, Manchester, UK
Jun’ichi Tsujii

Authors

Jin-Dong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Tomoko Ohta
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichi Tsujii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jin-Dong Kim .

Editor information

Editors and Affiliations

Institut für Deutsche Sprache (IDS), Mannheim, 68161, Germany
Andreas Witt
Fak. Linguistik und, Universität Bielefeld, Universitätsstraße, Bielefeld, 33615, Germany
Dieter Metzing

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kim, JD., Ohta, T., Tsujii, J. (2010). Multilevel Annotation for Information Extraction. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_7

Download citation

DOI: https://doi.org/10.1007/978-90-481-3331-4_7
Published: 09 November 2009
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics