The GENIA Corpus: Annotation Levels and Applications

Thompson, Paul; Ananiadou, Sophia; Tsujii, Jun’ichi

doi:10.1007/978-94-024-0881-2_54

Paul Thompson³,
Sophia Ananiadou³ &
Jun’ichi Tsujii^3,4

2219 Accesses
2 Citations

Abstract

The GENIA project was created with the aim of supporting the development and evaluation of information extraction and text mining systems in molecular biology. One of the main outcomes of the project has been the GENIA corpus, consisting of 1,999 MEDLINE abstracts. Over the course of several years, the corpus has been continually enriched with various levels of syntactic, semantic and discourse-level annotation, making it suitable for training various types of systems. The GENIA corpus has been widely used by the NLP community for the development of several semantic search systems, and motivated the establishment of the BioNLP shared task series of challenges. These challenges have been instrumental in pushing forward research into event extraction systems in the biomedical domain, and have also resulted in the development of a range of associated corpora in various biomedical sub-domains, annotated according to the GENIA guidelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D.B.: Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28(7), 381–390 (2010)
Article Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., et al.: Gene Ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)
Article Google Scholar
Batista-Navarro, R.T., Ananiadou, S.: Building a coreference-annotated corpus from the domain of biochemistry. In: Proceedings of BioNLP 2011 Workshop, pp. 83–91. Association for Computational Linguistics (2011)
Google Scholar
Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al.: Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania (1995)
Google Scholar
Bjorne, J., Salakoski, T.: Generalizing biomedical event extraction. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 183–191 (2011)
Google Scholar
Björne, J., Heimonen, J., Ginter, F., Airola, A., Pahikkala, T., Salakoski, T.: Extracting Complex Biological Events with Rich Graph-Based Feature Sets. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pp. 10–18 (2009)
Google Scholar
Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 132–139. Association for Computational Linguistics (2000)
Google Scholar
Cohen, K.B., Ogren, P.V., Fox, L., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)
Google Scholar
de Waard, A., Shum, B., Carusi, A., Park, J., Samwald, M., Sándor, Á.: Hypotheses, evidence and relationships: The HypER approach for representing scientific knowledge claims. In: Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (2009)
Google Scholar
Funahashi, A., Morohashi, M., Kitano, H., Tanimura, N.: Cell Designer: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1(5), 159–162 (2003)
Article Google Scholar
Goulart, R.R.V., de Lima, V.L., c.S., Xavier, C.C.: A systematic review of named entity recognition in biomedical texts. J. Braz. Comput. Soc. 17(2), 103–116 (2011)
Google Scholar
Hara, T., Miyao, Y., Tsujii, J.: Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: Proceedings of IJCNLP, pp. 199–210 (2005)
Google Scholar
Hasida, K.: GDA: annotated document as intelligent content. In: Proceedings of COLING Workshop on Semantic Annotation and Intelligent Content, pp. 333–340 (2000)
Google Scholar
Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4), 524–531 (2003)
Article Google Scholar
Karp, P.D.: An ontology for biological function based on molecular interactions. Bioinformatics 16(3), 269–285 (2000)
Article Google Scholar
Kazama, J., Miyao, Y., Tsujii, J.: A maximum entropy tagger with unsupervised hidden markov models. In: Proceedings of the 6th NLPRS, 2001, pp. 333–340 (2001)
Google Scholar
Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl. 1), i180–i182 (2003)
Article Google Scholar
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 70–75 (2004)
Google Scholar
Kim, J.D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Extracting bio-molecular events from literature - the BioNLP’09 shared task. Comput. Intell. 27(4), 513–540 (2011)
Google Scholar
Kim, J.-D., Nguyen, N., Wang, Y., Tsujii, J.i., Takagi, T., Yonezawa, A.: The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S1 (2012)
Google Scholar
Kim, Y., Riloff, E., Gilbert, N.: The taming of Reconcile as a biomedical coreference resolver. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 89–93. Association for Computational Linguistics (2011)
Google Scholar
Knight, J.: Negative results: null and void. Nature 422(6932), 554–555 (2003)
Article Google Scholar
Koike, A., Takagi, T.: Gene/protein/family name recognition in biomedical literature. In: Proceedings of BioLINK 2004: Linking Biological Literature, Ontologies, and Databases, pp. 9–16 (2004)
Google Scholar
Koike, A., Niwa, Y., Takagi, T.: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 21(7), 1227–1236 (2005)
Article Google Scholar
Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., et al.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68 (2004)
Google Scholar
Lease, M., Charniak, E.: Parsing biomedical literature. In: Proceedings of IJCNLP 2005, pp. 58–69. Springer, Berlin (2005)
Google Scholar
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualisation zones in scientific articles and two life science applications. Bioinformatics 28(7), (2012)
Google Scholar
Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88(3), 265 (2000)
Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1994)
Google Scholar
McClosky, D., Riedel, S., Surdeanu, M., McCallum, A., Manning, C.: Combining joint models for biomedical event extraction. BMC Bioinform. 13(Suppl 11), S9 (2012)
Article Google Scholar
Miwa, M., Saetre, R., Kim, J.D., Tsujii, J.: Event extraction with complex event classification using rich features. J. Bioinform. Comput. Biol. 8(1), 131–146 (2010)
Article Google Scholar
Miwa, M., Thompson, P., Ananiadou, S.: Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 28(13), 1759–1765 (2012)
Google Scholar
Miwa, M., Thompson, P., McNaught, J., Kell, D.B., Ananiadou, S.: Extracting semantically enriched events from biomedical literature. BMC Bioinform. 13(1), 108 (2012)
Google Scholar
Miwa, M., Ohta, T., Rak, R., Rowley, A., Kell, D.B., Pyysalo, S., et al.: A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 29(13), i44–i52 (2013)
Article Google Scholar
Miyao, Y., Tsujii, J.: Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proccedings of ACL, pp. 83–90 (2005)
Google Scholar
Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a Head-driven phrase structure Grammar from the Penn Treebank. In: Proceedings of IJCNLP, pp. 684–693 (2004)
Google Scholar
Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., et al.: Semantic retrieval for the accurate identification of relational concepts in massive textbases. Annu. Meet. Assoc. Comput. Linguist. 2, 1017–1024 (2006)
Google Scholar
Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation of syntactic parsers and their representations. In: Proceedings of ACL-08: HLT, pp. 46–54. Association for Computational Linguistics (2008)
Google Scholar
Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inform. 75(6), 468–487 (2006)
Article Google Scholar
Muller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. Corpus Technol. Lang. Pedagog. New Res. New tools New Methods 3, 197–214 (2006)
Google Scholar
Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics 21(Suppl 1) (2005)
Google Scholar
Nawaz, R., Thompson, P., Ananiadou, S.: Identification of manner in bio-events. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 3505–3510 (2012)
Google Scholar
Nawaz, R., Thompson, P., Ananiadou, S.: Negated bio-events: analysis and identification. BMC Bioinformatics 14(1), (2013)
Google Scholar
Nedellec, C., Bossy, R., Kim, J.-D., Kim, J.-j., Ohta, T., Pyysalo, S., et al.: Overview of BioNLP shared task 2013. In: Proceedings of the BioNLP Shared Task 2013 Workshop, pp. 1–7 (2013)
Google Scholar
Nguyen, N., Kim, J.-D., Tsujii, J.: Overview of the protein coreference task in BioNLP shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 74–82. Association for Computational Linguistics (2001)
Google Scholar
Nobata, C., Cotter, P., Okazaki, N., Rea, B., Sasaki, Y., Tsuruoka, Y., et al.: Kleio: a knowledge-enriched information retrieval system for biology. In: Proceedings of the 31st Annual International ACM SIGIR Singapore, pp. 787–788 (2008)
Google Scholar
Oda, K., Kim, J.-D., Ohta, T., Okanohara, D., Matsuzaki, T., Tateisi, Y., et al.: New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinform. 9(Suppl 3), S5 (2008)
Article Google Scholar
Ohta, T., Tateisi, Y., Mima, H., Tsujii, J.: GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Human Language Technology Conference (HLT 2002), pp. 73–77 (2002)
Google Scholar
Ohta, T., Pyysalo, S., Kim, J.-D., Tsujii, J., i.: A re-evaluation of biomedical named entity-term relations. J. Bioinform. Comput. Biol. 8(05), 917–928 (2010)
Google Scholar
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)
Article Google Scholar
Passonneau, R.: Computing reliability for coreference annotation. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2004) (2004)
Google Scholar
Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., et al.: TimeML: robust specification of event and temporal expressions in text. New Dir. Quest. Answ. 3, 28–34 (2003)
Google Scholar
Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., et al.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. 8, 50 (2007)
Article Google Scholar
Pyysalo, S., Ohta, T., Kim, J.-D., Tsujii, J.: Static relations: a piece in the biomedical information extraction puzzle. In: Proceedings of the BioNLP 2009 Workshop, pp. 1–9. Association for Computational Linguistics (2009)
Google Scholar
Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., et al.: Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S2 (2012)
Article Google Scholar
Ruppenhofer, J., Ellsworth, M., Petruck, M., Johnson, C., Scheffczyk, J.: FrameNet II: extended theory and practice (2010). http://framenet.icsi.berkeley.edu/
Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project (D. o. C. a. I. Science, Trans.). University of Pennsylvania (1990)
Google Scholar
Sasaki, Y., Tsuruoka, Y., McNaught, J., Ananiadou, S.: How to make the most of named entity dictionaries in statistical NER. BMC Bioinform. 9(Suppl 11), S5 (2008)
Google Scholar
Schulze-Kremer, S.: Ontologies for molecular biology. In: Pac Symp Biocomput, vol. 3, pp. 695–706 (1998)
Google Scholar
Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS metathesaurus: representing different views of biomedical concepts. Bull. Med. Lib. Assoc. 81(2), 217 (1993)
Google Scholar
Su, J., Yang, X., Hong, H., Tateisi, Y., Tsujii, J.: Coreference resolution in biomedical texts: a machine learning approach. Ontol. Text Min. Life Sci. 8 (2008)
Google Scholar
Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.J.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform. 6(Suppl 1), S3 (2005)
Article Google Scholar
Tateisi, Y., Tsujii, J.: Part-of-speech annotation of biology research abstracts. In: Proceedings of LREC, 2004 (2004)
Google Scholar
Tateisi, Y., Yakushiji, A., Ohta, T., Tsujii, J.i.: Syntax Annotation for the GENIA corpus. In: Proceedings of IJCNLP, pp. 222–227 (2005)
Google Scholar
Thompson, P., Iqbal, S., McNaught, J., Ananiadou, S.: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinform. 10(1), 349 (2009)
Article Google Scholar
Thompson, P., McNaught, J., Montemagni, S., Calzolari, N., Del Gratta, R., Lee, V., et al.: The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 12(1), 397–397 (2011)
Google Scholar
Thompson, P., Nawaz, R., McNaught, J., Ananiadou, S.: Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinform. 12, 393 (2011)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Google Scholar
Tsuruoka, Y., Tsujii, J.: Improving the performance of dictionary-based approaches in protein name recognition. J. Biomed. Inform. 37(6), 461–470 (2004)
Article Google Scholar
Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of HLT/EMNLP 2005, pp. 467–474 (2005)
Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.D., Ohta, T., McNaught, J., Ananiadou, S., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Lecture Notes in Computer Science - Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382–392 (2005)
Google Scholar
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21), 2559–2560 (2008)
Article Google Scholar
Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J.i., Ananiadou, S.: Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 27(13), i111–i119 (2011)
Google Scholar
Vincze, V., Szarvas, G., Farkas, R., Mora, G., Csirik, J.: The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinform. 9(Suppl 11), S9 (2008)
Article Google Scholar
Wattarujeekrit, T., Shah, P.K., Collier, N.: PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinform. 5, 155 (2004)
Article Google Scholar
Wilbur, W.J., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinform. 7, 356 (2006)
Article Google Scholar
Winston, M.E., Chaffin, R., Herrmann, D.: A taxonomy of part-whole relations. Cogn. Sci. 11(4), 417–444 (1987)
Article Google Scholar
Yang, L., Zhou, Y.: Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications, pp. 1061–1065. IEEE (2010)
Google Scholar
Yang, X., Su, J., Zhou, G., Tan, C.L.: An NP-cluster based approach to coreference resolution. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 226. Association for Computational Linguistics (2004)
Google Scholar
Yang, X., Zhou, G., Su, J., Tan, C.L.: Improving noun phrase coreference resolution by matching strings. In: Proceedings of IJCNLP 2004, pp. 22–31. Springer, Berlin (2005)
Google Scholar
Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19(Suppl 1), i331–i339 (2003)
Article Google Scholar
Zhao, S.: Named entity recognition in biomedical texts using an HMM model. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004, pp. 84–87. Association for Computational Linguistics (2004)
Google Scholar

Download references

Acknowledgements

This work has been supported by the BBSRC-funded EMPATHY project (Grant No. BB/M006891/1) and by the EPSRC and MRC-funded MMPathIC project (Grant No. MR/N00583X/1).

Author information

Authors and Affiliations

National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Paul Thompson, Sophia Ananiadou & Jun’ichi Tsujii
Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
Jun’ichi Tsujii

Authors

Paul Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Ananiadou
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichi Tsujii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sophia Ananiadou .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Thompson, P., Ananiadou, S., Tsujii, J. (2017). The GENIA Corpus: Annotation Levels and Applications. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_54

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_54
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics