Abstract
The GENIA project was created with the aim of supporting the development and evaluation of information extraction and text mining systems in molecular biology. One of the main outcomes of the project has been the GENIA corpus, consisting of 1,999 MEDLINE abstracts. Over the course of several years, the corpus has been continually enriched with various levels of syntactic, semantic and discourse-level annotation, making it suitable for training various types of systems. The GENIA corpus has been widely used by the NLP community for the development of several semantic search systems, and motivated the establishment of the BioNLP shared task series of challenges. These challenges have been instrumental in pushing forward research into event extraction systems in the biomedical domain, and have also resulted in the development of a range of associated corpora in various biomedical sub-domains, annotated according to the GENIA guidelines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D.B.: Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28(7), 381–390 (2010)
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., et al.: Gene Ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)
Batista-Navarro, R.T., Ananiadou, S.: Building a coreference-annotated corpus from the domain of biochemistry. In: Proceedings of BioNLP 2011 Workshop, pp. 83–91. Association for Computational Linguistics (2011)
Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al.: Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania (1995)
Bjorne, J., Salakoski, T.: Generalizing biomedical event extraction. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 183–191 (2011)
Björne, J., Heimonen, J., Ginter, F., Airola, A., Pahikkala, T., Salakoski, T.: Extracting Complex Biological Events with Rich Graph-Based Feature Sets. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pp. 10–18 (2009)
Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 132–139. Association for Computational Linguistics (2000)
Cohen, K.B., Ogren, P.V., Fox, L., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)
de Waard, A., Shum, B., Carusi, A., Park, J., Samwald, M., Sándor, Á.: Hypotheses, evidence and relationships: The HypER approach for representing scientific knowledge claims. In: Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (2009)
Funahashi, A., Morohashi, M., Kitano, H., Tanimura, N.: Cell Designer: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1(5), 159–162 (2003)
Goulart, R.R.V., de Lima, V.L., c.S., Xavier, C.C.: A systematic review of named entity recognition in biomedical texts. J. Braz. Comput. Soc. 17(2), 103–116 (2011)
Hara, T., Miyao, Y., Tsujii, J.: Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: Proceedings of IJCNLP, pp. 199–210 (2005)
Hasida, K.: GDA: annotated document as intelligent content. In: Proceedings of COLING Workshop on Semantic Annotation and Intelligent Content, pp. 333–340 (2000)
Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4), 524–531 (2003)
Karp, P.D.: An ontology for biological function based on molecular interactions. Bioinformatics 16(3), 269–285 (2000)
Kazama, J., Miyao, Y., Tsujii, J.: A maximum entropy tagger with unsupervised hidden markov models. In: Proceedings of the 6th NLPRS, 2001, pp. 333–340 (2001)
Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl. 1), i180–i182 (2003)
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 70–75 (2004)
Kim, J.D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Extracting bio-molecular events from literature - the BioNLP’09 shared task. Comput. Intell. 27(4), 513–540 (2011)
Kim, J.-D., Nguyen, N., Wang, Y., Tsujii, J.i., Takagi, T., Yonezawa, A.: The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S1 (2012)
Kim, Y., Riloff, E., Gilbert, N.: The taming of Reconcile as a biomedical coreference resolver. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 89–93. Association for Computational Linguistics (2011)
Knight, J.: Negative results: null and void. Nature 422(6932), 554–555 (2003)
Koike, A., Takagi, T.: Gene/protein/family name recognition in biomedical literature. In: Proceedings of BioLINK 2004: Linking Biological Literature, Ontologies, and Databases, pp. 9–16 (2004)
Koike, A., Niwa, Y., Takagi, T.: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 21(7), 1227–1236 (2005)
Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., et al.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68 (2004)
Lease, M., Charniak, E.: Parsing biomedical literature. In: Proceedings of IJCNLP 2005, pp. 58–69. Springer, Berlin (2005)
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualisation zones in scientific articles and two life science applications. Bioinformatics 28(7), (2012)
Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88(3), 265 (2000)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1994)
McClosky, D., Riedel, S., Surdeanu, M., McCallum, A., Manning, C.: Combining joint models for biomedical event extraction. BMC Bioinform. 13(Suppl 11), S9 (2012)
Miwa, M., Saetre, R., Kim, J.D., Tsujii, J.: Event extraction with complex event classification using rich features. J. Bioinform. Comput. Biol. 8(1), 131–146 (2010)
Miwa, M., Thompson, P., Ananiadou, S.: Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 28(13), 1759–1765 (2012)
Miwa, M., Thompson, P., McNaught, J., Kell, D.B., Ananiadou, S.: Extracting semantically enriched events from biomedical literature. BMC Bioinform. 13(1), 108 (2012)
Miwa, M., Ohta, T., Rak, R., Rowley, A., Kell, D.B., Pyysalo, S., et al.: A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 29(13), i44–i52 (2013)
Miyao, Y., Tsujii, J.: Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proccedings of ACL, pp. 83–90 (2005)
Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a Head-driven phrase structure Grammar from the Penn Treebank. In: Proceedings of IJCNLP, pp. 684–693 (2004)
Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., et al.: Semantic retrieval for the accurate identification of relational concepts in massive textbases. Annu. Meet. Assoc. Comput. Linguist. 2, 1017–1024 (2006)
Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation of syntactic parsers and their representations. In: Proceedings of ACL-08: HLT, pp. 46–54. Association for Computational Linguistics (2008)
Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inform. 75(6), 468–487 (2006)
Muller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. Corpus Technol. Lang. Pedagog. New Res. New tools New Methods 3, 197–214 (2006)
Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics 21(Suppl 1) (2005)
Nawaz, R., Thompson, P., Ananiadou, S.: Identification of manner in bio-events. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 3505–3510 (2012)
Nawaz, R., Thompson, P., Ananiadou, S.: Negated bio-events: analysis and identification. BMC Bioinformatics 14(1), (2013)
Nedellec, C., Bossy, R., Kim, J.-D., Kim, J.-j., Ohta, T., Pyysalo, S., et al.: Overview of BioNLP shared task 2013. In: Proceedings of the BioNLP Shared Task 2013 Workshop, pp. 1–7 (2013)
Nguyen, N., Kim, J.-D., Tsujii, J.: Overview of the protein coreference task in BioNLP shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 74–82. Association for Computational Linguistics (2001)
Nobata, C., Cotter, P., Okazaki, N., Rea, B., Sasaki, Y., Tsuruoka, Y., et al.: Kleio: a knowledge-enriched information retrieval system for biology. In: Proceedings of the 31st Annual International ACM SIGIR Singapore, pp. 787–788 (2008)
Oda, K., Kim, J.-D., Ohta, T., Okanohara, D., Matsuzaki, T., Tateisi, Y., et al.: New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinform. 9(Suppl 3), S5 (2008)
Ohta, T., Tateisi, Y., Mima, H., Tsujii, J.: GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Human Language Technology Conference (HLT 2002), pp. 73–77 (2002)
Ohta, T., Pyysalo, S., Kim, J.-D., Tsujii, J., i.: A re-evaluation of biomedical named entity-term relations. J. Bioinform. Comput. Biol. 8(05), 917–928 (2010)
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)
Passonneau, R.: Computing reliability for coreference annotation. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2004) (2004)
Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., et al.: TimeML: robust specification of event and temporal expressions in text. New Dir. Quest. Answ. 3, 28–34 (2003)
Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., et al.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. 8, 50 (2007)
Pyysalo, S., Ohta, T., Kim, J.-D., Tsujii, J.: Static relations: a piece in the biomedical information extraction puzzle. In: Proceedings of the BioNLP 2009 Workshop, pp. 1–9. Association for Computational Linguistics (2009)
Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., et al.: Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S2 (2012)
Ruppenhofer, J., Ellsworth, M., Petruck, M., Johnson, C., Scheffczyk, J.: FrameNet II: extended theory and practice (2010). http://framenet.icsi.berkeley.edu/
Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project (D. o. C. a. I. Science, Trans.). University of Pennsylvania (1990)
Sasaki, Y., Tsuruoka, Y., McNaught, J., Ananiadou, S.: How to make the most of named entity dictionaries in statistical NER. BMC Bioinform. 9(Suppl 11), S5 (2008)
Schulze-Kremer, S.: Ontologies for molecular biology. In: Pac Symp Biocomput, vol. 3, pp. 695–706 (1998)
Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS metathesaurus: representing different views of biomedical concepts. Bull. Med. Lib. Assoc. 81(2), 217 (1993)
Su, J., Yang, X., Hong, H., Tateisi, Y., Tsujii, J.: Coreference resolution in biomedical texts: a machine learning approach. Ontol. Text Min. Life Sci. 8 (2008)
Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.J.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform. 6(Suppl 1), S3 (2005)
Tateisi, Y., Tsujii, J.: Part-of-speech annotation of biology research abstracts. In: Proceedings of LREC, 2004 (2004)
Tateisi, Y., Yakushiji, A., Ohta, T., Tsujii, J.i.: Syntax Annotation for the GENIA corpus. In: Proceedings of IJCNLP, pp. 222–227 (2005)
Thompson, P., Iqbal, S., McNaught, J., Ananiadou, S.: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinform. 10(1), 349 (2009)
Thompson, P., McNaught, J., Montemagni, S., Calzolari, N., Del Gratta, R., Lee, V., et al.: The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 12(1), 397–397 (2011)
Thompson, P., Nawaz, R., McNaught, J., Ananiadou, S.: Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinform. 12, 393 (2011)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Tsuruoka, Y., Tsujii, J.: Improving the performance of dictionary-based approaches in protein name recognition. J. Biomed. Inform. 37(6), 461–470 (2004)
Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of HLT/EMNLP 2005, pp. 467–474 (2005)
Tsuruoka, Y., Tateishi, Y., Kim, J.D., Ohta, T., McNaught, J., Ananiadou, S., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Lecture Notes in Computer Science - Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382–392 (2005)
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21), 2559–2560 (2008)
Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J.i., Ananiadou, S.: Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 27(13), i111–i119 (2011)
Vincze, V., Szarvas, G., Farkas, R., Mora, G., Csirik, J.: The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinform. 9(Suppl 11), S9 (2008)
Wattarujeekrit, T., Shah, P.K., Collier, N.: PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinform. 5, 155 (2004)
Wilbur, W.J., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinform. 7, 356 (2006)
Winston, M.E., Chaffin, R., Herrmann, D.: A taxonomy of part-whole relations. Cogn. Sci. 11(4), 417–444 (1987)
Yang, L., Zhou, Y.: Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications, pp. 1061–1065. IEEE (2010)
Yang, X., Su, J., Zhou, G., Tan, C.L.: An NP-cluster based approach to coreference resolution. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 226. Association for Computational Linguistics (2004)
Yang, X., Zhou, G., Su, J., Tan, C.L.: Improving noun phrase coreference resolution by matching strings. In: Proceedings of IJCNLP 2004, pp. 22–31. Springer, Berlin (2005)
Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19(Suppl 1), i331–i339 (2003)
Zhao, S.: Named entity recognition in biomedical texts using an HMM model. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004, pp. 84–87. Association for Computational Linguistics (2004)
Acknowledgements
This work has been supported by the BBSRC-funded EMPATHY project (Grant No. BB/M006891/1) and by the EPSRC and MRC-funded MMPathIC project (Grant No. MR/N00583X/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Thompson, P., Ananiadou, S., Tsujii, J. (2017). The GENIA Corpus: Annotation Levels and Applications. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_54
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_54
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)