Skip to main content

The GENIA Corpus: Annotation Levels and Applications

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

The GENIA project was created with the aim of supporting the development and evaluation of information extraction and text mining systems in molecular biology. One of the main outcomes of the project has been the GENIA corpus, consisting of 1,999 MEDLINE abstracts. Over the course of several years, the corpus has been continually enriched with various levels of syntactic, semantic and discourse-level annotation, making it suitable for training various types of systems. The GENIA corpus has been widely used by the NLP community for the development of several semantic search systems, and motivated the establishment of the BioNLP shared task series of challenges. These challenges have been instrumental in pushing forward research into event extraction systems in the biomedical domain, and have also resulted in the development of a range of associated corpora in various biomedical sub-domains, annotated according to the GENIA guidelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://nlp.i2r.a-star.edu.sg/medco.html.

  2. 2.

    http://www.nactem.ac.uk/genia/tools/xconc.

  3. 3.

    http://www.nactem.ac.uk/medie/.

  4. 4.

    http://www.nactem.ac.uk/facta/.

  5. 5.

    http://www.nactem.ac.uk/facta-visualizer/.

  6. 6.

    http://www.nactem.ac.uk/pathtext2/demo/.

  7. 7.

    http://labs.europepmc.org/evf.

References

  1. Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D.B.: Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28(7), 381–390 (2010)

    Article  Google Scholar 

  2. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., et al.: Gene Ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)

    Article  Google Scholar 

  3. Batista-Navarro, R.T., Ananiadou, S.: Building a coreference-annotated corpus from the domain of biochemistry. In: Proceedings of BioNLP 2011 Workshop, pp. 83–91. Association for Computational Linguistics (2011)

    Google Scholar 

  4. Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al.: Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania (1995)

    Google Scholar 

  5. Bjorne, J., Salakoski, T.: Generalizing biomedical event extraction. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 183–191 (2011)

    Google Scholar 

  6. Björne, J., Heimonen, J., Ginter, F., Airola, A., Pahikkala, T., Salakoski, T.: Extracting Complex Biological Events with Rich Graph-Based Feature Sets. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pp. 10–18 (2009)

    Google Scholar 

  7. Charniak, E.: A maximum-entropy-inspired parser. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 132–139. Association for Computational Linguistics (2000)

    Google Scholar 

  8. Cohen, K.B., Ogren, P.V., Fox, L., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)

    Google Scholar 

  9. de Waard, A., Shum, B., Carusi, A., Park, J., Samwald, M., Sándor, Á.: Hypotheses, evidence and relationships: The HypER approach for representing scientific knowledge claims. In: Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (2009)

    Google Scholar 

  10. Funahashi, A., Morohashi, M., Kitano, H., Tanimura, N.: Cell Designer: a process diagram editor for gene-regulatory and biochemical networks. Biosilico 1(5), 159–162 (2003)

    Article  Google Scholar 

  11. Goulart, R.R.V., de Lima, V.L., c.S., Xavier, C.C.: A systematic review of named entity recognition in biomedical texts. J. Braz. Comput. Soc. 17(2), 103–116 (2011)

    Google Scholar 

  12. Hara, T., Miyao, Y., Tsujii, J.: Adapting a probabilistic disambiguation model of an HPSG parser to a new domain. In: Proceedings of IJCNLP, pp. 199–210 (2005)

    Google Scholar 

  13. Hasida, K.: GDA: annotated document as intelligent content. In: Proceedings of COLING Workshop on Semantic Annotation and Intelligent Content, pp. 333–340 (2000)

    Google Scholar 

  14. Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4), 524–531 (2003)

    Article  Google Scholar 

  15. Karp, P.D.: An ontology for biological function based on molecular interactions. Bioinformatics 16(3), 269–285 (2000)

    Article  Google Scholar 

  16. Kazama, J., Miyao, Y., Tsujii, J.: A maximum entropy tagger with unsupervised hidden markov models. In: Proceedings of the 6th NLPRS, 2001, pp. 333–340 (2001)

    Google Scholar 

  17. Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl. 1), i180–i182 (2003)

    Article  Google Scholar 

  18. Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 70–75 (2004)

    Google Scholar 

  19. Kim, J.D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Extracting bio-molecular events from literature - the BioNLP’09 shared task. Comput. Intell. 27(4), 513–540 (2011)

    Google Scholar 

  20. Kim, J.-D., Nguyen, N., Wang, Y., Tsujii, J.i., Takagi, T., Yonezawa, A.: The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S1 (2012)

    Google Scholar 

  21. Kim, Y., Riloff, E., Gilbert, N.: The taming of Reconcile as a biomedical coreference resolver. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 89–93. Association for Computational Linguistics (2011)

    Google Scholar 

  22. Knight, J.: Negative results: null and void. Nature 422(6932), 554–555 (2003)

    Article  Google Scholar 

  23. Koike, A., Takagi, T.: Gene/protein/family name recognition in biomedical literature. In: Proceedings of BioLINK 2004: Linking Biological Literature, Ontologies, and Databases, pp. 9–16 (2004)

    Google Scholar 

  24. Koike, A., Niwa, Y., Takagi, T.: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 21(7), 1227–1236 (2005)

    Article  Google Scholar 

  25. Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., et al.: Integrated annotation for biomedical information extraction. In: Proceedings of the Human Language Technology Conference and the Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 61–68 (2004)

    Google Scholar 

  26. Lease, M., Charniak, E.: Parsing biomedical literature. In: Proceedings of IJCNLP 2005, pp. 58–69. Springer, Berlin (2005)

    Google Scholar 

  27. Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualisation zones in scientific articles and two life science applications. Bioinformatics 28(7), (2012)

    Google Scholar 

  28. Lipscomb, C.E.: Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 88(3), 265 (2000)

    Google Scholar 

  29. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1994)

    Google Scholar 

  30. McClosky, D., Riedel, S., Surdeanu, M., McCallum, A., Manning, C.: Combining joint models for biomedical event extraction. BMC Bioinform. 13(Suppl 11), S9 (2012)

    Article  Google Scholar 

  31. Miwa, M., Saetre, R., Kim, J.D., Tsujii, J.: Event extraction with complex event classification using rich features. J. Bioinform. Comput. Biol. 8(1), 131–146 (2010)

    Article  Google Scholar 

  32. Miwa, M., Thompson, P., Ananiadou, S.: Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 28(13), 1759–1765 (2012)

    Google Scholar 

  33. Miwa, M., Thompson, P., McNaught, J., Kell, D.B., Ananiadou, S.: Extracting semantically enriched events from biomedical literature. BMC Bioinform. 13(1), 108 (2012)

    Google Scholar 

  34. Miwa, M., Ohta, T., Rak, R., Rowley, A., Kell, D.B., Pyysalo, S., et al.: A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 29(13), i44–i52 (2013)

    Article  Google Scholar 

  35. Miyao, Y., Tsujii, J.: Probabilistic disambiguation models for wide-coverage HPSG parsing. In: Proccedings of ACL, pp. 83–90 (2005)

    Google Scholar 

  36. Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a Head-driven phrase structure Grammar from the Penn Treebank. In: Proceedings of IJCNLP, pp. 684–693 (2004)

    Google Scholar 

  37. Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., et al.: Semantic retrieval for the accurate identification of relational concepts in massive textbases. Annu. Meet. Assoc. Comput. Linguist. 2, 1017–1024 (2006)

    Google Scholar 

  38. Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation of syntactic parsers and their representations. In: Proceedings of ACL-08: HLT, pp. 46–54. Association for Computational Linguistics (2008)

    Google Scholar 

  39. Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inform. 75(6), 468–487 (2006)

    Article  Google Scholar 

  40. Muller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. Corpus Technol. Lang. Pedagog. New Res. New tools New Methods 3, 197–214 (2006)

    Google Scholar 

  41. Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics 21(Suppl 1) (2005)

    Google Scholar 

  42. Nawaz, R., Thompson, P., Ananiadou, S.: Identification of manner in bio-events. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 3505–3510 (2012)

    Google Scholar 

  43. Nawaz, R., Thompson, P., Ananiadou, S.: Negated bio-events: analysis and identification. BMC Bioinformatics 14(1), (2013)

    Google Scholar 

  44. Nedellec, C., Bossy, R., Kim, J.-D., Kim, J.-j., Ohta, T., Pyysalo, S., et al.: Overview of BioNLP shared task 2013. In: Proceedings of the BioNLP Shared Task 2013 Workshop, pp. 1–7 (2013)

    Google Scholar 

  45. Nguyen, N., Kim, J.-D., Tsujii, J.: Overview of the protein coreference task in BioNLP shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 74–82. Association for Computational Linguistics (2001)

    Google Scholar 

  46. Nobata, C., Cotter, P., Okazaki, N., Rea, B., Sasaki, Y., Tsuruoka, Y., et al.: Kleio: a knowledge-enriched information retrieval system for biology. In: Proceedings of the 31st Annual International ACM SIGIR Singapore, pp. 787–788 (2008)

    Google Scholar 

  47. Oda, K., Kim, J.-D., Ohta, T., Okanohara, D., Matsuzaki, T., Tateisi, Y., et al.: New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinform. 9(Suppl 3), S5 (2008)

    Article  Google Scholar 

  48. Ohta, T., Tateisi, Y., Mima, H., Tsujii, J.: GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Human Language Technology Conference (HLT 2002), pp. 73–77 (2002)

    Google Scholar 

  49. Ohta, T., Pyysalo, S., Kim, J.-D., Tsujii, J., i.: A re-evaluation of biomedical named entity-term relations. J. Bioinform. Comput. Biol. 8(05), 917–928 (2010)

    Google Scholar 

  50. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)

    Article  Google Scholar 

  51. Passonneau, R.: Computing reliability for coreference annotation. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2004) (2004)

    Google Scholar 

  52. Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., et al.: TimeML: robust specification of event and temporal expressions in text. New Dir. Quest. Answ. 3, 28–34 (2003)

    Google Scholar 

  53. Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., et al.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. 8, 50 (2007)

    Article  Google Scholar 

  54. Pyysalo, S., Ohta, T., Kim, J.-D., Tsujii, J.: Static relations: a piece in the biomedical information extraction puzzle. In: Proceedings of the BioNLP 2009 Workshop, pp. 1–9. Association for Computational Linguistics (2009)

    Google Scholar 

  55. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., et al.: Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinform. 13(Suppl 11), S2 (2012)

    Article  Google Scholar 

  56. Ruppenhofer, J., Ellsworth, M., Petruck, M., Johnson, C., Scheffczyk, J.: FrameNet II: extended theory and practice (2010). http://framenet.icsi.berkeley.edu/

  57. Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project (D. o. C. a. I. Science, Trans.). University of Pennsylvania (1990)

    Google Scholar 

  58. Sasaki, Y., Tsuruoka, Y., McNaught, J., Ananiadou, S.: How to make the most of named entity dictionaries in statistical NER. BMC Bioinform. 9(Suppl 11), S5 (2008)

    Google Scholar 

  59. Schulze-Kremer, S.: Ontologies for molecular biology. In: Pac Symp Biocomput, vol. 3, pp. 695–706 (1998)

    Google Scholar 

  60. Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS metathesaurus: representing different views of biomedical concepts. Bull. Med. Lib. Assoc. 81(2), 217 (1993)

    Google Scholar 

  61. Su, J., Yang, X., Hong, H., Tateisi, Y., Tsujii, J.: Coreference resolution in biomedical texts: a machine learning approach. Ontol. Text Min. Life Sci. 8 (2008)

    Google Scholar 

  62. Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.J.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform. 6(Suppl 1), S3 (2005)

    Article  Google Scholar 

  63. Tateisi, Y., Tsujii, J.: Part-of-speech annotation of biology research abstracts. In: Proceedings of LREC, 2004 (2004)

    Google Scholar 

  64. Tateisi, Y., Yakushiji, A., Ohta, T., Tsujii, J.i.: Syntax Annotation for the GENIA corpus. In: Proceedings of IJCNLP, pp. 222–227 (2005)

    Google Scholar 

  65. Thompson, P., Iqbal, S., McNaught, J., Ananiadou, S.: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinform. 10(1), 349 (2009)

    Article  Google Scholar 

  66. Thompson, P., McNaught, J., Montemagni, S., Calzolari, N., Del Gratta, R., Lee, V., et al.: The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 12(1), 397–397 (2011)

    Google Scholar 

  67. Thompson, P., Nawaz, R., McNaught, J., Ananiadou, S.: Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinform. 12, 393 (2011)

    Google Scholar 

  68. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-vol. 1, pp. 173–180. Association for Computational Linguistics (2003)

    Google Scholar 

  69. Tsuruoka, Y., Tsujii, J.: Improving the performance of dictionary-based approaches in protein name recognition. J. Biomed. Inform. 37(6), 461–470 (2004)

    Article  Google Scholar 

  70. Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of HLT/EMNLP 2005, pp. 467–474 (2005)

    Google Scholar 

  71. Tsuruoka, Y., Tateishi, Y., Kim, J.D., Ohta, T., McNaught, J., Ananiadou, S., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Lecture Notes in Computer Science - Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382–392 (2005)

    Google Scholar 

  72. Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21), 2559–2560 (2008)

    Article  Google Scholar 

  73. Tsuruoka, Y., Miwa, M., Hamamoto, K., Tsujii, J.i., Ananiadou, S.: Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 27(13), i111–i119 (2011)

    Google Scholar 

  74. Vincze, V., Szarvas, G., Farkas, R., Mora, G., Csirik, J.: The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinform. 9(Suppl 11), S9 (2008)

    Article  Google Scholar 

  75. Wattarujeekrit, T., Shah, P.K., Collier, N.: PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinform. 5, 155 (2004)

    Article  Google Scholar 

  76. Wilbur, W.J., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinform. 7, 356 (2006)

    Article  Google Scholar 

  77. Winston, M.E., Chaffin, R., Herrmann, D.: A taxonomy of part-whole relations. Cogn. Sci. 11(4), 417–444 (1987)

    Article  Google Scholar 

  78. Yang, L., Zhou, Y.: Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications, pp. 1061–1065. IEEE (2010)

    Google Scholar 

  79. Yang, X., Su, J., Zhou, G., Tan, C.L.: An NP-cluster based approach to coreference resolution. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 226. Association for Computational Linguistics (2004)

    Google Scholar 

  80. Yang, X., Zhou, G., Su, J., Tan, C.L.: Improving noun phrase coreference resolution by matching strings. In: Proceedings of IJCNLP 2004, pp. 22–31. Springer, Berlin (2005)

    Google Scholar 

  81. Yeh, A.S., Hirschman, L., Morgan, A.A.: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19(Suppl 1), i331–i339 (2003)

    Article  Google Scholar 

  82. Zhao, S.: Named entity recognition in biomedical texts using an HMM model. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004, pp. 84–87. Association for Computational Linguistics (2004)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the BBSRC-funded EMPATHY project (Grant No. BB/M006891/1) and by the EPSRC and MRC-funded MMPathIC project (Grant No. MR/N00583X/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sophia Ananiadou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Thompson, P., Ananiadou, S., Tsujii, J. (2017). The GENIA Corpus: Annotation Levels and Applications. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_54

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_54

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics