Coreference Annotation Schema for an Inflectional Language

  • Maciej Ogrodniczuk
  • Magdalena Zawisławska
  • Katarzyna Głowińska
  • Agata Savary
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7816)

Abstract

Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution.

Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ogrodniczuk, M., Kopeć, M.: End-to-end coreference resolution baseline system for Polish. In: Vetulani, Z. (ed.) Proceedings of the Fifth Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland, pp. 167–171 (2011)Google Scholar
  2. 2.
    Kopeć, M., Ogrodniczuk, M.: Creating a Coreference Resolution System for Polish. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pp. 192–195. ELRA, Istanbul (2012)Google Scholar
  3. 3.
    Mitkov, R.: Anaphora Resolution. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. Oxford University Press (2003)Google Scholar
  4. 4.
    Recasens, M.: Coreference: Theory, Annotation, Resolution and Evaluation. PhD thesis, Department of Linguistics, University of Barcelona, Barcelona, Spain (2010)Google Scholar
  5. 5.
    Haghighi, A., Klein, D.: Simple coreference resolution with rich syntactic and semantic features. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 3, pp. 1152–1161. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  6. 6.
    Recasens, M., Hovy, E., Martí, M.A.: A Typology of Near-Identity Relations for Coreference (NIDENT). In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 149–156 (2010)Google Scholar
  7. 7.
    Recasens, M., Hovy, E., Martí, M.A.: Identity, non-identity, and near-identity: Addressing the complexity of coreference. Lingua 121(6) (2011)Google Scholar
  8. 8.
    Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego (Eng.: National Corpus of Polish). Wydawnictwo Naukowe PWN, Warsaw (2012)Google Scholar
  9. 9.
    Przepiórkowski, A., Buczyński, A.: Spejd: Shallow Parsing and Disambiguation Engine. In: Vetulani, Z. (ed.) Proceedings of the 3rd Language & Technology Conference, Poznań, Poland, pp. 340–344 (2007)Google Scholar
  10. 10.
    Acedański, S.: A Morphosyntactic Brill Tagger for Inflectional Languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 3–14. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Waszczuk, J., Głowińska, K., Savary, A., Przepiórkowski, A., Lenart, M.: Annotation Tools for Syntax and Named Entities in the National Corpus of Polish. International Journal of Data Mining, Modelling and Management (to appear)Google Scholar
  12. 12.
    Müller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.) Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pp. 197–214. Peter Lang, Frankfurt a.M., Germany (2006)Google Scholar
  13. 13.
    Osenova, P., Simov, K.: BTB-TR05: BulTreeBank Stylebook. BulTreeBank Version 1.0. Technical Report BTB-TR05, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria (2004)Google Scholar
  14. 14.
    Poesio, M., Artstein, R.: Anaphoric Annotation in the ARRAU Corpus. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association, Marrakech (2008)Google Scholar
  15. 15.
    Nedoluzhko, A., Mírovský, J., Ocelák, R., Pergler, J.: Extended Coreferential Relations and Bridging Anaphora in the Prague Dependency Treebank. In: Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2009), Goa, India. AU-KBC Research Centre, Anna University, Chennai, pp. 1–16 (2009)Google Scholar
  16. 16.
    Recasens, M., Martí, M.A.: AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. Language Resources and Evaluation 44(4), 315–345 (2010)CrossRefGoogle Scholar
  17. 17.
    Korzen, I., Buch-Kromann, M.: Anaphoric relations in the Copenhagen Dependency Treebanks. In: Proceedings of DGfS Workshop, Göttingen, Germany, pp. 83–98 (2011)Google Scholar
  18. 18.
    Rahman, A., Ng, V.: Translation-Based Projection for Multilingual Coreference Resolution. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 720–730. Association for Computational Linguistics, Montréal (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Maciej Ogrodniczuk
    • 1
  • Magdalena Zawisławska
    • 2
  • Katarzyna Głowińska
    • 3
  • Agata Savary
    • 4
  1. 1.Institute of Computer SciencePolish Academy of SciencesPoland
  2. 2.Institute of Polish LanguageWarsaw UniversityPoland
  3. 3.LingventaPoland
  4. 4.Laboratoire d’informatiqueFrançois Rabelais University ToursFrance

Personalised recommendations