Advertisement

Polish Coreference Corpus

  • Maciej Ogrodniczuk
  • Katarzyna Głowińska
  • Mateusz Kopeć
  • Agata Savary
  • Magdalena Zawisławska
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9561)

Abstract

The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has some novel features, such as the annotation of the quasi-identity relation, inspired by Recasens’ near-identity, as well as the mark-up of semantic heads and dominant expressions. It shows a good inter-annotator agreement and is distributed in three formats under an open license. Its by-products include freely available annotation tools with custom features such as file distribution management and annotation adjudication.

Keywords

Corpus Coreference Mention detection Anaphora 

References

  1. 1.
    Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 3–14. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: KPWr: Towards a Free Corpus of Polish. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pp. 3218–3222. ELRA, Istanbul (2012)Google Scholar
  3. 3.
    Linguistic Data Consortium: ACE (Automatic Content Extraction) Spanish Annotation Guidelines for Entities (2006). https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/spanish-entities-guidelines-v1.6.pdf. Accessed on 28 Aug 2015
  4. 4.
    Hendrickx, I., Bouma, G., Daelemans, W., Hoste, V., Kloosterman, G., Mineur, A.M., Van Der Vloet, J., Verschelde, J.L.: A coreference corpus and resolution system for Dutch. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 144–149. European Language Resources Association (ELRA), Marrakech (2008)Google Scholar
  5. 5.
    Hinrichs, E.W., Kübler, S., Naumann, K.: A unified representation for morphological, syntactic, semantic, and referential annotations. In: Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor, Michigan, USA, pp. 13–20 (2005)Google Scholar
  6. 6.
    Iida, R., Komachi, M., Inui, K., Matsumoto, Y.: Annotating a Japanese text corpus with predicate-argument and coreference relations. In: Proceedings of the Linguistic Annotation Workshop (LAW 2007), pp. 132–139. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  7. 7.
    Korzen, I., Buch-Kromann, M.: Anaphoric relations in the Copenhagen Dependency Treebanks. In: Proceedings of DGfS Workshop, Göttingen, Germany, pp. 83–98 (2011)Google Scholar
  8. 8.
    Müller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2. In: Braun, S., Kohn, K., Mukherjee, J. (eds.) Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pp. 197–214. Peter Lang, Frankfurt a.M. (2006)Google Scholar
  9. 9.
    Muzerelle, J., Lefeuvre, A., Antoine, J.Y., Schang, E., Maurel, D., Villaneau, J., Eshkol, I.: ANCOR, premier corpus de français parlé d’envergure annoté en coréférence et distribué librement. In: Proceedings of the 20th Conference Traitement Automatique des Langues Naturelles (TALN 2013), Les Sables d’Olonne, France, pp. 555–563 (2013)Google Scholar
  10. 10.
    Nedoluzhko, A., Mírovský, J., Ocelák, R., Pergler, J.: Extended coreferential relations and bridging anaphora in the Prague Dependency Treebank. In: Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2009), pp. 1–16. AU-KBC Research Centre, Anna University, Chennai (2009)Google Scholar
  11. 11.
    Ogrodniczuk, M., Głowińska, K., Kopeć, M., Savary, A., Zawisławska, M.: Interesting linguistic features in coreference annotation of an inflectional language. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds.) CCL and NLP-NABD 2013. LNCS, vol. 8202, pp. 97–108. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Ogrodniczuk, M., Głowińska, K., Kopeć, M., Savary, A., Zawisławska, M.: Coreference in Polish: Annotation, Resolution and Evaluation. Walter De Gruyter, Berlin (2015). http://www.degruyter.com/view/product/428667. Accessed on 28 Aug 2015
  13. 13.
    Ogrodniczuk, M., Kopeć, M., Savary, A.: Polish coreference corpus in numbers. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3234–3238. European Language Resources Association, Reykjavík (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1088_Paper.pdf. Accessed on 28 Aug 2015
  14. 14.
    Ogrodniczuk, M., Kopeć, M.: End-to-end coreference resolution baseline system for Polish. In: Vetulani, Z. (ed.) Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland, pp. 167–171 (2011)Google Scholar
  15. 15.
    Ogrodniczuk, M., Lenart, M.: Web Service integration platform for Polish linguistic resources. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pp. 1164–1168. ELRA, Istanbul (2012)Google Scholar
  16. 16.
    Osenova, P., Simov, K.: BTB-TR05: BulTreeBank Stylebook. BulTreeBank Version 1.0. Tech. Rep. BTB-TR05, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria (2004)Google Scholar
  17. 17.
    Poesio, M., Artstein, R.: Anaphoric annotation in the ARRAU Corpus. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). ELRA, European Language Resources Association, Marrakech (2008)Google Scholar
  18. 18.
    Pradhan, S.S., Ramshaw, L., Weischedel, R., MacBride, J., Micciulla, L.: Unrestricted coreference: identifying entities and events in ontonotes. In: Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), pp. 446–453. IEEE Computer Society, Washington, DC (2007)Google Scholar
  19. 19.
    Presspublica: Rzeczpospolita corpus (2013). http://www.cs.put.poznan.pl/dweiss/rzeczpospolita. Accessed on 28 Aug 2015
  20. 20.
    Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN, Warsaw (2012). http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf. Accessed on 28 Aug 2015
  21. 21.
    Recasens, M., Hovy, E., Martí, M.A.: Identity, non-identity, and near-identity: Addressing the complexity of coreference. Lingua 121(6), 1138–1152 (2011)CrossRefGoogle Scholar
  22. 22.
    Recasens, M., Martí, M.A.: AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. Lang. Resour. Eval. 44(4), 315–345 (2010)CrossRefGoogle Scholar
  23. 23.
    Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2012, pp. 102–107. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  24. 24.
    Waszczuk, J., Głowińska, K., Savary, A., Przepiórkowski, A., Lenart, M.: Annotation tools for syntax and named entities in the National Corpus of Polish. Int. J. Data Min. Model. Manag. 5(2), 103–122 (2013)Google Scholar
  25. 25.
    Woliński, M.: Morfeusz - a practical tool for the morphological analysis of Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining 2006 Conference, Wisła, Poland, pp. 511–520, June 2006Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Maciej Ogrodniczuk
    • 1
  • Katarzyna Głowińska
    • 2
  • Mateusz Kopeć
    • 1
  • Agata Savary
    • 3
  • Magdalena Zawisławska
    • 4
  1. 1.Institute of Computer Science, Polish Academy of SciencesWarsawPoland
  2. 2.LingventaWarsawPoland
  3. 3.Laboratoire d’informatiqueFrançois Rabelais University ToursBloisFrance
  4. 4.Institute of Polish LanguageWarsaw UniversityWarsawPoland

Personalised recommendations