AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan

Abstract

This article describes the enrichment of the AnCora corpora of Spanish and Catalan (400 k each) with coreference links between pronouns (including elliptical subjects and clitics), full noun phrases (including proper nouns), and discourse segments. The coding scheme distinguishes between identity links, predicative relations, and discourse deixis. Inter-annotator agreement on the link types is 85–89% above chance, and we provide an analysis of the sources of disagreement. The resulting corpora make it possible to train and test learning-based algorithms for automatic coreference resolution, as well as to carry out bottom-up linguistic descriptions of coreference relations as they occur in real data.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    All the examples throughout the article have been extracted from the AnCora-CO corpora. Those preceded by (Cat.) come from Catalan and those by (Sp.) from Spanish.

  2. 2.

    Following the terminology of the Automatic Content Extraction (ACE) program (Doddington et al. 2004), a mention is defined as an instance of reference to an object, and an entity is the collection of mentions referring to the same object in a document.

  3. 3.

    To obtain anaphoric coreference pronouns from AnCora-CO, one just needs to extract the pronouns that are included in an entity. By convention, we can assume that their antecedent corresponds to the previous mention in the same entity.

  4. 4.

    ACE-2004 entity types include: person, organization, geo-political entity, location, facility, vehicle and weapon.

  5. 5.

    http://projects.ldc.upenn.edu/ace/docs/Spanish-Entities-Guidelines_v1.6.pdf.

  6. 6.

    http://www.efe.es

  7. 7.

    http://www.acn.cat.

  8. 8.

    At present, a total of 300,000 words for each AnCora-CO corpus are freely downloadable from the Web. An additional subset of 100,000 words is being kept for test purposes in future evaluation programs.

  9. 9.

    Elliptical subject pronouns are marked with ⊘ and with the corresponding pronoun in brackets in the English translation.

  10. 10.

    Two guiding principles in the morphological annotation of AnCora were (a) to preserve the original text intact, and (b) to assign standard categories to tokens, so that a category such as “verb-pronoun” for verbs with incorporated clitics was ruled out.

  11. 11.

    Possessive determiners are not considered NPs according to the syntactic annotation scheme.

  12. 12.

    The fact that Argentina is marked as NE-organization provides a clue for the annotators to apply the maximal NP principle. This principle, however, turned out to be a source of inter-annotator disagreement (see Sect. 7.2).

  13. 13.

    Metonymy is the use of a word for an entity which is associated with the entity originally denoted by the word, e.g., dish for the food on the dish.

  14. 14.

    Given the length of some discourse segments, in the examples of discourse deixis coreferent mentions are underlined in order to distinguish them clearly from their antecedent.

  15. 15.

    We are replacing cada una ‘each’ with the coreferent candidate EDF y Mitsubishi ‘EDF and Mitsubishi.’ In the English translation, an inversion of verb-subject order is required.

  16. 16.

    http://clic.ub.edu/ancora/lng/en/coreference.pdf.

  17. 17.

    The POS of possessive determiners and pronouns contains the entity corresponding to the possessor, the entire NP contains the entity corresponding to the thing(s) possessed.

  18. 18.

    Verb nodes can only be a mention if they contain an incorporated clitic. The intention in annotating the verb is actually annotating the reference of the clitic, and this applies in Spanish only.

  19. 19.

    Strong NEs correspond strictly to the POS level (nouns, e.g., Los Angeles).

  20. 20.

    The original version with the inherent clitic is untranslatable into English.

  21. 21.

    The transitivity test extends to all the mentions in the same entity so that if mention A corefers with mention B, and mention B corefers with mention C, then it is possible to replace mention C by mention A with no change in meaning, and vice versa.

  22. 22.

    In Spanish and Catalan, unlike English, equative appositive and copular phrases often omit the definite article.

  23. 23.

    It is common practice among researchers in Computational Linguistics to consider 0.8 the absolute minimum value of α to accept for any serious purpose (Artstein and Poesio 2008).

  24. 24.

    Files 11177-20000817 and 16468-20000521.

  25. 25.

    Files 17704-20000522 (Text 1, 62 coreferent mentions) and 17124-20001122 (Text 2, 88 coreferent mentions).

  26. 26.

    At the time of the experiment, AnCoraPipe (the annotation tool that was used for the actual annotation) was not ready yet.

  27. 27.

    Discourse-deictic relations were left out from the quantitative study since coders only received the set of NPs as possible mentions. They had free choice to select the discourse segment antecedents. For the qualitative analysis on this respect, see Sect. 7.2 below.

References

  1. Ariel, M. (1988). Referring and accessibility. Journal of Linguistics 24(1), 65–87.

    Article  Google Scholar 

  2. Artstein, R., & Poesio, M. (2005). Bias decreases in proportion to the number of annotators. In Proceedings of FG-MoL 2005 (pp. 141–150). Edinburgh.

  3. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.

    Article  Google Scholar 

  4. Baldwin, B. (1997). CogNIAC: High precision coreference with limited knowledge and linguistic resources. In Proceedings of the ACL-EACL 1997 workshop on operational factors in practical, robust anaphor resolution for unrestricted texts (pp. 38–45). Madrid.

  5. Bertran, M., Borrega, O., Recasens, M., & Soriano, B. (2008). AnCoraPipe: A tool for multilevel annotation. Procesamiento del Lenguaje Natural, 41, 291–292.

    Google Scholar 

  6. Blackwell, S. (2003). Implicatures in discourse: The case of Spanish NP anaphora. Amsterdam: John Benjamins.

    Google Scholar 

  7. Borrega, O., Taulé, M., & Martí, M. A. (2007). What do we mean when we talk about named entities?. In Proceedings of the 4th corpus linguistics conference, Birmingham.

  8. Bosque, I., & Demonte, V. (Eds.) (1999). Gramática descriptiva de la lengua española. Madrid: Real Academia Española/Espasa Calpe.

    Google Scholar 

  9. Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.

    Google Scholar 

  10. Clark, H. H. (1977). Bridging. In P. Johnson-Laird, & P. C. Wason (Eds.), Thinking: Readings in cognitive science (pp. 411–420). Cambridge: Cambridge University Press.

    Google Scholar 

  11. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). The Automatic Content Extraction (ACE) program—Tasks, data, and evaluation. In Proceedings of LREC 2004 (pp. 837–840). Lisbon.

  12. Eckert, M., & Strube, M. (2000). Dialogue acts, synchronising units and anaphora resolution. Journal of Semantics, 17(1), 51–89.

    Article  Google Scholar 

  13. Fraurud, K. (1990). Definiteness and the processing of NPs in natural discourse. Journal of Semantics, 7, 395–433.

    Article  Google Scholar 

  14. Gundel, J., Hedberg, N., & Zacharski, R. (1993). Cognitive status and the form of referring expressions in discourse. Language, 69(2), 274–307.

    Article  Google Scholar 

  15. Halliday, M. A., & Hasan, R. (1976). Cohesion in English. London: Longman.

    Google Scholar 

  16. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.

    Google Scholar 

  17. Hinrichs, E., Kübler, S., Naumann, K., Telljohann, H., & Trushkina, J. (2004). Recent developments in linguistic annotations of the TüBa-D/Z treebank. In Proceedings of TLT 2004, Tübingen.

  18. Hirschman, L., & Chinchor, N. (1997). MUC-7 coreference task definition—Version 3.0.

  19. Hobbs, J. R. (1978). Resolving pronoun references. Lingua, 44, 311–338.

    Article  Google Scholar 

  20. Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. PhD Thesis, University of Antwerp.

  21. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. In Proceedings of HLT-NAACL 2006 (pp. 57–60). New York.

  22. Ide, N. (2000). Searching annotated language resources in XML: A statement of the problem. In Proceedings of the ACM SIGIR 2000 workshop on XML and information retrieval, Athens.

  23. Kilgarriff, A. (1999). 95% replicability for manual word sense tagging. In Proceedings of EACL 1999 (pp. 277–278). Bergen.

  24. Kripke, S. (1977). Speaker’s reference and semantic reference. Midwest Studies in Philosophy, 2, 255–276.

    Article  Google Scholar 

  25. Krippendorff, K. (2004 [1980]). Content Analysis: An Introduction to its Methodology (2nd ed.). Thousand Oaks, CA: Sage. Chapter 11.

  26. Kučová, L., & Hajičová, E. (2004). Coreferential relations in the Prague dependency treebank. In Proceedings of DAARC 2004 (pp. 97–102). San Miguel, Azores.

  27. Lappin, S., & Leass, H. J. (1994). An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4), 535–561.

    Google Scholar 

  28. Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., & Roukos, S. (2004). A mention-synchronous coreference resolution algorithm based on the Bell tree. In Proceedings of ACL 2004 (pp. 21–26). Barcelona.

  29. McCarthy, J. F., & Lehnert, W. G. (1995). Using decision trees for coreference resolution. In Proceedings of IJCAI 1995 (pp. 1050–1055). Montréal.

  30. Mengel, A., Dybkjaer, L., Garrido, J. M., Heid, U., Klein, M., Pirrelli, V., et al. (2000). MATE deliverable D2.1 – MATE dialogue annotation guidelines. http://www.ims.uni-stuttgart.de/projekte/mate/mdag.

  31. Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of COLING-ACL 1998 (pp. 869–875). Montréal.

  32. Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones, L., & Sotirova, V. (2000). Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies. In Proceedings of DAARC 2000 (pp. 49–58). Lancaster.

  33. Morton, T. S. (1999). Using coreference in question answering. In Proceedings of TREC-8 (pp. 85–89). Gaithersburg, MD.

  34. MŸller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn, & J. Mukherjee (Eds.), Corpus technology and language pedagogy: New resources, new tools, new methods (pp. 197–214). Frankfurt: Peter Lang.

    Google Scholar 

  35. Navarretta, C. (2007). A contrastive analysis of abstract anaphora in Danish, English and Italian. In Proceedings of DAARC 2007 (pp. 103–109). Lagos.

  36. Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. In Proceedings of ACL 2002 (pp. 104–111). Philadelphia.

  37. Orasan, C. (2003). PALinkA: A highly customisable tool for discourse annotation. In Proceedings of the 4th SIGdial workshop on discourse and dialogue (pp. 39–43). Sapporo.

  38. Orasan, C., Cristea, D., Mitkov, R., & Branco, A. (2008). Anaphora resolution exercise: An overview. In Proceedings of LREC 2008, Marrakech.

  39. Passonneau, R. (2004). Computing reliability for coreference annotation. In Proceedings of LREC 2004 (pp. 1503–1506). Lisbon.

  40. Passonneau, R. (2006). Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of LREC 2006 (pp. 831–836). Genoa.

  41. Poesio, M. (2004a). Discourse annotation and semantic annotation in the GNOME corpus. In Proceedings of the ACL 2004 workshop on discourse annotation (pp. 72–79). Barcelona.

  42. Poesio, M. (2004b). The MATE/GNOME proposals for anaphoric annotation, revisited. In Proceedings of the 5th SIGdial workshop at HLT-NAACL 2004 (pp. 154–162). Boston.

  43. Poesio, M., & Artstein, R. (2008). Anaphoric annotation in the ARRAU corpus. In Proceedings of LREC 2008, Marrakech.

  44. Poesio, M., & Vieira, R. (1998). A corpus-based investigation of definite description use. Computational Linguistics 24(2), 183–216.

    Google Scholar 

  45. Pradhan, S. S., Ramshaw, L., Weischedel, R., MacBride, J., & Micciulla, L. (2007). Unrestricted coreference: Identifying entities and events in OntoNotes. In Proceedings of ICSC 2007 (pp. 446–453). Irvine, CA.

  46. Recasens, M., Martí, M. A., & Taulé, M. (2009a). First-mention definites: More than exceptional cases. In S. Featherston, & S. Winkler (Eds.), The fruits of empirical linguistics (pp. 169–189). Berlin: De Gruyter.

    Google Scholar 

  47. Recasens, M., Martí, M. A., Taulé, M., Màrquez, L., & Sapena, E. (2009b). SemEval-2010 Task 1: Coreference resolution in multiple languages. In Proceedings of the NAACL 2009 workshop on semantic evaluations: Recent achievements and future directions (pp. 70–75). Boulder, CO.

  48. Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences, Chap. 9.8 (2nd ed.). New York: McGraw Hill.

    Google Scholar 

  49. Solà, J. (Ed.). (2002). Gramàtica del català contemporani. Barcelona: Empúries.

    Google Scholar 

  50. Soon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 521–544.

    Article  Google Scholar 

  51. Stede, M. (2004). The potsdam commentary corpus. In Proceedings of the ACL 2004 workshop on discourse annotation (pp. 96–102). Barcelona.

  52. Steinberger, J., Poesio, M., Kabadjov, M. A., & Jeek, K. (2007). Two uses of anaphora resolution in summarization. Information Processing and Management: an International Journal, 43(6), 1663–1680.

    Article  Google Scholar 

  53. Taboada, M. (2008). Reference, centers and transitions in spoken Spanish. In J. Gundel & N. Hedberg (Eds.), Reference and reference processing (pp. 176–215). Oxford: Oxford University Press.

    Google Scholar 

  54. Taulé, M., Martí, M. A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In Proceedings of LREC 2008, Marrakech.

  55. van Deemter, K., & Kibble, R. (2000). On coreferring: Coreference in MUC and related annotation schemes. Computational Linguistics, 26(4), 629–637.

    Article  Google Scholar 

  56. Webber, B. L. (1979). A formal approach to discourse anaphora. New York: Garland Press.

    Google Scholar 

  57. Webber, B. L. (1988). Discourse deixis: Reference to discourse segments. In Proceedings of ACL 1988 (pp. 113–122). Buffalo, New York.

  58. Zaenen, A. (2006). Mark-up barking up the wrong tree. Computational Linguistics, 32(4), 577–580.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the FPU Grant (AP2006-00994) from the Spanish Ministry of Education and Science, and the Lang2World (TIN2006-15265-C06-06) and Ancora-Nom (FFI2008-02691-E/FILO) projects. Special thanks to Mariona Taulé for her invaluable advice, Manuel Bertran for customising the AnCoraPipe annotation tool, and the annotators who participated in the development of AnCora-CO and the reliability study: Oriol Borrega, Isabel Briz, Irene Carbó, Sandra García, Iago González, Esther López, Jesús Martínez, Laura Muñoz, Montse Nofre, Lourdes Puiggròs, Lente Van Leeuwen, and Rita Zaragoza. We are indebted to three anonymous reviewers for their comments on earlier versions of this work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marta Recasens.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Recasens, M., Martí, M.A. AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. Lang Resources & Evaluation 44, 315–345 (2010). https://doi.org/10.1007/s10579-009-9108-x

Download citation

Keywords

  • Coreference
  • Anaphora
  • Corpus annotation
  • Annotation scheme
  • Reliability study