AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan


This article describes the enrichment of the AnCora corpora of Spanish and Catalan (400 k each) with coreference links between pronouns (including elliptical subjects and clitics), full noun phrases (including proper nouns), and discourse segments. The coding scheme distinguishes between identity links, predicative relations, and discourse deixis. Inter-annotator agreement on the link types is 85–89% above chance, and we provide an analysis of the sources of disagreement. The resulting corpora make it possible to train and test learning-based algorithms for automatic coreference resolution, as well as to carry out bottom-up linguistic descriptions of coreference relations as they occur in real data.

  1. 1.

    All the examples throughout the article have been extracted from the AnCora-CO corpora. Those preceded by (Cat.) come from Catalan and those by (Sp.) from Spanish.

  2. 2.

    Following the terminology of the Automatic Content Extraction (ACE) program (Doddington et al. 2004), a mention is defined as an instance of reference to an object, and an entity is the collection of mentions referring to the same object in a document.

  3. 3.

    To obtain anaphoric coreference pronouns from AnCora-CO, one just needs to extract the pronouns that are included in an entity. By convention, we can assume that their antecedent corresponds to the previous mention in the same entity.

  4. 4.

    ACE-2004 entity types include: person, organization, geo-political entity, location, facility, vehicle and weapon.

  5. 5.

  6. 6.

  7. 7.

  8. 8.

    At present, a total of 300,000 words for each AnCora-CO corpus are freely downloadable from the Web. An additional subset of 100,000 words is being kept for test purposes in future evaluation programs.

  9. 9.

    Elliptical subject pronouns are marked with ⊘ and with the corresponding pronoun in brackets in the English translation.

  10. 10.

    Two guiding principles in the morphological annotation of AnCora were (a) to preserve the original text intact, and (b) to assign standard categories to tokens, so that a category such as “verb-pronoun” for verbs with incorporated clitics was ruled out.

  11. 11.

    Possessive determiners are not considered NPs according to the syntactic annotation scheme.

  12. 12.

    The fact that Argentina is marked as NE-organization provides a clue for the annotators to apply the maximal NP principle. This principle, however, turned out to be a source of inter-annotator disagreement (see Sect. 7.2).

  13. 13.

    Metonymy is the use of a word for an entity which is associated with the entity originally denoted by the word, e.g., dish for the food on the dish.

  14. 14.

    Given the length of some discourse segments, in the examples of discourse deixis coreferent mentions are underlined in order to distinguish them clearly from their antecedent.

  15. 15.

    We are replacing cada una ‘each’ with the coreferent candidate EDF y Mitsubishi ‘EDF and Mitsubishi.’ In the English translation, an inversion of verb-subject order is required.

  16. 16.

  17. 17.

    The POS of possessive determiners and pronouns contains the entity corresponding to the possessor, the entire NP contains the entity corresponding to the thing(s) possessed.

  18. 18.

    Verb nodes can only be a mention if they contain an incorporated clitic. The intention in annotating the verb is actually annotating the reference of the clitic, and this applies in Spanish only.

  19. 19.

    Strong NEs correspond strictly to the POS level (nouns, e.g., Los Angeles).

  20. 20.

    The original version with the inherent clitic is untranslatable into English.

  21. 21.

    The transitivity test extends to all the mentions in the same entity so that if mention A corefers with mention B, and mention B corefers with mention C, then it is possible to replace mention C by mention A with no change in meaning, and vice versa.

  22. 22.

    In Spanish and Catalan, unlike English, equative appositive and copular phrases often omit the definite article.

  23. 23.

    It is common practice among researchers in Computational Linguistics to consider 0.8 the absolute minimum value of α to accept for any serious purpose (Artstein and Poesio 2008).

  24. 24.

    Files 11177-20000817 and 16468-20000521.

  25. 25.

    Files 17704-20000522 (Text 1, 62 coreferent mentions) and 17124-20001122 (Text 2, 88 coreferent mentions).

  26. 26.

    At the time of the experiment, AnCoraPipe (the annotation tool that was used for the actual annotation) was not ready yet.

  27. 27.

    Discourse-deictic relations were left out from the quantitative study since coders only received the set of NPs as possible mentions. They had free choice to select the discourse segment antecedents. For the qualitative analysis on this respect, see Sect. 7.2 below.


