Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures


Paraphrase corpora annotated with the types of paraphrases they contain constitute an essential resource for the understanding of the phenomenon of paraphrasing and the improvement of paraphrase-related systems in natural language processing. In this article, a new annotation scheme for paraphrase-type annotation is set out, together with newly created measures for the computation of inter-annotator agreement. Three corpora different in nature and in two languages have been annotated using this infrastructure. The annotation results and the inter-annotator agreement scores for these corpora are proof of the adequacy and robustness of our proposal.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    See Madnani and Dorr (2010), Section 5 for a discussion on this topic.

  2. 2. The readme of the corpus contains a discussion on when a pair of sentences should be considered a paraphrase and when it should not, according to their approach.

  3. 3.

  4. 4.

  5. 5.

    See Vila et al. (2013) for a more general state of the art on paraphrase corpora. See Vila et al. (2014) for a state of the art on paraphrase typologies: “paraphrase typology” does not equal “paraphrase-type annotation scheme”, but typologies are the linguistic knowledge in which annotation schemes may be based. In this section, and in this article in general, we focus on the latter.

  6. 6.

  7. 7. Although Semeval organisers distinguish between semantic textual similarity and paraphrasing, being the former a sort of graded paraphrasing, this distinction is not relevant here.

  8. 8.

    Annotation guidelines are available at

  9. 9.

    All the examples in this article are extracted from the three annotated corpora, namely P4P, MSRP-A, and WRPA-A. Typos in the original corpora have not been corrected.

  10. 10.

    It should be taken into account that corpora we annotate consist of positive cases of paraphrasing; therefore, non-paraphrases or non-paraphrase fragments are a minority.

  11. 11.

    See Vila et al. (2014) for a more detailed presentation of our paraphrase typology and Barrón-Cedeño et al. (2013) for a more detailed description of the types. In this article, we set out short definitions of the types for clarification purposes when required.

  12. 12.

    We refer to the tags with small capital letters and sometimes using short names, e.g., synthetic/analytic for synthetic/analytic substitutions.

  13. 13.

  14. 14.

    We use the subindex \(w\) (words) instead of \(t\) (tokens) in order to avoid confusion with the superindex \(t\) (type) that will appear in what follows.

  15. 15. See also Dale and Kilgarriff (2011) and Dale and Narroway (2012).

  16. 16.

    The \(\pi \) and \(\kappa \) factors can be omitted from the calculus (i.e., they can be set to 1) if they are not relevant, as in Barrón-Cedeño et al. (2013).

  17. 17.

    Annotated corpora are available at as a downloadable package and as a search interface.

  18. 18.

  19. 19.

  20. 20.

    The translation is ours.

  21. 21.

  22. 22.

    Strong punctuation marks are full stops, semi-colons, question marks, exclamations, and other punctuation marks that can divide autonomous text fragments (in general, sentences, or clauses), such as parentheses, hyphens, or colons.

  23. 23.

    For reasons of space, we do not include the per-type scores of inter-annotator agreement. Instead, we point out the most relevant issues in this respect.

  24. 24.

    Dolan and Brockett (2005)’s agreement value and ours are not directly comparable, as they represent different measures in diverging tasks with different degrees of complexity. Nevertheless, we consider that obtaining a value in the line of that of Dolan and Brockett (2005)’s simpler task shows that ours can be considered a satisfactory result.


We are grateful to the people that participated in the annotation of the corpora: Rita Zaragoza, Montse Nofre, Patricia Fernández, and Oriol Borrega. We would also like to thank Alberto Barrón-Cedeño for his help in shaping inter-annotator agreement measure formulae. This work is supported by the Spanish government through the projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01) from Ministerio de Ciencia e Innovación, as well as a FPU Grant (AP2008-02185) from Ministerio de Educación, Cultura y Deporte.

