Skip to main content

Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures


Paraphrase corpora annotated with the types of paraphrases they contain constitute an essential resource for the understanding of the phenomenon of paraphrasing and the improvement of paraphrase-related systems in natural language processing. In this article, a new annotation scheme for paraphrase-type annotation is set out, together with newly created measures for the computation of inter-annotator agreement. Three corpora different in nature and in two languages have been annotated using this infrastructure. The annotation results and the inter-annotator agreement scores for these corpora are proof of the adequacy and robustness of our proposal.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    See Madnani and Dorr (2010), Section 5 for a discussion on this topic.

  2. 2. The readme of the corpus contains a discussion on when a pair of sentences should be considered a paraphrase and when it should not, according to their approach.

  3. 3.

  4. 4.

  5. 5.

    See Vila et al. (2013) for a more general state of the art on paraphrase corpora. See Vila et al. (2014) for a state of the art on paraphrase typologies: “paraphrase typology” does not equal “paraphrase-type annotation scheme”, but typologies are the linguistic knowledge in which annotation schemes may be based. In this section, and in this article in general, we focus on the latter.

  6. 6.

  7. 7. Although Semeval organisers distinguish between semantic textual similarity and paraphrasing, being the former a sort of graded paraphrasing, this distinction is not relevant here.

  8. 8.

    Annotation guidelines are available at

  9. 9.

    All the examples in this article are extracted from the three annotated corpora, namely P4P, MSRP-A, and WRPA-A. Typos in the original corpora have not been corrected.

  10. 10.

    It should be taken into account that corpora we annotate consist of positive cases of paraphrasing; therefore, non-paraphrases or non-paraphrase fragments are a minority.

  11. 11.

    See Vila et al. (2014) for a more detailed presentation of our paraphrase typology and Barrón-Cedeño et al. (2013) for a more detailed description of the types. In this article, we set out short definitions of the types for clarification purposes when required.

  12. 12.

    We refer to the tags with small capital letters and sometimes using short names, e.g., synthetic/analytic for synthetic/analytic substitutions.

  13. 13.

  14. 14.

    We use the subindex \(w\) (words) instead of \(t\) (tokens) in order to avoid confusion with the superindex \(t\) (type) that will appear in what follows.

  15. 15. See also Dale and Kilgarriff (2011) and Dale and Narroway (2012).

  16. 16.

    The \(\pi \) and \(\kappa \) factors can be omitted from the calculus (i.e., they can be set to 1) if they are not relevant, as in Barrón-Cedeño et al. (2013).

  17. 17.

    Annotated corpora are available at as a downloadable package and as a search interface.

  18. 18.

  19. 19.

  20. 20.

    The translation is ours.

  21. 21.

  22. 22.

    Strong punctuation marks are full stops, semi-colons, question marks, exclamations, and other punctuation marks that can divide autonomous text fragments (in general, sentences, or clauses), such as parentheses, hyphens, or colons.

  23. 23.

    For reasons of space, we do not include the per-type scores of inter-annotator agreement. Instead, we point out the most relevant issues in this respect.

  24. 24.

    Dolan and Brockett (2005)’s agreement value and ours are not directly comparable, as they represent different measures in diverging tasks with different degrees of complexity. Nevertheless, we consider that obtaining a value in the line of that of Dolan and Brockett (2005)’s simpler task shows that ours can be considered a satisfactory result.


  1. Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 1st joint conference on lexical and computational semantics (*SEM 2012) (pp. 385–393). Montréal.

  2. Amigó, E., Giménez, J., Gonzalo, J., & Màrquez, L. (2006). MT evaluation: Human-like vs. human acceptable. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (COLING/ACL 2006) (pp. 17–24). Sydney.

  3. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Boston: Addison-Wesley Longman Publishing Co.

    Google Scholar 

  4. Barrón-Cedeño, A., Vila, M., Martí, M. A., & Rosso, P. (2013). Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics, 39(4), 917–947.

    Article  Google Scholar 

  5. Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the 39th annual meeting of the association for computational linguistics (ACL 2001) (pp. 50–57). Toulouse.

  6. Bès, G. G., & Fuchs, C. (1988). Introduction. In Lexique et paraphrase (pp. 7–11). Presses Universitaires de Lille.

  7. Bhagat, R. (2009). Learning paraphrases from Text, Ph.D. thesis. University of Southern California, Los Angeles.

  8. Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT 2011) (Vol 1, pp. 190–200). Portland.

  9. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  10. Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics, 34(4), 597–614.

    Article  Google Scholar 

  11. Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the 13th European workshop on natural language generation (ENLG 2011) (pp. 242–249). Nancy.

  12. Dale, R., & Narroway, G. (2011). The HOO pilot data set: Notes on release 2.0. Resource document. Accessed 8 February 2013

  13. Dale, R., & Narroway, G. (2012). A framework for evaluating text correction. In Proceedings of the 8th international conference on language resources and evaluation (LREC 2012) (pp. 3015–3018). Istanbul.

  14. Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd international workshop on paraphrasing (IWP 2005) (pp. 9–16). Jeju Island.

  15. Dutrey, C., Bernhard, D., Bouamor, H., & Max, A. (2011). Local modifications and paraphrases in Wikipedia’s revision history. Procesamiento del Lenguaje Natural, 46, 51–58.

    Google Scholar 

  16. España-Bonet, C., Vila, M., Rodríguez, H., & Martí, M. A. (2009). CoCo, a web interface for corpora compilation. Procesamiento del Lenguaje Natural, 43, 367–368.

    Google Scholar 

  17. Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.

    Google Scholar 

  18. Fuchs, C. (1988). Paraphrases prédicatives et contraintes énonciatives. In: Bès G., & Fuchs C. (Eds.), Lexique et Paraphrase, no. 6 in Lexique, Presses Universitaires de Lille, Villeneuve d’Ascq (pp. 157–171).

  19. Hovy, E., Lin, C. Y., Zhou, L., & Fukumoto, J. (2006). Automated summarization evaluation with basic elements. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006) (pp. 899–902). Genoa.

  20. Kupper, L. L., & Hafner, K. B. (1989). On assessing interrater agreement for multiple attribute responses. Biometrics, 45(3), 957–967.

    Article  Google Scholar 

  21. Lin, C. Y., & Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 4th annual meeting of the north american chapter of the association for computational linguistics: Human language technologies (NAACL/HLT 2003), Edmonton (Vol. 1, pp. 71–78).

  22. Lin, C. Y., & Och, F. J. (2004). ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th international conference on computational linguistics (COLING 2004), Geneva.

  23. Liu, C., Dahlmeier, D., & Ng, H. T. (2010) PEM: A paraphrase evaluation metric exploiting parallel texts. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP 2010), Cambridge (pp. 923–932).

  24. Madnani, N., & Dorr, B. J. (2010). Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3), 341–387.

    Article  Google Scholar 

  25. Max, A., & Wisniewski, G. (2010). Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta (pp. 3143–3148).

  26. Milićević, J. (2007). La paraphrase. Modélisation de la paraphrase langagière. Bern: Peter Lang.

    Google Scholar 

  27. Nenkova, A., & Passonneau, R. (2004). Evaluating content selection in summarization: the pyramid method. In Proceedings of the 5th annual meeting of the North American chapter of the association for computational linguistics: human language technologies (NAACL/HLT 2004), Boston (pp 145–152).

  28. Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010), Beijing (pp. 997–1005).

  29. Recasens, M., & Vila, M. (2010). On paraphrase and coreference. Computational Linguistics, 36(4), 639–647.

    Article  Google Scholar 

  30. Romano, L., Kouylekov, M., Szpektor, I., Dagan, I., & Lavelli, A. (2006). Investigating a generic paraphrase-based approach for relations extraction. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL 2006), Trento (pp. 409–416).

  31. Vila, M., & Dras, M. (2012). Tree edit distance as a baseline approach for paraphrase representation. Procesamiento del Lenguaje Natural, 48, 89–95.

    Google Scholar 

  32. Vila, M., Rodríguez, H., & Martí, M. A. (2013). Relational paraphrase acquisition from Wikipedia. The WRPA method and corpus: Natural language engineering. doi:10.1017/S1351324913000235.

  33. Vila, M., Martí, M. A., & Rodríguez, H. (2014). Is this a paraphrase? What kind? Paraphrase boundaries and typology. Open Journal of Modern Linguistics, 4, 205–218.

    Article  Google Scholar 

  34. Zaenen, A. (2006). Mark-up barking up the wrong tree. Computational Linguistics, 32(4), 577–580.

    Article  Google Scholar 

Download references


We are grateful to the people that participated in the annotation of the corpora: Rita Zaragoza, Montse Nofre, Patricia Fernández, and Oriol Borrega. We would also like to thank Alberto Barrón-Cedeño for his help in shaping inter-annotator agreement measure formulae. This work is supported by the Spanish government through the projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01) from Ministerio de Ciencia e Innovación, as well as a FPU Grant (AP2008-02185) from Ministerio de Educación, Cultura y Deporte.

Author information



Corresponding author

Correspondence to Marta Vila.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vila, M., Bertran, M., Martí, M.A. et al. Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures. Lang Resources & Evaluation 49, 77–105 (2015).

Download citation


  • Paraphrasing
  • Paraphrase typology
  • Corpus annotation
  • Inter-annotator agreement