International Journal on Digital Libraries

, Volume 19, Issue 2–3, pp 113–126 | Cite as

Reuse and plagiarism in Speech and Natural Language Processing publications

  • Joseph MarianiEmail author
  • Gil Francopoulo
  • Patrick Paroubek


The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.


Plagiarism Detection Text reuse Natural Language Processing Speech Processing Scientometrics Informetrics 


  1. 1.
    Barron-Cedeno, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: Proceedings of LREC 2010, pp. 771–774. Valletta (2010)Google Scholar
  2. 2.
    Barron-Cedeno, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRefGoogle Scholar
  3. 3.
    Bensalem, I., Rosso, P., Chikhi, S.,: Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2014, pp. 1459–1464. Doha (2014)Google Scholar
  4. 4.
    Berne Convention for the Protection of Literary and Artistic Works (as amended on Sept. 28, 1979).
  5. 5.
    Bird, S., Dale, R., Dorr, B.J., Gibson, B., Joseph, M.T., Kan, M.-Y., Dongwon, L., Powley, B., Radev, D.R., Tan Y.F.: The ACL anthology reference corpus: a reference dataset for bibliographic research in Computational linguistics. In: Proceedings of LREC 2008, pp. 1755–1759. Marrakesh (2008)Google Scholar
  6. 6.
    Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., Soria, C.: The LRE map. Harmonising community descriptions of resources. In: Proceedings of LREC 2012, pp. 1084–1089. Istanbul (2012)Google Scholar
  7. 7.
    Ceska, Z., Fox, C.: The influence of text pre-processing on plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2009, pp. 55–59. Borovets (2009)Google Scholar
  8. 8.
    Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2011, pp. 704–709. Hissar (2011)Google Scholar
  9. 9.
    Citron, D.T., Ginsparg, P.: Patterns of text reuse in a scientific corpus. Proc. Natl. Acad. Sci. 112(1), 25–30 (2014). doi: 10.1073/pnas.1415135111 CrossRefGoogle Scholar
  10. 10.
    Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: Measuring text reuse. In: Proceedings of ACL’2002, pp. 152–159. Philadelphia (2002)Google Scholar
  11. 11.
    Clough, P., Gaizauskas, R., Piao, S.S.L.: Building and annotating a corpus for the study of journalistic text reuse. In: Proceedings of LREC 2002, pp. 1678–1691. Las Palmas (2002)Google Scholar
  12. 12.
    Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)CrossRefGoogle Scholar
  13. 13.
    Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, pp. 661–667. Marrakesh (2008)Google Scholar
  14. 14.
    Francopoulo, G.: TagParser: well on the way to ISO-TC37 conformance. In: Proceedings of ICGL (International Conference on Global Interoperability for Language Resources) 2008. Hong Kong (2008)Google Scholar
  15. 15.
    Francopoulo, G., Marcoul, F., Causse, D., Piparo, G.: Global atlas: proper nouns, from Wikipedia to LMF. In: Francopoulo, G. (ed) LMF Lexical Markup Framework. ISTE Wiley (2013)Google Scholar
  16. 16.
    Francopoulo, G., Mariani, J., Paroubek, P.: NLP4NLP: the cobbler’s children won’t go unshod. D-Lib Mag. 21(11/12). (2015)
  17. 17.
    Francopoulo, G., Mariani, J., Paroubek, P.: A study of reuse and plagiarism in LREC papers. In: Proceedings of LREC 2016, pp. 72–83. Portorož (2016)Google Scholar
  18. 18.
    Frey, M., Kern, R.: Efficient table annotation for digital articles. D-Lib Mag. 21(11/12). (2015)
  19. 19.
    Gaizauskas, R., Foster, J., Wilks, Y., Arundel, J., Clough, P., Piao, S.S.L.: The METER corpus: a corpus for analysing journalistic text reuse. In: Proceedings of the Corpus Linguistics Conference 2001, pp. 214–223. Lancaster (2001)Google Scholar
  20. 20.
    Grove, J.: Sinister buttocks? Roget would blush at the crafty cheek. Middlesex lecturer gets to the bottom of meaningless phrases found while marking essays. Times Higher Education, 7 August (2014).
  21. 21.
    Guo, Y., Che, W., Liu, T., Li, S.: A graph-based method for entity linking. In: Proceedings of the International Joint Conference on NLP 2011, pp. 1010–1018. Chiang Mai (2011)Google Scholar
  22. 22.
    Gupta, P., Rosso, P.: Text reuse with ACL: (upward) trends. In: Proceedings ACL’2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82. Jeju (2012)Google Scholar
  23. 23.
    Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)CrossRefGoogle Scholar
  24. 24.
    HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 421–429. Beijing (2010)Google Scholar
  25. 25.
    Kasprzak, J., Brandejs, M.: Improving the reliability of the plagiarism detection system lab. In: Proceedings of the Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) at CLEF’2010. Padua (2010)Google Scholar
  26. 26.
    Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proceedings of the Empirical Methods in Natural Language Processing Conference 2001, pp. 118–125. Pittsburgh (2001)Google Scholar
  27. 27.
    Mariani, J., Paroubek, P., Francopoulo, G., Delaborde, M.: Rediscovering 25 years of discoveries in spoken language processing: a preliminary ISCA archive analysis. In: Proceedings of Interspeech 2013, pp. 4632–4669. Lyon (2013)Google Scholar
  28. 28.
    Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)Google Scholar
  29. 29.
    Nawab R.M.A., Stevenson, M., Clough, P.: Detecting text reuse with modified and weighted n-grams. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 54–58. Montréal (2012)Google Scholar
  30. 30.
    Potthast, M., Stein, B., Barron-Cedeno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 997–1005. Beijing (2010)Google Scholar
  31. 31.
    Radev, D.R., Muthukrishnan, P., Qazvinian, V., Abu-Jbara, A.: The ACL anthology network corpus. Lang. Resour. Eval. 47(4), 919–944 (2013)CrossRefGoogle Scholar
  32. 32.
    Samuelson, P.: Self-plagiarism or fair use? Commun. ACM 37(8), 21–25 (1994)CrossRefGoogle Scholar
  33. 33.
    Stamatatos, E., Koppel, M.: Plagiarism and authorship analysis: introduction to the special issue. Lang. Resour. Eval. 45(1), 1–5 (2011)CrossRefGoogle Scholar
  34. 34.
    Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)CrossRefGoogle Scholar
  35. 35.
    Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)CrossRefGoogle Scholar
  36. 36.
    Vilnat, A., Paroubek, P., de la Clergerie, E.V., Francopoulo, G., Guénot, M.-L.: PASSAGE syntactic representation: a minimal common ground for evaluation. In: Proceedings of LREC 2010, pp. 2478–2485. Valletta (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.LIMSI, CNRSUniversité Paris-SaclayOrsayFrance
  2. 2.TagmaticaParisFrance

Personalised recommendations