Advertisement

On the use of character n-grams as the only intrinsic evidence of plagiarism

  • Imene BensalemEmail author
  • Paolo Rosso
  • Salim Chikhi
Original Paper

Abstract

When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character n-grams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-of-the-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited.

Keywords

Intrinsic plagiarism detection Character n-grams Stylistic features Writing style analysis 

Notes

References

  1. Akiva, N. (2012). Authorship and Plagiarism Detection Using Binary BOW Features. In CLEF 2012 evaluation labs and workshopworking notes papers, 1720 September, Rome, Italy.Google Scholar
  2. Akiva, N., & Koppel, M. (2013). A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology, 64(11), 2256–2264.  https://doi.org/10.1002/asi.22924.CrossRefGoogle Scholar
  3. Aldebei, K., He, X., Jia, W., & Yang, J. (2016). Unsupervised multi-author document decomposition based on hidden Markov model. In Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016) (pp. 706–714).Google Scholar
  4. Aldebei, K., He, X., & Yang, J. (2015). Unsupervised decomposition of a multi-author document based on Naive-Bayesian Model. In In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers) (pp. 501–505).Google Scholar
  5. Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 356–370.CrossRefGoogle Scholar
  6. Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S. (2015). Overview of the AraPlagDet PAN@FIRE2015 Shared task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mitra (Eds.), Post proceedings of the workshops at the 7th forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 111–122). CEUR-WS.org.Google Scholar
  7. Bensalem, I., Rosso, P., & Chikhi, S. (2013a). A new corpus for the evaluation of Arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, & B. Stein (Eds.), CLEF 2013, LNCS, vol. 8138 (pp. 53–58). Heidelberg: Springer.  https://doi.org/10.1007/978-3-642-40802-1_6.
  8. Bensalem, I., Rosso, P., & Chikhi, S. (2013b). Building Arabic corpora from Wikisource. In 2013 ACS international conference on computer systems and applications (AICCSA), Fes/Ifran, Morocco (pp. 1–2). IEEE.  https://doi.org/10.1109/aiccsa.2013.6616474.
  9. Bensalem, I., Rosso, P., & Chikhi, S. (2014). Intrinsic plagiarism detection using n-gram classes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 2529 (pp. 1459–1464). Association for Computational Linguistics.Google Scholar
  10. Brocardo, M. L., Traore, I., Saad, S., & Woungang, I. (2013). Authorship verification for short messages using stylometry. In 2013 International conference on computer, information and telecommunication systems (CITS 2013) (pp. 1–6). IEEE.  https://doi.org/10.1109/cits.2013.6705711.
  11. Brooke, J., & Hirst, G. (2012). Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features—Notebook for PAN at CLEF 2012. In CLEF 2012 Evaluation labs and workshopWorking notes papers, 17-20 September, Rome, Italy.Google Scholar
  12. Burn-Thornton, K., & Burman, T. (2015). A novel approach for analysis of ‘real world’ data: A data mining engine for identification of multi-author student document submission. In M. Abou-Nasr, S. Lessmann, R. Stahlbock, & G. M. Weiss (Eds.), Real world data mining applications (Vol. 17, pp. 203–219). Springer International Publishing.  https://doi.org/10.1007/978-3-319-07812-0_11.
  13. Giannella, C. (2016). An improved algorithm for unsupervised decomposition of a multi author document. Journal of the Association for Information Science and Technology, 67(2), 400–411.CrossRefGoogle Scholar
  14. Gillam, L., Marinuzzi, J., & Ioannou, P. (2011). TurnItOff-defeating plagiarism detection systems. In Proceedings of the 11th higher education academy-ics annual conference. Higher Education Academy.Google Scholar
  15. Gipp, B., Meuschke, N., & Beel, J. (2011). Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 255–258).Google Scholar
  16. Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples & T. van der Geest (Eds.), The new writing environment (pp. 147–168). London: Springer.  https://doi.org/10.1007/978-1-4471-1482-6_12.CrossRefGoogle Scholar
  17. Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting documents by stylistic character. Natural Language Engineering, 11(04), 397–415.  https://doi.org/10.1017/S1351324905003694.CrossRefGoogle Scholar
  18. Grozea, C., & Popescu, M. (2010). Who’ s the thief? Automatic detection of the direction of plagiarism. In CICLing 2010, Iaşi, Romania, March 2127, LNCS, vol. 6008 (pp. 700–710). Springer, Berlin.  https://doi.org/10.1007/978-3-642-12116-6_59.
  19. Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In IJCAI international joint conference on artificial intelligence (pp. 1624–1628). Morgan Kaufmann Publishers, Burlington.Google Scholar
  20. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.  https://doi.org/10.1145/1656274.1656278.CrossRefGoogle Scholar
  21. Heather, J. (2010). Turnitoff: Identifying and fixing a hole in current plagiarism detection software. Assessment & Evaluation in Higher Education, 35(6), 647–660.  https://doi.org/10.1080/02602938.2010.486471.CrossRefGoogle Scholar
  22. Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International conference on artificial intelligence: Methodology, systems, and applications (pp. 77–86).Google Scholar
  23. Jankowska, M., Milios, E., & Kešelj, V. (2014). Author verification using common n-gram profiles of text documents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics (pp. 387–397).Google Scholar
  24. Kasprzak, J., & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy. Google Scholar
  25. Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2004). Segmenting time series: A survey and novel approach. In H. Bunke (Ed.), Data mining in time series databases (pp. 1–15). Singapore: World Scientific Publishing.Google Scholar
  26. Kern, R., & Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the international conference on management of emergent digital ecosystemsMEDES’09. ACM Press.  https://doi.org/10.1145/1643823.1643854.
  27. Kern, R., Klampfl, S., & Zechner, M. (2012). Vote/veto classification, ensemble clustering and sequence classification for author identification—Notebook of PAN at CLEF 2012. Working notes papers of the CLEF 2012 evaluation labs (pp. 1–15).Google Scholar
  28. Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the conference pacific association for computational linguistics, PA- CLING’03 (pp. 255–264).Google Scholar
  29. Kestemont, M., Luyckx, K., & Daelemans, W. (2011). Intrinsic Plagiarism detection using character trigram distance scores—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 1922, Amsterdam, The Netherlands. Google Scholar
  30. Koppel, M., Akiva, N., Dershowitz, I., & Dershowitz, N. (2011). Unsupervised decomposition of a document into authorial components. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 1356–1364). Association for Computational Linguistics.Google Scholar
  31. Kuta, M., & Kitowski, J. (2014). Optimisation of character n-gram profiles method for intrinsic plagiarism Detection. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), ICAISC 2014, Part II, LNAI, vol. 8468 (pp. 500–511). Springer.  https://doi.org/10.1007/978-3-319-07176-3_44.
  32. Kuznetsov, M., Motrenko, A., Kuznetsova, R., & Strijov, V. (2016). Methods for intrinsic plagiarism detection and author diarization Notebook for PAN at CLEF 2016. In Working notes of CLEF 2016Conference and labs of the evaluation forum Évora, Portugal, 58 September, 2016 (pp. 912–919). CEUR-WS.org.Google Scholar
  33. Mahgoub, A. Y., Magooda, A., Rashwan, M., Fayek, M. B., & Raafat, H. (2015). RDI system for intrinsic plagiarism detection (RDI_RID) Working notes for PAN-AraPlagDet at FIRE 2015. In Workshops proceedings of the seventh international forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 129–130). CEUR-WS.org.Google Scholar
  34. Meyer zu Eißen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H.-J. Lenz (Eds.), Advances in data analysis, selected papers from the 30th annual conference of the german classification society (GfKl), Berlin, (pp. 359–366). Heidelberg: Springer.  https://doi.org/10.1007/978-3-540-70981-7_40.
  35. Muhr, M., Kern, R., Zechner, M., & Granitzer, M. (2010). External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system—Lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and Workshops, September 2223, Padua, Italy. Google Scholar
  36. Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763.  https://doi.org/10.1016/j.eswa.2012.12.082.CrossRefGoogle Scholar
  37. Pertile, S. D. L., Moreira, V. P., & Rosso, P. (2015). Comparing and combining content- and citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526.  https://doi.org/10.1002/asi.23593.CrossRefGoogle Scholar
  38. Potthast, M., Barrón-cedeño, A., Eiselt, A., Stein, B., & Rosso, P. (2010). Overview of the 2nd International competition on plagiarism detection. In M. Braschler & D. Harman (Eds.), Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy. Google Scholar
  39. Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. In V. Petras, P. Forner, & P. Clough (Eds.), Notebook papers of CLEF 2011 LABs and workshops, September 1922. Amsterdam, The Netherland.Google Scholar
  40. Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism Detection. In C.-R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 997–1005). Stroudsburg, USA: Association for Computational Linguistics.Google Scholar
  41. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 1–9). CEUR-WS.org.Google Scholar
  42. Rao, S., Gupta, P., Singhal, K., & Majumder, P. (2011). External & intrinsic plagiarism detection: VSM & discourse markers based approach—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 1922, Amsterdam, The Netherlands (pp. 2–6).Google Scholar
  43. Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., & Stein, B. (2016). Overview of PAN’16: New challenges for authorship analysis: Cross-Genre profiling, clustering, Diarization, and Obfuscation. In N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, et al. (Eds.), CLEF 2016, LNCS 9822 (pp. 332–350). Springer.  https://doi.org/10.1007/978-3-319-44564-9_28.
  44. Sapkota, U., Bethard, S., y Gómez, M. M., & Solorio, T. (2015). Not all character n-grams are created equal: A study in authorship attribution. In 2015 conference of the north american chapter of the association for computational linguisticsHuman Language Technologies (NAACL HLT 2015) (pp. 93–102).  https://doi.org/10.3115/v1/n15-1010.
  45. Shrestha, P., & Solorio, T. (2015). Identification of original document by using textual similarities. In A. Gelbukh (Ed.), CICLing 2015, Part II, LNCS 9042 (pp. 643–654). Springer.  https://doi.org/10.1007/978-3-319-18117-2_48.
  46. Stamatatos, E. (2009a). Intrinsic plagiarism detection using character n-gram profiles. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 38–46). CEUR-WS.org.Google Scholar
  47. Stamatatos, E. (2009b). A survey of modern authorship attribution methods. Journal of the American Society for Information Science, 60(3), 538–556.  https://doi.org/10.1002/asi.21001.CrossRefGoogle Scholar
  48. Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, 21(2), 421–439.Google Scholar
  49. Stamatatos, E. (2016). Universality of stylistic traits in texts. In M. D. Esposti, E. G. Altmann, & F. Pachet (Eds.), Creativity and universality in language (pp. 143–155). Springer.  https://doi.org/10.1007/978-3-319-24403-7_9.
  50. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by authorship within and across documents. In Working notes of CLEF 2016Conference and labs of the evaluation forum Évora, Portugal, 58 September, 2016 (pp. 691–715). CEUR-WS.org.Google Scholar
  51. Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82.  https://doi.org/10.1007/s10579-010-9115-y.CrossRefGoogle Scholar
  52. Suárez, P., González, J. C., & Villena-Román, J. (2010). A plagiarism detector for intrinsic plagiarism—Lab Report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy. Google Scholar
  53. Suchomel, Š., Kasprzak, J., & Brandejs, M. (2012). Three way search engine queries with multi-feature document comparison for plagiarism detection—Notebook for PAN at CLEF 2012. In CLEF 2012 evaluation labs and workshopWorking notes papers, 1720 September, Rome, Italy.Google Scholar
  54. Tschuggnall, M., & Specht, G. (2014). Automatic decomposition of multi-author documents using grammar analysis. In Proceedings of the 26th GI-workshop on foundations of databases (Grundlagen von Datenbanken) (pp. 17–22). CEUR-WS.org.Google Scholar
  55. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2017). Overview of the author identification task at PAN-2017: Style breach detection and author clustering. In L. Cappellato, N. Ferro, L. Goeuriot, & T. Mandl (Eds.), Working notes papers of the CLEF 2017 evaluation labs volume 1866 of CEUR workshop proceedings, September 2017. CLEF and CEUR-WS.org.Google Scholar
  56. van Halteren, H. (2003). Detection of plagiarism in student essays. In Computational linguistics in the Netherlands 2003: Selected papers from the fourteenth CLIN meeting (pp. 157–169).Google Scholar
  57. van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In Proceedings of the 42nd annual meeting on association for computational linguistics (p. Article No. 199). Association for Computational Linguistics.  https://doi.org/10.3115/1218955.1218981.
  58. Zečević, A. (2011). N-gram based text classification according to authorship. In Proceedings of the student research workshop associated with RANLP 2011 (pp. 145–149). Hissar, Bulgaria: Association for Computational Linguistics.Google Scholar
  59. Zechner, M., Muhr, M., Kern, R., & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 47–55). CEUR-WS.org.Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.MISC LaboratoryConstantine 2 UniversityConstantineAlgeria
  2. 2.PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations