Skip to main content
Log in

On the use of character n-grams as the only intrinsic evidence of plagiarism

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character n-grams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-of-the-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Oxford Dictionary.

  2. See Heather (2010) and Gillam et al. (2011) for further information on this kind of cheating.

  3. For example, it might be known that each paragraph is written by one author and there would be no need to look for style shift at sentence level.

  4. A shared task named Author Diarization has been organised in PAN16 lab (http://pan.webis.de/clef16/pan16-web/author-identification.html). It involves three subtasks: traditional intrinsic plagiarism detection, diarization with a given number of authors, and diarization with an unknown number of authors.

  5. The frequency of n-grams in this method is normalised.

  6. The table lists only the methods that provide information on the used character n-grams.

  7. Numerals have not been considered when extracting n-grams.

  8. The used implementation of Naïve Bayes is the one of the software WEKA (Hall et al. 2009). We trained and tested other classification algorithms implemented on WEKA software, and the best results were obtained with Naïve Bayes.

  9. http://pan.webis.de.

  10. The corpora could be downloaded from: https://webis.de/data/data.html#pan-corpora.

  11. http://misc-umc.org/AraPlagDet.

  12. There is another performance measure of plagiarism detection, which is the granularity. This measure does not gauge the efficacy of the method to spot plagiarism but instead its ability to merge the overlapping and the adjacent detections into one segment. We did not use this measure in this paper because it is rather sensitive to the post-processing methods used to merge the identified plagiarism cases, which is outside our experiments’ scope.

  13. The results of Stamatatos’ method on the PAN-PC-11 corpus are available in (Potthast et al. 2011).

  14. The evaluation of Stamatatos’ method on InAra-Test is performed by ourselves using the original implementation of the method.

  15. In the AraPlagDet competition, participants were more interested in the external plagiarism detection approach.

  16. As detailed in previous paragraphs, in this context, short n-grams means n ≤ 3 or n ≤ 4 for Arabic and English, respectively. The rest are called long n-grams.

  17. There is an exception with features computed by classifying n-grams into 2 classes in English where peak performance has been reached with 4-grams.

  18. Some non-alphabetic n-grams such as n-grams of numerals are discarded.

  19. We are so grateful to the author of the method Efstathios Stamatatos for sending us its code.

  20. We also adjusted another parameter of the method called Real window length threshold to 2250 instead of 1500 to make it appropriate to the new window size.

References

  • Akiva, N. (2012). Authorship and Plagiarism Detection Using Binary BOW Features. In CLEF 2012 evaluation labs and workshopworking notes papers, 1720 September, Rome, Italy.

  • Akiva, N., & Koppel, M. (2013). A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology, 64(11), 2256–2264. https://doi.org/10.1002/asi.22924.

    Article  Google Scholar 

  • Aldebei, K., He, X., Jia, W., & Yang, J. (2016). Unsupervised multi-author document decomposition based on hidden Markov model. In Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016) (pp. 706–714).

  • Aldebei, K., He, X., & Yang, J. (2015). Unsupervised decomposition of a multi-author document based on Naive-Bayesian Model. In In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers) (pp. 501–505).

  • Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech and Language Processing, 20(2), 356–370.

    Article  Google Scholar 

  • Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., & Chikhi, S. (2015). Overview of the AraPlagDet PAN@FIRE2015 Shared task on Arabic Plagiarism Detection. In P. Majumder, M. Mitra, M. Agrawal, & P. Mitra (Eds.), Post proceedings of the workshops at the 7th forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 111–122). CEUR-WS.org.

  • Bensalem, I., Rosso, P., & Chikhi, S. (2013a). A new corpus for the evaluation of Arabic intrinsic plagiarism detection. In P. Forner, H. Müller, R. Paredes, P. Rosso, & B. Stein (Eds.), CLEF 2013, LNCS, vol. 8138 (pp. 53–58). Heidelberg: Springer. https://doi.org/10.1007/978-3-642-40802-1_6.

  • Bensalem, I., Rosso, P., & Chikhi, S. (2013b). Building Arabic corpora from Wikisource. In 2013 ACS international conference on computer systems and applications (AICCSA), Fes/Ifran, Morocco (pp. 1–2). IEEE. https://doi.org/10.1109/aiccsa.2013.6616474.

  • Bensalem, I., Rosso, P., & Chikhi, S. (2014). Intrinsic plagiarism detection using n-gram classes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, October 2529 (pp. 1459–1464). Association for Computational Linguistics.

  • Brocardo, M. L., Traore, I., Saad, S., & Woungang, I. (2013). Authorship verification for short messages using stylometry. In 2013 International conference on computer, information and telecommunication systems (CITS 2013) (pp. 1–6). IEEE. https://doi.org/10.1109/cits.2013.6705711.

  • Brooke, J., & Hirst, G. (2012). Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features—Notebook for PAN at CLEF 2012. In CLEF 2012 Evaluation labs and workshopWorking notes papers, 17-20 September, Rome, Italy.

  • Burn-Thornton, K., & Burman, T. (2015). A novel approach for analysis of ‘real world’ data: A data mining engine for identification of multi-author student document submission. In M. Abou-Nasr, S. Lessmann, R. Stahlbock, & G. M. Weiss (Eds.), Real world data mining applications (Vol. 17, pp. 203–219). Springer International Publishing. https://doi.org/10.1007/978-3-319-07812-0_11.

  • Giannella, C. (2016). An improved algorithm for unsupervised decomposition of a multi author document. Journal of the Association for Information Science and Technology, 67(2), 400–411.

    Article  Google Scholar 

  • Gillam, L., Marinuzzi, J., & Ioannou, P. (2011). TurnItOff-defeating plagiarism detection systems. In Proceedings of the 11th higher education academy-ics annual conference. Higher Education Academy.

  • Gipp, B., Meuschke, N., & Beel, J. (2011). Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 255–258).

  • Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples & T. van der Geest (Eds.), The new writing environment (pp. 147–168). London: Springer. https://doi.org/10.1007/978-1-4471-1482-6_12.

    Chapter  Google Scholar 

  • Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting documents by stylistic character. Natural Language Engineering, 11(04), 397–415. https://doi.org/10.1017/S1351324905003694.

    Article  Google Scholar 

  • Grozea, C., & Popescu, M. (2010). Who’ s the thief? Automatic detection of the direction of plagiarism. In CICLing 2010, Iaşi, Romania, March 2127, LNCS, vol. 6008 (pp. 700–710). Springer, Berlin. https://doi.org/10.1007/978-3-642-12116-6_59.

  • Guthrie, D., Guthrie, L., Allison, B., & Wilks, Y. (2007). Unsupervised anomaly detection. In IJCAI international joint conference on artificial intelligence (pp. 1624–1628). Morgan Kaufmann Publishers, Burlington.

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278.

    Article  Google Scholar 

  • Heather, J. (2010). Turnitoff: Identifying and fixing a hole in current plagiarism detection software. Assessment & Evaluation in Higher Education, 35(6), 647–660. https://doi.org/10.1080/02602938.2010.486471.

    Article  Google Scholar 

  • Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In International conference on artificial intelligence: Methodology, systems, and applications (pp. 77–86).

  • Jankowska, M., Milios, E., & Kešelj, V. (2014). Author verification using common n-gram profiles of text documents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics (pp. 387–397).

  • Kasprzak, J., & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy.

  • Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2004). Segmenting time series: A survey and novel approach. In H. Bunke (Ed.), Data mining in time series databases (pp. 1–15). Singapore: World Scientific Publishing.

    Google Scholar 

  • Kern, R., & Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval techniques. In Proceedings of the international conference on management of emergent digital ecosystemsMEDES’09. ACM Press. https://doi.org/10.1145/1643823.1643854.

  • Kern, R., Klampfl, S., & Zechner, M. (2012). Vote/veto classification, ensemble clustering and sequence classification for author identification—Notebook of PAN at CLEF 2012. Working notes papers of the CLEF 2012 evaluation labs (pp. 1–15).

  • Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the conference pacific association for computational linguistics, PA- CLING’03 (pp. 255–264).

  • Kestemont, M., Luyckx, K., & Daelemans, W. (2011). Intrinsic Plagiarism detection using character trigram distance scores—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 1922, Amsterdam, The Netherlands.

  • Koppel, M., Akiva, N., Dershowitz, I., & Dershowitz, N. (2011). Unsupervised decomposition of a document into authorial components. In Proceedings of the 49th annual meeting of the association for computational linguistics (pp. 1356–1364). Association for Computational Linguistics.

  • Kuta, M., & Kitowski, J. (2014). Optimisation of character n-gram profiles method for intrinsic plagiarism Detection. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), ICAISC 2014, Part II, LNAI, vol. 8468 (pp. 500–511). Springer. https://doi.org/10.1007/978-3-319-07176-3_44.

  • Kuznetsov, M., Motrenko, A., Kuznetsova, R., & Strijov, V. (2016). Methods for intrinsic plagiarism detection and author diarization Notebook for PAN at CLEF 2016. In Working notes of CLEF 2016Conference and labs of the evaluation forum Évora, Portugal, 58 September, 2016 (pp. 912–919). CEUR-WS.org.

  • Mahgoub, A. Y., Magooda, A., Rashwan, M., Fayek, M. B., & Raafat, H. (2015). RDI system for intrinsic plagiarism detection (RDI_RID) Working notes for PAN-AraPlagDet at FIRE 2015. In Workshops proceedings of the seventh international forum for information retrieval evaluation (FIRE 2015), Gandhinagar, India (pp. 129–130). CEUR-WS.org.

  • Meyer zu Eißen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H.-J. Lenz (Eds.), Advances in data analysis, selected papers from the 30th annual conference of the german classification society (GfKl), Berlin, (pp. 359–366). Heidelberg: Springer. https://doi.org/10.1007/978-3-540-70981-7_40.

  • Muhr, M., Kern, R., Zechner, M., & Granitzer, M. (2010). External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system—Lab report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and Workshops, September 2223, Padua, Italy.

  • Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082.

    Article  Google Scholar 

  • Pertile, S. D. L., Moreira, V. P., & Rosso, P. (2015). Comparing and combining content- and citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593.

    Article  Google Scholar 

  • Potthast, M., Barrón-cedeño, A., Eiselt, A., Stein, B., & Rosso, P. (2010). Overview of the 2nd International competition on plagiarism detection. In M. Braschler & D. Harman (Eds.), Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy.

  • Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. In V. Petras, P. Forner, & P. Clough (Eds.), Notebook papers of CLEF 2011 LABs and workshops, September 1922. Amsterdam, The Netherland.

  • Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism Detection. In C.-R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 997–1005). Stroudsburg, USA: Association for Computational Linguistics.

  • Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 1–9). CEUR-WS.org.

  • Rao, S., Gupta, P., Singhal, K., & Majumder, P. (2011). External & intrinsic plagiarism detection: VSM & discourse markers based approach—Notebook for PAN at CLEF 2011. In Notebook papers of CLEF 2011 LABs and workshops, September 1922, Amsterdam, The Netherlands (pp. 2–6).

  • Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., & Stein, B. (2016). Overview of PAN’16: New challenges for authorship analysis: Cross-Genre profiling, clustering, Diarization, and Obfuscation. In N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, et al. (Eds.), CLEF 2016, LNCS 9822 (pp. 332–350). Springer. https://doi.org/10.1007/978-3-319-44564-9_28.

  • Sapkota, U., Bethard, S., y Gómez, M. M., & Solorio, T. (2015). Not all character n-grams are created equal: A study in authorship attribution. In 2015 conference of the north american chapter of the association for computational linguisticsHuman Language Technologies (NAACL HLT 2015) (pp. 93–102). https://doi.org/10.3115/v1/n15-1010.

  • Shrestha, P., & Solorio, T. (2015). Identification of original document by using textual similarities. In A. Gelbukh (Ed.), CICLing 2015, Part II, LNCS 9042 (pp. 643–654). Springer. https://doi.org/10.1007/978-3-319-18117-2_48.

  • Stamatatos, E. (2009a). Intrinsic plagiarism detection using character n-gram profiles. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 38–46). CEUR-WS.org.

  • Stamatatos, E. (2009b). A survey of modern authorship attribution methods. Journal of the American Society for Information Science, 60(3), 538–556. https://doi.org/10.1002/asi.21001.

    Article  Google Scholar 

  • Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, 21(2), 421–439.

    Google Scholar 

  • Stamatatos, E. (2016). Universality of stylistic traits in texts. In M. D. Esposti, E. G. Altmann, & F. Pachet (Eds.), Creativity and universality in language (pp. 143–155). Springer. https://doi.org/10.1007/978-3-319-24403-7_9.

  • Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by authorship within and across documents. In Working notes of CLEF 2016Conference and labs of the evaluation forum Évora, Portugal, 58 September, 2016 (pp. 691–715). CEUR-WS.org.

  • Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82. https://doi.org/10.1007/s10579-010-9115-y.

    Article  Google Scholar 

  • Suárez, P., González, J. C., & Villena-Román, J. (2010). A plagiarism detector for intrinsic plagiarism—Lab Report for PAN at CLEF 2010. In Notebook papers of CLEF 2010 LABs and workshops, September 2223, Padua, Italy.

  • Suchomel, Š., Kasprzak, J., & Brandejs, M. (2012). Three way search engine queries with multi-feature document comparison for plagiarism detection—Notebook for PAN at CLEF 2012. In CLEF 2012 evaluation labs and workshopWorking notes papers, 1720 September, Rome, Italy.

  • Tschuggnall, M., & Specht, G. (2014). Automatic decomposition of multi-author documents using grammar analysis. In Proceedings of the 26th GI-workshop on foundations of databases (Grundlagen von Datenbanken) (pp. 17–22). CEUR-WS.org.

  • Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2017). Overview of the author identification task at PAN-2017: Style breach detection and author clustering. In L. Cappellato, N. Ferro, L. Goeuriot, & T. Mandl (Eds.), Working notes papers of the CLEF 2017 evaluation labs volume 1866 of CEUR workshop proceedings, September 2017. CLEF and CEUR-WS.org.

  • van Halteren, H. (2003). Detection of plagiarism in student essays. In Computational linguistics in the Netherlands 2003: Selected papers from the fourteenth CLIN meeting (pp. 157–169).

  • van Halteren, H. (2004). Linguistic profiling for author recognition and verification. In Proceedings of the 42nd annual meeting on association for computational linguistics (p. Article No. 199). Association for Computational Linguistics. https://doi.org/10.3115/1218955.1218981.

  • Zečević, A. (2011). N-gram based text classification according to authorship. In Proceedings of the student research workshop associated with RANLP 2011 (pp. 145–149). Hissar, Bulgaria: Association for Computational Linguistics.

  • Zechner, M., Muhr, M., Kern, R., & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), Proceedings of the SEPLN’09 workshop on uncovering plagiarism, authorship and social software misuse (PAN 09) (pp. 47–55). CEUR-WS.org.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imene Bensalem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are very grateful to the anonymous reviewers for their insightful suggestions and constructive comments that greatly improved the paper. This work has been partially supported by the École Supérieure de Comptabilité et de Finances de Constantine. The work of Paolo Rosso has been partially funded by the SomEMBED TIN2015-71147-C2-1-P research project (MINECO/FEDER). The work of Salim Chikhi has been partially funded by CNEPRU/DGRSDT/B*07120140018 research project.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bensalem, I., Rosso, P. & Chikhi, S. On the use of character n-grams as the only intrinsic evidence of plagiarism. Lang Resources & Evaluation 53, 363–396 (2019). https://doi.org/10.1007/s10579-019-09444-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09444-w

Keywords

Navigation