Advertisement

Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing

  • Marc Franco-Salvador
  • Parth Gupta
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8173)

Abstract

Cross-language plagiarism detection attempts to identify and extract automatically plagiarism among documents in different languages. Plagiarized fragments can be translated verbatim copies or may alter their structure to hide the copying, which is known as paraphrasing and is more difficult to detect. In order to improve the paraphrasing detection, we use a knowledge graph-based approach to obtain and compare context models of document fragments in different languages. Experimental results in German-English and Spanish-English cross-language plagiarism detection indicate that our knowledge graph-based approach offers a better performance compared to other state-of-the-art models.

Keywords

Cross-language plagiarism detection textual similarity paraphrasing knowledge graphs BabelNet 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics 39(4) (2013)Google Scholar
  2. 2.
    Barrón-Cedeño, A.: On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis, Universitat Politènica de València (2012)Google Scholar
  3. 3.
    Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On cross-lingual plagiarism analysis using a statistical model. In: Proc. of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)Google Scholar
  4. 4.
    Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-language plagiarism detection using BabelNet’s statistical dictionary. Computación y Sistemas, Revista Iberoamericana de Computación 16(4), 383–390 (2012)Google Scholar
  5. 5.
    Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multilingual semantic network. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 710–713. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    Franco-Salvador, M., Gupta, P., Rosso, P.: Graph-based similarity analysis: a new approach to cross-language plagiarism detection. Journal of the Spanish Society of Natural Language Processing (Sociedad Espaola de Procesamiento del Languaje Natural) (50) (2013)Google Scholar
  7. 7.
    Montes-y-Gómez, M., Gelbukh, A., López-López, A., Baeza-Yates, R.: Flexible comparison of conceptual graphs. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 102–111. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  8. 8.
    Gupta, P., Barrón-Cedeño, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 67–75. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Mcnamee, P., Mayfield, J.: Character n-gram tokenization for European language text retrieval. Information Retrieval 7(1), 73–97 (2004)CrossRefGoogle Scholar
  10. 10.
    Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proceedings of the Workshop on Human Language Technology, HLT 1993, pp. 303–308. Association for Computational Linguistics, Stroudsburg (1993)Google Scholar
  11. 11.
    Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, 217–250 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  13. 13.
    Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: An evaluation framework for plagiarism detection. In: Proc. of the 23rd Int. Conf. on Computational Linguistics, COLING 2010, Beijing, China, pp. 997–1005 (2010)Google Scholar
  14. 14.
    Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Language Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis 45(1), 45–62 (2011)Google Scholar
  15. 15.
    Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd int. competition on plagiarism detection. In: CLEF (Notebook Papers/Labs/Workshop) (2011)Google Scholar
  16. 16.
    Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., et al.: Overview of the 4th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)Google Scholar
  17. 17.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic linking of similar texts across languages. In: Proc. Recent Advances in Natural Language Processing III, RANLP 2003, pp. 307–316 (2003)Google Scholar
  18. 18.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proc. Int. Conf. on New Methods in Language Processing (1994)Google Scholar
  19. 19.
    Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 825–826. ACM (2007)Google Scholar
  20. 20.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: A multilingual aligned parallel corpus with +20 languages. In: Proc. 5th Int. Conf. on Language Resources and Evaluation, LREC 2006 (2006)Google Scholar
  21. 21.
    Vossen, P.: Eurowordnet: A multilingual database of autonomous and language-specific wordnets connected via an inter-lingual index. Proc. Int. Journal of Lexicography 17 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Marc Franco-Salvador
    • 1
    • 2
  • Parth Gupta
    • 1
  • Paolo Rosso
    • 1
  1. 1.Natural Language Engineering Lab - ELiRF, DSICUniversitat Politècnica de ValènciaValenciaSpain
  2. 2.Linguistic Computing Laboratory (LCL)Sapienza Università di RomaRomaItaly

Personalised recommendations