Plagiarism Detection Based on Singular Value Decomposition

  • Zdenek Ceska
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5221)

Abstract

Plagiarism is a widely spread problem that is the main focus of interest these days. In this paper, we propose a new method solving associations of phrases contained in text documents. This method, called SVDPlag, employs Singular Value Decomposition (SVD) for this purpose. Further, we discuss other approaches to plagiarism detection and compare them with our method. To examine the efficiency of plagiarism detection methods, we used an experimental corpus of 950 text documents about politics, which were created from the standard CTK corpus. The experiments indicate that our approach significantly improves the accuracy of plagiarism detection and overcomes other methods.

Keywords

Plagiarism Copy Detection Natural Language Processing Phrases N-grams Singular Value Decomposition Latent Semantic Analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bao, J., Malcolm, J.: Text Similarity in Academic Conference Papers. In: Proceedings of the 2nd International Plagiarism Conference. The Sage, Gateshead (2006)Google Scholar
  2. 2.
    Berry, M., Dumais, S., O’Brein, G.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37(4), 573–595 (1995)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Jose, Canada (1995)Google Scholar
  4. 4.
    Clough, P.: Plagiarism in natural and programming languages: An overview of current tools and technologies. In: Internal Report CS-00-05, Department of Computer Science, University of Sheffield (2000)Google Scholar
  5. 5.
    Clough, P.: Old and new challenges in automatic plagiarism detection. In: Plagiarism Advisory Service, vol. 10, Department of Computer Science, University of Sheffield (2003)Google Scholar
  6. 6.
    Hoad, T., Zobel, J.: Methods for Identifying Versioned and Plagiarised Documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 825–826 (2007) ISBN 978-1-59593-597-7Google Scholar
  7. 7.
    Kang, N., Gelbukh, A., Han, S.: PPChecker: Plagiarism Pattern Checker in Document Copy Detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Lancaster, T., Culwin, F.: Classification of Plagiarism Detection Engines. E-journal ITALICS 4(2) (2005) ISSN 1473-7507Google Scholar
  9. 9.
    Landauer, T., Foltz, P., Laham, D.: An Introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)CrossRefGoogle Scholar
  10. 10.
    Lane, P., Lyon, C., Malcolm, J.: Demonstration of the Ferret Plagiarism Detector. In: Proceedings of the 2nd International Plagiarism Conference, Newcastle (2006)Google Scholar
  11. 11.
    Lyon, P., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document. In: Proceedings of Conference on Empirical Methods and Natural Language Processing, Pittsburgh, USA, pp. 118–125 (2001)Google Scholar
  12. 12.
    Manning, C., Schutze, H.: Foundation of Statistical Natural Language Processing. MIT Press, Massachusetts Institute of Technology, Cambridge (1999)MATHGoogle Scholar
  13. 13.
    Maurer, H., Kappe, F., Zaka, B.: Plagiarism – A Survey. Journal of Universal Computer Science 12(8), 1050–1084 (2006)Google Scholar
  14. 14.
    Runeson, P., Alexanderson, M., Nyholm, O.: Detection of Duplicate Defect Reports Using Natural Language Processing. In: Proceedings of the IEEE 29th International Conference on Software Engineering, pp. 499–510 (2007)Google Scholar
  15. 15.
    Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Retrieval. Journal of Information Processing and Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  16. 16.
    Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, Austin, USA (1995)Google Scholar
  17. 17.
    Shivakumar, N., Garcia-Molina, H.: Building a Scalable and Accurate Copy Detection Mechanism. In: Proceedings of the 1st ACM DL International Conference, Besheda (1996)Google Scholar
  18. 18.
    Toman, M., Tesar, R., Jezek, K.: Influence of Word Normalization on Text Classification. In: Proceedings of the 1st International Conference on Multidisciplinary Information Sciences & Technologies, Merida, Spain, vol. 2, pp. 354–358 (2006) ISBN 84-611-3105-3Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Zdenek Ceska
    • 1
  1. 1.Department of Computer Science and Engineering, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic

Personalised recommendations