Fast Plagiarism Detection by Sentence Hashing

  • Dariusz Ceglarek
  • Konstanty Haniewicz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7268)

Abstract

This work presents a Sentence Hashing Algorithm for Plagiarism Detection - SHAPD. To present a user with the best results the algorithm makes use of special trait of the written texts - their natural sentence fragmentation, later employing a set of special techniques for text representation. Results obtained demonstrate that the algorithm delivers solution faster than the alternatives. Its algorithmic complexity is logarithmic, thus its performance is better than most algorithms using dynamic programming used to find the longest common subsequence.

Keywords

plagiarism plagiarism detection longest common subsequence semantic compression SEIPro2S 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chvatal, V., Klarner, D.A., Knuth, D.E.: Selected Combinatorial Research Problems. Technical Report, Stanford University, Stanford, CA, USA (1972)Google Scholar
  2. 2.
    Szymanski, T.G.: A special case of the maximal common subsequence problem. Technical Report TR-170, Computer Science Laboratory, Princeton University (1975)Google Scholar
  3. 3.
    Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20 (1980)Google Scholar
  4. 4.
    Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantically Enhanced Intellectual Property Protection System - SEIPro2S. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 449–459. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic Compression for Specialised Information Retrieval Systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Ceglarek, D., Haniewicz, K., Rutkowski, W.: Quality of Semantic Compression in Classification. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ICCCI 2010. LNCS, vol. 6421, pp. 162–171. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Irving, R.: Plagiarism and collusion detection using the Smith-Waterman algorithm. Technical Report TR-2004-164, University of Glasgow, Computing Science Departament Research Report (2004)Google Scholar
  8. 8.
    Yeates, S.: Automatic Extraction of Acronym from Text. In: Proceedings of the Third New Zealand Computer Science Research Students Conference. University of Waikato, New Zealand (1999)Google Scholar
  9. 9.
    Alonso, L., et al.: Approaches to text summarization: Questions and answers. Inteligentia Artificial. Revista Iberoamericana de Inteligencia Artificial (20), 34–52 (2003)Google Scholar
  10. 10.
    Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37, 151–175 (2007)CrossRefGoogle Scholar
  11. 11.
    Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, pp. 395–399. ACM, New York (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dariusz Ceglarek
    • 1
  • Konstanty Haniewicz
    • 2
  1. 1.Poznan School of BankingPoland
  2. 2.Poznan University of EconomicsPoland

Personalised recommendations