Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

  • Alberto Barrón-Cedeño
  • Paolo Rosso
  • José-Miguel Benedí
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5449)


Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible.

Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback-Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams.


Search Space Exhaustive Search Feature Selection Technique Reference Document Search Space Reduction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P.M., Zurek, W.H.: Information Distance. IEEE Transactions on Information Theory 44(4), 1407–1423 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Bigi, B.: Using Kullback-Leibler distance for text categorization. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Clough, P.: Plagiarism in Natural and Programming Languages: an Overview of Current Tools and Technologies. Research Memoranda: CS-00-05, Department of Computer Science. University of Sheffield, UK (2000)Google Scholar
  4. 4.
    Clough, P., Gaizauskas, R., Piao, S.: Building and Annotating a Corpus for the Study of Journalistic Text Reuse. In: 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Spain, vol. V, pp. 1678–1691 (2002)Google Scholar
  5. 5.
    Do, M.N., Vetterli, M.: Texture Similarity Measurement Using Kullback-Leibler Distance on Wavelet Subbands. In: International Conference on Image Processing, vol. 3, pp. 730–733 (2000)Google Scholar
  6. 6.
    Fuglede, B., Topse, F.: Jensen-Shannon Divergence and Hilbert Space Embedding. In: IEEE International Symposium on Information Theory (2004)Google Scholar
  7. 7.
    Kang, N., Gelbukh, A., Han, S.-Y.: PPChecker: Plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Kullback, S., Leibler, R.A.: On Information and Sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Lyon, C., Malcolm, J., Dickerson, B.: Detecting Short Passages of Similar Text in Large Document Collections. In: Conference on Empirical Methods in Natural Language Processing, Pennsylvania, pp. 118–125 (2001)Google Scholar
  10. 10.
    Lyon, C., Barrett, R., Malcolm, J.: A Theoretical Basis to the Automated Detection of Copying Between Texts, and its Practical Implementation in the Ferret Plagiarism and Collusion Detector. In: Plagiarism: Prevention, Practice and Policies Conference, Newcastle, UK (2004)Google Scholar
  11. 11.
    Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Pinto, D., Benedí, J.-M., Rosso, P.: Clustering narrow-domain short texts by using the Kullback-Leibler distance. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 611–622. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  15. 15.
    Si, A., Leong, H.V., Lau, R.W.H.: CHECK: A Document Plagiarism Detection System. In: ACM Symposium on Applied Computing, CA, pp. 70–77 (1997)Google Scholar
  16. 16.
    Stein, B.: Principles of Hash-Based Text Retrieval. In: Clarke, Fuhr, Kando, Kraaij, de Vries (eds.) 30th Annual International ACM SIGIR Conference, pp. 527–534 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Alberto Barrón-Cedeño
    • 1
  • Paolo Rosso
    • 1
  • José-Miguel Benedí
    • 1
  1. 1.Department of Information Systems and ComputationUniversidad Politécnica de ValenciaValenciaSpain

Personalised recommendations