Comparison of Overlap Detection Techniques

  • Krisztián Monostori
  • Raphael Finkel
  • Arkady Zaslavsky
  • Gábor Hodász
  • Máté Pataki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2329)


Easy access to the World Wide Web has raised concerns about copyright issues and plagiarism. It is easy to copy someone else’s work and submit it as someone’s own. This problem has been targeted by many systems, which use very similar approaches. These approaches are compared in this paper and suggestions are made when different strategies are more applicable than others. Some alternative approaches are proposed that perform better than previously presented methods. These previous methods share two common stages: chunking of documents and selection of representative chunks. We study both stages and also propose alternatives that are better in terms of accuracy and space requirement. The applications of these methods are not limited to plagiarism detection but may target other copy-detection problems. We also propose a third stage to be applied in the comparison that uses suffix trees and suffix vectors to identify the overlapping chunks.


Digital Library Suffix Tree Chunk Size Plagiarism Detection Chunk Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Argetsinger A. Technology exposes cheating at U-Va. The Washington Post, May 8, 2001.Google Scholar
  2. 2.
    Benjaminson A. Internet offers new path to plagiarism, UC-Berkeley officials say. Daily Californian, October 6, 1999.Google Scholar
  3. 3.
    Broder A.Z., Glassman S.C., Manasse M.S. Syntatic Clustering of the Web. Sixth International Web Conference, Santa Clara, California USA. URL
  4. 4.
    EVE Plagiarism Detection System. URL, 2000
  5. 5.
    Garcia-Molina H., Shivakumar N. The SCAM Approach To Copy Detection in Digital Libraries. D-lib Magazine, November, 1995.Google Scholar
  6. 6.
    Garcia-Molina H., Shivakumar N. Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of 1st ACM International Conference on Digital Libraries (DL’96) March, Bethesda Maryland, 1996.Google Scholar
  7. 7.
    Heintze N. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California, 18–21 November, 1996. URL
  8. 8.
    Monostori K., Zaslavsky A., Schmidt H. MatchDetectReveal: Finding Overlapping and Similar Digital Documents. Information Resources Management Association International Conference (IRMA2000), 21-24 May, 2000 at Anchorage Hilton Hotel, Anchorage, Alaska, USA. pp 955-957, 2000.Google Scholar
  9. 9.
    Monostori K., Zaslavsky A., Schmidt H. Parallel Overlap and Similarity Detection in Semi-Structured Document Collections. Proceedings of 6th Annual Australasian Conference on Parallel And Real-Time Systems (PART’ 99), Melbourne, Australia, 1999. pp 92–103, 1999.Google Scholar
  10. 10., the Internet plagiarism detection service for authors & education. URL, 1999.
  11. 11.
    Rivest R. L.. RFC 1321: The MD5 Message-Digest Algorithm. Internet Activities Board, April 1992.Google Scholar
  12. 12.
    Wall L. and Schwartz R. L. Programming Perl. O’Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA, 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Krisztián Monostori
    • 1
  • Raphael Finkel
    • 2
  • Arkady Zaslavsky
    • 1
  • Gábor Hodász
    • 3
  • Máté Pataki
    • 3
  1. 1.School of Computer Science and Software EngineeringMonash UniversityCaulfield EastAustralia
  2. 2.Computer ScienceUniversity of KentuckyLexingtonUSA
  3. 3.Department of Automation and Applied InformaticsBudapest University of Technology and Economic Sciences1111 BudapestHungary

Personalised recommendations