Advertisement

Near Similarity Search and Plagiarism Analysis

  • Benno Stein
  • Sven Meyer zu Eissen
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage.

This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.

Keywords

Cosine Similarity Vector Space Model Runtime Performance Chunk Size Global Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BAKER, B.S. (1993): On finding duplication in strings and software. http://cm.bell-labs.com/cm/cs/papers.htmlGoogle Scholar
  2. BRIN, S., DAVIS, J., and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents. SIGMOD’ 95, 398–409, New York, NY, USA. ACM Press.Google Scholar
  3. ENCYCLOPÆDIA BRITANNICA. New Frontiers in Cheating. http://www.britannica.com/eb/article?tocId=228894, 2005.Google Scholar
  4. FINKEL, R.A., ZASLAVSKY, A., MONOSTORI, K, and SCHMIDT, H. (2002): Signature Extraction for Overlap Detection in Documents. Proc. 25th Australian conference on Computer Science, 59–64. Australian Computer Society.Google Scholar
  5. FULLAM, K., and Park, J. (2002). Improvements for scalable and accurate plagiarism detection in digital documents. http://www.lips.utexas.edu/\( \tilde k \)fullam/pdf/DataMiningReport.pdfGoogle Scholar
  6. GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.Google Scholar
  7. HOAD, T.C., and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. American Society for Information Science and Technology, 54(3):203–215.Google Scholar
  8. MONOSTORI, K., FINKEL, R., ZASLAVSKY, A., HODÁSZ, G., and PATAKI, M. (2002): Comparison of overlap detection techniques. LNCS, volume 2329.Google Scholar
  9. MONOSTORI, K., ZASLAVSKY, A., and SCHMIDT, H. (2000): Document overlap detection system for distributed digital libraries. DL’ 00, 226–227, New York, NY, USA. ACM Press.Google Scholar
  10. RAMAKRISHNA, M.V., and ZOBEL, J. (1997): Performance in Practice of String Hashing Functions. Proc. Intl. Conf. on Database Systems for Advanced Applications, Australia.Google Scholar
  11. RIVEST, R.L. (1992): The md5 message-digest algorithm. http://theory.lcs.mit.edu/\( \tilde r \)ivest/rfc1321.txtGoogle Scholar
  12. SHIVAKUMAR, N., and GARCIA-MOLINA, H. (1996): Building a scalable and accurate copy detection mechanism. DL’ 96, 160–168, New York, NY, USA. ACM Press.Google Scholar
  13. SI, A., LEONG, H.V., and LAU, R.W.H. (1997): Check: a document plagiarism detection system. SAC’ 97, 70–77, New York, NY, USA. ACM Press.Google Scholar
  14. STEIN, S. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval. In: Tochtermann and Maurer (eds.): 5th Intl. Conf. on Knowledge Management (I-KNOW 05), Graz, Austria, JUCS. Know-Center.Google Scholar

Copyright information

© Springer Berlin · Heidelberg 2006

Authors and Affiliations

  • Benno Stein
    • 1
  • Sven Meyer zu Eissen
    • 2
  1. 1.Faculty of Media, Media SystemsBauhaus University WeimarWeimarGermany
  2. 2.Faculty of Computer SciencePaderborn UniversityPaderbornGermany

Personalised recommendations