Near Similarity Search and Plagiarism Analysis
Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage.
This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.
KeywordsCosine Similarity Vector Space Model Runtime Performance Chunk Size Global Similarity
Unable to display preview. Download preview PDF.
- BAKER, B.S. (1993): On finding duplication in strings and software. http://cm.bell-labs.com/cm/cs/papers.htmlGoogle Scholar
- BRIN, S., DAVIS, J., and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents. SIGMOD’ 95, 398–409, New York, NY, USA. ACM Press.Google Scholar
- ENCYCLOPÆDIA BRITANNICA. New Frontiers in Cheating. http://www.britannica.com/eb/article?tocId=228894, 2005.Google Scholar
- FINKEL, R.A., ZASLAVSKY, A., MONOSTORI, K, and SCHMIDT, H. (2002): Signature Extraction for Overlap Detection in Documents. Proc. 25th Australian conference on Computer Science, 59–64. Australian Computer Society.Google Scholar
- FULLAM, K., and Park, J. (2002). Improvements for scalable and accurate plagiarism detection in digital documents. http://www.lips.utexas.edu/\( \tilde k \)fullam/pdf/DataMiningReport.pdfGoogle Scholar
- GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.Google Scholar
- HOAD, T.C., and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. American Society for Information Science and Technology, 54(3):203–215.Google Scholar
- MONOSTORI, K., FINKEL, R., ZASLAVSKY, A., HODÁSZ, G., and PATAKI, M. (2002): Comparison of overlap detection techniques. LNCS, volume 2329.Google Scholar
- MONOSTORI, K., ZASLAVSKY, A., and SCHMIDT, H. (2000): Document overlap detection system for distributed digital libraries. DL’ 00, 226–227, New York, NY, USA. ACM Press.Google Scholar
- RAMAKRISHNA, M.V., and ZOBEL, J. (1997): Performance in Practice of String Hashing Functions. Proc. Intl. Conf. on Database Systems for Advanced Applications, Australia.Google Scholar
- RIVEST, R.L. (1992): The md5 message-digest algorithm. http://theory.lcs.mit.edu/\( \tilde r \)ivest/rfc1321.txtGoogle Scholar
- SHIVAKUMAR, N., and GARCIA-MOLINA, H. (1996): Building a scalable and accurate copy detection mechanism. DL’ 96, 160–168, New York, NY, USA. ACM Press.Google Scholar
- SI, A., LEONG, H.V., and LAU, R.W.H. (1997): Check: a document plagiarism detection system. SAC’ 97, 70–77, New York, NY, USA. ACM Press.Google Scholar
- STEIN, S. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval. In: Tochtermann and Maurer (eds.): 5th Intl. Conf. on Knowledge Management (I-KNOW 05), Graz, Austria, JUCS. Know-Center.Google Scholar