Plagiarism Detection Without Reference Collections
Current research in the field of automatic plagiarism detection for text documents focuses on the development of algorithms that compare suspicious documents against potential original documents. Although recent approaches perform well in identifying copied or even modified passages ([Brin et al. (1995), Stein (2005)]), they assume a closed world where a reference collection must be given (Finkel (2002)). Recall that a human reader can identify suspicious passages within a document without having a library of potential original documents in mind.
This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. This paper contributes right here; it proposes a method to identify potentially plagiarized passages by analyzing a single document with respect to changes in writing style. Such passages then can be used as a starting point for an Internet search for potential sources. As well as that, such passages can be preselected for inspection by a human referee. Among others, we will present new style features that can be computed efficiently and which provide highly discriminative information: Our experiments, which base on a test corpus that will be published, show encouraging results.
Unable to display preview. Download preview PDF.
- BRIN, S., DAVIS, J. and GARCIA-MOLINA, H. (1995): Copy Detection Mechanisms for Digital Documents. In: Proceedings of SIGMOD’ 95.Google Scholar
- DALE, E. and CHALL, J.S. (1948): A Formula for Predicting Readability. Educ. Res. Bull., 27.Google Scholar
- GARSIDE, R., LEECH, G. and MCENERY, A. (1997): Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman.Google Scholar
- HONORE, A. (1979): Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin, 7,2, 172–177.Google Scholar
- KINCAID, J., FISHBURNE, R.P., ROGERS, R.L. and CHISSOM, B.S. (1975): Derivation of New Readability Formulas for Navy Enlisted Personnel. Research Branch Report 85, US Naval Air Station.Google Scholar
- KOPPEL, M. and SCHLER, J. (2004): Authorship Verification as a One-class Classification Problem. In Proceedings of ICML 04, Banff, Canada. ACM Press.Google Scholar
- MCCABE, D. (2005): Research Report of the Center for Academic Integrity. http://www.academicintegrity.org.Google Scholar
- MEYER ZU EISSEN, S. and STEIN, B. (2004): Genre Classification of Web Pages: User Study and Feasibility Analysis. In: KI 2004, LNAI. Springer.Google Scholar
- SORENSEN, J. (2005): A Competitive Analysis of Automated Authorship Attribution Techniques. http://hbar.net/thesis.pdf.Google Scholar
- STEIN, B. (2005): Fuzzy-Fingerprints for Text-Based Information Retrieval. In the Proceedings I-KNOW 05, Graz, J.UCS, 572–579. Know-Center.Google Scholar
- STEIN, B. and MEYER ZU EISSEN, S. (2006): Near Similarity Search and Plagiarism Analysis. In Proc. 29th Annual Conference of the GfKl, Springer, Berlin.Google Scholar
- TWEEDIE, F.J. and BAAYEN, R.H. (1997): Lexical “Constants” in Stylometry and Authorship Studies. In Proceedings of ACH-ALLC’ 97.Google Scholar
- UNIVERSITY OF LEIPZIG (1995): Wortschatz. http://wortschatz.unileipzig.de.Google Scholar
- YULE, G. (1944): The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar