Plagiarism Detection Without Reference Collections

  • Sven Meyer zu Eissen
  • Benno Stein
  • Marion Kulig
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Current research in the field of automatic plagiarism detection for text documents focuses on the development of algorithms that compare suspicious documents against potential original documents. Although recent approaches perform well in identifying copied or even modified passages ([Brin et al. (1995), Stein (2005)]), they assume a closed world where a reference collection must be given (Finkel (2002)). Recall that a human reader can identify suspicious passages within a document without having a library of potential original documents in mind.

This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. This paper contributes right here; it proposes a method to identify potentially plagiarized passages by analyzing a single document with respect to changes in writing style. Such passages then can be used as a starting point for an Internet search for potential sources. As well as that, such passages can be preselected for inspection by a human referee. Among others, we will present new style features that can be computed efficiently and which provide highly discriminative information: Our experiments, which base on a test corpus that will be published, show encouraging results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BRIN, S., DAVIS, J. and GARCIA-MOLINA, H. (1995): Copy Detection Mechanisms for Digital Documents. In: Proceedings of SIGMOD’ 95.Google Scholar
  2. DALE, E. and CHALL, J.S. (1948): A Formula for Predicting Readability. Educ. Res. Bull., 27.Google Scholar
  3. FLESCH, R. (1948): A New Readability Yardstick. Journal of Applied Psychology, 32, 221–233.CrossRefGoogle Scholar
  4. GARSIDE, R., LEECH, G. and MCENERY, A. (1997): Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman.Google Scholar
  5. HOAD, T.C. and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. JASIST, 54,3, 203–215.CrossRefGoogle Scholar
  6. HONORE, A. (1979): Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin, 7,2, 172–177.Google Scholar
  7. KINCAID, J., FISHBURNE, R.P., ROGERS, R.L. and CHISSOM, B.S. (1975): Derivation of New Readability Formulas for Navy Enlisted Personnel. Research Branch Report 85, US Naval Air Station.Google Scholar
  8. KOPPEL, M. and SCHLER, J. (2004): Authorship Verification as a One-class Classification Problem. In Proceedings of ICML 04, Banff, Canada. ACM Press.Google Scholar
  9. MCCABE, D. (2005): Research Report of the Center for Academic Integrity. http://www.academicintegrity.org.Google Scholar
  10. MEYER ZU EISSEN, S. and STEIN, B. (2004): Genre Classification of Web Pages: User Study and Feasibility Analysis. In: KI 2004, LNAI. Springer.Google Scholar
  11. SORENSEN, J. (2005): A Competitive Analysis of Automated Authorship Attribution Techniques. http://hbar.net/thesis.pdf.Google Scholar
  12. STAMATATOS, E., FAKOTAKIS, N. and KOKKINSKIS, G. (2001): Computer-based Authorship Attribution without Lexical Measures. Computers and the Humanities, 35, 193–214.CrossRefGoogle Scholar
  13. STEIN, B. (2005): Fuzzy-Fingerprints for Text-Based Information Retrieval. In the Proceedings I-KNOW 05, Graz, J.UCS, 572–579. Know-Center.Google Scholar
  14. STEIN, B. and MEYER ZU EISSEN, S. (2006): Near Similarity Search and Plagiarism Analysis. In Proc. 29th Annual Conference of the GfKl, Springer, Berlin.Google Scholar
  15. TWEEDIE, F.J. and BAAYEN, R.H. (1997): Lexical “Constants” in Stylometry and Authorship Studies. In Proceedings of ACH-ALLC’ 97.Google Scholar
  16. UNIVERSITY OF LEIPZIG (1995): Wortschatz. http://wortschatz.unileipzig.de.Google Scholar
  17. YULE, G. (1944): The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Sven Meyer zu Eissen
    • 1
  • Benno Stein
    • 1
  • Marion Kulig
    • 1
  1. 1.Faculty of Media, Media SystemsBauhaus University WeimarWeimarGermany

Personalised recommendations