Skip to main content

Plagiarism Detection Without Reference Collections

  • Conference paper

Abstract

Current research in the field of automatic plagiarism detection for text documents focuses on the development of algorithms that compare suspicious documents against potential original documents. Although recent approaches perform well in identifying copied or even modified passages ([Brin et al. (1995), Stein (2005)]), they assume a closed world where a reference collection must be given (Finkel (2002)). Recall that a human reader can identify suspicious passages within a document without having a library of potential original documents in mind.

This raises the question whether plagiarized passages within a document can be detected automatically if no reference is given, e. g. if the plagiarized passages stem from a book that is not available in digital form. This paper contributes right here; it proposes a method to identify potentially plagiarized passages by analyzing a single document with respect to changes in writing style. Such passages then can be used as a starting point for an Internet search for potential sources. As well as that, such passages can be preselected for inspection by a human referee. Among others, we will present new style features that can be computed efficiently and which provide highly discriminative information: Our experiments, which base on a test corpus that will be published, show encouraging results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • BRIN, S., DAVIS, J. and GARCIA-MOLINA, H. (1995): Copy Detection Mechanisms for Digital Documents. In: Proceedings of SIGMOD’ 95.

    Google Scholar 

  • DALE, E. and CHALL, J.S. (1948): A Formula for Predicting Readability. Educ. Res. Bull., 27.

    Google Scholar 

  • FLESCH, R. (1948): A New Readability Yardstick. Journal of Applied Psychology, 32, 221–233.

    Article  Google Scholar 

  • GARSIDE, R., LEECH, G. and MCENERY, A. (1997): Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman.

    Google Scholar 

  • HOAD, T.C. and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. JASIST, 54,3, 203–215.

    Article  Google Scholar 

  • HONORE, A. (1979): Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin, 7,2, 172–177.

    Google Scholar 

  • KINCAID, J., FISHBURNE, R.P., ROGERS, R.L. and CHISSOM, B.S. (1975): Derivation of New Readability Formulas for Navy Enlisted Personnel. Research Branch Report 85, US Naval Air Station.

    Google Scholar 

  • KOPPEL, M. and SCHLER, J. (2004): Authorship Verification as a One-class Classification Problem. In Proceedings of ICML 04, Banff, Canada. ACM Press.

    Google Scholar 

  • MCCABE, D. (2005): Research Report of the Center for Academic Integrity. http://www.academicintegrity.org.

    Google Scholar 

  • MEYER ZU EISSEN, S. and STEIN, B. (2004): Genre Classification of Web Pages: User Study and Feasibility Analysis. In: KI 2004, LNAI. Springer.

    Google Scholar 

  • SORENSEN, J. (2005): A Competitive Analysis of Automated Authorship Attribution Techniques. http://hbar.net/thesis.pdf.

    Google Scholar 

  • STAMATATOS, E., FAKOTAKIS, N. and KOKKINSKIS, G. (2001): Computer-based Authorship Attribution without Lexical Measures. Computers and the Humanities, 35, 193–214.

    Article  Google Scholar 

  • STEIN, B. (2005): Fuzzy-Fingerprints for Text-Based Information Retrieval. In the Proceedings I-KNOW 05, Graz, J.UCS, 572–579. Know-Center.

    Google Scholar 

  • STEIN, B. and MEYER ZU EISSEN, S. (2006): Near Similarity Search and Plagiarism Analysis. In Proc. 29th Annual Conference of the GfKl, Springer, Berlin.

    Google Scholar 

  • TWEEDIE, F.J. and BAAYEN, R.H. (1997): Lexical “Constants” in Stylometry and Authorship Studies. In Proceedings of ACH-ALLC’ 97.

    Google Scholar 

  • UNIVERSITY OF LEIPZIG (1995): Wortschatz. http://wortschatz.unileipzig.de.

    Google Scholar 

  • YULE, G. (1944): The Statistical Study of Literary Vocabulary. Cambridge University Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Meyer zu Eissen, S., Stein, B., Kulig, M. (2007). Plagiarism Detection Without Reference Collections. In: Decker, R., Lenz, H.J. (eds) Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70981-7_40

Download citation

Publish with us

Policies and ethics