Advertisement

Automatic Detection of Local Reuse

  • Arno Mittelbach
  • Lasse Lehmann
  • Christoph Rensing
  • Ralf Steinmetz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6383)

Abstract

Local reuse detection is a prerequisite for a multitude of tasks ranging from document management and information retrieval to web search or plagiarism detection. Its results can be used to support authors in creating new learning resources or learners in finding existing ones by providing accurate suggestions for related documents. While the detection of local text reuse, i.e. reuse of parts of documents, is covered by various approaches, reuse detection for object-based documents has been hardly considered yet. In this paper we propose a new fingerprinting technique for local reuse detection for both text-based and object-based documents which is based on the contiguity of documents. This additional information, which is generally disregarded by existing approaches, allows the creation of shorter and more flexible fingerprints. Evaluations performed on different corpora have shown that it performs better than existing approaches while maintaining a significantly lower storage consumption.

Keywords

Local Reuse Detection Fingerprinting Overlap Detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barrón-Cede, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: ECIR 2009: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pp. 696–700. Springer, Heidelberg (2009)Google Scholar
  2. 2.
    Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD 2005: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pp. 398–409. ACM, New York (1995)Google Scholar
  3. 3.
    Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES 1997: Proceedings of the Compression and Complexity of Sequences 1997, Washington, DC, USA, p. 21. IEEE Computer Society, Los Alamitos (1997)Google Scholar
  4. 4.
    Broder, A.Z.: Identifying and filtering near-duplicate documents. In: COM 2000: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, London, UK, pp. 1–10. Springer, Heidelberg (2000)Google Scholar
  5. 5.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW6), pp. 1157–1166 (1997)Google Scholar
  6. 6.
    Steven Burrows, S., Tahaghoghi, M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37(2), 151–175 (2007)CrossRefGoogle Scholar
  7. 7.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. ACM Press, New York (2002)CrossRefGoogle Scholar
  8. 8.
    Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proceedings of the 40th Anniversary Meeting for the Association for Computational Linguistics (ACL 2002), Philadelphia, pp. 152–159 (July 2002)Google Scholar
  9. 9.
    Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 61–70. ACM, New York (2009)Google Scholar
  10. 10.
    Kim, J.W., Selçuk Candan, K., Tatemura, J.: Efficient overlap and content reuse detection in blogs and online news articles. In: 18th International World Wide Web Conference (April 2009)Google Scholar
  11. 11.
    Klerkx, J., Verbert, K., Duval, E.: Visualizing reuse: More than meets the eye. In: Proceedings of the 6th International Conference on Knowledge Management, I-KNOW 2006, Graz, Austria, pp. 489–497 (September 2006)Google Scholar
  12. 12.
    Lehmann, L., Hildebrandt, T., Rensing, C., Steinmetz, R.: Capture, management and utilization of lifecycle information for learning resources. IEEE Transactions on Learning Technologies 1(1), 75–87 (2008)CrossRefGoogle Scholar
  13. 13.
    Lehmann, L., Mittelbach, A., Rensing, C., Steinmetz, R.: Capture of lifecycle information in office applications. International Journal of Technology Enhanced Learning 2, 41–57 (2010)CrossRefGoogle Scholar
  14. 14.
    Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Lee, L., Harman, D. (eds.) Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburg, PA USA, pp. 118–125 (2001)Google Scholar
  15. 15.
    Manber, U.: Finding similar files in a large file system. In: WTEC 1994: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p. 2. USENIX Association, Berkeley (1994)Google Scholar
  16. 16.
    Metzler, D., Bernstein, Y., Croft, B.W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 517–524. ACM, New York (2005)Google Scholar
  17. 17.
    Rivest, R.: The md5 message-digest algorithm (1992)Google Scholar
  18. 18.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proceedings of SIGMOD 2003, San Diego, CA. ACM Press, New York (June 2003)Google Scholar
  19. 19.
    Seo, J., Bruce Croft, W.: Local text reuse detection. In: Proceedings of SIGIR ’08, Singapore, July 2008, ACM Press, New York (2008)Google Scholar
  20. 20.
    Syropoulos, A.: Mathematics of multisets. In: WMP 2000: Proceedings of the Workshop on Multiset Processing, London, UK, pp. 347–358. Springer, Heidelberg (2000)Google Scholar
  21. 21.
    Verbert, K., Ochoa, X., Duval, E.: The alocom framework: Towards scalable content reuse. Journal of Digital Information, 9 (2008)Google Scholar
  22. 22.
    Wise, M.J.: Running karp-rabin matching and greedy string tiling. Technical report, Basser Department of Computer Science - The University of Sydney (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Arno Mittelbach
    • 1
  • Lasse Lehmann
    • 1
  • Christoph Rensing
    • 1
  • Ralf Steinmetz
    • 1
  1. 1.KOM - Multimedia Communications LabTechnische Universität DarmstadtDarmstadt

Personalised recommendations