A Scalable System for Identifying Co-derivative Documents

  • Yaniv Bernstein
  • Justin Zobel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3246)

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.

Keywords

Average Precision Document Collection Scalable System Discovery Problem Chunk Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)Google Scholar
  2. Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)Google Scholar
  3. Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)Google Scholar
  4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
  5. Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  6. Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
  7. Heintze, N.: Scalable Document Fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)Google Scholar
  8. Hoad, T.C., Zobel, J.: ‘Methods for Identifying Versioned and Plagiarised Documents’. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)CrossRefGoogle Scholar
  9. Larsson, N.J., Moffat, A.: Offline Dictionary-Based Compression 88(11), 1722–1732 (2000)Google Scholar
  10. Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)Google Scholar
  11. Moffat, A., Wan, R.: Re-Store: A System for Compressing, Browsing, and Searching Large Documents. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 162–174. IEEE Computer Society, Los Alamitos (2001)CrossRefGoogle Scholar
  12. Nevill-Manning, C.G., Witten, I.H.: Compression and Explanation Using Hierarchical Grammars. The Computer Journal 40(2/3), 103–116 (1997)CrossRefGoogle Scholar
  13. Nevill-Manning, C.G., Witten, I.H., Paynter, G.W.: Browsing in digital libraries: a phrase-based approach. In: Proceedings of the second ACM international conference on Digital libraries, pp. 230–236. ACM Press, New York (1997)CrossRefGoogle Scholar
  14. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pp. 76–85. ACM Press, New York (2003)CrossRefGoogle Scholar
  15. Shivakumar, N., García-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)Google Scholar
  16. Shivakumar, N., García-Molina, H.: Finding Near-Replicas of Documents on the Web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB, Springer, Heidelberg (1999)Google Scholar
  17. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Yaniv Bernstein
    • 1
  • Justin Zobel
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations