SPIRE 2004: String Processing and Information Retrieval pp 55-67 | Cite as
A Scalable System for Identifying Co-derivative Documents
Abstract
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.
Keywords
Average Precision Document Collection Scalable System Discovery Problem Chunk SizePreview
Unable to display preview. Download preview PDF.
References
- Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)Google Scholar
- Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)Google Scholar
- Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)Google Scholar
- Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
- Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
- Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
- Heintze, N.: Scalable Document Fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)Google Scholar
- Hoad, T.C., Zobel, J.: ‘Methods for Identifying Versioned and Plagiarised Documents’. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)CrossRefGoogle Scholar
- Larsson, N.J., Moffat, A.: Offline Dictionary-Based Compression 88(11), 1722–1732 (2000)Google Scholar
- Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)Google Scholar
- Moffat, A., Wan, R.: Re-Store: A System for Compressing, Browsing, and Searching Large Documents. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 162–174. IEEE Computer Society, Los Alamitos (2001)CrossRefGoogle Scholar
- Nevill-Manning, C.G., Witten, I.H.: Compression and Explanation Using Hierarchical Grammars. The Computer Journal 40(2/3), 103–116 (1997)CrossRefGoogle Scholar
- Nevill-Manning, C.G., Witten, I.H., Paynter, G.W.: Browsing in digital libraries: a phrase-based approach. In: Proceedings of the second ACM international conference on Digital libraries, pp. 230–236. ACM Press, New York (1997)CrossRefGoogle Scholar
- Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pp. 76–85. ACM Press, New York (2003)CrossRefGoogle Scholar
- Shivakumar, N., García-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)Google Scholar
- Shivakumar, N., García-Molina, H.: Finding Near-Replicas of Documents on the Web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB, Springer, Heidelberg (1999)Google Scholar
- Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)Google Scholar