APWeb 2007, WAIM 2007: Advances in Data and Web Management pp 586-593 | Cite as
Density Analysis of Winnowing on Non-uniform Distributions
Abstract
The increasing copies of digital documents make detecting duplicates an important problem. Among the techniques proposed so far, Winnowing fingerprinting algorithm [5] is one of the most efficient. However, the previous density analysis leave the performance of Winnowing unwarranted in real systems, because the assumption of uniformly distributed k-grams is far from true in practice. In this paper, an improved density analysis method is introduced. Compared with the previous, our method needs only identically distributed k-grams to get the prediction. This means our theoretical result can be safely used on highly non-uniformly distributed data which are common in real systems. Extensive experiments are performed on both artificial data and real data. The experiment results agree with the theoretical predictions well.
Keywords
Window Size Density Analysis Digital Document Topic Detection English DocumentPreview
Unable to display preview. Download preview PDF.
References
- 1.Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceedings of ACM SIGMOD 1995, pp. 398–409 (1995)Google Scholar
- 2.Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of Winter USENIX Conference 1994, pp. 1–10 (1994)Google Scholar
- 3.Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of ACM SIGKDD 2005, pp. 394–400 (2005)Google Scholar
- 4.Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, pp. 191–200 (1996)Google Scholar
- 5.Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of ACM SIGMOD 2003, pp. 76–85 (2003)Google Scholar
- 6.Singh, S., Estan, C., Varghese, G., Savage, S.: The EarlyBird System for Real-time Detection of Unknown Worms. Technical Report CS2003-0761, University of California, San Diego (2003)Google Scholar
- 7.Zipf, G.K.: The Psychobiology of Language. Houghton Mifflin, Boston (1935)Google Scholar
- 8.Yi, G., Xiaolong, W., Kai, Z.: The Frequency-Rank Relation of Language Units in Modern Chinese Computational Language Model. Journal of Chinese Information Processing 13(2), 8–15 (1999)Google Scholar
- 9.Miller, G.A., Newman, E.B.: Tests of Statistical Explanation of the Rank-Frequency Relation for Words in Writen English. The American Journal of Psychology 71(1), 209–218 (1958)CrossRefGoogle Scholar
- 10.RFC1321. The MD5 Message-Digest AlgorithmGoogle Scholar
- 11.Rabin, M.O.: Fingerprinting by Random Polynomials. Technical Report TR-15-81, Harvard Aiken Computation Laboratory (1981)Google Scholar
- 12.Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipies in C: the Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)Google Scholar
- 13.