APWeb 2007, WAIM 2007: Advances in Data and Web Management pp 586-593 | Cite as

Density Analysis of Winnowing on Non-uniform Distributions

  • Xiaoming Yu
  • Yue Liu
  • Hongbo Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4505)

Abstract

The increasing copies of digital documents make detecting duplicates an important problem. Among the techniques proposed so far, Winnowing fingerprinting algorithm [5] is one of the most efficient. However, the previous density analysis leave the performance of Winnowing unwarranted in real systems, because the assumption of uniformly distributed k-grams is far from true in practice. In this paper, an improved density analysis method is introduced. Compared with the previous, our method needs only identically distributed k-grams to get the prediction. This means our theoretical result can be safely used on highly non-uniformly distributed data which are common in real systems. Extensive experiments are performed on both artificial data and real data. The experiment results agree with the theoretical predictions well.

Keywords

Window Size Density Analysis Digital Document Topic Detection English Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceedings of ACM SIGMOD 1995, pp. 398–409 (1995)Google Scholar
  2. 2.
    Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of Winter USENIX Conference 1994, pp. 1–10 (1994)Google Scholar
  3. 3.
    Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proceedings of ACM SIGKDD 2005, pp. 394–400 (2005)Google Scholar
  4. 4.
    Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, pp. 191–200 (1996)Google Scholar
  5. 5.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of ACM SIGMOD 2003, pp. 76–85 (2003)Google Scholar
  6. 6.
    Singh, S., Estan, C., Varghese, G., Savage, S.: The EarlyBird System for Real-time Detection of Unknown Worms. Technical Report CS2003-0761, University of California, San Diego (2003)Google Scholar
  7. 7.
    Zipf, G.K.: The Psychobiology of Language. Houghton Mifflin, Boston (1935)Google Scholar
  8. 8.
    Yi, G., Xiaolong, W., Kai, Z.: The Frequency-Rank Relation of Language Units in Modern Chinese Computational Language Model. Journal of Chinese Information Processing 13(2), 8–15 (1999)Google Scholar
  9. 9.
    Miller, G.A., Newman, E.B.: Tests of Statistical Explanation of the Rank-Frequency Relation for Words in Writen English. The American Journal of Psychology 71(1), 209–218 (1958)CrossRefGoogle Scholar
  10. 10.
    RFC1321. The MD5 Message-Digest AlgorithmGoogle Scholar
  11. 11.
    Rabin, M.O.: Fingerprinting by Random Polynomials. Technical Report TR-15-81, Harvard Aiken Computation Laboratory (1981)Google Scholar
  12. 12.
    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipies in C: the Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)Google Scholar
  13. 13.

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Xiaoming Yu
    • 1
    • 2
  • Yue Liu
    • 1
  • Hongbo Xu
    • 1
  1. 1.Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080P.R.China
  2. 2.Graduate School, Chinese Academy of Sciences, Beijing, 100039P.R.China

Personalised recommendations