Enhancing Duplicate Collection Detection Through Replica Boundary Discovery

Zhang, Zhigang; Jia, Weijia; Li, Xiaoming

doi:10.1007/11731139_42

Zhigang Zhang²²,
Weijia Jia²² &
Xiaoming Li²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3015 Accesses

Abstract

Web documents are widely replicated on the Internet. These replicated documents bring potential problems to Web based information systems. So replica detection on the Web is an indispensable task. The challenge is to find these duplicated collections from a very large data set with limited hardware resources in acceptable time. In this paper, we first introduce the notion of replica boundary to roughly reflect the situation of the replicas; then we propose an effective and efficient approach to discover the boundary of the replicas. The advantages of the proposed approach include: first, it dramatically reduces pair-wise document similarity computation, making it much faster than traditional replicated document detection approaches; second, it can identify the boundary of the replicated collections accurately, demonstrating to what extent two collections are replicated. On two web page sets containing 24 million and 30 million Web pages respectively, we evaluated the accuracy of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE Computer Society, Los Alamitos (1997)
Google Scholar
Broder, A.Z.: Identifying and Filtering Near-Duplicate Documents. In: 11th Annual Symposium on Combinatorial Pattern Matching, June 2000, pp. 1–10 (2000)
Google Scholar
Broder, Z., Glassman, S.C., Manasse, M.S., Eig, G.: Syntactic clustering of the Web. In: Proceedings of the sixth International World Wide Web Conference, pp. 391–404 (1997)
Google Scholar
Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding Replicated Web Collections. In: SIGMOD Conference 2000, pp. 355–366 (2000)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, pp. 191–200 (1996)
Google Scholar
Kotcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the 2004 ACM SIGKDD Conference, pp. 605–610 (2004)
Google Scholar
Bharat, K., Broder, A.Z.: Mirror, Mirror, on the Web: A study of host pairs with replicated content. In: Proceedings of 8th International Conference on World Wide Web (WWW 1999) (May 1999)
Google Scholar
Zhang, Z., Chen, J., Li, X.: A Preprocessing Framework and Approach for Web Applications. Journal of Web Engineering 2(3), 175–191 (2004)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D.A., McCabe, M.C.: Collection statistics for fast duplicated document detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)
Article Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA (May 1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Building a Scalable and Accurate Copy Detection Mechanism. In: Proceedings of the 3nd International Conference on Theory and Practice of Digital Libraries (1996)
Google Scholar
Xi, W., Fox, E.A., Tan, R.P., Shu, J.: Machine Learning Approach for Homepage Finding Task. In: Proceedings of the 9th International Symposium on String Processing and Information Retrieval, Lisbon, Portugal, September 11-15, pp. 145–159 (2002)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: Finding near-replicas of documents on the Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 204–212. Springer, Heidelberg (1998)
Chapter Google Scholar
Henzinger, M., Motwani, R., Silverstein.: Challenges in Web Search Engines. In: Proceedings of the 18th International Joint Conference on Artificial Intelligene (2003)
Google Scholar
Bharat, K., Broder, A., Dean, J., Henzinger, M.R.: A Comparison of Techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science 51(12), 1114–1122
Google Scholar
Tianwang Web search engine, http://e.pku.edu.cn

Download references

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong
Zhigang Zhang & Weijia Jia
Institute of Network Computing and Information Systems, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Xiaoming Li

Authors

Zhigang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weijia Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Jia, W., Li, X. (2006). Enhancing Duplicate Collection Detection Through Replica Boundary Discovery. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_42

Download citation

DOI: https://doi.org/10.1007/11731139_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics