Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

Xu, Xiao; Zhang, Weizhe; Zhang, Hongli; Fang, Binxing

doi:10.1007/978-3-642-15672-4_9

Xiao Xu¹⁹,
Weizhe Zhang¹⁹,
Hongli Zhang¹⁹ &
…
Binxing Fang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6289))

Included in the following conference series:

IFIP International Conference on Network and Parallel Computing

1742 Accesses

Abstract

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site’s content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system’s real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.

Download to read the full chapter text

Chapter PDF

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

An optimized crawling technique for maintaining fresh repositories

Article 03 January 2021

AcT: Accuracy-aware crawling techniques for cloud-crawler

Article 15 February 2015

Keywords

References

Foster, I.: Internet Computing and the Emerging Grid. Nature (2000)
Google Scholar
Werthimer, D., Cobb, J., Lebofsky, M., Anderson, D., Korpela, E.: SETI@HOME—Massively Distributed Computing for SETI. Comput. Sci. Eng. 3, 78–83 (2001)
Google Scholar
YaCy Distributed Web Search, http://yacy.net
FAROO Real Time Search, http://www.faroo.com
Majesti-12: Distributed Web Search, http://www.majestic12.co.uk
Xu, X., Zhang, W.Z., Zhang, H.L., Fang, B.X., Liu, X.R.: A Forwarding-based Task Scheduling Algorithm for Distributed Web Crawling over DHTs. In: The 15th International Conference on Parallel and Distributed Systems (ICPADS 2009), pp. 854–859. IEEE Computer Society, Shenzhen (2009)
Chapter Google Scholar
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2, 219–229 (1999)
Article Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A Scalable Fully Distributed Web Crawler. Software—Practice & Experience 3(8), 711–726 (2004)
Article Google Scholar
Loo, B.T., Cooper, O., Krishnamurthy, S.: Distributed Web Crawling over DHTs. Technical report, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley (February 1, 2004)
Google Scholar
Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: The SIGIR Workshop on Distributed Information Retrieval, pp. 126–142 (2004)
Google Scholar
Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 Billion Pages and Beyond. In: The 17th International Conference on World Wide Web, pp. 427–436 (2008)
Google Scholar
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: The 26th International Conference on Very Large Data Bases (VLDB 2000), San Francisco, pp. 200–209 (2000)
Google Scholar
Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web Changes Everything: Understanding the Dynamics of Web Content. In: The 2nd ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 282–291. ACM, Barcelona (2009)
Chapter Google Scholar
Edwards, J., McCurley, K., Tomlin, J.: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: The 10th International Conference on World Wide Web, pp. 106–113. ACM, Hong Kong (2001)
Google Scholar
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal Crawling Strategies for Web Search Engines. In: The 11th International Conference on World Wide Web, pp. 136–147. ACM, Honolulu (2002)
Google Scholar
Cho, J., Molina, H.G.: Effective Page Refresh Policies for Web Crawlers. ACM Trans. Database Syst. 28, 390–426 (2003)
Article Google Scholar
Pandey, S., Olston, C.: User-centric Web Crawling. In: The 10th International Conference on World Wide Web, pp. 401–411. ACM, Chiba (2005)
Google Scholar
Olston, C., Pandey, S.: Recrawl Scheduling based on Information Longevity. In: The 17th International Conference on World Wide Web, pp. 437–446. ACM, Beijing (2008)
Google Scholar
Brin, S., Page, L.: The Anatomy of A Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 107–117 (1998)
Article Google Scholar
P2PSim-Kingdata, http://pdos.csail.mit.edu/p2psim/kingdata/
Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. In: SIGCOMM 2004, pp. 15–26. ACM, Portland (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Xiao Xu, Weizhe Zhang, Hongli Zhang & Binxing Fang

Authors

Xiao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weizhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongli Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Binxing Fang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Rochester, P.O. Box 270226, 14627, Rochester, NY, USA
Chen Ding
School of Computer Science and Technology, Huazhong University of Science and Technology, 430074, Wuhan, China
Zhiyuan Shao
School of Computer Science and Technology, Services Computing Technology and Huazhong University of Science and Technology, 430074, Wuhan, China
Ran Zheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, X., Zhang, W., Zhang, H., Fang, B. (2010). Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System. In: Ding, C., Shao, Z., Zheng, R. (eds) Network and Parallel Computing. NPC 2010. Lecture Notes in Computer Science, vol 6289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15672-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-15672-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15671-7
Online ISBN: 978-3-642-15672-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

Abstract

Chapter PDF

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

An optimized crawling technique for maintaining fresh repositories

AcT: Accuracy-aware crawling techniques for cloud-crawler

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

Abstract

Chapter PDF

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

An optimized crawling technique for maintaining fresh repositories

AcT: Accuracy-aware crawling techniques for cloud-crawler

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation