Abstract
Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site’s content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system’s real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.
Chapter PDF
Similar content being viewed by others
References
Foster, I.: Internet Computing and the Emerging Grid. Nature (2000)
Werthimer, D., Cobb, J., Lebofsky, M., Anderson, D., Korpela, E.: SETI@HOME—Massively Distributed Computing for SETI. Comput. Sci. Eng. 3, 78–83 (2001)
YaCy Distributed Web Search, http://yacy.net
FAROO Real Time Search, http://www.faroo.com
Majesti-12: Distributed Web Search, http://www.majestic12.co.uk
Xu, X., Zhang, W.Z., Zhang, H.L., Fang, B.X., Liu, X.R.: A Forwarding-based Task Scheduling Algorithm for Distributed Web Crawling over DHTs. In: The 15th International Conference on Parallel and Distributed Systems (ICPADS 2009), pp. 854–859. IEEE Computer Society, Shenzhen (2009)
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2, 219–229 (1999)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A Scalable Fully Distributed Web Crawler. Software—Practice & Experience 3(8), 711–726 (2004)
Loo, B.T., Cooper, O., Krishnamurthy, S.: Distributed Web Crawling over DHTs. Technical report, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley (February 1, 2004)
Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: The SIGIR Workshop on Distributed Information Retrieval, pp. 126–142 (2004)
Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 Billion Pages and Beyond. In: The 17th International Conference on World Wide Web, pp. 427–436 (2008)
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: The 26th International Conference on Very Large Data Bases (VLDB 2000), San Francisco, pp. 200–209 (2000)
Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web Changes Everything: Understanding the Dynamics of Web Content. In: The 2nd ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 282–291. ACM, Barcelona (2009)
Edwards, J., McCurley, K., Tomlin, J.: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: The 10th International Conference on World Wide Web, pp. 106–113. ACM, Hong Kong (2001)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal Crawling Strategies for Web Search Engines. In: The 11th International Conference on World Wide Web, pp. 136–147. ACM, Honolulu (2002)
Cho, J., Molina, H.G.: Effective Page Refresh Policies for Web Crawlers. ACM Trans. Database Syst. 28, 390–426 (2003)
Pandey, S., Olston, C.: User-centric Web Crawling. In: The 10th International Conference on World Wide Web, pp. 401–411. ACM, Chiba (2005)
Olston, C., Pandey, S.: Recrawl Scheduling based on Information Longevity. In: The 17th International Conference on World Wide Web, pp. 437–446. ACM, Beijing (2008)
Brin, S., Page, L.: The Anatomy of A Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 107–117 (1998)
P2PSim-Kingdata, http://pdos.csail.mit.edu/p2psim/kingdata/
Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. In: SIGCOMM 2004, pp. 15–26. ACM, Portland (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 IFIP International Federation for Information Processing
About this paper
Cite this paper
Xu, X., Zhang, W., Zhang, H., Fang, B. (2010). Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System. In: Ding, C., Shao, Z., Zheng, R. (eds) Network and Parallel Computing. NPC 2010. Lecture Notes in Computer Science, vol 6289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15672-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-15672-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15671-7
Online ISBN: 978-3-642-15672-4
eBook Packages: Computer ScienceComputer Science (R0)