A genetic programming framework to schedule webpage updates
- 316 Downloads
The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called \( GP4C \)—Genetic Programming for Crawling, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare \( GP4C \) with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate \( GP4C \) using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the ranking produced by the scheduling into account.
KeywordsWeb crawling Scheduling functions Genetic Programming
We thank the partial support given by the Brazilian National Institute of Science and Technology for the Web (Grant MCT-CNPq 573871/2008-6), Project MinGroup (Grant CNPq-CT-Amazônia 575553/2008-1) and authors’ individual grants and scholarships from CNPq and FAPEMIG.
- Carvalho, A., Rossi, C., de Moura, E. S., Fernandes, D., & da Silva, A. S. (2012). LePrEF: Learn to pre-compute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 55(92), 1–28.Google Scholar
- Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. SIGMOD Record, 29(2), 117–128.Google Scholar
- Cho, J., & Ntoulas, A. (2002). Effective change detection using sampling. In 28th international conference on very large data bases, pp. 514–525.Google Scholar
- Coffman, E. G., Liu, Z., & Weber, R. R. (1998). Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1), 15–29.Google Scholar
- de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding collection-adapted ranking functions based on genetic programming. In 30rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 399–406.Google Scholar
- Douglis, F., Feldmann, A., Krishnamurthy, B., & Mogul, J. (1997). Rate of change and other metrics: A live study of the world wide web. In USENIX symposium on internet technologies and systems, pp. 14–14.Google Scholar
- Fan, W., Gordon, M., Pathak, P., Xi, W., & Fox, E. (2004b). Ranking function optimization for effective web search by genetic programming: An empirical study. In 37th Hawaii International Conference on System Sciences, pp.105–112.Google Scholar
- Fetterly, D., Craswell, N., & Vinay, V. (2009). The impact of crawl policy on web search effectiveness. In 32nd international ACM SIGIR conference on research and development in information retrieval, pp. 580–587.Google Scholar
- Henrique, W. F., Ziviani, N., de Cristo, M. A. P., de Moura, E. S., da Silva, A. S., & Carvalho, C. (2011). A new approach for verifying url uniqueness in web crawlers. In 18th international symposium on string processing and information retrieval, pp. 237–248.Google Scholar
- Radinsky, K., & Bennett, P. (2013). Predicting content change on the web. In 6th ACM international conference on web search and data mining, pp. 415–424.Google Scholar
- Santos, A. S. R., Ziviani, N., Almeida, J. M., Carvalho, C., de Moura, E. S., & da Silva, A. S. (2013). Learning to schedule webpage updates using genetic programming. In 20th international symposium on string processing and information retrieval, pp. 271–278.Google Scholar