The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called \( GP4C \)—Genetic Programming for Crawling, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare \( GP4C \) with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate \( GP4C \) using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the ranking produced by the scheduling into account.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
We note that in our preliminary version of this work (Santos et al. 2013), only \(n, X\) and \(t\) were used as terminals.
The BRDC’12 dataset is publicly available at http://www.latin.dcc.ufmg.br/brdc12.html.
One interesting experiment consists of using inaccurate statistics about the pages to produce the schedulings. We note that such inaccuracies would affect all methods, including the baselines. Thus, we conjecture that our main conclusions remain the same, although a careful investigation must be conducted to support this claim. Such study is left for future work.
Those peaks can also be noted in Fig. 3.
Carvalho, A., Rossi, C., de Moura, E. S., Fernandes, D., & da Silva, A. S. (2012). LePrEF: Learn to pre-compute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 55(92), 1–28.
Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. SIGMOD Record, 29(2), 117–128.
Cho, J., & Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3, 256–290.
Cho, J., & Ntoulas, A. (2002). Effective change detection using sampling. In 28th international conference on very large data bases, pp. 514–525.
Coffman, E. G., Liu, Z., & Weber, R. R. (1998). Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1), 15–29.
da Costa Carvalho, A. L., Rossi, C., de Moura, E. S., da Silva, A. S., & Fernandes, D. (2012). Lepref: Learn to precompute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 63(7), 1383–1397.
de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding collection-adapted ranking functions based on genetic programming. In 30rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 399–406.
Douglis, F., Feldmann, A., Krishnamurthy, B., & Mogul, J. (1997). Rate of change and other metrics: A live study of the world wide web. In USENIX symposium on internet technologies and systems, pp. 14–14.
Fan, W., Fox, E. A., Pathak, P., & Wu, H. (2004a). The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7), 628–636.
Fan, W., Gordon, M., Pathak, P., Xi, W., & Fox, E. (2004b). Ranking function optimization for effective web search by genetic programming: An empirical study. In 37th Hawaii International Conference on System Sciences, pp.105–112.
Fan, W., Gordon, M. D., & Pathak, P. (2004c). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4), 523–527.
Fetterly, D., Craswell, N., & Vinay, V. (2009). The impact of crawl policy on web search effectiveness. In 32nd international ACM SIGIR conference on research and development in information retrieval, pp. 580–587.
Henrique, W. F., Ziviani, N., de Cristo, M. A. P., de Moura, E. S., da Silva, A. S., & Carvalho, C. (2011). A new approach for verifying url uniqueness in web crawlers. In 18th international symposium on string processing and information retrieval, pp. 237–248.
Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. London: Wiley-Interscience.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4), 422–446.
Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection. Cambridge: MIT Press.
Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.
Radinsky, K., & Bennett, P. (2013). Predicting content change on the web. In 6th ACM international conference on web search and data mining, pp. 415–424.
Santos, A. S. R., Ziviani, N., Almeida, J. M., Carvalho, C., de Moura, E. S., & da Silva, A. S. (2013). Learning to schedule webpage updates using genetic programming. In 20th international symposium on string processing and information retrieval, pp. 271–278.
Silva, T. P. C., de Moura, E. S., Cavalcanti, J. M. B., da Silva, A. S., de Carvalho, M. G., & Gonçalves, M. A. (2009). An evolutionary approach for combining different sources of evidence in search engines. Information Systems, 34, 276–289.
Tan, Q., & Mitra, P. (2010). Clustering-based incremental web crawling. ACM Transactions on Information Systems, 28, 17:1–17:27.
Trotman, A. (2005). Learning to rank. Information Retrieval, 8(3), 359–381.
We thank the partial support given by the Brazilian National Institute of Science and Technology for the Web (Grant MCT-CNPq 573871/2008-6), Project MinGroup (Grant CNPq-CT-Amazônia 575553/2008-1) and authors’ individual grants and scholarships from CNPq and FAPEMIG.
About this article
Cite this article
Santos, A.S.R., de Carvalho, C.R., Almeida, J.M. et al. A genetic programming framework to schedule webpage updates. Inf Retrieval J 18, 73–94 (2015). https://doi.org/10.1007/s10791-014-9248-5
- Web crawling
- Scheduling functions
- Genetic Programming