A genetic programming framework to schedule webpage updates

Abstract

The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called \( GP4C \)Genetic Programming for Crawling, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare \( GP4C \) with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate \( GP4C \) using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the ranking produced by the scheduling into account.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    We note that in our preliminary version of this work (Santos et al. 2013), only \(n, X\) and \(t\) were used as terminals.

  2. 2.

    The BRDC’12 dataset is publicly available at http://www.latin.dcc.ufmg.br/brdc12.html.

  3. 3.

    http://www.alexa.com/topsites/countries/BR.

  4. 4.

    One interesting experiment consists of using inaccurate statistics about the pages to produce the schedulings. We note that such inaccuracies would affect all methods, including the baselines. Thus, we conjecture that our main conclusions remain the same, although a careful investigation must be conducted to support this claim. Such study is left for future work.

  5. 5.

    Those peaks can also be noted in Fig. 3.

References

  1. Carvalho, A., Rossi, C., de Moura, E. S., Fernandes, D., & da Silva, A. S. (2012). LePrEF: Learn to pre-compute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 55(92), 1–28.

    Google Scholar 

  2. Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. SIGMOD Record, 29(2), 117–128.

  3. Cho, J., & Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3, 256–290.

    Article  Google Scholar 

  4. Cho, J., & Ntoulas, A. (2002). Effective change detection using sampling. In 28th international conference on very large data bases, pp. 514–525.

  5. Coffman, E. G., Liu, Z., & Weber, R. R. (1998). Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1), 15–29.

  6. da Costa Carvalho, A. L., Rossi, C., de Moura, E. S., da Silva, A. S., & Fernandes, D. (2012). Lepref: Learn to precompute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 63(7), 1383–1397.

    Article  Google Scholar 

  7. de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding collection-adapted ranking functions based on genetic programming. In 30rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 399–406.

  8. Douglis, F., Feldmann, A., Krishnamurthy, B., & Mogul, J. (1997). Rate of change and other metrics: A live study of the world wide web. In USENIX symposium on internet technologies and systems, pp. 14–14.

  9. Fan, W., Fox, E. A., Pathak, P., & Wu, H. (2004a). The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7), 628–636.

    Article  Google Scholar 

  10. Fan, W., Gordon, M., Pathak, P., Xi, W., & Fox, E. (2004b). Ranking function optimization for effective web search by genetic programming: An empirical study. In 37th Hawaii International Conference on System Sciences, pp.105–112.

  11. Fan, W., Gordon, M. D., & Pathak, P. (2004c). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4), 523–527.

    Article  Google Scholar 

  12. Fetterly, D., Craswell, N., & Vinay, V. (2009). The impact of crawl policy on web search effectiveness. In 32nd international ACM SIGIR conference on research and development in information retrieval, pp. 580–587.

  13. Henrique, W. F., Ziviani, N., de Cristo, M. A. P., de Moura, E. S., da Silva, A. S., & Carvalho, C. (2011). A new approach for verifying url uniqueness in web crawlers. In 18th international symposium on string processing and information retrieval, pp. 237–248.

  14. Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. London: Wiley-Interscience.

    Google Scholar 

  15. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4), 422–446.

    Article  Google Scholar 

  16. Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection. Cambridge: MIT Press.

    Google Scholar 

  17. Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.

    Article  MATH  Google Scholar 

  18. Radinsky, K., & Bennett, P. (2013). Predicting content change on the web. In 6th ACM international conference on web search and data mining, pp. 415–424.

  19. Santos, A. S. R., Ziviani, N., Almeida, J. M., Carvalho, C., de Moura, E. S., & da Silva, A. S. (2013). Learning to schedule webpage updates using genetic programming. In 20th international symposium on string processing and information retrieval, pp. 271–278.

  20. Silva, T. P. C., de Moura, E. S., Cavalcanti, J. M. B., da Silva, A. S., de Carvalho, M. G., & Gonçalves, M. A. (2009). An evolutionary approach for combining different sources of evidence in search engines. Information Systems, 34, 276–289.

    Article  Google Scholar 

  21. Tan, Q., & Mitra, P. (2010). Clustering-based incremental web crawling. ACM Transactions on Information Systems, 28, 17:1–17:27.

    Article  Google Scholar 

  22. Trotman, A. (2005). Learning to rank. Information Retrieval, 8(3), 359–381.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We thank the partial support given by the Brazilian National Institute of Science and Technology for the Web (Grant MCT-CNPq 573871/2008-6), Project MinGroup (Grant CNPq-CT-Amazônia 575553/2008-1) and authors’ individual grants and scholarships from CNPq and FAPEMIG.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jussara M. Almeida.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Santos, A.S.R., de Carvalho, C.R., Almeida, J.M. et al. A genetic programming framework to schedule webpage updates. Inf Retrieval J 18, 73–94 (2015). https://doi.org/10.1007/s10791-014-9248-5

Download citation

Keywords

  • Web crawling
  • Scheduling functions
  • Genetic Programming