Information Retrieval Journal

, Volume 18, Issue 1, pp 73–94 | Cite as

A genetic programming framework to schedule webpage updates

  • Aécio S. R. Santos
  • Cristiano R. de Carvalho
  • Jussara M. Almeida
  • Edleno S. de Moura
  • Altigran S. da Silva
  • Nivio Ziviani
Article

Abstract

The quality of a Web search engine is influenced by several factors, including coverage and the freshness of the content gathered by the web crawler. Focusing particularly on freshness, one key challenge is to estimate the likelihood of a previously crawled webpage being modified. Such estimates are used to define the order in which those pages should be visited, and thus, can be exploited to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a Genetic Programming framework, called \( GP4C \)Genetic Programming for Crawling, to generate score functions that produce accurate rankings of pages regarding their probabilities of having been modified. We compare \( GP4C \) with state-of-the-art methods using a large dataset of webpages crawled from the Brazilian Web. Our evaluation includes multiple performance metrics and several variations of our framework, built from exploring different sets of terminals and fitness functions. In particular, we evaluate \( GP4C \) using the ChangeRate and Normalized Discounted Cumulative Gain (NDCG) metrics as both objective function and evaluation metric. We show that, in comparison with ChangeRate, NDCG has the ability of better evaluating the effectiveness of scheduling strategies, since it is able to take the ranking produced by the scheduling into account.

Keywords

Web crawling Scheduling functions Genetic Programming 

Notes

Acknowledgments

We thank the partial support given by the Brazilian National Institute of Science and Technology for the Web (Grant MCT-CNPq 573871/2008-6), Project MinGroup (Grant CNPq-CT-Amazônia 575553/2008-1) and authors’ individual grants and scholarships from CNPq and FAPEMIG.

References

  1. Carvalho, A., Rossi, C., de Moura, E. S., Fernandes, D., & da Silva, A. S. (2012). LePrEF: Learn to pre-compute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 55(92), 1–28.Google Scholar
  2. Cho, J., & Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. SIGMOD Record, 29(2), 117–128.Google Scholar
  3. Cho, J., & Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3, 256–290.CrossRefGoogle Scholar
  4. Cho, J., & Ntoulas, A. (2002). Effective change detection using sampling. In 28th international conference on very large data bases, pp. 514–525.Google Scholar
  5. Coffman, E. G., Liu, Z., & Weber, R. R. (1998). Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1), 15–29.Google Scholar
  6. da Costa Carvalho, A. L., Rossi, C., de Moura, E. S., da Silva, A. S., & Fernandes, D. (2012). Lepref: Learn to precompute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology, 63(7), 1383–1397.CrossRefGoogle Scholar
  7. de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding collection-adapted ranking functions based on genetic programming. In 30rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 399–406.Google Scholar
  8. Douglis, F., Feldmann, A., Krishnamurthy, B., & Mogul, J. (1997). Rate of change and other metrics: A live study of the world wide web. In USENIX symposium on internet technologies and systems, pp. 14–14.Google Scholar
  9. Fan, W., Fox, E. A., Pathak, P., & Wu, H. (2004a). The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7), 628–636.CrossRefGoogle Scholar
  10. Fan, W., Gordon, M., Pathak, P., Xi, W., & Fox, E. (2004b). Ranking function optimization for effective web search by genetic programming: An empirical study. In 37th Hawaii International Conference on System Sciences, pp.105–112.Google Scholar
  11. Fan, W., Gordon, M. D., & Pathak, P. (2004c). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4), 523–527.CrossRefGoogle Scholar
  12. Fetterly, D., Craswell, N., & Vinay, V. (2009). The impact of crawl policy on web search effectiveness. In 32nd international ACM SIGIR conference on research and development in information retrieval, pp. 580–587.Google Scholar
  13. Henrique, W. F., Ziviani, N., de Cristo, M. A. P., de Moura, E. S., da Silva, A. S., & Carvalho, C. (2011). A new approach for verifying url uniqueness in web crawlers. In 18th international symposium on string processing and information retrieval, pp. 237–248.Google Scholar
  14. Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. London: Wiley-Interscience.MATHGoogle Scholar
  15. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20(4), 422–446.CrossRefGoogle Scholar
  16. Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection. Cambridge: MIT Press.MATHGoogle Scholar
  17. Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.CrossRefMATHGoogle Scholar
  18. Radinsky, K., & Bennett, P. (2013). Predicting content change on the web. In 6th ACM international conference on web search and data mining, pp. 415–424.Google Scholar
  19. Santos, A. S. R., Ziviani, N., Almeida, J. M., Carvalho, C., de Moura, E. S., & da Silva, A. S. (2013). Learning to schedule webpage updates using genetic programming. In 20th international symposium on string processing and information retrieval, pp. 271–278.Google Scholar
  20. Silva, T. P. C., de Moura, E. S., Cavalcanti, J. M. B., da Silva, A. S., de Carvalho, M. G., & Gonçalves, M. A. (2009). An evolutionary approach for combining different sources of evidence in search engines. Information Systems, 34, 276–289.CrossRefGoogle Scholar
  21. Tan, Q., & Mitra, P. (2010). Clustering-based incremental web crawling. ACM Transactions on Information Systems, 28, 17:1–17:27.CrossRefGoogle Scholar
  22. Trotman, A. (2005). Learning to rank. Information Retrieval, 8(3), 359–381.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Aécio S. R. Santos
    • 1
    • 3
  • Cristiano R. de Carvalho
    • 1
  • Jussara M. Almeida
    • 1
  • Edleno S. de Moura
    • 2
  • Altigran S. da Silva
    • 2
  • Nivio Ziviani
    • 1
    • 3
  1. 1.Department of Computer ScienceFederal University of Minas GeraisBelo HorizonteBrazil
  2. 2.Institute of ComputingFederal University of AmazonasManausBrazil
  3. 3.Zunnit TechnologiesBelo HorizonteBrazil

Personalised recommendations