Learning to Schedule Webpage Updates Using Genetic Programming

  • Aécio S. R. Santos
  • Nivio Ziviani
  • Jussara Almeida
  • Cristiano R. Carvalho
  • Edleno Silva de Moura
  • Altigran Soares da Silva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8214)

Abstract

A key challenge endured when designing a scheduling policy regarding freshness is to estimate the likelihood of a previously crawled webpage being modified on the web. This estimate is used to define the order in which those pages should be visited, and can be explored to reduce the cost of monitoring crawled webpages for keeping updated versions. We here present a novel approach to generate score functions that produce accurate rankings of pages regarding their probability of being modified when compared to their previously crawled versions. We propose a flexible framework that uses genetic programming to evolve score functions to estimate the likelihood that a webpage has been modified. We present a thorough experimental evaluation of the benefits of our framework over five state-of-the-art baselines.

Keywords

Assure Extractor 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Carvalho, A.L., Rossi, C., de Moura, E.S., da Silva, A.S., Fernandes, D.: Lepref: Learn to precompute evidence fusion for efficient query evaluation. Journal of the American Society for Information Science and Technology 63(7), 1383–1397 (2012)CrossRefGoogle Scholar
  2. 2.
    Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: SIGMOD Record, pp. 117–128 (2000)Google Scholar
  3. 3.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Transactions on Internet Technology 3, 256–290 (2003)CrossRefGoogle Scholar
  4. 4.
    Cho, J., Ntoulas, A.: Effective change detection using sampling. In: VLDB, pp. 514–525 (2002)Google Scholar
  5. 5.
    Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for web search engines. Journal of Scheduling 1(1) (1998)Google Scholar
  6. 6.
    de Almeida, H.M., Gonçalves, M.A., Cristo, M., Calado, P.: A combined component approach for finding collection-adapted ranking functions based on genetic programming. In: SIGIR, pp. 399–406 (2007)Google Scholar
  7. 7.
    Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.: Rate of change and other metrics: a live study of the world wide web. In: USENIX Symposium on Internet Technologies and Systems, p. 14 (1997)Google Scholar
  8. 8.
    Henrique, W.F., Ziviani, N., Cristo, M.A., de Moura, E.S., da Silva, A.S., Carvalho, C.: A new approach for verifying URL uniqueness in web crawlers. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 237–248. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  9. 9.
    Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press (1992)Google Scholar
  10. 10.
    Radinsky, K., Bennett, P.: Predicting content change on the web. In: WSDM (2013)Google Scholar
  11. 11.
    Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Transactions on Information Systems 28, 17:1–17:27 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Aécio S. R. Santos
    • 1
  • Nivio Ziviani
    • 1
  • Jussara Almeida
    • 1
  • Cristiano R. Carvalho
    • 1
  • Edleno Silva de Moura
    • 2
  • Altigran Soares da Silva
    • 2
  1. 1.Department of Computer ScienceUniversidade Federal de Minas GeraisBelo HorizonteBrazil
  2. 2.Institute of ComputingUniversidade Federal do AmazonasManausBrazil

Personalised recommendations