Advertisement

Crawling Policies Based on Web Page Popularity Prediction

  • Liudmila Ostroumova
  • Ivan Bogatyy
  • Arseniy Chelnokov
  • Alexey Tikhonov
  • Gleb Gusev
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)

Abstract

In this paper, we focus on crawling strategies for newly discovered URLs. Since it is impossible to crawl all the new pages right after they appear, the most important (or popular) pages should be crawled with a higher priority. One natural measure of page importance is the number of user visits. However, the popularity of newly discovered URLs cannot be known in advance, and therefore should be predicted relying on URLs’ features. In this paper, we evaluate several methods for predicting new page popularity against previously investigated crawler performance measurements, and propose a novel measurement setup aiming to evaluate crawler performance more realistically. In particular, we compare short-term and long-term popularity of new ephemeral URLs by estimating the rate of popularity decay. Our experiments show that the information about popularity decay can be effectively used for optimizing ordering policies of crawlers, but further research is required to predict it accurately enough.

Keywords

crawling policies new web pages popularity prediction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: Proc. WWW Conference (2003)Google Scholar
  2. 2.
    Abramson, M., Aha, D.: What’s in a URL? Genre classification from URLs. In: Conference on Artificial Intelligence, pp. 262–263 (2012)Google Scholar
  3. 3.
    Bai, X., Cambazoglu, B.B., Junqueira, F.P.: Discovering urls through user feedback. In: Proc. CIKM Conference, pp. 77–86 (2011)Google Scholar
  4. 4.
    Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for url-based topic classification. ACM Trans. Web (2011)Google Scholar
  5. 5.
    Baykan, E., Henzinger, M., Weber, I.: Efficient discovery of authoritative resources. ACM Trans. Web (2013)Google Scholar
  6. 6.
    Cho, J., Schonfeld, U.: Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proc. VLDB (2007)Google Scholar
  7. 7.
    Edwards, J., McCurley, K.S., Tomlin, J.A.: Adaptive model for optimizing performance of an incremental web crawler. In: Proc. WWW Conference (2001)Google Scholar
  8. 8.
    Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proc. SIGIR Conference, pp. 580–587 (2009)Google Scholar
  9. 9.
    Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations. Springer, New York (2001)Google Scholar
  10. 10.
    Kan, M.Y.: Web page classification without the web page. In: Proc. WWW Conference, pp. 262–263 (2004)Google Scholar
  11. 11.
    Kumar, R., Lang, K., Marlow, C., Tomkins, A.: Efficient discovery of authoritative resources. Data Engineering (2008)Google Scholar
  12. 12.
    Lefortier, D., Ostroumova, L., Samosvat, E., Serdyukov, P.: Timely crawling of high-quality ephemeral new content. In: Proc. CIKM Conference, pp. 745–750 (2011)Google Scholar
  13. 13.
    Lei, T., Cai, R., Yang, J.M., Ke, Y., Fan, X., Zhang, L.: A pattern tree-based approach to learning url normalization rules. In: Proc. WWW Conference, pp. 611–620 (2010)Google Scholar
  14. 14.
    Liu, M., Cai, R., Zhang, M., Zhang, L.: User browsing behavior-driven web crawling. In: Proc. CIKM Conference, pp. 87–92 (2011)Google Scholar
  15. 15.
    Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)CrossRefzbMATHGoogle Scholar
  16. 16.
    Pandey, S., Olston, C.: User-centric web crawling. In: Proc. WWW Conference (2005)Google Scholar
  17. 17.
    Pandey, S., Olston, C.: Crawl ordering by search impact. In: Proc. WSDM Conference (2008)Google Scholar
  18. 18.
    Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E.: Modeling and predicting behavioral dynamics on the web. In: Proc. WWW Conference, pp. 599–608 (2012)Google Scholar
  19. 19.
    Tsur, O., Rappoport, A.: What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proc. WSDM Conference, pp. 643–652 (2012)Google Scholar
  20. 20.
    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proc. WWW Conference (2002)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Liudmila Ostroumova
    • 1
  • Ivan Bogatyy
    • 1
  • Arseniy Chelnokov
    • 1
  • Alexey Tikhonov
    • 1
  • Gleb Gusev
    • 1
  1. 1.YandexMoscowRussia

Personalised recommendations