Crawling the Infinite Web: Five Levels Are Enough

  • Ricardo Baeza-Yates
  • Carlos Castillo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3243)


A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the Twenty-seventh International Conference on Very Large Databases (VLDB), Rome, Italy, pp. 129–138. Morgan Kaufmann, San Francisco (2001)Google Scholar
  2. 2.
    Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web Conference 2, 219–229 (1999)CrossRefGoogle Scholar
  3. 3.
    Burke, R.D.: Salticus: guided crawling for personal digital libraries. In: Proceedings of the first ACM/IEEE-CS joint conference on Digital Libraries, Roanoke, Virginia, pp. 88–89 (2001)Google Scholar
  4. 4.
    Baeza-Yates, R., Castillo, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile, pp. 565–572. IOS Press, Amsterdam (2002)Google Scholar
  5. 5.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, pp. 124–135. ACM Press, New York (2002)CrossRefGoogle Scholar
  6. 6.
    Chakrabarti, S.: Mining the Web. Morgan Kaufmann Publishers, San Francisco (2003)Google Scholar
  7. 7.
    Diligenti, M., Gori, M., Maggini, M.: A unified probabilistis framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering 16, 4–16 (2004)CrossRefGoogle Scholar
  8. 8.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation algorithm: bringing order to the web. In: Proceedings of the seventh conference on World Wide Web, Brisbane, Australia (1998)Google Scholar
  9. 9.
    Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: On near–uniform url sampling. In: Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands, pp. 295–308. Elsevier, Amsterdam (2000)Google Scholar
  10. 10.
    Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the Tenth Conference on World Wide Web, Hong Kong, pp. 114–118. Elsevier Science, Amsterdam (2001)CrossRefGoogle Scholar
  11. 11.
    Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA, pp. 117–128 (2000)Google Scholar
  12. 12.
    Henzinger, M.: Hyperlink analysis for the web. IEEE Internet Computing 5, 45–50 (2001)CrossRefGoogle Scholar
  13. 13.
    Haigh, S., Megarity, J.: Measuring web site usage: Log file analysis. Network Notes (1998)Google Scholar
  14. 14.
    Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Proceedings of the Conference on Human Factors in Computing Systems CHI 1997 (1997)Google Scholar
  15. 15.
    Tanasa, D., Trousse, B.: Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems 19, 59–65 (2004)CrossRefGoogle Scholar
  16. 16.
    Tan, P.-N., Kumar, V.: Discovery of web robots session based on their navigational patterns. Data Mining and Knowledge discovery 6, 9–35 (2002)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280, 95–97 (1998)CrossRefGoogle Scholar
  18. 18.
    Adar, E., Huberman, B.A.: The economics of web surfing. In: Poster Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands (2000)Google Scholar
  19. 19.
    Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowledge and Information Systems 3, 120–129 (2001)CrossRefMATHGoogle Scholar
  20. 20.
    Lukose, R.M., Huberman, B.A.: Surfing as a real option. In: Proceedings of the first international conference on Information and computation economies, pp. 45–51. ACM Press, New York (1998)Google Scholar
  21. 21.
    Liu, J., Zhang, S., Yang, J.: Characterizing web usage regularities with information foraging agents. IEEE Transactions on Knowledge and Data Engineering 16, 566–584 (2004)CrossRefGoogle Scholar
  22. 22.
    Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)CrossRefGoogle Scholar
  23. 23.
    Catledge, L., Pitkow, J.: Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems 6 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Ricardo Baeza-Yates
    • 1
  • Carlos Castillo
    • 1
  1. 1.Center for Web Research, DCCUniversidad de ChileChile

Personalised recommendations