International Workshop on Algorithms and Models for the Web-Graph

WAW 2004: Algorithms and Models for the Web-Graph pp 156-167

Crawling the Infinite Web: Five Levels Are Enough

  • Ricardo Baeza-Yates
  • Carlos Castillo
Conference paper

DOI: 10.1007/978-3-540-30216-2_13

Volume 3243 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Baeza-Yates R., Castillo C. (2004) Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg

Abstract

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Ricardo Baeza-Yates
    • 1
  • Carlos Castillo
    • 1
  1. 1.Center for Web Research, DCCUniversidad de ChileChile