High-Performance Web Crawling
High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, and the Internet Archive uses them to record a history of the Internet. The design of a high-performance crawler poses many challenges, both technical and social, primarily due to the large scale of the web. The web crawler must be able to download pages at a very high rate, yet it must not overwhelm any particular web server. Moreover, it must maintain data structures far too large to fit in main memory, yet it must be able to access and update them efficiently. This chapter describes our experience building and operating such a high-performance crawler.
KeywordsWeb crawling Internet archive Search engines Java HTTP Checkpointing Link extractor Breadth first traversal Name resolution Fingerprinting Mercator.
Unable to display preview. Download preview PDF.
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference, pages 107–117, April 1998.Google Scholar
- A. Broder. Some applications of rabin’s fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.Google Scholar
- M. Burner. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine, 2 (5), May 1997.Google Scholar
- J. Cho, H. G. Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the Seventh International World Wide Web Conference, pages 161–172, April 1998.Google Scholar
- D. Eichmann. The rbse spider — balancing effective search against web load. In Proceedings of the First International World Wide Web Conference, pages 113–120, 1994.Google Scholar
- N. Freed and N. Borenstein. Multipurpose internet mail extensions part two: Media types, November 1996.Google Scholar
- M. Gray. Internet growth and statistics: Credits and background, 2001.Google Scholar
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. A. Najork. Measuring index quality using random walks on the web. In Proceedings of the Eighth International World Wide Web Conference, pages 213–225, May 1999.Google Scholar
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. A. Najork. On near-uniform url sampling. In Proceedings of the Ninth International World Wide Web Conference, pages 295–308, May 2000.Google Scholar
- J. Hirai, S. Raghavan, H. G. Molina, and A. Paepcke. Webbase: A repository of web pages. In Proceedings of the Ninth International World Wide Web Conference, pages 277–293, May 2000.Google Scholar
- M. Koster. A method for web robots control, December 1996. M. Koster. The web robots pages, 2001.Google Scholar
- O. A. McBryan. Genvl and wwww: Tools for taming the web. In Proceedings of the First International World Wide Web Conference, pages 79–90, 1994.Google Scholar
- B. Pinkerton. Finding what people want: Experiences with the webcrawler. In Proceedings of the Second International World Wide Web Conference, 1994.Google Scholar
- M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981. Report TR-15–81.Google Scholar