Advertisement

High-Performance Web Crawling

  • Marc Najork
  • Allan Heydon
Part of the Massive Computing book series (MACO, volume 4)

Abstract

High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, and the Internet Archive uses them to record a history of the Internet. The design of a high-performance crawler poses many challenges, both technical and social, primarily due to the large scale of the web. The web crawler must be able to download pages at a very high rate, yet it must not overwhelm any particular web server. Moreover, it must maintain data structures far too large to fit in main memory, yet it must be able to access and update them efficiently. This chapter describes our experience building and operating such a high-performance crawler.

Keywords

Web crawling Internet archive Search engines Java HTTP Checkpointing Link extractor Breadth first traversal Name resolution Fingerprinting Mercator. 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference, pages 107–117, April 1998.Google Scholar
  2. A. Broder. Some applications of rabin’s fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.Google Scholar
  3. M. Burner. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine, 2 (5), May 1997.Google Scholar
  4. J. Cho, H. G. Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the Seventh International World Wide Web Conference, pages 161–172, April 1998.Google Scholar
  5. J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the Tenth International World Wide Web Conference, pages 106–113, May 2001.CrossRefGoogle Scholar
  6. D. Eichmann. The rbse spider — balancing effective search against web load. In Proceedings of the First International World Wide Web Conference, pages 113–120, 1994.Google Scholar
  7. N. Freed and N. Borenstein. Multipurpose internet mail extensions part two: Media types, November 1996.Google Scholar
  8. M. Gray. Internet growth and statistics: Credits and background, 2001.Google Scholar
  9. M. Henzinger, A. Heydon, M. Mitzenmacher, and M. A. Najork. Measuring index quality using random walks on the web. In Proceedings of the Eighth International World Wide Web Conference, pages 213–225, May 1999.Google Scholar
  10. M. Henzinger, A. Heydon, M. Mitzenmacher, and M. A. Najork. On near-uniform url sampling. In Proceedings of the Ninth International World Wide Web Conference, pages 295–308, May 2000.Google Scholar
  11. A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2: 219–229, December 1999.CrossRefGoogle Scholar
  12. A. Heydon and M. Najork. Performance limitations of the java core libraries. Concurrency: Practice and Experience, 12: 363–373, May 2000.CrossRefGoogle Scholar
  13. J. Hirai, S. Raghavan, H. G. Molina, and A. Paepcke. Webbase: A repository of web pages. In Proceedings of the Ninth International World Wide Web Conference, pages 277–293, May 2000.Google Scholar
  14. M. Koster. A method for web robots control, December 1996. M. Koster. The web robots pages, 2001.Google Scholar
  15. O. A. McBryan. Genvl and wwww: Tools for taming the web. In Proceedings of the First International World Wide Web Conference, pages 79–90, 1994.Google Scholar
  16. M. Najork and A. Heydon. On high-performance web crawling. Technical report, Compaq Systems Research Center, 2001. SRC Research Report, forthcoming.zbMATHGoogle Scholar
  17. B. Pinkerton. Finding what people want: Experiences with the webcrawler. In Proceedings of the Second International World Wide Web Conference, 1994.Google Scholar
  18. M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, 1981. Report TR-15–81.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2002

Authors and Affiliations

  • Marc Najork
    • 1
  • Allan Heydon
    • 2
  1. 1.Compaq Computer Corporation Systems Research CenterPalo AltoUSA
  2. 2.Model N, Inc.South San FranciscoUSA

Personalised recommendations