World Wide Web

, Volume 2, Issue 4, pp 219–229 | Cite as

Mercator: A scalable, extensible Web crawler

  • Allan Heydon
  • Marc Najork
Article

Abstract

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well‐documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator's support for extensibility and customizability. Finally, we comment on Mercator's performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AltaVista, “AltaVista Software Search Intranet Home Page,” altavista.software.digital.com/search/intranet.Google Scholar
  2. BIND, “Berkeley Internet Name Domain (BIND),” www.isc.org/bind.html.Google Scholar
  3. Bloom, B. (1970), “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Communications of the ACM 13, 7, 422–426.Google Scholar
  4. Brin, S. and L. Page (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of the Seventh International World Wide Web Conference, pp. 107–117.Google Scholar
  5. Broder, A. (1993), “Some Applications of Rabin's Fingerprinting Method,” In Sequences II: Methods in Communications, Security, and Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, Eds., Springer-Verlag, pp. 143–152.Google Scholar
  6. Burner, M. (1977), “Crawling Towards Eternity: Building an Archive of the World Wide Web,” Web Techniques Magazine 2, 5.Google Scholar
  7. Cho, J., H. Garcia-Molina, and L. Page (1998), “Efficient Crawling Through URL Ordering,” In Proceedings of the Seventh International World Wide Web Conference, pp. 161–172.Google Scholar
  8. DCPI, “Digital Continuous Profiling Infrastructure,” www.research.digital.com/SRC/dcpi/.Google Scholar
  9. Eichmann, D. (1994), “The RBSE Spider - Balancing Effective Search Against Web Load,” In Proceedings of the First International World Wide Web Conference, pp. 113–120.Google Scholar
  10. Ghemawat, S., “srcjava home page,” www.research.digital.com/SRC/java/.Google Scholar
  11. Google, “Google! Search Engine,” google.stanford.edu/.Google Scholar
  12. Gray, M., “Internet Growth and Statistics: Credits and Background,” www.mit.edu/people/mkgray/net/background.html.Google Scholar
  13. Henzinger, M., A. Heydon, M. Mitzenmacher, and M.A. Najork (1999), “Measuring Index Quality Using Random Walks on the Web,” In Proceedings of the Eighth International World Wide Web Conference, pp. 213–225.Google Scholar
  14. Heydon, A. and M. Najork (1999), “Performance Limitations of the Java Core Libraries,” In Proceedings of the 1999 ACM Java Grande Conference, pp. 35–41.Google Scholar
  15. InternetArchive, “The Internet Archive,” www.archive.org/.Google Scholar
  16. Koster, M., “The Web Robots Pages,” info.webcrawler.com/mak/projects/robots/robots. html.Google Scholar
  17. McBryan, O.A. (1994), “GENVL and WWWW: Tools for Taming the Web,” In Proceedings of the First International World Wide Web Conference, pp. 79–90.Google Scholar
  18. Miller, R.C. and K. Bharat (1998), “SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers,” In Proceedings of the Seventh International World Wide Web Conference, pp. 119–130.Google Scholar
  19. Pinkerton, B. (1994), “Finding What People Want: Experiences with the WebCrawler,” In Proceedings of the Second International World Wide Web Conference.Google Scholar
  20. Rabin, M.O. (1981), “Fingerprinting by Random Polynomials,” Technical Report TR–15-81, Center for Research in Computing Technology, Harvard University.Google Scholar
  21. RobotsExclusion, “The Robots Exclusion Protocol,” info.webcrawler.com/mak/projects/robots/ exclusion.html.Google Scholar
  22. Smith, Z. (1997), “The Truth About the Web: Crawling Towards Eternity,” Web Techniques Magazine 2, 5.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Allan Heydon
    • 1
  • Marc Najork
    • 1
  1. 1.Compaq Systems Research CenterPalo AltoUSA

Personalised recommendations