Focused Crawls, Tunneling, and Digital Libraries

  • Donna Bergmark
  • Carl Lagoze
  • Alex Sbityakov
Conference paper

DOI: 10.1007/3-540-45747-X_7

Part of the Lecture Notes in Computer Science book series (LNCS, volume 2458)
Cite this paper as:
Bergmark D., Lagoze C., Sbityakov A. (2002) Focused Crawls, Tunneling, and Digital Libraries. In: Agosti M., Thanos C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg

Abstract

Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990’s, crawler technology having been developed for use by search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper covers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a “best-first” crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page’s relevance score, but also estimating the value of each link and prioritizing them as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an effective tool for building digital libraries.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Donna Bergmark
    • 1
  • Carl Lagoze
    • 1
  • Alex Sbityakov
    • 1
  1. 1.Cornell Digital Library Research GroupUSA

Personalised recommendations