Finding seeds to bootstrap focused crawlers


Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6



  2. No smoothing is required here since we only use terms that occur in some document.


This research was partially sponsored by projects TTDSW (PRONEM/FAPEAM/CNPq), e-vox pesquisa (FAPEAM), e-spot (CNPq Universal), and by individual CNPq fellowship grants to Edleno Moura (308130/2014-6) and Altigran da Silva 311433/2014-6). This material is based on research sponsored by DARPA under agreement number FA8750-14- 2-0236. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

  • Web crawling
  • Focused crawling
  • Relevance feedback