Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context

  • Tao Peng
  • Fengling He
  • Wanli Zuo
  • Changli Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4293)


Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.


Anchor Text Site Anchor Relevant Page Aggregation Node Target Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pinkerton, B.: Finding What People Want: Experiences with the WebCrawler. In: Proc. 1st international World Wide Web Conference (1994)Google Scholar
  2. 2.
    De Bra, R., Post, D.J.: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. In: Proceedings of the First International World-Wide Web conference, Geneva (1994)Google Scholar
  3. 3.
    Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm-An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference (1998)Google Scholar
  4. 4.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: Proceedings of 7th World Wide Web Conference (1998)Google Scholar
  5. 5.
    Menczer, F., Belew, R.: Adaptive retrieval agents: internalizing local context and scaling up to the web. Machine Learning 39(2–3), 203–242 (2000)MATHCrossRefGoogle Scholar
  6. 6.
    Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, Springer, Heidelberg (2003)Google Scholar
  7. 7.
    Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused Web crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington (2003)Google Scholar
  8. 8.
    Li, J., Furuse, K., Yamaguchi, K.: Focused Crawling by Exploiting Anchor Text Using Decision Tree. In: WWW 2005, Chiba, Japan, May 10-14, 2005, ACM, New York (2005), 1-59593-051-5/05/0005Google Scholar
  9. 9.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998)Google Scholar
  10. 10.
    Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report (January 1998), available at
  11. 11.
    McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the First International Conference on the World Wide Web, May 1994, Geneva, Switzerland, CERN (1994)Google Scholar
  12. 12.
    Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR 2003, pp. 459–460 (2003)Google Scholar
  13. 13.
    Tateishi, K., Kawai, H., Akamine, S., Matsuda, K., Fukushima, T.: Evaluation of Web Retrieval Method Using Anchor Text. In: Proceedings of the 3rd NTCIR Workshop, pp. 25–29 (2002)Google Scholar
  14. 14.
    Iwazume, M., Shirakami, K., Hatadani, K., Takeda, H., Nishida, T.: Iica: An ontology-based internet navigation system. In: AAAI 1996 Workshop on Internet Based Information Systems (1996)Google Scholar
  15. 15.
    Chakrabarti, Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW 2002, pp. 148–159 (2002)Google Scholar
  16. 16.
    Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Tao Peng
    • 1
  • Fengling He
    • 1
  • Wanli Zuo
    • 1
  • Changli Zhang
    • 1
  1. 1.College of Computer Science and TechnologyJilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of EducationChangchunChina

Personalised recommendations