An Extended Method for Finding Related Web Pages with Focused Crawling Techniques

  • Kazutaka Furuse
  • Hiroaki Ohmura
  • Hanxiong Chen
  • Hiroyuki Kitagawa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6882)

Abstract

This paper proposes an extended mechanism for efficiently finding related web pages, which is constructed by introducing some focused crawling techniques.

One of the successful methods for finding related web pages is Kleinberg’s HITS algorithm, and this method determines web pages which are related to a set of given web pages by calculating the hub and authority scores. Although this method is effective for extracting fine related web pages, it has a limitation that it only concerns the web pages which are directly connected to the given web pages for the score calculation.

The proposed method of this paper extends the HITS algorithm by enlarging neighborhood graph used for the score calculation. By navigating links forward and backward, pages which are not directly connected to the given web pages are included in the neighborhood graph. Since the navigation is done by using the focused crawling techniques, the proposed method effectively collects promising pages which contribute to improve accuracy of the scores. Moreover, unrelated pages are filtered out for avoiding topic drift in the course of the navigation. Consequently, the proposed method successfully finds related pages, since scores are calculated with adequately extended neighborhood graphs. The effectiveness and the efficiency of the proposed method is confirmed by the results of experiments performed with real data sets.

Keywords

Authority Score Priority Queue Neighborhood Graph Find Relate Priority Score 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    The Open Directory Project, http://www.dmoz.org/
  2. 2.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, pp. 1623–1640. Elsevier North-Holland, Inc., New York (1999)Google Scholar
  3. 3.
    Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Kleinberg, D.G.J.: Automatic resource compilation by analyzing hyperlink structure and associated text. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW 7, pp. 65–74. Elsevier Science Publishers B. V., Amsterdam (1998)Google Scholar
  4. 4.
    Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 41–48 (2000)Google Scholar
  5. 5.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Liu, B.: Web Data Mining — Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)MATHGoogle Scholar
  7. 7.
    Micarelli, A., Gasparetti, F.: Adaptive focused crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4, 175–246 (2010)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Kazutaka Furuse
    • 1
  • Hiroaki Ohmura
    • 1
  • Hanxiong Chen
    • 1
  • Hiroyuki Kitagawa
    • 1
  1. 1.Department of Computer Science, Graduate School of Systems and Information EngineeringUniversity of TsukubaJapan

Personalised recommendations