A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

  • Xiaoyan Shen
  • Junliang Chen
  • Xiangwu Meng
  • Yujie Zhang
  • Chuanchang Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5476)

Abstract

In this paper, a simple but powerful algorithm: block co-citation algorithm is proposed to automatically find related pages for a given web page, by using HTML segmentation technologies and parallel hyperlink structure analysis. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block of the page is computed according to several information, then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. At last, the block co-citation algorithm is implemented in parallel to analyze a corpus of 37482913 pages sampled from a commercial search engine and demonstrates its feasibility and efficiency.

Keywords

Related pages Co-citation algorithm HTML Segmentation Parallel Scalable 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Loia, V., Senatore, S., Sessa, M.I.: Discovering related web pages through fuzzy-context reasoning. In: The 2002 IEEE International Conference on Plasma Science, pp. 100–105 (2002) Google Scholar
  2. 2.
    Fan, W.-B., et al.: Recognition of the topic-oriented Web page relations based on ontology. Journal of South China University of Technology (Natural Science) 32(suppl.), 31–47 (2004)Google Scholar
  3. 3.
    Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 11(11), 1467–1479 (1999)CrossRefGoogle Scholar
  4. 4.
    Tsuyoshi, M.: Finding Related Web Pages Based on Connectivity Information from a Search Engine. In: Proceedings of the 10th International World Wide Web Conference, pp. 18–19 (2001)Google Scholar
  5. 5.
    Hou, J., Zhang, Y.: Effectively finding relevant web pages from linkage information. IEEE Transactions on Knowledge and Data Engineering 11(4), 940–950 (2003)Google Scholar
  6. 6.
    Ollivier, Y., Senellart, P.: Finding Related Pages Using Green Measures: An Illustration with Wikipedia. In: The 22nd National Conference on Artificial Intelligence (AAAI 2007). pp. 1427–1433 (2007)Google Scholar
  7. 7.
    Fogaras, D., Racz, B.: Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs. IEEE Transactions on Knowledge and Data Engineering 19(5), 585–598 (2007)CrossRefGoogle Scholar
  8. 8.
    Chakrabarti, S., et al.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In: The 7th International Conference on World Wide Web, pp. 65–74 (1998)Google Scholar
  9. 9.
    Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: 1998 ACM SIGMOD international conference on Management of data. pp. 307–318 (1998)Google Scholar
  10. 10.
    Debnath, S., et al.: Automatic identification of informative sections of Web pages. IEEE Transactions on Knowledge and Data Engineering 17(9), 1233–1246 (2005)CrossRefGoogle Scholar
  11. 11.
    Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3481, pp. 1076–1085. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Dean, J., Ghemawat, J.: MapReduce Simplified Data Processing on Large Clusters. In: The Proceedings of the 6th Symp. on Operating Systems Design and Implementation, pp. 137–149 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Xiaoyan Shen
    • 1
  • Junliang Chen
    • 1
  • Xiangwu Meng
    • 1
  • Yujie Zhang
    • 1
  • Chuanchang Liu
    • 1
  1. 1.State key Laboratory of Networking and Switching TechnologyBeijing University of Posts, and TelecommunicationsBeijingChina

Personalised recommendations