A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
- Cite this paper as:
- Shen X., Chen J., Meng X., Zhang Y., Liu C. (2009) A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures. In: Theeramunkong T., Kijsirikul B., Cercone N., Ho TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science, vol 5476. Springer, Berlin, Heidelberg
In this paper, a simple but powerful algorithm: block co-citation algorithm is proposed to automatically find related pages for a given web page, by using HTML segmentation technologies and parallel hyperlink structure analysis. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block of the page is computed according to several information, then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. At last, the block co-citation algorithm is implemented in parallel to analyze a corpus of 37482913 pages sampled from a commercial search engine and demonstrates its feasibility and efficiency.
KeywordsRelated pages Co-citation algorithm HTML Segmentation Parallel Scalable
Unable to display preview. Download preview PDF.