Advertisement

A Method for Pinpoint Clustering of Web Pages with Pseudo-Clique Search

  • Makoto Haraguchi
  • Yoshiaki Okubo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3847)

Abstract

This paper presents a method for Pinpoint Clustering of web pages. We try to find useful clusters of web pages which are significant in the sense that their contents are similar to ones of higher-ranked pages. Since we are usually careless of lower-ranked pages, they are unconditionally discarded even if their contents are similar to some pages with high ranks. Such hidden pages together with significant higher-ranked pages are extracted as a cluster. As the result, our clusters can provide new valuable information for users.

In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition (SVD) to the term-document matrix generated from a corpus. Based on the correlations, we can evaluate potential similarities among web pages to be clustered. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. An algorithm for finding Top-N weighted pseudo-cliques is presented. Our experimental result shows that a quite valuable cluster can be actually extracted according to our method.

We also discuss an idea for improvement on meanings of clusters. With the help of Formal Concept Analysis, our clusters, called FC-based clusters, can be provided with clear meanings. Our preliminary experimentation shows that the extended method would be a promising approach to finding meaningful clusters.

Keywords

Singular Value Decomposition Semantic Similarity Formal Concept Maximal Clique Information Retrieval System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999), http://dbpubs.stanford.edu/pub/1999-66
  2. 2.
    Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Strang, G.: Introduction to Linear Algebra, 3rd edn. Wellesley-Cambridge Press (2003)Google Scholar
  4. 4.
    Kita, K., Tsuda, K., Shishibori, M.: Information Retrieval Algorithms. Kyoritsu Shuppan (2002) (in Japanese)Google Scholar
  5. 5.
    Tomita, E., Seki, T.: An Efficient Branch-and-Bound Algorithm for Finding a Maximum Clique. In: Calude, C.S., Dinneen, M.J., Vajnovszki, V. (eds.) DMTCS 2003. LNCS, vol. 2731, pp. 278–289. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Fahle, T.: Simple and Fast: Improving a Branch-and-Bound Algorithm for Maximum Clique. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 485–498. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Carraghan, R., Pardalos, P.M.: An Exact Algorithm for the Maximum Clique Problem. Operations Research Letters 9, 375–382 (1990)MATHCrossRefGoogle Scholar
  8. 8.
    Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer, Heidelberg (1999)MATHGoogle Scholar
  9. 9.
    Satoh, K.: A Method for Generating Data Abstraction Based on Optimal Clique Search. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2003) (in Japanese)Google Scholar
  10. 10.
    Masuda, S.: Analysis of Ascidian Gene Expression Data by Clique Search. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)Google Scholar
  11. 11.
    Shi, B.: Top-N Clique Search of Web Pages. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)Google Scholar
  12. 12.
    Okubo, Y., Haraguchi, M.: Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 418–425. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Okubo, Y., Haraguchi, M., Shi, B.: Finding Significant Web Pages with Lower Ranks by Pseudo-Clique Search. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 345–352. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Okubo, Y., Haraguchi, M.: Finding Top-N Pseudo-Cliques in Simple Graph. In: Proceedings of the 9th World Multiconference on Systemics, Cybernetics and Informatics - WMSCI 2005, vol. III, pp. 215–220 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Makoto Haraguchi
    • 1
  • Yoshiaki Okubo
    • 1
  1. 1.Division of Computer Science, Graduate School of Information Science and TechnologyHokkaido UniversitySapporoJapan

Personalised recommendations