Natural Document Clustering by Clique Percolation in Random Graphs

  • Wei Gao
  • Kam-Fai Wong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)

Abstract

Document clustering techniques mostly depend on models that impose explicit and/or implicit priori assumptions as to the number, size, disjunction characteristics of clusters, and/or the probability distribution of clustered data. As a result, the clustering effects tend to be unnatural and stray away more or less from the intrinsic grouping nature among the documents in a corpus. We propose a novel graph-theoretic technique called Clique Percolation Clustering (CPC). It models clustering as a process of enumerating adjacent maximal cliques in a random graph that unveils inherent structure of the underlying data, in which we unleash the commonly practiced constraints in order to discover natural overlapping clusters. Experiments show that CPC can outperform some typical algorithms on benchmark data sets, and shed light on natural document clustering.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley, New YorkGoogle Scholar
  2. 2.
    Baker, L., McCallum, A.: Distributional clustering of words for text classification. In: Proc. of the 21th ACM SIGIR Conference, pp. 96–103 (1998)Google Scholar
  3. 3.
    Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. Plenum Press, New YorkGoogle Scholar
  4. 4.
    Bron, C., Kerbosch, J.: Finding all cliques of an undirected graph. Communications of the ACM 16, 575–577 (1971)CrossRefGoogle Scholar
  5. 5.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. McGraw-Hill, New YorkGoogle Scholar
  6. 6.
    Cutting, D., Karger, D., Pedersen, J., Tukey, J.W.: Scatter/Gather: A clusterbased approach to browsing large document collections. In: Proc. of the 15th ACM SIGIR Conference, pp. 318–329 (1992)Google Scholar
  7. 7.
    Derenyi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Physics Review Letters 95, 160202 (2005)CrossRefGoogle Scholar
  8. 8.
    Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. of the 7th ACM-KDD, pp. 269–274 (2001)Google Scholar
  9. 9.
    Ding, C.H.Q., He, X.F., Zha, H.Y., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proc. of IEEE ICDM, pp. 107–114 (2001)Google Scholar
  10. 10.
    Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of networks. Oxford Press, New YorkGoogle Scholar
  11. 11.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31, 264–323 (1999)CrossRefGoogle Scholar
  12. 12.
    King, B.: Step-wise clustering procedures. Journal of the American Statistical Association 69, 86–101 (1967)CrossRefGoogle Scholar
  13. 13.
    Krishnapuram, R., Joshi, A., Nasraoui, O., Yi, L.Y.: Low-complexity fuzzy relational clustering algorithms for web mining. IEEE Transactions on Fuzzy Systems 9, 595–607 (2001)CrossRefGoogle Scholar
  14. 14.
    Liu, X., Gong, Y.: Document clustering with clustering refinement and model selection capabilitities. In: Proc. of the 25th ACM SIGIR Conference, pp. 191–198 (2002)Google Scholar
  15. 15.
    Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex netowrks in nature and society. Nature 435, 814–818 (2005)CrossRefGoogle Scholar
  16. 16.
    Raghavan, V.V., Yu, C.T.: A comparison of the stability characteristics of some graph theoretic clustering methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 3, 393–402 (1981)MATHCrossRefGoogle Scholar
  17. 17.
    Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, New YorkGoogle Scholar
  18. 18.
    Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of the 23th ACM SIGIR Conference, pp. 208–215 (2000)Google Scholar
  19. 19.
    Sneath, P.H.A., Sokal, R.R.: Numerical taxonomy: the principles and practice of numerical classification. Freeman, LondonGoogle Scholar
  20. 20.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of doucment clustering techniques. In: Proc. of KDD 2000 Workshop on Text Mining (2000)Google Scholar
  21. 21.
    Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6, 505–517 (1977)MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Zhao, Y., Karypis, G.: Criterion functions for document clustering. Technical Report #01-40, Department of Computer Science, University of MinnesotaGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Wei Gao
    • 1
  • Kam-Fai Wong
    • 1
  1. 1.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongShatin, N.T., Hong Kong

Personalised recommendations