Skip to main content

Natural Document Clustering by Clique Percolation in Random Graphs

  • Conference paper
Information Retrieval Technology (AIRS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Abstract

Document clustering techniques mostly depend on models that impose explicit and/or implicit priori assumptions as to the number, size, disjunction characteristics of clusters, and/or the probability distribution of clustered data. As a result, the clustering effects tend to be unnatural and stray away more or less from the intrinsic grouping nature among the documents in a corpus. We propose a novel graph-theoretic technique called Clique Percolation Clustering (CPC). It models clustering as a process of enumerating adjacent maximal cliques in a random graph that unveils inherent structure of the underlying data, in which we unleash the commonly practiced constraints in order to discover natural overlapping clusters. Experiments show that CPC can outperform some typical algorithms on benchmark data sets, and shed light on natural document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley, New York

    Google Scholar 

  2. Baker, L., McCallum, A.: Distributional clustering of words for text classification. In: Proc. of the 21th ACM SIGIR Conference, pp. 96–103 (1998)

    Google Scholar 

  3. Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    Google Scholar 

  4. Bron, C., Kerbosch, J.: Finding all cliques of an undirected graph. Communications of the ACM 16, 575–577 (1971)

    Article  Google Scholar 

  5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. McGraw-Hill, New York

    Google Scholar 

  6. Cutting, D., Karger, D., Pedersen, J., Tukey, J.W.: Scatter/Gather: A clusterbased approach to browsing large document collections. In: Proc. of the 15th ACM SIGIR Conference, pp. 318–329 (1992)

    Google Scholar 

  7. Derenyi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Physics Review Letters 95, 160202 (2005)

    Article  Google Scholar 

  8. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. of the 7th ACM-KDD, pp. 269–274 (2001)

    Google Scholar 

  9. Ding, C.H.Q., He, X.F., Zha, H.Y., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proc. of IEEE ICDM, pp. 107–114 (2001)

    Google Scholar 

  10. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of networks. Oxford Press, New York

    Google Scholar 

  11. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31, 264–323 (1999)

    Article  Google Scholar 

  12. King, B.: Step-wise clustering procedures. Journal of the American Statistical Association 69, 86–101 (1967)

    Article  Google Scholar 

  13. Krishnapuram, R., Joshi, A., Nasraoui, O., Yi, L.Y.: Low-complexity fuzzy relational clustering algorithms for web mining. IEEE Transactions on Fuzzy Systems 9, 595–607 (2001)

    Article  Google Scholar 

  14. Liu, X., Gong, Y.: Document clustering with clustering refinement and model selection capabilitities. In: Proc. of the 25th ACM SIGIR Conference, pp. 191–198 (2002)

    Google Scholar 

  15. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex netowrks in nature and society. Nature 435, 814–818 (2005)

    Article  Google Scholar 

  16. Raghavan, V.V., Yu, C.T.: A comparison of the stability characteristics of some graph theoretic clustering methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 3, 393–402 (1981)

    Article  MATH  Google Scholar 

  17. Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, New York

    Google Scholar 

  18. Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of the 23th ACM SIGIR Conference, pp. 208–215 (2000)

    Google Scholar 

  19. Sneath, P.H.A., Sokal, R.R.: Numerical taxonomy: the principles and practice of numerical classification. Freeman, London

    Google Scholar 

  20. Steinbach, M., Karypis, G., Kumar, V.: A comparison of doucment clustering techniques. In: Proc. of KDD 2000 Workshop on Text Mining (2000)

    Google Scholar 

  21. Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6, 505–517 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  22. Zhao, Y., Karypis, G.: Criterion functions for document clustering. Technical Report #01-40, Department of Computer Science, University of Minnesota

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gao, W., Wong, KF. (2006). Natural Document Clustering by Clique Percolation in Random Graphs. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_10

Download citation

  • DOI: https://doi.org/10.1007/11880592_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45780-0

  • Online ISBN: 978-3-540-46237-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics