Knowledge and Information Systems

, Volume 39, Issue 1, pp 61–88 | Cite as

High-dimensional clustering: a clique-based hypergraph partitioning framework

  • Tianming Hu
  • Chuanren Liu
  • Yong Tang
  • Jing Sun
  • Hui Xiong
  • Sam Yuan Sung
Regular Paper

Abstract

Hypergraph partitioning has been considered as a promising method to address the challenges of high-dimensional clustering. With objects modeled as vertices and the relationship among objects captured by the hyperedges, the goal of graph partitioning is to minimize the edge cut. Therefore, the definition of hyperedges is vital to the clustering performance. While several definitions of hyperedges have been proposed, a systematic understanding of desired characteristics of hyperedges is still missing. To that end, in this paper, we first provide a unified clique perspective of the definition of hyperedges, which serves as a guide to define hyperedges. With this perspective, based on the concepts of shared (reverse) nearest neighbors, we propose two new types of clique hyperedges and analyze their properties regarding purity and size issues. Finally, we present an extensive evaluation using real-world document datasets. The experimental results show that, with shared (reverse) nearest neighbor-based hyperedges, the clustering performance can be improved significantly in terms of various external validation measures without the need for fine tuning of parameters.

Keywords

Clique Shared nearest neighbor Hypergraph partitioning High-dimensional clustering 

References

  1. 1.
    Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 207–216Google Scholar
  2. 2.
    Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6):891–923CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, ReadingGoogle Scholar
  4. 4.
    Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on k-means algorithm for optimal clustering in \(R^N\). Inf Sci 146(1–4):221–237CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Cheeseman P, Stutz J (1996) Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp 153–180Google Scholar
  6. 6.
    Chen C, Tseng F, Liang T (2011) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst 28(3):687–708CrossRefGoogle Scholar
  7. 7.
    Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM international conference on data mining, pp 47–58Google Scholar
  8. 8.
    Fodeh S, Punch B, Tan P (2011) On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28(2):395–421CrossRefGoogle Scholar
  9. 9.
    France SL, Carroll JD, Xiong H (2012) Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf Sci 184(1):92–110CrossRefMathSciNetGoogle Scholar
  10. 10.
    Han E-H, Karypis G, Kumar V, Mobasher B (1998) Hypergraph based clustering in high-dimensional data sets: a summary of results. IEEE Data Eng Bull 21(1):15–22Google Scholar
  11. 11.
    Hu T, Sung SY (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst 10(4):505–514CrossRefGoogle Scholar
  12. 12.
    Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87CrossRefMathSciNetGoogle Scholar
  13. 13.
    Hu T, Tan CL, Tang Y, Sung SY, Xiong H, Qu C (2008) Co-clustering bipartite with pattern preservation for topic extraction. Int J Artif Intell Tools 17(1):87–107CrossRefGoogle Scholar
  14. 14.
    Huang Y, Xiong H, Wu W, Deng P, Zhang Z (2007) Mining maximal hyperclique pattern: a hybrid search strategy. Inf Sci 177(3):703–721CrossRefMATHMathSciNetGoogle Scholar
  15. 15.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31(3):264–323CrossRefGoogle Scholar
  16. 16.
    Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474CrossRefGoogle Scholar
  17. 17.
    Karypis G (2003) CLUTO—software for clustering high-dimensional datasets. http://glaros.dtc.umn.edu/gkhome/views/cluto
  18. 18.
    Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: applications in VLSI domain. In: Proceedings of the 34th conference on design automation, pp 526–529Google Scholar
  19. 19.
    Korn F, Muthukrishnan S (2000) Influence sets based on reverse nearest neighbor queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 201–212Google Scholar
  20. 20.
    Leung C, Chan S, Chung F (2006) A collaborative filtering framework based on fuzzy association rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381CrossRefGoogle Scholar
  21. 21.
    Lin TY, Chiang I-J (2005) A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering. Int J Approx Reason 40(1–2):55–80CrossRefMATHMathSciNetGoogle Scholar
  22. 22.
    Liu C, Hu T, Ge Y, Xiong H (2012) Which distance metric is right: An evolutionary k-means view. In: Proceedings of the 12th SIAM international conference on data mining, pp 907–918Google Scholar
  23. 23.
    Ni X, Quan X, Lu Z, Liu W, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365CrossRefGoogle Scholar
  24. 24.
    Ozdal MM, Aykanat C (2004) Hypergraph models and algorithms for data-pattern-based clustering. Data Min Knowl Discov 9(1):29–57CrossRefMathSciNetGoogle Scholar
  25. 25.
    Rajpathak D, Chougule R, Bandyopadhyay P (2012) A domain-specific decision support system for knowledge discovery using association and text mining. Knowl Inf Syst 31(3):405–432CrossRefGoogle Scholar
  26. 26.
    Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning, pp 616–623Google Scholar
  27. 27.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text miningGoogle Scholar
  28. 28.
    Vadapalli S, Valluri SR, Karlapalem K (2006) A simple yet effective data clustering algorithm. In: Proceedings of the 6th IEEE international conference on data mining, pp 1108–1112Google Scholar
  29. 29.
    Xia C, Hsu W, Lee ML, Ooi BC (2006) BORDER: Efficient computation of boundary points. IEEE Trans Knowl Data Eng 18(3):289–303CrossRefGoogle Scholar
  30. 30.
    Xiong H, Tan P-N, Kumar V (2006) Hyperclique pattern discovery. Data Min Knowl Discov 13(2):219–242CrossRefMathSciNetGoogle Scholar
  31. 31.
    Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3):311–331CrossRefMATHGoogle Scholar
  32. 32.
    Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Tianming Hu
    • 1
  • Chuanren Liu
    • 2
  • Yong Tang
    • 3
  • Jing Sun
    • 4
  • Hui Xiong
    • 2
  • Sam Yuan Sung
    • 5
  1. 1.Dongguan University of TechnologyDongguanChina
  2. 2.Department of Management Science and Information SystemsRutgers, The State University of New JerseyNewarkUSA
  3. 3.South China Normal UniversityGuangzhouChina
  4. 4.University of AucklandAucklandNew Zealand
  5. 5.South Texas CollegeMcAllenUSA

Personalised recommendations