Abstract
Cluster validity has been widely used to evaluate the fitness of partitions produced by clustering algorithms. This paper presents a new validity, which is called the Vapnik–Chervonenkis-bound (VB) index, for data clustering. It is estimated based on the structural risk minimization (SRM) principle, which optimizes the bound simultaneously over both the distortion function (empirical risk) and the VC-dimension (model complexity). The smallest bound of the guaranteed risk achieved on some appropriate cluster number validates the best description of the data structure. We use the deterministic annealing (DA) algorithm as the underlying clustering technique to produce the partitions. Five numerical examples and two real data sets are used to illustrate the use of VB as a validity index. Its effectiveness is compared to several popular cluster-validity indexes. The results of comparative study show that the proposed VB index has high ability in producing a good cluster number estimate and in addition, it provides a new approach for cluster validity from the view of statistical learning theory.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bezdek J.C. (1974). Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology 1(1):57–71
Bezdek J.C. (1974). Cluster validity with fuzzy sets. Journal of Cybernatics 3(3):58–72
Bezdek J.C. (1981). Pattern Recognition with Fuzzy Objective Function ALgorithms. Plenum Press, New York
Boudraa A.O. (1999). Dynamic estimation of number of clusters in data sets. Electronics Letters 35(19):1606–1607
Dave R.N. and Krishnapuram R. (1997). Robust clustering methods: a unified view. IEEE Transaction on Fuzzy System 5(2):270–293
Fukuyama, Y. and Sugeno, M.: A new method of choosing the number of clusters for the fuzzy c-means method, In: Proceedings of the Fifth Fuzzy Systems Symposium, pp. 247–250, 1989.
Gath I. and Geva A.B. (1989). Unsupervised optimal fuzzy clustering. IEEE Transaction on Pattern Analysis and Machine Intelligence 11(7):773–781
Geman S., Bienenstock E. and Doursat R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation 4(1):1–58
Graepel T., Burger M. and Obermayer K. (1998). Self-organizing maps: generalizations and new optimization techniques. Neurocomputing 21(1–3):173–190
Haykin S. (1999). Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, NJ
Kim D.W., Lee K.H. and Lee D. (2004). On cluster validity index for estimation of the optimal number of fuzzy clusters. Pattern Recognition 37(10):2009–2025
Krishnapuram R., Nasraoui O. and Frigui H. (1992). The fuzzy c-spherical shells algorithm: a new approach. IEEE Transaction on Neural Networks 3(5):663–671
Kwon S.H. (1998). Cluster validity index for fuzzy clustering. Electronics Letters 34(22):2176–2177
Man Y. and Gath I. (1994). Detection and Separation of Ring-Shaped Clusters Using Fuzzy Clustering. IEEE Transaction on Pattern Analysis and Machine Intelligence 16(8):855–861
Pakhira M.K., Bandyopadhyay S. and Maulik U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition 37(3):487–501
Pal N.R. and Bezdek J.C. (1995). On cluster validity for the fuzzy c-means model. IEEE Transaction on Fuzzy Systems 3(3):370–379
Rezaee M.R., Lelieveldt B.P.F. and Reiber J.H.C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters 19(3–4):237–246
Roberts S.J., Holmes C., Denison D. (2001). Minimum-entropy data partitioning using reversible jump Markov chain Monte Carlo. IEEE Transaction on Pattern Analysis and Machine Intelligence 23(8):909–914
Rose K., Gurewitz E. and Fox G.C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review letters 65(8):945–948
Rose K., Gurewitz E. and Fox G.C. (1993). Constrained clustering as an optimization method. IEEE Transaction on Pattern Analysis and Machine Intelligence 15(8):785–794
Rose K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86(11):2210–2239
Scholkopf B. and Smola A.J. (2002). Learning with Kernels. MIT Press, Cambridge, MA
UCI Benchmark repository: A huge collection of artificial and real world data sets, Availabe at http://www.ics.uci.edu/~mlearn.
Vapnik V.N. (1998). Statistical Learning Theory. Wiley Inc, New York
Wu K.L. and Yang M.S. (2005). A cluster validity index for fuzzy clustering. Pattern Recognition Letters 26(9):1275–1291
Xie X.L. and Beni G. (1991). A validity measure for fuzzy clustering. IEEE Transaction on Pattern Analysis and Machine Intelligence 13(8):841–847
Zahid N., Limouri M. and Essaid A. (1999). A new cluster-validity for fuzzy clustering. Pattern Recognition 32(7): 1089–1097
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Yang, X., Cao, A. & Song, Q. A New Cluster Validity for Data Clustering. Neural Process Lett 23, 325–344 (2006). https://doi.org/10.1007/s11063-006-9005-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-006-9005-x