Abstract
The ability of a clustering algorithm to deal with overlapping clusters is a major indicator of its efficiency. However, the phenomenon of cluster overlapping is still not mathematically well characterized, especially in multivariate cases. In this paper, we are interested in the overlap phenomenon between Gaussian clusters, since the Gaussian mixture is a fundamental data distribution model suitable for many clustering algorithms. We introduce the novel concept of the ridge curve and establish a theory on the degree of overlap between two components. Based on this theory, we develop an algorithm for calculating the overlap rate. As an example, we use this algorithm to calculate the overlap rates between the classes in the IRIS data set and clear up some of the confusion as to the true number of classes in the data set. We investigate factors that affect the value of the overlap rate, and show how the theory can be used to generate “truthed data” as well as to measure the overlap rate between a given pair of clusters or components in a mixture. Finally, we show an example of application of the theory to evaluate the well known clustering algorithms.
Similar content being viewed by others
References
Aitnouri E, Dubeau V, Wang S, Ziou D (2002) Controlling mixture component overlap for clustering algorithms evaluation. J Pattern Recog Image Anal 12(4): 331–346
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York
Bouguessa M, Wang S, Sun H (2006) An objective approach to cluster validation. Pattern Recogn Lett 27(13): 1419–1430
Chan H, Chung A, Yu A.N.S, Wells W (2003) Clustering web content for efficient replication. In: 2003 Conference on computer vision and pattern recognition (CVPR ’03), vol II
Day N (1969) Estimating the components of a mixture of two normal distributions. Biometrics 56: 463–474
Do M, Vetterliyx M (2000) Texture similarity measurement using Kullback-Leibler distance on wavelet subbands. In: 2000 international conference on image processing (ICIP00), vol 3, pp 730–733
Fraley C (1998) Algorithm for model-based Gaussian hierachical clustering. SIAM J Sci Comput 20(1): 270–281
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic-Press, New York
Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7): 773–781
Halgamuge S, Glesner M (1994) Neural networks in designing fuzzy systems for real world applications. Fuzzy Sets and Syst 65(1): 1–12
Hsu T-H (2000) An application of fuzzy clustering group-positioning analysis. Proc Natl Sci Counc ROC(C) 10: 157–167
Kullback S (1959) Information theory and statistics. Wiley, New York
McLachlan G, Basford K (1988) Mixture models inference and applications to clustering. Marcel Dekker, New York
Milligan G (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrica 45(3): 325–342
Nicholls K, Tudorancea C (2001) Application of fuzzy cluster analysis to Lake Simcoe crustacean zooplankton community structure. Can J Fish Aquat Sci 58(2): 231–240
Pal N, Bezdek J (1995) On cluster validity for the fuzzy C-means Model. IEEE Trans Fuzzy Syst 3(3): 370–390
Ramos V, Muge F (2000) Map segmentation by colour cube genetic k-mean clustering. In: ECDL 2000, vol 1923, Lisbon, Portugal, pp 319–323
Salvi G (2003) Accent clustering in Swedish using the Bhattacharyya distance. In: 15th (ICPhS) International congress of phonetic sciences, pp 1149–1152
Sun H, Wang S (2004) Distinguishing between overlapping components in mixture models. In: Proceedings of the 2nd IASTED international conference on neural networks and computational intelligence, Switzerland, pp 102–108
Sun H, Wang S, Jiang Q (2004) FCM-based model selection algorithm for determining the number of clusters. Pattern Recogn 37(10): 2027–2037
Tabbone S (1994) Edge detection, subpixel and junctions using multiple scales. PhD thesis, Institut National Polytechnique de Lorraine, France (in French)
Zhang H, Liu X (2003) The comparison of clustering methods in data mining. Comput Appl Soft 2: 7–8
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
The expression “simulation data” is used in this paper to designate a data set with known membership of data points w.r.t. each cluster.
Rights and permissions
About this article
Cite this article
Sun, H., Wang, S. Measuring the component overlapping in the Gaussian mixture model. Data Min Knowl Disc 23, 479–502 (2011). https://doi.org/10.1007/s10618-011-0212-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-011-0212-3