Skip to main content
Log in

Measuring the component overlapping in the Gaussian mixture model

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The ability of a clustering algorithm to deal with overlapping clusters is a major indicator of its efficiency. However, the phenomenon of cluster overlapping is still not mathematically well characterized, especially in multivariate cases. In this paper, we are interested in the overlap phenomenon between Gaussian clusters, since the Gaussian mixture is a fundamental data distribution model suitable for many clustering algorithms. We introduce the novel concept of the ridge curve and establish a theory on the degree of overlap between two components. Based on this theory, we develop an algorithm for calculating the overlap rate. As an example, we use this algorithm to calculate the overlap rates between the classes in the IRIS data set and clear up some of the confusion as to the true number of classes in the data set. We investigate factors that affect the value of the overlap rate, and show how the theory can be used to generate “truthed data” as well as to measure the overlap rate between a given pair of clusters or components in a mixture. Finally, we show an example of application of the theory to evaluate the well known clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aitnouri E, Dubeau V, Wang S, Ziou D (2002) Controlling mixture component overlap for clustering algorithms evaluation. J Pattern Recog Image Anal 12(4): 331–346

    Google Scholar 

  • Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York

    MATH  Google Scholar 

  • Bouguessa M, Wang S, Sun H (2006) An objective approach to cluster validation. Pattern Recogn Lett 27(13): 1419–1430

    Article  Google Scholar 

  • Chan H, Chung A, Yu A.N.S, Wells W (2003) Clustering web content for efficient replication. In: 2003 Conference on computer vision and pattern recognition (CVPR ’03), vol II

  • Day N (1969) Estimating the components of a mixture of two normal distributions. Biometrics 56: 463–474

    Article  MATH  Google Scholar 

  • Do M, Vetterliyx M (2000) Texture similarity measurement using Kullback-Leibler distance on wavelet subbands. In: 2000 international conference on image processing (ICIP00), vol 3, pp 730–733

  • Fraley C (1998) Algorithm for model-based Gaussian hierachical clustering. SIAM J Sci Comput 20(1): 270–281

    Article  MathSciNet  MATH  Google Scholar 

  • Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic-Press, New York

    MATH  Google Scholar 

  • Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7): 773–781

    Article  Google Scholar 

  • Halgamuge S, Glesner M (1994) Neural networks in designing fuzzy systems for real world applications. Fuzzy Sets and Syst 65(1): 1–12

    Article  Google Scholar 

  • Hsu T-H (2000) An application of fuzzy clustering group-positioning analysis. Proc Natl Sci Counc ROC(C) 10: 157–167

    Google Scholar 

  • Kullback S (1959) Information theory and statistics. Wiley, New York

    MATH  Google Scholar 

  • McLachlan G, Basford K (1988) Mixture models inference and applications to clustering. Marcel Dekker, New York

    MATH  Google Scholar 

  • Milligan G (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrica 45(3): 325–342

    Article  Google Scholar 

  • Nicholls K, Tudorancea C (2001) Application of fuzzy cluster analysis to Lake Simcoe crustacean zooplankton community structure. Can J Fish Aquat Sci 58(2): 231–240

    Article  Google Scholar 

  • Pal N, Bezdek J (1995) On cluster validity for the fuzzy C-means Model. IEEE Trans Fuzzy Syst 3(3): 370–390

    Article  Google Scholar 

  • Ramos V, Muge F (2000) Map segmentation by colour cube genetic k-mean clustering. In: ECDL 2000, vol 1923, Lisbon, Portugal, pp 319–323

  • Salvi G (2003) Accent clustering in Swedish using the Bhattacharyya distance. In: 15th (ICPhS) International congress of phonetic sciences, pp 1149–1152

  • Sun H, Wang S (2004) Distinguishing between overlapping components in mixture models. In: Proceedings of the 2nd IASTED international conference on neural networks and computational intelligence, Switzerland, pp 102–108

  • Sun H, Wang S, Jiang Q (2004) FCM-based model selection algorithm for determining the number of clusters. Pattern Recogn 37(10): 2027–2037

    Article  MATH  Google Scholar 

  • Tabbone S (1994) Edge detection, subpixel and junctions using multiple scales. PhD thesis, Institut National Polytechnique de Lorraine, France (in French)

  • Zhang H, Liu X (2003) The comparison of clustering methods in data mining. Comput Appl Soft 2: 7–8

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haojun Sun.

Additional information

Responsible editor: Charu Aggarwal.

The expression “simulation data” is used in this paper to designate a data set with known membership of data points w.r.t. each cluster.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, H., Wang, S. Measuring the component overlapping in the Gaussian mixture model. Data Min Knowl Disc 23, 479–502 (2011). https://doi.org/10.1007/s10618-011-0212-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-011-0212-3

Keywords

Navigation