Skip to main content
Log in

Gaussian mixture modeling and model-based clustering under measurement inconsistency

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence of observations from the same class recorded in different ways. This effect can occur because of recording inconsistencies due to the use of different scales, operator errors, or simply various recording styles. The idea presented in this paper aims to alleviate this issue through modifications incorporated into mixture models. While the proposed methodology is applicable to a broad class of mixture models, in this paper it is illustrated on Gaussian mixtures. Several simulation studies and an application to a real-life data set are considered, yielding promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Alimoglu F, Alpaydin E (1996) Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In: Proceedings of the fifth Turkish artificial intelligence and artificial neural networks symposium (TAINN 96)

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 19:332–353

    Article  MathSciNet  Google Scholar 

  • Bunke H, Sanfeliu A (1990) Syntactic and structural pattern recognition: theory and applications, vol 7. World Scientific, Singapore

    Book  MATH  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux G Govaert (1995) Gaussian parsimonious clustering models. Comput Stat Data Anal 2:781–93

    Google Scholar 

  • Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of the IEEE symposium on foundations of computer science, New York, pp 633–644

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    MATH  Google Scholar 

  • Di Zio M, Guarnera U, Rocci R (2007) A mixture of mixture models for a classification problem: the unity measure error. Comput Stat Data Anal 51(5):2573–2585

    Article  MathSciNet  MATH  Google Scholar 

  • Eden M (1961) On the formalization of handwriting. In: Structure of language and its mathematical aspect

  • Fisher P (1999) Models of uncertainty in spatial data. Geogr Inf Syst 1:191–205

    Google Scholar 

  • Fop M, Murphy TB, Hanlon L (2017) Model-based clustering of data with measurement errors. In: CLADAG, 2017

  • Gormley IC, Murphy TB (2010) A mixture of experts latent position cluster model for social network data. Stat Methodol 7:385–405

    Article  MathSciNet  MATH  Google Scholar 

  • Govindan V, Shivaprasad A (1990) Character recognition—a review. Pattern Recognit 23:671–683

    Article  Google Scholar 

  • Han J, Kamber M, Pei J (eds) (2012) Data mining: concepts and techniques, 3rd edn. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Classif 4:3–34

    Article  MathSciNet  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  MATH  Google Scholar 

  • Ikeda K, Yamamura T, Mitamura Y, Fujiwara S, Tominaga Y, Kiyono T (1981) On-line recognition of hand-written characters utilizing positional and stroke vector sequences. Pattern Recognit 13:191–206

    Article  Google Scholar 

  • Just BH, Marc D, Munns M, Sandefer R (2016) Why patient matching is a challenge: research on master patient index (MPI) data discrepancies in key identifying fields. Perspect Health Inf Manag 13:1e

    Google Scholar 

  • Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New York

    Book  Google Scholar 

  • Kumar M, Patel N (2007) Clustering data with measurement errors. Comput Stat Data Anal 51(12):6084–6101

    Article  MathSciNet  MATH  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium. vol 1, pp 281–297

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • Melnykov V (2013) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592

    Article  MathSciNet  Google Scholar 

  • Melnykov V (2016) Merging mixture components for clustering through pairwise overlap. J Comput Graph Stat 25:66–90

    Article  MathSciNet  Google Scholar 

  • Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25

    Article  Google Scholar 

  • Pankove JI (2012) Optical processes in semiconductors. Courier Corporation, Chelmsford

    Google Scholar 

  • Pearson K (1894) Contribution to the mathematical theory of evolution. Philos Trans R Soc 185:71–110

    MATH  Google Scholar 

  • Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13

    Google Scholar 

  • Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin

    MATH  Google Scholar 

  • Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Sethi IK, Chatterjee B (1977) Machine recognition of constrained hand printed Devanagari. Pattern Recognit 9:69–75

    Article  Google Scholar 

  • Sneath P (1957) The application of computers to taxonomy. J Gen Microbiol 17:201–226

    Article  Google Scholar 

  • Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438

    Google Scholar 

  • Thomas H, Lohaus A, Brainerd C (1993) Modeling growth and individual differences in spatial tasks. Monogr Soc Res Child Devd 58:1–190

    Article  Google Scholar 

  • Tjaden B (2006) An approach for clustering gene expression data with error information. BMC Bioinform 7(1):17

    Article  Google Scholar 

  • Ullrich B, Antillòn A, Bhowmick M, Wang J, Xi H (2014) Atomic transition region at the crossover between quantum dots to molecules. Phys Scr 89(2):025801

    Article  Google Scholar 

  • Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244

    Article  MathSciNet  Google Scholar 

  • Young WC, Raftery AE, Yeung KY (2016) Model-based clustering with data correction for removing artifacts in gene expression data. Ann Appl Stat 11:1998

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu X, Melnykov V (2018) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Volodymyr Melnykov.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 52 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarkar, S., Melnykov, V. & Zheng, R. Gaussian mixture modeling and model-based clustering under measurement inconsistency. Adv Data Anal Classif 14, 379–413 (2020). https://doi.org/10.1007/s11634-020-00393-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-020-00393-9

Keywords

Mathematics Subject Classification

Navigation