Abstract
Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence of observations from the same class recorded in different ways. This effect can occur because of recording inconsistencies due to the use of different scales, operator errors, or simply various recording styles. The idea presented in this paper aims to alleviate this issue through modifications incorporated into mixture models. While the proposed methodology is applicable to a broad class of mixture models, in this paper it is illustrated on Gaussian mixtures. Several simulation studies and an application to a real-life data set are considered, yielding promising results.
Similar content being viewed by others
References
Alimoglu F, Alpaydin E (1996) Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In: Proceedings of the fifth Turkish artificial intelligence and artificial neural networks symposium (TAINN 96)
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 19:332–353
Bunke H, Sanfeliu A (1990) Syntactic and structural pattern recognition: theory and applications, vol 7. World Scientific, Singapore
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Celeux G Govaert (1995) Gaussian parsimonious clustering models. Comput Stat Data Anal 2:781–93
Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of the IEEE symposium on foundations of computer science, New York, pp 633–644
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Di Zio M, Guarnera U, Rocci R (2007) A mixture of mixture models for a classification problem: the unity measure error. Comput Stat Data Anal 51(5):2573–2585
Eden M (1961) On the formalization of handwriting. In: Structure of language and its mathematical aspect
Fisher P (1999) Models of uncertainty in spatial data. Geogr Inf Syst 1:191–205
Fop M, Murphy TB, Hanlon L (2017) Model-based clustering of data with measurement errors. In: CLADAG, 2017
Gormley IC, Murphy TB (2010) A mixture of experts latent position cluster model for social network data. Stat Methodol 7:385–405
Govindan V, Shivaprasad A (1990) Character recognition—a review. Pattern Recognit 23:671–683
Han J, Kamber M, Pei J (eds) (2012) Data mining: concepts and techniques, 3rd edn. Elsevier, Amsterdam
Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Classif 4:3–34
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Ikeda K, Yamamura T, Mitamura Y, Fujiwara S, Tominaga Y, Kiyono T (1981) On-line recognition of hand-written characters utilizing positional and stroke vector sequences. Pattern Recognit 13:191–206
Just BH, Marc D, Munns M, Sandefer R (2016) Why patient matching is a challenge: research on master patient index (MPI) data discrepancies in key identifying fields. Perspect Health Inf Manag 13:1e
Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New York
Kumar M, Patel N (2007) Clustering data with measurement errors. Comput Stat Data Anal 51(12):6084–6101
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium. vol 1, pp 281–297
McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York
Melnykov V (2013) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592
Melnykov V (2016) Merging mixture components for clustering through pairwise overlap. J Comput Graph Stat 25:66–90
Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Pankove JI (2012) Optical processes in semiconductors. Courier Corporation, Chelmsford
Pearson K (1894) Contribution to the mathematical theory of evolution. Philos Trans R Soc 185:71–110
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Schlattmann P (2009) Medical applications of finite mixture models. Springer, Berlin
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461–464
Sethi IK, Chatterjee B (1977) Machine recognition of constrained hand printed Devanagari. Pattern Recognit 9:69–75
Sneath P (1957) The application of computers to taxonomy. J Gen Microbiol 17:201–226
Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438
Thomas H, Lohaus A, Brainerd C (1993) Modeling growth and individual differences in spatial tasks. Monogr Soc Res Child Devd 58:1–190
Tjaden B (2006) An approach for clustering gene expression data with error information. BMC Bioinform 7(1):17
Ullrich B, Antillòn A, Bhowmick M, Wang J, Xi H (2014) Atomic transition region at the crossover between quantum dots to molecules. Phys Scr 89(2):025801
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Young WC, Raftery AE, Yeung KY (2016) Model-based clustering with data correction for removing artifacts in gene expression data. Ann Appl Stat 11:1998
Zhu X, Melnykov V (2018) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190–208
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Sarkar, S., Melnykov, V. & Zheng, R. Gaussian mixture modeling and model-based clustering under measurement inconsistency. Adv Data Anal Classif 14, 379–413 (2020). https://doi.org/10.1007/s11634-020-00393-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00393-9