Abstract
Model-based clustering tackles the task of uncovering heterogeneity in a data set to extract valuable insights. Given the common presence of outliers in practice, robust methods for model-based clustering have been proposed. However, the use of many methods in this area becomes severely limited in applications where partially observed records are common since their existing frameworks often assume complete data only. Here, a mixture of multiple scaled contaminated normal (MSCN) distributions is extended using the expectation-conditional maximization (ECM) algorithm to accommodate data sets with values missing at random. The newly proposed extension preserves the mixture’s capability in yielding robust parameter estimates and performing automatic outlier detection separately for each principal component. In this fitting framework, the MSCN marginal density is approximated using the inversion formula for the characteristic function. Extensive simulation studies involving incomplete data sets with outliers are conducted to evaluate parameter estimates and to compare clustering performance and outlier detection of our model to other mixtures.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00357-023-09450-2/MediaObjects/357_2023_9450_Fig10_HTML.png)
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author upon request.
Code Availability
The code can be found on github at https://github.com/cristinatortora/MSCN_missing.
References
Aitken, A. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45(1), 14–22.
Aitkin, M., & Wilson, G. T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22(3), 325–331.
Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In: E. Parzen, K. Tanabe, & G. Kitagawa (Eds.), Selected Papers of Hirotugu Akaike (pp. 199–213). Springer New York, New York, NY
Akogul, S., & Erisoglu, M. (2016). A comparison of information criteria in clustering based on mixture of multivariate normal distributions. Mathematical and Computational Applications, 21(3), 34.
Andrews, J. L., & McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Bagnato, L., & Punzo, A. (2021). Unconstrained representation of orthogonal matrices with application to common principal components. Computational Statistics, 36(2), 1177–1195.
Bagnato, L., Punzo, A., & Zoia, M. G. (2017). The multivariate leptokurtic-normal distribution and its application in model-based clustering. Canadian Journal of Statistics, 45(1), 95–119.
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803.
Berntsen, J., Espelid, T. O., & Genz, A. (1991). An adaptive algorithm for the approximate calculation of multiple integrals. ACM Transactions on Mathematical Software, 17(4), 437–451.
Biernacki, C., & Govaert, G. (1997). Using the classification likelihood to choose the number of clusters. Computing Science and Statistics, (pp. 451–457)
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.
Bozdogan, H. (1993). Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-Fisher information matrix. In: O. Opitz, B. Lausen, & R. Klar (Eds.), Information and Classification (pp. 40–54). Berlin, Heidelberg. Springer Berlin Heidelberg
Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3), 345–370.
Browne, R. P., & McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2), 176–198.
Broyden, C. (1970). The convergence of a class of double-rank minimization algorithms. Journal of the Institute of Mathematics and its Applications, 6(2), 76–90.
Buck, S. F. (1960). A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), 22(2), 302–306.
Buuren, S. v. (2021). Flexible imputation of missing data. Chapman & Hall/CRC interdisciplinary statistics series. Chapman & Hall/CRC, Boca Raton, 2nd ed.
Buuren, S. v., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software,45(3), 1–67
Cavanaugh, J. E. (1999). A large-sample model selection criterion based on Kullback’s symmetric divergence. Statistics & Probability Letters, 42(4), 333–343.
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Coretto, P., & Hennig, C. (2016). Robust improper maximum likelihood: Tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, 111(516), 1648–1659.
Cuesta-Albertos, J., Matrán, C., & Mayo-Iscar, A. (2008). Robust estimation in the normal mixture model based on robust clustering. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(4), 779–802.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–22.
Dooren, P. V., & Ridder, L. D. (1976). An adaptive algorithm for numerical integration over an n-dimensional cube. Journal of Computational and Applied Mathematics, 2(3), 207–217.
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322.
Forbes, F., & Wraith, D. (2014). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: Application to robust clustering. Statistics and Computing, 24(6), 971–984.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.
Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Franczak, B. C., Tortora, C., Browne, R. P., & McNicholas, P. D. (2015). Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognition Letters, 58, 69–76.
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer Series in Statistics. Springer New York
Gallegos, M. T., & Ritter, G. (2005). A robust method for cluster analysis. The Annals of Statistics, 33(1), 347–380.
Gallegos, M. T., & Ritter, G. (2009). Trimmed ML estimation of contaminated mixtures. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 71(2), 164–220.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., & Hothorn, T. (2021). mvtnorm: Multivariate normal and t distributions. R package version 1.1-3.
Ghahramani, Z., & Jordan, M. I. (1994). Learning from incomplete data. Technical report, Defense Technical Information Center, Fort Belvoir, VA
Goldfarb, D. (1970). A family of variable metric methods derived by variational means. Mathematics of Computation, 24(109), 23–26.
Goren, E. M., & Maitra, R. (2022). Fast model-based clustering of partial records. Stat,11(1), e416. Publisher: John Wiley & Sons, Ltd.
Greco, L., & Agostinelli, C. (2020). Weighted likelihood mixture modeling and model-based clustering. Statistics and Computing, 30(2), 255–277.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Hurvich, C. M., & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2), 297–307.
Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis. Pearson Prentice Hall, Upper Saddle River, N.J, 6th ed. edition. OCLC: ocm70867129.
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41, 577–590.
Kaufman, L., & Rousseeuw, P. J. (Eds.). (1990). Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Statistics. John Wiley & Sons, Hoboken, NJ, USA.
Lin, T. I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis, 100(2), 257–265.
Lin, T.-I. (2014). Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Computational Statistics & Data Analysis, 71, 183–195.
Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ, 3rd ed.
Liu, C., & Rubin, D. B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika, 81(4), 633–648.
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376. Publisher: Taylor & Francis
McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions. Wiley Series in Probability and Statistics. John Wiley & Sons, Hoboken, N.J.
McLachlan, G., & Peel, D. (2000). Finite mixture models. Wiley Series in Probability and Statistics. John Wiley & Sons, Hoboken, NJ, USA.
McNicholas, P. D. (2016). Model-based clustering. Journal of Classification, 33(3), 331–373.
McNicholas, P., Murphy, T., McDaid, A., & Frost, D. (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Second Special Issue on Statistical Algorithms and Software, 54(3), 711–723.
Melnykov, V., Chen, W.-C., & Maitra, R. (2012). MixSim : An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12)
Melnykov, V. (2013). Challenges in model-based clustering. Wiley interdisciplinary reviews: computational statistics, 5(2), 135–148.
Melnykov, V., & Maitra, R. (2010). Finite mixture models and model-based clustering. Statistics Surveys, 4, 80–116.
Meng, X.-L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278.
Michael, S., & Melnykov, V. (2016). An effective strategy for initializing the em algorithm in finite mixture models. Advances in Data Analysis and Classification, 10, 563–583.
Morris, K., Punzo, A., Blostein, M., & McNicholas, P. D. (2019). Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric laplace distributions. Computational Statistics and Data Analysis, 132, 145–166.
Narasimhan, B., Johnson, S. G., Hahn, T., Bouvier, A., & Kiêu, K. (2022). cubature: Adaptive multivariate integration over hypercubes.
Novi Inverardi, P. L., & Taufer, E. (2020). Outlier detection through mixtures with an improper component. Electronic Journal of Applied Statistical Analysis, 13(1), 146–163.
Peel, D., & McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing, 10(4), 339–348.
Punzo, A., Mazza, A., & McNicholas, P. D. (2018). ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. Journal of Statistical Software, 85(10), 1–25.
Punzo, A., & McNicholas, P. D. (2016). Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal, 58(6), 1506–1537.
Punzo, A., & Tortora, C. (2021). Multiple scaled contaminated normal distribution and its application in clustering. Statistical Modelling, 21(4), 332–358.
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.
Ritter, G. (2014). Robust cluster analysis and variable selection. Chapman and Hall/CRC, 1st ed.
Rubin, D. B. (Ed.). (1987). Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics. John Wiley & Sons Inc., Hoboken, NJ, USA
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Sachs, J. D., Layard, R., Helliwell, J. F., et al. (2018). World happiness report 2018. Technical report.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,6(2)
Seghouane, A., & Bekara, M. (2004). A small sample model selection criterion based on Kullback’s symmetric divergence. IEEE Transactions on Signal Processing, 52(12), 3314–3323.
Serafini, A., Murphy, T. B., & Scrucca, L. (2020). Handling missing data in model-based clustering. arXiv preprint arXiv:2006.02954
Shanno, D. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics of Computation, 24(111), 647–656.
Shireman, E., Steinley, D., & Brusco, M. J. (2017). Examining the effect of initialization strategies on the performance of Gaussian mixture modeling. Behavior Research Methods, 49(1), 282–293.
Soetaert, K. (2009). rootSolve: Nonlinear root finding, equilibrium and steady-state analysis of ordinary differential equations. R package 1.6.
Soetaert, K., & Herman, P. M. (2009). A practical guide to ecological modelling. Using R as a Simulation Platform. Springer. ISBN 978-1-4020-8623-6
Steinley, D. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3), 386–396.
Sugasawa, S., & Kobayashi, G. (2022). Robust fitting of mixture models using weighted complete estimating equations. Computational Statistics & Data Analysis, 174, 107526.
Tong, H., & Tortora, C. (2022). MixtureMissing: Robust model-based clustering for data sets with missing values at random. R package version 1.0.2.
Tong, H., & Tortora, C. (2022). Model-based clustering and outlier detection with missing data. Advances in Data Analysis and Classification, 16(1), 5–30.
Tortora, C., Punzo, A., & Tran, L. (2023). MSclust: Multiple-scaled clustering. R package version 1.0.3.
Tortora, C., Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2019). A mixture of coalesced generalized hyperbolic distributions. Journal of Classification, 36(1), 26–57.
Tran, L., & Tortora, C. (2021). How many clusters are best? Investigating model selection in robust clustering. In JSM Proceedings, Statistical Learning and Data Science Section. Alexandria, VA: American Statistical Association. 1159–1180 2021.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and statistics: Essays in Honor of Harold Hotelling (pp. 448–485). Stanford University Press, Stanford, CA
Wang, W.-L., & Lin, T.-I. (2015). Robust model-based clustering via mixtures of skew-t distributions with missing information. Advances in Data Analysis and Classification, 9(4), 423–445.
Wang, H., Zhang, Q., Luo, B., & Wei, S. (2004). Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognition Letters, 25(6), 701–710.
Wei, Y., Tang, Y., & McNicholas, P. D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics & Data Analysis, 130, 18–41.
Wilks, S. S. (1932). Moments and distributions of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics, 3, 163–195.
Wolfe, J. H. (1965). A computer program for the maximum likelihood analysis of types. USNPRA Technical Bulletin 65-15, U.S. Naval Personnel Research Activity, San Diego, USA.
You, J., Li, Z., & Du, J. (2023). A new iterative initialization of em algorithm for gaussian mixture models. Plos one, 18(4), e0284114.
Funding
This material is based upon work supported by the National Science Foundation under Grant No. 2209974
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
The authors agree to follow Springer ethical conduct.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A: Characteristic Functions and the Inversion Formula
In statistics, characteristic functions provide a powerful tool for deriving probability density functions by means of Fourier transformations. One major advantage of this approach is that there always exists a unique characteristic function for every probability distribution.
Definition A.1
Let \(\varvec{X}= \left( X_1, \dots , X_p \right) ^\top \in \mathbb {R}^p\) be a p-variate random vector, \(\varvec{t}= \left( t_1, \dots , t_p \right) ^\top \in \mathbb {R}^p\), and i be an imaginary unit. The function
is called the characteristic function of \(\varvec{X}\).
From a characteristic function, the associated probability density function can be obtained using the inversion formula.
Theorem A.1
(Inversion Formula) Let \(\varvec{X}= \left( X_1, \dots , X_p \right) ^\top \in \mathbb {R}^p\) be a p-variate random vector, \(\phi _{\varvec{X}} (\varvec{t})\) be the characteristic function of \(\varvec{X}\) with \(\varvec{t}= \left( t_1, \dots , t_p \right) ^\top \in \mathbb {R}^p\), and i be an imaginary unit. The probability density function of \(\varvec{X}\) can be obtained by
To obtain the marginals of the MSCN distribution, the propositions describing the characteristic functions of the MN and MCN distributions are needed. The marginals of the MCN and MSCN distribution are outlined in the methodology under Sect. 3.
Proposition A.1
The characteristic function of a p-variate random vector \(\varvec{X}\!=\! \left( X_1, \dots , X_p \right) ^\top \) \(\in \mathbb {R}^p\) that follows a multivariate normal distribution with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\) is
where \(\varvec{t}= \left( t_1, \dots , t_p \right) ^\top \in \mathbb {R}^p\) and i is an imaginary unit.
Appendix B: Proofs
Proposition 3.1
Proof
For data generation purposes, the MCN random variable \(\varvec{X}\) can be represented as
where V follows a Bernoulli distribution such that \(V = 1\) with probability \(\alpha \in (0.5, 1)\) and \(V = 0\) with probability \(1 - \alpha \); and \(\varvec{Y}\) follows an MN distribution with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\). By Definition A.1 and the law of total expectation, we can establish the following
\(\square \)
Proposition 3.2
Proof
From Definition A.1 and the fact that \(\tilde{\varvec{Y}}\) contains p independent univariate contaminated normal random variables, the characteristic function of the marginal variable \(\varvec{X}_1\) is given by
where from Proposition 3.1, for \(h = 1, \dots p\), we know that
\(\square \)
Proposition 3.6
Proof
From Definition A.1 and the fact that we are dealing with linear combinations of independent random variables, we have the characteristic function of \(\varvec{X}_1 { \; \mid \; }V_r = v_r, r \in \mathcal {A}\) to be
Herein, for \(r \in \mathcal {A}\), \(\tilde{Y}_r\) follows a univariate normal distribution with mean 0 and variance 1 if \(v_r = 1\) or variance \(\eta _r\) if \(v_r = 0\). Thus,
On the other hand, for \(s \in \mathcal {B}\), \(\tilde{Y}_s\) follows a univariate contaminated normal distribution with mean 0, variance 1, proportion of good observation \(\alpha _s\), and degree of contamination \(\eta _s\). As the result,
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tong, H., Tortora, C. Missing Values and Directional Outlier Detection in Model-Based Clustering. J Classif (2023). https://doi.org/10.1007/s00357-023-09450-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s00357-023-09450-2