Abstract
Mixtures of t factor analyzers (MtFA) are powerful and widely used tools for robust clustering of high-dimensional data in the presence of outliers. However, the occurrence of missing values may cause analytical intractability and computational complexity when fitting the MtFA model. We explicitly derive the score vector and Hessian matrix of the MtFA model with incomplete data to approximate the information matrix. In this regard, some asymptotic properties can be established under certain regularity conditions. Three expectation-maximization-based algorithms are developed for maximum likelihood estimation of the MtFA model with possibly missing values at random. Practical issues related to the recovery of missing values and clustering of partially observed samples are also investigated. The relevant utility of our methodology is exemplified through the analysis of simulated and real data sets.
Similar content being viewed by others
References
Anderson TW (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J Am Stat Assoc 52:200–203
Boldea O, Magnus JR (2009) Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc 104:1539–1549
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Fokoué E, Titterington DM (2003) Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach Learn 50:73–94
Ghahramani Z, Beal MJ (2000) Variational inference for Bayesian mixture of factor analysers. In: Solla S, Leen T, Muller K-R (eds) Advances in neural information processing systems, vol 12. MIT Press, Cambridge, pp 449–455
Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers, Technical report no. CRG-TR-96-1, University of Toronto, Canada
Greselin F, Ingrassia S (2015) Maximum likelihood estimation in constrained parameter spaces for mixtures of factor analyzers. Stat Comput 25:215–226
Hirose K, Kim S, Kano Y, Imada M, Yoshida M, Matsuo M (2016) Full information maximum likelihood estimation in factor analysis with a large number of missing values. J Stat Comput Simul 86:91–104
Hocking RR, Smith WB (1968) Estimation of parameters in the multivariate normal distribution with missing observations. J Am Stat Assoc 63:159–173
Kotz S, Nadarajah S (2004) Multivariate \(t\) distributions and their applications. Cambridge University Press, Cambridge
Lee SX, Lin TI, McLachlan GJ (2021) Mixtures of factor analyzers with fundamental skew symmetric distributions. Adv Data Anal Classif 15:481–512
Lin TI, Lachos VH, Wang WL (2018) Multivariate longitudinal data analysis with censored and intermittent missing responses. Stat Med 37:2822–2835
Lin TI, Lee JC, Ho HJ (2006) On fast supervised learning for normal mixture models with missing information. Pattern Recognit 39:1177–1187
Lin TI, McLachlan GJ, Lee SX (2016) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. J Multivar Anal 143:398–413
Lin TI, McNicholas PD, Ho HJ (2014) Capturing patterns via parsimonious \(t\) mixture models. Stat Prob Lett 88:80–87
Lin TI, Wang WL (2020) Multivariate-\(t\) linear mixed models with censored responses, intermittent missing values and heavy tails. Stat Meth Med Res 29:1288–1304
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Liu C (1999) Efficient ML estimation of the multivariate normal distribution from incomplete data. J Multivar Anal 69:206–217
Maleki M, Wraith D (2019) Mixtures of multivariate restricted skew-normal factor analyzer models in a Bayesian framework. Comput Stat 34:1039–1053
Maleki M, Wraith D, Arellano-Valle RB (2019) A flexible class of parametric distributions for Bayesian linear mixed models. TEST 28:543–564
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422
McLachlan GJ, Bean RW, Jones LBT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51:5327–5338
McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388
McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comput 18:285–296
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Meng XL, van Dyk D (1997) The EM algorithm: an old folk-song sung to a fast new tune. J R Stat Soc Ser B 59:511–567
Montanari A, Viroli C (2011) Maximum likelihood estimation of mixtures of factor analyzers. Comput Stat Data Anal 55:2712–2723
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall, London
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12:2109–2128
Utsugi A, Kumagai T (2001) Bayesian analysis of mixtures of factor analyzers. Neural Comput 13:993–1002
Woodbury MA (1950) Inverting Modified Matrices. Statistical Research Group, Memo Rep No. 42. Princeton University, Princeton, New Jersey
Wang WL, Castro LM, Lachos VH, Lin TI (2019) Model-based clustering of censored data via mixtures of factor analyzers. Comput Stat Data Anal 140:104–121
Wang WL, Castro LM, Lin TI (2017) Automated learning of \(t\) factor analysis models with complete and incomplete data. J Multivar Anal 161:157–171
Wang WL, Lin TI (2013) An efficient ECM algorithm for maximum likelihood estimation in mixtures of \(t\)-factor analyzers. Comput Stat 28:751–769
Wang WL, Lin TI (2016) Maximum likelihood inference for the multivariate \(t\) mixture model. J Multivar Anal 149:54–64
Wang WL, Lin TI (2020) Automated learning of mixtures of factor analysis models with missing information. TEST 29:1098–1124
Wang WL, Lin TI (2021) Robust clustering of multiply censored data via mixtures of \(t\) factor analyzers. TEST. https://doi.org/10.1007/s11749-021-00766-y
Zhao JH, Shi L (2014) Automated learning of factor analysis with complete and incomplete data. Comput Stat Data Anal 72:205–218
Zhao JH, Yu PLH (2008) Fast ML estimation for the mixture of factor analyzers via an ECM algorithm. IEEE Trans Neural Netw 19:1956–1961
Acknowledgements
The authors gratefully acknowledge the Coordinating Editor, Maurizio Vichi, the Associate Editor and two anonymous referees for their comments and suggestions that greatly improved this paper. We are also grateful to Mr. Meng-Chih Liu for making some initial inputs. W.L. Wang and T.I. Lin would like to acknowledge the support of the Ministry of Science and Technology of Taiwan under Grant Nos. MOST 107-2628-M-035-001-MY3 and MOST 109-2118-M-005-005-MY3.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wang, WL., Lin, TI. Robust clustering via mixtures of t factor analyzers with incomplete data. Adv Data Anal Classif 16, 659–690 (2022). https://doi.org/10.1007/s11634-021-00453-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00453-8
Keywords
- Data reduction
- Factor analyzer
- Information matrix
- Mixture models
- Multivariate t distribution
- Missing data