Eigenvalues and constraints in mixture modeling: geometric and computational issues

Abstract

This paper presents a review about the usage of eigenvalues restrictions for constrained parameter estimation in mixtures of elliptical distributions according to the likelihood approach. The restrictions serve a twofold purpose: to avoid convergence to degenerate solutions and to reduce the onset of non interesting (spurious) local maximizers, related to complex likelihood surfaces. The paper shows how the constraints may play a key role in the theory of Euclidean data clustering. The aim here is to provide a reasoned survey of the constraints and their applications, considering the contributions of many authors and spanning the literature of the last 30 years.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21:361–373

    MathSciNet  Article  MATH  Google Scholar 

  2. Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate \(t\)-distributions. Comput Stat Data Anal 55:520–529

    MathSciNet  Article  MATH  Google Scholar 

  3. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    MathSciNet  Article  MATH  Google Scholar 

  4. Baudry J-P, Celeux G (2015) EM for mixtures. Stat Comput 22(5):1021–1029

    MATH  Google Scholar 

  5. Biernacki C (2004) Initializing EM using the properties of its trajectories in gaussian mixtures. Stat Comput 14:267–279

    MathSciNet  Article  Google Scholar 

  6. Biernacki C, Chrétien S (2003) Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with the EM. Stat Probab Lett 61:373–382

    MathSciNet  Article  MATH  Google Scholar 

  7. Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the em algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575

    MathSciNet  Article  MATH  Google Scholar 

  8. Boyles RA (1983) On the convergence of the EM algorithm. J R Stat Soc B 45:47–50

    MathSciNet  MATH  Google Scholar 

  9. Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793

    Article  Google Scholar 

  10. Cerioli A, García-Escudero LA, Mayo-Iscar A, Riani M (2016) Finding the number of groups in model-based clustering via constrained likelihoods. http://uvadoc.uva.es/handle/10324/18093

  11. Chanda KC (1954) A note on the consistency and maxima of the roots of likelihood equations. Biometrika 41:56–61

    MathSciNet  Article  MATH  Google Scholar 

  12. Ciuperca G, Ridolfi A, Idier J (2003) Penalized maximum likelihood estimator for normal mixtures. Scand J Stat 30:45–59

    MathSciNet  Article  MATH  Google Scholar 

  13. Cramér H (1946) Math Methods Stat. Princeton University Press, Princeton

    Google Scholar 

  14. Day N (1969) Estimating the components of a mixture of normal distributions. Biometrika 56(3):463–474

    MathSciNet  Article  MATH  Google Scholar 

  15. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  16. Dennis JE (1981) Algorithms for non linear fitting. In: Proceedings of the NATO advanced research symposium, Cambridge, England. Cambridge University

  17. Dykstra RL (1983) An algorithm for restricted least squares regression. J Am Stat Assoc 78(384):837–842

    MathSciNet  Article  MATH  Google Scholar 

  18. Fang K, Anderson T (1990) Statistical inference in elliptically contoured and related distributions. Alberton, New York

    Google Scholar 

  19. Fraley C, Raftery AE (2006) Mclust version 3: an R package for normal mixture modeling and model-based clustering. Technical report, DTIC Document

  20. Fraley C, Raftery A (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 24(2):155–181

    MathSciNet  Article  MATH  Google Scholar 

  21. Fraley C, Raftery A, Murphy T, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. University of Washington, Seattle

    Google Scholar 

  22. Fritz H, García-Escudero LA, Mayo-Iscar A (2012) tclust: an R package for a trimming approach to cluster analysis. J Stat Softw 47(12):1–26

    Article  Google Scholar 

  23. Fritz H, García-Escudero LA, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Comput Stat Data Anal 61:124–136

    MathSciNet  Article  MATH  Google Scholar 

  24. Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock H-H (eds) Classification, clustering, and data analysis: recent advances and applications. Springer, Berlin, pp 247–255

  25. Gallegos M, Ritter G (2009) Trimmed ML estimation of contaminated mixture. Sankhya (Ser A) 71:164–220

    MathSciNet  MATH  Google Scholar 

  26. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345

    MathSciNet  Article  MATH  Google Scholar 

  27. García-Escudero LA, Gordaliza A, Mayo-Iscar A (2014) A constrained robust proposal for mixture modeling avoiding spurious solutions. Adv Data Anal Classif 8(1):27–43

    MathSciNet  Article  Google Scholar 

  28. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2015) Avoiding spurious local maximizers in mixture modelling. Stat Comput 25:619–633

    MathSciNet  Article  MATH  Google Scholar 

  29. García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2016) The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers. Comput Stat Data Anal 99:131–147

    MathSciNet  Article  MATH  Google Scholar 

  30. García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2017) Eigenvalues in robust approaches to mixture modeling: a review. Technical report, in preparation

  31. Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto

  32. Greselin F, Ingrassia S (2010) Constrained monotone em algorithms for mixtures of multivariate \(t\)-distributions. Stat Comput 20:9–22

    MathSciNet  Article  Google Scholar 

  33. Greselin F, Ingrassia S (2015) Maximum likelihood estimation in constrained parameter spaces for mixtures of factor analyzers. Stat Comput 25(2):215–226

    MathSciNet  Article  MATH  Google Scholar 

  34. Greselin F, Ingrassia S, Punzo A (2011) Assessing the pattern of covariance matrices via an augmentation multiple testing procedure. Stat Methods Appl 20:141–170

    MathSciNet  Article  MATH  Google Scholar 

  35. Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13(2):795–800

    MathSciNet  Article  MATH  Google Scholar 

  36. Hathaway RJ (1996) A constrained EM algorithm for univariate normal mixtures. J Stat Comput Simul 23:211–230

    Article  Google Scholar 

  37. Hennig C (2004) Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann Stat 32:1313–1340

    MathSciNet  Article  MATH  Google Scholar 

  38. Ingrassia S (1992) A comparison between the simulated annealing and the EM algorithms in normal mixture decompositions. Stat Comput 2:203–211

    Article  Google Scholar 

  39. Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13(4):151–166

    MathSciNet  Google Scholar 

  40. Ingrassia S, Rocci R (2007) Constrained monotone EM algorithms for finite mixture of multivariate gaussians. Comput Stat Data Anal 51(11):5339–5351

    MathSciNet  Article  MATH  Google Scholar 

  41. Ingrassia S, Rocci R (2011) Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Comput Stat Data Anal 55(4):1715–1725

    MathSciNet  Article  MATH  Google Scholar 

  42. Kiefer J, Wolfowitz J (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann Math Stat 27(4):887–906

    MathSciNet  Article  MATH  Google Scholar 

  43. Kiefer NM (1978) Discrete parameter variation: efficient estimation of a switching regression model. Econometrica 46(2):427–434

    MathSciNet  Article  MATH  Google Scholar 

  44. Lindsay BG (1995) Mixture models: theory, geometry and applications. NSF-CBMS regional conference series in probability and statistics, vol 5. Institute of Mathematical Statistics, Hayward, CA

  45. McLachlan G, Krishnan T (2008a) The EM algorithm and extensions, 2nd edn, vol 589. Wiley, New York

  46. McLachlan GJ, Krishnan T (2008b) The EM algorithm and its extensions, 2nd edn. Wiley, New York

    Google Scholar 

  47. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Google Scholar 

  48. McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comput 18:285–296

    MathSciNet  Article  Google Scholar 

  49. Meng X-L (1994) On the rate of convergence of the ECM algorithm. Ann Stat 22(1):326–339

    MathSciNet  Article  MATH  Google Scholar 

  50. Meng X-L, van Dyk D (1997) The EM algorithm. An old folk song sung to a fast new tune. J R Stat Soc B 59(3):511–567

    MathSciNet  Article  MATH  Google Scholar 

  51. Nettleton D (1999) Convergence properties of the EM algorithm in constrained spaces. Canad J Stat 27(3):639–644

    MathSciNet  Article  MATH  Google Scholar 

  52. O’Hagan A, White A (2016) Improved model-based clustering performance using bayes initialization averaging. Technical report, arXiv:1504.06870v4

  53. O’Hagan A, Murphy TB, Gormley C (2013) Computational aspects of fitting mixture model via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864

    Article  MATH  Google Scholar 

  54. Puntanen S, Styan GP, Isotalo J (2011) Matrix tricks for linear statistical models. Springer, Berlin

    Google Scholar 

  55. Redner RA, Walker HF (1984) Mixture densities maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239

    MathSciNet  Article  MATH  Google Scholar 

  56. Ritter G (2014) Cluster analysis and variable selection. CRC Press, Boca Raton

    Google Scholar 

  57. Rocci R, Gattone SA, Di Mari R (2016) A data driven equivariant approach to constrained Gaussian mixture modeling. Adv Data Anal Classif. doi:10.1007/s11634-016-0279-1

  58. Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection. Wiley, New York

    Google Scholar 

  59. Rudin W (1976) Principles of mathematical analysis, 3rd edn. McGraw-Hill, Tokyo

    Google Scholar 

  60. Scrucca L, Fop M, Murphy TB, Raftery A (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289–317

    Google Scholar 

  61. Seo B, Kim D (2012) Root selection in normal mixture models. Comput Stat Data Anal 56:2454–2470

    MathSciNet  Article  MATH  Google Scholar 

  62. Subedi S, Punzo A, Ingrassia S, McNicholas PD (2013) Clustering and classification via cluster-weighted factor analyzers. Adv Data Anal Classif 7:5–40

    MathSciNet  Article  MATH  Google Scholar 

  63. Subedi S, Punzo A, Ingrassia S, McNicholas PD (2015) Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl 24(4):623–649

  64. Tanaka K (2009) Strong consistency of the maximum likelihood estimator for finite mixtures of location-scale distributions when penalty is imposed on the ratios of the scale parameters. Scand J Stat 36(1):171–184

    MathSciNet  MATH  Google Scholar 

  65. Tanaka K, Takemura A (2006) Strong consistency of the maximum likelihood estimator for finite mixtures of location-scale distributions when the scale parameters are exponentially small. Bernoulli 12(6):1003–1017

    MathSciNet  Article  MATH  Google Scholar 

  66. Tarone RD, Gruenhage G (1975) A note on the uniqueness of the roots of the likelihood equations for vector-valued parameters. J Am Stat Assoc 70(352):903–904

    MathSciNet  Article  MATH  Google Scholar 

  67. Tarone RD, Gruenhage G (1979) Corrigenda: a note on the uniqueness of the roots of the likelihood equations for vector-valued parameters. J Am Stat Assoc 74(367):744

    Article  Google Scholar 

  68. Theobald C (1975) An inequality with application to multivariate analysis. Biometrika 62(2):461–466

    MathSciNet  Article  MATH  Google Scholar 

  69. Theobald C (1976) Corrections and amendments: an inequality with application to multivariate analysis. Biometrika 63(3):685

    MathSciNet  Google Scholar 

  70. Tipping M, Bishop CM (1999) Mixtures of probabilistic principal component mixtures of probabilistic principal component analysers. Neural Comput 11(2):443–482

    Article  MATH  Google Scholar 

  71. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York

    Google Scholar 

  72. van Laarhoven PJM, Aarts EHL (1988) Simulated annealing: theory and practice. D. Reidel, Dordecht

    Google Scholar 

  73. Wu CFJ (1983) On convergence properties of the EM algorithm. Ann Stat 11:95–103

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Salvatore Ingrassia.

Appendix: Proof of Theorem 4

Appendix: Proof of Theorem 4

To maximize \(\mathcal {L}(\mathbf {\psi })\) means to jointly maximize \(|\varvec{\varSigma }_g|^{-1/2}\) and to minimize the argument of the exponential, i.e., \((\mathbf {x}_n- \varvec{\mu }_g)' \varvec{\varSigma }_g^{-1}(\mathbf {x}_n- \varvec{\mu }_g)\), for each \(g=1,\ldots ,G\). Hence, firstly we will show that, for a given \(\varvec{\varSigma }_g\), the mean vector \(\varvec{\mu }_g\) has to lie in a compact subset in \(\mathbb {R}^d\). Let C be the convex hull of \(\mathcal {X}\), i.e., the intersection of all convex sets containing the N points, given by

$$\begin{aligned} C(\mathcal {X})=\left\{ \sum _{n=1}^N u_n \mathbf {x}_n \; | \; \sum _{n=1}^N u_n=1, u_n \ge 0\right\} . \end{aligned}$$
(53)

Suppose now that \(\bar{\mathbf {\psi }} \in \mathbf {\Psi }_{a,b}\) satisfies \(\bar{\varvec{\mu }}_g \notin C(\mathcal {X})\). Then \(\mathcal {L}(\bar{\mathbf {\psi }}) \le \mathcal {L}(\mathbf {\psi }^*)\) where \(\mathbf {\psi }^* \in \mathbf {\Psi }_{a,b}\) is obtained from \(\bar{\mathbf {\psi }}\) by changing the gth mean component to \(\varvec{\mu }_g=\alpha \bar{\varvec{\mu }}_g\) for some \(\alpha \in (0,1)\) (i.e., along the line joining \(\mathbf{0}\) and \(\bar{\varvec{\mu }}_g\)) such that \(\varvec{\mu }_g \in C(\mathcal {X})\).

Let us set \(S= \{ \mathbf {\psi }\in \mathbf {\Psi }_{a,b} \, | \, \varvec{\mu }_g\in C(\mathcal {X}); 0< a \le \lambda _i(\varvec{\varSigma }_g)\le b < + \infty \quad g=1, \ldots , G\}\). Then, it follows that

$$\begin{aligned} \sup _{\mathbf {\psi }\in \mathbf {\Psi }_{a,b}} \mathcal {L}(\mathbf {\psi }) = \sup _{\mathbf {\psi }\in S} \mathcal {L}(\mathbf {\psi }) . \end{aligned}$$

By the compactness of S and the continuity of \(\mathcal {L}(\mathbf {\psi })\), there exists a parameter \(\hat{\mathbf {\psi }}\in \mathbf {\Psi }_{a,b}\) satisfying

$$\begin{aligned} \mathcal {L}(\hat{\mathbf {\psi }}) =\sup _{\mathbf {\psi }\in \mathbf {\Psi }_{a,b}} \mathcal {L}(\mathbf {\psi })= \sup _{\mathbf {\psi }\in S} \mathcal {L}(\mathbf {\psi }) \end{aligned}$$

by Weierstrass’ theorem, see, e.g., Theorem 4.16 in Rudin (1976). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

García-Escudero, L.A., Gordaliza, A., Greselin, F. et al. Eigenvalues and constraints in mixture modeling: geometric and computational issues. Adv Data Anal Classif 12, 203–233 (2018). https://doi.org/10.1007/s11634-017-0293-y

Download citation

Keywords

  • Mixture model
  • EM algorithm
  • Eigenvalues
  • Model-based clustering

Mathematics Subject Classification

  • 62F10 Point estimation
  • 62F12 Asymptotic properties of estimators
  • 62F30 Inference under constraints
  • 62F35 Robustness and adaptive procedures