Maximum likelihood estimation of Gaussian mixture models without matrix operations

Abstract

The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML) estimation for GMMs due to the M-step existing in closed form and its desirable numerical properties, such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization (LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts. Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs. Using numerical simulations in the R programming environment, we then demonstrate that the MM algorithm can be faster than the EM algorithm in various large data situations, where sample sizes range in the tens to hundreds of thousands and for estimating models with up to 16 mixture components on multivariate data with up to 16 variables.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Amemiya T (1985) Advanced econometrics. Harvard University Press, Cambridge

    Google Scholar 

  2. Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley, New York

    Google Scholar 

  3. Andrews JL, McNicholas PD (2013) Using evolutionary algorithms for model-based clustering. Pattern Recognit Lett 34:987–992

    Article  Google Scholar 

  4. Atienza N, Garcia-Heras J, Munoz-Pichardo JM, Villa R (2007) On the consistency of MLE in finite mixture models of exponential families. J Stat Plan Inference 137:496–505

    MATH  MathSciNet  Article  Google Scholar 

  5. Becker MP, Yang I, Lange K (1997) EM algorithms without missing data. Stat Methods Med Res 6:38–54

    Article  Google Scholar 

  6. Bishop CM (2006) Pattern recognition and machine learning. Springer, New York

    Google Scholar 

  7. Botev Z, Kroese DP (2004) Global likelihood optimization via the cross-entropy method with an application to mixture models. In: Proceedings of the 36th conference on winter simulation

  8. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Google Scholar 

  9. Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332

    MATH  MathSciNet  Article  Google Scholar 

  10. Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. Springer, New York

    Google Scholar 

  11. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38

    MATH  MathSciNet  Google Scholar 

  12. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York

    Google Scholar 

  13. Ganesalingam S, McLachlan GJ (1980) A comparison of the mixture and classification approaches to cluster analysis. Commun Stat Theory Methods 9:923–933

    Article  Google Scholar 

  14. Greselin F, Ingrassia S (2008) A note on constrained EM algorithms for mixtures of elliptical distributions. Advances in data analysis, data handling and business intelligence In: Proceedings of the 32nd annual conference of the German classification society. vol 53

  15. Hartigan JA (1985) Statistical theory in clustering. J Classif 2:63–76

    MATH  MathSciNet  Article  Google Scholar 

  16. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York

    Google Scholar 

  17. Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800

    MATH  MathSciNet  Article  Google Scholar 

  18. Hunter DR, Lange K (2004) A tutorial on MM algorithms. Am Stat 58:30–37

    MathSciNet  Article  Google Scholar 

  19. Ingrassia S (1991) Mixture decomposition via the simulated annealing algorithm. Appl Stoch Models Data Anal 7:317–325

    Article  Google Scholar 

  20. Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13:151–166

    MathSciNet  Article  Google Scholar 

  21. Ingrassia S, Rocci R (2007) Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput Stat Data Anal 51:5339–5351

    MATH  MathSciNet  Article  Google Scholar 

  22. Ingrassia S, Rocci R (2011) Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Comput Stat Data Anal 55:1714–1725

    MathSciNet  Article  Google Scholar 

  23. Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif 29:363–401

    MathSciNet  Article  Google Scholar 

  24. Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182

  25. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666

    Article  Google Scholar 

  26. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323

    Article  Google Scholar 

  27. Jennrich RI (1969) Asymptotic properties of non-linear least squares estimators. Ann Math Stat 40:633–643

    MATH  MathSciNet  Article  Google Scholar 

  28. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkley symposium on mathematical statistics and probability, University of California press, 281–297

  29. McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah PR, Kanal L (eds) Handbook of statistics, vol 2. North-Holland, Amsterdam

  30. McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York

  31. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Google Scholar 

  32. McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New York

    Google Scholar 

  33. Pernkopf F, Bouchaffra D (2005) Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 27:1344–1348

    Article  Google Scholar 

  34. R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  35. Razaviyayn M, Hong M, Luo ZQ (2013) A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J Optim 23:1126–1153

    MATH  MathSciNet  Article  Google Scholar 

  36. Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239

    MATH  MathSciNet  Article  Google Scholar 

  37. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge

    Google Scholar 

  38. Seber GAF (2008) A matrix handbook for statisticians. Wiley, New York

    Google Scholar 

  39. Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York

    Google Scholar 

  40. Zhou H, Lange K (2010) Mm algorithms for some discrete multivariate distributions. J Comput Graph Stat 19:645–665

    MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hien D. Nguyen.

Appendix

Appendix

Proof of Theorem 1

We shall show the result by construction. Firstly, set

$$\begin{aligned} \beta _{0,1}=\mu _{1}\quad \hbox {and}\quad \sigma _{1}^{2}=\varSigma _{1,1}, \end{aligned}$$
(22)

followed by

$$\begin{aligned}&\beta _{k,0}=\mu _{k}-{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}{\varvec{\mu }}_{1:k-1},\end{aligned}$$
(23)
$$\begin{aligned}&(\beta _{k,1},\ldots ,\beta _{k,k-1})={\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}, \end{aligned}$$
(24)

and

$$\begin{aligned} \sigma _{k}^{2}=\varSigma _{k,k}-{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\varSigma }}_{k,1:k-1}^{T}, \end{aligned}$$
(25)

for each \(k=2,\ldots ,d\), in order, to get

$$\begin{aligned} {\varvec{\beta }}_{k}^{T}\tilde{{\varvec{x}}}_{k}= & {} \beta _{k,0}+(\beta _{k,1},\ldots ,\beta _{k,k-1}){\varvec{x}}_{1}\\= & {} \mu _{k}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}({\varvec{x}}_{1:k-1}- {\varvec{\mu }}_{1:k-1})\\= & {} \mu _{k|1:k-1}({\varvec{x}}_{1:k-1}), \end{aligned}$$

and \(\sigma _{k}^{2}=\varSigma _{k|1:k-1}\).

Now, by Lemma 1, and by definition of conditional densities,

$$\begin{aligned} \phi _{1}(x_{1};\mu _{1},\varSigma _{1,1}) \prod _{k=2}^{d}\phi _{1}(x_{k};\mu _{k|1:k-1}( {\varvec{x}}_{1:k-1}),\varSigma _{k|1:k-1})= \phi _{d}({\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }}), \end{aligned}$$

for all \({\varvec{x}}\in \mathbb {R}^{d}\), which implies \(\lambda ({\varvec{x}}; {\varvec{\gamma }}, {\varvec{\sigma }}^{2})=\phi _{d}({\varvec{x}}; {\varvec{\mu }}, {\varvec{\varSigma }})\) by application of the mappings (22)–(25). Note that \({\varvec{\mu }}\) and \(\hbox {vech}({\varvec{\varSigma }})\), and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) have equal numbers of elements, and (22)–(25) are unique for each k. Thus, there is an injective mapping between the LRC and the natural parameters. The inverse mapping can also be constructed by setting

$$\begin{aligned} \mu _{1}=\beta _{0,1}\quad \hbox {and}\quad {\varSigma _{1,1}=\sigma _{1}^{2}}, \end{aligned}$$
(26)

followed by

$$\begin{aligned}&{\varvec{\varSigma }}_{k,1:k-1}=(\beta _{k,1},\ldots , \beta _{k,k-1}){\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1},\end{aligned}$$
(27)
$$\begin{aligned}&\varSigma _{k,k}=\sigma _{k}^{2}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\varSigma }}_{k,1:k-1}^{T}, \end{aligned}$$
(28)

and

$$\begin{aligned} \mu _{k}=\beta _{k,0}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\mu }}_{1:k-1}, \end{aligned}$$
(29)

for each \(k=2,\ldots ,d\), in order. The mappings (26)–(29) are also unique for each k, and thus constitutes a surjective mapping.

Proof of Theorem 2

The first and last inequalities of (19) and (20) are due to the definition of minorization [i.e. (11) and (14) are of forms (9) and (10), respectively]. The middle inequality of (19) is due to the concavity of \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\). This can be shown by firstly noting that

$$\begin{aligned} \sum _{i=1}^{g-1}\sum _{j=1}^{n}\tau _{i} ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}) \log (\pi _{i})+\sum _{j=1}^{n}\tau _{g}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})\log \left( 1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]\right) \end{aligned}$$

is concave in \(\log (\pi _{i})\) since \(1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]\) is concave and \(\log \) is an increasing concave function. Secondly, note that

$$\begin{aligned} \frac{\partial ^{2}Q^{\prime }({\varvec{\psi }}_{1}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})}{\partial \beta _{i,k,l}^{2}}=-\frac{1}{2\sigma _{i,k}^{(m)2}} \sum _{j=1}^{n}\frac{x_{j,l}^{2}\tau _{i}( {\varvec{x}}_{j};{\varvec{\psi }}^{(m)})}{\alpha _{j,l}} \end{aligned}$$

is negative, and thus \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is concave with respect to each \(\beta _{i,k,l}\) for each i, k and \(l=0,\ldots ,k-1\). Thus, \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is the additive composition of concave functions and is therefore concave with respect to a bijection of \({\varvec{\psi }}_{1}\). Furthermore, the system of equations

$$\begin{aligned} \frac{\partial Q^{\prime }({\varvec{\psi }}_{1}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})}{\partial \log (\pi _{i})}=0, \end{aligned}$$

for \(i=1,\ldots ,g-1\), has a unique root that is equivalent to update (16), which always satisfies the positivity restrictions on each \(\pi _{i}\).

The middle inequality of (20) is due to the concavity of \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\). This can be shown by noting that

$$\begin{aligned} -\frac{1}{2}\log \sigma _{i,k}^{2}\sum _{j=1}^{n}\tau ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})- \frac{1}{2\exp [\log \sigma _{i,k}^{2}]} \sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})Q_{2,i,j,k} ({\varvec{\beta }}_{i,k};{\varvec{\beta }}_{i,k}^{(m)}) \end{aligned}$$

is concave in \(\log \sigma _{i,k}^{2}\) for each i and k, since the inverse of \(\exp (x)\) is convex. Thus, \(Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})\) is concave with respect to a bijection of \({\varvec{\psi }}_{2}.\) Furthermore, the system of equations

$$\begin{aligned} \frac{\partial Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})}{\partial \log \sigma _{i,k}^{2}}=0 \end{aligned}$$

has a unique root that is equivalent to update (18).

Proof of Theorem 3

This result follows from part (a) of Theorem 2 from Razaviyayn et al. (2013), which assumes that \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) and \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\) both satisfy the definition of a minorizer, and are quasi-concave and have unique critical points, with respect to the parameters \({\varvec{\psi }}_{1}\) and \({\varvec{\psi }}_{2}\), respectively.

Firstly, the definition of a minorizer is satisfied via construction [i.e. (11) and (14) are of forms (9) and (10), respectively]. Secondly, from the proof of Theorem 2, both functions are concave with respect to some bijective mappings, and are therefore quasi-concave under said mappings [see Section 3.4 of Boyd and Vandenberghe (2004) regarding quasi-concavity]. Finally, since both functions are concave with respect to some bijective mapping, the critical points obtained must be unique.

Proof of Theorem 4

We show this result via induction. Firstly, using (26), we see that \(\sigma _{1}^{2}=\det (\varSigma _{1,1})>0\) is the first leading principal minor of \({\varvec{\varSigma }}\), and is positive. Now, by definition of (23), \(\sigma _{2}^{2}\) is the Schur complement of \({\varvec{\varSigma }}_{1:k,1:k}\), for \(k=2\), where

$$\begin{aligned} {\varvec{\varSigma }}_{1:k,1:k}=\left[ \begin{array}{c@{\quad }c} {\varvec{\varSigma }}_{1:k-1,1:k-1} &{} {\varvec{\varSigma }}_{k,1:k-1}^{T}\\ {\varvec{\varSigma }}_{k,1:k-1} &{} \varSigma _{k,k} \end{array}\right] . \end{aligned}$$
(30)

Since \(\sigma _{2}^{2}\) is positive and \(\varSigma _{1,1}\) is positive definite, we have the result that

$$\begin{aligned} \det ({\varvec{\varSigma }}_{1:2,1:2})=\det (\varSigma _{1,1}) \sigma _{2}^{2}>0 \end{aligned}$$

via the partitioning of the determinant. Thus, \({\varvec{\varSigma }}_{1:2,1:2}\) is also positive definite because both the first and second leading principal minors are positive.

Now, for each \(k=3,\ldots ,d\), we assume that \({\varvec{\varSigma }}_{1:k-1,1:k-1}\) is positive-definite. Since \(\sigma _{k}^{2}>0\) is the Schur complement of the partitioning (30), we have the result that

$$\begin{aligned} \det ({\varvec{\varSigma }}_{1:k,1:k})=\det ({\varvec{\varSigma }}_{1:k-1,1:k-1}) \sigma _{k}^{2}>0. \end{aligned}$$

Thus, the \(k\hbox {th}\) leading principal minor is positive, for all k. The result follows by the property of positive-definite matrices; see Chapters 10 and 14 of Seber (2008) for all relevant matrix results.

Proof of Theorem 5

Theorem 5 can be established from Theorem 4.1.2 of Amemiya (1985), which requires the validation of the assumptions,

  1. A1

    The parameter space \(\varPsi \) is an open subset of some Euclidean space.

  2. A2

    The log-likelihood \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is a measurable function for all \({\varvec{\psi }}\in \varPsi \), \(\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/\partial {\varvec{\psi }}\) exist and is continuous in an open neighborhood \(N_{1}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\).

  3. A3

    There exists an open neighborhood \(N_{2}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\), where \(n^{-1}\log \mathcal {L}_{R,n}({\varvec{\psi }})\) converges to \(\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})]\) in probability uniformly in \({\varvec{\psi }}\) in any compact subset of \(N_{2}({\varvec{\psi }}^{0})\).

Assumptions A1, and A2 are fulfilled by noting that the parameter space \(\varPsi =(0,1)^{g-1}\times \mathbb {R}^{g(d^{2}+d)/2+gd}\) is an open subset of \(\mathbb {R}^{(g-1)+g(d^{2}+d)/2+gd}\), and that \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is smooth with respect to the parameters \({\varvec{\psi }}\). Using Theorem 2 of Jennrich (1969), we can show that A3 holds by verifying that

$$\begin{aligned} \mathbb {E}\sup _{{\varvec{\psi }}\in \bar{N}}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty , \end{aligned}$$
(31)

where \(\bar{N}\) is a compact subset of \(N_{2}({\varvec{\psi }}^{0})\). Since \(f_{R}({\varvec{X}}; {\varvec{\psi }})\) is smooth, this is equivalent to showing that \(\mathbb {E}|f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty \), for any fixed \({\varvec{\psi }}\in \bar{N}\). This is achieved by noting that

$$\begin{aligned} \mathbb {E}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|= & {} \mathbb {E}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|\nonumber \\= & {} \mathbb {E}\left| \log \sum _{i=1}^{g}\pi _{i}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})\right| \nonumber \\\le & {} \sum _{i=1}^{g}\mathbb {E}|\log \lambda ({\varvec{x}}; {\varvec{\gamma }}_{i},{\varvec{\sigma }}_{i}^{2})|\nonumber \\= & {} \sum _{i=1}^{g}\mathbb {E}\left| \sum _{k=1}^{d}\log \phi _{1}(x_{k};{\varvec{\beta }}_{k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{k}^{2})\right| \nonumber \\\le & {} \sum _{i=1}^{g}\sum _{k=1}^{d}\mathbb {E} |\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|. \end{aligned}$$
(32)

The inequality on line 3 of (32) is due to Lemma 1 of Atienza et al. (2007). Considering that \(\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})\) is a polynomial function of Gaussian random variables, we have \(\mathbb {E}|\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|<\infty \) for each i and k. The result then follows.

Proof of Theorem 6

Theorem 6 can be established from Theorem 4.2.4 of Amemiya (1985), which requires the validation of the assumptions,

  1. B1

    The Hessian \(\partial ^{2}(\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}\) exists and is continuous in an open neighborhood of \({\varvec{\psi }}^{0}\).

  2. B2

    The equations

    $$\begin{aligned} \int \frac{\partial \log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}}\hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$

    and

    $$\begin{aligned} \int \frac{\partial ^{2}\log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}} \hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$

    hold, for any \({\varvec{\psi }}\in \varPsi \).

  3. B3

    The averaged Hessian satisfies

    $$\begin{aligned} \frac{1}{n}\frac{\partial ^{2}\log \mathcal {L}_{R,n} ({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}}\overset{P}{\rightarrow }\mathbb {E} \left[ \frac{\partial ^{2}\log f_{R}({\varvec{X}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\right] , \end{aligned}$$

    uniformly in \({\varvec{\psi }}\), in all compact subsets of an open neighborhood of \({\varvec{\psi }}^{0}\).

  4. B4

    The Fisher information

    $$\begin{aligned} -\mathbb {E}\left[ \frac{\partial ^{2}\log f_{R} ({\varvec{x}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\bigg |_{{\varvec{\psi }}= {\varvec{\psi }}^{0}}\right] ^{-1}, \end{aligned}$$

    is positive-definite.

Assumption B1 is validated via the smoothness of \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\), and it is mechanical to check the validity of B2. Assumption B3 can be shown via Theorem 2 of Jennrich (1969). Unlike the others, B4 must be taken as given.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nguyen, H.D., McLachlan, G.J. Maximum likelihood estimation of Gaussian mixture models without matrix operations. Adv Data Anal Classif 9, 371–394 (2015). https://doi.org/10.1007/s11634-015-0209-7

Download citation

Keywords

  • Gaussian mixture model
  • Minorization–maximization algorithm
  • matrix operation-free
  • Linear Regression

Mathematics Subject Classification

  • 65C60
  • 62E10