Abstract
If one is to judge by counts of citations of the fundamental paper (Dempster in JRSSB 39: 1–38, 1977), EM algorithms are a runaway success. But it is surprisingly easy to find published applications of EM that are unnecessary, in the sense that there are simpler methods available that will solve the relevant estimation problems. In particular, such problems can often be solved by the simple expedient of submitting the observed-data likelihood (or log-likelihood) to a general-purpose routine for unconstrained optimization. This can dispense with the need to derive and code (or modify) the E and M steps, a process which can sometimes be laborious or error-prone. Here, I discuss six such applications of EM in some detail, and in an appendix describe briefly some others that have already appeared in the literature. Whether these are atypical of applications of EM seems an open question, although one that may be difficult to answer; this question is of relevance to current practice, but may also be of historical interest. But it is clear that there are problems traditionally solved by EM (e.g. the fitting of finite mixtures of distributions) that can also be solved by other means. It is suggested that, before going to the effort of devising an EM algorithm to use on a new problem, the researcher should consider whether other methods (e.g. direct numerical maximization or an MM algorithm of some other kind) may be either simpler to implement or more efficient.
Similar content being viewed by others
Notes
I have known a reviewer to react with surprise to the claim that one needs only to solve a quadratic equation. I have seen Mathematica invoked to solve the likelihood equation. And there is in the discussion following the paper of Dempster et al. the correct but strange suggestion that one can approximate the likelihood equation by a certain linear equation—in order to render it “easily solvable”!
An anonymous reviewer of one of my papers suggested bluntly, albeit necessarily without evidence, that the reason certain authors choose to use EM is so that they can get a paper published.
References
Altham, P.M.E.: Two generalizations of the binomial distribution. J. R. Stat. Soc. Ser. C 27(2), 162–167 (1978)
Amis, K.: Lucky Jim. Victor Gollancz, London (1954)
Azzalini, A., Bowman, A.W.: A look at some data on the Old Faithful geyser. J. R. Stat. Soc. Ser. C (Applied Statistics) 39, 357–365 (1990)
Balakrishnan, N., Mitra, D.: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 125–171 (2014)
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)
Brown, G.O., Buckley, W.S.: Experience rating with Poisson mixtures. Ann. Actuar. Sci. 9(2), 304–321 (2015)
Davison, A.C.: Statistical Models. Cambridge University Press, Cambridge (2003)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Diaconis, P.: Some things we’ve learned (about Markov chain Monte Carlo). Bernoulli 19(4), 1294–1305 (2013). https://doi.org/10.3150/12-BEJSP09
Finney, D.J.: The estimation from individual records of the relationship between dose and quantal response. Biometrika 34(3/4), 320–334 (1947)
Fisher, R.A., Balmukand, B.: The estimation of linkage from the offspring of selfed heterozygotes. J. Genet. 20, 79–92 (1928)
Gould, S.J.: The Lying Stones of Marrakech: Penultimate Reflections in Natural History. Belknap Press, Cambridge, MA (2011)
He, Y., Liu, C.: The dynamic expectation-conditional maximization either algorithm. J. R. Stat. Soc. Ser. B 74(2), 313–336 (2012). https://doi.org/10.1111/j.1467-9868.2011.01013.x
Jamshidian, M., Jennrich, R.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. Ser. B 59(3), 569–587 (1997). https://doi.org/10.1111/1467-9868.00083
Kim, D.K., Taylor, J.M.G.: The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters. J. Am. Stat. Ass. 90(430), 708–716 (1995)
Kundu, D., Dey, A.K.: Estimating the parameters of the Marshall-Olkin bivariate Weibull distribution by EM algorithm. Comput. Stat. Data Anal. 53, 956–965 (2009)
Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B 57(2), 425–437 (1995b)
Lange, K.: A quasi-Newton acceleration of the EM algorithm. Stat. Sin. 5, 1–18 (1995a)
Lange, K.: Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. Springer, New York (2002)
Lange, K.: Numerical Analysis for Statisticians, 2nd edn. Springer, New York (2010)
Lange, K.: MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA (2016)
Langrock, R.: Some applications of nonlinear and non-Gaussian state-space modelling by means of hidden Markov models. J. Appl. Stat. 38(12), 2955–2970 (2011)
Langrock, R., MacDonald, I.L., Zucchini, W.: Some nonstandard stochastic volatility models and their estimation using structured hidden Markov models. J. Empir. Financ. 19, 147–161 (2012)
Leask, K.: Wadley’s problem with overdispersion. PhD thesis, University of KwaZulu–Natal (2009)
Leask, K.L., Haines, L.M.: The Altham-Poisson distribution. Stat. Model. 15(5), 476–497 (2015). https://doi.org/10.1177/1471082X15571161
Lee, W., Pawitan, Y.: Direct calculation of the variance of maximum penalized likelihood estimates via EM algorithm. Am. Stat. 68(2), 93–97 (2014)
Lewandowski, A., Liu, C., Vander Wiel, S.: Parameter expansion and efficient inference. Stat. Sci. 25(4), 533–544 (2010)
Little, R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data, 2nd edn. Wiley, Hoboken, NJ (2002)
Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)
Liu, S., Wu, H., Meeker, W.Q.: Understanding and addressing the unbounded likelihood problem. Am. Stat. 69(3), 191–200 (2015)
Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85(4), 755–770 (1998)
MacDonald, I.L., Korula, F.: Maximum-likelihood estimation for multivariate distributions of Marshall–Olkin type: Two routes simpler than EM, submitted (2020)
MacDonald, I.L.: Does Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 308–311 (2014a). https://doi.org/10.1177/0962280213497329
MacDonald, I.L.: Numerical maximisation of likelihood: a neglected alternative to EM? Int. Stat. Rev. 82(2), 296–308 (2014b)
MacDonald, I.L.: Fitting truncated normal distributions. Stat. Methods Med. Res. 27(12), 3835–3838 (2018). https://doi.org/10.1177/0962280217712089
MacDonald, I.L., Lapham, B.M.: Even more direct calculation of the variance of a maximum penalized-likelihood estimator. Am. Stat. 70(1), 114–118 (2016)
MacDonald, I.L., Nkalashe, P.: A simple route to maximum-likelihood estimates of two-locus recombination fractions under inequality restrictions. J. Genet. 94, 479–481 (2015)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Meng, X.L., van Dyk, D.: The EM algorithm: An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–540 (1997). https://doi.org/10.1111/1467-9868.00082
Meng, X.L.: The EM algorithm and medical studies: a historical link. Stat. Methods Med. Res. 6, 3–23 (1997)
Meng, X.L.: Response: Did Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 312–314 (2014)
Morton, N.E.: Genetic studies of Northeastern Brazil. Cold Spring Harbor Symp. Quant. Biol. 29, 69–79 (1964)
Mulinacci, S.: Archimedean-based Marshall-Olkin distributions and related dependence structures. Method. Comput. Appl. Prob 20(1), 205–236 (2018)
Ng, H.K.T., Ye, Z.: Comments: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 177–180 (2014)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–231 (2014). https://doi.org/10.1561/2400000003
Pawitan, Y.: In All Likelihood. Oxford University Press, Oxford (2001)
Polson, N.G., Scott, J.G., Willard, B.T.: Proximal algorithms in statistics and machine learning. Stat. Sci. 30(4), 559–581 (2015). https://doi.org/10.1214/15-STS530
R Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org (2017)
Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (1973)
Reilly, M., Lawlor, E.: A likelihood-based method of identifying contaminated lots of blood product. Int. J. Epidemiol. 28(4), 787–792 (1999). https://doi.org/10.1093/ije/28.4.787
Shi, N.Z., Zheng, S.R., Guo, J.: The restricted EM algorithm under inequality restrictions on the parameters. J. Multivar. Anal. 92(1), 53–76 (2005). https://doi.org/10.1016/S0047-259X(03)00134-9
Speed, T.P.: Terence’s stuff: my favourite algorithm. Inst. Math. Stat. Bull. 37, 14 (2008)
Springer, T., Urban, K.: Comparison of the EM algorithm and alternatives. Num. Algorithms 67(2), 335–364 (2014). https://doi.org/10.1007/s11075-013-9794-8
Thompson, E.A.: Statistical Inferences from Genetic Data on Pedigrees (NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 6). Institute of Mathematical Statistics, Beachwood, OH (2000)
Tian, G.L., Ju, D., Yuen, K.C., Zhang, C.: New expectation-maximization-type algorithms via stochastic representation for the analysis of truncated normal data with applications in biomedicine. Stat. Methods Med. Res. 27(8), 2459–2477 (2018). https://doi.org/10.1177/0962280216681598
van Dyk, D., Tang, R.: The one-step-late PXEM algorithm. Stat. Comput. 13(2), 137–152 (2003). https://doi.org/10.1023/A:1023256509116
Wu, T.T., Lange, K.: The MM alternative to EM. Stat. Sci. 25, 492–505 (2010)
Yasuda, N.: Estimation of inbreeding coefficient from phenotype frequencies by a method of maximum likelihood scoring. Biometrics 24(4), 915–935 (1968). https://doi.org/10.2307/2528880
Zhou, Y., Shi, N.Z., Fung, W.K., Guo, J.: Maximum likelihood estimates of two-locus recombination fractions under some natural inequality restrictions. BMC Genet. 9(1), 1 (2008)
Zhou, H., Lange, K.: Rating movies and rating the raters who rate them. Am. Statist. 63(4), 297–307 (2009)
Zucchini, W., MacDonald, I.L., Langrock, R.: Hidden Markov Models for Time Series: An Introduction Using R, 2nd edn. Chapman & Hall/CRC Press, Boca Raton, FL (2016)
Acknowledgements
The author thanks the Editor-in-Chief, Associate Editor and reviewers for their helpful and encouraging comments and suggestions. In addition, Dr. Etienne Pienaar is thanked for his many helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 A Other published examples
Here, I discuss briefly some previously published examples in which it is apparently simpler not to use EM, plus the applications chapters of Zucchini et al. (2016), all of which use DNM but not EM.
1.1.1 A.1 EM modified to allow for constraints
It is sometimes stated that EM allows automatically for constraints on parameters. For instance, Lange (2010, p. 223) writes that “[...] the EM algorithm handles parameter constraints gracefully. Constraint satisfaction is by definition built into the solution of the M step.” This could, however, be misunderstood to mean all constraints on parameters. EM does indeed handle many constraints automatically, e.g. the nonnegativity and unit-sum constraints on the transition probabilities in a hidden Markov model (Zucchini et al. 2016, p. 72). But see (e.g.) Kim and Taylor (1995) and Shi et al. (2005), the very purpose of which is to modify EM in order to incorporate (respectively) certain linear equality or inequality constraints on parameters.
There has been one very determined but apparently unnecessary attempt to modify EM in order to allow for the linear inequality constraints arising very naturally in one particular problem in genetics, even splitting the M step into seven cases in order to do so (Zhou et al. 2008). There the problem is just one of maximizing a (nonlinear) function of three variables subject to four linear inequality constraints and is easily solved by using the constrained optimizer constrOptim provided by R. EM makes this problem harder than it need be. For further details, see MacDonald and Nkalashe (2015).
1.1.2 A.2 The Altham–Poisson distribution
Leask and Haines (2015) and Leask (2009) considered very fully the possible use of EM for parameter estimation in the case of an Altham–Poisson distribution, that is, a “multiplicative binomial” distribution (Altham 1978) for which n, the number of trials, is taken to have a Poisson distribution. A multiplicative binomial has probability mass function of the form
with \(p \in [0,1]\) and \(\theta >0\). They concluded that the formulation of EM for such a distribution is “subtle and somewhat complicated” and, because EM was slow to converge, chose instead to maximize the log-likelihood directly.
1.1.3 A.3 Fitting truncated normal distributions
It is straightforward to evaluate the (density-approximated) likelihood of a sample from a truncated normal distribution and maximize it numerically. Nevertheless, Tian et al. (2018) go to considerable lengths to develop an algorithm of EM type to fit truncated normal distributions. Their two examples of model fitting can easily be carried out instead by DNM, although in one case it is clear that their truncated normal is much inferior to a log-normal model. This problem has been more fully discussed by MacDonald (2018).
1.1.4 A.4 Maximization of penalized likelihood
Lee and Pawitan (2014) describe how to estimate the variances of estimators based on the maximization of penalized likelihood; they assume that these estimators have been found by EM. MacDonald and Lapham (2016) describe how to accomplish the same and more without using EM, by instead maximizing the penalized likelihood directly. For the two examples of Lee and Pawitan they find the MLEs, their standard errors, and confidence (or credibility) intervals of two types: those of Wald type and those based directly on penalized likelihood, which can differ considerably from those of Wald type.
1.1.5 A.5 Fitting a zero-truncated Poisson
Meng (1997) describes how to use (inter alia) EM to fit a Poisson distribution to count data from which the number of zeros observed is missing. Among the methods he describes is Newton–Raphson. MacDonald (2014a) claims that the apparent failure of Newton–Raphson for certain starting-values is due only to the fact that the obvious positivity constraint on the Poisson mean has been ignored. Meng (2014) has replied, but does not agree entirely.
1.1.6 A.6 The examples of MacDonald (2014b)
MacDonald (2014b) describes a range of examples in which EM has apparently been used unnecessarily and compares EM with DNM. Included are the analysis of ABO blood-group data, and the fitting of Dirichlet distributions. There has been little or no response published which disagrees materially with the conclusions and recommendations of that paper.
1.1.7 A.7 The applications of Zucchini et al. (2016)
The applications chapters of Zucchini et al. (2016, Chapters 15–24) present a wide variety of models of hidden Markov type or similar, some simple, some complex, but all fitted by direct numerical maximization of likelihood. There is little if any mention of EM in those chapters, as it usually seemed redundant or over-complicated—in spite of the strong historical connection between hidden Markov models and EM, and the traditional use of the Baum–Welch algorithm (i.e. EM) in such models. This seems to support the argument that, in some contexts where EM is often used, it is inessential.
1.2 B A simple optimization problem with implicit constraint
Boyd and Vandenberghe (2004, Exercise 9.10b) present a very simple minimization problem for which they correctly state that the “pure” (i.e. undamped) Newton method can diverge. The problem is to minimize \(f(x) = x-\log x\). It is straightforward to establish (analytically) that f has a unique minimum at \(x=1\), but if pure Newton is started from \(x_0=3\) (for instance), the algorithm does indeed diverge.
But the very nature of f is such that there is the implicit constraint \(x>0\). One can (and should) therefore replace the problem by the unconstrained problem of minimizing \(g(y) = \exp (y)-y\)—or else impose the positivity constraint in some other way. If pure Newton, without any embellishment whatsoever, is then started from \(y_0=\log 3\), it converges very fast to \(y=0\), as it should. Not surprisingly, the unconstrained minimizer nlm, applied to g and starting from \(y_0=\log 3\), also converges very fast to \(y=0\).
If, however, I ignore my own advice, put f (unconstrained) into nlm, and start from \(x_0=3\), with a few warnings nlm converges in ten iterations to \(x=0.9999995\). nlm appears to be sufficiently robust to withstand some rough treatment. Nevertheless, if there are constraints it is unwise to ignore them. And, whatever method one chooses to attempt an optimization problem, there is always the possibility that sufficiently extreme starting values will cause under- or overflow in the objective, and thereby cause the method to fail.
Rights and permissions
About this article
Cite this article
MacDonald, I.L. Is EM really necessary here? Examples where it seems simpler not to use EM. AStA Adv Stat Anal 105, 629–647 (2021). https://doi.org/10.1007/s10182-021-00392-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-021-00392-x