Skip to main content
Log in

Is EM really necessary here? Examples where it seems simpler not to use EM

  • Original Paper
  • Published:
AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Abstract

If one is to judge by counts of citations of the fundamental paper (Dempster in JRSSB 39: 1–38, 1977), EM algorithms are a runaway success. But it is surprisingly easy to find published applications of EM that are unnecessary, in the sense that there are simpler methods available that will solve the relevant estimation problems. In particular, such problems can often be solved by the simple expedient of submitting the observed-data likelihood (or log-likelihood) to a general-purpose routine for unconstrained optimization. This can dispense with the need to derive and code (or modify) the E and M steps, a process which can sometimes be laborious or error-prone. Here, I discuss six such applications of EM in some detail, and in an appendix describe briefly some others that have already appeared in the literature. Whether these are atypical of applications of EM seems an open question, although one that may be difficult to answer; this question is of relevance to current practice, but may also be of historical interest. But it is clear that there are problems traditionally solved by EM (e.g. the fitting of finite mixtures of distributions) that can also be solved by other means. It is suggested that, before going to the effort of devising an EM algorithm to use on a new problem, the researcher should consider whether other methods (e.g. direct numerical maximization or an MM algorithm of some other kind) may be either simpler to implement or more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. I have known a reviewer to react with surprise to the claim that one needs only to solve a quadratic equation. I have seen Mathematica invoked to solve the likelihood equation. And there is in the discussion following the paper of Dempster et al. the correct but strange suggestion that one can approximate the likelihood equation by a certain linear equation—in order to render it “easily solvable”!

  2. An anonymous reviewer of one of my papers suggested bluntly, albeit necessarily without evidence, that the reason certain authors choose to use EM is so that they can get a paper published.

References

  • Altham, P.M.E.: Two generalizations of the binomial distribution. J. R. Stat. Soc. Ser. C 27(2), 162–167 (1978)

    MathSciNet  MATH  Google Scholar 

  • Amis, K.: Lucky Jim. Victor Gollancz, London (1954)

    Google Scholar 

  • Azzalini, A., Bowman, A.W.: A look at some data on the Old Faithful geyser. J. R. Stat. Soc. Ser. C (Applied Statistics) 39, 357–365 (1990)

    MATH  Google Scholar 

  • Balakrishnan, N., Mitra, D.: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 125–171 (2014)

    MathSciNet  MATH  Google Scholar 

  • Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)

    MATH  Google Scholar 

  • Brown, G.O., Buckley, W.S.: Experience rating with Poisson mixtures. Ann. Actuar. Sci. 9(2), 304–321 (2015)

    Google Scholar 

  • Davison, A.C.: Statistical Models. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MATH  Google Scholar 

  • Diaconis, P.: Some things we’ve learned (about Markov chain Monte Carlo). Bernoulli 19(4), 1294–1305 (2013). https://doi.org/10.3150/12-BEJSP09

    Article  MathSciNet  MATH  Google Scholar 

  • Finney, D.J.: The estimation from individual records of the relationship between dose and quantal response. Biometrika 34(3/4), 320–334 (1947)

    MATH  Google Scholar 

  • Fisher, R.A., Balmukand, B.: The estimation of linkage from the offspring of selfed heterozygotes. J. Genet. 20, 79–92 (1928)

    Google Scholar 

  • Gould, S.J.: The Lying Stones of Marrakech: Penultimate Reflections in Natural History. Belknap Press, Cambridge, MA (2011)

    Google Scholar 

  • He, Y., Liu, C.: The dynamic expectation-conditional maximization either algorithm. J. R. Stat. Soc. Ser. B 74(2), 313–336 (2012). https://doi.org/10.1111/j.1467-9868.2011.01013.x

    Article  MathSciNet  MATH  Google Scholar 

  • Jamshidian, M., Jennrich, R.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. Ser. B 59(3), 569–587 (1997). https://doi.org/10.1111/1467-9868.00083

    Article  MathSciNet  MATH  Google Scholar 

  • Kim, D.K., Taylor, J.M.G.: The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters. J. Am. Stat. Ass. 90(430), 708–716 (1995)

    MathSciNet  MATH  Google Scholar 

  • Kundu, D., Dey, A.K.: Estimating the parameters of the Marshall-Olkin bivariate Weibull distribution by EM algorithm. Comput. Stat. Data Anal. 53, 956–965 (2009)

    MathSciNet  MATH  Google Scholar 

  • Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B 57(2), 425–437 (1995b)

    MathSciNet  MATH  Google Scholar 

  • Lange, K.: A quasi-Newton acceleration of the EM algorithm. Stat. Sin. 5, 1–18 (1995a)

    MathSciNet  MATH  Google Scholar 

  • Lange, K.: Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. Springer, New York (2002)

    MATH  Google Scholar 

  • Lange, K.: Numerical Analysis for Statisticians, 2nd edn. Springer, New York (2010)

    MATH  Google Scholar 

  • Lange, K.: MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA (2016)

    MATH  Google Scholar 

  • Langrock, R.: Some applications of nonlinear and non-Gaussian state-space modelling by means of hidden Markov models. J. Appl. Stat. 38(12), 2955–2970 (2011)

    MathSciNet  MATH  Google Scholar 

  • Langrock, R., MacDonald, I.L., Zucchini, W.: Some nonstandard stochastic volatility models and their estimation using structured hidden Markov models. J. Empir. Financ. 19, 147–161 (2012)

    Google Scholar 

  • Leask, K.: Wadley’s problem with overdispersion. PhD thesis, University of KwaZulu–Natal (2009)

  • Leask, K.L., Haines, L.M.: The Altham-Poisson distribution. Stat. Model. 15(5), 476–497 (2015). https://doi.org/10.1177/1471082X15571161

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, W., Pawitan, Y.: Direct calculation of the variance of maximum penalized likelihood estimates via EM algorithm. Am. Stat. 68(2), 93–97 (2014)

    MathSciNet  Google Scholar 

  • Lewandowski, A., Liu, C., Vander Wiel, S.: Parameter expansion and efficient inference. Stat. Sci. 25(4), 533–544 (2010)

    MathSciNet  MATH  Google Scholar 

  • Little, R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data, 2nd edn. Wiley, Hoboken, NJ (2002)

    MATH  Google Scholar 

  • Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)

    MathSciNet  MATH  Google Scholar 

  • Liu, S., Wu, H., Meeker, W.Q.: Understanding and addressing the unbounded likelihood problem. Am. Stat. 69(3), 191–200 (2015)

    MathSciNet  Google Scholar 

  • Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85(4), 755–770 (1998)

    MathSciNet  MATH  Google Scholar 

  • MacDonald, I.L., Korula, F.: Maximum-likelihood estimation for multivariate distributions of Marshall–Olkin type: Two routes simpler than EM, submitted (2020)

  • MacDonald, I.L.: Does Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 308–311 (2014a). https://doi.org/10.1177/0962280213497329

    Article  MathSciNet  Google Scholar 

  • MacDonald, I.L.: Numerical maximisation of likelihood: a neglected alternative to EM? Int. Stat. Rev. 82(2), 296–308 (2014b)

    MathSciNet  MATH  Google Scholar 

  • MacDonald, I.L.: Fitting truncated normal distributions. Stat. Methods Med. Res. 27(12), 3835–3838 (2018). https://doi.org/10.1177/0962280217712089

    Article  MathSciNet  Google Scholar 

  • MacDonald, I.L., Lapham, B.M.: Even more direct calculation of the variance of a maximum penalized-likelihood estimator. Am. Stat. 70(1), 114–118 (2016)

    MathSciNet  Google Scholar 

  • MacDonald, I.L., Nkalashe, P.: A simple route to maximum-likelihood estimates of two-locus recombination fractions under inequality restrictions. J. Genet. 94, 479–481 (2015)

    Google Scholar 

  • McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    MATH  Google Scholar 

  • Meng, X.L., van Dyk, D.: The EM algorithm: An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–540 (1997). https://doi.org/10.1111/1467-9868.00082

    Article  MathSciNet  MATH  Google Scholar 

  • Meng, X.L.: The EM algorithm and medical studies: a historical link. Stat. Methods Med. Res. 6, 3–23 (1997)

    Google Scholar 

  • Meng, X.L.: Response: Did Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 312–314 (2014)

    MathSciNet  Google Scholar 

  • Morton, N.E.: Genetic studies of Northeastern Brazil. Cold Spring Harbor Symp. Quant. Biol. 29, 69–79 (1964)

    Google Scholar 

  • Mulinacci, S.: Archimedean-based Marshall-Olkin distributions and related dependence structures. Method. Comput. Appl. Prob 20(1), 205–236 (2018)

    MathSciNet  MATH  Google Scholar 

  • Ng, H.K.T., Ye, Z.: Comments: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 177–180 (2014)

    MATH  Google Scholar 

  • Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–231 (2014). https://doi.org/10.1561/2400000003

    Article  Google Scholar 

  • Pawitan, Y.: In All Likelihood. Oxford University Press, Oxford (2001)

    MATH  Google Scholar 

  • Polson, N.G., Scott, J.G., Willard, B.T.: Proximal algorithms in statistics and machine learning. Stat. Sci. 30(4), 559–581 (2015). https://doi.org/10.1214/15-STS530

    Article  MathSciNet  MATH  Google Scholar 

  • R Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org (2017)

  • Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (1973)

    MATH  Google Scholar 

  • Reilly, M., Lawlor, E.: A likelihood-based method of identifying contaminated lots of blood product. Int. J. Epidemiol. 28(4), 787–792 (1999). https://doi.org/10.1093/ije/28.4.787

    Article  Google Scholar 

  • Shi, N.Z., Zheng, S.R., Guo, J.: The restricted EM algorithm under inequality restrictions on the parameters. J. Multivar. Anal. 92(1), 53–76 (2005). https://doi.org/10.1016/S0047-259X(03)00134-9

    Article  MathSciNet  MATH  Google Scholar 

  • Speed, T.P.: Terence’s stuff: my favourite algorithm. Inst. Math. Stat. Bull. 37, 14 (2008)

    Google Scholar 

  • Springer, T., Urban, K.: Comparison of the EM algorithm and alternatives. Num. Algorithms 67(2), 335–364 (2014). https://doi.org/10.1007/s11075-013-9794-8

    Article  MathSciNet  MATH  Google Scholar 

  • Thompson, E.A.: Statistical Inferences from Genetic Data on Pedigrees (NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 6). Institute of Mathematical Statistics, Beachwood, OH (2000)

  • Tian, G.L., Ju, D., Yuen, K.C., Zhang, C.: New expectation-maximization-type algorithms via stochastic representation for the analysis of truncated normal data with applications in biomedicine. Stat. Methods Med. Res. 27(8), 2459–2477 (2018). https://doi.org/10.1177/0962280216681598

    Article  MathSciNet  Google Scholar 

  • van Dyk, D., Tang, R.: The one-step-late PXEM algorithm. Stat. Comput. 13(2), 137–152 (2003). https://doi.org/10.1023/A:1023256509116

    Article  MathSciNet  Google Scholar 

  • Wu, T.T., Lange, K.: The MM alternative to EM. Stat. Sci. 25, 492–505 (2010)

    MathSciNet  MATH  Google Scholar 

  • Yasuda, N.: Estimation of inbreeding coefficient from phenotype frequencies by a method of maximum likelihood scoring. Biometrics 24(4), 915–935 (1968). https://doi.org/10.2307/2528880

    Article  MathSciNet  Google Scholar 

  • Zhou, Y., Shi, N.Z., Fung, W.K., Guo, J.: Maximum likelihood estimates of two-locus recombination fractions under some natural inequality restrictions. BMC Genet. 9(1), 1 (2008)

    Google Scholar 

  • Zhou, H., Lange, K.: Rating movies and rating the raters who rate them. Am. Statist. 63(4), 297–307 (2009)

    MathSciNet  Google Scholar 

  • Zucchini, W., MacDonald, I.L., Langrock, R.: Hidden Markov Models for Time Series: An Introduction Using R, 2nd edn. Chapman & Hall/CRC Press, Boca Raton, FL (2016)

    MATH  Google Scholar 

Download references

Acknowledgements

The author thanks the Editor-in-Chief, Associate Editor and reviewers for their helpful and encouraging comments and suggestions. In addition, Dr. Etienne Pienaar is thanked for his many helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iain L. MacDonald.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Appendix

Appendix

1.1 A Other published examples

Here, I discuss briefly some previously published examples in which it is apparently simpler not to use EM, plus the applications chapters of Zucchini et al. (2016), all of which use DNM but not EM.

1.1.1 A.1 EM modified to allow for constraints

It is sometimes stated that EM allows automatically for constraints on parameters. For instance, Lange (2010, p. 223) writes that “[...] the EM algorithm handles parameter constraints gracefully. Constraint satisfaction is by definition built into the solution of the M step.” This could, however, be misunderstood to mean all constraints on parameters. EM does indeed handle many constraints automatically, e.g. the nonnegativity and unit-sum constraints on the transition probabilities in a hidden Markov model (Zucchini et al. 2016, p. 72). But see (e.g.) Kim and Taylor (1995) and Shi et al. (2005), the very purpose of which is to modify EM in order to incorporate (respectively) certain linear equality or inequality constraints on parameters.

There has been one very determined but apparently unnecessary attempt to modify EM in order to allow for the linear inequality constraints arising very naturally in one particular problem in genetics, even splitting the M step into seven cases in order to do so (Zhou et al. 2008). There the problem is just one of maximizing a (nonlinear) function of three variables subject to four linear inequality constraints and is easily solved by using the constrained optimizer constrOptim provided by R. EM makes this problem harder than it need be. For further details, see MacDonald and Nkalashe (2015).

1.1.2 A.2 The Altham–Poisson distribution

Leask and Haines (2015) and Leask (2009) considered very fully the possible use of EM for parameter estimation in the case of an Altham–Poisson distribution, that is, a “multiplicative binomial” distribution (Altham 1978) for which n, the number of trials, is taken to have a Poisson distribution. A multiplicative binomial has probability mass function of the form

$$\Pr (Y=y) \propto {n \atopwithdelims ()y} p^y (1-p)^{n-y} \theta ^{y(n-y)}, \quad y=0, 1, \ldots , n ,$$

with \(p \in [0,1]\) and \(\theta >0\). They concluded that the formulation of EM for such a distribution is “subtle and somewhat complicated” and, because EM was slow to converge, chose instead to maximize the log-likelihood directly.

1.1.3 A.3 Fitting truncated normal distributions

It is straightforward to evaluate the (density-approximated) likelihood of a sample from a truncated normal distribution and maximize it numerically. Nevertheless, Tian et al. (2018) go to considerable lengths to develop an algorithm of EM type to fit truncated normal distributions. Their two examples of model fitting can easily be carried out instead by DNM, although in one case it is clear that their truncated normal is much inferior to a log-normal model. This problem has been more fully discussed by MacDonald (2018).

1.1.4 A.4 Maximization of penalized likelihood

Lee and Pawitan (2014) describe how to estimate the variances of estimators based on the maximization of penalized likelihood; they assume that these estimators have been found by EM. MacDonald and Lapham (2016) describe how to accomplish the same and more without using EM, by instead maximizing the penalized likelihood directly. For the two examples of Lee and Pawitan they find the MLEs, their standard errors, and confidence (or credibility) intervals of two types: those of Wald type and those based directly on penalized likelihood, which can differ considerably from those of Wald type.

1.1.5 A.5 Fitting a zero-truncated Poisson

Meng (1997) describes how to use (inter alia) EM to fit a Poisson distribution to count data from which the number of zeros observed is missing. Among the methods he describes is Newton–Raphson. MacDonald (2014a) claims that the apparent failure of Newton–Raphson for certain starting-values is due only to the fact that the obvious positivity constraint on the Poisson mean has been ignored. Meng (2014) has replied, but does not agree entirely.

1.1.6 A.6 The examples of MacDonald (2014b)

MacDonald (2014b) describes a range of examples in which EM has apparently been used unnecessarily and compares EM with DNM. Included are the analysis of ABO blood-group data, and the fitting of Dirichlet distributions. There has been little or no response published which disagrees materially with the conclusions and recommendations of that paper.

1.1.7 A.7 The applications of Zucchini et al. (2016)

The applications chapters of Zucchini et al. (2016, Chapters 15–24) present a wide variety of models of hidden Markov type or similar, some simple, some complex, but all fitted by direct numerical maximization of likelihood. There is little if any mention of EM in those chapters, as it usually seemed redundant or over-complicated—in spite of the strong historical connection between hidden Markov models and EM, and the traditional use of the Baum–Welch algorithm (i.e. EM) in such models. This seems to support the argument that, in some contexts where EM is often used, it is inessential.

1.2 B A simple optimization problem with implicit constraint

Boyd and Vandenberghe (2004, Exercise 9.10b) present a very simple minimization problem for which they correctly state that the “pure” (i.e. undamped) Newton method can diverge. The problem is to minimize \(f(x) = x-\log x\). It is straightforward to establish (analytically) that f has a unique minimum at \(x=1\), but if pure Newton is started from \(x_0=3\) (for instance), the algorithm does indeed diverge.

But the very nature of f is such that there is the implicit constraint \(x>0\). One can (and should) therefore replace the problem by the unconstrained problem of minimizing \(g(y) = \exp (y)-y\)—or else impose the positivity constraint in some other way. If pure Newton, without any embellishment whatsoever, is then started from \(y_0=\log 3\), it converges very fast to \(y=0\), as it should. Not surprisingly, the unconstrained minimizer nlm, applied to g and starting from \(y_0=\log 3\), also converges very fast to \(y=0\).

If, however, I ignore my own advice, put f (unconstrained) into nlm, and start from \(x_0=3\), with a few warnings nlm converges in ten iterations to \(x=0.9999995\). nlm appears to be sufficiently robust to withstand some rough treatment. Nevertheless, if there are constraints it is unwise to ignore them. And, whatever method one chooses to attempt an optimization problem, there is always the possibility that sufficiently extreme starting values will cause under- or overflow in the objective, and thereby cause the method to fail.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

MacDonald, I.L. Is EM really necessary here? Examples where it seems simpler not to use EM. AStA Adv Stat Anal 105, 629–647 (2021). https://doi.org/10.1007/s10182-021-00392-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-021-00392-x

Navigation