## Abstract

If one is to judge by counts of citations of the fundamental paper (Dempster in JRSSB 39: 1–38, 1977), EM algorithms are a runaway success. But it is surprisingly easy to find published applications of EM that are unnecessary, in the sense that there are simpler methods available that will solve the relevant estimation problems. In particular, such problems can often be solved by the simple expedient of submitting the observed-data likelihood (or log-likelihood) to a general-purpose routine for unconstrained optimization. This can dispense with the need to derive and code (or modify) the E and M steps, a process which can sometimes be laborious or error-prone. Here, I discuss six such applications of EM in some detail, and in an appendix describe briefly some others that have already appeared in the literature. Whether these are atypical of applications of EM seems an open question, although one that may be difficult to answer; this question is of relevance to current practice, but may also be of historical interest. But it is clear that there are problems traditionally solved by EM (e.g. the fitting of finite mixtures of distributions) that can also be solved by other means. It is suggested that, before going to the effort of devising an EM algorithm to use on a new problem, the researcher should consider whether other methods (e.g. direct numerical maximization or an MM algorithm of some other kind) may be either simpler to implement or more efficient.

### Similar content being viewed by others

## Notes

I have known a reviewer to react with surprise to the claim that one needs only to solve a quadratic equation. I have seen Mathematica invoked to solve the likelihood equation. And there is in the discussion following the paper of Dempster et al. the correct but strange suggestion that one can approximate the likelihood equation by a certain linear equation—in order to render it “easily solvable”!

An anonymous reviewer of one of my papers suggested bluntly, albeit necessarily without evidence, that the reason certain authors choose to use EM is so that they can get a paper published.

## References

Altham, P.M.E.: Two generalizations of the binomial distribution. J. R. Stat. Soc. Ser. C

**27**(2), 162–167 (1978)Amis, K.: Lucky Jim. Victor Gollancz, London (1954)

Azzalini, A., Bowman, A.W.: A look at some data on the Old Faithful geyser. J. R. Stat. Soc. Ser. C (Applied Statistics)

**39**, 357–365 (1990)Balakrishnan, N., Mitra, D.: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J.

**48**, 125–171 (2014)Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)

Brown, G.O., Buckley, W.S.: Experience rating with Poisson mixtures. Ann. Actuar. Sci.

**9**(2), 304–321 (2015)Davison, A.C.: Statistical Models. Cambridge University Press, Cambridge (2003)

Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B

**39**, 1–38 (1977)Diaconis, P.: Some things we’ve learned (about Markov chain Monte Carlo). Bernoulli

**19**(4), 1294–1305 (2013). https://doi.org/10.3150/12-BEJSP09Finney, D.J.: The estimation from individual records of the relationship between dose and quantal response. Biometrika

**34**(3/4), 320–334 (1947)Fisher, R.A., Balmukand, B.: The estimation of linkage from the offspring of selfed heterozygotes. J. Genet.

**20**, 79–92 (1928)Gould, S.J.: The Lying Stones of Marrakech: Penultimate Reflections in Natural History. Belknap Press, Cambridge, MA (2011)

He, Y., Liu, C.: The dynamic expectation-conditional maximization either algorithm. J. R. Stat. Soc. Ser. B

**74**(2), 313–336 (2012). https://doi.org/10.1111/j.1467-9868.2011.01013.xJamshidian, M., Jennrich, R.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. Ser. B

**59**(3), 569–587 (1997). https://doi.org/10.1111/1467-9868.00083Kim, D.K., Taylor, J.M.G.: The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters. J. Am. Stat. Ass.

**90**(430), 708–716 (1995)Kundu, D., Dey, A.K.: Estimating the parameters of the Marshall-Olkin bivariate Weibull distribution by EM algorithm. Comput. Stat. Data Anal.

**53**, 956–965 (2009)Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B

**57**(2), 425–437 (1995b)Lange, K.: A quasi-Newton acceleration of the EM algorithm. Stat. Sin.

**5**, 1–18 (1995a)Lange, K.: Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. Springer, New York (2002)

Lange, K.: Numerical Analysis for Statisticians, 2nd edn. Springer, New York (2010)

Lange, K.: MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA (2016)

Langrock, R.: Some applications of nonlinear and non-Gaussian state-space modelling by means of hidden Markov models. J. Appl. Stat.

**38**(12), 2955–2970 (2011)Langrock, R., MacDonald, I.L., Zucchini, W.: Some nonstandard stochastic volatility models and their estimation using structured hidden Markov models. J. Empir. Financ.

**19**, 147–161 (2012)Leask, K.: Wadley’s problem with overdispersion. PhD thesis, University of KwaZulu–Natal (2009)

Leask, K.L., Haines, L.M.: The Altham-Poisson distribution. Stat. Model.

**15**(5), 476–497 (2015). https://doi.org/10.1177/1471082X15571161Lee, W., Pawitan, Y.: Direct calculation of the variance of maximum penalized likelihood estimates via EM algorithm. Am. Stat.

**68**(2), 93–97 (2014)Lewandowski, A., Liu, C., Vander Wiel, S.: Parameter expansion and efficient inference. Stat. Sci.

**25**(4), 533–544 (2010)Little, R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data, 2nd edn. Wiley, Hoboken, NJ (2002)

Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika

**81**(4), 633–648 (1994)Liu, S., Wu, H., Meeker, W.Q.: Understanding and addressing the unbounded likelihood problem. Am. Stat.

**69**(3), 191–200 (2015)Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika

**85**(4), 755–770 (1998)MacDonald, I.L., Korula, F.: Maximum-likelihood estimation for multivariate distributions of Marshall–Olkin type: Two routes simpler than EM, submitted (2020)

MacDonald, I.L.: Does Newton-Raphson really fail? Stat. Methods Med. Res.

**23**(3), 308–311 (2014a). https://doi.org/10.1177/0962280213497329MacDonald, I.L.: Numerical maximisation of likelihood: a neglected alternative to EM? Int. Stat. Rev.

**82**(2), 296–308 (2014b)MacDonald, I.L.: Fitting truncated normal distributions. Stat. Methods Med. Res.

**27**(12), 3835–3838 (2018). https://doi.org/10.1177/0962280217712089MacDonald, I.L., Lapham, B.M.: Even more direct calculation of the variance of a maximum penalized-likelihood estimator. Am. Stat.

**70**(1), 114–118 (2016)MacDonald, I.L., Nkalashe, P.: A simple route to maximum-likelihood estimates of two-locus recombination fractions under inequality restrictions. J. Genet.

**94**, 479–481 (2015)McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

Meng, X.L., van Dyk, D.: The EM algorithm: An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B

**59**(3), 511–540 (1997). https://doi.org/10.1111/1467-9868.00082Meng, X.L.: The EM algorithm and medical studies: a historical link. Stat. Methods Med. Res.

**6**, 3–23 (1997)Meng, X.L.: Response: Did Newton-Raphson really fail? Stat. Methods Med. Res.

**23**(3), 312–314 (2014)Morton, N.E.: Genetic studies of Northeastern Brazil. Cold Spring Harbor Symp. Quant. Biol.

**29**, 69–79 (1964)Mulinacci, S.: Archimedean-based Marshall-Olkin distributions and related dependence structures. Method. Comput. Appl. Prob

**20**(1), 205–236 (2018)Ng, H.K.T., Ye, Z.: Comments: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J.

**48**, 177–180 (2014)Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim.

**1**(3), 127–231 (2014). https://doi.org/10.1561/2400000003Pawitan, Y.: In All Likelihood. Oxford University Press, Oxford (2001)

Polson, N.G., Scott, J.G., Willard, B.T.: Proximal algorithms in statistics and machine learning. Stat. Sci.

**30**(4), 559–581 (2015). https://doi.org/10.1214/15-STS530R Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org (2017)

Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (1973)

Reilly, M., Lawlor, E.: A likelihood-based method of identifying contaminated lots of blood product. Int. J. Epidemiol.

**28**(4), 787–792 (1999). https://doi.org/10.1093/ije/28.4.787Shi, N.Z., Zheng, S.R., Guo, J.: The restricted EM algorithm under inequality restrictions on the parameters. J. Multivar. Anal.

**92**(1), 53–76 (2005). https://doi.org/10.1016/S0047-259X(03)00134-9Speed, T.P.: Terence’s stuff: my favourite algorithm. Inst. Math. Stat. Bull.

**37**, 14 (2008)Springer, T., Urban, K.: Comparison of the EM algorithm and alternatives. Num. Algorithms

**67**(2), 335–364 (2014). https://doi.org/10.1007/s11075-013-9794-8Thompson, E.A.: Statistical Inferences from Genetic Data on Pedigrees (NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 6). Institute of Mathematical Statistics, Beachwood, OH (2000)

Tian, G.L., Ju, D., Yuen, K.C., Zhang, C.: New expectation-maximization-type algorithms via stochastic representation for the analysis of truncated normal data with applications in biomedicine. Stat. Methods Med. Res.

**27**(8), 2459–2477 (2018). https://doi.org/10.1177/0962280216681598van Dyk, D., Tang, R.: The one-step-late PXEM algorithm. Stat. Comput.

**13**(2), 137–152 (2003). https://doi.org/10.1023/A:1023256509116Wu, T.T., Lange, K.: The MM alternative to EM. Stat. Sci.

**25**, 492–505 (2010)Yasuda, N.: Estimation of inbreeding coefficient from phenotype frequencies by a method of maximum likelihood scoring. Biometrics

**24**(4), 915–935 (1968). https://doi.org/10.2307/2528880Zhou, Y., Shi, N.Z., Fung, W.K., Guo, J.: Maximum likelihood estimates of two-locus recombination fractions under some natural inequality restrictions. BMC Genet.

**9**(1), 1 (2008)Zhou, H., Lange, K.: Rating movies and rating the raters who rate them. Am. Statist.

**63**(4), 297–307 (2009)Zucchini, W., MacDonald, I.L., Langrock, R.: Hidden Markov Models for Time Series: An Introduction Using R, 2nd edn. Chapman & Hall/CRC Press, Boca Raton, FL (2016)

## Acknowledgements

The author thanks the Editor-in-Chief, Associate Editor and reviewers for their helpful and encouraging comments and suggestions. In addition, Dr. Etienne Pienaar is thanked for his many helpful suggestions.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

Below is the link to the electronic supplementary material.

## Appendix

### Appendix

### 1.1 A Other published examples

Here, I discuss briefly some previously published examples in which it is apparently simpler not to use EM, plus the applications chapters of Zucchini et al. (2016), all of which use DNM but not EM.

#### 1.1.1 A.1 EM modified to allow for constraints

It is sometimes stated that EM allows automatically for constraints on parameters. For instance, Lange (2010, p. 223) writes that “[...] the EM algorithm handles parameter constraints gracefully. Constraint satisfaction is by definition built into the solution of the M step.” This could, however, be misunderstood to mean *all* constraints on parameters. EM does indeed handle many constraints automatically, e.g. the nonnegativity and unit-sum constraints on the transition probabilities in a hidden Markov model (Zucchini et al. 2016, p. 72). But see (e.g.) Kim and Taylor (1995) and Shi et al. (2005), the very purpose of which is to modify EM in order to incorporate (respectively) certain linear equality or inequality constraints on parameters.

There has been one very determined but apparently unnecessary attempt to modify EM in order to allow for the linear inequality constraints arising very naturally in one particular problem in genetics, even splitting the M step into seven cases in order to do so (Zhou et al. 2008). There the problem is just one of maximizing a (nonlinear) function of three variables subject to four linear inequality constraints and is easily solved by using the constrained optimizer constrOptim provided by R. EM makes this problem harder than it need be. For further details, see MacDonald and Nkalashe (2015).

#### 1.1.2 A.2 The Altham–Poisson distribution

Leask and Haines (2015) and Leask (2009) considered very fully the possible use of EM for parameter estimation in the case of an Altham–Poisson distribution, that is, a “multiplicative binomial” distribution (Altham 1978) for which *n*, the number of trials, is taken to have a Poisson distribution. A multiplicative binomial has probability mass function of the form

with \(p \in [0,1]\) and \(\theta >0\). They concluded that the formulation of EM for such a distribution is “subtle and somewhat complicated” and, because EM was slow to converge, chose instead to maximize the log-likelihood directly.

#### 1.1.3 A.3 Fitting truncated normal distributions

It is straightforward to evaluate the (density-approximated) likelihood of a sample from a truncated normal distribution and maximize it numerically. Nevertheless, Tian et al. (2018) go to considerable lengths to develop an algorithm of EM type to fit truncated normal distributions. Their two examples of model fitting can easily be carried out instead by DNM, although in one case it is clear that their truncated normal is much inferior to a log-normal model. This problem has been more fully discussed by MacDonald (2018).

#### 1.1.4 A.4 Maximization of penalized likelihood

Lee and Pawitan (2014) describe how to estimate the variances of estimators based on the maximization of penalized likelihood; they assume that these estimators have been found by EM. MacDonald and Lapham (2016) describe how to accomplish the same and more without using EM, by instead maximizing the penalized likelihood directly. For the two examples of Lee and Pawitan they find the MLEs, their standard errors, and confidence (or credibility) intervals of two types: those of Wald type and those based directly on penalized likelihood, which can differ considerably from those of Wald type.

#### 1.1.5 A.5 Fitting a zero-truncated Poisson

Meng (1997) describes how to use (*inter alia*) EM to fit a Poisson distribution to count data from which the number of zeros observed is missing. Among the methods he describes is Newton–Raphson. MacDonald (2014a) claims that the apparent failure of Newton–Raphson for certain starting-values is due only to the fact that the obvious positivity constraint on the Poisson mean has been ignored. Meng (2014) has replied, but does not agree entirely.

#### 1.1.6 A.6 The examples of MacDonald (2014b)

MacDonald (2014b) describes a range of examples in which EM has apparently been used unnecessarily and compares EM with DNM. Included are the analysis of ABO blood-group data, and the fitting of Dirichlet distributions. There has been little or no response published which disagrees materially with the conclusions and recommendations of that paper.

#### 1.1.7 A.7 The applications of Zucchini et al. (2016)

The applications chapters of Zucchini et al. (2016, Chapters 15–24) present a wide variety of models of hidden Markov type or similar, some simple, some complex, but all fitted by direct numerical maximization of likelihood. There is little if any mention of EM in those chapters, as it usually seemed redundant or over-complicated—in spite of the strong historical connection between hidden Markov models and EM, and the traditional use of the Baum–Welch algorithm (i.e. EM) in such models. This seems to support the argument that, in some contexts where EM is often used, it is inessential.

### 1.2 B A simple optimization problem with implicit constraint

Boyd and Vandenberghe (2004, Exercise 9.10b) present a very simple minimization problem for which they correctly state that the “pure” (i.e. undamped) Newton method can diverge. The problem is to minimize \(f(x) = x-\log x\). It is straightforward to establish (analytically) that *f* has a unique minimum at \(x=1\), but if pure Newton is started from \(x_0=3\) (for instance), the algorithm does indeed diverge.

But the very nature of *f* is such that there is the implicit constraint \(x>0\). One can (and should) therefore replace the problem by the unconstrained problem of minimizing \(g(y) = \exp (y)-y\)—or else impose the positivity constraint in some other way. If pure Newton, without any embellishment whatsoever, is then started from \(y_0=\log 3\), it converges very fast to \(y=0\), as it should. Not surprisingly, the unconstrained minimizer nlm, applied to *g* and starting from \(y_0=\log 3\), also converges very fast to \(y=0\).

If, however, I ignore my own advice, put *f* (unconstrained) into nlm, and start from \(x_0=3\), with a few warnings nlm converges in ten iterations to \(x=0.9999995\). nlm appears to be sufficiently robust to withstand some rough treatment. Nevertheless, if there are constraints it is unwise to ignore them. And, whatever method one chooses to attempt an optimization problem, there is always the possibility that sufficiently extreme starting values will cause under- or overflow in the objective, and thereby cause the method to fail.

## Rights and permissions

## About this article

### Cite this article

MacDonald, I.L. Is EM really necessary here? Examples where it seems simpler not to use EM.
*AStA Adv Stat Anal* **105**, 629–647 (2021). https://doi.org/10.1007/s10182-021-00392-x

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10182-021-00392-x