Is EM really necessary here? Examples where it seems simpler not to use EM

MacDonald, Iain L.

doi:10.1007/s10182-021-00392-x

Is EM really necessary here? Examples where it seems simpler not to use EM

Original Paper
Published: 18 March 2021

Volume 105, pages 629–647, (2021)
Cite this article

AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Iain L. MacDonald ORCID: orcid.org/0000-0001-6433-2707¹

1418 Accesses
2 Citations
Explore all metrics

Abstract

If one is to judge by counts of citations of the fundamental paper (Dempster in JRSSB 39: 1–38, 1977), EM algorithms are a runaway success. But it is surprisingly easy to find published applications of EM that are unnecessary, in the sense that there are simpler methods available that will solve the relevant estimation problems. In particular, such problems can often be solved by the simple expedient of submitting the observed-data likelihood (or log-likelihood) to a general-purpose routine for unconstrained optimization. This can dispense with the need to derive and code (or modify) the E and M steps, a process which can sometimes be laborious or error-prone. Here, I discuss six such applications of EM in some detail, and in an appendix describe briefly some others that have already appeared in the literature. Whether these are atypical of applications of EM seems an open question, although one that may be difficult to answer; this question is of relevance to current practice, but may also be of historical interest. But it is clear that there are problems traditionally solved by EM (e.g. the fitting of finite mixtures of distributions) that can also be solved by other means. It is suggested that, before going to the effort of devising an EM algorithm to use on a new problem, the researcher should consider whether other methods (e.g. direct numerical maximization or an MM algorithm of some other kind) may be either simpler to implement or more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparison of the $$L_2$$ minimum distance estimator and the EM-algorithm when fitting $${\varvec{{k}}}$$ -component univariate normal mixtures

Article 24 February 2016

EM Algorithms from a Non-stochastic Perspective

Notes

I have known a reviewer to react with surprise to the claim that one needs only to solve a quadratic equation. I have seen Mathematica invoked to solve the likelihood equation. And there is in the discussion following the paper of Dempster et al. the correct but strange suggestion that one can approximate the likelihood equation by a certain linear equation—in order to render it “easily solvable”!
An anonymous reviewer of one of my papers suggested bluntly, albeit necessarily without evidence, that the reason certain authors choose to use EM is so that they can get a paper published.

References

Altham, P.M.E.: Two generalizations of the binomial distribution. J. R. Stat. Soc. Ser. C 27(2), 162–167 (1978)
MathSciNet MATH Google Scholar
Amis, K.: Lucky Jim. Victor Gollancz, London (1954)
Google Scholar
Azzalini, A., Bowman, A.W.: A look at some data on the Old Faithful geyser. J. R. Stat. Soc. Ser. C (Applied Statistics) 39, 357–365 (1990)
MATH Google Scholar
Balakrishnan, N., Mitra, D.: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 125–171 (2014)
MathSciNet MATH Google Scholar
Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)
MATH Google Scholar
Brown, G.O., Buckley, W.S.: Experience rating with Poisson mixtures. Ann. Actuar. Sci. 9(2), 304–321 (2015)
Google Scholar
Davison, A.C.: Statistical Models. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
MATH Google Scholar
Diaconis, P.: Some things we’ve learned (about Markov chain Monte Carlo). Bernoulli 19(4), 1294–1305 (2013). https://doi.org/10.3150/12-BEJSP09
Article MathSciNet MATH Google Scholar
Finney, D.J.: The estimation from individual records of the relationship between dose and quantal response. Biometrika 34(3/4), 320–334 (1947)
MATH Google Scholar
Fisher, R.A., Balmukand, B.: The estimation of linkage from the offspring of selfed heterozygotes. J. Genet. 20, 79–92 (1928)
Google Scholar
Gould, S.J.: The Lying Stones of Marrakech: Penultimate Reflections in Natural History. Belknap Press, Cambridge, MA (2011)
Google Scholar
He, Y., Liu, C.: The dynamic expectation-conditional maximization either algorithm. J. R. Stat. Soc. Ser. B 74(2), 313–336 (2012). https://doi.org/10.1111/j.1467-9868.2011.01013.x
Article MathSciNet MATH Google Scholar
Jamshidian, M., Jennrich, R.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. Ser. B 59(3), 569–587 (1997). https://doi.org/10.1111/1467-9868.00083
Article MathSciNet MATH Google Scholar
Kim, D.K., Taylor, J.M.G.: The restricted EM algorithm for maximum likelihood estimation under linear restrictions on the parameters. J. Am. Stat. Ass. 90(430), 708–716 (1995)
MathSciNet MATH Google Scholar
Kundu, D., Dey, A.K.: Estimating the parameters of the Marshall-Olkin bivariate Weibull distribution by EM algorithm. Comput. Stat. Data Anal. 53, 956–965 (2009)
MathSciNet MATH Google Scholar
Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. B 57(2), 425–437 (1995b)
MathSciNet MATH Google Scholar
Lange, K.: A quasi-Newton acceleration of the EM algorithm. Stat. Sin. 5, 1–18 (1995a)
MathSciNet MATH Google Scholar
Lange, K.: Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. Springer, New York (2002)
MATH Google Scholar
Lange, K.: Numerical Analysis for Statisticians, 2nd edn. Springer, New York (2010)
MATH Google Scholar
Lange, K.: MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA (2016)
MATH Google Scholar
Langrock, R.: Some applications of nonlinear and non-Gaussian state-space modelling by means of hidden Markov models. J. Appl. Stat. 38(12), 2955–2970 (2011)
MathSciNet MATH Google Scholar
Langrock, R., MacDonald, I.L., Zucchini, W.: Some nonstandard stochastic volatility models and their estimation using structured hidden Markov models. J. Empir. Financ. 19, 147–161 (2012)
Google Scholar
Leask, K.: Wadley’s problem with overdispersion. PhD thesis, University of KwaZulu–Natal (2009)
Leask, K.L., Haines, L.M.: The Altham-Poisson distribution. Stat. Model. 15(5), 476–497 (2015). https://doi.org/10.1177/1471082X15571161
Article MathSciNet MATH Google Scholar
Lee, W., Pawitan, Y.: Direct calculation of the variance of maximum penalized likelihood estimates via EM algorithm. Am. Stat. 68(2), 93–97 (2014)
MathSciNet Google Scholar
Lewandowski, A., Liu, C., Vander Wiel, S.: Parameter expansion and efficient inference. Stat. Sci. 25(4), 533–544 (2010)
MathSciNet MATH Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data, 2nd edn. Wiley, Hoboken, NJ (2002)
MATH Google Scholar
Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)
MathSciNet MATH Google Scholar
Liu, S., Wu, H., Meeker, W.Q.: Understanding and addressing the unbounded likelihood problem. Am. Stat. 69(3), 191–200 (2015)
MathSciNet Google Scholar
Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85(4), 755–770 (1998)
MathSciNet MATH Google Scholar
MacDonald, I.L., Korula, F.: Maximum-likelihood estimation for multivariate distributions of Marshall–Olkin type: Two routes simpler than EM, submitted (2020)
MacDonald, I.L.: Does Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 308–311 (2014a). https://doi.org/10.1177/0962280213497329
Article MathSciNet Google Scholar
MacDonald, I.L.: Numerical maximisation of likelihood: a neglected alternative to EM? Int. Stat. Rev. 82(2), 296–308 (2014b)
MathSciNet MATH Google Scholar
MacDonald, I.L.: Fitting truncated normal distributions. Stat. Methods Med. Res. 27(12), 3835–3838 (2018). https://doi.org/10.1177/0962280217712089
Article MathSciNet Google Scholar
MacDonald, I.L., Lapham, B.M.: Even more direct calculation of the variance of a maximum penalized-likelihood estimator. Am. Stat. 70(1), 114–118 (2016)
MathSciNet Google Scholar
MacDonald, I.L., Nkalashe, P.: A simple route to maximum-likelihood estimates of two-locus recombination fractions under inequality restrictions. J. Genet. 94, 479–481 (2015)
Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
MATH Google Scholar
Meng, X.L., van Dyk, D.: The EM algorithm: An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–540 (1997). https://doi.org/10.1111/1467-9868.00082
Article MathSciNet MATH Google Scholar
Meng, X.L.: The EM algorithm and medical studies: a historical link. Stat. Methods Med. Res. 6, 3–23 (1997)
Google Scholar
Meng, X.L.: Response: Did Newton-Raphson really fail? Stat. Methods Med. Res. 23(3), 312–314 (2014)
MathSciNet Google Scholar
Morton, N.E.: Genetic studies of Northeastern Brazil. Cold Spring Harbor Symp. Quant. Biol. 29, 69–79 (1964)
Google Scholar
Mulinacci, S.: Archimedean-based Marshall-Olkin distributions and related dependence structures. Method. Comput. Appl. Prob 20(1), 205–236 (2018)
MathSciNet MATH Google Scholar
Ng, H.K.T., Ye, Z.: Comments: EM-based likelihood inference for some lifetime distributions based on left truncated and right censored data and associated model discrimination. South Afr. Stat. J. 48, 177–180 (2014)
MATH Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–231 (2014). https://doi.org/10.1561/2400000003
Article Google Scholar
Pawitan, Y.: In All Likelihood. Oxford University Press, Oxford (2001)
MATH Google Scholar
Polson, N.G., Scott, J.G., Willard, B.T.: Proximal algorithms in statistics and machine learning. Stat. Sci. 30(4), 559–581 (2015). https://doi.org/10.1214/15-STS530
Article MathSciNet MATH Google Scholar
R Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org (2017)
Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (1973)
MATH Google Scholar
Reilly, M., Lawlor, E.: A likelihood-based method of identifying contaminated lots of blood product. Int. J. Epidemiol. 28(4), 787–792 (1999). https://doi.org/10.1093/ije/28.4.787
Article Google Scholar
Shi, N.Z., Zheng, S.R., Guo, J.: The restricted EM algorithm under inequality restrictions on the parameters. J. Multivar. Anal. 92(1), 53–76 (2005). https://doi.org/10.1016/S0047-259X(03)00134-9
Article MathSciNet MATH Google Scholar
Speed, T.P.: Terence’s stuff: my favourite algorithm. Inst. Math. Stat. Bull. 37, 14 (2008)
Google Scholar
Springer, T., Urban, K.: Comparison of the EM algorithm and alternatives. Num. Algorithms 67(2), 335–364 (2014). https://doi.org/10.1007/s11075-013-9794-8
Article MathSciNet MATH Google Scholar
Thompson, E.A.: Statistical Inferences from Genetic Data on Pedigrees (NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 6). Institute of Mathematical Statistics, Beachwood, OH (2000)
Tian, G.L., Ju, D., Yuen, K.C., Zhang, C.: New expectation-maximization-type algorithms via stochastic representation for the analysis of truncated normal data with applications in biomedicine. Stat. Methods Med. Res. 27(8), 2459–2477 (2018). https://doi.org/10.1177/0962280216681598
Article MathSciNet Google Scholar
van Dyk, D., Tang, R.: The one-step-late PXEM algorithm. Stat. Comput. 13(2), 137–152 (2003). https://doi.org/10.1023/A:1023256509116
Article MathSciNet Google Scholar
Wu, T.T., Lange, K.: The MM alternative to EM. Stat. Sci. 25, 492–505 (2010)
MathSciNet MATH Google Scholar
Yasuda, N.: Estimation of inbreeding coefficient from phenotype frequencies by a method of maximum likelihood scoring. Biometrics 24(4), 915–935 (1968). https://doi.org/10.2307/2528880
Article MathSciNet Google Scholar
Zhou, Y., Shi, N.Z., Fung, W.K., Guo, J.: Maximum likelihood estimates of two-locus recombination fractions under some natural inequality restrictions. BMC Genet. 9(1), 1 (2008)
Google Scholar
Zhou, H., Lange, K.: Rating movies and rating the raters who rate them. Am. Statist. 63(4), 297–307 (2009)
MathSciNet Google Scholar
Zucchini, W., MacDonald, I.L., Langrock, R.: Hidden Markov Models for Time Series: An Introduction Using R, 2nd edn. Chapman & Hall/CRC Press, Boca Raton, FL (2016)
MATH Google Scholar

Download references

Acknowledgements

The author thanks the Editor-in-Chief, Associate Editor and reviewers for their helpful and encouraging comments and suggestions. In addition, Dr. Etienne Pienaar is thanked for his many helpful suggestions.

Author information

Authors and Affiliations

University of Cape Town, Cape Town, South Africa
Iain L. MacDonald

Authors

Iain L. MacDonald
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iain L. MacDonald.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (r 0 KB)

Supplementary material 2 (r 1 KB)

Supplementary material 3 (r 0 KB)

Supplementary material 4 (r 1 KB)

Supplementary material 5 (r 0 KB)

Supplementary material 6 (r 0 KB)

Supplementary material 7 (r 0 KB)

Supplementary material 8 (r 0 KB)

Supplementary material 9 (r 1 KB)

Supplementary material 10 (TXT 5 KB)

Supplementary material 11 (TXT 3 KB)

Appendix

1.1 A Other published examples

Here, I discuss briefly some previously published examples in which it is apparently simpler not to use EM, plus the applications chapters of Zucchini et al. (2016), all of which use DNM but not EM.

1.1.1 A.1 EM modified to allow for constraints

It is sometimes stated that EM allows automatically for constraints on parameters. For instance, Lange (2010, p. 223) writes that “[...] the EM algorithm handles parameter constraints gracefully. Constraint satisfaction is by definition built into the solution of the M step.” This could, however, be misunderstood to mean all constraints on parameters. EM does indeed handle many constraints automatically, e.g. the nonnegativity and unit-sum constraints on the transition probabilities in a hidden Markov model (Zucchini et al. 2016, p. 72). But see (e.g.) Kim and Taylor (1995) and Shi et al. (2005), the very purpose of which is to modify EM in order to incorporate (respectively) certain linear equality or inequality constraints on parameters.

There has been one very determined but apparently unnecessary attempt to modify EM in order to allow for the linear inequality constraints arising very naturally in one particular problem in genetics, even splitting the M step into seven cases in order to do so (Zhou et al. 2008). There the problem is just one of maximizing a (nonlinear) function of three variables subject to four linear inequality constraints and is easily solved by using the constrained optimizer constrOptim provided by R. EM makes this problem harder than it need be. For further details, see MacDonald and Nkalashe (2015).

1.1.2 A.2 The Altham–Poisson distribution

Leask and Haines (2015) and Leask (2009) considered very fully the possible use of EM for parameter estimation in the case of an Altham–Poisson distribution, that is, a “multiplicative binomial” distribution (Altham 1978) for which n, the number of trials, is taken to have a Poisson distribution. A multiplicative binomial has probability mass function of the form

$$\Pr (Y=y) \propto {n \atopwithdelims ()y} p^y (1-p)^{n-y} \theta ^{y(n-y)}, \quad y=0, 1, \ldots , n ,$$

with $p \in [0,1]$ and $\theta >0$. They concluded that the formulation of EM for such a distribution is “subtle and somewhat complicated” and, because EM was slow to converge, chose instead to maximize the log-likelihood directly.

1.1.3 A.3 Fitting truncated normal distributions

It is straightforward to evaluate the (density-approximated) likelihood of a sample from a truncated normal distribution and maximize it numerically. Nevertheless, Tian et al. (2018) go to considerable lengths to develop an algorithm of EM type to fit truncated normal distributions. Their two examples of model fitting can easily be carried out instead by DNM, although in one case it is clear that their truncated normal is much inferior to a log-normal model. This problem has been more fully discussed by MacDonald (2018).

1.1.4 A.4 Maximization of penalized likelihood

Lee and Pawitan (2014) describe how to estimate the variances of estimators based on the maximization of penalized likelihood; they assume that these estimators have been found by EM. MacDonald and Lapham (2016) describe how to accomplish the same and more without using EM, by instead maximizing the penalized likelihood directly. For the two examples of Lee and Pawitan they find the MLEs, their standard errors, and confidence (or credibility) intervals of two types: those of Wald type and those based directly on penalized likelihood, which can differ considerably from those of Wald type.

1.1.5 A.5 Fitting a zero-truncated Poisson

Meng (1997) describes how to use (inter alia) EM to fit a Poisson distribution to count data from which the number of zeros observed is missing. Among the methods he describes is Newton–Raphson. MacDonald (2014a) claims that the apparent failure of Newton–Raphson for certain starting-values is due only to the fact that the obvious positivity constraint on the Poisson mean has been ignored. Meng (2014) has replied, but does not agree entirely.

1.1.6 A.6 The examples of MacDonald (2014b)

MacDonald (2014b) describes a range of examples in which EM has apparently been used unnecessarily and compares EM with DNM. Included are the analysis of ABO blood-group data, and the fitting of Dirichlet distributions. There has been little or no response published which disagrees materially with the conclusions and recommendations of that paper.

1.1.7 A.7 The applications of Zucchini et al. (2016)

The applications chapters of Zucchini et al. (2016, Chapters 15–24) present a wide variety of models of hidden Markov type or similar, some simple, some complex, but all fitted by direct numerical maximization of likelihood. There is little if any mention of EM in those chapters, as it usually seemed redundant or over-complicated—in spite of the strong historical connection between hidden Markov models and EM, and the traditional use of the Baum–Welch algorithm (i.e. EM) in such models. This seems to support the argument that, in some contexts where EM is often used, it is inessential.

1.2 B A simple optimization problem with implicit constraint

Boyd and Vandenberghe (2004, Exercise 9.10b) present a very simple minimization problem for which they correctly state that the “pure” (i.e. undamped) Newton method can diverge. The problem is to minimize $f(x) = x-\log x$. It is straightforward to establish (analytically) that f has a unique minimum at $x=1$, but if pure Newton is started from $x_0=3$ (for instance), the algorithm does indeed diverge.

But the very nature of f is such that there is the implicit constraint $x>0$. One can (and should) therefore replace the problem by the unconstrained problem of minimizing $g(y) = \exp (y)-y$—or else impose the positivity constraint in some other way. If pure Newton, without any embellishment whatsoever, is then started from $y_0=\log 3$, it converges very fast to $y=0$, as it should. Not surprisingly, the unconstrained minimizer nlm, applied to g and starting from $y_0=\log 3$, also converges very fast to $y=0$.

If, however, I ignore my own advice, put f (unconstrained) into nlm, and start from $x_0=3$, with a few warnings nlm converges in ten iterations to $x=0.9999995$. nlm appears to be sufficiently robust to withstand some rough treatment. Nevertheless, if there are constraints it is unwise to ignore them. And, whatever method one chooses to attempt an optimization problem, there is always the possibility that sufficiently extreme starting values will cause under- or overflow in the objective, and thereby cause the method to fail.

Rights and permissions

Reprints and permissions

About this article

Cite this article

MacDonald, I.L. Is EM really necessary here? Examples where it seems simpler not to use EM. AStA Adv Stat Anal 105, 629–647 (2021). https://doi.org/10.1007/s10182-021-00392-x

Download citation

Received: 18 May 2020
Accepted: 09 February 2021
Published: 18 March 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10182-021-00392-x

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Is EM really necessary here? Examples where it seems simpler not to use EM

Abstract

Access this article

Similar content being viewed by others

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Appendix

Appendix

1.1 A Other published examples

1.1.1 A.1 EM modified to allow for constraints

1.1.2 A.2 The Altham–Poisson distribution

1.1.3 A.3 Fitting truncated normal distributions

1.1.4 A.4 Maximization of penalized likelihood

1.1.5 A.5 Fitting a zero-truncated Poisson

1.1.6 A.6 The examples of MacDonald (2014b)

1.1.7 A.7 The applications of Zucchini et al. (2016)

1.2 B A simple optimization problem with implicit constraint

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation