Abstract
Fast incremental expectation maximization (FIEM) is a version of the EM framework for large datasets. In this paper, we first recast FIEM and other incremental EM type algorithms in the Stochastic Approximation within EM framework. Then, we provide nonasymptotic bounds for the convergence in expectation as a function of the number of examples n and of the maximal number of iterations \(K_\mathrm {max}\). We propose two strategies for achieving an \(\epsilon \)-approximate stationary point, respectively with \(K_\mathrm {max}= O(n^{2/3}/\epsilon )\) and \(K_\mathrm {max}= O(\sqrt{n}/\epsilon ^{3/2})\), both strategies relying on a random termination rule before \(K_\mathrm {max}\) and on a constant step size in the Stochastic Approximation step. Our bounds provide some improvements on the literature. First, they allow \(K_\mathrm {max}\) to scale as \(\sqrt{n}\) which is better than \(n^{2/3}\) which was the best rate obtained so far; it is at the cost of a larger dependence upon the tolerance \(\epsilon \), thus making this control relevant for small to medium accuracy with respect to the number of examples n. Second, for the \(n^{2/3}\)-rate, the numerical illustrations show that thanks to an optimized choice of the step size and of the bounds in terms of quantities characterizing the optimization problem at hand, our results design a less conservative choice of the step size and provide a better control of the convergence in expectation.
Similar content being viewed by others
Notes
The numerical applications are developed in MATLAB by the first author of the paper. The code files are publicly available from https://github.com/gfort-lab/OpSiMorE/tree/master/FIEM .
Available at http://yann.lecun.com/exdb/mnist/
References
Agarwal, A., Bottou, L.: A lower bound for the optimization of finite sums. In: Bach, F., Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 37, pp. 78–86 (2015)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Balcan, M., Weinberger, K. (eds.), Proceedings of The 33rd International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 48, pp. 699–707 (2016)
Benveniste, A., Priouret, P., Métivier, M.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)
Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)
Bottou, L.: Stochastic Gradient Descent Tricks, pp. 421–436. Springer, Berlin (2012)
Brown, L.D.: Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, Institute of Mathematical Statistics Lecture Notes-Monograph Series, vol. 9. Institute of Mathematical Statistics, Hayward (1986)
Cappé, O., Moulines, E.: On-line Expectation Maximization algorithm for latent data models. J. R. Stat. Soc. B Met. 71(3), 593–613 (2009)
Celeux, G., Diebolt, J.: The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Stat. Q. 2, 73–82 (1985)
Chen, J., Zhu, J., Teh, Y., Zhang, T.: Stochastic expectation maximization with variance reduction. In: Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Bengio, S. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 7967–7977. Curran Associates Inc, Red Hook (2018)
Csiszár, I., Tusnády, G.: Information geometry and alternating minimization procedures. In: Recent Results in Estimation Theory and Related Topics, suppl. 1, Statist. Decisions, pp. 205–237 (1984)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates Inc, Red Hook (2014)
Delyon, B., Lavielle, M., Moulines, E.: Convergence of a Stochastic Approximation version of the EM algorithm. Ann. Stat. 27(1), 94–128 (1999)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B Met. 39(1), 1–38 (1977)
Fang, C., Li, C., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Bengio, S. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 689–699. Curran Associates Inc, Red Hook (2018)
Fort, G., Moulines, E.: Convergence of the Monte Carlo Expectation Maximization for curved exponential families. Ann. Stat. 31(4), 1220–1259 (2003)
Frühwirth-Schnatter, S., Celeux, G., Robert, C.P. (eds.): Handbook of Mixture Analysis. Handbooks of Modern Statistical Methods. Chapman & Hall/CRC Press, Boca Raton (2019)
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Glasserman, P.: Monte Carlo Methods in Financial Engineering. Springer, New York (2004)
Gunawardana, A., Byrne, W.: Convergence theorems for generalized alternating minimization procedures. J. Mach. Learn. Res. 6, 2049–2073 (2005)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Burges, C.J.C. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates Inc, Red Hook (2013)
Karimi, B., Miasojedow, B., Moulines, E., Wai, H.T.: Non-asymptotic analysis of biased stochastic approximation scheme. In: Beygelzimer, A., Hsu, D. (eds.) Proceedings of the Thirty-Second Conference on Learning Theory, PMLR, Phoenix, USA, Proceedings of Machine Learning Research, vol. 99, pp. 1944–1974 (2019a)
Karimi, B., Wai, H.T., Moulines, E., Lavielle, M.: On the global convergence of (fast) incremental expectation maximization methods. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 2837–2847. Curran Associates Inc, Red Hook (2019b)
Kwedlo, W.: A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering. Pattern Anal. Appl. 18, 757–770 (2015)
Lange, K.: MM Optimization Algorithms. Other Titles in Applied Mathematics, Society for Industrial and Applied Mathematics (2016)
Lange, K.: A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. B 57(2), 425–437 (1995)
Little, R.J.A., Rubin, D.: Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2002)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series in Probability and Statistics. Wiley, New York (2008)
Murty, K., Kabadi, S.: Some NP-complete problems in quadratic and nonlinear programming. Math. Program. 39, 117–129 (1987)
Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. Springer, Dordrecht (1998)
Ng, S.K., McLachlan, G.J.: On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Stat. Comput. 13(1), 45–55 (2003)
Nguyen, H., Forbes, F., McLachlan, G.: Mini-batch learning of exponential family finite mixture models. Stat. Comput. 30, 731–748 (2020)
Parizi, S.N., He, K., Aghajani, R., Sclaroff, S., Felzenszwalb, P.: Generalized majorization-minimization. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 5022–5031 (2019)
Reddi, S., Sra, S., Póczos, B., Smola, A.: Fast incremental method for smooth nonconvex optimization. In: 2016 IEEE 55th conference on decision and control (CDC), pp. 1971–1977 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Sundberg, R.: Statistical Modelling by Exponential Families. Cambridge University Press, Cambridge (2019)
Wei, G., Tanner, M.: A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 85(411), 699–704 (1990)
Wu, C.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)
Zangwill, W.I.: Non-linear programming via penalty functions. Manag. Sci. 13, 344–358 (1967)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduced gradient descent for nonconvex optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 3921–3932. Curran Associates Inc, Red Hook (2018)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is partially supported by the Fondation Simone et Cino Del Duca through the project OpSiMorE; by the French Agence Nationale de la Recherche (ANR), project under reference ANR-PRC-CE23 MASDOL and Chair ANR of research and teaching in artificial intelligence - SCAI Statistics and Computation for AI; and by the Russian Academic Excellence Project ‘5-100’.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fort, G., Gach, P. & Moulines, E. Fast incremental expectation maximization for finite-sum optimization: nonasymptotic convergence. Stat Comput 31, 48 (2021). https://doi.org/10.1007/s11222-021-10023-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-021-10023-9
Keywords
- Computational statistical learning
- Large scale learning
- Incremental expectation maximization algorithm
- Momentum stochastic approximation
- Finite-sum optimization