Machine Learning

, Volume 45, Issue 3, pp 279–299 | Cite as

Accelerating EM for Large Databases

  • Bo Thiesson
  • Christopher Meek
  • David Heckerman


The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.

Expectation Maximization algorithm incremental EM lazy EM online EM data blocking mixture models clustering 


  1. Agresti, A. (1990). Categorical Data Analysis. New York: John Wiley and Sons.Google Scholar
  2. Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling EM (Expectation Maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research.Google Scholar
  3. Cheeseman, P. & Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatesky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 153–180). Menlo Park, CA: AAAI Press.Google Scholar
  4. Chickering, D. M. & Heckerman, D. (1997). Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29, 181–212.Google Scholar
  5. Chickering, D. M. & Heckerman, D. (1999) Fast learning from sparse data. In K. B. Laskey & H. Prade (Eds.), Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 109–115). San Mateo, CA: Morgan Kaufmann Publishers.Google Scholar
  6. Dempster, A. P., Laird, N., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38.Google Scholar
  7. Green, P. J. (1990). On use of the EM algorithm for penalized likelihood estimation. Journal of the Royal Statistical Society, Series B, 52, 443–452.Google Scholar
  8. Huang, X., Acero, A., Alleva, F., Hwang, M.-Y., Jiang, L., & Mahajan, M. (1995). Microsoft Windows highly intelligent speech recognizer: Whisper. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 93–96).Google Scholar
  9. Jamshidian, M. & Jennrich, R. I. (1993). Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association, 88(421), 221–228.Google Scholar
  10. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B, 44(2), 226–233.Google Scholar
  11. McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In R. Ramakrishnan & S. Stolfo (Eds.), Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). New York: ACM.Google Scholar
  12. Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society, Series B, 51(1), 127–138.Google Scholar
  13. Meng, X.-L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: Ageneral framework. Biometrika, 80(2), 267–278.Google Scholar
  14. Meng, X.-L. & van Dyk, D. (1997). The EM algorithm—an old folksong sung to a fast new tune (with discussion). Journal of the Royal Statistical Society, Series B, 59, 511–567.Google Scholar
  15. Moore, A. (1999). Very fast EM-based mixture model clustering using multiresolution kd-trees. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in Neural Information Processing Systems. Proceedings of the 1998 Conference (Vol. 11, pp. 543–549). Cambridge, MA: MIT Press.Google Scholar
  16. Neal, R. & Hinton, G. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan (Ed.), Learning in Graphical Models (pp. 355–371). The Netherlands, Kluwer Academic Publishers.Google Scholar
  17. Nowlan, S. J. (1991). Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh.Google Scholar
  18. Sato, M. (1999). Fast learning of on-line em algorithm. Technical Report, ATR Human Information Processing Research Laboratories 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan.Google Scholar
  19. Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 407–432.Google Scholar
  20. Thiesson, B. (1995). Accelerated quantification of Bayesian networks with incomplete data. In U. M. Fayyad, & R. Uthurusamy (Eds.), Proceedings of First International Conference on Knowledge Discovery and Data Mining (pp. 306–311). Menlo Park, CA: AAAI Press.Google Scholar
  21. Thiesson, B., Meek, C., Chickering, D., & Heckerman, D. (1999). Computational efficient methods for selectiong among mixtures of graphical models, with discussion. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian Statistics: Proceedings of the Sixth Valencia International Meeting (Vol. 6, pp. 631–656). Oxford: Oxford University Press.Google Scholar
  22. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In Proceedings of the Fifteenth ACM SIGMOD International Conference on Management of Data and Symposium on Principles of Database Systems (pp. 103–114). New York: ACM.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Bo Thiesson
    • 1
  • Christopher Meek
    • 1
  • David Heckerman
    • 1
  1. 1.Microsoft ResearchOne Microsoft WayRedmondUSA

Personalised recommendations