Natural Conjugate Gradient in Variational Inference

  • Antti Honkela
  • Matti Tornio
  • Tapani Raiko
  • Juha Karhunen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4985)


Variational methods for approximate inference in machine learning often adapt a parametric probability distribution to optimize a given objective function. This view is especially useful when applying variational Bayes (VB) to models outside the conjugate-exponential family. For them, variational Bayesian expectation maximization (VB EM) algorithms are not easily available, and gradient-based methods are often used as alternatives. Traditional natural gradient methods use the Riemannian structure (or geometry) of the predictive distribution to speed up maximum likelihood estimation. We propose using the geometry of the variational approximating distribution instead to speed up a conjugate gradient method for variational learning and inference. The computational overhead is small due to the simplicity of the approximating distribution. Experiments with real-world speech data show significant speedups over alternative learning algorithms.


Conjugate Gradient Conjugate Gradient Method Fisher Information Matrix Riemannian Structure Natural Gradient 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bishop, C.: Pattern Recognition and Machince Learning. Springer, Heidelberg (2006)Google Scholar
  2. 2.
    Barber, D., Bishop, C.: Ensemble learning for multi-layer networks. In: Advances in Neural Information Processing Systems 10, pp. 395–401. The MIT Press, Cambridge (1998)Google Scholar
  3. 3.
    Seeger, M.: Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In: Advances in Neural Information Processing Systems 12, pp. 603–609. MIT Press, Cambridge (2000)Google Scholar
  4. 4.
    Lappalainen, H., Honkela, A.: Bayesian nonlinear independent component analysis by multi-layer perceptrons. In: Girolami, M. (ed.) Advances in Independent Component Analysis, pp. 93–121. Springer, Berlin (2000)Google Scholar
  5. 5.
    Valpola, H., Karhunen, J.: An unsupervised ensemble learning method for nonlinear dynamic state-space models. Neural Computation 14(11), 2647–2692 (2002)zbMATHCrossRefGoogle Scholar
  6. 6.
    Valpola, H., Harva, M., Karhunen, J.: Hierarchical models of variance sources. Signal Processing 84(2), 267–282 (2004)CrossRefzbMATHGoogle Scholar
  7. 7.
    Honkela, A., Valpola, H.: Unsupervised variational Bayesian learning of nonlinear models. In: Advances in Neural Information Processing Systems 17, pp. 593–600. MIT Press, Cambridge (2005)Google Scholar
  8. 8.
    Amari, S.: Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics, vol. 28. Springer, Heidelberg (1985)zbMATHGoogle Scholar
  9. 9.
    Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Sato, M.: Online model selection based on the variational Bayes. Neural Computation 13(7), 1649–1681 (2001)zbMATHCrossRefGoogle Scholar
  11. 11.
    Murray, M.K., Rice, J.W.: Differential Geometry and Statistics. Chapman and Hall, Boca Raton (1993)zbMATHGoogle Scholar
  12. 12.
    Valpola, H.: Bayesian Ensemble Learning for Nonlinear Factor Analysis. PhD thesis, Helsinki University of Technology, Espoo, Finland, Published in Acta Polytechnica Scandinavica, Mathematics and Computing Series No. 108 (2000)Google Scholar
  13. 13.
    Nocedal, J.: Theory of algorithms for unconstrained optimization. Acta Numerica 1, 199–242 (1991)CrossRefGoogle Scholar
  14. 14.
    Smith, S.T.: Geometric Optimization Methods for Adaptive Filtering. PhD thesis, Harvard University, Cambridge, Massachusetts (1993)Google Scholar
  15. 15.
    Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20(2), 303–353 (1998)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Antti Honkela
    • 1
  • Matti Tornio
    • 1
  • Tapani Raiko
    • 1
  • Juha Karhunen
    • 1
  1. 1.Adaptive Informatics Research CentreHelsinki University of TechnologyFinland

Personalised recommendations