Natural Conjugate Gradient in Variational Inference
Variational methods for approximate inference in machine learning often adapt a parametric probability distribution to optimize a given objective function. This view is especially useful when applying variational Bayes (VB) to models outside the conjugate-exponential family. For them, variational Bayesian expectation maximization (VB EM) algorithms are not easily available, and gradient-based methods are often used as alternatives. Traditional natural gradient methods use the Riemannian structure (or geometry) of the predictive distribution to speed up maximum likelihood estimation. We propose using the geometry of the variational approximating distribution instead to speed up a conjugate gradient method for variational learning and inference. The computational overhead is small due to the simplicity of the approximating distribution. Experiments with real-world speech data show significant speedups over alternative learning algorithms.
KeywordsConjugate Gradient Conjugate Gradient Method Fisher Information Matrix Riemannian Structure Natural Gradient
Unable to display preview. Download preview PDF.
- 1.Bishop, C.: Pattern Recognition and Machince Learning. Springer, Heidelberg (2006)Google Scholar
- 2.Barber, D., Bishop, C.: Ensemble learning for multi-layer networks. In: Advances in Neural Information Processing Systems 10, pp. 395–401. The MIT Press, Cambridge (1998)Google Scholar
- 3.Seeger, M.: Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In: Advances in Neural Information Processing Systems 12, pp. 603–609. MIT Press, Cambridge (2000)Google Scholar
- 4.Lappalainen, H., Honkela, A.: Bayesian nonlinear independent component analysis by multi-layer perceptrons. In: Girolami, M. (ed.) Advances in Independent Component Analysis, pp. 93–121. Springer, Berlin (2000)Google Scholar
- 7.Honkela, A., Valpola, H.: Unsupervised variational Bayesian learning of nonlinear models. In: Advances in Neural Information Processing Systems 17, pp. 593–600. MIT Press, Cambridge (2005)Google Scholar
- 12.Valpola, H.: Bayesian Ensemble Learning for Nonlinear Factor Analysis. PhD thesis, Helsinki University of Technology, Espoo, Finland, Published in Acta Polytechnica Scandinavica, Mathematics and Computing Series No. 108 (2000)Google Scholar
- 14.Smith, S.T.: Geometric Optimization Methods for Adaptive Filtering. PhD thesis, Harvard University, Cambridge, Massachusetts (1993)Google Scholar