Natural Langevin Dynamics for Neural Networks

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10589)

Abstract

One way to avoid overfitting in machine learning is to use model parameters distributed according to a Bayesian posterior given the data, rather than the maximum likelihood estimator. Stochastic gradient Langevin dynamics (SGLD) is one algorithm to approximate such Bayesian posteriors for large models and datasets. SGLD is a standard stochastic gradient descent to which is added a controlled amount of noise, specifically scaled so that the parameter converges in law to the posterior distribution [WT11, TTV16]. The posterior predictive distribution can be approximated by an ensemble of samples from the trajectory.

Choice of the variance of the noise is known to impact the practical behavior of SGLD: for instance, noise should be smaller for sensitive parameter directions. Theoretically, it has been suggested to use the inverse Fisher information matrix of the model as the variance of the noise, since it is also the variance of the Bayesian posterior [PT13, AKW12, GC11]. But the Fisher matrix is costly to compute for large-dimensional models.

Here we use the easily computed Fisher matrix approximations for deep neural networks from [MO16, Oll15]. The resulting natural Langevin dynamics combines the advantages of Amari’s natural gradient descent and Fisher-preconditioned Langevin dynamics for large neural networks.

Small-scale experiments on MNIST show that Fisher matrix preconditioning brings SGLD close to dropout as a regularizing technique.

References

  1. [AKW12]
    Ahn, S., Korattikara, A., Welling, M.: Bayesian posterior sampling via stochastic gradient Fisher scoring. In: ICML (2012)Google Scholar
  2. [Ama98]
    Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)CrossRefGoogle Scholar
  3. [Bis06]
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)MATHGoogle Scholar
  4. [BL03]
    Bottou, L., LeCun, Y.: Large scale online learning. In: NIPS, vol. 30, p. 77 (2003)Google Scholar
  5. [Bot10]
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT 2010, pp. 177–186. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. [CDC15]
    Chen, C., Ding, N., Carin, L.: On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In: Advances in Neural Information Processing Systems, pp. 2278–2286 (2015)Google Scholar
  7. [DM16]
    Durmus, A., Moulines, E.: High-dimensional Bayesian inference via the unadjusted Langevin algorithm. arXiv preprint arXiv:1605.01559 (2016)
  8. [GBC16]
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)Google Scholar
  9. [GC11]
    Girolami, M., Calderhead, B.: Riemann manifold langevin and hamiltonian monte carlo methods. J. Roy. Stat. Soc. Series B (Statistical Methodology) 73(2), 123–214 (2011)CrossRefMathSciNetGoogle Scholar
  10. [LCCC16]
    Li, C., Chen, C., Carlson, D.E., Carin, L.: Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp. 1788–1794. AAAI Press (2016)Google Scholar
  11. [Mac92]
    MacKay, D.J.C.: A practical Bayesian framework for backpropagation networks. Neural Comput. 4(3), 448–472 (1992)CrossRefGoogle Scholar
  12. [Mac03]
    MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATHGoogle Scholar
  13. [MDM17]
    Majewski, S., Durmus, A., Miasojedow, B.: (2017)Google Scholar
  14. [MO16]
    Marceau-Caron, G., Ollivier, Y.: Practical Riemannian neural networks. arXiv, abs/1602.08007 (2016)Google Scholar
  15. [Nea96]
    Neal, R.M.: Bayesian Learning for Neural Networks. Springer, New York (1996)CrossRefMATHGoogle Scholar
  16. [Oll15]
    Ollivier, Y.: Riemannian metrics for neural networks I: feedforward networks. Inf. Infer. 4(2), 108–153 (2015)MathSciNetGoogle Scholar
  17. [PB13]
    Pascanu, R., Bengio, Y.: Natural gradient revisited. arXiv, abs/1301.3584 (2013)Google Scholar
  18. [PT13]
    Patterson, S., Teh, Y.W.: Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In: Advances in Neural Information Processing Systems, pp. 3102–3110 (2013)Google Scholar
  19. [SHK+14]
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MATHMathSciNetGoogle Scholar
  20. [TTV16]
    Teh, Y.W., Thiery, A.H., Vollmer, S.J.: Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17(7), 1–33 (2016)MATHMathSciNetGoogle Scholar
  21. [vdV00]
    van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (2000)MATHGoogle Scholar
  22. [WT11]
    Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688 (2011)Google Scholar
  23. [XSL+14]
    Xifara, T., Sherlock, C., Livingstone, S., Byrne, S., Girolami, M.: Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 91, 14–19 (2014)CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.MILAUniversité de MontréalMontréalCanada
  2. 2.CNRS, Université Paris-SaclayParisFrance

Personalised recommendations