Restricted Boltzmann Machines: Introduction and Review

Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 252)


The restricted Boltzmann machine is a network of stochastic units with undirected interactions between pairs of visible and hidden units. This model was popularized as a building block of deep learning architectures and has continued to play an important role in applied and theoretical machine learning. Restricted Boltzmann machines carry a rich structure, with connections to geometry, applied algebra, probability, statistics, machine learning, and other areas. The analysis of these models is attractive in its own right and also as a platform to combine and generalize mathematical tools for graphical models with hidden variables. This article gives an introduction to the mathematical analysis of restricted Boltzmann machines, reviews recent results on the geometry of the sets of probability distributions representable by these models, and suggests a few directions for further investigation.


Hierarchical model Latent variable model Exponential family Mixture model Hadamard product Non-negative tensor rank Expected dimension Universal approximation Kullback–Leibler divergence Divergence maximization 



I thank Shun-ichi Amari for inspiring discussions over the years. This review article originated at the IGAIA IV conference in 2016 dedicated to his 80th birthday. I am grateful to Nihat Ay, Johannes Rauh, Jason Morton, and more recently Anna Seigal for our collaborations. I thank Fero Matúš for discussions on the divergence maximization for hierarchical models, lastly at the MFO Algebraic Statistics meeting in 2017. I thank Bernd Sturmfels for many fruitful discussions, and Dave Ackley for insightful discussions at the Santa Fe Institute in 2016. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 757983).


  1. 1.
    Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985)CrossRefGoogle Scholar
  2. 2.
    Hinton, G.E., Sejnowski, T.J.: Analyzing cooperative computation. In: Proceedings of the Fifth Annual Conference of the Cognitive Science Society. Rochester, NY (1983)Google Scholar
  3. 3.
    Hinton, G.E., Sejnowski, T.J.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Chapter learning and relearning in boltzmann machines, pp. 282–317. MIT Press, USA (1986)Google Scholar
  4. 4.
    Hopfield, J.J.: Neurocomputing: Foundations of Research. Chapter neural networks and physical systems with emergent collective computational abilities, pp. 457–464. MIT Press, USA (1988)Google Scholar
  5. 5.
    Huang, K.: Statistical Mechanics. Wiley, New York (2000)Google Scholar
  6. 6.
    Gibbs, J.: Elementary Principles in Statistical Mechanics: Developed with Especial Reference to the Rational Foundations of Thermodynamics. C. Scribner’s sons (1902)Google Scholar
  7. 7.
    Brown, L.: Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. Institute of Mathematical Statistics, USA (1986)Google Scholar
  8. 8.
    Jordan, M.I.: Graphical models. Stat. Sci. 19(1), 140–155 (2004)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Lauritzen, S.L.: Graphical Models. Oxford University Press, USA (1996)Google Scholar
  10. 10.
    Amari, S.: Information geometry on hierarchical decomposition of stochastic interactions. IEEE Trans. Inf. Theory 47, 1701–1711 (1999)CrossRefGoogle Scholar
  11. 11.
    Amari, S.: Information Geometry and its Applications. Applied mathematical sciences, vol. 194. Springer, Japan (2016)CrossRefGoogle Scholar
  12. 12.
    Amari, S., Nagaoka, H.: Methods of Information Geometry. Translations of mathematical monographs. American Mathematical Society (2007)Google Scholar
  13. 13.
    Ay, N., Jost, J., Lê, H., Schwachhöfer, L.: Information Geometry. Ergebnisse der Mathematik und ihrer Grenzgebiete, vol. 64. Springer, Berlin (2017)Google Scholar
  14. 14.
    Drton, M., Sturmfels, B., Sullivant, S.: Lectures on Algebraic Statistics. Springer, Oberwolfach Seminars (2009)CrossRefGoogle Scholar
  15. 15.
    Sullivant, S.: Algebraic Statistics (2018)Google Scholar
  16. 16.
    Amari, S., Kurata, K., Nagaoka, H.: Information geometry of Boltzmann machines. IEEE Trans. Neural Netw. 3(2), 260–271 (1992)CrossRefGoogle Scholar
  17. 17.
    Smolensky, P.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Chapter information processing in dynamical systems: foundations of harmony theory, vol. 1, pp. 194–281. MIT Press, USA (1986)Google Scholar
  18. 18.
    Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. In: Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems 4, pp. 912–919. Morgan-Kaufmann (1992)Google Scholar
  19. 19.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009). Also published as a book. Now PublishersGoogle Scholar
  20. 20.
    Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Fischer, A., Igel, C.: An introduction to restricted Boltzmann machines. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (eds.) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 14–36. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  22. 22.
    Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput. 20(6), 1631–1649 (2008)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Montúfar, G., Rauh, J.: Hierarchical models as marginals of hierarchical models. Int. J. Approx. Reason. 88, 531–546 (2017). (Supplement C)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Montúfar, G., Rauh, J., Ay, N.: Expressive power and approximation errors of restricted Boltzmann machines. Adv. Neural Inf. Process. Syst. 24, 415–423 (2011)Google Scholar
  25. 25.
    Younes, L.: Synchronous Boltzmann machines can be universal approximators. Appl. Math. Lett. 9(3), 109–113 (1996)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Cueto, M.A., Morton, J., Sturmfels, B.: Geometry of the restricted Boltzmann machine. In: Viana, M.A.G., Wynn, H.P. (eds.) Algebraic methods in statistics and probability II, AMS Special Session, vol. 2. American Mathematical Society (2010)Google Scholar
  27. 27.
    Montúfar, G., Morton, J.: Discrete restricted Boltzmann machines. In: Proceedings of the 1-st International Conference on Learning Representations (ICLR2013) (2013)Google Scholar
  28. 28.
    Montúfar, G., Morton, J.: Dimension of marginals of Kronecker product models. SIAM J. Appl. Algebra Geom. 1(1), 126–151 (2017)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Cueto, M.A., Tobis, E.A., Yu, J.: An implicitization challenge for binary factor analysis. J. Symb. Comput. 45(12), 1296–1315 (2010)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Seigal, A., Montúfar, G.: Mixtures and products in two graphical models. To Appear J. Algebraic Stat. (2018). arXiv:1709.05276
  31. 31.
    Martens, J., Chattopadhya, A., Pitassi, T., Zemel, R.: On the representational efficiency of restricted Boltzmann machines. In: Advances in Neural Information Processing Systems 26, pp. 2877–2885. Curran Associates, Inc., USA (2013)Google Scholar
  32. 32.
    Montúfar, G., Morton, J.: When does a mixture of products contain a product of mixtures? SIAM J. Discret. Math. 29(1), 321–347 (2015)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Fischer, A., Igel, C.: Bounding the bias of contrastive divergence learning. Neural Comput. 23(3), 664–673 (2010)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Fischer, A., Igel, C.: A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines. Theor. Comput. Sci. 598, 102–117 (2015)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Aoyagi, M.: Stochastic complexity and generalization error of a restricted Boltzmann machine in Bayesian estimation. J. Mach. Learn. Res. 99, 1243–1272 (2010)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Fischer, A., Igel, C.: Contrastive divergence learning may diverge when training restricted Boltzmann machines. In: Frontiers in Computational Neuroscience. Bernstein Conference on Computational Neuroscience (BCCN 2009) (2009)Google Scholar
  37. 37.
    Salakhutdinov, R.: Learning and evaluating Boltzmann machines. Technical report, 2008Google Scholar
  38. 38.
    Karakida, R., Okada, M., Amari, S.: Dynamical analysis of contrastive divergence learning: restricted Boltzmann machines with Gaussian visible units. Neural Netw. 79, 78–87 (2016)CrossRefGoogle Scholar
  39. 39.
    Salakhutdinov, R., Mnih, A., Hinton, G.E.: Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th international conference on Machine learning, ICML ’07, pp. 791–798. ACM, NY (2007)Google Scholar
  40. 40.
    Welling, M., Rosen-Zvi, M., Hinton, G.E.: Exponential family harmoniums with an application to information retrieval. Adv. Neural Inf. Process. Syst. 17, 1481–1488 (2005)Google Scholar
  41. 41.
    Sejnowski, T.J.: Higher-order Boltzmann machines. Neural Networks for Computing, pp. 398–403. American Institute of Physics (1986)Google Scholar
  42. 42.
    Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 09), pp. 448–455 (2009)Google Scholar
  43. 43.
    Montúfar, G.: Universal approximation depth and errors of narrow belief networks with discrete units. Neural Comput. 26(7), 1386–1407 (2014)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Montúfar, G., Ay, N.: Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Comput. 23(5), 1306–1319 (2011)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Sutskever, I., Hinton, G.E.: Deep, narrow sigmoid belief networks are universal approximators. Neural Comput. 20(11), 2629–2636 (2008)CrossRefGoogle Scholar
  46. 46.
    Montúfar, G.: Deep narrow Boltzmann machines are universal approximators. International Conference on Learning Representations (ICLR 15) (2015). arXiv:1411.3784
  47. 47.
    Montúfar, G., Ay, N., Ghazi-Zahedi, K.: Geometry and expressive power of conditional restricted Boltzmann machines. J. Mach. Learn. Res. 16, 2405–2436 (2015)MathSciNetzbMATHGoogle Scholar
  48. 48.
    Amin, M.H., Andriyash, E., Rolfe, J., Kulchytskyy, B., Melko, R.: Quantum Boltzmann machine. Phys. Rev. X 8, 021050 (2018)Google Scholar
  49. 49.
    Zhang, N., Ding, S., Zhang, J., Xue, Y.: An overview on restricted Boltzmann machines. Neurocomputing 275, 1186–1199 (2018)CrossRefGoogle Scholar
  50. 50.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)CrossRefGoogle Scholar
  51. 51.
    Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1064–1071. ACM, USA (2008)Google Scholar
  52. 52.
    Salakhutdinov, R.: Learning in Markov random fields using tempered transitions. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22, pp. 1598–1606. Curran Associates, Inc., (2009)Google Scholar
  53. 53.
    Fischer, A., Igel, C.: Training restricted Boltzmann machines: an introduction. Pattern Recognit. 47(1), 25–39 (2014)CrossRefGoogle Scholar
  54. 54.
    Hinton. G.E.: A practical guide to training restricted Boltzmann machines, version 1. Technical report, UTML2010-003, University of Toronto, 2010Google Scholar
  55. 55.
    Amari, S.: Differential-geometrical Methods in Statistics. Lecture notes in statistics. Springer, Berlin (1985)CrossRefGoogle Scholar
  56. 56.
    Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)CrossRefGoogle Scholar
  57. 57.
    Rao, R.C.: Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)MathSciNetzbMATHGoogle Scholar
  58. 58.
    Watanabe, S.: Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, USA (2009)CrossRefGoogle Scholar
  59. 59.
    Grosse, R., Salakhudinov, R.: Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning of Research, vol. 37, pp. 2304–2313. PMLR, France, 07–09 Jul (2015)Google Scholar
  60. 60.
    Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations 2014 (Conference Track) (2014)Google Scholar
  61. 61.
    Li, W., Montúfar, G.: Natural gradient via optimal transport I (2018). arXiv:1803.07033
  62. 62.
    Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted boltzmann machines. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 3718–3726. Curran Associates Inc., USA (2016)Google Scholar
  63. 63.
    Csiszár, I., Tusnády, G.: Information Geometry and Alternating minimization procedures. Statistics and decisions (1984). Supplement Issue 1Google Scholar
  64. 64.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  65. 65.
    Draisma, J.: A tropical approach to secant dimensions. J. Pure Appl. Algebra 212(2), 349–363 (2008)MathSciNetCrossRefGoogle Scholar
  66. 66.
    Bieri, R., Groves, J.: The geometry of the set of characters iduced by valuations. Journal für die reine und angewandte Mathematik 347, 168–195 (1984)zbMATHGoogle Scholar
  67. 67.
    Catalisano, M., Geramita, A., Gimigliano, A.: Secant varieties of \(\mathbb{P}^1\times \dots \times \mathbb{P}^1\) (\(n\)-times) are not defective for \(n\ge 5\). J. Algebraic Geom. 20, 295–327 (2011)MathSciNetCrossRefGoogle Scholar
  68. 68.
    Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. Technical report, Santa Cruz, CA, USA 1994Google Scholar
  69. 69.
    Rauh, J.: Optimally approximating exponential families. Kybernetika 49(2), 199–215 (2013)MathSciNetzbMATHGoogle Scholar
  70. 70.
    Matúš, F.: Divergence from factorizable distributions and matroid representations by partitions. IEEE Trans. Inf. Theory 55(12), 5375–5381 (2009)MathSciNetCrossRefGoogle Scholar
  71. 71.
    Matúš, F., Ay, N.: On maximization of the information divergence from an exponential family. In: Proceedings of the WUPES’03, pp. 199–204 (2003)Google Scholar
  72. 72.
    Rauh, J.: Finding the maximizers of the information divergence from an exponential family. IEEE Trans. Inf. Theory 57(6), 3236–3247 (2011)MathSciNetCrossRefGoogle Scholar
  73. 73.
    Montúfar, G., Rauh, J., Ay, N.: Geometric Science of Information: First International Conference, GSI 2013, Paris, France, August 28-30, 2013. Proceedings. Chapter maximal information divergence from statistical models defined by neural networks, pp. 759–766. Springer, Heidelberg (2013)Google Scholar
  74. 74.
    Montúfar, G., Rauh, J.: Scaling of model approximation errors and expected entropy distances. Kybernetika 50(2), 234–245 (2014)MathSciNetzbMATHGoogle Scholar
  75. 75.
    Allman, E., Cervantes, H.B., Evans, R., Hoşten, S., Kubjas, K., Lemke, D., Rhodes, J., Zwiernik, P.: Maximum likelihood estimation of the latent class model through model boundary decomposition (2017)Google Scholar
  76. 76.
    Hammersley, J.M., Clifford, P.E.: Markov Random Fields on Finite Graphs and Lattices (1971). Unpublished manuscriptGoogle Scholar
  77. 77.
    Geiger, D., Meek, C., Sturmfels, B.: On the toric algebra of graphical models. Ann. Stat. 34(3), 1463–1492 (2006)MathSciNetCrossRefGoogle Scholar
  78. 78.
    Steudel, B., Ay, N.: Information-theoretic inference of common ancestors. Entropy 17(4), 2304 (2015)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Mathematics and Department of StatisticsUniversity of California, Los AngelesLos AngelesUSA
  2. 2.Max Planck Institute for Mathematics in the SciencesLeipzigGermany

Personalised recommendations