Skip to main content
Log in

Thermodynamics of Restricted Boltzmann Machines and Related Learning Dynamics

  • Published:
Journal of Statistical Physics Aims and scope Submit manuscript

Abstract

We investigate the thermodynamic properties of a restricted Boltzmann machine (RBM), a simple energy-based generative model used in the context of unsupervised learning. Assuming the information content of this model to be mainly reflected by the spectral properties of its weight matrix W, we try to make a realistic analysis by averaging over an appropriate statistical ensemble of RBMs. First, a phase diagram is derived. Otherwise similar to that of the Sherrington–Kirkpatrick (SK) model with ferromagnetic couplings, the RBM’s phase diagram presents a ferromagnetic phase which may or may not be of compositional type depending on the kurtosis of the distribution of the components of the singular vectors of W. Subsequently, the learning dynamics of the RBM is studied in the thermodynamic limit. A “typical” learning trajectory is shown to solve an effective dynamical equation, based on the aforementioned ensemble average and explicitly involving order parameters obtained from the thermodynamic analysis. In particular, this let us show how the evolution of the dominant singular values of W, and thus of the unstable modes, is driven by the input data. At the beginning of the training, in which the RBM is found to operate in the linear regime, the unstable modes reflect the dominant covariance modes of the data. In the non-linear regime, instead, the selected modes interact and eventually impose a matching of the order parameters to their empirical counterparts estimated from the data. Finally, we illustrate our considerations by performing experiments on both artificial and real data, showing in particular how the RBM operates in the ferromagnetic compositional phase.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. A somewhat different form of the TAP equations.

  2. Note that in [17] a dependence \(\sqrt{\kappa (1-\kappa )}\) \(\left( \sqrt{\alpha (1-\alpha )} \text {in their notation} \right) \) is found. This dependence is hidden in our definition of \(\sigma ^2\) giving \(L=\sqrt{N_v N_h}\) times the variance of \(r_{ij}\) instead of \(N_v+N_h\) as in their case.

References

  1. Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory, chapter 6. In: Rumelhart, D., McLelland, J. (eds.) Parallel Distributed Processing, pp. 194–281. MIT Press, Cambridge (1986)

    Google Scholar 

  2. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Artificial Intelligence and Statistics, pp. 448–455 (2009)

  3. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  ADS  MathSciNet  Google Scholar 

  4. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)

    Article  Google Scholar 

  5. Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1064–1071. ACM, New York (2008)

  6. Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. Springer, Berlin (2012)

    Book  Google Scholar 

  7. Salazar, D.S.P.: Nonequilibrium thermodynamics of restricted Boltzmann machines. Phys. Rev. E 96, 022131 (2017)

    Article  ADS  Google Scholar 

  8. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79(8), 2554–2558 (1982)

    Article  ADS  MathSciNet  Google Scholar 

  9. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Statistical mechanics of neural networks near saturation. Ann. Phys. 173(1), 30–67 (1987)

    Article  ADS  Google Scholar 

  10. Gardner, E.: Maximum storage capacity in neural networks. Europhys. Lett. 4(4), 481 (1987)

    Article  ADS  Google Scholar 

  11. Gardner, E., Derrida, B.: Optimal storage properties of neural network models. J. Phys. A 21(1), 271 (1988)

    Article  ADS  MathSciNet  Google Scholar 

  12. Barra, A., Bernacchia, A., Santucci, E., Contucci, P.: On the equivalence of Hopfield networks and Boltzmann machines. Neural Netw. 34, 1–9 (2012)

    Article  Google Scholar 

  13. Marylou, G., Tramel, E.W., Krzakala, F.: Training restricted Boltzmann machines via the Thouless-Anderson-Palmer free energy. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp. 640–648 (2015)

  14. Huang, H., Toyoizumi, T.: Advanced mean-field theory of the restricted Boltzmann machine. Phys. Rev. E 91(5), 050101 (2015)

    Article  ADS  MathSciNet  Google Scholar 

  15. Takahashi, C., Yasuda, M.: Mean-field inference in gaussian restricted Boltzmann machine. J. Phys. Soc. Jpn. 85(3), 034001 (2016)

    Article  ADS  Google Scholar 

  16. Furtlehner, C., Lasgouttes, J.-M., Auger, A.: Learning multiple belief propagation fixed points for real time inference. Phys. A 389(1), 149–163 (2010)

    Article  MathSciNet  Google Scholar 

  17. Barra, A., Genovese, G., Sollich, P., Tantari, D.: Phase diagram of restricted Boltzmann machines and generalized Hopfield networks with arbitrary priors. Phys. Rev. E 97, 022310 (2018)

    Article  ADS  Google Scholar 

  18. Huang, H.: Statistical mechanics of unsupervised feature learning in a restricted Boltzmann machine with binary synapses. J. Stat. Mech. 2017(5), 053302 (2017)

    Article  MathSciNet  Google Scholar 

  19. Agliari, E., Barra, A., Galluzzi, A., Guerra, F., Moauro, F.: Multitasking associative networks. Phys. Rev. Lett. 109, 268101 (2012)

    Article  ADS  Google Scholar 

  20. Monasson, R., Tubiana, J.: Emergence of compositional representations in restricted Boltzmann machines. Phys. Rev. Let. 118, 138301 (2017)

    Article  ADS  Google Scholar 

  21. Zdeborová, L., Krzakala, F.: Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65(5), 453–552 (2016)

    Article  ADS  Google Scholar 

  22. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999)

    Article  Google Scholar 

  23. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4), 291–294 (1988)

    Article  MathSciNet  Google Scholar 

  24. Saxe, A. M., McClelland, J. L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2014). arXiv:1312.6120

  25. Decelle, A., Fissore, G., Furtlehner, C.: Spectral dynamics of learning in restricted Boltzmann machines. EPL 119(6), 60001 (2017)

    Article  ADS  Google Scholar 

  26. Tramel, E.W., Gabrié, M., Manoel, A., Caltagirone, F., Krzakala, F.: A Deterministic and generalized framework for unsupervised learning with restricted Boltzmann machines (2017). arXiv:1702.03260

  27. Marčenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices. Math. USSR-Sbornik 1(4), 457 (1967)

    Article  Google Scholar 

  28. Mézard, M.: Mean-field message-passing equations in the Hopfield model and its generalizations. Phys. Rev. E 95, 022117 (2017)

    Article  ADS  MathSciNet  Google Scholar 

  29. Parisi, G., Potters, M.: Mean-field equations for spin models with orthogonal interaction matrices. J. Phys. A 28(18), 5267 (1995)

    Article  ADS  MathSciNet  Google Scholar 

  30. Opper, M., Winther, O.: Adaptive and self-averaging Thouless–Anderson–Palmer mean field theory for probabilistic modeling. Phys. Rev. E 64, 056131 (2001)

    Article  ADS  Google Scholar 

  31. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Spin-glass models of neural networks. Phys. Rev. A 32, 1007–1018 (1985)

    Article  ADS  MathSciNet  Google Scholar 

  32. Mézard, M., Parisi, G., Virasoro, M.A.: Spin Glass Theory and Beyond. World Scientific, Singapore (1987)

    MATH  Google Scholar 

  33. Almeida, J.R.L., Thouless, D.J.: Stability of the Sherrington–Kirkpatrick solution of a spin glass model. J. Phys. A 11(5), 983–990 (1978)

    Article  ADS  Google Scholar 

  34. Hohenberg, P.C., Cross, M.C.: An introduction to pattern formation in nonequilibrium systems, pp. 55–92. Springer, Berlin (1987)

    Google Scholar 

  35. Mastromatteo, I., Marsili, M.: On the criticality of inferred models. J. Stat. Mech. 2011(10), P10012 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Furtlehner.

Appendices

Appendix A: AT Line

The stability of the RS solution to the mean-field equations is studied along the lines of [33] by looking at the Hessian of the replicated version of the free energy and identifying eigenmodes from symmetry arguments. Before taking the limit \(p\rightarrow 0\) the free energy reads

$$\begin{aligned} f[m,\bar{m},Q,\bar{Q}] = \sum _{a,\alpha }w_\alpha m_\alpha ^a\bar{m}_\alpha ^a +\frac{\sigma ^2}{2}\sum _{a\ne b} Q_{ab}\bar{Q}_{ab} -\frac{1}{\sqrt{\kappa }} A_p[m,Q]-\sqrt{\kappa }B_p[\bar{m},\bar{Q}], \end{aligned}$$

with \(A_p\) and \(B_p\) given in (10,11). Assuming the small perturbations

$$\begin{aligned} m_\alpha ^a = m_\alpha +\epsilon _\alpha ^a\qquad \qquad \bar{m}_\alpha ^a = \bar{m}_\alpha +\bar{\epsilon }_\alpha ^a\\ Q_{ab} = q +\eta _{ab}\qquad \qquad \bar{Q}_{ab} = \bar{q} + \bar{\eta }_{ab}, \end{aligned}$$

around the saddle point \((m_\alpha ,\bar{m}_\alpha ,q,\bar{q})\), the perturbed free energy reads

$$\begin{aligned} \Delta f =&\sum _{a,\alpha }w_\alpha \bar{\epsilon }_\alpha ^a\epsilon _\alpha ^a+\frac{\sigma ^2}{2}\sum _{a\ne b}\bar{\eta }_{ab}\eta _{ab} +\sum _{a,b,\alpha ,\beta }\bigl [\bigl (\delta _{ab}\bar{A}_{\alpha \beta }+\bar{\delta }_{ab}\bar{B}_{\alpha \beta }\bigr )\epsilon _\alpha ^a\epsilon _\beta ^b +CT\bigr ]\\&+\sum _{a\ne b,c,\alpha }\bigl [\bigl ((\delta _{ab}+\delta _{ac})\bar{C}_{\alpha }+(1-\delta _{ac}-\delta _{bc})\bar{D}_{\alpha }\bigr )\epsilon _\alpha ^c\eta _{ab} +CT\bigr ]\\&+\sum _{a\ne b,c\ne d}\bigl [\bigl (\delta _{(ab)(cd)}\bar{E}_0+\mathbb {1}_{\{a\in (cd)\oplus b\in (cd)\}}\bar{E}_1+\mathbb {1}_{\{(ab)\cap (cd)=\emptyset \}}\bar{E}_2\bigr )\eta _{ab}\eta _{cd} +CT\bigr ], \end{aligned}$$

where CT means “conjugate term” in the sense \(\epsilon \leftrightarrow \bar{\epsilon }\), \(A_{\alpha \beta } \leftrightarrow \bar{A}_{\alpha \beta }\)..., where \(\bar{\delta }_{ab} {\mathop {=}\limits ^{\text{ def }}}1-\delta _{ab}\) and the operators are given by

$$\begin{aligned} A_{\alpha \beta }&{\mathop {=}\limits ^{\text{ def }}}(\delta _{\alpha \beta }-m_\alpha m_\beta )w_\alpha w_\beta \qquad \qquad B_{\alpha \beta } {\mathop {=}\limits ^{\text{ def }}}\Bigl (\mathsf {E}_{x,v}\bigl (v^\alpha v^\beta \tanh ^2(\bar{h}(x,v))\bigr )-m_\alpha m_\beta \Bigr )w_\alpha w_\beta \\ C_\alpha&{\mathop {=}\limits ^{\text{ def }}}\frac{\kappa ^{1/4}\sigma ^2}{2}m_\alpha (1-q)w_\alpha \qquad \qquad D_\alpha {\mathop {=}\limits ^{\text{ def }}}\frac{\kappa ^{1/4}\sigma ^2}{2} \Bigl (\mathsf {E}_{x,v}\bigl (v^\alpha \tanh ^3(\bar{h}(x,v))\bigr )-m_\alpha q\Bigr )w_\alpha \\ E_0&{\mathop {=}\limits ^{\text{ def }}}\frac{\sqrt{\kappa }\sigma ^4}{4}(1-q^2)\qquad E_1 {\mathop {=}\limits ^{\text{ def }}}\frac{\sqrt{\kappa }\sigma ^4}{4}q(1-q)\qquad E_2 {\mathop {=}\limits ^{\text{ def }}}\frac{\sqrt{\kappa }\sigma ^4}{4}\Bigl (\mathsf {E}_{x,v}\bigl (\tanh ^4(\bar{h}(x,v))\bigr )-q^2\Bigr ) \end{aligned}$$

with

$$\begin{aligned} h(x,u) {\mathop {=}\limits ^{\text{ def }}}\kappa ^{1/4}\left( \sqrt{q}\sigma x + \sum _\alpha (m_\alpha w_\alpha - \eta _\alpha )u^\alpha \right) , \end{aligned}$$

Conjugate quantities are obtained by replacing \(m_\alpha \) by \(\bar{m}_\alpha \), q by \(\bar{q}\), \(u^\alpha \) by \(v^\alpha \), \(\eta _\alpha \) by \(\theta _\alpha \) and \(\kappa \) by \(1/\kappa \). As for the SK model, the \(2Kp\times 2Kp\) Hessian thereby defined can be diagonalized with the help of three similar sets of eigenmodes corresponding to different permutation symmetries in replica space.

The first set corresponds to \(2K+2\) replica symmetric modes defined by \(\eta _\alpha ^a = \eta _\alpha \) and \(\eta _{ab} = \eta \) solving the linear system

$$\begin{aligned}&\left( \frac{w_\alpha }{2}-\lambda \right) \bar{\epsilon }_\alpha -\frac{1}{2}\bar{A}_{\alpha \alpha }\epsilon _\alpha +\sum _\beta \bigl (\bar{A}_{\alpha \beta } +(p-1)\bar{B}_{\alpha \beta }\bigr )\epsilon _\beta \\&\quad +\left( (p-1)\bar{C}_\alpha +\frac{(p-1)(p-2)}{2}\bar{D}_\alpha \right) \eta = 0\\&\left( \frac{w_\alpha }{2}-\lambda \right) \epsilon _\alpha -\frac{1}{2}A_{\alpha \alpha }\bar{\epsilon }_\alpha +\sum _\beta \bigl (A_{\alpha \beta } +(p-1)B_{\alpha \beta }\bigr )\bar{\epsilon }_\beta \\&\quad +\left( (p-1)C_\alpha +\frac{(p-1)(p-2)}{2}D_\alpha \right) \bar{\eta }= 0\\&\left( \frac{\sigma ^2}{2}-\lambda \right) \bar{\eta }+\sum _\alpha \left( \bar{C}_\alpha +\frac{p-2}{2}\bar{D}_\alpha \right) \epsilon _\alpha +2\left( \bar{E}_0+2(p-2)\bar{E}_1+\frac{(p-2)(p-3)}{2}\bar{E}_2\right) \eta = 0\\&\left( \frac{\sigma ^2}{2}-\lambda \right) \eta +\sum _\alpha \left( C_\alpha +\frac{p-2}{2}D_\alpha \right) \bar{\epsilon }_\alpha +2\left( E_0+2(p-2)E_1+\frac{(p-2)(p-3)}{2}E_2\right) \bar{\eta }= 0 \end{aligned}$$

with eigenvalue \(\lambda \) solving a polynomial equation of degree \(2K+2\) corresponding to a vanishing determinant in the above system.

The second set corresponds to a broken replica symmetry where one replica \(a_0\) is different from the others

$$\begin{aligned} (\epsilon _\alpha ^a,\bar{\epsilon }_\alpha ^a)= & {} {\left\{ \begin{array}{ll} (\epsilon _\alpha ,\bar{\epsilon }_\alpha )\qquad \text {for}\ a\ne a_0\\ (1-p)(\epsilon _\alpha ,\bar{\epsilon }_\alpha )\qquad \text {for}\ a=a_0 \end{array}\right. } \\ (\eta _{ab},\bar{\eta }_{ab})= & {} {\left\{ \begin{array}{ll} (\eta ,\bar{\eta })\qquad \text {for}\ a,b\ne a_0\\ (1-\frac{p}{2})(\eta ,\bar{\eta })\qquad \text {for}\ a=a_0\ or\ b=a_0 \end{array}\right. } \end{aligned}$$

This set has dimension \((2K+2)(p-1)\). Its parameterization is obtained by imposing orthogonality with the previous one. The corresponding system reads

$$\begin{aligned}&\left( \frac{w_\alpha }{2}-\lambda \right) \bar{\epsilon }_\alpha -\frac{1}{2}\bar{A}_{\alpha \alpha }\epsilon _\alpha +\sum _\beta (\bar{A}_{\alpha \beta }-\bar{B}_{\alpha \beta })\epsilon _\beta +\frac{p-2}{2}\bigl (\bar{C}_\alpha -\bar{D}_\alpha \bigr )\eta = 0\\&\left( \frac{w_\alpha }{2}-\lambda \right) \epsilon _\alpha -\frac{1}{2}A_{\alpha \alpha }\bar{\epsilon }_\alpha +\sum _\beta (A_{\alpha \beta }-B_{\alpha \beta })\bar{\epsilon }_\beta +\frac{p-2}{2}\bigl (C_\alpha -D_\alpha \bigr )\bar{\eta }= 0\\&\left( \frac{\sigma ^2}{2}-\lambda \right) \bar{\eta }+\sum _\alpha (\bar{C}_\alpha -\bar{D}_\alpha )\epsilon _\alpha +2\bigl (\bar{E}_0+(p-4)\bar{E}_1-(p-3)\bar{E}_2\bigr )\eta = 0\\&\left( \frac{\sigma ^2}{2}-\lambda \right) \eta +\sum _\alpha (C_\alpha -D_\alpha )\bar{\epsilon }_\alpha +2\bigl (E_0+(p-4)E_1-(p-3)E_2\bigr )\bar{\eta }= 0 \end{aligned}$$

Finally the eigenmodes of the Hessian are made complete by considering a broken symmetry where two replicas \(a_0\) and \(a_1\) are different from the others, with the following parameterization dictated again by orthogonality constraints with the previous sets:

$$\begin{aligned} (\epsilon _\alpha ^a,\bar{\epsilon }_\alpha ^a) = 0, \qquad (\eta _{ab},\bar{\eta }_{ab}) = {\left\{ \begin{array}{ll} (\eta ,\bar{\eta })\qquad \text {for}\ a,b\ne a_0\\ \frac{3-p}{2}(\eta ,\bar{\eta })\qquad \text {for}\ a\in {a_0,a_1}\ or\ b\in {a_0,a_1}\\ \frac{(p-2)(p-3)}{2}(\eta ,\bar{\eta })\qquad \text {for}\ (a,b)=(a_0,a_1). \end{array}\right. } \end{aligned}$$

The dimension of this set is now \(p(p-3)\), and it represents eigenvectors iff the following system of equations is satisfied

$$\begin{aligned}&\left( \frac{\sigma ^2}{2}-\lambda \right) \bar{\eta }+2(\bar{E}_0-2\bar{E}_1+\bar{E}_2)\eta = 0\\&\left( \frac{\sigma ^2}{2}-\lambda \right) \eta +2(E_0-2E_1+E_2)\bar{\eta }= 0 \end{aligned}$$

The corresponding eigenvalues read

$$\begin{aligned} \lambda = \frac{\sigma ^2}{2}\pm 2\sqrt{(\bar{E}_0-2\bar{E}_1+\bar{E}_2)(E_0-2E_1+E_2)}, \end{aligned}$$

with degeneracy \(p(p-3)/2\). Finally the RS stability condition reads

$$\begin{aligned} \frac{1}{\sigma ^2} > \sqrt{\mathsf {E}_{x,u}\Bigl ({\text {sech}}^4\bigl (h(x,u)\bigr )\Bigr )\mathsf {E}_{x,v}\Bigl ({\text {sech}}^4\bigl (\bar{h}(x,v)\bigr )\Bigr )}, \end{aligned}$$

which reduces to the same form of the AT line for the SK model when \(\kappa =1\), except for the u and v averages that are specific to our model. As seen in Fig. 2 the influence of \(\kappa \) is very limited.

Appendix B: Synthetic Dataset

The multimodal distribution modeling the N-dimensional synthetic data is

$$\begin{aligned} P(s) = \sum _{c=1}^C p_c\prod _{i=1}^N \frac{e^{h_i^c s_i}}{2\cosh (h_i^c)}, \end{aligned}$$
(45)

where \(C\) is the number of clusters, \(p_c\) is a weight and \({\varvec{h}}^c\) is a hidden field for cluster \(c\). The values for \(p_c\) are taken at random and normalized, while to compute \(h_i^c\) we take into account the magnetizations \(m_i^c = \tanh (h_i^c)\). Expanding over the spectral modes, we can set an effective dimension \(d\) by constraining the sum to the range \(\alpha = 1, \dots , d \)

$$\begin{aligned} m_i^c = \sum _{\alpha = 1}^d m_{\alpha }^c u_i^\alpha \end{aligned}$$
(46)

Clusters’ magnetizations \(m_{\alpha }^c\) are drawn at random between \([-1, 1]\) and normalized with the factor

$$\begin{aligned} Z = \sqrt{\frac{\sum _{\alpha } m_{\alpha }^2}{d \cdot r}}, \quad r = \tanh (\eta ) \end{aligned}$$
(47)

where \(r\) is introduced to decrease the clusters’ polarizations (in our simulations, we used \(\eta = 0.3\)). The spectral basis \( u_i^\alpha \) is obtained by drawing at random \(d\) N-dimensional vectors and applying the Gram-Schmidt process (which can be safely employed as N is supposedly big and thus the initial vectors are nearly orthogonal). The hidden fields are then obtained from the magnetizations

$$\begin{aligned} h_i^c = \tanh ^{-1}(m_i^c) \end{aligned}$$
(48)

and the samples are generated by choosing a cluster according to \(p_c\) and setting the visible variables to \( \pm 1\) according to

$$\begin{aligned} p(s_i = 1) = \frac{1}{1 + e^{-2 h_i^c}} \end{aligned}$$
(49)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Decelle, A., Fissore, G. & Furtlehner, C. Thermodynamics of Restricted Boltzmann Machines and Related Learning Dynamics. J Stat Phys 172, 1576–1608 (2018). https://doi.org/10.1007/s10955-018-2105-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10955-018-2105-y

Keywords

Navigation