Machine Learning

, Volume 93, Issue 1, pp 53–69 | Cite as

The flip-the-state transition operator for restricted Boltzmann machines

Article

Abstract

Most learning and sampling algorithms for restricted Boltzmann machines (RMBs) rely on Markov chain Monte Carlo (MCMC) methods using Gibbs sampling. The most prominent examples are Contrastive Divergence learning (CD) and its variants as well as Parallel Tempering (PT). The performance of these methods strongly depends on the mixing properties of the Gibbs chain. We propose a Metropolis-type MCMC algorithm relying on a transition operator maximizing the probability of state changes. It is shown that the operator induces an irreducible, aperiodic, and hence properly converging Markov chain, also for the typically used periodic update schemes. The transition operator can replace Gibbs sampling in RBM learning algorithms without producing computational overhead. It is shown empirically that this leads to faster mixing and in turn to more accurate learning.

Keywords

Restricted Boltzmann machine Markov chain Monte Carlo Gibbs sampling Mixing rate Contrastive divergence learning Parallel tempering 

References

  1. Bengio, Y., & Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural Computation, 21(6), 1601–1621. MathSciNetMATHCrossRefGoogle Scholar
  2. Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. Journal of Machine Learning Research Workshop and Conference Proceedings, 28(1), 552–560. Google Scholar
  3. Brémaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Berlin: Springer. MATHGoogle Scholar
  4. Breuleux, O., Bengio, Y., & Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2058–2073. MathSciNetCrossRefGoogle Scholar
  5. Cho, K., Raiko, T., & Ilin, A. (2010). Parallel tempering is efficient for learning restricted Boltzmann machines. In Proceedings of the international joint conference on neural networks (IJCNN 2010) (pp. 3246–3253). New York: IEEE Press. Google Scholar
  6. Desjardins, G., Courville, A., Bengio, Y., Vincent, P., & Dellaleau, O. (2010). Parallel tempering for training of restricted Boltzmann machines. Journal of Machine Learning Research Workshop and Conference Proceedings 9(AISTATS 2010), 145–152. Google Scholar
  7. Fischer, A., & Igel, C. (2010). Empirical analysis of the divergence of Gibbs sampling based learning algorithms for Restricted Boltzmann Machines. In K. Diamantaras, W. Duch, & L. S. Iliadis (Eds.), LNCS: Vol. 6354. International conference on artificial neural networks (ICANN 2010) (pp. 208–217). Berlin: Springer. Google Scholar
  8. Fischer, A., & Igel, C. (2011). Bounding the bias of contrastive divergence learning. Neural Computation, 23, 664–673. MathSciNetMATHCrossRefGoogle Scholar
  9. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800. MATHCrossRefGoogle Scholar
  10. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. MathSciNetMATHCrossRefGoogle Scholar
  11. Igel, C., Glasmachers, T., & Heidrich-Meisner, V. (2008). Shark. Journal of Machine Learning Research, 9, 993–996. MATHGoogle Scholar
  12. Liu, J. S. (1996). Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6, 113–119. CrossRefGoogle Scholar
  13. MacKay, D. J. C. (2002). Information theory, inference & learning algorithms. Cambridge: Cambridge University Press. Google Scholar
  14. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto. Google Scholar
  15. Peskun, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika, 60(3), 607–612. MathSciNetMATHCrossRefGoogle Scholar
  16. Salakhutdinov, R. (2010). Learning in Markov random fields using tempered transitions. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 1598–1606). Google Scholar
  17. Smolensky, P. (1986). Information processing in dynamical systems: foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: explorations in the microstructure of cognition, Vol. 1: foundations (pp. 194–281). Cambridge: MIT Press. Google Scholar
  18. Thompson, M. B. (2010). A comparison of methods for computing autocorrelation time. Tech. Rep. 1007, Department of Statistics, University of Toronto. Google Scholar
  19. Thompson, M. B. (2011). Introduction to SamplerCompare. Journal of Statistical Software, 43(12), 1–10. Google Scholar
  20. Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In W. W. Cohen, A. McCallum, & S. T. Roweis (Eds.), International conference on machine learning (ICML) (pp. 1064–1071). New York: ACM. Google Scholar
  21. Tieleman, T., & Hinton, G. E. (2009). Using fast weights to improve persistent contrastive divergence. In A. Pohoreckyj Danyluk, L. Bottou, & M. L. Littman (Eds.), International conference on machine learning (ICML) (pp. 1033–1040). New York: ACM. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  2. 2.Helsinki Institute for Information Technology HIITHelsinkiFinland
  3. 3.Institut für NeuroinformatikRuhr-Universität BochumBochumGermany
  4. 4.Department of Computer ScienceUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations