On the Equivalence between Deep NADE and Generative Stochastic Networks

  • Li Yao
  • Sherjil Ozair
  • Kyunghyun Cho
  • Yoshua Bengio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8726)


Neural Autoregressive Distribution Estimators (NADEs) have recently been shown as successful alternatives for modeling high dimensional multimodal distributions. One issue associated with NADEs is that they rely on a particular order of factorization for P(x). This issue has been recently addressed by a variant of NADE called Orderless NADEs and its deeper version, Deep Orderless NADE. Orderless NADEs are trained based on a criterion that stochastically maximizes P(x) with all possible orders of factorizations. Unfortunately, ancestral sampling from deep NADE is very expensive, corresponding to running through a neural net separately predicting each of the visible variables given some others. This work makes a connection between this criterion and the training criterion for Generative Stochastic Networks (GSNs). It shows that training NADEs in this way also trains a GSN, which defines a Markov chain associated with the NADE model. Based on this connection, we show an alternative way to sample from a trained Orderless NADE that allows to trade-off computing time and quality of the samples: a 3 to 10-fold speedup (taking into account the waste due to correlations between consecutive samples of the chain) can be obtained without noticeably reducing the quality of the samples. This is achieved using a novel sampling procedure for GSNs called annealed GSN sampling, similar to tempering methods that combines fast mixing (obtained thanks to steps at high noise levels) with accurate samples (obtained thanks to steps at low noise levels).


Markov Chain Markov Chain Monte Carlo Training Procedure Deep Learning Neural Information Processing System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data generating distribution. In: International Conference on Learning Representations, ICLR 2013 (2013)Google Scholar
  2. 2.
    Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. In: Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)Google Scholar
  3. 3.
    Bengio, Y.: Deep learning of representations: Looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  4. 4.
    Bengio, Y., Bengio, S.: Modeling high-dimensional discrete data with multi-layer neural networks. In: NIPS 1999, pp. 400–406. MIT Press (2000)Google Scholar
  5. 5.
    Bengio, Y., Courville, A., Vincent, P.: Unsupervised feature learning and deep learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI (2013)Google Scholar
  6. 6.
    Bengio, Y., Thibodeau-Laufer, E., Alain, G., Yosinski, J.: Deep generative stochastic networks trainable by backprop. Technical Report arXiv:1306.1091 (2014)Google Scholar
  7. 7.
    Bengio, Y., Yao, L., Alain, G., Vincent, P.: Generalized denoising auto-encoders as generative models. In: Advances in Neural Information Processing Systems 26, NIPS 2013 (2013)Google Scholar
  8. 8.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation (June 2010)Google Scholar
  9. 9.
    Bishop, C.M.: Mixture density networks (1994)Google Scholar
  10. 10.
    Geyer, C.J.: Practical markov chain monte carlo. Statistical Science, 473–483 (1992)Google Scholar
  11. 11.
    Goodfellow, I., Miraz, M., Courville, A., Bengio, Y.: Multi-prediction deep Boltzmann machines. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 548–556 (December 2013)Google Scholar
  12. 12.
    Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  14. 14.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580 (2012)Google Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, NIPS 2012 (2012)Google Scholar
  16. 16.
    Larochelle, H., Murray, I.: The Neural Autoregressive Distribution Estimator. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011). JMLR: W&CP, vol. 15 (2011)Google Scholar
  17. 17.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  18. 18.
    Neal, R.M.: Sampling from multimodal distributions using tempered transitions. Technical Report 9421, Dept. of Statistics, University of Toronto (1994)Google Scholar
  19. 19.
    Neal, R.M.: Annealed importance sampling. Statistics and Computing 11(2), 125–139 (2001)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), vol. 8 (2009)Google Scholar
  21. 21.
    Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, vol. 5, pp. 448–455 (2009)Google Scholar
  22. 22.
    Uria, B., Murray, I., Larochelle, H.: A deep and tractable density estimator. Technical Report arXiv:1310.1757 (2013)Google Scholar
  23. 23.
    Uria, B., Murray, I., Larochelle, H.: Rnade: The real-valued neural autoregressive density-estimator. In: NIPS 2013 (2013)Google Scholar
  24. 24.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: ICML 2008 (2008)Google Scholar
  25. 25.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res. 11 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Li Yao
    • 1
  • Sherjil Ozair
    • 1
  • Kyunghyun Cho
    • 1
  • Yoshua Bengio
    • 1
  1. 1.Département d’Informatique et de Recherche OpérationelleUniversité de MontréalCanada

Personalised recommendations