Advertisement

Deep Networks with Stochastic Depth

  • Gao HuangEmail author
  • Yu Sun
  • Zhuang Liu
  • Daniel Sedra
  • Kilian Q. Weinberger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9908)

Abstract

Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).

Keywords

Training Time Test Error Constant Depth Validation Error Early Layer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We thank the anonymous reviewers for their kind suggestions. Kilian Weinberger is supported by NFS grants IIS-1550179, IIS-1525919 and EFRI-1137211. Gao Huang is supported by the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No. 20150015). Yu Sun is supported by the Cornell University Office of Undergraduate Research. We also thank our lab mates, Matthew Kusner and Shuang Li for useful and interesting discussions.

References

  1. 1.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)Google Scholar
  2. 2.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009. IEEE, pp. 248–255 (2009)Google Scholar
  3. 3.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  4. 4.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
  5. 5.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  6. 6.
    Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
  7. 7.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  9. 9.
    Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Comput. Complex. 1(2), 113–129 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Håstad, J.: Computational Limitations of Small-Depth Circuits. MIT Press, Cambridge (1987)Google Scholar
  11. 11.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5(2), 157–166 (1994)CrossRefGoogle Scholar
  12. 12.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)Google Scholar
  13. 13.
    Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. arXiv preprint arXiv:1409.5185 (2014)
  14. 14.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  15. 15.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015)
  16. 16.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Fahlman, S.E., Lebiere, C.: The Cascade-Correlation Learning Architecture. Morgan Kaufmann Publishers Inc., San Francisco (1989)Google Scholar
  18. 18.
    Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)Google Scholar
  20. 20.
    Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Dasgupta, S., Mcallester, D. (eds.): Proceedings of the 30th International Conference on Machine Learning (ICML-13), JMLR Workshop and Conference Proceedings, vol. 28, pp. 1058–1066, May 2013Google Scholar
  21. 21.
    Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
  22. 22.
    Smith, L.N., Hand, E.M., Doster, T.: Gradual dropin of layers to train very deep neural networks. In: CVPR (2016)Google Scholar
  23. 23.
    Zagoruyko, S.: 92.45% on cifar-10 in torch (2015)Google Scholar
  24. 24.
    Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
  25. 25.
    Graham, B.: Fractional max-pooling. arXiv preprint arXiv:1412.6071 (2014)
  26. 26.
    Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P.: Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830 (2014)
  27. 27.
    Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3367–3375 (2015)Google Scholar
  28. 28.
    Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Ali, M., Adams, R.P., et al.: Scalable bayesian optimization using deep neural networks. arXiv preprint arXiv:1502.05700 (2015)
  29. 29.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Advances in Neural Information Processing Systems, pp. 2368–2376 (2015)Google Scholar
  30. 30.
    Lee, C.Y., Gallagher, P.W., Tu, Z.: Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. arXiv preprint arXiv:1509.08985 (2015)
  31. 31.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, vol. 2011, p. 4 (2011)Google Scholar
  32. 32.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)Google Scholar
  33. 33.
    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1139–1147 (2013)Google Scholar
  34. 34.
    Gross, S., Wilber, M.: Training and investigating residual nets (2016)Google Scholar
  35. 35.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016)

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Gao Huang
    • 1
    Email author
  • Yu Sun
    • 1
  • Zhuang Liu
    • 2
  • Daniel Sedra
    • 1
  • Kilian Q. Weinberger
    • 1
  1. 1.Cornell UniversityIthacaUSA
  2. 2.Tsinghua UniversityBeijingChina

Personalised recommendations