Dropout Rademacher complexity of deep neural networks

Abstract

Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed; however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropouts, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504–507

    MathSciNet  Article  MATH  Google Scholar 

  2. 2

    Cire¸san D C, Meier U, Gambardella L M, et al. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput, 2010, 22: 3207–3220

    Article  Google Scholar 

  3. 3

    Cire¸san D C, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3642–3649

    Google Scholar 

  4. 4

    Coates A, Huval B, Wang T, et al. Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1337–1345

    Google Scholar 

  5. 5

    Dahl G E, Sainath T N, Hinton G E. Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 2013. 8609–8613

    Google Scholar 

  6. 6

    Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio, Speech, Language Process, 2012, 20: 30–42

    Article  Google Scholar 

  7. 7

    Hinton G E, Deng L, Yu D. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29: 82–97

    Article  Google Scholar 

  8. 8

    Bo L, Lai K, Ren X, et al. Object recognition with hierarchical kernel descriptors. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, 2011. 1729–1736

    Google Scholar 

  9. 9

    Mobahi H, Collobert R, Weston J. Deep learning from temporal coherence in video. In: Proceedings of the 26th International Conference on Machine Learning, Montreal, 2009. 737–744

    Google Scholar 

  10. 10

    Liu Y J, Liu L, Tong S C. Adaptive neural network tracking design for a class of uncertain nonlinear discrete-time systems with dead-zone. Sci China Inf Sci, 2014, 57: 032206

    MATH  Google Scholar 

  11. 11

    Andreas S W, David E R, Bernardo A H. Generalization by weight-elimination with application to forecasting. In: Advances in Neural Information Processing Systems 3. Cambridge: MIT Press, 1991. 875–882

    Google Scholar 

  12. 12

    Amari S, Murata N, Muller K, et al. Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw, 1997, 8: 985–996

    Article  Google Scholar 

  13. 13

    Neal R M. Bayesian learning for neural networks. In: Lecture Notes in Statistics. New York: Springer, 1996

    Google Scholar 

  14. 14

    Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012

    Google Scholar 

  15. 15

    Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25. Cambridge: MIT Press, 2012. 1106–1114

    Google Scholar 

  16. 16

    Wan L, Zeiler M, Zhang S, et al. Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1058–1066

    Google Scholar 

  17. 17

    Wang S I, Manning C D. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 118–126

    Google Scholar 

  18. 18

    Ba J, Frey B. Adaptive dropout for training deep neural networks. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 3084–3092

    Google Scholar 

  19. 19

    Srivastava N, Hinton G, Krizhevsky A. Dropout: a simple way to prevent neural networks from overfitting. J Mach Lear Res, 2014, 15: 1929–1958

    MathSciNet  MATH  Google Scholar 

  20. 20

    Baldi P, Sadowski P J. The dropout learning algorithm. Artif Intel, 2014, 210: 78–122

    MathSciNet  Article  MATH  Google Scholar 

  21. 21

    Wager S, Wang S, Liang P. Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 351–359

    Google Scholar 

  22. 22

    McAllester D. A pac-bayesian tutorial with a dropout bound. arXiv:1307.2118, 2013

    Google Scholar 

  23. 23

    Karpinski M, Macintyre A. Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks. J Comput Syst Sci, 1997, 54: 169–176

    MathSciNet  Article  MATH  Google Scholar 

  24. 24

    Anthony M, Bartlett P L. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press, 2009

    Google Scholar 

  25. 25

    Bartlett P L, Mendelson S. Rademacher and gaussian complexities: risk bounds and structural results. J Mach Lear Res, 2002, 3: 463–482

    MathSciNet  MATH  Google Scholar 

  26. 26

    Koltchinskii V, Panchenko D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann Stat, 2002, 30: 1–50

    MathSciNet  MATH  Google Scholar 

  27. 27

    Cortes C, Mohri M, Rostamizadeh A. Generalization bounds for learning kernels. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 247–254

    Google Scholar 

  28. 28

    Maurer A. Bounds for linear multi-task learning. J Mach Lear Res, 2006 7: 117–139

    MathSciNet  MATH  Google Scholar 

  29. 29

    Meir R, Zhang T. Generalization error bounds for bayesian mixture algorithms. J Mach Lear Res, 2003, 4: 839–860

    MathSciNet  MATH  Google Scholar 

  30. 30

    Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110

    MathSciNet  Article  Google Scholar 

  31. 31

    McDiarmid C. On the method of bounded differences. In: Surveys in Combinatorics. Cambridge: Cambridge University Press, 1989. 148–188

    Google Scholar 

  32. 32

    Ledoux M, Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer, 2002

    Google Scholar 

  33. 33

    Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 807–814

    Google Scholar 

  34. 34

    Kakade S M, Sridharan K, Tewari A. On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In: Advances in Neural Information Processing Systems 24. Cambridge: MIT Press, 2008. 351–359

    Google Scholar 

  35. 35

    Bartlett P L. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory, 1998, 44: 525–536

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zhi-Hua Zhou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, W., Zhou, ZH. Dropout Rademacher complexity of deep neural networks. Sci. China Inf. Sci. 59, 072104 (2016). https://doi.org/10.1007/s11432-015-5470-z

Download citation

Keywords

  • artificial intelligence
  • machine learning
  • deep learning
  • dropout
  • Rademacher complexity