Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Dropout Rademacher complexity of deep neural networks

Abstract

Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed; however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropouts, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction.

This is a preview of subscription content, log in to check access.

References

  1. 1

    Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504–507

  2. 2

    Cire¸san D C, Meier U, Gambardella L M, et al. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput, 2010, 22: 3207–3220

  3. 3

    Cire¸san D C, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3642–3649

  4. 4

    Coates A, Huval B, Wang T, et al. Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1337–1345

  5. 5

    Dahl G E, Sainath T N, Hinton G E. Improving deep neural networks for lvcsr using rectified linear units and dropout. In: Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, 2013. 8609–8613

  6. 6

    Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio, Speech, Language Process, 2012, 20: 30–42

  7. 7

    Hinton G E, Deng L, Yu D. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29: 82–97

  8. 8

    Bo L, Lai K, Ren X, et al. Object recognition with hierarchical kernel descriptors. In: Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, 2011. 1729–1736

  9. 9

    Mobahi H, Collobert R, Weston J. Deep learning from temporal coherence in video. In: Proceedings of the 26th International Conference on Machine Learning, Montreal, 2009. 737–744

  10. 10

    Liu Y J, Liu L, Tong S C. Adaptive neural network tracking design for a class of uncertain nonlinear discrete-time systems with dead-zone. Sci China Inf Sci, 2014, 57: 032206

  11. 11

    Andreas S W, David E R, Bernardo A H. Generalization by weight-elimination with application to forecasting. In: Advances in Neural Information Processing Systems 3. Cambridge: MIT Press, 1991. 875–882

  12. 12

    Amari S, Murata N, Muller K, et al. Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw, 1997, 8: 985–996

  13. 13

    Neal R M. Bayesian learning for neural networks. In: Lecture Notes in Statistics. New York: Springer, 1996

  14. 14

    Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012

  15. 15

    Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25. Cambridge: MIT Press, 2012. 1106–1114

  16. 16

    Wan L, Zeiler M, Zhang S, et al. Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1058–1066

  17. 17

    Wang S I, Manning C D. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 118–126

  18. 18

    Ba J, Frey B. Adaptive dropout for training deep neural networks. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 3084–3092

  19. 19

    Srivastava N, Hinton G, Krizhevsky A. Dropout: a simple way to prevent neural networks from overfitting. J Mach Lear Res, 2014, 15: 1929–1958

  20. 20

    Baldi P, Sadowski P J. The dropout learning algorithm. Artif Intel, 2014, 210: 78–122

  21. 21

    Wager S, Wang S, Liang P. Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013. 351–359

  22. 22

    McAllester D. A pac-bayesian tutorial with a dropout bound. arXiv:1307.2118, 2013

  23. 23

    Karpinski M, Macintyre A. Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks. J Comput Syst Sci, 1997, 54: 169–176

  24. 24

    Anthony M, Bartlett P L. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press, 2009

  25. 25

    Bartlett P L, Mendelson S. Rademacher and gaussian complexities: risk bounds and structural results. J Mach Lear Res, 2002, 3: 463–482

  26. 26

    Koltchinskii V, Panchenko D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann Stat, 2002, 30: 1–50

  27. 27

    Cortes C, Mohri M, Rostamizadeh A. Generalization bounds for learning kernels. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 247–254

  28. 28

    Maurer A. Bounds for linear multi-task learning. J Mach Lear Res, 2006 7: 117–139

  29. 29

    Meir R, Zhang T. Generalization error bounds for bayesian mixture algorithms. J Mach Lear Res, 2003, 4: 839–860

  30. 30

    Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110

  31. 31

    McDiarmid C. On the method of bounded differences. In: Surveys in Combinatorics. Cambridge: Cambridge University Press, 1989. 148–188

  32. 32

    Ledoux M, Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer, 2002

  33. 33

    Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 807–814

  34. 34

    Kakade S M, Sridharan K, Tewari A. On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In: Advances in Neural Information Processing Systems 24. Cambridge: MIT Press, 2008. 351–359

  35. 35

    Bartlett P L. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory, 1998, 44: 525–536

Download references

Author information

Correspondence to Zhi-Hua Zhou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, W., Zhou, Z. Dropout Rademacher complexity of deep neural networks. Sci. China Inf. Sci. 59, 072104 (2016). https://doi.org/10.1007/s11432-015-5470-z

Download citation

Keywords

  • artificial intelligence
  • machine learning
  • deep learning
  • dropout
  • Rademacher complexity