Skip to main content
Log in

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

  • Original Paper
  • Published:
Communications on Applied Mathematics and Computation Aims and scope Submit manuscript

Abstract

Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called drop-activation to reduce overfitting and improve generalization. The key idea is to drop nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Our theoretical analyses support the regularization effect of drop-activation as implicit parameter reduction and verify its capability to be used together with batch normalization (Ioffe and Szegedy in Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015). The experimental results on CIFAR10, CIFAR100, SVHN, EMNIST, and ImageNet show that drop-activation generally improves the performance of popular neural network architectures for the image classification task. Furthermore, as a regularizer drop-activation can be used in harmony with standard training and regularization techniques such as batch normalization and AutoAugment (Cubuk et al. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123, 2019). The code is available at https://github.com/LeungSamWai/Drop-Activation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: an extension of MNIST to handwritten letters (2017). arXiv:1702.05373

  2. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation strategies from data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)

  3. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout (2017). arXiv:1708.04552

  4. Gastaldi, X.: Shake-shake regularization (2017). arXiv:1705.07485

  5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  6. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645. Springer, Cham (2016)

  7. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  8. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  9. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer, Cham (2016)

  10. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167

  11. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, Toronto (2009)

  12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  13. Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N.R., Goyal, A., Bengio, Y., Courville, A., Pal, C.: Zoneout: regularizing RNNs by randomly preserving hidden activations (2016). arXiv:1606.01305

  14. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570–8581 (2019)

  15. Li, X., Chen, S., Hu, X., Yang, J.: Understanding the disharmony between dropout and batch normalization by variance shift. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690 (2019)

  16. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain (2011)

  17. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Li, F.F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556

  19. Singh, S., Hoiem, D., Forsyth, D.: Swapout: learning an ensemble of deep architectures. In: Advances in Neural Information Processing Systems, pp. 28–36 (2016)

  20. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  21. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)

  22. Wager, S., Wang, S., Liang, P.S.: Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems, pp. 351–359 (2013)

  23. Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., Fergus, R.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066 (2013)

  24. Xie, L., Wang, J., Wei, Z., Wang, M., Tian, Q.: DisturbLabel: regularizing CNN on the loss layer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4753–4762 (2016)

  25. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

  26. Xu, B., Wang, N., Chen, T., Li, M.L.: Empirical evaluation of rectified activations in convolutional network (2015). arXiv:1505.00853

  27. Yamada, Y., Iwamura, M., Akiba, T., Kise, K.: Shakedrop regularization for deep residual learning (2018). arXiv:1802.02375

  28. Zagoruyko, S., Komodakis, N.: Wide residual networks (2016). arXiv:1605.07146

  29. Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks (2013). arXiv:1301.3557

  30. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Cham (2014)

  31. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization (2017). arXiv:1710.09412

Download references

Acknowledgements

S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) Singapore (the computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore. https://www.nscc.sg) and High-Performance Computing (HPC) of the National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. H. Yang was partially supported by National Science Foundation under the grant award 1945029.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haizhao Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix A

Appendix A

1.1 A1 The Simple Model for Finding the Best Parameter p

To find the best parameter for drop-activation, we perform a grid search on a simple model. The simple network consists of the following layers: we first stack three blocks, and each block contains convolution with \(3\times 3\) filter, BN, ReLU, and average pooling, as shown in Fig. A1. The number of \(3\times 3\) filters for \(\text {Block}_1\), \(\text {Block}_2\), \(\text {Block}_3\) is 32, 64, 128, respectively. The widths for fully connected layers are 1 000 and 10, respectively.

Fig. A1
figure 7

The model for finding the best parameter for drop-activation

1.2 A2 Introdution of Datasets

We use the following datasets in our numerical experiments.

CIFAR Both CIFAR10 and CIFAR100 contain 60 k color nature images of size 32 by 32. There are 50 k images for training and 10 k images for testing. CIFAR10 has ten classes of objects and 6 k for each class. CIFAR100 is similar to CIFAR10, except that it includes 100 classes and 600 images for each class. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data as [5].

SVHN The dataset of street view house numbers (SVHN) contains ten classes of color digit images of size 32 by 32. There are about 73 k training images, 26 k testing images, and additional 531 k images. The training and additional images are used together for training, so there are totally over 600 k images for training. An image in SVHN may contain more than one digit, and the recognition task is to identify the digit in the center of the image. We preprocess the images following [28]. The pixel values of the images are rescaled to [0, 1], and no data augmentation is applied.

EMNIST EMNIST is a set of \(28\times 28\) grayscale images containing handwritten English characters and digits. There are six different splits in this dataset and we use the split “Balanced”. In the “Balanced” split, there are 131 600 images in total, including 112 800 for training and 18 800 for testing.

ImageNet 2012 The ImageNet 2012 dataset consists of 1.28 million training images and 50 k validation images from 1 000 classes. The models are evaluated on the validation set. We train the models for 120 epochs with an initial learning rate 0.1.

1.3 A3 Implementation Details

The hyper-parameters for different networks are shown in Tables A1A2 and A3, and we offer the explanation of hyper-parameter names in Table A4.

Table A1 Hyper-parameter setting for training models on CIFAR10/100 and EMNIST
Table A2 Hyper-parameter setting for training models on ImageNet
Table A3 Hyper-parameter setting for training models on SVHN
Table A4 The explanation of hyper-parameter names

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, S., Khoo, Y. & Yang, H. Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization. Commun. Appl. Math. Comput. 3, 293–311 (2021). https://doi.org/10.1007/s42967-020-00085-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42967-020-00085-3

Keywords

Mathematics Subject Classification

Navigation