Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Liang, Senwei; Khoo, Yuehaw; Yang, Haizhao

doi:10.1007/s42967-020-00085-3

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Original Paper
Published: 30 October 2020

Volume 3, pages 293–311, (2021)
Cite this article

Communications on Applied Mathematics and Computation Aims and scope Submit manuscript

610 Accesses
10 Citations
Explore all metrics

Abstract

Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called drop-activation to reduce overfitting and improve generalization. The key idea is to drop nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Our theoretical analyses support the regularization effect of drop-activation as implicit parameter reduction and verify its capability to be used together with batch normalization (Ioffe and Szegedy in Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015). The experimental results on CIFAR10, CIFAR100, SVHN, EMNIST, and ImageNet show that drop-activation generally improves the performance of popular neural network architectures for the image classification task. Furthermore, as a regularizer drop-activation can be used in harmony with standard training and regularization techniques such as batch normalization and AutoAugment (Cubuk et al. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123, 2019). The code is available at https://github.com/LeungSamWai/Drop-Activation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Microsoft COCO: Common Objects in Context

References

Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: an extension of MNIST to handwritten letters (2017). arXiv:1702.05373
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation strategies from data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout (2017). arXiv:1708.04552
Gastaldi, X.: Shake-shake regularization (2017). arXiv:1705.07485
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645. Springer, Cham (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer, Cham (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N.R., Goyal, A., Bengio, Y., Courville, A., Pal, C.: Zoneout: regularizing RNNs by randomly preserving hidden activations (2016). arXiv:1606.01305
Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570–8581 (2019)
Li, X., Chen, S., Hu, X., Yang, J.: Understanding the disharmony between dropout and batch normalization by variance shift. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690 (2019)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain (2011)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Li, F.F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556
Singh, S., Hoiem, D., Forsyth, D.: Swapout: learning an ensemble of deep architectures. In: Advances in Neural Information Processing Systems, pp. 28–36 (2016)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)
Wager, S., Wang, S., Liang, P.S.: Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems, pp. 351–359 (2013)
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., Fergus, R.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066 (2013)
Xie, L., Wang, J., Wei, Z., Wang, M., Tian, Q.: DisturbLabel: regularizing CNN on the loss layer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4753–4762 (2016)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Xu, B., Wang, N., Chen, T., Li, M.L.: Empirical evaluation of rectified activations in convolutional network (2015). arXiv:1505.00853
Yamada, Y., Iwamura, M., Akiba, T., Kise, K.: Shakedrop regularization for deep residual learning (2018). arXiv:1802.02375
Zagoruyko, S., Komodakis, N.: Wide residual networks (2016). arXiv:1605.07146
Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks (2013). arXiv:1301.3557
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Cham (2014)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization (2017). arXiv:1710.09412

Download references

Acknowledgements

S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) Singapore (the computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore. https://www.nscc.sg) and High-Performance Computing (HPC) of the National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. H. Yang was partially supported by National Science Foundation under the grant award 1945029.

Author information

Authors and Affiliations

Department of Mathematics, Purdue University, West Lafayette, IN, 47907, USA
Senwei Liang & Haizhao Yang
Department of Statistics and the College, The University of Chicago, Chicago, IL, 60637, USA
Yuehaw Khoo

Authors

Senwei Liang
View author publications
You can also search for this author in PubMed Google Scholar
Yuehaw Khoo
View author publications
You can also search for this author in PubMed Google Scholar
Haizhao Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haizhao Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix A

1.1 A1 The Simple Model for Finding the Best Parameter p

To find the best parameter for drop-activation, we perform a grid search on a simple model. The simple network consists of the following layers: we first stack three blocks, and each block contains convolution with \(3\times 3\) filter, BN, ReLU, and average pooling, as shown in Fig. A1. The number of \(3\times 3\) filters for \(\text {Block}_1\), \(\text {Block}_2\), \(\text {Block}_3\) is 32, 64, 128, respectively. The widths for fully connected layers are 1 000 and 10, respectively.

1.2 A2 Introdution of Datasets

We use the following datasets in our numerical experiments.

CIFAR Both CIFAR10 and CIFAR100 contain 60 k color nature images of size 32 by 32. There are 50 k images for training and 10 k images for testing. CIFAR10 has ten classes of objects and 6 k for each class. CIFAR100 is similar to CIFAR10, except that it includes 100 classes and 600 images for each class. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data as [5].

SVHN The dataset of street view house numbers (SVHN) contains ten classes of color digit images of size 32 by 32. There are about 73 k training images, 26 k testing images, and additional 531 k images. The training and additional images are used together for training, so there are totally over 600 k images for training. An image in SVHN may contain more than one digit, and the recognition task is to identify the digit in the center of the image. We preprocess the images following [28]. The pixel values of the images are rescaled to [0, 1], and no data augmentation is applied.

EMNIST EMNIST is a set of \(28\times 28\) grayscale images containing handwritten English characters and digits. There are six different splits in this dataset and we use the split “Balanced”. In the “Balanced” split, there are 131 600 images in total, including 112 800 for training and 18 800 for testing.

ImageNet 2012 The ImageNet 2012 dataset consists of 1.28 million training images and 50 k validation images from 1 000 classes. The models are evaluated on the validation set. We train the models for 120 epochs with an initial learning rate 0.1.

1.3 A3 Implementation Details

The hyper-parameters for different networks are shown in Tables A1, A2 and A3, and we offer the explanation of hyper-parameter names in Table A4.

Table A1 Hyper-parameter setting for training models on CIFAR10/100 and EMNIST

Full size table

Table A2 Hyper-parameter setting for training models on ImageNet

Full size table

Table A3 Hyper-parameter setting for training models on SVHN

Full size table

Table A4 The explanation of hyper-parameter names

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, S., Khoo, Y. & Yang, H. Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization. Commun. Appl. Math. Comput. 3, 293–311 (2021). https://doi.org/10.1007/s42967-020-00085-3

Download citation

Received: 09 December 2019
Revised: 27 March 2020
Accepted: 13 June 2020
Published: 30 October 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s42967-020-00085-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Microsoft COCO: Common Objects in Context

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A

1.1 A1 The Simple Model for Finding the Best Parameter p

1.2 A2 Introdution of Datasets

1.3 A3 Implementation Details

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Microsoft COCO: Common Objects in Context

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A

Appendix A

1.1 A1 The Simple Model for Finding the Best Parameter p

1.2 A2 Introdution of Datasets

1.3 A3 Implementation Details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation