Skip to main content

Make \(\ell _1\) regularization effective in training sparse CNN

Abstract

Compressed Sensing using \(\ell _1\) regularization is among the most powerful and popular sparsification technique in many applications, but why has it not been used to obtain sparse deep learning model such as convolutional neural network (CNN)? This paper is aimed to provide an answer to this question and to show how to make it work. Following Xiao (J Mach Learn Res 11(Oct):2543–2596, 2010), We first demonstrate that the commonly used stochastic gradient decent and variants training algorithm is not an appropriate match with \(\ell _1\) regularization and then replace it with a different training algorithm based on a regularized dual averaging (RDA) method. The RDA method of Xiao (J Mach Learn Res 11(Oct):2543–2596, 2010) was originally designed specifically for convex problem, but with new theoretical insight and algorithmic modifications (using proper initialization and adaptivity), we have made it an effective match with \(\ell _1\) regularization to achieve a state-of-the-art sparsity for the highly non-convex CNN compared to other weight pruning methods without compromising accuracy (achieving 95% sparsity for ResNet-18 on CIFAR-10, for example).

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. 1.

    In the original paper [38], RDA is proposed as an online learning algorithm, which takes one input at each time.

References

  1. 1.

    Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, pp. 2270–2278 (2016)

  2. 2.

    Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129, 163 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  3. 3.

    Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)

    MathSciNet  MATH  Article  Google Scholar 

  4. 4.

    Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks (2017). arXiv preprint arXiv:1710.09282

  5. 5.

    Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)

    MathSciNet  MATH  Article  Google Scholar 

  6. 6.

    Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10(Dec), 2899–2934 (2009)

    MathSciNet  MATH  Google Scholar 

  7. 7.

    Eldar, Y.C., Kutyniok, G.: Compressed Sensing: Theory and Applications. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  8. 8.

    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

  9. 9.

    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding (2015). arXiv preprint arXiv:1510.00149

  10. 10.

    Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)

  11. 11.

    Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: optimal brain surgeon. In: Advances in Neural Information Processing Systems, pp. 164–171 (1993)

  12. 12.

    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

  13. 13.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  14. 14.

    He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)

  15. 15.

    Hu, H., Peng, R., Tai, Y.-W., Tang, C.-K.: Network trimming: a data-driven neuron pruning approach towards efficient deep architectures (2016). arXiv preprint arXiv:1607.03250

  16. 16.

    Huang, Z., Wang, N.: Data-driven sparse structure selection for deep neural networks (2017). arXiv preprint arXiv:1707.01213

  17. 17.

    Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10(2), 777–801 (2009)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  19. 19.

    LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990)

  20. 20.

    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G., Müller, K.R. (eds.) Neural Networks: Tricks of the Trade, pp. 9–48. Springer, Berlin (2012)

    Chapter  Google Scholar 

  21. 21.

    Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets (2016). arXiv preprint arXiv:1608.08710

  22. 22.

    Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)

    MathSciNet  MATH  Article  Google Scholar 

  23. 23.

    Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. IEEE (2017)

  24. 24.

    Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv preprint arXiv:1810.05270

  25. 25.

    Luo, J.-H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression (2017). arXiv preprint arXiv:1707.06342

  26. 26.

    Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: the application of compressed sensing for rapid MR imaging. Magn. Reson. Med. Off. J. Int. Soc. Magn. Reson. Med. 58(6), 1182–1195 (2007)

    Article  Google Scholar 

  27. 27.

    McMahan, B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 525–533 (2011)

  28. 28.

    McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18(1), 3117–3166 (2017)

    MathSciNet  MATH  Google Scholar 

  29. 29.

    Mine, H., Fukushima, M.: A minimization method for the sum of a convex function and a continuously differentiable function. J. Optim. Theory Appl. 33(1), 9–23 (1981)

    MathSciNet  MATH  Article  Google Scholar 

  30. 30.

    Mittal, D., Bhardwaj, S., Khapra, M.M., Ravindran, B.: Recovering from random pruning: on the plasticity of deep convolutional neural networks (2018)

  31. 31.

    Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  32. 32.

    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  33. 33.

    Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem (2012). CoRR arXiv:abs/1211.5063

  34. 34.

    Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: International Conference on Neural Information Processing Systems, pp. 177–185 (1988)

  35. 35.

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

  36. 36.

    Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)

  37. 37.

    Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, pp. 2116–2124 (2009)

  38. 38.

    Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11(Oct), 2543–2596 (2010)

    MathSciNet  MATH  Google Scholar 

  39. 39.

    Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression (2017). arXiv preprint arXiv:1710.01878

Download references

Acknowledgements

This work was partially supported by the Penn State and Peking University Joint Center for Computational Mathematics and Applications, the Beijing International Center for Mathematical Research from Peking University, and the Verne M. William Professorship Fund from Penn State University. The research of L. Zhao and L. Zhang was also supported by the China Scholarship Council (for visiting Penn State) and by HKUST16301218 Hong Kong RGC Competitive Earmarked Research Grant (for visiting Penn State), respectively. The authors wish to thank Drs. Lin Xiao and Liang Yang for helpful suggestions and discussions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jinchao Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

He, J., Jia, X., Xu, J. et al. Make \(\ell _1\) regularization effective in training sparse CNN. Comput Optim Appl 77, 163–182 (2020). https://doi.org/10.1007/s10589-020-00202-1

Download citation

Keywords

  • Sparse optimization
  • \(\ell _1\) regularization
  • Dual averaging
  • CNN