Skip to main content
Log in

Regularization-based pruning of irrelevant weights in deep neural architectures

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep neural networks exploiting million parameters are currently the norm. This is a potential issue because of the great number of computations needed for training, and the possible loss of generalization performance of overparameterized networks. We propose in this paper a method for learning sparse neural topologies via a regularization approach that identifies nonrelevant weights in any type of layer (i.e., convolutional, fully connected, attention and embedding ones) and selectively shrinks their norm while performing a standard back-propagation update for relevant layers. This technique, which is an improvement of classical weight decay, is based on the definition of a regularization term that can be added to any loss function regardless of its form, resulting in a unified general framework exploitable in many different contexts. The actual elimination of parameters identified as irrelevant is handled by an iterative pruning algorithm.

To explore the possibility of an interdisciplinary use of our proposed technique, we test it on six different image classification and natural language generation tasks, among which four are based on real datasets. We reach state-of-the-art performance in one out of four imaging tasks while obtaining results better than competitors for the others and one out of two of the considered language generation tasks, both in terms of compression and metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://www.graphcore.ai/

  2. more info at: https://www.gnu.org/software/gzip/

  3. more info at: https://www.sourceware.org/bzip2/

  4. https://pytorch.org/vision/stable/index.html

  5. https://huggingface.co/

  6. https://commoncrawl.org/

References

  1. Zhang K, Gool LV, Timofte R (2020) Deep unfolding network for image super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  2. He T, Zhang Z, Zhang H, Zhang Z, Xie J, Li M (2019) Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  3. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 770–778

  4. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1–9

  5. Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  6. Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  7. Puduppully R, Dong L, Lapata M (2019) Data-to-text generation with content selection and planning. In: Proceedings of the thirty-third conference on artificial intelligence, AAAI, Honolulu, Hawaii, USA, pp 6908–6915

  8. Dusek O, Novikova J, Rieser V (2020) Evaluating the state-of-the-art of end-to-end natural language generation: the E2E NLG challenge. Comput Speech Lang 59:123–156

    Article  Google Scholar 

  9. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 31, pp 6000–6010

  10. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations ICLR, San Diego, CA, USA

  11. Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural network. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28, pp 1135–1143

  12. Ullrich K, Meeds E, Welling M (2017) Soft weight-sharing for neural network compression. In: 5th International Conference on Learning Representations, ICLR, Toulon, France

  13. Sanh V, Wolf T, Rush AM (2020) Movement pruning: adaptive sparsity by fine-tuning. Adv Neural Inf Process Syst, vol 34

  14. Liu J, Wang Y, Qiao Y (2017) Sparse deep transfer learning for convolutional neural network. In: The thirty-first AAAI conference on artificial intelligence, AAAI

  15. Tartaglione E, Lepsøy S, Fiandrotti A , Francini G (2018) Learning sparse neural networks via sensitivity-driven regularization. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 32, NeurIPS

  16. Tartaglione E, Bragagnolo A, Fiandrotti A, Grangetto M (2022) LOSs-based sensitivity regularization: towards deep sparse neural networks. Neural Netw 146:230–237

    Article  Google Scholar 

  17. Gomez AN, Zhang I, Swersky K, Gal Y, Hinton GE (2019) Learning sparse networks using targeted dropout. arXiv:1905.13678

  18. Lin S et al (2018) Accelerating convolutional networks via global & dynamic filter pruning, proceedings of the 27th international joint conference on artificial intelligence IJCAI

  19. Lin S et al (2020) Toward compact convnets via structure-sparsity regularized filter pruning. IEEE Trans Neural Netw Learn Syst 31(2):574–588

    Article  MathSciNet  Google Scholar 

  20. Lin C et al (2018) Synaptic strength for convolutional neural network. Adv Neural Inf Process Syst, vol 32 neurIPS

  21. Wang Z, Lin S, Xie J, Lin Y (2019) Pruning Blocks for CNN compression and acceleration via online ensemble distillation. IEEE Access 7:175703–175716

    Article  Google Scholar 

  22. Ding G, Zhang S, Jia Z, Zhong J, Han J (2021) Where to prune: using LSTM to guide data-dependent soft pruning. IEEE Trans Image Process 30:293–304

    Article  Google Scholar 

  23. Zhu J, Pei J (2022) Progressive kernel pruning with saliency mapping of input-output channels. Neurocomputing 467(7):360–378

    Article  Google Scholar 

  24. Zhu J, Pei J, Progressive kernel pruning CNN (2022) Compression method with an adjustable input channel. Appl Intell 52(3):1–22

    Google Scholar 

  25. Huang Z, Wang N (2018) Data-driven sparse structure selection for deep neural. Proceedings of the 15th european conference on computer vision ECCV

  26. He Y, Dong X, Kang G, Fu Y, Yan C, Yang Y (2020) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Trans Cybern 50(8):3594–3604

    Article  Google Scholar 

  27. Lin M et al (2020) HRank: filter pruning using high-rank feature map. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1526–1535

  28. Zhuang Z, Tan M, Zhuang B, Liu J, Guo Y, Wu Q, Huang J, Zhu J (2018) Discrimination-aware channel pruning for deep neural networks. In: Proceedings of the 32nd international conference on neural information processing systems (NIPS’18). Red Hook, NY, USA, pp 883–894

  29. Molchanov D, Ashukha A, Vetrov DP (2017) Variational dropout sparsifies deep neural networks. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML, pp 2498–2507

  30. Salehinejad H, Valaee S (2021) EDRopout: energy-based dropout and pruning of deep neural networks. IEEE Trans Neural Netw Learn:1–14

  31. Lee N, Ajanthan T, Torr PHS (2019) Snip: single-shot network pruning based on connection sensitivity. In: Proceedings of the 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA

  32. Guo Y, Yao A, Chen Y (2016) Dynamic network surgery for efficient dnns. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems, vol 29, Barcelona, Spain, pp 1379–1387

  33. Gale T, Elsen E, Hooker S (2019) The state of sparsity in deep neural networks. arXiv:1902.09574

  34. Goodfellow IJ, Bengio Y, Courville AC (2016) Deep Learning, Adaptive computation and machine learning series. The MIT Press, Massachusetts Institute of Technology, Cambridge

    MATH  Google Scholar 

  35. Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4:1035–1038

    MATH  Google Scholar 

  36. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL), Philadelphia, pp 311–318

  37. LeCun Y, Cortes C (1990) MNIST handwritten digit database

  38. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747

  39. Alex Krizhevsky VN, Hinton G (2009) CIFAR RGB image dataset

  40. Torralba A, Fergus R, Freeman WT (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970

    Article  Google Scholar 

  41. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE conference on computer vision and pattern recognition, pp 248–255

  42. Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41

    Article  Google Scholar 

  43. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

    Article  Google Scholar 

  44. Louizos C, Welling M, Kingma DP (2018) Learning sparse neural networks through l_0 regularization. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, 30 April - 3 May 2018, conference track proceedings, openreview.net

  45. Michel P, Levy O, Neubig G (2019) Are sixteen heads really better than one?. In: wallach H, Larochelle H, Beygelzimer A, D'Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 33

  46. Voita E, Talbot D, Moiseev F, Sennrich R, Titov I (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th annual meeting of the association for computational linguistics (ACL), Stroudsburg, PA, USA, pp 5797–5808

  47. Byrne B, Krishnamoorthi K, Sankar C, Neelakantan A, Goodrich B, Duckworth D, Yavuz S, Dubey A, Kim K, Cedilnik A (2019) Taskmaster-1: toward a realistic and diverse dialog dataset. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing EMNLP-IJCNLP, Hong Kong, China, pp 4515–4524

  48. Bojar O, Buck C, Federmann C, Haddow B, Koehn P, Leveling J, Monz C, Pecina P, Post M, Saint-Amand H, Soricut R, Specia L, Tamchyna A (2014) Proceedings of the ninth workshop on statistical machine translation, association for computational linguistics, Baltimore, Maryland, USA, pp 12–58

  49. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the tenth machine translation summit, AAMT, Phuket, Thailand, pp 79–86

  50. Tiedemann J (2012) Parallel data, tools and interfaces in opus. In: Proceedings of the eight international conference on language resources and evaluation (LREC’12), european language resources association (ELRA), Istanbul, Turkey

Download references

Acknowledgements

The activity has been partially carried on in the context of the Visiting Professor Program of the Gruppo Nazionale per il Calcolo Scientifico (GNCS) of the Italian Istituto Nazionale di Alta Matematica (INdAM).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Bonetta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bonetta, G., Ribero, M. & Cancelliere, R. Regularization-based pruning of irrelevant weights in deep neural architectures. Appl Intell 53, 17429–17443 (2023). https://doi.org/10.1007/s10489-022-04353-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04353-y

Keywords

Navigation