HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Li, Dong; Ye, Qianqian; Guo, Xiaoyue; Sun, Yunda; Zhang, Li

doi:10.1007/s00521-024-09539-8

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Original Article
Published: 05 March 2024

Volume 36, pages 8825–8842, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Dong Li^1,2,
Qianqian Ye²,
Xiaoyue Guo²,
Yunda Sun² &
…
Li Zhang^1,2

126 Accesses
Explore all metrics

Abstract

Successful deployment of convolutional neural networks on resource-constrained hardware platforms is challenging for ubiquitous AI applications. For latency-sensitive scenarios, real-time inference requires model compression techniques such as network pruning to achieve the purpose of inference acceleration. However, many researches focus on hardware-independent filter pruning methods, which cannot balance the contribution of pruned structure to latency and the accuracy drop. Although some pruning methods have introduced latency constraints into the pruning process, most of them are based on look-up tables, which omits the key step of hardware optimization, resulting in significant deviation in latency estimation. In this paper, we propose a novel latency-constrained pruning method, named hardware-in-loop pruning (HILP). It is based on the fast optimal pruning rate search within the layer and layer-wise hybrid pruning, which can prioritize removing the less important layers with considerable latency contributions. The proposed hardware-in-loop pipeline enables the hardware optimization module to be integrated into the entire framework. During pruning, an intermediate network architecture is automatically transformed to a deployable model for accurate latency measurement. The latency-optimized intermediate architecture is then selected by traversing all layers for next progressive step. HILP is generally applicable to any platform that provides a hardware optimization toolchain, such as NVIDIA GPU and Cambricon NPU. We evaluate HILP on both image classification task using ResNet50 with ImageNet and object detection task using YOLOv3 with COCO, HILP can reduce the inference latency of these two networks to 60% and 75%, respectively, within the range of accuracy variation not exceeding 0.6%. Extensive experiment results have proven that HILP is able to achieve a significant advantage in latency-accuracy performance compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

To Filter Prune, or to Layer Prune, That Is the Question

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

Article Open access 13 February 2021

Dynamic Pruning for Parsimonious CNN Inference on Embedded Systems

Data Availability

The datasets generated during and/or analysed during the current study are available in the ImageNet (https://image-net.org/download.php) and the COCO (https://cocodataset.org/#download) repository. All other data generated or analysed during this study are included in this article.

References

Howard AG, Zhu M, Chen B, et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856
LeCun Y, Denker J, Solla S (1989) Optimal brain damage. Advances in neural information processing systems 2
Hassibi B, Stork D (1992) Second order derivatives for network pruning: optimal brain surgeon. Advances in neural information processing systems 5
Li H, Kadav A, Durdanovic I, et al (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710
Courbariaux M, Bengio Y, David J-P (2015) Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems 28
Hubara I, Courbariaux M, Soudry D et al (2017) Quantized neural networks: training neural networks with low precision weights and activations. J Mach Learn Res 18(1):6869–6898
MathSciNet Google Scholar
Denton EL, Zaremba W, Bruna J, et al (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems 27
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2(7)
Han S, Pool J, Tran J, Dally W (2015) Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28
Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149
Han S, Liu X, Mao H et al (2016) Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243–254
Article Google Scholar
Park J, Li S, Wen W, et al (2016) Faster CNNS with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409
Sui Y, Yin M, Xie Y et al (2021) Chip: channel independence-based pruning for compact neural networks. Adv Neural Inf Process Syst 34:24604–24616
Google Scholar
Lin S, Ji R, Yan C, et al (2019) Towards optimal structured cnn pruning via generative adversarial learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2790–2799
You Z, Yan K, Ye J, et al (2019) Gate decorator: global filter pruning method for accelerating deep convolutional neural networks. Advances in neural information processing systems 32
Lin M, Chen B, Chao F, Ji R (2023) Training compact CNNS for image classification using dynamic-coded filter fusion. IEEE Trans Patt Anal Mach Intell. https://doi.org/10.1109/TPAMI.2023.3259402
Article Google Scholar
He Y, Liu P, Wang Z, et al (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4340–4349
Molchanov P, Tyree S, Karras T, et al (2016) Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440
Li D, Chen S, Liu X, et al (2020) Towards optimal filter pruning with balanced performance and pruning speed. In: Proceedings of the Asian conference on computer vision
Lin M, Ji R, Wang Y, et al (2020) Hrank: filter pruning using high-rank feature map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1529–1538
Lin M, Ji R, Zhang Y, et al (2020) Channel pruning via automatic structure search. arXiv preprint arXiv:2001.08565
Liu Z, Li J, Shen Z, et al (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp. 2736–2744
Li B, Wu B, Su J, Wang G (2020) Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In: European conference on computer vision, pp. 639–654
He Y, Lin J, Liu Z, et al (2018) Amc: Automl for model compression and acceleration on mobile devices. In: Proceedings of the European conference on computer vision, pp. 784–800
Chen Y, Chen T, Xu Z et al (2016) Diannao family: energy-efficient hardware accelerators for machine learning. Commun ACM 59(11):105–112
Article MathSciNet Google Scholar
Wu Y-C, Liu C-T, Chen B-Y, Chien S-Y (2020) Constraint-aware importance estimation for global filter pruning under multiple resource constraints. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 686–687
Yang T-J, Howard A, Chen B, et al (2018) Netadapt: Platform-aware neural network adaptation for mobile applications. In: Proceedings of the European conference on computer vision (ECCV), pp. 285–300
Shen M, Yin H, Molchanov P, et al (2021) Halp: hardware-aware latency pruning. arXiv preprint arXiv:2110.10811
Chen Z, Liu C, Yang W et al (2022) Lap: latency-aware automated pruning with dynamic-based filter selection. Neural Netw 152:407–418
Article Google Scholar
Liu J, Sun J, Xu Z, Sun G (2021) Latency-aware automatic CNN channel pruning with GPU runtime analysis. BenchCouncil Trans Benchmk, Stand Eval 1(1):100009
Article Google Scholar
Dong J-D, Cheng A-C, Juan D-C, et al (2018) Dpp-net: device-aware progressive search for pareto-optimal neural architectures. In: Proceedings of the European conference on computer vision (ECCV), pp. 517–531
Dai X, Zhang P, Wu B, et al (2019) Chamnet: towards efficient network design through platform-aware model adaptation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11398–11407
Wu B, Dai X, Zhang P, et al (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10734–10742
Yang T-J, Liao Y-L, Sze V (2021) Netadaptv2: efficient neural architecture search with fast super-network training and architecture optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2402–2411
Zhang P, Zhong Y, Li X (2019) Slimyolov3: narrower, faster and better for real-time UAV applications. In: Proceedings of the IEEE/CVF International conference on computer vision workshops
Tan M, Chen B, Pang R, et al (2019) Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2820–2828
Cai H, Gan C, Wang T, et al (2019) Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791
Abbasi S, Wong A, Shafiee MJ (2022) Maple: microprocessor a priori for latency estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2747–2756
Vanholder H (2016) Efficient inference with tensorrt. GPU Technol Conf 1:2
Google Scholar
Liu S, Du Z, Tao J, et al (2016) Cambricon: An instruction set architecture for neural networks. In: 2016 ACM/IEEE 43rd Annual international symposium on computer architecture (ISCA), pp. 393–405
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 770–778
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
Molchanov P, Mallya A, Tyree S, et al (2019) Importance estimation for neural network pruning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11264–11272
Hu H, Peng R, Tai Y-W, Tang C-K (2016) Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250
Yu R, Li A, Chen C-F, et al (2018) Nisp: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9194–9203
Lin S, Ji R, Li Y, et al (2018) Accelerating convolutional networks via global and dynamic filter pruning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp. 2425–2432
He Y, Zhang X, Sun J (2017) Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE international conference on computer vision, pp. 1389–1397
Luo J-H, Wu J, Lin W (2017) Thinet: A filter level pruning method for deep neural network compression. In: Proceedings of the IEEE international conference on computer vision, pp. 5058–5066
Huang Z, Wang N (2018) Data-driven sparse structure selection for deep neural networks. In: Proceedings of the European conference on computer vision, pp. 304–320
Ding X, Ding G, Guo Y, Han J (2019) Centripetal sgd for pruning very deep convolutional networks with complicated structure. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4943–4953
Liu Z, Mu H, Zhang X, et al (2019) Metapruning: meta learning for automatic neural network channel pruning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3296–3305
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578
Liu H, Simonyan K, Yang Y (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055
Elkerdawy S, Elhoushi M, Singh A, et al (2020) To filter prune, or to layer prune, that is the question. In: Proceedings of the Asian conference on computer vision
Cai H, Zhu L, Han S (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
de Jorge P, Sanyal A, Behl HS, et al (2020) Progressive skeletonization: trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081
Verdenius S, Stol M, Forré P (2020) Pruning via iterative ranking of sensitivity statistics. arXiv preprint arXiv:2006.00896
Chetlur S, Woolley C, Vandermersch P, et al (2014) cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252
Article MathSciNet Google Scholar
Paszke A, Gross S, Massa F, et al (2019) Pytorch: an imperative style, high-performance deep earning library. Advances in neural information processing systems 32
Lin T-Y, Maire M, Belongie S, et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755
Zhang J, Wang P, Zhao Z, Su F (2021) Pruned-yolo: learning efficient object detector using model pruning. In: International conference on artificial neural networks, pp. 34–45

Download references

Author information

Authors and Affiliations

Department of Engineering Physics, Tsinghua University, Beijing, 100084, China
Dong Li & Li Zhang
Nuctech AI R&D Center, Beijing, 100084, China
Dong Li, Qianqian Ye, Xiaoyue Guo, Yunda Sun & Li Zhang

Authors

Dong Li
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yunda Sun
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, D., Ye, Q., Guo, X. et al. HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration. Neural Comput & Applic 36, 8825–8842 (2024). https://doi.org/10.1007/s00521-024-09539-8

Download citation

Received: 23 December 2022
Accepted: 22 January 2024
Published: 05 March 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00521-024-09539-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Abstract

Access this article

Similar content being viewed by others

To Filter Prune, or to Layer Prune, That Is the Question

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

Dynamic Pruning for Parsimonious CNN Inference on Embedded Systems

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Abstract

Access this article

Similar content being viewed by others

To Filter Prune, or to Layer Prune, That Is the Question

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

Dynamic Pruning for Parsimonious CNN Inference on Embedded Systems

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation