Abstract
In recent years, in order to facilitate the efficient application of deep convolutional neural networks, it has become increasingly important to accelerate the inference stage of deep convolutional neural networks. But with the development of numerous heterogeneous computing devices, today’s popular deep learning inference tools only support specific devices, so they cannot effectively utilize different GPU devices to accelerate DNN inference. To address this issue, we propose an OpenCL-based parallel deep convolutional neural network inference algorithms. Firstly, we design and implement parallel kernel code using OpenCL to accelerate depthwise separable convolution, and implement parallel matrix multiplication combined with clBLAS to accelerate traditional convolution. Meanwhile, we design OpenCL parallel kernel codes for other operations in the inference stage of deep convolutional neural networks. Secondly, we further improve the inference performance by means of kernel fusion and increasing the workload per core. Finally, MobileNet v1 network and the 21-layer residual network based on OpenCL are run on AMD Radeon Vega Frontier GPU and Nvidia GeForce GTX 1070 GPU. Compared to the Caffe implementation, 40.16x, 1.67x speedups are achieved on the AMD GPU and 14.95x, 1.11x speedups are achieved on the Nvidia GPU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Guo P.: Multi-institutional collaborations for improving deep learning-based magnetic resonance image reconstruction using federated learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. IEEE, Piscataway, NJ (2021)
Wang J.: End-to-end object detection with fully convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15849–15858. IEEE, Piscataway, NJ (2021)
Das A.: Enabling on-device smartphone GPU based training: lessons learned. In: 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), pp. 533–538. IEEE, Piscataway, NJ (2022)
Kim, S.: Performance evaluation of INT8 quantized inference on mobile GPUs. IEEE Access 9, 164245–164255 (2021)
Wai, Y.J.: Fixed point implementation of Tiny-Yolo-v2 using OpenCL on FPGA. Int. J. Adv. Comput. Sci. Appl. 9(10), 506–512 (2018)
Mu, J.: Optimizing Opencl-Based CNN design on FPGA with comprehensive design space exploration and collaborative performance modeling. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 13(3), 1–28 (2020)
Koo, Y., Kim, S., Ha, Y.-G.: OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework. World Wide Web 24(4), 1299–1319 (2020). https://doi.org/10.1007/s11280-020-00778-y
Dagli, R., Eken, S.: Deploying a smart queuing system on edge with Intel OpenVINO toolkit. Soft. Comput. 25(15), 10103–10115 (2021). https://doi.org/10.1007/s00500-021-05891-2
Marco, V.S.: Optimizing deep learning inference on embedded systems through adaptive model selection. ACM Trans. Embed. Comput. Syst. 19(1), 1–28 (2020)
Dua A.: Systolic-CNN: an OpenCL-defined scalable run-time-flexible FPGA accelerator architecture for accelerating convolutional neural network inference in cloud/edge computing. In: Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), p. 231. IEEE, Piscataway, NJ (2020)
Lin, D.L.: Accelerating large sparse neural network inference using GPU task graph parallelism. IEEE Trans. Parallel Distrib. Syst. 33(11), 3041–3052 (2021)
He, S.: An efficient GPU-accelerated inference engine for binary neural network on mobile phones. J. Syst. Architect. 117, 102156 (2021)
Chen, J.: Split convolutional neural networks for distributed inference on concurrent IoT sensors. In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 66–73. IEEE, Piscataway, NJ (2021)
Howard A.G.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
He, K.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Piscataway, NJ (2016)
Qin Z.: Diagonal wise refactorization: an efficient training method for depthwise convolutions. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 770–778. IEEE, Piscataway, NJ (2016)
Funding
This work is funded in part by the Key Research and Development Program of Shaanxi (Program No. 2022ZDLGY01-09), GHfund A (No. 202107014474) GHfund C (No. 202202036165), Wuhu and Xidian University special fund for industry-university-research cooperation (Project No. XWYCXY-012021013), and Cloud Computing Key Laboratory of Gansu Province.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Wu, Y., Zhu, H., Zhang, L., Hou, B., Jiao, L. (2022). Accelerating Deep Convolutional Neural Network Inference Based on OpenCL. In: Shi, Z., Jin, Y., Zhang, X. (eds) Intelligence Science IV. ICIS 2022. IFIP Advances in Information and Communication Technology, vol 659. Springer, Cham. https://doi.org/10.1007/978-3-031-14903-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-14903-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14902-3
Online ISBN: 978-3-031-14903-0
eBook Packages: Computer ScienceComputer Science (R0)