Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Wang, Xueying; Li, Guangli; Dong, Xiao; Li, Jiansong; Liu, Lei; Feng, Xiaobing

doi:10.1007/978-3-030-57675-2_14

Xueying Wang^10,11,
Guangli Li^10,11,
Xiao Dong^10,11,
Jiansong Li^10,11,
Lei Liu¹⁰ &
…
Xiaobing Feng^10,11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

European Conference on Parallel Processing

1595 Accesses
6 Citations
1 Altmetric

Abstract

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and memory access, we explore the fusion opportunities in the CNN computation graph and propose three fusion modes of convolutional neural networks: straight, merge and split. Then, an approach for generating efficient fused code is designed, which goes deeper in multi-level memory usage for cross-layer data reuse. The effectiveness of our method is evaluated with the network layers from state-of-the-art CNNs on two different GPU platforms, NVIDIA TITAN Xp and Tesla P4. The experiments show that the average speedup is 2.02 \(\times \) on representative structures of CNNs, and 1.57\(\times \) on end-to-end inference of SqueezeNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, October 2016. https://doi.org/10.1109/micro.2016.7783725
Chen, T., et al.: Tvm: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. USENIX Association (2018)
Google Scholar
Cheng, J., Grossman, M., McKercher, T.: Professional Cuda C Programming. Wiley, Hoboken (2014)
Google Scholar
Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Fang, M., Fang, J., Zhang, W., Zhou, H., Liao, J., Wang, Y.: Benchmarking the gpu memory at the warp level. Parallel Comput. 71, 23–41 (2018). https://doi.org/10.1016/j.parco.2017.11.003
Article MathSciNet Google Scholar
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomputing 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. https://doi.org/10.1109/cvpr.2016.90
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and\(<\) 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2016. https://doi.org/10.1109/sc.2016.53
Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017)
Mazaheri, A., Schulte, J., Moskewicz, M.W., Wolf, F., Jannesari, A.: Enhancing the programmability and performance portability of GPU tensor operations. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 213–226. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_16
Chapter Google Scholar
Nvidia CUDA: Compute unified device architecture programming guide (2007)
Google Scholar
Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, February 2019. https://doi.org/10.1109/cgo.2019.8661176
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide. ACM SIGPLAN Not. 48(6), 519–530 (2013). https://doi.org/10.1145/2499370.2462176
Article Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015. https://doi.org/10.1109/cvpr.2015.7298594
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2014. https://doi.org/10.1109/sc.2014.21
Wang, X.: Artifact and instructions to generate experimental results for conference proceeding 2020 paper: accelerating deep learning inference with cross-layer data reuse on GPUs, July 2020. https://doi.org/10.6084/m9.figshare.12571928. https://springernature.figshare.com/articles/software/Artifact_and_instructions_to_generate_experimental_results_for_conference_proceeding_2020_paper_Accelerating_Deep_Learning_Inference_with_Cross-Layer_Data_Reuse_on_GPUs/12571928/1
Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, May 2012. https://doi.org/10.1109/ipdpsw.2012.300
Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM Press (2012). https://doi.org/10.1145/2259016.2259044

Download references

Acknowledgements and Data Availability Statement

The authors thank Zhen Zhang for helpful discussion. This work is supported by the National Key R&D Program of China under Grant No.2017YFB1003103, and the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant No.61521092.

The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12571928 [18].

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xueying Wang, Guangli Li, Xiao Dong, Jiansong Li, Lei Liu & Xiaobing Feng
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
Xueying Wang, Guangli Li, Xiao Dong, Jiansong Li & Xiaobing Feng

Authors

Xueying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guangli Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jiansong Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobing Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Liu .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, X., Li, G., Dong, X., Li, J., Liu, L., Feng, X. (2020). Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_14
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics