Abstract
Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and memory access, we explore the fusion opportunities in the CNN computation graph and propose three fusion modes of convolutional neural networks: straight, merge and split. Then, an approach for generating efficient fused code is designed, which goes deeper in multi-level memory usage for cross-layer data reuse. The effectiveness of our method is evaluated with the network layers from state-of-the-art CNNs on two different GPU platforms, NVIDIA TITAN Xp and Tesla P4. The experiments show that the average speedup is 2.02 \(\times \) on representative structures of CNNs, and 1.57\(\times \) on end-to-end inference of SqueezeNet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, October 2016. https://doi.org/10.1109/micro.2016.7783725
Chen, T., et al.: Tvm: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. USENIX Association (2018)
Cheng, J., Grossman, M., McKercher, T.: Professional Cuda C Programming. Wiley, Hoboken (2014)
Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Fang, M., Fang, J., Zhang, W., Zhou, H., Liao, J., Wang, Y.: Benchmarking the gpu memory at the warp level. Parallel Comput. 71, 23–41 (2018). https://doi.org/10.1016/j.parco.2017.11.003
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomputing 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. https://doi.org/10.1109/cvpr.2016.90
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and\(<\) 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2016. https://doi.org/10.1109/sc.2016.53
Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017)
Mazaheri, A., Schulte, J., Moskewicz, M.W., Wolf, F., Jannesari, A.: Enhancing the programmability and performance portability of GPU tensor operations. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 213–226. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_16
Nvidia CUDA: Compute unified device architecture programming guide (2007)
Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, February 2019. https://doi.org/10.1109/cgo.2019.8661176
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide. ACM SIGPLAN Not. 48(6), 519–530 (2013). https://doi.org/10.1145/2499370.2462176
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015. https://doi.org/10.1109/cvpr.2015.7298594
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2014. https://doi.org/10.1109/sc.2014.21
Wang, X.: Artifact and instructions to generate experimental results for conference proceeding 2020 paper: accelerating deep learning inference with cross-layer data reuse on GPUs, July 2020. https://doi.org/10.6084/m9.figshare.12571928. https://springernature.figshare.com/articles/software/Artifact_and_instructions_to_generate_experimental_results_for_conference_proceeding_2020_paper_Accelerating_Deep_Learning_Inference_with_Cross-Layer_Data_Reuse_on_GPUs/12571928/1
Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, May 2012. https://doi.org/10.1109/ipdpsw.2012.300
Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM Press (2012). https://doi.org/10.1145/2259016.2259044
Acknowledgements and Data Availability Statement
The authors thank Zhen Zhang for helpful discussion. This work is supported by the National Key R&D Program of China under Grant No.2017YFB1003103, and the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant No.61521092.
The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12571928 [18].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, X., Li, G., Dong, X., Li, J., Liu, L., Feng, X. (2020). Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)