Skip to main content

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

  • Conference paper
  • First Online:
Euro-Par 2020: Parallel Processing (Euro-Par 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

Abstract

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies data reuse analysis and access optimization in different levels of the memory hierarchy. To achieve the balance between computation and memory access, we explore the fusion opportunities in the CNN computation graph and propose three fusion modes of convolutional neural networks: straight, merge and split. Then, an approach for generating efficient fused code is designed, which goes deeper in multi-level memory usage for cross-layer data reuse. The effectiveness of our method is evaluated with the network layers from state-of-the-art CNNs on two different GPU platforms, NVIDIA TITAN Xp and Tesla P4. The experiments show that the average speedup is 2.02 \(\times \) on representative structures of CNNs, and 1.57\(\times \) on end-to-end inference of SqueezeNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, October 2016. https://doi.org/10.1109/micro.2016.7783725

  2. Chen, T., et al.: Tvm: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. USENIX Association (2018)

    Google Scholar 

  3. Cheng, J., Grossman, M., McKercher, T.: Professional Cuda C Programming. Wiley, Hoboken (2014)

    Google Scholar 

  4. Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)

  5. Fang, M., Fang, J., Zhang, W., Zhou, H., Liao, J., Wang, Y.: Benchmarking the gpu memory at the warp level. Parallel Comput. 71, 23–41 (2018). https://doi.org/10.1016/j.parco.2017.11.003

    Article  MathSciNet  Google Scholar 

  6. Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomputing 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z

    Article  Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2016. https://doi.org/10.1109/cvpr.2016.90

  8. Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  9. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and\(<\) 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)

  10. Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2016. https://doi.org/10.1109/sc.2016.53

  11. Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017)

  12. Mazaheri, A., Schulte, J., Moskewicz, M.W., Wolf, F., Jannesari, A.: Enhancing the programmability and performance portability of GPU tensor operations. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 213–226. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_16

    Chapter  Google Scholar 

  13. Nvidia CUDA: Compute unified device architecture programming guide (2007)

    Google Scholar 

  14. Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, February 2019. https://doi.org/10.1109/cgo.2019.8661176

  15. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide. ACM SIGPLAN Not. 48(6), 519–530 (2013). https://doi.org/10.1145/2499370.2462176

    Article  Google Scholar 

  16. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015. https://doi.org/10.1109/cvpr.2015.7298594

  17. Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, November 2014. https://doi.org/10.1109/sc.2014.21

  18. Wang, X.: Artifact and instructions to generate experimental results for conference proceeding 2020 paper: accelerating deep learning inference with cross-layer data reuse on GPUs, July 2020. https://doi.org/10.6084/m9.figshare.12571928. https://springernature.figshare.com/articles/software/Artifact_and_instructions_to_generate_experimental_results_for_conference_proceeding_2020_paper_Accelerating_Deep_Learning_Inference_with_Cross-Layer_Data_Reuse_on_GPUs/12571928/1

  19. Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, May 2012. https://doi.org/10.1109/ipdpsw.2012.300

  20. Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM Press (2012). https://doi.org/10.1145/2259016.2259044

Download references

Acknowledgements and Data Availability Statement

The authors thank Zhen Zhang for helpful discussion. This work is supported by the National Key R&D Program of China under Grant No.2017YFB1003103, and the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant No.61521092.

The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12571928 [18].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, X., Li, G., Dong, X., Li, J., Liu, L., Feng, X. (2020). Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57675-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57674-5

  • Online ISBN: 978-3-030-57675-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics