Skip to main content

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

  • Conference paper
  • First Online:
Euro-Par 2022: Parallel Processing (Euro-Par 2022)

Abstract

Current execution of kernels on GPUs allows improving the use of hardware resources and reducing the execution time of co-executed kernels. In addition, efficient kernel-oriented scheduling policies pursuing criteria based on fairness or Quality of Service can be implemented. However, achieved co-executing performance strongly depends on how GPU resources are partitioned between kernels. Thus, precise slowdown models that predict accurate co-execution performance must be used to fulfill scheduling policy requirements. Most recent slowdown models work with Spatial Multitask (SMT) partitioning, where Stream Multiprocessors (SMs) are distributed among tasks. In this work, we show that Simultaneous Multikernel (SMK) partitioning, where kernels share the SMs, obtains better performance. However, kernel interference in SMK occurs not only in global memory, as in the SMT case, but also within the SM, leading to high prediction errors. Here, we propose a modification of a previous state-of-the-art slowdown model to reduce median prediction error from 27.92% to 9.50%. Moreover, this new slowdown model is used to implement a scheduling policy that improves fairness by 1.41x on average compared to even partitioning, whereas previous models reach only 1.21x on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking, In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12 (2012). https://doi.org/10.1109/HPCA.2012.6168946

  2. Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA 2016, pp. 230–242 (2016). https://doi.org/10.1109/ISCA.2016.29

  3. Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel gpu: multi-tasking throughput processors via fine-grained sharing. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2016, pp. 358–369 (2016). https://doi.org/10.1109/HPCA.2016.7446078

  4. Zhao, C., Gao, W., Nie, F., Zhou, H.: A survey of GPU multitasking methods supported by hardware architecture. IEEE Trans. Parallel Distrib. Syst. 33(6), 1451–1463 (2022). https://doi.org/10.1109/TPDS.2021.3115630

    Article  Google Scholar 

  5. Zhao, X., Jahre, M., Eeckhout, L.: HSM: a Hybrid Slowdown Model for Multitasking GPUs, Association for Computing Machinery, pp. 1371–1385 (2020)

    Google Scholar 

  6. Zhao, X., Wang, Z., Eeckhout, L.: Classification-driven search for effective SM partitioning in multitasking GPUs. In: Proceedings of the 2018 International Conference on Supercomputing, ICS 2018, pp. 65–75. Association for Computing Machinery, New York 2018. https://doi.org/10.1145/3205289.3205311

  7. Zhao, X., Wang, Z., Eeckhout, L.: HeteroCore GPU to exploit TLP-resource diversity. IEEE Trans. Parallel Distrib. Syst. 30(1), 93–106 (2019). https://doi.org/10.1109/TPDS.2018.2854764

    Article  Google Scholar 

  8. Thomas, W., Toraskar, S., Singh, V.: Dynamic optimizations in GPU using roofline model. In: IEEE International Symposium on Circuits and Systems (ISCAS) 2021, pp. 1–5 (2021). https://doi.org/10.1109/ISCAS51556.2021.9401255

  9. Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66 (2016). https://doi.org/10.1109/ICPP.2016.14

  10. Zhao, W., et al.: Themis: predicting and reining in application-level slowdown on spatial multitasking GPUs. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019, pp. 653–663 (2019). https://doi.org/10.1109/IPDPS.2019.00074

  11. Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008). https://doi.org/10.1109/MM.2008.44

    Article  Google Scholar 

  12. Park, J.J.K., Park, Y., Mahlke, S.: Resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, pp. 527-540. ACM, New York (2017). https://doi.org/10.1145/3037697.3037707. http://doi.acm.org/10.1145/3037697.3037707

  13. Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software 2009, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648

  14. NVIDIA, Cuda sdk code samples, May 2018. https://www.nvidia.com/object/cuda_get_samples_3.html

  15. Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009, IISWC 2009, pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797

  16. Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G.D., Hwu, W.-M.W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127, 29 (2012)

    Google Scholar 

  17. Gómez-Luna, J., et al.: Collaborative heterogeneous applications for integrated-architectures, in: ISPASS, pp. 43–54 (2017). https://doi.org/10.1109/ISPASS.2017.7975269

  18. Tukey, J.W.: Exploratory Data Analysis (1977)

    Google Scholar 

  19. Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486. https://doi.org/10.1109/ISCA45697.2020.00047

  20. Dai, H., et al.: Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 208–220 (2018). https://doi.org/10.1109/HPCA.2018.00027

  21. Wang, H., Luo, F., Ibrahim, M., Kayiran, O., Jog, A.: Efficient and fair multi-programming in GPUs via effective bandwidth management. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 247–258 (2018). https://doi.org/10.1109/HPCA.2018.00030

  22. Zhao, X., Jahre, M., Eeckhout, L.: HSM: a hybrid slowdown model for multitasking GPUs, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020, pp. 1371–1385. Association for Computing Machinery, New York (2020)

    Google Scholar 

  23. Kim, J., Cha, J., Park, J.J.K., Jeon, D., Park, Y.: Improving GPU multitasking efficiency using dynamic resource sharing. IEEE Comput. Archit. Lett. 18(1), 1–5 (2019). https://doi.org/10.1109/LCA.2018.2889042

    Article  Google Scholar 

Download references

Acknowledgment

This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130 and UMA20-FEDERJA-059) and the Ministry of Education of Spain (PID2019-105396RB-I00).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolás Guil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

López-Albelda, B., Castro, F.M., González-Linares, J.M., Guil, N. (2022). A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12597-3_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12596-6

  • Online ISBN: 978-3-031-12597-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics