A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

López-Albelda, Bernabé; Castro, Francisco M.; González-Linares, Jose M.; Guil, Nicolás

doi:10.1007/978-3-031-12597-3_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13440))

Included in the following conference series:

European Conference on Parallel Processing

1256 Accesses

Abstract

Current execution of kernels on GPUs allows improving the use of hardware resources and reducing the execution time of co-executed kernels. In addition, efficient kernel-oriented scheduling policies pursuing criteria based on fairness or Quality of Service can be implemented. However, achieved co-executing performance strongly depends on how GPU resources are partitioned between kernels. Thus, precise slowdown models that predict accurate co-execution performance must be used to fulfill scheduling policy requirements. Most recent slowdown models work with Spatial Multitask (SMT) partitioning, where Stream Multiprocessors (SMs) are distributed among tasks. In this work, we show that Simultaneous Multikernel (SMK) partitioning, where kernels share the SMs, obtains better performance. However, kernel interference in SMK occurs not only in global memory, as in the SMT case, but also within the SM, leading to high prediction errors. Here, we propose a modification of a previous state-of-the-art slowdown model to reduce median prediction error from 27.92% to 9.50%. Moreover, this new slowdown model is used to implement a scheduling policy that improves fairness by 1.41x on average compared to even partitioning, whereas previous models reach only 1.21x on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Article 19 May 2021

Kernel concurrency opportunities based on GPU benchmarks characterization

Article 17 January 2019

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

References

Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking, In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12 (2012). https://doi.org/10.1109/HPCA.2012.6168946
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA 2016, pp. 230–242 (2016). https://doi.org/10.1109/ISCA.2016.29
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel gpu: multi-tasking throughput processors via fine-grained sharing. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2016, pp. 358–369 (2016). https://doi.org/10.1109/HPCA.2016.7446078
Zhao, C., Gao, W., Nie, F., Zhou, H.: A survey of GPU multitasking methods supported by hardware architecture. IEEE Trans. Parallel Distrib. Syst. 33(6), 1451–1463 (2022). https://doi.org/10.1109/TPDS.2021.3115630
Article Google Scholar
Zhao, X., Jahre, M., Eeckhout, L.: HSM: a Hybrid Slowdown Model for Multitasking GPUs, Association for Computing Machinery, pp. 1371–1385 (2020)
Google Scholar
Zhao, X., Wang, Z., Eeckhout, L.: Classification-driven search for effective SM partitioning in multitasking GPUs. In: Proceedings of the 2018 International Conference on Supercomputing, ICS 2018, pp. 65–75. Association for Computing Machinery, New York 2018. https://doi.org/10.1145/3205289.3205311
Zhao, X., Wang, Z., Eeckhout, L.: HeteroCore GPU to exploit TLP-resource diversity. IEEE Trans. Parallel Distrib. Syst. 30(1), 93–106 (2019). https://doi.org/10.1109/TPDS.2018.2854764
Article Google Scholar
Thomas, W., Toraskar, S., Singh, V.: Dynamic optimizations in GPU using roofline model. In: IEEE International Symposium on Circuits and Systems (ISCAS) 2021, pp. 1–5 (2021). https://doi.org/10.1109/ISCAS51556.2021.9401255
Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66 (2016). https://doi.org/10.1109/ICPP.2016.14
Zhao, W., et al.: Themis: predicting and reining in application-level slowdown on spatial multitasking GPUs. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019, pp. 653–663 (2019). https://doi.org/10.1109/IPDPS.2019.00074
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008). https://doi.org/10.1109/MM.2008.44
Article Google Scholar
Park, J.J.K., Park, Y., Mahlke, S.: Resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, pp. 527-540. ACM, New York (2017). https://doi.org/10.1145/3037697.3037707. http://doi.acm.org/10.1145/3037697.3037707
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software 2009, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648
NVIDIA, Cuda sdk code samples, May 2018. https://www.nvidia.com/object/cuda_get_samples_3.html
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009, IISWC 2009, pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797
Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G.D., Hwu, W.-M.W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127, 29 (2012)
Google Scholar
Gómez-Luna, J., et al.: Collaborative heterogeneous applications for integrated-architectures, in: ISPASS, pp. 43–54 (2017). https://doi.org/10.1109/ISPASS.2017.7975269
Tukey, J.W.: Exploratory Data Analysis (1977)
Google Scholar
Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
Dai, H., et al.: Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 208–220 (2018). https://doi.org/10.1109/HPCA.2018.00027
Wang, H., Luo, F., Ibrahim, M., Kayiran, O., Jog, A.: Efficient and fair multi-programming in GPUs via effective bandwidth management. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 247–258 (2018). https://doi.org/10.1109/HPCA.2018.00030
Zhao, X., Jahre, M., Eeckhout, L.: HSM: a hybrid slowdown model for multitasking GPUs, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020, pp. 1371–1385. Association for Computing Machinery, New York (2020)
Google Scholar
Kim, J., Cha, J., Park, J.J.K., Jeon, D., Park, Y.: Improving GPU multitasking efficiency using dynamic resource sharing. IEEE Comput. Archit. Lett. 18(1), 1–5 (2019). https://doi.org/10.1109/LCA.2018.2889042
Article Google Scholar

Download references

Acknowledgment

This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130 and UMA20-FEDERJA-059) and the Ministry of Education of Spain (PID2019-105396RB-I00).

Author information

Authors and Affiliations

Department Computer Architecture, University of Málaga, Málaga, Spain
Bernabé López-Albelda, Francisco M. Castro, Jose M. González-Linares & Nicolás Guil

Authors

Bernabé López-Albelda
View author publications
You can also search for this author in PubMed Google Scholar
Francisco M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. González-Linares
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás Guil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolás Guil .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Glasgow, Glasgow, UK
Phil Trinder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López-Albelda, B., Castro, F.M., González-Linares, J.M., Guil, N. (2022). A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-12597-3_23
Published: 01 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12596-6
Online ISBN: 978-3-031-12597-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

Abstract

Access this chapter

Similar content being viewed by others

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Kernel concurrency opportunities based on GPU benchmarks characterization

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU

Abstract

Access this chapter

Similar content being viewed by others

FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUs

Kernel concurrency opportunities based on GPU benchmarks characterization

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation