Abstract
Current execution of kernels on GPUs allows improving the use of hardware resources and reducing the execution time of co-executed kernels. In addition, efficient kernel-oriented scheduling policies pursuing criteria based on fairness or Quality of Service can be implemented. However, achieved co-executing performance strongly depends on how GPU resources are partitioned between kernels. Thus, precise slowdown models that predict accurate co-execution performance must be used to fulfill scheduling policy requirements. Most recent slowdown models work with Spatial Multitask (SMT) partitioning, where Stream Multiprocessors (SMs) are distributed among tasks. In this work, we show that Simultaneous Multikernel (SMK) partitioning, where kernels share the SMs, obtains better performance. However, kernel interference in SMK occurs not only in global memory, as in the SMT case, but also within the SM, leading to high prediction errors. Here, we propose a modification of a previous state-of-the-art slowdown model to reduce median prediction error from 27.92% to 9.50%. Moreover, this new slowdown model is used to implement a scheduling policy that improves fairness by 1.41x on average compared to even partitioning, whereas previous models reach only 1.21x on average.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking, In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12 (2012). https://doi.org/10.1109/HPCA.2012.6168946
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), ISCA 2016, pp. 230–242 (2016). https://doi.org/10.1109/ISCA.2016.29
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel gpu: multi-tasking throughput processors via fine-grained sharing. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2016, pp. 358–369 (2016). https://doi.org/10.1109/HPCA.2016.7446078
Zhao, C., Gao, W., Nie, F., Zhou, H.: A survey of GPU multitasking methods supported by hardware architecture. IEEE Trans. Parallel Distrib. Syst. 33(6), 1451–1463 (2022). https://doi.org/10.1109/TPDS.2021.3115630
Zhao, X., Jahre, M., Eeckhout, L.: HSM: a Hybrid Slowdown Model for Multitasking GPUs, Association for Computing Machinery, pp. 1371–1385 (2020)
Zhao, X., Wang, Z., Eeckhout, L.: Classification-driven search for effective SM partitioning in multitasking GPUs. In: Proceedings of the 2018 International Conference on Supercomputing, ICS 2018, pp. 65–75. Association for Computing Machinery, New York 2018. https://doi.org/10.1145/3205289.3205311
Zhao, X., Wang, Z., Eeckhout, L.: HeteroCore GPU to exploit TLP-resource diversity. IEEE Trans. Parallel Distrib. Syst. 30(1), 93–106 (2019). https://doi.org/10.1109/TPDS.2018.2854764
Thomas, W., Toraskar, S., Singh, V.: Dynamic optimizations in GPU using roofline model. In: IEEE International Symposium on Circuits and Systems (ISCAS) 2021, pp. 1–5 (2021). https://doi.org/10.1109/ISCAS51556.2021.9401255
Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66 (2016). https://doi.org/10.1109/ICPP.2016.14
Zhao, W., et al.: Themis: predicting and reining in application-level slowdown on spatial multitasking GPUs. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019, pp. 653–663 (2019). https://doi.org/10.1109/IPDPS.2019.00074
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008). https://doi.org/10.1109/MM.2008.44
Park, J.J.K., Park, Y., Mahlke, S.: Resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, pp. 527-540. ACM, New York (2017). https://doi.org/10.1145/3037697.3037707. http://doi.acm.org/10.1145/3037697.3037707
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software 2009, pp. 163–174 (2009). https://doi.org/10.1109/ISPASS.2009.4919648
NVIDIA, Cuda sdk code samples, May 2018. https://www.nvidia.com/object/cuda_get_samples_3.html
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009, IISWC 2009, pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797
Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G.D., Hwu, W.-M.W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127, 29 (2012)
Gómez-Luna, J., et al.: Collaborative heterogeneous applications for integrated-architectures, in: ISPASS, pp. 43–54 (2017). https://doi.org/10.1109/ISPASS.2017.7975269
Tukey, J.W.: Exploratory Data Analysis (1977)
Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-sim: an extensible simulation framework for validated GPU modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486. https://doi.org/10.1109/ISCA45697.2020.00047
Dai, H., et al.: Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 208–220 (2018). https://doi.org/10.1109/HPCA.2018.00027
Wang, H., Luo, F., Ibrahim, M., Kayiran, O., Jog, A.: Efficient and fair multi-programming in GPUs via effective bandwidth management. In: IEEE International Symposium on High Performance Computer Architecture (HPCA) 2018, pp. 247–258 (2018). https://doi.org/10.1109/HPCA.2018.00030
Zhao, X., Jahre, M., Eeckhout, L.: HSM: a hybrid slowdown model for multitasking GPUs, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020, pp. 1371–1385. Association for Computing Machinery, New York (2020)
Kim, J., Cha, J., Park, J.J.K., Jeon, D., Park, Y.: Improving GPU multitasking efficiency using dynamic resource sharing. IEEE Comput. Archit. Lett. 18(1), 1–5 (2019). https://doi.org/10.1109/LCA.2018.2889042
Acknowledgment
This work has been supported by the Junta de Andalucía of Spain (P18-FR-3130 and UMA20-FEDERJA-059) and the Ministry of Education of Spain (PID2019-105396RB-I00).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
López-Albelda, B., Castro, F.M., González-Linares, J.M., Guil, N. (2022). A Hybrid Piece-Wise Slowdown Model for Concurrent Kernel Execution on GPU. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-12597-3_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12596-6
Online ISBN: 978-3-031-12597-3
eBook Packages: Computer ScienceComputer Science (R0)