Abstract
Data centers and cloud environments have recently started providing graphic processing unit (GPU)-based infrastructure services. Actual general purpose GPU (GPGPU) applications have low GPU utilization, unlike GPU-friendly applications. To improve the resource utilization of GPUs, there is the need for the concurrent execution of different applications while sharing resources in a streaming multiprocessor (SM). However, it is difficult to predict the execution performance of applications because resource contention can be caused by intra-SM multitasking. Furthermore, it is crucial to find the best resource partitioning and an execution set of applications that show the best performance among many applications. To address this, the current paper proposes K-Scheduler, a multitasking placement scheduler based on the intra-SM resource-use characteristics of applications. First, the resource-use and multitasking characteristics of applications are analyzed according to their classification and their individual execution characteristics. Rules for concurrent execution are derived according to each observation, and scheduling is performed according to the corresponding rules. The results verified that the total workload execution performance of K-Scheduler improved by 18% compared to previous studies, and individual execution performance improved by 32%.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
TOP-500. https://www.top500.org/
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
EC2 ELASTIC GPUS, A. (2017). https://aws.amazon.com/ec2/Elastic-GPUs/
MICROSOFT-AZURE. https://docs.microsoft.com/en-au/azure/virtual-machines/windows/sizes-gpu
Dongarra, J.J., Luszczek, P., Petitet, A.: The linpack benchmark: past, present and future. Concurr. Comput.: Pract. Exp. 15(9), 803–820 (2003)
Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia report, SAND2013-4744 312, 150 (2013)
Allen, T., Feng, X., Ge, R.: Slate: enabling workload-aware efficient multiprocessing for modern gpgpus. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), pp. 252–261. IEEE
NVIDIA-MULTI-PROCESS-SERVICE. (2020). https://docs.nvidia.com/deploy/pdf/CUDA-Multi-Process-Service-Overview.pdf
Schulte, M.J., Ignatowski, M., Loh, G.H., Beckmann, B.M., Brantley, W.C., Gurumurthi, S., Jayasena, N., Paul, I., Reinhardt, S.K., Rodgers, G.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)
Zhang, W., Cui, W., Fu, K., Chen, Q., Mawhirter, D.E., Wu, B., Li, C., Guo, M.: Laius: Towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: Proceedings of the ACM international conference on supercomputing, pp. 58–68 (2019)
Zhao, X., Jahre, M., Eeckhout, L.: Hsm: A hybrid slowdown model for multitasking gpus. In: Proceedings of the twenty-fifth international conference on architectural support for programming languages and operating systems, pp. 1371–1385 (2020)
Zhao, X., Wang, Z., Eeckhout, L.: Classification-driven search for effective sm partitioning in multitasking gpus. In: Proceedings of the 2018 international conference on supercomputing, pp. 65–75 (2018)
Dai, H., Lin, Z., Li, C., Zhao, C., Wang, F., Zheng, N., Zhou, H.: Accelerate gpu concurrent kernel execution by mitigating memory pipeline stalls. In: 2018 IEEE international symposium on high performance computer architecture (HPCA), pp. 208–220. IEEE (2018)
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel gpu: multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE international symposium on high performance computer architecture (HPCA), pp. 358–369. IEEE (2016)
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M. Warped-slicer.: Efficient intra-sm slicing through dynamic resource partitioning for gpu multiprogramming. In: 2016 ACM/IEEE 43rd annual international symposium on computer architecture (ISCA) (2016), pp. 230–242. IEEE (2016)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K. Rodinia.: Rodinia: A benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), pp. 44–54. IEEE (2009)
NVIDIA-CUDA-SAMPLE. https://docs.nvidia.com/cuda/cuda-samples/index.html
Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G.D., Hwu, W.-M.W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127,(2012)
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units, pp. 63–74 (2010)
POLYHEDRAL-BENCHMARK-SUITE. http://web.cse.ohio-state.edu/pouchet.2/software/polybench/
PROGRAMMING GUIDE, C.-C. (2021). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
TUNING GUIDE, K. (2021). https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html
Kim, S., Qichen Chen, H.Y., Kim, Y.: Performance analysis of concurrent multitasking for efficient resource utilization of gpus. J. KIISE 48(6), 604–611 (2021)
NVCC. (2021). https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Chen, Q., Chung, H., Son, Y., Kim, Y., and Yeom, H.Y.: Smcompactor: a workload-aware fine-grained resource management framework for gpgpus. In: Proceedings of the 36th annual ACM symposium on applied computing, SAC ’21, pp. 1147–1155 (2021)
Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking gpus. In: Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems, pp. 527–540 (2017)
Thomas, W., Toraskar, S., Singh, V.: Dynamic optimizations in gpu using roofline model. In: 2021 IEEE international symposium on circuits and systems (ISCAS), pp. 1–5 (2021)
Wei, M., Zhao, W., Chen, Q., Dai, H., Leng, J., Li, C., Zheng, W., Guo, M.: Predicting and reining in application-level slowdown on spatial multitasking gpus. J. Parallel Distrib. Comput. 141, 99–114 (2020)
Alizadeh, N.S., Momtazpour, M.: Machine learning-based interference detection in gpgpu concurrent kernel execution. In: 2020 25th international computer conference, computer society of Iran (CSICC), pp. 1–4. IEEE (2020)
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C1003379).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Informed consent
Written informed consent for publication of this paper was obtained from Sookmyung Women’s University and all authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, S., Kim, Y. K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs. Cluster Comput 25, 597–617 (2022). https://doi.org/10.1007/s10586-021-03429-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03429-7