Abstract
This paper proposes a deep Q network (DQN)-based method for the workload partition problem in OpenCL. The DQN, a reinforcement learning algorithm, optimizes the workload partition for each processing unit by the self-training, based on the accumulated performance data on the computing environment. Our experiments reveal that the DQN-based partition provides the performance improvement by up to 62.2% and 6.9% in JPEG decoding, compared to the LuxMark-based and target-based partitions, respectively. The DQN is able to capture the low-level contention in slave devices such as caches and memory, and the communication bottleneck between devices, and reflect it to the workload partition ratio.
This is a preview of subscription content, access via your institution.











Notes
Object oriented RPC framework of Google.
References
Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim (TACO) 9(4):57
Cano A (2018) A survey on graphic processing unit computing for large-scale data mining. Wiley Interdiscip Rev Data Min Knowl Discov 8(1):e1232
Choi HJ, Son DO, Kang SG, Kim JM, Lee HH, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65(2):886–902
Constantinides GA (2017) FPGAs in the cloud. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 167–167
Gaster B, Howes L, Kaeli DR, Mistry P, Schaa D (2012) Heterogeneous computing with OpenCL: revised OpenCL. 1.2 edn. Morgan Kaufmann
Gregg C, Brantley J, Hazelwood K (2010) Contention-aware scheduling of parallel code for heterogeneous systems. In: 2nd USENIX workshop on hot topics in parallelism, HotPar, Berkeley, CA
Grewe D, O’Boyle MF (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: International Conference on Compiler Construction. Springer, pp 286–305
Group KOW et al. (2011) The OpenCL specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. Accessed 21 Apr 2018
Helal AE, Feng Wc, Jung C, Hanafy YY (2017) AutoMatch: an automated framework for relative performance estimation and workload distribution on heterogeneous HPC systems. In: Workload characterization (IISWC), 2017 IEEE international symposium on. IEEE, pp 32–42
Kasim H, March V, Zhang R, See S (2008) Survey on parallel programming model. In: IFIP International Conference on Network and Parallel Computing. Springer, pp 266–275
Li HF, Liang TY, Chiu JY (2013) A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J Supercomput 66(1):381–405
Li L, Li X, Tan G, Chen M, Zhang P (2011) Experience of parallelizing cryo-EM 3D reconstruction on a CPU–GPU heterogeneous system. In: Proceedings of the 20th international symposium on High performance distributed computing. ACM, pp 195–204
Lu F, Song J, Cao X, Zhu X (2012) CPU/GPU computing for long-wave radiation physics on large GPU clusters. Comput Geosci 41:47–55
LuxCoreRender: Luxmark, an OpenCL benchmark based on LuxCoreRender. http://luxmark.info/. Accessed 3 Mar 2018
Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM workshop on hot topics in networks. ACM, pp 50–56
Mittal S, Vetter JS (2015) A survey of CPU–GPU heterogeneous computing techniques. ACM Comput Surv (CSUR) 47(4):69
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Munir A, Koushanfar F, Gordon-Ross A, Ranka S (2013) High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66(1):431–487
Navarro A, Vilches A, Corbera F, Asenjo R (2014) Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J Supercomput 70(2):756–771
Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based CPU–GPU heterogeneous FFT library. In: Parallel and distributed processing, 2008. IPDPS 2008. IEEE international symposium on. IEEE, pp 1–10
Sodsong W, Hong J, Chung S, Lim Y, Kim SD, Burgstaller B (2014) Dynamic partitioning-based jpeg decompression on heterogeneous multicore architectures. In: Proceedings of programming models and applications on multicores and manycores. ACM, p 80
Steuwer M, Gorlatch S (2014) SkelCL: a high-level extension of OpenCL for multi-GPU systems. J Supercomput 69(1):25–33
Sutton RS, Barto AG (1998) Introduction to reinforcement learning, vol 135. MIT Press, Cambridge
Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR), pp 1–6
Taylor B, Marco VS, Wang Z (2017) Adaptive optimization for OpenCL programs on embedded heterogeneous systems. In: ACM SIGPLAN notices, vol 52. ACM, pp 11–20
Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292
Windh S, Ma X, Halstead RJ, Budhkar P, Luna Z, Hussaini O, Najjar WA (2015) High-level language tools for reconfigurable computing. Proc IEEE 103(3):390–408
Acknowledgements
This work was partially supported by the National Research Foundation of Korea under Grant NRF-2017R1D1A1B03028926.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, S., Suh, T. DQN-based OpenCL workload partition for performance optimization. J Supercomput 75, 4875–4893 (2019). https://doi.org/10.1007/s11227-019-02766-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02766-0
Keywords
- OpenCL
- DQN
- Workload partition