DQN-based OpenCL workload partition for performance optimization

Abstract

This paper proposes a deep Q network (DQN)-based method for the workload partition problem in OpenCL. The DQN, a reinforcement learning algorithm, optimizes the workload partition for each processing unit by the self-training, based on the accumulated performance data on the computing environment. Our experiments reveal that the DQN-based partition provides the performance improvement by up to 62.2% and 6.9% in JPEG decoding, compared to the LuxMark-based and target-based partitions, respectively. The DQN is able to capture the low-level contention in slave devices such as caches and memory, and the communication bottleneck between devices, and reflect it to the workload partition ratio.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Object oriented RPC framework of Google.

References

  1. 1.

    Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim (TACO) 9(4):57

    Google Scholar 

  2. 2.

    Cano A (2018) A survey on graphic processing unit computing for large-scale data mining. Wiley Interdiscip Rev Data Min Knowl Discov 8(1):e1232

    MathSciNet  Article  Google Scholar 

  3. 3.

    Choi HJ, Son DO, Kang SG, Kim JM, Lee HH, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65(2):886–902

    Article  Google Scholar 

  4. 4.

    Constantinides GA (2017) FPGAs in the cloud. In: Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. ACM, pp 167–167

  5. 5.

    Gaster B, Howes L, Kaeli DR, Mistry P, Schaa D (2012) Heterogeneous computing with OpenCL: revised OpenCL. 1.2 edn. Morgan Kaufmann

  6. 6.

    Gregg C, Brantley J, Hazelwood K (2010) Contention-aware scheduling of parallel code for heterogeneous systems. In: 2nd USENIX workshop on hot topics in parallelism, HotPar, Berkeley, CA

  7. 7.

    Grewe D, O’Boyle MF (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: International Conference on Compiler Construction. Springer, pp 286–305

  8. 8.

    Group KOW et al. (2011) The OpenCL specification version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. Accessed 21 Apr 2018

  9. 9.

    Helal AE, Feng Wc, Jung C, Hanafy YY (2017) AutoMatch: an automated framework for relative performance estimation and workload distribution on heterogeneous HPC systems. In: Workload characterization (IISWC), 2017 IEEE international symposium on. IEEE, pp 32–42

  10. 10.

    Kasim H, March V, Zhang R, See S (2008) Survey on parallel programming model. In: IFIP International Conference on Network and Parallel Computing. Springer, pp 266–275

  11. 11.

    Li HF, Liang TY, Chiu JY (2013) A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J Supercomput 66(1):381–405

    Article  Google Scholar 

  12. 12.

    Li L, Li X, Tan G, Chen M, Zhang P (2011) Experience of parallelizing cryo-EM 3D reconstruction on a CPU–GPU heterogeneous system. In: Proceedings of the 20th international symposium on High performance distributed computing. ACM, pp 195–204

  13. 13.

    Lu F, Song J, Cao X, Zhu X (2012) CPU/GPU computing for long-wave radiation physics on large GPU clusters. Comput Geosci 41:47–55

    Article  Google Scholar 

  14. 14.

    LuxCoreRender: Luxmark, an OpenCL benchmark based on LuxCoreRender. http://luxmark.info/. Accessed 3 Mar 2018

  15. 15.

    Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM workshop on hot topics in networks. ACM, pp 50–56

  16. 16.

    Mittal S, Vetter JS (2015) A survey of CPU–GPU heterogeneous computing techniques. ACM Comput Surv (CSUR) 47(4):69

    Article  Google Scholar 

  17. 17.

    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  18. 18.

    Munir A, Koushanfar F, Gordon-Ross A, Ranka S (2013) High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66(1):431–487

    Article  Google Scholar 

  19. 19.

    Navarro A, Vilches A, Corbera F, Asenjo R (2014) Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures. J Supercomput 70(2):756–771

    Article  Google Scholar 

  20. 20.

    Ogata Y, Endo T, Maruyama N, Matsuoka S (2008) An efficient, model-based CPU–GPU heterogeneous FFT library. In: Parallel and distributed processing, 2008. IPDPS 2008. IEEE international symposium on. IEEE, pp 1–10

  21. 21.

    Sodsong W, Hong J, Chung S, Lim Y, Kim SD, Burgstaller B (2014) Dynamic partitioning-based jpeg decompression on heterogeneous multicore architectures. In: Proceedings of programming models and applications on multicores and manycores. ACM, p 80

  22. 22.

    Steuwer M, Gorlatch S (2014) SkelCL: a high-level extension of OpenCL for multi-GPU systems. J Supercomput 69(1):25–33

    Article  Google Scholar 

  23. 23.

    Sutton RS, Barto AG (1998) Introduction to reinforcement learning, vol 135. MIT Press, Cambridge

    Google Scholar 

  24. 24.

    Tang W, Lease M (2011) Semi-supervised consensus labeling for crowdsourcing. In: SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR), pp 1–6

  25. 25.

    Taylor B, Marco VS, Wang Z (2017) Adaptive optimization for OpenCL programs on embedded heterogeneous systems. In: ACM SIGPLAN notices, vol 52. ACM, pp 11–20

  26. 26.

    Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292

    MATH  Google Scholar 

  27. 27.

    Windh S, Ma X, Halstead RJ, Budhkar P, Luna Z, Hussaini O, Najjar WA (2015) High-level language tools for reconfigurable computing. Proc IEEE 103(3):390–408

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Research Foundation of Korea under Grant NRF-2017R1D1A1B03028926.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Taeweon Suh.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Park, S., Suh, T. DQN-based OpenCL workload partition for performance optimization. J Supercomput 75, 4875–4893 (2019). https://doi.org/10.1007/s11227-019-02766-0

Download citation

Keywords

  • OpenCL
  • DQN
  • Workload partition