Abstract
We propose a new GPU-based asynchronous DPPO training framework (GAPPO), in which the sampling part and the network update part are assigned to two different threads. The data exchange between two threads is realized by a buffer. Through coordinating the cycles of the two threads and synchronizing them, the training speed can be increased by about 1.5 times on average compared with the original DPPO algorithm. We apply this framework to five Atari games from the Arcade Learning Environment, the result indicates that compared with the DPPO algorithm, GAPPO can reach the same score in shorter time, and compared with GA3C, GAPPO can reach higher scores in four games.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mnih, V., Badia, A.P., Mirza, M.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Frosio, I., Tyree, S., Clemons, J., Kautz, J.: GA3C: GPU-based A3C for deep reinforcement learning. In: 30th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12, pp. 1057–1063 (2000)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms, aXiv:1707.06347v2 [cs.LG] (2017)
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)
Mnih, V., Kavukcuoglu, K., Silver, D.: Human-level control through deep reinforcement learning. Nature, vol. 518. Springer, Heidelberg (2015)
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv (2015)
Xiao, Z., Xie, N., Chen, J., Liu, B., Jiang, F., Yang, G.: Fast-PPO: proximal policy optimization with optimal baseline method. J. Chin. Comput. Syst. 41, 1351–1356 (2020)
Heess, N., et al.: Emergence of locomotion behaviours in rich environments, arXiv preprint arXiv (2017)
Mnih, V., et al.: Playing atari with deep reinforcement learning. In: NIPS Deep Learning Workshop (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Youlve, C., Kaiyun, B., Zhaoyang, L. (2022). Asynchronous Distributed Proximal Policy Optimization Training Framework Based on GPU. In: Deng, Z. (eds) Proceedings of 2021 Chinese Intelligent Automation Conference. Lecture Notes in Electrical Engineering, vol 801. Springer, Singapore. https://doi.org/10.1007/978-981-16-6372-7_67
Download citation
DOI: https://doi.org/10.1007/978-981-16-6372-7_67
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6371-0
Online ISBN: 978-981-16-6372-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)