Abstract
As the bias between the estimated value and the true value, overestimation is a basic problem in reinforcement learning, which leads to a lower total reward because of the incorrect action decisions. In order to reduce the impact of overestimation on reinforcement learning, we propose a double Actor-Critic learning system embedding improved Monte Carlo Tree Search (DAC-IMCTS). The proposed learning system consists of a reference module, a simulation module and an outcome module. The reference module and the simulation module are designed to compute the upper bound and lower bound of the expected reward of the agent, respectively. And the outcome module is developed to learn the agent’s control policies. The reference module, constructed based on the Actor-Critic framework, provides an upper confidence bound of the expected reward. Different from the classic Actor-Critic learning system, we introduce a simulation module into the new learning system to estimate the lower confidence bound of the expected reward. We propose an improved MCTS in this module to sample the policy distribution more efficiency. Based on the lower and upper confidence bounds, we propose a confidence interval weighted estimation algorithm (CIWE) in the outcome module for generating the target expected reward. We then prove that the target expected reward generated by our method has zero expectation bias, which reduces the overestimation that exists in the classic Actor-Critic learning system. We evaluate our learning system on OpenAI Gym experimental tasks. The experimental results show that our proposed model and algorithm outperform the state-of-the-art learning systems.
Similar content being viewed by others
Availability of data and materials
Data will be made available on reasonable request.
References
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602
Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning, pp 4295–4304. PMLR
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274
Zhou T, Tang D, Zhu H, Zhang Z (2021) Multi-agent reinforcement learning for online scheduling in smart factories. Robot Comput-Integr Manuf 72:102202
Fischer T, Krauss C (2018) Deep learning with long short-term memory networks for financial market predictions. Eur J Oper Res 270(2):654–669
Namdari A, Samani MA, Durrani TS (2022) Lithium-ion battery prognostics through reinforcement learning based on entropy measures. Algorithms 15(11):393
Chen S-A, Tangkaratt V, Lin H-T, Sugiyama M (2020) Active deep q-learning with demonstration. Mach Learn 109(9):1699–1725
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Meng L, Yazidi A, Goodwin M, Engelstad P (2022) Expert q-learning: deep reinforcement learning with coarse state values from offline expert examples. In: Proceedings of the northern lights deep learning workshop, vol 3
Panag TS, Dhillon J (2021) Predator-prey optimization based clustering algorithm for wireless sensor networks. Neural Comput Appl 33(17):11415–11435
Thrun S, Schwartz A (1993) Issues in using function approximation for reinforcement learning. In: Proceedings of the 1993 connectionist models summer school Hillsdale, NJ. Lawrence Erlbaum, vol 6, pp 1–9
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12
Lv P, Wang X, Cheng Y, Duan Z, Chen CP (2020) Integrated double estimator architecture for reinforcement learning. IEEE Trans Cybern 52(5):3111–3122
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in Actor-Critic methods. In: International conference on machine learning, pp 1587–1596. PMLR
Wu H, Zhang J, Wang Z, Lin Y, Li H (2022) Sub-avg: overestimation reduction for cooperative multi-agent reinforcement learning. Neurocomputing 474:94–106
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43
Lu Q, Tao F, Zhou S, Wang Z (2021) Incorporating Actor-Critic in Monte Carlo tree search for symbolic regression. Neural Comput Appl 33(14):8495–8511
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540
Baxter J, Tridgell A, Weaver L (1999) Knightcap: a chess program that learns by combining td (lambda) with game-tree search. arXiv preprint arXiv:cs/9901002
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395. PMLR
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937. PMLR
Walȩdzik K, Mańdziuk J (2018) Applying hybrid Monte Carlo tree search methods to risk-aware project scheduling problem. Inf Sci 460:450–468
Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: European conference on machine learning, pp 282–293. Springer
Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208
Snyder RD, Koehler AB, Hyndman RJ, Ord JK (2004) Exponential smoothing models: means and variances for lead-time demand. Eur J Oper Res 158(2):444–455
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70
Sabry M, Khalifa A (2019) On the reduction of variance and overestimation of deep q-learning. arXiv preprint arXiv:1910.05983
Jadon S (2020) A survey of loss functions for semantic segmentation. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pp 1–7. IEEE
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Acknowledgements
The authors would like to thank the editors and the anonymous reviewers for their constructive comments and suggestions. The work was supported by the National Natural Science Foundation of China [grant number 71771096].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors disclosed no relevant relationships.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, H., Xie, Y. & Zheng, S. A double Actor-Critic learning system embedding improved Monte Carlo tree search. Neural Comput & Applic 36, 8485–8500 (2024). https://doi.org/10.1007/s00521-024-09513-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09513-4