Skip to main content
Log in

A double Actor-Critic learning system embedding improved Monte Carlo tree search

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

As the bias between the estimated value and the true value, overestimation is a basic problem in reinforcement learning, which leads to a lower total reward because of the incorrect action decisions. In order to reduce the impact of overestimation on reinforcement learning, we propose a double Actor-Critic learning system embedding improved Monte Carlo Tree Search (DAC-IMCTS). The proposed learning system consists of a reference module, a simulation module and an outcome module. The reference module and the simulation module are designed to compute the upper bound and lower bound of the expected reward of the agent, respectively. And the outcome module is developed to learn the agent’s control policies. The reference module, constructed based on the Actor-Critic framework, provides an upper confidence bound of the expected reward. Different from the classic Actor-Critic learning system, we introduce a simulation module into the new learning system to estimate the lower confidence bound of the expected reward. We propose an improved MCTS in this module to sample the policy distribution more efficiency. Based on the lower and upper confidence bounds, we propose a confidence interval weighted estimation algorithm (CIWE) in the outcome module for generating the target expected reward. We then prove that the target expected reward generated by our method has zero expectation bias, which reduces the overestimation that exists in the classic Actor-Critic learning system. We evaluate our learning system on OpenAI Gym experimental tasks. The experimental results show that our proposed model and algorithm outperform the state-of-the-art learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

Data will be made available on reasonable request.

References

  1. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

  2. Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning, pp 4295–4304. PMLR

  3. Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274

    Article  Google Scholar 

  4. Zhou T, Tang D, Zhu H, Zhang Z (2021) Multi-agent reinforcement learning for online scheduling in smart factories. Robot Comput-Integr Manuf 72:102202

    Article  Google Scholar 

  5. Fischer T, Krauss C (2018) Deep learning with long short-term memory networks for financial market predictions. Eur J Oper Res 270(2):654–669

    Article  MathSciNet  Google Scholar 

  6. Namdari A, Samani MA, Durrani TS (2022) Lithium-ion battery prognostics through reinforcement learning based on entropy measures. Algorithms 15(11):393

    Article  Google Scholar 

  7. Chen S-A, Tangkaratt V, Lin H-T, Sugiyama M (2020) Active deep q-learning with demonstration. Mach Learn 109(9):1699–1725

    Article  MathSciNet  Google Scholar 

  8. Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

  9. Meng L, Yazidi A, Goodwin M, Engelstad P (2022) Expert q-learning: deep reinforcement learning with coarse state values from offline expert examples. In: Proceedings of the northern lights deep learning workshop, vol 3

  10. Panag TS, Dhillon J (2021) Predator-prey optimization based clustering algorithm for wireless sensor networks. Neural Comput Appl 33(17):11415–11435

    Article  Google Scholar 

  11. Thrun S, Schwartz A (1993) Issues in using function approximation for reinforcement learning. In: Proceedings of the 1993 connectionist models summer school Hillsdale, NJ. Lawrence Erlbaum, vol 6, pp 1–9

  12. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

  13. Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12

  14. Lv P, Wang X, Cheng Y, Duan Z, Chen CP (2020) Integrated double estimator architecture for reinforcement learning. IEEE Trans Cybern 52(5):3111–3122

    Google Scholar 

  15. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in Actor-Critic methods. In: International conference on machine learning, pp 1587–1596. PMLR

  16. Wu H, Zhang J, Wang Z, Lin Y, Li H (2022) Sub-avg: overestimation reduction for cooperative multi-agent reinforcement learning. Neurocomputing 474:94–106

    Article  Google Scholar 

  17. Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43

    Article  Google Scholar 

  18. Lu Q, Tao F, Zhou S, Wang Z (2021) Incorporating Actor-Critic in Monte Carlo tree search for symbolic regression. Neural Comput Appl 33(14):8495–8511

    Article  Google Scholar 

  19. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540

  20. Baxter J, Tridgell A, Weaver L (1999) Knightcap: a chess program that learns by combining td (lambda) with game-tree search. arXiv preprint arXiv:cs/9901002

  21. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395. PMLR

  22. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937. PMLR

  23. Walȩdzik K, Mańdziuk J (2018) Applying hybrid Monte Carlo tree search methods to risk-aware project scheduling problem. Inf Sci 460:450–468

    Article  Google Scholar 

  24. Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: European conference on machine learning, pp 282–293. Springer

  25. Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208

    Article  Google Scholar 

  26. Snyder RD, Koehler AB, Hyndman RJ, Ord JK (2004) Exponential smoothing models: means and variances for lead-time demand. Eur J Oper Res 158(2):444–455

    Article  MathSciNet  Google Scholar 

  27. Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70

    Article  Google Scholar 

  28. Sabry M, Khalifa A (2019) On the reduction of variance and overestimation of deep q-learning. arXiv preprint arXiv:1910.05983

  29. Jadon S (2020) A survey of loss functions for semantic segmentation. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pp 1–7. IEEE

  30. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Download references

Acknowledgements

The authors would like to thank the editors and the anonymous reviewers for their constructive comments and suggestions. The work was supported by the National Natural Science Foundation of China [grant number 71771096].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Xie.

Ethics declarations

Conflict of interest

All authors disclosed no relevant relationships.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, H., Xie, Y. & Zheng, S. A double Actor-Critic learning system embedding improved Monte Carlo tree search. Neural Comput & Applic 36, 8485–8500 (2024). https://doi.org/10.1007/s00521-024-09513-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09513-4

Keywords

Navigation