Assessing Policy, Loss and Planning Combinations in Reinforcement Learning Using a New Modular Architecture

Oliveira, Tiago Gaspar; Oliveira, Arlindo L.

doi:10.1007/978-3-031-16474-3_35

Tiago Gaspar Oliveira^12,13 &
Arlindo L. Oliveira^12,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13566))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

1255 Accesses
1 Altmetric

Abstract

The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that create challenges for research and development of new models. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include search algorithms, policies, and loss functions (Code available at https://github.com/GaspTO/Modular_MBRL).

We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular search algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with our implementation showed the best combination of search, policy, and loss algorithms to be heavily problem dependent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search (2017)
Google Scholar
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 5, 834–846 (1983)
Article Google Scholar
Borges, A., Oliveira, A.: Combining off and on-policy training in model-based reinforcement learning. arXiv preprint arXiv:2102.12194 (2021)
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Campbell, M., Hoane, A., hsiung Hsu, F.: Deep blue. Artif. Intell. 134(1), 57–83 (2002). https://doi.org/10.1016/S0004-3702(01)00129-1, https://www.sciencedirect.com/science/article/pii/S0004370201001291
Chevalier-Boisvert, M., Willems, L., Pal, S.: Minimalistic gridworld environment for OpenAI gym (2018). https://github.com/maximecb/gym-minigrid
Cohen-Solal, Q.: Learning to play two-player perfect-information games without knowledge. arXiv preprint arXiv:2008.01188 (2020)
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M.J. (eds.) CG 2006. LNCS, vol. 4630, pp. 72–83. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75538-8_7
Chapter Google Scholar
Hamrick, J.B., et al.: Combining Q-learning and search with amortized value estimates. arXiv preprint arXiv:1912.02807 (2019)
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). https://doi.org/10.1007/11871842_29
Chapter Google Scholar
Oh, J., Singh, S., Lee, H.: Value prediction network. arXiv preprint arXiv:1707.03497 (2017)
Schrittwieser, J., et al.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
Article Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
de Vries, J.A., Voskuil, K.S., Moerland, T.M., Plaat, A.: Visualizing muzero models. arXiv preprint arXiv:2102.12924 (2021)
Wang, H., Emmerich, M., Preuss, M., Plaat, A.: Alternative loss functions in alphazero-like self-play. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 155–162 (2019). https://doi.org/10.1109/SSCI44817.2019.9002814

Download references

Acknowledgments

This work was supported by the Portuguese Science Foundation, under projects PRELUNA PTDC/CCI-INF/4703/2021 and UIDB/50021/2020.

Author information

Authors and Affiliations

INESC-ID, Lisbon, Portugal
Tiago Gaspar Oliveira & Arlindo L. Oliveira
Instituto Superior Técnico, Lisbon, Portugal
Tiago Gaspar Oliveira & Arlindo L. Oliveira

Authors

Tiago Gaspar Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Arlindo L. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiago Gaspar Oliveira .

Editor information

Editors and Affiliations

ISEP/GECAD, Polytechnic Institute of Porto, Porto, Portugal
Goreti Marreiros
IST/INESC-ID, University of Lisbon, Lisbon, Portugal
Bruno Martins
IST/INESC-ID, University of Lisbon, Porto Salvo, Portugal
Ana Paiva
CISUC, University of Coimbra, Coimbra, Portugal
Bernardete Ribeiro
IST/INESC-ID, University of Lisbon, Porto Salvo, Portugal
Alberto Sardinha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oliveira, T.G., Oliveira, A.L. (2022). Assessing Policy, Loss and Planning Combinations in Reinforcement Learning Using a New Modular Architecture. In: Marreiros, G., Martins, B., Paiva, A., Ribeiro, B., Sardinha, A. (eds) Progress in Artificial Intelligence. EPIA 2022. Lecture Notes in Computer Science(), vol 13566. Springer, Cham. https://doi.org/10.1007/978-3-031-16474-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-16474-3_35
Published: 13 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16473-6
Online ISBN: 978-3-031-16474-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing Policy, Loss and Planning Combinations in Reinforcement Learning Using a New Modular Architecture