Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem

Miyashita, Megumi; Yano, Shiro; Kondo, Toshiyuki

doi:10.1007/978-3-031-22216-0_5

Megumi Miyashita¹²,
Shiro Yano¹² &
Toshiyuki Kondo¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 577))

Included in the following conference series:

International Conference on Intelligent Autonomous Systems

904 Accesses

Abstract

In reinforcement learning, an agent in an environment improves the skill depending on a reward, which is the feedback from an environment. For practical, reinforcement learning has several important challenges. First, reinforcement learning algorithms often use assumptions for an environment such as Markov decision processes; however, the environment in the real world often cannot be represented by these assumptions. Especially we focus on the environment with non-Markovian rewards, which allows the reward to depend on past experiences. To handle non-Markovian rewards, researchers have used a reward machine, which decomposes the original task into the sub-tasks. In those works, they assume that the sub-tasks are usually represented by a Markov decision process. Second, safety is also one of the challenges in reinforcement learning. G-CoMDS is a safe reinforcement learning algorithm based on CoMirror algorithm, an algorithm for constrained optimization problems. We have developed G-CoMDS algorithm to learn safely under environments without a Markov decision process. Therefore, the promising approach in complex situations would be decomposing the original task as the reward machine does, then solving the sub-tasks with G-CoMDS. In this paper, we provide additional experimental results and discussions of G-CoMDS, as a preliminary step of combining G-CoMDS with a reward machine. We evaluate G-CoMDS and existing reinforcement learning algorithm in the mobile robot simulation with a kind of non-Markovian rewards. The experimental result shows that G-CoMDS has the effect of suppressing the cost spike and slightly exceeds the performance of the existing safe reinforcement learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, PMLR, pp. 2107–2116 (2018)
Google Scholar
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
MathSciNet MATH Google Scholar
Miyashita, M., Kondo, T., Yano, S.: Reinforcement learning with constraint based on mirror descent algorithm. Results Control Optim. 4, 100048 (2021)
Article Google Scholar
Beck, A., Ben-Tal, A., Guttmann-Beck, N., Tetruashvili, L.: The comirror algorithm for solving nonsmooth constrained convex problems. Oper. Res. Lett. 38(6), 493–498 (2010)
Article MathSciNet MATH Google Scholar
Thomas, P.S., Dabney, W.C., Giguere, S., Mahadevan, S.: Projected natural actor-critic. In: Advances in Neural Information Processing Systems, pp. 2337–2345 (2013)
Google Scholar
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, JMLR. org, vol. 70, pp. 22–31 (2017)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Bacchus, F., Boutilier, C., Grove, A.: Rewarding behaviors. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2. AAAI’96, Portland, Oregon, pp. 1160–1167. AAAI Press (1996)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Article MATH Google Scholar
Gaon, M., Brafman, R.: Reinforcement learning with non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3980–3987 (2020)
Google Scholar
Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)
Google Scholar
Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987)
Article MathSciNet MATH Google Scholar
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: International Colloquium on Grammatical Inference, pp. 1–12. Springer (1998)
Google Scholar
Neider, D., Gaglione, J.R., Gavran, I., Topcu, U., Wu, B., Xu, Z.: Advice-guided reinforcement learning in a non-Markovian environment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9073–9080 (2021)
Google Scholar
Wen, M., Topcu, U.: Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Autom. Control 1 (2020)
Google Scholar
Miyashita, M., Yano, S., Kondo, T.: Mirror descent search and its acceleration. Robot. Auton. Syst. 106, 107–116 (2018)
Article Google Scholar
Ray, A., Achiam, J., Amodei, D.: Benchmarking Safe Exploration in Deep Reinforcement Learning (2019)
Google Scholar
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Google Scholar

Download references

Acknowledgement

This work was partially supported by JSPS KAKENHI (Grant numbers JP26120005, 17KK0064, JP18K19732, and JP17K12737).

Author information

Authors and Affiliations

Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi, Tokyo, Japan
Megumi Miyashita, Shiro Yano & Toshiyuki Kondo

Authors

Megumi Miyashita
View author publications
You can also search for this author in PubMed Google Scholar
Shiro Yano
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Kondo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Megumi Miyashita .

Editor information

Editors and Affiliations

Faculty of Electrical Engineering, University of Zagreb, Zagreb, Croatia
Ivan Petrovic
Department of Information Engineering, University of Padua, Padua, Italy
Emanuele Menegatti
Faculty of Electrical Engineering, University of Zagreb, Zagreb, Croatia
Ivan Marković

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miyashita, M., Yano, S., Kondo, T. (2023). Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem. In: Petrovic, I., Menegatti, E., Marković, I. (eds) Intelligent Autonomous Systems 17. IAS 2022. Lecture Notes in Networks and Systems, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-031-22216-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-22216-0_5
Published: 18 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22215-3
Online ISBN: 978-3-031-22216-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics