Abstract
In reinforcement learning, an agent in an environment improves the skill depending on a reward, which is the feedback from an environment. For practical, reinforcement learning has several important challenges. First, reinforcement learning algorithms often use assumptions for an environment such as Markov decision processes; however, the environment in the real world often cannot be represented by these assumptions. Especially we focus on the environment with non-Markovian rewards, which allows the reward to depend on past experiences. To handle non-Markovian rewards, researchers have used a reward machine, which decomposes the original task into the sub-tasks. In those works, they assume that the sub-tasks are usually represented by a Markov decision process. Second, safety is also one of the challenges in reinforcement learning. G-CoMDS is a safe reinforcement learning algorithm based on CoMirror algorithm, an algorithm for constrained optimization problems. We have developed G-CoMDS algorithm to learn safely under environments without a Markov decision process. Therefore, the promising approach in complex situations would be decomposing the original task as the reward machine does, then solving the sub-tasks with G-CoMDS. In this paper, we provide additional experimental results and discussions of G-CoMDS, as a preliminary step of combining G-CoMDS with a reward machine. We evaluate G-CoMDS and existing reinforcement learning algorithm in the mobile robot simulation with a kind of non-Markovian rewards. The experimental result shows that G-CoMDS has the effect of suppressing the cost spike and slightly exceeds the performance of the existing safe reinforcement learning algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, PMLR, pp. 2107–2116 (2018)
GarcÃa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
Miyashita, M., Kondo, T., Yano, S.: Reinforcement learning with constraint based on mirror descent algorithm. Results Control Optim. 4, 100048 (2021)
Beck, A., Ben-Tal, A., Guttmann-Beck, N., Tetruashvili, L.: The comirror algorithm for solving nonsmooth constrained convex problems. Oper. Res. Lett. 38(6), 493–498 (2010)
Thomas, P.S., Dabney, W.C., Giguere, S., Mahadevan, S.: Projected natural actor-critic. In: Advances in Neural Information Processing Systems, pp. 2337–2345 (2013)
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, JMLR. org, vol. 70, pp. 22–31 (2017)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Bacchus, F., Boutilier, C., Grove, A.: Rewarding behaviors. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2. AAAI’96, Portland, Oregon, pp. 1160–1167. AAAI Press (1996)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Gaon, M., Brafman, R.: Reinforcement learning with non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3980–3987 (2020)
Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)
Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987)
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: International Colloquium on Grammatical Inference, pp. 1–12. Springer (1998)
Neider, D., Gaglione, J.R., Gavran, I., Topcu, U., Wu, B., Xu, Z.: Advice-guided reinforcement learning in a non-Markovian environment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9073–9080 (2021)
Wen, M., Topcu, U.: Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Autom. Control 1 (2020)
Miyashita, M., Yano, S., Kondo, T.: Mirror descent search and its acceleration. Robot. Auton. Syst. 106, 107–116 (2018)
Ray, A., Achiam, J., Amodei, D.: Benchmarking Safe Exploration in Deep Reinforcement Learning (2019)
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Acknowledgement
This work was partially supported by JSPS KAKENHI (Grant numbers JP26120005, 17KK0064, JP18K19732, and JP17K12737).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Miyashita, M., Yano, S., Kondo, T. (2023). Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem. In: Petrovic, I., Menegatti, E., Marković, I. (eds) Intelligent Autonomous Systems 17. IAS 2022. Lecture Notes in Networks and Systems, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-031-22216-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-22216-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22215-3
Online ISBN: 978-3-031-22216-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)