Skip to main content

Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem

  • Conference paper
  • First Online:
Intelligent Autonomous Systems 17 (IAS 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 577))

Included in the following conference series:

  • 904 Accesses

Abstract

In reinforcement learning, an agent in an environment improves the skill depending on a reward, which is the feedback from an environment. For practical, reinforcement learning has several important challenges. First, reinforcement learning algorithms often use assumptions for an environment such as Markov decision processes; however, the environment in the real world often cannot be represented by these assumptions. Especially we focus on the environment with non-Markovian rewards, which allows the reward to depend on past experiences. To handle non-Markovian rewards, researchers have used a reward machine, which decomposes the original task into the sub-tasks. In those works, they assume that the sub-tasks are usually represented by a Markov decision process. Second, safety is also one of the challenges in reinforcement learning. G-CoMDS is a safe reinforcement learning algorithm based on CoMirror algorithm, an algorithm for constrained optimization problems. We have developed G-CoMDS algorithm to learn safely under environments without a Markov decision process. Therefore, the promising approach in complex situations would be decomposing the original task as the reward machine does, then solving the sub-tasks with G-CoMDS. In this paper, we provide additional experimental results and discussions of G-CoMDS, as a preliminary step of combining G-CoMDS with a reward machine. We evaluate G-CoMDS and existing reinforcement learning algorithm in the mobile robot simulation with a kind of non-Markovian rewards. The experimental result shows that G-CoMDS has the effect of suppressing the cost spike and slightly exceeds the performance of the existing safe reinforcement learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, PMLR, pp. 2107–2116 (2018)

    Google Scholar 

  2. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)

    MathSciNet  MATH  Google Scholar 

  3. Miyashita, M., Kondo, T., Yano, S.: Reinforcement learning with constraint based on mirror descent algorithm. Results Control Optim. 4, 100048 (2021)

    Article  Google Scholar 

  4. Beck, A., Ben-Tal, A., Guttmann-Beck, N., Tetruashvili, L.: The comirror algorithm for solving nonsmooth constrained convex problems. Oper. Res. Lett. 38(6), 493–498 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  5. Thomas, P.S., Dabney, W.C., Giguere, S., Mahadevan, S.: Projected natural actor-critic. In: Advances in Neural Information Processing Systems, pp. 2337–2345 (2013)

    Google Scholar 

  6. Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, JMLR. org, vol. 70, pp. 22–31 (2017)

    Google Scholar 

  7. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)

    Google Scholar 

  8. Bacchus, F., Boutilier, C., Grove, A.: Rewarding behaviors. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2. AAAI’96, Portland, Oregon, pp. 1160–1167. AAAI Press (1996)

    Google Scholar 

  9. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    Article  MATH  Google Scholar 

  10. Gaon, M., Brafman, R.: Reinforcement learning with non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3980–3987 (2020)

    Google Scholar 

  11. Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)

    Google Scholar 

  12. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  13. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: International Colloquium on Grammatical Inference, pp. 1–12. Springer (1998)

    Google Scholar 

  14. Neider, D., Gaglione, J.R., Gavran, I., Topcu, U., Wu, B., Xu, Z.: Advice-guided reinforcement learning in a non-Markovian environment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9073–9080 (2021)

    Google Scholar 

  15. Wen, M., Topcu, U.: Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Autom. Control 1 (2020)

    Google Scholar 

  16. Miyashita, M., Yano, S., Kondo, T.: Mirror descent search and its acceleration. Robot. Auton. Syst. 106, 107–116 (2018)

    Article  Google Scholar 

  17. Ray, A., Achiam, J., Amodei, D.: Benchmarking Safe Exploration in Deep Reinforcement Learning (2019)

    Google Scholar 

  18. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)

    Google Scholar 

Download references

Acknowledgement

This work was partially supported by JSPS KAKENHI (Grant numbers JP26120005, 17KK0064, JP18K19732, and JP17K12737).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Megumi Miyashita .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Miyashita, M., Yano, S., Kondo, T. (2023). Evaluation of Safe Reinforcement Learning with CoMirror Algorithm in a Non-Markovian Reward Problem. In: Petrovic, I., Menegatti, E., Marković, I. (eds) Intelligent Autonomous Systems 17. IAS 2022. Lecture Notes in Networks and Systems, vol 577. Springer, Cham. https://doi.org/10.1007/978-3-031-22216-0_5

Download citation

Publish with us

Policies and ethics