Skip to main content

Safe Policy Improvement in Constrained Markov Decision Processes

  • Conference paper
  • First Online:
Leveraging Applications of Formal Methods, Verification and Validation. Verification Principles (ISoLA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13701))

Included in the following conference series:

Abstract

The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the first, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the second, we introduce an algorithm ensuring that the policy is improved in a safe fashion, with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abels, A., Roijers, D., Lenaerts, T., Nowé, A., Steckelmacher, D.: Dynamic weights in multi-objective deep reinforcement learning. In: International Conference on Machine Learning, pp. 11–20. PMLR (2019)

    Google Scholar 

  2. Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 22–31. PMLR (2017). http://proceedings.mlr.press/v70/achiam17a.html

  3. Agha, G., Palmskog, K.: A survey of statistical model checking. ACM Trans. Model. Comput. Simul. (TOMACS) 28(1), 1–39 (2018)

    Article  MathSciNet  Google Scholar 

  4. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. CoRR arXiv:1708.08611 (2017)

  5. Altman, E.: Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 48(3), 387–417 (1998)

    Article  MathSciNet  Google Scholar 

  6. Altman, E.: Constrained Markov decision processes, vol. 7. CRC Press (1999)

    Google Scholar 

  7. Balakrishnan, A., Deshmukh, J.V.: Structured reward shaping using signal temporal logic specifications. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3481–3486 (2019). https://doi.org/10.1109/IROS40897.2019.8968254

  8. Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: Proceedings of the 25th International Conference on Machine Learning, pp. 41–47 (2008)

    Google Scholar 

  9. Berducci, L., Aguilar, E.A., Ničković, D., Grosu, R.: Hierarchical potential-based reward shaping from task specifications. arXiv (2021). https://doi.org/10.48550/ARXIV.2110.02792

  10. Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Academic press (2014)

    Google Scholar 

  11. Brunke, L., et al.: Safe learning in robotics: From learning-based control to safe reinforcement learning. CoRR arXiv:2108.06266 (2021)

  12. Brunnbauer, A., et al.: Latent imagination facilitates zero-shot transfer in autonomous racing. arXiv preprint arXiv:2103.04909 (2021)

  13. Brys, T., Harutyunyan, A., Vrancx, P., Taylor, M.E., Kudenko, D., Nowé, A.: Multi-objectivization of reinforcement learning problems by reward shaping. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2315–2322. IEEE (2014)

    Google Scholar 

  14. Censi, A., et al.: Liability, ethics, and culture-aware behavior specification using rulebooks. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019, pp. 8536–8542 (2019)

    Google Scholar 

  15. Chow, Y., Ghavamzadeh, M., Janson, L., Pavone, M.: Risk-constrained reinforcement learning with percentile risk criteria. CoRR arXiv:1512.01629 (2015)

  16. Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 4299–4307 (2017). https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html

  17. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31 (2018)

    Google Scholar 

  18. Dalal, G., Dvijotham, K., Vecerík, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. CoRR arXiv:1801.08757 (2018)

  19. Deisenroth, M., Rasmussen, C.E.: Pilco: A model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472. Citeseer (2011)

    Google Scholar 

  20. Deisenroth, M.P., Fox, D., Rasmussen, C.E.: Gaussian processes for data-efficient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 408–423 (2013)

    Article  Google Scholar 

  21. Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Fox, D., Kavraki, L.E., Kurniawati, H. (eds.) Robotics: Science and Systems X, University of California, Berkeley, USA, July 12–16, 2014 (2014). https://doi.org/10.15607/RSS.2014.X.039. http://www.roboticsproceedings.org/rss10/p39.html

  22. Gábor, Z., Kalmár, Z., Szepesvári, C.: Multi-criteria reinforcement learning. In: Shavlik, J.W. (ed.) Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 197–205. Morgan Kaufmann (1998)

    Google Scholar 

  23. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015). http://dl.acm.org/citation.cfm?id=2886795

  24. Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep Statistical Model Checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6

    Chapter  Google Scholar 

  25. Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018)

  26. Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)

  27. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)

    Google Scholar 

  28. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  29. Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, pp. 2107–2116. PMLR (2018)

    Google Scholar 

  30. Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  31. Jiang, Y., Bharadwaj, S., Wu, B., Shah, R., Topcu, U., Stone, P.: Temporal-logic-based reward shaping for continuing reinforcement learning tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(9), pp. 7995–8003, May 2021. https://ojs.aaai.org/index.php/AAAI/article/view/16975

  32. Jones, A., Aksaray, D., Kong, Z., Schwager, M., Belta, C.: Robust satisfaction of temporal logic specifications via reinforcement learning (2015)

    Google Scholar 

  33. Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. CoRR arXiv:2106.13906 (2021)

  34. Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of 19th International Conference on Machine Learning. Citeseer (2002)

    Google Scholar 

  35. Legay, A., Lukina, A., Traonouez, L.M., Yang, J., Smolka, S.A., Grosu, R.: Statistical model checking. In: Steffen, B., Woeginger, G. (eds.) Computing and Software Science. LNCS, vol. 10000, pp. 478–504. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91908-9_23

    Chapter  Google Scholar 

  36. Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: 2018 Annual American Control Conference (ACC), pp. 240–245 (2018)

    Google Scholar 

  37. Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839 (2017). https://doi.org/10.1109/IROS.2017.8206234

  38. Liu, C., Xu, X., Hu, D.: Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Sys. 45(3), 385–398 (2015). https://doi.org/10.1109/TSMC.2014.2358639

    Article  Google Scholar 

  39. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-3_12

    Chapter  MATH  Google Scholar 

  40. Massart, P.: Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer (2007)

    Google Scholar 

  41. Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE (2018)

    Google Scholar 

  42. Natarajan, S., Tadepalli, P.: Dynamic preferences in multi-criteria reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 601–608 (2005)

    Google Scholar 

  43. Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)

    Google Scholar 

  44. Ničković, D., Yamaguchi, T.: RTAMT: online robustness monitors from STL. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 564–571. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_34

    Chapter  Google Scholar 

  45. Phan, D.T., Paoletti, N., Grosu, R., Jansen, N., Smolka, S.A., Stoller, S.D.: Neural simplex architecture. CoRR arXiv:1908.00528 (2019)

  46. Pirotta, M., Restelli, M., Pecorino, A., Calandriello, D.: Safe policy iteration. In: International Conference on Machine Learning, pp. 307–315. PMLR (2013)

    Google Scholar 

  47. Precup, D.: Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p. 80 (2000)

    Google Scholar 

  48. Puranic, A.G., Deshmukh, J.V., Nikolaidis, S.: Learning from demonstrations using signal temporal logic in stochastic and continuous domains. IEEE Robot. Autom. Lett. 6(4), 6250–6257 (2021). https://doi.org/10.1109/LRA.2021.3092676

    Article  Google Scholar 

  49. Rodionova, A., Bartocci, E., Nickovic, D., Grosu, R.: Temporal logic as filtering. In: Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control, pp. 11–20 (2016)

    Google Scholar 

  50. Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Int. Res. 48(1), 67–113 (2013)

    MathSciNet  MATH  Google Scholar 

  51. Saunders, W., Sastry, G., Stuhlmüller, A., Evans, O.: Trial without error: Towards safe reinforcement learning via human intervention. CoRR arXiv:1707.05173 (2017)

  52. Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. CoRR arXiv:1502.05477 (2015)

  53. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  54. Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. CoRR arXiv:1610.03295 (2016)

  55. Shelton, C.: Balancing multiple sources of reward in reinforcement learning. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13. MIT Press (2001)

    Google Scholar 

  56. Thananjeyan, B., et al.: Recovery RL: safe reinforcement learning with learned recovery zones. IEEE Robotics Autom. Lett. 6(3), 4915–4922 (2021). https://doi.org/10.1109/LRA.2021.3070252

    Article  Google Scholar 

  57. Thomas, P., Theocharous, G., Ghavamzadeh, M.: High-confidence off-policy evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)

    Google Scholar 

  58. Thomas, P., Theocharous, G., Ghavamzadeh, M.: High confidence policy improvement. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 2380–2388. PMLR, Lille, France, 07–09 Jul 2015. https://proceedings.mlr.press/v37/thomas15.html

  59. Thomas, P.S.: Safe reinforcement learning (2015)

    Google Scholar 

  60. Thomas, P.S., Castro da Silva, B., Barto, A.G., Giguere, S., Brun, Y., Brunskill, E.: Preventing undesirable behavior of intelligent machines. Science 366(6468), 999–1004 (2019)

    Google Scholar 

  61. Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an rl agent using ltl. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 452–461 (2018)

    Google Scholar 

  62. Van Moffaert, K., Drugan, M.M., Nowé, A.: Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 191–199 (2013). https://doi.org/10.1109/ADPRL.2013.6615007

  63. Viswanadha, K., Kim, E., Indaheng, F., Fremont, D.J., Seshia, S.A.: Parallel and multi-objective falsification with Scenic and VerifAI. In: Feng, L., Fisman, D. (eds.) RV 2021. LNCS, vol. 12974, pp. 265–276. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88494-9_15

    Chapter  Google Scholar 

  64. Wilcox, A., Balakrishna, A., Thananjeyan, B., Gonzalez, J.E., Goldberg, K.: LS3: latent space safe sets for long-horizon visuomotor control of iterative tasks. CoRR arXiv:2107.04775 (2021)

  65. Zhao, Y., Chen, Q., Hu, W.: Multi-objective reinforcement learning algorithm for mosdmp in unknown environment. In: 2010 8th World Congress on Intelligent Control and Automation, pp. 3190–3194 (2010). https://doi.org/10.1109/WCICA.2010.5553980

Download references

Acknowledgement

Luigi Berducci is supported by the Doctoral College Resilient Embedded Systems. This work has received funding from the Austrian FFG-ICT project ADEX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luigi Berducci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Berducci, L., Grosu, R. (2022). Safe Policy Improvement in Constrained Markov Decision Processes. In: Margaria, T., Steffen, B. (eds) Leveraging Applications of Formal Methods, Verification and Validation. Verification Principles. ISoLA 2022. Lecture Notes in Computer Science, vol 13701. Springer, Cham. https://doi.org/10.1007/978-3-031-19849-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19849-6_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19848-9

  • Online ISBN: 978-3-031-19849-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics