DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning

Gros, Timo P.; Höller, Daniel; Hoffmann, Jörg; Klauck, Michaela; Meerkamp, Hendrik; Wolf, Verena

doi:10.1007/978-3-030-85172-9_11

Timo P. Gros¹⁰,
Daniel Höller¹⁰,
Jörg Hoffmann¹⁰,
Michaela Klauck¹⁰,
Hendrik Meerkamp¹⁰ &
…
Verena Wolf¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12846))

Included in the following conference series:

International Conference on Quantitative Evaluation of Systems

1247 Accesses
6 Citations

The original version of this chapter was revised: an error in the algorithm on page 206 were corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-85172-9_25

Abstract

Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures which for DRL to work must be replaced with proxy objectives. Here we introduce methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC) which verifies NN policies in MDPs. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations, and (ii) allowing to foster arbitrary objectives. We run case studies in Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. Our results show that DSMC-based ES can significantly improve both (i) and (ii).

Authors are listed alphabetically. This work was partially supported by the German Research Foundation (DFG) under grant No. 389792660, as part of TRR 248, see https://perspicuous-computing.science, and by the European Regional Development Fund (ERDF).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

19 August 2021
In an older version of this paper, there was a mistake in line 12 of the algorithm on page 206. This was corrected.

Notes

1.
One can combine such a proxy with the goal probability objective, though multiple objectives are difficult to achieve with a one-dimensional reward signal and standard backpropagation algorithms for neural nets [26]; anyway, training objective vs. ideal objective are still not identical here. Reward shaping is an alternative option that can in principle preserve the optimal policy [31], but this is not always possible, and manual work is needed for individual learning tasks (substantial work sometimes, see e.g. [46]).
2.
The benefit of our proposed ES thus hinges, in particular, on how meaningful these representative states are for policy performance. While this is a limitation, partitioning by physical location like in Racetrack could be a canonical candidate in many scenarios.
3.
In our Racetrack case studies, we use the map cells as the basis of \(\mathbb {P}\) – i.e., states sharing the same physical location. We believe that this partitioning method may work for many application scenarios involving physical space. Alternatively, one may, for example, partition state-variable ranges into intervals.

References

Agostinelli, F., McAleer, S., Shmakov, A., Baldi, P.: Solving the Rubik’s cube with deep reinforcement learning and search. Nat. Mach. Intell. 1, 356–363 (2019)
Article Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Amit, R., Meir, R., Ciosek, K.: Discount factor as a regularizer in reinforcement learning. In: International Conference on Machine Learning, pp. 269–278. PMLR (2020)
Google Scholar
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Chapter Google Scholar
Baier, C., et al.: Lab conditions for research on explainable automated decisions. In: Heintz, F., Milano, M., O’Sullivan, B. (eds.) TAILOR 2020. LNCS (LNAI), vol. 12641, pp. 83–90. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73959-1_8
Chapter Google Scholar
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995)
Article Google Scholar
Bogdoll, J., Hartmanns, A., Hermanns, H.: Simulation and statistical model checking for modestly nondeterministic models. In: Schmitt, J.B. (ed.) MMB&DFT 2012. LNCS, vol. 7201, pp. 249–252. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28540-0_20
Chapter Google Scholar
Bonet, B., Geffner, H.: GPT: a tool for planning with uncertainty and partial information. In: Proceedings of the IJCAI Workshop on Planning with Uncertainty and Incomplete Information, pp. 82–87 (2001)
Google Scholar
Bonet, B., Geffner, H.: Labeled RTDP: improving the convergence of real-time dynamic programming. In: Proceedings of the International Conference on Automated Planning and Scheduling, pp. 12–21 (2003)
Google Scholar
Budde, C.E., D’Argenio, P.R., Hartmanns, A., Sedwards, S.: A statistical model checker for nondeterminism and rare events. In: Beyer, D., Huisman, M. (eds.) TACAS 2018. LNCS, vol. 10806, pp. 340–358. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89963-3_20
Chapter Google Scholar
Ciosek, K., Whiteson, S.: Offer: off-environment reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Frank, J., Mannor, S., Precup, D.: Reinforcement learning in the presence of rare events. In: Proceedings of the 25th International Conference on Machine Learning, pp. 336–343 (2008)
Google Scholar
Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T.: ChainerRL: a deep reinforcement learning library. J. Mach. Learn. Res. 22(77), 1–14 (2021)
MATH Google Scholar
Gros, T.P., Groß, D., Gumhold, S., Hoffmann, J., Klauck, M., Steinmetz, M.: TraceVis: towards visualization for deep statistical model checking. In: Proceedings of the 9th International Symposium On Leveraging Applications of Formal Methods, Verification and Validation. From Verification to Explanation (2020)
Google Scholar
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Proceedings of the 40th International Conference on Formal Techniques for Distributed Objects, Components, and Systems (FORTE 2020) (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Gros, T.P., Höller, D., Hoffmann, J., Wolf, V.: Tracking the race between deep reinforcement learning and imitation learning. In: Gribaudo, M., Jansen, D.N., Remke, A. (eds.) QEST 2020. LNCS, vol. 12289, pp. 11–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59854-9_2
Chapter Google Scholar
Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE (2017)
Google Scholar
Hare, J.: Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019)
Hartmanns, A., Hermanns, H.: The modest toolset: an integrated environment for quantitative modelling and verification. In: Ábrahám, E., Havelund, K. (eds.) TACAS 2014. LNCS, vol. 8413, pp. 593–598. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54862-8_51
Chapter Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe Reinforcement Learning Using Probabilistic Shields (2020)
Google Scholar
Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_8
Chapter Google Scholar
Knox, W.B., Stone, P.: Reinforcement learning from human reward: discounting in episodic tasks. In: 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, pp. 878–885 (2012). https://doi.org/10.1109/ROMAN.2012.6343862
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Liu, C., Xu, X., Hu, D.: Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Syst. 45(3), 385–398 (2014)
Google Scholar
McMahan, H.B., Gordon, G.J.: Fast exact planning in Markov decision processes. In: Proceedings of the International Conference on Automated Planning and Scheduling, pp. 151–160 (2005)
Google Scholar
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). Accessed 15 Sept 2020
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article Google Scholar
Nazari, M., Oroojlooy, A., Snyder, L., Takac, M.: Reinforcement learning for solving the vehicle routing problem. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 9839–9849. Curran Associates, Inc. (2018)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning (ICML 1999), pp. 278–287 (1999)
Google Scholar
Pineda, L.E., Lu, Y., Zilberstein, S., Goldman, C.V.: Fault-tolerant planning under uncertainty. In: Twenty-Third International Joint Conference on Artificial Intelligence, pp. 2350–2356 (2013)
Google Scholar
Pineda, L.E., Zilberstein, S.: Planning under uncertainty using reduced models: revisiting determinization. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 24 (2014)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)
Book Google Scholar
Riedmiller, M., et al.: Learning by playing solving sparse reward tasks from scratch. In: International Conference on Machine Learning, pp. 4344–4353. PMLR (2018)
Google Scholar
Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017(19), 70–76 (2017)
Article Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR (2016)
Google Scholar
Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International Conference on Machine Learning, vol. 298, pp. 298–305 (1993)
Google Scholar
Sen, K., Viswanathan, M., Agha, G.: On statistical model checking of stochastic systems. In: International Conference on Computer Aided Verification, pp. 266–280 (2005)
Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Stooke, A., Abbeel, P.: rlpyt: a research code base for deep reinforcement learning in Pytorch. arXiv preprint arXiv:1909.01500 (2019)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, Adaptive Computation and Machine Learning, 2nd edn. The MIT Press, Cambridge (2018)
MATH Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Google Scholar
Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019)
Article Google Scholar
Younes, H.L.S., Kwiatkowska, M., Norman, G., Parker, D.: Numerical vs. statistical probabilistic model checking: an empirical study. In: Jensen, K., Podelski, A. (eds.) TACAS 2004. LNCS, vol. 2988, pp. 46–60. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24730-2_4
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Timo P. Gros, Daniel Höller, Jörg Hoffmann, Michaela Klauck, Hendrik Meerkamp & Verena Wolf

Authors

Timo P. Gros
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Höller
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Hoffmann
View author publications
You can also search for this author in PubMed Google Scholar
Michaela Klauck
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik Meerkamp
View author publications
You can also search for this author in PubMed Google Scholar
Verena Wolf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo P. Gros .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Alessandro Abate
Ca’ Foscari University of Venice, Venice, Italy
Andrea Marin

A Hyperparameters

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gros, T.P., Höller, D., Hoffmann, J., Klauck, M., Meerkamp, H., Wolf, V. (2021). DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning. In: Abate, A., Marin, A. (eds) Quantitative Evaluation of Systems. QEST 2021. Lecture Notes in Computer Science(), vol 12846. Springer, Cham. https://doi.org/10.1007/978-3-030-85172-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-85172-9_11
Published: 19 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85171-2
Online ISBN: 978-3-030-85172-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning

Abstract

Access this chapter

Change history

19 August 2021

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Hyperparameters

A Hyperparameters

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation