Abstract
The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
One-shot means that there is no need to master easy tasks first, then compose them together to accomplish a more complex tasks.
- 2.
On-the-fly means that the algorithm tracks (or executes) the state of an underlying structure (or a function) without explicitly constructing it.
References
Abate, A., Katoen, J.P., Lygeros, J., Prandini, M.: Approximate model checking of stochastic hybrid systems. Eur. J. Control 16(6), 624–641 (2010)
Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: ICML, pp. 166–175 (2017)
Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)
Belzner, L., Wirsing, M.: Synthesizing safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. arXiv preprint arXiv:2005.03898 (2020)
Bertsekas, D.P., Shreve, S.: Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, USA (2004)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming, vol. 1. Athena Scientific, USA (1996)
Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. arXiv preprint arXiv:1909.07299 (2019)
Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8
Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: Artificial Intelligence and Statistics, pp. 273–281 (2012)
De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F.: Imitation learning over heterogeneous agents with restraining bolts. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 517–521 (2020)
De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: ICAPS, vol. 29, pp. 128–136 (2019)
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)
Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA (2018)
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Fulton, N., Platzer, A.: Verifiably safe off-model reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 413–430. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_28
Gunter, E.: From natural language to linear temporal logic: aspects of specifying embedded systems in LTL. In: Monterey Workshop on Software Engineering for Embedded Systems: From Requirements to Implementation (2003)
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 395–412. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_27
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted Q-iteration. In: AAMAS, pp. 2012–2014 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)
Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proceedings of the 58th Conference on Decision and Control, pp. 5338–5343. IEEE (2019)
Hasanbeig, M., Yogananda Jeppu, N., Abate, A., Melham, T., Kroening, D.: Deepsynth: program synthesis for automatic task segmentation in deep reinforcement learning. arXiv preprint arXiv:1911.10244 (2019)
Huang, C., Xu, S., Wang, Z., Lan, S., Li, W., Zhu, Q.: Opportunistic intermittent control with safety guarantees for autonomous systems. arXiv preprint arXiv:2005.03726 (2020)
Hunt, N., Fulton, N., Magliacane, S., Hoang, N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. arXiv preprint arXiv:2007.01223 (2020)
Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. arXiv preprint arXiv:1807.06096 (2018)
Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_8
Kazemi, M., Soudjani, S.: Formal policy synthesis for continuous-space systems via reinforcement learning. arXiv preprint arXiv:2005.01319 (2020)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2–3), 209–232 (2002)
Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS, pp. 3675–3683 (2016)
Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for continuous-space MDPs via model-free reinforcement learning. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 98–107. IEEE (2020)
Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)
Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: ACC, pp. 240–245 (2018)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nat. Geosci. 7(1), 53–58 (2014)
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Newell, R.G., Pizer, W.A.: Discounting the distant future: how much do uncertain rates increase valuations? J. Environ. Econ. Manag. 46(1), 52–71 (2003)
Nikora, A.P., Balcom, G.: Automated identification of LTL patterns in natural language requirements. In: ISSRE, pp. 185–194 (2009)
Oura, R., Sakakibara, A., Ushio, T.: Reinforcement learning of control policy for linear temporal logic specifications using limit-deterministic generalized Büchi automata. IEEE Control Syst. Lett. 4(3), 761–766 (2020)
Pitis, S.: Rethinking the discount factor in reinforcement learning: a decision theoretic approach. arXiv preprint arXiv:1902.02893 (2019)
Pnueli, A.: The temporal logic of programs. In: Foundations of Computer Science, pp. 46–57 (1977)
Polymenakos, K., Abate, A., Roberts, S.: Safe policy search using Gaussian process models. In: Proceedings of AAMAS, pp. 1565–1573 (2019)
Precup, D.: Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (2001)
Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096 (2014)
Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limit-deterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41540-6_17
Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46520-3_9
Silver, D., Lever, G., Heess, N., Thomas Degris, D.W., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)
Squyres, S.W.: Exploration of Victoria crater by the Mars rover opportunity. Science 324(5930), 1058–1061 (2009)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMA, pp. 452–461 (2018)
Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al.: Strategic attentive writer for learning macro-actions. In: NIPS, pp. 3486–3494 (2016)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39(5), 369–374 (2011)
Yan, R., Cheng, C.H., Chai, Y.: Formal consistency checking over specifications in natural languages. In: DATE, pp. 1677–1682 (2015)
Yoshida, N., Uchibe, E., Doya, K.: Reinforcement learning with state-dependent discount factor. In: 2013 IEEE 3rd Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE (2013)
Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)
Acknowledgements
The authors would like to thank Lim Zun Yuan for valuable discussions and technical support, and to anonymous reviewers for feedback on previous drafts of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Proof of Theorem 1
Appendix: Proof of Theorem 1
Theorem 1. Let \(\varphi \) be a given LTL formula and \(\mathfrak {M}_\mathfrak {A}\) be the product MDP constructed by synchronising the MDP \(\mathfrak {M}\) with the LDBA \(\mathfrak {A}\) associated with \(\varphi \). Then the optimal stationary Markov policy on \(\mathfrak {M}_\mathfrak {A}\) that maximises the expected return, maximises the probability of satisfying \(\varphi \) and induces a finite-memory policy on the MDP \(\mathfrak {M}\).
Proof. Assume that the optimal Markov policy on \(\mathfrak {M}_\mathfrak {A}\) is \({\pi ^\otimes }^*\), namely at each state \(s^\otimes \) in \(\mathfrak {M}_\mathfrak {A}\) we have
where \(\mathcal {D}^\otimes \) is the set of stationary deterministic policies over the state space \(\mathcal {S}^\otimes \), \(\mathbb {E}^{\pi ^\otimes } [\cdot ]\) denotes the expectation given that the agent follows policy \(\pi ^\otimes \), and \(s_0^\otimes ,a_0,s_1^\otimes ,a_1,\ldots \) is a generic path generated by the product MDP under policy \(\pi ^\otimes \).
Recall that an infinite word \(w \in {\Sigma }^\omega ,~\Sigma =2^\mathcal {AP}\) is accepted by the LDBA \(\mathfrak {A}=(\mathcal {Q},q_0,\Sigma , \mathcal {F}, \varDelta )\) if there exists an infinite run \(\theta \in \mathcal {Q}^\omega \) starting from \(q_0\) where \(\theta [i+1] \in \varDelta (\theta [i],\omega [i]),~i \ge 0\) and, for each \(F_j \in \mathcal {F}\), \( inf (\theta ) \cap F_j \ne \emptyset ,\) where \( inf (\theta )\) is the set of states that are visited infinitely often in the sequence \(\theta \). From Definition 8, the associated run \(\theta \) of an infinite path in the product MDP \(\rho = s^\otimes _0 \xrightarrow {a_0} s^\otimes _1 \xrightarrow {a_1} ...\) is \(\theta = L^\otimes (s^\otimes _0)L^\otimes (s^\otimes _1)...\). From Definition 9 and (10), and since for an accepting run \( inf (\theta ) \,\cap \, F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}\), all accepting paths starting from \(s_0^\otimes \), accumulate infinite number of positive rewards \(r_p\) (see Remark 2).
In the following, by contradiction, we show that any optimal policy \({\pi ^\otimes }^*\) satisfies the property with maximum possible probability. Let us assume that there exists a stationary deterministic Markov policy \({\pi ^\otimes }^+\ne {\pi ^\otimes }^*\) over the state space \(\mathcal {S}^\otimes \) such that probability of satisfying \(\varphi \) under \({\pi ^\otimes }^+\) is maximum.
This essentially means in the product MDP \(\mathfrak {M}_\mathfrak {A}\) by following \({\pi ^\otimes }^+\) the expectation of reaching the point where \( inf (\theta ) \cap F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}\) and positive reward is received ever after is higher than any other policy, including \({\pi ^\otimes }^*\). With a tuned discount factor \(\gamma \), e.g. (1),
This is in contrast with optimality of \({\pi ^\otimes }^*\) (15) and concludes \({\pi ^\otimes }^*={\pi ^\otimes }^+\). Namely, an optimal policy that maximises the expected return also maximises the probability of satisfying LTL property \(\varphi \). It is easy to see that the projection of policy \({\pi ^\otimes }^*\) on MDP \(\mathfrak {M}\) is a finite-memory policy \(\pi ^*\). \(\Box \)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Hasanbeig, M., Kroening, D., Abate, A. (2020). Deep Reinforcement Learning with Temporal Logics. In: Bertrand, N., Jansen, N. (eds) Formal Modeling and Analysis of Timed Systems. FORMATS 2020. Lecture Notes in Computer Science(), vol 12288. Springer, Cham. https://doi.org/10.1007/978-3-030-57628-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-57628-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57627-1
Online ISBN: 978-3-030-57628-8
eBook Packages: Computer ScienceComputer Science (R0)