OmegaRegular Objectives in ModelFree Reinforcement Learning
 5.9k Downloads
Abstract
We provide the first solution for modelfree reinforcement learning of \(\omega \)regular objectives for Markov decision processes (MDPs). We present a constructive reduction from the almostsure satisfaction of \(\omega \)regular objectives to an almostsure reachability problem, and extend this technique to learning how to control an unknown model so that the chance of satisfying the objective is maximized. We compile \(\omega \)regular properties into limitdeterministic Büchi automata instead of the traditional Rabin automata; this choice sidesteps difficulties that have marred previous proposals. Our approach allows us to apply modelfree, offtheshelf reinforcement learning algorithms to compute optimal strategies from the observations of the MDP. We present an experimental evaluation of our technique on benchmark learning problems.
1 Introduction
Reinforcement learning (RL) [3, 37, 40] is an approach to sequential decision making in which agents rely on reward signals to choose actions aimed at achieving prescribed objectives. Modelfree RL refers to a class of techniques that are asymptotically spaceefficient [36] because they do not construct a full model of the environment. These techniques include classic algorithms like Qlearning [37] as well as their extensions to deep neural networks [14, 31]. Some objectives, like running a maze, are naturally expressed in terms of scalar rewards; in other cases the translation is less obvious. We solve the problem of \(\omega \)regular rewards, that is, the problem of defining scalar rewards for the transitions of a Markov decision process (MDP) so that strategies that maximize the probability to satisfy an \(\omega \)regular objective may be computed by offtheshelf, modelfree RL algorithms.
Omegaregular languages [28, 38] provide a rich formalism to unambiguously express qualitative safety and progress requirements of MDPs [2]. A common way to describe an \(\omega \)regular language is via a formula in Linear Time Logic (LTL); other specification mechanisms include extensions of LTL, various types of automata, and monadic secondorder logic. A typical requirement that is naturally expressed as an \(\omega \)regular objective prescribes that the agent should eventually control the MDP to stay within a given set of states, while at all times avoiding another set of states. In LTL this would be written Open image in new window , where \(\mathtt {goal}\) and \(\mathtt {trap}\) are labels attached to the appropriate states, Open image in new window stands for “finally,” and Open image in new window stands for “globally.”
For verification or synthesis, an \(\omega \)regular objective is usually translated into an automaton that monitors the traces of the MDP [10]. Successful executions cause the automaton to take certain (accepting) transitions infinitely often, and ultimately avoid certain (rejecting) transitions. That is, \(\omega \)regular objectives are about the longterm behavior of an MDP; the frequency of reward collected is not what matters. A policy that guarantees no rejecting transitions and an accepting transition every ten steps, is better than a policy that promises an accepting transition at each step, but with probability 0.5 does not accept at all.
The problem of \(\omega \)regular rewards in the context of modelfree RL was first tackled in 2014 by translating the objective into a deterministic Rabin automaton and deriving positive and negative rewards directly from the acceptance condition of the automaton [32]. In Sect. 3 we show that their algorithm, and the extension of [18] may fail to find optimal strategies, and may underestimate the probability of satisfaction of the objective. In [16, 17] the use of limitdeterministic Büchi automata avoids the problems connected with the use of a Rabin acceptance condition. However, as shown in Sect. 3, that approach may still produce incorrect results.
We avoid the problems inherent in the use of deterministic Rabin automata for modelfree RL by resorting to limitdeterministic Büchi automata, which, under mild restrictions, were shown by [8, 15, 33] to be suitable for both qualitative and quantitative analysis of MDPs under all \(\omega \)regular objectives. The Büchi acceptance condition, which, unlike the Rabin condition, does not use rejecting transitions, allows us to constructively reduce the almostsure satisfaction of \(\omega \)regular objectives to an almostsure reachability problem. It is also suitable for quantitative analysis: the value of a state converges to the maximum probability of satisfaction of the objective from that state as a parameter approaches 1.
We concentrate on modelfree approaches and infinitary behaviors for finite MDPs. Related problems include modelbased RL [13], RL for finitehorizon objectives [22, 23], and learning for efficient verification [4].
This paper is organized as follows. Section 2 recalls definitions and notations. Section 3 shows the problems that arise when the reward of the RL algorithm is derived from the acceptance condition of a deterministic Rabin automaton. In Sect. 4 we prove the main results. Finally, Sect. 5 discusses our experiments.
2 Preliminaries
2.1 Markov Decision Processes
Let \(\mathcal{D}(S)\) be the set of distributions over S. A Markov decision process \(\mathcal {M}\) is a tuple (S, A, T, AP, L) where S is a finite set of states, A is a finite set of actions, Open image in new window is the probabilistic transition (partial) function, AP is the set of atomic propositions, and \(L:S \rightarrow 2^{AP}\) is the proposition labeling function.
For any state \(s \in S\), we let A(s) denote the set of actions that can be selected in state s. For states \(s, s' \in S\) and \(a \in A(s)\), \(T(s, a)(s')\) equals \(p(s'  s, a)\). A run of \(\mathcal {M}\) is an \(\omega \)word \(\langle s_0, a_1, s_1, \ldots \rangle \in S \times (A \times S)^\omega \) such that \(p(s_{i+1}  s_{i}, a_{i+1}) {>} 0\) for all \(i \ge 0\). A finite run is a finite such sequence. For a run \(r = \langle s_0, a_1, s_1, \ldots \rangle \) we define the corresponding labeled run as \(L(r) = \langle L(s_0), L(s_1), \ldots \rangle \in (2^{AP})^\omega \). We write \( Runs ^\mathcal {M}( FRuns ^\mathcal {M})\) for the set of runs (finite runs) of the MDP \(\mathcal {M}\) and \( Runs {}^\mathcal {M}(s) ( FRuns {}^\mathcal {M}(s))\) for the set of runs (finite runs) of the MDP \(\mathcal {M}\) starting from state s. We write \( last (r)\) for the last state of a finite run r.
A strategy in \(\mathcal {M}\) is a function \(\sigma : FRuns \rightarrow \mathcal{D}(A)\) such that \( supp (\sigma (r)) \subseteq A( last (r))\), where \( supp (d)\) denotes the support of the distribution d. Let \( Runs ^\mathcal {M}_\sigma (s)\) denote the subset of runs \( Runs ^\mathcal {M}(s)\) that correspond to strategy \(\sigma \) with initial state s. Let \(\varSigma _\mathcal {M}\) be the set of all strategies. A strategy \(\sigma \) is pure if \(\sigma (r)\) is a point distribution for all runs \(r \in FRuns ^\mathcal {M}\) and we say that \(\sigma \) is stationary if \( last (r) = last (r')\) implies \(\sigma (r) = \sigma (r')\) for all runs \(r, r' \in FRuns ^\mathcal {M}\). A strategy that is not pure is mixed. A strategy is positional if it is both pure and stationary.
The behavior of an MDP \(\mathcal {M}\) under a strategy \(\sigma \) is defined on a probability space \(( Runs ^\mathcal {M}_\sigma (s), \mathcal {F}_{ Runs ^\mathcal {M}_\sigma (s)}, \Pr ^\mathcal {M}_\sigma (s))\) over the set of infinite runs of \(\sigma \) with starting state s. Given a realvalued random variable over the set of infinite runs \(f : Runs ^\mathcal {M}\rightarrow \mathbb R\), we denote by \(\mathbb E^\mathcal {M}_\sigma (s) \left\{ f \right\} \) the expectation of f over the runs of \(\mathcal {M}\) originating at s that follow strategy \(\sigma \).
Given an MDP \(\mathcal {M}= (S, A, T, AP, L)\), we define its directed underlying graph \(\mathcal {G}_\mathcal {M}= (V, E)\) where \(V = S\) and \(E \subseteq S \times S\) is such that \((s, s') \in E\) if \(T(s, a)(s') > 0\) for some \(a\in A(s)\). A subMDP of \(\mathcal {M}\) is an MDP \(\mathcal {M}' = (S', A', T', AP, L')\), where \(S' \subset S\), \(A' \subseteq A\) is such that \(A'(s) \subseteq A(s)\) for every \(s \in S'\), and \(T'\) and \(L'\) are analogous to T and L when restricted to \(S'\) and \(A'\). In particular, \(\mathcal {M}'\) is closed under probabilistic transitions, i.e. for all \(s \in S'\) and \(a \in A'\) we have that \(T(s, a)(s') > 0\) implies that \(s' \in S'\). An endcomponent [10] of an MDP \(\mathcal {M}\) is a subMDP \(\mathcal {M}'\) of \(\mathcal {M}\) such that \(\mathcal {G}_{\mathcal {M}'}\) is strongly connected.
Theorem 1
(EndComponent Properties [10]). Once an endcomponent C of an MDP is entered, there is a strategy that visits every stateaction combination in C with probability 1 and stays in C forever. Moreover, for every strategy the union of the endcomponents is visited with probability 1.
A Markov chain is an MDP whose set of actions is singleton. A bottom strongly connected component (BSCC) of a Markov chain is any of its endcomponents. A BSCC is accepting if it contains an accepting transition (see below) and otherwise it is rejecting. For any MDP \(\mathcal {M}\) and positional strategy \(\sigma \), let \(\mathcal {M}_\sigma \) be the Markov chain resulting from resolving the nondeterminism in \(\mathcal {M}\) using \(\sigma \).
A rewardful MDP is a pair \((\mathcal {M}, \rho )\), where \(\mathcal {M}\) is an MDP and \(\rho :S \times A \rightarrow \mathbb R\) is a reward function assigning utility to stateaction pairs. A rewardful MDP \((\mathcal {M}, \rho )\) under a strategy \(\sigma \) determines a sequence of random rewards \({\rho (X_{i1}, Y_i)}_{i \ge 1}\), where \(X_i\) and \(Y_i\) are the random variables denoting the ith state and action, respectively. Depending upon the problem of interest, different performance objectives may be of interest. The reachability probability objective \( Reach (T)^\mathcal {M}_\sigma (s)\) (with \(T \subseteq S\)) is defined as Open image in new window . For a given discount factor \(\lambda \in [0, 1[\), the discounted reward objective \(\mathcal { Disct }(\lambda )^\mathcal {M}_\sigma (s)\) is defined as Open image in new window , while the average reward \(\mathcal { Avg }^\mathcal {M}_\sigma (s)\) is defined as Open image in new window . For an objective \( Reward ^\mathcal {M}{\in } \{ Reach (T)^\mathcal {M}, \mathcal { Disct }(\lambda )^\mathcal {M}, \mathcal { Avg }^\mathcal {M}\}\) and an initial state s, we define the optimal reward \( Reward ^\mathcal {M}_*(s)\) as \(\sup _{\sigma \in \varSigma _\mathcal {M}} Reward ^\mathcal {M}_\sigma (s)\). A strategy \(\sigma {\in } \varSigma _\mathcal {M}\) is optimal for \( Reward ^\mathcal {M}\) if \( Reward ^\mathcal {M}_\sigma (s) {=} Reward ^\mathcal {M}_*(s)\) for all \(s {\in } S\).
2.2 \(\omega \)Regular Performance Objectives
A nondeterministic \(\omega \)automaton is a tuple \(\mathcal{A} = (\varSigma ,Q,q_0,\delta ,\mathsf {Acc})\), where \(\varSigma \) is a finite alphabet, Q is a finite set of states, \(q_0 \in Q\) is the initial state, \(\delta :Q \times \varSigma \rightarrow 2^Q\) is the transition function, and \(\mathsf {Acc}\) is the acceptance condition. A run r of \(\mathcal{A}\) on \(w \in \varSigma ^\omega \) is an \(\omega \)word \(r_0, w_0, r_1, w_1, \ldots \) in \((Q \cup \varSigma )^\omega \) such that \(r_0 = q_0\) and, for \(i > 0\), \(r_i \in \delta (r_{i1},w_{i1})\). Each triple \((r_{i1},w_{i1},r_i)\) is a transition of \(\mathcal{A}\).
We consider Büchi and Rabin acceptance conditions, which depend on the transitions that occur infinitely often in a run of an automaton. We write \(\mathsf {inf}(r)\) for the set of transitions that appear infinitely often in the run r. The Büchi acceptance condition defined by \(F \subseteq Q \times \varSigma \times Q\) is the set of runs \(\{ r \in (Q \cup \varSigma )^\omega :\mathsf {inf}(r) \cap F \ne \emptyset \}\). A Rabin acceptance condition is defined in terms of k pairs of subsets of \(Q \times \varSigma \times Q\), \((B_0,G_0), \ldots , (B_{k1},G_{k1})\), as the set Open image in new window . The index of a Rabin condition is its number of pairs.
A run r of \(\mathcal{A}\) is accepting if \(r \in \mathsf {Acc}\). The language, \(L_\mathcal {A}\), of \(\mathcal {A}\) (or, accepted by \(\mathcal{A}\)) is the subset of words in \(\varSigma ^\omega \) that have accepting runs in \(\mathcal{A}\). A language is \(\omega \)regular if it is accepted by an \(\omega \)automaton.
Given an MDP \(\mathcal {M}\) and an \(\omega \)regular objective \(\varphi \) given as an \(\omega \)automaton \(\mathcal {A}_\varphi = (\varSigma ,Q,q_0,\delta ,\mathsf {Acc})\), we are interested in computing an optimal strategy satisfying the objective. We define the satisfaction probability of a strategy \(\sigma \) from initial state s as: Open image in new window .
The optimal satisfaction probability \(\Pr ^\mathcal {M}_*(s\models \varphi )\) is defined as \(\sup _{\sigma \in \varSigma _\mathcal {M}} \Pr ^\mathcal {M}_\sigma (s \models \varphi )\) and we say that \(\sigma \in \varSigma _\mathcal {M}\) is an optimal strategy for \(\varphi \) if \(\Pr ^\mathcal {M}_*(s\models \varphi ) = \Pr ^\mathcal {M}_\sigma (s \models \varphi )\).
An automaton \(\mathcal{A} = (\varSigma ,Q,q_0,\delta ,\mathsf {Acc})\) is deterministic if \(\delta (q,\sigma ) \le 1\) for all \(q \in Q\) and all \(\sigma \in \varSigma \). \(\mathcal{A}\) is complete if \(\delta (q,\sigma ) \ge 1\). A word in \(\varSigma ^\omega \) has exactly one run in a deterministic, complete automaton. We use common threeletter abbreviations to distinguish types of automata. The first (D or N) tells whether the automaton is deterministic; the second denotes the acceptance condition (B for Büchi and R for Rabin). The third letter (W) says that the automaton reads \(\omega \)words. For example, an NBW is a nondeterministic Büchi automaton, and a DRW is a deterministic Rabin automaton.
Every \(\omega \)regular language is accepted by some DRW and by some NBW. In contrast, there are \(\omega \)regular languages that are not accepted by any DBW. The Rabin index of a Rabin automaton [6, 20] is the index of its acceptance condition. The Rabin index of an \(\omega \)regular language \(\mathcal {L}\) is the minimum index among those of the DRWs that accept \(\mathcal {L}\). For each \(n \in \mathbb {N}\) there exist \(\omega \)regular languages of Rabin index n. The languages accepted by DBWs, however, form a proper subset of the languages of index 1.
2.3 The Product MDP
Given an MDP \(\mathcal {M}= (S, A, T, AP, L)\) with a designated initial state \(s_0 \in S\), and a deterministic \(\omega \)automaton \(\mathcal {A} = (2^{AP}, Q, q_0, \delta , \mathsf {Acc})\), the product \(\mathcal {M}\times \mathcal {A}\) is the tuple \((S \times Q, (s_0,q_0), A, T^\times , \mathsf {Acc}^\times )\). The probabilistic transition function Open image in new window is such that \(T^\times ((s,q),a)((\hat{s},\hat{q})) = T(s,a)(\hat{s})\) if \(\{\hat{q}\} = \delta (q,L(s))\) and is 0 otherwise. If \(\mathcal {A}\) is a DBW, \(\mathsf {Acc}\) is defined by \(F \subseteq Q \times 2^{AP} \times Q\); then \(F^\times \subseteq (S \times Q) \times A \times (S \times Q)\) defines \(\mathsf {Acc}^\times \) as follows: \(((s,q),a,(s',q')) \in F^\times \) if and only if \((q,L(s),q') \in F\) and \(T(s,a)(s') \ne 0\). If \(\mathcal {A}\) is a DRW of index k, \(\mathsf {Acc}^\times = \{(B^\times _0,G^\times _0), \ldots , (B^\times _{k1},G^\times _{k1})\}\). To set \(B_i\) of \(\mathsf {Acc}\), there corresponds \(B^\times _i\) of \(\mathsf {Acc}^\times \) such that \(((s,q),a,(s',q')) \in B^\times _i\) if and only if \((q,L(s),q') \in B_i\) and \(T(s,a)(s') \ne 0\). Likewise for \(G^\times _i\).
If \(\mathcal {A}\) is a nondeterministic automaton, the actions in the product are enriched to identify both the actions of the original MDP and the choice of the successor state of the nondeterministic automaton.
Endcomponents and runs are defined for products just like for MDPs. A run of \(\mathcal {M}\times \mathcal {A}\) is accepting if it satisfies the product’s acceptance condition. An accepting endcomponent of \(\mathcal {M}\times \mathcal {A}\) is an endcomponent such that every run of the product MDP that eventually dwells in it is accepting.
In view of Theorem 1, satisfaction of an \(\omega \)regular objective \(\varphi \) by an MDP \(\mathcal {M}\) can be formulated in terms of the accepting endcomponents of the product \(\mathcal {M}\times \mathcal {A}_\varphi \), where \(\mathcal {A}_\varphi \) is an automaton accepting \(\varphi \). The maximum probability of satisfaction of \(\varphi \) by \(\mathcal {M}\) is the maximum probability, over all strategies, that a run of the product \(\mathcal {M}\times \mathcal {A}_\varphi \) eventually dwells in one of its accepting endcomponents.
It is customary to use DRWs instead of DBWs in the construction of the product, because the latter cannot express all \(\omega \)regular objectives. On the other hand, general NBWs are not used since causal strategies cannot optimally resolve nondeterministic choices because that requires access to future events [39].
2.4 LimitDeterministic Büchi Automata

\(Q_i \cap Q_f = \emptyset \), \(F \subseteq Q_f \times \varSigma \times Q_f\);

\(\delta (q,\sigma )\cap Q_i \le 1\) for all \(q \in Q_i\) and \(\sigma \in \varSigma \);

\(\delta (q,\sigma ) \le 1\) for all \(q \in Q_f\) and \(\sigma \in \varSigma \);

\(\delta (q,\sigma ) \subseteq Q_f\) for all \(q \in Q_f\) and \(\sigma \in \varSigma \).
LDBWs are as expressive as general NBWs. Moreover, NBWs can be translated into LDBWs that can be used for the qualitative and quantitative analysis of MDPs [8, 15, 33, 39]. We use the translation from [15], which uses LDBWs that consist of two parts: an initial deterministic automaton (without accepting transitions) obtained by a subset construction; and a final part produced by a breakpoint construction. They are connected by a single “guess”, where the automaton guesses a singleton subset of the reachable states to start the breakpoint construction. Like in other constructions (e.g. [33]), one can compose the resulting automata with an MDP, such that the optimal control of the product defines a control on the MDP that maximizes the probability of obtaining a word from the language of the automaton. We refer to LDBWs with this property as suitable limitdeterministic automata (SLDBWs).
Definition 1
(Suitable LDBW). An SLDBW \(\mathcal {A}\) for property \(\varphi \) is an LDBW that recognizes \(\varphi \) and such that, for every finite MDP \(\mathcal {M}\), there exists a positional strategy \(\sigma \in \varSigma _{\mathcal {M}\times \mathcal {A}}\) such that the probability of satisfying the Büchi condition in the Markov chain \((\mathcal {M}\times \mathcal {A})_\sigma \) is \(\Pr ^\mathcal {M}_*(s \models \varphi )\).
Theorem 2
[8, 15, 33, 39]. Suitable limitdeterministic Büchi automata exist for all \(\omega \)regular languages.
SLDBWs—and their properties described in Definition 1—are used in the qualitative and quantitative model checking algorithms in [8, 15, 33, 39]. The accepting endcomponents of the product MDPs are all using only states from the final part of the SLDBW. Büchi acceptance then allows for using memoryless almost sure winning strategies in the accepting endcomponents, while outside of accepting endcomponents a memoryless strategy that maximizes the chance of reaching such an endcomponent can be used. The distinguishing property is the guarantee that they provide the correct probability, while using a product with a general NBW would only provide a value that cannot exceed it.
2.5 Linear Time Logic Objectives
LTL (Linear Time Logic) is a temporal logic whose formulae describe a subset of the \(\omega \)regular languages, which is often used to specify objectives in humanreadable form. Translations exist from LTL to various forms of automata, including NBW, DRW, and SLDBW. Given a set of atomic propositions AP, a is an LTL formula for each \(a \in AP\). Moreover, if \(\varphi \) and \(\psi \) are LTL formulae, so are Open image in new window . Additional operators are defined as abbreviations: Open image in new window ; Open image in new window ; Open image in new window ; Open image in new window ; Open image in new window ; and Open image in new window . We write \(w \models \varphi \) if \(\omega \)word w over \(2^{AP}\) satisfies LTL formula \(\varphi \). The satisfaction relation is defined inductively [2, 24].
2.6 Reinforcement Learning
For an MDP \(\mathcal {M}\) and an objectives \( Reward ^\mathcal {M}\in \{ Reach (T)^\mathcal {M}, \mathcal { Disct }(\lambda )^\mathcal {M}, \mathcal { Avg }^\mathcal {M}\}\), the optimal reward and an optimal strategy can be computed using value iteration, policy iteration, or, in polynomial time, using linear programming [12, 30]. On the other hand, for \(\omega \)regular objectives (given as DRW, SLDBW, or LTL formulae) optimal satisfaction probabilities and strategies can be computed using graphtheoretic techniques (computing accepting endcomponent and then maximizing the probability to reach states in such components) over the product structure. However, when the MDP transition/reward structure is unknown, such techniques are not applicable.
3 Problem Statement and Motivation
Given MDP \(\mathcal {M}\) with unknown transition structure and \(\omega \)regular objective \(\varphi \), we seek a strategy that maximizes the probability that \(\mathcal {M}\) satisfies \(\varphi \).
To apply modelfree RL algorithms to this task, one needs to define rewards that depend on the observations of the MDP and reflect the satisfaction of the objective. It is natural to use the product of the MDP and an automaton monitoring the satisfaction of the objective to assign suitable rewards to various actions chosen by the learning algorithm.
Sadigh et al. [32] were the first to apply modelfree RL to a qualitativeversion of this problem, i.e., to learn a strategy that satisfies the property with probability 1. For an MDP \(\mathcal {M}\) and a DRW \(\mathcal {A}_\varphi \) of index k, they formed the product MDP \(\mathcal {M}\times \mathcal {A}_\varphi \) with k different “Rabin” reward functions \(\rho _1, \ldots , \rho _k\). The function \(\rho _i\) corresponds to the Rabin pair \((B^\times _i, G^\times _i)\): it assigns a fixed negative reward \(R_ < 0\) to all edges in \(B^\times _i\) and a fixed positive reward \(R_+ >0\) to all edges in \(G^\times _i\). [32] claimed that if there exists a strategy satisfying an \(\omega \)regular objective \(\varphi \) with probability 1, then there exists a Rabin pair i, discount factor \(\lambda _* \in [0, 1[\), and suitably high ratio \(R_*\), such that for all \(\lambda \in [\lambda _*, 1[\) and \(R_{}/R_{+} \ge R_*\), any strategy maximizing \(\lambda \)discounted reward for the MDP \((\mathcal {M}{\times } \mathcal {A}_\varphi , \rho _i)\) also satisfies the \(\omega \)regular objective \(\varphi \) with probability 1. Using Blackwelloptimality theorem [19], a paraphrase of this claim is that if there exists a strategy satisfying an \(\omega \)regular objective \(\varphi \) with probability 1, then there exists a Rabin pair i and suitably high ratio \(R_*\), such that for all \(R_{}/R_{+} \ge R_*\), any strategy maximizing expected average reward for the MDP \((\mathcal {M}{\times } \mathcal {A}_\varphi , \rho _i)\) also satisfies the \(\omega \)regular objective \(\varphi \) with probability 1. This approach has two faults, the second of which also affects approaches that replace DRWs with LDBWs [16, 17].
 1.
We provide in Example 1 an MDP and an \(\omega \)regular objective \(\varphi \) with Rabin index 2, such that, although there is a strategy that satisfies the property with probability 1, optimal average strategies from any Rabin reward do not satisfy the objective with probability 1.
 2.
Even for an \(\omega \)regular objective with one Rabin pair (B, G) and \(B {=} \emptyset \)—i.e., one that can be specified by a DBW—we demonstrate in Example 2 that the problem of finding a strategy that satisfies the property with probability 1 may not be reduced to finding optimal average strategies.
Example 1
(Two Rabin Pairs). Consider the MDP given as a simple gridworld example shown in Fig. 2. Each cell (state) of the MDP is labeled with the atomic propositions that are true there. In each cell, there is a choice between two actions \(\mathtt {rest}\) and \(\mathtt {go}\). With action \(\mathtt {rest}\) the state of the MDP does not change. However, with action \(\mathtt {go}\) the MDP moves to the other cell in the same row with probability p, or to the other cell in the same column with probability \(1{}p\). The initial cell is (0, 0).
The specification is given by LTL formula Open image in new window . A DRW that accepts \(\varphi \) is shown in Fig. 2. The DRW has two accepting pairs: \((B_0,G_0)\) and \((B_1,G_1)\). The table besides the automaton gives, for each transition, its label and the B and G sets to which it belongs.
The optimal strategy that satisfies the objective \(\varphi \) with probability 1 chooses \(\mathtt {go}\) in Cell (0, 0) and chooses \(\mathtt {rest}\) subsequently. However, for both Rabin pairs, the optimal strategy for expected average reward is to maximize the probability of reaching one of the \((0, 1), \texttt {safe}\) or \((1, 0), \texttt {safe}\) states of the product and stay there forever. For the first accepting pair the maximum probability of satisfaction is \(\frac{1}{2p}\), while for the second pair it is \(\frac{1}{1+p}\).
Example 2
(DBW to Expected Average Reward Reduction). This counterexample demonstrates that even for deterministic Büchi objectives, the problem of finding an optimal strategy satisfying an objective may not be reduced to the problem of finding an optimal average strategy. Consider the simple gridworld example of Fig. 3 with the specification Open image in new window , where atomic proposition \(\mathtt {b}\) (blue) labels Cell 1 and atomic proposition \(\mathtt {g}\) (green) labels Cells 2 and 3. Actions enabled in various cells and their probabilities are depicted in the figure.
The strategy from Cell 0 that chooses Action a guarantees satisfaction of \(\varphi \) with probability 1. An automaton with accepting transitions for \(\varphi \) is shown in Fig. 3; it is a DBW (or equivalently a DRW with one pair (B, G) and \(B = \emptyset \)).
The product MDP is shown at the bottom of Fig. 3. All states whose second component is \(\mathtt {trap}\) have been merged. Notice that there is no negative reward since the set B is empty. If reward is positive and equal for all accepting transitions, and 0 for all other transitions, then when \(p > 1/2\), the strategy that maximizes expected average reward chooses Action b in the initial state and Action e from State \((2,\mathtt {safe})\). Note that, for large values of \(\lambda \), the optimal expected average reward strategies are also optimal strategies for the \(\lambda \)discounted reward objective. However, these strategies are not optimal for \(\omega \)regular objectives.
Example 1 shows that one cannot select a pair from a Rabin acceptance condition ahead of time. This problem can be avoided by the use of Büchi acceptance conditions. While DBWs are not sufficiently expressive, SLDBWs express all \(\omega \)regular properties and are suitable for probabilistic model checking. In the next section, we show that they are also “the ticket” for modelfree reinforcement learning, because they allow us to maximize the probability of satisfying an \(\omega \)regular specification by solving a reachability probability problem that can be solved efficiently by offtheshelf RL algorithms.
4 ModelFree RL from OmegaRegular Rewards
We now reduce the model checking problem for a given MDP and SLDBW to a reachability problem by slightly changing the structure of the product: We add a target state t that can be reached with a given probability \(1  \zeta \) whenever visiting an accepting transition of the original product MDP.
Our reduction avoids the identification of winning endcomponents and thus allows a natural integration to a wide range of modelfree RL approaches. Thus, while the proofs do lean on standard model checking properties that are based on identifying winning endcomponents, they serve as a justification not to consider them when running the learning algorithm. In the rest of this section, we fix an MDP \(\mathcal {M}\) and an SLDBW \(\mathcal {A}\) for the \(\omega \)regular property \(\varphi \).
Definition 2
(Augmented Product). For any \(\zeta \in ]0,1[\), the augmented MDP \(\mathcal {M}^\zeta \) is an MDP obtained from \(\mathcal {M}\!\times \!\mathcal {A}\) by adding a new state t with a selfloop to the set of states of \(\mathcal {M}\!\times \!\mathcal {A}\), and by making t a destination of each accepting transition \(\tau \) of \(\mathcal {M}\!\times \!\mathcal {A}\) with probability \(1\zeta \). The original probabilities of all other destinations of an accepting transition \(\tau \) are multiplied by \(\zeta \).
An example of an augmented MDP is shown in Fig. 4. With a slight abuse of notation, if \(\sigma \) is a strategy on the augmented MDP \(\mathcal {M}^\zeta \), we denote by \(\sigma \) also the strategy on \(\mathcal {M}\!\times \!\mathcal {A}\) obtained by removing t from the domain of \(\sigma \).
We let \(p^\sigma _s(\zeta )\) denote the probability of reaching t in \(\mathcal {M}^\zeta _\sigma \) when starting at state s. Notice that we can encode this value as the expected average reward in the following rewardful MDP \((\mathcal {M}^\zeta , \rho )\), where we set the reward function \(\rho (t,a) = 1\) for all \(a \in A\) and \(\rho (s,a) = 0\) otherwise. For any strategy \(\sigma \), the probability \(p^\sigma _s(\zeta )\) and the reward of \(\sigma \) from s in \((\mathcal {M}^\zeta , \rho )\) are the same. We also let \(a^\sigma _s\) be the probability that a run that starts from s in \((\mathcal {M}\times \mathcal {A})_\sigma \) is accepting.
Lemma 1
 1.
if the state s is in a rejecting BSCC of \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \), then \(p^\sigma _s(\zeta ) = 0\);
 2.
if the state s is in an accepting BSCC of \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \), then \(p^\sigma _s(\zeta ) = 1\);
 3.
the probability \(p^\sigma _s(\zeta )\) of reaching t is greater than \(a^\sigma _s\); and
 4.
if \(p^\sigma _s(\zeta ) {=} 1\) then no rejecting BSCC is reachable from s in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) and \(a^\sigma _s = 1\).
Proof
(1) holds as there are no accepting transition in a rejecting BSCC of \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \), and so t cannot be reached when starting at s in \(\mathcal {M}^\zeta _\sigma \). (2) holds because t (with its selfloop) is the only BSCC reachable from s in \(\mathcal {M}^\zeta \). In other words, t (with its selfloop) and the rejecting BSCCs of \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) are the only BSCCs in \(\mathcal {M}^\zeta _\sigma \). (3) then follows, because the same paths lead to a rejecting BSCCs in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) and \(\mathcal {M}^\zeta _\sigma \), while the probability of each such a path is no larger—and strictly smaller iff it contains an accepting transition—than in \(\mathcal {M}^\zeta _\sigma \). (4) holds because, if \(p^\sigma _s(\zeta ) = 1\), then t (with its selfloop) is the only BSCC reachable from s in \(\mathcal {M}^\zeta _\sigma \). Thus, there is no path to a rejecting BSCC in \(\mathcal {M}^\zeta _\sigma \), and therefore no path to a rejecting BSCC in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \). \(\square \)
Lemma 2
Let \(\sigma \) be a positional strategy on \(\mathcal {M}^\zeta \). For every state s of \(\mathcal {M}^\zeta \), we have that \(\lim _{\zeta \uparrow 1} p^\sigma _s(\zeta ) = a^\sigma _s\).
Proof
As shown in Lemma 1(3) for all \(\zeta \), we have \(p^\sigma _s(\zeta ) \ge a^\sigma _s\). For a coarse approximation of their difference, we recall that \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) is a finite Markov chain. The expected number of transitions taken before reaching a BSCC from s in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) is therefore a finite number. Let us refer to the—no larger—expected number of accepting transitions taken before reaching a BSCC when starting at s in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) as \(f^\sigma _s\). We claim that \(a^\sigma _s \ge p^\sigma _s(\zeta )(1\zeta ) \cdot f^\sigma _s\). This is because the probability of reaching a rejecting BSCC in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \) is at most the probability of reaching a rejecting BSCC in \(\mathcal {M}^\zeta _\sigma \), which is at most \(1p^\sigma _s(\zeta )\), plus the probability of moving on to t from a state that is not in any BSCC in \((\mathcal {M}\!\times \!\mathcal {A})_\sigma \), which we are going to show next is at most \(f^\sigma _s \cdot (1\zeta )\).
This provides us with our main theorem.
Theorem 3
There exists a threshold \(\zeta ' \in ]0,1[\) such that, for all \(\zeta > \zeta '\) and every state s, any strategy \(\sigma \) that maximizes the probability \(p^{\sigma }_s(\zeta )\) of reaching the sink in \(\mathcal {M}^\zeta \) is (1) an optimal strategy in \(\mathcal {M}\times \mathcal {A}\) from s and (2) induces an optimal strategy for the original MDP \(\mathcal {M}\) from s with objective \(\varphi \).
Proof
We use the fact that it suffices to study positional strategies, and there are only finitely many of them. Let \(\sigma _1\) be an optimal strategy of \(\mathcal {M}\!\times \!\mathcal {A}\), and let \(\sigma _2\) be a strategy that has the highest likelihood of creating an accepting run among all nonoptimal memoryless ones. (If \(\sigma _2\) does not exist, then all strategies are equally good, and it does not matter which one is chosen.) Let \(\delta = a^{\sigma _1}_s  a^{\sigma _2}_s\).
Now, suppose that \(\sigma \) is a positional strategy that is optimal in \(\mathcal {M}\!\times \!\mathcal {A}\). Then the probability of satisfying \(\varphi \) in \(\mathcal {M}\) when starting at s is at least^{1} \(a^\sigma _s\). At the same time, if there was a strategy for which the probability of satisfying \(\varphi \) in \(\mathcal {M}\) is \( > a^\sigma _s\), then the property of \(\mathcal A\) to be an SLDBW (Definition 1) would guarantee the existence of strategy \(\sigma '\) for which \(a^{\sigma '}_s > a^\sigma _s\); a contradiction with the assumption that \(\sigma \) is optimal. Therefore any positional strategy that is optimal in \(\mathcal {M}\!\times \!\mathcal {A}\) induces an optimal strategy in \(\mathcal {M}\) with objective \(\varphi \). \(\square \)
Corollary 1
Due to Lemma 1(4), \(\mathcal {M}\) satisfies \(\varphi \) almost surely if and only if the sink is almost surely reachable in \(\mathcal {M}^\zeta \) for all \(0< \zeta < 1\).
Theorem 3 leads to a very simple modelfree RL algorithm. The augmented product is not built by the RL algorithm, which does not know the transition structure of the environment MDP. Instead, the observations of the MDP are used by an interpreter process to compute a run of the objective automaton. The interpreter also extracts the set of actions for the learner to choose from. If the automaton is not deterministic and it has not taken the one nondeterministic transition it needs to take yet, the set of actions the interpreter provides to the learner includes the choice of special “jump” actions that instruct the automaton to move to a chosen accepting component.
When the automaton reports an accepting transition, the interpreter gives the learner a positive reward with probability \(1\zeta \). When the learner actually receives a reward, the training episode terminates. Any RL algorithm that maximizes this probabilistic reward is guaranteed to converge to a policy that maximizes the probability of satisfaction of the \(\omega \)regular objective.
5 Experimental Results
We implemented the construction described in the previous sections in a tool named Mungojerrie [11], which reads MDPs described in the PRISM language [21], and \(\omega \)regular automata written in the HOA format [1, 9]. Mungojerrie builds the augmented product \(\mathcal {M}^\zeta \), provides an interface for RL algorithms akin to that of [5] and supports probabilistic model checking. Our algorithm computes, for each pair (s, a) of state and action, the maximum probability of satisfying the given objective after choosing action a from state s by using offtheshelf, temporal difference algorithms. Not all actions with maximum probability are part of positional optimal strategies—consider a product MDP with one state and two actions, a and b, such that a enables an accepting selfloop, and b enables a nonaccepting one: both state/action pairs are assigned probability 1. In b’s case, because choosing b once—or a finite number of times—does not preclude acceptance. Since the probability values alone do not identify a pure optimal strategy, Mungojerrie computes an optimal mixed strategy, uniformly choosing all maximum probability actions from a state.
Qlearning results. The default values of the learner hyperparameters are: \(\zeta = 0.99\), \(\epsilon =0.1\), \(\alpha =0.1\), tol\(=0.01\), epl\(=30\), and epn\(=20000\). Times are in seconds.
Name  states  aut.  prod.  prob.  est.  time  \(\zeta \)  \(\varepsilon \)  \(\alpha \)  tol  epl  epn 

twoPairs  4  4  16  1  1  0.26  
riskReward  4  2  8  1  1  1.47  
deferred  41  1  41  1  1  1.01  
grid5x5  25  3  75  1  1  10.82  0.01  0.2  400  30k  
trafficNtk  122  13  462  1  1  2.89  
windy  123  2  240  1  1  12.35  0.95  0.001  0.05  0  900  200k 
windyKing  130  2  256  1  1  14.34  0.95  0.02  0.2  0  300  120k 
windyStoch  130  2  260  1  1  47.70  0.95  0.02  0.2  0  300  200k 
frozenSmall  16  3  48  0.823  0.83  0.51  0.05  0  200  
frozenLarge  64  3  192  1  1  1.81  0.05  0  700  
othergrid6  36  25  352  1  1  10.80  0  300  75k  
othergrid20  400  25  3601  1  1  78.00  0.9999  0.07  0.2  0  5k  
othergrid40  1600  25  14401  1  0.99  87.90  0.9999  0.05  0.2  0  14k  25k 
doublegrid8  4096  3  12287  1  1  45.50  0  3k  100k  
doublegrid12  20736  3  62207  1  1  717.6  0  20k  300k  
slalom  36  5  84  1  1  0.98  
rps1  121  2  130  0.768  0.76  5.21  0.12  0.006  0  500k  
dpenny  52  2  65  0.5  0.5  1.99  0.001  0.2  0  50  120k  
devious  11  1  11  1  1  0.81  
arbiter2  32  3  72  1  1  5.16  0.5  0.02  200  
knuthYao  13  3  39  1  1  0.31  100  
threeWayDuel  10  3  13  0.397  0.42  0.08  
mutual414  27600  128  384386  1  1  2.74  
mutual415  27600  527  780504  1  1  3.61 
Figure 5 illustrates how increasing the parameter \(\zeta \) makes the RL algorithm less sensitive to the presence of transient (not in an endcomponent) accepting transitions. Model deferred consists of two chains of states: one, which the agent choses with action a, has accepting transitions throughout, but leads to an endcomponent that is not accepting. The other chain, selected with action b, leads to an accepting endcomponent, but has no other accepting transitions. There are no other decisions in the model; hence only two strategies are possible, which we denote by a and b, depending on the action chosen.
The curve labeled \(p_a\) in Fig. 5 gives the probability of satisfaction under strategy a of the MDP’s objective as a function of \(\zeta \) as computed by Qlearning. The number of episodes is kept fixed at 20, 000 and each episode has length 80. Each data point is the average of five experiments for the same value of \(\zeta \).
For values of \(\zeta \) close to 0, the chance is high that the sink is reached directly from a transient state. Consequently, Qlearning considers strategies a and b equally good. For this reason, the probability of satisfaction of the objective, \(p_\varphi \), according to the strategy that mixes a and b, is computed by Mungojerrie’s model checker as 0.5. As \(\zeta \) approaches 1, the importance of transient accepting transitions decreases, until the probability computed for strategy a is no longer considered to be approximately the same as the probability of strategy b. When that happens, \(p_\varphi \) abruptly goes from 0.5 to its true value of 1, because the pure strategy b is selected. The value of \(p_a\) continues to decline for larger values of \(\zeta \) until it reaches its true value of 0 for \(\zeta = 0.9999\). Probability \(p_b\), not shown in the graph, is 1 throughout.
The change in value of \(p_\varphi \) does not contradict Theorem 3, which says that \(p_b=1 > p_a\) for all values of \(\zeta \). In practice a high value of \(\zeta \) may be needed to reliably distinguish between transient and recurrent accepting transitions in numerical computation. Besides, Theorem 3 suggests that even in the almostsure case there is a meaningful path to the target strategy where the likelihood of satisfying \(\varphi \) can be expected to grow. This is important, as it comes with the promise of a generally increasing quality of intermediate strategies.
6 Conclusion
We have reduced the problem of maximizing the satisfaction of an \(\omega \)regular objective in a MDP to reachability on a product graph augmented with a sink state. This change is so simple and elegant that it may surprise that it has not been used before. But the reason for this is equally simple: it does not help in a model checking context, as it does not remove any analysis step there. In a reinforcement learning context, however, it simplifies our task significantly. In previous attempts to use suitable LDBW [4], the complex part of the model checking problem—identifying the accepting endcomponents—is still present. Only after this step, which is expensive and requires knowledge of the structure of the underlying MDP, can these methods reduce the search for optimal satisfaction to the problem of maximizing the chance to reach those components. Our reduction avoids the identification of accepting endcomponents entirely and thus allows a natural integration with a wide range of modelfree RL approaches.
Footnotes
 1.
This holds for all nondeterministic automata that recognize the models of \(\varphi \): an accepting run establishes that the path was a model of \(\varphi \).
References
 1.Babiak, T., et al.: The Hanoi omegaautomata format. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 479–486. Springer, Cham (2015). https://doi.org/10.1007/9783319216904_31CrossRefGoogle Scholar
 2.Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)zbMATHGoogle Scholar
 3.Bertsekas, D.P., Tsitsiklis, J.N.: NeuroDynamic Programming. Athena Scientific, Belmont (1996)zbMATHGoogle Scholar
 4.Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/9783319119366_8CrossRefGoogle Scholar
 5.Brockman, G., et al.: OpenAI Gym. CoRR, abs/1606.01540 (2016)Google Scholar
 6.Carton, O., Maceiras, R.: Computing the Rabin index of a parity automaton. Theoret. Inf. Appl. 33, 495–505 (1999)MathSciNetCrossRefGoogle Scholar
 7.Chatterjee, K., Gaiser, A., Křetínský, J.: Automata with generalized Rabin pairs for probabilistic model checking and LTL synthesis. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 559–575. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642397998_37CrossRefGoogle Scholar
 8.Courcoubetis, C., Yannakakis, M.: The complexity of probabilistic verification. J. ACM 42(4), 857–907 (1995)MathSciNetCrossRefGoogle Scholar
 9.cpphoafparser (2016). https://automata.tools/hoa/cpphoafparser. Accessesd 05 Sept 2018
 10.de Alfaro, L.: Formal Verification of Probabilistic Systems. Ph.D. thesis, Stanford University (1998)Google Scholar
 11.Eliot, T.S.: Old Possum’s Book of Practical Cats. Harcourt Brace Jovanovich, San Diego (1939)Google Scholar
 12.Feinberg, E.A., Shwartz, A. (eds.): Handbook of Markov Decision Processes. Springer, New York (2002). https://doi.org/10.1007/9781461508052CrossRefzbMATHGoogle Scholar
 13.Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems, July 2014Google Scholar
 14.Guez, A., et al.: An investigation of modelfree planning. CoRR, abs/1901.03559 (2019)Google Scholar
 15.Hahn, E.M., Li, G., Schewe, S., Turrini, A., Zhang, L.: Lazy probabilistic model checking without determinisation. In: Concurrency Theory (CONCUR), pp. 354–367 (2015)Google Scholar
 16.Hasanbeig, M., Abate, A., Kroening, D.: Logicallycorrect reinforcement learning. CoRR, abs/1801.08099v1, January 2018Google Scholar
 17.Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv eprints, arXiv:1902.00778, February 2019
 18.Hiromoto, M., Ushio, T.: Learning an optimal control policy for a Markov decision process under linear temporal logic specifications. In: Symposium Series on Computational Intelligence, pp. 548–555, December 2015Google Scholar
 19.Hordijk, A., Yushkevich, A.A.: Blackwell optimality. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes: Methods and Applications, pp. 231–267. Springer, Boston (2002). https://doi.org/10.1007/9781461508052_8CrossRefGoogle Scholar
 20.Krishnan, S.C., Puri, A., Brayton, R.K., Varaiya, P.P.: The Rabin index and chain automata, with applications to automata and games. In: Wolper, P. (ed.) CAV 1995. LNCS, vol. 939, pp. 253–266. Springer, Heidelberg (1995). https://doi.org/10.1007/3540600450_55CrossRefGoogle Scholar
 21.Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic realtime systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642221101_47CrossRefGoogle Scholar
 22.Lahijanian, M., Andersson, S.B., Belta, C.: Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Trans. Robot. 28(2), 396–409 (2012)CrossRefGoogle Scholar
 23.Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: International Conference on Intelligent Robots and Systesm (IROS), pp. 3834–3839 (2017)Google Scholar
 24.Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems *Specification*. Springer, New York (1991). https://doi.org/10.1007/9781461209317CrossRefzbMATHGoogle Scholar
 25.Mnih, V., et al.: Humanlevel control through reinforcement learning. Nature 518, 529–533 (2015)CrossRefGoogle Scholar
 26.Mungojerrie\(\omega \)regular reinforcement learning benchmarks (2019). https://plv.colorado.edu/omegaregularrlbenchmarks2019
 27.OpenAI Gym (2018). https://gym.openai.com. Accessed 05 Sept 2018
 28.Perrin, D., Pin, J.É.: Infinite Words: Automata, Semigroups, Logic and Games. Elsevier, Amsterdam (2004)zbMATHGoogle Scholar
 29.Pnueli, A., Zuck, L.: Verification of multiprocess probabilistic protocols. Distrib. Comput. 1, 53–72 (1986)CrossRefGoogle Scholar
 30.Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRefGoogle Scholar
 31.Riedmiller, M.: Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_32CrossRefGoogle Scholar
 32.Sadigh, D., Kim, E., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: IEEE Conference on Decision and Control (CDC), pp. 1091–1096, December 2014Google Scholar
 33.Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limitdeterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016). https://doi.org/10.1007/9783319415406_17CrossRefGoogle Scholar
 34.Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limitdeterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016). https://doi.org/10.1007/9783319465203_9CrossRefGoogle Scholar
 35.Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016)CrossRefGoogle Scholar
 36.Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC modelfree reinforcement learning. In: International Conference on Machine Learning ICML, pp. 881–888 (2006)Google Scholar
 37.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018)zbMATHGoogle Scholar
 38.Thomas, W.: Automata on infinite objects. In: Handbook of Theoretical Computer Science, pp. 133–191. The MIT Press/Elsevier, Cambridge (1990)Google Scholar
 39.Vardi, M.Y.: Automatic verification of probabilistic concurrent finite state programs. In: Foundations of Computer Science, pp. 327–338 (1985)Google Scholar
 40.Wiering, M., van Otterlo, M. (eds.): Reinforcement Learning: State of the Art. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642276453CrossRefGoogle Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.