Omega-Regular Objectives in Model-Free Reinforcement Learning

We provide the first solution for model-free reinforcement learning of {\omega}-regular objectives for Markov decision processes (MDPs). We present a constructive reduction from the almost-sure satisfaction of {\omega}-regular objectives to an almost- sure reachability problem and extend this technique to learning how to control an unknown model so that the chance of satisfying the objective is maximized. A key feature of our technique is the compilation of {\omega}-regular properties into limit- deterministic Buechi automata instead of the traditional Rabin automata; this choice sidesteps difficulties that have marred previous proposals. Our approach allows us to apply model-free, off-the-shelf reinforcement learning algorithms to compute optimal strategies from the observations of the MDP. We present an experimental evaluation of our technique on benchmark learning problems.


Introduction
Reinforcement learning (RL) [20] is an approach to sequential decision making in which agents rely on reward signals to choose actions aimed at achieving prescribed objectives. Some objectives, like running a maze, are naturally expressed in terms of scalar rewards; in other cases the translation is less obvious. In this paper we solve the problem of ω-regular rewards, that is, the problem of defining scalar rewards for the transitions of a Markov decision process (MDP) so that strategies to maximize the probability to satisfy an ω-regular objective may be computed by off-the-shelf, modelfree (a.k.a. direct) RL algorithms.
Omega-regular languages provide a rich formalism to unambiguously express qualitative safety and progress requirements of MDPs [2]. The most common way to describe an ω-regular language is via a formula in Linear Time Logic (LTL); other specification mechanisms include extensions of LTL, various types of automata, and monadic second-order logic. A typical requirement that is naturally expressed as an ω-regular objective prescribes that the agent should eventually control the MDP to stay within a given set of states, while at all times avoiding another set of states. In LTL this would be written (F G goal) ∧ (G ¬trap), where goal and trap are labels attached to the appropriate states, F stands for "finally," and G stands for "globally." For verification or synthesis, an ω-regular objective is usually translated into an automaton that monitors the traces of execution of the MDP [7]. Successful executions cause the automaton to take certain (accepting) transitions infinitely often, and ultimately avoid certain (rejecting) transitions. That is, ω-regular objectives are about the long-term behavior of an MDP; the frequency of reward collected is not what matters. A policy that guarantees no rejecting transitions and an accepting transition every ten steps, is better than a policy that promises an accepting transition at each step, but with probability 0.5 does not accept at all.
The problem of ω-regular rewards in the context of model-free RL was first tackled in [18] by translating the objective into a deterministic Rabin automaton and deriving positive and negative rewards directly from the acceptance condition of the automaton. In Section 3 we show that their algorithm, and the extension of [11] may fail to find optimal strategies, and may underestimate the probability of satisfaction of the objective.
We avoid the problems inherent in the use of deterministic Rabin automata for model-free RL by resorting to limit-deterministic Büchi automata, which, under mild restrictions, were shown by [10] to be suitable for both qualitative and quantitative analysis of MDPs under all ω-regular objectives. The Büchi acceptance condition, which, unlike the Rabin condition, does not involve rejecting transitions, allows us to constructively reduce the almost-sure satisfaction of ω-regular objectives to an almost-sure reachability problem. In addition, it is also suitable for quantitative analysis: the value of a state approximates the maximum probability of satisfaction of the objective from that state as a probability parameter approaches 1.
In this paper we concentrate on model-free approaches and infinitary behaviors for finite MDPs. Related problems include model-based RL [9], RL for finite-horizon objectives [14], and learning for efficient verification [3]. This paper is organized as follows. In Section 2 we introduce definitions and notations. Section 3 motivates our approach by showing the problems that arise when the reward of the RL algorithm is derived from the acceptance condition of a deterministic Rabin automaton. In Section 4 we prove the results on which our approach is based. Finally, Section 5 discusses our experiments.

Preliminaries
An ω-word w on an alphabet Σ is a function w : N → Σ. We abbreviate w(i) by w i . The set of ω-words on Σ is written Σ ω and a subset of Σ ω is an ω-language on Σ.
A probability distribution over a finite set S is a function d :

Markov Decision Processes
A Markov decision process M is a tuple (S, A, T, AP, L) where S is a finite set of states, A is a finite set of actions, T : S × A − ⇁ D(S) is the probabilistic transition (partial) function, AP is the set of atomic propositions, and L : S → 2 AP is the proposition labeling function.
For any state s ∈ S, we let A(s) denote the set of actions available in the state s. For states s, s ′ ∈ S and a ∈ A(s) as T (s, a)(s ′ ) equals p(s ′ |s, a). A run of M is an ω-word s 0 , a 1 , s 1 , . . . ∈ S × (A × S) ω such that p(s i+1 |s i , a i+1 )>0 for all i ≥ 0. A finite run is a finite such sequence. For a run r = s 0 , a 1 , s 1 , . . . , s n we define corresponding labeled run as L(r) = {L(s 0 ), L(s 1 ), . . .} ∈ (2 AP ) ω . We write Runs M (Runs M f ) for the set of runs (finite runs) of M and Runs M (s)(Runs f M (s)) for the set of runs (finite runs) of M starting from state s. For a finite run r we write last (r) for the last state of the sequence.
A strategy in M is a function σ : Runs f → D(A) such that supp(σ(r)) ⊆ A(last (r)). Let Runs M σ (s) denote the subset of runs Runs M s which correspond to strategy σ with the initial state s. Let Σ M be the set of all strategies. We say that a strategy σ is pure if σ(r) is a point distribution for all runs r ∈ Runs M f and we say that σ is stationary if last (r) = last (r ′ ) implies σ(r) = σ(r ′ ) for all runs r, r ′ ∈ Runs M f . A strategy that is not pure is mixed. A stationary strategy is positional if it is both pure and stationary.
The behavior of an MDP M under a strategy σ is defined on a probability space (Runs M σ (s), F Runs M σ (s) , Pr σ s ) over the set of infinite runs of σ with starting state s. Given a real-valued random variable over the set of infinite runs f :Runs M → R, we denote by E σ s {f } the expectation of f over the runs of M originating at s that follow strategy σ.
A rewardful MDP is a pair (M, ρ), where M is an MDP and ρ : S × A → R is a reward function assigning utility to state-action pairs. A rewardful MDP (M, ρ) under a strategy σ determines a sequence of random rewards ρ(X i−1 , Y i ) i≥1 . Depending upon the problem of interest, there are a number of performance objectives proposed in the literature:

Average Reward
For an objective Cost ∈{PReach T , EDisct λ , EAvg} and an initial state s we define the optimal cost Cost * (s) as sup σ∈ΣM Cost (s, σ). A strategy σ of M is optimal for the objective Cost if Cost (s, σ) = Cost * (s) for all s ∈ S. For an MDP the optimal cost and optimal strategies can be computed in polynomial time [17].

ω-Regular Performance Objectives
A nondeterministic ω-regular automaton is a tuple A = (Σ, Q, q 0 , δ, Acc), where Σ is a finite alphabet, Q is a finite set of states, q 0 ∈ Q is the initial state, δ : Q×Σ → 2 Q is the transition function, and Acc is the acceptance condition, to be discussed presently.
A run r of A on w ∈ Σ ω is an ω-word r 0 , w 0 , r 1 , w 1 , . . . in (Q ∪ Σ) ω such that r 0 = q 0 and, for i > 0, We consider two types of acceptance conditions-Büchi and Rabin-that depend on the transitions that occur infinitely often in a run of an automaton. A Büchi (Rabin) automaton is an ω-automaton equipped with a Büchi (Rabin) acceptance condition.
We write inf(r) for the set of transitions that appear infinitely often in the run r. The Büchi acceptance condition defined by The index of a Rabin condition is its number of pairs.
A run r of A is accepting if r ∈ Acc. The language of A (or, accepted by A) is the subset of words in Σ ω that have accepting runs in A. A language is ω-regular if it is accepted by an ω-regular automaton.
Given an MDP M and an ω-regular objective ϕ given as an ω-regular automaton A ϕ = (Σ, Q, q 0 , δ, Acc), we are interested in computing an optimal strategy satisfying the objective. We define the satisfaction probability of a strategy σ from initial state s as: PSat ϕ (s, σ) = Pr σ s {r ∈ Runs M σ (s) : L(r) ∈ Acc} The optimal probability of satisfaction PSat * and corresponding optimal strategy is defined in a manner analogous to other performance objectives.

Deterministic Rabin and Büchi Automata
A word in Σ ω has exactly one run in a deterministic, complete automaton. We use common three-letter abbreviations to distinguish types of automata. The first (D or N) tells whether the automaton is deterministic; the second denotes the acceptance condition (B for Büchi and R for Rabin). The third letter refers to the type of objects read by the automaton (here, we only use W for ω-words). For example, an NBW is a nondeterministic Büchi automaton, and a DRW is a deterministic Rabin automaton.
Every ω-regular language is accepted by some DRW and by some NBW. In contrast, there are ω-regular languages that are not accepted by any DBW. The Rabin index of a Rabin automaton is the index of its acceptance condition. The Rabin index of an ωregular language L is the minimum index among those of the DRWs that accept L. For each n ∈ N there exist ω-regular languages of Rabin index n. The languages accepted by DBWs, however, form a proper subset of the languages of index 1.
Given an MDP M = (S, A, T, AP, L) with a designated initial state s 0 ∈ S, and a deterministic ω-automaton A = (2 AP , Q, q 0 , δ, Acc), the product M × A is the tuple (S × Q, (s 0 , q 0 ), A, T × , Acc × ). The probabilistic transition function T × : and T ′ and L ′ are analogous to T and L when restricted to S ′ and A ′ . In particular, M ′ is closed under probabilistic transitions, i.e. for all s ∈ S ′ and a ∈ A ′ we have that T (s, a)(s ′ ) > 0 implies that s ′ ∈ S ′ . An end-component [7] of an MDP M is a sub-MDP M ′ of M such that G M ′ is strongly connected. A maximal end-component is an end-component that is maximal under set-inclusion. Every state s of an MDP M belongs to at most one maximal end-component of M.
Theorem 1 (End-Component Properties [7]). Once an end-component C of an MDP is entered, there is a strategy that visits every state-action combination in C with probability 1 and stays in C forever. Moreover, for every strategy an end-component is visited with probability 1.
End components and runs are defined for products just like for MDPs. A run of M × A is accepting if it satisfies the product's acceptance condition. An accepting end component of M × A is an end component such that every run of the product MDP that eventually dwells in it is accepting.
In view of Theorem 1, satisfaction of an ω-regular objective ϕ by an MDP M can be formulated in terms of the accepting end components of the product M×A ϕ , where A ϕ is an automaton accepting ϕ. The maximum probability of satisfaction of ϕ by M is the maximum probability, over all strategies, that a run of the product M × A ϕ eventually dwells in one of its accepting end components.
It is customary to use DRWs instead of DBWs in the construction of the product, because the latter cannot express all ω-regular objectives. On the other hand, general NBWs are not used since causal strategies cannot optimally resolve nondeterministic choices because that requires access to future events [21].

Limit-Deterministic Büchi Automata
In spite of the large gap between DRWs and DBWs in terms of indices, even a very restricted form of nondeterminism is sufficient to make DBWs as expressive as DRWs.
Broadly speaking, a LDBW behaves deterministically once it has seen an accepting transition.
LDBWs are as expressive as general NBWs. Moreover, NBWs can be translated into LDBWs that can be used for the qualitative and quantitative analysis of MDPs [21,5,10,19]. We use the translation from [10], which uses LDBWs that consist of two parts: an initial deterministic automaton (without accepting transitions) obtained by a subset construction; and a final part produced by a breakpoint construction. They are connected by a single "guess", where the automaton guesses a subset of the reachable states to start the breakpoint construction. Like other constructions (e.g. [19]), one can compose the resulting automata with an MDP, such that the optimal control of the product defines a control on the MDP that maximises the probability of obtaining a word from the language of the automaton. We refer to LDBWs with this property as suitable limit-deterministic automata (SLDBWs) (cf. Theorem 2 for details).

Linear Time Logic Objectives
LTL (Linear Time Logic) is a temporal logic whose formulae describe a subset of the ω-regular languages, which is often used to specify objectives in human-readable form. Translations exist from LTL to various forms of automata, including NBW, DRW, and SLDBW. Given a set of atomic propositions AP , a is an LTL formula for each a ∈ AP . Moreover, if ϕ and ψ are LTL formulae, so are ¬ϕ, ϕ ∨ ψ, X ϕ, ψ U ϕ .
Additional operators are defined as abbreviations: We write w |= ϕ if ω-word w over 2 AP satisfies LTL formula ϕ. The satisfaction relation is defined inductively. Let w j be the ω-word defined by w j i = w i+j ; let a be an atomic proposition; and let ϕ, ψ be LTL formulae. Then • w |= ¬ϕ if and only if w |= ϕ; • w |= ϕ ∨ ψ if w |= ϕ or w |= ψ; • w |= X ϕ if and only if w 1 |= ϕ; • w |= ψ U ϕ if and only if there exists i ≥ 0 such that w i |= ϕ and, for 0 ≤ j < i, w j |= ψ.
Further details may be found in [15].

Reinforcement Learning
For a given MDP and a performance objective Cost ∈{PReach T , EDisct λ , EAvg}, the optimal cost and an optimal strategy can be computed in polynomial time using value iteration, policy iteration, or linear programming [17]. On the other hand, for ω-regular objectives (given as DRW, SLDBW, or LTL formulae) optimal satisfaction probabilities and strategies can be computed using graph-theoretic techniques (computing accepting end-component and then maximizing the probability to reach states in such components) over the product structure. However, when the MDP transition/reward structure is unknown, such techniques are not applicable. For MDPs with unknown transition/reward structure, reinforcement learning [20] provides a framework to learn optimal strategies from repeated interactions with the environment. There are two main approaches to reinforcement learning in MDPs: modelfree/direct approaches and model-based/indirect approaches. In a model-based approach, the learner interacts with the system to first estimate the transition probabilities and corresponding rewards, and then uses standard MDP algorithms to compute the optimal cost and strategies. On the other hand in a model-free approach-such as Q-learning-the learner computes optimal strategies without explicitly estimating the transition probabilities and rewards. We focus on model-free RL to learn a strategy that maximizes the probability of satisfying a given ω-regular objective.

Problem Statement and Motivation
The problem we address in this paper is the following: Given an MDP M with unknown transition structure and an ω-regular objective ϕ, compute a strategy that maximizes the probability of M satisfying ϕ.
To apply model-free RL algorithms to this task, one needs to define rewards that depend on the observations of the MDP and reflect the satisfaction of the objective. It is natural to use the product of the MDP and an automaton monitoring the satisfaction of the objective to assign suitable rewards to various actions chosen by the learning algorithm.
Sadigh et al. [18] were the first ones to consider model-free RL to solve a qualitativeversion of this problem, i.e., to learn a strategy that satisfies the property with probability 1. For an MDP M and a DRW A ϕ of index k, they formed the product MDP M × A ϕ with k different "Rabin" reward functions ρ 1 , . . . , ρ k . The function ρ i corresponds to the Rabin pair (B × i , G × i ) and is defined such that for for R + , R − ∈ R ≥0 : [18] claimed that if there exists a strategy satisfying an ω-regular objective ϕ with probability 1, then there exists a Rabin pair i, discount factor λ * ∈ [0, 1[, and suitably high ratio R * , such that for all λ ∈ [λ * , 1[ and R − /R + ≥ R * , any strategy maximizing λ-discounted reward for the MDP (M×A ϕ , ρ i ) also satisfies the ω-regular objective ϕ with probability 1. Using Blackwell-optimality theorem [12], a paraphrase of this claim is that if there exists a strategy satisfying an ω-regular objective ϕ with probability 1, then there exists a Rabin pair i and suitably high ratio R * , such that for all R − /R + ≥ R * , any strategy maximizing expected average reward for the MDP (M×A ϕ , ρ i ) also satisfies the ω-regular objective ϕ with probability 1. There are two flaws with this approach.
1. We provide in Example 1 an MDP and an ω-regular objective ϕ with Rabin index 2, such that, although there is a strategy that satisfies the property with probability 1, optimal average strategies corresponding to any Rabin reward do not satisfy the objective with probability 1.

2.
Even for an ω-regular objective with one Rabin pair (B, G) and B=∅-i.e., one that can be specified by a DBW-we demonstrate in Example 2 that the problem of finding a strategy that satisfies the property with probability 1 may not be reduced to finding optimal expected average strategies.
Example 1 (Two Rabin Pairs). Consider the MDP given as a simple grid-world example shown in Figure 1. Each cell (state) of the MDP is labeled with the atomic propositions that are true there. In each cell, there is a choice between two actions rest and go.
With action rest the state of the MDP does not change. However, with action go the MDP moves to the other cell in the same row with probability p, or to the other cell in the same column with probability 1−p. The initial cell is (0, 0). The specification is given by LTL formula A DRW that accepts ϕ is shown in Figure 1. The DRW has two accepting pairs: (B 0 , G 0 ) and (B 1 , G 1 ). The table besides the automaton gives, for each transition, its label and the B and G sets to which it belongs. The optimal strategy that satisfies the objective ϕ with probability 1 is to choose go in the first Cell (0, 0) and to choose rest subsequently no matter which state is reached. However, notice that for both Rabin pairs, the optimal strategy for expected average reward (or analogously optimal discounted strategies for sufficiently large discount factors λ) is to maximize the probability of reaching one of the (0, 1), safe or (1, 0), safe states of the product and stay there forever. For the first accepting pair the maximum probability of satisfaction is 1 2−p , while for the second pair it is 1 1+p . Example 2 (DBW to Expected Average Reward Reduction). This counterexample demonstrates that even for deterministic Büchi objectives, the problem of finding an optimal strategy satisfying an objective may not be reduced to the problem of finding an optimal average strategy. Consider the simple grid-world example of Figure 2 with the specification ϕ = (G ¬b) ∧ (G F g), where atomic proposition b (blue) labels Cell 1 and atomic proposition g (green) labels Cells 2 and 3. Actions enabled in various cells and their probabilities are depicted in the figure.
The strategy from Cell 0 that chooses Action a guarantees satisfaction of ϕ with probability 1. An automaton with accepting transitions for ϕ is shown in Figure 2; it is a DBW (or equivalently a DRW with one pair (B, G) and B = ∅).
The product MDP is shown at the bottom of Figure 2. All states whose second component is trap have been merged. Notice that there is no negative reward since the set B is empty. If reward is positive and equal for all accepting transitions, and 0 for all other transitions, then when p > 1/2, the strategy that maximizes expected average reward chooses Action b in the initial state and Action e from State (2, safe). Note that for large values of λ the optimal expected average reward strategies are also optimal strategies for the λ-discounted reward objective. However, these strategies are not optimal for ω-regular objectives.
Example 1 shows that one cannot select a pair from a Rabin acceptance condition ahead of time. This problem can be avoided by the use of Büchi acceptance conditions. While DBWs are not sufficiently expressive, SLDBWs express all ω-regular properties and are suitable for probabilistic model checking. In the next section, we show that they are also "the ticket" for model-free reinforcement learning because they allow us to maximize the probability of satisfying an ω-regular specification by solving a reachability probability problem that can be solved efficiently by off-the-shelf RL algorithm.

Learning from Omega-Regular Rewards
Let us fix in this section an MDP M, an ω-regular property ϕ and its corresponding SLDBW A.
A Markov chain is an MDP whose set of actions is a singleton. A bottom strongly connected component (BSCC) of a Markov chain is any of its end-components. A BSCC is accepting if it contains an accepting transition and otherwise it is a rejecting BSCC. For any MDP M and positional strategy σ, let M σ be the Markov chain resulting from resolving the nondeterminism in M using σ.
We start with recalling two key properties of SLDBWs: Similar observations can also be found in other work, e.g. [21,5,19]. While the proof of (2) is involved, (1) follows immediately from the fact that any run in M induced by an accepting run of M × A satisfies ϕ (while the converse does not necessarily hold); this holds for all nondeterministic automata, regardless of the acceptance mechanism. With a slight abuse of notation, if σ is a strategy on the augmented MDP M ζ , we denote by σ also the strategy on M×A obtained by removing t from the domain of σ.
We let p σ s (ζ) denote the probability of reaching t in M ζ σ when starting at s. Notice that we can encode this value as the expected average reward in the following rewardful MDP (M ζ , ρ) where we set the reward function ρ(t, a) = 1 for all a ∈ A and ρ(s, a) = 0 otherwise. For any strategy σ, p σ s (ζ) and the reward of σ from s in (M ζ , ρ) are the same.
We also let a σ s be the probability that a run that starts from s in (M × A) σ is accepting. We now show the following basic properties of these values. Lemma 1. If σ is a positional strategy on M ζ , then, for every state s of (M × A) σ , the following holds: 1. If s is in a rejecting BSCC of (M×A) σ , then p σ s (ζ) = 0.
2. If s is in an accepting BSCC of (M×A) σ , then p σ s (ζ) = 1. We claim that a σ s ≥ p σ s (ζ) − (1 − ζ) · f σ s . This is because the probability of reaching a rejecting BSCC in (M × A) σ is at most the probability of reaching a rejecting BSCC in M ζ σ , which is at most 1−p σ s (ζ), plus the the probability of moving on to t from a state that is not in any BSCC in (M×A) σ , which we are going to show next is at most f σ s · (1 − ζ). First, a proof by induction shows that (1 − ζ k ) ≤ k(1 − ζ) for all k ≥ 0. Let P σ s (ζ, k) be the probability of generating a path from s with k accepting transitions before t or a node in some BSCC of (M×A) σ is reached in M ζ σ . The probability of seeing k accepting transitions and not reaching t is at least ζ k . Therefore, the probability of moving to t from a state not in any BSCC is at most k P σ This provides us with our main theorem.
Theorem 3. There exists a threshold ζ ′ ∈]0, 1[ such that, for all ζ > ζ ′ and every state s, any strategy σ that maximizes p σ s (ζ) in M ζ is (1) an optimal strategy in M × A from s and (2) induces an optimal for the original MDP M from s with objective ϕ.
Proof. We use the fact that it suffices to study positional strategies and there are only finitely many of them. Let σ 1 be an optimal strategy of M×A, and let σ 2 be a strategy that has the highest likelihood of creating an accepting run among all non-optimal memoryless ones. (If σ 2 does not exist, then all strategies are equally good, and it does not matter which one is chosen.) Let δ = a σ1 s − a σ2 s . Let f max = max σ max s f σ s where σ ranges over positional strategies only and f σ s is defined as in Lemma 2. We claim that it suffices to pick ζ ′ ∈]0, 1[ such that Suppose that σ is a positional strategy that is optimal in M ζ for ζ > ζ ′ , but is not optimal in M×A. We then have a σ s ≤ p σ s (ζ) ≤ a σ s + (1 − ζ)f σ s < a σ s + δ ≤ a σ1 s ≤ p σ1 s (ζ), where these inequalities follow, respectively, from: Lemma 1(3), the proof of Lemma 2, the definition of ζ ′ , the assumption that σ is not optimal and the definition of δ, and the last one from Lemma 1 (3). This shows that p σ s (ζ) < p σ1 s (ζ), i.e., σ is not optimal in M ζ ; a contradiction. Therefore, any positional strategy that is optimal in M ζ for ζ > ζ ′ is also optimal in M×A. Now, suppose that σ is a positional strategy that is optimal in M × A. Theorem 2(1) implies then that the probability of satisfying ϕ in M when starting at s is at least a σ s . At the same time, if there was a strategy for which the probability of satisfying ϕ in M is > a σ s , then Theorem 2(2) would guarantee the existence of strategy σ ′ for which a σ ′ s > a σ s ; a contradiction with the assumption that σ is optimal. Therefore any positional strategy that is optimal in M × A induces an optimal strategy in M with objective ϕ.

Experimental Results
We implemented the construction described in the previous sections in a tool named MUNGOJERRIE [8], which reads MDPs described in the PRISM language [13], and ωregular automata written in the HOA format [1,6]. MUNGOJERRIE builds the product M ζ , provides an interface for RL algorithms akin to that of [4] and supports probabilistic model checking. Our algorithm computes, for each pair (s, a) of state and action, the maximum probability of satisfying the given objective after choosing action a from state s by using off-the-shelf, temporal difference algorithms. Not all actions with maximum probability are part of positional optimal strategies-consider a product MDP with one state and two actions, a and b, such that a enables an accepting self-loop, and b enables a non-accepting one: both state/action pairs are assigned probability 1.
Since the probability values alone do not identify a pure optimal strategy, MUNGOJER-RIE computes an optimal mixed strategy, uniformly choosing all maximum probability actions from a state.
The MDPs on which we tested our algorithms are listed in Table 1. For each model, the number of states in the MDP and automaton is given, together with the maximum probability of satisfaction of the objective, as computed by the RL algorithm and confirmed by the model checker (which has full access to the MDP). Models twoPairs and riskReward are from Examples 1 and 2, respectively. Models grid5x5 and trafficNtk are from [18]. The three "windy" MDPs are taken from [20]. The "frozen" examples are from [16]. Some ω-regular objectives are simple reachability requirements (e.g., frozenSmall and frozenLarge). The objective for anothergrid is to collect three types of coupons, while incurring at most one of two types of penalties. The objective for slalom is given by the LTL formula G(p → X G ¬q) ∧ G(q → X G ¬p). Figure 3 illustrates how increasing the parameter ζ makes the RL algorithm less sensitive to the presence of transient accepting transitions. Model deferred consists of two chains of states: one, which the agent choses with action a, has accepting transitions throughout, but leads to an end component that is not accepting. The other chain, selected with action b, leads to an accepting end component, but has no other accepting path to the target strategy where the likelihood of satisfying ϕ can be expected to grow. This is important, as it comes with the promise of a generally increasing quality of intermediate strategies.