1 Introduction

We consider discrete-time Markov decision processes (MDP) with finite state and action spaces. We consider two different types of objectives for the decision maker: reachability objectives and safety objectives. The decision maker is said to have reachability objective, if his goal is to reach a specific state of the MDP with the highest possible probability, and the decision maker is said to have a safety objective, if his goal is the opposite: to avoid a specific state of the MDP with the highest possible probability. Both objectives are standard and have been analyzed extensively in the literature, but they are quite different in nature (see, e.g., [1,2,3]).

An important question is on which time horizon the decision maker evaluates his strategies. On any given finite horizon, backward induction guarantees that the decision maker has a pure optimal strategy. This optimal strategy can depend heavily on the horizon, and generally there is no strategy that is optimal on all finite horizons. On the infinite horizon, the decision maker has a pure stationary optimal strategy (cf. [4, 5]).

In this paper, instead of considering a fixed horizon, we propose to evaluate strategies by how they perform on all long but finite horizons. In particular, such an evaluation can be meaningful, if the decision maker knows that the decision process will last for many periods, but he has no information on its exact length. In the case of reachability objectives, such an evaluation may also reflect the attitude of a decision maker who is patient and can wait for many periods to reach the target state.

More precisely, when the decision maker has reachability objective with target state \(s^*\), we say that a strategy \(\sigma \) overtakes another strategy \(\sigma '\), if there exists \(T\in {\mathbb {N}}\) such that, on all finite horizons \(t\ge T\), the probability of having visited the state \(s^*\) within horizon t is strictly larger under \(\sigma \) than under \(\sigma '\). Thus, conditionally on the MDP lasting at least T periods, \(\sigma \) performs better than \(\sigma '\) regardless of the horizon, and consequently the decision maker should prefer \(\sigma \) to \(\sigma '\). When the decision maker has a safety objective and wants to avoid a state \(s^*\), we say that a strategy \(\sigma \) overtakes another strategy \(\sigma '\), if there exists \(T\in {\mathbb {N}}\) such that, on all finite horizons \(t\ge T\), the probability of having visited the state \(s^*\) within horizon t is strictly smaller under \(\sigma \) than under \(\sigma '\).

We also define a more permissive version of the aforementioned relations between strategies. For reachability objectives, we say that a strategy \(\sigma \) weakly overtakes another strategy \(\sigma '\), if there exists \(T\in {\mathbb {N}}\) such that, on all finite horizons \(t\ge T\), the probability of having visited the target state \(s^*\) within horizon t under \(\sigma \) is at least as much as that under \(\sigma '\), but strictly more for infinitely many horizons t. The definition is analogous for safety objectives.

Under these comparisons of strategies, we call a strategy overtaking optimal, if it is not overtaken by any other strategy and call it strongly overtaking optimal, if it is not weakly overtaken by any other strategy. Strong overtaking optimality is a strict refinement of overtaking optimality, and as an appealing property, they are both strict refinements of optimality on the infinite horizon.

Our contribution. For reachability objectives, we obtain the following results, sorted by the attributes of the MDP. (I.1) We prove that if the MDP is such that each action can lead to at most one non-target state with a positive probability, then there exists a pure stationary strategy that is not weakly overtaken by any pure strategy. This is Theorem 4.1. We show with Example 4.2 that such a statement does not hold for all MDPs. (I.2) We prove by means of Example 4.1 that an overtaking optimal strategy does not always exist. This MDP is however constructed in a very specific way and is non-generic. (I.3) We consider MDPs that are generic, in the sense that the transition probabilities are randomized using any non-trivial joint density function. We show for these MDPs that there exists a pure stationary strategy that overtakes each other stationary strategy. This is Theorem 5.1. (I.4) We present sufficient conditions in Theorem 6.1 for the existence of a stationary strategy that is strongly overtaking optimal. For safety objectives, we argue that the same results hold.

Proof techniques. We use quite different proof techniques to obtain our results. For proving result (I.1), we transform the MDP with the reachability objective into a regular MDP, by assigning payoffs to actions based on the immediate transition probabilities to the target state. In this new MDP, we invoke some results in [6] to derive a specific pure stationary strategy. We show that this strategy is exactly the desired strategy in the original MDP with the reachability objective. This proof technique is suitable for pure strategies, but probably also limited to them, as the relation between the two MDPs is much weaker for non-pure strategies. When considering generic MDPs in result (I.3), we rely on techniques from linear algebra. The overtaking comparison between two stationary strategies can be reduced to the comparison of the spectral gaps of the transition matrices that these strategies induce. The spectral gap of a transition matrix refers to the difference between the largest eigenvalue, which is equal to 1, and the modulus of the second eigenvalue, which can be a complex number. To obtain result (I.3), we need to compare the spectral gaps of transition matrices induced by stationary and pure stationary strategies. Result (I.4) is proven in a constructive way. The mixed actions of the desired stationary strategy can be derived from the conditions of Theorem 6.1. The results for safety objectives are proven similarly.

Related literature. Reachability and safety problems were studied both in the MDP framework and in the context of two-player zero-sum games, for an overview we refer to [1] and respectively to [2] and [3]. An important distinction is made between the qualitative and the quantitative approaches. The qualitative approach is interested in the probability with which the decision maker succeeds to meet his objective. For the quantitative approach, however, it also matters how quickly the target state is reached in the case of reachability objective, or how long the bad state has been avoided in the case of a safety objective. Our overtaking approach could thus be classified as a quantitative approach on the infinite horizon. For other quantitative approaches, we refer to [7, 8] and the references therein, and to [9].

In the literature, various definitions of overtaking optimality have been proposed. They all serve as a refinement of optimality on the infinite horizon, based on the performance of strategies on the finite horizons. For an overview, we refer to [10,11,12,13,14,15,16].

A well-established definition of overtaking optimality is given in Section 5.4.2 in [11] for MDPs in which the decision maker receives a payoff at each period, depending on the state and the chosen action. According to this definition, a strategy \(\sigma ^*\) is overtaking optimal if, for each strategy \(\sigma \) and for each error-term \(\delta >0\) the following holds: for all large horizons N, the expected sum of the payoffs during the first N periods under \(\sigma ^*\) is at least as much as that under \(\sigma \) minus \(\delta \). In our framework, there are no immediate payoffs, hence this definition does not apply. However, we show in Example 3.1 that, if we take the natural assignment of payoff 0 to each non-target state and payoff 1 to the target state, then our definition of overtaking optimality can lead to different strategies than Puterman’s definition, as well as variants of Puterman’s definition defined therein.

Our definition of overtaking optimality is a relatively direct translation of the definitions of sporadic overtaking optimality in [6, 10] and repeated optimality in [16], into the context of MDPs with reachability and safety objectives.

One important feature of our definition is that it does not require the strategy to outperform all other strategies on long but finite horizons. It only requires that the strategy is not outperformed by any other strategy. Our definition is therefore weaker than overtaking optimality and uniform overtaking optimality as in [10], and weaker than strong overtaking optimality as in [17] or [18]. See also [16], who delineates “not-outperformed” definitions from “outperform-all” definitions of optimality.

Organization of the paper. Section 2 details the model. Then, we start by analyzing reachability objectives. Section 3 provides an example, which highlights different aspects of the concept of overtaking optimality by comparing it with other optimality notions. Sections 4 and 5 present the results for piecewise deterministic MDPs and generic MDPs, respectively. Section 6 provides sufficient conditions that ensure that a stationary strategy is strongly overtaking optimal. In Sect. 7, we turn to safety objectives. Section 8 concludes.

2 The Model

2.1 MDPs with Reachability Objective

The model. An MDP is given by [1] a nonempty, finite set S of states, [2] for each state \(s\in S\), a nonempty, finite set A(s) of actions and [3] for each state \(s\in S\) and action \(a\in A(s)\), a probability distribution \(p(s,a) = (p(z \mid s,a))_{z \in S}\) on the set S of states. The MDP is played at periods in \({\mathbb {N}}=\{1,2,\ldots \}\) as follows: The initial state \(s_1\) is given. At each period t, the decision maker chooses an action \(a_t\in A(s_t)\), which leads to a state \(s_{t+1} \in S\) drawn according to the distribution \(p(s_t,a_t)\). An MDP with reachability objective is an MDP together with a specific state \(s^*\in S\), called the target state, which is not the initial state \(s_1\).

Histories. Let \(H_{\infty }\) be the set of all infinite histories, i.e., the set of sequences \((s_1,a_1,s_2,a_2,\dots )\) such that \(s_i \in S\), \(a_i \in A(s_i)\), and \(p(s_{i+1}\mid s_i,a_i) > 0\) for each \(i \in {\mathbb {N}}\). A history at period t is a prefix \((s_1,a_1,\ldots ,s_{t-1},a_{t-1},s_t)\) of an infinite history. Denote by \(H_t\) the set of all histories at period t, by \(H=\cup _{t\in {\mathbb {N}}}H_t\) the set of all histories, and by s(h) the final state of each history \(h\in H\).

Strategies. A mixed action in a state \(s\in S\) is a probability distribution on A(s). The set of mixed actions in state s is denoted by \(\Delta (A(s))\). A strategy\(\sigma \) is a map that, to each history \(h\in H\), assigns a mixed action \(\sigma (h)\in \Delta (A(s(h)))\). The interpretation is that, if history h arises, \(\sigma \) chooses an action according to the probabilities given by the mixed action \(\sigma (h)\). A strategy \(\sigma \) is called pure, if \(\sigma (h)\) places probability 1 on one action, for each history h. A strategy \(\sigma \) is called stationary, if the recommendation of the action only depends on the current state, i.e., \(\sigma (h)=\sigma (h')\) whenever \(s(h)=s(h')\). Note that a pure stationary strategy can be seen as an element of \(\times _{s\in S}A(s)\). Every initial state s and strategy \(\sigma \) induce a probability measure \({\mathbb {P}}_{s\sigma }\) on \(H_{\infty }\), where \(H_{\infty }\) is endowed with the sigma–algebra generated by the cylinder sets. We denote the corresponding expectation operator by \({\mathbb {E}}_{s\sigma }\).

Value and optimality. Let \(t^*\) denote the first period when state \(s^*\) is reached; if \(s^*\) is not reached then \(t^*=\infty \). The value at the initial state s is the maximal probability that state \(s^*\) can be reached: \(v(s)=\sup _{\sigma }{\mathbb {P}}_{s\sigma }(t^*<\infty )\). A strategy \(\sigma \) is called optimal at the initial state s if \({\mathbb {P}}_{s\sigma }(t^*<\infty )=v(s)\). It is known that the decision maker always has a pure stationary strategy that is optimal at all initial states (cf. [4, 5]).

Overtaking optimality. We say that a strategy \(\sigma \)overtakes a strategy \(\sigma '\) at the initial state s if there is \(T\in {\mathbb {N}}\) such that for all periods \(t\ge T\) we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,>\, {\mathbb {P}}_{s\sigma '}(t^*\le t)\). This means that, for all periods \(t\ge T\), the probability under \(\sigma \) to reach \(s^*\) within the first t periods is strictly larger than that under \(\sigma '\). If the decision maker is sufficiently patient with regard to his goal to reach the target state, then he strictly prefers \(\sigma \) to \(\sigma '\).

Note that two strategies can be incomparable in the sense that neither of them overtakes the other one. Consider the following example. The state space is \(\{x,s^*\}\). In state x, the decision maker has three actions: \(a_0,a_{1/2},a_{7/8}\). For \(z \in \{0,1/2, 7/8\}\), under action \(a_{z}\), the play moves to state \(s^*\) with probability z and remains in state x with probability \(1-z\). Now suppose that \(\sigma \) recommends to always play action \(a_{1/2}\) and \(\sigma '\) recommends to play the sequence of actions \(a_0,a_{7/8},a_0,a_0,a_{7/8},a_0,\ldots \) as long as the play is in state x. Then, at periods \(t=3k+1\), where \(k\in {\mathbb {N}}\), we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,=\, {\mathbb {P}}_{s\sigma '}(t^*\le t)\,=\,(7/8)^k\). At periods \(t=3k+2\), we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,>\, {\mathbb {P}}_{s\sigma '}(t^*\le t)\). At periods \(t=3k\), we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,<\, {\mathbb {P}}_{s\sigma '}(t^*\le t)\). So, \(\sigma \) and \(\sigma '\) are incomparable.

A strategy \(\sigma \) is called overtaking optimal at the initial state s, if there is no strategy that overtakes \(\sigma \) at that initial state. That is, \(\sigma \) is maximal with respect to the relation of “overtakes” between strategies.

Note that any optimal strategy overtakes any strategy that is not optimal. Indeed, if strategy \(\sigma \) is optimal at the initial state s but strategy \(\sigma '\) is not, then \({\mathbb {P}}_{s\sigma }(t^*<\infty )\,=v(s)\,>\, {\mathbb {P}}_{s\sigma '}(t^*<\infty )\), and hence \({\mathbb {P}}_{s\sigma }(t^*\le t)\,>\, {\mathbb {P}}_{s\sigma '}(t^*\le t)\) for all sufficiently large t. Consequently, an overtaking optimal strategy at the initial state s is also optimal at that initial state. As Example 3.1 will show, the converse is not true: there exist optimal strategies at an initial state that are not overtaking optimal at that initial state. Thus, overtaking optimality is a strict refinement of optimality.

Strong overtaking optimality. We say that a strategy \(\sigma \)weakly overtakes a strategy \(\sigma '\) at the initial state s if there is \(T\in {\mathbb {N}}\) such that for all periods \(t\ge T\) we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,\ge \, {\mathbb {P}}_{s\sigma '}(t^*\le t)\) with strict inequality for infinitely many t. Note that if \(\sigma \) overtakes \(\sigma '\) at the initial state s then \(\sigma \) also weakly overtakes \(\sigma '\) at that initial state.

A strategy \(\sigma \) is called strongly overtaking optimal at the initial state s, if no strategy weakly overtakes \(\sigma \) at that initial state. A strongly overtaking optimal strategy at an initial state is also overtaking optimal at that initial state.

2.2 Discounted and Average Payoff MDPs

We will also consider MDPs with the discounted payoff or with the average payoff, but only as auxiliary models. A discounted MDP is an MDP together with a discount factor \(\beta \in ]0,1[\) and a payoff function, namely a function \((s,a) \mapsto u(s,a) \in {\mathbb {R}}\) that maps a payoff to each state \(s\in S\) and action \(a\in A(s)\). For initial state \(s\in S\), the \(\beta \)-discounted value is defined as

$$\begin{aligned} v_\beta (s)\,=\,\sup _{\sigma }\,{\mathbb {E}}_{s\sigma }\left[ (1-\beta ) \cdot \sum _{t=1}^\infty \,\beta ^{t-1}\cdot u(s_t,a_t)\right] . \end{aligned}$$
(1)

A strategy \(\sigma \) is called \(\beta \)-discounted optimal at the initial state s, if the supremum in (1) is attained at \(\sigma \). A strategy \(\sigma ^*\) is called Blackwell optimal, if there is \(B\in ]0,1[\) such that \(\sigma ^*\) is \(\beta \)-discounted optimal at all initial states for all discount factors \(\beta \in [B,1[\).

By the results of [19] and [5], it is known that for each discount factor \(\beta \in ]0,1[\), the decision maker has a pure stationary strategy that is \(\beta \)-discounted optimal at all initial states, and that he has a Blackwell optimal strategy too.

An average payoff MDP is similar to a discounted MDP, except that the decision maker’s goal is to maximize the expectation of the average payoff \(\liminf _{T\rightarrow \infty }\ \frac{1}{T}\sum _{t=1}^T u(s_t,a_t)\). The average value and average optimality are defined analogously to the corresponding definitions for reachability objective. By [4, 5], it is known that the decision maker has a pure stationary strategy that is average optimal at all initial states. Moreover, each Blackwell optimal strategy is average optimal at all initial states.

3 Reachability Objectives: An Illustrative Example

In this section, we discuss a specific MDP with reachability objective, which demonstrates four properties of overtaking optimality. First, there are optimal strategies that are not overtaking optimal. That is, overtaking optimality is a strict refinement of optimality. Second, finding overtaking optimal strategies cannot be done by simply solving a related discounted MDP. Third, the strategy that minimizes the expected time of reaching the target state \(s^*\) can be different from the overtaking optimal strategies, even when the latter is unique. Fourth, for a related MDP, overtaking optimality in not the same as the concept of overtaking optimality as defined in [11], or the related concepts of cumulative overtaking optimality and average overtaking optimality that were defined therein.

Example 3.1

Consider an MDP that has state space \(S=\{x,y,z,s^*\}\) such that:

  • State x is the initial state. In this state, the decision maker has two actions: a and b. Action a leads to state \(s^*\) with probability q and to state y with probability \(1-q\). Action b leads to state \(s^*\) with probability \(\frac{1}{2}\) and to state z with probability \(\frac{1}{2}\).

  • In state y, there is only one action, denoted by c, which leads to state \(s^*\) with probability q and to state y with probability \(1-q\).

  • In state z, there is only one action, denoted by d, which leads to state \(s^*\) with probability p and to state z with probability \(1-p\).

  • State \(s^*\) is absorbing.

The probabilities p and q are such that \(0<p<q<\frac{2p}{2p+1}\). For example, \(p=0.1\) and \(q=0.11\). Note that \(p<\frac{2p}{2p+1}\) implies \(p<1/2\), and therefore \(p< q<\frac{2p}{2p+1}\) implies \(q<1/2\) as well.

The MDP is depicted in Fig. 1. In this figure, the state \(s^*\) is omitted for simplicity, states x, y, and z are denoted by circles, and actions are denoted by arcs together with the name of the action and the corresponding probability of moving to state \(s^*\). For example, the arrow from state x to state y denoted by a : q indicates that this action is called a and that, when played at state x, it leads to state \(s^*\) with probability q and to state y with probability \(1-q\).

Fig. 1
figure 1

The MDP in Example 3.1

Suppose that the initial state is x. Since in states y and z there is a single action, a strategy is characterized by the action it selects in state x. Choosing the action a at state x leads to the following sequence of probabilities of moving to state \(s^*\): \((q,q,q,\ldots )\). Choosing the action b at state x leads to the following sequence of probabilities of moving to state \(s^*\): \((1/2,p,p,\ldots )\). The state \(s^*\) is eventually reached with probability 1 under both actions a and b. Below we argue that while b (and not a) is optimal according to many optimality concepts, a (and not b) is overtaking optimal under reachability objective.

Claim 1

In Example 3.1, action a is overtaking optimal under reachability objective, and action b is not.

Proof

Take a period \(t\ge 2\). Under a the probability of reaching the target state \(s^*\) within the first t periods is \({\mathbb {P}}_{xa}(t^*\le t)\;=\;1-(1-q)^{t-1}\), whereas under b this probability is \({\mathbb {P}}_{xb}(t^*\le t)\;=\;\frac{1}{2}+\frac{1}{2}\cdot \left( 1-(1-p)^{t-2}\right) \). Thus,

$$\begin{aligned} {\mathbb {P}}_{xa}(t^*\le t)-{\mathbb {P}}_{xb}(t^*\le t)&\;=\;-(1-q)^{t-1}+\frac{1}{2}\cdot (1-p)^{t-2}\\&\;=\;(1-p)^{t-2}\cdot \left[ -\left( \frac{1-q}{1-p}\right) ^{t-2} \cdot (1-q)+\frac{1}{2}\right] . \end{aligned}$$

Since \(p<q\) by assumption, \((1-q)/(1-p)<1\). So, \({\mathbb {P}}_{xa}(t^*\le t)-{\mathbb {P}}_{xb}(t^*\le t)\) is positive for large \(t\in {\mathbb {N}}\). This completes the proof. \(\square \)

Claim 2

Consider the discounted MDP that has payoff equal to 1 in state \(s^*\) and payoff 0 in states x, y and z. Strategy b is Blackwell optimal, while strategy a is not Blackwell optimal.

Proof

In the discounted MDP, if state \(s^*\) is reached at some period t, then the \(\beta \)-discounted payoff is equal to \((1-\beta )\cdot (\beta ^{t-1}+\beta ^t+\cdots )=\beta ^{t-1}.\) Thus, action a leads to the expected discounted payoff \(D_a(\beta )\;=\;q\beta \frac{1}{1-(1-q)\beta }\), whereas action b leads to the expected discounted payoff: \(D_b(\beta )\;=\;\frac{1}{2}\beta +\frac{1}{2}p\beta ^2\frac{1}{1-(1-p)\beta }\). As one can verify, we have

$$\begin{aligned} D_b(\beta )-D_a(\beta )\;=\;\frac{\frac{1}{2}\beta (1-\beta ) \cdot \bigl (1-(1-2p)(1-q)\beta -2q\bigr )}{(1-(1-p)\beta )\cdot (1-(1-q)\beta )}. \end{aligned}$$

The denominator of the fraction above is positive for all \(\beta \in ]0,1[\). As \(q<\frac{2p}{2p+1}\) by assumption, the expression \(1-(1-2p)(1-q)-2q\) is positive. Hence, the numerator of the fraction above is positive for large \(\beta \in ]0,1[\). Thus, we find \(D_b(\beta )>D_a(\beta )\) for large \(\beta \in ]0,1[\), and the claim follows. \(\square \)

Claim 3

In Example 3.1, the expectation of the period \(t^*\) when reaching the state \(s^*\) is smaller under b than under a: \({\mathbb {E}}_{xb}(t^*)\,<\,{\mathbb {E}}_{xa}(t^*)\).

Proof

We have \({\mathbb {E}}_{xa}(t^*)\;=\;2q+3(1-q)q+4(1-q)^2q+\cdots \;=\;\frac{1}{q}+1\) and \({\mathbb {E}}_{xb}(t^*)\;=\;2\cdot \frac{1}{2}+\frac{1}{2}\cdot [3p+4(1-p)p+5(1-p)^2p+\cdots ]\;=\;\frac{1}{2p}+2\). Since \(q<\frac{2p}{2p+1}\) by assumption, the claim follows. \(\square \)

Claim 4

Consider the MDP with payoff 1 in state \(s^*\) and payoff 0 in states x, y, and z. The strategy b is both overtaking optimal and cumulative overtaking optimal according to the definitions in Section 5.4.2 in [11], and strategy a is neither of them.Footnote 1

Proof

For each strategy \(\sigma \) and period N, define \(R(\sigma ,N)\) to be the expected sum of the payoffs (which is also the sum of the expected payoffs) up to period N. Since the initial state x is not the target state, \(R(a,1)=R(b,1)=0\) holds trivially. We claim that \(R(a,N)<R(b,N)\) for every \(N\ge 2\), and, moreover, \(\lim _{N\rightarrow \infty } R(b,N)-R(a,N)=-\frac{1}{2p}+\frac{1-q}{q}>0\).

Since \({\mathbb {P}}_{xa}(t^*\le t)\;=\;1-(1-q)^{t-1}\), for every \(N \ge 2\) we have

$$\begin{aligned} R(a,N)\,=\,\sum _{t=2}^N \left( 1-(1-q)^{t-1}\right) \,=\,(N-1)-(1-q)\cdot \frac{1-(1-q)^{N-1}}{1-(1-q)}, \end{aligned}$$

and since \({\mathbb {P}}_{xb}(t^*\le t)\;=\;\frac{1}{2}+\frac{1}{2}\cdot \left( 1-(1-p)^{t-2}\right) \), we have

$$\begin{aligned} R(b,N)\,=\,\sum _{t=2}^N \left( \frac{1}{2}+\frac{1}{2}\cdot (1-(1-p)^{t-2})\right) \,=\,(N-1) -\frac{1}{2}\cdot \frac{1-(1-p)^{N-1}}{1-(1-p)}. \end{aligned}$$

Since \(p<q\) and \(q<\frac{2p}{2p+1}\) by assumption, \(1-(1-q)^{N-1}>1-(1-p)^{N-1}\) and \(\frac{1-q}{q}>\frac{1}{2p}\), and therefore \((1-q)\cdot \frac{1-(1-q)^{N-1}}{1-(1-q)}\,>\,\frac{1}{2}\cdot \frac{1-(1-p)^{N-1}}{1-(1-p)}\). Thus, \(R(a,N)<R(b,N)\), and therefore b (and not a) is cumulative overtaking optimal according to Puterman’s definition. Moreover,

$$\begin{aligned} \lim _{N\rightarrow \infty } R(b,N)-R(a,N)\,=-\frac{1}{2p}+\frac{1-q}{q}\,>\,0, \end{aligned}$$

so b (and not a) is overtaking optimal according to Puterman’s definition. \(\square \)

4 Reachability Objectives: Piecewise Deterministic MDPs

A piecewise deterministic Markov process [20] is a process whose behavior is governed by random jumps at points in time, but whose evolution is deterministically governed by an ordinary differential equation between those times. These processes have been shown to be useful in a wide range of applications, including queueing theory, ruin problems, biochemistry and geology. In this section, we study an analogous concept when the state space is finite and when the jumps lead to the target state.

We call an MDP with reachability objective piecewise deterministic if for each state \(s\ne s^*\) and each action \(a\in A(s)\) there is a state \(\omega (s,a) \in S\) such that \(p(\{\omega (s,a),s^*\}\mid s,a)=1\). That is, for any state and action, the play moves to the target state \(s^*\) or to a specific state that is not \(s^*\).

A special case of piecewise deterministic MDPs is that of deterministic MDPs (see Section 3.3 in [11]), that is, when there is no randomness in the transitions: for every state \(s\in S\) and action \(a\in A(s)\) there is a unique state \(w(s,a) \in S\) such that \(p(w(s,a)\mid s,a)=1\).

The following theorem states that, in piecewise deterministic MDPs with reachability objective, there always exists a pure stationary strategy that is at least as good as any pure strategy in the overtaking sense. The main idea of the proof is to transform the MDP with reachability objective into an average payoff MDP. The payoffs that we assign to actions are related to the probabilities that these actions lead to the target state.

The condition that the MDP is piecewise deterministic plays an important role. Indeed, Example 4.2 will show that the result is not true in general for MDPs that are not piecewise deterministic.

Theorem 4.1

In every piecewise deterministic MDP with reachability objective, there exists a pure stationary strategy that is not weakly overtaken by any other pure strategy.

To prove Theorem 4.1, we need the following result. The proof is provided in the Appendix.

Theorem 4.2

Consider a deterministic discounted MDP and let \(\sigma \) be a Blackwell optimal strategy. There exists no pure strategy \(\sigma '\) and no initial state \(s \in S\) with the following properties:

  • Property I: there is \(M\in {\mathbb {N}}\) such that for all periods \(t\ge M\) we have \(u_t(s,\sigma )\,\le \,u_t(s,\sigma ')\), where \(u_t(s,\sigma )\) and \(u_t(s,\sigma ')\) are the expected average payoffs up to period t under \(\sigma \) and respectively under \(\sigma '\) at initial state s,

  • Property II: \(u_t(s,\sigma )\,<\,u_t(s,\sigma ')\) holds for infinitely many t.

Proof of Theorem 4.1

Consider a piecewise deterministic MDP \({\mathcal {M}}\) with reachability objective. We may assumeFootnote 2 that there is no state \(s\ne s^*\) and action \(a\in A(s)\) such that \(p(s^*\mid s,a)=1\). As the MDP \({\mathcal {M}}\) is piecewise deterministic, this implies that \(p(\omega (s,a)\mid s,a)>0\), for every \(s \in S\) and \(a \in A(s)\).

We define an auxiliary average payoff deterministic MDP \({\mathcal {M}}'\) as follows: (i) The state space is \(S'=S-\{s^*\}\). (ii) For each state \(s\in S'\), the action space is the same as in the MDP \({\mathcal {M}}\) with reachability objective: \(A'(s)=A(s)\). (iii) For each state \(s\in S'\) and action \(a\in A(s)\), the transition and the payoff in \({\mathcal {M}}'\) are defined as follows: \(p'(w(s,a) \mid s,a)=1\) and \(u'(s,a)=-\log (p(w(s,a) \mid s,a))\).

Intuitively, the MDP \({\mathcal {M}}'\) represents what happens if in the MDP \({\mathcal {M}}\) the decision maker is unlucky at all periods and the process never reaches state \(s^*\). Let \(\sigma \) be a pure Blackwell optimal stationary strategy in \({\mathcal {M}}'\). We show that \(\sigma \) is not weakly overtaken by any pure strategy, thereby proving the theorem.

Consider any other pure strategy \(\rho \) in \({\mathcal {M}}'\). Since the MDP \({\mathcal {M}}'\) is deterministic, each of the pure strategies \(\sigma \) and \(\rho \) induces a specific infinite history in \({\mathcal {M}}'\) with probability 1. Let \((s_t,a_t)_{t\in {\mathbb {N}}}\) denote the infinite history induced by \(\sigma \) and \((z_t,b_t)_{t\in {\mathbb {N}}}\) denote the infinite history induced by \(\rho \).

In the original MDP \({\mathcal {M}}\), the probability under \(\sigma \) that state \(s^*\) is not reached within the first t periods is \(1-{\mathbb {P}}_{s_1,\sigma }(t^*\le t)\). This probability is related to the payoffs in the average payoff MDP \({\mathcal {M}}'\). Indeed, for each period \(t\ge 2\) we have

$$\begin{aligned} \log \left( 1-{\mathbb {P}}_{s_1,\sigma }(t^*\le t)\right)&\;=\; \log \left( \prod _{k=1}^{t-1}p(s_{k+1} \mid s_k,a_k) \right) \;=\; \sum _{k=1}^{t-1}\log \left( p(s_{k+1} \mid s_k,a_k)\right) \\&\;=\; -\,\sum _{k=1}^{t-1}u'(s_k,a_k) \;=\; -(t-1)\cdot u_{t-1}(s_1,\sigma ). \end{aligned}$$

Similarly, \(\log \left( 1-{\mathbb {P}}_{s_1,\rho }(t^*\le t)\right) \;=\; -\,\sum _{k=1}^{t-1}u'(z_k,b_k)\;=\;-(t-1)\cdot u_{t-1}(s_1,\rho )\).

By the choice of \(\sigma \) in \({\mathcal {M}}'\) (cf. Theorem 4.2), one of the following holds: (i) There is \(M\in {\mathbb {N}}\) such that for all periods \(t\ge M\) we have \(u_t(s_1,\sigma )\,=\,u_t(s_1,\rho )\). (ii) There is a strictly increasing sequence \((t_k)_{k\in {\mathbb {N}}}\) of periods such that for each \(k\in {\mathbb {N}}\) we have \(u_{t_k}(s_1,\sigma )\,>\,u_{t_k}(s_1,\rho )\).

If (i) holds, then we have \({\mathbb {P}}_{s_1,\sigma }(t^*\le t) \,=\, {\mathbb {P}}_{s_1,\rho }(t^*\le t)\) for all \(t\ge M+1\), and hence \(\rho \) does not weakly overtake \(\sigma \) in the original MDP \({\mathcal {M}}\). If (ii) holds, then \({\mathbb {P}}_{s_1,\sigma }(t^*\le t_k) \,>\, {\mathbb {P}}_{s_1,\rho }(t^*\le t_k)\) for each \(k\in {\mathbb {N}}\), and hence \(\rho \) does not weakly overtake \(\sigma \) in the original MDP \({\mathcal {M}}\) in this case either. \(\square \)

The following example demonstrates that, even if the MDP with reachability objective is piecewise deterministic, an overtaking optimal strategy may fail to exist. In the example, each pure strategy is equally good in the overtaking sense, but each pure strategy is overtaken by any strategy that uses randomization at every period.

Example 4.1

Consider the MDP with reachability objective given in Fig. 2, with a notation similar to that of Example 3.1. The initial state is state x.

Fig. 2
figure 2

The MDP in Example 4.1

Since in states y and z there is a single action, a pure strategy is characterized by the period in which the action b is first played in state x. When playing action b and subsequently action d, the total probability during these two periods of reaching the target state is \(\frac{3}{4}\). Playing action a twice (or action e twice) leads to the same total probability, as \(\frac{1}{2}+\frac{1}{2}\cdot \frac{1}{2}=\frac{3}{4}\). It follows that \({\mathbb {P}}_{x\sigma }(t^* \le t) = 1-\tfrac{1}{2^{t-1}}\), for all pure strategies \(\sigma \), except of the strategy \(\sigma ' = a^{t-2}b\) that plays the action a in the first \(t-2\) periods and the action b in period \(t-1\), for which \({\mathbb {P}}_{x\sigma '}(t^* \le t) > 1 - \tfrac{1}{2^{t-1}}\). This implies that strategies that use randomization at state x at every period do better in the overtaking sense than all pure strategies. Indeed, given any \(t\ge 2\), when calculating the probability of reaching the target state within the first t periods, there is a positive probability that action b is played exactly at period \(t-1\) (and thus we reach the target state exactly at period t), while not having to include consequences of playing action d yet.

Claim 1

Consider Example 4.1. For each pure strategy \(\sigma \), it holds for sufficiently large periods t that the probability of reaching the target state within the first t periods is \({\mathbb {P}}_{x\sigma }(t^*\le t)\;=\;1-\frac{1}{2^{t-1}}\). In particular, no pure strategy is overtaken by another pure strategy.

Proof

For the pure strategy \(a^\infty \) that plays a at all periods, we have for all periods t that \({\mathbb {P}}_{xa^\infty }(t^*\le t)\;=\;1-\frac{1}{2^{t-1}}\). For any other pure strategy \(a^{n-1}b\) that plays a at the first \(n-1\) periods and plays b at period n, we have for all periods \(t\ge n+1\) that \({\mathbb {P}}_{x,a^{n-1}b}(t^*\le t)\;=\;1-\frac{1}{2^{t-1}}\). \(\square \)

Claim 2

Consider Example 4.1. Take two strategies \(\sigma \) and \(\sigma '\). Consider a period \(t\ge 2\). Then, \({\mathbb {P}}_{x\sigma }(t^*\le t)>{\mathbb {P}}_{x\sigma '}(t^*\le t)\) holds if and only if the probability under \(\sigma \) of being in state x and playing action b at period \(t-1\) is strictly larger than that under \(\sigma '\), i.e., \({\mathbb {P}}_{x\sigma }(a_{t-1}=b)> {\mathbb {P}}_{x\sigma '}(a_{t-1}=b)\). Consequently, if the condition \({\mathbb {P}}_{x\sigma }(a_{t-1}=b)> {\mathbb {P}}_{x\sigma '}(a_{t-1}=b)\) holds for all sufficiently large periods t, then \(\sigma \) overtakes \(\sigma '\).

Proof

Suppose that when playing two strategies \(\sigma \) and \(\sigma '\), it holds that for some period \(t\ge 2\) we have \({\mathbb {P}}_{x\sigma }(a_{t-1}=b)> {\mathbb {P}}_{x\sigma '}(a_{t-1}=b)\).

On the finite horizon up to period t, the set of pure strategies is the finite set \(W^t=\{a^{t},\ b,\ ab,\ a^2b,\ldots ,\ a^{t-1}b\}\). Under \(a^{t-2}b\), the probability of reaching the target state within the first t periods is \({\mathbb {P}}_{x,a^{t-2}b}(t^*\le t)\;=\;1-\frac{1}{2^{t-2}}\cdot \frac{1}{4}\;=\;1-\frac{1}{2^t}\), while under each other pure strategy \(\tau \ne a^{t-2}b\), this is \({\mathbb {P}}_{x\tau }(t^*\le t)\;=\;1-\frac{1}{2^{t-1}}\).

The strategy \(\sigma \) induces in a natural way a probability distribution on the finite set \(W^t\) of pure strategies. Indeed, denote by \(((x,a)^{k-1},x)\) the history at period k in which the play remained through action a in state x until period k, and by \(\sigma (h)(a)\) the probability to select action a after history h. Then, we have: (i) \({\mathbb {P}}_{x\sigma }(b)=\sigma (x)(b)\), since (x) is the history at period 1. (ii) \({\mathbb {P}}_{x\sigma }(a^t)=\sigma (x)(a)\cdot \sigma (x,a,x)(a)\cdots \sigma ((x,a)^{t-1},x)(a)\). (iii) For \(k=1,\ldots ,t-1\), we have \({\mathbb {P}}_{x\sigma }(a^k b)=\sigma (x)(a)\cdots \sigma ((x,a)^{k-1},x)(a)\cdot \sigma ((x,a)^k,x)(b)\). Similarly, the strategy \(\sigma '\) also induces a probability distribution on \(W^t\). It follows on the finite horizon t that

$$\begin{aligned} {\mathbb {P}}_{x\sigma }(t^*\le t)&\;=\;\sum _{\tau \in W^t}{\mathbb {P}}_{x\sigma }(\tau )\cdot {\mathbb {P}}_{x\tau }(t^*\le t)\\&\;=\;{\mathbb {P}}_{x\sigma }(a^{t-2}b)\cdot \Big (1-\frac{1}{2^t}\Big )+\;(1-{\mathbb {P}}_{x\sigma }(a^{t-2}b))\cdot \Big (1-\frac{1}{2^{t-1}}\Big ), \end{aligned}$$

and similarly for the strategy \(\sigma '\).

Thus, \({\mathbb {P}}_{x\sigma }(t^*\le t)>{\mathbb {P}}_{x\sigma '}(t^*\le t)\) if and only if \({\mathbb {P}}_{x\sigma }(a^{t-2}b)>{\mathbb {P}}_{x\sigma '}(a^{t-2}b)\), if and only if \({\mathbb {P}}_{x\sigma }(a_{t-1}=b)> {\mathbb {P}}_{x\sigma '}(a_{t-1}=b)\). The proof is complete. \(\square \)

By Claim 2, the stationary strategy \((\frac{1}{2},\frac{1}{2})^\infty \) that always chooses action a and action b each with probability \(\frac{1}{2}\) overtakes every pure strategy. Also, the stationary strategy \((p,1-p)^\infty \) overtakes the stationary strategy \((q,1-q)^\infty \) if \(q<p<1\), as for large periods t we have

$$\begin{aligned} {\mathbb {P}}_{x,(p,1-p)^\infty }(a_{t-1}=b)\;&=\;p^{t-2}\cdot \Big (\frac{1}{2}\Big )^{t-2}\cdot (1-p) >\;\;q^{t-2}\cdot \Big (\frac{1}{2}\Big )^{t-2}\cdot (1-q)\\&=\; {\mathbb {P}}_{x,(q,1-q)^\infty }(a_{t-1}=b). \end{aligned}$$

This means that in Example 4.1 there is no stationary overtaking optimal strategy. We now show that there is no overtaking optimal strategy at all.

Claim 3

The MDP in Example 4.1 admits no overtaking optimal strategy.

Proof

Consider any strategy \(\sigma \). We construct a strategy that overtakes \(\sigma \). The strategy \(\sigma \) can be seen as a sequence \((\xi _n)_{n=1}^\infty \) where \(\xi _n\) denotes the probability that \(\sigma \) assigns to action b when being in state x at period n. We distinguish two cases.

  • Case 1 Assume that either \(\xi _n=0\) for all periods n or \(\xi _n=1\) for some period n. In this case, \({\mathbb {P}}_{x\sigma }(a_n=b)=0\) at large periods n. Hence, by Claim 2, the stationary strategy \((\frac{1}{2},\frac{1}{2})^\infty \) overtakes \(\sigma \).

  • Case 2 Assume that \(\xi _m>0\) for some period m and \(\xi _n<1\) for all periods n. We can choose a sequence \((\xi '_n)_{n=1}^\infty \) such that (i) for all periods \(n=1,\ldots ,m-1\) we have \(\xi '_n=\xi _n\), (ii) for period m we have \(\xi '_m<\xi _m\), (iii) for all periods \(n>m\) we have \(\xi _n<\xi '_n<1\), and (iv) \(\prod _{n=1}^\infty (1-\xi '_n)=\prod _{n=1}^\infty (1-\xi _n)\). The idea is to slightly reduce the probability \(\xi _m\) at period m and slightly increase all probabilities \(\xi _n\), \(n>m\), so that (iv) holds, i.e., the total probability of ever playing b under \((\xi _n)_{n=1}^\infty \) is equal to that under \((\xi '_n)_{n=1}^\infty \).

Let \(\sigma '\) be the strategy corresponding to \((\xi '_n)_{n=1}^\infty \). Consider a period \(t>m\). By (iii), we have \(\prod _{n=t}^\infty (1-\xi '_n)\le \prod _{n=t}^\infty (1-\xi _n)\). Hence, by (iv), we obtain \(\prod _{n=1}^{t-1} (1-\xi '_n)\ge \prod _{n=1}^{t-1} (1-\xi _n)\). This means that the probability of being in state x at period t is at least as large under \(\sigma '\) as under \(\sigma \). Thus, by (iii), we obtain \({\mathbb {P}}_{x\sigma '}(a_t=b)> {\mathbb {P}}_{x\sigma }(a_t=b)\). Since this is true for all periods \(t>m\), in view of Claim 2, \(\sigma '\) overtakes \(\sigma \). \(\square \)

The following example, which is an adaptation of Example 4.1, shows that if the MDP with reachability objective is not piecewise deterministic, then it can happen that each pure strategy is overtaken by another pure strategy. As a consequence, Theorem 4.1 cannot be extended to all MDPs.

Example 4.2

Consider the MDP with initial state x and reachability objective that is depicted in Fig.3. In this MDP, the only choice of the decision maker is when to play action c, if at all, and action c can be played at most once.

In this MDP, action c leads to the target state \(s^*\) with probability 5/8, to state y with probability 1/8 and to state \(x'\) with probability 1/4. It will be easier to think about action c in the following way, which gives the same transition probabilities: After playing action c, a lottery is executed: (1) with probability 1/2 the play follows the upper-part of the arrow, and thus the play moves to state \(s^*\) with probability 3/4 and to state y with probability 1/8, and (2) with probability 1/2 the play follows the bottom-part of the arrow, and thus the play moves to state \(s^*\) with probability 1/2 and to state \(x'\) with probability 1/2. Action \(c'\) in state \(x'\) has a similar interpretation. Note that this MDP is not piecewise deterministic, as each of the actions c and \(c'\) leads to two non-target states with a positive probability.

The pure strategies in this MDP are \(a^\infty \), c, ac, \(a^2c\), \(\ldots \). The strategy \(a^\infty \) corresponds to strategy \(a^\infty \) in Example 4.1, and the strategy \(a^tc\) corresponds to the mixed strategy in Example 4.1 that, in state x, recommends action a up to period t and the mixed action \((\frac{1}{2},\frac{1}{2})\) at all periods after t. The reader can verify that \(a^\infty \) is overtaken by c, and each \(a^tc\) is overtaken by \(a^{t+1}c\). That is, each pure strategy is overtaken by another pure strategy.

Fig. 3
figure 3

The MDP in Example 4.2

5 Reachability Objectives: Generic MDPs

As shown in Sect. 4, an overtaking optimal strategy may fail to exist. In this section, we show that this is due to non-genericity of the transition function: when transitions are generic, an overtaking optimal strategy always exists. Generic transitions are natural in various applications, where transitions are affected by random noise.

We call an MDP with reachability objective generic if (a) all transitions from states that are not the target stateFootnote 3 are positive: \(p(z \mid s,a) > 0\) for every \(s \ne s^*\), \(a \in A(s)\), and \(z \in S\), and (b) the second largest eigenvalue of \(A_1, A_2, \ldots , A_K\) are all different, where K is the number of pure stationary strategies, and \(A_j\) is the transition matrix of the Markov chain on the state space S induces by the j’th pure stationary strategy, for every \(j \in \{1,2,\ldots ,K\}\). These requirements involve only finitely many linear equalities and inequalities. It follows that if one randomly chooses the transition function of the MDP from \((\Delta (S))^{\sum _{s \ne s^*} |A(s)|}\) according to some probability distribution that is absolutely continuous w.r.t. the Lebesgue measure, then with probability 1 the transition function is generic.

The main result of this section is the following theorem.

Theorem 5.1

In generic MDPs with reachability objective, there is a pure stationary strategy that overtakes each other stationary strategy at each initial state.

The idea of the proof of Theorem 5.1 is as follows. Fix a state space \(S=\{1,\ldots ,n\}\), where \(n\ge 2\), and let the target state be state \(s^*=n\). As above, we assume without loss of generality that state n is absorbing. Every stationary strategy \(\sigma \) defines a transition matrix \(A_\sigma \). Under the strategy \(\sigma \), the rate of absorption to state n is exactly \(\lambda _2(A_\sigma )\), the second largest eigenvalue of \(A_\sigma \). Thus, if \(\lambda _2(A_\sigma ) < \lambda _2(A_{\sigma '})\), then \(\sigma \) overtakes \(\sigma '\) for the reachability objective. As before, let \(A_1, A_2, \ldots , A_K\) be all transition matrices that are induced by pure stationary strategies. Since the MDP is generic, the second largest eigenvalues of these matrices differ, and therefore there is one of them, say, \(A_1\), whose second largest eigenvalue is minimal. The matrix \(A_\sigma \) is in the convex hull of the matrices \(A_1, A_2, \ldots , A_K\), and we will prove that if \(A_\sigma \ne A_1\) then \(\lambda _2(A_\sigma ) > \lambda _2(A_1)\). This will imply that the pure stationary strategy that corresponds to the matrix \(A_1\) overtakes each other stationary strategy at each initial state. The proof of Theorem 5.1 consists of four steps.

  • Step 1 Proving that the second largest eigenvalue determines the overtaking relation between stationary strategies: When comparing two stationary strategies \(\sigma _A\) and \(\sigma _B\) generating the respective transition matrices A and B, \(\lambda _2(A)<\lambda _2(B)\) implies that \(\sigma _A\) overtakes \(\sigma _B\) at each initial state.

  • Step 2 Proving that the second largest eigenvalue of a transition matrix A, corresponding to a stationary strategy, is equal to the largest eigenvalue of the \((n-1)\times (n-1)\) submatrix \(A'\) that remains when we remove from A the column and the row associated with the target state: \(\lambda _2(A) = \lambda _1(A')\).

  • Step 3 Proving that for two positive square matrices A and B that differ only in one row, the largest eigenvalue of their any convex combination cannot be lower than the minimum between the largest eigenvalues of the two matrices: for every \(\alpha \in ]0,1[\) we have \(\lambda _1(\alpha A +(1-\alpha )B)\ge \text {min}\left\{ \lambda _1(A), \lambda _1(B)\right\} \), and if \(\lambda (A)\ne \lambda (B)\) then the inequality is strict.

  • Step 4 Proving that it suffices to consider only matrices that differ in one row.

Steps 1–4 imply that the pure stationary strategy that corresponds to the transition matrix with minimal second largest eigenvalue overtakes each other stationary strategy, at each initial state.

Proof of Step 1:

This is a slightly stronger version of Theorem 3 in [21].Footnote 4\(\square \)

Proof of Step 2:

Let \(\sigma \) be a stationary strategy with an \(n\times n\) transition matrix A, with entry (ij) being the probability under \(\sigma \) of moving from state i to state j. Since the target state \(s^*=n\) is absorbing and since the sum of entries in each row is equal to 1, the largest eigenvalue of A is \(\lambda _1(A)=1\) with eigenvector \((0,0,\ldots ,0,1)\).

Consider the submatrix \(A'\) of A that arises when we delete the last column and the last row (which correspond to the target state). Since the MDP is generic, the matrix \(A'\) is positive, hence by the Perron–Frobenius Theorem, the largest eigenvalue \(\lambda _1(A')\) of the matrix \(A'\) is a real number. As the sum of entries in each row of \(A'\) is strictly less than 1, we have \(\lambda _1(A')<1\).

The proof that \(\lambda _1(A')\) is the second largest eigenvalue of the matrix A follows from the following two observations:

  1. (i)

    Any eigenvalue of \(A'\) is also an eigenvalue of A. Indeed, let \(\mu \) be an eigenvalue of \(A'\) with right eigenvector \((y_1,\ldots ,y_{n-1})\).Footnote 5 Then \(\mu \) is an eigenvalue of A with right eigenvector \((y_1,\ldots ,y_{n-1},0)\).

  2. (ii)

    If \(\mu \ne 1\) is an eigenvalue of A, then \(\mu \) is also an eigenvalue of \(A'\). Indeed, let \(y=(y_1,\ldots ,y_n)\) be a right eigenvector of A corresponding to \(\mu \). Then, \(Ay=\mu y\). This implies \(y_n=\mu \cdot y_n\), which is only possible if \(y_n=0\). Hence, \(\mu \) is an eigenvalue of \(A'\) with eigenvector \((y_1,\ldots ,y_{n-1})\).

\(\square \)

Proof of Step 3:

The statement of Step 3 follows from the following theorem. \(\square \)

Theorem 5.2

Let A and B be two positive (all elements are positive) square matrices of the same size that differ only in the first row.Footnote 6 For every \(\alpha \in ]0,1[\) define \(M_\alpha := \alpha A + (1-\alpha )B\). Then, \(\lambda _{1}(M_{\alpha })\ge \text {min}\left\{ \lambda _1(A),\lambda _1(B)\right\} \), and, if \(\lambda _1(A)\ne \lambda _1(B)\), then \(\lambda _{1}(M_{\alpha })> \text {min}\left\{ \lambda _1(A),\lambda _1(B)\right\} \).

Proof of Step 4:

Let \(A_1, \ldots , A_K\) be all transition matrices that are induced by pure stationary strategies. Since the MDP is generic, the second largest eigenvalues of these matrices are all different. Assume \(\lambda _2(A_1)<\lambda _2(A_i)\) for all \(i=2,\ldots ,K\). Let \(\sigma \) be the pure stationary strategy corresponding to \(A_1\).

Let \(\tau \) be a stationary strategy, and let A be the transition matrix corresponding to \(\tau \). For every \(r=0,1,\ldots ,n-1\) let \({\mathcal {B}}_r\) be the collection of all matrices that coincide with A in the first r rows and coincide with one of the matrices \(A_1, A_2, \ldots , A_K\) in the other \(n-r\) rows. Note that \({\mathcal {B}}_0 = \{A_1, A_2, \ldots , A_K\}\). Using Step 2 together with Step 3 inductively, we obtain that

$$\begin{aligned} \lambda _2(A) \ge \min _{B \in {\mathcal {B}}_{n-1}} \lambda _2(B) \ge \min _{B \in {\mathcal {B}}_{n-2}} \lambda _2(B) \ge \cdots \ge \min _{B \in {\mathcal {B}}_{0}} \lambda _2(B). \end{aligned}$$
(2)

Moreover, if \(A \notin {\mathcal {B}}_{0}\), then at least one of the inequalities in Eq. (2) is strict.

Now assume that \(\tau \ne \sigma \). If \(A \notin {\mathcal {B}}_{0}\) then \(\lambda _2(A)>\min _{B \in {\mathcal {B}}_{0}} \lambda _2(B)=\lambda _2(A_1)\), whereas if \(A \in {\mathcal {B}}_{0}\) then \(\lambda _2(A)>\lambda _2(A_1)\) by the choice of \(A_1\). Thus, by Step 1, \(\sigma \) overtakes \(\tau \) at each initial state. \(\square \)

Remark 5.1

Let \(\sigma \) be a strategy as in Theorem 5.1, \(\sigma '\ne \sigma \) be a stationary strategy and s be the initial state. One can compute a horizon T such that \(\sigma \) outperforms \(\sigma '\) beyond T, i.e., \({\mathbb {P}}_{s\sigma }(t^*\le t)> {\mathbb {P}}_{s\sigma '}(t^*\le t)\) for all \(t\ge T\), by using inequalities (A1) and (A2) in [21].

6 Sufficient Conditions for Strong Overtaking Optimality

In a discounted MDP, if for every \(s\in S\) the strategy \(\sigma _s\) is an optimal strategy at the initial state s, then the stationary strategy that plays at each state s the mixed action that \(\sigma _s\) plays at the initial period is also optimal. The next theorem aims at developing the analogous result for strongly overtaking optimal strategies in MDPs with reachability objective.

Theorem 6.1

Consider an MDP with reachability objective. Suppose that, for every initial state \(s\in S\), there is a strategy \(\sigma _s\) with the following properties:

  1. (i)

    The strategy \(\sigma _s\) is strongly overtaking optimal at the initial state s.

  2. (ii)

    Denote by \(\alpha _s\) the mixed action that \(\sigma _s\) uses at period 1 in state s. The strategy \(\sigma _s\) weakly overtakes each strategy that uses a mixed action different from \(\alpha _s\) at period 1 in the initial state s.

Let \(\alpha \) be the stationary strategy that uses the mixed action \(\alpha _s\) at state s, for all \(s\in S\). Then, \(\alpha \) is strongly overtaking optimal at each initial state.

Proof

We will use a dynamic programming argument. Fix an initial state \(s \in S\). For each state \(z\in S\), let \(H_s(z)\) denote the set of histories h such that (1) h has a positive probability under \(\sigma _s\), and (2) h ends in state z.

We show that for each \(h\in H_s(z)\) we have \(\sigma _s(h)=\alpha _z\). Let \(h\in H_s(z)\). Suppose by way of contradiction that \(\sigma _s(h)\ne \alpha _z\). Let \(\sigma '_s\) be the strategy such that (i) \(\sigma '_s\) follows \(\sigma _s\) outside the subgame that starts at h, and (ii) in the subgame that starts at h, the continuation strategy \(\sigma [h]\) is replaced by \(\sigma _z\). Then, for each period t that is larger than the last period in the history h we have \({\mathbb {P}}_{\sigma '_s}(t^*\le t)-{\mathbb {P}}_{\sigma _s}(t^*\le t)\,=\,{\mathbb {P}}_{\sigma _s}(h)\cdot \left[ {\mathbb {P}}_{z,\sigma _z}(t^*\le t)-{\mathbb {P}}_{z,\sigma [h]}(t^*\le t)\right] \). By (ii), \(\sigma _z\) weakly overtakes \(\sigma [h]\) for initial state z. Therefore, the quantity \({\mathbb {P}}_{z,\sigma _z}(t^*\le t)-{\mathbb {P}}_{z,\sigma [h]}(t^*\le t)\) is non-negative for all large t and strictly positive for infinitely many t. Thus, the same holds for \({\mathbb {P}}_{\sigma '_s}(t^*\le t)-{\mathbb {P}}_{\sigma _s}(t^*\le t)\), and hence \(\sigma _s\) is weakly overtaken by \(\sigma '_s\). This is a contradiction to (i).

Hence, each history h has the same probability under \(\sigma _s\) and under \(\alpha \). As \(\sigma _s\) is strongly overtaking optimal at the initial state s, so is the strategy \(\alpha \). \(\square \)

7 Safety Objectives

The model of MDPs with safety objective is similar to the model of MDPs with reachability objective, except that the decision maker’s objective is to reach the state \(s^*\) with as low a probability as possible.

Overtaking optimality. A strategy \(\sigma \)overtakes a strategy \(\sigma '\) at the initial state s if there is \(T\in {\mathbb {N}}\) such that \({\mathbb {P}}_{s\sigma }(t^*\le t)\,<\,{\mathbb {P}}_{s\sigma '}(t^*\le t)\) for all \(t\ge T\). A strategy \(\sigma \) is overtaking optimal at the initial state s if there is no strategy that overtakes \(\sigma \) at that initial state.

Strong overtaking optimality. A strategy \(\sigma \)weakly overtakes a strategy \(\sigma '\) at the initial state s if there is \(T\in {\mathbb {N}}\) such that for all \(t\ge T\) we have \({\mathbb {P}}_{s\sigma }(t^*\le t)\,\le \, {\mathbb {P}}_{s\sigma '}(t^*\le t)\) with strict inequality for infinitely many t. If \(\sigma \) overtakes \(\sigma '\) at the initial state s then \(\sigma \) also weakly overtakes \(\sigma '\) at that initial state. A strategy \(\sigma \) is strongly overtaking optimal at the initial state s if no strategy weakly overtakes \(\sigma \) at that initial state. A strongly overtaking optimal strategy at the initial state s is also overtaking optimal at that state.

Results. Theorem 4.1 remains valid for safety objectives. The proof requires the following changes. (1) We can still assume that the MDP \({\mathcal {M}}\) has no state \(s\ne s^*\) and action \(a\in A(s)\) with \(p(s^*\mid s,a)=1\). Indeed, such an action can be deleted, and if all actions in a state s are deleted, then we can delete the state s and replace each transition to s by a transition to \(s^*\). (2) Because now the decision maker prefers low probabilities to state \(s^*\), the payoffs in the auxiliary MDP \({\mathcal {M}}'\) are defined to be the opposite: \(u'(s,a)=\log (p(w(s,a) \mid s,a))\), .

Theorems 5.1 and 6.1 remain valid for safety objectives, with analogous proofs. Similarly to Example 4.1, the following MDP with safety objective has no overtaking optimal strategy: Take the MDP in Example 4.1 and replace \(b:\frac{3}{4}\) with b : 0 and d : 0 with \(d:\frac{3}{4}\). In this MDP, b is still preferred over d.

8 Conclusions

It remains an open problem if generic MDPs with reachability objective admit a pure stationary strategy that is strongly overtaking optimal (cf. Theorem 5.1). The difficulty is that when we allow non-stationary strategies, the transition probabilities generally cannot be described by a single transition matrix.

When using a stationary strategy, sometimes it is important to study the probability distribution of the current state, at any period t, on condition that the state \(s^*\) has not been reached yet. This conditional distribution converges under some conditions to a limit, called a quasi-stationary distribution. This convergence and its speed are subject of study in the literature; see, e.g., [22].