1 Introduction

This paper presents an axiomatic approach to finite Markov decision processes (MPDs) where the discount rate is zero. MDPs comprise a broad class of stochastic dynamic decision problems and they have been studied extensively over the past several decades. To keep the discussion as elementary as possible, we will work within the framework of Blackwell’s (1962) classic paper. For extensions of this framework and discussion of its many uses, the reader is referred to Arapostathis et al. (1993), Hernández-Lerma and Vega-Amaya (1998), Rosenberg et al. (2002) and the books by Feinberg and Shwartz (2002), Piunovskiy (2013) and Puterman (1994).

In its simplest form, a MDP has the following ingredients: A state space \({{\mathscr {S}}}\), an action space \({{\mathscr {A}}}\), a transition probability function \(p_a(s' \vert s)\) on \({{\mathscr {S}}}\) for each \(a \in {{\mathscr {A}}}\), and a real-valued function r(sa) on \({{\mathscr {S}}}\times {{\mathscr {A}}}\). Here \({{\mathscr {S}}}\) represents possible states of a system (a manufacturing chain, a biological system, a natural resource, etc.) and \({{\mathscr {A}}}\) represents choices available to an agent (the decision maker). Unless stated otherwise, \({{\mathscr {S}}}\) and \({{\mathscr {A}}}\) are finite sets. At discrete times \(t=1, 2, 3, \ldots \), the agent observes the state and selects an element from \({{\mathscr {A}}}\). If the system is in \(s \in {{\mathscr {S}}}\) and \(a \in {{\mathscr {A}}}\) is chosen, then a reward of r(sa) is received and the system moves to \(s'\) with probability \(p_a(s' \vert s)\). Rewards are discounted so that a reward of one unit at time t has present value \(\beta ^t\), where \(0<\beta \le 1\). The problem is to choose a policy (i.e., a rule for selecting actions at all future times) that maximizes the expected net present value of all future rewards.

This problem is particularly difficult when \(\beta =1\). To begin with, it is not clear what it means to maximize net present value in this case. The difficulty is that the total value of a policy is typically infinite if \(\beta =1\). There is a natural sense in which a policy is maximal if it generates a sequence of cumulative expected rewards that eventually dominates that of any other policy. This leads to the intuitive notion of overtaking optimality (formally defined in Sect. 3). It is well known, however, that an overtaking optimal policy need not exist. A less selective criterion is based on the expected long-run average reward of a policy. But this criterion does not differentiate between streams of expected rewards which might have very different appeal to the decision maker.

Blackwell (1962) introduced the 1-optimality criterion, which evaluates streams of expected rewards on the basis of their Abel means. He also established the existence of 1-optimal policies that are stationary, (i.e., for which the action chosen at time t depends only on the state of the system at time t).Footnote 1 Subsequently, Veinott (1966) introduced what is often referred to as the average overtaking criterion, where Abel means are substituted for Cesàro means. The Blackwell–Veinott criteria are able to select between policies that the average reward criterion does not distinguish. However, the literature has not adressed the following questions:

  1. Q1.

    Are the Blackwell–Veinott criteria the only selective criteria which admit optimal policies in the no discounting case?

  2. Q2.

    How can these criteria be described axiomatically?

  3. Q3.

    Under which assumptions on a decision maker’s preferences do optimal policies exist?

Our main results are summarized in Theorems 1, 2 and 3. Theorem 1 shows that, subject to certain constraints, Q1 has an affirmative answer. Theorems 2 and 3 provide two sets of axioms that characterize the average overtaking and 1-optimality criterion on the reward streams generated by stationary policies. The second of these two results complements a theorem of Jonsson and Voorneveld (2018) and uses the compensation principle as a key axiom. Finally, we obtain a partial answer to Q3 as a corollary of these results.

2 Preliminaries

Our finite MDP has state space \({{\mathscr {S}}}\) and action space \({{\mathscr {A}}}\). At times \(t=1, 2, 3, \ldots \), the agent observes the state of and chooses an element a from \({{\mathscr {A}}}\). We assume that this choice depends on the history of the system only through its present state. Thus, the action chosen at time t is an element of F, the set of all functions from \({{\mathscr {S}}}\) to \({{\mathscr {A}}}\). Each \(f\in F\) has a corresponding transition matrix, \({{\textbf{Q}}}(f)\), and reward vector, \({{\textbf{r}}}(f)\). With the notation from the introduction, if the system is in \(s \in {{\mathscr {S}}}\) and f is used, then a reward of \({{\textbf{r}}}(f)_s=r(s, f(s))\) is received and the system moves to \(s'\) with probability \({{\textbf{Q}}}(f)_{s, s'}=p_{f(s)}(s' \vert s)\). Rewards may be interpreted, for example, as payouts of a single good received by an infinitely lived consumer, or as the utilities of future generations.

A policy is a sequence \((f_1, f_2, f_3, \dots )\) in F. Using policy \(\pi =(f_1, f_2, f_3, \dots )\) means that, for each \(t=1, 2, 3, \ldots \), \(f_t(s)\) is selected from \({{\mathscr {A}}}\) if the system is in state s. A policy is stationary if using it implies that the action chosen at time t depends on the state of the system at time t, but not on t itself. Formally, a stationary policy can be written \((f, f, f, \ldots )\) for some \(f \in F\).Footnote 2 We denote the set of all policies by \(\varPi \) and the set of all stationary policies by \({\varPi _F}\).

Given an initial state \(s\in {{\mathscr {S}}}\), the sequence of expected rewards that \(\pi \in \varPi \) generates is denoted \(u(s, \pi )\). If \(\pi =(f_1, f_2, f_3, \dots )\) and \(u=(u_1, u_2, u_3, \ldots )=u(s, \pi )\), then

$$\begin{aligned} u_1&=[{{\textbf{r}}}(f_1)]_s, \nonumber \\ u_t&=[{{\textbf{Q}}}(f_1) \cdot \ldots \cdot {{\textbf{Q}}}(f_{t-1})\cdot {{\textbf{r}}}(f_t)]_s, \; t\ge 2. \end{aligned}$$
(1)

Let \({{\mathscr {U}}_F}\) be the set of sequences generated by stationary policies. That is, \(u \in {{\mathscr {U}}_F}\) if and only if \(u=u(s, \pi )\) for some \(s\in {{\mathscr {S}}}\) and \(\pi \in {\varPi _F}\).

The agent needs to compare \(u(s, \pi )\) and \(u(s, \pi ')\) for different \(s \in {{\mathscr {S}}}\) and \(\pi , \pi ' \in \varPi \). For convenience, we consider (incomplete) preferences on the set of all bounded sequences, which is denoted by \({{\mathscr {U}}}\). We reserve the notation \(\succsim \) for a preorder on \({{\mathscr {U}}}\) (i.e., a reflexive and transitive binary relation), where \(u \succsim v\) means that u is at least as good as v. We say that \(\succsim \) compares u and v if either \(u \succsim v\) or \(v \succsim u\), and we write \(\lnot u \succsim v\) to indicate that u is not at least as good as v. As usual, \(u \succ v\) denotes strict preference (\(u \succsim v\), but \(\lnot v \succsim u\)) and \(u \sim v\) denotes indifference (\(u \succsim v\) and \(v \succsim u\)).

In this framework, preferences are thus defined over sequences of expected rewards. That is, it is assumed that preferences over random rewards can be reduced to preferences over expected rewards. The framework is therefore unable to elucidate risk-averse preferences. For risk measures and risk-sensitive control of Markov processes, see Bäuerle and Rieder (2014), Ruszczyński (2010) and the references cited there.

3 A motivating example

For background, we begin by reviewing how different ways of comparing reward streams may fail or succeed to yield optimal policies. The comparisons often involve sums over a finite horizon. For \(u \in {{\mathscr {U}}}\) and \(T \in {\mathbb {N}}\), we let

$$\begin{aligned} \sigma _T(u)=\sum _{t=1}^Tu_t, \quad \sigma (u)=(\sigma _1(u), \sigma _2(u), \sigma _3(u), \ldots ). \end{aligned}$$
(2)

A policy \(\pi ^*\in \varPi \) is overtaking optimal if, for every \(\pi \in \varPi \),

$$\begin{aligned} u(s, \pi ^*) \succsim _{\text {O}} u(s, \pi ) \text { for every }s\in {{\mathscr {S}}}, \end{aligned}$$
(3)

where

$$\begin{aligned} u \succsim _{\text {O}} v \; \Longleftrightarrow \; \liminf _{T\rightarrow \infty } \sigma _T(u-v) \ge 0. \end{aligned}$$
(4)

This criterion has the advantage of being plausible intuitively. It is also the strongest among the most commonly discussed criteria for undiscounted MDPs. Its drawback is that an optimal policy need not exist (Brown 1965; Gale 1967). The following is a variation of an example from Denardo and Miller (1968). We return to this example in Sect. 6.

Example 1

Figure 1 displays the transition graph of a deterministic MDP with \({{\mathscr {A}}}=\{a_1, a_2\}\) and \({{\mathscr {S}}}=\{s_1, s_2, s_3\}\). If the system starts in state \(s_1\) and \(a_1\) is chosen, then the system moves to \(s_2\) and a reward of 2 is received; if \(a_2\) is chosen, the system moves to \(s_3\) and a reward of \(c \in {\mathbb {R}}\) is received. Once the system reaches \(s_2\) or \(s_3\), it starts to alternate between these two states, and it does not matter how the agent acts. A reward of 0 is received when the system goes from \(s_2\) to \(s_3\), and a reward of 2 is received when it goes from \(s_3\) to \(s_2\).

Suppose that the system starts in \(s_1\). Let u be the reward stream that is generated if \(a_1\) is chosen, and let v be the stream that obtains if \(a_2\) is chosen. Then

$$\begin{aligned} u=(2, 0, 2, 0, 2, \dots ) \quad \text { and } \quad v=(c, 2, 0, 2, 0, 2, \dots ). \end{aligned}$$

We have \(\sigma _T(u-v)=2-c\) if T is odd and \(\sigma _T(u-v)=-c\) if T is even. Hence, if \(0<c<2\), then \(\lnot u \succsim _{\text {O}} v\) and \(\lnot v \succsim _{\text {O}} u\). This means that there is no overtaking-optimal policy if \(0<c<2\). \(\square \)

Fig. 1
figure 1

A deterministic MDP where no overtaking-optimal policy exists

Note that the MPD in Example 1 still does not admit an overtaking optimal policy if attention is restricted to stationary policies. We remark that it is not only in deterministic MDPs that this limitation of overtaking optimality makes itself known. There are, indeed, ergodic MDPs where no overtaking-optimal policy exists within the class of stationary policies (Nowak and Vega-Amaya 1999).

Let us also note that optimal policies often do exist if we adopt an alternative definition of overtaking optimality, according to which \(\pi ^*\in \varPi \) is optimal if there is no \(\pi \in \varPi \) such that

$$\begin{aligned} u(s, \pi ) \succ _{\text {O}} u(s, \pi ^*) \text { for every }s\in {{\mathscr {S}}}. \end{aligned}$$

(In Example 1, all policies are optimal in this sense if \(0<c<2\).) This weaker form of overtaking optimality has been used frequently in studies of optimal economic growth (Brock 1970b; Brock and Mirman 1973; Basu and Mitra 2007). It is closely related to the notion of sporadic overtaking optimality studied in the operations research literature (Stern 1984; Flesch et al. 2017). Here we have adopted the definition of overtaking optimality that this literature most frequently employs.

Generalizing the definition (4) to an arbitrary preorder \(\succsim \), let us say that \(\pi ^*\in \varPi \) is \(\succsim \)-optimal or optimal with respect to \(\succsim \) if, for every \(\pi \in \varPi \),

$$\begin{aligned} u(s, \pi ^*) \succsim u(s, \pi ) \text { for every }s\in {{\mathscr {S}}}. \end{aligned}$$
(5)

The preorders associated with average reward optimality, average overtaking optimality and 1-optimality are defined as follows:

$$\begin{aligned}&{{\textbf {(average reward)}}} \quad u \succsim _{\text {AR}} v \Longleftrightarrow \liminf _{T\rightarrow \infty }\frac{1}{T}\sigma _T(u-v) \ge 0 \end{aligned}$$
(6)
$$\begin{aligned}&{{\textbf {(average overtaking)}}} \quad u \succsim _{\text {AO}} v \Longleftrightarrow \liminf _{T\rightarrow \infty } \frac{1}{T}\sum _{t=1}^T\sigma _t(u-v) \ge 0 \end{aligned}$$
(7)
$$\begin{aligned}&{{\textbf {(1-optimality)}}} \quad u \succsim _1 v \Longleftrightarrow \liminf _{\delta \rightarrow 1^{-}} \sum _{t=1}^\infty \delta ^{t}\cdot (u_t-v_t)\ge 0. \end{aligned}$$
(8)

The average reward criterion is the most studied criterion for undiscounted MDPs. The standard criticism against this criterion concerns the fact that improvements in any finite number of time periods are ignored. In Example  1, for instance, it is average reward-optimal to choose \(a_1\) in state \(s_1\) even if the value of c is very large.

If u and v are the streams in Example 1, the Cesàro sum of \(\sum _{t=1}^\infty (u_t-v_t)\) is \(1-c\). Hence, it is average overtaking-optimal to choose \(a_1\) if and only if \(c\le 1\). It is well known that average overtaking optimality is equivalent to 1-optimality in finite MDPs (Lippman 1969). In general, any average overtaking-optimal policy is 1-optimal, but a 1-optimal policy need not be average overtaking optimal (see, e.g., Bishop et al. 2014).

To sum up, while the average reward criterion is unselective, the overtaking criterion is overselective. One way to formulate the first question (Q1) from the introduction is to ask if the average overtaking criterion is the least selective criterion that admits optimal policies. To state this question in a precise way, we will formulate a set of conditions which we can plausibly require of a selective criterion.

4 Axioms

This section provides five conditions (called axioms) on preorders that are known from the literature. The five conditions are satisfied by the preorders associated with the overtaking criterion, the average overtaking criterion and the 1-optimality criterion (see Jonsson and Voorneveld 2018, p. 28). They may be viewed as conditions that can be plausibly required of a selective criterion.

The first axiom, A1, is a standard monotonicity requirement. It asserts that preferences are positively sensitive to improvements in each time period. Preorders that meet this requirement avoid the standard criticism of the average reward criterion.

A1. For all \(u,v \in {{\mathscr {U}}}\), if \(u_t\ge v_t\) for all t and \(u_t>v_t\) for some t, then \(u \succ v\).

This axiom says, in particular, that the agent prefers a certain reward of 2 units to a certain reward of 1 unit. In the present framework, it also says that the agent disprefers a certain reward of 2 units to a lottery that pays a reward of 1 or 4 units with equal probabilities. As indicated in Sect. 2, such assumptions are inappropriate for risk-averse agents.

The second axiom, A2, formalizes the assumption that a reward of one unit at time \(t>1\) is worth the same as a reward of one unit at \(t=1\) (i.e., that \(\beta =1\)). In the case when rewards represent utilities (or consumption) of future generations, A2 is the axiom of anonymity, which ensures the equal treatment of generations.

A2. For all \(u,v \in {{\mathscr {U}}}\), if u can be obtained from v by interchanging two entries of v, then \(u \sim v\).

The next axiom is a relaxation of the consistency requirement used in Brock’s (1970a) characterization of the overtaking criterion. For \(n \ge 1\) and \(u \in {{\mathscr {U}}}\), let \(u_{[n]}\) denote the sequence obtained from u by replacing \(u_t\) with 0 for all \(t>n\). Our third axiom can then be stated as follows.

A3. For all \(u,v \in {{\mathscr {U}}}\), if there exists \(N > 1\) such that \(u_{[n]} \succ v_{[n]}\) for all \(n\ge N\), then \(u \succsim v\).

That the average reward criterion satisfies A3 is a trivial consequence of the fact that \(u_{[n]} \sim _{\text {AR}} v_{[n]}\) for all \(u, v \in {{\mathscr {U}}}\) and every \(n\ge 1\). The preorders in (4), (7) and (8) have the stronger property that u is at least as good as v if \(u_{[n]}\) is merely at least as good as \(v_{[n]}\) for all sufficiently large n; this property does not hold for the average reward criterion.

The fourth axiom asserts that for reward streams \(u, v \in {{\mathscr {U}}}\), if both streams are postponed one period and an arbitrary reward of \(c \in {\mathbb {R}}\) is assigned to the first period, then the resulting streams, \((c, u)=(c, u_1, u_2, u_3, \ldots )\) and \((c, v)=(c, v_1, v_2, v_3, \ldots )\), should be ranked in the same way as u and v.

A4. For all \(u,v \in {{\mathscr {U}}}\) and \(c \in {\mathbb {R}}\), \((c, u) \succsim (c, v)\) if and only if \(u \succsim v\).

This axiom was proposed as a fundamental condition by Koopmans (1960) in his pioneering work on intertemporal choice. It is usually referred to as stationarity (Asheim et al. 2010; Bleichrodt et al. 2008) or independent future (Fleurbaey and Michel 2003; Mitra 2018).

Our last axiom is an adaptation of the standard assumption of interpersonal comparability from social choice theory (see, e.g., d’Aspremont and Gevers 1977). In the intertemporal setting, it asserts that preferences are invariant to changes in the origins of the utility indices used in different periods. This condition has been referred to as zero independence (Moulin 1988) and translation scale invariance (Asheim et al. 2010).

A5. For all \(u,v, \alpha \in {{\mathscr {U}}}\), if \(u \succsim v\), then \(u+\alpha \succsim v+\alpha \).

Note that a preorder \(\succsim \) which satisfies A5 has the property that if \(u, v, u', v' \in {{\mathscr {U}}}\) are such that \(u-v=u'-v'\), then \(u \succsim v\) if and only if \(u' \succsim v'\). (The converse is also true.) This fact will be used repeatedly below.

5 A rigidity result

If we view the axioms from the previous section as conditions which we expect a selective criterion to satisfy, then the first question from the introduction can be stated as follows: If \(\succsim \) satisfies A1A5, is every \(\succsim \)-optimal policy average overtaking-optimal (and hence 1-optimal)?Footnote 3 Theorem 1 shows that this question has an affirmative answer if attention is restricted to stationary policies. This restriction does not trivialize any of the questions (Q1–Q3) from the introduction. In fact, replacing \(\varPi \) with \({\varPi _F}\) in the preceding discussion would not affect what has been said so far in an essential way.

Theorem 1

Suppose that \(\succsim \) satisfies A1A5. If a policy is \(\succsim \)-optimal within \({\varPi _F}\), then it is average overtaking-optimal within \({\varPi _F}\).

Proof

The proof exploits the fact that under certain conditions on \(u \in {{\mathscr {U}}}\), if a preorder \(\succsim \) satisfies A1A5, then

$$\begin{aligned} u \succsim (0, u) \text { implies } {\bar{u}} \ge 0, \end{aligned}$$
(9)

where

$$\begin{aligned} {\bar{u}}\equiv \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{t=1}^nu_t \end{aligned}$$
(10)

is the average of u. The usefulness of (9) is explained by the fact that if \(\succsim \) satisfies A5 and \(u, v \in {{\mathscr {U}}}\) are such that \(\sigma \equiv \sigma (u-v)\) is bounded, then

$$\begin{aligned} u \succsim v\text { if and only }\sigma \succsim (0, \sigma ). \end{aligned}$$
(11)

This is because \(u-v=\sigma -(0, \sigma )\). Applying (9) with \(\sigma \) in the role of u, we see that \(u \succsim v\) implies \({\bar{\sigma }}\ge 0\). Since \({\bar{\sigma }}\) is the Cesàro sum of \(\sum _{t=1}^\infty (u_t-v_i)\), this means that \(u \succsim v\) implies \(u \succsim _{\text {AO}} v\).

The conditions on \(u \in {{\mathscr {U}}}\) which ensure (9) are that (i) the limit (10) exists and (ii) for every \(\varepsilon >0\) there exists an N such that the average of any \(n\ge N\) consecutive coordinates of u differs from \({\bar{u}}\) by at most \(\varepsilon \)—that is,

$$\begin{aligned} \left| \frac{1}{n}\sum _{t=t_0}^{t_0+n}u_t-{\bar{u}}\right| <\varepsilon \text { for every }t_0 \in {\mathbb {N}}. \end{aligned}$$

We say that \(u \in {{\mathscr {U}}}\) is regular if the two conditions are met.

Lemma 1

(Jonsson and Voorneveld 2018, Proposition 1) Suppose that \(\succsim \) satisfies A1A5. If \(u \in {{\mathscr {U}}}\) is regular and \(c \in {\mathbb {R}}\), then

$$\begin{aligned} (c, u) \succsim u \text { implies }c\ge {\bar{u}} \end{aligned}$$

and

$$\begin{aligned} u \succsim (c, u) \text { implies }c\le {\bar{u}}. \end{aligned}$$

Now, for every \(\pi \in {\varPi _F}\), \(u(s, \pi )\) is regular for each \(s \in {{\mathscr {S}}}\). This follows from the well known fact that the reward stream generated by a stationary policy can be written as the sum of a periodic sequence and a summable sequence. (The stream generated by \((f, f, f, \ldots )\) is defined by powers of \({{\textbf{Q}}}(f)\) acting on \({{\textbf{r}}}(f)\)—see (1). By the Perron-Frobenius theorem for non-negative matrices, the sequence \({{\textbf{Q}}}(f) \cdot {{\textbf{r}}}(f), {{\textbf{Q}}}(f)^2\cdot {{\textbf{r}}}(f), {{\textbf{Q}}}(f)^3\cdot {{\textbf{r}}}(f), \ldots \) approaches a periodic orbit at exponential rate.) To apply the arguments preceding Lemma 1, we need to know that \(\sigma (u-v)\) is bounded and regular if u and v are generated by stationary policies. We have the following result.

Lemma 2

Suppose that u and v are generated by stationary policies, and let \(\sigma \equiv \sigma (u-v)\) be defined as in (2). If \({\bar{u}}={\bar{v}}\), then \(\sigma \in {{\mathscr {U}}}\) is regular.

Proof

Write

$$\begin{aligned} u=x^{(u)}+y^{(u)}, \quad v=x^{(v)}+y^{(v)}, \end{aligned}$$
(12)

where \(x^{(u)}\) and \(x^{(v)}\) are periodic and where \(y^{(u)}\) and \(y^{(v)}\) are summable. Let p be the product of the periods of \(x^{(u)}\) and \(x^{(v)}\). Then \({\bar{u}}={\bar{x}}^{(u)}=\sigma _p(x^{(u)})/p\) and \({\bar{v}}={\bar{x}}^{(v)}=\sigma _p(x^{(v)})/p\). So, if \({\bar{u}}={\bar{v}}\), then \(\sigma _p(x^{(u)}-x^{(v)})=0\). This means that \(\sigma (x^{(u)}-x^{(v)})\) is periodic. The sequence \(\sigma (y^{(u)}-y^{(v)})\) is convergent by our choice of \(y^{(u)}\) and \(y^{(v)}\). Hence, \(\sigma =\sigma (u-v)\) is the sum of a periodic sequence and a convergent sequence. This means that \(\sigma \in {{\mathscr {U}}}\) is regular. \(\square \)

To complete the proof of Theorem  1, let \(\succsim \) be a preorder that satisfies A1A5, and suppose that \(\pi ^*\) is \(\succsim \)-optimal within \({\varPi _F}\). Let \(u=u(s,\pi ^*)\) and \(v=u(s,\pi )\), where \(\pi \in {\varPi _F}\) and \(s \in {{\mathscr {S}}}\) are arbitrary, and let \(\sigma \equiv \sigma (u-v)\) be defined as in (2). Since \(\pi ^*\) is \(\succsim \)-optimal within \({\varPi _F}\), \(u \succsim v\). We need to show that \(u \succsim _{\text {AO}} v\). If \({\bar{u}}={\bar{v}}\), then this follows from Lemmas 1 and 2 and the remarks preceding Lemma 1. It remains to show that \(u \succsim _{\text {AO}} v\) if \({\bar{u}} \ne {\bar{v}}\). It is enough to show that \({\bar{u}} > {\bar{v}}\), since this clearly implies \(u \succ _{AO} v\). Given any preorder \(\succsim '\) that satisfies A1A5, if \(x \in {{\mathscr {U}}}\) and \(y\in {{\mathscr {U}}}\) are such that \({\bar{x}} > {\bar{y}}\), then \(x \succ ' y\) (see Basu and Mitra 2007 or Jonsson and Voorneveld 2015). Thus, if \({\bar{u}} \ne {\bar{v}}\), then we must have \({\bar{u}} > {\bar{v}}\). (If it were the case that \({\bar{v}} > {\bar{u}}\), then we would have \(v \succ u\), which contradicts the assumption that \(u \succsim v\).) We can therefore conclude that \(u \succ _{\text {AO}} v\), and the proof of Theorem 1 is thereby complete. \(\square \)

6 Characterizations

One goal of this paper is to provide a preference foundation for finite MDPs. In the case of a positive discount rate, the well known preference foundation of Koopmans (1960, 1972) is easily adapted to the present setting. The literature provides characterizations of two criteria for the no discounting case: the overtaking criterion (Asheim and Tungodden 2004; Basu and Mitra 2007; Brock 1970a) and the average reward criterion (Kothiyal et al. 2014; Marinacci 1998; Khan and Stinchcombe 2018; Pivato 2022). The overtaking criterion is characterized by axioms that are similar to those in Sect. 4. The characterizations of the average reward criterion, which does not satisfy A1, involve further conditions of permutability and numeric representability. These conditions are well known to be incompatible with A1 in the no discounting case (Basu and Mitra 2003; Fleurbaey and Michel 2003).

In this section, we axiomatize the preorders associated with the average overtaking criterion and the 1-optimality criterion. As in the previous section, we restrict attention to stationary policies.

6.1 First characterization

The axioms from Sect. 4 do not characterize \(\succsim _{\text {AO}}\). Indeed, the preorder associated with the overtaking criterion satisfies A1A5, and \(\succsim _{\text {O}}\) does not agree with \(\succsim _{\text {AO}}\) on \({{\mathscr {U}}_F}\). As illustrated in Example  1, for \(\succsim _{\text {AO}}\)-optimality to imply \(\succsim \)-optimality, it is necessary that \(\succsim \) compares at least some pairs of streams that \(\succsim _{\text {O}}\) does not compare.

Insisting that all pairs \(u, v \in {{\mathscr {U}}}\) be comparable has unwanted consequences. In fact, it is not possible to give an explicit definition of a preorder, satisfying A1 and A2, that compares all pairs of sequences of 0s and 1s (Lauwers 2010). On the other hand, \(\succsim _{\text {AO}}\) compares each pair \(u, v \in {{\mathscr {U}}_F}\) and coincides with \(\succsim _{1}\) on this domain. Thus, the following condition is compatible with A1A5:

A6. For all \(u, v \in {{\mathscr {U}}_F}\), \(\succsim \) compares u and v.

If \(\succsim \) satisfies A1A6 and \(u, v \in {{\mathscr {U}}_F}\), then \(u \succ v\) if and only if \(u \succ _{\text {AO}} v\). To conclude that the symmetric parts of \(\succsim \) and \(\succsim _{\text {AO}}\) agree, further assumptions are needed. A sufficient condition asserts that, for all \(u, v \in {{\mathscr {U}}}\), if \((\varepsilon +u_1, u_2, u_3, \ldots ) \succsim v\) for every \(\varepsilon >0\), then \(u \succsim v\). This condition can be formalized by defining a metric on \({{\mathscr {U}}}\) and demanding that \(\{v \in {{\mathscr {U}}}:u \succsim v\}\) be a closed subset of \({{\mathscr {U}}}\) for every \(u \in {{\mathscr {U}}}\). Almost any metric from the literature will do (e.g., Banerjee and Mitra 2008, p. 5). For example, let \(d(u, v)=\min \{1, \sum _{i=1}^\infty \vert u_i-v_i \vert \}\). The continuity requirement can then be stated as follows.

A7. For every \(u \in {{\mathscr {U}}}\), \(\{v \in {{\mathscr {U}}}:u \succsim v\}\) is a closed subset of \({{\mathscr {U}}}\).

Theorem 2

If \(\succsim \) satisfies A1A7, then \(\succsim \) and \(\succsim _{\text {AO}}\) coincide on \({{\mathscr {U}}_F}\).

Proof

Let \(\succsim \) satisfy A1A7, and let \(u, v \in {{\mathscr {U}}_F}\). We know that \(u \succsim _{\text {AO}} v\) if \(u \succsim v\) (Theorem 1). So it is enough to show that \(u \succsim _{\text {AO}} v\) implies \(u \succsim v\).

If \(u \succ _{\text {AO}} v\), then either (i) \({\bar{u}}>{\bar{v}}\) or (ii) \({\bar{u}}={\bar{v}}\) and \({\bar{\sigma }}>0\), where \(\sigma =\sigma (u-v)\). In case (i), we get \(u \succ v\) as a consequence of the fact that \(\succsim \) satisfies A1A5. In case (ii), \(\lnot (0, \sigma ) \succsim \sigma \) by Lemma 1, so \(\lnot v \succsim u\) by A5. By A6, \(u \succ v\). Conclude that \(u \succ _{\text {AO}} v\) implies \(u \succ v\).

Now suppose that \(u \sim _{\text {AO}} v\). Let \(u^{(\varepsilon )}=(\varepsilon +u_1, u_2, u_3, \ldots )\). Then \(u^{(\varepsilon )} \succ _{\text {AO}} v\) for every \(\varepsilon >0\), so (by the above conclusion) \(u^{(\varepsilon )} \succ v\) for every \(\varepsilon >0\). By A7, \(u \succsim v\). The same argument shows that \(v \succsim u\). \(\square \)

6.2 Second characterization

Axioms A6 and A7 were motivated by necessity rather than some normative or economic reason. In our second characterization, these axioms are replaced by the compensation principle.

As an illustration of this principle, imagine that the decision maker is faced with two options. The first option yields some sequence of expected rewards \(u\in {{\mathscr {U}}}\). The second option is to obtain a one-period postponement of u and a compensation of \(c \in {\mathbb {R}}\) in the first period. Which value of c should make the agent indifferent?

In some cases, this value will be zero. This is the case if u has at most finitely nonzero entries—then (0, u) and u are equally good by A2. However, the agent will not always be indifferent if \(c=0\). For instance, if \(u=(r, r, r, \ldots )\) is constant and c is less than \(r>0\), then (cu) is worse than u by A1. The compensation principle says that u and (cu) are equally good if \(c={\bar{u}}\) (compare Lemma 1). Its precise statement is as follows:

A8. For every \(u\in {{\mathscr {U}}}\), if \({\bar{u}}\) is well defined, then \(({\bar{u}},u) \sim u\).

For a case of the two options described above, consider again the system in Fig. 1, and suppose that the system starts in \(s_1\). The agent then has two options. If \(a_1\) is chosen, then \(u=(2, 0,2,0,2,\dots )\) obtains. If \(a_2\) is chosen, then u is delayed one period, and a reward of c is obtained in the first period. Thus, the two feasible alternatives are u and \(v=(c, u)\). Since \({\bar{u}}=1\), A8 says that u and v are equally good if \(c=1\).

Example 1 illustrates the fact that \(\succsim _{\text {O}}\) violates A8. It is easy to check that \(\succsim _{\text {AO}}\) satisfies A8, and the same is true of \(\succsim _1\) (Jonsson and Voorneveld 2018). To see that the average reward criterion also satisfies A8, note that if \(d=(c, u)-u\), then we have \(\sigma _T(d)=c-u_T\) and therefore \(\liminf _{T\rightarrow \infty }\frac{1}{T}\sigma _T(d)=\liminf _{T\rightarrow \infty }\frac{1}{T}\sigma _T(-d)=0\). It follows that \((c, u) \sim _{\text {AR}} u\) for every \(c \in {\mathbb {R}}\) and \(u\in {{\mathscr {U}}}\).

Like (9), the usefulness of A8 stems from the fact that if \(\succsim \) satisfies A5 and \(u, v \in {{\mathscr {U}}}\) are such that \(\sigma \equiv \sigma (u-v)\) is bounded, then \(u \succsim v\) if and only \(\sigma \succsim (0, \sigma )\). Thus, if \(\succsim \) satisfies A1, A5 and A8, then \(u \succsim v\) if and only \({\bar{\sigma }} \ge 0\). In Jonsson and Voorneveld (2018), this observation is used to characterize \(\succsim _1\) on the set of streams that are summable or eventually periodic. Theorem  3 extends this result to streams that can be decomposed according to (12).

Theorem 3

If \(\succsim \) satisfies A1, A5 and A8, then \(\succsim \) and \(\succsim _{\text {AO}}\) coincide on \({{\mathscr {U}}_F}\).

Proof

Let \(\succsim \) be a preorder that satisfies A1, A5 and A8. For \(u, v \in {{\mathscr {U}}_F}\), let \(\sigma =\sigma (u-v)\). Suppose that \({\bar{u}}={\bar{v}}\). Then \(\sigma \in {{\mathscr {U}}}\) is regular (Lemma 2), which means that \({\bar{\sigma }}\) is well defined. By A1 and A8, \(\sigma \succsim (0, \sigma )\) if and only if \({\bar{\sigma }} \ge 0\). By A5, \(u \succsim v\) if and only if \(\sigma \succsim (0, \sigma )\). Hence, \(u \succsim v\) if and only if \({\bar{\sigma }} \ge 0\). Since \({\bar{\sigma }}\) is the Cesàro sum of \(\sum _{t=1}^\infty (u_t-v_i)\), we see that \(u \succsim v\) if and only if \(u \succsim _{\text {AO}} v\).

Now suppose (without loss of generality) that \({\bar{u}}>{\bar{v}}\). Then \(u \succ _{\text {AO}} v\). We show that \(u \succ v\). For \(T>1\), define \(z \in {{\mathscr {U}}}\) by setting \(z_t = u_t\) for \(t \le T\) and \(z_t = u_t-c\) for \(t > T\). Then z is the sum of periodic sequence and a summable sequence, and \(u \succ z\) by A1. Since \({\bar{u}}>{\bar{v}}\), we can choose T so that \(\sigma _t(u-z)\ge 0\) for all \(t \ge T\). Since \({\bar{z}}={\bar{v}}\), the preceding argument gives that \(z \succsim v\), so \(u \succ v\) by transitivity. \(\square \)

We can obtain a characterization of average overtaking optimality in general discrete time MDPs by generalizing A8. This result, which concerns optimality within the class of all policies, is provided in the appendix. There we also verify that the axioms in Theorem 3 are logically independent.

Theorems 2 and 3 provide two axioms sets that characterize \(\succsim _{\text {AO}}\) on \({{\mathscr {U}}_F}\). As a corollary of these results, we obtain a partial answer to the third question (Q3) from the introduction: If \(\succsim \) satisfies the axioms in any one of these axiom sets, then a policy is \(\succsim \)-optimal within \({\varPi _F}\) if and only if it is \(\succsim _{\text {AO}}\)-optimal within \({\varPi _F}\). In particular, a \(\succsim \)-optimal policy exists within \({\varPi _F}\).