Abstract
Under the expected total reward criterion, the optimal value of a finite-horizon Markov decision process can be determined by solving the Bellman equations. The equations were extended by White to processes with vector rewards. Using a counterexample, we show that the assumptions underlying this extension fail to guarantee its validity. Analysis of the counterexample enables us to articulate a sufficient condition for White’s functional equations to be valid. The condition is shown to be true when the policy space has been refined to include a special class of non-Markovian policies, when the dynamics of the model are deterministic, and when the decision making horizon does not exceed two time steps. The paper demonstrates that in general, the solutions to White’s equations are sets of Pareto efficient policy returns over the refined policy space. Our results are illustrated with an example.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In a seminal work on vector-valued Markov decision processes, Douglas White (1982) presents an inductive scheme for determining the set of Pareto efficient returns that can be accrued over an n-step planning horizon given some initial state. The scheme is intended as a generalization of the value iteration algorithm (backward induction) for scalar Markov decision processes. Accordingly, it is based on equations reminiscent of Bellman’s (Puterman, 2014, Section 4.3), though the unknowns in White’s case are set-valued rather than real-valued functions. There is abundant reference to White’s equations in the technical literature (Hayes et al., 2007; Mannor & Shimkin, 2022; Roijers, Röpke, Nowé, & Rădulescu, 2004; Ruiz-Montiel et al., 2021; Van Moffaert & Nowé, 2014; Wiering & De Jong, 2017), and their continuing relevance is evidenced by the recent publication of a paper (Mandow, Pérez-de-la Cruz, & Pozas, 2022) in which they are used as a point of departure for the development of a multi-objective optimization algorithm. In fact, the equations are invalid under White’s assumptions. By means of a counterexample, we shall demonstrate that their solution does not always yield the desired efficient sets, contrary to White’s chief claim (see (D. White, 1982, Theorem 2), reproduced below).
It is also the purpose of this paper to develop conditions for the equations’ applicability. In particular, it shall be shown that the equations hold when the decision making horizon spans one or two control intervals (Sect. 3), when the dynamics of the model are deterministic (Sect. 4), and when the policy space is augmented with a class of non-Markovian policies (Sect. 5). By analyzing the latter case, we discover that the actual solutions to White’s equations are sets of Pareto efficient policy returns over the augmented policy space (Theorem 5, Sect. 5). To our knowledge, this paper carries the first such analysis, and is the first to disprove the said “theorem".
To make for a relatively self-contained paper, we restate White’s assumptions. Let \(I = \{1,..., N\}\) be a finite set of states. For each state \(i \in I\), let \(K_i\) be the set of actions that can be taken in \(i\), and suppose \(K_i\) is compact. Let \(K = \cup _{i \in I}K_i\). A decision maker observes the process over discrete control periods. If at the beginning of a period (an epoch) the process occupies a state i and the decision maker selects an action \(k \in K_i\), a vector reward \(f_i^{k} \in {\mathbb {R}}^{m}\) is generated, the components of which are continuous on \(K_i\). Let \(\rho \in [0, 1)\) be a reward discount factorFootnote 1. Call \(p_{ij}^{k}\) the conditional probability that the system will occupy state \(j\) at an epoch given that state i was observed at the preceding epoch and action \(k \in K_i\) was selected; suppose it is continuous on \(K_i\). A Markovian decision rule \(\delta \) is a mapping from I to K that dictates the action that should be selected in each state at a particular epoch. The set of all decision rules, \(\Delta \), is compact. No decision is made at process termination. A policy specifies the decision rules to be used throughout the lifetime of the process. If \(n \ge 1\) control periods are remaining, a policy shall be identified with its corresponding sequence of decision rules \((\delta _n, \delta _{n-1},..., \delta _{1})\), where \(\delta _s\) represents the rule prescribed by this policy when \(s \le n\) periods are left. Let \(\Pi \) denote the set of all policies.
For any policy \(\pi = (\delta _n,..., \delta _{1}) \in \Pi \), let
be the expected total reward generated by \(\pi \) if \(n = 1, 2,...\) periods are remaining and the current state is \(i \in I\), where \(X_t\) denotes the (random) state of the process with t epochs left. It follows from rudimentary probability operations that
by letting
for each \(i \in I\) and \(\pi \in \Pi \). For all \(n = 0, 1,...\) and \(i \in I\), write \(V_n(i) = \bigcup _{\pi \in \Pi } \{ v_n^{\pi }(i) \}\). The generic expression “policy return", where time and state are omitted for brevity, shall refer to any member of \(V_n(i)\) for some \(n = 0, 1,...\) and \(i \in I\).
Define the Pareto efficient subset of a set \(X \subseteq {\mathbb {R}}^{m}\) as
where \(\ge \) denotes the usual componentwise order on \({\mathbb {R}}^m\), i.e \(x \ge y \iff \forall i \in \llbracket 1, m \rrbracket ,\, x_i \ge y_i\) for all \(x, y \in {\mathbb {R}}^m\). Thus, \(x \in X\) is said to be efficient if it is maximal in X with respect to \(\ge \). Geoffrion (1968) points out that such an x is sometimes termed “admissible", “noninferior" or “Pareto optimal".
White concerns himself with the “vector value method of successive approximations" for the calculation of \({\mathscr {E}}(V_n(i))\) assuming knowledge of the \({\mathscr {E}}(V_{n-1}(j))\)’s, \(j \in I\). The subject of this paper is the theorem on which this method is predicated:
Theorem
(D. White (1982), Theorem 2) For all \(n = 0, 1,...\) and \(i \in I\), \({\mathscr {E}}(V_n(i))\) is the unique solution \(W_n(i)\) to one of the following equations:
where for any nonempty sets A and B, for any scalar c, \(A \bigoplus B = \{\, a + b: a \in A, b \in B\,\}\) and \(c\cdot A = \{\, c\cdot A: a \in A\,\}\).
When \(m = 1\), Equations (3) and (4) reduce to the more familiar Bellman equations, with “max" substituted for “\({\mathscr {E}}\)", and the theorem reduces to the correct claim that \(\max _{\pi \in \Pi }v_n^{\pi }(i)\), which exists under the aforementioned assumptions (Puterman, 2014, Proposition 4.4.3.), is the unique solution for all \(n = 0, 1,...\) and \(i \in I\). When \(m > 1\), the theorem no longer holds, as the next section will show.
2 A counterexample
Consider the vector-valued Markov decision process defined by \(I = \{1, 2\};\, K_{1} = \{a, b\};\, K_{2} = \{a\};\, p_{11}^{a} = \frac{3}{4};\, p_{12}^{a} = \frac{1}{4};\, p_{11}^{b} = p_{12}^{b} = \frac{1}{2};\, p_{21}^{a} = 1;\, p_{22}^{a} = 0;\, f_1^{a} = (1, 0);\, f_1^{b} = (0, 1);\, f_2^{a} = 0_{{\mathbb {R}}^2};\, \rho = 0.9\).
Noting that the only action available in state 2 is action a, there are two decision rules in this model: the one that prescribes a in state 1 (\(\delta ^{a}\)), and the one that prescribes b (\(\delta ^{b}\)). That is, \(\delta ^{a}(i) = a\) for \(i = 1, 2\); \(\delta ^{b}(1) = b\) and \(\delta ^{b}(2) = a\). If there are three control periods remaining before the process terminates, this gives rise to a total of eight policies, to wit \((\delta ^{a}, \delta ^{a}, \delta ^{a})\), \((\delta ^{a}, \delta ^{a}, \delta ^{b})\), \((\delta ^{a}, \delta ^{b}, \delta ^{a})\), \((\delta ^{b}, \delta ^{a}, \delta ^{a})\), \((\delta ^{a}, \delta ^{b}, \delta ^{b})\), \((\delta ^{b}, \delta ^{a}, \delta ^{b})\), \((\delta ^{b}, \delta ^{b}, \delta ^{a})\), and \((\delta ^{b}, \delta ^{b}, \delta ^{b})\). When the process is initially in state 1, these policies generate the following returns:
The reader can verify that all these returns are efficient, i.e \({\mathscr {E}}(V_3(1)) = V_3(1)\).
On the other hand, solution of Equations (3) and (4) produces, successively,
\({\left\{ \begin{array}{ll} W_1(1) = \bigl \{(1, 0), (0, 1)\bigr \},\\ W_1(2) = \bigl \{0_{{\mathbb {R}}^2}\bigr \}; \end{array}\right. }\)
\({\left\{ \begin{array}{ll} W_2(1) = \bigl \{(1.675, 0), (1, 0.675), (0.45, 1), (0, 1.45)\bigr \},\\ W_2(2) = \bigl \{(0.9, 0), (0, 0.9)\bigr \}; \end{array}\right. }\)
whence
Clearly, \(W_3(1) \ne {\mathscr {E}}(V_3(1))\). This concludes the counterexample.
The problem is that some of the vectors in the \(W_n(i)\)’s may be infeasible such that we may have a \(w \in W_n(i)\) despite \(w \notin V_n(i)\). If \(W_n(i) \not \subseteq V_n(i)\) for some n and \(i \in I\), one obviously cannot have \(W_n(i) = {\mathscr {E}}(V_n(i))\), and the theorem fails. Consider, for instance, \(w = (1.8775, 0.455625) \in W_3(1)\) in the example above. This vector is anomalous in that it does not belong to \(V_3(1)\). A closer look at its construction reveals that \(w = f_{1}^{a} + \rho \cdot (p_{11}^{a}w^1 + p_{12}^{a}w^2)\) traces back to \(w^1 = (1, 0.675) \in W_2(1)\) and \(w^2 = (0.9, 0) \in W_2(2)\). Were w in \(V_3(1)\), with \(w = v_{3}^{\pi }(1)\) for some \(\pi \), one would expect to have \(w^1 = v_{2}^{\pi }(1)\) and \(w^2 = v_{2}^{\pi }(2)\), but in fact this is not so. Indeed, a simple calculation shows that when two control periods are remaining, \(w^1\) equals the return of a policy (beginning in state 1) that uses \(\delta ^{a}\) then \(\delta ^{b}\), whereas \(w^2\) equals the return of a policy (beginning in state 2) that prescribes \(\delta ^{a}\) for both periods or uses \(\delta ^{b}\) then \(\delta ^{a}\).
At the root of the problem is the possibility that at the n-th iteration, \(n \ge 1\), there is a state i for which
White’s proof overlooks this possibility. Instead, it asserts that if \(w = f_i^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_j\), with \(i \in I\), \(k \in K_i\), and each \(v_j\) in \(W_{n-1}(j)\), then we must have \(w \in V_n(i)\) (D. White, 1982, Equation 11, p. 7). This, of course, need not be true, as we have just illustrated with \(w = f_{1}^{a} + \rho \cdot (p_{11}^{a}w^1 + p_{12}^{a}w^2) \notin V_3(1)\).
However, that is essentially the only issue with the proof, for it becomes correct if \(F_n(i)\) is guaranteed to be a subset of \(V_n(i)\) for all \(n \ge 1\) and \(i \in I\). This inclusion condition, namely
shall henceforth be known as property (P). This property not only ensures that \(W_n(i) \subseteq V_n(i)\), but it also implies that the White equations are valid, i.e \(W_n(i) = {\mathscr {E}}(V_n(i))\) for all \(n = 0, 1,...\) and \(i \in I\). This crucial observation is established formally in Sect. 3.
It is worth noting that (P) can only be violated when \(m > 1\). Recall our earlier comment that White’s equations coincide with Bellman’s when \(m = 1\), since
for all \(n \ge 0\) and \(i \in I\), so that we may write \(W_n(i) = \{w_n(i)\} \subseteq {\mathbb {R}}\), where
for all \(i \in I\) if \(n \ge 1\), and
for all \(i \in I\). Now, for fixed \(i \in I\), \(n \ge 1\) and \(k \in K_i\), there exists a policy \(\pi (i, k, n)\) such that
which ensures \(F_n(i) \subseteq V_n(i)\) and therefore (P). Such a policy can be constructed in two stages. First, using Equations (5) and (6), construct a policy \(\pi ^{*} = (\delta _{n-1}^{*},..., \delta _{1}^{*})\) by calculating \(w_1(j),..., w_{n-1}(j)\) with the corresponding maximizing actions \(k^{*}_{1}(j),..., k^{*}_{n-1}(j)\) for each \(j \in I\), then setting \(\delta _s^{*}(j) = k^{*}_{s}(j)\) for each j and \(s < n\). The resulting policy satisfies \(w_{n-1}(j) = v_{n-1}^{\pi ^{*}}(j)\) for all \(j \in I\). Second, choose \(\pi (i, k, n)\) to be any policy that prescribes k in i with n epochs remaining then follows \(\pi ^{*}\) at all future epochs. That is, \(\pi (i, k, n)\) is any policy \((\delta _n,..., \delta _{1})\) of the form \(\delta _{n}(i) = k\) and
for all states j. For such a policy, we have \(f_{i}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}w_{n-1}(j) = v_n^{\pi (i, k, n)}(i) \in V_n(i)\), hence property (P), hence the validity of the Bellman equations.
In summary, then, (P) is a property of scalar Markov decision processes, but, as our counterexample makes clear, it is not a general property of vector-valued Markov decision processes. In the sequel, we shall develop two special, overlapping classes of models in which (P) is satisfied. These are:
-
(1)
Deterministic dynamic programs: vector-valued Markov decision processes where action choice, conditional on the present state, determines the next state with certainty.
-
(2)
Vector-valued Markov decision processes where the definition of “decision rule" has been refined to include a broader range of rules than that considered by White.
Before treating these cases (Sects. 4 and 5, respectively), we shall first justify our claim that (P) is sufficient to ensure the validity of the White equations (Sect. 3). We shall adopt the same method of proof as White on (D. White, 1982, p. 7), namely induction on n and appeal to a lemma that will be introduced in due course. Note that since we shall be dealing with a different notion of policy in our treatment of the second case, (P) shall have to be recast accordingly, though this is a minor change, and the property’s tenor as well as implications for the White equations shall be untouched. All assumptions made in the succeeding sections supplement those set forth in the Introduction, unless it is stated explicitly that a new assumption repeals an old one. Finally, it should by now be clear that in carrying out this investigation, we have an \(m > 1\) in mind, though all the results presented in the following also hold when \(m = 1\).
3 On (P) as a sufficient condition for the validity of the White equations
As a prelude to showing that (P) implies \(W_n(i) = {\mathscr {E}}(V_n(i))\) for all \(i \in I\) and \(n \ge 0\), where \(W_n(i)\) is given by Equations (3) and (4), we borrow the following lemma from D. White (1982):
Lemma 1
(D. White, 1982, Lemma 2) Let \(n = 0, 1,...\) and \(i \in I\). For each \(u \in V_n(i)\), there exists \(v \in {\mathscr {E}}(V_n(i))\) such that \(v \ge u\).
This lemma generalizes the fact that in a partially ordered set \((X, \succeq )\) that admits a maximum, we have \(\max (X) \succeq x\) for all \(x \in X\). An equivalent statement is that “for all \(n = 0, 1,...\), \(i \in I\), \({\mathscr {E}}(V_n(i))\) is the kernel of \(V_n(i)\) with respect to \(\ge \)", “kernel" being the decision-theoretic term for the unique antichainFootnote 2K in a partially ordered set \((X, \succeq )\) with the property that there is for all \(x \in X\) a \(y \in K\) with \(y \succeq x\) (D. White, 1977).
The next lemma will also be instrumental in proving our result. Because it can be inferred directly from the definition of \(\ge \), we state it without proof:
Lemma 2
Let \(p_1,..., p_k\) be \(k \ge 1\) nonnegative reals. Let x, y, z be vectors in \({\mathbb {R}}^m\). Then
-
(1)
If \(x_j \ge y_j\) for all \(j = 1,..., k\), we have \(\sum _{j = 1}^{k}p_jx_j \ge \sum _{j = 1}^{k}p_jy_j\);
-
(2)
\(x \ge y\) implies \(z + x \ge z + y\).
We now state our result.
Proposition 1
(P) implies \(W_n(i) = {\mathscr {E}}(V_n(i))\) for all \(i \in I\) and \(n = 0, 1,...\).
Proof
Suppose (P) is true. Proceed by induction on n. For \(n = 0\), we have, independently of (P), \(V_0(i) = \{0\} = W_0(i)\), and thus \(W_0(i) = {\mathscr {E}}(V_0(i))\) for all \(i \in I\). Suppose \(W_{n-1}(i) = {\mathscr {E}}(V_{n-1}(i))\) for all \(i \in I\), for some \(n \ge 1\).
Let \(i \in I\) and \(v \in {\mathscr {E}}(V_n(i))\). We shall prove that \(v \in W_n(i)\). Since \(v \in V_n(i)\), there exists a policy \(\pi = (\delta _n,..., \delta _{1})\) such that \(v = f_i^{\delta _{n}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{n}(i)}v_{n-1}^{\pi }(j)\). By Lemma 1, there is, for all \(j \in I\), a policy \(\pi _j\) such that \(v_{n-1}^{\pi _j}(j) \ge v_{n-1}^{\pi }(j)\) and \(v_{n-1}^{\pi _j}(j) \in {\mathscr {E}}(V_{n-1}(j))\). Let \(w = f_i^{\delta _{n}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{s}(i)}v_{n-1}^{\pi _j}(j)\). By the induction hypothesis, \(v_{n-1}^{\pi _j}(j) \in W_{n-1}(j)\) for all \(j \in I\), hence \(w \in F_n(i)\). Furthermore, \(w \ge v\) according to Lemma 2, and \(w \in V_n(i)\) by (P). But \(v \in {\mathscr {E}}(V_n(i))\), thus \(v = w\). Therefore, \(v \in F_n(i)\). Ergo, because \(v \in {\mathscr {E}}(V_n(i))\) and \(F_n(i) \subseteq V_n(i)\), we have \(v \in {\mathscr {E}}(F_n(i)) = W_n(i)\). This proves that \({\mathscr {E}}(V_n(i)) \subseteq W_n(i)\).
To show the converse inclusion, let \(u \in W_n(i)\). We may write \(u = f_{i}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_j\) for some \(k \in K\) and \(v_j \in W_{n-1}(j)\) for all \(j \in I\). From (P), \(u \in V_n(i)\). Suppose for the sake of contradiction that \(u \notin {\mathscr {E}}(V_n(i))\). Then there exists, applying Lemma 1, \(v \in {\mathscr {E}}(V_n(i)) \subseteq W_n(i)\) such that \(v \ge u\) and \(v \ne u\). This contradicts the fact that \(u \in W_n(i)\), and shows that \(u \in {\mathscr {E}}(V_n(i))\). Consequently, \(W_n(i) \subseteq {\mathscr {E}}(V_n(i))\).
We have established that \(W_n(i) = {\mathscr {E}}(V_n(i))\) for any \(i \in I\) and \(n = 0, 1,...\). \(\square \)
Notice that the base case of this induction did not require (P). In fact, whether or not (P) is satisfied, the equality \(W_n(i) = {\mathscr {E}}(V_n(i))\) holds for all \(i \in I\) up to \(n = 2\).
Proposition 2
For \(n = 0, 1, 2\), we have that \(W_n(i) = {\mathscr {E}}(V_n(i))\) for all \(i \in I\).
Proof
The case where \(n = 0\) was treated in the preceding proof. Let \(i \in I\) be a state. For \(n = 1\), \(w \in F_1(i)\) if and only if there exists an action \(k \in K_i\) such that \(w = f_i^{k}\). If \(w = f_i^{k}\), then \(w = v_1^{\pi }(i)\) for any policy \(\pi \) that prescribes k in i with one period remaining. Conversely, if \(w = v_1^{\pi }(i)\) for some \(\pi \), then \(w = f_i^{\delta _{1}(i)} \in F_1(i)\). Thus, \(w \in F_1(i)\) if and only if \(w \in V_1(i)\). As a result, \(W_1(i) = {\mathscr {E}}(F_1(i) = {\mathscr {E}}(V_1(i))\).
Suppose now \(n = 2\). We shall find it convenient to start by proving that \(F_2(i) \subseteq V_2(i)\). By definition, \(w \in F_2(i)\) if and only if there exist \(k \in K_i\) and \(v_1 \in W_1(1),..., v_N \in W_1(N)\) such that \(w = f_i^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_j\). Recalling the fact (just established) that \(W_1(j) = {\mathscr {E}}(V_1(j))\) for all states j, we have that \(w \in F_2(i)\) if and only if \(w = f_i^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_1^{\pi _j}(j)\) for a set of N policies \(\pi _1,..., \pi _N\) and a \(k \in K_i\). If \(w = f_i^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_1^{\pi _j}(j)\), then by writing \(\pi _j = (\delta ^{j}_1)\) for all \(j \in I\), we see that any policy \(\pi = (\delta _2, \delta _1)\) given by \(\delta _{2}(i) = k\) and \(\delta _{1}(j) = \delta ^{j}_{1}(j)\) for all j satisfies \(w = v_2^{\pi }(i)\). Consequently, \(w \in F_2(i)\) implies that \(w \in V_2(i)\), from which it follows that \(F_2(i) \subseteq V_2(i)\).
If \(v \in {\mathscr {E}}(V_2(i))\), then there is a policy \(\pi = (\delta _{2}, \delta _{1})\) such that \(v = v_2^{\pi }(i) = f_i^{\delta _{2}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{2}(i)}v_1^{\pi }(j)\). Lemma 1 assures us that for each \(j \in I\) there is a \(v_j \in {\mathscr {E}}(V_1(j)) = W_1(j)\) satisfying \(v_j \ge v_1^{\pi }(j)\). Let \(w = f_i^{\delta _{2}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{2}(i)}v_j\). By construction, w is in \(F_2(i) \subseteq V_2(i)\), and by Lemma 2, \(w \ge v\). But v is efficient in \(V_2(i)\), which implies that \(w = v\). Consequently \(v \in F_2(i)\). It remains to show that v is efficient in \(F_2(i)\). This follows at once from the fact that v is efficient in \(V_2(i)\), and that \(F_2(i)\) is a subset of \(V_2(i)\). \(\square \)
Thus, in the special case where one or two control periods are remaining, solving the White equations gives the desired efficient sets.
4 Deterministic dynamic programs
Having established that property (P) ensures the validity of the White equations, we may now use it to show that these equations are applicable to vector-valued Markov decision processes with deterministic dynamics. By “deterministic dynamics" we mean specifically that
Assumption 1
For each \(i \in I\) and \(k \in K_i\), there exists a state \(i^{+} \in I\) such that \(p_{ii^{+}}^{k} = 1\).
This assumption posits that the choice of an action and the state in which that choice is made determine uniquely the next state of the process.
Theorem 1
Under Assumption 1, (P) is true, that is,
Proof
We proceed by induction on n. For \(n = 1\), for any \(i \in I\) and \(k \in K_i\), by letting \(f = f_{i}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_j\) where \(v_j \in W_{0}(j) = \{0\}\) for all \(j \in I\), we have \(f = f_{i}^{k}\), and thus \(f = v_{1}^{\pi }(i)\) for any policy \(\pi \) that prescribes k for i when only one decision is left. As a result, \(F_{1}(i) \subseteq V_{1}(i)\) for all \(i \in I\).
Suppose that for all \(i \in I\), \(F_{n-1}(i) \subseteq V_{n-1}(i)\) for some \(n \ge 1\). Let \(i \in I\), and let \(f = f_{i}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{k}v_j\) where \(v_j \in W_{n-1}(j) = {\mathscr {E}}(F_{n-1}(j))\) for all \(j \in I\). Following Assumption 1, we may write \(f = f_{i}^{k} + \rho \cdot v_{i^{+}}\), where \(i^{+} \in I\) satisfies \(p_{ii^{+}}^{k} = 1\). By the induction hypothesis, \(v_{i^{+}} \in V_{n-1}(i^{+})\), hence the existence of a policy \(\pi \) such that \(f = f_{i}^{k} + \rho \cdot v_{n-1}^{\pi }(i^{+})\). Let \(\pi '\) be any policy that chooses k in i with n epochs remaining then uses \(\pi \) in the future. We have \(f = v_{n}^{\pi '}(i)\), so \(f \in V_n(i)\). Consequently, \(F_n(i) \subseteq V_n(i)\) for all \(i \in I\).
In summary, we have \(F_n(i) \subseteq V_n(i)\) for all \(n = 1, 2,...\) and \(i \in I\). \(\square \)
From Proposition 1 follows the validity of the White equations under Assumption 1.
Corollary 1
Under Assumption 1, the White equations are valid, i.e \(W_n(i) = {\mathscr {E}}(V_n(i))\) for all \(n = 0, 1,...\) and \(i \in I\).
5 The White equations and non-Markovian policies
Before turning to another class of models that have (P) as their property, we start with some useful background. Recall that we were led to this property by the observation that, for some \(n \ge 1\) and \(i \in I\), vectors in \(F_n(i)\) may have no corresponding policies in \(\Pi \). This allows ultimately for the possibility that \(W_n(i) = {\mathscr {E}}(F_n(i)) \not \subseteq V_n(i)\). Suppose now that the theorem was true under some special set of assumptions, and that we were to show this by proving (P). Furthermore, suppose we were to proceed by induction on n, having noticed that \(F_1(i) \subseteq V_1(i)\) for all i, since for any \(w = f_{i}^{k} \in F_{1}(i)\) we have \(w = v_{1}^{\pi }(i) \in V_{1}(i)\), \(\pi \) being a policy that prescribes k in i with one epoch remaining.
To carry out the induction, assume that for some \(n \ge 1\),
Consider a state i and a \(w \in F_n(i)\). Our task is to show that \(w \in V_n(i)\). By definition, there is an action \(k \in K_i\) and there are, by the induction hypothesis, N policies \(\pi _1,..., \pi _{N}\) such that
If n periods are remaining and the state is i, a policy of the form “take action k, then, if the next state is j, take the action prescribed by \(\pi _j\), and continue using \(\pi _j\) until the process terminates" should in principle accrue an expected total reward equal to w. Such a policy, however, cannot be formulated within the present framework. Indeed, our model presupposes that the only information relevant to decision making at a particular epoch is the state at that time, whereas if we were to implement the policy just described, we would also need to know, at all future epochs, which \(j \in I\) was observed when \(n-1\) periods were left. This leads naturally to a concept of policy where decision rules are allowed to depend on past states.
To enable the definition of such rules, it shall henceforth be assumed that the number of control intervals is a fixed \(n \ge 1\). For convenience, we introduce a time variable s that records the number of remaining intervals (or equivalently, the number of remaining epochs). If the present time is \(s \le n\), we shall refer to the trajectory of states observed prior to and including the present epoch as a history, a realization of the random process defined by
if \(s < n\), and \(Z_n = X_n\). As before, \(X_k\) denotes the state of the process with k periods to go. Let \(H_s = I^{n-s+1}\) denote the set of all histories at time s, and notice that \(H_n = I\). For each \(s = 1,..., n\), the phrase “s-decision rule" shall refer to any mapping from \(H_s\) to K. Let \(\Delta _s\) be the set of all s-decision rules, assumed to be compact, and \(\Delta = \cup _{1 \le s \le n}\Delta _s\) be the overall set of decision rules. A policy \(\pi = (\delta _n,..., \delta _{1})\) is a sequence of decision rules where each \(\delta _s\) is an s-decision rule. Let \(\Pi '\) denote the set of all policies.
A similar, if distinct, category of decision rules are Puterman’s history-dependent deterministic decision rules (Puterman, 2014). The difference is that Puterman’s histories record past actions too. Thus, whereas the domain of an s-decision rule is \(I^{n-s+1}\), the domain of a history-dependent one is \(I \times K \times I \times ... \times K \times I\), with I multiplied \(n-s+1\) times and K \(n-s\) times. We may nevertheless construe each \(\Delta _s\) as the subset of history-dependent decision rules available at time s which ignore actions taken up to that time. Similarly, White’s decision rules can be viewed as the subset of \(\Delta \) for which past states are irrelevant. That is, if \(h_s = (i_{n},..., i_{s})\) is a history at time s, and \(\delta _s\) is an s-decision rule such that \(\delta _{s}(h_s)\) depends on \(h_s\) only through \(i_s\), then \(\delta _s\) is a decision rule in White’s sense. We may conclude that \(\Pi \subset \Pi ' \subset \Pi _{HD}\), where \(\Pi _{HD}\) is the set of all policies consisting of history-dependent and deterministic decision rules.
Proposition 3
\(\Pi \subset \Pi '\).
Now that we have refined our concepts of decision rule and policy, we may define the expected total reward accrued for implementing a policy \(\pi = (\delta _n,..., \delta _{1}) \in \Pi '\) if there are \(s = 1,..., n\) periods remaining and the current history is \(h_s = (i_n,..., i_s)\):
Since \(Z_{s} = h_s\), we obtain a recursion analogous to Equation (1):
where we let \((h_s, j):= (i_n,..., i_{s}, j)\) for all states \(j \in I\), and
for all complete histories \(h_0 = (i_n,..., i_0) \in H_0\), \(i_0\) being the state obtained consequent to applying the final decision rule \(\delta _1\) in state \(i_1\). Write \(V'_s(h_s) = \bigcup _{\pi \in \Pi '} \{ v_s^{\pi }(h_s) \}\) for all \(s = 0,..., n\).
Given this new framework, we are in a position to construct policies of the form described earlier in this section, that is, policies which prescribe some action at time s, then pursue from time \(s+1\) onward a policy \(\pi _j\) dependent upon the state j that obtained at \(s+1\). As noted, the significance of such policies lies in their ability to achieve expected total rewards of the form
where \(\pi _1,..., \pi _{N} \in \Pi '\) are arbitrary policies, \(h_s = (i_n,..., i_s) \in H_s\) and \(k \in K_{i_s}\). We justify this assertion in Proposition 4. This enables us to prove a particular restatement of property (P), namely
where for all \(s = 1,..., n\), for all \(h_s = (i_n,..., i_s) \in H_s\),
and
subject to the boundary condition
for all \(h_0 = (i_n,..., i_0) \in H_0\). It is clear that this revised property, which we shall refer to as (P’), is fundamentally no different from (P); informally, we may say that it is (P), but enunciated in a different framework. In fact, (P’) provides a sufficient condition for the White equations to admit the \({\mathscr {E}}(V'_s(h_s))\), \(s \le n\), \(h_s \in H_s\), as solutions, meaning it plays the same role as (P) in the original framework (Propositions 5 and 6). More importantly, we shall see that the refined policy space is such that (P’) must be true (Theorem 3), and therefore that the solutions to the White equations are the \({\mathscr {E}}(V'_s(h_s))\)’s (Theorem 5).
Proposition 4
Let \(s = 1,..., n\) and \(h_s = (i_n,..., i_s) \in H_s\). Let \(\pi _1,..., \pi _{N} \in \Pi '\) be N policies and \(k \in K_{i_s}\) be an action. For each \(\pi _j\), write \(\pi _j = (\delta ^j_{n},..., \delta ^j_{1})\). Let
Let
-
(1)
\(\delta _s: H_s \rightarrow K\) be any s-decision rule satisfying \(\delta _s(h_s) = k\);
-
(2)
\(\delta _{s-1}: H_{s-1} \rightarrow K\) be any \((s-1)\)-decision rule such that for all \(j \in I\), \(\delta _{s-1}(h_s, j) = \delta ^j_{s-1}(h_s, j)\);
-
(3)
for each time \(k = 1,..., s-1\), \(\delta _k: H_k \rightarrow K\) be any k-decision rule such that for any \((s-k)\) states \(j_{s-1},..., j_k \in I\) we have
$$\begin{aligned}\delta _{k}(h_s, j_{s-1},..., j_k) = \delta ^{j_{s-1}}_{k}(h_s, j_{s-1},..., j_k).\end{aligned}$$
Finally, let \(\pi \) be any element of \(\Delta _n \times ... \times \Delta _{1}\) with \((\pi )_k = \delta _k\) for each \(k = 1,..., s-1\). Then:
-
(a)
\(\pi \) is a policy, i.e \(\pi \in \Pi '\);
-
(b)
\(v_s^{\pi }(h_s) = w\).
Proof
Part (a) follows from the definition of \(\Pi '\) and the well-definedness of the \(\delta _k\)’s. We shall prove part (b).
\(\square \)
The same method that was used to prove Proposition 1 can be applied here to show that (P’) yields \(W'_s(h_s) = {\mathscr {E}}(V'_s(h_s))\) for all \(s \le n\) and \(h_s \in H_s\). To do this, we need the analogue of Lemma 1 for the \(V'_s(h_s)\)’s:
Lemma 3
Let \(s \le n\) and \(h_s \in H_s\). For each \(u \in V'_s(h_s)\), there is a \(v \in {\mathscr {E}}(V'_s(h_s))\) such that \(v \ge u\).
An essential component of our proof of Lemma 3 is a theorem on efficient sets that appears in (D. White, 1977), rephrased for our purposes:
Theorem 2
(D. White, 1977) Let R be a transitive relation on a set X. Define the efficient subset of X as \({\mathscr {E}}(X) = \{x \in X \mid \, \forall y \in X,\ yRx \implies y = x\}\). If X is compact and \(S(x) = \{y \in X \mid \, yRx\}\) is closed for all \(x \in X\), then, for all \(x \in X\), there exists \(y \in {\mathscr {E}}(X)\) such that yRx.
This theorem implies that we need only show that \(V'_s(h_s)\) is compact and that \(S(u) = \{v \in V'_s(h_s) \mid \, v \ge u\}\) is closed for all \(s \le n\), \(h_s \in H_s\), \(u \in V'_s(h_s)\). Per the Bolzano-Weierstrass theorem, \(V'_s(h_s) \subseteq {\mathbb {R}}^m\) is compact if and only if it is bounded and closed. In this connection, we must introduce a topology with respect to which convergence is defined in the \(V'_s(h_s)\)’s. This will be the topology of pointwise convergence. Convergence in the \(\Delta _s\)’s is also with respect to this topology.
Our proof is by induction on s. Boundedness is fairly simple to establish. To demonstrate the closedness of \(V'_s(h_s)\) for some \(s = 1,..., n\) and \(h_s \in H_s\), assuming that \(V'_{s-1}(h_{s-1})\) is compact for all \(h_{s-1} \in H_{s-1}\), we invoke a useful lemma concerning the weighted sum of k sequences, \(k \ge 1\), where each sequence takes its values in a compact subset of \({\mathbb {R}}^m\):
Lemma 4
Equip \({\mathbb {R}}^m\) with the topology of pointwise convergence. Let \(k \ge 1\) and let \(U_1,..., U_k\) be k nonempty compact subsets of \({\mathbb {R}}^m\). If \((p_{1, \alpha })_{\alpha \ge 0}\),..., \((p_{k, \alpha })_{\alpha \ge 0}\) are k sequences with values in [0, 1] and \((x_{1, \alpha })_{\alpha \ge 0}\),..., \((x_{k, \alpha })_{\alpha \ge 0}\) sequences with values in \(U_1,..., U_k\) respectively, then there exists a strictly increasing map \(\phi : {\mathbb {N}} \rightarrow {\mathbb {N}}\) such that
-
(a)
for all \(j = 1,..., k\), \(p_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] p_j^{*}\) for some \(p_j^{*} \in [0, 1]\);
-
(b)
for all \(j = 1,..., k\), \(x_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] x_j^{*}\) for some \(x_j^{*} \in U_i\);
-
(c)
and \(\sum _{j = 1}^{k} p_{j, \phi (\alpha )}x_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] \sum _{j = 1}^{k} p_j^{*}x_j^{*}\).
Proof
Let us first verify parts (a) and (b) for \(k = 1\). Let \(U_1\) be a nonempty compact subset of \({\mathbb {R}}^m\), \((p_{1, \alpha })_{\alpha \ge 0}\) be a sequence with values in [0, 1] and \((x_{1, \alpha })_{\alpha \ge 0}\) be a sequence with values in \(U_1\). Since [0, 1] is compact, there exist a strictly increasing map \(\beta _{1}: {\mathbb {N}} \rightarrow {\mathbb {N}}\) and a \(p_1^{*} \in [0, 1]\) such that \(p_{1, \beta _{1}(\alpha )} \xrightarrow [\alpha \rightarrow \infty ] p_1^{*}\). Consider now the subsequence of \((x_{1, \alpha })_{\alpha \ge 0}\) indexed by \(\beta _1\), \((x_{1, \beta _{1}(\alpha )})_{\alpha \ge 0}\). Because this subsequence also takes values in a compact set, \(U_1\), there exist a strictly increasing map \(\beta _{2}: {\mathbb {N}} \rightarrow {\mathbb {N}}\) and a \(x_1^{*} \in U_1\) such that \(x_{1, \beta _{1}(\beta _{2}(\alpha ))} \xrightarrow [\alpha \rightarrow \infty ] x_1^{*}\). We also have that \((p_{1, \beta _{1}(\beta _{2}(\alpha ))})_{\alpha \ge 0}\) is a subsequence of \((p_{1, \beta _{1}(\alpha )})_{\alpha \ge 0}\), and therefore \((p_{1, \beta _{1}(\beta _{2}(\alpha ))})_{\alpha \ge 0} \xrightarrow [\alpha \rightarrow \infty ] p_1^{*}\). Let \(\phi = \beta _{1} \circ \beta _{2}\), which is a strictly increasing map from \({\mathbb {N}}\) to \({\mathbb {N}}\). Then \(p_{1, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] p_1^{*}\) and \(x_{1, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] x_1^{*}\), hence parts (a) and (b), from which part (c) follows.
Let now \(k \ge 1\), and assume that the lemma holds for any k nonempty compact subsets of \({\mathbb {R}}^m\). Let \(U_1,..., U_{k+1}\) be \((k+1)\) nonempty compact subsets of \({\mathbb {R}}^m\), \((p_{1, \alpha })_{\alpha \ge 0}\),..., \((p_{k+1, \alpha })_{\alpha \ge 0}\) be sequences with values in [0, 1], and \((x_{1, \alpha })_{\alpha \ge 0}\),..., \((x_{k+1, \alpha })_{\alpha \ge 0}\) be sequences with values in \(U_1,..., U_{k+1}\) respectively. By the induction hypothesis, there exist \(p_1^{*}\),..., \(p_k^{*}\) in [0, 1], \(x_1^{*}\),..., \(x_k^{*}\) in \(U_1,..., U_k\) respectively, and a strictly increasing map \(\gamma : {\mathbb {N}} \rightarrow {\mathbb {N}}\) such that \(p_{j, \gamma (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] p_j^{*}\) and \(x_{j, \gamma (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] x_j^{*}\) for every \(j = 1,..., k\). Focus now on \((p_{k+1, \gamma (\alpha )})_{\alpha \ge 0}\). Due to the compactness of [0, 1], \((p_{k+1, \gamma (\alpha )})_{\alpha \ge 0}\) admits a subsequence, say \((p_{k+1, \gamma (\beta (\alpha )}))_{\alpha \ge 0}\), where \(\beta : {\mathbb {N}} \rightarrow {\mathbb {N}}\) is strictly increasing, such that \(p_{k+1, \gamma (\beta (\alpha ))} \xrightarrow [\alpha \rightarrow \infty ] p_{k+1}^{*}\) for some \(p_{k+1}^{*} \in [0, 1]\). Also, we know that \(p_{j, \gamma (\beta (\alpha ))} \xrightarrow [\alpha \rightarrow \infty ] p_j^{*}\) and \(x_{j, \gamma (\beta (\alpha ))} \xrightarrow [\alpha \rightarrow \infty ] x_j^{*}\) for every \(j = 1,..., k\). By letting \(\phi = \gamma \circ \beta \), we can see that for every \(j = 1,..., k+1\), \(p_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] p_j^{*}\) and \(x_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] x_j^{*}\). Thus \(\sum _{j = 1}^{k+1} p_{j, \phi (\alpha )}x_{j, \phi (\alpha )} \xrightarrow [\alpha \rightarrow \infty ] \sum _{j = 1}^{k+1} p_j^{*}x_j^{*}\). The lemma follows by induction. \(\square \)
Theorem 2 and Lemma 4 supply enough background material for proving Lemma 3. We shall now proceed to do so.
Proof of Lemma 3
We begin by demonstrating the compactness of \(V'_s(h_s)\) for all \(s = 0,..., n\) and \(h_s \in H_s\). The proof is by induction on s. For \(s = 0\), let \(h_0 = (i_n,..., i_0) \in H_0\). \(V'_0(h_0) = \{0\}\) is compact because it is finite. We then have that for all \(h_0 \in H_0\), \(V'_0(h_0)\) is compact. Suppose \(V'_{s-1}(h_{s-1})\) is compact for all \(h_{s-1} \in H_{s-1}\), for some \(s = 1,..., n\).
Let \(h_s = (i_n,..., i_s) \in H_s\) and \(v = v_{s}^{\pi }(h_{s}) \in V'_{s}(h_s)\) with \(\pi = (\delta _n,..., \delta _{1})\). Then \(v = f_{i_{s}}^{\delta _{s}(h_s)} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{\delta _{s}(h_s)}v_{s-1}^{\pi }(h_s, j)\). By the induction hypothesis, there exists, for each \(j \in I\), an \(M_j \ge 0\) such that \(\Vert v_{s-1}^{\pi }(h_s, j)\Vert _{\infty } \le M_j\). Applying the triangle inequality twice on \(\Vert v\Vert _{\infty }\), we obtain \(\Vert v\Vert _{\infty } \le \Vert f_{i_{s}}^{\delta _{s}(h_s)}\Vert _{\infty } + \sum _{j = 1}^{N} M_j\), whence
by noticing that for each \(l = 1,..., m\), \(\max _{k \in K_{i_s}}\, |f_{i_{s}, l}^{k}|\) exists because \(K_{i_s}\) is compact and \(k \mapsto |f_{i_{s}, l}^{k}|\) is continuous on \(K_{i_s}\).
Thus, \(V'_s(h_s)\) is bounded. In order to demonstrate its closedness, let \((v^\alpha )_{\alpha \ge 0}\) be a sequence of vectors in \(V'_s(h_s)\) that converges to a \(v \in {\mathbb {R}}^m\). We may write
for some \(\pi _\alpha = (\delta _n^{\pi _\alpha },..., \delta _{1}^{\pi _\alpha }) \in \Pi '\), for all \(\alpha \ge 0\). We endeavor to show that \(v \in V'_s(h_s)\). By the hypothesized compactness of the \(V'_{s-1}(h_s, j)\)’s, \(j \in I\), there exist, according to Lemma 4, a strictly increasing map \(\phi : {\mathbb {N}} \rightarrow {\mathbb {N}}\), and \(p_1^{*}\),..., \(p_{N}^{*} \in [0, 1]\), and \(x_1^{*},..., x_{N}^{*}\) in \(V'_{s-1}(h_s, 1)\),..., \(V'_{s-1}(h_s, N)\) respectively, such that
for all \(j \in I\), and therefore \(\sum _{j = 1}^{N}p_{i_{s}j}^{\delta _{s}^{\pi _{\phi (\alpha )}}}v_{s-1}^{\pi _{\phi (n)}}(h_s, j) \xrightarrow [\alpha \rightarrow \infty ] \sum _{j = 1}^{N}p_j^{*}x_j^{*}\).
Focusing now on the \(p_j^{*}\)’s, observe that since \(\Delta _s\) is compact, there exist a \(\delta ^{*} \in \Delta _s\) and a strictly increasing \(\beta : {\mathbb {N}} \rightarrow {\mathbb {N}}\) such that \(\delta _s^{\pi _{\phi (\beta (\alpha ))}} \xrightarrow [\alpha \rightarrow \infty ] \delta ^{*}\) and, in particular, in view of the adopted topology, \(\delta _s^{\pi _{\phi (\beta (\alpha ))}}(h_s) \xrightarrow [\alpha \rightarrow \infty ] \delta ^{*}(h_s)\). By the assumed continuity of transition probabilities on \(K_{i_s}\), it follows that
for all \(j \in I\). Moreover, for all \(j \in I\), the sequence \((p_{i_{s}j}^{\delta _s^{\pi _{\phi (\beta (\alpha ))}}(h_s)})_{\alpha \ge 0}\) is a subsequence of \((p_{i_{s}j}^{\delta _s^{\pi _{\phi (\alpha )}}(h_s)})_{\alpha \ge 0}\), from which it follows that \(p_j^{*} = p_{i_{s}j}^{\delta ^{*}(h_s)}\). To recapitulate, then, we have established that
By the assumed continuity of rewards on \(K_{i_{s}}\), we also have that \(f_{i_{s}}^{\delta _s^{\pi _{\phi (\beta (\alpha ))}}(h_s)} \xrightarrow [\alpha \rightarrow \infty ] f_{i_{s}}^{\delta ^{*}(h_s)}\). This means that
We thus have a subsequence of \((v^\alpha )_{\alpha \ge 0}\) that converges to \(f_{i_{s}}^{\delta ^{*}(h_s)} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{\delta ^{*}(h_s)}x_j^{*}\). Hence \(v = f_{i_{s}}^{\delta ^{*}(h_s)} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{\delta ^{*}(h_s)}x_j^{*}\). Recalling that each \(x_j^{*}\) is in \(V'_{s-1}(h_s, j)\), we know from Proposition 4 that there is a \(\pi \in \Pi '\) such that \(v = v_s^{\pi }(h_s)\). Thus \(v \in V'_s(h_s)\). In conclusion, \(V'_s(h_s)\) is closed and bounded, and therefore compact.
Finally, that \(S(u) = \{v \in V'_s(h_s) \mid \, v \ge u\}\) is closed for all \(s = 0,..., n\), \(h_s \in H_s\) and \(u \in V'_s(h_s)\) follows directly from the closedness of \(V'_s(h_s)\) and the fact that if a convergent sequence \((v^\alpha )_{\alpha \ge 0}\) in \({\mathbb {R}}^m\) satisfies \(v^\alpha \ge u\) for all \(\alpha \ge 0\), then its limit v also satisfies \(v \ge u\).
By Theorem 2, Lemma 3 is true. \(\square \)
Proposition 5
(P’) implies \(W'_s(h_s) = {\mathscr {E}}(V'_s(h_s))\) for all \(s = 0,..., n\) and \(h_s \in H_s\).
Proof
Suppose (P’) is true. For \(s = 0\), for all \(h_0 = (i_n,..., i_0) \in H_0\), \(W'_0(h_0) = \{0\} = V'_0(h_0)\); thus, \(W'_0(h_0) = {\mathscr {E}}(V'_0(h_0))\). Let \(s = 1,..., n\) and assume \(W'_{s-1}(h_{s-1}) = {\mathscr {E}}(V'_{s-1}(h_{s-1}))\) for all \(h_{s-1} \in H_{s-1}\).
Let \(h_s = (i_n,..., i_s) \in H_s\) and \(v \in {\mathscr {E}}(V'_s(h_s))\). There exists a \(\pi = (\delta _n,..., \delta _{1}) \in \Pi '\) such that \(v = f_{i_{s}}^{\delta _{s}(h_s)} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{\delta _{s}(h_s)}v_{s-1}^{\pi }(h_s, j)\). For all states \(j \in I\), \((h_s, j) \in H_{s-1}\), and by Lemma 3 there exists a \(v_j \in {\mathscr {E}}(V'_{s-1}(h_s, j))\) such that \(v_j \ge v_{s-1}^{\pi }(h_s, j)\). Let \(w = f_{i_{s}}^{\delta _{s}(h_s)} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{\delta _{s}(h_s)}v_j\). By Lemma 2, \(w \ge v\). Combining the induction hypothesis with (P’) yields \(w \in V'_s(h_s)\). However, v being efficient in \(V'_s(h_s)\), \(v = w\). Hence, again by the induction hypothesis, \(v \in F'_s(h_s)\). By (P’), \(F'_s(h_s) \subseteq V'_s(h_s)\), and thus v is also efficient in \(F'_s(h_s)\), i.e \(v \in W'_s(h_s)\). From this it follows that \({\mathscr {E}}(V'_s(h_s)) \subseteq W'_s(h_s)\).
Now we shall demonstrate the converse inclusion. Let \(w \in W'_s(h_s)\). There exists an action \(k \in K_{i_{s}}\) for which \(w = f_{i_s}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{k}v_j\) for some \(v_j \in W'_{s-1}(h_s, j)\) for all \(j \in I\). By (P’), \(u \in V'_s(h_s)\). Assuming towards a contradiction that \(w \notin {\mathscr {E}}(V'_s(h_s))\), there exists, invoking Lemma 1, \(v \in {\mathscr {E}}(V'_s(h_s)) \subseteq W'_s(h_s)\) such that \(v \ge w\) and \(v \ne w\). This contradicts the fact that \(w \in W'_s(h_s)\), because \(W'_s(h_s)\) is an efficient set, and shows that \(w \in {\mathscr {E}}(V'_s(h_s))\). Consequently, \(W'_s(h_s) \subseteq {\mathscr {E}}(V'_s(h_s))\).
Finally, for all \(s = 0,..., n\), for all \(h_s \in H_s\), \(W'_s(h_s) = {\mathscr {E}}(V'_s(h_s))\).
\(\square \)
We are now able to substantiate (P’).
Theorem 3
(P’) is true, i.e
Proof
We proceed by induction on s. Let \(h_{1} = (i_n,..., i_{1}) \in H_{1}\), and let \(w = f_{i_{1}}^{k} \in F'_{1}(h_{1})\), for some \(k \in K_{i_{1}}\). Then \(w = v_{1}^{\pi }(h_{1}) \in V'_{1}(h_{1})\) for any policy \(\pi \in \Pi '\) that dictates action k when one control period is left and the history at that time is \(h_{1}\). Consequently, \(F'_{1}(h_{1}) \subseteq V'_{1}(h_{1})\) for all \(h_{1} \in H_{1}\).
Let us now assume that for some \(s = 1,..., n\) we have \(F'_{s-1}(h_{s-1}) \subseteq V'_{s-1}(h_{s-1})\) for all \(h_{s-1} \in H_{s-1}\). Let \(h_s = (i_n,..., i_s) \in H_s\), and let \(w \in F'_s(h_s)\). There exist a \(k \in K_{i_s}\) and, for each \(j \in I\), a \(v_j \in W'_{s-1}(h_s, j)\) satisfying \(w = f_{i_{s}}^{k} + \rho \cdot \sum _{j = 1}^{N}p_{i_{s}j}^{k}v_j\). By our induction hypothesis, \(W'_{s-1}(h_s, j) = {\mathscr {E}}(F'_{s-1}(h_s, j)) \subseteq V'_{s-1}(h_s, j)\) for all \(j \in I\), so that each \(v_j\) can be written as \(v_j = v_{s-1}^{\pi _j}(h_s, j)\) for a policy \(\pi _j \in \Pi '\). Thus,
By virtue of Proposition 4, there exists a policy \(\pi \in \Pi '\) such that \(w = v_s^{\pi }(h_s)\). Thus, \(w \in V'_s(h_s)\). As a result, \(F'_s(h_s) \subseteq V'_s(h_s)\). In sum, we have shown by induction that for all \(s = 1,..., n\), \(F'_s(h_s) \subseteq V'_s(h_s)\) for all \(h_s \in H_s\). \(\square \)
By combining Theorem 3 with Proposition 5, we obtain the following relation between the solutions to Equations (9) and (10) and the efficient sets of policy returns over \(\Pi '\), the \({\mathscr {E}}(V'_s(h_s))\)’s:
Theorem 4
For all \(s = 0,..., n\) and \(h_s = (i_n,..., i_s) \in H_s\), \({\mathscr {E}}(V'_s(h_s))\) is the unique solution \(W'_s(h_s)\) to either of the following equations:
The equations can be simplified by noticing that \(W'_s(h_s)\) depends on \(h_s\) only through \(i_{s}\). Because \(W'_0(h_0) = \{0\} = W_0(i_{0})\) for all \(h_0 = (i_n,..., i_0) \in H_0\), this has the important implication that \(W'_s(h_s)\) is none other than \(W_s(i_{s})\), with \(W_s(i_{s})\) defined by Equations (4) and (5).
Proposition 6
For all \(s = 0,..., n\), for all \(h_s = (i_n,..., i_s) \in H_s\), \(W'_s(h_s) = W_s(i_{s})\).
Proof
The property evidently holds for \(s = 0\). Assume that for some \(s = 1,..., n\), \(W'_{s-1}(h_{s-1}) = W_{s-1}(i_{s-1})\) for all \(h_{s-1} = (i_n,..., i_{s-1}) \in H_{s-1}\). For all \(h_s = (i_n,..., i_s) \in H_s\), we have by definition,
hence
applying the induction hypothesis. Thus, \(W'_s(h_s) = W_s(i_{s})\). The proposition follows. \(\square \)
By connecting Proposition 6 with Theorem 4, we discover that the solutions to White’s equations are the \({\mathscr {E}}(V'_s(h_s))\)’s.
Theorem 5
For all \(s = 0,..., n\), for all \(h_s = (i_n,..., i_s) \in H_s\), \(W_s(i_s) = {\mathscr {E}}(V'_s(h_s))\).
An interesting byproduct of Theorem 5 and Corollary 1 is that under the determinism assumption introduced in Sect. 3, all efficient policy returns over \(\Pi '\) are attained by policies in \(\Pi \subset \Pi '\). That is, given the same initial state, an optimal policy in \(\Pi \) can accrue as “large" an expected reward as an optimal policy in the whole of \(\Pi '\). In this case the decision maker, rather than consider the full range of policies in \(\Pi '\), is justified in focusing exclusively on those that are Markovian, which are easier to implement and evaluate.
Corollary 2
Suppose Assumption 1 is satisfied. Then, for all initial states \(i_n \in I\),
6 An example
In this section we report the results of experimental tests of Corollary 1 (Sect. 4), Theorem 5 (Sect. 5) and Corollary 2 (Sect. 5). Recall that the model in Sect. 2 furnished a counterexample to White’s theorem, since
and \(W_3(1) \ne {\mathscr {E}}(V_3(1))\). We now show that, as Theorem 5 assures us, \(W_3(1)\) coincides exactly with \({\mathscr {E}}(V'_3(1))\), letting \(n = 3\). An exhaustive search of \(\Pi '\) yielded
Thus \(W_3(1) = {\mathscr {E}}(V'_3(1))\), a fact in accord with Theorem 5.
As a test of Corollary 1, we solved the White equations under Assumption 1, setting \(p_{11}^{a} = p_{12}^{b} = 1\). The rewards were unchanged from Sect. 2. By evaluating Equations (3) and (4), we obtained
and a full search over the eight policies in \(\Pi \) returned \({\mathscr {E}}(V_3(1)) = W_3(1)\).
To support Corollary 2, we calculated \({\mathscr {E}}(V'_1(1))\) in the same deterministic setting with a view to comparing it with \({\mathscr {E}}(V_1(1))\). Through a comprehensive search, we found
which corroborates Corollary 2.
7 Generalization to other vector-valued Markov decision processes
Our findings in Sects. 3, 4 and 5 are capable of generalization beyond Markov decision processes with componentwise-ordered rewards. About the only useful aspects of the componentwise order have been its transitivity, the implications in Lemma 2, and the fact that inequalities are preserved under limits. By the latter we mean that whenever a sequence of vectors \((v^{\alpha })_{\alpha \ge 0}\) converges to a \(v \in {\mathbb {R}}^m\) while satisfying \(v^{\alpha } \ge u\) for some \(u \in {\mathbb {R}}^m\) and for all \(\alpha = 0, 1,...\), then \(v \ge u\). Transitivity and Lemma 2 were used on several occasions to justify return vector comparisons; inequality preservation helped establish the conditions for applying Theorem 2 to prove Lemmas 1 and 3. Among the arguments adduced in our proofs of Propositions 1 and 5, Lemmas 1, 2 and 3 were the only ones which bear on the order relation. On the other hand, Lemma 4, Propositions 3, 4 and 6, and Theorems 1, 2 and 3 are independent of the order relation. Corollary 1 was a consequence of Theorem 1 and Proposition 1; Theorem 4 a consequence of Theorem 3 and Proposition 5; Theorem 5 a consequence of Theorem 4 and Proposition 6; and Corollary 2 a consequence of Theorem 5 and Corollary 1. Thus, we may specify a broad class of relations, of which the usual componentwise order is one instance, to which these results carry over.
Proposition 7
Let \(\succeq \) be a transitive relation on \({\mathbb {R}}^m\). For any \(X \subseteq {\mathbb {R}}^m\), let \({\mathscr {E}}(X) = \bigl \{\, x \in X \mid \, \forall y \in X,\, y \succeq x \implies y = x \,\bigr \}\) denote the elements of X that are maximal (efficient) relative to \(\succeq \). Suppose \(\succeq \) meets the following conditions:
-
(1)
for all \(x, y, z \in {\mathbb {R}}^m\), \(x \succeq y\) implies \((x+z) \succeq (y+z)\);
-
(2)
for all \(x \in {\mathbb {R}}^m\) and nonnegative reals a, \(x \succeq y\) implies \(ax \succeq ay\);
-
(3)
for all \(n = 0, 1,...\) and \(i \in I\), the set \(S(u) = \{v \in V_n(i) \mid \, v \succeq u\}\) is closed for all \(u \in V_n(i)\);
-
(4)
for all \(n = 0, 1,...\), \(s = 0,..., n\), and \(h_s \in H_s\), the set \(S'(u) = \{v \in V'_s(h_s) \mid \, v \succeq u\}\) is closed for all \(u \in V'_s(h_s)\).
Then Lemmas 1–4, Propositions 1–6, Theorems 1–5 and Corollaries 1 and 2 are true with respect to \(\succeq \).
Proof
As noted above, Lemma 4, Propositions 3, 4 and 6, and Theorems 1–3 are true irrespective of \(\succeq \). Lemma 2 is implied by conditions (1) and (2). Theorem 2 and conditions (3) and (4) assure us that in order for Lemmas 1 and 3 to follow, we need only show that \(V_n(i)\) and \(V'_s(h_s)\) are compact for all \(n = 0, 1,...\), \(i \in I\), \(s = 0,..., n\) and \(h_s \in H_s\). The reader can verify that the parts involving \(\ge \) in our demonstration of the compactness of \(V'_s(h_s)\) on page 14 only make use of Lemmas 2 and 4, both of which are true here, and that an analogous proof can be given for \(V_n(i)\). Lemmas 1 and 3 are therefore true. Thus, from the discussion above, Propositions 1 and 5 hold. Corollary 1 follows from Theorem 1 and Proposition 1, Theorem 4 from Theorem 3 and Proposition 5, Theorem 5 from Theorem 4 and Proposition 6, and Corollary 2 from Theorem 5 and Corollary 1. \(\square \)
If \(\succeq \) is not merely a transitive relation but a partial order (reflexive, antisymmetric and transitive), conditions (3) and (4) can be replaced with the assumption that for each \(n = 0, 1,...\), \(s = 0,..., n\), \(i \in I\) and \(h_s \in H_s\), every totally ordered subset (chain) of \(V_n(i)\) and of \(V'_s(h_s)\) is bounded above with respect to \(\succeq \). In this case Theorem 2 becomes superfluous, for Lemmas 1 and 3 can be derived directly from Zorn’s lemma (Zorn, 1935). To illustrate this, consider just Lemma 1. Let \(u \in V_n(i)\) and write \(K(u) = \{v \in V_n(i) \mid \, v \succeq u\}\). We wish to show that \(K(u) \cap {\mathscr {E}}(V_n(i))\) is nonempty. To this end, observe that there is at least a chain in \(K(u) \cap V_n(i) = K(u)\), because \(\succeq \) being symmetric, we have that \(u \succeq u\) and therefore that K(u) is nonempty. Let C be such a chain. Clearly, C is also a chain in \(V_n(i)\), and is thus, by our assumption, bounded above by an \(x \in V_n(i)\). By the transitivity of \(\succeq \) and the definition of K(u), x must be in K(u). Therefore, C admits an upper bound in K(u). This being true for any chain in K(u), the requirements for Zorn’s lemma are satisfied, and K(u) has at least one maximal element with respect to \(\succeq \); that is, \({\mathscr {E}}(K(u)) = {\mathscr {E}}(K(u) \cap V_n(i)) \ne \emptyset \). Now, as can easily be checked, \({\mathscr {E}}(K(u))\) is a subset of \(K(u) \cap {\mathscr {E}}(V_n(i))\). Consequently \(K(u) \cap {\mathscr {E}}(V_n(i))\) is nonempty, hence Lemma 1. Lemma 3 can be established in exactly the same fashion, and the rest of the propositions, theorems and corollaries follow as above.
8 Summary and discussion
The aims of this paper were (a) to show that the hypotheses underlying the vector extension of the Bellman equations of a Markov decision process do not ensure the extension’s validity, and (b) to propose alternative hypotheses that do. A counterexample to the theorem on which this extension rests was provided, and an explanation as to why the theorem failed was advanced. It was found that the theorem holds (1) when the decision making horizon spans one or two control intervals, (2) when action choice, conditional on the present state, determines the ensuing state with certainty, and (3) when the policy space is enlarged from just Markovian policies to include a class of policies dependent upon the state’s historical progression. These results extend to problems where relations other than the usual componentwise order are used for comparing rewards, provided they satisfy the hypotheses of Proposition 7.
We close with some comments on the theorem that formed the basis of this work. Recall that at each state i and with \(n = 1, 2,...\) epochs remaining until process termination, the corresponding White equation is given by
Recall also that the theorem says that \(W_n(i) = {\mathscr {E}}(V_n(i))\).
First, although the original statement of the theorem indicates that this equation is in “\(\bigoplus \) sum-set form" (D. White, 1982, p. 6), nowhere in that paper, as a reviewer points out, is the summation symbol \(\sum \) defined. We surmised, having read White’s Generalized efficient solutions for sums of sets (D.J. White, 1980), which was a correction and an extension of a paper dealing with efficiency in sum sets (Moskowitz, 1975), that the symbol denoted a sum set. Indeed, in Moskowitz (1975), given t nonempty sets \(X_1\),..., \(X_t\), the notation \(\sum _{l = 1}^{t}X_l\) refers to the set \(\{\sum _{l = 1}^{t}x_l: x_l \in X_l, \forall l\}\), that is, the sum set \(X_1 \bigoplus ... \bigoplus X_t\). Since Generalized efficient solutions for sums of sets appeared before the publication of D. White (1982), and since White adopts in it the exact same notation as Moskowitz, including the notation above, and since we have found no writing of his where he introduces a definition of \(\sum \) at variance with Moskowitz’s, we were compelled to assume that his use of the symbol in D. White (1982) was consistent with the sum set definition. One might note that the same view is held in published applications of the White equations; see, for example, Mandow et al. (2022).
Secondly, consider the theorem’s proof, which begins on page 6 of D. White (1982) and proceeds by induction on n. For \(n = 0\), it is readily verifiable (see our proof of Proposition 1) that \(W_0(i) = {\mathscr {E}}(V_0(i))\) for each \(i \in I\). Now, assuming that the theorem holds for \(n-1\), for some \(n \geqq 2\), the inductive step has two parts. In the first part, White endeavors to show that \({\mathscr {E}}(V_n(i)) \subseteq W_n(i)\) for all \(i \in I\). The argument is as follows. If \(v \in {\mathscr {E}}(V_n(i))\), then \(v = v_n^{\pi }(i) = f_{i}^{\delta _{n}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{n}(i)}v_{n-1}^{\pi }(j)\) for some policy \(\pi = (\delta _{n},..., \delta _{1}) \in \Pi \). For each state \(j \in I\), there exists, according to Lemma 1, a policy \(\pi _j\) such that \(v_{n-1}^{\pi _j}(j) \in {\mathscr {E}}(V_{n-1}(j))\) and \(v_{n-1}^{\pi _j}(j) \geqq v_{n-1}^{\pi }(j)\). By the induction hypothesis, each \(v_{n-1}^{\pi _j}(j)\), \(j \in I\), lies in \(W_{n-1}(j)\). Let \(w = f_{i}^{\delta _{n}(i)} + \rho \cdot \sum _{j = 1}^{N}p_{ij}^{\delta _{n}(i)}v_{n-1}^{\pi _j}(j)\). Then, \(w \in W_n(i)\) and \(w \geqq v\). At this point White makes the claim that \(w \in V_n(i)\), an assertion which, when combined with the previous sentence, implies \(v = w\) and hence \(v \in W_n(i)\). But as our counterexample demonstrates, the assertion is unwarranted, and herein lies the error in the proof. It might be noted, incidentally, that w does belong to \(V'_n(i)\) as a consequence of Proposition 4.
Another issue with the proof concerns its second part. Under the induction hypothesis and on the assumption that the first inclusion has been established, White attempts to show the converse inclusion. The argument begins with the next two sentences, reproduced in our notation: “Now let \(w \in W_n(i)\). Then there is a \(\pi \in \Pi \) such that \(v = v_n^{\pi }(i)\), and if \(\delta _n\) is the first decision rule in \(\pi \), we have, again, [Equation (1)], with \(v_{n-1}^{\pi }(j) \in W_{n-1}(j), \forall j \in I\)". In the first place, as a reviewer indicates, it is unclear what v, which appears only once in the whole of the argument, represents in this passage. Quite probably a misprint occurred and w was the intended letter, especially considering that the relevant sentence draws a direct conclusion from the fact that \(w \in W_n(i)\). In the second place, even if v did represent w, so that \(w = v_n^{\pi }(i)\), then the identical error that marred White’s treatment of the first part was repeated here, for, as we have shown in this paper, a point in \(W_n(i)\) need not lie in \(V_n(i)\), unless property (P) or some equivalent property was satisfied.
Notes
The conclusions of this paper also hold for \(\rho = 1\). We assume \(\rho < 1\) merely for consistency with (D. White, 1982).
An antichain is a subset of a partially ordered set in which no distinct elements are comparable. Formally, \(K \subseteq (X, \succeq )\) is an antichain in X if for all \(x, y \in K\), \(x \ne y\) implies \(x \not \succeq y\) and \(y \not \succeq x\).
References
Geoffrion, A. M. (1968). Proper efficiency and the theory of vector maximization. Journal of Mathematical Analysis and Applications, 22(3), 618–630.
Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., et al. (2022). A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1), 26.
Mandow, L., Pérez-de-la Cruz, J.-L., & Pozas, N. (2022). Multi-objective dynamic programming with limited precision. Journal of Global Optimization, 82(3), 595–614.
Mannor, S., & Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement learning. The Journal of Machine Learning Research, 5, 325–360.
Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. Wiley.
Roijers, D.M., Röpke, W., Nowé, A., & Rădulescu, R. (2021). On following Pareto-optimal policies in multi-objective planning and reinforcement learning. In Proceedings of the multi-objective decision making (modem) workshop.
Ruiz-Montiel, M., Mandow, L., & Pérez-de-la Cruz, J.-L. (2017). A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263, 15–25.
Van Moffaert, K., & Nowé, A. (2014). Multi-objective reinforcement learning using sets of Pareto dominating policies. The Journal of Machine Learning Research, 15(1), 3483–3512.
White, D. (1977). Kernels of preference structures. Econometrica: Journal of the Econometric Society, 45, 91–100.
White, D. (1982). Multi-objective infinite-horizon discounted Markov decision processes. Journal of Mathematical Analysis and Applications, 89(2), 639–647.
Wiering, M.A., & De Jong, E.D. (2007). Computing optimal stationary policies for multi-objective Markov decision processes. In 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 158–165).
Zorn, M. (1935). A remark on method in transfinite algebra. Bulletin of the American Mathematical Society, 41(10), 667–670.
Moskowitz, H. (1975). A recursion algorithm for finding pure admissible decision functions in statistical decisions. Operations Research, 23(5), 1037–1042.
White, D. J. (1980). Generalized efficient solutions for sums of sets. Operations Research, 28(3), 844–846.
Acknowledgements
The author would like to thank the Editor and the reviewers for the meticulousness with which they pored over the paper. He is indebted to Prof. Dominikus Noll for stimulating conversations that provided the impetus for Sect. 4. He also acknowledges the assistance of Dr. Slim Kammoun with certain aspects of Sect. 5.
Funding
Open access funding provided by Université Toulouse III - Paul Sabatier.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author states that he has no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mifrani, A. A counterexample and a corrective to the vector extension of the Bellman equations of a Markov decision process. Ann Oper Res 345, 351–369 (2025). https://doi.org/10.1007/s10479-024-06439-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s10479-024-06439-x

