Abstract
In 1976 I was looking for a suitable subject for my PhD thesis. My thesis advisor Arie Hordijk and I found a lot of inspiration in Derman’s book (Finite state Markovian decision processes, Academic Press, New York, 1970). Since that time I was interested in linear programming methods for Markov decision processes. In this article I will describe some results in this area on the following topics: (1) MDPs with the average reward criterion; (2) additional constraints; (3) applications. These topics are the main elements of Derman’s book.
Introduction
When Arie Hordijk was appointed at the Leiden University in 1976, I became his first PhD student in Leiden. Hordijk was the successor of Guus Zoutendijk, who has chosen to leave the university for a position as chairman of the executive board of the Delta Lloyd Group. Zoutendijk was the supervisor of my master thesis and a leading expert in linear and nonlinear programming. Looking for a PhD project Hordijk suggested linear programming (for short, LP) for the solution of Markov Decision Processes (for short, MDPs). LP for MDPs was introduced by D’Epenoux (1960) for the discounted case. De Ghellinck (1960) as well as Manne (1960) obtained LP formulations for the average reward criterion in the irreducible case. The first analysis of LP for the multichain case was given by Denardo and Fox (1968). Our interest was raised by Derman’s remark (Derman 1970, p. 84): “No satisfactory treatment of the dual program for the multiple class case has been published”.
We started to work on this subject and succeeded to present a satisfactory treatment of the dual program for multichained MDPs. We proved a theorem from which a simple algorithm follows for the determination of an optimal deterministic policy (Hordijk and Kallenberg 1979). In Sect. 3 we describe this approach. Furthermore, we present in Sect. 3 some examples which show the essential difference between irreducible, unichained and multichained MDPs. These examples show for general MDPs:

1.
An extreme optimal solution of the dual program may have in some state more than one positive variable and consequently an extreme feasible solution of the dual program may correspond to a nondeterministic policy (Example 2).

2.
Two different solutions may correspond to the same deterministic policy (Example 3).

3.
An nonoptimal solution of the dual program may correspond to an optimal deterministic policy (Example 4).

4.
The results of the unichain case cannot be generalized to the general single chain case (Example 5).
The second topic of this article concerns additional constraints. Chapter 7 of Derman’s book deals with this subject and has as title “Stateaction frequencies and problems with constraints”. This chapter may be considered as the starting point for the study of MDPs with additional constraints.
For unichained MDPs with additional constraints, Derman has shown that an optimal policy can be found in the class of stationary policies. We have generalized these results in the sense that for multichained MDPs stationary policies are not sufficient; however, in that case there exists an optimal policy in the class of Markov policies. This subject is presented in Sect. 4.
Derman’s book also deals with some applications, for instance optimal stopping and replacement problems. In the last part, Sect. 5, of this paper we will discuss LP methods for the following applications:

1.
Optimal stopping problems.

2.
Replacement problems:

(a)
General replacement problems;

(b)
Replacement problems with increasing deterioration;

(c)
Skip to the right problems with failure;

(d)
Separable replacement problems.

(a)

3.
Multiarmed bandit problems.

4.
Separable problems with both the discounted and the average reward criterion.
Notations and definitions
Let S be the finite state space and A(i) the finite action set in state i∈S. If in state i action a∈A(i) is chosen, then a reward r _{ i }(a) is earned and p _{ ij }(a) is the transition probability that the next state is state j.
A policy R is a sequence of decision rules: R=(π ^{1},π ^{2},…,π ^{t},…), where π ^{t} is the decision rule at time point t, t=1,2,…. The decision rule π ^{t} at time point t may depend on all available information on the system until time t, i.e., on the states at the time points 1,2,…,t and the actions at the time points 1,2,…,t−1.
Let C denote the set of all policies. A policy is said to be memoryless if the decision rules π ^{t} are independent of the history; it depends only on the state at time t. We call C(M) the set of the memoryless policies. Memoryless policies are also called Markov policies.
If a policy is memoryless and the decision rules are independent of the time point t, then the policy is called stationary. Hence, a stationary policy is determined by a nonnegative function π on S×A, where S×A={(i,a)∣i∈S, a∈A(i)}, such that ∑_{ a } π _{ ia }=1 for every i∈S. The stationary policy R=(π,π,…) is denoted by π ^{∞}. The set of stationary policies is notated by C(S).
If the decision rule π of a stationary policy is nonrandomized, i.e., for every i∈S, we have π _{ ia }=1 for exactly one action a, then the policy is called deterministic. A deterministic policy can be described by a function f on S, where f(i) is the chosen action in state i. A deterministic policy is denoted by f ^{∞} and the set of deterministic policies by C(D).
A matrix P=(p _{ ij }) is a transition matrix if p _{ ij }≥0 for all (i,j) and ∑_{ j } p _{ ij }=1 for all i. Notice that P is a stationary Markov chain. For a Markov policy R=(π ^{1},π ^{2},…) the transition matrix P(π ^{t}) is defined by
and the vector r(π ^{t}), defined by
is called the reward vector.
Let the random variables X _{ t } and Y _{ t } denote the state and action at time t. Given starting state i, policy R and a discount factor α∈(0,1), the discounted reward and the average reward are denoted by \(v^{\alpha}_{i}(R)\) and ϕ _{ i }(R), respectively, and defined by
and
respectively.
The value vectors v ^{α} and ϕ for discounted and average rewards are defined by \(v^{\alpha}_{i} = \sup_{R}v^{\alpha}_{i}(R)\), i∈S, and ϕ _{ i }=sup_{ R } ϕ _{ i }(R), i∈S, respectively.
A policy R ^{∗} is a discounted optimal policy if \(v^{\alpha}_{i}(R^{*})= v^{\alpha}_{i}\), i∈S; similarly, R ^{∗} is an average optimal policy if ϕ _{ i }(R ^{∗})=ϕ _{ i }, i∈S. It is well known that, for both discounted as average rewards, an optimal policy exists and can be found within C(D), the class of deterministic policies.
An MDP is called irreducible if, for all deterministic decision rules f, in the Markov chain P(f) all states belong to a single ergodic class.
An MDP is called unichained if, for all deterministic decision rules f, in the Markov chain P(f) all states belong to a single ergodic class plus a (perhaps empty and decision rule dependent) set of transient states. In the weak unichain case every optimal deterministic policy f ^{∞} has a unichain Markov chain P(f); in the general single chain case at least one optimal deterministic policy f ^{∞} has a unichain Markov chain P(f);
An MDP is called multichained if there may be several ergodic classes and some transient states; these classes may vary from policy to policy.
An MDP is communicating if for every i,j∈S there exists a deterministic policy f ^{∞}, which may depend on i and j, such that in the Markov chain P(f) state j is accessible from state i.
It is well known that for irreducible, unichained and communicating MDPs the value vector has identical components. Hence, in these cases one uses, instead of a vector, a scalar ϕ for the value.
LP for MDPs with the average reward criterion
The irreducible case
In Chap. 6, pp. 78–80, of Derman’s book the following result can be found, which originates from Manne (1960).
Theorem 1
Let (v ^{∗},u ^{∗}) and x ^{∗} be optimal solutions of (1) and (2), respectively, where
and
Let \(f_{*}^{\infty}\) be such that \(x_{i}^{*}(f_{*}(i)) > 0\), i∈S. Then, \(f_{*}^{\infty}\) is well defined and an average optimal policy. Furthermore, v ^{∗}=ϕ, the value.
The unichain case
Theorem 2
Let (v ^{∗},u ^{∗}) and x ^{∗} be optimal solutions of (1) and (2), respectively. Let \(S_{*} = \{i\mid\sum_{a} x^{*}_{i}(a) >0\}\). Choose \(f_{*}^{\infty}\) such that \(x_{i}^{*}(f_{*}(i)) > 0\) if i∈S _{∗} and choose f _{∗}(i) arbitrarily if i∉S _{∗}. Then, \(f_{*}^{\infty}\) is an average optimal policy. Furthermore, v ^{∗}=ϕ, the value.
This linear programming result for unichained MDPs was derived by Denardo (1970). I suppose that Derman was also aware of this result, although it was not explicitly mentioned in his book. Theorem 2 on p. 75 and the subsequent text on p. 76 are the reason of my supposition. The result of Theorem 2, but with a different proof, is part of my thesis (Kallenberg 1980), which was also published in Kallenberg (1983).
The communicating case
Since the value vector ϕ is constant in communicating MDPs, the value ϕ is the unique v ^{∗}part of an optimal solution (v ^{∗},u ^{∗}) of the linear program (1). One would expect that an optimal policy could also be obtained from the dual program (2). The next example shows that—in contrast with the irreducible and the unichain case—in the communicating case the optimal solution of the dual program doesn’t provide an optimal policy, in general.
Example 1
S={1,2,3}; A(1)={1,2}, A(2)={1,2,3}, A(3)={1,2}. r _{1}(1)=0, r _{1}(2)=2; r _{2}(1)=1, r _{2}(2)=1, r _{2}(3)=3; r _{3}(1)=2; r _{3}(2)=4. p _{12}(1)=p _{11}(2)=p _{23}(1)=p _{21}(2)=p _{22}(3)=p _{32}(1)=p _{33}(2)=1 (other transitions are 0). This is a multichain and communicating model. The value is 4 and \(f_{*}^{\infty}\) with f _{∗}(1)=f _{∗}(2)=1, f _{∗}(3)=2 is the unique optimal deterministic policy.
The primal linear program (1) becomes for this model
with optimal solution v ^{∗}=4; \(u^{*}_{1} = 0\), \(u^{*}_{2} = 3\), \(u^{*}_{3} = 5\) (v ^{∗} is unique; u ^{∗} is not unique).
The dual linear program is
For the optimal solution x ^{∗}, we obtain: \(x^{*}_{1}(1) = x^{*}_{1}(2) =x^{*}_{2}(1) = x^{*}_{2}(2) = x^{*}_{2}(3) = x^{*}_{3}(1) = 0;~x^{*}_{3}(2) = 1\) (this solution is unique).
Proceeding as if this were a unichain model, we choose arbitrary actions in the states 1 and 2. Clearly, this approach may generate a nonoptimal policy.
So, we are not able—in general—to derive an optimal policy from the dual program (2). However, it is possible to find an optimal policy with some additional work. In Example 1 we have seen that the optimal solution x ^{∗} provides an optimal action in state 3, which is the only state of \(S_{*} = \{i\mid\sum_{a} x^{*}_{i}(a) > 0\}\). The next theorem shows that the states of S _{∗} always provide optimal actions. For the proof we refer to Kallenberg (2010).
Theorem 3
Let x ^{∗} be an extreme optimal solution of (2). Take any policy \(f_{*}^{\infty}\) such that \(x^{*}_{i}(f_{*}(i)) > 0\), i∈S _{∗}. Then, \(\phi_{j}(f_{*}^{\infty}) = \phi\), j∈S _{∗}.
Note that \(S_{*} \not= \emptyset\) (because \(\sum_{i,a} x^{*}_{i}(a) = 1\)) and that we can find, by Theorem 3, optimal actions f _{∗}(i) for all i∈S _{∗}. Furthermore, one can easily show that S _{∗} is closed in the Markov chain P(f _{∗}).
Since we have a communicating MDP, one can find for each i∉S _{∗} an action f _{∗}(i) such that in the Markov chain P(f _{∗}) the set S _{∗} is reached from state i with a strictly positive probability after one or more transitions. So, the set S\S _{∗} is transient in the Markov chain P(f _{∗}). Therefore, the following search procedure provides the remaining optimal actions for the states S\S _{∗}.
Search procedure

1.
If S _{∗}=S: stop;
Otherwise go to step 2.

2.
Pick a triple (i,a,j) with i∈S\S _{∗}, a∈A(i), j∈S _{∗} and p _{ ij }(a)>0.

3.
f _{∗}(i):=a, S _{∗}:=S _{∗}∪{i} and go to step 1.
A second way to find an optimal policy for communicating MDPs is based on the following theorem which is due to Filar and Schultz (1988).
Theorem 4
An MDP is communicating if and only if for every b∈ℝ^{S} such that ∑_{ j } b _{ j }=0 there exists a \(y \in{\mathbb{R}}_{+}^{S \times A}\) such that ∑_{ i,a }{δ _{ ij }−p _{ ij }(a)}y _{ i }(a)=b _{ j } for all j∈S.
The following procedure also yields an optimal deterministic policy. This is based on results for multichained MDPs which are discussed in Sect. 3.4.
Determination yvariables

1.
Choose β∈ℝ^{S} such that β _{ j }>0, j∈S and ∑_{ j } β _{ j }=1.

2.
Let \(b_{j} = \beta_{j}  \sum_{a} x^{*}_{j}(a)\), j∈S.

3.
Determine \(y^{*} \in{\mathbb{R}}_{+}^{S \times A}\) such that \(\sum_{i,a}\{\delta_{ij}  p_{ij}(a)\}y^{*}_{i}(a) = b_{j}\), j∈S.

4.
Choose f _{∗}(i) such that \(y^{*}_{i}(f_{*}(i)) > 0\) for all i∈S\S _{∗}.
Example 1
(continued)

Search procedure:

S _{∗}={3}.

i=2; a=1; j=3; f _{∗}(2)=1; S _{∗}={2,3}.

i=1; a=1; j=2; f _{∗}(1)=1; S _{∗}={1,2,3}.

Determination yvariables:

Choose \(\beta_{1} = \beta_{2} = \beta_{3} = \frac{1}{3}\).

Let \(b_{1} = \frac{1}{3}\), \(b_{2} = \frac{1}{3}\), \(b_{3} = \frac{2}{3}\).

The system ∑_{ i,a }{δ _{ ij }−p _{ ij }(a)}y _{ i }(a)=b _{ j }, j∈S becomes:
$$\begin{array}{lllllllllr}& y_1(1) & & &  & y_2(2) & & & = & \frac{1}{3} \\[0.12cm] & y_1(1) & + & y_2(1) & + & y_2(2) &  & y_3(1) & = & \frac{1}{3} \\[0.12cm]& &  & y_2(1) & & & + & y_3(1) & = & \frac{2}{3} \\[0.12cm]\end{array}$$with a nonnegative solution \(y^{*}_{1}(1) = \frac{1}{3}\), \(y^{*}_{2}(1) =\frac{2}{3}\), \(y^{*}_{2}(2) = y^{*}_{3}(1) = 0\) (this solution is not unique). Choose f _{∗}(1)=f _{∗}(2)=1.
Remarks

1.
The verification of an irreducible or communicating MDP is computationally easy (see Kallenberg 2002); generally, the verification of a unichain MDP is \(\mathcal{NP}\)complete as shown by Tsitsiklis (2007).

2.
It turns out that the approach with the search procedure can also be used for the weak unichain case.
The multichain case
For multichained MDPs the programs (1) and (2) are not sufficient. For general MDPs the following dual pair of linear programs were proposed by Denardo and Fox (1968):
and
where β _{ j }>0 for all j∈S.
In Denardo and Fox (1968) it was shown that if (v ^{∗},u ^{∗}) is an optimal solution of the primal problem (3), than v ^{∗}=ϕ, the value vector.
Notice that if the value vector ϕ is constant, i.e., ϕ has identical components, then \(\sum_{j} \{ \delta_{ij}  p_{ij}(a)\}v^{*}_{j} = \sum_{j} \{ \delta_{ij}  p_{ij}(a) \}\phi= \{1  1\}\phi = 0\). Hence, the first set of inequalities of (3) is superfluous and (3) can be simplified to (1) with as dual program (2).
Furthermore, Denardo and Fox have derived the following result (see pp. 73–75 in Derman 1970).
Lemma 1
Let \(f_{*}^{\infty}\in C(D)\) be an optimal policy and let (v ^{∗}=ϕ,u ^{∗}) be an optimal solution of the primal program (3). Then,
where R(f _{∗})={i∣i is recurrent in the Markov chain P(f _{∗})}.
Lemma 1 asserts that in any optimal solution of the primal program (3) one can always select actions f _{∗}(i) such that ∑_{ j }{δ _{ ij }−p _{ ij }(f _{∗})}ϕ _{ j }=0, i∈S, and \(\phi_{i} + \sum_{j} \{ \delta_{ij}  p_{ij}(f_{*}) \}u^{*}_{j} =r_{i}(f_{*})\) for all i in a nonempty subset S(f _{∗}) of S. Furthermore, the following result holds, given such policy \(f_{*}^{\infty}\) and a companion S(f _{∗}) (see pp. 75–76 in Derman 1970).
Lemma 2
If all states of S\S(f _{∗}) are transient in the Markov chain P(f _{∗}), then policy \(f_{*}^{\infty}\) is an average optimal policy.
If we are fortunate in our selection of \(f_{*}^{\infty}\), then the states of S\S(f _{∗}) are transient in the Markov chain P(f _{∗}) and policy \(f_{*}^{\infty}\) is an average optimal policy. However, we may not be so fortunate in our selection of \(f_{*}^{\infty}\). In that case, Derman suggests the following approach to find an optimal policy (see pp. 76–78 in Derman’s book 1970). Let S _{1} be defined by
By Lemma 1, S\S _{1} must consist entirely of transient states under every optimal policy. Let S _{2} be defined by
Also by Lemma 1, the states of S _{1}\S _{2} must be transient under at least one optimal policy \(f_{*}^{\infty}\). Let S _{3} and A _{3}(i), i∈S _{3} be defined as
Consider the following linear program
where \(s_{i}(a) = r_{i}(a)  \sum_{j \notin S_{3}} \{ \delta_{ij} p_{ij}(a) \}u_{j}^{*}  \phi_{i}\).
Theorem 5

(1)
The linear program (8) has a finite optimal solution.

(2)
Let w ^{∗} be an optimal solution of (8). Then, for each i∈S _{3} there exists at least one action f _{∗}(i) satisfying \(\sum_{j \in S_{3}} \{\delta_{ij} p_{ij}(f_{*})\}w^{*}_{j} = s_{i}(f_{*})\).

(3)
Let \(f_{*}^{\infty}\) be such that
$$\left\{\begin{array}{l@{\quad}l}\sum_j \{ \delta_{ij}  p_{ij}(f_*) \}\phi_j = 0 ,&i \in S_2\\\phi_i + \sum_j \{ \delta_{ij}  p_{ij}(f_*) \}u^*_j = r_i(f_*) ,&i\in S_2\end{array}\right.$$and \(\sum_{j \in S_{3}} \{\delta_{ij}  p_{ij}(f_{*})\}w^{*}_{j} =s_{i}(f_{*})\), i∈S _{3}. Then, \(f_{*}^{\infty}\) is an average optimal policy.
Hence, in order to find an optimal policy in the multichain case, by the results of Denardo and Fox (1968) and Derman (1970), one has to execute the following procedure:

1.
Determine an optimal solution (v ^{∗},u ^{∗}) of the linear program (3) to find the value vector ϕ=v ^{∗}.

2.
Determine, by (5), (6) and (7), the sets S _{1},S _{2},S _{3} and A _{3}(i), i∈S _{3}.

3.
Compute \(s_{i}(a) = r_{i}(a)  \sum_{j \notin S_{3}} \{\delta_{ij}  p_{ij}(a) \}u_{j}^{*}  \phi_{i}\), i∈S _{3}, a∈A _{3}(i).

4.
Determine an optimal solution w ^{∗} of the linear program (8).

5.
Determine an optimal policy \(f_{*}^{\infty}\) as described in Theorem 5.
This rather complicated approach elicited from Derman the remark (see Derman 1970, p. 84): “No satisfactory treatment of the dual program for the multiple class case has been published”, which was for Hordijk and myself the reason to start research on this topic. In Hordijk and Kallenberg (1979) the following result was proved.
Theorem 6
Let (x ^{∗},y ^{∗}) be an extreme optimal solution of the dual program (4). Then, any stationary deterministic policy \(f_{*}^{\infty}\) such that
is welldefined and is an average optimal policy.
This result is based on the following propositions, where:

Proposition 1 is related to Lemma 1;

Proposition 2 is related to the definitions of S _{2};

Proposition 3 is related to Lemma 2; it also uses the property that the columns of positive variables of an extreme optimal solution are linearly independent.
Proposition 1
Let (v ^{∗}=ϕ,u ^{∗}) be an optimal solution of program (3). Then,
Proposition 2
The subset S _{∗} of S is closed in the Markov chain P(f _{∗}).
Proposition 3
The states of S\S _{∗} are transient in the Markov chain P(f _{∗}).
The correspondence between feasible solutions (x,y) of (4) and randomized stationary policies π ^{∞} is given by the following mappings. For a feasible solution (x,y) the corresponding policy π ^{∞}(x,y) is defined by
Conversely, for a stationary policy π ^{∞}, we define a feasible solution (x ^{π},y ^{π}) of the dual program (4) by
where P ^{∗}(π) and D(π) are the stationary and the deviation matrix of the transition matrix P(π); γ _{ j }=0 on the transient states and constant on each recurrent class under P(π) (for the precise definition of γ see Hordijk and Kallenberg 1979).
Now, we will present some examples which show the essential difference between irreducible, unichained and multichained MDPs.
Example 2
It is wellknown that in the irreducible case each extreme optimal solution has exactly one positive xvariable. It is also well known that in other cases some states can have no positive xvariables, i.e., S _{∗} is a proper subset of S.
This example shows an MDP with an extreme optimal solution which has two positive xvariables for some state. Hence, the two corresponding deterministic policies, which can constructed via Theorem 6, are both optimal.
Furthermore, this extreme feasible solution is mapped on a nondeterministic policy. Let S={1,2,3}; A(1)={1}, A(2)={1}, A(3)={1,2}; r _{1}(1)=1, r _{2}(1)=2, r _{3}(1)=4, r _{3}(2)=3; p _{13}(1)=p _{23}(1)=p _{31}(1)=p _{32}(2)=1 (other transitions are 0).
The dual program (4) of this MDP is (take \(\beta_{1} =\beta_{2} = \frac{1}{4}, \beta_{3} = \frac{1}{2}\)):
The feasible solution (x,y), where \(x_{1}(1) = x_{2}(1) = x_{3}(1) =x_{3}(2) = \frac{1}{4}\), y _{1}(1)=y _{2}(1)=y _{3}(1)=y _{3}(2)=0, is an extreme optimal solution. Observe that state 3 has two positive xvariables.
Example 3
This example shows that the mapping (9) is not a bijective mapping. Let S={1,2,3,4}; A(1)={1}, A(2)={1,2}, A(3)={1,2}, A(4)={1}; p _{12}(1)=p _{23}(1)=p _{24}(2)=p _{33}(1)=p _{31}(2)=p _{44}(1)=1 (other transitions are 0). Since the rewards are not important for this property, we have omitted these numbers.
The constraints of the dual program are (take \(\beta_{j} = \frac{1}{4}\), 1≤j≤4):
First, consider the feasible solution (x ^{1},y ^{1}) with \(x^{1}_{1}(1) =x^{1}_{2}(1) = \frac{1}{4}\), \(x^{1}_{2}(2) = x^{1}_{3}(1) = 0\), \(x^{1}_{3}(2) = x^{1}_{4}(1) = \frac{1}{4}\); \(y^{1}_{1}(1) = y^{1}_{2}(1) = y^{1}_{2}(2) = y^{1}_{3}(2) = 0\). This feasible solution is mapped on the deterministic policy \(f_{1}^{\infty}\) with f _{1}(1)=f _{1}(2)=1, f _{1}(3)=2, f _{1}(4)=1.
Then, consider the feasible solution (x ^{2},y ^{2}) with \(x^{2}_{1}(1) =x^{2}_{2}(1) = \frac{1}{6}\), \(x^{2}_{2}(1) = x^{2}_{3}(1) = 0\), \(x^{2}_{3}(2) = \frac{1}{6}\), \(x^{2}_{4}(1)= \frac{1}{2}\), \(y^{2}_{1}(1) = \frac{1}{6}\), \(y^{2}_{2}(1) = 0\), \(y^{2}_{2}(2) =\frac{1}{4}\), \(y^{2}_{3}(2) = \frac{1}{12}\). This feasible solution is mapped on the deterministic policy \(f_{2}^{\infty}\) with f _{2}(1)=f _{2}(2)=1, f _{2}(3)=2, f _{2}(4)=1. Notice that \((x^{1},y^{1}) \not= (x^{2},y^{2})\) and \(f_{1}^{\infty}= f_{2}^{\infty}\).
Example 4
This example shows that a feasible nonoptimal solution can be mapped on an optimal policy. Let S={1,2,3}; A(1)=A(2)={1,2}, A(3)={1}; p _{12}(1)=p _{13}(2)=p _{21}(1)=p _{22}(2)=p _{33}(1)=1 (other transitions are 0); r _{1}(1)=1, r _{1}(2)=r _{2}(1)=r _{2}(2)=r _{3}(1)=0.
The dual program for this model is (take \(\beta_{1} = \beta_{2} = \beta _{3} =\frac{1}{3}\)):
The solution (x,y) given by \(x_{1}(1) = \frac{1}{6}\), x _{1}(2)=0, \(x_{2}(1) = \frac{1}{6}\), x _{2}(2)=0, \(x_{3}(1) = \frac{2}{3}\), y _{1}(1)=0, \(y_{1}(2) = \frac{1}{3}\), \(y_{2}(1) = \frac{1}{6}\) is a feasible solution, but not an optimal solution. Notice that \(x^{*}_{1}(1) = x^{*}_{2}(1)= x^{*}_{3}(1) = \frac{1}{3}\) and all other variables 0 is an optimal solution and that the xpart of the optimal solution is unique. However, the policy f ^{∞} which corresponds to (x,y) has f(1)=f(2)=f(3)=1 and is an optimal policy.
Example 5
In this last example, we show that the general unichain case needs an approach different from the unichain case; even the additional search procedure is not sufficient. In the general unichain case the value vector is a constant vector and the linear programs (1) and (2) may be considered. Let S={1,2,3}; A(1)={1}, A(2)=A(3)={1,2}; r _{1}(1)=r _{2}(1)=0, r _{2}(2)=r _{3}(1)=1, r _{3}(2)=0; p _{12}(1)=p _{21}(1)=p _{22}(2)=p _{33}(1)=p _{32}(2)=1 (other transitions are 0). This is a general unichained MDP, because the policy f ^{∞} with f(1)=1, f(2)=f _{∗}(3)=2 is an optimal policy and has a single chain structure. The dual program (2) of this model is:
x given by x _{1}(1)=x _{2}(1)=x _{2}(2)=x _{3}(2)=0, x _{3}(1)=1 is an extreme optimal solution. In state 3, the policy corresponding to x chooses action 1. The choice in state 2 for an optimal policy has to be action 2. Since the set of the states 1 and 2 is closed under any policy, it is impossible to search for actions in these states with transitions to state 3.
Stateaction frequencies and problems with constraints
Introduction
“Stateaction frequencies and problems with constraints” is the title of chapter 7 of Derman’s book. This chapter may be concerned as the starting point for the study of MDPs with additional constraints. In such problems it is not obvious that optimal policies exist. It is also not necessarily true that optimal policies, if they exist, belong to the class C(D) or C(S).
MDPs with additional constraints occur in a natural way in all kind of applications. For instance in inventory management, where one wants to minimize the total costs under the constraint that the shortage is bounded by a given number.
In general, for MDPs with additional constraints, a policy which is optimal simultaneously for all starting states does not exist. Therefore, we consider problems with a given initial distribution β, i.e., β _{ j } is a given probability that state j is the starting state. A special case is β _{ j }=1 for j=i and β _{ j }=0 for \(j \not= i\), i.e., that state i is the (fixed) starting state.
In many cases reward and cost functions are specified in terms of expectations of some function of the stateaction frequencies. Given the initial distribution β, we define for any policy R, any time point t and any stateaction pair (i,a)∈S×A, the actionstate frequency \(x_{ia}^{R}(t)\) by
For the additional constraints we assume that, besides the immediate rewards r _{ i }(a), there are also certain immediate costs \(c^{k}_{i}(a)\), i∈S, a∈A(i) for k=1,2,…,m.
Let β be an arbitrary initial distribution. For any policy R, let the average reward and the kth average cost function with respect to the initial distribution β be defined by
and
A policy R is a feasible policy for a constrained Markov decision problem, shortly CMDP, if the kthe cost function is bounded by a given number b _{ k } for k=1,2,…,m, i.e., if c ^{k}(β,R)≤b _{ k }, k=1,2,…,m.
An optimal policy R ^{∗} for this criterion is a feasible policy that maximizes ϕ(β,R), i.e.,
For any policy R and any T∈ℕ, we denote the average expected stateaction frequencies in the first T periods by
By X(R) we denote the limit points of the vectors {x ^{T}(R), T=1,2,…}. For any T∈ℕ, x ^{T}(R) satisfies \(\sum_{(i,a)}x_{ia}^{T}(R) =1\); so also ∑_{(i,a)} x _{ ia }(R)=1 for all x(R)∈X(R).
Since \(\mathbb{P}_{\pi^{\infty}}\{X_{t} = i, Y_{t} = a \mid X_{1} = j\} =\{P^{t1}(\pi)\}_{ji} \cdot\pi_{ia}\), (i,a)∈S×A for all π ^{∞}∈C(S), we have \(\lim_{T \rightarrow\infty}x_{ia}^{T}(\pi^{\infty})=\sum_{j \in S} \beta_{j} \{P^{*}(\pi)\}_{ji}\cdot \pi_{ia}\), i.e., X(π ^{∞}) consists of only one element, namely the vector x(π), where x _{ ia }(π)={β ^{T} P ^{∗}(π)}_{ i }⋅π _{ ia }, (i,a)∈S×A.
Let the policy set C _{1} be the set of convergent policies, defined by
Hence, C(S)⊆C _{1}. Furthermore, define the vector sets L, L(M), L(C), L(S) and L(D) by
The following result is due to Derman (1970, pp. 93–94).
Theorem 7
\(L = L(M) = \overline{L(S)} = \overline{L(D)}\), where \(\overline{L(S)}\) and \(\overline{L(D)}\) are the closed convex hull of the sets L(S) and L(D), respectively.
The unichain case
Derman has also shown (Derman 1970, pp. 95–96) that in the unichain case a feasible CMDP has an optimal stationary policy. He showed that L(S)=X, where
Since X is a closed convex set, this result also implies that \(L(S) =\overline{L(S)}\). Hence, the CMDP (14) can be solved by the following algorithm.
Algorithm 1

1.
Determine an optimal solution x ^{∗} of the linear program
$$ \max\left\{\sum_{i,a} r_i(a) x_i(a)\left\vert \begin{array}{l@{\quad}l}\sum_{i,a} \{\delta_{ij}  p_{ij}(a)\}x_i(a) = 0,& j \in S\\ [0.1cm]\sum_{i,a} x_i(a) = 1 \\ [0.1cm]\sum_{i,a} c^k_i(a) x_i(a) \leq b_k, &k = 1,2,\dots,m \\ [0.1cm]x_i(a) \geq0, &(i,a) \in S \times A\end{array}\right\}.\right.$$(18)(if (18) is infeasible, then problem (14) is also infeasible).

2.
Take
$$\pi^*_{ia} = \left\{\begin{array}{l@{\quad}l}x^*_i(a)/x^*_i, & a \in A(i), i \in S_* \\ [0.1cm]\mbox{arbitrary} & \mbox{otherwise},\end{array}\right.$$where \(x^{*}_{i} = \sum_{a} x^{*}_{i}(a)\) and \(S_{*} = \{i\mid x^{*}_{i} > 0\}\).
The multichain case
The multichain case was solved by Hordijk and Kallenberg (see Kallenberg 1980, 1983 and Hordijk and Kallenberg 1984). First, they generalized Theorem 7 in the following way.
Theorem 8
\(L = L(M) = L(C) = \overline{L(S)} = \overline{L(D)}\).
Then, they showed that L=XY, where
From the above results it follows that any extreme point of XY is an element of L(D). The next example shows the converse statement is not true, in general.
Example 6
Take the MDP with S={1,2,3}; A(1)={1,2}, A(2)={1,2}, A(3)={1}; p _{12}(1)=p _{13}(2)=p _{22}(1)=p _{21}(2)=p _{33}(1)=1 (other transitions are 0). Since the rewards are not important for this property, we have omitted these numbers. Let \(\beta_{1} = \beta_{2} = \beta_{3} = \frac{1}{3}\). Consider \(f_{1}^{\infty}\), \(f_{2}^{\infty}\), \(f_{3}^{\infty}\), where f _{1}(1)=2, f _{1}(2)=1, f _{1}(3)=1; f _{2}(1)=2, f _{2}(2)=2, f _{2}(3)=1; f _{3}(1)=1, f _{3}(2)=1, f _{3}(3)=1.
For these policies one easily verifies that:
Since \(x(f_{1}^{\infty}) = \frac{1}{2}x(f_{2}^{\infty}) +\frac{1}{2}x(f_{3}^{\infty})\), \(x(f_{1}^{\infty})\) is not an extreme point of XY.
In order to solve the CMDP (14) we consider the linear program
The next theorem shows how an optimal policy for the CMDP (14) can be computed. This policy may lie outside the set of stationary policies.
Theorem 9

(1)
Problem (14) is feasible if and only if problem (20) is feasible.
 (2)

(3)
If R is optimal for problem (14), then x(R) is optimal for (20).

(4)
Let (x,y) be an optimal solution of problem (20) and let \(x = \sum_{k=1}^{n} p_{k}x(f_{k})\), where p _{ k }≥0 and \(\sum_{k=1}^{n} p_{k} = 1\) and \(C(D) =\{f^{\infty}_{1},f^{\infty}_{2},\dots,f^{\infty}_{n}\}\). Let R∈C(M) such that ∑_{ j } β _{ j }⋅ℙ_{ R }{X _{ t }=i,Y _{ t }=a∣X _{1}}=∑_{ j } β _{ j }⋅∑_{ k } p _{ k }⋅ \(\mathbb{P}_{f^{\infty}_{k}}\{X_{t} = i, Y_{t} = a\mid X_{1}\}=\beta_{j}\}\) for all (i,a)∈S×A and all t∈ℕ. Then, R is an optimal solution of problem (14).
To compute an optimal policy from an optimal solution (x,y) of the linear program (20), we first have to express x as \(x =\sum_{k=1}^{n} p_{k}x(f_{k}^{\infty})\), where p _{ k }≥0 and \(\sum_{k=1}^{n}p_{k} = 1\). Next, we have to determine the policy R=(π ^{1},π ^{2},…)∈C(M) such that R satisfies \(\sum_{j} \beta_{j}\times\mathbb{P}_{R} \{X_{t} = i,Y_{t} = aX_{1}\} = \sum_{j} \beta_{j} \cdot \sum_{k}p_{k} \cdot\mathbb{P}_{f^{\infty}_{k}}\{X_{t} = i,Y_{t} = aX_{1}\} = \beta_{j}\}\) for all (i,a)∈S×A and all t∈ℕ. The decision rules π ^{t},t∈ℕ, can be determined by
Hence, the following algorithm constructs a policy R∈C(M)∩C _{1} which is optimal for CMDP problem (14).
Algorithm 2

1.
Determine an optimal solution (x ^{∗},y ^{∗}) of linear program (20) (if (20) is infeasible, then problem (14) is also infeasible).

2.

(a)
Let \(C(D) = \{f^{\infty}_{1},f^{\infty}_{2},\ldots,f^{\infty}_{n}\}\) and compute P ^{∗}(f _{ k }) for k=1,2,…,n.

(b)
Take
$$x^k_{ia} = \left\{\begin{array}{l@{\quad}l}\sum_j \beta_j \cdot\{P^*(f_k)\}_{ji} & a = f_k(i)\\ [0.06cm]0 & a \not= f_k(i)\end{array}, \right.\quad i \in S,\ k = 1,2,\dots,n.$$

(a)

3.
Determine p _{ k }, k=1,2,…,n as feasible solution of the linear system
$$\left\{\begin{array}{l@{\quad}l}\sum_{k=1}^n p_k x^k_{ia} = x^*_{ia}, & a \in A(i),i \in S\\ [0.06cm]\sum_{k=1}^n p_k = 1 \\ [0.06cm]p_k \geq0 & k = 1,2,\dots,n\end{array}\right.$$ 
4.
R=(π ^{1},π ^{2},…), defined by
$$\pi^t_{ia} = \left\{\begin{array}{l@{\quad}l}\frac{\sum_j \beta_j \cdot\sum_k p_k \{P^{t1}(f_k)\}_{ji} \cdot \delta_{af_k(i)}}{\sum_j \beta_j \cdot\sum_k p_k \{P^{t1}(f_k)\}_{ji}} & \mbox{if}\ \sum_j \beta_j \cdot\sum_k p_k \{P^{t1}(f_k)\}_{ji} \not= 0\\ [0.06cm]\mbox{arbitrary} & \mbox{if}\ \sum_j \beta_j \cdot\sum_k p_k \{P^{t1}(f_k)\}_{ji} = 0\end{array}\right.$$is an optimal policy for problem (14).
In the next example Algorithm 2 is applied on a CMDP.
Example 7
Let S={1,2,3}; A(1)={1,2}, A(2)={1}, A(3)={1,2}; p _{12}(1)=p _{13}(2)=p _{22}(1)=p _{33}(1)=p _{32}(2)=1 (other transitions are 0); r _{1}(1)=0, r _{1}(2)=0, r _{2}(1)=1, r _{3}(1)=r _{3}(2)=0; \(\beta_{1} = \frac{1}{4}\), \(\beta_{2} = \frac{3}{16}\), \(\beta_{3} =\frac{9}{16}\). As constraints we have bounds for the value \(x_{21}(R):\frac{1}{4} \leq x_{21}(R) \leq\frac{1}{2}\). If we apply Algorithm 2 we obtain the following.
with optimal solution: \(x^{*}_{1}(1) = 0\), \(x^{*}_{1}(2) = 0\), \(x^{*}_{2}(1) =\frac{1}{2}\), \(x^{*}_{3}(1) = \frac{1}{2}\), \(x^{*}_{3}(2) = 0\); \(y^{*}_{1}(1) = 0\), \(y^{*}_{1}(2) = \frac{1}{4}\), \(y^{*}_{3}(2) =\frac{5}{16}\).
There are four deterministic policies:
The corresponding vectors x ^{1}, x ^{2}, x ^{3}, x ^{4} are:
For the numbers p _{1},p _{2},p _{3},p _{4}≥0 such that p _{1} x ^{1}+p _{2} x ^{2}+p _{3} x ^{3}+p _{4} x ^{4}=x ^{∗} and \(\sum_{k=1}^{4} p_{k} = 1\), we obtain: \(p_{1} = \frac{8}{9}\), \(p_{2} = \frac{1}{9}\), p _{3}=0, p _{4}=0.
Since
we obtain R=(π ^{1},π ^{2},…) with \(\pi^{t}_{11} = 1\), t∈ℕ; \(\pi^{t}_{21} = 1\), t∈ℕ; ; .
Remark
Algorithm 2 is unattractive for practical problems. The number of calculations is prohibitive. Moreover, the use of Markov policies is inefficient in practice. Therefore, we also analyze the problem of finding an optimal stationary policy, if one exists.
For any feasible solution (x,y) of (20) we define a stationary policy π ^{∞}(x,y) in a slightly different way as by (9). The difference is caused by the fact that for constrained MDPs β _{ j } can be equal to zero in one or more states j, while in unconstrained MDPs we take β _{ j }>0 for all states j.
In Kallenberg (1983) the following lemmata can be found.
Lemma 3
If (x ^{∗},y ^{∗}) is an optimal solution of problem (20) and the Markov chain P(π(x ^{∗},y ^{∗})) has one ergodic set plus a (perhaps empty) set of transient states, then π ^{∞}(x ^{∗},y ^{∗}) is an optimal policy for problem (14).
Lemma 4
If (x ^{∗},y ^{∗}) is an optimal solution of problem (20) and x ^{∗} satisfies \(x^{*}_{i}(a) = \pi_{ia}(x^{*},y^{*}) \cdot\{\beta^{T} P^{*}(\pi(x^{*},y^{*}) )\}_{i}\) for all (i,a)∈S×A, then π ^{∞}(x ^{∗},y ^{∗}) is an optimal policy for problem (14).
Lemma 5
If (x ^{∗},y ^{∗}) is an optimal solution of problem (20) and furthermore \(x^{*}_{i}(a)/x^{*}_{i} = y^{*}_{i}(a)/y^{*}_{i}\) for all pairs (i,a) with i∈S _{+}, a∈A(i), where \(x^{*}_{i} = \sum_{a} x^{*}_{ia}\), \(y^{*}_{i}= \sum_{a} y^{*}_{ia}\) and \(S_{+} = \{i\mid x^{*}_{i} > 0, y^{*}_{i} > 0\}\), then the stationary policy π ^{∞}(x ^{∗},y ^{∗}) is an optimal policy for problem (14).
The next example shows that for an optimal solution (x ^{∗},y ^{∗}) of (20), the policy π ^{∞}(x ^{∗},y ^{∗}) is not an optimal solution of (14), even in the case that (14) has a stationary optimal policy.
Example 7
(continued)
Consider the MDP model of Example 7, but with as constraint \(x_{21}(R) \leq\frac{1}{4}\). The linear program (20) for this constrained problem is:
with optimal solution \(x^{*}_{1}(1) = 0\), \(x^{*}_{1}(2) = 0\), \(x^{*}_{2}(1) =\frac{1}{4}\), \(x^{*}_{3}(1) = \frac{3}{4}\), \(x^{*}_{3}(2) = 0\); \(y^{*}_{1}(1) =0\), \(y^{*}_{1}(2) = \frac{1}{4}\), \(y^{*}_{3}(2) = \frac{1}{16}\) and with optimum value \(\frac{1}{4}\). The corresponding stationary policy π ^{∞}(x ^{∗},y ^{∗}) gives π _{12}=π _{21}=π _{31}=1, so this policy is in fact deterministic. This policy is not optimal, because \(\phi (\pi^{\infty}(x^{*},y^{*}) ) = \frac{3}{16} < \frac{1}{4}\), the optimum of the linear program. Consider the stationary policy π ^{∞} with \(\pi_{11} = \frac{1}{4}\), \(\pi_{12} =\frac{3}{4}\), π _{21}=π _{31}=1. For this policy we obtain \(x_{12}(\pi^{\infty}) = \frac{1}{4}\) and \(\phi(\pi^{\infty}) =\frac{1}{4}\), the optimum value of the linear program. So, this policy is feasible and optimal.
If the conditions of Lemma 5 are not satisfied, we can try to find for the same x ^{∗} another y ^{∗}, say \(\overline{y}\), such that \((x^{*},\overline{y})\) is feasible for (20), and consequently also optimal, and satisfies the conditions of Lemma 5. To achieve this, we need \(\overline{y}_{i}(a)/\overline{y}_{i} = \pi_{ia}\), a∈A(i), \(i \in\{j\mid x^{*}_{j} > 0, \overline{y}_{j} > 0\}\), which is equivalent to \(\overline{y}_{i}(a) = \overline{y}_{i} \cdot\pi_{ia}\), a∈A(i), \(i \in\{j\mid x^{*}_{j} > 0\}\). Hence, \(\overline{y}\) has to satisfy the following linear system in the yvariables (x ^{∗} is fixed)

Example 7 (continued)

The optimal solution (x ^{∗},y ^{∗}) with \(x^{*}_{1}(1) = 0\), \(x^{*}_{1}(2) = 0\), \(x^{*}_{2}(1) = \frac{1}{4}\), \(x^{*}_{3}(1) = \frac{3}{4}\), \(x^{*}_{3}(2) = 0\); \(y^{*}_{1}(1) = 0\), \(y^{*}_{1}(2) =\frac{1}{4}\), \(y^{*}_{3}(2) = \frac{1}{16}\) does not satisfy \(x^{*}_{i}(a)/x^{*}_{i} = y^{*}_{i}(a)/y^{*}_{i}\) for all a∈A(i), i∈S _{+}, because S _{+}={3} and \(x^{*}_{3}(2)/x^{*}_{3} = 0\) and \(y^{*}_{3}(2)/y^{*}_{3} = 1\). The system (22) becomes \(\overline{y}_{1}(1) + \overline{y}_{1}(2) = \frac{4}{16}\); \(\overline{y}_{1}(1) = \frac{1}{16}\); \(\overline{y}_{1}(2) = \frac{3}{16}\); \(\overline{y}_{1}(1), \overline{y}_{1}(2) \geq0\). This system has the solution \(\overline{y}_{1}(1) = \frac{1}{16}\), \(\overline{y}_{1}(2) = \frac{3}{16}\). The stationary policy π ^{∞} with \(\pi_{11} = \frac{1}{4}\), \(\pi_{12} = \frac{3}{4}\), π _{21}=π _{31}=1 is optimal for problem (14).
Remark
If the xpart of problem (20) is unique and (22) is infeasible, then problem (14) has no optimal stationary policy. If the xpart of problem (20) is not unique and (22) is infeasible, then it is still possible that there exists an optimal stationary policy. In that case we can compute every extreme optimal solution of the linear program (20), and for each of these extreme optimal solutions we can perform the above analysis in order to search for an optimal stationary policy. We show an example of this approach.
Example 8
Take the MDP with S={1,2,3}; A(1)={1,2}, A(2)={1,2}, A(3)={1}; p _{12}(1)=p _{13}(2)=p _{22}(1)=p _{21}(2)=p _{33}(1)=1 (other transitions are 0). r _{1}(1)=r _{1}(2)=0, r _{2}(1)=1, r _{2}(2)=0, r _{3}(1)=1. Let \(\beta_{1} = \beta_{2} = \beta_{3} = \frac{1}{3}\). Add as only constraint \(x_{21}(R) \geq\frac{1}{9}\). The formulation of the linear program (20) becomes:
with extreme optimal solution \(x^{*}_{1}(1) = 0\), \(x^{*}_{1}(2) = 0\), \(x^{*}_{2}(1)= \frac{1}{9}\), \(x^{*}_{2}(2) = 0\), \(x^{*}_{3}(1) = \frac{8}{9}\); \(y^{*}_{1}(1) =0\), \(y^{*}_{1}(2) = \frac{5}{9}\), \(y^{*}_{2}(2) = \frac{2}{9}\) and with optimum value 1. The xpart of this problem is not unique. It can easily be verified that \(\hat{x}_{1}(1) = 0\), \(\hat{x}_{1}(2) = 0\), \(\hat{x}_{2}(1) =\frac{2}{3}\), \(\hat{x}_{2}(2) = 0\), \(\hat{x}_{3}(1) =\frac{1}{3}\); \(\hat{y}_{1}(1) = \frac{1}{3}\), \(\hat{y}_{1}(2) = 0\), \(\hat{y}_{2}(2) = 0\) is also an extreme optimal solution. For the first extreme optimal solution (x ^{∗},y ^{∗}) system (22) becomes
This system is obviously infeasible.
For the second extreme optimal solution \((\hat{x},\hat{y})\) we can apply Lemma 5, which gives that the deterministic policy \(f_{*}^{\infty}\) with f _{∗}(1)=f _{∗}(2)=f _{∗}(3)=1 is an optimal solution.
Remarks

1.
Discounted MDPs with additional constraints
These problems have always a stationary optimal policy. The analysis for this kind of problems is much easier than for MDPs with the average reward as optimality criterion (see Kallenberg 2010).

2.
Multiple objectives
Some problems may have several kinds of rewards or costs, which cannot be optimized simultaneously. Assume that we want to maximize some utility for an mtuple of immediate rewards, say utilities u ^{k}(R) and immediate rewards \(r_{i}^{k}(a)\), (i,a)∈S×A, for k=1,2,…,m. For each k one can find an optimal policy R _{ k }, i.e., \(u_{i}^{k}(R_{k})\geq u_{i}^{k}(R)\), i∈S, for all policies R. However, in general, \(R_{k} \not= R_{l}\) if \(k \not=l\), and there does not exist one policy which is optimal for all m rewards simultaneously for all starting states. Therefore, we consider the utility function with respect a given initial distribution β. Given this initial distribution β and a policy R, we denote the utilities by u ^{k}(β,R). The goal in multiobjective optimization is to find an βefficient solution, i.e., a policyR _{∗} such that there exists no other policy R satisfying u ^{k}(β,R)≥u ^{k}(β,R _{∗}) for all k and u ^{k}(β,R)>u ^{k}(β,R _{∗}) for at least one k. These problems can be solved, for both discounted rewards and average rewards, by CMDPs (for more details, see Kallenberg 2010).
Applications
Optimal stopping problems
In Chap. 8 of Derman’s book (Derman 1970) optimal stopping of a Markov chain is discussed. Derman considers the following model. Let {X _{ t },t=1,2,…} be a finite Markov chain with state space S and stationary transition probabilities p _{ ij }. Let us suppose there exists an absorbing state 0, i.e., p _{00}=1, such that ℙ{X _{ t }=0 for some t≥1∣X _{1}=i}=1 for every i∈S. Let r _{ i }, i∈S, denote nonnegative values.
When the chain is absorbed at state 0, we can think of the process as having been stopped at that point in time and we receive the value r _{0}. However, we can also think of stopping the process at any point in time prior to absorption and receiving the value r _{ i } if i is the state of the chain when the process is stopped. If our aim is to receive the highest possible value and if r _{0}<max_{ i∈S } r _{ i }, then clearly we would not necessarily wait for absorption before stopping the process.
By a stopping time τ, we mean a rule that prescribes the time to stop the process. Optimal stopping of a Markov chain is the problem to determine the stopping time τ such that \(\mathbb{E}\{r_{X_{\tau}}\mid X_{1} = i\}\) is maximized for all i∈S. Let \(M_{i} = \max_{\tau}\mathbb{E}\{r_{X_{\tau}}\mid X_{1} = i\}\), i∈S. Derman has shown the following result.
Theorem 10
If v ^{∗} is an optimal solution of the linear program
then \(M_{i} = v^{*}_{i}\), i∈S.
In Kallenberg (1983) this approach is generalized in the following way:

the assumption r _{ i }≥0, i∈S, is omitted;

if we continue in state i, a cost c _{ i } is incurred for all i∈S;

we can determine not only M _{ i }, i∈S, but also the states S _{0} in which it is optimal to stop.
The results are based on properties for convergent MDPs with as optimality criterion the total expected reward over an infinite horizon. The following theorem shows the result.
Theorem 11
Let v ^{∗} and (x ^{∗},y ^{∗}) be optimal solutions of the following dual pair of linear programs
and
Then, \(M_{i} = v^{*}_{i}\), i∈S and \(S_{0} = \{i \in S\mid x^{*}_{i} > 0\}\).
Furthermore, we have the following result for monotone optimal stopping problems, i.e., problems that satisfy p _{ ij }=0 for all i∈S _{1}, j∉S _{1}, where S _{1}={i∈S∣r _{ i }≥−c _{ i }+∑_{ j } p _{ ij } r _{ j }}. So, S _{1} is the set of states in which immediate stopping is not worse than continuing for one period and than choose to stop. The set S _{1} follows directly from the data of the model.
Theorem 12
In a monotone optimal stopping problem a onestep look ahead policy, i.e., a policy that stops in the states of S _{1} and continues outside S _{1}, is an optimal policy.
Replacement problems
General replacement problem
In a general replacement model we have state space S={0,1,…,N}, where state 0 corresponds to a new item, and action sets A(0)={1} and A(i)={0,1}, \(i \not= 0\), where action 0 means replacing the ‘old’ item by a new item. We consider in this model costs instead of rewards. Let c be the cost of a new item.
Furthermore, assume that an item of state i has tradeinvalue s _{ i } and maintenance costs c _{ i }. If in state i action 0 is chosen, then c _{ i }(0)=c−s _{ i }+c _{0} and p _{ ij }(0)=p _{0j }, j∈S; for action 1, we have c _{ i }(1)=c _{ i } and p _{ ij }(1)=p _{ ij }, j∈S. In contrast with other replacement models, where the state is determined by the age of the item, we allow that the state of the item may change to any other state.
In this case the optimal replacement policy is in general not a controllimit rule. As optimality criterion we consider the discounted reward. For this model the primal linear program is:
where β _{ j }>0, j∈S. Because there is only one action in state 0, namely action 1, we have \(v^{\alpha}_{0} = c_{0} + \alpha \sum^{N}_{j=0} p_{0j}v^{\alpha}_{j}\).
Hence, instead of \(v_{i}  \alpha\sum^{N}_{j=0} p_{0j}v_{j} =\sum^{N}_{j=0}\) (δ _{ ij }−αp _{0j })v _{ j }≥−c+s _{ i }−c _{0}, we can write v _{ i }−v _{0}≥−c+s _{ i }, obtaining the equivalent linear program
where r _{ i }=−c+s _{ i }, i∈S. The dual linear program of (27) is:
For this linear program the following result can be shown. For the proof we refer to Kallenberg (2010).
Theorem 13
There is a onetoone correspondence between the extreme solutions of (28) and the set of deterministic policies.
Consider the simplex method to solve (28) and start with the basic solution that corresponds to the policy which chooses action 1 (no replacement) in all states. Hence, in the first simplex tableau y _{ j }, 0≤j≤N, are the basic variables and x _{ i }, 1≤i≤N, the nonbasic variables. Take the usual version of the simplex method in which the column with the most negative cost is chosen as pivot column. It turns out, see Theorem 14, that this choice gives the optimal action for that state, i.e., in that state action 0, the replacement action, is optimal. Hence, after interchanging x _{ i } and y _{ i }, the column of y _{ i } can be deleted. Consequently, we obtain the following greedy simplex algorithm.
Algorithm 3
(Greedy simplex algorithm)

1.
Start with the basic solution corresponding to the nonreplacing actions.

2.
If the reduced costs are nonnegative: the corresponding policy is optimal (STOP).
Otherwise:

(a)
Choose the column with the most negative reduced cost as pivot column.

(b)
Execute the usual simplex transformation and delete the pivot column.

(a)

3.
If all columns are removed: replacement in all states is the optimal policy (STOP).
Otherwise: return to step 2.
Theorem 14
The greedy simplex algorithm is correct and has complexity \(\mathcal{O}(N^{3})\).
Remark 1
For the proof of Theorem 14 we also refer to Kallenberg (2010). The linear programming approach, as discussed in this section, is related to a paper by Gal (1984), in which the method of policy iteration was considered.
Remark 2
An optimal stopping problem may be considered as a special case of a replacement problem with as optimality criterion the total expected reward, i.e., α=1. In an optimal stopping problem there are two actions in each state. The first action is the stopping action and the second action corresponds to continue. If the stopping action is chosen in state i, then a final reward r _{ i } is earned and the process terminates. If the second action is chosen, then a cost c _{ i } is incurred and the transition probability of being in state j at the next decision time point is p _{ ij }, j∈S. This optimal stopping problem is a special case of the replacement problem with p _{0j }=0 for all j∈S, c _{ i }(0)=−r _{ i } and c _{ i }(1)=c _{ i } for all i∈S. Hence, also for the optimal stopping problem, the linear programming approach of this section can be used and the complexity is also \(\mathcal{O}(N^{3})\).
Remark 3
With a similar approach, the average reward criterion for an irreducible general replacement problem can be treated.
Replacement problem with increasing deterioration
Consider a replacement model with state space S={0,1,…,N+1}. An item is in state 0 if and only if it is new; an item is in state N+1 if and only if it is inoperative. In states 1,2,…,N there are two actions: action 0 is to replace the item by a new one and action 1 is not to replace the item. In the states 0 and N+1 only one action is possible (no replacement and replacement by a new item, respectively) and call this action 1 and 0, respectively. The transition probabilities are:
We assume two types of cost, the cost c _{0}≥0 to replace an operative item by a new one and the cost c _{0}+c _{1}, where c _{1}≥0, to replace an inoperative item by a new one. Thus, c _{1} is the additional cost incurred if the item becomes inoperative before being replaced. Hence, the costs c are:
We state the following assumptions, which turn out to be equivalent (see Lemma 6).
Assumption 1
The transition probabilities are such that for every nondecreasing function x _{ j }, j∈S, the function \(F(i) = \sum_{j=0}^{N+1}p_{ij}x_{j}\) is nondecreasing in i.
Assumption 2
The transition probabilities are such that for every k∈S, the function \(G_{k}(i) = \sum_{j=k}^{N+1} p_{ij}\) is nondecreasing in i.
Lemma 6
The Assumptions 1 and 2 are equivalent.
The significance of Lemma 6 is that Assumption 1 can be verified by the verification of Assumption 2, which can be verified only using the data of the model. Assumption 2 means that this replacement model has increasing deterioration.
We first consider the criterion of discounted costs. For this criterion the following result can be shown, which is based on the property that the value vector \(v^{\alpha}_{i}\), 0≤i≤N+1, is nondecreasing in the states i.
Theorem 15
If Assumption 1 (or 2) holds and if the state i _{∗} is such that \(i_{*} = \max\{i\mid\alpha \sum_{j}p_{ij}v_{j}^{\alpha}\leq c_{0} + \alpha\sum_{j} p_{0j}v_{j}^{\alpha}\}\). Then, the controllimit policy \(f^{\infty}_{*}\) which replaces in the states i>i _{∗} is a discounted optimal policy.
Theorem 15 implies that the next algorithm computes an optimal controllimit policy for this model. Similar to Algorithm 3 it can be shown that the complexity of Algorithm 4 is \(\mathcal {O}(N^{3})\).
Algorithm 4
(Computation of an optimal controllimit policy)

1.

(a)
Start with the basic solution corresponding to the nonreplacing actions in the states i=1,2,…,N and to the only action in the states 0 and N+1.

(b)
Let k=N (the number of nonbasic variables corresponding to the replacing actions in the states i=1,2,…,N).

(a)

2.
If the reduced costs are nonnegative: the corresponding policy is optimal (STOP).
Otherwise:

(a)
Choose the column corresponding to state k as pivot column.

(b)
Execute the usual simplex transformation.

(c)
Delete the pivot column.

(a)

3.
If all columns are removed: replacement in all states is the optimal policy (STOP).
Otherwise: return to step 2.
Next, we consider the criterion of average cost. By Theorem 15, for each α∈(0,1) there exists a controllimit policy \(f^{\infty}_{\alpha}\) that is αdiscounted optimal. Let {α _{ k },k=1,2,…} be any sequence of discount factors such that lim_{ k→∞} α _{ k }=1.
Since there are only a finite number of different controllimit policies, there is a subsequence with one of these policies. Therefore, we may assume that \(f^{\infty}_{\alpha_{k}} = f^{\infty}_{0}\) for all k. Let f ^{∞} be any policy in C(D). Since \(f^{\infty}_{0} =f^{\infty}_{\alpha_{k}}\) is optimal for all k, we have
Letting k→∞, we obtain for every f ^{∞}∈C(D),
Therefore, the following result holds.
Theorem 16
If Assumption 1 (or 2) holds, then there exists a controllimit policy \(f^{\infty}_{*}\) such that \(\phi(f^{\infty}_{*}) \leq\phi(f^{\infty})\) for all policies f ^{∞}∈C(D).
Remark
The results of this section, with the exception of Algorithm 4, have been developed by Derman (1963).
Skip to the right model with failure
This model is slightly different from the previous one, replacement with increasing deterioration. Let the state space S={0,1,…,N+1}, where state 0 corresponds to a new item and state N+1 to failure. The states i, 0≤i≤N, may be interpreted as the age of the item. The system has in state i (0≤i≤N) a failure probability p _{ i } during the next period. When failure occurs in state i, which is modeled as being transferred to state N+1, there is an additional cost f _{ i }. In state N+1 the item has to be replaced by a new one. In the states 1≤i≤N there are two actions. Action 0 replaces the item immediately by a new one, so it has the same transitions as state 0; the replacement cost is c. By action 1 the system moves, when there is no failure, from state i to the next state i+1: the system skips to the right, i.e., the age of the item increases. Furthermore, in state i there are maintenance cost c _{ i }.
The action sets, the cost of a new item, the maintenance costs and the transition probabilities are as follows.
We impose the following assumptions:

(A1)
c≥0; c _{ i }≥0, f _{ i }≥0, 0≤i≤N.

(A2)
p _{0}≤p _{1}≤⋯≤p _{ N }, i.e., older items have greater failure probability.

(A3)
c _{0}+p _{0} f _{0}≤c _{1}+p _{1} f _{1}≤⋯≤c _{ N }+p _{ N } f _{ N }, i.e., the expected maintenance and failure costs grow with the age of the item.
Take any k∈S. Since
this summation is, by assumption A2, nondeceasing in i. Hence, Assumption 2 and consequently also Assumption 1 of the previous section, is satisfied. This enables us to treat this model in a similar way as the model with increasing deterioration. In this way we can derive the following result.
Theorem 17
Let the assumptions (A1), (A2) and (A3) hold, and let \(i_{*} =\max\{i\mid c_{i} + p_{i}f_{i} + \alpha\sum_{j} p_{ij}(1)v_{j}^{\alpha}\leq c+c_{0} +p_{0}f_{0} + \alpha\sum_{j} p_{0j}(1)v_{j}^{\alpha}\}\). Then, the controllimit policy \(f^{\infty}_{*}\) which replaces in the states i>i _{∗} is an optimal policy.
Remarks

1.
For the proof of Theorem 17 we refer to Kallenberg (1994).

2.
Algorithm 4 is also applicable to this model.

3.
Similarly as in the previous section it can be shown that for the average cost criterion there exists also a controllimit optimal policy.

4.
In Derman (1970, pp. 125–130) a surveillancemaintenancereplacement model is discussed. This model is solved in the following way:

(a)
A fractional linear programming formulation is developed from which an optimal policy can be derived.

(b)
This fractional linear programming can be transformed into a normal linear program. This transformation is due to Derman and his student Klein (see Derman 1962 and Klein 1962). See Charnes and Cooper (1962) and Wagner and Yuan (1968) for more general treatment of linear fractional programming.

(a)
Separable replacement problem
Suppose that the MDP has the following structure: S={0,1,2,…,N}; A(i)={1,2,…,M}, i∈S; p _{ ij }(a)=p _{ j }(a), i,j∈S, a∈A(i), i.e., the transitions are state independent; r _{ i }(a)=s _{ i }+t(a), i∈S, a∈A(i), i.e., the rewards are separable.
As example, consider the problem of periodically replacing a car. The age of a car can be 0,1,…,N. When a car is replaced, it can be replaced not only by a new one (state 0), but also by a car in an arbitrary state a, 1≤a≤N. Let s _{ i } be the tradeinvalue of a car of state i, t(a) the costs of a car of state a. Then, r _{ i }(a)=s _{ i }−t(a) and p _{ ij }(a)=p _{ j }(a), where p _{ j }(a) is the probability that a car of state a is in state j at the next decision time point.
The next theorems show that a onestep look ahead policy is optimal both for discounted as for undiscounted rewards.
Theorem 18
The policy \(f^{\infty}_{1}\), defined by f _{1}(i)=a _{1} for all i, where a _{1} is such that −t(a _{1})+α∑_{ j } p _{ j }(a _{1})s _{ j }=max_{1≤a≤M }{−t(a)+α∑_{ j } p _{ j }(a)s _{ j }}, is an αdiscounted optimal policy.
Theorem 19
The policy \(f^{\infty}_{2}\), defined by f _{2}(i)=a _{2} for all i, where a _{2} is such that −t(a _{2})+∑_{ j } p _{ j }(a _{2})s _{ j }=max_{1≤a≤M }{−t(a)+∑_{ j } p _{ j }(a)s _{ j }}, is an average optimal policy.
Multiarmed bandit problems
Introduction
The multiarmed bandit problem is a model for dynamic allocation of a resource to one of n independent alternative projects. Any project may be in one of a finite number of states, say project j in the set S _{ j }, j=1,2,…,n. Hence, the state space S is the Cartesian product S=S _{1}×S _{2}×⋯×S _{ n }. Each state i=(i _{1},i _{2},…,i _{ n }) has the same action set A={1,2,…,n}, where action k means that project k is chosen, k=1,2,…,n. So, at each stage one can be working on exactly one of the projects.
When project k is chosen in state i—the chosen project is called the active project—the immediate reward and the transition probabilities only depend on the active project, whereas the states of the remaining projects are frozen. Let \(r_{i_{k}}\) and \(p_{i_{k}j}\), j∈S _{ k } denote these quantities when action k is chosen. The total discounted reward criterion is chosen.
It was shown by Gittins and Jones (1974, 1979) that an optimal policy is the policy that selects project k in state i=(i _{1},i _{2},…,i _{ n }), where k satisfies
for certain numbers G _{ j }(i _{ j }), i _{ j }∈S _{ j }, 1≤j≤n. Such a policy is called an index policy. Surprisingly, the number G _{ j }(i _{ j }) only depends on project j and not on the other projects. These indices are called the Gittins indices.
As a consequence, the multiarmed bandit problem can be solved by solving a sequence of n onearmed bandit problems. This is a decomposition result by which the dimensionality of the problem is reduced considerably. Algorithms with complexity \(\mathcal{O}(\sum_{j=1}^{n} n_{j}^{3})\), where n _{ j }=S _{ j }, 1≤j≤n, do exist for the computation of all indices.
A single project with a terminal reward
Consider the onearmed bandit problem with stopping option, i.e., in each state there are two options: action 1 is the stopping option and then one earns a terminal reward M and by action 2 the process continues with in state i an immediate reward r _{ i } and transition probabilities p _{ ij }. Let v ^{α}(M) be the value vector of this optimal stopping problem. Then, v ^{α}(M) is the unique solution of the optimality equation
and of the linear program
Furthermore, we have the following results.
Theorem 20
Let (x,y) be an extreme optimal solution of the dual program of (30), i.e.,
Then, the policy f ^{∞} such that
is an optimal policy.
Lemma 7
\(v_{i}^{\alpha}(M)  M\) is a nonnegative continuous nonincreasing function in M, for all i∈S.
Define the indices G _{ i }, i∈S, by \(G_{i} = \min\{M\mid v^{\alpha}_{i}(M) =M\}\). Hence, \(v^{\alpha}_{i}(G_{i}) = G_{i}\) and, by Lemma 7, \(v^{\alpha}_{i}(M) = M\) for all M≥G _{ i }. For these indices one can show the following theorem.
Theorem 21
For any M, the policy f ^{∞}∈C(D) which chooses the stopping action in state i if and only if M≥G _{ i } is optimal.
For M=G _{ i } both actions (stop or continue) are optimal. Hence, an interpretation of the Gittins index G _{ i } is that it is the terminal reward under which in state i both actions are optimal. Therefore, this number is also called the indifference value.
Multiarmed bandits
Consider the multiarmed bandit model with an additional option (action 0) in each state. Action 0 is a stopping option and then one earns a terminal reward M. One can show the following result.
Theorem 22
For any state i=(i _{1},i _{2},…,i _{ n }) and any terminal reward M, the policy that takes the stopping action if \(M \geq G_{i_{j}}\) for all j=1,2,…,n and continues with project k if \(G_{i_{k}} = \max_{j}G_{i_{j}} > M\), is an optimal policy.
The preceding theorem shows that the optimal policy in the multiproject case can be determined by an analysis of the n singleproject problems, with the optimal decision in state i=(i _{1},i _{2},…,i _{ n }) being to operate on that project k having the largest \(G_{i_{k}}\) if this value is greater than M and to stop otherwise.
Several methods have been proposed for the computation of the Gittins indices. We mention the contributions of Katehakis and Veinott (the restartinstate method, see Katehakis and Veinott 1987), Varaiya, Walrand and Buyukkoc (thelargestremainingindex method, see Varaiya et al. 1985), and Chen and Katehakis (the linear programming method, see Chen and Katehakis 1986). In this article we present the parametric linear programming method proposed in Kallenberg (1986). This method has for a project with N states complexity \(\mathcal{O}(N^{3})\).
The parametric linear programming method
We have already seen that for a single project with terminal reward M the solution can be obtained from a linear programming problem, namely program (31). For M big enough, e.g., for M≥C=(1−α)⋅max_{ i } r _{ i }, we know that \(v^{\alpha}_{i}(M) = M\) for all states i. Furthermore, we have seen that the Gittins index \(G_{i} =\min\{M\mid v^{\alpha}_{i}(M) = M\}\).
One can solve program (31) as a parametric linear programming problem with parameter M. Starting with M=C one can decrease M and find for each state i the largest M for which it is optimal to keep working on the project, which is in fact \(\min\{M\mid v^{\alpha}_{i}(M)= M\} = G_{i}\), in the order of decreasing Mvalues.
One can start with the simplex tableau in which all yvariables are in the basis and in which the xvariables are the nonbasic variables. This tableau is optimal for M≥C. Decrease M until we meet a basis change, say the basic variable y _{ i } will be exchanged with the nonbasic variable x _{ i }. Then, we know the Mvalue which is equal to G _{ i }. In this way we continue and repeat the procedure N times, where N is the number of states in the current project. The used pivoting row and column do not influence any further pivoting step, so we can delete these row and column from the simplex tableau.
We can easily determine the computational complexity. Each update of an element in a simplex tableau needs at most two arithmetic operations (multiplication and divisions as well as additions and subtractions). Hence, the total number of arithmetic operations in this method for a project with N states, is at most \(2 \cdot\sum_{k=1}^{N} k^{2} =\frac{1}{3}N(N+1)(2N + 1) = \mathcal{O}(N^{3})\).
Remark
The problem of assigning one of several treatments in clinical trials can be formulated as a multiarmed bandit problem. Derman and Katehakis (1987) have used the characterization of the Gittins index as a restartinstate problem (see Katehakis and Veinott 1987) to calculate efficiently the Gittins values for clinical trials. The characterization of the Gittins index as a restartinstate problem is related to a general replacement problem as treated by Derman in his book (Derman 1970, pp. 121–125).
Separable problems
Introduction
Separable MDPs have the property that for certain pairs (i,a)∈S×A:

(1)
the immediate reward is the sum of two terms, one depends only on the current state and the other depends only on the chosen action: r _{ i }(a)=s _{ i }+t _{ a }.

(2)
the transition probabilities depend only on the action and not on the state from which the transition occurs: p _{ ij }(a)=p _{ j }(a), j∈S.
Let S _{1}×A _{1} be the subset of S×A for which the pairs (i,a) satisfy (1) and (2). We also assume that the action sets of A _{1} are nested: let S _{1}={1,2,…,m}, then \(A_{1}(1)\supseteq A_{1}(2) \supseteq\cdots\supseteq A_{1}(m) \not= \emptyset\). Let S _{2}=S\S _{1}, A _{2}(i)=A(i)\A _{1}(i), 1≤i≤m and A _{2}(i)=A(i), m+1≤i≤N. We also introduce the notation B(i)=A _{1}(i)\A _{1}(i+1), 1≤i≤m−1 and B(m)=A _{1}(m). Then, \(A_{1}(i) = \bigcup_{j=i}^{m} B(j)\) and the sets B(j) are disjunct. We allow that S _{2}, A _{2} or B(i), 1≤i≤m−1, are empty sets.
If the system is observed in state i∈S _{1} and the decision maker will choose an action from A _{1}(i), then, the decision process can be considered as follows. First, a reward s _{ i } is earned and the system makes a zerotime transition to an additional state N+i. In this additional state there are two options: either to take an action a∈B(i) or to take an action a∈A _{1}(i)\B(i)=A _{1}(i+1). In the first case the reward t _{ a } is earned and the process moves to state j with probability p _{ j }(a), j∈S; in the second case we are in the same situation as in state N+i, but now in N+i+1, i.e., a zerotime transition is made from state N+i to state N+i+1.
A lot of dynamic decision problems are separable, e.g., the automobile replacement problem which was first considered by Howard (see Howard 1960)
Discounted rewards
The description in the introduction as a problem with zerotime and onetime transitions gives rise to the transformed model with N+m states and to the following linear program for the computation of the value vector v ^{α}:
The first set of inequalities corresponds to the nonseparable set S×A _{2} with onetime transitions; the second set inequalities to the zerotime transitions from the state i to N+i, 1≤i≤m; the third set of inequalities to the set S _{1}×B with onetime transitions and the last set inequalities corresponds to the zerotime transitions from the state N+i to N+i+1, 1≤i≤m−1.
The dual of program (32), where the dual variables x _{ i }(a), λ _{ i }, w _{ i }(a), ρ _{ i } correspond to the four sets of constraints in (32), is:
subject to the constraints
Without using the transformed problem, the linear program to compute the value vector v ^{α} is:
The following result can be shown.
Lemma 8
Let the vector v be feasible for (34) and define the vector y by \(y_{i} = \max_{a \in A_{1}(i)} \{t_{a} + \alpha\sum_{j=1}^{N} p_{j}(a)v_{j}\}\), 1≤i≤m. Then,

(1)
(v,y) is a feasible solution of (32).

(2)
\(\sum_{i=1}^{N} v_{i} + \sum_{i=1}^{m} y_{i} \geq\sum_{i=1}^{N}v^{\alpha}_{i} + \sum_{i=1}^{m} \max_{a \in A_{1}(i)} \{t_{a} + \alpha\sum_{j=1}^{N}p_{j}(a)v^{\alpha}_{j}\}\).
Since v ^{α} is the unique optimal solution of (34), we have shown that (v ^{α},y ^{α}), with \(y^{\alpha}_{i} = \max_{a\in A_{1}(i)} \{t_{a} + \alpha\sum_{j=1}^{N} p_{j}(a)v^{\alpha}_{j}\}\), 1≤i≤m, is the unique optimal solution of (32). The next theorem shows how an optimal policy can be found from an optimal solution of problem (33).
Theorem 23
Let (x ^{∗},λ ^{∗},w ^{∗},ρ ^{∗}) be an optimal solution of (33). Define \(S_{*} = \{j\mid\sum_{a \in A_{2}(j)}x^{*}_{j}(a) > 0\}\) and \(k_{j} = \min\{k \geq j\mid\sum_{a \in B(k)}w^{*}_{k}(a) > 0\}\), j∈S\S _{∗}. Take any policy \(f^{\infty}_{*} \in C(D)\) such that \(x^{*}_{j}(f_{*}(j)) > 0\) if j∈S _{∗} and \(w^{*}_{k_{j}}(f_{*}(j)) > 0\) if j∈S\S _{∗}. Then, \(f^{\infty}_{*}\) is welldefined and a discounted optimal policy.
Average rewards—unichain case
Consider the problem again in the transformed model with N+m states and with zerotime and onetime transitions. This interpretation gives rise to the following linear program for the computation of the value vector ϕ.
The dual of program (35), where the dual variables x _{ i }(a), λ _{ i }, w _{ i }(a), ρ _{ i } correspond to the four sets of constraints in (35), is:
subject to the constraints
Without using the transformed problem, the linear program to compute the value ϕ is:
Lemma 9
Let (x,y) feasible for problem (37) and define the vector z by \(z_{i} = \max_{a \in A_{1}(i)} \{t_{a} + \sum_{j=1}^{N}p_{j}(a)y_{j}\}  x\), 1≤i≤m. Then, (x,y,z) is a feasible solution of (35) and x≥ϕ.
Since any optimal solution (x ^{∗},y ^{∗}) of problem (37) satisfies x ^{∗}=ϕ, the optimum value of (35) is also ϕ. Furthermore, (x ^{∗}=ϕ,y ^{∗},z ^{∗}) is an optimal solution of program (35), where \(z^{*}_{i} = \max_{a \in A_{1}(i)}\{t_{a} +\sum_{j=1}^{N} p_{j}(a)y^{*}_{j}\}\phi\) for i=1,2,…,m. The next theorem shows how an optimal policy can be found from an optimal solution of problem (36).
Theorem 24
Let (x ^{∗},λ ^{∗},w ^{∗},ρ ^{∗}) be an optimal solution of (36). Define \(S_{*} = \{j\mid\sum_{a \in A_{2}(j)} x^{*}_{j}(a) >0\}\) and \(k_{j} = \min\{k \geq j\mid\sum_{a \in B(k)}w^{*}_{k}(a) > 0\}\), \(j \in S_{w^{*}}\), where \(S_{w^{*}} = \{j \in S \backslash S_{*}\mid \sum_{a \in A_{1}(j)} w^{*}_{j}(a) > 0\}\). Take any policy \(f^{\infty}_{*}\in C(D)\) such that \(x^{*}_{j}(f_{*}(j)) > 0\) if j∈S _{∗}, \(w^{*}_{k_{j}}(f_{*}(j)) > 0\) if \(j \in S_{w^{*}}\) and f _{∗}(j) arbitrarily chosen if \(j \notin S_{*} \cup S_{w^{*}}\). Then, \(f^{\infty}_{*}\) is an average optimal policy.
Average rewards—general case
Again, the interpretation of the transformed model gives rise to consider the following linear program in order to compute the value vector ϕ.
The dual of program (38), where the dual variables y _{ i }(a), μ _{ i }, z _{ i }(a), σ _{ i }, x _{ i }(a), λ _{ i }, w _{ i }(a), ρ _{ i } correspond to the eight sets of constraints in (38), is:
subject to the constraints
for all i and a.
Without using the transformed problem, the linear program to compute the value ϕ is:
Theorem 25
Let (x ^{∗},w ^{∗},y ^{∗},z ^{∗}) and (y ^{∗},μ ^{∗},z ^{∗},σ ^{∗},x ^{∗},λ ^{∗},w ^{∗},ρ ^{∗}) be optimal solutions of the problems (38) and (39), respectively. Let m _{ i } and n _{ i } defined by \(m_{i} = \min\{j \geq i\mid\sum_{a \in B(j)} w^{*}_{j}(a) > 0\}\) and \(n_{i} = \min\{j \geq i\mid\sum_{a \in B(j)} \{w^{*}_{j}(a) + z^{*}_{j}(a)\} >0\}\). Take any policy \(f^{\infty}_{*} \in C(D)\) such that
Then, (1) x ^{∗}=ϕ; (2) \(f^{\infty}_{*}\) is welldefined and an average optimal policy.
Remark
De Ghellinck and Eppen (1967) have examined separable MDPs with the discounted rewards as optimality criterion. Denardo introduced in Denardo (1968) the notion of zerotime transitions. Discounted and averaging versions (for the unichain case) are then shown to yield special linear programming formulations. In the discounted case, the linear program is identical to that of De Ghellinck and Eppen. Kallenberg (1992) has shown that for the average reward criterion also in the multichain case a special linear program can be used to solve the original problem.
References
Charnes, A., & Cooper, W. W. (1962). Programming with linear fractional functions. Naval Research Logistics Quarterly, 9, 181–186.
Chen, Y. R., & Katehakis, M. N. (1986). Linear programming for finite state bandit problems. Mathematics of Operations Research, 11, 180–183.
De Ghellinck, G. T. (1960). Les problèmes de décisions sequentielles. Cahiers du Centre d’Etudes de Recherche Opérationelle, 2, 161–179.
De Ghellinck, G. T., & Eppen, G. D. (1967). Linear programming solutions for separable Markovian decision problems. Management Science, 13, 371–394.
Denardo, E. V. (1968). Separable Markov decision problem. Management Science, 14, 451–462.
Denardo, E. V. (1970). On linear programming in a Markov decision problem. Management Science, 16, 281–288.
Denardo, E. V., & Fox, B. L. (1968). Multichain Markov renewal programs. SIAM Journal on Applied Mathematics, 16, 468–487.
D’Epenoux, F. (1960). Sur un problème de production et de stockage dans l’aléatoire. Revue Française de Recherche Opérationelle, 14, 3–16.
Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9, 16–24.
Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.
Derman, C. (1963). On optimal replacement rules when changes of state are Markovian. In R. Bellman (Ed.), Mathematical optimization techniques (pp. 201–210). Berkeley: University of California Press.
Derman, C., & Katehakis, M. N. (1987). Computing optimal sequential allocation rules in clinical trials. In J. Van Ryzin (Ed.), I.M.S. lecture notes—monograph series: Vol. 8. Adaptive statistical procedures and related topics (pp. 29–39).
Filar, J. A., & Schultz, T. (1988). Communicating MDPs: Equivalence and LP properties. Operations Research Letters, 7, 303–307.
Gal, S. (1984). A \(\mathcal{O}(N^{3})\) algorithm for optimal replacement problems. SIAM Journal of Control and Optimization, 22, 902–910.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistic Society Series B, 14, 148–177.
Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In J. Gani (Ed.), Progress in statistics (pp. 241–266). Amsterdam: North Holland.
Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge: MIT Press.
Hordijk, A., & Kallenberg, L. C. M. (1979). Linear programming and Markov decision chains. Management Science, 25, 352–362.
Hordijk, A., & Kallenberg, L. C. M. (1984). Constrained undiscounted stochastic dynamic programming. Mathematics of Operations Research, 9, 276–289.
Kallenberg, L. C. M. (1980). Linear programming and finite Markovian control problem. PhD Thesis, University of Leiden.
Kallenberg, L. C. M. (1983). Linear programming and finite Markovian control problem. Mathematical Centre Tract no. 148, Amsterdam.
Kallenberg, L. C. M. (1986). A note on Katehakis and Chen’s computation of the Gittins index. Mathematics of Operations Research, 11, 184–186.
Kallenberg, L. C. M. (1992). Separable Markov decision problems. OR Spektrum, 14, 43–52.
Kallenberg, L. C. M. (1994). Survey of linear programming for standard and nonstandard Markovian control problems. Part II: Applications. Mathematical Methods of Operations Research, 40, 127–143.
Kallenberg, L. C. M. (2002). Classification problems in MDPs. In Z. How, J. A. Filar, & A. Chen (Eds.), Markov processes and controlled Markov chains (pp. 151–165). Boston: Kluwer.
Kallenberg, L. C. M. (2010). Markov decision processes. Lecture notes, University of Leiden (available at http://www.math.leidenuniv.nl/~kallenberg/LecturenotesMDP.pdf).
Katehakis, M. N., & Veinott, A. V. Jr. (1987). The multiarmed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268.
Klein, M. (1962). Inspectionmaintenancereplacement schedules under Markovian deterioration. Management Science, 9, 25–32.
Manne, A. S. (1960). Linear programming and sequential decisions. Management Science, 6, 259–267.
Tsitsiklis, J. N. (2007). NPhardness of checking the unichain condition in average cost MDPs. Operations Research Letters, 35, 319–323.
Varaiya, P. P., Walrand, J. C., & Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control, 30, 426–439.
Wagner, H. M., & Yuan, J. S. C. (1968). Algorithmic equivalence in linear fractional programming. Management Science, 14, 301–306.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/bync/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Kallenberg, L. Derman’s book as inspiration: some results on LP for MDPs. Ann Oper Res 208, 63–94 (2013). https://doi.org/10.1007/s1047901110474
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1047901110474
Keywords
 Optimal Policy
 Average Reward
 Dual Program
 Discount Reward
 Deterministic Policy