1 Introduction

Stochastic optimization is concerned with the solution of optimization problems that involve random quantities as data. Consequently, the decisions \(x(\xi )\) depend on values of a random process \(\xi \), making stochastic optimization a problem in function spaces. Mirroring the situation in deterministic optimization, only few stochastic optimization problems lend itself to analytical treatment and allow for closed form solutions. In the following, we therefore focus on discrete time problems that are solved numerically.

The theory of stochastic optimization as well as the development of solution methods made great advances in the last decades. In particular, there exists a sound theory for two-stage stochastic optimization problems, i.e., problems with only one decision stage in the future (see [7, 53] for an overview). Consequently, two-stage stochastic optimization is nowadays routinely applied by researchers and industry practitioners alike. State-of-the-art methods are based on discrete representations of the, possibly continuous, source of randomness in the form of a finite set of samples or scenarios. This can either be achieved by sample average approaches (see [53] for an introduction) or by explicitly choosing representative scenarios. In this paper, we will focus on the latter.

Despite the abovementioned successes, it became clear quite early that the effort required to solve stochastic optimization does not scale well in the problem’s size. More specifically, it has been shown that stochastic optimization problems exhibit non-polynomial increase in complexity as the number of random variables increases [21]. The problem underlying these difficulties is the numerical evaluation of high dimensional integrals, which is in turn related to the problem of optimal discretization of probability distributions.

The situation is even more complicated for multi-stage problems, where we deal with random processes resulting in additional random variables in every stage and the issue of finding discretizations for conditional distributions. Consequently, it was observed in [50, 51] that solving multi-stage stochastic optimization problems is often practically intractable.

Notwithstanding these problems, there is a rich literature on multi-stage stochastic optimization. The majority of authors use scenario trees as representation of discrete stochastic processes (see the left panel in Fig. 1 for an illustration). In a scenario tree, nodes represent possible states of the world and are assigned to a point in time. All nodes at the same point in time are usually depicted at the same level of the tree. Possible transitions between nodes in consecutive stages are represented by probability weighted arcs connecting the nodes. Consequently, the collection of transition probabilities between a node and the nodes of the next stage connected by arcs describes the distribution of the random process conditional on that node. Note that the requirement that the resulting graph is a tree implies that every node is allowed to have exactly one predecessor in the previous stage.

Fig. 1
figure 1

A scenario tree with 31 nodes representing 16 scenarios on the left and a scenario lattice with 15 nodes representing 120 scenarios on the right. The transition probabilities on the arcs are not depicted to keep the picture legible

There are various ways to construct scenario trees for multi-stage stochastic programs (see [16, 31] for surveys). In [28, 29], a recursive application of moment matching is presented. The approach is easy to understand and apply, but suffers from an exponential explosion of nodes in the resulting trees as the number of stages increases. Furthermore, the method offers no theoretical insight regarding discretization error made when replacing the original process with the generated tree.

The papers [37, 38] propose a method for the construction of scenario trees that is based on integration quadratures and ensures that the approximated problems based on scenario trees epi-converge to the true infinite-dimensional problem yielding convergence in optimal value as well as in optimal decisions. However, the results are asymptotic in nature, i.e., the approximation scheme doesn’t offer guarantees for any given discrete approximation.

Another approach is based on the principle of bound-based constructions, see [10, 18, 20, 32]. The idea is to construct two discrete stochastic programs which provide upper and lower bounds on the optimal value of the original problem.

The results in this paper extend a stream of literature that uses probability metrics to define notions of distance for stochastic processes and allows inference about the accuracy of approximating trees, see [17, 22, 23, 40,41,42]. The authors in [17, 22] consider a distance between discrete stochastic processes and assume that both processes are defined on the same probability space. This assumption is relaxed in [41, 42] where a nested distance between value-and-information structures is developed, which can be applied to continuous processes. [24, 25] prove stability results using the sum of a \(L_r\)-distance and a filtration distance to bound objective values of a certain class of stochastic optimization problems.

Scenario trees are discrete approximations of general processes and therefore lend themselves to the construction of a general theory of stochastic optimization. However, the requirement that every node has only one predecessor makes it hard to construct scenario trees with many stages that model the conditional distributions well, i.e., ensure that every node has a sufficient number of successors and at the same time avoid exponential growth of the number of nodes.

A possible way out of this dilemma is to restrict the type of the stochastic optimization to problems with a Markovian structure where the random processes in the problem formulation are Markov processes [34] or, even more common, independent [39]. In this setting the history of random variables and decisions is condensed in the state variables of the problem and there is no need to remember the whole history of the randomness and the decisions. This paves the way for leaner discretizations, which we call scenario lattices in this paper and which are similar to stochastic meshes used in option pricing [9]. In particular, a scenario lattice consists of the same building blocks as a scenario tree, but relaxes the requirement that every node has only one predecessor and therefore solves the problem of exponential explosion of the number of nodes as the number of stages grows (see the right panel in Fig. 1). In the same way that a scenario tree is a natural representation for a general discrete stochastic process, a scenario lattice is a natural representation of a discrete Markov process.

Even though the abovementioned problem class is quite popular, there are no theoretical results on how to construct optimal scenario lattices. An exception is [2, 3] who design an algorithm for the construction of scenario lattices for Brownian motions based on ideas of optimal quantization.

We mention that there is a large and well developed theory on the approximations of Markov decision processes (MDPs) that is concerned with similar questions as this article. Typical formulations of MDP problems feature finite state and action states as well as a stationary Markov process describing the randomness, which is potentially influenced by the actions taken by the decision maker.

The setting as well as the solution methods differs from our paper in several important ways. Firstly, methods for solving MDPs are almost exclusively based on the discretization of the whole state space, leading to the well known curse of dimensionality as the dimension of the state space grows. Consequently, methods to approximate MDPs either assume a finite or countable state and action space to start with [4, 19, 33, 46, 57, 58] or discretize the state space to be able to solve the problem.

Furthermore, much of the work on approximations of MDPs deals with infinite horizon problems relying on the fact that optimal value functions are fixed points of the Bellman operator [11, 26, 46, 49, 57, 58].

Papers that deal with continuous state spaces usually impose (Lipschitz) continuity conditions on the probability transition kernels [4, 13,14,15, 26, 36, 49], which we do not require.

The difference of our approach to the MDP literature is thus threefold: Firstly, we keep the resource state continuous in order to be able to solve the problems on the nodes of the scenario lattice by linear optimization. This avoids at least part of the curse of dimensionality usually encountered in dynamic programming. Second, unlike most of the literature on approximation of MDPs, we deal with finite horizon problems. Lastly, we do not assume any Lipschitz continuity of the Markov kernel.

With this paper we contribute to the development of a theory for discrete approximations of Markov processes to be used in stochastic programming. In particular, we propose a class of problem-specific semi-distances for Markov processes and show that the objective value of a certain class of linear stochastic optimization problems is Lipschitz continuous with respect to these distances. This lays the foundations for constructing scenario lattices approximating general Markov processes that in turn can be used to formulate approximating optimization problems. In particular, the results in this paper can be used to control the error that results from replacing a stochastic optimization problem that is formulated using a complex (possibly continuous) Markov process by another, simpler problem using a compact scenario lattice instead of the original process. Furthermore, we discuss a LP formulation of our distance for discrete Markov processes, i.e., scenario lattices. We consider a multi-stage version of the well known newsvendor problem to demonstrate how to use our results in practical problems.

Our approach is inspired by [41] who work on optimal scenario trees and general stochastic optimization problems. In contrast to [41], our approach is specialized to linear stochastic programs with a Markovian structure, which results in tighter bounds for this problem class and additionally allows for problems where the randomness does not only affect the objective function but also the feasible set. The latter makes it necessary to adopt a different technique of proof based on stability results for linear programs rather than the idea of transporting solutions from one problem to the other. While in the MDP literature there are papers that model differences in feasible sets in terms of the Hausdorff distance [26], to the best of our knowledge, we are the first to propose stability results based on transportation distances that allow for problems where the feasible set depends on randomness in inequality constraints: [17, 22, 41, 42] require the feasible set to be independent of randomness, while in [24, 25] the constraints involving random parameters are required to be equality constraints. Furthermore, we demonstrate that our distance yields tighter bounds than [41] for problems where the constraints do not depend on the random process.

This paper is structured as follows: In Sect. 2, we introduce some notation and discuss the problem setup. In Sect. 3, we define the problem dependent lattice distance and establish some of its key properties. Section 4 contains the main results of the paper which allow to connect the lattice distance to optimal values of linear stochastic programming problems, while Sect. 5 is devoted to the case of discretely supported processes representable by lattices and a numerical example. Section 6 concludes the paper.

2 Problem description

We consider a class of discrete time, finite horizon, linear stochastic dynamic programming problems depending on a Markov process. The time periods in our problem are indexed by \(t\in \mathbb {T}=\{ 0, 1, \ldots , T\}\), where the values at \(t=0\) represents the deterministic start state of the problem. We partition the state space in an environmental state \(\xi \) and a resource state S. The former is governed by a (possibly inhomogeneous) Markov process \(\xi = (\xi _0, \xi _1, \ldots , \xi _T)\), \(\xi _t: \varOmega _t \rightarrow \mathbb {R}^{n_t}\) which is assumed to be independent of the decisions. Examples are prices, demand for a product, or weather related variables such as temperature. The resource state \(S_t\), on the other hand, describes the part of the state space that is influenced by the decision maker. Examples include inventory levels, states of machinery, and contractual obligations.

We equip the probability space \(\varOmega _t\) with the \(\sigma \)-algebra \(\sigma _t = \sigma (\xi _t)\) generated by the random variable \(\xi _t\) and define the path space \(\varOmega =\varOmega _0\times \cdots \times \varOmega _T\) and a corresponding \(\sigma \)-algebra \(\mathcal {F}= \sigma _0\otimes \cdots \otimes \sigma _T\). Note that we base our \(\sigma \)-algebras only on the random variables \(\xi _t\) and not on the whole history of random variables until t as it is usually done when working with scenario trees. Consequently, \(\sigma _0,\sigma _1\ldots ,\sigma _T\) is not a filtration.

Furthermore, we define the paths for which the event \(H \in \sigma _t\) occurs as

$$\begin{aligned} H_{t}^\varOmega :=\varOmega _0\times \varOmega _1\times \cdots \times \varOmega _{t-1}\times H\times \varOmega _{t+1}\times \cdots \times \varOmega _T \end{aligned}$$

and the corresponding \(\sigma \)-algebra as

$$\begin{aligned} \sigma _t^{\varOmega }=\{\varOmega _0\times \cdots \times \varOmega _{t-1}\times H\times \varOmega _{t+1}\times \cdots \times \varOmega _T: H\in \sigma _t\}=\left\{ H_t^{\varOmega }: H\in \sigma _t\right\} . \end{aligned}$$

The distribution of \(\xi \) is described by a sequence of Markov kernels and we write \(P_t^{\omega _{t-1}}\) for the distribution of \(\xi _t\) given \(\omega _{t-1} \in \varOmega _{t-1}\). The kernel as a function from \(\varOmega _{t-1}\) to the set of probability measures on \(\varOmega _t\) is \(\sigma _{t-1}\)-measurable [45]. For a given sequence of Markov kernels, we denote \(\omega =(\omega _0,\ldots ,\omega _T)\) and define the distribution on \(\varOmega \) as

$$\begin{aligned} P(H):=\int \limits _{\varOmega _0}{\ldots \int \limits _{\varOmega _T}{\mathbb {1}_H (\omega )P_T^{\omega _{T-1}}(d\omega _T)}\ldots P_1^{\omega _{0}}(d\omega _1)}P_0(d\omega _0) \end{aligned}$$

for every \(H\in \mathcal {F}\).

We consider stochastic optimization problems that can be written as

$$\begin{aligned} V_0(S_0, \xi _0) = \left\{ \begin{array}{ll} \max \limits _{x,S} &{} \mathbb {E}\left( \sum \limits _{t=0}^T c_t(\xi _t)^\top x_t \right) \\ {\text {s.t.}} &{} (x_t,S_{t+1}) \in \mathcal {X}_t(S_t, \xi _t) \;\quad \forall t \in \mathbb {T}\end{array} \right. \end{aligned}$$
(1)

with \(x = (x_0, \ldots , x_t)\), \(S=(S_1, \ldots , S_{T+1})\), \(S_t \in \mathbb {R}^{k_t}\) and feasible sets

$$\begin{aligned} \mathcal {X}_t(S_t, \xi _t) = \left\{ (x_t,S_{t+1}): \begin{array}{l} A_{1,t} x_t \le b_{1,t}(\xi _t)+C_{1,t} S_t\\ A_{2,t} x_t = S_{t+1}\\ A_{2,t} x_t \le b_{2,t+1}\\ x_t,S_{t+1} \ge 0 \end{array} \right\} , \end{aligned}$$
(2)

which we assume to be compact. Note that the data of the problem depends on the stochastic process \(\xi \) via the functions \(\xi _t \mapsto c_t(\xi _t)\) and \(\xi _t \mapsto b_{1,t}(\xi _t)\), which we assume to be continuous.

We assume that for planning in stage t, the decision maker knows \(S_t\), i.e., the system’s resource state at the beginning of the period as well as \(\xi _t\), i.e., the realization of the Markov process in period t. Given this information the feasible set for the decision \(x_t\) as well as the definition of \(S_{t+1}\) can be expressed using linear inequality constraints. The decisions \(x_t\) are auxiliary decision variables in stage t that are not part of the resource state. Note that in order for the problem to be feasible \(b_{2,t+1}\ge 0\) has to hold. The combination of constraints in (1) ensures that

$$\begin{aligned} 0 \le S_{t+1} = A_{2,t} x_t \le b_{2,t+1}, \end{aligned}$$

i.e., that the feasible region for \(S_{t+1}\) is box-constrained and therefore compact.

Remark 1

Usually we would expect a state transition equation of the form \(S_{t+1} = S_t+ Ax_t\). However, since we want to make the proposed distance independent of the resource state, we formulate the state transition using \(x_t\). More specifically, we assign \(S_t\) to a subset of variables in \(x_t\) in the first constraint. The state transition is subsequently modelled in the equality constraint using those variables instead of \(S_t\). Alternatively, we could assign \(S_{t+1}\) to variables in \(x_t\) in the equality constraint and then formulate the state transition using the first inequality constraint. We refer to the example in Sect. 5 for an illustration of this principle.

Because of its recursive structure, problem (1) can be equivalently written in terms of its dynamic programming equations using value functions, i.e.,

$$\begin{aligned} V_t(S_t, \xi _t)= & {} \left\{ \begin{array}{ll} \max \limits _{x_t,S_{t+1}} &{} c_t(\xi _t)^\top x_t + \mathbb {E}\left( V_{t+1}(S_{t+1}, \xi _{t+1}) | \xi _t \right) \\ {\text {s.t.}} &{} (x_t,S_{t+1}) \in \mathcal {X}_t(S_t, \xi _t) \end{array} \right. \quad \forall t \in \mathbb {T} \end{aligned}$$
(3)

and \(V_{T+1}(S_{T+1}, \xi _{T+1}) \equiv 0\) or, more generally, a known piecewise linear concave function. Since \(\xi \) is a Markov process and \(V_t\) as well as the decisions \((x_t, S_{t+1})\) only depend on the current state \((S_t, \xi _t)\), we call the problem a stochastic optimization problem with Markovian structure.

If we are dealing with discrete Markov processes, the expectations of the value functions \(V_t\), which are concave functions of the resource state, can be written as a minimum of finitely many affine functions. We formalize this well known fact in the following lemma whose proof can be found for example in [34, 44, 52].

Lemma 1

If \(\xi \) is finitely supported, then for every realization of \(\xi _t\), \(S_{t+1} \mapsto \mathbb {E}\left( V_{t+1}(S_{t+1}, \xi _{t+1}) | \xi _t \right) \) is a concave, polyhedral function. In particular, there are coefficients \(b_{3,t+1}^i(\xi _t) \in \mathbb {R}\) and row vectors \(C_{3,t+1}^i(\xi _t) \in \mathbb {R}^k\) for \(i=1,\ldots ,m_{t+1}(\xi _t)\) such that

$$\begin{aligned} \mathbb {E}\left( V_{t+1}(S_{t+1}, \xi _{t+1}) | \xi _t \right) = \min _{i=1,\ldots ,m_{t+1}(\xi _t)} b_{3,t+1}^i(\xi _t) + C_{3,t+1}^i(\xi _t) S_{t+1}, \end{aligned}$$

where \(m_{t+1}(\xi _t)\) is the number of affine functions required to model

$$\begin{aligned} \mathbb {E}\left( V_{t+1}(S_{t+1}, \xi _{t+1}) | \xi _t \right) . \end{aligned}$$

3 A distance for Markov processes

In order to introduce the concept of a distance between Markov processes, we first recall the Wasserstein or Kantorovich distance for distributions [30, 54]. Loosely speaking, the Wasserstein distance is defined as the total cost of passing from a given distribution to a desired one by moving probability mass accordingly.

Definition 1

Let \(\xi : (\varOmega , \mathcal {A}) \rightarrow \mathbb {R}^n\) and \({\tilde{\xi }}: ({\tilde{\varOmega }}, {\tilde{\mathcal {A}}}) \rightarrow \mathbb {R}^n\) be two random vectors with distributions P and \({\tilde{P}}\), respectively. The Wasserstein distance of order r \((r\ge 1)\) between \(\xi \) and \({\tilde{\xi }}\) is defined as

$$\begin{aligned} W_r(\xi ,{\tilde{\xi }}) = \left\{ \begin{array}{ll} \inf \limits _{\pi } &{} \left( \; \displaystyle \int \limits _{\varOmega \times {\tilde{\varOmega }}}{ \Vert \xi (\omega ) - {\tilde{\xi }}({\tilde{\omega }}) \Vert _r^r \; \pi \left( d\omega ,d{\tilde{\omega }}\right) }\right) ^{\frac{1}{r}}\\ \text {s.t.} &{} \pi (H\times {\tilde{\varOmega }}) = P(H)\quad \forall H \in \mathcal {A},\\ &{} \pi (\varOmega \times {\tilde{H}}) = {\tilde{P}}({\tilde{H}})\quad \forall {\tilde{H}}\in {\tilde{\mathcal {A}}}, \end{array} \right. \end{aligned}$$
(4)

where the infimum is taken over all probability measures \(\pi \) on \((\varOmega \times {\tilde{\varOmega }}, \mathcal {A}\otimes {\tilde{\mathcal {A}}})\).

Remark 2

Note that, following [41], we define \(W_r\) as a distance between two random vectors \(\xi : \varOmega \rightarrow \mathbb {R}^n\) and \({\tilde{\xi }}: \varOmega \rightarrow \mathbb {R}^n\) instead of between two distributions P and \({\tilde{P}}\). However, in order for \(W_r\) to be well defined, information on the probability measures P and \({\tilde{P}}\) on \(\varOmega \) and \({\tilde{\varOmega }}\) is required as can be seen from (4).

In particular, when changing P and \({\tilde{P}}\) while holding \(\xi \) and \({\tilde{\xi }}\) constant, the image measure of \(\xi \) and \({\tilde{\xi }}\) and therefore also \(W_r\) changes. By a slight abuse of notation, we consider \(\xi \) and \({\tilde{\xi }}\) to contain the information on the probability spaces \((\varOmega , P)\) and \(({\tilde{\varOmega }}, {\tilde{P}})\), i.e., as mappings \(\xi : (\varOmega , P) \rightarrow \mathbb {R}^n\) and \({\tilde{\xi }}: ({\tilde{\varOmega }}, {\tilde{P}}) \rightarrow \mathbb {R}^n\) in the same way that [41] do, when defining nested distributions.

Remark 3

The above problem is bounded and an optimal transportation measure \(\pi \) exists, due to weak-compactness of the set of transportation plans (see [54], Lemma 4.4). Furthermore, according to the famous Kantorovich-Rubinstein Theorem, for \(r=1\), the dual of (4) can be written as the following maximization problem

$$\begin{aligned} W_1(\xi ,{\tilde{\xi }}) = \left\{ \begin{array}{ll} \sup \limits _{f} &{} \left( \displaystyle \int {f dP} - \displaystyle \int {f d{\tilde{P}}}\right) \\ \text {s.t.} &{} {\text {Lip}}(f) \le 1, \end{array} \right. \end{aligned}$$

where Lip(f) is the Lipschitz constant of f.

Clearly, for a two-stage stochastic optimization problem

$$\begin{aligned} v(P) = \left\{ \begin{array}{lll} \inf \limits _{x} &{} f(x) + \mathbb {E}_P( Q(x,\xi ) )\\ \text {s.t.} &{} x \in \mathcal {X}\end{array} \right. , \quad Q(x,\xi )= \left\{ \begin{array}{lll} \inf \limits _{y} &{} g(y,\xi )\\ \text {s.t.} &{} y \in \mathcal {Y}(x) \end{array} \right. \end{aligned}$$

with

$$\begin{aligned} |Q\left( x, \xi \right) - Q(x, {\tilde{\xi }}) |\le L \; \Vert \xi - {\tilde{\xi }}\Vert _1 \quad \forall x\in \mathcal {X}, \end{aligned}$$
(5)

we have

$$\begin{aligned} v(P) - v({\tilde{P}}) \le \mathbb {E}_P (Q({\tilde{x}}^*, \xi )) - \mathbb {E}_{{\tilde{P}}} (Q({\tilde{x}}^*, {\tilde{\xi }})) \le L \; W_1(\xi ,{\tilde{\xi }}) \end{aligned}$$

where \({\tilde{x}}^*\) is the optimal solution for \(v({\tilde{P}})\). By symmetry it follows that

$$\begin{aligned} |v(P) - v({\tilde{P}})| \le L \; W_1(\xi ,{\tilde{\xi }}), \end{aligned}$$

i.e., that the objective value of the two-stage stochastic program is Lipschitz continuous with respect to \(W_1\), as long as the cost-to-go function Q is Lipschitz in \(\xi \). This was first recognized in [48].

The authors in [41, 42] generalize these ideas to a multi-stage setting using the notion of nested distributions which correspond to generalized scenario trees. Based on a modified transportation problem and an assumption similar to the uniform Lipschitz property in (5), they obtain a distance with respect to which the objective value of a general multi-stage problem is Hölder continuous, see Sect. 4 for more details.

We aim for a similar result for scenario lattices and problems of the form (1). Additionally, we relax one major assumption in the abovementioned approaches, namely that randomness enters the problem only in the objective function. Observe that the argument above hinges on the fact that the set \(\mathcal {Y}\) does not depend on \(\xi \). The same restriction applies to the results on multi-period problems in [41, 42].

We begin with analyzing the following simple deterministic linear optimization problem, which is of a similar structure as (3), with the second last inequality constraint and the second term in the objective function, y, modeling the piecewise linear value function (see Lemma 1)

$$\begin{aligned} \max \limits _{x\in \mathbb {R}^n, y\in \mathbb {R},z\in \mathbb {R}^k} \left\{ c_1^\top x + y :\begin{array}{l} A_1 x \le b_1\\ A_2 x = z\\ A_2 x \le b_2\\ \mathbb {1}_m y \le b_3 + C_3 z \\ x,z\ge 0 \end{array} \right\} . \end{aligned}$$
(6)

Furthermore, we define \(\mathbb {1}_m \in \mathbb {R}^m\) as the column vector of ones, assume that \(C_3\) has m rows and k columns, and assume that the other matrices and vectors are of fitting dimension.

First we prove the following result which is motivated by Hoffman’s lemma [27] and in particular its discussion in [53], Theorem 7.11 and Theorem 7.12. For what follows, we adopt the notational convention that the addition of a vector \(x = (x_1, \ldots , x_n)\) and a scalar \(y \in \mathbb {R}\) is to be interpreted pointwise, i.e., results in the vector \((x_1 + y, \ldots , x_n + y)\) and, similarly, inequalities of the form \(x \le y\) are interpreted pointwise as well.

Lemma 2

Let \(V(b_1)\) be the optimal value of problem (6) dependent on the parameter \(b_1\) and assume that there is a \(\kappa \ge 0\) with

$$\begin{aligned} \left\Vert C_3^\top \lambda \right\Vert _\infty \le \kappa \end{aligned}$$

for all \(\mathbb {R}^m \ni \lambda \ge 0\) with \(|\mathbb {1}_m^\top \lambda | \le 2\). Then for any \(b_1, b'_1\) for which (6) is feasible

$$\begin{aligned} \left| V(b_1)-V(b'_1)\right| \le \gamma (A_1,A_2,\kappa ,c_1) \left\Vert b_1-b'_1\right\Vert _1. \end{aligned}$$
(7)

where \(\gamma (A_1,A_2,\kappa ,c_1) = \max _{\lambda \in {\text {ext}}(\varGamma )} ||\lambda _2||_\infty < \infty \) and \({\text {ext}}(\varGamma )\) are the vertices of the polyhedron

$$\begin{aligned} \varGamma = \left\{ (\lambda _2,\lambda _3,\lambda _4,\lambda _6,\lambda _7): \begin{array}{l} \Vert A_1^\top \lambda _2+A_2^\top (\lambda _3+\lambda _4)-\lambda _6\Vert _\infty \le 1+\left\Vert c_1\right\Vert _\infty \\ \Vert \lambda _3-\lambda _7\Vert _\infty \le 1+\kappa \\ \lambda _2,\lambda _4,\lambda _6,\lambda _7\ge 0 \end{array} \right\} . \end{aligned}$$

Proof

We start by rewriting (6) as

$$\begin{aligned} \max \limits _{t\in \mathbb {R},x\in \mathbb {R}^n, y\in \mathbb {R},z\in \mathbb {R}^k} \left\{ t: \begin{array}{l} t -c_1^\top x-y\le 0\\ A_1 x \le b_1\\ A_2 x = z\\ A_2 x \le b_2\\ \mathbb {1}_m y \le b_3 + C_3 z\\ x,z\ge 0 \end{array} \right\} . \end{aligned}$$
(8)

Denote by \(\mathcal {M}(b_1)\) the set of feasible points of problem (8) and consider a point \(\alpha = (x,y,z,t)\in \mathcal {M}(b_1)\). Note that for any \(a\in \mathbb {R}^n\), \(||a||_1=\sup \nolimits _{||u||_\infty \le 1} u^\top a\) and define \(u = (u_1, u_2, u_3, u_4)\) with \(u_i\) corresponding to the respective entries in \(\alpha \), i.e., \(u_1 \in \mathbb {R}^n\), \(u_2 \in \mathbb {R}\) and so on. Therefore, we have

$$\begin{aligned} {\text {dist}}(\alpha ,\mathcal {M}(b'_1))&=\inf \limits _{\alpha ' \in \mathcal {M}(b'_1)} || \alpha - \alpha '||_1=\inf \limits _{\alpha ' \in \mathcal {M}(b')} \sup \limits _{||u||_\infty \le 1} u^\top (\alpha -\alpha ')\\&=\sup \limits _{||u||_\infty \le 1} \inf \limits _{\alpha ' \in \mathcal {M}(b'_1)} u^\top (\alpha -\alpha '). \end{aligned}$$

By a change of variables defining \(w = (w_1, w_2, w_3, w_4) = \alpha - \alpha '\) and using linear optimization duality with \(\lambda =(\lambda _1,\lambda _2,\lambda _3,\lambda _4,\lambda _5,\lambda _6,\lambda _7)\), we have

$$\begin{aligned}&\inf \limits _{\alpha '\in \mathcal {M}(b'_1)} u^\top (\alpha -\alpha ') = \inf \limits _{w\in \tilde{\mathcal {M}}(b_1')} u^\top w = \sup \limits _{\lambda \in \tilde{\mathcal {M}}^*(u)}\lambda _1^\top (t -c_1^\top x- y)\\&\quad +\lambda _2^\top (A_1 x- b'_1)+\lambda _4^\top (A_2x-b_2)+\lambda _5^\top (\mathbb {1}_m y - b_3- C_3 z)+\lambda _6^\top (-x)+\lambda _7^\top (-z), \end{aligned}$$

where

$$\begin{aligned} \tilde{\mathcal {M}}(b'_1)= \left\{ w: \begin{array}{l}t- c_1^\top x -y\le w_4 -c_1^\top w_1 -w_2\\ A_1x-b'_1\le A_1w_1\\ A_2w_1- w_3=0\\ A_2x-b_2\le A_2w_1,\\ \mathbb {1}_m y-b_3-C_3z\le \mathbb {1}_m w_2-C_3w_3\\ -x\le -w_1\\ -z\le -w_3 \end{array} \right\} \end{aligned}$$

and

$$\begin{aligned} \tilde{\mathcal {M}}^*(u)= \left\{ \lambda : \begin{array}{l} -c_1 \lambda _1+A_1^\top \lambda _2+A_2^\top \lambda _3+A_2^\top \lambda _4-\lambda _6=u_1\\ -\lambda _1+\mathbb {1}_m^\top \lambda _5=u_2\\ \lambda _3-C_3^\top \lambda _5-\lambda _7 = u_3\\ \lambda _1 = u_4\\ \lambda _1,\lambda _2,\lambda _4,\lambda _5,\lambda _6,\lambda _7\ge 0 \end{array} \right\} . \end{aligned}$$

Consequently we obtain that

$$\begin{aligned} {\text {dist}}(\alpha ,\mathcal {M}(b'_1))&=\sup \limits _{||u||_\infty \le 1,\lambda \in \tilde{\mathcal {M}}^*(u)}\lambda _1^\top (t -c_1^\top x- y)+\lambda _2^\top (A_1 x- b'_1)\nonumber \\&\quad +\lambda _4^\top (A_2x-b_2)+\lambda _5^\top (\mathbb {1}_m y - b_3- C_3 z)+\lambda _6^\top (-x)+\lambda _7^\top (-z). \end{aligned}$$
(9)

The right-hand side of (9) has a finite optimal value (since the left-hand side of (9) is finite) and, hence, has an optimal solution \(({\hat{u}}, {\hat{\lambda }})\). It follows that

$$\begin{aligned} {\text {dist}}(\alpha ,\mathcal {M}(b'_1))&=\hat{\lambda }_1^\top ( t - c_1^\top x-y)+\hat{\lambda }_2^\top (A_1 x -b'_1)+\hat{\lambda }_4^\top (A_2x-b_2)\\&\quad +\hat{\lambda }_5^\top (\mathbb {1}_m y - b_3- C_3 z)+\hat{\lambda }_6^\top (-x)+\hat{\lambda }_7^\top (-z). \end{aligned}$$

Since \(\alpha \in \mathcal {M}(b_1)\) and \(\hat{\lambda }_1,\hat{\lambda }_2,\hat{\lambda }_4,\hat{\lambda }_5,\hat{\lambda }_6,\hat{\lambda }_7\ge 0\), we have

$$\begin{aligned} {\text {dist}}(\alpha ,\mathcal {M}(b'_1))&\le \hat{\lambda }_2^\top (A_1 x -b'_1)= \hat{\lambda }_2^\top (A_1x-b_1)+\hat{\lambda }_2^\top (b_1-b'_1)\\&\le \hat{\lambda }_2^\top (b_1-b'_1)\le ||\hat{\lambda }_2||_\infty ||b_1-b'_1||_1. \end{aligned}$$

To find a bound for \(||\hat{\lambda }_2||_\infty \), we analyze the extreme points of the feasible set

$$\begin{aligned} \varGamma '= \left\{ \lambda : \begin{array}{l} \Vert -c_1 \lambda _1+A_1^\top \lambda _2+A_2^\top \lambda _3+A_2^\top \lambda _4-\lambda _6\Vert _\infty \le 1\\ \Vert -\lambda _1+\mathbb {1}_m^\top \lambda _5\Vert _\infty \le 1\\ \Vert \lambda _3-C_3^\top \lambda _5-\lambda _7 \Vert _\infty \le 1\\ \Vert \lambda _1\Vert _\infty \le 1\\ \lambda _1,\lambda _2,\lambda _4,\lambda _5,\lambda _6,\lambda _7\ge 0 \end{array} \right\} . \end{aligned}$$

Since we know that \(||\lambda _1||_\infty \le 1\), we can replace the constraint

$$\begin{aligned} \left\Vert -\lambda _1+\mathbb {1}_m^\top \lambda _5\right\Vert _\infty \le 1 \end{aligned}$$

with

$$\begin{aligned} |\mathbb {1}_m^\top \lambda _5| \le 2 \end{aligned}$$

and the constraint

$$\begin{aligned} \left\Vert -c_1 \lambda _1+A_1^\top \lambda _2+A_2^\top \lambda _3+A_2^\top \lambda _4-\lambda _6\right\Vert _\infty \le 1 \end{aligned}$$

with

$$\begin{aligned} \left\Vert A_1^\top \lambda _2+A_2^\top (\lambda _3+\lambda _4)-\lambda _6\right\Vert _\infty \le 1+\left\Vert c_1\right\Vert _\infty . \end{aligned}$$

Then using the assumption that \(\left\Vert C_3^\top \lambda _5\right\Vert _\infty \le \kappa \) we can substitute

$$\begin{aligned} \left\Vert \lambda _3-C_3^\top \lambda _5-\lambda _7\right\Vert _\infty \le 1 \end{aligned}$$

with the constraint

$$\begin{aligned} ||\lambda _3-\lambda _7 ||_\infty \le 1+\kappa \end{aligned}$$

to increase the feasible set of problem (9), and hence increase its optimal value. Consequently,

$$\begin{aligned} \max _{\lambda \in \varGamma '} \left\Vert \lambda _2\right\Vert _\infty \le \max _{\lambda \in \varGamma } \left\Vert \lambda _2\right\Vert _\infty \end{aligned}$$

with

$$\begin{aligned} \varGamma = \left\{ (\lambda _2,\lambda _3,\lambda _4,\lambda _6,\lambda _7): \begin{array}{l} \Vert A_1^\top \lambda _2+A_2^\top (\lambda _3+\lambda _4)-\lambda _6\Vert _\infty \le 1+\left\Vert c_1\right\Vert _\infty \\ \Vert \lambda _3-\lambda _7\Vert _\infty \le 1+\kappa \\ \lambda _2,\lambda _4,\lambda _6,\lambda _7\ge 0 \end{array} \right\} . \end{aligned}$$

Note that the optimal value remains bounded when replacing \(\varGamma '\) with \(\varGamma \), since if there would be a ray

$$\begin{aligned} R = \left\{ \lambda (\alpha )= \lambda ^0 + \alpha \lambda ^1: \alpha \in [0,\infty ) \right\} \end{aligned}$$

in \(\varGamma \) such that \(||\lambda _2(\alpha )||_\infty \xrightarrow {\alpha \rightarrow \infty } \infty \) and at the same time

$$\begin{aligned} \left\Vert A_1^\top \lambda _2(\alpha )+A_2^\top (\lambda _3(\alpha )+\lambda _4(\alpha ))-\lambda _6(\alpha )\right\Vert _\infty \le 1+||c_1||_\infty , \quad \forall \alpha > 0, \end{aligned}$$

this would imply that

$$\begin{aligned}&\alpha || (A_1^\top \lambda _2^1+A_2^\top (\lambda _3^1+\lambda _4^1)-\lambda _6^1)||_\infty - ||A_1^\top \lambda _2^0+A_2^\top (\lambda _3^0+\lambda _4^0)-\lambda _6^0||_\infty \\&\quad \le \left\Vert A_1^\top \lambda _2(\alpha )+A_2^\top (\lambda _3(\alpha )+\lambda _4(\alpha ))-\lambda _6(\alpha )\right\Vert _\infty \\&\quad \le 1+\left\Vert c_1\right\Vert _\infty , \quad \forall \alpha > 0 \end{aligned}$$

and therefore

$$\begin{aligned} ||A_1^\top \lambda _2^1+A_2^\top (\lambda _3^1+\lambda _4^1)-\lambda _6^1||_\infty = 0. \end{aligned}$$
(10)

In this case, we can define \(\lambda ^{1\prime } = (0, \lambda _2^1, \lambda _3^1, \lambda _4^1, 0, \lambda _6^1, \lambda _7^1)\) and a ray

$$\begin{aligned} R' = \left\{ 0 + \alpha \lambda ^{1\prime }: \alpha \in [0, \infty ) \right\} . \end{aligned}$$

Clearly, points in \(R'\) fulfill the first constraint of \(\varGamma '\) by (10), the second one since the first and the fifth component of \(\lambda ^{1\prime }\) are zero, and the third since \(\lambda _3^1 = \lambda _7^1\) has to hold for R to be in \(\varGamma \). This means that \(R'\) is contained in \(\varGamma '\), contradicting the boundedness of the original problem. Hence, the modified problems remains bounded and therefore the maximum is taken at a vertex of the polyhedron \(\varGamma \).

The polyhedral set \(\varGamma \) has a finite number of extreme points. Hence, \(||\hat{\lambda }_2||_\infty \) can be bounded by \(\gamma (A_1,A_2,\kappa ,c_1)\) which depends on \(A_1\),\(A_2\), \(\kappa \), \(c_1\) and

$$\begin{aligned} {\text {dist}}(\alpha ,\mathcal {M}(b'_1))\le \Vert \hat{\lambda }_2\Vert _\infty \left\Vert b_1-b'_1\right\Vert _1\le \gamma (A_1,A_2,\kappa ,c_1)\left\Vert b_1-b'_1\right\Vert _1. \end{aligned}$$
(11)

Assume that \(\alpha =(x,y,z,t)\) is the optimal solution of problem (8) and \(t=V(b_1)\). Let further \(\alpha ' \in \mathcal {M}(b'_1)\) be a point minimizing the distance dist\((\alpha , \mathcal {M}(b'_1))\). Then (11) implies

$$\begin{aligned} \left| t-t'\right| \le \gamma (A_1,A_2,\kappa ,c_1)\left\Vert b_1-b'_1\right\Vert _1 \end{aligned}$$

and we obtain

$$\begin{aligned} V(b_1)-V(b'_1)&\le V(b_1)- t'=t- t'\le \gamma (A_1,A_2,\kappa ,c_1)\left\Vert b_1-b'_1\right\Vert _1. \end{aligned}$$

Analogously, we get

$$\begin{aligned} V(b'_1)-V(b_1)\le \gamma (A_1,A_2,\kappa ,c_1)\left\Vert b_1-b'_1\right\Vert _1 \end{aligned}$$

and finally (7).\(\square \)

Remark 4

The matrix \(C_3\) represents slopes of the linear functions modeling value function for discrete distributions (see Lemma 1). Applying Lemma 2 to the problem (3), \(C_3\) may differ depending on the stage t and the state of the random process \(\xi _t\). Therefore, we write \(C_{3,t}(\xi _{t-1})\) for the matrix of slopes of the linear functions used in the representation of \(\mathbb {E}\left( V_t(S_t,\xi _t)|\xi _{t-1}\right) \) and choose \(\kappa _t(\xi _{t-1})\) as follows

$$\begin{aligned} \kappa _t(\xi _{t-1})&= \max \left\{ \left\Vert C_{3,t}(\xi _{t-1})^\top \lambda \right\Vert _\infty : \lambda \ge 0, \; | \mathbb {1}_{m_t(\xi _{t-1})}^\top \lambda | \le 2 \right\} \\&=2\max \limits _{i,j}{|C^{ij}_{3,t}(\xi _{t-1})|} \end{aligned}$$

where \(C_{3,t}^{ij}\) is the entry in the \(\hbox {i}^{th}\) row and \(\hbox {j}^{th}\) column of the matrix \(C_{3,t}\).

Remark 5

For continuous distributions the matrix \(C_3\) doesn’t exist. However, in our formulation of the problem \(S_t\mapsto \mathbb {E}\left( V_t(S_t,\xi _t)|\xi _{t-1}\right) \) is a continuous function on the compact set of permissible \(S_t\) for every \(\xi _{t-1}\), hence it is Lipschitz continuous with Lipschitz constant \(L_t(\xi _{t-1})\) on this set. Therefore, in this case we use \(\kappa _t(\xi _{t-1})=2L_t(\xi _{t-1})\) in the definition of the distance below.

An alternative proof of the above lemma could be based on the Lipschitz continuity of the feasible set with respect to the Hausdorff metric as shown in [47, 56]. However, the aforementioned papers do not provide any instruction for calculation of the Lipschitz constant, which makes it difficult to apply their results in concrete optimization problems. Our approach does not suffer from this problem, since it allows to explicitly bound the variation in the objective as a function of the right hand side data of the problem (6). Next, we will prove a similar result to bound the objective value when the objective coefficient \(c_1\) changes.

Lemma 3

If \(V(c_1)\) is the optimal value of problem (6) in dependence on the objective value coefficient \(c_1\), then

$$\begin{aligned} |V(c_1)-V({\tilde{c}}_1)|\le \phi (A_1,b_1,A_2,b_2)||c_1-{\tilde{c}}_1||_1 \end{aligned}$$

with \(\phi (A_1,b_1,A_2,b_2) = \max _{x \in {\text {ext}}(\varPhi )} \; ||x||_\infty \) and

$$\begin{aligned} \varPhi = \{x\in \mathbb {R}^n: A_1 x \le b_1, A_2 x \le b_2, A_2 x\ge 0,x\ge 0\}. \end{aligned}$$

Proof

Let \((x^{*},y^{*})\) be an optimal solution to \(V(c_1)\), then we have

$$\begin{aligned} V(c_1)&= c_1^\top x^{*} + y^{*} = c_1^\top x^{*} + y^{*} - {\tilde{c}}_1^\top x^{*} + {\tilde{c}}_1^\top x^{*} =\\&=(c_1-{\tilde{c}}_1)^\top x^{*} + {\tilde{c}}_1^\top x^{*} + y^{*} \le ||c_1-{\tilde{c}}_1||_1 ||x^{*}||_\infty + V({\tilde{c}}_1). \end{aligned}$$

By symmetry this implies

$$\begin{aligned} |V(c_1)-V({\tilde{c}}_1)|\le \max (||x^{*}||_\infty ,||{\tilde{x}}^{*}||_\infty )||c_1-{\tilde{c}}_1||_1 \end{aligned}$$

for an optimal solution \(({\tilde{x}}^{*},{\tilde{y}}^{*})\) to \(V({\tilde{c}}_1)\). Notice that the set of feasible points is invariant with respect to the parameter \(c_1\). Hence, \(x^*\) and \({\tilde{x}}^*\) can be selected as extreme points of the same polyhedral set

$$\begin{aligned} \varPhi = \{x\in \mathbb {R}^n: A_1 x \le b_1, A_2 x \le b_2, A_2 x\ge 0,x\ge 0\}. \end{aligned}$$

\(\varPhi \) depends on \(A_1\), \(b_1\), \(A_2\), \(b_2\) and has a finite number of vertices. Therefore \(||x^{*}||_\infty \) and \(||{\tilde{x}}^{*}||_\infty \) can be bounded by a constant \(\phi (A_1,b_1,A_2,b_2)\) for which

$$\begin{aligned} |V(c_1)-V({\tilde{c}}_1)|\le \phi (A_1,b_1,A_2,b_2)||c_1-{\tilde{c}}_1||_1 \end{aligned}$$

finishing the proof.\(\square \)

Remark 6

When applying the above lemma to the problem (3), \(b_{1,t}(\xi _t)+C_{1,t}S_t\) corresponds to the second parameter of \(\phi \). Since we would like to avoid a dependence of our distance on the resource state, we note that \(\phi \) is increasing with respect to this parameter and replace \(b_{1,t}(\xi _t)+C_{1,t}S_t\) by \(b_{1,t}(\xi _t)+C_{1,t}^+b_{2,t}\) where \(C_{1,t}^+ = (\max (c_{i,j}, 0))_{i,j}\) and \(c_{i,j}\) are the entries in the matrix \(C_{1,t}\). Since \(S_t \ge 0\) and \(b_{2,t} \ge 0\), we thereby increase the size of the polyhedron \(\varGamma \) and thus make the bound slightly looser but independent of \(S_t\).

Note that the problems in (3) fulfill the assumptions of Lemma 2 and Lemma 3. Equipped with these results, we define a transportation distance between two Markov processes. The distance is defined for a given problem of the form (1), i.e., we do not propose one distance but a whole family of problem specific distances, which differ in the matrices and vectors used to define the constants \(\gamma \) and \(\phi \) in Lemma 2 and Lemma 3. To avoid cluttered notation, we write

$$\begin{aligned} \gamma _t(\xi _t,{\tilde{\xi }}_t) = \gamma (A_{1,t},A_{2,t},\min \{\kappa _{t+1}(\xi _t),{\tilde{\kappa }}_{t+1}({\tilde{\xi }}_t)\},c_t(\xi _t)) \end{aligned}$$

and

$$\begin{aligned} \phi _t\left( \xi _t\right) = \phi (A_{1,t},b_{1,t}(\xi _t)+C_{1,t}^+b_{2,t},A_{2,t},b_{2,t+1}). \end{aligned}$$

Furthermore, we omit the explicit dependence of \(\xi \) on \(\omega \) wherever no confusion can arise, i.e., write \(\xi \) instead of \(\xi (\omega )\).

Remark 7

Note that to ensure measurability of \(\phi _t\) and \(\gamma _t\) we have to use the universal sigma algebra, which is a natural extension of the Borel sigma algebra fitting for dynamic programming. See [6], Chapter 7 for an in-depth treatment of the subject and [5], Appendix C for a short primer.

In particular, we mention that the vertices of the polyhedra in the proofs of Lemma 2 and Lemma 3 change continuously with the right hand sides of the linear inequality constraints almost everywhere. The functions \(\gamma _t\) and \(\phi _t\) are therefore Borel measurable due to the Borel measurability of the functions \(c_t\) and \(b_t\).

Furthermore, standard arguments yield that, by Borel measurability of the Markov kernel, the functions

$$\begin{aligned} (S_t, \xi _{t-1}) \mapsto \mathbb {E}(V_{t}(S_t, \xi _t) | \xi _{t-1}) \end{aligned}$$

are lower semi-analytic. Hence, the function

$$\begin{aligned} f(S_t, S_t', \xi _{t-1}) = \frac{\mathbb {E}(V_t(S_t,\xi _t) - V_t(S_t', \xi _t)|\xi _{t-1})}{S_t - S_t'} \end{aligned}$$

is lower semi-analytic on \(\mathcal {Y}\times \mathbb {R}^{n_{t-1}}\) with \(\mathcal {Y}= \left\{ (x,y) \in \mathbb {R}^{k_t} \times \mathbb {R}^{k_t}: x\ne y \right\} \). It follows from [5], Proposition C.1 that

$$\begin{aligned} \xi _{t-1} \mapsto \kappa _t(\xi _{t-1}) = \sup _{S_t \ne S_t'} f(S_t, S_t', \xi _{t-1}) \end{aligned}$$

is lower semi-analytic and therefore universally measurable.

Consequently, we interpret all integrals as integrals with respect to the unique extensions of measures with respect to the universal sigma algebra [see [6]].

Definition 2

Let \(\xi \) and \({\tilde{\xi }}\) be two Markov processes defined on probability spaces \(\varOmega \) and \({\tilde{\varOmega }}\), respectively, and P and \({\tilde{P}}\) corresponding probability measures on \(\varOmega \) and \({\tilde{\varOmega }}\). We define a lattice distance for the problem (1) as

$$\begin{aligned} D_L(\xi , {\tilde{\xi }}) = \left\{ \begin{array}{ll} \inf \limits _{\pi } &{} \displaystyle \int \limits _{\varOmega \times {\tilde{\varOmega }}}{d(\xi (\omega ), {\tilde{\xi }}({\tilde{\omega }})) \pi (d\omega , d{\tilde{\omega }})}\\ \text {s.t.} &{} \pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(H_t \times {\tilde{\varOmega }}_t) = P_t^{\omega _{t-1}} (H_t), \quad (t\in \mathbb {T}\backslash \{0\})\\ &{} \pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}( \varOmega _t \times {\tilde{H}}_t) = {\tilde{P}}_t^{{\tilde{\omega }}_{t-1}} ({\tilde{H}}_t), \quad (t\in \mathbb {T}\backslash \{0\}) \end{array} \right. \end{aligned}$$
(12)

taking the infimum over all Markov probability measures \(\pi \) defined on \(\mathcal {F}\otimes {\tilde{\mathcal {F}}}\). We assume that the constraints hold for almost all \((\omega _{t-1},{\tilde{\omega }}_{t-1}) \in \varOmega _{t-1}\times {\tilde{\varOmega }}_{t-1}\), as well as all \(H_t\times {\tilde{H}}_t \in \sigma _{t} \otimes {\tilde{\sigma }}_{t}\) and define

$$\begin{aligned} d(\xi ,{\tilde{\xi }}):=\sum \limits _{t=0}^T{\min \left\{ d_t(\xi _t, {\tilde{\xi }}_t),d_t({\tilde{\xi }}_t,\xi _t) \right\} }, \end{aligned}$$
(13)

and

$$\begin{aligned} d_t(\xi _t, {\tilde{\xi }}_t) := \gamma _t(\xi _{t},{\tilde{\xi }}_t)\Vert b_{1,t}(\xi _t) - b_{1,t}({\tilde{\xi }}_t)\Vert _1 + \phi _t({\tilde{\xi }}_t) \Vert c_t\left( \xi _t\right) - c_t({\tilde{\xi }}_t)\Vert _1. \end{aligned}$$
(14)

Remark 8

Note that similar to the convention discussed in Remark 2, we require the information on the measures \({\tilde{P}}\) and P on the underlying probability spaces to calculate the distance between the two Markov processes.

Remark 9

As will become clear in the proof of Theorem 3, both \(d_t(\xi _t, {\tilde{\xi }}_t)\) and \(d_t({\tilde{\xi }}_t, \xi _t)\) can be used to construct bounds for the difference in stochastic optimization problems. We therefore use the minimum in (13) to improve the bounds and ensure symmetry of \(D_L\).

Note that the objective function in (12) is defined in terms of the unconditional transport plan \(\pi \) between the joint distributions P and \({\tilde{P}}\) while the constraints rely on the corresponding disintegration in the form of Markov kernels \(\pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}\), which are guaranteed to exist [45] and relate to \(\pi \) via

$$\begin{aligned} \pi (H \times {\tilde{H}}) = \int \limits _{\varOmega \times {\tilde{\varOmega }}}\mathbb {1}_{H \times {\tilde{H}}} (\omega , {\tilde{\omega }})\; \ldots \pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(d\omega _t, d{\tilde{\omega }}_t) \ldots \pi _0(d\omega _0,d{\tilde{\omega }}_0) \end{aligned}$$

for \(H \times {\tilde{H}} \in \mathcal {F}\otimes {\tilde{\mathcal {F}}}\). However, since the disintegration of \(\pi \) into Markov kernels is only \(\pi \)-almost surely unique, the constraints in (12) have to be fulfilled \(\pi _{t-1}\) almost surely, where \(\pi _{t-1}\) is the unconditional marginal of \(\pi \) in stage \(t-1\).

Remark 10

Analogously to the Remark 3 and [41, 42] the infimum in the above definition is attained due to weak-compactness of the set of transportation plans.

Next we show that there is always at least one feasible transport plan between any two Markov processes, i.e., there are no processes with infinite distance.

Proposition 1

The defining optimization problem of \(D_L\) is always feasible. In particular, the product measure \(\pi :=P\otimes {\tilde{P}}\) is always part of the feasible set.

Proof

Let \(A\in \sigma _{t+1}\) and \(B\in {\tilde{\sigma }}_{t+1}\) for given t and \(C\in \mathcal {F}\) and \(D\in {\tilde{\mathcal {F}}}\). We have

$$\begin{aligned}&\int \limits _{C\times D}{P_{t+1}^{\omega _t}(A)\cdot {\tilde{P}}_{t+1}^{{\tilde{\omega }}_t}(B) \pi (d\omega ,d{\tilde{\omega }})}=\int \limits _{C}{P_{t+1}^{\omega _t}(A)P(d\omega )}\cdot \int \limits _{D}{{\tilde{P}}_{t+1}^{{\tilde{\omega }}_t}(A){\tilde{P}}(d{\tilde{\omega }})} \\&\quad =P(C\cap A_{t+1}^{\varOmega })\cdot {\tilde{P}}(D\cap B_{t+1}^{{\tilde{\varOmega }}})=\pi ((C\cap A_{t+1}^{\varOmega })\times (D\cap B_{t+1}^{{\tilde{\varOmega }}}))\\&\quad =\pi ((C\times D)\cap (A_{t+1}^{\varOmega }\times B_{t+1}^{{\tilde{\varOmega }}}))=\int \limits _{C\times D}{\pi _{t+1}^{\omega _t, {\tilde{\omega }}_t}(A\times B) \pi (d\omega , d{\tilde{\omega }})} \end{aligned}$$

where the first equality follows from the properties of the product measure. Since the sets A, B, C, and D are chosen arbitrarily and \(P_{t+1}^{\omega _t}(A)\cdot {\tilde{P}}_{t+1}^{{\tilde{\omega }}_t}(B)\) as well as \(\pi _{t+1}^{\omega _t, {\tilde{\omega }}_t}(A\times B)\) are \(\sigma _{t}\otimes {\tilde{\sigma }}_{t}\) measurable, it follows that they coincide \(\pi \)-almost everywhere, i.e.,

$$\begin{aligned} P_{t+1}^{\omega _t}(A)\cdot {\tilde{P}}_{t+1}^{{\tilde{\omega }}_t}(B)=\pi _{t+1}^{\omega _t, {\tilde{\omega }}_t}(A\times B). \end{aligned}$$

For the particular choices \(A=\varOmega _{t+1}\) or \(B={\tilde{\varOmega }}_{t+1}\), we get the conditions in problem (12). \(\square \)

Next we show that \(D_L\) is a semi-metric, i.e., that it is non-negative and symmetric. Example 1 demonstrates that it does not fulfill the triangle inequality.

Proposition 2

If either \(c_t\) or \(b_{1,t}\) have a continuous inverse, \(D_L\) is a semi-metric on the equivalence classes of Markov processes that have the same distribution.

Proof

From the non-negativity of the norms and the constants \(\phi _t\) and \(\gamma _t\), we obtain that \(D_L \ge 0\). Clearly, \(d(\xi , {\tilde{\xi }})=d({\tilde{\xi }}, \xi )\). If \(\pi ^*\) is the optimal transportation plan for \(D_L(\xi , {\tilde{\xi }})\), then \(\tilde{\pi }^*({\tilde{\omega }}, \omega )=\pi ^*(\omega , {\tilde{\omega }})\) is the optimal transportation plan for \(D_L({\tilde{\xi }},\xi )\). Therefore we have \(D_L(\xi ,{\tilde{\xi }})=D_L({\tilde{\xi }},\xi )\).

To show

$$\begin{aligned} D_L(\xi , {\tilde{\xi }}) = 0 \Leftrightarrow \xi = {\tilde{\xi }} \text { in distribution}, \end{aligned}$$

we note that one direction is trivial, since \(\xi = {\tilde{\xi }}\) in distribution implies that \(D_L(\xi , {\tilde{\xi }}) = 0\).

If \(c_t\) or \(b_{1,t}\) have continuous inverses, then \(b_{1,t}(\xi _t) \ne b_{1,t}({\tilde{\xi }}_t)\) or \(c_{t}(\xi _t) \ne c_{t}({\tilde{\xi }}_t)\) in distribution for any two processes \(\xi \) and \({\tilde{\xi }}\) that do not have the same distribution.

Under these circumstances, if \(D_L(\xi , {\tilde{\xi }}) = 0\), similar to [55], we can without loss of generality assume that \(\varOmega = {\tilde{\varOmega }}\) and find a measure \(\pi \) whose image measure on is almost surely concentrated on the diagonal. This implies that \(\xi \) and \({\tilde{\xi }}\) have the same distribution. \(\square \)

Example 1

In the following example we demonstrate that the triangle inequality does not hold in general for \(D_L\). To that end, consider a simple two-stage problem with the objective function in period t defined by

$$\begin{aligned} c_t(\xi _t)^\top x_t = (\xi _t-8,0)x_t \end{aligned}$$

where \(\xi _t\) is a one-dimensional random variable. The constraints in the form of (2) are described by

$$\begin{aligned} A_1 = (\begin{array}{rr} 1&0 \end{array}), \quad b_1= 0, \quad C_1 = 1, \quad A_2 = (\begin{array}{rr} 0&1 \end{array}), \quad b_2 = 10. \end{aligned}$$
Fig. 2
figure 2

The three processes used in Example 1 to show that the triangle inequality of \(D_L\) does not hold

Considering the three random processes presented in Fig. 2 and using definition (12), we obtain the following values of the lattice distance between every pair of processes

$$\begin{aligned} D_L(\xi ^{(1)}, \xi ^{(2)})=59.8, \quad D_L(\xi ^{(2)},\xi ^{(3)})=19.8, \quad D_L(\xi ^{(1)},\xi ^{(3)})=88. \end{aligned}$$

We refer to Sect. 5 for a detailed description on how to calculate the distances. Hence, we have

$$\begin{aligned} D_L(\xi ^{(1)},\xi ^{(2)})+D_L(\xi ^{(2)},\xi ^{(3)})=59.8+19.8=79.6<88=D_L(\xi ^{(1)},\xi ^{(3)}) \end{aligned}$$

confirming that the triangle inequality does not hold.

4 Bounding linear Markov decision problems

In this section, we show how the lattice distance \(D_L\) can be used to approximate linear stochastic programming problems with a Markovian structure as defined in (1). We start by showing that every Markov process can be approximated to an arbitrary precision by a discrete process in Theorem 1. We proceed by proving Theorem 3 in which we show that optimal values of problems in (1) are Lipschitz continuous with respect to \(D_L\). These two results in combination imply that \(D_L\) can, in theory, be used to find discrete Markov processes (scenario lattices) that, when used in optimization problems, lead to an arbitrary close approximation of the objective values.

In order to show Theorem 1, we require the following result demonstrating that distances between any pair of Markov processes can be approximated to an arbitrary precision by distances where one of the processes is replaced by a discrete approximation. For what follows, we denote by \(\mathcal {L}^p(\varOmega \times {\tilde{\varOmega }}, \pi )\) the Lebesgue space of p-integrable functions.

Lemma 4

Let

$$\begin{aligned} \theta _t(\xi _t, {\tilde{\xi }}_t) = \min \{d_t(\xi _t, {\tilde{\xi }}_t), d_t({\tilde{\xi }}_t, \xi _t)\} \end{aligned}$$
(15)

and \(\pi \) be transportation plan that minimizes \(D_L(\xi , {\tilde{\xi }})\) for two given processes \(\xi \) and \({\tilde{\xi }}\). If for all \(0\le t \le T\), \(\theta _t(\xi _t, {\tilde{\xi }}_t) \in \mathcal {L}^p(\varOmega _t \times {\tilde{\varOmega }}_t, \pi _t)\) for some \(p>1\) and there is a \(x_{0t} \in \mathbb {R}^{n_t}\) such that

$$\begin{aligned} \int \limits _{\varOmega \times {\tilde{\varOmega }}} \theta _t(\xi _t, x_{0t}) \; \nu (d\omega , d{\tilde{\omega }}) < \infty \end{aligned}$$
(16)

for all feasible transportation plans \(\nu \), then there is a sequence of discrete approximations \(({\tilde{\xi }}^k)_{k\in \mathbb {N}}\) such that \(D_L(\xi , {\tilde{\xi }}^k) \xrightarrow {k \rightarrow \infty } D_L(\xi , {\tilde{\xi }})\).

Note that the condition \(p>1\) ensures that the space \(\mathcal {L}^p(\varOmega \times {\tilde{\varOmega }}, \pi )\) is reflexive, which is used for the proof of Lemma 5 below, which in turn is required for the proof of Lemma 4.

Theorem 1

Every Markov process \(\xi \) for which (16) holds can be approximated arbitrarily well in terms of \(D_L\) by a discrete process, i.e., there are discrete Markov processes \((\xi ^k)_{k \in \mathbb {N}}\) such that \(D_L(\xi , \xi ^k) \xrightarrow {k \rightarrow \infty } 0\).

Proof

Use \(\xi \) instead of \({\tilde{\xi }}\) in Lemma 4 and note that \(\theta (\xi _t, \xi _t) = 0\) for the transportation plan that does not transport anything. Therefore the conditions of Lemma 4 are fulfilled and \(D_L(\xi , \xi ^k) \xrightarrow {k \rightarrow \infty } 0\) follows.\(\square \)

Note that this result is purely theoretical showing that, loosely speaking, discrete Markov processes are dense with respect to \(D_L\). In particular, the crude discretization used below to show Lemma 4 does not yield efficient approximations of Markov processes.

Remark 11

We note that similar to the tree distance proposed in [41], the empirical distribution does not converge to the true distribution in \(D_L\). This follows essentially by the same argument that is given in [43] in Proposition 1. Modifications of the distance based on non-parametric estimates addressing this issue as in [43] would be in principle possible but are out of the scope of this paper.

To prove Lemma 4, we define discrete approximations \({\tilde{\xi }}^k\) of \({\tilde{\xi }}\). We start by noting that since \(\theta _t\) is continuous, it is uniformly continuous on \(B_t^k:=B_t(0, k) \times B_t(0,k)\), where \(B_t(0,k)\) is the ball of radius k around 0 in \(\mathbb {R}^{n_t}\). Now, for each k define a discrete random variable \({\tilde{\xi }}^k_t: {\tilde{\varOmega }}_t \rightarrow \mathbb {R}^{n_t}\) with atoms \({\tilde{\xi }}_{t,m}^k\) and

$$\begin{aligned} E_{t,m}^k = \left\{ {\tilde{\omega }}_t \in {\tilde{\varOmega }}_t: {\tilde{\xi }}_t^k({\tilde{\omega }}_t) = {\tilde{\xi }}_{t,m}^k \right\} \end{aligned}$$

such that

$$\begin{aligned} |\theta _t(\xi _t(\omega _t), {\tilde{\xi }}_t({\tilde{\omega }}_t)) - \theta _t(\xi _t(\omega _t), {\tilde{\xi }}^k_t({\tilde{\omega }}_t))| \le k^{-1}, \; \forall \omega _t \; \forall {\tilde{\omega }}_t: {\tilde{\xi }}_t({\tilde{\omega }}_t) \in B_t(0,k) \end{aligned}$$

and \({\tilde{\xi }}^k_t ({\tilde{\omega }}_t) = x_{0t}\) for all \({\tilde{\omega }}_t\) such that \({\tilde{\xi }}_t({\tilde{\omega }}_t) \notin B_t(0,k)\). Furthermore, define corresponding Markov kernels as

$$\begin{aligned} {\tilde{P}}_{t,k}^{{\tilde{\xi }}_{t-1,m}^k} ({\tilde{\xi }}_{t,j}^k) = \int \limits _{E_{t-1, m}^k} {\tilde{P}}_t^{{\tilde{\omega }}_{t-1}}(E_{t,j}^k) \; {\tilde{P}}_{t-1}(d{\tilde{\omega }}_{t-1}) \end{aligned}$$

and the functions

$$\begin{aligned} f_k^t(\nu ) = \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t^k) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t), \quad f_0^t(\nu ) = \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t) \; \nu _t(d\omega _t,d{\tilde{\omega }}_t) \end{aligned}$$

for \(\nu _t \in \mathcal {L}^q(\varOmega _t \times {\tilde{\varOmega }}_t, \pi _t)\) with \(q^{-1} + p^{-1} = 1\) the unconditional distributions of the transportation plan \(\nu \) in stage t.

In Lemma 5, we will show that the approximations defined above epi-converge to the objective function of the optimization problem defining the lattice distance. Epi-convergence is the weakest notion of convergence of functions that allows to conclude that convergence of objective functions implies the convergence of optimal solutions and is defined as follows.

Definition 3

(epi-convergence) A sequence of functions \(f_n: X \rightarrow \mathbb {R}\) defined on a metric space X epi-convergences to a function \(f: X \rightarrow \mathbb {R}\), if for each \(x \in X\)

$$\begin{aligned} \underset{n \rightarrow \infty }{\lim \inf } f_n(x_n)&\ge f(x) \quad \text {for every } x_n \rightarrow x \text { and}\\ \underset{n \rightarrow \infty }{\lim \sup } f_n(x_n)&\le f(x) \quad \text {for some } x_n \rightarrow x. \end{aligned}$$

We write \(f_n \xrightarrow {epi} f\).

We will additionally require the notion of barrelled spaces, which are exactly the spaces where the uniform boundedness principle is valid which we will use in the proof of Lemma 4.

Definition 4

(barrel, barrelled space) A closed set \(B\subseteq X\) in a real topological vector space X is a barrel, if and only if the following conditions hold

  1. 1.

    B is absolutely convex, i.e.,

    $$\begin{aligned} x_1, x_2 \in B \Rightarrow \lambda _1 x_1 + \lambda _2 x_2 \in B \end{aligned}$$

    for \(|\lambda _1| + |\lambda _2| = 1\).

  2. 2.

    B is absorbing, i.e., for every \(x \in X\) there is a \(\alpha > 0\) with \(x \in \alpha B\).

A locally convex vector space is called barrelled, if and only if every barrel is a neighborhood of zero.

Theorem 2

(Uniform boundedness principle, Theorem III.2.1 in [8]) Let X be a barrelled locally convex vector space and Y be an arbitrary locally convex vector space. A collection \(\mathcal {F}\) of continuous linear functions \(f: X \rightarrow Y\) is bounded pointwise, i.e.,

$$\begin{aligned} \left\{ f(x): f \in \mathcal {F} \right\} \subseteq Y \end{aligned}$$

is bounded for all \(x \in X\), if and only if the functions are equi-continuous, i.e., for every neighborhood \(V \subseteq Y\) of zero there is a neighborhood of zero \(U \subseteq X\), such that

$$\begin{aligned} f^{-1}(V) \subseteq U, \quad \forall f \in \mathcal {F}. \end{aligned}$$

Lemma 5

If the integrability conditions (16) hold for \(\xi \) and \({\tilde{\xi }}\), then

$$\begin{aligned} \sum \limits _{t = 0}^T f_k^t \xrightarrow {epi} \sum \limits _{t = 0}^T f_0^t \quad \text {as } k \rightarrow \infty . \end{aligned}$$

Proof

Define

$$\begin{aligned} f_{kn}^t(\nu )&= \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t^k) \mathbb {1}_{B_t^n}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t,d{\tilde{\omega }}_t) \\ f_{0n}^t(\nu )&= \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t) \mathbb {1}_{B_t^n}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t). \end{aligned}$$

Fix \(\epsilon > 0\). By integrability of \(\theta _t\) with respect to \(\nu _t\) and an application of the dominated convergence theorem, it follows that there is a compact set \(K_t\subset \mathbb {R}^{n_t} \times \mathbb {R}^{n_t}\) for every \(t=0,\ldots ,T\) such that

$$\begin{aligned}&\int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, x_{0t}) \mathbb {1}_{K_t^c}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t)< \epsilon , \\&\int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t) \mathbb {1}_{K_t^c}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t) < \epsilon . \end{aligned}$$

Now choose \(k \in \mathbb {N}\) such that \(K_t \subseteq B_t^k\) and \(k>\epsilon ^{-1}\) and note that

$$\begin{aligned} | f_{kn}^t(\nu ) - f_{0n}^t(\nu ) |&\le \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} |\theta _t(\xi _t, {\tilde{\xi }}_t^k) - \theta _t(\xi _t, {\tilde{\xi }}_t)| \mathbb {1}_{B_t^k}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t) \\&\quad + \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, {\tilde{\xi }}_t) \mathbb {1}_{B_t^n{\setminus } B_t^k}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t) \\&\quad + \int \limits _{\varOmega _t \times {\tilde{\varOmega }}_t} \theta _t(\xi _t, x_{0t}) \mathbb {1}_{B_t^n{\setminus } B_t^k}(\xi _t,{\tilde{\xi }}_t) \; \nu _t(d\omega _t, d{\tilde{\omega }}_t) \le 3\epsilon , \end{aligned}$$

i.e., \(f^t_{kn} \rightarrow f^t_{0n}\) uniformly for all n. Note further that

$$\begin{aligned} f_0^t = \lim _n f_{0n}^t = \lim _n \lim _k f_{kn}^t = \lim _k \lim _n f_{kn}^t = \lim _k f_k^t \end{aligned}$$

where the two limits can be exchanged because of the uniform convergence shown above and the first equality follows by the monotone convergence theorem. As the convergence holds for every \(t=0,\ldots ,T\), we obtain that

$$\begin{aligned} \sum \limits _{t=0}^{T}f_0^t = \lim _k \sum \limits _{t=0}^{T}f_k^t. \end{aligned}$$

\(\mathcal {L}^p(\varOmega \times {\tilde{\varOmega }}, \pi )\) is reflexive and therefore the weak topology is barrelled (see [35], Theorem 23.22). Since \(\sum \nolimits _{t=0}^{T} f^t_k \rightarrow \sum \limits _{t=0}^{T} f^t_0\) weakly, the set \(\left\{ \sum \nolimits _{t=0}^{T}f_k^t, \sum \nolimits _{t=0}^{T}f_0^t\right\} \) is weakly bounded and therefore weakly equi-continuous by the uniform boundedness principle. Since \(\left\{ \sum \nolimits _{t=0}^{T}f^t_{kn}: n\in \mathbb {N}_0\right\} \) is equi-continuous, it is equi–lower semi-continuous and \(\sum \nolimits _{t=0}^{T}f_k^t \xrightarrow {epi} \sum \nolimits _{t=0}^{T}f_0^t\) (see [12], Theorem 2.18). \(\square \)

Proof

(Lemma 4) Because of the epi-convergence proved in Lemma 5, we obtain (see [1], Theorem 2.5)

$$\begin{aligned} D_L(\xi , {\tilde{\xi }}^k) = \min _{\nu \in \varUpsilon } \sum _{t = 0}^T f_{k}^t(\nu ) \rightarrow \min _{\nu \in \varUpsilon } \sum _{t = 0}^T f^{t}_0(\nu ) = D_L(\xi , {\tilde{\xi }}). \end{aligned}$$

Note that the feasible set \(\varUpsilon \) can w.l.o.g. be assumed the feasible set of \(D_L(\xi , {\tilde{\xi }})\), since for every feasible transportation plan for \(D_L(\xi , {\tilde{\xi }}^k)\) there exists a plan that is feasible for \(D_L(\xi , {\tilde{\xi }})\) yielding the same objective. \(\square \)

Next we prove the main result of the paper establishing that the optimal value of the stochastic optimization problem associated to \(D_L\) is Lipschitz with respect to \(D_L\). We first note the following useful lemma assuming that \(i_t: \varOmega _t\times {\tilde{\varOmega }}_t\rightarrow \varOmega _t\), \({\tilde{i}}_t: \varOmega _t\times {\tilde{\varOmega }}_t\rightarrow {\tilde{\varOmega }}_t\) are natural projections for \(t=0,\ldots ,T\).

Lemma 6

For a measurable function \(f: \varOmega _t \rightarrow \mathbb {R}\) and measures \(P_t\), \({\tilde{P}}_t\), \(\pi _t\) that fulfill the conditions in (12), we have

$$\begin{aligned} \mathbb {E}_{\pi _t} (f \circ i_t) = \mathbb {E}_{P_t}(f). \end{aligned}$$

Proof

The result clearly holds for functions \(f = \mathbb {1}_{A}\) with \(A \in \varOmega _t\) and therefore, by the usual argument, also for general measurable functions.\(\square \)

Theorem 3

Let \(\xi \) and \({\tilde{\xi }}\) be Markov processes and \(V_0\) be the value function for a stochastic optimization problem of the form (1), then

$$\begin{aligned} |V_0(S_0, \xi _0) - {\tilde{V}}_0(S_0, {\tilde{\xi }}_0)| \le D_L(\xi , {\tilde{\xi }}). \end{aligned}$$

Proof

We start by choosing \(\epsilon >0\) arbitrary. If the process \(\xi \) is continuous, we define an \(\epsilon \)-exact approximation of the value functions. To this end, we note that since for every \(\xi _{t-1}\), \(S_t\mapsto \mathbb {E}( V_t(S_t,\xi _t)|\xi _{t-1})\) is a continuous function on the compact set of permissible decisions \(S_t\), it is Lipschitz continuous with Lipschitz constant \(L_t(\xi _{t-1})\). By concavity of \(S_t\mapsto \mathbb {E}( V_t(S_t,\xi _t)|\xi _{t-1})\) there exists a supergradient \(C_{3,t}^{S_t}(\xi _{t-1})\) and by continuity there is an open neighborhood \(\mathcal {U}(S_t)\) of \(S_t\) such that

$$\begin{aligned} |\mathbb {E}(V_t(S,\xi _t)|\xi _{t-1})- b_{3,t}^{S_t}(\xi _{t-1})- C_{3,t}^{S_t}(\xi _{t-1})S|\le \epsilon , \quad \forall S \in \mathcal {U}(S_t). \end{aligned}$$

with \(b_{3,t}^{S_t}(\xi _{t-1}) = \mathbb {E}( V_t(S_t,\xi _t)|\xi _{t-1})\).

By compactness, the set of feasible \(S_t\) can be covered by a finite open cover \(\mathcal {U}^{i} = \mathcal {U}(S^{i}_t)\) with corresponding \(b_{3,t}^{i}(\xi _{t-1})\) and \( C_{3,t}^{i}(\xi _{t-1})\) for \(i=1,\ldots ,m_t(\xi _{t-1})\) such that

$$\begin{aligned} |\mathbb {E}(V_t(S,\xi _t)|\xi _{t-1})-\min _{i}{ b_{3,t}^{i}(\xi _{t-1})+ C_{3,t}^{i}(\xi _{t-1})S} |\le \epsilon , \quad \forall \text { feasible } S. \end{aligned}$$

Clearly, it follows that

$$\begin{aligned} {\hat{\kappa }}_t(\xi _{t-1}) := \max \limits _{i,j} | C_{3,t}^{i,j}(\xi _{t-1})| \le 2 L_t( \xi _{t-1}) \end{aligned}$$
(17)

and therefore \({\hat{\kappa }}_t(\xi _{t-1}) \le \kappa _t(\xi _{t-1}) = 2L_t( \xi _{t-1})\). An analogous argument holds for process \({\tilde{\xi }}\). Note that if \(\xi \) or \({\tilde{\xi }}\) are discrete, we can choose \(\epsilon = 0\) and \({\hat{\kappa }}_t = \kappa _t\) or \(\hat{{\tilde{\kappa }}}_t = {\tilde{\kappa }}_t\), since the value function approximation constructed above can be made exact due to Lemma 1.

Defining

$$\begin{aligned} \delta _t^1(\xi _t, {\tilde{\xi }}_t)&= \gamma (A_{1,t},A_{2,t},\min \{{\hat{\kappa }}_{t+1}(\xi _t),\hat{{\tilde{\kappa }}}_{t+1}({\tilde{\xi }}_t)\},c_t(\xi _t))\Vert b_{1,t}(\xi _t) - b_{1,t}({\tilde{\xi }}_t)\Vert _1 \\ \delta _t^2(\xi _t, {\tilde{\xi }}_t)&= \phi (A_{1,t}, b_{1,t}({\tilde{\xi }}_t)+C_{1,t}^+b_{2,t},A_{2,t},b_{2,t+1})\Vert c_t(\xi _t)-c_t({\tilde{\xi }}_t)\Vert _1 \end{aligned}$$

as well as \(\delta _t(\xi _t, {\tilde{\xi }}_t) = \delta _t^1(\xi _t, {\tilde{\xi }}_t) +\delta _t^2(\xi _t, {\tilde{\xi }}_t)\), we note that

$$\begin{aligned} V_T(S_T, \xi _T)&= \max \left\{ c_T(\xi _T)^\top x_T : (x_T,S_{T+1})\in \mathcal {X}_T(S_T, \xi _T) \right\} \nonumber \\&\ge \max \left\{ c_T(\xi _T)^\top x_T : (x_T,S_{T+1}) \in \mathcal {X}_T(S_T, {\tilde{\xi }}_T) \right\} - \delta ^1_T(\xi _T, {\tilde{\xi }}_T) \nonumber \\&\ge \max \left\{ c_T({\tilde{\xi }}_T)^\top x_T : (x_T,S_{T+1}) \in \mathcal {X}_T(S_T, {\tilde{\xi }}_T) \right\} - \delta _T(\xi _T, {\tilde{\xi }}_T) \nonumber \\&={\tilde{V}}_T(S_T, {\tilde{\xi }}_T) - \delta _T(\xi _T,{\tilde{\xi }}_T), \end{aligned}$$
(18)

where first inequality follows from Lemma 2 and second from Lemma 3 and Remark 6. Note that since \(V_{T+1} \equiv 0\), \(\kappa _{T+1}(\xi _T) = {\tilde{\kappa }}_{T+1}({\tilde{\xi }}_T) = 0\). Exchanging the order of steps in which Lemma 2 and Lemma 3 are applied yields

$$\begin{aligned} {\tilde{V}}_T(S_T, {\tilde{\xi }}_T) - \delta _T({\tilde{\xi }}_T,\xi _T)\le V_T(S_T,\xi _T) \end{aligned}$$

and exchanging the roles of \(V_T\) and \({\tilde{V}}_T\) finally results in

$$\begin{aligned} |{\tilde{V}}_T(S_T, {\tilde{\xi }}_T)-V_T(S_T,\xi _T)|&\le \min \left\{ \delta _T(\xi _T,{\tilde{\xi }}_T),\delta _T({\tilde{\xi }}_T,\xi _T) \right\} =:\varDelta _T(\xi _T,{\tilde{\xi }}_T). \end{aligned}$$

Proceeding to the next stage, we assume w.l.o.g. that

$$\begin{aligned} \min \{{\hat{\kappa }}_{T}(\xi _{T-1}),\hat{{\tilde{\kappa }}}_{T}({\tilde{\xi }}_{T-1})\} = \hat{{\tilde{\kappa }}}_{T}({\tilde{\xi }}_{T-1}). \end{aligned}$$

Then for all \(\xi _{T-1}\in \varOmega _{T-1},{\tilde{\xi }}_{T-1}\in {\tilde{\varOmega }}_{T-1}\) we have

$$\begin{aligned}&V_{T-1}\left( S_{T-1}, \xi _{T-1}\right) = \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + \mathbb {E}_{P_T}\left( V_T\left( S_T, \xi _T\right) \left| \xi _{T-1} \right. \right) \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}\left( S_{T-1}, \xi _{T-1}\right) \end{array} \right. \\&\quad = \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + \mathbb {E}_{\pi _T}( V_T\left( S_T, \xi _T\right) \circ i_T|\xi _{T-1}, {\tilde{\xi }}_{T-1}) \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}\left( S_{T-1}, \xi _{T-1}\right) \end{array} \right. \\&\quad \ge \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + \mathbb {E}_{\pi _T}( {\tilde{V}}_T(S_T, {\tilde{\xi }}_T) \circ {\tilde{i}}_T - \varDelta _T| \xi _{T-1}, {\tilde{\xi }}_{T-1}) \\ {\text {s.t.}} &{} (x_{T-1},S_T)\in \mathcal {X}_{T-1}\left( S_{T-1}, \xi _{T-1}\right) \end{array} \right. \\&\quad = \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + \mathbb {E}_{{\tilde{P}}_T} ( {\tilde{V}}_T(S_T, {\tilde{\xi }}_T) | {\tilde{\xi }}_{T-1}) \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}\left( S_{T-1}, \xi _{T-1}\right) \end{array} \right. \\&\qquad - \mathbb {E}_{\pi _T}(\varDelta _T| \xi _{T-1}, {\tilde{\xi }}_{T-1})\\&\quad \ge \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + {\tilde{\gamma }} -\epsilon \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}\left( S_{T-1}, \xi _{T-1}\right) \\ &{} \mathbb {1}_{m_{T}({\tilde{\xi }}_{T-1})}{\tilde{\gamma }} \le {\tilde{b}}_{3,T}({\tilde{\xi }}_{T-1}) + {\tilde{C}}_{3,T}({\tilde{\xi }}_{T-1}) S_{T} \end{array} \right. \\&\qquad - \mathbb {E}_{\pi _T}(\varDelta _T| \xi _{T-1}, {\tilde{\xi }}_{T-1})\\&\quad \ge \left\{ \begin{array}{ll} \max &{} c_{T-1}\left( \xi _{T-1}\right) ^\top x_{T-1} + {\tilde{\gamma }} \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}(S_{T-1}, {\tilde{\xi }}_{T-1}) \\ &{} \mathbb {1}_{m_{T}({\tilde{\xi }}_{T-1})}{\tilde{\gamma }} \le {\tilde{b}}_{3,T}({\tilde{\xi }}_{T-1}) + {\tilde{C}}_{3,T}({\tilde{\xi }}_{T-1}) S_{T} \end{array} \right. \\&\qquad -\epsilon - \mathbb {E}_{\pi _T}(\varDelta _T|\xi _{T-1}, {\tilde{\xi }}_{T-1}) - \delta _{T-1}^1(\xi _{T-1}, {\tilde{\xi }}_{T-1}) \\&\quad \ge \left\{ \begin{array}{ll} \max &{} c_{T-1}({\tilde{\xi }}_{T-1})^\top x_{T-1} + {\tilde{\gamma }} \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}(S_{T-1}, {\tilde{\xi }}_{T-1}) \\ &{} \mathbb {1}_{m_{T}({\tilde{\xi }}_{T-1})}{\tilde{\gamma }} \le {\tilde{b}}_{3,T}({\tilde{\xi }}_{T-1}) + {\tilde{C}}_{3,T}({\tilde{\xi }}_{T-1}) S_{T} \end{array} \right. \\&\qquad -\epsilon -\mathbb {E}_{\pi _T}(\varDelta _T| \xi _{T-1}, {\tilde{\xi }}_{T-1}) - \delta _{T-1}(\xi _{T-1}, {\tilde{\xi }}_{T-1}) \\&\quad \ge \left\{ \begin{array}{ll} \max &{} c_{T-1}({\tilde{\xi }}_{T-1})^\top x_{T-1} + {\tilde{\gamma }} +\epsilon \\ {\text {s.t.}} &{} \left( x_{T-1},S_T\right) \in \mathcal {X}_{T-1}(S_{T-1}, {\tilde{\xi }}_{T-1}) \\ &{} \mathbb {1}_{m_{T}({\tilde{\xi }}_{T-1})}{\tilde{\gamma }} \le {\tilde{b}}_{3,T}({\tilde{\xi }}_{T-1}) + {\tilde{C}}_{3,T}({\tilde{\xi }}_{T-1}) S_{T} \end{array} \right. \\&\qquad -2\epsilon -\mathbb {E}_{\pi _T}(\varDelta _T| \xi _{T-1}, {\tilde{\xi }}_{T-1}) - \delta _{T-1}(\xi _{T-1}, {\tilde{\xi }}_{T-1}) \\&\quad \ge {\tilde{V}}_{T-1}(S_{T-1}, {\tilde{\xi }}_{T-1})-2\epsilon -\mathbb {E}_{\pi _T}(\varDelta _T|\xi _{T-1}, {\tilde{\xi }}_{T-1}) - \delta _{T-1}(\xi _{T-1}, {\tilde{\xi }}_{T-1}) \end{aligned}$$

where the second equality follows by Lemma 6, the first inequality from (18), the following equality again from Lemma 6 and the subsequent inequalities follow from Lemma 2 and Lemma 3 and (17). As in the derivation of (18), we can exchange the order in which Lemma  2 and Lemma 3 are applied to get the above inequality with \(\delta _{T-1}(\xi _{T-1}, {\tilde{\xi }}_{T-1})\) replaced by \(\delta _{T-1}({\tilde{\xi }}_{T-1}, \xi _{T-1})\). Exchanging the roles of \(V_{T-1}\) and \({\tilde{V}}_{T-1}\) we obtain

$$\begin{aligned} |{\tilde{V}}_{T-1}(S_{T-1}, {\tilde{\xi }}_{T-1})- V_{T-1}(S_{T-1}, \xi _{T-1})|&\le \mathbb {E}_{\pi _T}(\varDelta _T(\xi _{T}, {\tilde{\xi }}_{T})|\xi _{T-1}, {\tilde{\xi }}_{T-1}) \\&\quad + \varDelta _{T-1}(\xi _{T-1}, {\tilde{\xi }}_{T-1}) +2\epsilon . \end{aligned}$$

Proceeding by backward induction, and noting that the distance \(D_L\) is non-decreasing when replacing \({\hat{\kappa }}_t(\xi _{t-1})\) by \(\kappa _t(\xi _{t-1})\) and \(\hat{{\tilde{\kappa }}}_t(\xi _{t-1})\) by \({\tilde{\kappa }}_t({\tilde{\xi }}_{t-1})\), we arrive at

$$\begin{aligned} |{\tilde{V}}_0(S_0, {\tilde{\xi }}_0) - V_0(S_0, \xi _0)| \le D_L(\xi , {\tilde{\xi }})+2T\epsilon \end{aligned}$$

and since \(\epsilon >0\) was arbitrary, the result follows.\(\square \)

Remark 12

Linear stochastic optimization problems without randomness in the constraints are special cases of the problems for which [41] provide stability results analogous to Theorem 3. Hence, a comparison of the two types of results for this problem class is of interest.

The authors in [41] show that for their nested distance \(D_T\), a convex set \(\mathbb {X}\), and a general objective function \(h: \mathbb {X}\times \varOmega \rightarrow \mathbb {R}\)

$$\begin{aligned} | \min _{x\in \mathbb {X}} \mathbb {E}(h(x, \xi )) - \min _{x\in \mathbb {X}} \mathbb {E}(h(x, {\tilde{\xi }}))| \le L \; D_T(\xi , {\tilde{\xi }}) \end{aligned}$$

assuming that there is a constant L such that

$$\begin{aligned} |h(x, \xi ) - h(x, {\tilde{\xi }})| \le L \; \Vert \xi - {\tilde{\xi }}\Vert _1, \quad \forall x \in \mathbb {X}, \quad \forall \left( \omega , {\tilde{\omega }}\right) \in \varOmega \times {\tilde{\varOmega }}. \end{aligned}$$

Defining \(\mathcal {G}_t = \sigma (\xi _0, \ldots , \xi _t)\), \({\tilde{\mathcal {G}}}_t = \sigma ({\tilde{\xi }}_0, \ldots , {\tilde{\xi }}_t)\) as the \(\sigma \)-algebras generated by the history of the processes, the distance \(D_T\) for arbitrary stochastic processes is defined as

$$\begin{aligned} D_T(\xi , {\tilde{\xi }}) = \left\{ \begin{array}{ll} \inf \limits _{\pi } &{} \displaystyle \int \limits _{\varOmega \times {\tilde{\varOmega }}} \Vert \xi - {\tilde{\xi }}\Vert _1 \; \pi \left( d\omega ,d{\tilde{\omega }}\right) \\ \text {s.t.} &{} \pi (A \times {\tilde{\varOmega }} | \mathcal {G}_t \otimes {\tilde{\mathcal {G}}}_t) = P\left( A\left| \mathcal {G}_t\right. \right) , \quad \forall A \in \mathcal {G}_T \\ &{} \pi (\varOmega \times {\tilde{A}} | \mathcal {G}_t \otimes {\tilde{\mathcal {G}}}_t) = {\tilde{P}}({\tilde{A}}| {\tilde{\mathcal {G}}}_t), \quad \forall {\tilde{A}} \in {\tilde{\mathcal {G}}}_T. \end{array} \right. \end{aligned}$$
(19)

In this paper, we treat the special case \(h\left( x,\xi \right) = \sum \limits _{t=0}^T c_t\left( \xi _t\right) ^\top x_t\) for which L can be calculated as \(L = \max \limits _t L_{c_t} \phi _t\) assuming that the functions \(c_t\) are Lipschitz with constants \(L_{c_t}\) and \(\phi _t\) is the function calculated in Lemma 3. Note that \(\phi _t\) is deterministic in the case of a deterministic feasible set.

It is easy to see that for two Markov processes the permissible transportation plans \(\pi \) for \(D_T\) and for \(D_L\) are equivalent. Assume that \(\pi ^*\) is an optimal transportation plan for \(D_T\), then we have

$$\begin{aligned} D_L(\xi , {\tilde{\xi }})&\le \int \limits _{\varOmega \times {\tilde{\varOmega }}} \sum _{t=0}^T \phi _t \Vert c_t\left( \xi _t\right) - c_t({\tilde{\xi }}_t)\Vert _1 \; \pi ^* \left( d\omega , d{\tilde{\omega }}\right) \\&\le \int \limits _{\varOmega \times {\tilde{\varOmega }}} \sum _{t=0}^T \phi _t L_{c_t} \Vert \xi _t - {\tilde{\xi }}_t \Vert _1 \; \pi ^* \left( d\omega , d{\tilde{\omega }}\right) \\&\le L \int \limits _{\varOmega \times {\tilde{\varOmega }}} \Vert \xi - {\tilde{\xi }} \Vert _1 \; \pi ^* \left( d\omega , d{\tilde{\omega }}\right) = L \; D_T(\xi , {\tilde{\xi }}). \end{aligned}$$

The above calculations show that our bound is tighter than \(D_T\) for problems where both bounds are applicable, i.e., linear stochastic optimization problems with deterministic feasible set \(\mathbb {X}\).

5 Implementation for finite scenario lattices

In this section, we focus on the computation of \(D_L\) for two finitely supported Markov processes. In Sect. 5.1, we detail all necessary steps to compute \(D_L\), provide a formal algorithm for the computation, and discuss computational issues. In Sect. 5.2, we discuss a simple example demonstrating the bounding property of \(D_L\) and provide a comparison to the tree distance of [41].

5.1 Computation of \(D_L\)

In this section, we show that, similar to the case of the classical Wasserstein distance and [41], the distance can be computed by solving a linear optimization problem to find the optimal transport plan \(\pi \).

We represent two discrete Markov processes \(\xi \) and \({\tilde{\xi }}\) by scenario lattices. To that end, at every stage \(t\in \mathbb {T}\) we define the probability spaces

$$\begin{aligned} \varOmega _t = \left\{ i \in \mathbb {N}: 1\le i \le N_t \right\} , \; {\tilde{\varOmega }}_t = \left\{ {\tilde{i}} \in \mathbb {N}: 1\le {\tilde{i}} \le M_t \right\} \end{aligned}$$

where \(N_t\) and \(M_t\) are the number of atoms of the unconditional distributions \(P_t\) and \({\tilde{P}}_t\), respectively. The conditional transition from a given state i (\({\tilde{i}}\)) at time \((t-1)\) to a state j (\({\tilde{j}}\)) at time t is described by a conditional probability \(P_t^i(j)\) and \({\tilde{P}}_t^{{\tilde{i}}}({\tilde{j}})\), respectively.

The optimal transport plan \(\pi \) is a Markov process on \(\varOmega \) which is fully described by the conditional probabilities

The measure \(\pi \) can therefore be represented by a set of non-negative matrices \(\pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}} \in \mathbb {R}^{|\varOmega _t| \times |{\tilde{\varOmega }}_t|}\) with \(\pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}})\) the element in row i and column \({\tilde{i}}\) for \((i,{\tilde{i}}) \in \varOmega _t \times {\tilde{\varOmega }}_t\) and

$$\begin{aligned} \sum _{(i,{\tilde{i}}) \in \varOmega _t \times {\tilde{\varOmega }}_t} \pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}}) = 1. \end{aligned}$$

We furthermore denote by \(\pi _t\) the unconditional distributions at time t.

To be able to compute the lattice distance as linear program, we define

$$\begin{aligned} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}}) = \pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}}) \pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}), \quad \forall (i, {\tilde{i}}) \in \varOmega _t \times {\tilde{\varOmega }}_t \end{aligned}$$

as well as \(\pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1})\) as decision variables. For given \((\omega _{t-1}, {\tilde{\omega }}_{t-1})\) and \((i, {\tilde{i}})\), the constraints in the definition of \(D_L\) can therefore be written as linear constraints in these variables as

$$\begin{aligned} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(\{i\} \times {\tilde{\varOmega }}_t)&= P_t^{\omega _{t-1}}(i)\; \pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}),\\ \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(\varOmega _t \times \{{\tilde{i}}\})&= {\tilde{P}}_t^{{\tilde{\omega }}_{t-1}}({\tilde{i}})\; \pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}) \end{aligned}$$

where \(\tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}} (\{i\} \times {\tilde{\varOmega }}_t) = \sum _{{\tilde{\omega }}_t \in {\tilde{\varOmega }}_t} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}} (i, {\tilde{\omega }}_t)\) and \(\tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(\varOmega _t \times \{{\tilde{i}}\})\) is defined analogously.

Hence, given two discrete processes \(\xi \) and \({\tilde{\xi }}\) as well as \(\theta _t(\xi _t(\omega _t), {\tilde{\xi }}_t({\tilde{\omega }}_t))\), \(D_L(\xi , {\tilde{\xi }})\) can be computed as the following linear optimization problem in the variables \(\tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}})\) and \(\pi _t(\omega _{t}, {\tilde{\omega }}_{t})\)

$$\begin{aligned} D_L(\xi , {\tilde{\xi }}) = \left\{ \begin{array}{ll} \min &{} \displaystyle \sum \limits _{t=1}^T \displaystyle \sum \limits _{\omega _t,{\tilde{\omega }}_t} \theta _t(\xi _t(\omega _t),{\tilde{\xi }}_t ({\tilde{\omega }}_t)) \; \pi _t(\omega _t,{\tilde{\omega }}_t)\\ \text {s.t.} &{} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(\{i\} \times {\tilde{\varOmega }}_t) = P_t^{\omega _{t-1}}(i)\pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}) \\ &{} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(\varOmega _t \times \{{\tilde{i}}\}) = {\tilde{P}}_t^{{\tilde{\omega }}_{t-1}}({\tilde{i}})\pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}) \\ &{} \pi _t(\omega _t, {\tilde{\omega }}_t) = \sum _{\omega _{t-1}, {\tilde{\omega }}_{t-1}} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}} (\omega _t, {\tilde{\omega }}_t)\\ &{} \pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1}) = \sum _{\omega _t, {\tilde{\omega }}_t} \tau _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}} (\omega _t, {\tilde{\omega }}_t) \end{array} \right. \end{aligned}$$
(20)

where the constraints hold for all \((\omega _{t-1},{\tilde{\omega }}_{t-1}) \in \varOmega _{t-1} \times {\tilde{\varOmega }}_{t-1}\) and for all \((i, {\tilde{i}}) \in \varOmega _t \times {\tilde{\varOmega }}_t\) for all \(t \in \mathbb {T}{\setminus } \left\{ 0 \right\} \) and \(\pi _{0}(1,1) := 1\). Note that the third set of constraints ensures that the unconditional probabilities in \(\pi _t\) sum to one, while the last set of constraints ensures that the probably mass of \(\pi _{t-1}(\omega _{t-1}, {\tilde{\omega }}_{t-1})\) is distributed amongst the successors of \((\omega _{t-1}, {\tilde{\omega }}_{t-1})\), i.e., that the stages are properly connected.

Note that, since we model the conditional probabilities \(\pi _t^{\omega _{t-1}, {\tilde{\omega }}_{t-1}}(i,{\tilde{i}})\) only dependent on the state of the process in \((t-1)\), the feasible measures \(\pi \) are automatically Markov.

Since (20) is a linear program it can be efficiently solved. However, in order to do so, the \(\theta _t(\xi _t, {\tilde{\xi }}_t)\) have to be computed. Since \(\theta _t(\xi _t, {\tilde{\xi }}_t)\) only depends on the values of the two processes \(\xi \) and \({\tilde{\xi }}\) and are thus independent of the probabilities \(\pi \), they can be obtained offline.

In order to compute \(\theta _t(\xi _t, {\tilde{\xi }}_t)\), the constants \(\gamma _t(\xi _t(\omega _t), {\tilde{\xi }}_t({\tilde{\omega }}_t))\) and \(\phi _t(\xi _t(\omega _t))\) and \(\phi _t({\tilde{\xi }}_t({\tilde{\omega }}_t))\) are required. These quantities are maxima of \(||\cdot ||_\infty \) over the vertices of the polyhedra \(\varGamma _t\) and \(\varPhi _t\) defined in Lemma 2 and Lemma 3 and dependent on the constant problem data \(A_{1,t}\), \(A_{2,t}\), \(b_{2,t+1}\), \(C_{1,t}\), as well as the random data \(b_{1,t}\), \(c_t\), and \(\kappa _t\).

Candidates \(x^+\) for vertices of a polyhedron \(\varLambda = \left\{ x \in \mathbb {R}^k: Ax \le b \right\} \) with \(A \in \mathbb {R}^{m \times k}\) can be found choosing a subset \(I \subseteq \left\{ 1, \ldots , m \right\} \) with \(|I| = k\) and solving \(A^Ix^+=b^I\) where \(A^I \in \mathbb {R}^{k \times k}\) and \(b^I\) are the submatrices of A and b with rows \(i \in I\), respectively. \(x^+\) is a vertex of \(\varLambda \) if it fulfills \(Ax^+\le b\).Footnote 1

The number of vertices grows exponentially with the number of constraints in the linear problems on the nodes. However, the type of problems that are solved using the decomposition approaches described in Sect. 2 usually have a large number of stages but rather small nodal problems. Furthermore, in most problems the data on the left hand side of the problem \(A_{1,t}\), \(A_{2,t}\), \(b_{2,t+1}\), \(C_{1,t}\) does not vary with the stage or the randomness and some of the right hand sides remain constant as well. Hence, once can precompute the value of \(||x^+||_\infty \) for all vertices where the right hand side does not change and store the factorization of the left hand side matrix for all the vertices where the right hand side is random in order to efficiently compute \(x^+\) for varing b. This together with the limited problem size on the nodes makes the computation of \(\gamma _t\) and \(\phi _t\) computationally relatively cheap even for larger scenario lattices.

figure a

We provide pseudocode for the calculation of \(D_L\) in Algorithm 1. The algorithm loops over the stages t of the problem and iteratively computes the constants \(\gamma _t\) and \(\phi _t\).

In line 2, we write polyhedra defined in Lemma 2 and Lemma 3 as a system of linear inequalities with single vectors and matrices \(A_\gamma \), \(b_\gamma \), \(A_\phi \), and \(b_\phi \). This is merely for notational convenience in the rest of the algorithm. We assume that there are in all \(N_{\gamma ,t}\) and \(N_{\phi ,t}\) inequalities defining \(\varGamma _t\) and \(\varPhi _t\), respectively.

We define \(K_{\gamma ,t}\) and \(K_{\phi ,t}\) as dimensions of \(\varGamma _t\) and \(\varPhi _t\). In lines 4–12 and 13–21 we iterate over all sets of size \(K_{\gamma ,t}\) and \(K_{\phi ,t}\) of linear inequalities defining the polyhedra. The solution to the corresponding system of linear equalities defines a vertex if it fulfills all the rest of the constraints. We evaluate the norm of those vertices that do not depend on random data and keep track of the maximum, while we store the LU factorization of the systems whose right hand sides are random. Note that for the computation of \(\gamma _t\), we only require the norm of the components that correspond to \(\lambda _2\) in Lemma 2, which we denote by \(x_{\gamma ,2}^I\) for a specific set of inequality constraints I. We also remark that for \(\gamma _t\) all vertices except the origin depend on the randomness unless either \(\kappa _t\) is independent of the randomness (stagewise independence) or the objective is deterministic.

In line 23–29 we compute \(\phi _t(\xi _t)\) by solving the linear systems \(\mathcal {I}_\phi \) for all possible realizations of \(\xi _t\) using the stored LU factorizations. In line 37–44 we compute \(\phi _t({\tilde{\xi }}_t)\) for the realizations of \({\tilde{\xi }}_t\) and additionally compute \(\gamma _t(\xi _t, {\tilde{\xi }}_t)\).

Given these quantities we easily obtain \(\theta _t(\xi _t, {\tilde{\xi }}_t)\) in line 43 and finally \(D_L\) in line 47. Note that if either \(\varGamma _t\) or \(\varPhi _t\) are independent of the stage, or at least identical in some stages, the algorithm can be modified by changing the outer loop in line 1 in an obvious way to avoid repetitive computations.

5.2 The flowergirl problem

As a demonstration, we consider a multi-stage extension of the classical newsvendor problem – the problem of a flowergirl selling flowers, facing a random demand and a random sales price with the possibility to store excess flowers for the next periods. The problem has \((T+1)\) stages, with stage \(t=0\) being the deterministic start state. In every stage t, we start with the inventory level \(S_t\) limited by the storage capacity \({\bar{S}}_t\). After the demand \(\xi ^1_t\) and the price \(\xi ^2_t\) become known in stage t, the flowergirl sells \(x_{t}^2\) flowers and places an order \(x_{t}^1\) for flowers to be delivered from a wholesaler for a price p on the next day. If the available quantity exceeds the demand, the flowergirl adds the excess to her inventory for sale in \((t+1)\). Due to the perishable nature of flowers, a fraction of \(k \in (0,1)\) of the stored flowers are spoilt on the next day. The order in stage t has to be placed without knowing the random demand \(\xi ^1_{t+1}\). On the next day the flowers can be sold at a market price \(\xi _{t+1}^2\) not known on day t. The flowergirl starts in period \(t=0\) without any stock and no demand, i.e., \(S_0 = 0\) and \(\xi ^1_0 = 0\).

The decisions in every stage consist of the number of flowers to order for the next stage \(x_{t}^1\), the number of flowers to sell \(x_t^2\), and the inventory level of the next day \(x_t^3\). Note that, as described in Remark 1, the environmental state variable \(S_{t+1}\) is represented by \(x_t^3\) so as to make the feasible set fit (2).

The storage equation consequently is

$$\begin{aligned} x_t^3 = (1-k) \cdot (S_t - x_t^2) + x_t^1, \quad \forall t = 0, \ldots , T. \end{aligned}$$

The sales decisions are constrained by the random demand as well as the storage level, i.e.,

$$\begin{aligned} x_t^2 \le \min \left\{ \xi ^1_t, S_t\right\} , \quad \forall t = 0, \ldots , T, \quad a.s. \end{aligned}$$

Furthermore, we impose the following constraints

$$\begin{aligned} x_t^3 = S_{t+1}, \quad x_t^3 \le {\bar{S}}_{t+1},\quad \; x_t^1, \, x_t^2, \; x_{t}^3, S_{t+1} \ge 0, \quad \forall t = 0, \ldots , T. \end{aligned}$$

The flowergirl maximizes her expected profit, which is given by

$$\begin{aligned} \mathbb {E}\left( \sum _{t=0}^T \xi _t^2 x_t^2 -p x_t^1 \right) . \end{aligned}$$

For our numerical example, we consider the three-stage version of the problem, i.e., the problem with \(T = 2\). Further, we choose \(k = 0.1\), \(p=5\) and the vector of storage capacities \({\bar{S}} = ({\bar{S}}_0, {\bar{S}}_1, {\bar{S}}_2, {\bar{S}}_3 ) =(0,11,9,0)\). To rewrite the problem to the form in (1), we define \(c_t(\xi _t) = (-p, \xi _t^2,0)^\top \) and the vectors and matrices appearing in the constraints as

$$\begin{aligned}&A_{1,t} = \left( \begin{array}{ccc} 0 &{} 1 &{} 0 \\ 0 &{} 1 &{} 0\\ -1 &{} (1-k) &{} 1\\ 1 &{} -(1-k) &{} -1 \end{array}\right) ,\quad b_{1,t} (\xi _t) = \left( \begin{array}{c} \xi _t^1 \\ 0\\ 0\\ 0 \end{array}\right) , \quad C_{1,t} =\left( \begin{array}{c} 0\\ 1\\ (1-k) \\ -(1-k) \end{array}\right) ,\\&A_{2,t} = \left( \begin{array}{ccc} 0&0&1 \end{array}\right) ,\quad b_{2,t} = (\begin{array}{c} {\bar{S}}_t \end{array}). \end{aligned}$$

As the function \(c_t\) and the matrices \(A_{1,t}\), \(A_{2,t}\) and \(C_{1,t}\) have the same form for all stages, we can ignore the index t.

Next, we find the constants \(\kappa _t(\xi _{t-1})\), which depend on the slopes of the value functions. Note that, in every period t, the flowergirl can either sell all flowers for the price \(\xi _t^2\) or hold them for sale in future periods, in which case part of the flower will perish. In the last period \(\kappa _{T+1}(\xi _T) = 0\), since the flowers are worthless at the end of planning while in period \((T-1)\), stored flowers can be sold in period T, i.e., \(\kappa _T(\xi _{T-1}) = 2\mathbb {E}(\xi _{T}^2|\xi _{T-1})\). In periods \(t < (T-1)\), flowers can either be sold in period \(t+1\) or carried on to period \(t+2\), in which case they have to be evaluated using the respective value function. This yields the approximation

$$\begin{aligned} \kappa _t(\xi _{t-1}) = 2\mathbb {E}( \max \{\xi _{t}^2,(1-k) \kappa _{t+1}(\xi _t)\}|\xi _{t-1}). \end{aligned}$$

This logic can be recursively applied to find all the constants \(\kappa _t\).

Putting everything together, the problem can be formulated as

$$\begin{aligned} V_t\left( S_{t}, \xi _t\right)&= \left\{ \begin{array}{ll} \max \limits _{x_t,S_{t+1}} &{} c(\xi _t)^\top x_t + \mathbb {E}\left( V_{t+1}\left( S_{t+1}, \xi _{t+1}\right) \left| \xi _t \right. \right) \\ {\text {s.t.}} &{} A_{1} x_t \le b_{1,t}(\xi _t)+C_{1}S_t\\ &{} A_{2}x_t = S_{t+1}\\ &{} A_{2}x_t \le b_{2,t+1}\\ &{} x_t,S_{t+1} \ge 0 \end{array} \right. \\&= \left\{ \begin{array}{ll} \max \limits _{x_t,S_{t+1}} &{} c(\xi _t)^\top x_t + \gamma \\ {\text {s.t.}} &{} A_{1} x_t \le b_{1,t}(\xi _t)+C_{1}S_t\\ &{} A_{2}x_t = S_{t+1}\\ &{} A_{2}x_t \le b_{2,t+1}\\ &{} \mathbb {1}_{m_{t+1}(\xi _t)}\gamma \le b_{3,t+1}(\xi _{t})+C_{3,t+1}(\xi _{t})S_{t+1}\\ &{} x_t,S_{t+1} \ge 0. \end{array} \right. \end{aligned}$$

We consider the two Markov processes \(\xi \) and \({\tilde{\xi }}\) presented in Fig. 3a, b with transition probabilities

$$\begin{aligned}&P_1 = \left( \begin{array}{ccc} 0.5523&0.0871&0.3605 \end{array}\right) , P_2 = \left( \begin{array}{cccc} 0.5489 &{} 0.0005 &{} 0.2901 &{} 0.1606\\ 0.4576 &{} 0.0004 &{} 0.2067 &{} 0.3353\\ 0.3953 &{} 0.0403 &{} 0.2681 &{} 0.2962 \end{array}\right) ,\\&{\tilde{P}}_1 = \left( \begin{array}{cc} 0.6374&0.3626 \end{array}\right) , {\tilde{P}}_2 = \left( \begin{array}{ccc} 0.5529 &{} 0.2855 &{} 0.1626\\ 0.4364 &{} 0.2838 &{} 0.2797 \end{array}\right) . \end{aligned}$$
Fig. 3
figure 3

Depiction of the two Markov processes used for the numerical calculation of the flowergirl example

To bound the difference in the optimal values, we calculate \(D_L(\xi , {\tilde{\xi }})\). As detailed in Algorithm 1, the constant \(\gamma _t(\xi _t,{\tilde{\xi }}_t)\) can be obtained by maximizing \(\left\Vert \lambda _2\right\Vert _\infty \) over the extreme points of the polyhedron

$$\begin{aligned} \varGamma = \left\{ (\lambda _2,\lambda _3,\lambda _4,\lambda _6,\lambda _7): \begin{array}{l} \left\Vert A_1^\top \lambda _2+A_2^\top (\lambda _3+\lambda _4)-\lambda _6\right\Vert _\infty \le 1+\left\Vert c_t(\xi _t)\right\Vert _\infty \\ \left\Vert \lambda _3-\lambda _7 \right\Vert _\infty \le 1+\min \left\{ k_{t+1}(\xi _t),{\tilde{\kappa }}_{t+1}({\tilde{\xi }}_t) \right\} \\ \lambda _2,\lambda _4,\lambda _6,\lambda _7\ge 0 \end{array} \right\} . \end{aligned}$$

Similarly, the constant \(\phi _t(\xi _t)=\phi (A_{1},b_{1,t}(\xi _t)+C_{1}^+b_{2,t},A_{2},b_{2,t+1})\) can be found by maximizing \(\left\Vert x\right\Vert _\infty \) over the extreme points of the polyhedron

$$\begin{aligned} \varPhi = \left\{ x: A_{1}x\le b_{1,t}(\xi _t)+C_{1}^+b_{2,t},A_2x\le b_{2,t+1}, A_{2}x\ge 0, x\ge 0 \right\} . \end{aligned}$$

Having calculated \(\gamma _t\) and \(\phi _t\), we proceed by computing \(\theta _t(\xi _t, {\tilde{\xi }}_t)\) using (13) and (14). Then we can determine the joint distribution \(\pi \) that minimizes the distance between processes by solving the linear optimization problem (20).

The resulting optimal transportation plan yields a distance of \(D_L(\xi , {\tilde{\xi }}) = 7.03\). The optimal value of our problem for \(\xi \) is equal to 126.59 and for \({\tilde{\xi }}\) the optimal value equals 129.16 resulting in a difference of 2.58. Hence, our bound overestimates the difference in the optimal values by 4.45.

Lastly, we compare the performance of \(D_L\) to the performance of the nested distance defined in [41, 42]. For this calculation, it is necessary to simplify the problem to make the constraints independent of the randomness. To this end, we fix the demand at each stage. In particular, we assume that the demand is equal to 0, 11, and 9 in the stages \(t=0\), 1, and 2, respectively. For this simplified setup, we obtain \(D_L(\xi , {\tilde{\xi }}) = 3.94\) and \(D_T(\xi , {\tilde{\xi }}) = 4.31\) demonstrating that for our problem \(D_L\) provides a tighter bound than \(D_T\) (see also Remark 12).

6 Conclusions

Stochastic optimization problems with a Markovian structure strike a good balance between the complexity of the underlying randomness and the expressiveness of the corresponding problem class. In particular, since scenario lattices offer leaner discretization structures than scenario trees, the unfavorable computational properties of general stochastic optimization problems can be, in part, mitigated.

In this paper, we define a family of problem dependent semi-distances for linear stochastic optimization problems with Markovian structure that can be used to bound objective values. We also show that every Markov process can, in theory, be approximated to arbitrary precision in terms of the defined distances. Therefore, the concepts in this paper can be used to find arbitrary precise discrete approximation of complicated problems, possibly with continuous state spaces.

Furthermore, we contribute to the literature on transportation distances by an approach that is capable of dealing with randomness in the constraints. This necessitates a different technique of proof, since the transport of solutions between problems becomes impossible in this framework. We therefore base our results on stability results for linear programs.

In this paper we laid the foundations for a theory driven method to generate scenario lattices. Further research is required to find computationally efficient ways to do so and to evaluate the outcomes on real world problems.