A stability result for linear Markovian stochastic optimization problems

In this paper, we propose a semi-metric for Markov processes that allows to bound optimal values of linear Markovian stochastic optimization problems. Similar to existing notions of distance for general stochastic processes, our distance is based on transportation metrics. As opposed to the extant literature, the proposed distance is problem specific, i.e., dependent on the data of the problem whose objective value we want to bound. As a result, we are able to consider problems with randomness in the constraints as well as in the objective function and therefore relax an assumption in the extant literature. We derive several properties of the proposed semi-metric and demonstrate its use in a stylized numerical example.


Introduction
Stochastic optimization is concerned with the solution of optimization problems that involve random quantities as data. Consequently, the decisions x(ξ ) depend on values of a random process ξ , making stochastic optimization a problem in function spaces. Mirroring the situation in deterministic optimization, only few stochastic optimization problems lend itself to analytical treatment and allow for closed form solutions. In the following, we therefore focus on discrete time problems that are solved numerically.
The theory of stochastic optimization as well as the development of solution methods made great advances in the last decades. In particular, there exists a sound theory for two-stage stochastic optimization problems, i.e., problems with only one decision stage in the future (see [7,53] for an overview). Consequently, two-stage stochastic optimization is nowadays routinely applied by researchers and industry practitioners alike. State-of-the-art methods are based on discrete representations of the, possibly continuous, source of randomness in the form of a finite set of samples or scenarios. This can either be achieved by sample average approaches (see [53] for an introduction) or by explicitly choosing representative scenarios. In this paper, we will focus on the latter.
Despite the abovementioned successes, it became clear quite early that the effort required to solve stochastic optimization does not scale well in the problem's size. More specifically, it has been shown that stochastic optimization problems exhibit non-polynomial increase in complexity as the number of random variables increases [21]. The problem underlying these difficulties is the numerical evaluation of high dimensional integrals, which is in turn related to the problem of optimal discretization of probability distributions.
The situation is even more complicated for multi-stage problems, where we deal with random processes resulting in additional random variables in every stage and the issue of finding discretizations for conditional distributions. Consequently, it was observed in [50,51] that solving multi-stage stochastic optimization problems is often practically intractable.
Notwithstanding these problems, there is a rich literature on multi-stage stochastic optimization. The majority of authors use scenario trees as representation of discrete stochastic processes (see the left panel in Fig. 1 for an illustration). In a scenario tree, nodes represent possible states of the world and are assigned to a point in time. All nodes at the same point in time are usually depicted at the same level of the tree. Possible transitions between nodes in consecutive stages are represented by probability weighted arcs connecting the nodes. Consequently, the collection of transition probabilities between a node and the nodes of the next stage connected by arcs describes the distribution of the random process conditional on that node. Note that the requirement that the resulting graph is a tree implies that every node is allowed to have exactly one predecessor in the previous stage.
There are various ways to construct scenario trees for multi-stage stochastic programs (see [16,31] for surveys). In [28,29], a recursive application of moment matching is presented. The approach is easy to understand and apply, but suffers from an exponential explosion of nodes in the resulting trees as the number of stages increases. Furthermore, the method offers no theoretical insight regarding discretization error made when replacing the original process with the generated tree.
The papers [37,38] propose a method for the construction of scenario trees that is based on integration quadratures and ensures that the approximated problems based on scenario trees epi-converge to the true infinite-dimensional problem yielding convergence in optimal value as well as in optimal decisions. However, the results are asymptotic in nature, i.e., the approximation scheme doesn't offer guarantees for any given discrete approximation.
Another approach is based on the principle of bound-based constructions, see [10,18,20,32]. The idea is to construct two discrete stochastic programs which provide upper and lower bounds on the optimal value of the original problem. The results in this paper extend a stream of literature that uses probability metrics to define notions of distance for stochastic processes and allows inference about the accuracy of approximating trees, see [17,22,23,[40][41][42]. The authors in [17,22] consider a distance between discrete stochastic processes and assume that both processes are defined on the same probability space. This assumption is relaxed in [41,42] where a nested distance between value-and-information structures is developed, which can be applied to continuous processes. [24,25] prove stability results using the sum of a L r -distance and a filtration distance to bound objective values of a certain class of stochastic optimization problems.
Scenario trees are discrete approximations of general processes and therefore lend themselves to the construction of a general theory of stochastic optimization. However, the requirement that every node has only one predecessor makes it hard to construct scenario trees with many stages that model the conditional distributions well, i.e., ensure that every node has a sufficient number of successors and at the same time avoid exponential growth of the number of nodes.
A possible way out of this dilemma is to restrict the type of the stochastic optimization to problems with a Markovian structure where the random processes in the problem formulation are Markov processes [34] or, even more common, independent [39]. In this setting the history of random variables and decisions is condensed in the state variables of the problem and there is no need to remember the whole history of the randomness and the decisions. This paves the way for leaner discretizations, which we call scenario lattices in this paper and which are similar to stochastic meshes used in option pricing [9]. In particular, a scenario lattice consists of the same building blocks as a scenario tree, but relaxes the requirement that every node has only one predecessor and therefore solves the problem of exponential explosion of the number of nodes as the number of stages grows (see the right panel in Fig. 1). In the same way that a scenario tree is a natural representation for a general discrete stochastic process, a scenario lattice is a natural representation of a discrete Markov process.
Even though the abovementioned problem class is quite popular, there are no theoretical results on how to construct optimal scenario lattices. An exception is [2,3] who design an algorithm for the construction of scenario lattices for Brownian motions based on ideas of optimal quantization. We mention that there is a large and well developed theory on the approximations of Markov decision processes (MDPs) that is concerned with similar questions as this article. Typical formulations of MDP problems feature finite state and action states as well as a stationary Markov process describing the randomness, which is potentially influenced by the actions taken by the decision maker.
The setting as well as the solution methods differs from our paper in several important ways. Firstly, methods for solving MDPs are almost exclusively based on the discretization of the whole state space, leading to the well known curse of dimensionality as the dimension of the state space grows. Consequently, methods to approximate MDPs either assume a finite or countable state and action space to start with [4,19,33,46,57,58] or discretize the state space to be able to solve the problem.
Furthermore, much of the work on approximations of MDPs deals with infinite horizon problems relying on the fact that optimal value functions are fixed points of the Bellman operator [11,26,46,49,57,58].
The difference of our approach to the MDP literature is thus threefold: Firstly, we keep the resource state continuous in order to be able to solve the problems on the nodes of the scenario lattice by linear optimization. This avoids at least part of the curse of dimensionality usually encountered in dynamic programming. Second, unlike most of the literature on approximation of MDPs, we deal with finite horizon problems. Lastly, we do not assume any Lipschitz continuity of the Markov kernel.
With this paper we contribute to the development of a theory for discrete approximations of Markov processes to be used in stochastic programming. In particular, we propose a class of problem-specific semi-distances for Markov processes and show that the objective value of a certain class of linear stochastic optimization problems is Lipschitz continuous with respect to these distances. This lays the foundations for constructing scenario lattices approximating general Markov processes that in turn can be used to formulate approximating optimization problems. In particular, the results in this paper can be used to control the error that results from replacing a stochastic optimization problem that is formulated using a complex (possibly continuous) Markov process by another, simpler problem using a compact scenario lattice instead of the original process. Furthermore, we discuss a LP formulation of our distance for discrete Markov processes, i.e., scenario lattices. We consider a multi-stage version of the well known newsvendor problem to demonstrate how to use our results in practical problems.
Our approach is inspired by [41] who work on optimal scenario trees and general stochastic optimization problems. In contrast to [41], our approach is specialized to linear stochastic programs with a Markovian structure, which results in tighter bounds for this problem class and additionally allows for problems where the randomness does not only affect the objective function but also the feasible set. The latter makes it necessary to adopt a different technique of proof based on stability results for linear programs rather than the idea of transporting solutions from one problem to the other. While in the MDP literature there are papers that model differences in feasible sets in terms of the Hausdorff distance [26], to the best of our knowledge, we are the first to propose stability results based on transportation distances that allow for problems where the feasible set depends on randomness in inequality constraints: [17,22,41,42] require the feasible set to be independent of randomness, while in [24,25] the constraints involving random parameters are required to be equality constraints. Furthermore, we demonstrate that our distance yields tighter bounds than [41] for problems where the constraints do not depend on the random process.
This paper is structured as follows: In Sect. 2, we introduce some notation and discuss the problem setup. In Sect. 3, we define the problem dependent lattice distance and establish some of its key properties. Section 4 contains the main results of the paper which allow to connect the lattice distance to optimal values of linear stochastic programming problems, while Sect. 5 is devoted to the case of discretely supported processes representable by lattices and a numerical example. Section 6 concludes the paper.

Problem description
We consider a class of discrete time, finite horizon, linear stochastic dynamic programming problems depending on a Markov process. The time periods in our problem are indexed by t ∈ T = {0, 1, . . . , T }, where the values at t = 0 represents the deterministic start state of the problem. We partition the state space in an environmental state ξ and a resource state S. The former is governed by a (possibly inhomogeneous) Markov process ξ = (ξ 0 , ξ 1 , . . . , ξ T ), ξ t : Ω t → R n t which is assumed to be independent of the decisions. Examples are prices, demand for a product, or weather related variables such as temperature. The resource state S t , on the other hand, describes the part of the state space that is influenced by the decision maker. Examples include inventory levels, states of machinery, and contractual obligations.
We equip the probability space Ω t with the σ -algebra σ t = σ (ξ t ) generated by the random variable ξ t and define the path space Ω = Ω 0 ×· · ·×Ω T and a corresponding σ -algebra F = σ 0 ⊗ · · · ⊗ σ T . Note that we base our σ -algebras only on the random variables ξ t and not on the whole history of random variables until t as it is usually done when working with scenario trees. Consequently, σ 0 , σ 1 . . . , σ T is not a filtration.
Furthermore, we define the paths for which the event H ∈ σ t occurs as and the corresponding σ -algebra as The distribution of ξ is described by a sequence of Markov kernels and we write P ω t−1 t for the distribution of ξ t given ω t−1 ∈ Ω t−1 . The kernel as a function from Ω t−1 to the set of probability measures on Ω t is σ t−1 -measurable [45]. For a given sequence of Markov kernels, we denote ω = (ω 0 , . . . , ω T ) and define the distribution on Ω as for every H ∈ F. We consider stochastic optimization problems that can be written as which we assume to be compact. Note that the data of the problem depends on the stochastic process ξ via the functions ξ t → c t (ξ t ) and ξ t → b 1,t (ξ t ), which we assume to be continuous. We assume that for planning in stage t, the decision maker knows S t , i.e., the system's resource state at the beginning of the period as well as ξ t , i.e., the realization of the Markov process in period t. Given this information the feasible set for the decision x t as well as the definition of S t+1 can be expressed using linear inequality constraints. The decisions x t are auxiliary decision variables in stage t that are not part of the resource state. Note that in order for the problem to be feasible b 2,t+1 ≥ 0 has to hold. The combination of constraints in (1) ensures that i.e., that the feasible region for S t+1 is box-constrained and therefore compact.

Remark 1
Usually we would expect a state transition equation of the form S t+1 = S t + Ax t . However, since we want to make the proposed distance independent of the resource state, we formulate the state transition using x t . More specifically, we assign S t to a subset of variables in x t in the first constraint. The state transition is subsequently modelled in the equality constraint using those variables instead of S t . Alternatively, we could assign S t+1 to variables in x t in the equality constraint and then formulate the state transition using the first inequality constraint. We refer to the example in Sect. 5 for an illustration of this principle.
Because of its recursive structure, problem (1) can be equivalently written in terms of its dynamic programming equations using value functions, i.e., and V T +1 (S T +1 , ξ T +1 ) ≡ 0 or, more generally, a known piecewise linear concave function. Since ξ is a Markov process and V t as well as the decisions (x t , S t+1 ) only depend on the current state (S t , ξ t ), we call the problem a stochastic optimization problem with Markovian structure. If we are dealing with discrete Markov processes, the expectations of the value functions V t , which are concave functions of the resource state, can be written as a minimum of finitely many affine functions. We formalize this well known fact in the following lemma whose proof can be found for example in [34,44,52].

Lemma 1 If ξ is finitely supported, then for every realization of
where m t+1 (ξ t ) is the number of affine functions required to model

A distance for Markov processes
In order to introduce the concept of a distance between Markov processes, we first recall the Wasserstein or Kantorovich distance for distributions [30,54]. Loosely speaking, the Wasserstein distance is defined as the total cost of passing from a given distribution to a desired one by moving probability mass accordingly.
(Ω,Ã) → R n be two random vectors with distributions P andP, respectively. The Wasserstein distance of order r (r ≥ 1) between ξ andξ is defined as where the infimum is taken over all probability measures π on (Ω ×Ω, A ⊗Ã).

Remark 2
Note that, following [41], we define W r as a distance between two random vectors ξ : Ω → R n andξ : Ω → R n instead of between two distributions P andP. However, in order for W r to be well defined, information on the probability measures P andP on Ω andΩ is required as can be seen from (4). In particular, when changing P andP while holding ξ andξ constant, the image measure of ξ andξ and therefore also W r changes. By a slight abuse of notation, we consider ξ andξ to contain the information on the probability spaces (Ω, P) and (Ω,P), i.e., as mappings ξ : (Ω, P) → R n andξ : (Ω,P) → R n in the same way that [41] do, when defining nested distributions.

Remark 3
The above problem is bounded and an optimal transportation measure π exists, due to weak-compactness of the set of transportation plans (see [54], Lemma 4.4). Furthermore, according to the famous Kantorovich-Rubinstein Theorem, for r = 1, the dual of (4) can be written as the following maximization problem Clearly, for a two-stage stochastic optimization problem wherex * is the optimal solution for v(P). By symmetry it follows that i.e., that the objective value of the two-stage stochastic program is Lipschitz continuous with respect to W 1 , as long as the cost-to-go function Q is Lipschitz in ξ . This was first recognized in [48]. The authors in [41,42] generalize these ideas to a multi-stage setting using the notion of nested distributions which correspond to generalized scenario trees. Based on a modified transportation problem and an assumption similar to the uniform Lipschitz property in (5), they obtain a distance with respect to which the objective value of a general multi-stage problem is Hölder continuous, see Sect. 4 for more details.
We aim for a similar result for scenario lattices and problems of the form (1). Additionally, we relax one major assumption in the abovementioned approaches, namely that randomness enters the problem only in the objective function. Observe that the argument above hinges on the fact that the set Y does not depend on ξ . The same restriction applies to the results on multi-period problems in [41,42].
We begin with analyzing the following simple deterministic linear optimization problem, which is of a similar structure as (3), with the second last inequality constraint and the second term in the objective function, y, modeling the piecewise linear value function (see Lemma 1) Furthermore, we define 1 m ∈ R m as the column vector of ones, assume that C 3 has m rows and k columns, and assume that the other matrices and vectors are of fitting dimension. First we prove the following result which is motivated by Hoffman's lemma [27] and in particular its discussion in [53], Theorem 7.11 and Theorem 7.12. For what follows, we adopt the notational convention that the addition of a vector x = (x 1 , . . . , x n ) and a scalar y ∈ R is to be interpreted pointwise, i.e., results in the vector (x 1 +y, . . . , x n +y) and, similarly, inequalities of the form x ≤ y are interpreted pointwise as well.
be the optimal value of problem (6) dependent on the parameter b 1 and assume that there is a κ ≥ 0 with Proof We start by rewriting (6) as Denote by M(b 1 ) the set of feasible points of problem (8) and consider a point α = (x, y, z, t) ∈ M(b 1 ). Note that for any a ∈ R n , ||a|| 1 = sup ||u|| ∞ ≤1 u a and define u = (u 1 , u 2 , u 3 , u 4 ) with u i corresponding to the respective entries in α, i.e., u 1 ∈ R n , u 2 ∈ R and so on. Therefore, we have

By a change of variables defining
Consequently we obtain that The right-hand side of (9) has a finite optimal value (since the left-hand side of (9) is finite) and, hence, has an optimal solution (û,λ). It follows that To find a bound for ||λ 2 || ∞ , we analyze the extreme points of the feasible set Since we know that ||λ 1 || ∞ ≤ 1, we can replace the constraint Then using the assumption that C 3 λ 5 ∞ ≤ κ we can substitute to increase the feasible set of problem (9), and hence increase its optimal value. Consequently, Note that the optimal value remains bounded when replacing Γ with Γ , since if there would be a ray and therefore In this case, we can define λ 1 = (0, λ 1 2 , λ 1 3 , λ 1 4 , 0, λ 1 6 , λ 1 7 ) and a ray Clearly, points in R fulfill the first constraint of Γ by (10), the second one since the first and the fifth component of λ 1 are zero, and the third since λ 1 3 = λ 1 7 has to hold for R to be in Γ . This means that R is contained in Γ , contradicting the boundedness of the original problem. Hence, the modified problems remains bounded and therefore the maximum is taken at a vertex of the polyhedron Γ .
The polyhedral set Γ has a finite number of extreme points. Hence, ||λ 2 || ∞ can be bounded by γ (A 1 , A 2 Analogously, we get and finally (7).

Remark 4
The matrix C 3 represents slopes of the linear functions modeling value function for discrete distributions (see Lemma 1). Applying Lemma 2 to the problem (3), C 3 may differ depending on the stage t and the state of the random process ξ t . Therefore, we write C 3,t (ξ t−1 ) for the matrix of slopes of the linear functions used in the representation of E (V t (S t , ξ t )|ξ t−1 ) and choose κ t (ξ t−1 ) as follows ,t is the entry in the i th row and j th column of the matrix C 3,t .

Remark 5
For continuous distributions the matrix C 3 doesn't exist. However, in our formulation of the problem S t → E (V t (S t , ξ t )|ξ t−1 ) is a continuous function on the compact set of permissible S t for every ξ t−1 , hence it is Lipschitz continuous with Lipschitz constant L t (ξ t−1 ) on this set. Therefore, in this case we use κ t (ξ t−1 ) = 2L t (ξ t−1 ) in the definition of the distance below.
An alternative proof of the above lemma could be based on the Lipschitz continuity of the feasible set with respect to the Hausdorff metric as shown in [47,56]. However, the aforementioned papers do not provide any instruction for calculation of the Lipschitz constant, which makes it difficult to apply their results in concrete optimization problems. Our approach does not suffer from this problem, since it allows to explicitly bound the variation in the objective as a function of the right hand side data of the problem (6). Next, we will prove a similar result to bound the objective value when the objective coefficient c 1 changes.
is the optimal value of problem (6) in dependence on the objective value coefficient c 1 , then Proof Let (x * , y * ) be an optimal solution to V (c 1 ), then we have

By symmetry this implies
for an optimal solution (x * ,ỹ * ) to V (c 1 ). Notice that the set of feasible points is invariant with respect to the parameter c 1 . Hence, x * andx * can be selected as extreme points of the same polyhedral set Φ depends on A 1 , b 1 , A 2 , b 2 and has a finite number of vertices. Therefore ||x * || ∞ and ||x * || ∞ can be bounded by a constant φ (A 1 , b 1 , A 2 finishing the proof.

Remark 6
When applying the above lemma to the problem (3), b 1,t (ξ t ) +C 1,t S t corresponds to the second parameter of φ. Since we would like to avoid a dependence of our distance on the resource state, we note that φ is increasing with respect to this parameter and replace b 1,t (c i, j , 0)) i, j and c i, j are the entries in the matrix C 1,t . Since S t ≥ 0 and b 2,t ≥ 0, we thereby increase the size of the polyhedron Γ and thus make the bound slightly looser but independent of S t .
Note that the problems in (3) fulfill the assumptions of Lemma 2 and Lemma 3. Equipped with these results, we define a transportation distance between two Markov processes. The distance is defined for a given problem of the form (1), i.e., we do not propose one distance but a whole family of problem specific distances, which differ in the matrices and vectors used to define the constants γ and φ in Lemma 2 and Lemma 3. To avoid cluttered notation, we write Furthermore, we omit the explicit dependence of ξ on ω wherever no confusion can arise, i.e., write ξ instead of ξ(ω).

Remark 7
Note that to ensure measurability of φ t and γ t we have to use the universal sigma algebra, which is a natural extension of the Borel sigma algebra fitting for dynamic programming. See [6], Chapter 7 for an in-depth treatment of the subject and [5], Appendix C for a short primer.
In particular, we mention that the vertices of the polyhedra in the proofs of Lemma 2 and Lemma 3 change continuously with the right hand sides of the linear inequality constraints almost everywhere. The functions γ t and φ t are therefore Borel measurable due to the Borel measurability of the functions c t and b t .
Furthermore, standard arguments yield that, by Borel measurability of the Markov kernel, the functions are lower semi-analytic. Hence, the function is lower semi-analytic and therefore universally measurable. Consequently, we interpret all integrals as integrals with respect to the unique extensions of measures with respect to the universal sigma algebra [see [6]].

Definition 2
Let ξ andξ be two Markov processes defined on probability spaces Ω andΩ, respectively, and P andP corresponding probability measures on Ω andΩ.
We define a lattice distance for the problem (1) as taking the infimum over all Markov probability measures π defined on F ⊗F. We assume that the constraints hold for almost all (ω t−1 ,ω t−1 ) ∈ Ω t−1 ×Ω t−1 , as well as all H t ×H t ∈ σ t ⊗σ t and define and Remark 8 Note that similar to the convention discussed in Remark 2, we require the information on the measuresP and P on the underlying probability spaces to calculate the distance between the two Markov processes.

Remark 9
As will become clear in the proof of Theorem 3, both d t (ξ t ,ξ t ) and d t (ξ t , ξ t ) can be used to construct bounds for the difference in stochastic optimization problems. We therefore use the minimum in (13) to improve the bounds and ensure symmetry of D L .
Note that the objective function in (12) is defined in terms of the unconditional transport plan π between the joint distributions P andP while the constraints rely on the corresponding disintegration in the form of Markov kernels π ω t−1 ,ω t−1 t , which are guaranteed to exist [45] and relate to π via for H ×H ∈ F ⊗F. However, since the disintegration of π into Markov kernels is only π -almost surely unique, the constraints in (12) have to be fulfilled π t−1 almost surely, where π t−1 is the unconditional marginal of π in stage t − 1.

Remark 10
Analogously to the Remark 3 and [41,42] the infimum in the above definition is attained due to weak-compactness of the set of transportation plans.
Next we show that there is always at least one feasible transport plan between any two Markov processes, i.e., there are no processes with infinite distance.

Proposition 1
The defining optimization problem of D L is always feasible. In particular, the product measure π := P ⊗P is always part of the feasible set.
Proof Let A ∈ σ t+1 and B ∈σ t+1 for given t and C ∈ F and D ∈F. We have where the first equality follows from the properties of the product measure. Since the sets A, B, C, and D are chosen arbitrarily and P ω t t+1 (A)·Pω t t+1 (B) as well as π ω t ,ω t t+1 (A× B) are σ t ⊗σ t measurable, it follows that they coincide π -almost everywhere, i.e., For the particular choices A = Ω t+1 or B =Ω t+1 , we get the conditions in problem (12).
Next we show that D L is a semi-metric, i.e., that it is non-negative and symmetric. Example 1 demonstrates that it does not fulfill the triangle inequality.

Proposition 2 If either c t or b 1,t have a continuous inverse, D L is a semi-metric on the equivalence classes of Markov processes that have the same distribution.
Proof From the non-negativity of the norms and the constants φ t and γ t , we obtain that D L ≥ 0. Clearly, d(ξ,ξ) = d(ξ, ξ). If π * is the optimal transportation plan for D L (ξ,ξ), thenπ * (ω, ω) = π * (ω,ω) is the optimal transportation plan for D L (ξ, ξ). Therefore we have D L (ξ,ξ) = D L (ξ, ξ).
To show we note that one direction is trivial, since ξ =ξ in distribution implies that D L (ξ,ξ) = 0. If c t or b 1,t have continuous inverses, then b 1,t (ξ t ) = b 1,t (ξ t ) or c t (ξ t ) = c t (ξ t ) in distribution for any two processes ξ andξ that do not have the same distribution.
Under these circumstances, if D L (ξ,ξ) = 0, similar to [55], we can without loss of generality assume that Ω =Ω and find a measure π whose image measure on × T t=1 R n t is almost surely concentrated on the diagonal. This implies that ξ andξ have the same distribution.

Bounding linear Markov decision problems
In this section, we show how the lattice distance D L can be used to approximate linear stochastic programming problems with a Markovian structure as defined in (1). We start by showing that every Markov process can be approximated to an arbitrary precision by a discrete process in Theorem 1. We proceed by proving Theorem 3 in which we show that optimal values of problems in (1) are Lipschitz continuous with respect to D L . These two results in combination imply that D L can, in theory, be used to find discrete Markov processes (scenario lattices) that, when used in optimization problems, lead to an arbitrary close approximation of the objective values.
In order to show Theorem 1, we require the following result demonstrating that distances between any pair of Markov processes can be approximated to an arbitrary precision by distances where one of the processes is replaced by a discrete approximation. For what follows, we denote by L p (Ω×Ω, π) the Lebesgue space of p-integrable functions. (15) and π be transportation plan that minimizes D L (ξ,ξ) for two given processes ξ and

Lemma 4 Let
for all feasible transportation plans ν, then there is a sequence of discrete approxi- Note that the condition p > 1 ensures that the space L p (Ω ×Ω, π) is reflexive, which is used for the proof of Lemma 5 below, which in turn is required for the proof of Lemma 4. Note that this result is purely theoretical showing that, loosely speaking, discrete Markov processes are dense with respect to D L . In particular, the crude discretization used below to show Lemma 4 does not yield efficient approximations of Markov processes.

Remark 11
We note that similar to the tree distance proposed in [41], the empirical distribution does not converge to the true distribution in D L . This follows essentially by the same argument that is given in [43] in Proposition 1. Modifications of the distance based on non-parametric estimates addressing this issue as in [43] would be in principle possible but are out of the scope of this paper.
To prove Lemma 4, we define discrete approximationsξ k ofξ . We start by noting that since θ t is continuous, it is uniformly continuous on B k t : where B t (0, k) is the ball of radius k around 0 in R n t . Now, for each k define a discrete random variableξ k t :Ω t → R n t with atomsξ k t,m and 0, k). Furthermore, define corresponding Markov kernels as and the functions for ν t ∈ L q (Ω t ×Ω t , π t ) with q −1 + p −1 = 1 the unconditional distributions of the transportation plan ν in stage t.
In Lemma 5, we will show that the approximations defined above epi-converge to the objective function of the optimization problem defining the lattice distance. Epiconvergence is the weakest notion of convergence of functions that allows to conclude that convergence of objective functions implies the convergence of optimal solutions and is defined as follows.
Definition 3 (epi-convergence) A sequence of functions f n : X → R defined on a metric space X epi-convergences to a function f : We write f n epi −→ f . We will additionally require the notion of barrelled spaces, which are exactly the spaces where the uniform boundedness principle is valid which we will use in the proof of Lemma 4. Definition 4 (barrel, barrelled space) A closed set B ⊆ X in a real topological vector space X is a barrel, if and only if the following conditions hold 1. B is absolutely convex, i.e., for |λ 1 | + |λ 2 | = 1. 2. B is absorbing, i.e., for every x ∈ X there is a α > 0 with x ∈ α B.
A locally convex vector space is called barrelled, if and only if every barrel is a neighborhood of zero.

Lemma 5 If the integrability conditions (16) hold for ξ andξ , then
Fix > 0. By integrability of θ t with respect to ν t and an application of the dominated convergence theorem, it follows that there is a compact set K t ⊂ R n t × R n t for every t = 0, . . . , T such that Now choose k ∈ N such that K t ⊆ B k t and k > −1 and note that i.e., f t kn → f t 0n uniformly for all n. Note further that where the two limits can be exchanged because of the uniform convergence shown above and the first equality follows by the monotone convergence theorem. As the convergence holds for every t = 0, . . . , T , we obtain that L p (Ω ×Ω, π) is reflexive and therefore the weak topology is barrelled (see [35], Theorem 23.22). Since T t=0 f t k → T t=0 f t 0 weakly, the set T t=0 f t k , T t=0 f t 0 is weakly bounded and therefore weakly equi-continuous by the uniform boundedness principle. Since T t=0 f t kn : n ∈ N 0 is equi-continuous, it is equi-lower semi-continuous and T t=0 f t k epi −→ T t=0 f t 0 (see [12], Theorem 2.18). Proof (Lemma 4) Because of the epi-convergence proved in Lemma 5, we obtain (see [1], Theorem 2.5) Note that the feasible set Υ can w.l.o.g. be assumed the feasible set of D L (ξ,ξ), since for every feasible transportation plan for D L (ξ,ξ k ) there exists a plan that is feasible for D L (ξ,ξ) yielding the same objective.
Next we prove the main result of the paper establishing that the optimal value of the stochastic optimization problem associated to D L is Lipschitz with respect to D L . We first note the following useful lemma assuming that i t : Ω t ×Ω t → Ω t , i t : Ω t ×Ω t →Ω t are natural projections for t = 0, . . . , T .

Lemma 6
For a measurable function f : Ω t → R and measures P t ,P t , π t that fulfill the conditions in (12), we have Proof The result clearly holds for functions f = 1 A with A ∈ Ω t and therefore, by the usual argument, also for general measurable functions.
Theorem 3 Let ξ andξ be Markov processes and V 0 be the value function for a stochastic optimization problem of the form (1), then Proof We start by choosing > 0 arbitrary. If the process ξ is continuous, we define an -exact approximation of the value functions. To this end, we note that since for is a continuous function on the compact set of permissible decisions S t , it is Lipschitz continuous with Lipschitz constant L t (ξ t−1 ).

By compactness, the set of feasible S t can be covered by a finite open cover
Clearly, it follows that and ). An analogous argument holds for processξ . Note that if ξ orξ are discrete, we can choose = 0 andκ t = κ t orκ t =κ t , since the value function approximation constructed above can be made exact due to Lemma 1. Defining where first inequality follows from Lemma 2 and second from Lemma 3 and Remark 6.
Exchanging the order of steps in which Lemma 2 and Lemma 3 are applied yields and exchanging the roles of V T andṼ T finally results in Proceeding to the next stage, we assume w.l.o.g. that Then for all ξ T −1 ∈ Ω T −1 ,ξ T −1 ∈Ω T −1 we have where the second equality follows by Lemma 6, the first inequality from (18), the following equality again from Lemma 6 and the subsequent inequalities follow from Lemma 2 and Lemma 3 and (17). As in the derivation of (18), we can exchange the order in which Lemma 2 and Lemma 3 are applied to get the above inequality with . Exchanging the roles of V T −1 and V T −1 we obtain Proceeding by backward induction, and noting that the distance D L is nondecreasing when replacingκ t (ξ t−1 ) by κ t (ξ t−1 ) andκ t (ξ t−1 ) byκ t (ξ t−1 ), we arrive at and since > 0 was arbitrary, the result follows.

Remark 12
Linear stochastic optimization problems without randomness in the constraints are special cases of the problems for which [41] provide stability results analogous to Theorem 3. Hence, a comparison of the two types of results for this problem class is of interest.
The authors in [41] show that for their nested distance D T , a convex set X, and a general objective function h : assuming that there is a constant L such that Defining G t = σ (ξ 0 , . . . , ξ t ),G t = σ (ξ 0 , . . . ,ξ t ) as the σ -algebras generated by the history of the processes, the distance D T for arbitrary stochastic processes is defined as In this paper, we treat the special case h (x, ξ) = T t=0 c t (ξ t ) x t for which L can be calculated as L = max t L c t φ t assuming that the functions c t are Lipschitz with constants L c t and φ t is the function calculated in Lemma 3. Note that φ t is deterministic in the case of a deterministic feasible set.
It is easy to see that for two Markov processes the permissible transportation plans π for D T and for D L are equivalent. Assume that π * is an optimal transportation plan for D T , then we have The above calculations show that our bound is tighter than D T for problems where both bounds are applicable, i.e., linear stochastic optimization problems with deterministic feasible set X.

Implementation for finite scenario lattices
In this section, we focus on the computation of D L for two finitely supported Markov processes. In Sect. 5.1, we detail all necessary steps to compute D L , provide a formal algorithm for the computation, and discuss computational issues. In Sect. 5.2, we discuss a simple example demonstrating the bounding property of D L and provide a comparison to the tree distance of [41].

Computation of D L
In this section, we show that, similar to the case of the classical Wasserstein distance and [41], the distance can be computed by solving a linear optimization problem to find the optimal transport plan π .
We represent two discrete Markov processes ξ andξ by scenario lattices. To that end, at every stage t ∈ T we define the probability spaces where N t and M t are the number of atoms of the unconditional distributions P t andP t , respectively. The conditional transition from a given state i (ĩ) at time (t − 1) to a state j (j) at time t is described by a conditional probability P i t ( j) andP˜i t (j), respectively. The optimal transport plan π is a Markov process on Ω which is fully described by the conditional probabilities The measure π can therefore be represented by a set of non-negative matrices π We furthermore denote by π t the unconditional distributions at time t.
To be able to compute the lattice distance as linear program, we define as well as π t−1 (ω t−1 ,ω t−1 ) as decision variables. For given (ω t−1 ,ω t−1 ) and (i,ĩ), the constraints in the definition of D L can therefore be written as linear constraints in these variables as Hence, given two discrete processes ξ andξ as well as θ t (ξ t (ω t ),ξ t (ω t )), D L (ξ,ξ) can be computed as the following linear optimization problem in the variables τ Algorithm 1 Computation of D L Require: Data A 1,t , A 2,t , b 2,t , C 1,t , functions c t , b 1,t , κ t for all t ∈ T 1: for t ∈ T do 2: for I ⊂ 1, . . . , N γ,t with |I | = K γ,t do deterministic vertices LU for γ 5: else rhs stochastic 9: Store LU factorization of A I γ in (P I γ , L I γ , U I γ ) 10: end if 12: end for 13: for I ⊂ 1, . . . , N φ,t with |I | = K φ,t do deterministic vertices and LU for φ 14: for

25:
for I ∈ I φ do 26: Use end for 29: end for 30: 31: forω t ∈Ω t do 32: end for 37: for ω t ∈ Ω t do 38: end for 43: Compute θ t (ξ t (ω t ),ξ t (ω t )) according to (14) and ( (20) We provide pseudocode for the calculation of D L in Algorithm 1. The algorithm loops over the stages t of the problem and iteratively computes the constants γ t and φ t .
In line 2, we write polyhedra defined in Lemma 2 and Lemma 3 as a system of linear inequalities with single vectors and matrices A γ , b γ , A φ , and b φ . This is merely for notational convenience in the rest of the algorithm. We assume that there are in all N γ,t and N φ,t inequalities defining Γ t and Φ t , respectively.
We define K γ,t and K φ,t as dimensions of Γ t and Φ t . In lines 4-12 and 13-21 we iterate over all sets of size K γ,t and K φ,t of linear inequalities defining the polyhedra. The solution to the corresponding system of linear equalities defines a vertex if it fulfills all the rest of the constraints. We evaluate the norm of those vertices that do not depend on random data and keep track of the maximum, while we store the LU factorization of the systems whose right hand sides are random. Note that for the computation of γ t , we only require the norm of the components that correspond to λ 2 in Lemma 2, which we denote by x I γ,2 for a specific set of inequality constraints I . We also remark that for γ t all vertices except the origin depend on the randomness unless either κ t is independent of the randomness (stagewise independence) or the objective is deterministic.
In line 23-29 we compute φ t (ξ t ) by solving the linear systems I φ for all possible realizations of ξ t using the stored LU factorizations. In line 37-44 we compute φ t (ξ t ) for the realizations ofξ t and additionally compute γ t (ξ t ,ξ t ).
Given these quantities we easily obtain θ t (ξ t ,ξ t ) in line 43 and finally D L in line 47. Note that if either Γ t or Φ t are independent of the stage, or at least identical in some stages, the algorithm can be modified by changing the outer loop in line 1 in an obvious way to avoid repetitive computations.

The flowergirl problem
As a demonstration, we consider a multi-stage extension of the classical newsvendor problem -the problem of a flowergirl selling flowers, facing a random demand and a random sales price with the possibility to store excess flowers for the next periods. The problem has (T + 1) stages, with stage t = 0 being the deterministic start state. In every stage t, we start with the inventory level S t limited by the storage capacityS t . After the demand ξ 1 t and the price ξ 2 t become known in stage t, the flowergirl sells x 2 t flowers and places an order x 1 t for flowers to be delivered from a wholesaler for a price p on the next day. If the available quantity exceeds the demand, the flowergirl adds the excess to her inventory for sale in (t + 1). Due to the perishable nature of flowers, a fraction of k ∈ (0, 1) of the stored flowers are spoilt on the next day. The order in stage t has to be placed without knowing the random demand ξ 1 t+1 . On the next day the flowers can be sold at a market price ξ 2 t+1 not known on day t. The flowergirl starts in period t = 0 without any stock and no demand, i.e., S 0 = 0 and ξ 1 0 = 0. The decisions in every stage consist of the number of flowers to order for the next stage x 1 t , the number of flowers to sell x 2 t , and the inventory level of the next day x 3 t . Note that, as described in Remark 1, the environmental state variable S t+1 is represented by x 3 t so as to make the feasible set fit (2). Putting everything together, the problem can be formulated as We consider the two Markov processes ξ andξ presented in Fig. 3a, b with transition probabilities To bound the difference in the optimal values, we calculate D L (ξ,ξ). As detailed in Algorithm 1, the constant γ t (ξ t ,ξ t ) can be obtained by maximizing λ 2 ∞ over the extreme points of the polyhedron Γ = ⎧ ⎪ ⎨ ⎪ ⎩ (λ 2 , λ 3 , λ 4 , λ 6 , λ 7 ) : Similarly, the constant φ t (ξ t ) = φ(A 1 , b 1,t (ξ t ) + C + 1 b 2,t , A 2 , b 2,t+1 ) can be found by maximizing x ∞ over the extreme points of the polyhedron Having calculated γ t and φ t , we proceed by computing θ t (ξ t ,ξ t ) using (13) and (14). Then we can determine the joint distribution π that minimizes the distance between processes by solving the linear optimization problem (20). The resulting optimal transportation plan yields a distance of D L (ξ,ξ) = 7.03. The optimal value of our problem for ξ is equal to 126.59 and forξ the optimal value equals 129.16 resulting in a difference of 2.58. Hence, our bound overestimates the difference in the optimal values by 4.45.
Lastly, we compare the performance of D L to the performance of the nested distance defined in [41,42]. For this calculation, it is necessary to simplify the problem to make the constraints independent of the randomness. To this end, we fix the demand at each stage. In particular, we assume that the demand is equal to 0, 11, and 9 in the stages t = 0, 1, and 2, respectively. For this simplified setup, we obtain D L (ξ,ξ) = 3.94 and D T (ξ,ξ) = 4.31 demonstrating that for our problem D L provides a tighter bound than D T (see also Remark 12).

Conclusions
Stochastic optimization problems with a Markovian structure strike a good balance between the complexity of the underlying randomness and the expressiveness of the corresponding problem class. In particular, since scenario lattices offer leaner discretization structures than scenario trees, the unfavorable computational properties of general stochastic optimization problems can be, in part, mitigated.
In this paper, we define a family of problem dependent semi-distances for linear stochastic optimization problems with Markovian structure that can be used to bound objective values. We also show that every Markov process can, in theory, be approximated to arbitrary precision in terms of the defined distances. Therefore, the concepts in this paper can be used to find arbitrary precise discrete approximation of complicated problems, possibly with continuous state spaces.
Furthermore, we contribute to the literature on transportation distances by an approach that is capable of dealing with randomness in the constraints. This necessitates a different technique of proof, since the transport of solutions between problems becomes impossible in this framework. We therefore base our results on stability results for linear programs.
In this paper we laid the foundations for a theory driven method to generate scenario lattices. Further research is required to find computationally efficient ways to do so and to evaluate the outcomes on real world problems.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.