Partition-based Distributionally Robust Optimization via Optimal Transport with Order Cone Constraints

In this paper we wish to tackle stochastic programs affected by ambiguity about the probability law that governs their uncertain parameters. Using optimal transport theory, we construct an ambiguity set that exploits the knowledge about the distribution of the uncertain parameters, which is provided by: (1) sample data and (2) a-priori information on the order among the probabilities that the true data-generating distribution assigns to some regions of its support set. This type of order is enforced by means of order cone constraints and can encode a wide range of information on the shape of the probability distribution of the uncertain parameters such as information related to monotonicity or multi-modality. We seek decisions that are distributionally robust. In a number of practical cases, the resulting distributionally robust optimization (DRO) problem can be reformulated as a finite convex problem where the a-priori information translates into linear constraints. In addition, our method inherits the finite-sample performance guarantees of the Wasserstein-metric-based DRO approach proposed by Mohajerin Esfahani and Kuhn (Math Program 171(1-2):115-166. https://doi.org/10.1007/s10107-017-1172-1, 2018), while generalizing this and other popular DRO approaches. Finally, we have designed numerical experiments to analyze the performance of our approach with the newsvendor problem and the problem of a strategic firm competing \`a la Cournot in a market.

Mathematics Subject Classification (2010) 90C15 Stochastic programming · 90C47 Minimax problems 1 Introduction Distributionally robust optimization (DRO) is a powerful modeling framework for optimization under uncertainty that emerges from considering that the probability distribution of the problem's uncertain parameters is, in itself, also uncertain. This gives rise to the notion of the ambiguity set, that is, a set where the modeler assumes that the true distribution of the problem's uncertain parameters is contained. The goal of DRO is therefore to find the decision maker's choice that is optimal against the worst-case probability distribution within the prescribed ambiguity set. Hence, DRO can be seen as a marriage between stochastic programming and robust optimization, working with probability distributions as the former does, while hedging the decisionmaker against the worst case as the latter typically aims to do. Since the work of [54], many DRO models have been proposed and studied in the technical literature, especially over the last decade, in which DRO has attracted a lot of attention and become very popular in the field of optimization under uncertainty as an alternative to other paradigms. We refer the reader to [32,50] for recent surveys on DRO and optimization under uncertainty. Naturally, the construction of the ambiguity set is key to the practical performance of DRO. It is no wonder, therefore, that much effort has been applied to this issue, resulting in several ways to specify and characterize the ambiguity set, namely: 1. Moment-based approach: The ambiguity set is defined as the set of all probability distributions whose moments satisfy certain constraints; see, [20,27,37,36,41,43,63,67], to name a few. 2. Dissimilarity-based approach: The ambiguity set is defined as the set of all probability distributions whose dissimilarity to a prescribed distribution (often referred to as the nominal distribution) is lower than or equal to a given value. Within this category, the choice of the dissimilarity function leads to a wealth of distinct variants: (a) Optimal-transport-based (OTP) approach: Here, we include the work in [11,10,26,42,55], among many others, all of which use, as the dissimilarity function, the well-known Wasserstein distance, which exhibits some nice statistical convergence properties. Our work is also based on optimal mass transportation and consequently, it falls within this category. (b) φ -divergences-based approach: This class comprises all those approaches, which use φ -divergences (such as the Kullback-Leibler divergence), for instance, [4,5,44]. We also include in this group the likelihood-based approaches, proposed by [21] and [61]. (c) Other measures of dissimilarity: This category includes all other dissimilaritybased procedures for constructing ambiguity sets than those already mentioned, such as those that utilize the family of ζ -structure probability metrics (for example, the total variation metric, the Bounded Lipschitz metric ...), see, for example, the work in [49] and [66], and the Prokhorov metric [22].
of occurrence of certain events. For instance, in the multi-item newsvendor problem, the experienced decision maker may state that high demand values for a certain item are more likely to occur than low ones. This can occur, for example, when the true data-generating probability distribution is known or believed to be a mixture of distributions. In this case, determining the number of partitions in our DRO approach would be equivalent to estimating the number of components of the mixture. Indeed, one should use a number of partitions close to the number of distributions in the mixture. The task of inferring that number and the contribution of each component to the mixture is a relevant and well-known problem in statistics, which falls within the so-called realm of finite mixture models (see, Chapter 6 of [40]). In our approach, however, we assume that part of this inference task has already been done and so some of the inference results are available to the decision maker. Our aim is to exploit this type of qualitative information in the construction of the ambiguity set. Most importantly, our DRO approach protects the decision maker against the ambiguity in this inference process. For this purpose, we propose partitioning the support of the random parameter vector and bestow a partial order on (some of) the probability masses of the resulting subregions. This partial order can be described by a graph, which, in turn, can be associated with a convex cone. Consequently, the partial order can be embedded into the formulation of the ambiguity set in the form of conic constraints. The use of these types of cones is well known in the field of statistical inference with order restrictions (see, [45,57]). 2. As shown in the numerical tests, this partial order can be leveraged, among other things, to easily encode multi-modality using linear constraints, as opposed to other approaches based on semidefinite programming (see, for example, the work in [30]), with the consequent benefit in terms of computational complexity. The recent papers [15,34,35] consider ambiguity sets with moment and generalized unimodal constraints. Our approach, however, can practically model a wider range of "shapes" beyond unimodality (see Subsection 2.3 for more details). 3. In addition to the order cone constraints on the probability masses linked to the different subregions of the partitioned sample space, these probability masses can also be treated as random, with their probability distribution belonging to a certain ambiguity set. This way, our modeling framework extends the two popular DRO paradigms proposed by [42], and [4,5], respectively. Indeed, -If we consider one partition only, that is, the entire sample space itself, there is no uncertainty about the associated probability mass (which is, evidently, equal to one) and no partial order can be established. If we now use a distance as the transportation cost function, our DRO framework reduces to that of [42]. -On the contrary, in order to get the DRO framework of [4,5], we just need to i) consider a number of partitions such that every partition contains a single data point from the sample, ii) assume that the distribution of their probability masses belongs to a φ -divergence-based ambiguity set and iii) ignore any other information on the true probability distribution of the problem's uncertain parameters (namely, partial order and ambiguity in the conditional distributions).
For their part, the authors in [16] have proposed a different ambiguity set that also covers these two DRO approaches as special cases. However, their ambiguity set does not include the DRO framework we propose, as we note later. 4. Under mild assumptions, we provide a tractable reformulation of our proposed DRO framework and show that it enjoys finite sample and asymptotic consistency guarantees. 5. Finally, we numerically illustrate the benefits in having a-priori information by comparing our DRO framework with the well-known sample average approximation (SAA) solution and the Wasserstein metric-based approach of [42]. To this end, we consider the single and multi-item newsvendor problems and the problem of a strategic firm competingà la Cournot in a market.
The rest of the paper is organized as follows. Section 2 includes some preliminaries to the optimal transport problem, we formulate the proposed DRO approach and present tractable reformulations. Convergence properties and performance guarantees are theoretically discussed in Section 3. Section 4 provides the results from numerical experiments. Finally, Section 5 concludes the paper.
Notation. We use R to denote the extended real line, and adopt the conventions of its associated arithmetic. Furthermore, R + denotes the set of non-negative real numbers. We employ lower-case bold face letters to represent vectors and bold face capital letters for matrices. We use diag(a 1 , . . . , a m ) for a diagonal matrix of size m × m whose diagonal elements are equal to a 1 , . . . , a m . Moreover, given a matrix M, its transpose matrix will be written as M T . We define e as the array with all its components equal to 1. The inner product of two vectors u, v (in a certain space) is denoted u, v = u T v. Given any norm · in the Euclidean space (of a given dimension d), the dual norm is defined as u It is well known that if f is a proper function, then f * is also a proper function. Given a set A ⊆ R d , we denote its relative interior as relint(A). Similarly, we refer to its interior as int(A). The support function of set A, S A , is defined as S A (b) := sup a∈A b, a . The dual cone C * of a cone C is given by C * := {y / y, x 0, ∀x ∈ C }. We use the symbol δ ξ to represent the Dirac distribution supported on ξ . In addition, we reserve the symbol " " for objects which are dependent on the sample data. The symbols E and P denote, respectively, "expectation" and "probability." Finally, for the rest of the paper we assume that we always have measurability for those objects, whose expected values we consider.
2 Data-driven distributionally robust optimization model First, we briefly introduce some concepts from the optimal transport problem (also known as the mass transportation problem) that are at the core of the development of our DRO framework.
Intuitively speaking, the optimal transport problem (OTP) centers on the question of how to move masses between two probability distributions in such a way that the transportation cost is minimal. Let P and Q be two probability distributions in a Polish space S such that P is the distribution of mass seen as the origin (i.e. the source) and Q is the distribution of mass seen as the destination (i.e., the sink), and let c be a measurable cost function with c(x, y) representing the cost of moving a unit of mass from location x to location y. The OTP can be stated as follows C(P, Q) = inf Π c(x, y)Π (dx, dy) :Π is a joint distribution with marginals P and Q, respectively We assume that the cost function c is a non-negative jointly convex lower semicontinuous function such that if x = y , then c(x, y) = 0. In the remainder of the paper we assume that we have existence and uniqueness of the OTP (see, for example, Theorem 4.1 in [59]).
For more technical details about the assumptions on the cost function in the OTP, we refer to [59] and [53]. Note that if we choose a distance on S as the cost function (for example, a p-norm, with p ≥ 1, if S is the Euclidean space R n ), we get the socalled Wasserstein metric of order 1, which we represent as W (P, Q) and which is also known as the Kantorovich metric.
It is well known that this probability distance metrizes the weak convergence property. Furthermore, convergence with respect to the Wasserstein metric of order 1 is equivalent to weak convergence plus convergence of the first moment. Wherever the Wasserstein metric of order 1 is used in this paper, we implicitly consider the set of all probability distributions with finite moment of order 1. Likewise, we refer to the Wasserstein ball of radius r 0 centered at a certain nominal probability distribution P 0 , which we denote by B r (P 0 ), as the set of all probability distributions whose Wasserstein metric of order 1 to P 0 is at most r.

Formulation of the proposed model
Problem (P) below formulates the data-driven distributionally robust optimization (DDRO) framework we propose.
where X ⊆ R n is the set of feasible decisions, ξ : Ω → Ξ ⊆ R d is a random vector defined on the measurable space (Ω , F ) with σ -algebra F , and Q is the set of all probability distributions over the measurable space (Ω , F ). Moreover, for each i ∈ I , Q i is the conditional distribution of Q given ξ ∈ Ξ i , that is Q i = Q(ξ / ξ ∈ Ξ i ) ∈ Q i , with Q i being the set of all conditional probability distributions of Q given ξ ∈ Ξ i . In this setting, I is the set of regions Ξ i with pairwise disjoint interiors into which the support set Ξ is partitioned, that is, i∈I (Ξ i ) = Ξ and int(Ξ i ) int(Ξ j ) = / 0, ∀i, j ∈ I , i = j. Furthermore, we assume that Q * (Ξ i ∩ Ξ j ) = 0, ∀i, j ∈ I , i = j, where Q * is the true data-generating distribution. This is equivalent to stating that {Ξ i } i∈I constitutes a Q * -packing (see a formal definition of this concept in page 50 of [28]) and will allow us to unequivocally assign samples from Q * to the partitions Ξ i , i ∈ I . Finally, constraint (1c) defines the set of all probability vectors p that differ from the nominal empirical probability vector p in at most ρ according to the cost function c. This is a function that quantifies how dissimilar two probability vectors p and q are. For this purpose, we require that c be a non-negative jointly convex lower semicontinuous function such that if p = q, then c(p, q) = 0. As mentioned further on, function c could, for example, take the form of a norm or a φ -divergence. To ease the notation and the formulation, we use ξ to represent either the random vector ξ (ω), with ω ∈ Ω or an element of R d . Note that we can consider the probability measure induced by the random vector ξ , if we choose the corresponding Borel σalgebra B on Ξ . Thus, we can see Q as a set of probability measures defined over (Ξ , B), so we write Q = Q(Ξ ). We define the uncertainty set P for the probability vector p ∈ R |I | , with |I | being the number of partitions, as the intersection of Θ and the set defined by constraint (1c). The support set Θ , which includes the order cone constraints on the probability masses p, is given by: where C is a proper (convex, closed, full and pointed) cone. Hence, Θ is a convex compact set.
In problem (P), ρ and ε are non-negative parameters, to be tuned by the decision maker, which control the size of the ambiguity set defined by equations (1b)-(1f). We represent this set as U ρ,ε ( Q), where Q is a nominal distribution expressed in terms of p and Q i as where and Additionally, I = {i ∈ I such that partition i does not contain any data from the sample}, . . , ξ i N i } and N i is the number of atoms in region Ξ i . Here we set N i = 1 and ξ i 1 := arg sup ξ ∈Ξ i f (x, ξ ) for those i ∈ I . Implicitly, we assume that this supremum is attained. We remark that this modeling choice protects the decision maker in those cases where there is a total absence of information on the conditional distributions Q i , i ∈ I . Indeed, by introducing the "artificial" data point ξ i 1 := arg sup ξ ∈Ξ i f (x, ξ ) in a partition Ξ i with no samples, we are considering the worst-case form that the true conditional distribution Q i could possibly take, that is, a Dirac distribution supported on ξ i 1 . Finally, we note that the ambiguity set defined by constraints (1b)-(1f) is unequivocally determined by specifying the partitions Ξ i , i ∈ I , the nominal distribution Q, the budgets ρ and ε, and the order cone constraints p ∈ C in (2). In fact, if these constraints are removed and we set ρ = ε = 0, then we have p i = p i and Q i = Q i , ∀i, and therefore, Q = Q.
The following theorem shows that problem (P) can be reformulated as a singlelevel problem.
Theorem 1 (Reformulation based on strong duality) For any non-negative values of parameters ε, ρ, problem (P) is equivalent to the following: is the convex conjugate function of c(·, p), with p fixed, and 1 Proof Recall that we have assumed that regions Ξ i are disjoint. Thus, using the law of total probability, we can rewrite problem (P) as follows: where we have considered the subproblem (SP): The probability distribution Q i is defined as Note that the structure of problem (7) does not fit in the general ambiguity set proposed in [16].
Equivalently, we can recast the subproblem (SP) as is a joint distribution of ξ and ξ with marginals Q i and Q i , respectively where reformulation (10) follows on from the fact that the marginal distribution of ξ is the discrete uniform distribution supported on points ξ [42]. The mathematical program (10) constitutes a generalized moment problem over the normalized measures Q i j , for which strong duality holds (see, for example, [56]). We can, therefore, dualize the ε-budget constraint on the transport cost, thus obtaining: where the second equality derives from the fact that we can choose a Dirac distribution supported on Ξ i as Q i j . Now, dualizing the ρ-budget constraint on the transport cost in the inner supremum of problem (7), we obtain: Thus, Since function θ ε +∑ i∈I is upper semicontinuous and concave in p on the compact convex set Θ (recall that c is nonegative, lower semicontinuous, and convex in p), and linear in θ and t i, j on the convex set defined by θ 0 and (17), we can apply Sion's min-max theorem ( [58]) and in this way, interchange the innest infimum with the outer supremum. Then, by merging the two infima, we arrive at We focus now on the inner supremum, where we have written ∑ i∈I . This is a concave maximization problem (be aware that p, H(x) − λ c(p, p) is a concave function with respect to p and Θ is a convex compact set; furthermore, notice that we have in our particular case). Consequently, strong duality holds if a Slater condition is satisfied, that is, if there exists a point p * ∈ relint(R |I | + ) such that e, p * = 1, and p * ∈ int(C ) (see, for example, [13]). Using a standard duality argument, we dualize the constraints p ∈ R |I | + , e, p = 1 and p ∈ C , with associated multipliers µ ∈ R |I | + , η ∈ R and p ∈ C * , respectively. Thus, we obtain the following problem: is the convex conjugate function of c(·, p), with p fixed. Therefore, problem (7) can be equivalently reformulated as follows: Moreover, in the case that the cost function c(·, ·) is given by a norm, we have c p (p) = p − p . The next corollary deals with this particular case.
Corollary 1 If the cost functions c(·, ·) and c(·, ·) are given by norms, then for any non-negative values of parameters ε, ρ, the problem (P) is equivalent to the following problem Proof We use the following Lemma to put problem (P0) in a better shape.
Lemma 1 Let c p (p) = p − p , where p ∈ R |I | is a fixed vector and · a norm in R |I | . Then, it holds that the convex conjugate function of c p (p) is as follows Proof The claim of the Lemma follows from Proposition 5.1.4. (vii) and Example 5.1.2 (b) of [39].
Therefore, problem (P0) reduces to Remarks. Our data-driven DRO framework (P) can be easily understood as a generalization of other popular DRO approaches. To see this, first we need to remove the order cone constraints on the probabilities associated with each subregion into which the support Ξ has been partitioned, that is, the condition p ∈ C , and then proceed as indicated below: 1. If we set ε = 0, |I | = N, with every partition containing a single and different data point from the sample, and use a φ -divergence to build the cost function, i.e., c p (p) = ∑ i∈I p i φ p i p i and hence, c * p (s) = ∑ i∈I p i φ * (s i ), then our data-driven DRO approach boils down to that of [5] and [4]. 2. On the contrary, if we set |I | = 1, c is given by a norm and take c p (p) = p − p (hence c * p (s) = ∑ i∈I p i s i if s * 1), we get the model of [42]. Finally, we remark that constraint (6) for each i ∈ I is equivalent (under the assumptions we make on the transportation cost function) to t i,1 sup ξ ∈Ξ i f (x, ξ ).

Tractable reformulations
In this section we provide nice reformulations of our DRO model (P) under mild assumptions. For this purpose, we make use of the theoretical foundations laid out in [42]. Likewise, some extensions to our model, such as the extension to two-stage stochastic programming problems, are omitted here for brevity and because they can be easily derived in a similar way as found in [42] for the data-driven DRO approach they develop.
We start our theoretical development with the following assumption.
Assumption 1 We consider that Ξ i , for each i ∈ I , is a closed convex set, and that f (x, ξ ) := max k K g k (x, ξ ), with g k , for each k K, being a proper, concave and upper semicontinuous function with respect to ξ (for any fixed value of x ∈ X) and not identically ∞ on Ξ i .
Theorem 2 below provides a tractable reformulation of problem (P1) as a finite convex problem. For ease of notation, we suppress the dependence on the variable x (bearing in mind that this dependence occurs through functions g k , k K).
Theorem 2 If Assumption 1 holds and if we choose a norm (in R d ) as the transportation cost function c, then for any values of ρ and ε, problem (P1) is equivalent to the following finite convex problem: Proof In essence, the complexity of problem (P1) depends on our ability to reformulate the supremum in constraint (23) in a tractable manner. This is possible under Asummption 1, following similar steps to those in the proof of Theorem 4.2 in [42], to which we refer.
We note that Asummption 1 covers the particular case where functions g k , k K, are affine and, as a result, f is convex piecewise linear. The single-item newsvendor problem, which we illustrate in the first part of Section 4, constitutes a popular example of this case.

Separable objective function
Now we extend the results presented above to a class of objective functions which are additively separable with respect to the dimension d. We assume here that ξ = (ξ 1 , . . . , ξ d ), where ξ l ∈ R p , for each l = 1, . . . , d. Furthermore, we consider the separable norm ξ d := ∑ d l=1 ξ l associated with the base norm · (on R p ). Finally, we assume that the function f is given as follows: In this case, the complexity of the resulting DRO problem is linear with respect to the number N of samples. The multi-item newsvendor problem, which we illustrate in the second half of Section 4, constitutes a popular example of this case.
, {g lk } k K satisfy Assumption 1 for all l d, and Ξ i , for each i ∈ I , is given by the Cartesian product of closed convex sets (that is, Ξ i := ∏ d l=1 D i l , with D i l a closed convex set), and if we choose the norm · d as the transportation cost function c, then for any values of ρ and ε, problem (P) is equivalent to the following finite convex problem: Proof The proof runs in a similar way to that of Theorem 6.1 in [42].
Remarks. If the transportation cost function c is not a norm, there are still some cases where the constraint (6) can be reformulated in a tractable way. In general, equation (6) can be seen as the robust counterpart of a constraint affected by the random parameter vector ξ , with Ξ i playing the role of the so-called uncertainty set. In our case, the tractability of (6) depends on the nature of each set Ξ i and each . Indeed, suppose that every Ξ i is a closed convex set, then: -If the function f is concave in ξ , so is each function α i j (ξ ) (recall that the transportation cost function c is assumed to be convex and that θ is non-negative). As [51] points out, this is a tractable instance and tractable reformulations of constraint (6) can be obtained using Fenchel duality following the guidelines in [6]. -In contrast, the case in which some α i j (ξ ) are convex is much more challenging and may call for approximation methods such as the one proposed by [51].
In any case, we need to compute convex conjugate functions, which is, in itself, a complicated problem in general. For assistance in this regard, one may resort to symbolic computation in order to get closed formulas for convex conjugate functions (see, for example, [12]).

Order cone constraints
To account for a-priori knowledge about the probability distribution of the random parameter vector ξ (for example, the decision maker may have some information about the shape of this distribution), we propose to convey this knowledge using order constraints on the probability masses p i associated with each subregion Ξ i into which the support Ξ of ξ is partitioned. These order constraints are based on order cones, which, in turn, can be represented in the form of graphs.
We can build order cones from graphs that allow for the comparison of all probabilities p i . In that case, we say that the graph, and the associated cone, establish a total order. If, on the contrary, the graph only allows some of those probabilities to be compared, we talk about partial order. For more details about order cones we refer the reader to [45].
Below, we present some common choices of order cones.
-Simple order cone (monotonicity): -Tree order cone: -Star-shaped cone (decrease on average): -Umbrella cone (unimodality): An order cone is a polyhedral convex cone and as such, can be algebraically expressed in the form C = {p ∈ R |I | : Ap 0}, with A being a matrix of appropriate dimensions. Its dual C * can, therefore, be easily computed as C * = { p = A T ν : ν 0} (see, for instance, Corollary 3.12.9 in [57]). Notwithstanding, our DRO approach can be equally applied under other types of support sets, as long as the problem (22) admits a strong dual (we refer the interested reader to [5] for a list of types of support sets under which strong duality holds).
As compared to other approaches available in the technical literature, order cones provide a straightforward way of encoding modality information in the ambiguity set of the DRO problem. For instance, [30] indirectly introduces multi-modality information by imposing first and second moment conditions on the different ambiguous components of a mixture with known weights. Their approach, however, results in a semidefinite program. Unlike [30], the authors in [35] explicitly incorporate modality information into their ambiguity set through moment and generalized unimodal constraints. Nonetheless, they still need to solve a semidefinite program and their DRO approach overlooks the data-driven nature of those constraints. In [15], they construct an ambiguity set made up of those absolutely continuous probability distributions whose density function is bounded by some bands with a certain confidence level. Their approach can be used to impose monotonicity or unimodality of the probability distributions, but can only be applied to the univariate case.
Beyond modality, the order cone constraints on the partition probabilities that characterize our DRO approach equip the decision maker with a versatile and intuitive framework to exploit information on the shape of the ambiguous probability distribution. For example, as we do in the numerical experiments in Section 4, we can construct an order cone that constrains the ratios among the partition probabilities, which can be seen as a discrete approximation of encoding "derivative" information on the ambiguous probability distribution (if this admits a density function). Likewise, other order cones could be used to bestow some sense of "convexity" on this distribution.

On convergence and out-of-sample performance guarantees
In this section, we show that our DRO approach (P) naturally inherits the convergence and performance guarantees of that introduced in [42]. For this purpose, we first need to recall some terminology and concepts from this paper to which we will resort later on. Throughout this section, we denote the training data sample (that is, the sample path sequence) as Following [42], Ξ N can be seen as a random vector governed by the probability distribution P N := Q * × · · · × Q * (N times) supported on Ξ N (with the respective product σ -algebra).
In the remainder of this paper, we will denote the optimization problem associated with problem (P) under the true probability distribution Q * as (P * ) (that is, the problem defined as J * := inf x∈X E Q * [ f (x, ξ )]). We then say that a data-driven solution for problem (P * ) is a feasible solution x N ∈ X which is constructed from the sample data. Furthermore, the out-of-sample performance of a data-driven solution In line with [42], given a data-driven solution x N , a finite sample guarantee is a relation in the form where J N is a certificate for the out-of-sample performance of x N (i.e., an upper bound that is generally contingent on the training dataset), β ∈ (0, 1) is a significance parameter with respect to the distribution P N , on which both x N and J N depend. Moreover, we refer to the probability on the left-hand side of (32) as the reliability of ( x N , J N ). Ideally, we strive to develop a method capable of identifying a highly reliable data-driven solution with a certificate as low as possible.
The data-driven DRO approach that we propose in this paper to address problem (P * ) accounts for the uncertainty about the true data-generating distribution Q * , while taking advantage of some a-priori order information that the decision maker may have on some probabilities induced by Q * over a partition of the support set Ξ . Below, we claim that the pair ( x N , J N ) provided by our distributionally robust optimization problem (P) features performance guarantees in line with those discussed in [42]. More specifically, for a suitable choice of the ambiguity set, the optimal value J N of problem (P) constitutes a certificate of the type (32) providing a confidence level 1 − β on the out-of-sample performance of the data-driven solution x N . This can be formally stated under some assumptions about the underlying true conditional probability distributions.
To this end, we first provide probabilistic guarantees on the partition probabilities p i , ∀i |I |. In this vein, note that the empirical probability p i , defined as in Equation (4), can be modeled as a binomial distribution with success probability p * i , divided by the total number of trials. Consequently, by the Strong Law of Large Numbers (SLLN), p i converges to p * i almost surely. Now suppose that we choose a φ -divergence as c, where φ is a twice continuously differentiable function around 1 with φ (1) > 0. Then, take β p > 0. If we choose as ρ the value we get a confidence set of level 1 − β p on the true partition probabilities p * (see [5] and [4]).
If, alternatively, we choose the total variation distance as c, we can use Equation (19) in [29] to take ρ as and obtain a confidence set of level 1 − β p on p * . Next we establish a concentration tail inequality of the probability weighted Wasserstein metric of order 1 between each conditional distribution and its respective true conditional distribution. For this purpose, we first need to make the following assumption: Assumption 2 (Light-tailed Conditional Distributions) For each i ∈ I , there exist a i , γ i ∈ R, with a i > 1 and γ i > 0 such that The following theorem provides a tail concentration inequality for the weighted sum of the Wasserstein metrics of order 1 between the true and empirical conditional distributions.
Theorem 4 (Concentration Inequality for the Conditional Distributions ) If Assumption 2 holds, for each i ∈ I , given β i ∈ (0, 1] we have that ∀N i 1, dim(ξ ) = 2 and for all ε > ∑ i∈I p i ε N i (β i ), for any values p i , i ∈ I such that p i 0 and ∑ i∈I p i = 1, the following holds Proof. Given Assumption 3, for all i ∈ I , we deduce from Theorem 2 in [23] that Thus, we have that Theorem 4 sets the probabilistic bound ∑ i∈I p i ε N i (β i ) on the weighted Wasserstein metric of order 1 between each conditional distribution and its respective true conditional distribution, with at least confidence level 1 − ∑ i∈I β i . We remark that, if the partitions are compact, stronger results like those in Theorem 2 of [31] could be used to choose the radii of the Wasserstein balls. More specifically, the result in Theorem 2 of [31] depends on the diameter of the compact support set (i.e., the maximum distance between two elements of that set). The result stated in our theorem, in contrast, is valid for unbounded partitions, as it only requires the true conditional distribution associated with each partition be light-tailed. The next theorem states the finite-sample guarantee performance of the proposed DRO method we develop in this paper: Theorem 5 (Finite sample guarantee) Suppose that Assumption 2 holds and that we have chosen as ρ the value given by Equation (33) or (34). Then, the finite sample guarantee (32) holds with at least confidence level (1 − β p )(1 − ∑ i∈I β i ).
Proof. The claim follows from Theorem 4 and Equations (33) and (34), which imply that P(Q * ∈ U ρ,ε ( Q)) (1 − β p )(1 − ∑ i∈I β i ). Hence, Remarks. In practice, proper values for ε and ρ are set by way of data-driven procedures like bootstrapping or cross-validation, as we illustrate in the numerical experiments in Section 4.2 (see also [14], [18], [42], [55], and [62] for more examples). These procedures allow the decision maker to tune those parameters as a function of sample size N in order to get reliable decisions without giving up too much on out-of-sample performance. Following this line, and as noted in Remark 5 in [33], the requirement to include the true distribution inside the ambiguity set is only a sufficient, but not necessary condition to ensure a finite sample guarantee. Indeed, this guarantee can be sustained even if the parameters of the ambiguity set are reduced below the lowest values for which the ambiguity set represents a confidence set for the true distribution.
Furthermore, recall that the partition probabilities p belong to the support set Θ defined by the order cone constraints. Since we assume that these constraints are coherent with the true distribution Q * , we do not need to explore those probability measures Q in the Wasserstein ball B ρ N (β ) that do not comply with them. Consider, for example, the case in which the worst-case distribution in the ball B ρ N (β ) does not satisfy the order cone constraints. One could expect, therefore, that, in practice, our approach could benefit from this fact to produce a data-driven solution x N as reliable as that given by the method of [42], but with a tighter certificate J N . This is precisely what we observe in the numerical experiments that we present below.
We conclude this section with some remarks on the convergence and asymptotic consistency of our DRO approach: We have that, as the number N of samples grows to infinity, where x * (resp. J * ) is an optimizer (resp. the optimal solution value) of problem (P * ). Indeed, assume that Theorem 3.6 in [42] holds, then take a confidence level 1−β , and choose ε and ρ by way of Theorem 4 and Equations (33) (or (34)), respectively. When N grows to infinity, we have, on the one hand, that the conditional distributions converge (in the Wasserstein metric) to their respective true conditional distributions and the probability weights converge a.s. by the SLLN to their respective true values. Therefore, both ε and ρ tend to zero as N increases to infinity. Consequently, our ambiguity set only contains the empirical distribution Q N , which converges almost surely to the true distribution Q * .

Numerical experiments
The following simulation experiments are designed to provide additional insights into the performance guarantees of our proposed distributionally robust optimization scheme with order cone constraints. For this purpose, we consider two test instances: the (single and multi-item) newsvendor problem and the problem of a strategic firm competingà la Cournot in a market. These two problems have been intentionally selected, because they are qualitatively different when addressed by the standard Wasserstein-metric-based DRO approach proposed in [42]. In effect, the former features an objective function f (x, ξ ) whose Lipschitz constant with respect to ξ is independent of the decision x. Consequently, as per Remark 6.7 in [42], the standard Wasserstein-metric-based DRO approach renders the same minimizer for this problem as the sample average approximation, whenever the support of the uncertainty ξ is unbounded. This is, in contrast, not true for the problem of a strategic firm competingà la Cournot in a market, which is characterized by an objective function with a Lipschitz constant over ξ that is a function of x. This allows us to highlight the differences of our approach with regard to [42] in two distinct settings.
All the numerical experiments have been implemented in Python. The optimization problems have been built using Pyomo [2] and solved with CPLEX 12.10 [1] on a PC with Windows 10 and a CPU Intel (R) Core i7-8550U clocking at 1.80 GHz and with 8 GB of RAM. The statistical methods that have been used for the numerical experiments have been coded by means of the module Scikit-learn (see [48]). In what follows we provide some implementation details regarding the proposed model. The numerical experiments have been designed under the following assumptions: 1. A-priori information. Given a fixed and known partition of the sample space Ξ , we can construct an order cone that is consistent with the true probability distribution. That is, the probability masses that the true distribution assigns to each partition verify the order cone constraints. In practice, this a-priori information is determined by the nature of the problem and the random phenomena, and is assumed to be known by the decision maker based on experience and expert knowledge. Furthermore, in the case that the decision maker has no full certainty about the a-priori information, s/he may resort to statistical hypothesis testing to assess the confidence that the partition probabilities belong to a given order cone (see, for instance, [9] and references therein). In our numerical experiments, we specifically apply the following approach: Given a fixed number of partitions (later we explain how the partition set is obtained), we consider that the decision maker knows a total order between the probability masses associated with each of the regions into which the sample space Ξ is split. Furthermore, s/he also knows their ratios approximately, within a certain tolerance (which, in the subsequent experiments, we set to 0.1). For instance, suppose we have three partitions with (true) probability masses of p * 1 = 0.6, p * 2 = 0.3 and p * 3 = 0.1. The decision maker only knows their relative ratios with a tolerance error of 0.1, that is: This way, we get the following order cone constraints: The support set is the Cartesian product of closed intervals (that is, an hypercube, whose size is indicated in each example) and, therefore, is a closed convex set. 3. True distribution. For simulation and analysis, the data-generating distribution is approximated by a certain number of data points (15 000 in the newsvendor setting and 10 000 in the problem of the Cournot producer) drawn from a mixture of three normal distributions, whose characteristics are specified in each of the two examples we consider in the following subsections. Furthermore, those data points that fall outside the support set Ξ are discarded. 4. Construction of partitions Ξ i , i = 1, . . . , |I |: In order to construct the partitions, we proceed as follows: (a) Clustering phase: Firstly, we employ the K-means clustering technique to group the total number of data points that approximate the true data distribution into K clusters. The number K of clusters is decided upon using the well-known Elbow's method (see, for example, [19]). It is based on the value of the average distortion produced by different values of K. If K increases, the average distortion will decrease and the improvement in average distortion will diminish. The value of K at which the improvement in distortion decreases the most is called the elbow. At this value of K, we should stop dividing the data into further clusters and choose this value as the number of clusters. In addition, we assign a label to identify each of the K clusters. In all the numerical experiments that are presented next, the true data-generating distribution is constructed as a mixture of three (univariate or multivariate) normal distributions. We assume that the decision maker has a good estimate of the number of components of this mixture and thus, we consider, for example, four clusters, i.e., K = 4. (b) Decision-tree classifier phase: Once all the clusters have been labelled, we use the aforementioned total number of data points to train a decision-tree multi-classifier with a maximum number of leafs equal to K. The tree will be then used to allocate new data points into one of the K clusters, which, in effect, is equivalent to having a partition of the support set in K disjoint regions. 5. Comparative analysis: We compare three different data-driven approaches to address the solution to problem inf x∈X E Q * [ f (x, ξ )], namely, our approach (DROC), the one of [42] (DROW) and the sample average approximation (SAA). Recall that we denote x * ∈ arg min x∈X E Q * [ f (x, ξ )] and J * = E Q * [ f (x * , ξ )], which, in practice, are unknown to the decision maker, but, for analysis purposes, we estimate using the total number of data points that approximate the true datagenerating distribution. Moreover, in all numerical experiments, we consider the 1-norm as the functions c and c. To compare the three data-driven approaches we consider, we use two performance metrics, specifically, the out-of-sample performance of the data-driven solution (which we also refer to as its actual expected cost) and its out-of-sample disappointment. The former is given by , while the latter is calculated as J * − J m N , where m = {DROC, DROW, SAA} and J m N is the objective function value yielded by the data-driven optimization problem solved by method m. We stress that a negative out-of-sample disappointment represents a favourable outcome. As E Q * [ f ( x m N , ξ )] and J m N are random variables (they are direct functions of the sample data), we conduct a certain number of runs, each with an independent sample of size N. This way we can provide (visual) estimates of the expected value and variability of the out-of-sample performance and disappointment for several values of the sample size N. These estimates are illustrated in the form of box plots in a series of figures. In these figures, the dotted black horizontal line corresponds to either solution x * or to its associated optimal cost J * with complete information (i.e., without ambiguity about the true data distribution). For the sole purpose of conducting a comparison as fairly as possible, parameters ε and ρ in both DROC and DROW are tuned so that the underlying true distribution of the data belongs to the corresponding ambiguity set with, at least, a prefixed confidence level of probability. In the case of the newsvendor examples, we guarantee this by trial and error for simplicity. In practice, however, these parameters should be calibrated by way of a (statistical) procedure that uses the data available to the decision maker, for example, through cross-validation or bootstrapping. We follow this approach in the problem of the Cournot producer. Finally, we stress that, in our approach, caution should be exercised when selecting ε and ρ, as they should be such that problem (P) has at least one feasible solution. This is not guaranteed in the case that the empirical distribution Q does not satisfy the order cone constraints on the probability masses associated with each subregion Ξ i of the support set Ξ . Intuitively, in this case, optimization problem (P) must have enough "budget" (i.e., ε and ρ must be high enough) to "transport" the empirical distribution to another one that complies with the a-priori information. In other words, the ambiguity set of problem (P) must be sufficiently large to contain at least one probability distribution that assigns probability masses verifying the order cone constraints to the partitions.

Newsvendor problems
In this section, we illustrate the theoretical results of our paper on the popular newsvendor problem (also known as the newsboy problem). Many extensions and variants of this problem have been considered since it was first posed in the 50s (see, for example, the work in [25], [17], [3], [47], and references therein). According to [46], The newsboy problem is probably the most studied stochastic inventory model in inventory control theory and the one with most extensions in recent years. This problem reflects many real-life situations and is often used to aid decision making in both manufacturing and retailing. It is particularly important for items with significant demand uncertainty and large over-stocking and under-stocking costs.

The single-item newsvendor problem
In the single-item newsvendor model, the decision maker has to plan the inventory level for a certain product before the random demand ξ for that product is realized, facing both holding and backorder costs. The newsvendor problem can be formulated as where x is the order quantity, and b, h > 0 are the unit holding cost and the unit backorder cost, respectively. Here we have assumed that h = 4 and b = 2.
The demand for the item (unknown to the decision maker) is assumed to follow a mixture (with weights ω 1 = 0.1, ω 2 = 0.35 and ω 3 = 0.55) of the three normal distributions N 1 (0.2, 0.05), N 1 (0.5, 0.1), and N 1 (0.8, 0.05) , truncated over the unit interval [0, 1]. Figure 1a provides a visual illustration of the resulting mixture. Recall that, in the numerical experiments that follow, we have used 15 000 samples drawn from this mixture of Gaussian distributions to approximate the true distribution of the item demand and to partition its support set [0, 1] into four regions, based on the twophase procedure we have previously described. In fact, what we show in Figure 1a is the histogram of those 15 000 data points and its corresponding kernel density estimate.
For the sole purpose of conducting a comparison as fairly as possible, parameters ε and ρ in both DROC and DROW are tuned so that the underlying true distribution of the data belongs to the corresponding ambiguity set with at least 95% of probability. We check whether this condition holds or not a posteriori (by trial and error), by counting the number of runs (out of the one thousand we perform) for which the out-of-sample disappointment is negative.
The values we have used for the parameters ε and ρ in DROC and DROW are collated in Table 1. We insist that these parameters have been adjusted so that at most 50 out of the 1000 runs we have conducted for each sample size N deliver a positive out-of-sample disappointment (that is, to achieve and maintain a similar level of reliability for the data-driven solutions given by DROC and DROW). As expected, therefore, the values of both ε and ρ decrease as the sample size N grows.   1b, 1c, and 1d show the box plots corresponding to the order quantity, the out-of-sample disappointment and the actual expected cost delivered by each of the considered data-driven approaches for various sample sizes. The shaded areas have been obtained by joining the whiskers of the box plots, while the associated solid lines link their medians. Interestingly, whereas the medians of the order quantity estimators provided by SAA are very close to the optimal one x * , their high variability results in (large) disappointment with very high probability. On the contrary, the median of the order quantity delivered by DROW is significantly far from the optimal one (with complete information) for small sample sizes, but it manages to keep the out-of-sample disappointment below zero in return. To do so, however, DROW tends to produce costly (overconservative) solutions on average, as inferred from their actual expected cost in Figure 1d. In plain words, DROW pays quite a lot to ensure a highly reliable/robust order quantity. The proposed approach DDRO, however, is able to leverage the a-priori information on the partition probabilities (p i ) |I | i=1 to substantially reduce the cost to pay for reliable data-driven solutions, especially for small sample sizes. Intuitively, this information enables DROC to identify highly reliable solutions that are myopically deemed as non-reliable and, therefore, discarded by DROW. Logically, this is contingent on the quality of the a-priori information that is supplied to DROC in the form of order cone constraints on (p i )

The multi-item newsvendor problem
In this section, we carry out an analysis similar to that of Subsection 4.1.1, but for the multi-item newsvendor problem, which can be formulated as follows: where x l is the order quantity for the l-th item, Q is the joint probability distribution governing the demands for the d items, and b l , h l > 0 are the unit holding cost and the unit backorder cost for the l-th item, respectively.
To illustrate our approach in a higher dimensional setting, we consider twenty items, i.e., d = 20.
We consider the following parameters: h 1 = . . . The values we have used for the parameters ε and ρ in DROC and DROW are collated in Table 2. Again, for a meaningful and fair comparison, these parameters have been tuned by trial and error in such a way that at most 50 out of the 1000 runs we have carried out for each sample size N yield a positive out-of-sample disappointment. The values for the parameters, which we need to this end, diminish as we gain more information (i.e., as the sample size N grows). Note that, for small sample sizes, for which the available data provide very little information about their true distribution, a great deal of robustness is required to produce highly reliable data-driven solutions. Consequently, it is little wonder that the selected values for ρ in DROC are equal to two, which is the maximum value that the total variation distance between P and P can take on. In the same fashion as in the case of the previous example of the  Figures 2a and 2b show, for various sample sizes, the box plots pertaining to the out-of-sample disappointment and the actual expected cost associated with each of the considered data-driven approaches, in that order. The results conveyed by these figures confirm our initial conclusions: The ability of our approach DROC to exploit a-priori knowledge of the order among some partition probabilities permits identifying solutions that perform noticeably better out of sample with the same level of confidence. We underline that, in terms of the out-ofsample disappointment, the decision maker seeks a data-driven method m that renders an estimate J m N that results in a positive surprise (i.e., negative disappointment) with a high probability, but that is as close as possible to the cost with full information J * . Consequently, the large negative out-of-sample disappointment that the solutions given by DROW feature can be attributed to its over-conservativeness.
In terms of computational time, solving DROC for this instance of the multi-item newsvendor problem, with 20 items, four partitions and a sample size of 200, takes less than a second with CPLEX 12.10 running on a Windows 10 PC with a CPU Intel (R) Core i7-8550U clocking at 1.80 GHz and 8 GB of RAM.

The problem of a strategic firm competingà la Cournot in a market
Next we consider the problem of a strategic firm competingà la Cournot in a market for an undifferentiated product. This could be the case of, for instance, the electricity market (see, e.g., [24,Ch. 3] and [52]). Suppose the firm can produce up to one per-unit amount of product at a cost given by a 2 x 2 + a 1 x + a 0 , where x is the per-unit amount of product eventually produced and a 0 , a 1 and a 2 are known parameters taking values in R + . Furthermore, assume an inverse residual demand function in the form λ = α − β x, where λ is the market clearing price for the product, and α, β ∈ R + are unknown and uncertain parameters. The firm seeks, therefore, to minimize its cost (a 2 x 2 + a 1 x + a 0 ) − λ x subject to x ∈ [0, 1]. After some basic manipulation, the problem of the firm can be posed as The most interesting feature of this example is that, unlike in the aforementioned newsvendor problems, the Lipschitz constant of the objective function f (x, ξ ) := (−x)ξ + x 2 with respect to ξ is dependent on the decision variable x.
We consider that ξ follows a (true) probability distribution given by 10 000 points sampled from a mixture of three Gaussian distributions with variances all equal to 0.3 and means µ 1 = 0, µ 2 = 1.2 and µ 3 = 2.5. The weights of the mixture are ω 1 = 0.5, ω 2 = 0.2 and ω = 30.3. Furthermore, the mixture has been truncated (over the interval [−1.8, 3]. Figure 3a plots the kernel estimate of the data-generating distribution. As in the previous experiments, we have divided the support [−1.8, 3] into four partitions, using the procedure described at the beginning of Section 4. However, in a different way to what we did in the newsvendor examples, here we select parameters ε and ρ following a procedure that solely relies on the available data, similarly to what is done in [42]. Essentially, given a desired confidence level (1 − β ) for the finitesample guarantee (set to 0.85 in our numerical experiments), we need to estimate, using the data sample available only, the parameters ε and ρ that deliver, at least, this confidence level while yielding the best out-of-sample performance. To this end, we use bootstrapping. The estimator of those parameters is denoted as param m N (β ), underlining that the number and type of parameters to be estimated depend on the method m. The estimation procedure is carried out as follows for each sample of size N (in this experiment, we consider 300 independent data samples for each size N): 1. We construct kboot resamples of size N (with replacement), each playing the role of a different training dataset. Moreover, take those data points that have not been resampled to form a validation dataset (one per resample of size N). In our experiments below, we have considered kboot = 50. 2. For each resample k = 1, . . . , kboot and each candidate value for param, get a DRO solution from method j with parameter (or pair of paramaters) param on the k-th resample. The resulting optimal decision is denoted as x j,k N (param) and its associated objective value as J j,k N (param). Subsequently, we compute the outof-sample performance J( x j,k N (param)) of the data-driven solution x j,k N (param) over the k-th validation dataset. 3. From among the candidate values for param such that J j,k N (param) exceeds the value J( x j,k N (param)) in at least (1 − β ) × kboot different resamples, take the one with the lowest  Figures 3b, 3c, and 3d show, for various sample sizes, the box plots pertaining to the optimal decision, the out-of-sample disappoint-  Fig. 3: Strategic firm problem: (Approximate) true data-generating distribution, optimal solution and performance metrics ment and the actual expected cost associated with each of the considered data-driven approaches, in that order. Once again, the results conveyed by these figures confirm our previous conclusions: Our approach DROC is able to leverage a-priori knowledge of the order among some partition probabilities to deliver solutions that perform significantly better out of sample for the same level of confidence. Furthermore, we see that the decision computed by the proposed method DROC converges to the true optimal solution (with complete information) faster than the solutions provided by the other methods.

Conclusions
In this paper, we have presented a novel framework for data-driven distributionally robust optimization (DRO) based on optimal transport theory in combination with order cone constraints to leverage a-priori information on the true data-generating distribution. Motivated by the reported over-conservativeness of the traditional DRO approach based on the Wasserstein metric, we have formulated an ambiguity set able to incorporate information about the order among the probabilities that the true distribution of the problem's uncertain parameters assigns to some subregions of its support set. Our approach can accomodate a wide range of shape information (such as that related to monotonicity or multi-modality) in a practical and intuitive way. Moreover, under mild assumptions, the resulting distributionally robust optimization problem can be, in fact, reformulated as a finite convex problem where the a-priori information (expressed through the order cone constraints) are cast as linear constraints as opposed to the more computationally challenging formulations that exist in the literature. Furthermore, our approach is supported by theoretical performance guarantees and is capable of turning the provided information into solutions with increased reliability and improved performance, as illustrated by the numerical experiments we have prepared based on the well-known newsvendor problem and the problem of a strategic firm competingá la Cournot in a market for a homogeneous product.