Abstract
We consider stochastic programs where the distribution of the uncertain parameters is only observable through a finite training dataset. Using the Wasserstein metric, we construct a ball in the space of (multivariate and nondiscrete) probability distributions centered at the uniform distribution on the training samples, and we seek decisions that perform best in view of the worstcase distribution within this Wasserstein ball. The stateoftheart methods for solving the resulting distributionally robust optimization problems rely on global optimization techniques, which quickly become computationally excruciating. In this paper we demonstrate that, under mild assumptions, the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs—in many interesting cases even as tractable linear programs. Leveraging recent measure concentration results, we also show that their solutions enjoy powerful finitesample performance guarantees. Our theoretical results are exemplified in meanrisk portfolio optimization as well as uncertainty quantification.
Introduction
Stochastic programming is a powerful modeling paradigm for optimization under uncertainty. The goal of a generic singlestage stochastic program is to find a decision \(x\in \mathbb {R}^n\) that minimizes an expected cost \(\mathbb {E}^\mathbb {P}[h(x,\xi )]\), where the expectation is taken with respect to the distribution \(\mathbb {P}\) of the continuous random vector \(\xi \in \mathbb {R}^m\). However, classical stochastic programming is challenged by the largescale decision problems encountered in today’s increasingly interconnected world. First, the distribution \(\mathbb {P}\) is never observable but must be inferred from data. However, if we calibrate a stochastic program to a given dataset and evaluate its optimal decision on a different dataset, then the resulting outofsample performance is often disappointing—even if the two datasets are generated from the same distribution. This phenomenon is termed the optimizer’s curse and is reminiscent of overfitting effects in statistics [48]. Second, in order to evaluate the objective function of a stochastic program for a fixed decision x, we need to compute a multivariate integral, which is #Phard even if \(h(x,\xi )\) constitutes the positive part of an affine function, while \(\xi \) is uniformly distributed on the unit hypercube [24, Corollary 1].
Distributionally robust optimization is an alternative modeling paradigm, where the objective is to find a decision x that minimizes the worstcase expected cost \(\sup _{{{\mathbb {Q}}} \in \mathcal {P}} \mathbb {E}^{{\mathbb {Q}}} [ h(x,\xi )]\). Here, the worstcase is taken over an ambiguity set \({\mathcal {P}}\), that is, a family of distributions characterized through certain known properties of the unknown datagenerating distribution \(\mathbb {P}\). Distributionally robust optimization problems have been studied since Scarf’s [43] seminal treatise on the ambiguityaverse newsvendor problem in 1958, but the field has gained thrust only with the advent of modern robust optimization techniques in the last decade [3, 9]. Distributionally robust optimization has the following striking benefits. First, adopting a worstcase approach regularizes the optimization problem and thereby mitigates the optimizer’s curse characteristic for stochastic programming. Second, distributionally robust models are often tractable even though the corresponding stochastic model with the true datagenerating distribution (which is generically continuous) are \(\#P\)hard. So even if the datagenerating distribution was known, the corresponding stochastic program could not be solved efficiently.
The ambiguity set \({\mathcal {P}}\) is a key ingredient of any distributionally robust optimization model. A good ambiguity set should be rich enough to contain the true datagenerating distribution with high confidence. On the other hand, the ambiguity set should be small enough to exclude pathological distributions, which would incentivize overly conservative decisions. The ambiguity set should also be easy to parameterize from data, and—ideally—it should facilitate a tractable reformulation of the distributionally robust optimization problem as a structured mathematical program that can be solved with offtheshelf optimization software.
Distributionally robust optimization models where \(\xi \) has finitely many realizations are reviewed in [2, 7, 39]. This paper focuses on situations where \(\xi \) can have a continuum of realizations. In this setting, the existing literature has studied three types of ambiguity sets. Moment ambiguity sets contain all distributions that satisfy certain moment constraints, see for example [18, 22, 51] or the references therein. An attractive alternative is to define the ambiguity set as a ball in the space of probability distributions by using a probability distance function such as the Prohorov metric [20], the Kullback–Leibler divergence [25, 27], or the Wasserstein metric [38, 52] etc. Such metricbased ambiguity sets contain all distributions that are close to a nominal or most likely distribution with respect to the prescribed probability metric. By adjusting the radius of the ambiguity set, the modeler can thus control the degree of conservatism of the underlying optimization problem. If the radius drops to zero, then the ambiguity set shrinks to a singleton that contains only the nominal distribution, in which case the distributionally robust problem reduces to an ambiguityfree stochastic program. In addition, ambiguity sets can also be defined as confidence regions of goodnessoffit tests [7].
In this paper we study distributionally robust optimization problems with a Wasserstein ambiguity set centered at the uniform distribution \(\widehat{\mathbb {P}}_N\) on N independent and identically distributed training samples. The Wasserstein distance of two distributions \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) can be viewed as the minimum transportation cost for moving the probability mass from \(\mathbb {Q}_1\) to \(\mathbb {Q}_2\), and the Wasserstein ambiguity set contains all (continuous or discrete) distributions that are sufficiently close to the (discrete) empirical distribution \(\widehat{\mathbb {P}}_N\) with respect to the Wasserstein metric. Modern measure concentration results from statistics guarantee that the unknown datagenerating distribution \(\mathbb {P}\) belongs to the Wasserstein ambiguity set around \(\widehat{\mathbb {P}}_N\) with confidence \(1\beta \) if its radius is a sublinearly growing function of \(\log (1/\beta )/N\) [11, 21]. The optimal value of the distributionally robust problem thus provides an upper confidence bound on the achievable outofsample cost.
While Wasserstein ambiguity sets offer powerful outofsample performance guarantees and enable the decision maker to control the model’s conservativeness, momentbased ambiguity sets appear to display better tractability properties. Specifically, there is growing evidence that distributionally robust models with moment ambiguity sets are more tractable than the corresponding stochastic models because the intractable highdimensional integrals in the objective function are replaced with tractable (generalized) moment problems [18, 22, 51]. In contrast, distributionally robust models with Wasserstein ambiguity sets are believed to be harder than their stochastic counterparts [36]. Indeed, the stateoftheart method for computing the worstcase expectation over a Wasserstein ambiguity set \({\mathcal {P}}\) relies on global optimization techniques. Exploiting the fact that the extreme points of \({\mathcal {P}}\) are discrete distributions with a fixed number of atoms [52], one may reformulate the original worstcase expectation problem as a finitedimensional nonconvex program, which can be solved via “difference of convex programming” methods, see [52] or [36, Section 7.1]. However, the computational effort is reported to be considerable, and there is no guarantee to find the global optimum. Nevertheless, tractability results are available for special cases. Specifically, the worst case of a convex lawinvariant risk measure with respect to a Wasserstein ambiguity set \({\mathcal {P}}\) reduces to the sum of the nominal risk and a regularization term whenever \(h(x,\xi )\) is affine in \(\xi \) and \({\mathcal {P}}\) does not include any support constraints [53]. Moreover, while this paper was under review we became aware of the PhD thesis [54], which reformulates a distributionally robust twostage unit commitment problem over a Wasserstein ambiguity set as a semiinfinite linear program, which is subsequently solved using a Benders decomposition algorithm.
The main contribution of this paper is to demonstrate that the worstcase expectation over a Wasserstein ambiguity set can in fact be computed efficiently via convex optimization techniques for numerous loss functions of practical interest. Furthermore, we propose an efficient procedure for constructing an extremal distribution that attains the worstcase expectation—provided that such a distribution exists. Otherwise, we construct a sequence of distributions that attain the worstcase expectation asymptotically. As a byproduct, our analysis shows that many interesting distributionally robust optimization problems with Wasserstein ambiguity sets can be solved in polynomial time. We also investigate the outofsample performance of the resulting optimal decisions—both theoretically and experimentally—and analyze its dependence on the number of training samples. We highlight the following main contributions of this paper.

We prove that the worstcase expectation of an uncertain loss \(\ell (\xi )\) over a Wasserstein ambiguity set coincides with the optimal value of a finitedimensional convex program if \(\ell (\xi )\) constitutes a pointwise maximum of finitely many concave functions. Generalizations to convex functions or to sums of maxima of concave functions are also discussed. We conclude that worstcase expectations can be computed efficiently to high precision via modern convex optimization algorithms.

We describe a supplementary finitedimensional convex program whose optimal (nearoptimal) solutions can be used to construct exact (approximate) extremal distributions for the infinitedimensional worstcase expectation problem.

We show that the worstcase expectation reduces to the optimal value of an explicit linear program if the 1norm or the \(\infty \)norm is used in the definition of the Wasserstein metric and if \(\ell (\xi )\) belongs to any of the following function classes: (1) a pointwise maximum or minimum of affine functions; (2) the indicator function of a closed polytope or the indicator function of the complement of an open polytope; (3) the optimal value of a parametric linear program whose cost or righthand side coefficients depend linearly on \(\xi \).

Using recent measure concentration results from statistics, we demonstrate that the optimal value of a distributionally robust optimization problem over a Wasserstein ambiguity set provides an upper confidence bound on the outofsample cost of the worstcase optimal decision. We validate this theoretical performance guarantee in numerical tests.
If the uncertain parameter vector \(\xi \) is confined to a fixed finite subset of \(\mathbb {R}^m\), then the worstcase expectation problems over Wasserstein ambiguity sets simplify substantially and can often be reformulated as tractable conic programs by leveraging ideas from robust optimization. An elegant secondorder conic reformulation has been discovered, for instance, in the context of distributionally robust regression analysis [32], and a comprehensive list of tractable reformulations of distributionally robust risk constraints for various risk measures is provided in [39]. Our paper extends these tractability results to the practically relevant case where \(\xi \) has uncountably many possible realizations—without resorting to space tessellation or discretization techniques that are prone to the curse of dimensionality.
When \(\ell (\xi )\) is linear and the distribution of \(\xi \) ranges over a Wasserstein ambiguity set without support constraints, one can derive a concise closedform expression for the worstcase risk of \(\ell (\xi )\) for various convex risk measures [53]. However, these analytical solutions come at the expense of a loss of generality. We believe that the results of this paper may pave the way towards an efficient computational procedure for evaluating the worstcase risk of \(\ell (\xi )\) in more general settings where the loss function may be nonlinear and \(\xi \) may be subject to support constraints.
Among all metricbased ambiguity sets studied to date, the Kullback–Leibler ambiguity set has attracted most attention from the robust optimization community. It has first been used in financial portfolio optimization to capture the distributional uncertainty of asset returns with a Gaussian nominal distribution [19]. Subsequent work has focused on Kullback–Leibler ambiguity sets for discrete distributions with a fixed support, which offer additional modeling flexibility without sacrificing computational tractability [2, 14]. It is also known that distributionally robust chance constraints involving a generic Kullback–Leibler ambiguity set are equivalent to the respective classical chance constraints under the nominal distribution but with a rescaled violation probability [26, 27]. Moreover, closedform counterparts of distributionally robust expectation constraints with Kullback–Leibler ambiguity sets have been derived in [25].
However, Kullback–Leibler ambiguity sets typically fail to represent confidence sets for the unknown distribution \(\mathbb {P}\). To see this, assume that \(\mathbb {P}\) is absolutely continuous with respect to the Lebesgue measure and that the ambiguity set is centered at the discrete empirical distribution \(\widehat{\mathbb {P}}_N\). Then, any distribution in a Kullback–Leibler ambiguity set around \(\widehat{\mathbb {P}}_N\) must assign positive probability mass to each training sample. As \(\mathbb {P}\) has a density function, it must therefore reside outside of the Kullback–Leibler ambiguity set irrespective of the training samples. Thus, Kullback–Leibler ambiguity sets around \(\widehat{\mathbb {P}}_N\) contain \(\mathbb {P}\) with probability 0. In contrast, Wasserstein ambiguity sets centered at \(\widehat{\mathbb {P}}_N\) contain discrete as well as continuous distributions and, if properly calibrated, represent meaningful confidence sets for \(\mathbb {P}\). We will exploit this property in Sect. 3 to derive finitesample guarantees. A comparison and critical assessment of various metricbased ambiguity sets is provided in [45]. Specifically, it is shown that worstcase expectations over Kullback–Leibler and other divergencebased ambiguity sets are law invariant. In contrast, worstcase expectations over Wasserstein ambiguity sets are not. The law invariance can be exploited to evaluate worstcase expectations via the sample average approximation.
The models proposed in this paper fall within the scope of datadriven distributionally robust optimization [7, 16, 20, 23]. Closest in spirit to our work is the robust sample average approximation [7], which seeks decisions that are robust with respect to the ambiguity set of all distributions that pass a prescribed statistical hypothesis test. Indeed, the distributions within the Wasserstein ambiguity set could be viewed as those that pass a multivariate goodnessoffit test in light of the available training samples. This amounts to interpreting the Wasserstein distance between the empirical distribution \(\widehat{\mathbb {P}}_N\) and a given hypothesis \(\mathbb {Q}\) as a test statistic and the radius of the Wasserstein ambiguity set as a threshold that needs to be chosen in view of the test’s desired significance level \(\beta \). The Wasserstein distance has already been used in tests for normality [17] and to devise nonparametric homogeneity tests [40].
The rest of the paper proceeds as follows. Section 2 sketches a generic framework for datadriven distributionally robust optimization, while Sect. 3 introduces our specific approach based on Wasserstein ambiguity sets and establishes its outofsample performance guarantees. In Sect. 4 we demonstrate that many worstcase expectation problems over Wasserstein ambiguity sets can be reduced to finitedimensional convex programs, and we develop a systematic procedure for constructing worstcase distributions. Explicit linear programming reformulations of distributionally robust single and twostage stochastic programs as well as uncertainty quantification problems are derived in Sect. 5. Section 6 extends the scope of the basic approach to broader classes of objective functions, and Sect. 7 reports on numerical results.
Notation
We denote by \(\mathbb {R}_+\) the nonnegative and by \(\overline{\mathbb {R}}{:=}\mathbb {R}\cup \{\infty ,\infty \}\) the extended reals. Throughout this paper, we adopt the conventions of extended arithmetics, whereby \(\infty \cdot 0 = 0\cdot \infty = {0 / 0 } = 0\) and \(\infty  \infty = \infty + \infty = 1/0 = \infty \). The inner product of two vectors \(a,b \in \mathbb {R}^m\) is denoted by \(\big \langle a, b \big \rangle {:=}a^\intercal b\). Given a norm \(\Vert \cdot \Vert \) on \(\mathbb {R}^m\), the dual norm is defined through \(\Vert z\Vert _* {:=}\sup _{\Vert \xi \Vert \le 1} \big \langle z, \xi \big \rangle \). A function \(f:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) is proper if \(f(\xi )<+\infty \) for at least one \(\xi \) and \(f(\xi )>\infty \) for every \(\xi \) in \(\mathbb {R}^m\). The conjugate of f is defined as \(f^*(z) {:=}\sup _{\xi \in \mathbb {R}^m} \big \langle z, \xi \big \rangle  f(\xi )\). Note that conjugacy preserves properness. For a set \(\Xi \subseteq \mathbb {R}^m\), the indicator function \(\mathbbm {1}_{\Xi }\) is defined through \(\mathbbm {1}_{\Xi }(\xi )=1\) if \(\xi \in \Xi \); \(=0\) otherwise. Similarly, the characteristic function \(\chi _\Xi \) is defined via \(\chi _\Xi (\xi )=0\) if \(\xi \in \Xi \); \(=\infty \) otherwise. The support function of \(\Xi \) is defined as \(\sigma _{\Xi }(z) {:=}\sup _{\xi \in \Xi } \big \langle z, \xi \big \rangle \). It coincides with the conjugate of \(\chi _\Xi \). We denote by \(\delta _{\xi }\) the Dirac distribution concentrating unit mass at \(\xi \in \mathbb {R}^m\). The product of two probability distributions \(\mathbb {P}_1\) and \(\mathbb {P}_2\) on \(\Xi _1\) and \(\Xi _2\), respectively, is the distribution \(\mathbb {P}_1\otimes \mathbb {P}_2 \) on \(\Xi _1\times \Xi _2\). The Nfold product of a distribution \(\mathbb {P}\) on \(\Xi \) is denoted by \(\mathbb {P}^N\), which represents a distribution on the Cartesian product space \(\Xi ^N\). Finally, we set the expectation of \(\ell :\Xi \rightarrow \overline{\mathbb {R}}\) under \(\mathbb {P}\) to \(\mathbb {E}^\mathbb {P}[\ell (\xi )] = \mathbb {E}^\mathbb {P}\big [\max \{\ell (\xi ),0\}\big ] + \mathbb {E}^\mathbb {P}\big [\min \{\ell (\xi ),0\}\big ]\), which is welldefined by the conventions of extended arithmetics.
Datadriven stochastic programming
Consider the stochastic program
with feasible set \(\mathbb {X}\subseteq \mathbb {R}^n\), uncertainty set \(\Xi \subseteq \mathbb {R}^m\) and loss function \(h : \mathbb {R}^n \times \mathbb {R}^m \rightarrow \overline{\mathbb {R}}\). The loss function depends both on the decision vector \(x\in \mathbb {R}^n\) and the random vector \(\xi \in \mathbb {R}^m\), whose distribution \(\mathbb {P}\) is supported on \(\Xi \). Problem (1) can be viewed as the firststage problem of a twostage stochastic program, where \(h(x,\xi )\) represents the optimal value of a subordinate secondstage problem [46]. Alternatively, problem (1) may also be interpreted as a generic learning problem in the spirit of [49].
Unfortunately, in most situations of practical interest, the distribution \(\mathbb {P}\) is not precisely known, and therefore we miss essential information to solve problem (1) exactly. However, \(\mathbb {P}\) is often partially observable through a finite set of N independent samples, e.g., past realizations of the random vector \(\xi \). We denote the training dataset comprising these samples by \(\widehat{\Xi }_N{:=}\{\widehat{\xi }_i\}_{i\le N} \subseteq \Xi \). We emphasize that—before its revelation—the dataset \(\widehat{\Xi }_N\) can be viewed as a random object governed by the distribution \(\mathbb {P}^N\) supported on \(\Xi ^N\).
A datadriven solution for problem (1) is a feasible decision \(\widehat{x}_N\in \mathbb {X}\) that is constructed from the training dataset \(\widehat{\Xi }_N\). Throughout this paper, we notationally suppress the dependence of \(\widehat{x}_N\) on the training samples in order to avoid clutter. Instead, we reserve the superscript ‘\(\,{\widehat{~}}\) ’ for objects that depend on the training data and thus constitute random objects governed by the product distribution \(\mathbb {P}^N\). The outofsample performance of \(\widehat{x}_N\) is defined as \(\mathbb {E}^\mathbb {P}\big [ h(\widehat{x}_N,\xi ) \big ]\) and can thus be viewed as the expected cost of \(\widehat{x}_N\) under a new sample \(\xi \) that is independent of the training dataset. As \(\mathbb {P}\) is unknown, however, the exact outofsample performance cannot be evaluated in practice, and the best we can hope for is to establish performance guarantees in the form of tight bounds. The feasibility of \(\widehat{x}_N\) in (1) implies \(J^\star \le \mathbb {E}^\mathbb {P}\big [ h(\widehat{x}_N,\xi ) \big ]\), but this lower bound is again of limited use as \(J^\star \) is unknown and as our primary concern is to bound the costs from above. Thus, we seek datadriven solutions \(\widehat{x}_N\) with performance guarantees of the type
where \(\widehat{J}_N\) constitutes an upper bound that may depend on the training dataset, and \(\beta \in (0,1)\) is a significance parameter with respect to the distribution \(\mathbb {P}^N\), which governs both \(\widehat{x}_N\) and \(\widehat{J}_N\). Hereafter we refer to \(\widehat{J}_N\) as a certificate for the outofsample performance of \(\widehat{x}_N\) and to the probability on the lefthand side of (2) as its reliability. Our ideal goal is to find a datadriven solution with the lowest possible outofsample performance. This is impossible, however, as \(\mathbb {P}\) is unknown, and the outofsample performance cannot be computed. We thus pursue the more modest but achievable goal to find a datadriven solution with a low certificate and a high reliability.
A natural approach to generate datadriven solutions \(\widehat{x}_N\) is to approximate \(\mathbb {P}\) with the discrete empirical probability distribution
that is, the uniform distribution on \(\widehat{\Xi }_N\). This amounts to approximating the original stochastic program (1) with the sampleaverage approximation (SAA) problem
If the feasible set \(\mathbb {X}\) is compact and the loss function is uniformly continuous in x across all \(\xi \in \Xi \), then the optimal value and optimal solutions of the SAA problem (4) converge almost surely to their counterparts of the true problem (1) as N tends to infinity [46, Theorem 5.3]. Even though finite sample performance guarantees of the type (2) can be obtained under additional assumptions such as Lipschitz continuity of the loss function (see e.g., [47, Theorem 1]), the SAA problem has been conceived primarily for situations where the distribution \(\mathbb {P}\) is known and additional samples can be acquired cheaply via random number generation. However, the optimal solutions of the SAA problem tend to display a poor outofsample performance in situations where N is small and where the acquisition of additional samples would be costly.
In this paper we address problem (1) with an alternative approach that explicitly accounts for our ignorance of the true datagenerating distribution \(\mathbb {P}\), and that offers attractive performance guarantees even when the acquisition of additional samples from \(\mathbb {P}\) is impossible or expensive. Specifically, we use \(\widehat{\Xi }_N\) to design an ambiguity set \(\widehat{\mathcal {P}}_N\) containing all distributions that could have generated the training samples with high confidence. This ambiguity set enables us to define the certificate \(\widehat{J}_N\) as the optimal value of a distributionally robust optimization problem that minimize the worstcase expected cost.
Following [38], we construct \(\widehat{\mathcal {P}}_N\) as a ball around the empirical distribution (3) with respect to the Wasserstein metric. In the remainder of the paper we will demonstrate that the optimal value \(\widehat{J}_N\) as well as any optimal solution \(\widehat{x}_N\) (if it exists) of the distributionally robust problem (5) satisfy the following conditions.

(i)
Finite sample guarantee: For a carefully chosen size of the ambiguity set, the certificate \(\widehat{J}_N\) provides a \(1\beta \) confidence bound of the type (2) on the outofsample performance of \(\widehat{x}_N\).

(ii)
Asymptotic consistency: As N tends to infinity, the certificate \(\widehat{J}_N\) and the datadriven solution \(\widehat{x}_N\) converge—in a sense to be made precise below—to the optimal value \(J^\star \) and an optimizer \(x^\star \) of the stochastic program (1), respectively.

(iii)
Tractability: For many loss functions \(h(x,\xi )\) and sets \(\mathbb {X}\), the distributionally robust problem (5) is computationally tractable and admits a reformulation reminiscent of the SAA problem (4).
Conditions (i–iii) have been identified in [7] as desirable properties of datadriven solutions for stochastic programs. Precise statements of these conditions will be provided in the remainder. In Sect. 3 we will use the Wasserstein metric to construct ambiguity sets of the type \(\widehat{\mathcal {P}}_N\) satisfying the conditions (i) and (ii). In Sect. 4, we will demonstrate that these ambiguity sets also fulfill the tractability condition (iii). We see this last result as the main contribution of this paper because the stateoftheart method for solving distributionally robust problems over Wasserstein ambiguity sets relies on global optimization algorithms [36].
Wasserstein metric and measure concentration
Probability metrics represent distance functions on the space of probability distributions. One of the most widely used examples is the Wasserstein metric, which is defined on the space \(\mathcal {M}(\Xi )\) of all probability distributions \(\mathbb {Q}\) supported on \(\Xi \) with \(\mathbb {E}^\mathbb {Q}\big [\Vert \xi \Vert \big ] = \int _\Xi \Vert \xi \Vert \,\mathbb {Q}(\mathrm {d}\xi )<\infty \).
Definition 3.1
(Wasserstein metric [29]) The Wasserstein metric \(d_\mathrm{W} : \mathcal {M}(\Xi )\times \mathcal {M}(\Xi )\rightarrow \mathbb {R}_+\) is defined via
for all distributions \(\mathbb {Q}_1,\mathbb {Q}_2\in \mathcal {M}(\Xi )\), where \(\Vert \cdot \Vert \) represents an arbitrary norm on \(\mathbb {R}^m\).
The decision variable \(\Pi \) can be viewed as a transportation plan for moving a mass distribution described by \(\mathbb {Q}_1\) to another one described by \(\mathbb {Q}_2\). Thus, the Wasserstein distance between \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) represents the cost of an optimal mass transportation plan, where the norm \(\Vert \cdot \Vert \) encodes the transportation costs. We remark that a generalized pWasserstein metric for \(p\ge 1\) is obtained by setting the transportation cost between \(\xi _1\) and \(\xi _2\) to \(\Vert \xi _1\xi _2\Vert ^p\). In this paper, however, we focus exclusively on the 1Wasserstein metric of Definition 3.1, which is sometimes also referred to as the Kantorovich metric.
We will sometimes also need the following dual representation of the Wasserstein metric.
Theorem 3.2
(Kantorovich–Rubinstein [29]) For any distributions \(\mathbb {Q}_1, \mathbb {Q}_2\in {\mathcal {M}}(\Xi )\) we have
where \(\mathcal {L}\) denotes the space of all Lipschitz functions with \(f(\xi )f(\xi ')\le \Vert \xi \xi '\Vert \) for all \(\xi ,\xi '\in \Xi \).
Kantorovich and Rubinstein [29] originally established this result for distributions with bounded support. A modern proof for unbounded distributions is due to Villani [50, Remark 6.5, p. 107]. The optimization problems in Definition 3.1 and Theorem 3.2, which provide two equivalent characterizations of the Wasserstein metric, constitute a primaldual pair of infinitedimensional linear programs. The dual representation implies that two distributions \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) are close to each other with respect to the Wasserstein metric if and only if all functions with uniformly bounded slopes have similar integrals under \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\). Theorem 3.2 also demonstrates that the Wasserstein metric is a special instance of an integral probability metric (see e.g. [33]) and that its generating function class coincides with a family of Lipschitz continuous functions.
In the remainder we will examine the ambiguity set
which can be viewed as the Wasserstein ball of radius \(\varepsilon \) centered at the empirical distribution \(\widehat{\mathbb {P}}_N\). Under a common light tail assumption on the unknown datagenerating distribution \(\mathbb {P}\), this ambiguity set offers attractive performance guarantees in the spirit of Sect. 2.
Assumption 3.3
(Lighttailed distribution) There exists an exponent \(a > 1\) such that
Assumption 3.3 essentially requires the tail of the distribution \(\mathbb {P}\) to decay at an exponential rate. Note that this assumption trivially holds if \(\Xi \) is compact. Heavytailed distributions that fail to meet Assumption 3.3 are difficult to handle even in the context of the classical sample average approximation. Indeed, under a heavytailed distribution the sample average of the loss corresponding to any fixed decision \(x \in \mathbb {X}\) may not even converge to the expected loss; see e.g. [13, 15]. The following modern measure concentration result provides the basis for establishing powerful finite sample guarantees.
Theorem 3.4
(Measure concentration [21, Theorem 2]) If Assumption 3.3 holds, we have
for all \(N \ge 1\), \(m \ne 2\), and \(\varepsilon >0\), where \(c_1, c_2\) are positive constants that only depend on a, A, and m.^{Footnote 1}
Theorem 3.4 provides an a priori estimate of the probability that the unknown datagenerating distribution \(\mathbb {P}\) resides outside of the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\). Thus, we can use Theorem 3.4 to estimate the radius of the smallest Wasserstein ball that contains \(\mathbb {P}\) with confidence \(1\beta \) for some prescribed \(\beta \in (0,1)\). Indeed, equating the righthand side of (7) to \(\beta \) and solving for \(\varepsilon \) yields
Note that the Wasserstein ball with radius \(\varepsilon _N(\beta )\) can thus be viewed as a confidence set for the unknown true distribution as in statistical testing; see also [7].
Theorem 3.5
(Finite sample guarantee) Suppose that Assumption 3.3 holds and that \(\beta \in (0,1)\). Assume also that \(\widehat{J}_N\) and \(\widehat{x}_N\) represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\). Then, the finite sample guarantee (2) holds.
Proof
The claim follows immediately from Theorem 3.4, which ensures via the definition of \(\varepsilon _N(\beta )\) in (8) that \(\mathbb {P}^N \{ \mathbb {P}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N) \} \ge 1\beta \). Thus, \(\mathbb {E}^\mathbb {P}[ h(\widehat{x}_N,\xi )] \le \sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N}\mathbb {E}^\mathbb {Q}[ h(\widehat{x}_N,\xi )] = \widehat{J}_N\) with probability \(1\beta \). \(\square \)
It is clear from (8) that for any fixed \(\beta >0\), the radius \( \varepsilon _N(\beta )\) tends to 0 as N increases. Moreover, one can show that if \(\beta _N\) converges to zero at a carefully chosen rate, then the solution of the distributionally robust optimization problem (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\) converges to the solution of the original stochastic program (1) as N tends to infinity. The following theorem formalizes this statement.
Theorem 3.6
(Asymptotic consistency) Suppose that Assumption 3.3 holds and that \(\beta _N\in (0,1)\), \(N \in \mathbb {N}\), satisfies \(\sum _{N=1}^\infty \beta _N<\infty \) and \(\lim _{N\rightarrow \infty }\varepsilon _N(\beta _N)=0\).^{Footnote 2} Assume also that \(\widehat{J}_N\) and \(\widehat{x}_N\) represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\), \(N\in \mathbb {N}\).

(i)
If \(h(x,\xi )\) is upper semicontinuous in \(\xi \) and there exists \(L\ge 0\) with \(h(x,\xi )\le L(1+\Vert \xi \Vert )\) for all \(x\in \mathbb {X}\) and \(\xi \in \Xi \), then \(\mathbb {P}^\infty \)almost surely we have \(\widehat{J}_N\downarrow J^\star \) as \(N \rightarrow \infty \) where \(J^\star \) is the optimal value of (1).

(ii)
If the assumptions of assertion (i) hold, \(\mathbb {X}\) is closed, and \(h(x,\xi )\) is lower semicontinuous in x for every \(\xi \in \Xi \), then any accumulation point of \(\{\widehat{x}_N\}_{N \in \mathbb {N}}\) is \(\mathbb {P}^\infty \)almost surely an optimal solution for (1).
The proof of Theorem 3.6 will rely on the following technical lemma.
Lemma 3.7
(Convergence of distributions) If Assumption 3.3 holds and \(\beta _N\in (0,1)\), \(N \in \mathbb {N}\), satisfies \(\sum _{N=1}^\infty \beta _N<\infty \) and \(\lim _{N\rightarrow \infty }\varepsilon _N(\beta _N)=0\), then, any sequence \({\widehat{\mathbb {Q}}}_N \in \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\), \(N\in \mathbb {N}\), where \({\widehat{\mathbb {Q}}}_N\) may depend on the training data, converges under the Wasserstein metric (and thus weakly) to \(\mathbb {P}\) almost surely with respect to \(\mathbb {P}^\infty \), that is,
Proof
As \({\widehat{\mathbb {Q}}}_N \in \mathbb {B}_{\delta _N}(\widehat{\mathbb {P}}_N)\), the triangle inequality for the Wasserstein metric ensures that
Moreover, Theorem 3.4 implies that \(\mathbb {P}^N \{ d_{\mathrm W}\big (\mathbb {P},\widehat{\mathbb {P}}_N\big ) \le \varepsilon _N(\beta _N)\}\ge 1\beta _N\), and thus we have \(\mathbb {P}^N \{ d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N \big ) \le 2\varepsilon _N(\beta _N) \} \ge 1\beta _N\). As \(\sum _{N=1}^\infty \beta _N<\infty \), the Borel–Cantelli Lemma [28, Theorem 2.18] further implies that
Finally, as \(\lim _{N \rightarrow \infty }\varepsilon _N(\beta _N)=0\), we conclude that \(\lim _{N \rightarrow \infty }d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N\big ) =0\) almost surely. Note that convergence with respect to the Wasserstein metric implies weak convergence [10]. \(\square \)
Proof of Theorem 3.6
As \({\widehat{x}}_N\in \mathbb {X}\), we have \(J^\star \le \mathbb {E}^\mathbb {P}[h({\widehat{x}}_N,\xi )]\). Moreover, Theorem 3.5 implies that
for all \(N \in \mathbb {N}\). As \(\sum _{N=1}^\infty \beta _N<\infty \), the Borel–Cantelli Lemma further implies that
To prove assertion (i), it thus remains to be shown that \(\limsup _{N \rightarrow \infty }\widehat{J}_N\le J^\star \) with probability 1. As \(h(x,\xi )\) is upper semicontinuous and grows at most linearly in \(\xi \), there exists a nonincreasing sequence of functions \(h_k(x,\xi )\), \(k\in \mathbb {N}\), such that \(h(x,\xi )=\lim _{k\rightarrow \infty } h_k(x,\xi )\), and \(h_k(x,\xi )\) is Lipschitz continuous in \(\xi \) for any fixed \(x\in \mathbb {X}\) and \(k\in \mathbb {N}\) with Lipschitz constant \(L_k\ge 0\); see Lemma A.1 in the appendix. Next, choose any \(\delta >0\), fix a \(\delta \)optimal decision \(x_\delta \in \mathbb {X}\) for (1) with \(\mathbb {E}^\mathbb {P}[h(x_\delta ,\xi )]\le J^\star +\delta \), and for every \(N\in \mathbb {N}\) let \({\widehat{\mathbb {Q}}}_N \in \widehat{\mathcal {P}}_N\) be a \(\delta \)optimal distribution corresponding to \(x_\delta \) with
Then, we have
where the second inequality holds because \(h_k(x,\xi )\) converges from above to \(h(x,\xi )\), and the third inequality follows from Theorem 3.2. Moreover, the almost sure equality holds due to Lemma 3.7, and the last equality follows from the Monotone Convergence Theorem [30, Theorem 5.5], which applies because \(\mathbb {E}^{\mathbb {P}}[h_k(x_\delta ,\xi )] < \infty \). Indeed, recall that \(\mathbb {P}\) has an exponentially decaying tail due to Assumption 3.3 and that \(h_k(x_\delta ,\xi )\) is Lipschitz continuous in \(\xi \). As \(\delta >0\) was chosen arbitrarily, we thus conclude that \(\limsup _{N \rightarrow \infty }\widehat{J}_N\le J^\star \).
To prove assertion (ii), fix an arbitrary realization of the stochastic process \(\{\widehat{\xi }_N\}_{N \in \mathbb {N}}\) such that \(J^\star = \lim _{N \rightarrow \infty } \widehat{J}_N\) and \(J^\star \le \mathbb {E}^{\mathbb {P}}[h(\widehat{x}_N,\xi )] \le \widehat{J}_N\) for all sufficiently large N. From the proof of assertion (i) we know that these two conditions are satisfied \(\mathbb {P}^\infty \)almost surely. Using these assumptions, one easily verifies that
Next, let \(x^\star \) be an accumulation point of the sequence \(\{\widehat{x}_N\}_{N \in \mathbb {N}}\), and note that \(x^\star \in \mathbb {X}\) as \(\mathbb {X}\) is closed. By passing to a subsequence, if necessary, we may assume without loss of generality that \(x^\star = \lim _{N\rightarrow \infty }\widehat{x}_N\). Thus,
where the first inequality exploits that \(x^\star \in \mathbb {X}\), the second inequality follows from the lower semicontinuity of \(h(x,\xi )\) in x, the third inequality holds due to Fatou’s lemma (which applies because \(h(x,\xi )\) grows at most linearly in \(\xi \)), and the last inequality follows from (9). Therefore, we have \(\mathbb {E}^{\mathbb {P}}[h(x^\star ,\xi )] = J^\star \). \(\square \)
In the following we show that all assumptions of Theorem 3.6 are necessary for asymptotic convergence, that is, relaxing any of these conditions can invalidate the convergence result.
Example 1
(Necessity of regularity conditions)

(1)
Upper semicontinuity of \(\xi \mapsto h(x,\xi )\) in Theorem 3.6 (i):
Set \(\Xi = [0,1]\), \(\mathbb {P}= \delta _{0}\) and \(h(x,\xi ) = \mathbbm {1}_{(0,1]}(\xi )\), whereby \(J^\star = 0\). As \(\mathbb {P}\) concentrates unit mass at 0, we have \(\widehat{\mathbb {P}}_N=\delta _{0}=\mathbb {P}\) irrespective of \(N\in \mathbb {N}\). For any \(\varepsilon > 0\), the Dirac distribution \(\delta _{\varepsilon }\) thus resides within the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\). Hence, \(\widehat{J}_N\) fails to converge to \(J^\star \) for \(\varepsilon \rightarrow 0\) because
$$\begin{aligned} \widehat{J}_N\ge \mathbb {E}^{\delta _{\varepsilon }} [h(x,\xi )] = h(x, \varepsilon ) = 1,\quad \forall \varepsilon >0. \end{aligned}$$ 
(2)
Linear growth of \(\xi \mapsto h(x,\xi )\) in Theorem 3.6 (i):
Set \(\Xi = \mathbb {R}\), \(\mathbb {P}= \delta _{0}\) and \(h(x,\xi ) = \xi ^2\), which implies that \(J^\star =0\). Note that for any \(\rho >\varepsilon \), the twopoint distribution \(\mathbb {Q}_\rho = (1\tfrac{\varepsilon }{\rho })\delta _{0}+\tfrac{\varepsilon }{\rho }\delta _{\rho }\) is contained in the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) of radius \(\varepsilon >0\). Hence, \(\widehat{J}_N\) fails to converge to \(J^\star \) for \(\varepsilon \rightarrow 0\) because
$$\begin{aligned} \widehat{J}_N\ge \, \sup _{\rho> \varepsilon } \,\mathbb {E}^{\mathbb {Q}_\rho } [h(x,\xi )] = \sup _{\rho> \varepsilon } \, \varepsilon \rho = \infty , \quad \forall \varepsilon >0. \end{aligned}$$ 
(3)
Lower semicontinuity of \(x \mapsto h(x,\xi )\) in Theorem 3.6 (ii):
Set \(\mathbb {X}= [0,1]\) and \(h(x,\xi ) = \mathbbm {1}_{[0.5,1]}(x)\), whereby \(J^\star =0\) irrespective of \(\mathbb {P}\). As the objective is independent of \(\xi \), the distributionally robust optimization problem (5) is equivalent to (1). Then, \({\widehat{x}}_N = \tfrac{N1}{2N}\) is a sequence of minimizers for (5) whose accumulation point \(x^\star = \tfrac{1}{2}\) fails to be optimal in (1).
A convergence result akin to Theorem 3.6 for goodnessoffitbased ambiguity sets is discussed in [7, Section 4]. This result is complementary to Theorem 3.6. Indeed, Theorem 3.6(i) requires \(h(x,\xi )\) to be upper semicontinuous in \(\xi \), which is a necessary condition in our setting (see Example 1) that is absent in [7]. Moreover, Theorem 3.6(ii) only requires \(h(x,\xi )\) to be lower semicontinuous in x, while [7] asks for equicontinuity of this mapping. This stronger requirement provides a stronger result, that is, the almost sure convergence of \(\sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N} \mathbb {E}^\mathbb {Q}[h(x,\xi )]\) to \(\mathbb {E}^\mathbb {P}[h(x,\xi )]\) uniformly in x on any compact subset of \(\mathbb {X}\).
Theorems 3.5 and 3.6 indicate that a careful a priori design of the Wasserstein ball results in attractive finite sample and asymptotic guarantees for the distributionally robust solutions. In practice, however, setting the Wasserstein radius to \(\varepsilon _N(\beta )\) yields overconservative solutions for the following reasons:

Even though the constants \(c_1\) and \(c_2\) in (8) can be computed based on the proof of [21, Theorem 2], the resulting Wasserstein ball is larger than necessary, i.e., \(\mathbb {P}\notin \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\) with probability \(\ll \beta \).

Even if \(\mathbb {P}\notin \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\), the optimal value \(\widehat{J}_N\) of (5) may still provide an upper bound on \(J^\star \).

The formula for \(\varepsilon _N(\beta )\) in (8) is independent of the training data. Allowing for random Wasserstein radii, however, results in a more efficient use of the available training data.
While Theorems 3.5 and 3.6 provide strong theoretical justification for using Wasserstein ambiguity sets, in practice, it is prudent to calibrate the Wasserstein radius via bootstrapping or crossvalidation instead of using the conservative a priori bound \(\varepsilon _N(\beta )\); see Sect. 7.2 for further details. A similar approach has been advocated in [7] to determine the sizes of ambiguity sets that are constructed via goodnessoffit tests.
So far we have seen that the Wasserstein metric allows us to construct ambiguity sets with favorable asymptotic and finite sample guarantees. In the remainder of the paper we will further demonstrate that the distributionally robust optimization problem (5) with a Wasserstein ambiguity set (6) is not significantly harder to solve than the corresponding SAA problem (4).
Solving worstcase expectation problems
We now demonstrate that the inner worstcase expectation problem in (5) over the Wasserstein ambiguity set (6) can be reformulated as a finite convex program for many loss functions \(h(x,\xi )\) of practical interest. For ease of notation, throughout this section we suppress the dependence on the decision variable x. Thus, we examine a generic worstcase expectation problem
involving a decisionindependent loss function \(\ell (\xi ) {:=}\max _{k \le K}\ell _k(\xi )\), which is defined as the pointwise maximum of more elementary measurable functions \(\ell _k:\mathbb {R}^m \rightarrow \overline{\mathbb {R}}\), \(k\le K\). The focus on loss functions representable as pointwise maxima is nonrestrictive unless we impose some structure on the functions \(\ell _k\). Many tractability results in the remainder of this paper are predicated on the following convexity assumption.
Assumption 4.1
(Convexity) The uncertainty set \(\Xi \subseteq \mathbb {R}^m\) is convex and closed, and the negative constituent functions \(\ell _k\) are proper, convex, and lower semicontinuous for all \(k\le K\). Moreover, we assume that \(\ell _k\) is not identically \(\infty \) on \(\Xi \) for all \(\le K\).
Assumption 4.1 essentially stipulates that \(\ell (\xi )\) can be written as a maximum of concave functions. As we will showcase in Sect. 5, this mild restriction does not sacrifice much modeling power. Moreover, generalizations of this setting will be discussed in Sect. 6. We proceed as follows. Sect. 4.1 addresses the reduction of (10) to a finite convex program, while Sect. 4.2 describes a technique for constructing worstcase distributions.
Reduction to a finite convex program
The worstcase expectation problem (10) constitutes an infinitedimensional optimization problem over probability distributions and thus appears to be intractable. However, we will now demonstrate that (10) can be reexpressed as a finitedimensional convex program by leveraging tools from robust optimization.
Theorem 4.2
(Convex reduction) If the convexity Assumption 4.1 holds, then for any \(\varepsilon \ge 0 \) the worstcase expectation (10) equals the optimal value of the finite convex program
Recall that \([\ell _k]^*(z_{ik}  \nu _{ik})\) denotes the conjugate of \(\ell _k\) evaluated at \(z_{ik}  \nu _{ik}\) and \(\Vert z_{ik}\Vert _*\) the dual norm of \(z_{ik}\). Moreover, \(\chi _\Xi \) represents the characteristic function of \(\Xi \) and \(\sigma _\Xi \) its conjugate, that is, the support function of \(\Xi \).
Proof of Theorem 4.2
By using Definition 3.1 we can reexpress the worstcase expectation (10) as
The second equality follows from the law of total probability, which asserts that any joint probability distribution \(\Pi \) of \(\xi \) and \(\xi '\) can be constructed from the marginal distribution \(\widehat{\mathbb {P}}_N\) of \(\xi '\) and the conditional distributions \(\mathbb {Q}_i\) of \(\xi \) given \(\xi '=\widehat{\xi }_i\), \(i\le N\), that is, we may write \(\Pi = {1 \over N}\sum _{i = 1}^{N} \delta _{\widehat{\xi }_i}\otimes \mathbb {Q}_i\). The resulting optimization problem represents a generalized moment problem in the distributions \(\mathbb {Q}_i\), \(i\le N\). Using a standard duality argument, we obtain
where (12a) follows from the maxmin inequality, and (12b) follows from the fact that \(\mathcal {M}(\Xi )\) contains all the Dirac distributions supported on \(\Xi \). Introducing epigraphical auxiliary variables \(s_i\), \(i\le N\), allows us to reformulate (12b) as
Equality (12d) exploits the definition of the dual norm and the decomposability of \(\ell (\xi )\) into its constituents \(\ell _k(\xi )\), \(k\le K\). Interchanging the maximization over \(z_{ik}\) with the minus sign (thereby converting the maximization to a minimization) and then with the maximization over \(\xi \) leads to a restriction of the feasible set of (12d). The resulting upper bound (12e) can be reexpressed as
where (12f) follows from the definition of conjugacy, our conventions of extended arithmetic, and the substitution of \(z_{ik}\) with \(z_{ik}\). Note that (12f) is already a finite convex program.
Next, we show that Assumption 4.1 reduces the inequalities (12a) and (12e) to equalities. Under Assumption 4.1, the inequality (12a) is in fact an equality for any \(\varepsilon > 0\) by virtue of an extended version of a wellknown strong duality result for moment problems [44, Proposition 3.4]. One can show that (12a) continues to hold as an equality even for \(\varepsilon = 0\), in which case the Wasserstein ambiguity set (6) reduces to the singleton \(\{\widehat{\mathbb {P}}_N\}\), while (10) reduces to the sample average \(\frac{1}{N}\sum _{i=1}^N \ell (\widehat{\xi }_i)\). Indeed, for \(\varepsilon =0\) the variable \(\lambda \) in (12b) can be increased indefinitely at no penalty. As \(\ell (\xi )\) constitutes a pointwise maximum of upper semicontinuous concave functions, an elementary but tedious argument shows that (12b) converges to the sample average \(\frac{1}{N}\sum _{i=1}^N \ell (\widehat{\xi }_i)\) as \(\lambda \) tends to infinity.
The inequality (12e) also reduces to an equality under Assumption 4.1 thanks to the classical minimax theorem [4, Proposition 5.5.4], which applies because the set \(\{z_{ik} \in \mathbb {R}^m : \Vert z_{ik}\Vert _* \le \lambda \}\) is compact for any finite \(\lambda \ge 0\). Thus, the optimal values of (10) and (12f) coincide.
Assumption 4.1 further implies that the function \(\ell _k+\chi _{\Xi }\) is proper, convex and lower semicontinuous. Properness holds because \(\ell _k\) is not identically \(\infty \) on \(\Xi \). By Rockafellar and Wets [42, Theorem 11.23(a), p. 493], its conjugate essentially coincides with the epiaddition (also known as infconvolution) of the conjugates of the functions \(\ell _k\) and \(\sigma _{\Xi }\). Thus,
where \({{\mathrm{cl}}}[\cdot ]\) denotes the closure operator that maps any function to its largest lower semicontinuous minorant. As \({{\mathrm{cl}}}[f(\xi )]\le 0\) if and only if \(f(\xi )\le 0\) for any function f, we may conclude that (12f) is indeed equivalent to (11) under Assumption 4.1. \(\square \)
Note that the semiinfinite inequality in (12c) generalizes the nonlinear uncertain constraints studied in [1] because it involves an additional norm term and as the loss function \(\ell (\xi )\) is not necessarily concave under Assumption 4.1. As in [1], however, the semiinfinite constraint admits a robust counterpart that involves the conjugate of the loss function and the support function of the uncertainty set.
From the proof of Theorem 4.2 it is immediately clear that the worstcase expectation (10) is conservatively approximated by the optimal value of the finite convex program (12f) even if Assumption 4.1 fails to hold. In this case the sum \(\ell _k + \chi _{\Xi }\) in (12f) must be evaluated under our conventions of extended arithmetics, whereby \(\infty  \infty = \infty \). These observations are formalized in the following corollary.
Corollary 4.3
[Approximate convex reduction] For any \(\varepsilon \ge 0\), the worstcase expectation (10) is smaller or equal to the optimal value of the finite convex program (12f).
Extremal distributions
Stress test experiments are instrumental to assess the quality of candidate decisions in stochastic optimization. Meaningful stress tests require a good understanding of the extremal distributions from within the Wasserstein ball that achieve the worstcase expectation (10) for various loss functions. We now show that such extremal distributions can be constructed systematically from the solution of a convex program akin to (11).
Theorem 4.4
(Worstcase distributions) If Assumption 4.1 holds, then the worstcase expectation (10) coincides with the optimal value of the finite convex program
irrespective of \(\varepsilon \ge 0\). Let \(\big \{\alpha _{ik}(r), q_{ik}(r)\big \}_{r \in \mathbb {N}}\) be a sequence of feasible decisions whose objective values converge to the supremum of (13). Then, the discrete probability distributions
belong to the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) and attain the supremum of (10) asymptotically, i.e.,
We highlight that all fractions in (13) must again be evaluated under our conventions of extended arithmetics. Specifically, if \(\alpha _{ik}=0\) and \(q_{ik}\ne 0\), then \(q_{ik}/\alpha _{ik}\) has at least one component equal to \(+\infty \) or \(\infty \), which implies that \(\widehat{\xi }_i  q_{ik}/\alpha _{ik}\notin \Xi \). In contrast, if \(\alpha _{ik}=0\) and \(q_{ik}= 0\), then \(\widehat{\xi }_i  q_{ik} / \alpha _{ik}=\widehat{\xi }_i \in \Xi \). Moreover, the ikth component in the objective function of (13) evaluates to 0 whenever \(\alpha _{ik} =0\) regardless of \(q_{ik}\).
The proof of Theorem 4.4 is based on the following technical lemma.
Lemma 4.5
Define \(F: \mathbb {R}^m \times \mathbb {R}_{+} \rightarrow \overline{\mathbb {R}}\) through \(F(q,\alpha ) = \inf _{z \in \mathbb {R}^m} \big \langle z, q  \alpha {\widehat{\xi }} \big \rangle + \alpha f^*(z)\) for some proper, convex, and lower semicontinuous function \(f:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) and reference point \({\widehat{\xi }}\in \mathbb {R}^m\). Then, F coincides with the (extended) perspective function of the mapping \(q \mapsto f({\widehat{\xi }}  q)\), that is,
Proof
By construction, we have \(F(q,0) = \inf _{z \in \mathbb {R}^m} \big \langle z, q \big \rangle =  \chi _{\{0\}}(q)\). For \(\alpha > 0\), on the other hand, the definition of conjugacy implies that
The claim then follows because \([f^*]^* = f\) for any proper, convex, and lower semicontinuous function f [4, Proposition 1.6.1(c)]. Additional information on perspective functions can be found in [12, Section 2.2.3, p. 39]. \(\square \)
Proof of Theorem 4.4
By Theorem 4.2, which applies under Assumption 4.1, the worstcase expectation (10) coincides with the optimal value of the convex program (11). From the proof of Theorem 4.2 we know that (11) is equivalent to (12f). The Lagrangian dual of (12f) is given by
where the products of dual variables and constraint functions in the objective are evaluated under the standard convention \(0 \cdot \infty = 0\). Strong duality holds since the function \([\ell _k+\chi _{\Xi }]^*\) is proper, convex, and lower semicontinuous under Assumption 4.1 and because this function appears in a constraint of (12f) whose righthand side is a free decision variable. By explicitly carrying out the minimization over \(\lambda \) and \(s_i\), one can show that the above dual problem is equivalent to
By using the definition of the dual norm, (14a) can be reexpressed as
where (14c) follows from the classical minimax theorem and the fact that the \(q_{ik}\) variables range over a nonempty and compact feasible set for any finite \(\varepsilon \); see [4, Proposition 5.5.4]. Eliminating the \(\beta _{ik}\) variables and using Lemma 4.5 allows us to reformulate (14c) as
Our conventions of extended arithmetics imply that the ikth term in the objective function of problem (14e) simplifies to
Indeed, for \(\alpha _{ik}>0\), this identity trivially holds. For \(\alpha _{ik}=0\), on the other hand, the ikth objective term in (14e) reduces to \( \chi _{\{0\}}(q_{ik})\). Moreover, the first term in (14f) vanishes whenever \(\alpha _{ik} = 0\) regardless of \(q_{ik}\), and the second term in (14f) evaluates to 0 if \(q_{ik}=0\) (as \(0/0=0\) and \(\widehat{\xi }_i \in \Xi \)) and to \(\infty \) if \(q_{ik}\ne 0\) (as \(q_{ik}/0\) has at least one infinite component, implying that \(\widehat{\xi }_i+q_{ik}/0\notin \Xi \)). Therefore, (14f) also reduces to \( \chi _{\{0\}}(q_{ik})\) when \(\alpha _{ik}=0\). This proves that the ikth objective term in (14e) coincides with (14f). Substituting (14f) into (14e) and reexpressing \( \chi _{\Xi }\big (\widehat{\xi }_i  {q_{ik} \over \alpha _{ik}}\big )\) in terms of an explicit hard constraint yields
Finally, replacing \(\big \{\alpha _{ik}, q_{ik}\big \}\) with \({1 \over N}\big \{\alpha _{ik}, q_{ik}\big \}\) shows that (14g) is equivalent to (13). This completes the first part of the proof.
As for the second claim, let \(\{\alpha _{ik}(r), q_{ik}(r)\}_{r \in \mathbb {N}}\) be a sequence of feasible solutions that attains the supremum in (13), and set \(\xi _{ik}(r)\,{:=}\,\widehat{\xi }_i  {q_{ik}(r) \over \alpha _{ik}(r)}\in \Xi \). Then, the discrete distribution
has the distribution \(\mathbb {Q}_r\) defined in the theorem statement and the empirical distribution \(\widehat{\mathbb {P}}_N\) as marginals. By the definition of the Wasserstein metric, \(\Pi _r\) represents a feasible mass transportation plan that provides an upper bound on the distance between \(\widehat{\mathbb {P}}_N\) and \(\mathbb {Q}_r\); see Definition 3.1. Thus, we have
where the last inequality follows readily from the feasibility of \(q_{ik}(r)\) in (13). We conclude that
where the first inequality holds as \(\mathbb {Q}_r \in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) for all \(k \in \mathbb {N}\), and the second inequality uses the trivial estimate \(\ell \ge \ell _k\) for all \(k\le K\). The last equality follows from the construction of \(\alpha _{ik}(r)\) and \(\xi _{ik}(r)\) and the fact that (13) coincides with the worstcase expectation (10). \(\square \)
In the rest of this section we discuss some notable properties of the convex program (13).
In the ambiguityfree limit, that is, when the radius of the Wasserstein ball is set to zero, then the optimal value of the convex program (13) reduces to the expected loss under the empirical distribution. Indeed, for \(\varepsilon = 0\) all \(q_{ik}\) variables are forced to zero, and \(\alpha _{ik}\) enters the objective only through \(\sum _{k=1}^K \alpha _{ik}={1\over N}\). Thus, the objective function of (13) simplifies to \( \mathbb {E}^{\widehat{\mathbb {P}}_N}[\ell (\xi )]\).
We further emphasize that it is not possible to guarantee the existence of a worstcase distribution that attains the supremum in (10). In general, as shown in Theorem 4.4, we can only construct a sequence of distributions that attains the supremum asymptotically. The following example discusses an instance of (10) that admits no worstcase distribution.
Example 2
(Nonexistence of a worstcase distribution) Assume that \(\Xi = \mathbb {R}\), \(N = 1\), \(\widehat{\xi }_1 = 0\), \(K = 2\), \(\ell _1(\xi ) =0\) and \(\ell _2(\xi ) = \xi  1\). In this case we have \(\widehat{\mathbb {P}}_N=\delta _{\{0\}}\), and problem (13) reduces to
The supremum on the righthand side amounts to \(\varepsilon \) and is attained, for instance, by the sequence \(\alpha _{11}(r) = 1  {1 \over k}\), \(\alpha _{12}(r) = {1 \over k}\), \(q_{11}(r) = 0\), \(q_{12}(r) =  \varepsilon \) for \(k\in {\mathbb {N}}\). Define
with \(\xi _{11}(r) = \widehat{\xi }_1  {q_{11}(r) \over \alpha _{11}(r)}=0,\) and \(\xi _{12}(r) = \widehat{\xi }_1  {q_{12}(r) \over \alpha _{12}(r)}=\varepsilon k\). By Theorem 4.4, the twopoint distributions \(\mathbb {Q}_r\) reside within the Wasserstein ball of radius \(\varepsilon \) around \(\delta _{0}\) and asymptotically attain the supremum in the worstcase expectation problem. However, this sequence has no weak limit as \(\xi _{12}(r) = \varepsilon k\) tends to infinity, see Fig. 1. In fact, no single distribution can attain the worstcase expectation. Assume for the sake of contradiction that there exists \(\mathbb {Q}^\star \in \mathbb {B}_{\varepsilon }(\delta _{0})\) with \(\mathbb {E}^{\mathbb {Q}^\star }[\ell (\xi )]=\varepsilon \). Then, we find \(\varepsilon = \mathbb {E}^{\mathbb {Q}^\star }[\ell (\xi )]< \mathbb {E}^{\mathbb {Q}^\star }[\xi ]\le \varepsilon \), where the strict inequality follows from the relation \(\ell (\xi )<\xi \) for all \(\xi \ne 0\) and the observation that \(\mathbb {Q}^\star \ne \delta _{0}\), while the second inequality follows from Theorem 3.2. Thus, \(\mathbb {Q}^\star \) does not exist.
The existence of a worstcase distribution can, however, be guaranteed in some special cases.
Corollary 4.6
(Existence of a worstcase distribution) Suppose that Assumption 4.1 holds. If the uncertainty set \(\Xi \) is compact or the loss function is concave (i.e., \(K=1\)), then the sequence \(\{\alpha _{ik}(r), \xi _{ik}(r)\}_{r \in \mathbb {N}}\) constructed in Theorem 4.4 has an accumulation point \(\{\alpha ^\star _{ik}, \xi ^\star _{ik}\}\), and
is a worstcase distribution achieving the supremum in (10).
Proof
If \(\Xi \) is compact, then the sequence \(\{\alpha _{ik}(r), \xi _{ik}(r)\}_{r \in \mathbb {N}}\) has a converging subsequence with limit \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\). Similarly, if \(K = 1\), then \(\alpha _{i1} = 1\) for all \(i\le N\), in which case (13) reduces to a convex optimization problem with an upper semicontinuous objective function over a compact feasible set. Hence, its supremum is attained at a point \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\). In both cases, Theorem 4.4 guarantees that the distribution \(\mathbb {Q}^\star \) implied by \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\) achieves the supremum in (10). \(\square \)
The worstcase distribution of Corollary 4.6 is discrete, and its atoms \(\xi ^\star _{ik}\) reside in the neighborhood of the given data points \(\widehat{\xi }_i\). By the constraints of problem (13), the probabilityweighted cumulative distance between the atoms and the respective data points amounts to
which is bounded above by the radius of the Wasserstein ball. The fact that the worstcase distribution \(\mathbb {Q}^\star \) (if it exists) is supported outside of \(\widehat{\Xi }_N\) is a key feature distinguishing the Wasserstein ball from the ambiguity sets induced by other probability metrics such as the total variation distance or the Kullback–Leibler divergence; see Fig. 2. Thus, the worstcase expectation criterion based on Wasserstein balls advocated in this paper should appeal to decision makers who wish to immunize their optimization problems against perturbations of the data points.
Remark 4.7
(Weak coupling) We highlight that the convex program (13) is amenable to decomposition and parallelization techniques as the decision variables associated with different sample points are only coupled through the norm constraint. We expect the resulting scenario decomposition to offer a substantial speedup of the solution times for problems involving large datasets. Efficient decomposition algorithms that could be used for solving the convex program (13) are described, for example, in [35] and [5, Chapter 4].
Special loss functions
We now demonstrate that the convex optimization problems (11) and (13) reduce to computationally tractable conic programs for several loss functions of practical interest.
Piecewise affine loss functions
We first investigate the worstcase expectations of convex and concave piecewise affine loss functions, which arise, for example, in option pricing [8], risk management [34] and in generic twostage stochastic programming [6]. Moreover, piecewise affine functions frequently serve as approximations of smooth convex or concave loss functions.
Corollary 5.1
(Piecewise affine loss functions) Suppose that the uncertainty set is a polytope, that is, \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) where C is a matrix and d a vector of appropriate dimensions. Moreover, consider the affine functions \(a_k(\xi ) {:=}\big \langle a_{k}, \xi \big \rangle + b_{k}\) for all \(k\le K\).

(i)
If \(\ell (\xi )= \max _{k\le K}a_k(\xi )\), then the worstcase expectation (10) evaluates to
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} b_k +\big \langle a_k, \widehat{\xi }_i \big \rangle + \big \langle \gamma _{ik}, dC\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert C^\intercal \gamma _{ik}  a_{k}\Vert _* \le \lambda &{} \quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik} \ge 0&{} \quad \forall i \le N, &{} \forall k \le K . \end{array}\right. \end{aligned}$$(15a) 
(ii)
If \(\ell (\xi )= \min _{k\le K}a_k(\xi )\), then the worstcase expectation (10) evaluates to
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{i},\theta _{i}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{}\big \langle \theta _i, b+ A\widehat{\xi }_i \big \rangle +\big \langle \gamma _{i}, d C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N\\ &{} \Vert C^\intercal \gamma _iA^\intercal \theta _i\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \big \langle \theta _{i}, e \big \rangle = 1 &{} \quad \forall i \le N\\ &{} \gamma _{i}\ge 0&{} \quad \forall i \le N\\ &{} \theta _{i} \ge 0&{} \quad \forall i \le N, \end{array}\right. \end{aligned}$$(15b)where A is the matrix with rows \(a^\intercal _k\), \(k\le K\), b is the column vector with entries \(b_k\), \(k\le K\), and \(e\) is the vector of all ones.
Proof
Assertion (i) is an immediate consequence of Theorem 4.2, which applies because \(\ell (x)\) is the pointwise maximum of the affine functions \(\ell _k(\xi )= a_k(\xi )\), \(k\le K\), and thus Assumption 4.1 holds for \(J= K\). By definition of the conjugacy operator, we have
and
where the last equality follows from strong duality, which holds as the uncertainty set is nonempty. Assertion (i) then follows by substituting the above expressions into (11).
Assertion (ii) also follows directly from Theorem 4.2 because \(\ell (\xi )=\ell _1(\xi )= \min _{k\le K}a_j(\xi )\) is concave and thus satisfies Assumption 4.1 for \(J=1\). In this setting, we find
where the last equality follows again from strong linear programming duality, which holds since the primal maximization problem is feasible. Assertion (ii) then follows by substituting \([\ell ]^*\) as well as the formula for \(\sigma _\Xi \) from the proof of assertion (i) into (11). \(\square \)
As a consistency check, we ascertain that in the ambiguityfree limit, the optimal value of (15a) reduces to the expectation of \(\max _{k\le K}a_k(\xi )\) under the empirical distribution. Indeed, for \(\varepsilon = 0\), the variable \(\lambda \) can be set to any positive value at no penalty. For this reason and because all training samples must belong to the uncertainty set (i.e., \(dC\widehat{\xi }_i\ge 0\) for all \(i\le N\)), it is optimal to set \(\gamma _{ik}=0\). This in turn implies that \(s_i= \max _{k\le K}a_k(\widehat{\xi }_i)\) at optimality, in which case \(\frac{1}{N}\sum _{i=1}^Ns_i\) represents the sample average of the convex loss function at hand.
An analogous argument shows that, for \(\varepsilon =0\), the optimal value of (15b) reduces to the expectation of \(\min _{k\le K}a_k(\xi )\) under the empirical distribution. As before, \(\lambda \) can be increased at no penalty. Thus, we conclude that \(\gamma _i=0\) and
at optimality, in which case \(\frac{1}{N}\sum _{i=1}^Ns_i\) is the sample average of the given concave loss function.
Uncertainty quantification
A problem of great practical interest is to ascertain whether a physical, economic or engineering system with an uncertain state \(\xi \) satisfies a number of safety constraints with high probability. In the following we denote by \(\mathbb {A}\) the set of states in which the system is safe. Our goal is to quantify the probability of the event \(\xi \in \mathbb {A}\) (\(\xi \notin \mathbb {A}\)) under an ambiguous state distribution that is only indirectly observable through a finite training dataset. More precisely, we aim to calculate the worstcase probability of the system being unsafe, i.e.,
as well as the bestcase probability of the system being safe, that is,
Remark 5.2
(Datadependent sets) The set \(\mathbb {A}\) may even depend on the samples \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\), in which case \(\mathbb {A}\) is renamed as \({\widehat{\mathbb {A}}}\). If the Wasserstein radius \(\varepsilon \) is set to \(\varepsilon _N(\beta )\), then we have \(\mathbb {P}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) with probability \(1\beta \), implying that (16a) and (16b) still provide \(1\beta \) confidence bounds on \(\mathbb {P}[\xi \notin {\widehat{\mathbb {A}}}]\) and \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\), respectively.
Corollary 5.3
(Uncertainty quantification) Suppose that the uncertainty set is a polytope of the form \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) as in Corollary 5.1.

(i)
If \(\mathbb {A} = \{\xi \in \mathbb {R}^m: A\xi < b\}\) is an open polytope and the halfspace \(\big \{\xi :\big \langle a_k, \xi \big \rangle \ge b_k \big \}\) has a nonempty intersection with \(\Xi \) for any \(k\le K\), where \(a_k\) is the kth row of the matrix A and \(b_k\) is the kth entry of the vector b, then the worstcase probability (16a) is given by
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik},\theta _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{}1\theta _{ik}\big (b_k\big \langle a_k, \widehat{\xi }_i \big \rangle \big ) +\big \langle \gamma _{ik}, dC\widehat{\xi }_i \big \rangle \le s_i &{}\quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert a_k\theta _{ik}C^\intercal \gamma _{ik}\Vert _* \le \lambda &{}\quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik}\ge 0&{}\quad \forall i \le N, &{} \forall k \le K\\ &{} \theta _{ik} \ge 0&{}\quad \forall i \le N, &{} \forall k \le K\\ &{} s_i \ge 0 &{} \quad \forall i \le N. \end{array}\right. \end{aligned}$$(17a) 
(ii)
If \(\mathbb {A} = \{\xi \in \mathbb {R}^m : A\xi \le b\}\) is a closed polytope that has a nonempty intersection with \(\Xi \), then the bestcase probability (16b) is given by
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _i, \theta _i} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} 1+\big \langle \theta _i, b  A\widehat{\xi }_i \big \rangle + \big \langle \gamma _{i}, d  C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N \\ &{} \Vert A^\intercal \theta _i+C^\intercal \gamma _{i}\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \gamma _i \ge 0 &{}\quad \forall i \le N\\ &{} \theta _{i} \ge 0 &{}\quad \forall i \le N\\ &{} s_i\ge 0 &{}\quad \forall i \le N. \end{array}\right. \end{aligned}$$(17b)
Proof
The uncertainty quantification problems (16a) and (16b) can be interpreted as instances of (10) with loss functions \(\ell = 1  \mathbbm {1}_{\mathbb {A}}\) and \(\ell = \mathbbm {1}_{\mathbb {A}}\), respectively. In order to be able to apply Theorem 4.2, we should represent these loss functions as finite maxima of concave functions as shown in Fig. 3.
Formally, assertion (i) follows from Theorem 4.2 for a loss function with \(K+1\) pieces if we use the following definitions. For every \(k\le K\) we define
Moreover, we define \(\ell _{K+1}(\xi ) = 0\). As illustrated in Fig. 3a, we thus have \(\ell (\xi )=\max _{k\le K+1} \ell _k(\xi )= 1  \mathbbm {1}_{\mathbb {A}}(\xi )\) and
Assumption 4.1 holds due to the postulated properties of \(\mathbb {A}\) and \(\Xi \). In order to apply Theorem 4.2, we must determine the support function \(\sigma _\Xi \), which is already known from Corollary 5.1, as well as the conjugate functions of \(\ell _k\), \(k\le K+1\). A standard duality argument yields
for all \(k\le K\). Moreover, we have \([\ell _{K+1}]^* = 0\) if \(\xi =0\); \(=\infty \) otherwise. Assertion (ii) then follows by substituting the formulas for \([\ell _k]^*\), \(k\le K+1\), and \(\sigma _\Xi \) into (11).
Assertion (ii) follows from Theorem 4.2 by setting \(K= 2\), \(\ell _1(\xi ) = 1\chi _{\mathbb {A}}(\xi )\) and \(\ell _2(\xi ) = 0\). As illustrated in Fig. 3b, this implies that \(\ell (\xi )=\max \{\ell _1(\xi ),\ell _2(\xi )\}=\mathbbm {1}_{\mathbb {A}}(\xi )\) and
Assumption 4.1 holds by our assumptions on \(\mathbb {A}\) and \(\Xi \). In order to apply Theorem 4.2, we thus have to determine the support function \(\sigma _\Xi \), which was already calculated in Corollary 5.1, and the conjugate functions of \(\ell _1\) and \(\ell _2\). By the definition of the conjugacy operator, we find
where the last equality follows from strong linear programming duality, which holds as the safe set is nonempty. Similarly, we find \([\ell _{2}]^* = 0\) if \(\xi =0\); \(=\infty \) otherwise. Assertion (ii) then follows by substituting the above expressions into (11). \(\square \)
In the ambiguityfree limit (i.e., for \(\varepsilon = 0\)) the optimal value of (17a) reduces to the fraction of training samples residing outside of the open polytope \(\mathbb {A}=\{\xi :A\xi <b\}\). Indeed, in this case the variable \(\lambda \) can be set to any positive value at no penalty. For this reason and because all training samples belong to the uncertainty set (i.e., \(dC\widehat{\xi }_i\ge 0\) for all \(i\le N\)), it is optimal to set \(\gamma _{ik}=0\). If the ith training sample belongs to \(\mathbb {A}\) (i.e., \(b_k\big \langle a_k, \widehat{\xi }_i \big \rangle > 0\) for all \(k\le K\)), then \(\theta _{ik}\ge 1/(b_k\big \langle a_k, \widehat{\xi }_i \big \rangle )\) for all \(k\le K\) and \(s_i=0\) at optimality. Conversely, if the ith training sample belongs to the complement of \(\mathbb {A}\), (i.e., \(b_k\big \langle a_k, \widehat{\xi }_i \big \rangle \le 0\) for some \(k\le K\)), then \(\theta _{ik}=0\) for some \(k\le K\) and \(s_i=1\) at optimality. Thus, \(\sum _{i=1}^Ns_i\) coincides with the number of training samples outside of \(\mathbb {A}\) at optimality. An analogous argument shows that, for \(\varepsilon =0\), the optimal value of (17b) reduces to the fraction of training samples residing inside of the closed polytope \(\mathbb {A}=\{\xi :A\xi \le b\}\).
Twostage stochastic programming
A major challenge in linear twostage stochastic programming is to evaluate the expected recourse costs, which are only implicitly defined as the optimal value of a linear program whose coefficients depend linearly on the uncertain problem parameters [46, Section 2.1]. The following corollary shows how we can evaluate the worstcase expectation of the recourse costs with respect to an ambiguous parameter distribution that is only observable through a finite training dataset. For ease of notation and without loss of generality, we suppress here any dependence on the firststage decisions.
Corollary 5.4
(Twostage stochastic programming) Suppose that the uncertainty set is a polytope of the form \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) as in Corollaries 5.1 and 5.3.

(i)
If \(\ell (\xi ) =\inf _{y} \left\{ \big \langle y, Q\xi \big \rangle : Wy\ge h \right\} \) is the optimal value of a parametric linear program with objective uncertainty, and if the feasible set \(\{y:Wy\ge h\}\) is nonempty and compact, then the worstcase expectation (10) is given by
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _i, y_i} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} \big \langle y_i, Q\widehat{\xi }_i \big \rangle + \big \langle \gamma _{i}, d  C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N \\ &{} Wy_i\ge h &{}\quad \forall i \le N\\ &{} \Vert Q^\intercal y_iC^\intercal \gamma _{i}\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \gamma _i \ge 0 &{} \quad \forall i \le N. \end{array}\right. \end{aligned}$$(18a) 
(ii)
If \(\ell (\xi ) =\inf _{y} \left\{ \big \langle q, y \big \rangle : Wy \ge H\xi + h \right\} \) is the optimal value of a parametric linear program with righthand side uncertainty, and if the dual feasible set \(\{\theta \ge 0:W^\intercal \theta =q\}\) is nonempty and compact with vertices \(v_k\), \(k\le K\), then the worstcase expectation (10) is given by
$$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} \big \langle v_k, h \big \rangle + \big \langle H^\intercal v_k, \widehat{\xi }_i \big \rangle + \big \langle \gamma _{ik}, dC\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert C^\intercal \gamma _{ik}H^\intercal v_k\Vert _* \le \lambda &{} \quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik} \ge 0&{}\quad \forall i \le N, &{} \forall k \le K. \end{array}\right. \end{aligned}$$(18b)
Proof
Assertion (i) follows directly from Theorem 4.2 because \(\ell (\xi )\) is concave as an infimum of linear functions in \(\xi \). Indeed, the compactness of the feasible set \(\{y: Wy\ge h\}\) ensures that Assumption 4.1 holds for \(K=1\). In this setting, we find
where the second equality follows from the classical minimax theorem [4, Proposition 5.5.4], which applies because \(\{y: Wy\ge h\}\) is compact. Assertion (i) then follows by substituting \([\ell ]^*\) as well as the formula for \(\sigma _\Xi \) from Corollary 5.1 into (11).
Assertion (ii) relies on the following reformulation of the loss function,
where the first equality holds due to strong linear programming duality, which applies as the dual feasible set is nonempty. The second equality exploits the elementary observation that the optimal value of a linear program with nonempty, compact feasible set is always adopted at a vertex. As we managed to express \(\ell (\xi )\) as a pointwise maximum of linear functions, assertion (ii) follows immediately from Corllary 5.1 (i). \(\square \)
As expected, in the ambiguityfree limit, problem (18a) reduces to a standard SAA problem. Indeed, for \(\varepsilon =0\), the variable \(\lambda \) can be made large at no penalty, and thus \(\gamma _i=0\) and \(s_i=\big \langle y_i, Q\widehat{\xi }_i \big \rangle \) at optimality. In this case, problem (18a) is equivalent to
Similarly, one can verify that for \(\varepsilon =0\), (18b) reduces to the SAA problem
We close this section with a remark on the computational complexity of all the convex optimization problems derived in this section.
Remark 5.5
(Computational tractability) \(\square \)

If the Wasserstein metric is defined in terms of the 1norm (i.e., \(\Vert \xi \Vert =\sum _{k=1}^m\xi _k\)) or the \(\infty \)norm (i.e., \(\Vert \xi \Vert =\max _{k\le m}\xi _k\)), then the optimization problems (15a), (15b), (17a), (17b), (18a) and (18b) all reduce to linear programs whose sizes scale with the number N of data points and the number J of affine pieces of the underlying loss functions.

Except for the twostage stochastic program with righthand side uncertainty in (18b), the resulting linear programs scale polynomially in the problem description and are therefore computationally tractable. As the number of vertices \(v_k\), \(k\le K\), of the polytope \(\{\theta \ge 0:W^\intercal \theta =q\}\) may be exponential in the number of its facets, however, the linear program (18b) has generically exponential size.

Inspecting (15a), one easily verifies that the distributionally robust optimization problem (5) reduces to a finite convex program if \(\mathbb {X}\) is convex and \(h(x,\xi )= \max _{k\le K} \big \langle a_{k}(x), \xi \big \rangle + b_{k}(x)\), while the gradients \(a_{k}(x)\) and the intercepts \(b_{k}(x)\) depend linearly on x. Similarly, (5) can be reformulated as a finite convex program if \(\mathbb {X}\) is convex and \(h(x,\xi )=\inf _{y} \left\{ \big \langle y, Q\xi \big \rangle : Wy\ge h(x) \right\} \) or \(h(x,\xi )=\inf _{y} \left\{ \big \langle q, y \big \rangle : Wy \ge H(x)\xi + h(x) \right\} \), while the right hand side coefficients h(x) and H(x) depend linearly on x; see (18a) and (18b), respectively. In contrast, problems (15b), (17a) and (17b) result in nonconvex optimization problems when their data depends on x.

We emphasize that the computational complexity of all convex programs examined in this section is independent of the radius \(\varepsilon \) of the Wasserstein ball.
Tractable extensions
We now demonstrate that through minor modifications of the proofs, Theorems 4.2 and 4.4 extend to worstcase expectation problems involving even richer classes of loss functions. First, we investigate problems where the uncertainty can be viewed as a stochastic process and where the loss function is additively separable. Next, we study problems whose loss functions are convex in the uncertain variables and are therefore not necessarily representable as finite maxima of concave functions as postulated by Assumption 4.1.
Stochastic processes with a separable cost
Consider a variant of the worstcase expectation problem (10), where the uncertain parameters can be interpreted as a stochastic process \(\xi = \big (\xi _1,\ldots ,\xi _T\big )\), and assume that \(\xi _t \in \Xi _t\), where \( \Xi _t \subseteq \mathbb {R}^m\) is nonempty and closed for any \(t\le T\). Moreover, assume that the loss function is additively separable with respect to the temporal structure of \(\xi \), that is,
where \(\ell _{tk}:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) is a measurable function for any \(k\le K\) and \(t\le T\). Such loss functions appear, for instance, in openloop stochastic optimal control or in multiitem newsvendor problems. Consider a process norm \(\left\ \xi \right\ _{\mathrm{T}} = \sum _{t = 1}^{T} \Vert \xi _t\Vert \) associated with the base norm \(\Vert \cdot \Vert \) on \(\mathbb {R}^m\), and assume that its induced metric is the one used in the definition of the Wasserstein distance. Note that if \(\Vert \cdot \Vert \) is the 1norm on \(\mathbb {R}^m\), then \(\left\ \cdot \right\ _{\mathrm{T}}\) reduces to the 1norm on \(\mathbb {R}^{mT}\).
By interchanging summation and maximization, the loss function (19) can be reexpressed as
where the maximum runs over all \(K^T\) combinations of \(k_1,\ldots , k_T\le K\). Under this representation, Theorem 4.2 remains applicable. However, the resulting convex optimization problem would involve \(\mathcal O(K^T)\) decision variables and constraints, indicating that an efficient solution may not be available. Fortunately, this deficiency can be overcome by modifying Theorem 4.2.
Theorem 6.1
(Convex reduction for separable loss functions) Assume that the loss function \(\ell \) is of the form (19), and the Wasserstein ball is defined through the process norm \(\left\ \cdot \right\ _{\mathrm{T}}\). Then, for any \(\varepsilon \ge 0 \), the worstcase expectation (10) is smaller or equal to the optimal value of the finite convex program
If \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy the convexity Assumption 4.1 for every \(t\le T\), then the worstcase expectation (10) coincides exactly with the optimal value of problem (20).
Proof
Up until equation (12d), the proof of Theorem 6.1 parallels that of Theorem 4.2. Starting from (12d), we then have
where the interchange of the summation and the maximization is facilitated by the separability of the overall loss function. Introducing epigraphical auxiliary variables yields
where the inequality is justified in a similar manner as the one in (12e), and it holds as an equality provided that \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy Assumption 4.1 for all \(t \le T\). Finally, by Rockafellar and Wets [42, Theorem 11.23(a),p. 493], the conjugate of \(\ell _{tk} + \chi _{\Xi _t}\) can be replaced by the infconvolution of the conjugates of \(\ell _{tk}\) and \(\chi _{\Xi _t}\). This completes the proof. \(\square \)
Note that the convex program (20) involves only \(\mathcal {O}(NKT)\) decision variables and constraints. Moreover, if \(\ell _{tk}\) is affine for every \(t\le T\) and \(k\le K\), while \(\Vert \cdot \Vert \) represents the 1norm or the \(\infty \)norm on \(\mathbb {R}^m\), then (20) reduces to a tractable linear program (see also Remark 5.5). A natural generalization of Theorem 4.4 further allows us to characterize the extremal distributions of the worstcase expectation problem (10) with a separable loss function of the form (19).
Theorem 6.2
(Worstcase distributions for separable loss functions) Assume that the loss function \(\ell \) is of the form (19), and the Wasserstein ball is defined through the process norm \(\left\ \cdot \right\ _{\mathrm{T}}\). If \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy Assumption 4.1 for all \(t \le T\), then the worstcase expectation (10) coincides with the optimal value of the finite convex program
irrespective of \(\varepsilon \ge 0 \). Let \(\big \{\alpha _{tik}(r), q_{tik}(r)\big \}_{r \in \mathbb {N}}\) be a sequence of feasible decisions whose objective values converge to the supremum of (21). Then, the discrete (product) probability distributions
belong to the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) and attain the supremum of (10) asymptotically, i.e.,
Proof
As in the proof of Theorem 4.4, the claim follows by dualizing the convex program (20). Details are omitted for brevity of exposition. \(\square \)
We emphasize that the distributions \(\mathbb {Q}_r\) from Theorem 6.2 can be constructed efficiently by solving a convex program of polynomial size even though they have \(NK^T\) discretization points.
Convex loss functions
Consider now another variant of the worstcase expectation problem (10), where the loss function \(\ell \) is proper, convex and lower semicontinuous. Unless \(\ell \) is piecewise affine, we cannot represent such a loss function as a pointwise maximum of finitely many concave functions, and thus Theorem 4.2 may only provide a loose upper bound on the worstcase expectation (10). The following theorem provides an alternative upper bound that admits new insights into distributionally robust optimization with Wasserstein balls and becomes exact for \(\Xi =\mathbb {R}^m\).
Theorem 6.3
(Convex reduction for convex loss functions) Assume that the loss function \(\ell \) is proper, convex, and lower semicontinuous, and define \(\kappa {:=}\sup \big \{ \Vert \theta \Vert _* : \ell ^*(\theta ) < \infty \big \}\). Then, for any \(\varepsilon \ge 0 \), the worstcase expectation (10) is smaller or equal to
If \(\Xi =\mathbb {R}^m\), then the worstcase expectation (10) coincides exactly with (22).
Remark 6.4
(Radius of effective domain) The parameter \(\kappa \) can be viewed as the radius of the smallest ball containing the effective domain of the conjugate function \(\ell ^*\) in terms of the dual norm. By the standard conventions of extended arithmetic, the term \(\kappa \varepsilon \) in (22) is interpreted as 0 if \(\kappa =\infty \) and \(\varepsilon =0\).
Proof
Equation (12b) in the proof of Theorem 4.2 implies that
for every \(\varepsilon > 0\). As \(\ell \) is proper, convex, and lower semicontinuous, it coincides with its biconjugate function \(\ell ^{**}\), see e.g. [4, Proposition 1.6.1(c)]. Thus, we may write
where \(\Theta {:=}\{\theta \in \mathbb {R}^m : \ell ^*(\theta ) < \infty \}\) denotes the effective domain of the conjugate function \(\ell ^*\). Using this dual representation of \(\ell \) in conjunction with the definition of the dual norm, we find
The classical minimax theorem [4, Proposition 5.5.4] then allows us to interchange the maximization over \(\xi \) with the maximization over \(\theta \) and the minimization over z to obtain
Recall that \(\sigma _\Xi \) denotes the support function of \(\Xi \). It seems that there is no simple exact reformulation of (24) for arbitrary convex uncertainty sets \(\Xi \). Interchanging the maximization over \(\theta \) with the minimization over z in (24) would lead to the conservative upper bound of Corollary 4.3. Here, however, we employ an alternative approximation. By definition of the support function, we have \(\sigma _\Xi \le \sigma _{\mathbb {R}^m} = \chi _{\{0\}}\). Replacing \(\sigma _\Xi \) with \( \chi _{\{0\}}\) in (24) thus results in the conservative approximation
The inequality (22) then follows readily by substituting (25) into (23) and using the definition of \(\kappa \) in the theorem statement. For \(\Xi =\mathbb {R}^m\) we have \(\sigma _\Xi = \chi _{\{0\}}\), and thus the upper bound (22) becomes exact. Finally, if \(\varepsilon =0\), then (10) trivially coincides with (22) under our conventions of extended arithmetic. Thus, the claim follows. \(\square \)
Theorem 6.3 asserts that for \(\Xi =\mathbb {R}^m\), the worstcase expectation (10) of a convex loss function reduces the sample average of the loss adjusted by the simple correction term \(\kappa \varepsilon \). The following proposition highlights that \(\kappa \) can be interpreted as a measure of maximum steepness of the loss function. This interpretation has intuitive appeal in view of Definition 3.1.
Proposition 6.5
(Steepness of the loss function) Let \(\kappa \) be defined as in Theorem 6.3.

(i)
If \(\ell \) is \({\overline{L}}\)Lipschitz continuous, i.e., if there exists \(\xi ' \in \mathbb {R}^m\) such that \(\ell (\xi )  \ell (\xi ') \le {\overline{L}}\Vert \xi \xi '\Vert \) for all \(\xi \in \mathbb {R}^m\), then \(\kappa \le {\overline{L}}\).

(ii)
If \(\ell \) majorizes an affine function, i.e., if there exists \(\theta \in \mathbb {R}^m\) with \(\Vert \theta \Vert _*=:{\underline{L}}\) and \(\xi ' \in \mathbb {R}^m\) such that \(\ell (\xi )  \ell (\xi ') \ge \big \langle \theta , \xi \xi ' \big \rangle \) for all \(\xi \in \mathbb {R}^m\), then \(\kappa \ge {\underline{L}} \).
Proof
The proof follows directly from the definition of conjugacy. As for (i), we have
where the last equality follows from the definition of the dual norm. Applying the minimax theorem [4, Proposition 5.5.4] and explicitly carrying out the maximization over \(\xi \) yields
Consequently, \(\ell ^*(\theta )\) is infinite for all \(\theta \) with \(\Vert \theta \Vert _*> {\overline{L}}\), which readily implies that the \(\Vert \cdot \Vert _*\)ball of radius \({\overline{L}}\) contains the effective domain of \(\ell ^*\). Thus, \(\kappa \le {\overline{L}}\).
As for (ii), we have
which implies that \(\ell ^*(\theta ) \le \big \langle \theta , \xi ' \big \rangle  \ell (\xi ') < \infty \). Thus, \(\theta \) belongs to the effective domain of \(\ell ^*\). We then conclude that \(\kappa \ge \Vert \theta \Vert _* = {\underline{L}}\). \(\square \)
Remark 6.6
(Consistent formulations) If \(\Xi =\mathbb {R}^m\) and the loss function is given by \(\ell (\xi ) = \max _{k \le K}\{\big \langle a_{k}, \xi \big \rangle + b_{k}\}\), then both Corollary 5.1 and Theorem 6.3 offer an exact reformulation of the worstcase expectation (10) in terms of a finitedimensional convex program. On the one hand, Corollary 5.1 implies that (10) is equivalent to
which is obtained by setting \(C=0\) and \(d=0\) in (15a). At optimality we have \(\lambda ^\star =\max _{k\le K} \Vert a_k\Vert _*\), which corresponds to the (best) Lipschitz constant of \(\ell (\xi )\) with respect to the norm \(\Vert \cdot \Vert \). On the other hand, Theorem 6.3 implies that (10) is equivalent to (22) with \(\kappa =\lambda ^\star \). Thus, Corollary 5.1 and Theorem 6.3 are consistent.
Remark 6.7
(\(\varepsilon \)insensitive optimizers^{Footnote 3}) Consider a loss function \(h(x,\xi )\) that is convex in \(\xi \), and assume that \(\Xi =\mathbb {R}^m\). In this case Theorem 6.3 remains valid, but the steepness parameter \(\kappa (x)\) may depend on x. For loss functions whose Lipschitz modulus with respect to \(\xi \) is independent of x (e.g., the newsvendor loss), however, \(\kappa (x)\) is constant. In this case the distributionally robust optimization problem (5) and the SAA problem (4) share the same minimizers irrespective of the Wasserstein radius \(\varepsilon \). This phenomenon could explain why the SAA solutions tend to display a surprisingly strong outofsample performance in these problems.
Numerical results
We validate the theoretical results of this paper in the context of a stylized portfolio selection problem. The subsequent simulation experiments are designed to provide additional insights into the performance guarantees of the proposed distributionally robust optimization scheme.
Meanrisk portfolio optimization
Consider a capital market consisting of m assets whose yearly returns are captured by the random vector \(\xi = [\xi _1, \ldots , \xi _m]^\intercal \). If shortselling is forbidden, a portfolio is encoded by a vector of percentage weights \(x=[x_1,\ldots ,x_m]^\intercal \) ranging over the probability simplex \(\mathbb {X}=\{x\in {\mathbb {R}}^m_+: \sum _{i=1}^{m}x_i = 1\}\). As portfolio x invests a percentage \(x_i\) of the available capital in asset i for each \(i=1,\ldots ,m\), its return amounts to \(\big \langle x, \xi \big \rangle \). In the remainder we aim to solve the singlestage stochastic program
which minimizes a weighted sum of the mean and the conditional valueatrisk (CVaR) of the portfolio loss \(\big \langle x, \xi \big \rangle \), where \(\alpha \in (0,1]\) is referred to as the confidence level of the CVaR, and \(\rho \in \mathbb {R}_+\) quantifies the investor’s riskaversion. Intuitively, the CVaR at level \(\alpha \) represents the average of the \(\alpha \times 100{\%}\) worst (highest) portfolio losses under the distribution \(\mathbb {P}\). Replacing the CVaR in the above expression with its formal definition [41], we obtain
where \(K=2\), \(a_1= 1\), \(a_2= 1\frac{\rho }{\alpha }\), \(b_1=\rho \) and \(b_2= \rho (1\frac{1}{\alpha })\). An investor who is unaware of the distribution \(\mathbb {P}\) but has observed a dataset \(\widehat{\Xi }_N\) of N historical samples from \(\mathbb {P}\) and knows that the support of \(\mathbb {P}\) is contained in \(\Xi =\{\xi \in \mathbb {R}^m:C\xi \le d\}\) might solve the distributionally robust counterpart of (26) with respect to the Wasserstein ambiguity set \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\), that is,
where we make the dependence on the Wasserstein radius \(\varepsilon \) explicit. By Corollary 5.1 we know that
Before proceeding with the numerical analysis of this problem, we provide some analytical insights into its optimal solutions when there is significant ambiguity. In what follows we keep the training data set fixed and let \(\widehat{x}_N(\varepsilon )\) be an optimal distributionally robust portfolio corresponding to the Wasserstein ambiguity set of radius \(\varepsilon \). We will now show that, for natural choices of the ambiguity set, \(\widehat{x}_N(\varepsilon )\) converges to the equally weighted portfolio \(\frac{1}{m}e\) as \(\varepsilon \) tends to infinity, where \(e {:=}(1,\ldots ,1)^\intercal \). The optimality of the equally weighted portfolio under high ambiguity has first been demonstrated in [37] using analytical methods. We identify this result here as an immediate consequence of Theorem 4.2, which is primarily a computational result.
For any nonempty set \(S\subseteq \mathbb {R}^m\) we denote by \(\text{ recc }(S) {:=}\{y\in \mathbb {R}^m:x+\lambda y\in S~\forall x\in S, ~\forall \lambda \ge 0\}\) the recession cone and by \(S^\circ {:=}\{y\in \mathbb {R}^m:\big \langle y, x \big \rangle \le 0~\forall x\in S\}\) the polar cone of S.
Lemma 7.1
If \(\{\varepsilon _k\}_{k\in {\mathbb {N}}}\subset \mathbb {R}_+\) tends to infinity, then any accumulation point \(x^\star \) of \(\big \{\widehat{x}_N(\varepsilon _k)\big \}_{k\in {\mathbb {N}}}\) is a portfolio that has minimum distance to \((\text{ recc }(\Xi ))^\circ \) with respect to \(\Vert \cdot \Vert _*\).
Proof
Note first that \(\widehat{x}_N(\varepsilon _k)\), \(k\in {\mathbb {N}}\), and \(x^\star \) exist because \(\mathbb {X}\) is compact. For large Wasserstein radii \(\varepsilon \), the term \(\lambda \varepsilon \) dominates the objective function of problem (27). Using standard epiconvergence results [42, Section 7.E], one can thus show that
where the first equality follows from the fact that \(a_k<0\) for all \(k\le K\), the second equality uses the substitution \(\gamma \rightarrow \gamma a_k\), and the last equality holds because the set of minimizers of an optimization problem is not affected by a positive scaling of the objective function. Thus, \(x^\star \) is the portfolio nearest to the cone \({\mathcal {C}}=\{C^\intercal \gamma :\gamma \ge 0\}\). The claim now follows as the polar cone
is readily recognized as the recession cone of \(\Xi \) and as \({\mathcal {C}}=({\mathcal {C}}^\circ )^\circ \). \(\square \)
Proposition 7.2
(Equally weighted portfolio) Assume that the Wasserstein metric is defined in terms of the pnorm in the uncertainty space for some \(p\in [1,\infty )\). If \(\{\varepsilon _k\}_{k\in {\mathbb {N}}}\subset \mathbb {R}_+\) tends to infinity, then \(\big \{\widehat{x}_N(\varepsilon _k)\big \}_{k\in {\mathbb {N}}}\) converges to the equally weighted portfolio \(x^\star =\frac{1}{m}e\) provided that the uncertainty set is given by

(i)
the entire space, i.e., \(\Xi =\mathbb {R}^m\), or

(ii)
the nonnegative orthant shifted by \(e\), i.e., \(\Xi =\{\xi \in \mathbb {R}^m:\xi \ge e\}\), which captures the idea that no asset can lose more than \(100\%\) of its value.
Proof
(i) One easily verifies from the definitions that \((\text{ recc }(\Xi ))^\circ =\{0\}\). Moreover, we have \(\Vert \cdot \Vert _*=\Vert \cdot \Vert _q\) where \(\frac{1}{p}+\frac{1}{q}=1\). As \(p\in [1,\infty )\), we conclude that \(q\in (1,\infty ]\), and thus the unique nearest portfolio to \((\text{ recc }(\Xi ))^\circ \) with respect to \(\Vert \cdot \Vert _*\) is \(x^\star =\frac{1}{m}e\). The claim then follows from Lemma 7.1. Assertion (ii) follows in a similar manner from the observation that \((\text{ recc }(\Xi ))^\circ \) is now the nonpositive orthant. \(\square \)
With some extra effort one can show that for every \(p\in [1,\infty )\) there is a threshold \({\bar{\varepsilon }}>0\) with \(\widehat{x}_N(\varepsilon )=x^\star \) for all \(\varepsilon \ge {\bar{\varepsilon }}\), see [37, Proposition 3]. Moreover, for \(p\in \{1,2\}\) the threshold \({\bar{\varepsilon }}\) is known analytically.
Simulation results: portfolio optimization
Our experiments are based on a market with \(m=10\) assets considered in [7, Section 7.5]. In view of the capital asset pricing model we may assume that the return \(\xi _i\) is decomposable into a systematic risk factor \(\psi \sim {\mathcal {N}}(0,2\%)\) common to all assets and an unsystematic or idiosyncratic risk factor \(\zeta _i\sim {\mathcal {N}}(i\times 3\%, i\times 2.5\%)\) specific to asset i. Thus, we set \(\xi _i=\psi +\zeta _i\), where \(\psi \) and the idiosyncratic risk factors \(\zeta _i\), \(i=1,\ldots ,m\), constitute independent normal random variables. By construction, assets with higher indices promise higher mean returns at a higher risk. Note that the given moments of the risk factors completely determine the distribution \(\mathbb {P}\) of \(\xi \). This distribution has support \(\Xi =\mathbb {R}^m\) and satisfies Assumption 3.3 for the tail exponent \(a=1\), say. We also set \(\alpha =20\%\) and \(\rho =10\) in all numerical experiments, and we use the 1norm to measure distances in the uncertainty space. Thus, \(\Vert \cdot \Vert _*\) is the \(\infty \)norm, whereby (27) reduces to a linear program.
Impact of the Wasserstein radius
In the first experiment we investigate the impact of the Wasserstein radius \(\varepsilon \) on the optimal distributionally robust portfolios and their outofsample performance. We solve problem (27) using training datasets of cardinality \(N \in \{30, 300, 3000\}\). Figure 4 visualizes the corresponding optimal portfolio weights \(\widehat{x}_N(\varepsilon )\) as a function of \(\varepsilon \), averaged over 200 independent simulation runs. Our numerical results confirm the theoretical insight of Proposition 7.2 that the optimal distributionally robust portfolios converge to the equally weighted portfolio as the Wasserstein radius \(\varepsilon \) increases; see also [37].
The outofsample performance
of any fixed distributionally robust portfolio \(\widehat{x}_N(\varepsilon )\) can be computed analytically as \(\mathbb {P}\) constitutes a normal distribution by design, see, e.g., [41, p. 29]. Figure 5 shows the tubes between the 20 and 80% quantiles (shaded areas) and the means (solid lines) of the outofsample performance \(J\big (\widehat{x}_N(\varepsilon )\big )\) as a function of \(\varepsilon \)—estimated using 200 independent simulation runs. We observe that the outofsample performance improves (decreases) up to a critical Wasserstein radius \(\varepsilon _\mathrm{crit}\) and then deteriorates (increases). This stylized fact was observed consistently across all of simulations and provides an empirical justification for adopting a distributionally robust approach.
Figure 5 also visualizes the reliability of the performance guarantees offered by our distributionally robust portfolio model. Specifically, the dashed lines represent the empirical probability of the event \(J\big (\widehat{x}_N(\varepsilon )\big ) \le \widehat{J}_N(\varepsilon )\) with respect to 200 independent training datasets. We find that the reliability is nondecreasing in \(\varepsilon \). This observation has intuitive appeal because \(\widehat{J}_N(\varepsilon ) \ge J(\widehat{x}_N(\varepsilon ))\) whenever \(\mathbb {P}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\), and the latter event becomes increasingly likely as \(\varepsilon \) grows. Figure 5 also indicates that the certificate guarantee sharply rises towards 1 near the critical Wasserstein radius \(\varepsilon _\mathrm{crit}\). Hence, the outofsample performance of the distributionally robust portfolios improves as long as the reliability of the performance guarantee is noticeably smaller than 1 and deteriorates when it saturates at 1. Even though this observation was made consistently across all simulations, we were unable to validate it theoretically.
Portfolios driven by outofsample performance
Different Wasserstein radii \(\varepsilon \) may result in robust portfolios \(\widehat{x}_N(\varepsilon )\) with vastly different outofsample performance \(J(\widehat{x}_N(\varepsilon ))\). Ideally, one should select the radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) that minimizes \(J(\widehat{x}_N(\varepsilon ))\) over all \(\varepsilon \ge 0\); note that \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) inherits the dependence on the training data from \(J(\widehat{x}_N(\varepsilon ))\). As the true distribution \(\mathbb {P}\) is unknown, however, it is impossible to evaluate and minimize \(J(\widehat{x}_N(\varepsilon ))\). In practice, the best we can hope for is to approximate \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) using the training data. Statistics offers several methods to accomplish this goal:

Holdout method: Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into a training dataset of size \(N_T\) and a validation dataset of size \(N_V=NN_T\). Using only the training dataset, solve (27) for a large but finite number of candidate radii \(\varepsilon \) to obtain \({\widehat{x}}_{N_T}(\varepsilon )\). Use the validation dataset to estimate the outofsample performance of \({\widehat{x}}_{N_T}(\varepsilon )\) via the sample average approximation. Set \({\widehat{\varepsilon }}_N^\mathrm{\; hm}\) to any \(\varepsilon \) that minimizes this quantity. Report \(\widehat{x}_N^\mathrm{\; hm}={\widehat{x}}_{N_T}({\widehat{\varepsilon }}_N^\mathrm{\; hm})\) as the datadriven solution and \(\widehat{J}_N^\mathrm{\; hm}={\widehat{J}}_{N_T}({\widehat{\varepsilon }}_N^\mathrm{\; hm})\) as the corresponding certificate.

k fold cross validation: Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into k subsets, and run the holdout method k times. In each run, use exactly one subset as the validation dataset and merge the remaining \(k1\) subsets to a training dataset. Set \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) to the average of the Wasserstein radii obtained from the k holdout runs. Resolve (27) with \(\varepsilon ={\widehat{\varepsilon }}_N^\mathrm{\; cv}\) using all N samples, and report \(\widehat{x}_N^\mathrm{\; cv}=\widehat{x}_N({\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the datadriven solution and \(\widehat{J}_N^\mathrm{\; cv}=\widehat{J}_N{(\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the corresponding certificate.
The holdout method is computationally cheaper, but cross validation has superior statistical properties. There are several other methods to estimate the best Wassertein radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\). By construction, however, no method can provide a radius \({\widehat{\varepsilon }}_N\) such that \(\widehat{x}_N({\widehat{\varepsilon }}_N)\) has a better outofsample performance than \(\widehat{x}_N({\widehat{\varepsilon }}_N^\mathrm{\; opt})\).
In all experiments we compare the distributionally robust approach based on the Wasserstein ambiguity set with the classical sample average approximation (SAA) and with a stateoftheart datadriven distributionally robust approach, where the ambiguity set is defined via a linearconvex ordering (LCX)based goodnessoffit test [7, Section 3.3.2]. The size of the LCX ambiguity set is determined by a single parameter, which should be tuned to optimize the outofsample performance. While the best parameter value is unavailable, it can again be estimated using the holdout method or via cross validation. To our best knowledge, the LCX approach represents the only existing datadriven distributionally robust approach for continuous uncertainty spaces that enjoys strong finitesample guarantees, asymptotic consistency as well as computational tractability.^{Footnote 4}
To keep the computational burden manageable, in all experiments we select the Wasserstein radius as well as the LCX size parameter from within the discrete set \({\mathcal {E}}=\{\varepsilon =b\cdot 10^c:b\in \{0,\ldots ,9\},\; c\in \{3,2,1\}\}\) instead of \({\mathbb {R}}_+\). We have verified that refining or extending \(\mathcal E\) has only a marginal impact on our results, which indicates that \({\mathcal {E}}\) provides a sufficiently rich approximation of \(\mathbb {R}_+\).
In Fig. 6a–c the sizes of the (LCX and Wasserstein) ambiguity sets are determined via the holdout method, where \(80\%\) of the data are used for training and \(20\%\) for validation. Figure 6a visualizes the tube between the 20 and \(80\%\) quantiles (shaded areas) as well as the mean value (solid lines) of the outofsample performance \(J(\widehat{x}_N)\) as a function of the sample size N and based on 200 independent simulation runs, where \(\widehat{x}_N\) is set to the minimizer of the SAA (blue), LCX (purple) and Wasserstein (green) problems, respectively. The constant dashed line represents the optimal value \(J^\star \) of the original stochastic program (1), which is computed through an SAA problem with \(N = 10^6\) samples. We observe that the Wasserstein solutions tend to be superior to the SAA and LCX solutions in terms of outofsample performance.
Figure 6b shows the optimal values \(\widehat{J}_N\) of the SAA, LCX and Wasserstein problems, where the sizes of the ambiguity sets are chosen via the holdout method. Unlike Fig. 6a, Fig. 6b thus reports insample estimates of the achievable portfolio performance. As expected, the SAA approach is overoptimistic due to the optimizer’s curse, while the LCX and Wasserstein approaches err on the side of caution. All three methods are known to enjoy asymptotic consistency, which is in agreement with all insample and outofsample results.
Figure 6c visualizes the reliability of the different performance certificates, that is, the empirical probability of the event \(J(\widehat{x}_N) \le \widehat{J}_N\) evaluated over 200 independent simulation runs. Here, \(\widehat{x}_N\) represents either an optimal portfolio of the SAA, LCX or Wasserstein problems, while \(\widehat{J}_N\) denotes the corresponding optimal value. The optimal SAA portfolios display a disappointing outofsample performance relative to the optimistically biased mimimum of the SAA problem—particularly when the training data is scarce. In contrast, the outofsample performance of the optimal LCX and Wasserstein portfolios often undershoots \(\widehat{J}_N\).
Figure 6d–f show the same graphs as Fig. 6a–c, but now the sizes of the ambiguity sets are determined via kfold cross validation with \(k=5\). In this case, the outofsample performance of both distributionally robust methods improves slightly, while the corresponding certificates and their reliabilities increase significantly with respect to the naïve holdout method. However, these improvements come at the expense of a kfold increase in the computational cost.
One could think of numerous other statistical methods to select the size of the Wasserstein ambiguity set. As discussed above, however, if the ultimate goal is to minimize the outofsample performance of \(\widehat{x}_N(\varepsilon )\), then the best possible choice is \(\varepsilon ={\widehat{\varepsilon }}_N^\mathrm{\; opt}\). Similarly, one can construct a size parameter for the LCX ambiguity set that leads to the best possible outofsample performance of any LCX solution. We emphasize that these optimal Wasserstein radii and LCX size parameters are not available in practice because computing \(J(\widehat{x}_N(\varepsilon ))\) requires knowledge of the datagenerating distribution. In our experiments we evaluate \(J(\widehat{x}_N(\varepsilon ))\) to high accuracy for every fixed \(\varepsilon \in \mathcal {E}\) using \(2\cdot 10^5\) validation samples, which are independent from the (much fewer) training samples used to compute \(\widehat{x}_N(\varepsilon )\). Figure 6g–i show the same graphs as Fig. 6a–c for optimally sized ambiguity sets. By construction, no method for sizing the Wasserstein or LCX ambiguity sets can result in a better outofsample performance, respectively. In this sense, the graphs in Fig. 6g capture the fundamental limitations of the different distributionally robust schemes.
Portfolios driven by reliability
In Sect. 7.2.2 the Wasserstein radii and LCX size parameters were calibrated with the goal to achieve the best outofsample performance. Figure 6c, f, i reveal, however, that by optimizing the outofsample performance one may sacrifice reliability. An alternative objective more in line with the general philosophy of Sect. 2 would be to choose Wasserstein radii that guarantee a prescribed reliability level. Thus, for a given \(\beta \in [0,1]\) we should find the smallest Wasserstein radius \(\varepsilon \ge 0\) for which the optimal value \(\widehat{J}_N(\varepsilon )\) of (27) provides an upper \(1\beta \) confidence bound on the outofsample performance \(J(\widehat{x}_N(\varepsilon ))\) of its optimal solution. As the true distribution \(\mathbb {P}\) is unknown, however, the optimal Wasserstein radius corresponding to a given \(\beta \) cannot be computed exactly. Instead, we must derive an estimator \({\widehat{\varepsilon }}_N^{\; \beta }\) that depends on the training data. We construct \({\widehat{\varepsilon }}_N^{\; \beta }\) and the corresponding reliabilitydriven portfolio via bootstrapping as follows:

(1)
Construct k resamples of size N (with replacement) from the original training dataset. It is well known that, as N grows, the probability that any fixed training data point appears in a particular resample converges to \(\frac{e1}{e}\approx \frac{2}{3}\). Thus, about \(\frac{N}{3}\) training samples are absent from any resample. We collect all unused samples in a validation dataset.

(2)
For each resample \(\kappa =1,\ldots , k\) and \(\varepsilon \ge 0\), solve problem (27) using the Wasserstein ball of radius \(\varepsilon \) around the empirical distribution \(\widehat{\mathbb {P}}_N^\kappa \) on the \(\kappa \)th resample. The resulting optimal decision and optimal value are denoted as \({\widehat{x}}_N^\kappa (\varepsilon )\) and \({\widehat{J}}_N^\kappa (\varepsilon )\), respectively. Next, estimate the outofsample performance \(J(\widehat{x}_N^\kappa (\varepsilon ))\) of \(\widehat{x}_N^\kappa (\varepsilon )\) using the sample average over the \(\kappa \)th validation dataset.

(3)
Set \({\widehat{\varepsilon }}_N^{\; \beta }\) to the smallest \(\varepsilon \ge 0\) so that the certificate \({\widehat{J}}_N^\kappa (\varepsilon )\) exceeds the estimate of \(J({\widehat{x}}_N^\kappa (\varepsilon ))\) in at least \((1\beta )\times k\) different resamples.

(4)
Compute the datadriven portfolio \(\widehat{x}_N=\widehat{x}_N({\widehat{\varepsilon }}_N^{\; \beta })\) and the corresponding certificate \(\widehat{J}_N={\widehat{J}}_N({\widehat{\varepsilon }}_N^{\; \beta })\) using the original training dataset.
As in Sect. 7.2.2, we compare the Wasserstein approach with the LCX and SAA approaches. Specifically, by using bootstrapping, we calibrate the size of the LCX ambiguity set so as to guarantee a desired reliability level \(1\beta \). The SAA problem, on the other hand, has no free parameter that can be tuned to meet a prescribed reliability target. Nevertheless, we can construct a meaningful certificate of the form \(\widehat{J}_N(\Delta ):=\widehat{J}_{\mathrm{SAA}}+\Delta \) for the SAA portfolio by adding a nonnegative constant to the optimal value of the SAA problem. Our aim is to find the smallest offset \(\Delta \ge 0\) with the property that \(\widehat{J}_N(\Delta )\) provides an upper \(1\beta \) confidence bound on the outofsample performance \(J(\widehat{x}_{\mathrm{SAA}})\) of the optimal SAA portfolio \(\widehat{x}_{\mathrm{SAA}}\). The optimal offset corresponding to a given \(\beta \) cannot be computed exactly. Instead, we must derive an estimator \({\widehat{\Delta }}_N^{\; \beta }\) that depends on the training data. Such an estimator can be found through a simple variant of the above bootstrapping procedure.
In all experiments we set the number of resamples to \(k=50\). Figure 7a–c visualize the outofsample performance, the certificate and the empirical reliability of the reliabilitydriven portfolios obtained with the SAA, LCX and Wasserstein approaches, respectively, for the reliability target \(1\beta =90\%\) and based on 200 independent simulation runs. Figure 7d–f show the same graphs as Fig. 7a–c but for the reliability target \(1\beta =75\%\). We observe that the new SAA certificate now overestimates the true optimal value of the portfolio problem. Moreover, while the empirical reliability of the SAA solution now closely matches the desired reliability target, the empirical reliabilities of the LCX and Wasserstein solutions are similar but noticeably exceed the prescribed reliability threshold. A possible explanation for this phenomenon is that the k resamples generated by the bootstrapping algorithm are not independent, which may give rise to a systematic bias in estimating the Wasserstein radii required for the desired reliability levels.
Impact of the sample size on the Wasserstein radius
It is instructive to analyze the dependence of the Wasserstein radii on the sample size N for different datadriven schemes. As for the performancedriven portfolios from Sect. 7.2.2, Fig. 8 depicts the best possible Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) as well as the Wasserstein radii \({\widehat{\varepsilon }}_N^\mathrm{\; hm}\) and \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained by the holdout method and via kfold cross validation, respectively. As for the reliabilitydriven portfolios from Sect. 7.2.3, Fig. 8 further depicts the Wasserstein radii \({\widehat{\varepsilon }}_N^{\beta }\), for \(\beta \in \{10\%,25\%\}\), obtained by bootstrapping. All results are averaged across 200 independent simulation runs. As expected from Theorem 3.6, all Wasserstein radii tend to zero as N increases. Moreover, the convergence rate is approximately equal to \(N^{\frac{1}{2}}\). This rate is likely to be optimal. Indeed, if \(\mathbb {X}\) is a singleton, then every quantile of the sample average estimator \(\widehat{J}_{\mathrm{SAA}}\) converges to \(J^\star \) at rate \(N^{\frac{1}{2}}\) due to the central limit theorem. Thus, if \({\widehat{\varepsilon }}_N= o(N^{\frac{1}{2}})\), then \(\widehat{J}_N\) also converges to \(J^\star \) at leading order \(N^{\frac{1}{2}}\) by Theorem 6.3, which applies as the loss function is convex. This indicates that the a priori rate \(N^{\frac{1}{m}}\) suggested by Theorem 3.4 is too pessimistic in practice.
Simulation results: uncertainty quantification
Investors often wish to determine the probability that a given portfolio will outperform various benchmark indices or assets. Our results on uncertainty quantification developed in Sect. 5.2 enable us to compute this probability in a meaningful way—solely on the basis of the training dataset.
Assume for example that we wish to quantify the probability that any datadriven portfolio \(\widehat{x}_N\) outperforms the three most risky assets in the market jointly. Thus, we should compute the probability of the closed polytope
As the true distribution \(\mathbb {P}\) is unknown, the probability \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) cannot be evaluated exactly. Note that \({\widehat{\mathbb {A}}}\) as well as \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) constitute random objects that depend on \(\widehat{x}_N\) and thus on the training data. Using the same training dataset that was used to compute \(\widehat{x}_N\), however, we may estimate \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) from above and below by
respectively. Indeed, recall that the true datagenerating probability distribution resides in the Wasserstein ball of radius \(\varepsilon _N(\beta )\) defined in (8) with probability \(1\beta \). Therefore, we have
where \(\mathfrak {B}(\Xi )\) denotes the set of all Borel subsets of \(\Xi \). The datadependent set \({\widehat{\mathbb {A}}}_N\) can now be viewed as a (measurable) mapping from \(\widehat{\Xi }_N\) to the subsets in \(\mathfrak {B}(\Xi )\). The above inequality then implies
Thus, \( \sup \{\mathbb {Q}[{{\widehat{\mathbb {A}}}_N}]:\mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\}\) provides indeed an upper bound on \(\mathbb {P}[{{\widehat{\mathbb {A}}}_N}]\) with confidence \(1\beta \). Similarly, one can show that \( \inf \{\mathbb {Q}[{{\widehat{\mathbb {A}}}_N}]: \mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\}\) provides a lower confidence bound on \(\mathbb {P}[{{\widehat{\mathbb {A}}}_N}]\).
The upper confidence bound can be computed by solving the linear program (17a). Replacing \({\widehat{\mathbb {A}}}\) with its interior in the lower confidence bound leads to another (potentially weaker) lower bound that can be computed by solving the linear program (17b). We denote these computable bounds by \(\widehat{J}_N^+(\varepsilon )\) and \(\widehat{J}_N^(\varepsilon )\), respectively. In all subsequent experiments \(\widehat{x}_N\) is set to a solution of the distributionally robust program (27) calibrated via kfold cross validation as described in Sect. 7.2.2.
Impact of the Wasserstein radius
As \(\widehat{J}_N^+(\varepsilon )\) and \(\widehat{J}_N^(\varepsilon )\) estimate a random target \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), it makes sense to filter out the randomness of the target and to study only the differences \(\widehat{J}_N^+(\varepsilon ) \mathbb {P}[{\widehat{\mathbb {A}}}]\) and \(\widehat{J}_N^(\varepsilon ) \mathbb {P}[{\widehat{\mathbb {A}}}]\). Figure 9a, b visualize the empirical mean (solid lines) as well as the tube between the empirical 20 and 80% quantiles (shaded areas) of these differences as a function of the Wasserstein radius \(\varepsilon \), based on 200 training datasets of cardinality \(N = 30\) and \(N=300\), respectively. Figure 9 also shows the empirical reliability of the bounds (dashed lines), that is, the empirical probability of the event \(\widehat{J}_N^(\varepsilon ) \le \mathbb {P}[{\widehat{\mathbb {A}}}] \le \widehat{J}_N^+(\varepsilon )\). Note that the reliability drops to 0 for \(\varepsilon =0\), in which case both \(\widehat{J}_N^+(0)\) and \(\widehat{J}_N^(0)\) coincide with the SAA estimator for \(\mathbb {P}[{\widehat{\mathbb {A}}}]\). Moreover, at \(\varepsilon =0\) the set \({\widehat{\mathbb {A}}}\) is constructed from the SAA portfolio \(\widehat{x}_N\), whose performance is overestimated on the training dataset. Thus, the SAA estimator for \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), which is evaluated using the same training dataset, is positively biased. For \(\varepsilon >0\), finally, the reliability increases as the shaded confidence intervals move away from 0.
Impact of the sample size
We propose a variant of the kfold cross validation procedure for selecting \(\varepsilon \) in uncertainty quantification. Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into k subsets and repeat the following holdout method k times. Select one of the subsets as the validation set of size \(N_V\) and merge the remaining \(k1\) subsets to a training dataset of size \(N_T=NN_V\). Use the validation set to compute the SAA estimator of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), and use the training dataset to compute \({\widehat{J}}_{N_T}^+(\varepsilon )\) for a large but finite number of candidate radii \(\varepsilon \). Set \({\widehat{\varepsilon }}_N^{\; \mathrm hm}\) to the smallest candidate radius for which the SAA estimator of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\) is not larger than \({\widehat{J}}_{N_T}^+(\varepsilon )\). Next, set \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) to the average of the Wasserstein radii obtained from the k holdout runs, and report \(\widehat{J}_N^+={\widehat{J}}_{N}^+({\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the datadriven upper bound on \(\mathbb {P}[{\widehat{\mathbb {A}}}]\). The datadriven lower bound \(\widehat{J}_N^\) is constructed analogously in the obvious way.
Figure 10a visualizes the empirical means (solid lines) as well as the tubes between the empirical 20 and 80% quantiles (shaded areas) of \(\widehat{J}_N^+\mathbb {P}[{\widehat{\mathbb {A}}}]\) and \(\widehat{J}_N^\mathbb {P}[{\widehat{\mathbb {A}}}]\) as a function of the sample size N, based on 300 independent training datasets. As expected, the confidence intervals shrink and converge to 0 as N increases. We emphasize that \(\widehat{J}_N^+\) and \(\widehat{J}_N^\) are computed solely on the basis of N training samples, whereas the computation of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\) necessitates a much larger dataset, particularly if \({\widehat{\mathbb {A}}}\) constitutes a rare event.
Figure 10b shows the Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained via kfold cross validation (both for \(\widehat{J}_N^+\) and \(\widehat{J}_N^\)). As usual, all results are averaged across 300 independent simulation runs. A comparison with Fig. 8 reveals that the datadriven Wasserstein radii in uncertainty quantification display a similar but faster polynomial decay than in portfolio optimization. We conjecture that this is due to the absence of decisions, which implies that uncertainty quantification is less susceptible to the optimizer’s curse. Thus, nature (i.e., the fictitious adversary choosing the distribution in the ambiguity set) only has to compensate for noise but not for bias. A smaller Wasserstein radius seems to be sufficient for this purpose.
Notes
 1.
A similar but slightly more complicated inequality also holds for the special case \(m = 2\); see [21, Theorem 2] for details.
 2.
A possible choice is \(\beta _N = \exp (\sqrt{N})\).
 3.
We are indepted to Vishal Gupta who has brought this interesting observation to our attention.
 4.
Much like worstcase expectations over Wasserstein balls, worstcase expectations over LCX ambiguity sets can be reformulated as finite convex programs whenever the underlying loss function represents a pointwise maximum of K concave component functions. Unlike problem (11) in Theorem 4.2, however, the resulting convex program scales exponentially with K.
References
 1.
BenTal, A., den Hertog, D., Vial, J.P.: Deriving robust counterparts of nonlinear uncertain inequalities. Math. Program. 149, 265–299 (2015)
 2.
BenTal, A., den Hertog, D., Waegenaere, A.D., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59, 341–357 (2013)
 3.
BenTal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)
 4.
Bertsekas, D.P.: Convex Optimization Theory. Athena Scientific, Belmont (2009)
 5.
Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
 6.
Bertsimas, D., Doan, X.V., Natarajan, K., Teo, C.P.: Models for minimax stochastic linear optimization problems with risk aversion. Math. Oper. Res. 35, 580–602 (2010)
 7.
Bertsimas, D., Gupta, V., Kallus, N.: Robust SAA. Available at arXiv:1408.4445 (2014)
 8.
Bertsimas, D., Popescu, I.: On the relation between option and stock prices: a convex optimization approach. Oper. Res. 50, 358–374 (2002)
 9.
Bertsimas, D., Sim, M.: The price of robustness. Oper. Res. 52, 35–53 (2004)
 10.
Boissard, E.: Simple bounds for convergence of empirical and occupation measures in 1Wasserstein distance. Electron. J. Probab. 16, 2296–2333 (2011)
 11.
Bolley, F., Guillin, A., Villani, C.: Quantitative concentration inequalities for empirical measures on noncompact spaces. Probab. Theory Relat. Fields 137, 541–593 (2007)
 12.
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2009)
 13.
Brownlees, C., Joly, E., Lugosi, G.: Empirical risk minimization for heavytailed losses. Ann. Stat. 43, 2507–2536 (2015)
 14.
Calafiore, G.C.: Ambiguous risk measures and optimal robust portfolios. SIAM J. Optim. 18, 853–877 (2007)
 15.
Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 48, 1148–1185 (2012)
 16.
Chehrazi, N., Weber, T.A.: Monotone approximation of decision problems. Oper. Res. 58, 1158–1177 (2010)
 17.
del Barrio, E., CuestaAlbertos, J.A., Matrán, C., et al.: Tests of goodness of fit based on the \(l_2 \)Wasserstein distance. Ann. Stat. 27, 1230–1239 (1999)
 18.
Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to datadriven problems. Oper. Res. 58, 595–612 (2010)
 19.
El Ghaoui, L., Oks, M., Oustry, F.: Worstcase valueatrisk and robust portfolio optimization: a conic programming approach. Oper. Res. 51, 543–556 (2003)
 20.
Erdoğan, E., Iyengar, G.: Ambiguous chance constrained problems and robust optimization. Math. Program. 107, 37–61 (2006)
 21.
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 162, 1–32 (2014)
 22.
Goh, J., Sim, M.: Distributionally robust optimization and its tractable approximations. Oper. Res. 58, 902–917 (2010)
 23.
Hanasusanto, G.A., Kuhn, D.: Robust datadriven dynamic programming. In: Burges, C.J.C., Bottou, L., Welling M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, 26: 27th Annual Conference on Neural Information Processing Systems 2013, pp. 827–835. Curran Associates, Inc., (2013)
 24.
Hanasusanto, G.A., Kuhn, D., Wiesemann, W.: A comment on computational complexity of stochastic programming problems. Math. Program. 159, 557–569 (2016)
 25.
Hu, Z., Hong, L.J.: Kullback–Leibler divergence constrained distributionally robust optimization. Available at Optimization Online (2013)
 26.
Hu, Z., Hong, L.J., So, A.M.C.: Ambiguous probabilistic programs. Available at Optimization Online (2013)
 27.
Jiang, R., Guan, Y.: Datadriven chance constrained stochastic program. Math. Program. 158, 291–327 (2016)
 28.
Kallenberg, O.: Foundations of Modern Probability, Probability and its Applications. Springer, New York (1997)
 29.
Kantorovich, L.V., Rubinshtein, G.S.: On a space of totally additive functions. Vestn. Leningr. Univ. 13, 52–59 (1958)
 30.
Lang, S.: Real and Functional Analysis, 3rd edn. Springer, Berlin (1993)
 31.
Mashreghi, J.: Representation Theorems in Hardy Spaces. Cambridge University Press, Cambridge (2009)
 32.
Mehrotra, S., Zhang, H.: Models and algorithms for distributionally robust least squares problems. Math. Program. 146, 123–141 (2014)
 33.
Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29, 429–443 (1997)
 34.
Natarajan, K., Sim, M., Uichanco, J.: Tractable robust expected utility and risk models for portfolio optimization. Math. Financ. 20, 695–731 (2010)
 35.
Parikh, N., Boyd, S.: Block splitting for distributed optimization. Math. Program. Comput. 6, 77–102 (2014)
 36.
Pflug, G.C., Pichler, A.: Multistage Sochastic Optimization. Springer, Berlin (2014)
 37.
Pflug, G.C., Pichler, A., Wozabal, D.: The 1/N investment strategy is optimal under high model ambiguity. J. Bank. Financ. 36, 410–417 (2012)
 38.
Pflug, G.C., Wozabal, D.: Ambiguity in portfolio selection. Quant. Financ. 7, 435–442 (2007)
 39.
Postek, K., den Hertog, D., Melenberg, B.: Computationally tractable counterparts of distributionally robust constraints on risk measures. SIAM 58(4), 603–650 (2016). doi:10.1137/151005221
 40.
Ramdas, A., Garcia, N., Cuturi, M.: On Wasserstein two sample testing and related families of nonparametric tests. Available at arXiv:1509.02237 (2015)
 41.
Rockafellar, R.T., Uryasev, S.: Optimization of conditional valueatrisk. J. Risk 2, 21–42 (2000)
 42.
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (2010)
 43.
Scarf, H.E.: Studies in the mathematical theory of inventory and production. In: Arrow, K.J., Karlin, S., Scarf, H.E. (eds.) A Min–Max Solution of an Inventory Problem, pp. 201–209. Stanford University Press, Stanford (1958)
 44.
Shapiro, A.: On duality theory of conic linear problems. In: Goberna, M.A., López, M.A. (eds.) SemiInfinite Programming, pp. 135–165. Kluwer, Boston (2001)
 45.
Shapiro, A.: Distributionally robust stochastic programming. Available at Optimization Online (2015)
 46.
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming, 2nd edn, SIAM (2014)
 47.
Shapiro, A., Nemirovski, A.: On complexity of stochastic programming problems. In: Jeyakumar, V., Rubinov, A. (eds.) Continuous Optimization, pp. 111–146. Springer, New York (2005)
 48.
Smith, J.E., Winkler, R.L.: The optimizers curse: Skepticism and postdecision surprise in decision analysis. Manag. Sci. 52, 311–322 (2006)
 49.
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
 50.
Villani, C.: Topics in Optimal Transportation. American Mathematical Society, Providence (2003)
 51.
Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62, 1358–1376 (2014)
 52.
Wozabal, D.: A framework for optimization under ambiguity. Ann. Oper. Res. 193, 21–47 (2012)
 53.
Wozabal, D.: Robustifying convex risk measures for linear portfolios: a nonparametric approach. Oper. Res. 62, 1302–1315 (2014)
 54.
Zhao, C.: DataDriven RiskAverse Stochastic Program and Renewable Energy Integration. PhD thesis, University of Florida (2014)
Acknowledgements
We thank Soroosh Shafieezadeh Abadeh for helping us with the numerical experiments. The authors are grateful to Vishal Gupta, Ruiwei Jiang and Nathan Kallus for their valuable comments. This research was supported by the Swiss National Science Foundation under Grant BSCGI0_157733.
Author information
Appendix A
Appendix A
The following technical lemma on the pointwise approximation of an upper semicontinuous function by a nonincreasing sequence of Lipschitz continuous majorants strengthens [31, Theorem 4.2], which focuses on bounded domains and continuous (but not necessarily Lipschitz continuous) majorants.
Lemma A.1
If \(h:\Xi \rightarrow \mathbb {R}\) is upper semicontinuous and satisfies \(h(\xi ) \le L(1+\Vert \xi \Vert )\) for some \(L\ge 0\), then there exists a nonincreasing sequence of Lipschitz continuous functions that converge pointwise to h on \(\Xi \).
Proof
The proof is constructive. Define the functions
where L is the linear growth rate of h. Note that by construction \(h_k(\xi )\le L(1+\Vert \xi \Vert )\). As \(\xi '=\xi \) is feasible in the maximization problem defining \(h_k(\xi )\), we have \(h_k(\xi ) \ge h(\xi )\) for all \(\xi \in \Xi \) and \(k\in \mathbb {N}\). Moreover, \(h_k(\xi )\) is Lipschitz continuous with Lipschitz constant kL (as \(h_k(\xi )\) constitutes a supremum of norm functions with this property). Given any \(\xi \in \Xi \), it remains to be shown that \(\lim _{k\rightarrow \infty }h_k(\xi ) = h(\xi )\). Thus, choose \(\xi '_k \in \Xi \) with
We first show that \(\xi _k\) converges to \(\xi \) as k tends to \(\infty \). Indeed, we have
which implies
that is, \(\Vert \xi  \xi '_k\Vert \rightarrow 0\) as \(k \rightarrow \infty \). Therefore, we find
where the last inequality is due to the upper semicontinuity of h. This concludes the proof. \(\square \)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Mohajerin Esfahani, P., Kuhn, D. Datadriven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171, 115–166 (2018). https://doi.org/10.1007/s1010701711721
Received:
Accepted:
Published:
Issue Date:
Mathematics Subject Classification
 90C15 Stochastic programming
 90C25 Convex programming
 90C47 Minimax problems