1 Introduction

Stochastic programming is a powerful modeling paradigm for optimization under uncertainty. The goal of a generic single-stage stochastic program is to find a decision \(x\in \mathbb {R}^n\) that minimizes an expected cost \(\mathbb {E}^\mathbb {P}[h(x,\xi )]\), where the expectation is taken with respect to the distribution \(\mathbb {P}\) of the continuous random vector \(\xi \in \mathbb {R}^m\). However, classical stochastic programming is challenged by the large-scale decision problems encountered in today’s increasingly interconnected world. First, the distribution \(\mathbb {P}\) is never observable but must be inferred from data. However, if we calibrate a stochastic program to a given dataset and evaluate its optimal decision on a different dataset, then the resulting out-of-sample performance is often disappointing—even if the two datasets are generated from the same distribution. This phenomenon is termed the optimizer’s curse and is reminiscent of overfitting effects in statistics [48]. Second, in order to evaluate the objective function of a stochastic program for a fixed decision x, we need to compute a multivariate integral, which is #P-hard even if \(h(x,\xi )\) constitutes the positive part of an affine function, while \(\xi \) is uniformly distributed on the unit hypercube [24, Corollary 1].

Distributionally robust optimization is an alternative modeling paradigm, where the objective is to find a decision x that minimizes the worst-case expected cost \(\sup _{{{\mathbb {Q}}} \in \mathcal {P}} \mathbb {E}^{{\mathbb {Q}}} [ h(x,\xi )]\). Here, the worst-case is taken over an ambiguity set \({\mathcal {P}}\), that is, a family of distributions characterized through certain known properties of the unknown data-generating distribution \(\mathbb {P}\). Distributionally robust optimization problems have been studied since Scarf’s [43] seminal treatise on the ambiguity-averse newsvendor problem in 1958, but the field has gained thrust only with the advent of modern robust optimization techniques in the last decade [3, 9]. Distributionally robust optimization has the following striking benefits. First, adopting a worst-case approach regularizes the optimization problem and thereby mitigates the optimizer’s curse characteristic for stochastic programming. Second, distributionally robust models are often tractable even though the corresponding stochastic model with the true data-generating distribution (which is generically continuous) are \(\#P\)-hard. So even if the data-generating distribution was known, the corresponding stochastic program could not be solved efficiently.

The ambiguity set \({\mathcal {P}}\) is a key ingredient of any distributionally robust optimization model. A good ambiguity set should be rich enough to contain the true data-generating distribution with high confidence. On the other hand, the ambiguity set should be small enough to exclude pathological distributions, which would incentivize overly conservative decisions. The ambiguity set should also be easy to parameterize from data, and—ideally—it should facilitate a tractable reformulation of the distributionally robust optimization problem as a structured mathematical program that can be solved with off-the-shelf optimization software.

Distributionally robust optimization models where \(\xi \) has finitely many realizations are reviewed in [2, 7, 39]. This paper focuses on situations where \(\xi \) can have a continuum of realizations. In this setting, the existing literature has studied three types of ambiguity sets. Moment ambiguity sets contain all distributions that satisfy certain moment constraints, see for example [18, 22, 51] or the references therein. An attractive alternative is to define the ambiguity set as a ball in the space of probability distributions by using a probability distance function such as the Prohorov metric [20], the Kullback–Leibler divergence [25, 27], or the Wasserstein metric [38, 52] etc. Such metric-based ambiguity sets contain all distributions that are close to a nominal or most likely distribution with respect to the prescribed probability metric. By adjusting the radius of the ambiguity set, the modeler can thus control the degree of conservatism of the underlying optimization problem. If the radius drops to zero, then the ambiguity set shrinks to a singleton that contains only the nominal distribution, in which case the distributionally robust problem reduces to an ambiguity-free stochastic program. In addition, ambiguity sets can also be defined as confidence regions of goodness-of-fit tests [7].

In this paper we study distributionally robust optimization problems with a Wasserstein ambiguity set centered at the uniform distribution \(\widehat{\mathbb {P}}_N\) on N independent and identically distributed training samples. The Wasserstein distance of two distributions \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) can be viewed as the minimum transportation cost for moving the probability mass from \(\mathbb {Q}_1\) to \(\mathbb {Q}_2\), and the Wasserstein ambiguity set contains all (continuous or discrete) distributions that are sufficiently close to the (discrete) empirical distribution \(\widehat{\mathbb {P}}_N\) with respect to the Wasserstein metric. Modern measure concentration results from statistics guarantee that the unknown data-generating distribution \(\mathbb {P}\) belongs to the Wasserstein ambiguity set around \(\widehat{\mathbb {P}}_N\) with confidence \(1-\beta \) if its radius is a sublinearly growing function of \(\log (1/\beta )/N\) [11, 21]. The optimal value of the distributionally robust problem thus provides an upper confidence bound on the achievable out-of-sample cost.

While Wasserstein ambiguity sets offer powerful out-of-sample performance guarantees and enable the decision maker to control the model’s conservativeness, moment-based ambiguity sets appear to display better tractability properties. Specifically, there is growing evidence that distributionally robust models with moment ambiguity sets are more tractable than the corresponding stochastic models because the intractable high-dimensional integrals in the objective function are replaced with tractable (generalized) moment problems [18, 22, 51]. In contrast, distributionally robust models with Wasserstein ambiguity sets are believed to be harder than their stochastic counterparts [36]. Indeed, the state-of-the-art method for computing the worst-case expectation over a Wasserstein ambiguity set \({\mathcal {P}}\) relies on global optimization techniques. Exploiting the fact that the extreme points of \({\mathcal {P}}\) are discrete distributions with a fixed number of atoms [52], one may reformulate the original worst-case expectation problem as a finite-dimensional non-convex program, which can be solved via “difference of convex programming” methods, see [52] or [36, Section 7.1]. However, the computational effort is reported to be considerable, and there is no guarantee to find the global optimum. Nevertheless, tractability results are available for special cases. Specifically, the worst case of a convex law-invariant risk measure with respect to a Wasserstein ambiguity set \({\mathcal {P}}\) reduces to the sum of the nominal risk and a regularization term whenever \(h(x,\xi )\) is affine in \(\xi \) and \({\mathcal {P}}\) does not include any support constraints [53]. Moreover, while this paper was under review we became aware of the PhD thesis [54], which reformulates a distributionally robust two-stage unit commitment problem over a Wasserstein ambiguity set as a semi-infinite linear program, which is subsequently solved using a Benders decomposition algorithm.

The main contribution of this paper is to demonstrate that the worst-case expectation over a Wasserstein ambiguity set can in fact be computed efficiently via convex optimization techniques for numerous loss functions of practical interest. Furthermore, we propose an efficient procedure for constructing an extremal distribution that attains the worst-case expectation—provided that such a distribution exists. Otherwise, we construct a sequence of distributions that attain the worst-case expectation asymptotically. As a by-product, our analysis shows that many interesting distributionally robust optimization problems with Wasserstein ambiguity sets can be solved in polynomial time. We also investigate the out-of-sample performance of the resulting optimal decisions—both theoretically and experimentally—and analyze its dependence on the number of training samples. We highlight the following main contributions of this paper.

  • We prove that the worst-case expectation of an uncertain loss \(\ell (\xi )\) over a Wasserstein ambiguity set coincides with the optimal value of a finite-dimensional convex program if \(\ell (\xi )\) constitutes a pointwise maximum of finitely many concave functions. Generalizations to convex functions or to sums of maxima of concave functions are also discussed. We conclude that worst-case expectations can be computed efficiently to high precision via modern convex optimization algorithms.

  • We describe a supplementary finite-dimensional convex program whose optimal (near-optimal) solutions can be used to construct exact (approximate) extremal distributions for the infinite-dimensional worst-case expectation problem.

  • We show that the worst-case expectation reduces to the optimal value of an explicit linear program if the 1-norm or the \(\infty \)-norm is used in the definition of the Wasserstein metric and if \(\ell (\xi )\) belongs to any of the following function classes: (1) a pointwise maximum or minimum of affine functions; (2) the indicator function of a closed polytope or the indicator function of the complement of an open polytope; (3) the optimal value of a parametric linear program whose cost or right-hand side coefficients depend linearly on \(\xi \).

  • Using recent measure concentration results from statistics, we demonstrate that the optimal value of a distributionally robust optimization problem over a Wasserstein ambiguity set provides an upper confidence bound on the out-of-sample cost of the worst-case optimal decision. We validate this theoretical performance guarantee in numerical tests.

If the uncertain parameter vector \(\xi \) is confined to a fixed finite subset of \(\mathbb {R}^m\), then the worst-case expectation problems over Wasserstein ambiguity sets simplify substantially and can often be reformulated as tractable conic programs by leveraging ideas from robust optimization. An elegant second-order conic reformulation has been discovered, for instance, in the context of distributionally robust regression analysis [32], and a comprehensive list of tractable reformulations of distributionally robust risk constraints for various risk measures is provided in [39]. Our paper extends these tractability results to the practically relevant case where \(\xi \) has uncountably many possible realizations—without resorting to space tessellation or discretization techniques that are prone to the curse of dimensionality.

When \(\ell (\xi )\) is linear and the distribution of \(\xi \) ranges over a Wasserstein ambiguity set without support constraints, one can derive a concise closed-form expression for the worst-case risk of \(\ell (\xi )\) for various convex risk measures [53]. However, these analytical solutions come at the expense of a loss of generality. We believe that the results of this paper may pave the way towards an efficient computational procedure for evaluating the worst-case risk of \(\ell (\xi )\) in more general settings where the loss function may be non-linear and \(\xi \) may be subject to support constraints.

Among all metric-based ambiguity sets studied to date, the Kullback–Leibler ambiguity set has attracted most attention from the robust optimization community. It has first been used in financial portfolio optimization to capture the distributional uncertainty of asset returns with a Gaussian nominal distribution [19]. Subsequent work has focused on Kullback–Leibler ambiguity sets for discrete distributions with a fixed support, which offer additional modeling flexibility without sacrificing computational tractability [2, 14]. It is also known that distributionally robust chance constraints involving a generic Kullback–Leibler ambiguity set are equivalent to the respective classical chance constraints under the nominal distribution but with a rescaled violation probability [26, 27]. Moreover, closed-form counterparts of distributionally robust expectation constraints with Kullback–Leibler ambiguity sets have been derived in [25].

However, Kullback–Leibler ambiguity sets typically fail to represent confidence sets for the unknown distribution \(\mathbb {P}\). To see this, assume that \(\mathbb {P}\) is absolutely continuous with respect to the Lebesgue measure and that the ambiguity set is centered at the discrete empirical distribution \(\widehat{\mathbb {P}}_N\). Then, any distribution in a Kullback–Leibler ambiguity set around \(\widehat{\mathbb {P}}_N\) must assign positive probability mass to each training sample. As \(\mathbb {P}\) has a density function, it must therefore reside outside of the Kullback–Leibler ambiguity set irrespective of the training samples. Thus, Kullback–Leibler ambiguity sets around \(\widehat{\mathbb {P}}_N\) contain \(\mathbb {P}\) with probability 0. In contrast, Wasserstein ambiguity sets centered at \(\widehat{\mathbb {P}}_N\) contain discrete as well as continuous distributions and, if properly calibrated, represent meaningful confidence sets for \(\mathbb {P}\). We will exploit this property in Sect. 3 to derive finite-sample guarantees. A comparison and critical assessment of various metric-based ambiguity sets is provided in [45]. Specifically, it is shown that worst-case expectations over Kullback–Leibler and other divergence-based ambiguity sets are law invariant. In contrast, worst-case expectations over Wasserstein ambiguity sets are not. The law invariance can be exploited to evaluate worst-case expectations via the sample average approximation.

The models proposed in this paper fall within the scope of data-driven distributionally robust optimization [7, 16, 20, 23]. Closest in spirit to our work is the robust sample average approximation [7], which seeks decisions that are robust with respect to the ambiguity set of all distributions that pass a prescribed statistical hypothesis test. Indeed, the distributions within the Wasserstein ambiguity set could be viewed as those that pass a multivariate goodness-of-fit test in light of the available training samples. This amounts to interpreting the Wasserstein distance between the empirical distribution \(\widehat{\mathbb {P}}_N\) and a given hypothesis \(\mathbb {Q}\) as a test statistic and the radius of the Wasserstein ambiguity set as a threshold that needs to be chosen in view of the test’s desired significance level \(\beta \). The Wasserstein distance has already been used in tests for normality [17] and to devise nonparametric homogeneity tests [40].

The rest of the paper proceeds as follows. Section 2 sketches a generic framework for data-driven distributionally robust optimization, while Sect. 3 introduces our specific approach based on Wasserstein ambiguity sets and establishes its out-of-sample performance guarantees. In Sect. 4 we demonstrate that many worst-case expectation problems over Wasserstein ambiguity sets can be reduced to finite-dimensional convex programs, and we develop a systematic procedure for constructing worst-case distributions. Explicit linear programming reformulations of distributionally robust single and two-stage stochastic programs as well as uncertainty quantification problems are derived in Sect. 5. Section 6 extends the scope of the basic approach to broader classes of objective functions, and Sect. 7 reports on numerical results.

Notation

We denote by \(\mathbb {R}_+\) the non-negative and by \(\overline{\mathbb {R}}{:=}\mathbb {R}\cup \{-\infty ,\infty \}\) the extended reals. Throughout this paper, we adopt the conventions of extended arithmetics, whereby \(\infty \cdot 0 = 0\cdot \infty = {0 / 0 } = 0\) and \(\infty - \infty = -\infty + \infty = 1/0 = \infty \). The inner product of two vectors \(a,b \in \mathbb {R}^m\) is denoted by \(\big \langle a, b \big \rangle {:=}a^\intercal b\). Given a norm \(\Vert \cdot \Vert \) on \(\mathbb {R}^m\), the dual norm is defined through \(\Vert z\Vert _* {:=}\sup _{\Vert \xi \Vert \le 1} \big \langle z, \xi \big \rangle \). A function \(f:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) is proper if \(f(\xi )<+\infty \) for at least one \(\xi \) and \(f(\xi )>-\infty \) for every \(\xi \) in \(\mathbb {R}^m\). The conjugate of f is defined as \(f^*(z) {:=}\sup _{\xi \in \mathbb {R}^m} \big \langle z, \xi \big \rangle - f(\xi )\). Note that conjugacy preserves properness. For a set \(\Xi \subseteq \mathbb {R}^m\), the indicator function \(\mathbbm {1}_{\Xi }\) is defined through \(\mathbbm {1}_{\Xi }(\xi )=1\) if \(\xi \in \Xi \); \(=0\) otherwise. Similarly, the characteristic function \(\chi _\Xi \) is defined via \(\chi _\Xi (\xi )=0\) if \(\xi \in \Xi \); \(=\infty \) otherwise. The support function of \(\Xi \) is defined as \(\sigma _{\Xi }(z) {:=}\sup _{\xi \in \Xi } \big \langle z, \xi \big \rangle \). It coincides with the conjugate of \(\chi _\Xi \). We denote by \(\delta _{\xi }\) the Dirac distribution concentrating unit mass at \(\xi \in \mathbb {R}^m\). The product of two probability distributions \(\mathbb {P}_1\) and \(\mathbb {P}_2\) on \(\Xi _1\) and \(\Xi _2\), respectively, is the distribution \(\mathbb {P}_1\otimes \mathbb {P}_2 \) on \(\Xi _1\times \Xi _2\). The N-fold product of a distribution \(\mathbb {P}\) on \(\Xi \) is denoted by \(\mathbb {P}^N\), which represents a distribution on the Cartesian product space \(\Xi ^N\). Finally, we set the expectation of \(\ell :\Xi \rightarrow \overline{\mathbb {R}}\) under \(\mathbb {P}\) to \(\mathbb {E}^\mathbb {P}[\ell (\xi )] = \mathbb {E}^\mathbb {P}\big [\max \{\ell (\xi ),0\}\big ] + \mathbb {E}^\mathbb {P}\big [\min \{\ell (\xi ),0\}\big ]\), which is well-defined by the conventions of extended arithmetics.

2 Data-driven stochastic programming

Consider the stochastic program

$$\begin{aligned} J^\star {:=}\inf _{x \in \mathbb {X}} \left\{ {\mathbb {E}^{\mathbb {P}}} \big [ h(x,\xi ) \big ] = \int _{\Xi } h(x,\xi )\, \mathbb {P}(\mathrm {d}\xi )\right\} \end{aligned}$$
(1)

with feasible set \(\mathbb {X}\subseteq \mathbb {R}^n\), uncertainty set \(\Xi \subseteq \mathbb {R}^m\) and loss function \(h : \mathbb {R}^n \times \mathbb {R}^m \rightarrow \overline{\mathbb {R}}\). The loss function depends both on the decision vector \(x\in \mathbb {R}^n\) and the random vector \(\xi \in \mathbb {R}^m\), whose distribution \(\mathbb {P}\) is supported on \(\Xi \). Problem (1) can be viewed as the first-stage problem of a two-stage stochastic program, where \(h(x,\xi )\) represents the optimal value of a subordinate second-stage problem [46]. Alternatively, problem (1) may also be interpreted as a generic learning problem in the spirit of [49].

Unfortunately, in most situations of practical interest, the distribution \(\mathbb {P}\) is not precisely known, and therefore we miss essential information to solve problem (1) exactly. However, \(\mathbb {P}\) is often partially observable through a finite set of N independent samples, e.g., past realizations of the random vector \(\xi \). We denote the training dataset comprising these samples by \(\widehat{\Xi }_N{:=}\{\widehat{\xi }_i\}_{i\le N} \subseteq \Xi \). We emphasize that—before its revelation—the dataset \(\widehat{\Xi }_N\) can be viewed as a random object governed by the distribution \(\mathbb {P}^N\) supported on \(\Xi ^N\).

A data-driven solution for problem (1) is a feasible decision \(\widehat{x}_N\in \mathbb {X}\) that is constructed from the training dataset \(\widehat{\Xi }_N\). Throughout this paper, we notationally suppress the dependence of \(\widehat{x}_N\) on the training samples in order to avoid clutter. Instead, we reserve the superscript ‘\(\,{\widehat{~}}\) ’ for objects that depend on the training data and thus constitute random objects governed by the product distribution \(\mathbb {P}^N\). The out-of-sample performance of \(\widehat{x}_N\) is defined as \(\mathbb {E}^\mathbb {P}\big [ h(\widehat{x}_N,\xi ) \big ]\) and can thus be viewed as the expected cost of \(\widehat{x}_N\) under a new sample \(\xi \) that is independent of the training dataset. As \(\mathbb {P}\) is unknown, however, the exact out-of-sample performance cannot be evaluated in practice, and the best we can hope for is to establish performance guarantees in the form of tight bounds. The feasibility of \(\widehat{x}_N\) in (1) implies \(J^\star \le \mathbb {E}^\mathbb {P}\big [ h(\widehat{x}_N,\xi ) \big ]\), but this lower bound is again of limited use as \(J^\star \) is unknown and as our primary concern is to bound the costs from above. Thus, we seek data-driven solutions \(\widehat{x}_N\) with performance guarantees of the type

$$\begin{aligned} \mathbb {P}^N\Big \{ \widehat{\Xi }_N~:~ \mathbb {E}^\mathbb {P}\big [ h(\widehat{x}_N,\xi ) \big ] \le \widehat{J}_N\Big \}\ge 1-\beta , \end{aligned}$$
(2)

where \(\widehat{J}_N\) constitutes an upper bound that may depend on the training dataset, and \(\beta \in (0,1)\) is a significance parameter with respect to the distribution \(\mathbb {P}^N\), which governs both \(\widehat{x}_N\) and \(\widehat{J}_N\). Hereafter we refer to \(\widehat{J}_N\) as a certificate for the out-of-sample performance of \(\widehat{x}_N\) and to the probability on the left-hand side of (2) as its reliability. Our ideal goal is to find a data-driven solution with the lowest possible out-of-sample performance. This is impossible, however, as \(\mathbb {P}\) is unknown, and the out-of-sample performance cannot be computed. We thus pursue the more modest but achievable goal to find a data-driven solution with a low certificate and a high reliability.

A natural approach to generate data-driven solutions \(\widehat{x}_N\) is to approximate \(\mathbb {P}\) with the discrete empirical probability distribution

$$\begin{aligned} \widehat{\mathbb {P}}_N{:=}{1 \over N} \sum _{i = 1}^{N} \delta _{\widehat{\xi }_i}, \end{aligned}$$
(3)

that is, the uniform distribution on \(\widehat{\Xi }_N\). This amounts to approximating the original stochastic program (1) with the sample-average approximation (SAA) problem

$$\begin{aligned} \widehat{J}_{\mathrm{SAA}}{:=}\inf _{x \in \mathbb {X}} \left\{ \mathbb {E}^{\widehat{\mathbb {P}}_N} \big [ h(x,\xi ) \big ] = {1 \over N} \sum _{i = 1}^{N} h(x, \widehat{\xi }_i)\right\} . \end{aligned}$$
(4)

If the feasible set \(\mathbb {X}\) is compact and the loss function is uniformly continuous in x across all \(\xi \in \Xi \), then the optimal value and optimal solutions of the SAA problem (4) converge almost surely to their counterparts of the true problem (1) as N tends to infinity [46, Theorem 5.3]. Even though finite sample performance guarantees of the type (2) can be obtained under additional assumptions such as Lipschitz continuity of the loss function (see e.g., [47, Theorem 1]), the SAA problem has been conceived primarily for situations where the distribution \(\mathbb {P}\) is known and additional samples can be acquired cheaply via random number generation. However, the optimal solutions of the SAA problem tend to display a poor out-of-sample performance in situations where N is small and where the acquisition of additional samples would be costly.

In this paper we address problem (1) with an alternative approach that explicitly accounts for our ignorance of the true data-generating distribution \(\mathbb {P}\), and that offers attractive performance guarantees even when the acquisition of additional samples from \(\mathbb {P}\) is impossible or expensive. Specifically, we use \(\widehat{\Xi }_N\) to design an ambiguity set \(\widehat{\mathcal {P}}_N\) containing all distributions that could have generated the training samples with high confidence. This ambiguity set enables us to define the certificate \(\widehat{J}_N\) as the optimal value of a distributionally robust optimization problem that minimize the worst-case expected cost.

$$\begin{aligned} \widehat{J}_N{:=}\inf \limits _{x \in \mathbb {X}} \sup \limits _{\mathbb {Q}\in \widehat{\mathcal {P}}_N} \mathbb {E}^\mathbb {Q}\big [ h(x,\xi ) \big ] \end{aligned}$$
(5)

Following [38], we construct \(\widehat{\mathcal {P}}_N\) as a ball around the empirical distribution (3) with respect to the Wasserstein metric. In the remainder of the paper we will demonstrate that the optimal value \(\widehat{J}_N\) as well as any optimal solution \(\widehat{x}_N\) (if it exists) of the distributionally robust problem (5) satisfy the following conditions.

  1. (i)

    Finite sample guarantee: For a carefully chosen size of the ambiguity set, the certificate \(\widehat{J}_N\) provides a \(1-\beta \) confidence bound of the type (2) on the out-of-sample performance of \(\widehat{x}_N\).

  2. (ii)

    Asymptotic consistency: As N tends to infinity, the certificate \(\widehat{J}_N\) and the data-driven solution \(\widehat{x}_N\) converge—in a sense to be made precise below—to the optimal value \(J^\star \) and an optimizer \(x^\star \) of the stochastic program (1), respectively.

  3. (iii)

    Tractability: For many loss functions \(h(x,\xi )\) and sets \(\mathbb {X}\), the distributionally robust problem (5) is computationally tractable and admits a reformulation reminiscent of the SAA problem (4).

Conditions (i–iii) have been identified in [7] as desirable properties of data-driven solutions for stochastic programs. Precise statements of these conditions will be provided in the remainder. In Sect. 3 we will use the Wasserstein metric to construct ambiguity sets of the type \(\widehat{\mathcal {P}}_N\) satisfying the conditions (i) and (ii). In Sect. 4, we will demonstrate that these ambiguity sets also fulfill the tractability condition (iii). We see this last result as the main contribution of this paper because the state-of-the-art method for solving distributionally robust problems over Wasserstein ambiguity sets relies on global optimization algorithms [36].

3 Wasserstein metric and measure concentration

Probability metrics represent distance functions on the space of probability distributions. One of the most widely used examples is the Wasserstein metric, which is defined on the space \(\mathcal {M}(\Xi )\) of all probability distributions \(\mathbb {Q}\) supported on \(\Xi \) with \(\mathbb {E}^\mathbb {Q}\big [\Vert \xi \Vert \big ] = \int _\Xi \Vert \xi \Vert \,\mathbb {Q}(\mathrm {d}\xi )<\infty \).

Definition 3.1

(Wasserstein metric [29]) The Wasserstein metric \(d_\mathrm{W} : \mathcal {M}(\Xi )\times \mathcal {M}(\Xi )\rightarrow \mathbb {R}_+\) is defined via

$$\begin{aligned} d_{\mathrm W}\big (\mathbb {Q}_1,\mathbb {Q}_2\big ) {:=}\inf \left\{ \int _{\Xi ^2} \Vert \xi _1 - \xi _2 \Vert \, \Pi (\mathrm {d}\xi _1, \mathrm {d}\xi _2) ~: \begin{array}{l} \Pi \textit{ is a joint distribution of } \xi _1 \textit{ and } \xi _2 \\ \textit{with marginals } \mathbb {Q}_1 \textit{ and } \mathbb {Q}_2, \textit{ respectively} \end{array}\right\} \end{aligned}$$

for all distributions \(\mathbb {Q}_1,\mathbb {Q}_2\in \mathcal {M}(\Xi )\), where \(\Vert \cdot \Vert \) represents an arbitrary norm on \(\mathbb {R}^m\).

The decision variable \(\Pi \) can be viewed as a transportation plan for moving a mass distribution described by \(\mathbb {Q}_1\) to another one described by \(\mathbb {Q}_2\). Thus, the Wasserstein distance between \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) represents the cost of an optimal mass transportation plan, where the norm \(\Vert \cdot \Vert \) encodes the transportation costs. We remark that a generalized p-Wasserstein metric for \(p\ge 1\) is obtained by setting the transportation cost between \(\xi _1\) and \(\xi _2\) to \(\Vert \xi _1-\xi _2\Vert ^p\). In this paper, however, we focus exclusively on the 1-Wasserstein metric of Definition 3.1, which is sometimes also referred to as the Kantorovich metric.

We will sometimes also need the following dual representation of the Wasserstein metric.

Theorem 3.2

(Kantorovich–Rubinstein [29]) For any distributions \(\mathbb {Q}_1, \mathbb {Q}_2\in {\mathcal {M}}(\Xi )\) we have

$$\begin{aligned} d_{\mathrm W}\big (\mathbb {Q}_1,\mathbb {Q}_2\big ) = \sup _{f \in \mathcal {L}} \Big \{ \int _{\Xi } f(\xi ) \,\mathbb {Q}_1(\mathrm {d}\xi ) - \int _{\Xi } f(\xi )\, \mathbb {Q}_2(\mathrm {d}\xi )\Big \}, \end{aligned}$$

where \(\mathcal {L}\) denotes the space of all Lipschitz functions with \(|f(\xi )-f(\xi ')|\le \Vert \xi -\xi '\Vert \) for all \(\xi ,\xi '\in \Xi \).

Kantorovich and Rubinstein [29] originally established this result for distributions with bounded support. A modern proof for unbounded distributions is due to Villani [50, Remark 6.5, p. 107]. The optimization problems in Definition 3.1 and Theorem 3.2, which provide two equivalent characterizations of the Wasserstein metric, constitute a primal-dual pair of infinite-dimensional linear programs. The dual representation implies that two distributions \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\) are close to each other with respect to the Wasserstein metric if and only if all functions with uniformly bounded slopes have similar integrals under \(\mathbb {Q}_1\) and \(\mathbb {Q}_2\). Theorem 3.2 also demonstrates that the Wasserstein metric is a special instance of an integral probability metric (see e.g. [33]) and that its generating function class coincides with a family of Lipschitz continuous functions.

In the remainder we will examine the ambiguity set

$$\begin{aligned} \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N) {:=}\left\{ \mathbb {Q}\in \mathcal {M}(\Xi ) ~:~ d_{\mathrm W}\big (\widehat{\mathbb {P}}_N,\mathbb {Q}\big ) \le \varepsilon \right\} , \end{aligned}$$
(6)

which can be viewed as the Wasserstein ball of radius \(\varepsilon \) centered at the empirical distribution \(\widehat{\mathbb {P}}_N\). Under a common light tail assumption on the unknown data-generating distribution \(\mathbb {P}\), this ambiguity set offers attractive performance guarantees in the spirit of Sect. 2.

Assumption 3.3

(Light-tailed distribution) There exists an exponent \(a > 1\) such that

$$\begin{aligned} A {:=}\mathbb {E}^\mathbb {P}\big [ \exp (\Vert \xi \Vert ^a) \big ] = \int _{\Xi } \exp (\Vert \xi \Vert ^a)\,\mathbb {P}(\mathrm {d}\xi ) < \infty . \end{aligned}$$

Assumption 3.3 essentially requires the tail of the distribution \(\mathbb {P}\) to decay at an exponential rate. Note that this assumption trivially holds if \(\Xi \) is compact. Heavy-tailed distributions that fail to meet Assumption 3.3 are difficult to handle even in the context of the classical sample average approximation. Indeed, under a heavy-tailed distribution the sample average of the loss corresponding to any fixed decision \(x \in \mathbb {X}\) may not even converge to the expected loss; see e.g. [13, 15]. The following modern measure concentration result provides the basis for establishing powerful finite sample guarantees.

Theorem 3.4

(Measure concentration [21, Theorem 2]) If Assumption 3.3 holds, we have

$$\begin{aligned} \mathbb {P}^N \Big \{ d_{\mathrm W}\big (\mathbb {P},\widehat{\mathbb {P}}_N\big ) \ge \varepsilon \Big \} \le \left\{ \begin{array}{ll} c_1 \exp \big ({-c_2N\varepsilon ^{\max \{m,2\}}}\big ) &{} \quad \text {if } \varepsilon \le 1, \\ c_1 \exp \big ({-c_2N\varepsilon ^a}\big ) &{} \quad \text {if } \varepsilon > 1,\end{array}\right. \end{aligned}$$
(7)

for all \(N \ge 1\), \(m \ne 2\), and \(\varepsilon >0\), where \(c_1, c_2\) are positive constants that only depend on a, A, and m.Footnote 1

Theorem 3.4 provides an a priori estimate of the probability that the unknown data-generating distribution \(\mathbb {P}\) resides outside of the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\). Thus, we can use Theorem 3.4 to estimate the radius of the smallest Wasserstein ball that contains \(\mathbb {P}\) with confidence \(1-\beta \) for some prescribed \(\beta \in (0,1)\). Indeed, equating the right-hand side of (7) to \(\beta \) and solving for \(\varepsilon \) yields

$$\begin{aligned} \varepsilon _N(\beta ) {:=}\left\{ \begin{array}{ll} \Big ({\log (c_1\beta ^{-1}) \over c_2N} \Big )^{1/{\max \{m,2\}}} &{} \quad \text {if } N \ge {\log (c_1\beta ^{-1}) \over c_2}, \\ \Big ({\log (c_1\beta ^{-1}) \over c_2N} \Big )^{1/a} &{} \quad \text {if } N < {\log (c_1\beta ^{-1}) \over c_2}. \end{array}\right. \end{aligned}$$
(8)

Note that the Wasserstein ball with radius \(\varepsilon _N(\beta )\) can thus be viewed as a confidence set for the unknown true distribution as in statistical testing; see also [7].

Theorem 3.5

(Finite sample guarantee) Suppose that Assumption 3.3 holds and that \(\beta \in (0,1)\). Assume also that \(\widehat{J}_N\) and \(\widehat{x}_N\) represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\). Then, the finite sample guarantee (2) holds.

Proof

The claim follows immediately from Theorem 3.4, which ensures via the definition of \(\varepsilon _N(\beta )\) in (8) that \(\mathbb {P}^N \{ \mathbb {P}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N) \} \ge 1-\beta \). Thus, \(\mathbb {E}^\mathbb {P}[ h(\widehat{x}_N,\xi )] \le \sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N}\mathbb {E}^\mathbb {Q}[ h(\widehat{x}_N,\xi )] = \widehat{J}_N\) with probability \(1-\beta \). \(\square \)

It is clear from (8) that for any fixed \(\beta >0\), the radius \( \varepsilon _N(\beta )\) tends to 0 as N increases. Moreover, one can show that if \(\beta _N\) converges to zero at a carefully chosen rate, then the solution of the distributionally robust optimization problem (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\) converges to the solution of the original stochastic program (1) as N tends to infinity. The following theorem formalizes this statement.

Theorem 3.6

(Asymptotic consistency) Suppose that Assumption 3.3 holds and that \(\beta _N\in (0,1)\), \(N \in \mathbb {N}\), satisfies \(\sum _{N=1}^\infty \beta _N<\infty \) and \(\lim _{N\rightarrow \infty }\varepsilon _N(\beta _N)=0\).Footnote 2 Assume also that \(\widehat{J}_N\) and \(\widehat{x}_N\) represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set \(\widehat{\mathcal {P}}_N = \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\), \(N\in \mathbb {N}\).

  1. (i)

    If \(h(x,\xi )\) is upper semicontinuous in \(\xi \) and there exists \(L\ge 0\) with \(|h(x,\xi )|\le L(1+\Vert \xi \Vert )\) for all \(x\in \mathbb {X}\) and \(\xi \in \Xi \), then \(\mathbb {P}^\infty \)-almost surely we have \(\widehat{J}_N\downarrow J^\star \) as \(N \rightarrow \infty \) where \(J^\star \) is the optimal value of (1).

  2. (ii)

    If the assumptions of assertion (i) hold, \(\mathbb {X}\) is closed, and \(h(x,\xi )\) is lower semicontinuous in x for every \(\xi \in \Xi \), then any accumulation point of \(\{\widehat{x}_N\}_{N \in \mathbb {N}}\) is \(\mathbb {P}^\infty \)-almost surely an optimal solution for (1).

The proof of Theorem 3.6 will rely on the following technical lemma.

Lemma 3.7

(Convergence of distributions) If Assumption 3.3 holds and \(\beta _N\in (0,1)\), \(N \in \mathbb {N}\), satisfies \(\sum _{N=1}^\infty \beta _N<\infty \) and \(\lim _{N\rightarrow \infty }\varepsilon _N(\beta _N)=0\), then, any sequence \({\widehat{\mathbb {Q}}}_N \in \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N)\), \(N\in \mathbb {N}\), where \({\widehat{\mathbb {Q}}}_N\) may depend on the training data, converges under the Wasserstein metric (and thus weakly) to \(\mathbb {P}\) almost surely with respect to \(\mathbb {P}^\infty \), that is,

$$\begin{aligned} \mathbb {P}^{\infty } \left\{ \lim _{N \rightarrow \infty } d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N\big ) = 0 \right\} = 1. \end{aligned}$$

Proof

As \({\widehat{\mathbb {Q}}}_N \in \mathbb {B}_{\delta _N}(\widehat{\mathbb {P}}_N)\), the triangle inequality for the Wasserstein metric ensures that

$$\begin{aligned} d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N \big ) \le d_{\mathrm W}\big (\mathbb {P},\widehat{\mathbb {P}}_N\big ) + d_{\mathrm W}\big (\widehat{\mathbb {P}}_N,{\widehat{\mathbb {Q}}}_N\big ) \le d_{\mathrm W}\big (\mathbb {P},\widehat{\mathbb {P}}_N\big ) + \varepsilon _N(\beta _N). \end{aligned}$$

Moreover, Theorem 3.4 implies that \(\mathbb {P}^N \{ d_{\mathrm W}\big (\mathbb {P},\widehat{\mathbb {P}}_N\big ) \le \varepsilon _N(\beta _N)\}\ge 1-\beta _N\), and thus we have \(\mathbb {P}^N \{ d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N \big ) \le 2\varepsilon _N(\beta _N) \} \ge 1-\beta _N\). As \(\sum _{N=1}^\infty \beta _N<\infty \), the Borel–Cantelli Lemma [28, Theorem 2.18] further implies that

$$\begin{aligned} \mathbb {P}^{\infty } \left\{ d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N\big ) \le \varepsilon _N(\beta _N) ~ \text {for all sufficiently large } N \right\} = 1. \end{aligned}$$

Finally, as \(\lim _{N \rightarrow \infty }\varepsilon _N(\beta _N)=0\), we conclude that \(\lim _{N \rightarrow \infty }d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N\big ) =0\) almost surely. Note that convergence with respect to the Wasserstein metric implies weak convergence [10]. \(\square \)

Proof of Theorem 3.6

As \({\widehat{x}}_N\in \mathbb {X}\), we have \(J^\star \le \mathbb {E}^\mathbb {P}[h({\widehat{x}}_N,\xi )]\). Moreover, Theorem 3.5 implies that

$$\begin{aligned} \mathbb {P}^N \left\{ J^\star \le \mathbb {E}^\mathbb {P}[h(\widehat{x}_N,\xi )] \le \widehat{J}_N\right\} \ge \mathbb {P}^N \left\{ \mathbb {P}\in \mathbb {B}_{\varepsilon _N(\beta _N)}(\widehat{\mathbb {P}}_N) \right\} \ge 1-\beta _N, \end{aligned}$$

for all \(N \in \mathbb {N}\). As \(\sum _{N=1}^\infty \beta _N<\infty \), the Borel–Cantelli Lemma further implies that

$$\begin{aligned} \mathbb {P}^{\infty } \left\{ J^\star \le \mathbb {E}^{\mathbb {P}}[h(\widehat{x}_N,\xi )] \le \widehat{J}_N~ \text {for all sufficiently large }N \right\} = 1. \end{aligned}$$

To prove assertion (i), it thus remains to be shown that \(\limsup _{N \rightarrow \infty }\widehat{J}_N\le J^\star \) with probability 1. As \(h(x,\xi )\) is upper semicontinuous and grows at most linearly in \(\xi \), there exists a non-increasing sequence of functions \(h_k(x,\xi )\), \(k\in \mathbb {N}\), such that \(h(x,\xi )=\lim _{k\rightarrow \infty } h_k(x,\xi )\), and \(h_k(x,\xi )\) is Lipschitz continuous in \(\xi \) for any fixed \(x\in \mathbb {X}\) and \(k\in \mathbb {N}\) with Lipschitz constant \(L_k\ge 0\); see Lemma A.1 in the appendix. Next, choose any \(\delta >0\), fix a \(\delta \)-optimal decision \(x_\delta \in \mathbb {X}\) for (1) with \(\mathbb {E}^\mathbb {P}[h(x_\delta ,\xi )]\le J^\star +\delta \), and for every \(N\in \mathbb {N}\) let \({\widehat{\mathbb {Q}}}_N \in \widehat{\mathcal {P}}_N\) be a \(\delta \)-optimal distribution corresponding to \(x_\delta \) with

$$\begin{aligned} \sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N}\mathbb {E}^{\mathbb {Q}}[h(x_\delta ,\xi )] \le \mathbb {E}^{\mathbb {Q}_N}[h(x_\delta ,\xi )] + \delta . \end{aligned}$$

Then, we have

$$\begin{aligned} \limsup _{N\rightarrow \infty }\widehat{J}_N&\le \limsup _{N \rightarrow \infty } \sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N}\mathbb {E}^{\mathbb {Q}} [h(x_\delta ,\xi )] \\&\le \limsup _{N \rightarrow \infty } \mathbb {E}^{{\widehat{\mathbb {Q}}}_N} [h(x_\delta ,\xi )] + \delta \\&\le \lim _{k \rightarrow \infty } \limsup _{N \rightarrow \infty } \mathbb {E}^{{\widehat{\mathbb {Q}}}_N}[h_k(x_\delta ,\xi )] + \delta \\&\le \lim _{k \rightarrow \infty } \limsup _{N \rightarrow \infty } \left( \mathbb {E}^{\mathbb {P}}[h_k(x_\delta ,\xi )] + L_k\, d_{\mathrm W}\big (\mathbb {P},{\widehat{\mathbb {Q}}}_N\big ) \right) +\delta \\&= \lim _{k \rightarrow \infty } \mathbb {E}^{\mathbb {P}}[h_k(x_\delta ,\xi )] + \delta , \quad \mathbb {P}^\infty \text {-almost surely}\\&\qquad = \mathbb {E}^{\mathbb {P}}[h(x_\delta ,\xi )] + \delta \le J^\star +2\delta , \end{aligned}$$

where the second inequality holds because \(h_k(x,\xi )\) converges from above to \(h(x,\xi )\), and the third inequality follows from Theorem 3.2. Moreover, the almost sure equality holds due to Lemma 3.7, and the last equality follows from the Monotone Convergence Theorem [30, Theorem 5.5], which applies because \(|\mathbb {E}^{\mathbb {P}}[h_k(x_\delta ,\xi )]| < \infty \). Indeed, recall that \(\mathbb {P}\) has an exponentially decaying tail due to Assumption 3.3 and that \(h_k(x_\delta ,\xi )\) is Lipschitz continuous in \(\xi \). As \(\delta >0\) was chosen arbitrarily, we thus conclude that \(\limsup _{N \rightarrow \infty }\widehat{J}_N\le J^\star \).

To prove assertion (ii), fix an arbitrary realization of the stochastic process \(\{\widehat{\xi }_N\}_{N \in \mathbb {N}}\) such that \(J^\star = \lim _{N \rightarrow \infty } \widehat{J}_N\) and \(J^\star \le \mathbb {E}^{\mathbb {P}}[h(\widehat{x}_N,\xi )] \le \widehat{J}_N\) for all sufficiently large N. From the proof of assertion (i) we know that these two conditions are satisfied \(\mathbb {P}^\infty \)-almost surely. Using these assumptions, one easily verifies that

$$\begin{aligned} \liminf _{N \rightarrow \infty } \mathbb {E}^{\mathbb {P}}[h({\widehat{x}}_{N},\xi )] \le \lim _{N \rightarrow \infty } \widehat{J}_N= J^\star . \end{aligned}$$
(9)

Next, let \(x^\star \) be an accumulation point of the sequence \(\{\widehat{x}_N\}_{N \in \mathbb {N}}\), and note that \(x^\star \in \mathbb {X}\) as \(\mathbb {X}\) is closed. By passing to a subsequence, if necessary, we may assume without loss of generality that \(x^\star = \lim _{N\rightarrow \infty }\widehat{x}_N\). Thus,

$$\begin{aligned} J^\star \le \mathbb {E}^{\mathbb {P}}[h(x^\star ,\xi )]&\le \mathbb {E}^{\mathbb {P}}[\liminf _{N \rightarrow \infty } h({\widehat{x}}_{N},\xi )] \le \liminf _{N \rightarrow \infty } \mathbb {E}^{\mathbb {P}}[h({\widehat{x}}_{N},\xi )] \le J^\star , \end{aligned}$$

where the first inequality exploits that \(x^\star \in \mathbb {X}\), the second inequality follows from the lower semicontinuity of \(h(x,\xi )\) in x, the third inequality holds due to Fatou’s lemma (which applies because \(h(x,\xi )\) grows at most linearly in \(\xi \)), and the last inequality follows from (9). Therefore, we have \(\mathbb {E}^{\mathbb {P}}[h(x^\star ,\xi )] = J^\star \). \(\square \)

In the following we show that all assumptions of Theorem 3.6 are necessary for asymptotic convergence, that is, relaxing any of these conditions can invalidate the convergence result.

Example 1

(Necessity of regularity conditions)

  1. (1)

    Upper semicontinuity of  \(\xi \mapsto h(x,\xi )\) in Theorem 3.6 (i):

    Set \(\Xi = [0,1]\), \(\mathbb {P}= \delta _{0}\) and \(h(x,\xi ) = \mathbbm {1}_{(0,1]}(\xi )\), whereby \(J^\star = 0\). As \(\mathbb {P}\) concentrates unit mass at 0, we have \(\widehat{\mathbb {P}}_N=\delta _{0}=\mathbb {P}\) irrespective of \(N\in \mathbb {N}\). For any \(\varepsilon > 0\), the Dirac distribution \(\delta _{\varepsilon }\) thus resides within the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\). Hence, \(\widehat{J}_N\) fails to converge to \(J^\star \) for \(\varepsilon \rightarrow 0\) because

    $$\begin{aligned} \widehat{J}_N\ge \mathbb {E}^{\delta _{\varepsilon }} [h(x,\xi )] = h(x, \varepsilon ) = 1,\quad \forall \varepsilon >0. \end{aligned}$$
  2. (2)

    Linear growth of \(\xi \mapsto h(x,\xi )\) in Theorem 3.6 (i):

    Set \(\Xi = \mathbb {R}\), \(\mathbb {P}= \delta _{0}\) and \(h(x,\xi ) = \xi ^2\), which implies that \(J^\star =0\). Note that for any \(\rho >\varepsilon \), the two-point distribution \(\mathbb {Q}_\rho = (1-\tfrac{\varepsilon }{\rho })\delta _{0}+\tfrac{\varepsilon }{\rho }\delta _{\rho }\) is contained in the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) of radius \(\varepsilon >0\). Hence, \(\widehat{J}_N\) fails to converge to \(J^\star \) for \(\varepsilon \rightarrow 0\) because

    $$\begin{aligned} \widehat{J}_N\ge \, \sup _{\rho> \varepsilon } \,\mathbb {E}^{\mathbb {Q}_\rho } [h(x,\xi )] = \sup _{\rho> \varepsilon } \, \varepsilon \rho = \infty , \quad \forall \varepsilon >0. \end{aligned}$$
  3. (3)

    Lower semicontinuity of \(x \mapsto h(x,\xi )\) in Theorem 3.6 (ii):

    Set \(\mathbb {X}= [0,1]\) and \(h(x,\xi ) = \mathbbm {1}_{[0.5,1]}(x)\), whereby \(J^\star =0\) irrespective of \(\mathbb {P}\). As the objective is independent of \(\xi \), the distributionally robust optimization problem (5) is equivalent to (1). Then, \({\widehat{x}}_N = \tfrac{N-1}{2N}\) is a sequence of minimizers for (5) whose accumulation point \(x^\star = \tfrac{1}{2}\) fails to be optimal in (1).

A convergence result akin to Theorem 3.6 for goodness-of-fit-based ambiguity sets is discussed in [7, Section 4]. This result is complementary to Theorem 3.6. Indeed, Theorem 3.6(i) requires \(h(x,\xi )\) to be upper semicontinuous in \(\xi \), which is a necessary condition in our setting (see Example 1) that is absent in [7]. Moreover, Theorem 3.6(ii) only requires \(h(x,\xi )\) to be lower semicontinuous in x, while [7] asks for equicontinuity of this mapping. This stronger requirement provides a stronger result, that is, the almost sure convergence of \(\sup _{\mathbb {Q}\in \widehat{\mathcal {P}}_N} \mathbb {E}^\mathbb {Q}[h(x,\xi )]\) to \(\mathbb {E}^\mathbb {P}[h(x,\xi )]\) uniformly in x on any compact subset of \(\mathbb {X}\).

Theorems 3.5 and 3.6 indicate that a careful a priori design of the Wasserstein ball results in attractive finite sample and asymptotic guarantees for the distributionally robust solutions. In practice, however, setting the Wasserstein radius to \(\varepsilon _N(\beta )\) yields over-conservative solutions for the following reasons:

  • Even though the constants \(c_1\) and \(c_2\) in (8) can be computed based on the proof of [21, Theorem 2], the resulting Wasserstein ball is larger than necessary, i.e., \(\mathbb {P}\notin \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\) with probability \(\ll \beta \).

  • Even if \(\mathbb {P}\notin \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\), the optimal value \(\widehat{J}_N\) of (5) may still provide an upper bound on \(J^\star \).

  • The formula for \(\varepsilon _N(\beta )\) in (8) is independent of the training data. Allowing for random Wasserstein radii, however, results in a more efficient use of the available training data.

While Theorems 3.5 and 3.6 provide strong theoretical justification for using Wasserstein ambiguity sets, in practice, it is prudent to calibrate the Wasserstein radius via bootstrapping or cross-validation instead of using the conservative a priori bound \(\varepsilon _N(\beta )\); see Sect. 7.2 for further details. A similar approach has been advocated in [7] to determine the sizes of ambiguity sets that are constructed via goodness-of-fit tests.

So far we have seen that the Wasserstein metric allows us to construct ambiguity sets with favorable asymptotic and finite sample guarantees. In the remainder of the paper we will further demonstrate that the distributionally robust optimization problem (5) with a Wasserstein ambiguity set (6) is not significantly harder to solve than the corresponding SAA problem (4).

4 Solving worst-case expectation problems

We now demonstrate that the inner worst-case expectation problem in (5) over the Wasserstein ambiguity set (6) can be reformulated as a finite convex program for many loss functions \(h(x,\xi )\) of practical interest. For ease of notation, throughout this section we suppress the dependence on the decision variable x. Thus, we examine a generic worst-case expectation problem

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] \end{aligned}$$
(10)

involving a decision-independent loss function \(\ell (\xi ) {:=}\max _{k \le K}\ell _k(\xi )\), which is defined as the pointwise maximum of more elementary measurable functions \(\ell _k:\mathbb {R}^m \rightarrow \overline{\mathbb {R}}\), \(k\le K\). The focus on loss functions representable as pointwise maxima is non-restrictive unless we impose some structure on the functions \(\ell _k\). Many tractability results in the remainder of this paper are predicated on the following convexity assumption.

Assumption 4.1

(Convexity) The uncertainty set \(\Xi \subseteq \mathbb {R}^m\) is convex and closed, and the negative constituent functions \(-\ell _k\) are proper, convex, and lower semicontinuous for all \(k\le K\). Moreover, we assume that \(\ell _k\) is not identically \(-\infty \) on \(\Xi \) for all \(\le K\).

Assumption 4.1 essentially stipulates that \(\ell (\xi )\) can be written as a maximum of concave functions. As we will showcase in Sect. 5, this mild restriction does not sacrifice much modeling power. Moreover, generalizations of this setting will be discussed in Sect. 6. We proceed as follows. Sect. 4.1 addresses the reduction of (10) to a finite convex program, while Sect. 4.2 describes a technique for constructing worst-case distributions.

4.1 Reduction to a finite convex program

The worst-case expectation problem (10) constitutes an infinite-dimensional optimization problem over probability distributions and thus appears to be intractable. However, we will now demonstrate that (10) can be re-expressed as a finite-dimensional convex program by leveraging tools from robust optimization.

Theorem 4.2

(Convex reduction) If the convexity Assumption 4.1 holds, then for any \(\varepsilon \ge 0 \) the worst-case expectation (10) equals the optimal value of the finite convex program

$$\begin{aligned} \left\{ \begin{array}{llll} \inf \limits _{\lambda ,s_i, z_{ik},\nu _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i &{}&{} \\ \text {s.t.}&{} [-\ell _k]^*(z_{ik} - \nu _{ik}) + \sigma _{\Xi }(\nu _{ik}) - \big \langle z_{ik}, \widehat{\xi }_i \big \rangle \le s_i &{} \forall i \le N, &{} \quad \forall k \le K \\ &{} \Vert z_{ik}\Vert _* \le \lambda &{}\forall i \le N, &{} \quad \forall k \le K. \end{array} \right. \end{aligned}$$
(11)

Recall that \([-\ell _k]^*(z_{ik} - \nu _{ik})\) denotes the conjugate of \(-\ell _k\) evaluated at \(z_{ik} - \nu _{ik}\) and \(\Vert z_{ik}\Vert _*\) the dual norm of \(z_{ik}\). Moreover, \(\chi _\Xi \) represents the characteristic function of \(\Xi \) and \(\sigma _\Xi \) its conjugate, that is, the support function of \(\Xi \).

Proof of Theorem 4.2

By using Definition 3.1 we can re-express the worst-case expectation (10) as

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ]&= \left\{ \begin{array}{cl} \sup \limits _{\Pi ,\mathbb {Q}} &{} \int _{\Xi } \ell (\xi ) \, \mathbb {Q}(\mathrm {d}\xi ) \\ \text {s.t.}&{} \int _{\Xi ^2} \Vert \xi -\xi '\Vert \, \Pi (\mathrm {d}\xi , \mathrm {d}\xi ') \le \varepsilon \\[1ex] &{} \left\{ \begin{array}{l} \Pi \text{ is } \text{ a } \text{ joint } \text{ distribution } \text{ of } \xi \text{ and } \xi '\\ \text{ with } \text{ marginals } \mathbb {Q} \text{ and } \widehat{\mathbb {P}}_N\text{, } \text{ respectively } \end{array}\right. \end{array} \right. \\&= \left\{ \begin{array}{cl} \sup \limits _{\mathbb {Q}_i \in \mathcal {M}(\Xi )} &{} {1 \over N}\sum \limits _{i = 1}^{N} \int _{\Xi } \ell (\xi ) \, \mathbb {Q}_i(\mathrm {d}\xi ) \\ \text {s.t.}&{} {1 \over N}\sum \limits _{i = 1}^{N} \int _{\Xi } \Vert \xi -\widehat{\xi }_i\Vert \, \mathbb {Q}_i(\mathrm {d}\xi ) \le \varepsilon . \end{array} \right. \end{aligned}$$

The second equality follows from the law of total probability, which asserts that any joint probability distribution \(\Pi \) of \(\xi \) and \(\xi '\) can be constructed from the marginal distribution \(\widehat{\mathbb {P}}_N\) of \(\xi '\) and the conditional distributions \(\mathbb {Q}_i\) of \(\xi \) given \(\xi '=\widehat{\xi }_i\), \(i\le N\), that is, we may write \(\Pi = {1 \over N}\sum _{i = 1}^{N} \delta _{\widehat{\xi }_i}\otimes \mathbb {Q}_i\). The resulting optimization problem represents a generalized moment problem in the distributions \(\mathbb {Q}_i\), \(i\le N\). Using a standard duality argument, we obtain

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ]= & {} \sup \limits _{\mathbb {Q}_i \in \mathcal {M}(\Xi )} \inf \limits _{\lambda \ge 0} {1 \over N}\sum \limits _{i = 1}^{N} \int _{\Xi } \ell (\xi )\, \mathbb {Q}_i(\mathrm {d}\xi )\nonumber \\&\qquad \qquad \qquad \quad + \lambda \Big ( \varepsilon - {1 \over N}\sum \limits _{i = 1}^{N} \int _{\Xi } \Vert \xi -\widehat{\xi }_i\Vert \, \mathbb {Q}_i(\mathrm {d}\xi ) \Big ) \nonumber \\\le & {} \inf \limits _{\lambda \ge 0} \sup \limits _{\mathbb {Q}_i \in \mathcal {M}(\Xi )} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \int _{\Xi } \left( \ell (\xi ) - \lambda \Vert \xi -\widehat{\xi }_i\Vert \right) \mathbb {Q}_i(\mathrm {d}\xi )\nonumber \\ \end{aligned}$$
(12a)
$$\begin{aligned}= & {} \inf \limits _{\lambda \ge 0} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sup _{\xi \in \Xi } \left( \ell (\xi ) - \lambda \Vert \xi - \widehat{\xi }_i\Vert \right) , \end{aligned}$$
(12b)

where (12a) follows from the max-min inequality, and (12b) follows from the fact that \(\mathcal {M}(\Xi )\) contains all the Dirac distributions supported on \(\Xi \). Introducing epigraphical auxiliary variables \(s_i\), \(i\le N\), allows us to reformulate (12b) as

$$\begin{aligned}&\left\{ \begin{array}{clll} \inf \limits _{\lambda , s_i} &{} \lambda \varepsilon + {1 \over N} \sum \limits _{i = 1}^{N} s_i &{} &{} \\ \text {s.t.}&{} \sup \limits _{\xi \in \Xi } \bigg (\ell (\xi ) - \lambda \Vert \xi - \widehat{\xi }_i\Vert \bigg )\le s_i &{} \quad \forall i\le N &{} \\ &{} \lambda \ge 0 &{}~&{} \\ \end{array}\right. \end{aligned}$$
(12c)
$$\begin{aligned}&\quad = \left\{ \begin{array}{clll} \inf \limits _{\lambda , s_i} &{} \lambda \varepsilon + {1 \over N} \sum \limits _{i = 1}^{N} s_i &{}~&{} \\ \text {s.t.}&{} \sup \limits _{\xi \in \Xi } \bigg (\ell _k(\xi ) - \max \limits _{\Vert z_{ik}\Vert _* \le \lambda } \big \langle z_{ik}, \xi - \widehat{\xi }_i \big \rangle \bigg )\le s_i &{}\quad \forall i\le N,&{} \;\forall k\le K \\ &{} \lambda \ge 0 &{}~&{} \\ \end{array}\right. \qquad \end{aligned}$$
(12d)
$$\begin{aligned}&\quad \le \left\{ \begin{array}{clll} \inf \limits _{\lambda , s_i} &{} \lambda \varepsilon + {1 \over N} \sum \limits _{i = 1}^{N} s_i &{}~&{}\\ \text {s.t.}&{} \min \limits _{\Vert z_{ik}\Vert _* \le \lambda } \sup \limits _{\xi \in \Xi } \bigg (\ell _k(\xi ) - \big \langle z_{ik}, \xi - \widehat{\xi }_i \big \rangle \bigg )\le s_i &{}\quad \forall i\le N,&{} \;\forall k\le K \\ &{} \lambda \ge 0. &{}~&{} \\ \end{array}\right. \qquad \end{aligned}$$
(12e)

Equality (12d) exploits the definition of the dual norm and the decomposability of \(\ell (\xi )\) into its constituents \(\ell _k(\xi )\), \(k\le K\). Interchanging the maximization over \(z_{ik}\) with the minus sign (thereby converting the maximization to a minimization) and then with the maximization over \(\xi \) leads to a restriction of the feasible set of (12d). The resulting upper bound (12e) can be re-expressed as

$$\begin{aligned}&\left\{ \begin{array}{clll} \inf \limits _{\lambda , s_i,z_{ik}} &{} \lambda \varepsilon + {1 \over N} \sum \limits _{i = 1}^{N} s_i &{}&{}\\ \text {s.t.}&{} \sup \limits _{\xi \in \Xi } \Big (\ell _k(\xi ) - \big \langle z_{ik}, \xi \big \rangle \Big ) + \big \langle z_{ik}, \widehat{\xi }_i \big \rangle \le s_i &{}\quad \forall i\le N,&{} \;\forall k\le K \\ &{} \Vert z_{ik}\Vert _* \le \lambda &{} \quad \forall i\le N, &{} \;\forall k\le K \end{array}\right. \nonumber \\ = ~&\left\{ \begin{array}{clll} \inf \limits _{\lambda , s_i,z_{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i &{}&{}\\ \text {s.t.}&{} [-\ell _k + \chi _{\Xi }]^*(z_{ik}) - \big \langle z_{ik}, \widehat{\xi }_i \big \rangle \le s_i &{}\quad \forall i\le N,&{} \;\forall k\le K \\ &{} \Vert z_{ik}\Vert _* \le \lambda &{}\quad \forall i\le N, &{} \;\forall k\le K, \end{array} \right. \end{aligned}$$
(12f)

where (12f) follows from the definition of conjugacy, our conventions of extended arithmetic, and the substitution of \(z_{ik}\) with \(-z_{ik}\). Note that (12f) is already a finite convex program.

Next, we show that Assumption 4.1 reduces the inequalities (12a) and (12e) to equalities. Under Assumption 4.1, the inequality (12a) is in fact an equality for any \(\varepsilon > 0\) by virtue of an extended version of a well-known strong duality result for moment problems [44, Proposition 3.4]. One can show that (12a) continues to hold as an equality even for \(\varepsilon = 0\), in which case the Wasserstein ambiguity set (6) reduces to the singleton \(\{\widehat{\mathbb {P}}_N\}\), while (10) reduces to the sample average \(\frac{1}{N}\sum _{i=1}^N \ell (\widehat{\xi }_i)\). Indeed, for \(\varepsilon =0\) the variable \(\lambda \) in (12b) can be increased indefinitely at no penalty. As \(\ell (\xi )\) constitutes a pointwise maximum of upper semicontinuous concave functions, an elementary but tedious argument shows that (12b) converges to the sample average \(\frac{1}{N}\sum _{i=1}^N \ell (\widehat{\xi }_i)\) as \(\lambda \) tends to infinity.

The inequality (12e) also reduces to an equality under Assumption 4.1 thanks to the classical minimax theorem [4, Proposition 5.5.4], which applies because the set \(\{z_{ik} \in \mathbb {R}^m : \Vert z_{ik}\Vert _* \le \lambda \}\) is compact for any finite \(\lambda \ge 0\). Thus, the optimal values of (10) and (12f) coincide.

Assumption 4.1 further implies that the function \(-\ell _k+\chi _{\Xi }\) is proper, convex and lower semicontinuous. Properness holds because \(\ell _k\) is not identically \(-\infty \) on \(\Xi \). By Rockafellar and Wets [42, Theorem 11.23(a), p. 493], its conjugate essentially coincides with the epi-addition (also known as inf-convolution) of the conjugates of the functions \(-\ell _k\) and \(\sigma _{\Xi }\). Thus,

$$\begin{aligned} {[-\ell _k + \chi _{\Xi }]^*(z_{ik})}&= \inf _{\nu _{ik}} \Big ([-\ell _k]^*(z_{ik} - \nu _{ik}) + [\chi _{\Xi }]^*(\nu _{ik}) \Big ) \\&= {{\mathrm{cl}}}\Big [\inf _{\nu _{ik}} \Big ([-\ell _k]^*(z_{ik} - \nu _{ik}) + \sigma _{\Xi }(\nu _{ik}) \Big )\Big ], \end{aligned}$$

where \({{\mathrm{cl}}}[\cdot ]\) denotes the closure operator that maps any function to its largest lower semicontinuous minorant. As \({{\mathrm{cl}}}[f(\xi )]\le 0\) if and only if \(f(\xi )\le 0\) for any function f, we may conclude that (12f) is indeed equivalent to (11) under Assumption 4.1. \(\square \)

Note that the semi-infinite inequality in (12c) generalizes the nonlinear uncertain constraints studied in [1] because it involves an additional norm term and as the loss function \(\ell (\xi )\) is not necessarily concave under Assumption 4.1. As in [1], however, the semi-infinite constraint admits a robust counterpart that involves the conjugate of the loss function and the support function of the uncertainty set.

From the proof of Theorem 4.2 it is immediately clear that the worst-case expectation (10) is conservatively approximated by the optimal value of the finite convex program (12f) even if Assumption 4.1 fails to hold. In this case the sum \(-\ell _k + \chi _{\Xi }\) in (12f) must be evaluated under our conventions of extended arithmetics, whereby \(\infty - \infty = \infty \). These observations are formalized in the following corollary.

Corollary 4.3

[Approximate convex reduction] For any \(\varepsilon \ge 0\), the worst-case expectation (10) is smaller or equal to the optimal value of the finite convex program (12f).

4.2 Extremal distributions

Stress test experiments are instrumental to assess the quality of candidate decisions in stochastic optimization. Meaningful stress tests require a good understanding of the extremal distributions from within the Wasserstein ball that achieve the worst-case expectation (10) for various loss functions. We now show that such extremal distributions can be constructed systematically from the solution of a convex program akin to (11).

Theorem 4.4

(Worst-case distributions) If Assumption 4.1 holds, then the worst-case expectation (10) coincides with the optimal value of the finite convex program

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{ik}, q_{ik}} &{} {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \alpha _{ik}\ell _k \big ( \widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\big ) \\ \text {s.t.}&{} {1 \over N}\sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \Vert q_{ik}\Vert \le \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = 1 &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{}\forall i \le N, \quad \forall k\le K \\ &{} \widehat{\xi }_i - {q_{ik} \over \alpha _{ik}} \in \Xi &{}\forall i \le N, \quad \forall k\le K \end{array} \right. \end{aligned}$$
(13)

irrespective of \(\varepsilon \ge 0\). Let \(\big \{\alpha _{ik}(r), q_{ik}(r)\big \}_{r \in \mathbb {N}}\) be a sequence of feasible decisions whose objective values converge to the supremum of (13). Then, the discrete probability distributions

$$\begin{aligned} \mathbb {Q}_r {:=}{1 \over N}\sum _{i = 1}^{N}\sum _{k = 1}^{K} \alpha _{ik}(r)\delta _{\xi _{ik}(r)} \quad \text{ with }\quad \xi _{ik}(r) \,{:=}\,\widehat{\xi }_i - {q_{ik}(r) \over \alpha _{ik}(r)} \end{aligned}$$

belong to the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) and attain the supremum of (10) asymptotically, i.e.,

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] = \lim \limits _{r \rightarrow \infty } \mathbb {E}^{\mathbb {Q}_r} \big [ \ell (\xi ) \big ] = \lim \limits _{k \rightarrow \infty } {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \alpha _{ik}(r)\ell \big (\xi _{ik}(r)\big ). \end{aligned}$$

We highlight that all fractions in (13) must again be evaluated under our conventions of extended arithmetics. Specifically, if \(\alpha _{ik}=0\) and \(q_{ik}\ne 0\), then \(q_{ik}/\alpha _{ik}\) has at least one component equal to \(+\infty \) or \(-\infty \), which implies that \(\widehat{\xi }_i - q_{ik}/\alpha _{ik}\notin \Xi \). In contrast, if \(\alpha _{ik}=0\) and \(q_{ik}= 0\), then \(\widehat{\xi }_i - q_{ik} / \alpha _{ik}=\widehat{\xi }_i \in \Xi \). Moreover, the ik-th component in the objective function of (13) evaluates to 0 whenever \(\alpha _{ik} =0\) regardless of \(q_{ik}\).

The proof of Theorem 4.4 is based on the following technical lemma.

Lemma 4.5

Define \(F: \mathbb {R}^m \times \mathbb {R}_{+} \rightarrow \overline{\mathbb {R}}\) through \(F(q,\alpha ) = \inf _{z \in \mathbb {R}^m} \big \langle z, q - \alpha {\widehat{\xi }} \big \rangle + \alpha f^*(z)\) for some proper, convex, and lower semicontinuous function \(f:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) and reference point \({\widehat{\xi }}\in \mathbb {R}^m\). Then, F coincides with the (extended) perspective function of the mapping \(q \mapsto -f({\widehat{\xi }} - q)\), that is,

$$\begin{aligned} F(q, \alpha ) = \left\{ \begin{array}{cc} - \alpha f \big ({\widehat{\xi }} - q /\alpha \big )&{} \quad \text {if }\alpha > 0, \\ -\chi _{\{0\}}(q) &{} \quad \text {if }\alpha = 0. \end{array} \right. \end{aligned}$$

Proof

By construction, we have \(F(q,0) = \inf _{z \in \mathbb {R}^m} \big \langle z, q \big \rangle = - \chi _{\{0\}}(q)\). For \(\alpha > 0\), on the other hand, the definition of conjugacy implies that

$$\begin{aligned} F(q,\alpha ) = -[\alpha f^*]^*(\alpha {\widehat{\xi }} - q) = - \alpha [f^*]^* \big ({\widehat{\xi }} - {q / \alpha }\big ). \end{aligned}$$

The claim then follows because \([f^*]^* = f\) for any proper, convex, and lower semicontinuous function f [4, Proposition 1.6.1(c)]. Additional information on perspective functions can be found in [12, Section 2.2.3, p. 39]. \(\square \)

Proof of Theorem 4.4

By Theorem 4.2, which applies under Assumption 4.1, the worst-case expectation (10) coincides with the optimal value of the convex program (11). From the proof of Theorem 4.2 we know that (11) is equivalent to (12f). The Lagrangian dual of (12f) is given by

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\beta _{ik}, \alpha _{ik}} &{}\mathop {\inf }\limits _{\lambda , s_i, z_{ik}} \lambda \varepsilon + \sum \limits _{i = 1}^{N} \Big [ {s_i \over N} + &{} \sum \limits _{k = 1}^{K} \big [\beta _{ik} \big (\Vert z_{ik}\Vert _* -\lambda \big ) + \alpha _{ik}\big ( [-\ell _k + \chi _{\Xi }]^*(z_{ik}) - \big \langle z_{ik}, \widehat{\xi }_i \big \rangle - s_i\big )\big ]\Big ] \\ \text {s.t.}&{} \alpha _{ik} \ge 0&{} \forall i \le N, \quad \forall k\le K \\ &{} \beta _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K, \end{array} \right. \end{aligned}$$

where the products of dual variables and constraint functions in the objective are evaluated under the standard convention \(0 \cdot \infty = 0\). Strong duality holds since the function \([-\ell _k+\chi _{\Xi }]^*\) is proper, convex, and lower semicontinuous under Assumption 4.1 and because this function appears in a constraint of (12f) whose right-hand side is a free decision variable. By explicitly carrying out the minimization over \(\lambda \) and \(s_i\), one can show that the above dual problem is equivalent to

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\beta _{ik}, \alpha _{ik}} &{} \mathop {\inf }\limits _{z_{ik}} ~ \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \beta _{ik} \Vert z_{ik}\Vert _* + &{}\alpha _{ik}[-\ell _k+ \chi _{\Xi }]^*(z_{ik}) - \alpha _{ik}\big \langle z_{ik}, \widehat{\xi }_i \big \rangle \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k = 1}^{K} \beta _{ik} = \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K \\ &{} \beta _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K. \end{array} \right. \end{aligned}$$
(14a)

By using the definition of the dual norm, (14a) can be re-expressed as

$$\begin{aligned}&\left\{ \begin{array}{clll} \mathop {\sup }\limits _{\beta _{ik}, \alpha _{ik}}&{} \mathop {\inf }\limits _{z_{ik}} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \mathop {\max }\limits _{\Vert q_{ik}\Vert \le \beta _{ik}} \big \langle z_{ik}, q_{ik} \big \rangle + &{} \alpha _{ik}[-\ell _k + \chi _{\Xi }]^*(z_{ik}) - \alpha _{ik} \big \langle z_{ik}, \widehat{\xi }_i \big \rangle \Big ] \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K}\beta _{ik} = \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K \\ &{} \beta _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K \end{array} \right. \end{aligned}$$
(14b)
$$\begin{aligned}&= \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\beta _{ik},\alpha _{ik}}&{} \mathop {\max }\limits _{\Vert q_{ik}\Vert \le \beta _{ik}} \mathop {\inf }\limits _{z_{ik}} \sum \limits _{i = 1}^{N} \sum \limits _{k =1}^{K} \big \langle z_{ik}, q_{ik} \big \rangle + &{}\alpha _{ik}[-\ell _k+ \chi _{\Xi }]^*(z_{ik}) - \alpha _{ik}\big \langle z_{ik}, \widehat{\xi }_i \big \rangle \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \beta _{ik} = \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K \\ &{} \beta _{ik} \ge 0 &{} \forall i \le N, \quad \forall k\le K, \end{array} \right. \end{aligned}$$
(14c)

where (14c) follows from the classical minimax theorem and the fact that the \(q_{ik}\) variables range over a non-empty and compact feasible set for any finite \(\varepsilon \); see [4, Proposition 5.5.4]. Eliminating the \(\beta _{ik}\) variables and using Lemma 4.5 allows us to reformulate (14c) as

$$\begin{aligned}&\left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{ik}, q_{ik}} &{} \mathop {\inf }\limits _{z_{ik}}~ \sum \limits _{i = 1}^{N} \sum \limits _{k =1}^{K} \big \langle z_{ik}, q_{ik} - \alpha _{ik}\widehat{\xi }_i \big \rangle + &{}\alpha _{ik}[-\ell _k+ \chi _{\Xi }]^*(z_{ik}) \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \Vert q_{ik}\Vert \le \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{}\forall i \le N,\quad \forall k\le K \end{array} \right. \end{aligned}$$
(14d)
$$\begin{aligned} =~&\left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{ik}, q_{ik}} &{} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} - \alpha _{ik} \Big (-\ell _k\big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\big ) + &{} \chi _{\Xi } \big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\big ) \Big ) \mathbbm {1}_{\{\alpha _{ik}>0\}} - \chi _{\{0\}}(q_{ik})\mathbbm {1}_{\{\alpha _{ik} = 0\}} \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \Vert q_{ik}\Vert \le \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{}\forall i \le N, \quad \forall k\le K. \end{array} \right. \end{aligned}$$
(14e)

Our conventions of extended arithmetics imply that the ik-th term in the objective function of problem (14e) simplifies to

$$\begin{aligned} \alpha _{ik} \ell _k\Big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\Big ) - \chi _{\Xi }\Big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\Big ). \end{aligned}$$
(14f)

Indeed, for \(\alpha _{ik}>0\), this identity trivially holds. For \(\alpha _{ik}=0\), on the other hand, the ik-th objective term in (14e) reduces to \(- \chi _{\{0\}}(q_{ik})\). Moreover, the first term in (14f) vanishes whenever \(\alpha _{ik} = 0\) regardless of \(q_{ik}\), and the second term in (14f) evaluates to 0 if \(q_{ik}=0\) (as \(0/0=0\) and \(\widehat{\xi }_i \in \Xi \)) and to \(-\infty \) if \(q_{ik}\ne 0\) (as \(q_{ik}/0\) has at least one infinite component, implying that \(\widehat{\xi }_i+q_{ik}/0\notin \Xi \)). Therefore, (14f) also reduces to \(- \chi _{\{0\}}(q_{ik})\) when \(\alpha _{ik}=0\). This proves that the ik-th objective term in (14e) coincides with (14f). Substituting (14f) into (14e) and re-expressing \(- \chi _{\Xi }\big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\big )\) in terms of an explicit hard constraint yields

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{ik}, q_{ik}} &{} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \alpha _{ik} \ell _k\big (\widehat{\xi }_i - {q_{ik} \over \alpha _{ik}}\big ) \\ \text {s.t.}&{} \sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \Vert q_{ik}\Vert \le \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{ik} = {1 \over N} &{}\forall i \le N\\ &{} \alpha _{ik} \ge 0 &{}\forall i \le N, \quad \forall k\le K \\ &{} \widehat{\xi }_i - {q_{ik} \over \alpha _{ik}} \in \Xi &{}\forall i \le N, \quad \forall k\le K. \end{array}\right. \end{aligned}$$
(14g)

Finally, replacing \(\big \{\alpha _{ik}, q_{ik}\big \}\) with \({1 \over N}\big \{\alpha _{ik}, q_{ik}\big \}\) shows that (14g) is equivalent to (13). This completes the first part of the proof.

As for the second claim, let \(\{\alpha _{ik}(r), q_{ik}(r)\}_{r \in \mathbb {N}}\) be a sequence of feasible solutions that attains the supremum in (13), and set \(\xi _{ik}(r)\,{:=}\,\widehat{\xi }_i - {q_{ik}(r) \over \alpha _{ik}(r)}\in \Xi \). Then, the discrete distribution

$$\begin{aligned} \Pi _r {:=}{1 \over N}\sum _{i = 1}^{N}\sum _{k = 1}^{K} \alpha _{ik}(r)\delta _{\big (\xi _{ik}(r), \widehat{\xi }_i\big )} \end{aligned}$$

has the distribution \(\mathbb {Q}_r\) defined in the theorem statement and the empirical distribution \(\widehat{\mathbb {P}}_N\) as marginals. By the definition of the Wasserstein metric, \(\Pi _r\) represents a feasible mass transportation plan that provides an upper bound on the distance between \(\widehat{\mathbb {P}}_N\) and \(\mathbb {Q}_r\); see Definition 3.1. Thus, we have

$$\begin{aligned} d_{\mathrm W}\big (\mathbb {Q}_r,\widehat{\mathbb {P}}_N\big )&\le \int _{\Xi ^2} \Vert \xi - \xi '\Vert \, \Pi _r(\mathrm {d}\xi , \mathrm {d}\xi ') = {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k= 1}^{K} \alpha _{ik}(r) \big \Vert \xi _{ik}(r) \\&\quad - \widehat{\xi }_i \big \Vert {= {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k= 1}^{K} \big \Vert q_{ik}(r)\big \Vert \le \varepsilon ,} \end{aligned}$$

where the last inequality follows readily from the feasibility of \(q_{ik}(r)\) in (13). We conclude that

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)}\mathbb {E}^{\mathbb {Q}}\big [\ell (\xi )\big ]&\ge \limsup _{k \rightarrow \infty } \mathbb {E}^{\mathbb {Q}_r}\big [\ell (\xi )\big ] = \limsup _{k \rightarrow \infty } {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K}\alpha _{ik}(r) \ell \big (\xi _{ik}(r) \big )\\&\ge \limsup _{k \rightarrow \infty } {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K}\alpha _{ik}(r) \ell _k\big (\xi _{ik}(r) \big ) = \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)}\mathbb {E}^{\mathbb {Q}}\big [\ell (\xi )\big ], \end{aligned}$$

where the first inequality holds as \(\mathbb {Q}_r \in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) for all \(k \in \mathbb {N}\), and the second inequality uses the trivial estimate \(\ell \ge \ell _k\) for all \(k\le K\). The last equality follows from the construction of \(\alpha _{ik}(r)\) and \(\xi _{ik}(r)\) and the fact that (13) coincides with the worst-case expectation (10). \(\square \)

In the rest of this section we discuss some notable properties of the convex program (13).

In the ambiguity-free limit, that is, when the radius of the Wasserstein ball is set to zero, then the optimal value of the convex program (13) reduces to the expected loss under the empirical distribution. Indeed, for \(\varepsilon = 0\) all \(q_{ik}\) variables are forced to zero, and \(\alpha _{ik}\) enters the objective only through \(\sum _{k=1}^K \alpha _{ik}={1\over N}\). Thus, the objective function of (13) simplifies to \( \mathbb {E}^{\widehat{\mathbb {P}}_N}[\ell (\xi )]\).

We further emphasize that it is not possible to guarantee the existence of a worst-case distribution that attains the supremum in (10). In general, as shown in Theorem 4.4, we can only construct a sequence of distributions that attains the supremum asymptotically. The following example discusses an instance of (10) that admits no worst-case distribution.

Example 2

(Non-existence of a worst-case distribution) Assume that \(\Xi = \mathbb {R}\), \(N = 1\), \(\widehat{\xi }_1 = 0\), \(K = 2\), \(\ell _1(\xi ) =0\) and \(\ell _2(\xi ) = \xi - 1\). In this case we have \(\widehat{\mathbb {P}}_N=\delta _{\{0\}}\), and problem (13) reduces to

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\delta _{0})} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] = \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{1j}, q_{1j}} &{} - q_{12} - \alpha _{12} \\ \text {s.t.}&{} |q_{11}| + |q_{12}| \le \varepsilon \\ &{} \alpha _{11} + \alpha _{12} = 1 \\ &{} {\alpha _{11} \ge 0, \quad \alpha _{12} \ge 0.} \end{array} \right. \end{aligned}$$

The supremum on the right-hand side amounts to \(\varepsilon \) and is attained, for instance, by the sequence \(\alpha _{11}(r) = 1 - {1 \over k}\), \(\alpha _{12}(r) = {1 \over k}\), \(q_{11}(r) = 0\), \(q_{12}(r) = - \varepsilon \) for \(k\in {\mathbb {N}}\). Define

$$\begin{aligned} \mathbb {Q}_r =\alpha _{11}(r)\, \delta _{\xi _{11}(r)}+\alpha _{12}(r)\, \delta _{\xi _{12}(r)}, \end{aligned}$$

with \(\xi _{11}(r) = \widehat{\xi }_1 - {q_{11}(r) \over \alpha _{11}(r)}=0,\) and \(\xi _{12}(r) = \widehat{\xi }_1 - {q_{12}(r) \over \alpha _{12}(r)}=\varepsilon k\). By Theorem 4.4, the two-point distributions \(\mathbb {Q}_r\) reside within the Wasserstein ball of radius \(\varepsilon \) around \(\delta _{0}\) and asymptotically attain the supremum in the worst-case expectation problem. However, this sequence has no weak limit as \(\xi _{12}(r) = \varepsilon k\) tends to infinity, see Fig. 1. In fact, no single distribution can attain the worst-case expectation. Assume for the sake of contradiction that there exists \(\mathbb {Q}^\star \in \mathbb {B}_{\varepsilon }(\delta _{0})\) with \(\mathbb {E}^{\mathbb {Q}^\star }[\ell (\xi )]=\varepsilon \). Then, we find \(\varepsilon = \mathbb {E}^{\mathbb {Q}^\star }[\ell (\xi )]< \mathbb {E}^{\mathbb {Q}^\star }[|\xi |]\le \varepsilon \), where the strict inequality follows from the relation \(\ell (\xi )<|\xi |\) for all \(\xi \ne 0\) and the observation that \(\mathbb {Q}^\star \ne \delta _{0}\), while the second inequality follows from Theorem 3.2. Thus, \(\mathbb {Q}^\star \) does not exist.

Fig. 1
figure 1

Example of a worst-case expectation problem without a worst-case distribution

The existence of a worst-case distribution can, however, be guaranteed in some special cases.

Corollary 4.6

(Existence of a worst-case distribution) Suppose that Assumption 4.1 holds. If the uncertainty set \(\Xi \) is compact or the loss function is concave (i.e., \(K=1\)), then the sequence \(\{\alpha _{ik}(r), \xi _{ik}(r)\}_{r \in \mathbb {N}}\) constructed in Theorem 4.4 has an accumulation point \(\{\alpha ^\star _{ik}, \xi ^\star _{ik}\}\), and

$$\begin{aligned} \mathbb {Q}^\star {:=}{1 \over N}\sum _{i = 1}^{N}\sum _{k = 1}^{K} \alpha ^\star _{ik}\delta _{\xi ^\star _{ik}} \end{aligned}$$

is a worst-case distribution achieving the supremum in (10).

Proof

If \(\Xi \) is compact, then the sequence \(\{\alpha _{ik}(r), \xi _{ik}(r)\}_{r \in \mathbb {N}}\) has a converging subsequence with limit \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\). Similarly, if \(K = 1\), then \(\alpha _{i1} = 1\) for all \(i\le N\), in which case (13) reduces to a convex optimization problem with an upper semicontinuous objective function over a compact feasible set. Hence, its supremum is attained at a point \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\). In both cases, Theorem 4.4 guarantees that the distribution \(\mathbb {Q}^\star \) implied by \(\{\alpha ^\star _{ik},\xi ^\star _{ik}\}\) achieves the supremum in (10). \(\square \)

The worst-case distribution of Corollary 4.6 is discrete, and its atoms \(\xi ^\star _{ik}\) reside in the neighborhood of the given data points \(\widehat{\xi }_i\). By the constraints of problem (13), the probability-weighted cumulative distance between the atoms and the respective data points amounts to

$$\begin{aligned} \sum _{i=1}^N\sum _{k=1}^K \alpha _{ik}\Vert \xi ^\star _{ik}-\widehat{\xi }_i \Vert = \sum _{i=1}^N\sum _{k=1}^K \Vert q_{ik}\Vert \le \varepsilon , \end{aligned}$$

which is bounded above by the radius of the Wasserstein ball. The fact that the worst-case distribution \(\mathbb {Q}^\star \) (if it exists) is supported outside of \(\widehat{\Xi }_N\) is a key feature distinguishing the Wasserstein ball from the ambiguity sets induced by other probability metrics such as the total variation distance or the Kullback–Leibler divergence; see Fig. 2. Thus, the worst-case expectation criterion based on Wasserstein balls advocated in this paper should appeal to decision makers who wish to immunize their optimization problems against perturbations of the data points.

Fig. 2
figure 2

Representative distributions in balls centered at \(\widehat{\mathbb {P}}_N\) induced by different metrics. (a) Empirical distribution on a training dataset with \(N = 2\) samples. (b) A representative discrete distribution in the total variation or the Kullback–Leiber ball. (c) A representative discrete distribution in the Wasserstein ball

Remark 4.7

(Weak coupling) We highlight that the convex program (13) is amenable to decomposition and parallelization techniques as the decision variables associated with different sample points are only coupled through the norm constraint. We expect the resulting scenario decomposition to offer a substantial speedup of the solution times for problems involving large datasets. Efficient decomposition algorithms that could be used for solving the convex program (13) are described, for example, in [35] and [5, Chapter 4].

5 Special loss functions

We now demonstrate that the convex optimization problems (11) and (13) reduce to computationally tractable conic programs for several loss functions of practical interest.

5.1 Piecewise affine loss functions

We first investigate the worst-case expectations of convex and concave piecewise affine loss functions, which arise, for example, in option pricing [8], risk management [34] and in generic two-stage stochastic programming [6]. Moreover, piecewise affine functions frequently serve as approximations of smooth convex or concave loss functions.

Corollary 5.1

(Piecewise affine loss functions) Suppose that the uncertainty set is a polytope, that is, \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) where C is a matrix and d a vector of appropriate dimensions. Moreover, consider the affine functions \(a_k(\xi ) {:=}\big \langle a_{k}, \xi \big \rangle + b_{k}\) for all \(k\le K\).

  1. (i)

    If \(\ell (\xi )= \max _{k\le K}a_k(\xi )\), then the worst-case expectation (10) evaluates to

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} b_k +\big \langle a_k, \widehat{\xi }_i \big \rangle + \big \langle \gamma _{ik}, d-C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert C^\intercal \gamma _{ik} - a_{k}\Vert _* \le \lambda &{} \quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik} \ge 0&{} \quad \forall i \le N, &{} \forall k \le K . \end{array}\right. \end{aligned}$$
    (15a)
  2. (ii)

    If \(\ell (\xi )= \min _{k\le K}a_k(\xi )\), then the worst-case expectation (10) evaluates to

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{i},\theta _{i}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{}\big \langle \theta _i, b+ A\widehat{\xi }_i \big \rangle +\big \langle \gamma _{i}, d- C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N\\ &{} \Vert C^\intercal \gamma _i-A^\intercal \theta _i\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \big \langle \theta _{i}, e \big \rangle = 1 &{} \quad \forall i \le N\\ &{} \gamma _{i}\ge 0&{} \quad \forall i \le N\\ &{} \theta _{i} \ge 0&{} \quad \forall i \le N, \end{array}\right. \end{aligned}$$
    (15b)

    where A is the matrix with rows \(a^\intercal _k\), \(k\le K\), b is the column vector with entries \(b_k\), \(k\le K\), and \(e\) is the vector of all ones.

Proof

Assertion (i) is an immediate consequence of Theorem 4.2, which applies because \(\ell (x)\) is the pointwise maximum of the affine functions \(\ell _k(\xi )= a_k(\xi )\), \(k\le K\), and thus Assumption 4.1 holds for \(J= K\). By definition of the conjugacy operator, we have

$$\begin{aligned}{}[-\ell _k]^*(z) =[-a_k]^*(z) = \sup \limits _{\xi } \big \langle z, \xi \big \rangle + \big \langle a_k, \xi \big \rangle + b_k=\left\{ \begin{array}{cl} b_k &{} \quad \text {if }z=-a_{k}, \\ \infty &{} \quad \text {else,} \end{array} \right. \end{aligned}$$

and

$$\begin{aligned} \sigma _\Xi (\nu ) = \left\{ \begin{array}{cl} \sup \limits _{\xi } &{} \big \langle \nu , \xi \big \rangle \\ \text {s.t.}&{} C \xi \le d \end{array}\right. = \left\{ \begin{array}{cl} \inf \limits _{\gamma \ge 0} &{} \big \langle \gamma , d \big \rangle \\ \text {s.t.}&{} C^\intercal \gamma = \nu , \end{array} \right. \end{aligned}$$

where the last equality follows from strong duality, which holds as the uncertainty set is non-empty. Assertion (i) then follows by substituting the above expressions into (11).

Assertion (ii) also follows directly from Theorem 4.2 because \(\ell (\xi )=\ell _1(\xi )= \min _{k\le K}a_j(\xi )\) is concave and thus satisfies Assumption 4.1 for \(J=1\). In this setting, we find

$$\begin{aligned}{}[-\ell ]^*(z)&= \sup \limits _{\xi } \big \langle z, \xi \big \rangle + \min _{k\le K}\Big \{ \big \langle a_k, \xi \big \rangle + b_k\Big \} = \left\{ \begin{array}{cl} \mathop {\sup }\limits _{\xi ,\tau } &{} \big \langle z, \xi \big \rangle +\tau \\ \text {s.t.}&{} A{\xi } + b \ge \tau e\end{array}\right. = \left\{ \begin{array}{cl} \mathop {\inf }\limits _{\theta \ge 0} &{} \big \langle \theta , b \big \rangle \\ \text {s.t.}&{} A^\intercal \theta = -z \\ &{} \big \langle \theta , e \big \rangle = 1 \end{array} \right. \end{aligned}$$

where the last equality follows again from strong linear programming duality, which holds since the primal maximization problem is feasible. Assertion (ii) then follows by substituting \([-\ell ]^*\) as well as the formula for \(\sigma _\Xi \) from the proof of assertion (i) into (11). \(\square \)

As a consistency check, we ascertain that in the ambiguity-free limit, the optimal value of (15a) reduces to the expectation of \(\max _{k\le K}a_k(\xi )\) under the empirical distribution. Indeed, for \(\varepsilon = 0\), the variable \(\lambda \) can be set to any positive value at no penalty. For this reason and because all training samples must belong to the uncertainty set (i.e., \(d-C\widehat{\xi }_i\ge 0\) for all \(i\le N\)), it is optimal to set \(\gamma _{ik}=0\). This in turn implies that \(s_i= \max _{k\le K}a_k(\widehat{\xi }_i)\) at optimality, in which case \(\frac{1}{N}\sum _{i=1}^Ns_i\) represents the sample average of the convex loss function at hand.

An analogous argument shows that, for \(\varepsilon =0\), the optimal value of (15b) reduces to the expectation of \(\min _{k\le K}a_k(\xi )\) under the empirical distribution. As before, \(\lambda \) can be increased at no penalty. Thus, we conclude that \(\gamma _i=0\) and

$$\begin{aligned} s_i=\min \limits _{\theta _i\ge 0}\left\{ \big \langle \theta _i, b+ A\widehat{\xi }_i \big \rangle :\big \langle \theta _{i}, e \big \rangle = 1 \right\} = \min _{k\le K}a_k(\widehat{\xi }_i) \end{aligned}$$

at optimality, in which case \(\frac{1}{N}\sum _{i=1}^Ns_i\) is the sample average of the given concave loss function.

5.2 Uncertainty quantification

A problem of great practical interest is to ascertain whether a physical, economic or engineering system with an uncertain state \(\xi \) satisfies a number of safety constraints with high probability. In the following we denote by \(\mathbb {A}\) the set of states in which the system is safe. Our goal is to quantify the probability of the event \(\xi \in \mathbb {A}\) (\(\xi \notin \mathbb {A}\)) under an ambiguous state distribution that is only indirectly observable through a finite training dataset. More precisely, we aim to calculate the worst-case probability of the system being unsafe, i.e.,

$$\begin{aligned} \sup _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \notin \mathbb {A}\right] , \end{aligned}$$
(16a)

as well as the best-case probability of the system being safe, that is,

$$\begin{aligned} \sup _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \in \mathbb {A}\right] . \end{aligned}$$
(16b)

Remark 5.2

(Data-dependent sets) The set \(\mathbb {A}\) may even depend on the samples \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\), in which case \(\mathbb {A}\) is renamed as \({\widehat{\mathbb {A}}}\). If the Wasserstein radius \(\varepsilon \) is set to \(\varepsilon _N(\beta )\), then we have \(\mathbb {P}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) with probability \(1-\beta \), implying that (16a) and (16b) still provide \(1-\beta \) confidence bounds on \(\mathbb {P}[\xi \notin {\widehat{\mathbb {A}}}]\) and \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\), respectively.

Corollary 5.3

(Uncertainty quantification) Suppose that the uncertainty set is a polytope of the form \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) as in Corollary 5.1.

  1. (i)

    If \(\mathbb {A} = \{\xi \in \mathbb {R}^m: A\xi < b\}\) is an open polytope and the halfspace \(\big \{\xi :\big \langle a_k, \xi \big \rangle \ge b_k \big \}\) has a nonempty intersection with \(\Xi \) for any \(k\le K\), where \(a_k\) is the k-th row of the matrix A and \(b_k\) is the k-th entry of the vector b, then the worst-case probability (16a) is given by

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik},\theta _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{}1-\theta _{ik}\big (b_k-\big \langle a_k, \widehat{\xi }_i \big \rangle \big ) +\big \langle \gamma _{ik}, d-C\widehat{\xi }_i \big \rangle \le s_i &{}\quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert a_k\theta _{ik}-C^\intercal \gamma _{ik}\Vert _* \le \lambda &{}\quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik}\ge 0&{}\quad \forall i \le N, &{} \forall k \le K\\ &{} \theta _{ik} \ge 0&{}\quad \forall i \le N, &{} \forall k \le K\\ &{} s_i \ge 0 &{} \quad \forall i \le N. \end{array}\right. \end{aligned}$$
    (17a)
  2. (ii)

    If \(\mathbb {A} = \{\xi \in \mathbb {R}^m : A\xi \le b\}\) is a closed polytope that has a nonempty intersection with \(\Xi \), then the best-case probability (16b) is given by

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _i, \theta _i} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} 1+\big \langle \theta _i, b - A\widehat{\xi }_i \big \rangle + \big \langle \gamma _{i}, d - C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N \\ &{} \Vert A^\intercal \theta _i+C^\intercal \gamma _{i}\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \gamma _i \ge 0 &{}\quad \forall i \le N\\ &{} \theta _{i} \ge 0 &{}\quad \forall i \le N\\ &{} s_i\ge 0 &{}\quad \forall i \le N. \end{array}\right. \end{aligned}$$
    (17b)

Proof

The uncertainty quantification problems (16a) and (16b) can be interpreted as instances of (10) with loss functions \(\ell = 1 - \mathbbm {1}_{\mathbb {A}}\) and \(\ell = \mathbbm {1}_{\mathbb {A}}\), respectively. In order to be able to apply Theorem 4.2, we should represent these loss functions as finite maxima of concave functions as shown in Fig. 3.

Formally, assertion (i) follows from Theorem 4.2 for a loss function with \(K+1\) pieces if we use the following definitions. For every \(k\le K\) we define

$$\begin{aligned} \ell _{k}(\xi ) = \left\{ \begin{array}{cl} 1 &{} \quad \text {if }\,\big \langle a_k, \xi \big \rangle \ge b_k, \\ -\infty &{} \quad \text {otherwise.} \end{array}\right. \end{aligned}$$

Moreover, we define \(\ell _{K+1}(\xi ) = 0\). As illustrated in Fig. 3a, we thus have \(\ell (\xi )=\max _{k\le K+1} \ell _k(\xi )= 1 - \mathbbm {1}_{\mathbb {A}}(\xi )\) and

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \notin \mathbb {A}\right] ~= \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\left[ \ell (\xi )\right] . \end{aligned}$$

Assumption 4.1 holds due to the postulated properties of \(\mathbb {A}\) and \(\Xi \). In order to apply Theorem 4.2, we must determine the support function \(\sigma _\Xi \), which is already known from Corollary 5.1, as well as the conjugate functions of \(-\ell _k\), \(k\le K+1\). A standard duality argument yields

$$\begin{aligned}{}[-\ell _k]^*(z)&= \left\{ \begin{array}{cl} \mathop {\sup }\limits _{\xi } &{} \big \langle z, \xi \big \rangle + 1 \\ \text {s.t.}&{} \big \langle a_k, \xi \big \rangle \ge b_k \end{array}\right. = \left\{ \begin{array}{cl} \mathop {\inf }\limits _{\theta \ge 0} &{} 1 - b_k\theta \\ \text {s.t.}&{} a_k \theta =-z, \end{array}\right. \end{aligned}$$

for all \(k\le K\). Moreover, we have \([-\ell _{K+1}]^* = 0\) if \(\xi =0\); \(=\infty \) otherwise. Assertion (ii) then follows by substituting the formulas for \([-\ell _k]^*\), \(k\le K+1\), and \(\sigma _\Xi \) into (11).

Assertion (ii) follows from Theorem 4.2 by setting \(K= 2\), \(\ell _1(\xi ) = 1-\chi _{\mathbb {A}}(\xi )\) and \(\ell _2(\xi ) = 0\). As illustrated in Fig. 3b, this implies that \(\ell (\xi )=\max \{\ell _1(\xi ),\ell _2(\xi )\}=\mathbbm {1}_{\mathbb {A}}(\xi )\) and

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \in \mathbb {A}\right] ~= \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\left[ \ell (\xi )\right] . \end{aligned}$$

Assumption 4.1 holds by our assumptions on \(\mathbb {A}\) and \(\Xi \). In order to apply Theorem 4.2, we thus have to determine the support function \(\sigma _\Xi \), which was already calculated in Corollary 5.1, and the conjugate functions of \(-\ell _1\) and \(-\ell _2\). By the definition of the conjugacy operator, we find

$$\begin{aligned}{}[-\ell _1]^*(z)&= \sup _{\xi \in \mathbb {A}} \big \langle z, \xi \big \rangle + 1 = \left\{ \begin{array}{cl} \mathop {\sup }\limits _{\xi } &{} \big \langle z, \xi \big \rangle + 1 \\ \text {s.t.}&{} A\xi \le b \end{array}\right. = \left\{ \begin{array}{cl} \mathop {\inf }\limits _{\theta _k \ge 0} &{} \big \langle \theta , b \big \rangle + 1 \\ \text {s.t.}&{} A^\intercal \theta = z \end{array}\right. \end{aligned}$$

where the last equality follows from strong linear programming duality, which holds as the safe set is non-empty. Similarly, we find \([-\ell _{2}]^* = 0\) if \(\xi =0\); \(=\infty \) otherwise. Assertion (ii) then follows by substituting the above expressions into (11). \(\square \)

Fig. 3
figure 3

Representing the indicator function of a convex set and its complement as a pointwise maximum of concave functions. (a) Indicator function of the unsafe set. (b) Indicator function of the safe set

In the ambiguity-free limit (i.e., for \(\varepsilon = 0\)) the optimal value of (17a) reduces to the fraction of training samples residing outside of the open polytope \(\mathbb {A}=\{\xi :A\xi <b\}\). Indeed, in this case the variable \(\lambda \) can be set to any positive value at no penalty. For this reason and because all training samples belong to the uncertainty set (i.e., \(d-C\widehat{\xi }_i\ge 0\) for all \(i\le N\)), it is optimal to set \(\gamma _{ik}=0\). If the i-th training sample belongs to \(\mathbb {A}\) (i.e., \(b_k-\big \langle a_k, \widehat{\xi }_i \big \rangle > 0\) for all \(k\le K\)), then \(\theta _{ik}\ge 1/(b_k-\big \langle a_k, \widehat{\xi }_i \big \rangle )\) for all \(k\le K\) and \(s_i=0\) at optimality. Conversely, if the i-th training sample belongs to the complement of \(\mathbb {A}\), (i.e., \(b_k-\big \langle a_k, \widehat{\xi }_i \big \rangle \le 0\) for some \(k\le K\)), then \(\theta _{ik}=0\) for some \(k\le K\) and \(s_i=1\) at optimality. Thus, \(\sum _{i=1}^Ns_i\) coincides with the number of training samples outside of \(\mathbb {A}\) at optimality. An analogous argument shows that, for \(\varepsilon =0\), the optimal value of (17b) reduces to the fraction of training samples residing inside of the closed polytope \(\mathbb {A}=\{\xi :A\xi \le b\}\).

5.3 Two-stage stochastic programming

A major challenge in linear two-stage stochastic programming is to evaluate the expected recourse costs, which are only implicitly defined as the optimal value of a linear program whose coefficients depend linearly on the uncertain problem parameters [46, Section 2.1]. The following corollary shows how we can evaluate the worst-case expectation of the recourse costs with respect to an ambiguous parameter distribution that is only observable through a finite training dataset. For ease of notation and without loss of generality, we suppress here any dependence on the first-stage decisions.

Corollary 5.4

(Two-stage stochastic programming) Suppose that the uncertainty set is a polytope of the form \(\Xi = \{ \xi \in \mathbb {R}^m : C \xi \le d \}\) as in Corollaries 5.1 and 5.3.

  1. (i)

    If \(\ell (\xi ) =\inf _{y} \left\{ \big \langle y, Q\xi \big \rangle : Wy\ge h \right\} \) is the optimal value of a parametric linear program with objective uncertainty, and if the feasible set \(\{y:Wy\ge h\}\) is non-empty and compact, then the worst-case expectation (10) is given by

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _i, y_i} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} \big \langle y_i, Q\widehat{\xi }_i \big \rangle + \big \langle \gamma _{i}, d - C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N \\ &{} Wy_i\ge h &{}\quad \forall i \le N\\ &{} \Vert Q^\intercal y_i-C^\intercal \gamma _{i}\Vert _* \le \lambda &{} \quad \forall i \le N \\ &{} \gamma _i \ge 0 &{} \quad \forall i \le N. \end{array}\right. \end{aligned}$$
    (18a)
  2. (ii)

    If \(\ell (\xi ) =\inf _{y} \left\{ \big \langle q, y \big \rangle : Wy \ge H\xi + h \right\} \) is the optimal value of a parametric linear program with right-hand side uncertainty, and if the dual feasible set \(\{\theta \ge 0:W^\intercal \theta =q\}\) is non-empty and compact with vertices \(v_k\), \(k\le K\), then the worst-case expectation (10) is given by

    $$\begin{aligned} \left\{ \begin{array}{clll} \inf \limits _{\lambda ,s_i, \gamma _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ \text {s.t.}&{} \big \langle v_k, h \big \rangle + \big \langle H^\intercal v_k, \widehat{\xi }_i \big \rangle + \big \langle \gamma _{ik}, d-C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N, &{} \forall k \le K\\ &{} \Vert C^\intercal \gamma _{ik}-H^\intercal v_k\Vert _* \le \lambda &{} \quad \forall i \le N, &{} \forall k \le K \\ &{} \gamma _{ik} \ge 0&{}\quad \forall i \le N, &{} \forall k \le K. \end{array}\right. \end{aligned}$$
    (18b)

Proof

Assertion (i) follows directly from Theorem 4.2 because \(\ell (\xi )\) is concave as an infimum of linear functions in \(\xi \). Indeed, the compactness of the feasible set \(\{y: Wy\ge h\}\) ensures that Assumption 4.1 holds for \(K=1\). In this setting, we find

$$\begin{aligned}{}[-\ell ]^*(z)&= \sup \limits _{\xi } \left\{ \big \langle z, \xi \big \rangle + \inf \limits _{y} \left\{ \big \langle y, Q\xi \big \rangle : Wy\ge h\right\} \right\} \\&= \inf \limits _{y}\left\{ \sup \limits _{\xi }\left\{ \big \langle z+Q^\intercal y, \xi \big \rangle \right\} : Wy\ge h \right\} \\&= \left\{ \begin{array}{c@{\quad }l} 0 &{} \text {if there exists { y} with } Q^\intercal y=-z \text { and }Wy\ge h,\\ \infty &{} \text {otherwise,} \end{array} \right. \end{aligned}$$

where the second equality follows from the classical minimax theorem [4, Proposition 5.5.4], which applies because \(\{y: Wy\ge h\}\) is compact. Assertion (i) then follows by substituting \([-\ell ]^*\) as well as the formula for \(\sigma _\Xi \) from Corollary 5.1 into (11).

Assertion (ii) relies on the following reformulation of the loss function,

$$\begin{aligned} \ell (\xi )&= \left\{ \begin{array}{cl} \inf \limits _{y} &{} \big \langle q, y \big \rangle \\ \text {s.t.}&{} Wy \ge H\xi + h \end{array}\right. = \left\{ \begin{array}{cl} \sup \limits _{\theta \ge 0} &{} \big \langle \theta , H\xi + h \big \rangle \\ \text {s.t.}&{} W^\intercal \theta = q \end{array}\right. = \max \limits _{k\le K} \big \langle v_k, H\xi + h \big \rangle \\&= \max \limits _{k\le K} \big \langle H^\intercal v_k, \xi \big \rangle + \big \langle v_k, h \big \rangle , \end{aligned}$$

where the first equality holds due to strong linear programming duality, which applies as the dual feasible set is non-empty. The second equality exploits the elementary observation that the optimal value of a linear program with non-empty, compact feasible set is always adopted at a vertex. As we managed to express \(\ell (\xi )\) as a pointwise maximum of linear functions, assertion (ii) follows immediately from Corllary 5.1 (i). \(\square \)

As expected, in the ambiguity-free limit, problem (18a) reduces to a standard SAA problem. Indeed, for \(\varepsilon =0\), the variable \(\lambda \) can be made large at no penalty, and thus \(\gamma _i=0\) and \(s_i=\big \langle y_i, Q\widehat{\xi }_i \big \rangle \) at optimality. In this case, problem (18a) is equivalent to

$$\begin{aligned} \inf \limits _{y_i} \left\{ {1 \over N}\sum \limits _{i = 1}^{N} \big \langle y_i, Q\widehat{\xi }_i \big \rangle : Wy_i\ge h \quad \forall i \le N\right\} . \end{aligned}$$

Similarly, one can verify that for \(\varepsilon =0\), (18b) reduces to the SAA problem

$$\begin{aligned} \inf \limits _{y_i} \left\{ {1 \over N}\sum \limits _{i = 1}^{N} \big \langle y_i, q \big \rangle : Wy_i\ge H\widehat{\xi }_i \quad \forall i \le N\right\} . \end{aligned}$$

We close this section with a remark on the computational complexity of all the convex optimization problems derived in this section.

Remark 5.5

(Computational tractability) \(\square \)

  • If the Wasserstein metric is defined in terms of the 1-norm (i.e., \(\Vert \xi \Vert =\sum _{k=1}^m|\xi _k|\)) or the \(\infty \)-norm (i.e., \(\Vert \xi \Vert =\max _{k\le m}|\xi _k|\)), then the optimization problems (15a), (15b), (17a), (17b), (18a) and (18b) all reduce to linear programs whose sizes scale with the number N of data points and the number J of affine pieces of the underlying loss functions.

  • Except for the two-stage stochastic program with right-hand side uncertainty in (18b), the resulting linear programs scale polynomially in the problem description and are therefore computationally tractable. As the number of vertices \(v_k\), \(k\le K\), of the polytope \(\{\theta \ge 0:W^\intercal \theta =q\}\) may be exponential in the number of its facets, however, the linear program (18b) has generically exponential size.

  • Inspecting (15a), one easily verifies that the distributionally robust optimization problem (5) reduces to a finite convex program if \(\mathbb {X}\) is convex and \(h(x,\xi )= \max _{k\le K} \big \langle a_{k}(x), \xi \big \rangle + b_{k}(x)\), while the gradients \(a_{k}(x)\) and the intercepts \(b_{k}(x)\) depend linearly on x. Similarly, (5) can be reformulated as a finite convex program if \(\mathbb {X}\) is convex and \(h(x,\xi )=\inf _{y} \left\{ \big \langle y, Q\xi \big \rangle : Wy\ge h(x) \right\} \) or \(h(x,\xi )=\inf _{y} \left\{ \big \langle q, y \big \rangle : Wy \ge H(x)\xi + h(x) \right\} \), while the right hand side coefficients h(x) and H(x) depend linearly on x; see (18a) and (18b), respectively. In contrast, problems (15b), (17a) and (17b) result in non-convex optimization problems when their data depends on x.

  • We emphasize that the computational complexity of all convex programs examined in this section is independent of the radius \(\varepsilon \) of the Wasserstein ball.

6 Tractable extensions

We now demonstrate that through minor modifications of the proofs, Theorems 4.2 and 4.4 extend to worst-case expectation problems involving even richer classes of loss functions. First, we investigate problems where the uncertainty can be viewed as a stochastic process and where the loss function is additively separable. Next, we study problems whose loss functions are convex in the uncertain variables and are therefore not necessarily representable as finite maxima of concave functions as postulated by Assumption 4.1.

6.1 Stochastic processes with a separable cost

Consider a variant of the worst-case expectation problem (10), where the uncertain parameters can be interpreted as a stochastic process \(\xi = \big (\xi _1,\ldots ,\xi _T\big )\), and assume that \(\xi _t \in \Xi _t\), where \( \Xi _t \subseteq \mathbb {R}^m\) is non-empty and closed for any \(t\le T\). Moreover, assume that the loss function is additively separable with respect to the temporal structure of \(\xi \), that is,

$$\begin{aligned} \ell (\xi ) {:=}\sum \limits _{t = 1}^{T} \max _{k\le K}\ell _{tk} \big (\xi _t\big ), \end{aligned}$$
(19)

where \(\ell _{tk}:\mathbb {R}^m\rightarrow \overline{\mathbb {R}}\) is a measurable function for any \(k\le K\) and \(t\le T\). Such loss functions appear, for instance, in open-loop stochastic optimal control or in multi-item newsvendor problems. Consider a process norm \(\left\| \xi \right\| _{\mathrm{T}} = \sum _{t = 1}^{T} \Vert \xi _t\Vert \) associated with the base norm \(\Vert \cdot \Vert \) on \(\mathbb {R}^m\), and assume that its induced metric is the one used in the definition of the Wasserstein distance. Note that if \(\Vert \cdot \Vert \) is the 1-norm on \(\mathbb {R}^m\), then \(\left\| \cdot \right\| _{\mathrm{T}}\) reduces to the 1-norm on \(\mathbb {R}^{mT}\).

By interchanging summation and maximization, the loss function (19) can be re-expressed as

$$\begin{aligned} \ell (\xi )= \max _{k_t \le K} \sum \limits _{t = 1}^{T} \ell _{tk_t} \big (\xi _t \big ), \end{aligned}$$

where the maximum runs over all \(K^T\) combinations of \(k_1,\ldots , k_T\le K\). Under this representation, Theorem 4.2 remains applicable. However, the resulting convex optimization problem would involve \(\mathcal O(K^T)\) decision variables and constraints, indicating that an efficient solution may not be available. Fortunately, this deficiency can be overcome by modifying Theorem 4.2.

Theorem 6.1

(Convex reduction for separable loss functions) Assume that the loss function \(\ell \) is of the form (19), and the Wasserstein ball is defined through the process norm \(\left\| \cdot \right\| _{\mathrm{T}}\). Then, for any \(\varepsilon \ge 0 \), the worst-case expectation (10) is smaller or equal to the optimal value of the finite convex program

$$\begin{aligned} \left\{ \begin{array}{cllll} \inf \limits _{\lambda , s_{ti}, z_{tik}, \nu _{tik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sum \limits _{t = 1}^{T} s_{ti} &{}&{} \\ \text {s.t.}&{} [-\ell _{tk}]^*\big (z_{tik} - \nu _{tik}\big ) + \sigma _{\Xi _t}(\nu _{tik}) - \big \langle z_{tik}, \widehat{\xi }_{ti} \big \rangle \le s_{ti} &{}\quad \forall i \le N, &{} \forall k\le K, &{} \forall t \le T,\\ &{} \Vert z_{tik}\Vert _* \le \lambda &{}\quad \forall i \le N, &{} \forall k\le K, &{} \forall t \le T. \end{array}\right. \end{aligned}$$
(20)

If \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy the convexity Assumption 4.1 for every \(t\le T\), then the worst-case expectation (10) coincides exactly with the optimal value of problem (20).

Proof

Up until equation (12d), the proof of Theorem 6.1 parallels that of Theorem 4.2. Starting from (12d), we then have

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)}&\mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] = \inf \limits _{\lambda \ge 0}~ \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sup _{\xi } \left( \ell (\xi ) - \lambda \left\| \xi - \widehat{\xi }_i \right\| _{\mathrm{T}} \right) \nonumber \\&= \inf \limits _{\lambda \ge 0} ~ \lambda \varepsilon + {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{t = 1}^{T} \sup _{\xi _t \in \Xi _t} \left( \max _{k\le K} \ell _{tk} \big (\xi _t \big ) - \lambda \big \Vert \xi _t - \widehat{\xi }_{ti}\big \Vert \right) , \end{aligned}$$

where the interchange of the summation and the maximization is facilitated by the separability of the overall loss function. Introducing epigraphical auxiliary variables yields

$$\begin{aligned}&\left\{ \begin{array}{cllll} \inf \limits _{\lambda , s_{ti}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sum \limits _{t=1}^{T} s_{ti} \\ \text {s.t.}&{} \sup \limits _{\xi _t\in \Xi _t} \Big (\ell _{tk}\big (\xi _t\big ) - \lambda \big \Vert \xi _t - \widehat{\xi }_{ti} \big \Vert \Big )\le s_{ti} &{}\quad \forall i\le N, ~ \forall k\le K, ~ \forall t \le T \\ &{} \lambda \ge 0 \end{array} \right. \nonumber \\ \le&\left\{ \begin{array}{cllll} \inf \limits _{\lambda , s_{ti},z_{tik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sum \limits _{t = 1}^{T}s_{ti} \\ \text {s.t.}&{} \sup \limits _{\xi _t\in \Xi _t} \Big (\ell _{tk}\big (\xi _t \big ) - \big \langle z_{tik}, \xi _t \big \rangle \Big ) + \big \langle z_{tik}, \widehat{\xi }_{ti} \big \rangle \le s_{ti} &{}\quad \forall i\le N, ~\forall k\le K, ~\forall t \le T \\ &{} \Vert z_{tik}\Vert _* \le \lambda &{}\quad \forall i\le N,~ \forall k\le K, ~ \forall t \le T \end{array} \right. \\ =&\left\{ \begin{array}{cllll} \inf \limits _{\lambda , s_{ti},z_{tik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sum \limits _{t = 1}^{T} s_{ti} \\ \text {s.t.}&{} [-\ell _{tk} + \chi _{\Xi _t}]^*\big (-z_{tik}\big ) + \big \langle z_{tik}, \widehat{\xi }_{ti} \big \rangle \le s_{ti} &{}\quad \forall i\le N,~ \forall k\le K,~ \forall t \le T \\ &{} \Vert z_{tik}\Vert _* \le \lambda &{}\quad \forall i\le N, ~ \forall k\le K, ~ \forall t \le T, \end{array} \right. \end{aligned}$$

where the inequality is justified in a similar manner as the one in (12e), and it holds as an equality provided that \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy Assumption 4.1 for all \(t \le T\). Finally, by Rockafellar and Wets [42, Theorem 11.23(a),p. 493], the conjugate of \(-\ell _{tk} + \chi _{\Xi _t}\) can be replaced by the inf-convolution of the conjugates of \(-\ell _{tk}\) and \(\chi _{\Xi _t}\). This completes the proof. \(\square \)

Note that the convex program (20) involves only \(\mathcal {O}(NKT)\) decision variables and constraints. Moreover, if \(\ell _{tk}\) is affine for every \(t\le T\) and \(k\le K\), while \(\Vert \cdot \Vert \) represents the 1-norm or the \(\infty \)-norm on \(\mathbb {R}^m\), then (20) reduces to a tractable linear program (see also Remark 5.5). A natural generalization of Theorem 4.4 further allows us to characterize the extremal distributions of the worst-case expectation problem (10) with a separable loss function of the form (19).

Theorem 6.2

(Worst-case distributions for separable loss functions) Assume that the loss function \(\ell \) is of the form (19), and the Wasserstein ball is defined through the process norm \(\left\| \cdot \right\| _{\mathrm{T}}\). If \(\Xi _t\) and \(\{\ell _{tk}\}_{k\le K}\) satisfy Assumption 4.1 for all \(t \le T\), then the worst-case expectation (10) coincides with the optimal value of the finite convex program

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\sup }\limits _{\alpha _{tik}, q_{tik}} &{} {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k = 1}^{K} \sum \limits _{t=1}^{T} \alpha _{tik}\ell _{tk}\Big ( \widehat{\xi }_{ti} - {q_{tik} \over \alpha _{tik}}\Big ) \\ \text {s.t.}&{} {1 \over N}\sum \limits _{i =1}^{N} \sum \limits _{k =1}^{K} \sum \limits _{t=1}^{T} \Vert q_{tik}\Vert \le \varepsilon \\ &{} \sum \limits _{k = 1}^{K} \alpha _{tik} = 1 &{}\quad \forall i \le N, \quad \forall t \le T\\ &{} \alpha _{tik} \ge 0 &{}\quad \forall i \le N, \quad \forall t \le T, \quad \forall k\le K \\ &{} \widehat{\xi }_{ti} - {q_{tik} \over \alpha _{tik}} \in \Xi _t &{}\quad \forall i \le N, \quad \forall t \le T, \quad \forall k\le K \end{array} \right. \end{aligned}$$
(21)

irrespective of \(\varepsilon \ge 0 \). Let \(\big \{\alpha _{tik}(r), q_{tik}(r)\big \}_{r \in \mathbb {N}}\) be a sequence of feasible decisions whose objective values converge to the supremum of (21). Then, the discrete (product) probability distributions

$$\begin{aligned} \mathbb {Q}_r {:=}{1 \over N} \sum _{i = 1}^{N} \bigotimes _{t=1}^T \Big (\sum _{k = 1}^{K} \alpha _{tik}(r) \delta _{\xi _{tik}(r)}\Big ) \quad \text{ with }\quad \xi _{tik}(r) {:=}\widehat{\xi }_{ti} - {q_{tik}(r) \over \alpha _{tik}(r)} \end{aligned}$$

belong to the Wasserstein ball \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\) and attain the supremum of (10) asymptotically, i.e.,

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] = \lim \limits _{r \rightarrow \infty } \mathbb {E}^{\mathbb {Q}_r} \big [ \ell (\xi ) \big ] = \lim \limits _{r \rightarrow \infty } {1 \over N} \sum \limits _{i = 1}^{N} \sum \limits _{k=1}^{K} \sum \limits _{t=1}^{T} \alpha _{tik}(r) \ell _{tk}\big (\xi _{tik}(r)\big ) . \end{aligned}$$

Proof

As in the proof of Theorem 4.4, the claim follows by dualizing the convex program (20). Details are omitted for brevity of exposition. \(\square \)

We emphasize that the distributions \(\mathbb {Q}_r\) from Theorem 6.2 can be constructed efficiently by solving a convex program of polynomial size even though they have \(NK^T\) discretization points.

6.2 Convex loss functions

Consider now another variant of the worst-case expectation problem (10), where the loss function \(\ell \) is proper, convex and lower semicontinuous. Unless \(\ell \) is piecewise affine, we cannot represent such a loss function as a pointwise maximum of finitely many concave functions, and thus Theorem 4.2 may only provide a loose upper bound on the worst-case expectation (10). The following theorem provides an alternative upper bound that admits new insights into distributionally robust optimization with Wasserstein balls and becomes exact for \(\Xi =\mathbb {R}^m\).

Theorem 6.3

(Convex reduction for convex loss functions) Assume that the loss function \(\ell \) is proper, convex, and lower semicontinuous, and define \(\kappa {:=}\sup \big \{ \Vert \theta \Vert _* : \ell ^*(\theta ) < \infty \big \}\). Then, for any \(\varepsilon \ge 0 \), the worst-case expectation (10) is smaller or equal to

$$\begin{aligned} \kappa \varepsilon + {1 \over N}\sum _{i = 1}^{N} \ell (\widehat{\xi }_i). \end{aligned}$$
(22)

If \(\Xi =\mathbb {R}^m\), then the worst-case expectation (10) coincides exactly with (22).

Remark 6.4

(Radius of effective domain) The parameter \(\kappa \) can be viewed as the radius of the smallest ball containing the effective domain of the conjugate function \(\ell ^*\) in terms of the dual norm. By the standard conventions of extended arithmetic, the term \(\kappa \varepsilon \) in (22) is interpreted as 0 if \(\kappa =\infty \) and \(\varepsilon =0\).

Proof

Equation (12b) in the proof of Theorem 4.2 implies that

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\big [ \ell (\xi ) \big ] = \inf \limits _{\lambda \ge 0} ~\lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \sup _{\xi \in \Xi } \left( \ell (\xi ) - \lambda \Vert \xi - \widehat{\xi }_i\Vert \right) \end{aligned}$$
(23)

for every \(\varepsilon > 0\). As \(\ell \) is proper, convex, and lower semicontinuous, it coincides with its bi-conjugate function \(\ell ^{**}\), see e.g. [4, Proposition 1.6.1(c)]. Thus, we may write

$$\begin{aligned} \ell (\xi ) = \sup _{\theta \in \Theta } \big \langle \theta , \xi \big \rangle - \ell ^*(\theta ), \end{aligned}$$

where \(\Theta {:=}\{\theta \in \mathbb {R}^m : \ell ^*(\theta ) < \infty \}\) denotes the effective domain of the conjugate function \(\ell ^*\). Using this dual representation of \(\ell \) in conjunction with the definition of the dual norm, we find

$$\begin{aligned} \mathop {\sup }\limits _{\xi \in \Xi } \Big (\ell (\xi ) - \lambda \Vert \xi -\widehat{\xi }_i\Vert \Big )&= \mathop {\sup }\limits _{\xi \in \Xi }~ \mathop {\sup }\limits _{\theta \in \Theta } \Big (\big \langle \theta , \xi \big \rangle - \ell ^*(\theta ) - \lambda \Vert \xi -\widehat{\xi }_i\Vert \Big ) \\&= \mathop {\sup }\limits _{\xi \in \Xi }~ \mathop {\sup }\limits _{\theta \in \Theta } \mathop {\inf }\limits _{\Vert z\Vert _* \le \lambda } \Big (\big \langle \theta , \xi \big \rangle - \ell ^*(\theta ) + \big \langle z, \xi \big \rangle - \big \langle z, \widehat{\xi }_i \big \rangle \Big ). \end{aligned}$$

The classical minimax theorem [4, Proposition 5.5.4] then allows us to interchange the maximization over \(\xi \) with the maximization over \(\theta \) and the minimization over z to obtain

$$\begin{aligned} \mathop {\sup }\limits _{\xi \in \Xi } \Big (\ell (\xi ) - \lambda \Vert \xi -\widehat{\xi }_i\Vert \Big )&= \mathop {\sup }\limits _{\theta \in \Theta } \mathop {\inf }\limits _{\Vert z\Vert _* \le \lambda } \mathop {\sup }\limits _{\xi \in \Xi }\Big (\big \langle \theta + z, \xi \big \rangle - \ell ^*(\theta ) - \big \langle z, \widehat{\xi }_i \big \rangle \Big ) \nonumber \\&= \mathop {\sup }\limits _{\theta \in \Theta } \mathop {\inf }\limits _{\Vert z\Vert _* \le \lambda } \sigma _{\Xi }(\theta + z) - \ell ^*(\theta ) - \big \langle z, \widehat{\xi }_i \big \rangle . \end{aligned}$$
(24)

Recall that \(\sigma _\Xi \) denotes the support function of \(\Xi \). It seems that there is no simple exact reformulation of (24) for arbitrary convex uncertainty sets \(\Xi \). Interchanging the maximization over \(\theta \) with the minimization over z in (24) would lead to the conservative upper bound of Corollary 4.3. Here, however, we employ an alternative approximation. By definition of the support function, we have \(\sigma _\Xi \le \sigma _{\mathbb {R}^m} = \chi _{\{0\}}\). Replacing \(\sigma _\Xi \) with \( \chi _{\{0\}}\) in (24) thus results in the conservative approximation

$$\begin{aligned} \mathop {\sup }\limits _{\xi \in \mathbb {R}^m} \Big (\ell (\xi ) - \lambda \Vert \xi -\widehat{\xi }_i\Vert \Big )&\le \left\{ \begin{array}{cl} \ell (\widehat{\xi }_i) &{} \quad \text{ if } \sup \big \{\Vert \theta \Vert _* : \theta \in \Theta \big \} \le \lambda , \\ \infty &{} \quad \text{ otherwise. } \end{array} \right. \end{aligned}$$
(25)

The inequality (22) then follows readily by substituting (25) into (23) and using the definition of \(\kappa \) in the theorem statement. For \(\Xi =\mathbb {R}^m\) we have \(\sigma _\Xi = \chi _{\{0\}}\), and thus the upper bound (22) becomes exact. Finally, if \(\varepsilon =0\), then (10) trivially coincides with (22) under our conventions of extended arithmetic. Thus, the claim follows. \(\square \)

Theorem 6.3 asserts that for \(\Xi =\mathbb {R}^m\), the worst-case expectation (10) of a convex loss function reduces the sample average of the loss adjusted by the simple correction term \(\kappa \varepsilon \). The following proposition highlights that \(\kappa \) can be interpreted as a measure of maximum steepness of the loss function. This interpretation has intuitive appeal in view of Definition 3.1.

Proposition 6.5

(Steepness of the loss function) Let \(\kappa \) be defined as in Theorem 6.3.

  1. (i)

    If \(\ell \) is \({\overline{L}}\)-Lipschitz continuous, i.e., if there exists \(\xi ' \in \mathbb {R}^m\) such that \(\ell (\xi ) - \ell (\xi ') \le {\overline{L}}\Vert \xi -\xi '\Vert \) for all \(\xi \in \mathbb {R}^m\), then \(\kappa \le {\overline{L}}\).

  2. (ii)

    If \(\ell \) majorizes an affine function, i.e., if there exists \(\theta \in \mathbb {R}^m\) with \(\Vert \theta \Vert _*=:{\underline{L}}\) and \(\xi ' \in \mathbb {R}^m\) such that \(\ell (\xi ) - \ell (\xi ') \ge \big \langle \theta , \xi -\xi ' \big \rangle \) for all \(\xi \in \mathbb {R}^m\), then \(\kappa \ge {\underline{L}} \).

Proof

The proof follows directly from the definition of conjugacy. As for (i), we have

$$\begin{aligned} \ell ^*(\theta ) = \sup _{\xi \in \mathbb {R}^m} \big \langle \theta , \xi \big \rangle - \ell (\xi )~&\ge \sup _{\xi \in \mathbb {R}^m} \big \langle \theta , \xi \big \rangle - {\overline{L}} \Vert \xi -\xi '\Vert -\ell (\xi ')\\&= \sup _{\xi \in \mathbb {R}^m} \inf _{\Vert z\Vert _*\le {\overline{L}}} \big \langle \theta , \xi \big \rangle - \big \langle z, \xi -\xi ' \big \rangle - \ell (\xi '), \end{aligned}$$

where the last equality follows from the definition of the dual norm. Applying the minimax theorem [4, Proposition 5.5.4] and explicitly carrying out the maximization over \(\xi \) yields

$$\begin{aligned} \ell ^*(\theta ) \ge \left\{ \begin{array}{cl} \big \langle \theta , \xi ' \big \rangle -\ell (\xi ') &{} \quad \text{ if } \Vert \theta \Vert _* \le {\overline{L}}, \\ \infty &{} \quad \text{ otherwise. } \end{array}\right. \end{aligned}$$

Consequently, \(\ell ^*(\theta )\) is infinite for all \(\theta \) with \(\Vert \theta \Vert _*> {\overline{L}}\), which readily implies that the \(\Vert \cdot \Vert _*\)-ball of radius \({\overline{L}}\) contains the effective domain of \(\ell ^*\). Thus, \(\kappa \le {\overline{L}}\).

As for (ii), we have

$$\begin{aligned} \ell ^*(\theta ) = \sup _{\xi \in \mathbb {R}^m} \big \langle \theta , \xi \big \rangle - \ell (\xi )&\le \sup _{\xi \in \mathbb {R}^m} \big \langle \theta , \xi \big \rangle - \big \langle z, \xi -\xi ' \big \rangle - \ell (\xi ') \\&= \sigma _{\mathbb {R}^m}(\theta - z)+ \big \langle z, \xi ' \big \rangle - \ell (\xi '), \end{aligned}$$

which implies that \(\ell ^*(\theta ) \le \big \langle \theta , \xi ' \big \rangle - \ell (\xi ') < \infty \). Thus, \(\theta \) belongs to the effective domain of \(\ell ^*\). We then conclude that \(\kappa \ge \Vert \theta \Vert _* = {\underline{L}}\). \(\square \)

Remark 6.6

(Consistent formulations) If \(\Xi =\mathbb {R}^m\) and the loss function is given by \(\ell (\xi ) = \max _{k \le K}\{\big \langle a_{k}, \xi \big \rangle + b_{k}\}\), then both Corollary 5.1 and Theorem 6.3 offer an exact reformulation of the worst-case expectation (10) in terms of a finite-dimensional convex program. On the one hand, Corollary 5.1 implies that (10) is equivalent to

$$\begin{aligned} \left\{ \begin{array}{clll} \mathop {\min }\limits _{\lambda } &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} \ell (\widehat{\xi }_i)\\ \text {s.t.}&{} \Vert a_k\Vert _* \le \lambda &{} \quad \forall k \le K, \end{array} \right. \end{aligned}$$

which is obtained by setting \(C=0\) and \(d=0\) in (15a). At optimality we have \(\lambda ^\star =\max _{k\le K} \Vert a_k\Vert _*\), which corresponds to the (best) Lipschitz constant of \(\ell (\xi )\) with respect to the norm \(\Vert \cdot \Vert \). On the other hand, Theorem 6.3 implies that (10) is equivalent to (22) with \(\kappa =\lambda ^\star \). Thus, Corollary 5.1 and Theorem 6.3 are consistent.

Remark 6.7

(\(\varepsilon \)-insensitive optimizersFootnote 3) Consider a loss function \(h(x,\xi )\) that is convex in \(\xi \), and assume that \(\Xi =\mathbb {R}^m\). In this case Theorem 6.3 remains valid, but the steepness parameter \(\kappa (x)\) may depend on x. For loss functions whose Lipschitz modulus with respect to \(\xi \) is independent of x (e.g., the newsvendor loss), however, \(\kappa (x)\) is constant. In this case the distributionally robust optimization problem (5) and the SAA problem (4) share the same minimizers irrespective of the Wasserstein radius \(\varepsilon \). This phenomenon could explain why the SAA solutions tend to display a surprisingly strong out-of-sample performance in these problems.

7 Numerical results

We validate the theoretical results of this paper in the context of a stylized portfolio selection problem. The subsequent simulation experiments are designed to provide additional insights into the performance guarantees of the proposed distributionally robust optimization scheme.

7.1 Mean-risk portfolio optimization

Consider a capital market consisting of m assets whose yearly returns are captured by the random vector \(\xi = [\xi _1, \ldots , \xi _m]^\intercal \). If short-selling is forbidden, a portfolio is encoded by a vector of percentage weights \(x=[x_1,\ldots ,x_m]^\intercal \) ranging over the probability simplex \(\mathbb {X}=\{x\in {\mathbb {R}}^m_+: \sum _{i=1}^{m}x_i = 1\}\). As portfolio x invests a percentage \(x_i\) of the available capital in asset i for each \(i=1,\ldots ,m\), its return amounts to \(\big \langle x, \xi \big \rangle \). In the remainder we aim to solve the single-stage stochastic program

$$\begin{aligned} J^\star = \inf _{x\in \mathbb {X}} \bigg \{\mathbb {E}^{\mathbb {P}}\big [-\big \langle x, \xi \big \rangle \big ] + \rho \, \mathbb {P}\text {-}\mathrm{CVaR}_{\alpha }\big (-\big \langle x, \xi \big \rangle \big ) \bigg \}, \end{aligned}$$
(26)

which minimizes a weighted sum of the mean and the conditional value-at-risk (CVaR) of the portfolio loss \(-\big \langle x, \xi \big \rangle \), where \(\alpha \in (0,1]\) is referred to as the confidence level of the CVaR, and \(\rho \in \mathbb {R}_+\) quantifies the investor’s risk-aversion. Intuitively, the CVaR at level \(\alpha \) represents the average of the \(\alpha \times 100{\%}\) worst (highest) portfolio losses under the distribution \(\mathbb {P}\). Replacing the CVaR in the above expression with its formal definition [41], we obtain

$$\begin{aligned} J^\star&= \inf _{x\in \mathbb {X}}\bigg \{\mathbb {E}^{\mathbb {P}}\big [-\big \langle x, \xi \big \rangle \big ] + \rho \,\inf _{\tau \in \mathbb {R}} \mathbb {E}^\mathbb {P}\Big [ \tau + {1\over \alpha } \max \big \{ -\big \langle x, \xi \big \rangle - \tau , 0 \big \}\Big ] \bigg \} \\&= \inf _{x\in \mathbb {X}, \tau \in \mathbb {R}} \mathbb {E}^\mathbb {P}\Big [ \max _{k\le K} \, a_k\big \langle x, \xi \big \rangle +b_k \tau \Big ], \end{aligned}$$

where \(K=2\), \(a_1= -1\), \(a_2= -1-\frac{\rho }{\alpha }\), \(b_1=\rho \) and \(b_2= \rho (1-\frac{1}{\alpha })\). An investor who is unaware of the distribution \(\mathbb {P}\) but has observed a dataset \(\widehat{\Xi }_N\) of N historical samples from \(\mathbb {P}\) and knows that the support of \(\mathbb {P}\) is contained in \(\Xi =\{\xi \in \mathbb {R}^m:C\xi \le d\}\) might solve the distributionally robust counterpart of (26) with respect to the Wasserstein ambiguity set \(\mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\), that is,

$$\begin{aligned} {\widehat{J}_N(\varepsilon )} {:=}\inf _{x\in \mathbb {X}, \tau \in \mathbb {R}} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {E}^\mathbb {Q}\Big [ \max _{k\le K} \, a_k \big \langle x, \xi \big \rangle +b_k \tau \Big ], \end{aligned}$$

where we make the dependence on the Wasserstein radius \(\varepsilon \) explicit. By Corollary 5.1 we know that

$$\begin{aligned} {\widehat{J}_N(\varepsilon )} = \left\{ \begin{array}{lclll} &{} \inf \limits _{x,\tau ,\lambda ,s_i, \gamma _{ik}} &{} \lambda \varepsilon + {1 \over N}\sum \limits _{i = 1}^{N} s_i \\ &{} \text {s.t.}&{} x\in \mathbb {X}\\ &{} &{} b_k\tau +a_k\big \langle x, \widehat{\xi }_i \big \rangle + \big \langle \gamma _{ik}, d-C\widehat{\xi }_i \big \rangle \le s_i &{} \quad \forall i \le N, &{}\forall k \le K\\ &{} &{} \Vert C^\intercal \gamma _{ik} - a_{k}x\Vert _* \le \lambda &{} \quad \forall i \le N, &{} \forall k \le K \\ &{} &{} \gamma _{ik} \ge 0&{} \quad \forall i \le N, &{} \forall k \le K. \end{array}\right. \end{aligned}$$
(27)

Before proceeding with the numerical analysis of this problem, we provide some analytical insights into its optimal solutions when there is significant ambiguity. In what follows we keep the training data set fixed and let \(\widehat{x}_N(\varepsilon )\) be an optimal distributionally robust portfolio corresponding to the Wasserstein ambiguity set of radius \(\varepsilon \). We will now show that, for natural choices of the ambiguity set, \(\widehat{x}_N(\varepsilon )\) converges to the equally weighted portfolio \(\frac{1}{m}e\) as \(\varepsilon \) tends to infinity, where \(e {:=}(1,\ldots ,1)^\intercal \). The optimality of the equally weighted portfolio under high ambiguity has first been demonstrated in [37] using analytical methods. We identify this result here as an immediate consequence of Theorem 4.2, which is primarily a computational result.

For any non-empty set \(S\subseteq \mathbb {R}^m\) we denote by \(\text{ recc }(S) {:=}\{y\in \mathbb {R}^m:x+\lambda y\in S~\forall x\in S, ~\forall \lambda \ge 0\}\) the recession cone and by \(S^\circ {:=}\{y\in \mathbb {R}^m:\big \langle y, x \big \rangle \le 0~\forall x\in S\}\) the polar cone of S.

Lemma 7.1

If \(\{\varepsilon _k\}_{k\in {\mathbb {N}}}\subset \mathbb {R}_+\) tends to infinity, then any accumulation point \(x^\star \) of \(\big \{\widehat{x}_N(\varepsilon _k)\big \}_{k\in {\mathbb {N}}}\) is a portfolio that has minimum distance to \((\text{ recc }(\Xi ))^\circ \) with respect to \(\Vert \cdot \Vert _*\).

Proof

Note first that \(\widehat{x}_N(\varepsilon _k)\), \(k\in {\mathbb {N}}\), and \(x^\star \) exist because \(\mathbb {X}\) is compact. For large Wasserstein radii \(\varepsilon \), the term \(\lambda \varepsilon \) dominates the objective function of problem (27). Using standard epi-convergence results [42, Section 7.E], one can thus show that

$$\begin{aligned} x^\star \,&\in \arg \min _{x\in \mathbb {X}}~ \min _{\gamma _{ik}\ge 0}~ \max _{i\le N,\, k\le K} \Vert C^\intercal \gamma _{ik} - a_{k}x\Vert _*\\&= \arg \min _{x\in \mathbb {X}}~ \max _{i\le N,\, k\le K}~ \min _{\gamma \ge 0} ~ \Vert C^\intercal \gamma + |a_{k}|\,x\Vert _* \\&= \arg \min _{x\in \mathbb {X}} ~ \min _{\gamma \ge 0} ~ \Vert C^\intercal \gamma + x\Vert _* ~ \max _{k\le K} |a_k| \\&= \arg \min _{x\in \mathbb {X}} ~ \min _{\gamma \ge 0} ~ \Vert C^\intercal \gamma + x\Vert _*, \end{aligned}$$

where the first equality follows from the fact that \(a_k<0\) for all \(k\le K\), the second equality uses the substitution \(\gamma \rightarrow \gamma |a_k|\), and the last equality holds because the set of minimizers of an optimization problem is not affected by a positive scaling of the objective function. Thus, \(x^\star \) is the portfolio nearest to the cone \({\mathcal {C}}=\{C^\intercal \gamma :\gamma \ge 0\}\). The claim now follows as the polar cone

$$\begin{aligned} {\mathcal {C}}^\circ&{:=}\, \{y\in \mathbb {R}^m:y^\intercal x\le 0~\forall x\in {\mathcal {C}}\}= \{y\in \mathbb {R}^m:y^\intercal C^\intercal \gamma \le 0~\forall \gamma \ge 0\}\\&= \{y\in \mathbb {R}^m: Cy\ge 0\} \end{aligned}$$

is readily recognized as the recession cone of \(\Xi \) and as \({\mathcal {C}}=({\mathcal {C}}^\circ )^\circ \). \(\square \)

Proposition 7.2

(Equally weighted portfolio) Assume that the Wasserstein metric is defined in terms of the p-norm in the uncertainty space for some \(p\in [1,\infty )\). If \(\{\varepsilon _k\}_{k\in {\mathbb {N}}}\subset \mathbb {R}_+\) tends to infinity, then \(\big \{\widehat{x}_N(\varepsilon _k)\big \}_{k\in {\mathbb {N}}}\) converges to the equally weighted portfolio \(x^\star =\frac{1}{m}e\) provided that the uncertainty set is given by

  1. (i)

    the entire space, i.e., \(\Xi =\mathbb {R}^m\), or

  2. (ii)

    the nonnegative orthant shifted by \(-e\), i.e., \(\Xi =\{\xi \in \mathbb {R}^m:\xi \ge -e\}\), which captures the idea that no asset can lose more than \(100\%\) of its value.

Proof

(i) One easily verifies from the definitions that \((\text{ recc }(\Xi ))^\circ =\{0\}\). Moreover, we have \(\Vert \cdot \Vert _*=\Vert \cdot \Vert _q\) where \(\frac{1}{p}+\frac{1}{q}=1\). As \(p\in [1,\infty )\), we conclude that \(q\in (1,\infty ]\), and thus the unique nearest portfolio to \((\text{ recc }(\Xi ))^\circ \) with respect to \(\Vert \cdot \Vert _*\) is \(x^\star =\frac{1}{m}e\). The claim then follows from Lemma 7.1. Assertion (ii) follows in a similar manner from the observation that \((\text{ recc }(\Xi ))^\circ \) is now the non-positive orthant. \(\square \)

With some extra effort one can show that for every \(p\in [1,\infty )\) there is a threshold \({\bar{\varepsilon }}>0\) with \(\widehat{x}_N(\varepsilon )=x^\star \) for all \(\varepsilon \ge {\bar{\varepsilon }}\), see [37, Proposition 3]. Moreover, for \(p\in \{1,2\}\) the threshold \({\bar{\varepsilon }}\) is known analytically.

7.2 Simulation results: portfolio optimization

Our experiments are based on a market with \(m=10\) assets considered in [7, Section 7.5]. In view of the capital asset pricing model we may assume that the return \(\xi _i\) is decomposable into a systematic risk factor \(\psi \sim {\mathcal {N}}(0,2\%)\) common to all assets and an unsystematic or idiosyncratic risk factor \(\zeta _i\sim {\mathcal {N}}(i\times 3\%, i\times 2.5\%)\) specific to asset i. Thus, we set \(\xi _i=\psi +\zeta _i\), where \(\psi \) and the idiosyncratic risk factors \(\zeta _i\), \(i=1,\ldots ,m\), constitute independent normal random variables. By construction, assets with higher indices promise higher mean returns at a higher risk. Note that the given moments of the risk factors completely determine the distribution \(\mathbb {P}\) of \(\xi \). This distribution has support \(\Xi =\mathbb {R}^m\) and satisfies Assumption 3.3 for the tail exponent \(a=1\), say. We also set \(\alpha =20\%\) and \(\rho =10\) in all numerical experiments, and we use the 1-norm to measure distances in the uncertainty space. Thus, \(\Vert \cdot \Vert _*\) is the \(\infty \)-norm, whereby (27) reduces to a linear program.

Fig. 4
figure 4

Optimal portfolio composition as a function of the Wasserstein radius \(\varepsilon \) averaged over 200 simulations; the portfolio weights are depicted in ascending order, i.e., the weight of asset 1 at the bottom (dark blue area) and that of asset 10 at the top (dark red area). (a) \(N=30\) training samples. (b) \(N=300\) training samples. (c) \(N=3000\) training samples (color figure online)

7.2.1 Impact of the Wasserstein radius

In the first experiment we investigate the impact of the Wasserstein radius \(\varepsilon \) on the optimal distributionally robust portfolios and their out-of-sample performance. We solve problem (27) using training datasets of cardinality \(N \in \{30, 300, 3000\}\). Figure 4 visualizes the corresponding optimal portfolio weights \(\widehat{x}_N(\varepsilon )\) as a function of \(\varepsilon \), averaged over 200 independent simulation runs. Our numerical results confirm the theoretical insight of Proposition 7.2 that the optimal distributionally robust portfolios converge to the equally weighted portfolio as the Wasserstein radius \(\varepsilon \) increases; see also [37].

The out-of-sample performance

$$\begin{aligned} J\big (\widehat{x}_N(\varepsilon )\big ) \,{:=}\, \mathbb {E}^{\mathbb {P}}\big [-\big \langle \widehat{x}_N(\varepsilon ), \xi \big \rangle \big ] + \rho \, \mathbb {P}\text {-}\mathrm{CVaR}_{\alpha }\big (-\big \langle \widehat{x}_N(\varepsilon ), \xi \big \rangle \big ) \end{aligned}$$

of any fixed distributionally robust portfolio \(\widehat{x}_N(\varepsilon )\) can be computed analytically as \(\mathbb {P}\) constitutes a normal distribution by design, see, e.g., [41, p. 29]. Figure 5 shows the tubes between the 20 and 80% quantiles (shaded areas) and the means (solid lines) of the out-of-sample performance \(J\big (\widehat{x}_N(\varepsilon )\big )\) as a function of \(\varepsilon \)—estimated using 200 independent simulation runs. We observe that the out-of-sample performance improves (decreases) up to a critical Wasserstein radius \(\varepsilon _\mathrm{crit}\) and then deteriorates (increases). This stylized fact was observed consistently across all of simulations and provides an empirical justification for adopting a distributionally robust approach.

Figure 5 also visualizes the reliability of the performance guarantees offered by our distributionally robust portfolio model. Specifically, the dashed lines represent the empirical probability of the event \(J\big (\widehat{x}_N(\varepsilon )\big ) \le \widehat{J}_N(\varepsilon )\) with respect to 200 independent training datasets. We find that the reliability is nondecreasing in \(\varepsilon \). This observation has intuitive appeal because \(\widehat{J}_N(\varepsilon ) \ge J(\widehat{x}_N(\varepsilon ))\) whenever \(\mathbb {P}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)\), and the latter event becomes increasingly likely as \(\varepsilon \) grows. Figure 5 also indicates that the certificate guarantee sharply rises towards 1 near the critical Wasserstein radius \(\varepsilon _\mathrm{crit}\). Hence, the out-of-sample performance of the distributionally robust portfolios improves as long as the reliability of the performance guarantee is noticeably smaller than 1 and deteriorates when it saturates at 1. Even though this observation was made consistently across all simulations, we were unable to validate it theoretically.

Fig. 5
figure 5

Out-of-sample performance \(J(\widehat{x}_N(\varepsilon )) \) (left axis, solid line and shaded area) and reliability \(\mathbb {P}^N[J(\widehat{x}_N(\varepsilon )) \le \widehat{J}_N(\varepsilon )]\) (right axis, dashed line) as a function of the Wasserstein radius \(\varepsilon \) and estimated on the basis of 200 simulations. (a) \(N=30\) training samples. (b) \(N=300\) training samples. (c) \(N=3000\) training samples

7.2.2 Portfolios driven by out-of-sample performance

Different Wasserstein radii \(\varepsilon \) may result in robust portfolios \(\widehat{x}_N(\varepsilon )\) with vastly different out-of-sample performance \(J(\widehat{x}_N(\varepsilon ))\). Ideally, one should select the radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) that minimizes \(J(\widehat{x}_N(\varepsilon ))\) over all \(\varepsilon \ge 0\); note that \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) inherits the dependence on the training data from \(J(\widehat{x}_N(\varepsilon ))\). As the true distribution \(\mathbb {P}\) is unknown, however, it is impossible to evaluate and minimize \(J(\widehat{x}_N(\varepsilon ))\). In practice, the best we can hope for is to approximate \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) using the training data. Statistics offers several methods to accomplish this goal:

  • Holdout method: Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into a training dataset of size \(N_T\) and a validation dataset of size \(N_V=N-N_T\). Using only the training dataset, solve (27) for a large but finite number of candidate radii \(\varepsilon \) to obtain \({\widehat{x}}_{N_T}(\varepsilon )\). Use the validation dataset to estimate the out-of-sample performance of \({\widehat{x}}_{N_T}(\varepsilon )\) via the sample average approximation. Set \({\widehat{\varepsilon }}_N^\mathrm{\; hm}\) to any \(\varepsilon \) that minimizes this quantity. Report \(\widehat{x}_N^\mathrm{\; hm}={\widehat{x}}_{N_T}({\widehat{\varepsilon }}_N^\mathrm{\; hm})\) as the data-driven solution and \(\widehat{J}_N^\mathrm{\; hm}={\widehat{J}}_{N_T}({\widehat{\varepsilon }}_N^\mathrm{\; hm})\) as the corresponding certificate.

  • k -fold cross validation: Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into k subsets, and run the holdout method k times. In each run, use exactly one subset as the validation dataset and merge the remaining \(k-1\) subsets to a training dataset. Set \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) to the average of the Wasserstein radii obtained from the k holdout runs. Resolve (27) with \(\varepsilon ={\widehat{\varepsilon }}_N^\mathrm{\; cv}\) using all N samples, and report \(\widehat{x}_N^\mathrm{\; cv}=\widehat{x}_N({\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the data-driven solution and \(\widehat{J}_N^\mathrm{\; cv}=\widehat{J}_N{(\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the corresponding certificate.

The holdout method is computationally cheaper, but cross validation has superior statistical properties. There are several other methods to estimate the best Wassertein radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\). By construction, however, no method can provide a radius \({\widehat{\varepsilon }}_N\) such that \(\widehat{x}_N({\widehat{\varepsilon }}_N)\) has a better out-of-sample performance than \(\widehat{x}_N({\widehat{\varepsilon }}_N^\mathrm{\; opt})\).

In all experiments we compare the distributionally robust approach based on the Wasserstein ambiguity set with the classical sample average approximation (SAA) and with a state-of-the-art data-driven distributionally robust approach, where the ambiguity set is defined via a linear-convex ordering (LCX)-based goodness-of-fit test [7, Section 3.3.2]. The size of the LCX ambiguity set is determined by a single parameter, which should be tuned to optimize the out-of-sample performance. While the best parameter value is unavailable, it can again be estimated using the holdout method or via cross validation. To our best knowledge, the LCX approach represents the only existing data-driven distributionally robust approach for continuous uncertainty spaces that enjoys strong finite-sample guarantees, asymptotic consistency as well as computational tractability.Footnote 4

To keep the computational burden manageable, in all experiments we select the Wasserstein radius as well as the LCX size parameter from within the discrete set \({\mathcal {E}}=\{\varepsilon =b\cdot 10^c:b\in \{0,\ldots ,9\},\; c\in \{-3,-2,-1\}\}\) instead of \({\mathbb {R}}_+\). We have verified that refining or extending \(\mathcal E\) has only a marginal impact on our results, which indicates that \({\mathcal {E}}\) provides a sufficiently rich approximation of \(\mathbb {R}_+\).

Fig. 6
figure 6

Out-of-sample performance \(J(\widehat{x}_N)\), certificate \(\widehat{J}_N\), and certificate reliability \(\mathbb {P}^N\big [J(\widehat{x}_N) \le \widehat{J}_N\big ]\) for the performance-driven SAA, LCX and Wasserstein solutions as a function of N. (a) Holdout method, (b) Holdout method, (c) Holdout method, (d) k-fold cross validation, (e) k-fold cross validation, (f) k-fold cross validation, (g) optimal size, (h) optimal size, (i) optimal size (color figure online)

In Fig. 6a–c the sizes of the (LCX and Wasserstein) ambiguity sets are determined via the holdout method, where \(80\%\) of the data are used for training and \(20\%\) for validation. Figure 6a visualizes the tube between the 20 and \(80\%\) quantiles (shaded areas) as well as the mean value (solid lines) of the out-of-sample performance \(J(\widehat{x}_N)\) as a function of the sample size N and based on 200 independent simulation runs, where \(\widehat{x}_N\) is set to the minimizer of the SAA (blue), LCX (purple) and Wasserstein (green) problems, respectively. The constant dashed line represents the optimal value \(J^\star \) of the original stochastic program (1), which is computed through an SAA problem with \(N = 10^6\) samples. We observe that the Wasserstein solutions tend to be superior to the SAA and LCX solutions in terms of out-of-sample performance.

Figure 6b shows the optimal values \(\widehat{J}_N\) of the SAA, LCX and Wasserstein problems, where the sizes of the ambiguity sets are chosen via the holdout method. Unlike Fig. 6a, Fig. 6b thus reports in-sample estimates of the achievable portfolio performance. As expected, the SAA approach is over-optimistic due to the optimizer’s curse, while the LCX and Wasserstein approaches err on the side of caution. All three methods are known to enjoy asymptotic consistency, which is in agreement with all in-sample and out-of-sample results.

Figure 6c visualizes the reliability of the different performance certificates, that is, the empirical probability of the event \(J(\widehat{x}_N) \le \widehat{J}_N\) evaluated over 200 independent simulation runs. Here, \(\widehat{x}_N\) represents either an optimal portfolio of the SAA, LCX or Wasserstein problems, while \(\widehat{J}_N\) denotes the corresponding optimal value. The optimal SAA portfolios display a disappointing out-of-sample performance relative to the optimistically biased mimimum of the SAA problem—particularly when the training data is scarce. In contrast, the out-of-sample performance of the optimal LCX and Wasserstein portfolios often undershoots \(\widehat{J}_N\).

Figure 6d–f show the same graphs as Fig. 6a–c, but now the sizes of the ambiguity sets are determined via k-fold cross validation with \(k=5\). In this case, the out-of-sample performance of both distributionally robust methods improves slightly, while the corresponding certificates and their reliabilities increase significantly with respect to the naïve holdout method. However, these improvements come at the expense of a k-fold increase in the computational cost.

One could think of numerous other statistical methods to select the size of the Wasserstein ambiguity set. As discussed above, however, if the ultimate goal is to minimize the out-of-sample performance of \(\widehat{x}_N(\varepsilon )\), then the best possible choice is \(\varepsilon ={\widehat{\varepsilon }}_N^\mathrm{\; opt}\). Similarly, one can construct a size parameter for the LCX ambiguity set that leads to the best possible out-of-sample performance of any LCX solution. We emphasize that these optimal Wasserstein radii and LCX size parameters are not available in practice because computing \(J(\widehat{x}_N(\varepsilon ))\) requires knowledge of the data-generating distribution. In our experiments we evaluate \(J(\widehat{x}_N(\varepsilon ))\) to high accuracy for every fixed \(\varepsilon \in \mathcal {E}\) using \(2\cdot 10^5\) validation samples, which are independent from the (much fewer) training samples used to compute \(\widehat{x}_N(\varepsilon )\). Figure 6g–i show the same graphs as Fig. 6a–c for optimally sized ambiguity sets. By construction, no method for sizing the Wasserstein or LCX ambiguity sets can result in a better out-of-sample performance, respectively. In this sense, the graphs in Fig. 6g capture the fundamental limitations of the different distributionally robust schemes.

7.2.3 Portfolios driven by reliability

In Sect. 7.2.2 the Wasserstein radii and LCX size parameters were calibrated with the goal to achieve the best out-of-sample performance. Figure 6c, f, i reveal, however, that by optimizing the out-of-sample performance one may sacrifice reliability. An alternative objective more in line with the general philosophy of Sect. 2 would be to choose Wasserstein radii that guarantee a prescribed reliability level. Thus, for a given \(\beta \in [0,1]\) we should find the smallest Wasserstein radius \(\varepsilon \ge 0\) for which the optimal value \(\widehat{J}_N(\varepsilon )\) of (27) provides an upper \(1-\beta \) confidence bound on the out-of-sample performance \(J(\widehat{x}_N(\varepsilon ))\) of its optimal solution. As the true distribution \(\mathbb {P}\) is unknown, however, the optimal Wasserstein radius corresponding to a given \(\beta \) cannot be computed exactly. Instead, we must derive an estimator \({\widehat{\varepsilon }}_N^{\; \beta }\) that depends on the training data. We construct \({\widehat{\varepsilon }}_N^{\; \beta }\) and the corresponding reliability-driven portfolio via bootstrapping as follows:

  1. (1)

    Construct k resamples of size N (with replacement) from the original training dataset. It is well known that, as N grows, the probability that any fixed training data point appears in a particular resample converges to \(\frac{e-1}{e}\approx \frac{2}{3}\). Thus, about \(\frac{N}{3}\) training samples are absent from any resample. We collect all unused samples in a validation dataset.

  2. (2)

    For each resample \(\kappa =1,\ldots , k\) and \(\varepsilon \ge 0\), solve problem (27) using the Wasserstein ball of radius \(\varepsilon \) around the empirical distribution \(\widehat{\mathbb {P}}_N^\kappa \) on the \(\kappa \)-th resample. The resulting optimal decision and optimal value are denoted as \({\widehat{x}}_N^\kappa (\varepsilon )\) and \({\widehat{J}}_N^\kappa (\varepsilon )\), respectively. Next, estimate the out-of-sample performance \(J(\widehat{x}_N^\kappa (\varepsilon ))\) of \(\widehat{x}_N^\kappa (\varepsilon )\) using the sample average over the \(\kappa \)-th validation dataset.

  3. (3)

    Set \({\widehat{\varepsilon }}_N^{\; \beta }\) to the smallest \(\varepsilon \ge 0\) so that the certificate \({\widehat{J}}_N^\kappa (\varepsilon )\) exceeds the estimate of \(J({\widehat{x}}_N^\kappa (\varepsilon ))\) in at least \((1-\beta )\times k\) different resamples.

  4. (4)

    Compute the data-driven portfolio \(\widehat{x}_N=\widehat{x}_N({\widehat{\varepsilon }}_N^{\; \beta })\) and the corresponding certificate \(\widehat{J}_N={\widehat{J}}_N({\widehat{\varepsilon }}_N^{\; \beta })\) using the original training dataset.

As in Sect. 7.2.2, we compare the Wasserstein approach with the LCX and SAA approaches. Specifically, by using bootstrapping, we calibrate the size of the LCX ambiguity set so as to guarantee a desired reliability level \(1-\beta \). The SAA problem, on the other hand, has no free parameter that can be tuned to meet a prescribed reliability target. Nevertheless, we can construct a meaningful certificate of the form \(\widehat{J}_N(\Delta ):=\widehat{J}_{\mathrm{SAA}}+\Delta \) for the SAA portfolio by adding a non-negative constant to the optimal value of the SAA problem. Our aim is to find the smallest offset \(\Delta \ge 0\) with the property that \(\widehat{J}_N(\Delta )\) provides an upper \(1-\beta \) confidence bound on the out-of-sample performance \(J(\widehat{x}_{\mathrm{SAA}})\) of the optimal SAA portfolio \(\widehat{x}_{\mathrm{SAA}}\). The optimal offset corresponding to a given \(\beta \) cannot be computed exactly. Instead, we must derive an estimator \({\widehat{\Delta }}_N^{\; \beta }\) that depends on the training data. Such an estimator can be found through a simple variant of the above bootstrapping procedure.

In all experiments we set the number of resamples to \(k=50\). Figure 7a–c visualize the out-of-sample performance, the certificate and the empirical reliability of the reliability-driven portfolios obtained with the SAA, LCX and Wasserstein approaches, respectively, for the reliability target \(1-\beta =90\%\) and based on 200 independent simulation runs. Figure 7d–f show the same graphs as Fig. 7a–c but for the reliability target \(1-\beta =75\%\). We observe that the new SAA certificate now overestimates the true optimal value of the portfolio problem. Moreover, while the empirical reliability of the SAA solution now closely matches the desired reliability target, the empirical reliabilities of the LCX and Wasserstein solutions are similar but noticeably exceed the prescribed reliability threshold. A possible explanation for this phenomenon is that the k resamples generated by the bootstrapping algorithm are not independent, which may give rise to a systematic bias in estimating the Wasserstein radii required for the desired reliability levels.

Fig. 7
figure 7

Out-of-sample performance \(J(\widehat{x}_N)\), certificate \(\widehat{J}_N\), and certificate reliability \(\mathbb {P}^N\big [J(\widehat{x}_N) \le \widehat{J}_N\big ]\) for the reliability-driven SAA, LCX and Wasserstein portfolios as a function of N. (a) \(\beta = 10\%\), (b) \(\beta = 10\%\), (c) \(\beta = 10\%\), (d) \(\beta = 25\%\) (e) \(\beta = 25\%\) (f) \(\beta = 25\%\)

Fig. 8
figure 8

Optimal performance-driven Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) and its estimates \({\widehat{\varepsilon }}_N^\mathrm{\; hm}\) and \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained via the holdout method and k-fold cross validation, respectively, as well as the reliability-driven Wasserstein radius \({\widehat{\varepsilon }}_N^{\beta }\) for \(\beta \in \{10\%,25\%\}\) obtained via bootstrapping

7.2.4 Impact of the sample size on the Wasserstein radius

It is instructive to analyze the dependence of the Wasserstein radii on the sample size N for different data-driven schemes. As for the performance-driven portfolios from Sect. 7.2.2, Fig. 8 depicts the best possible Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; opt}\) as well as the Wasserstein radii \({\widehat{\varepsilon }}_N^\mathrm{\; hm}\) and \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained by the holdout method and via k-fold cross validation, respectively. As for the reliability-driven portfolios from Sect. 7.2.3, Fig. 8 further depicts the Wasserstein radii \({\widehat{\varepsilon }}_N^{\beta }\), for \(\beta \in \{10\%,25\%\}\), obtained by bootstrapping. All results are averaged across 200 independent simulation runs. As expected from Theorem 3.6, all Wasserstein radii tend to zero as N increases. Moreover, the convergence rate is approximately equal to \(N^{-\frac{1}{2}}\). This rate is likely to be optimal. Indeed, if \(\mathbb {X}\) is a singleton, then every quantile of the sample average estimator \(\widehat{J}_{\mathrm{SAA}}\) converges to \(J^\star \) at rate \(N^{-\frac{1}{2}}\) due to the central limit theorem. Thus, if \({\widehat{\varepsilon }}_N= o(N^{-\frac{1}{2}})\), then \(\widehat{J}_N\) also converges to \(J^\star \) at leading order \(N^{-\frac{1}{2}}\) by Theorem 6.3, which applies as the loss function is convex. This indicates that the a priori rate \(N^{-\frac{1}{m}}\) suggested by Theorem 3.4 is too pessimistic in practice.

7.3 Simulation results: uncertainty quantification

Investors often wish to determine the probability that a given portfolio will outperform various benchmark indices or assets. Our results on uncertainty quantification developed in Sect. 5.2 enable us to compute this probability in a meaningful way—solely on the basis of the training dataset.

Assume for example that we wish to quantify the probability that any data-driven portfolio \(\widehat{x}_N\) outperforms the three most risky assets in the market jointly. Thus, we should compute the probability of the closed polytope

$$\begin{aligned} {\widehat{\mathbb {A}}} = \Big \{\xi \in \mathbb {R}^m ~:~ \big \langle \widehat{x}_N, \xi \big \rangle \ge \xi _i ~ \forall i=8,9,10 \Big \}. \end{aligned}$$

As the true distribution \(\mathbb {P}\) is unknown, the probability \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) cannot be evaluated exactly. Note that \({\widehat{\mathbb {A}}}\) as well as \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) constitute random objects that depend on \(\widehat{x}_N\) and thus on the training data. Using the same training dataset that was used to compute \(\widehat{x}_N\), however, we may estimate \(\mathbb {P}[\xi \in {\widehat{\mathbb {A}}}]\) from above and below by

$$\begin{aligned} \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \in {\widehat{\mathbb {A}}}\right] \qquad \text {and}\qquad \inf \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \in {\widehat{\mathbb {A}}}\right] = 1 - \sup \limits _{\mathbb {Q}\in \mathbb {B}_{\varepsilon }(\widehat{\mathbb {P}}_N)} \mathbb {Q}\left[ \xi \notin \widehat{{\mathbb {A}}}\right] , \end{aligned}$$

respectively. Indeed, recall that the true data-generating probability distribution resides in the Wasserstein ball of radius \(\varepsilon _N(\beta )\) defined in (8) with probability \(1-\beta \). Therefore, we have

$$\begin{aligned}&1 - \beta \le \mathbb {P}^N\Big [\widehat{\Xi }_N: \mathbb {P}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\Big ]&\\&\quad \le \mathbb {P}^N\Big [\widehat{\Xi }_N: ~ \sup _{\mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)} \mathbb {Q}\big [{\mathbb {A}}\big ] \ge \mathbb {P}\big [{\mathbb {A}}\big ] \quad \forall {\mathbb {A}} \in \mathfrak {B}(\Xi ) \Big ] \\&\quad = \mathbb {P}^N\Big [\widehat{\Xi }_N: ~ \inf _{{\mathbb {A}} \in \mathfrak {B}(\Xi )} \sup _{\mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)} \mathbb {Q}\big [{\mathbb {A}}\big ] - \mathbb {P}\big [{\mathbb {A}}\big ] \ge 0 \Big ], \end{aligned}$$

where \(\mathfrak {B}(\Xi )\) denotes the set of all Borel subsets of \(\Xi \). The data-dependent set \({\widehat{\mathbb {A}}}_N\) can now be viewed as a (measurable) mapping from \(\widehat{\Xi }_N\) to the subsets in \(\mathfrak {B}(\Xi )\). The above inequality then implies

$$\begin{aligned} \mathbb {P}^N\Big [\widehat{\Xi }_N: ~ \sup _{\mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)} \mathbb {Q}\big [{{\widehat{\mathbb {A}}}_N}\big ] - \mathbb {P}\big [{{\widehat{\mathbb {A}}}_N}\big ] \ge 0 \Big ]\ge 1-\beta . \end{aligned}$$

Thus, \( \sup \{\mathbb {Q}[{{\widehat{\mathbb {A}}}_N}]:\mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\}\) provides indeed an upper bound on \(\mathbb {P}[{{\widehat{\mathbb {A}}}_N}]\) with confidence \(1-\beta \). Similarly, one can show that \( \inf \{\mathbb {Q}[{{\widehat{\mathbb {A}}}_N}]: \mathbb {Q}\in \mathbb {B}_{\varepsilon _N(\beta )}(\widehat{\mathbb {P}}_N)\}\) provides a lower confidence bound on \(\mathbb {P}[{{\widehat{\mathbb {A}}}_N}]\).

The upper confidence bound can be computed by solving the linear program (17a). Replacing \({\widehat{\mathbb {A}}}\) with its interior in the lower confidence bound leads to another (potentially weaker) lower bound that can be computed by solving the linear program (17b). We denote these computable bounds by \(\widehat{J}_N^+(\varepsilon )\) and \(\widehat{J}_N^-(\varepsilon )\), respectively. In all subsequent experiments \(\widehat{x}_N\) is set to a solution of the distributionally robust program (27) calibrated via k-fold cross validation as described in Sect. 7.2.2.

7.3.1 Impact of the Wasserstein radius

As \(\widehat{J}_N^+(\varepsilon )\) and \(\widehat{J}_N^-(\varepsilon )\) estimate a random target \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), it makes sense to filter out the randomness of the target and to study only the differences \(\widehat{J}_N^+(\varepsilon )- \mathbb {P}[{\widehat{\mathbb {A}}}]\) and \(\widehat{J}_N^-(\varepsilon )- \mathbb {P}[{\widehat{\mathbb {A}}}]\). Figure 9a, b visualize the empirical mean (solid lines) as well as the tube between the empirical 20 and 80% quantiles (shaded areas) of these differences as a function of the Wasserstein radius \(\varepsilon \), based on 200 training datasets of cardinality \(N = 30\) and \(N=300\), respectively. Figure 9 also shows the empirical reliability of the bounds (dashed lines), that is, the empirical probability of the event \(\widehat{J}_N^-(\varepsilon ) \le \mathbb {P}[{\widehat{\mathbb {A}}}] \le \widehat{J}_N^+(\varepsilon )\). Note that the reliability drops to 0 for \(\varepsilon =0\), in which case both \(\widehat{J}_N^+(0)\) and \(\widehat{J}_N^-(0)\) coincide with the SAA estimator for \(\mathbb {P}[{\widehat{\mathbb {A}}}]\). Moreover, at \(\varepsilon =0\) the set \({\widehat{\mathbb {A}}}\) is constructed from the SAA portfolio \(\widehat{x}_N\), whose performance is overestimated on the training dataset. Thus, the SAA estimator for \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), which is evaluated using the same training dataset, is positively biased. For \(\varepsilon >0\), finally, the reliability increases as the shaded confidence intervals move away from 0.

Fig. 9
figure 9

Excess \(\widehat{J}_N^+(\varepsilon )- \mathbb {P}\widehat{\mathbb {A}}]\) and shortfall \(\widehat{J}_N^-(\varepsilon )- \mathbb {P}[{\widehat{\mathbb {A}}}]\) (solid lines, left axis) as well as reliability \(\mathbb {P}^N[\widehat{J}_N^-(\varepsilon ) \le \mathbb {P}[{\widehat{\mathbb {A}}}] \le \widehat{J}_N^+(\varepsilon )]\) (dashed lines, right axis) as a function of \(\varepsilon \). (a) \(N=30\), (b) \(N=300\)

7.3.2 Impact of the sample size

We propose a variant of the k-fold cross validation procedure for selecting \(\varepsilon \) in uncertainty quantification. Partition \(\widehat{\xi }_1,\ldots ,\widehat{\xi }_N\) into k subsets and repeat the following holdout method k times. Select one of the subsets as the validation set of size \(N_V\) and merge the remaining \(k-1\) subsets to a training dataset of size \(N_T=N-N_V\). Use the validation set to compute the SAA estimator of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\), and use the training dataset to compute \({\widehat{J}}_{N_T}^+(\varepsilon )\) for a large but finite number of candidate radii \(\varepsilon \). Set \({\widehat{\varepsilon }}_N^{\; \mathrm hm}\) to the smallest candidate radius for which the SAA estimator of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\) is not larger than \({\widehat{J}}_{N_T}^+(\varepsilon )\). Next, set \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) to the average of the Wasserstein radii obtained from the k holdout runs, and report \(\widehat{J}_N^+={\widehat{J}}_{N}^+({\widehat{\varepsilon }}_N^\mathrm{\; cv})\) as the data-driven upper bound on \(\mathbb {P}[{\widehat{\mathbb {A}}}]\). The data-driven lower bound \(\widehat{J}_N^-\) is constructed analogously in the obvious way.

Figure 10a visualizes the empirical means (solid lines) as well as the tubes between the empirical 20 and 80% quantiles (shaded areas) of \(\widehat{J}_N^+-\mathbb {P}[{\widehat{\mathbb {A}}}]\) and \(\widehat{J}_N^--\mathbb {P}[{\widehat{\mathbb {A}}}]\) as a function of the sample size N, based on 300 independent training datasets. As expected, the confidence intervals shrink and converge to 0 as N increases. We emphasize that \(\widehat{J}_N^+\) and \(\widehat{J}_N^-\) are computed solely on the basis of N training samples, whereas the computation of \(\mathbb {P}[{\widehat{\mathbb {A}}}]\) necessitates a much larger dataset, particularly if \({\widehat{\mathbb {A}}}\) constitutes a rare event.

Fig. 10
figure 10

Dependence of the confidence bounds and the Wasserstein radius on N. (a) Excess \(\widehat{J}_N^+-\mathbb {P}{[}{\widehat{\mathbb {A}}}{]}\) and shortfall \(\widehat{J}_N^--\mathbb {P}{[}{\widehat{\mathbb {A}}}{]}\) of the data-driven confidence bounds for \(\mathbb {P}{[}{\widehat{\mathbb {A}}}{]}\). (b) Data-driven Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained via k-fold cross validation

Figure 10b shows the Wasserstein radius \({\widehat{\varepsilon }}_N^\mathrm{\; cv}\) obtained via k-fold cross validation (both for \(\widehat{J}_N^+\) and \(\widehat{J}_N^-\)). As usual, all results are averaged across 300 independent simulation runs. A comparison with Fig. 8 reveals that the data-driven Wasserstein radii in uncertainty quantification display a similar but faster polynomial decay than in portfolio optimization. We conjecture that this is due to the absence of decisions, which implies that uncertainty quantification is less susceptible to the optimizer’s curse. Thus, nature (i.e., the fictitious adversary choosing the distribution in the ambiguity set) only has to compensate for noise but not for bias. A smaller Wasserstein radius seems to be sufficient for this purpose.