Distributionally Robust Shortfall Risk Optimization Model and Its Approximation

Utility-based shortfall risk measures (SR) have received increasing attention over the past few years for their potential to quantify the risk of large tail losses more effectively than conditional value at risk. In this paper, we consider a distributionally robust version of the shortfall risk measure (DRSR) where the true probability distribution is unknown and the worst distribution from an ambiguity set of distributions is used to calculate the SR. We start by showing that the DRSR is a convex risk measure and under some special circumstance a coherent risk measure. We then move on to study an optimization problem with the objective of minimizing the DRSR of a random function and investigate numerical tractability of the optimization problem with the ambiguity set being constructed through φ-divergence ball and Kantorovich ball. In the case when the nominal distribution in the balls is an empirical distribution constructed through iid samples, we quantify convergence of the ambiguity sets to the true probability distribution as the sample size increases under the Kantorovich metric and consequently the optimal values of the corresponding DRSR problems. Specifically, we show that the error of the optimal value is linearly bounded by the error of each of the approximate ambiguity sets and subsequently derive a confidence interval of the optimal value under each of the approximation schemes. Some preliminary numerical test results are reported for the proposed modeling and computational schemes. The research is supported by EPSRC grant EP/M003191/1. The work of the first author was partially carried out while she was working as a postdoctoral research fellow in the School of Mathematical Sciences, University of Southampton supported by the EPSRC grant. Shaoyan Guo School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, China E-mail: syguo@dlut.edu.cn Huifu Xu School of Mathematics, University of Southampton, Southampton, SO17 1BJ, UK E-mail: h.xu@soton.ac.uk


Introduction
Quantitative measure of risk is a key element for financial institutions and regulatory authorities. It provides a way to compare different financial positions. A financial position can be mathematically characterized by a random variable Z : (Ω, F , P ) → IR, where Ω is a sample space with sigma algebra F and P is a probability measure. A risk measure ρ assigns to Z a number that signifies the risk of the position. A good risk measure should have some virtues, such as being sensitive to excessive losses, penalizing concentration and encouraging diversification, and supporting dynamically consistent risk managements over multiple horizons [15].
Artzner et al. [1] considered the axiomatic characterizations of risk measures and first introduced the concept of coherent risk measure, which satisfies: (a) positive homogeneity (ρ(αZ) = αρ(Z) for α ≥ 0); (b) subadditivity (ρ(Z + Y ) ≤ ρ(Z) + ρ(Y )); (c) monotonicity (if Z ≥ Y , then ρ(Z) ≤ ρ(Y )); (d) translation invariance (if m ∈ IR, then ρ(Z + m) = ρ(Z) − m). Frittelli and Rosazza Gianin [12], Heath [17] and Föllmer and Schied [9] extended the notion of coherent risk measure to convex risk measure by replacing positive homogeneity and subadditivity with convexity, that is, ρ(αZ , for all α ∈ [0, 1]. Obviously positive homogeneity and subadditivity imply convexity but not vice versa. In other words, a coherent risk measure is a convex risk measure but conversely it may not be true. A well-known coherent risk measure is conditional value at risk (CVaR) defined by CVaR α (Z) := 1 α ∫ α 0 VaR λ (Z)dλ, where VaR λ (Z) denotes the value at risk (VaR) which in this context is the smallest amount of cash that needs to be added to Z such that the probability of the financial position falling into a loss does not exceed a specified level λ, that is, VaR λ (Z) := inf{t ∈ IR : P (Z + t < 0) ≤ λ}. In a financial context, CVaR has a number of advantages over the commonly used VaR, and CVaR has been proposed as the primary tool for banking capital regulation in the draft Basel III standard [2]. However, CVaR has a couple of deficiencies.
One is that CVaR is not invariant under randomization, a property which is closely related to the weak dynamic consistency of risk measurements, that is, if CVaR α (Z i ) ≤ 0, for i = 1, 2 and Z := { Z 1 , with probability p, Z 2 , with probability 1 − p, for p ∈ (0, 1), then we do not necessarily have CVaR α (Z) ≤ 0, see [26,Example 3.4]. The other is that CVaR is not particularly sensitive to heavy tailed losses [15,Section 5]. Here, we illustrate this by a simple example. Let It is easy to calculate that CVaR 0.02 (X 1 ) = CVaR 0.02 (X 2 ) = CVaR 0.02 (X 3 ) = 150.
To overcome the deficiencies, a special category of convex risk measure, called utility-based shortfall risk measure (abbreviated as SR hereafter) was introduced by Föllmer and Schied [9] and attracted more and more attention in recent years, see [7,15,18]. Let l : IR → IR be a convex, increasing and non-constant function. Let λ be a pre-specified constant in the interior of the range of l to reveal the risk level. The SR of a financial position Z is defined as (SR) SR P l,λ (Z) := inf{t ∈ IR : t + Z ∈ A P }, where A P := {Z ∈ L ∞ : E P [l(−Z(ω))] ≤ λ} is called the acceptance set and L ∞ denotes the set of bounded random variables. From the definition, we can see that the SR is the smallest amount of cash that must be added to the position Z to make it acceptable, i.e., t + Z ∈ A P . Observe that when l(·) takes a particular characteristic function of the form (0,+∞] (·), that is (0,+∞] (z) = 1 if z ∈ (0, +∞], and 0 otherwise, in such a case SR P l,λ (Z) coincides with VaR λ (Z). Of course, here l is nonconvex.
Compared to CVaR, SR not only satisfies convexity, but also satisfies invariance under randomization and can be used more appropriately for dynamic measurement of risks over time. To see invariance under randomization, we note that SR defined as in (2) is a function on the space of random variables, it can also be represented as a function on the space of probability measures, see [26,Remark 2.1]. In the latter case, the acceptance set can be characterized by N := {µ ∈ P(C) : where P(C) denotes the space of probability measures with support being contained in a compact set C ⊂ IR. If µ, ν ∈ N , i.e., Moreover, the SR is found to be more sensitive to financial losses from extreme events with heavy tailed distributions, see [15,Section 5]. Indeed, if we set l(z) = e z and λ = e, then we can easily calculate the shortfall risk values of X 1 , X 2 and X 3 in (1) with SR P l,λ (X 1 ) ≈ 194, SR P l,λ (X 2 ) ≈ 293, and SR P l,λ (X 3 ) ≈ 393. Furthermore, if we choose l(z) = e βz with β > 0, the resulting SR coincides, up to an additive constant, with the entropic risk measure, that is, In the case when l(z) = z α [0,∞) (z) with α ≥ 1, the associated risk measure focuses on downside risk only and thus neglects the tradeoff between gains and losses. Dunkel and Weber [7] are perhaps the first to discuss the computational aspects of SR. They characterized SR as a stochastic root finding problem and proposed the stochastic approximation (SA) method combined with importance sampling techniques to calculate it. Hu and Zhang [18] proposed an alternative approach by reformulating SR as the optimal value of a stochastic optimization problem and applying the well-known sample average approximation (SAA) method to solve the latter when either the true probability distribution is unknown or it is prohibitively expensive to compute the expected value of the underlying random functions. A detailed asymptotic analysis of the optimal values obtained from solving the sample average approximated problem was also provided.
In some practical applications, however, the true probability distribution may be unknown and it is expensive to collect a large set of samples or the samples are not trustworthy. However, it might be possible to use some partial information such as empirical data, computer simulation, prior moments or subjective judgements to construct a set of distributions which contains or approximates the true probability distribution in good faith. Under these circumstances, it might be reasonable to consider a distributionally robust version of (2) in order to hedge the risk arising from ambiguity of the true probability distribution, where  [27] demonstrated how a DRSR optimization problem may be reformulated as a tractable convex programming problem when l is piecewise affine and the ambiguity set is constructed through some moment conditions, see [27,Example 6] for details. In this paper, we take on the research by giving a more comprehensive treatment of DRSR. We start by looking into the properties of DRSR and then move on to discuss some optimization problems associated with DRSR. Specifically, for a loss c(x, ξ) associated with decision vector x ∈ X ⊂ IR n and random vector ξ ∈ IR k , we consider an optimization problem which aims to minimize the distributionally robust shortfall risk measure of the random loss: where SR P l,λ (·) is defined as in (3). We present a detailed discussion on (DRSRP) including tractable reformulation for the problem when the ambiguity set has a specific structure.
As far as we are concerned, the main contribution of the paper can be summarized as follows. First, we demonstrate that DRSR is the worst-case SR (Proposition 1) and hence it is a convex risk measure. Second, we investigate tractability of (DRSRP) by considering particular cases where the ambiguity set P is constructed respectively through ϕ-divergence ball and Kantorovich ball. Since the structure of P often involves sample data, we analyse convergence of the ambiguity set as the sample size increases (Propositions 3 and 5). To quantify how the errors arising from the ambiguity set propagate to the optimal value of (DRSRP), we then show under some moderate conditions that the error of the optimal value is linearly bounded by the error of the ambiguity set and subsequently derive finite sample guarantee (Theorem 1) and confidence intervals for the optimal value of (DRSRP) associated with the ambiguity sets (Theorem 2 and Corollary 1). Finally, as an application, we apply the (DRSRP) model to a portfolio management problem and carry out various out-of-sample tests on the numerical schemes for the (DRSRP) model with simulated data and real data (Section 5).
The rest of the paper is organised as follows. In Section 2, we present the properties of DRSR, that is, it is a convex risk measure and it is the worst-case SR. In Section 3, we derive the formulation of (DRSRP) when the ambiguity set is constructed through ϕ-divergence ball and Kantorovich ball and then establish the convergence of ambiguity sets as sample size increases. In Section 4, the finite sample guarantees on the quality of the optimal solutions and convergence of the optimal values as the sample size increases are discussed. In Section 5, we report results of numerical experiments.
Throughout the paper, we use IR n to represent n dimensional Euclidean space and IR n + nonnegative orthant. Given a norm ∥ · ∥ in IR n , the dual norm ∥ · ∥ * is defined by ∥y∥ * := sup ∥z∥≤1 ⟨y, z⟩. We use B to denote the unit ball in a matrix or vector space. Finally, for a sequence of subsets {S N } in a metric space, denote by lim sup N →∞ S N its outer limit, that is,

Properties of DRSR
In this section, we investigate the properties of DRSR. It is easy to observe that SR P l,λ (Z) is the optimal value of the following minimization problem: The following proposition states that the DRSR is the worst-case SR and it preserves convexity of SR.
Proposition 1 Let SR P l,λ (Z) be defined as in (3), Z ∈ L ∞ and l : IR → IR be a convex, increasing and non-constant function, let λ be a pre-specified constant in the range of l. Then SR P l,λ (Z) is finite, and SR P l,λ (Z) is a convex risk measure.
Remark 1 It may be helpful to make some comments on Proposition 1.
(i) The relationship established in (6) means that DRSR is the worst-case SR. This observation allows one to calculate DRSR via SR for each P ∈ P if it is easy to do so. Moreover, Giesecke et al. [15] showed that SR is a coherent risk measure if and only if the loss function l takes a specific form: where [z] − denotes the negative part of z and [z] + denotes the positive part. In this case, the SR gives rise to an expectile, see [3,Theorem 4.9].
Using this result, we can easily show through equation (6) that DRSR is a coherent risk measure when l takes the specific form in that the operation sup P ∈P preserves positive homogeneity and subadditivity. (ii) The restriction of Z to L ∞ implies that the support 1 of the probability distribution of Z is bounded. This condition may be relaxed to the case when there exist t l , t u ∈ IR such that sup P ∈P E P [l(−Z − t l )] > λ and sup P ∈P E P [l(−Z − t u )] < λ, see [18].
We now move on to discuss the property of DRSR when it is applied to a random function. This is to pave a way for us to develop full investigation on (DRSRP) in Sections 3-4. To this end, we need to make some assumptions on the random function c(·, ·) and the loss function l(·). Throughout this section, we use Ξ to denote the image space of random variable ξ(ω) and P(Ξ) to denote the set of all probability measures defined on the measurable space (Ξ, B) with Borel sigma algebra B. To ease notation, we will use ξ to denote either the random vector ξ(ω) or an element of IR k depending on the context.

Assumption 1 Let X, l(·) and c(·, ·) be defined as in (DRSRP) (4). We assume the following. (a) X is a convex and compact set and Ξ is a compact set, (b) l is convex, increasing, non-constant and Lipschitz continuous with modulus L, (c) c(·, ξ) is finite valued, convex w.r.t.
x ∈ X for each ξ ∈ Ξ and there exists a positive constant κ such that The proposition below summarises some important properties of l(c(x, ξ)− t) and sup P ∈P The following assertions hold.

in addition, Assumption 1 (a) holds and λ is a pre-specified constant in
the interior of the range of l, then there exist a point (x 0 , t 0 ) ∈ X × IR and a constant η > 0 such that and (DRSRP) has a finite optimal value.
Proof Part (i). It is well known that the composition of a convex function by a monotonic increasing convex function preserves convexity. The remaining claims can also be easily verified. Part (ii). Since c(x, ξ) is finite valued and convex in x, it is continuous in x for each fixed ξ. Together with its uniform continuity in ξ, we are able to show that c(x, ξ) is continuous over X × Ξ. By the boundedness of X and Ξ, there is a positive constant α such that c(x, ξ) ≤ α for all (x, ξ) ∈ X × Ξ. With the boundedness of c and the monotonic increasing, convex and non-constant property of l, we can easily show Part (ii) analogous to the proof of the first part of Proposition 1. We omit the details.

Structure of (DRSRP') and approximation of the ambiguity set
In this section, we investigate the structure and numerical solvability of (DRSRP). Using the formulation (5) for DRSR, we can reformulate (DRSRP) as where T is a compact set in IR which contains t 0 defined as in (7) and its existence is ensured by Proposition 2 under some moderate conditions. Obviously, the structure of (DRSRP') is determined by the distributionally robust constraint. The latter relies heavily on the concrete structure of the ambiguity set P and the loss function l.
In the literature of distributionally robust optimization, various statistical methods have been proposed to build ambiguity sets based on available information of the underlying uncertainty, see for instance [27,28] and the references therein. Here we consider ϕ-divergence ball and Kantorovich ball approaches and discuss tractable formulations of the corresponding (DRSRP').

Ambiguity set constructed through ϕ-divergence
Let us now consider the case that the only available information about the random vector ξ is its empirical data and the size of such data is limited (not very large). In stochastic programming, a well-known approach in such situation is to use empirical distribution constructed through the data to approximate the true probability distribution. However, if the sample size is not big enough or there is a reason from computational point of view to use a small size of empirical data (e.g., in multistage decision-making problems), then the quality of such approximation may be compromised. ϕ-divergence is subsequently proposed to address this dilemma. Let In this subsection, we consider some common ϕ-divergences which are defined as follows.

Lemma 1 (Relationships between ϕ-divergences)
For two probability vectors p, q ∈ IR M + , the following inequalities hold.
) ; We omit the proof as the results can be easily derived by the divergence functions ϕ.
Let {ζ 1 , . . . , ζ M } ⊂ Ξ denote the M -distinct points in the support of ξ and Ξ i denote the Voronoi partition of Ξ centered at ζ i for i = 1, . . . , M . Let ξ 1 , . . . , ξ N be an iid sample of ξ where N >> M and N i denote the number of samples falling into area Ξ i . Define empirical distribution and ambiguity set Using P M N for the ambiguity set in (DRSRP'), we can derive a dual formulation of (DRSRP') as follows: where p N is defined as in (10) and we write [ is a convex function of x, u, τ and t, see [19]. Thus, problem (11) is a convex program.
It is important to note that the reformulation (11) relies heavily on the discrete structure of the nominal distribution. Note that it is possible to use a continuous distribution for the nominal distribution, in which case the summation in the first constraint of problem (11) (before introducing new variables s i ). In such a case, we will need to use SAA approach to deal with the expected value.
The reallocation of the probabilities through Voronoi partition provides an effective way to reduce the scenarios of the discretized problem and hence the size of problem (11). It remains to be explained how the ambiguity set approximates the true probability distribution.
Let L denote the set of functions h : Ξ → IR satisfying |h(ξ 1 ) − h(ξ 2 )| ≤ ∥ξ 1 − ξ 2 ∥, and P, Q ∈ P(Ξ) be two probability measures. Recall that the Kantorovich metric (or distance) between P and Q, denoted by dl K (P, Q), is defined by Using the Kantorovich metric, we can define the deviation of a set of probability measures P from another set of probability measures Q by D K (P, Q) := sup P ∈P inf Q∈Q dl K (P, Q), and the Hausdorff distance between the two sets by An important property of the Kantorovich metric is that it metrizes weak convergence of probability measures [5] when the support is bounded, that is, a sequence of probability measures {P N } converges to P weakly if and only if dl K (P N , P ) → 0 as N tends to infinity.
Recall that for a given set of points Using this, we can estimate the Kantorovich distance between P M N and the true probability distribution P * .

Proposition 3 Let P M
N be defined as in (10) and P * be the true probability distribution of ξ. Let β M be defined as in (13) and δ be a positive number such that M δ < 1. If ϕ is chosen from one of the functions listed in (a)-(g) preceding Lemma 1, then with probability at least 1 − M δ, where ∆(M, N, δ) := min and r is defined as in (10). In the case when ξ follows a discrete distribution with support {ζ 1 , . . . , ζ M }, we have with probability at least 1 − M δ.
Proof By the triangle inequality of the Hausdorff distance with the Kantorovich metric, .
Moreover, it follows by [14, Theorem 4], the Kantorovich distance is bounded by D/2 times the total variation distance, that is, Observe that By Lemma 1, Thus, in order to show (14), it suffices to show Let a ∈ IR M be a vector with ∥a∥ ∞ := max 1≤i≤M |a i | = 1, and ϕ a (ξ) : with probability at least 1 − δ for the fixed a. In particular, if we set a = e i , for i = 1, . . . , M , where e i ∈ IR M is a vector with i-th component being 1 and the rest being 0, then we obtain with probability at least 1−δ for each i = 1, · · · , M . By Bouferroni's inequality, with probability at least 1 − M δ and hence we have shown (16) for the first part of its bound in ∆ (M, N, δ).
To show the second part of the bound, we need a bit more complex argument to estimate the left hand side of (16). Let A := {a ∈ IR M : ∥a∥ ∞ = 1}. For a small positive number ν (less or equal to 2), let A k := {a 1 , · · · , a k } be such that for any a ∈ A, there exists a point a i (a) ∈ A k depending on a such that ∥a − a i (a)∥ ∞ ≤ ν, i.e., where we write p * for the M -dimensional vector with i component P * (Ξ i ). Then By (17), for each with probability at least 1 − δ, thus inequality (21) holds uniformly for all i = 1, · · · , k with probability at least 1 − kδ. This enables us to conclude that with probability at least 1 − kδ. Since when ν = 2, A k will be a trivial ν-net, then we can set k = M and obtain from (22) that with probability at least 1 − M δ. This completes the proof of (16) and hence inequality (14).
In the case when ξ follows a discrete distribution with support {ζ 1 , . . . , ζ M }, The rest follows from similar analysis for the proof of (14).
It might be helpful to make a few comments on the above technical results. First, if we set δ = 1 10M , then 1 − δM = 90% and the third term at the right hand side of (14) is In order for the first part of (24) to be small, N must be significantly larger than M . The approach works for the case when there is a large data set which is not scattered evenly over Ξ, but rather they form clumps, locally dense areas, modes, or clusters. In the case that N is less than (M − 1) 2 , the second part of (24) is smaller than the first part, which means the second part provides a lower bound. Second, the true distribution in the local areas may be further described by moment conditions, see [20,27]. Third, Pflug and Pichler proposed a practical way for identifying the optimal location of discrete points ζ 1 , . . . , ζ M and computing the probability of each Voronoi partition, see [22,. Forth, the inequality (14) gives a bound for the Hausdorff distance of the true probability distribution P * and the ambiguity set P M N , it does not indicate the true probability distribution P * being located in P M N . Since the ambiguity set P M N does not constitute any continuous distribution irrespective of r > 0, then when the true probability distribution P * is continuous, P * lies outside P M N with probability 1. If the true probability distribution P * is discrete, Pardo [21] showed that the estimated ϕ-divergence 2N ϕ ′′ (1) I ϕ (p * , p N ) asymptotically follows a χ 2 M −1 -distribution with M − 1 degrees of freedom, where p * denotes the probability vector corresponding to probability measure P * and M is the cardinality of Ξ (the support of P * ), which means if we set then with probability 1 − δ, I ϕ (p * , p N ) ≤ r. The latter indicates that the ambiguity set (10) lies in the 1 − δ confidence region. For general ϕ-divergences, we are unable to establish the quantitative convergence as in Proposition 3. However, if P * follows a discrete distribution with support {ζ 1 , . . . , ζ M }, the following qualitative convergence result holds.

Proposition 4 [19, Proposition 2]
Suppose that ϕ(t) ≥ 0 has a unique root at t = 1 and the samples are independent and identically distributed from the true distribution P * . Then H K (P M N , P * ) → 0, w.p.1, as N → ∞, where r is defined as in (25).
Note that in [19, Proposition 2] the convergence is established under the total variation metric, since the probability distributions here are discrete, the convergence is equivalent to that under the Kantorovich metric. We refer readers to [19] for the details of the proof.

Kantorovich ball
An alternative approach to the ϕ-divergence ball is to consider Kantorovich ball centered at a nominal distribution, that is, where P N (·) = 1 N ∑ N i=1 ξ i (·) with ξ 1 , . . . , ξ N being iid samples of ξ. Differing from the ϕ-divergence ball, the Kantorovich ball contains both discrete and continuous distributions. In particular, if there exists a positive number a > 0 such that then for any r > 0, there exist positive constants C 1 and C 2 such that for all N ≥ 1, k ̸ = 2, where C 1 and C 2 are positive constants only depending on a, θ and k, "Prob" is a probability distribution over space Ξ × · · · × Ξ (N times) with Borel-sigma algebra B ⊗ · · · ⊗ B, and k is the dimension of ξ, see [11] for details. By setting the right hand side of the above inequality to δ and solving for r, we may set and consequently the ambiguity set (26) contains the true probability distribution P * with probability 1 − δ when r = r N (δ).
The proposition below gives a bound for the Hausdorff distance of P N and P * under the Kantorovich metric.

Proposition 5
Let P N be defined as in (26) and P * denote the true probability distribution. Let r N (δ) be defined as in (29). If the radius of the Kantorovich ball in (26) is equal to r N (δ), then with probability at least 1 − δ, Proof We first prove that To see this, for any P ′ ∈ P N , we have On the other hand, A combination of the last two inequalities yields (32). Let us now estimate the first term in (32), i.e., dl K (P N , P * ). By the definition of r N (δ), we have with probability 1 − δ, dl K (P N , P * ) ≤ r N (δ). The conclusion follows.
In the case when the centre of the Kantorovich ball P N in (26) is replaced by that defined as in (9), we have with probability at least 1−M δ, where ∆(M, N, δ) is defined as in Proposition 3. To see this, we can use the triangle inequality of the Hausdorff distance with the Kantorovich metric to derive we establish (33) by combining (34), (35), (19) and (22). Before concluding this section, we note that it is possible to use other statistical methods for constructing the ambiguity sets such as moment conditions and mixture distribution, we omit them due to limitation of the length of the paper, interested readers may find them in [16] and references therein.

Convergence of (DRSRP')
In Section 3, we discussed two approaches for constructing the ambiguity of the (DRSRP') model, each of which is defined through iid samples. Let us rewrite the model with P being replaced by P N : to explicitly indicate the dependence of the samples. In this section, we investigate finite sample guarantees on the quality of the optimal solutions obtained from solving (DRSRP'-N), a concept proposed by Esfahani and Kuhn [8], as well as convergence of the optimal values as the sample size increases. Let x N be a solution of distributionally robust shortfall risk minimization problem (DRSRP'N). The out-of-sample performance of x N is defined as SR P * l,λ (−c(x N , ξ)), where P * is the true probability distribution. Since P * is unknown, the exact out-of-sample performance of x N cannot be computed, but we may seek its upper bound ϑ N such that where δ ∈ (0, 1). Following the terminology of Esfahani and Kuhn [8], we call δ a significance parameter and ϑ N the certificate for the out-of-sample performance. The probability on the left-hand side of (37) indicates ϑ N 's reliability.
The following theorem states that the finite sample guarantee condition is fulfilled for the ambiguity sets discussed in Section 3, that is, when the size of the ambiguity sets are chosen carefully, the certificate ϑ N can provide a 1 − δ confidence bound of the type (37) on the out-of-sample performance of x N .
We now move on to investigate convergence of ϑ N and S N . From the discussion in Section 3, we know that H K (P N , P * ) → 0. However, to broaden the coverage of the convergence results, we present them by considering a slightly more general case with P * being replaced by a set P * .
Theorem 2 (Convergence of the optimal values and optimal solutions) Let P * ⊂ P(Ξ) be such that lim N →∞ H K (P N , P * ) = 0. Let ϑ * denote the optimal value of (DRSRP') with P being replaced by P * . Let S * be the corresponding optimal solutions. Under Assumption 1, for N sufficiently large and where D X denotes the diameter of X, η is defined as in Proposition 2, and L, κ are defined as in Assumption 1.
where the inequality is due to equi-Lipschitz continuity of g in ξ and the definition of the Kantorovich metric. Likewise, we can establish Combining the above two inequalities, we obtain By Proposition 2, v * and v N are convex on X ×T . Moreover, the Slater condition (7) allows us to apply Robinson's error bound for the convex inequality system (see [23]), i.e., there exists a positive constant C 1 such that for any (x, t) ∈ X × T , d((x, t), The inequality above enables us to estimate The last inequality follows from (40) and Robinson's error bound [23] ensures that the constant C 1 is bounded by P N ). On the other hand, the uniform convergence of v N to v ensures v N (x 0 , t 0 ) − λ < −η/2 for N sufficiently large, which means the convex inequality v N (x, t) − λ ≤ 0 satisfies the Slater condition. By applying Robinson's error bound for the inequality, we obtain for (x, t) ∈ F * and N is sufficiently large, where C 2 is bounded by 2D X /η. Combining (41) and (42), we obtain Let (x * , t * ) be an optimal solution to (DRSRP') with P being replaced by P * and (x N , t N ) the optimal solution of (DRSRP'-N). Note that F N , Thus which yields (38) via (43). Now, we move on to show (39). Let (x N , t N ) ∈ S N . Since X and T are compact, there exist a subsequence {(x N k , t N k )} and a point (x,t) ∈ X × T such that (x N k , t N k ) → (x,t). It follows by (43) and (38) that (x,t) ∈ F * and t = ϑ * . This shows (x,t) ∈ S * . Theorem 2 is instrumental in that it provides a unified quantitative convergence result for the optimal value of (DRSRP'-N) in terms of H K (P N , P * ) when P N is constructed in various ways discussed in Section 3. Based on the theorem and some quantitative convergence results about H K (P N , P * ), we can establish confidence intervals for the true optimal value ϑ * in the following corollary.
Corollary 1 Under the assumptions in Theorem 2, the following assertions hold.
(i) If P * comprises the true probability distribution only and P N is defined by (10), then under conditions of Proposition 3,ϑ , β being defined as in (13) and D being the diameter of Ξ. (ii) If P * comprises the true probability distribution only and P N is defined by (26), then under conditions of Proposition 5, with probability 1 − δ.

Extension
Now we turn to extend the convergence result to optimization problems with DRSR constraints: where decision maker wants to optimize an objective f (x) while requiring the DRSR risk level to be contained under threshold γ. By replacing P with P N , we may associate (DRSRCP) with Tractable reformulation of problem (DRSRCP) or (DRSRCP-N) may be derived as we did in Section 3. In what follows, we establish a theoretical quantitative convergence result for (DRSRCP-N). LetF,Ŝ andθ denote respectively the feasible set, the set of the optimal solutions and the optimal value of (DRSRCP). Likewise, we defineF N ,Ŝ N andθ N for its approximate problem (DRSRCP-N).
Theorem 3 Let Assumptions 1 hold. Suppose that there exists x 0 ∈ X such that SR P l,λ (−c(x 0 , ξ)) < γ and H K (P N , P) → 0 as N → ∞. Then the following assertions hold.
(i) There is a constant C > 0 such that for N sufficiently large. (46) Moreover, if (DRSRCP) satisfies the second order growth condition at the optimal solution setŜ, i.e., there exist positive constants α and ε such that when N is sufficiently large.
Proof Part (i) can be established through an analogous proof of Theorem 2.
Next, we show (47). Let x N ∈Ŝ N and x ∈Ŝ. By the second order growth condition, where ΠŜ(a) denotes the orthogonal projection of vector a on setŜ, that is, ΠŜ(a) ∈ arg min s∈Ŝ ∥s − a∥. By the Lipschitz continuity of f , the inequality implies Therefore, Since max we have from inequality (48) and Part (i), ] .
The last inequality implies (47) in that x N is arbitrarily chosen fromŜ N and H K (P N , P) ≤ √ H K (P N , P) when N sufficiently large.
Analogous to Corollary 1, we can derive confidence intervals and regions for the optimal values with different P N .

Application in portfolio optimization
In this section, we apply the (DRSRP) model to decision-making problems in portfolio optimization. Let ξ i denote the rate of return from investment on stock i and x i denote the capital invested in the stock i for i = 1, . . . , d. The total return from the investment of the d stocks is x T ξ, where we write ξ for (ξ 1 , ξ 2 , . . . , ξ d ) T and x for (x 1 , x 2 , . . . , x d ) T . We consider a situation where the investor's decision on allocation of the capital is based on minimization of the distributionally robust shortfall risk of x T ξ, that is, SR P l,λ (x T ξ) for some specified l, λ and P, that is, the investor finds an optimal decision x * by solving We have undertaken numerical experiments on problem (49) from different perspectives ranging from efficiency of computational schemes as we discussed in Section 3, the out-of-sample performance of the optimal portfolio and the growth of the total portfolio value over a specified time horizon using different optimal strategies. Our main numerical experiments focus on problem (49) with the ambiguity set being defined through the Kantorovich ball. We report the details in Example 1.
Example 1 Let ξ 1 , . . . , ξ N be iid samples of ξ and P N be the nominal distribution constructed through the samples, that is, The ambiguity set is defined respectively as To simplify the tests, we consider a specific piecewise affine loss function l(z) = max{0.05z + 1, z + 0.1, 4z + 2}. We set λ = 1 and let the total number of stocks d be fixed at 10. We follow Esfahani and Kuhn [8] to generate the iid samples by assuming that the rate of return ξ i is decomposable into a systematic risk factor ψ ∼ N (0, 2%) common to all stocks and an unsystematic risk factor ζ i ∼ N (i × 3%, i × 2.5%) specific to stock i, that is, Based on the discussion in Section 3.2, problem (49) can be reformulated through dual formulation as We use ∥ · ∥ to denote 1-norm and thus ∥ · ∥ * is the ∞-norm. Following the terminology of Esfahani and Kuhn [8], we call J N (r) the certificate.
In the first set of experiments, we investigate the impact of the radius of the Kantorovich ball r on the out-of-sample performance of the optimal portfolio. For any fixed portfolio x N (r) obtained from problem (51), the out-of-sample performance is defined as J(x N (r)) := SR P * l,λ (x N (r) T ξ), which can be computed from theoretical point of view since the true probability distribution P * is known by design although in the experiment we will generate a set of validation samples of size 2 × 10 5 to do the evaluation. Following the same strategy as in [8], we generate the training datasets of cardinality N ∈ {30, 300, 3000} to solve problem (51) and then use the same validation samples to evaluate J(x N (r)). Each of the experiments is carried out through 200 simulation runs.
Figures 1 depict the tubes between the 20% and 80% quantiles (shaded areas) and the means (solid lines) of the out-of-sample performance J(x N (r)) as a function of radius r, the dashed lines represent the empirical probability of the event J(x N (r)) ≤ J N (r) with respect to 200 independent runs which is called reliability in Esfahani and Kuhn [8]. It is clear that the reliability is nondecreasing in r and this is because the true probability distribution P * is located in P N more likely as r grows and hence the event J(x N (r)) ≤ J N (r) happens more likely. The out-of-sample performance of the portfolio improves (decreases) first and then deteriorates (increases).
In the second set of experiments, we investigate convergence of the out-ofsample performance, the certificate and the reliability of the DRO approach (51) and the SAA approach as the size of sample increases. Note that SAA corresponds to the case when the radius r of the Kantorovich ball is zero. In all of the tests we use cross validation method in [8] to select the Kantorovich radius from the discrete set {{5, 6, 7, 8, 9} × 10 −3 , {0, 1, 2, . . . , 9} × 10 −2 , {0, 1, 2, . . . , 9} × 10 −1 }. We have verified that refining or extending the above discrete set has only a marginal impact on the results. Figure 2 (a) shows the tubes between the 20% and 80% quantiles (shaded areas) and the means (solid lines) of the out-of-sample performance J(x N ) as a function of the sample size N based on 200 independent simulation runs, where x N is the minimizer of (51) and its SAA counterpart (r = 0). The constant dashed line represents the optimal value of the SAA problem with N = 10 6 samples which is regarded as the optimal value of the original problem with the true probability distribution. It is observed that the DRO model (51) outperforms the SAA model in terms of out-of-sample performance. Figure 2 (b) depicts the optimal values of the DRO model and the SAA counterpart, which is the in-sample estimate of the obtained portfolio performance. Both of the approaches display asymptotic consistency, which is consistent with the out-of-sample and in-sample results. Figure 2 (c) describes the empirical probability of the event J(x N ) ≤ J N with respect to 200 independent runs, where x N is the optimal value of the DRO model or SAA model, and J N are the optimal value of the corresponding problems. It is clear that the performance of the DRO model is better than that of the SAA model.

Example 2
In the last experiment, we evaluate the performance of problem (49) with the ambiguity set being constructed through the KL-divergence ball and the Kantorovich ball, we have also undertaken tests on problem (49) with 10 stocks (Apple Inc., Amazon.com, Inc., Baidu Inc., Costco Wholesale Corporation, DISH Network Corp., eBay Inc., Fox Inc., Alphabet Inc Class A, Marriott International Inc., QUALCOMM Inc.) where their historical data are collected from National Association of Securities Deal Automated Quotations (NASDAQ) index over 4 years (from 3rd May 2011 to 23rd April 2015) with total of 1000 records on the historical stock returns.
We have carried out out-of-sample tests with a rolling window of 500 days, that is, we use the first 500 data to calculate the optimal portfolio strategy for day 501 and then move on a rolling basis. The radiuses in the two ambiguity sets are selected through the cross validation method. Figure 3 depicts the performance of three models over 500 trading days. It seems that the KLdivergence model and SAA model perform similarly, whereas the Kantorovich model outperforms the both over most of the time period.