Distributionally robust optimization with matrix moment constraints: Lagrange duality and cutting plane methods

A key step in solving minimax distributionally robust optimization (DRO) problems is to reformulate the inner maximization w.r.t. probability measure as a semiinfinite programming problem through Lagrange dual. Slater type conditions have been widely used for strong duality (zero dual gap) when the ambiguity set is defined through moments. In this paper, we investigate effective ways for verifying the Slater type conditions and introduce other conditions which are based on lower semicontinuity of the optimal value function of the inner maximization problem. Moreover, we propose two discretization schemes for solving the DRO with one for the dualized DRO and the other directly through the ambiguity set of the DRO. In the absence of strong duality, the discretization scheme via Lagrange duality may provide an upper bound for the optimal value of the DRO whereas the direct discretization approach provides a lower bound. Two cutting plane schemes are consequently proposed: one for the discretized dualized DRO and the other for the minimax DRO with discretized ambiguity set. Convergence analysis is presented for the approximation schemes in terms of the optimal value, optimal solutions and stationary points. Comparative numerical results are reported for the resulting algorithms.

provides a lower bound. Two cutting plane schemes are consequently proposed: one for the discretized dualized DRO and the other for the minimax DRO with discretized ambiguity set. Convergence analysis is presented for the approximation schemes in terms of the optimal value, optimal solutions and stationary points. Comparative numerical results are reported for the resulting algorithms.

Introduction
One of the most challenging issues in decision analysis is to find an optimal decision under uncertainty. The solvability of a decision problem and the quality of an optimal decision rely heavily on the information about the underlying uncertainties which are often mathematically represented by a vector of random variables. If a decision maker has complete information on the distribution of the random variables, then he can either obtain a closed form of the integral of the random functions in the problem and then convert it into a deterministic optimization problem, or alternatively use various statistical and numerical integration approaches such as scenario method [20], Monte carlo sampling method [39] and quadrature rules [11] to develop a deterministic approximation scheme and solve this using a standard linear/nonlinear programming code. The numerical efficiency of an approximation scheme and the quality of an optimal solution obtained from it depend on the structure (both the objective and constraints) and the scale (dimensionality) of the problem.
The situation may become far more complex if the decision maker does not have complete information on the distribution of the random variables. For instance, if the decision maker does not have any information other than the range of the values of the random variables, then it might be a reasonable option to choose an optimal decision on the basis of the extreme values of the random variables in order to mitigate the risks. This kind of decision making framework is known as robust optimization where the decision maker is extremely risk averse or lacks information on the distribution of the underlying random variables as described above. It is useful in some decision making problems particularly in engineering design [3,8] where a design takes into account the extreme and rare event. However, the model may incur significant economic and/or computational costs in that excessive resources are used to prevent a rare event, resulting in numerical intractability or inefficiency. Over the past two decades, numerous efforts have been made to develop approximate schemes for solving robust optimization models which balance numerical tractability and quality of an optimal solution, see monograph by Ben-Tal et al. [4].
An alternative but possibly less conservative robust optimization model, which is known as distributionally robust optimization (DRO), involves a decision maker who is able to construct an ambiguity set of distributions with historical data, computer simulation or subjective judgements which contains the true distribution with certain confidence. In such circumstances, it is possible to choose an optimal decision on the basis of the worst distribution from the ambiguity set. For example, if we know roughly the nature of the distribution of random variables and can observe some samples, then we may use the classical maximum likelihood method to determine the parameters of the distribution and in that way construct a set of distributions if there is an inadequacy of the sample.
This kind of robust optimization framework may be traced back to the earlier work by Scarf [37] which was motivated to address incomplete information on the underlying uncertainty in supply chain and inventory control problems. In such problems, historical data may be insufficient to estimate the future distribution either because the sample size of past demand is too small or because there is a reason to suspect that future demand will come from a different distribution. A larger distributional set which contains the true distribution may adequately address the risk from the ambiguity of the uncertainty. DRO model has found many applications in operations research, finance and management sciences. It has been well investigated through a number of further research works by Žáčková [52], Dupačová [14], Shapiro and Ahmed [40]. Over the past few years, it has gained substantial popularity through further contributions by Bertsimas and Popescu [7], Betsimas et al. [6], Delage and Ye [13], Goh and Sim [17], Hu and Hong [21], Goldfarb and Iyengar [18], Mehrotra and Papp [26], Pflug et al. [29], Popescu [31], Wiesemann et al. [47,48] to name a few.
In this paper, we consider the following distributionally robust optimization problem: where X is a closed set of IR n , f : IR n × IR k → IR is a continuous function, ξ : → ⊂ IR k is a vector of random variables defined on measurable space ( , F) equipped with sigma algebra F, P is a set of probability distributions defined as P := P ∈ P : E P [ i (ξ )] = 0, for i = 1, . . . , p E P [ i (ξ )] 0, for i = p + 1, . . . , q . (1.2) Here i : → IR n i ×n i , i = 1, . . . , q, is a symmetric matrix or a scalar with measurable random components, and P denotes the set of all probability distributions/measures over space ( , F); the notation means that when i is a matrix, its expected value must be negative semidefinite. In the case when n i = 1, for i = 1, . . . , q, i reduces to a scalar function and (1.2) collapses to classical moment problems. Note that if we consider ( , B) as a measurable space equipped with Borel sigma algebra B, then P may be viewed as a set of probability measures defined on ( , B) induced by the random variate ξ . So we may also write P( ) for P. Following the terminology in the literature of robust optimization, we call P the ambiguity set which indicates ambiguity of the true probability distribution of ξ at the point of decision making. As we will see in later discussions, i may take some specific forms. Here we consider a general form in hope that our model covers a range of interesting moment problems. To ease the notation, we will use ξ to denote either the random vector ξ(ω) or an element of IR k depending on the context. An important issue concerning DRO is numerical tractability of the robust formulation. For example, Delege and Ye [13] consider the DRO problem with ambiguity in both the mean and the covariance and demonstrate how their model can be solved in polynomial time when the support set is convex and compact. Goh and Sim [17] provide a tractable approximation scheme when the DRO is applied to a class of two stage stochastic programming problems. More recently, Wiesemann, Kuhn and Sim [49] provide a unified framework for DRO problems where the ambiguity set is constructed through some probabilistic and moment constraints. Under the Slater type conditions and essential boundedness of the support set, they provide a tractable reformulation of the problems.
In a slightly different direction, the DRO approach has been applied to tackle chance constrained stochastic programming problems where there is a lack of complete information on the true probability distribution. Zymler et al. [54] consider a class of robust chance constrained optimization problems with the ambiguity set being constructed through moment conditions and reformulate the robust constraint as semiinfinite constraints. In the case when the support set of the random variable covers the whole space and the underlying functions in the chance constraint are linear w.r.t. both the decision variables and the random variables, they reformulate the semiinfinite constraints as a system of semidefinite constraints and demonstrate the resulting semidefinite program (SDP for short) is numerically tractable. In a more recent development, Yang and Xu [46] extend the research to the case where the underlying functions in the chance constraint are nonlinear. A deficiency in these robust approaches is that they may easily cause infeasibility of the robust chance constraint in that the ambiguity set may comprise a sequence of probability measures whose probability masses near the mean value and subsequently the robust probability of the inner random constraints (in the chance constraint) is equal to 1 when the mean lies in the inner feasible set. Of course, we are less concerned by the issue if the chance constraint is focused on the tail distribution of a loss function.
Our aim in this paper is to develop numerical methods for solving problem (1.1). Differing from the mainstream research in the literature of DRO, we concentrate on practical applicability of the computational methods without paying particular attention to numerical tractability in hope that the computational schemes and the underlying theory developed in this paper can be applied to a wide range of problems. Recall that a popular method for solving minimax distributionally robust optimization problems is to reformulate the inner maximization problem as a semi-infinite programming problem thorough Lagrange dual. A key theoretical question in our context is that under what conditions, problem (1.1) and its Lagrange dual problem are equivalent, i.e., the strong duality holds. The equivalence is well known when either the support set is compact in a finite dimensional space (see [41]) or the system of equalities and/or inequalities satisfy the Slater type conditions [38]. In the latter case, since the decision variables in the inner maximization problem are probability measures, one might wish to see whether a probability measure defined by an inequality moment constraint, i.e. P, ψ(ξ) ≤ 0 (hereafter ·, · is a bilinear representation of the expected value of function ψ), lies in the "interior" of the feasible set (the ambiguity set P). Unfortunately this kind of verification may turn out to be difficult at least technically since it concerns topological structure of the ambiguity set which is a subset in the space of probability measures. Shapiro [38] proposes an alternative way to characterize the condition which requires in this context the range of ·, ψ(ξ) over the cone of positive measures generated by P having nonempty intersection with the interior of IR + . While this effectively addresses the theoretical issue we have just raised, the condition is often difficult to verify particularly when ψ is a vector of random functions or matrices because in that case we would need "coordination" of the components of ψ for the expected values. Likewise, in the equality case, the condition requires 0 to lie in the range of ·, ψ(ξ) which is difficult to verify when ψ is vector-valued. It would become even more challenging when P is composed of both equality and inequality constraints. This motivates us to develop effective approaches for verifying the conditions and look into other complementary conditions in this paper.
Another main challenge concerning (1.1) is to develop efficient numerical methods for solving the problem. When the support set is a finite set, the Lagrange dual is an ordinary matrix optimization problem, so we may apply the available codes on matrix optimization (see i.e., [45]) to solve the latter. It is also possible to solve problem (1.1) directly as a finite dimensional minimax saddle point problem. Indeed, Pflug and Wozabal [30] propose an iterative scheme for solving distributionally robust portfolio optimization problems where the inner maximization problem and the outer minimization problem are solved in turn. Mehrotra and Papp [26] extend the approach to a general class of DRO problems and design a process which generates a "cutting surface" of the inner optimal value function at each iterate. In the case when is well structured such as polyhedral or semialgebraic and the underlying functions ( f and ) are quadratic or linear, one may recast the semiinfinite inequality constraints as a semidefinite constraint through the well known S-lemma [32]. We note that this kind of formulation is the most popular approach in the literature of distributionally robust optimization, see for instance [13,49] and the references therein. Here we concentrate on the case where is neither a finite set nor has aforementioned structure and develop some computational methods which complement the existing numerical schemes for the DRO. As far as we are concerned, the main contributions of the paper can be summarized as follows.
• We present a detailed analysis of conditions for the strong Lagrange duality of the inner maximization problem, namely the Slater type conditions and the lower semicontinuity condition. The analysis concerning the Slater type conditions is based on Shapiro's [38,Proposition 3.4] which has been widely used in the literature of distributionally robust optimization with moment constraints but rarely scrutinized in detail. In Sect. 2, we present detailed discussions on the Slater type condition through a few practically interesting moment problems and demonstrate how the condition may be effectively verified. We also look into the duality conditions from lower semicontinuity of the optimal value function of the perturbed inner maximization problem and derive sufficient conditions which are easy to verify (Proposition 2.3). While the conditions are restrictive in general, we find that they are satisfied in a number of important cases such as when the support set is compact or i is bounded, and this may effectively complement the popular Slater type condition in circumstances when the latter is difficult to be verified. Indeed, we can easily find some examples where the lower semicontinuity conditions are satisfied whereas the Slater type condition fails; see Example 2.7. • We propose a discretization scheme based on Monte Carlo sampling for approximating the semiinfinite constraints of the Lagrange dual of the inner maximization problem. The approach is in line with the randomization scheme considered by Campi and Calafiore [10] and Anderson, Xu and Zhang [1] for mathematical programs with robust convex constraints. Under some moderate conditions, we demonstrate convergence of the optimal values, the optimal solutions and the stationary points obtained from the approximate problems as sample size increases (Theorems 3.1 and 3.2). Moreover, by observing the equivalence between the Monte Carlo discretization scheme and discretization of the ambiguity set P under strong duality, we propose a cutting plane method for solving the approximate DRO (1.1) directly as a finite dimensional minimax optimization problem and show convergence of the approximation scheme in terms of the optimal values and optimization solutions as sample size increases (Theorem 4.2). In the absence of strong duality, we observe that the discretization scheme via Lagrange duality provides an upper bound for the optimal value of the DRO when the sample size is sufficiently large whereas the direct discretization approach provides a lower bound for any sample size. • Based on the aforementioned approximation schemes, we propose two algorithms for solving problem (1.1): a cutting plane algorithm for solving discretized dual problem (Algorithm 3.1) and a cutting plane method for the minimax DRO with discretized ambiguity set (Algorithm 4.1). We have carried out comparative numerical tests on the two algorithms through a portfolio optimization problem (Example 5.1) and a multiproduct newsvendor problem (Example 5.2) and conclude that the former is more sensitive to the change of the number of decision variables whereas the latter is more sensitive to the change of sample size.
Throughout the paper, we use the following notation. By convention, we use S n , S n + and S n − to denote the space of symmetric matrices, cone of symmetric positive semidefinite matrices and cone of symmetric negative semidefinite matrices in the n × n matrix spaces IR n×n , and IR n + to denote the cone of vectors with non-negative components in IR n . For matrices A, B ∈ IR n×n , we write A • B for the Frobenius inner product, i.e., A • B = trace(AB), and A B and A ≺ B to indicate A − B being negative semidefinite and negative definite respectively. We use (Z, d) to represent an abstract metric space Z with metric d. For a set C ⊂ Z, we use by convention "int C", "cl C" and "conv C" to denote its interior, closure and convex hull respectively. We write d(z, D) := inf z ∈D d(z, z ) for the distance from a point z to a set D. For two sets C and D, D(C, D) := sup z∈C d(z, D) stands for the deviation/excess of set C from/over set D. For a sequence of subsets {C k } in a metric space, we follow the standard notation [36] by using lim sup k→∞ C k to denote its outer limit, that is, For a set-valued mapping (also called multifunction in the literature) A : X → 2 Y , A is said to be closed atx if x k ∈ X , x k →x, y k ∈ A(x k ) and y k →ȳ impliesȳ ∈ A(x). A is said to be outer semicontinuous atx ∈ X if lim sup x→x A(x) ⊆ A(x). When A(x) is compact for each x, A(x) is upper semicontinuous (in the sense of Berge [5]) atx if and only if for every > 0, there exists a constant δ > 0 such that When the set-valued mapping A(·) is bounded, the outer semicontinuity coincides with upper semicontinuity, see [36,Theorem 5.19] for the Euclidean space and [24,Theorem 4.27] for the general Hausdorff space.

Lagrange dual of the inner maximization problem in (1.1)
Let x ∈ X be fixed. We consider the inner maximization problem of (1.1) and its Lagrange dual where M + denotes the positive linear space of all signed measures generated by P( ).
As discussed in the introduction, a key step towards numerical solution of problem (1.1) is to establish equivalence between problems (2.1) and (2.2). In the literature of distributionally robust optimization, the equivalence has been well established under the circumstances where the support set is compact and i (·) is continuous (see [41, page 312]), or the moment problem satisfies Slater type condition (see [38,49] and references therein). This is underpinned by Shapiro's duality theorem ([38, Proposition 3.4]) for a general class of moment problems.
Note that in the case when the optimal value of problem (2.1) is +∞, the dual problem (2.2) is infeasible. In that case, the equivalence is trivial. So our focus in this section is on the case when the optimal value of (2.1) is finite.

Slater type conditions
Let us start with the Slater type condition (STC for short). Following Shapiro's duality theory for moment problems, the condition in our context can be written as where (ξ ) := ( 1 (ξ ), . . . , p (ξ )), P, (ξ) = (ξ )P(dξ) with the integration taken componentwise, the second {0} at the right hand side is the Cartesian product of zero matrices in the respective matrix spaces of IR n i ×n i corresponding to i for i = 1, . . . , p, and K q− p := S n p+1 + × · · · × S n q + ; see [38, condition (3.12)] for general moment problems. Here we discuss how this condition may be satisfied and how it could be appropriately verified through some typical examples.
Example 2.1 (Reformulation of the STC and sufficient conditions for it) Consider the following moment problem: where i : → S n i , i = 1, . . . , q, are continuous maps and μ i ∈ S n i , i = 1, . . . , q are constant matrices which could be either the true mean values of i or their approximations/estimates. For the simplicity of notation we write E for ( 1 , . . . , p ), I for ( p+1 , . . . , q ), μ E for (μ 1 , . . . , μ p ) and μ I for (μ p+1 , . . . , μ q ). The Slater type condition in this case can be written as where

Proposition 2.1
The following assertions hold.
(iii) Condition (2.6) holds naturally in the case when Conversely, let (2.5) hold. Then for a sufficiently small positive number δ which yields (2.4). Part (ii). Conditions (2.6) and (2.7) guarantee existence of P E ∈ P( ) such that The proposition shows how the complex STC (2.4) can be examined through (2.5) and further through (2.7). Condition (2.9) is widely known as the Slater condition for inequality systems. The discussions show that the STC is weaker than the well known Slater condition.
We now turn to consider the case when P comprises a single matrix moment constraint.

Example 2.2 (STC for a single matrix moment constraint) Let
where ξ is a random vector with support set in IR n , μ and are either the true mean value and covariance matrix respectively or their estimates. Consider two types of moment conditions: one is inequality constrained and the other is equality constrained. The former is often used when a decision maker does not have complete information on the true mean value and/or covariance whereas the latter corresponds to the circumstance when the true covariance is known. We discuss them in sequel.
(a) With incomplete information of the mean and/or covariance, the moment problem is often written as where is some positive definite matrix. Let 0 denote the true covariance matrix (corresponding to the unknown true probability distribution of ξ ) and assume that 0 ≺ . Note that following the analysis as in Example 2.1 we can recast condition (2.3) as It is easy to observe that condition (2.10) holds in that P 0 , (ξ) ≺ 0 under the assumption 0 ≺ and any n × n positive definite matrix lies in the interior of S n + . (b) In the equality constraint case, the moment condition becomes and the Slater type condition becomes 0 ∈ int{ P, (ξ) : P ∈ P( )}. The condition is fulfilled if 0 ∈ int conv{(ξ − μ)(ξ − μ) T : ξ ∈ }. The latter is automatically satisfied when = IR n , see Proposition 2.2 below.
The example shows how condition (2.4) is verified through a different argument for equality and inequality matrix moment constraints. Proposition 2.2 (Image of covariance mapping over P( )) If = IR n , then Proof Observe that the right hand side of (2.11) is the image of the covariance mapping over the space of probability measures P( ). It suffices to show because the opposite inclusion always holds. Let M ∈ S n + be any positive semidefinite matrix with eigenvalue λ j and normalized eigenvector q j for j = 1, . . . , n. Let ξ j := μ + nλ j q j and P j , j = 1, . . . , n, denote the Dirac probability measure at ξ j and P := n j=1 1 n P j . ThenP ∈ P( ) and The conclusion follows.
In many practical cases, covariance constraint is often coupled by mean value constraints. Let us consider a few examples as such. [13] and So [43]) Consider ambiguity set

Example 2.3 (STC for matrix moments due to Delage and Ye
where γ 1 and γ 2 are nonnegative constants. The ambiguity set has first been considered by Delage and Ye [13] and further studied by So [43]. It is easy to observe that the inequality can be equivalently written as Thus P can be written as When γ i > 0 for i = 1, 2, the moment constraints (2.12) satisfy the Slater type constraint qualification, see [44,Theorem 3]. However, when γ 1 = 0, the constraint qualification fails. To see this, let us note that matrix can never be negative definite in that by Schur complement for the matrix to be negative definite, we would need 0 ) < 0 which will never happen. Nevertheless, if we rewrite the ambiguity set as then the Slater type condition holds, see [44,Theorem 3] for details.
Example 2.4 (STC for a variation of moment system (2.12)) Consider the following ambiguity set where γ 1 and γ 2 are small positive numbers, e is a vector with components of ones, |a| denotes the absolute value of a vector a with the absolute value taken componentwise, and · 2 denotes the spectral norm of a matrix. Using the property of the norm, we can reformulate the ambiguity set as If γ 1 > 0 and γ 2 > 0, then there exists a probability measure P 0 such that , and the strict inequalities of system of moment conditions hold. Following the remark after Proposition 2.1, we conclude that the Slater type condition holds.
Example 2.5 (STC for the moment system due to [25]) Consider the following ambiguity set It is easy to verify that · max is a norm for the matrix but without the sub-multiplicative property. The ambiguity set is considered in [25].

Lower semicontinuity condition
We now study a different condition which is fundamentally based on Shapiro's result [38,Proposition 2.4]. To this end, we consider the following perturbation of problem is in a small neighborhood of 0.
To simplify the notation, here and later on we mean 0 is in appropriate space without indicating its dimension. Let Let v(Y ) denote the optimal value of problem (2.13 for each bounded and continuous function h : → IR. For a set of probability measures A ⊂ P( ), A is said to be weakly compact w.r.t. topology of weak convergence if every sequence {P N } ⊂ A contains a subsequence {P N } and P ∈ A such that P N → P. A is said to be tight if for any > 0, there exists a compact set ⊂ such that inf P∈A P( ) > 1 − . In the case when A is a singleton, it reduces to the tightness of a single probability measure. A is said to be closed (under the topology of weak convergence) if for any sequence {P N } ⊂ A with P N → P weakly, we have P ∈ A.
By the well-known Prokhorov's theorem (see [2,33]), a closed set A of probability measures is weakly compact if and only if it is tight. In particular, since is a set in IR k , if is a compact set, then the set of all probability measures on ( , B) is weakly compact with respect to topology of weak convergence; see [27].
For two probability measures P 1 , P 2 ∈ P( ), the Prokhorov metric [34] is A sufficient condition for Assumption 2.1 (a) is that there are positive constants τ and C such that Likewise, when ψ i jt is a continuous function, a sufficient condition for Assumption 2.1 (b) is that there exists a positive constant τ such that is upper semicontinuous at 0 in the sense of Berge [5], that is, for any > 0, which means P ∈ P(Y ). Part (ii). Let {Y k } be a sequence converging to 0. By the definition of outer semicontinuity, we only need to consider the points with P(Y k ) = ∅. Let P k ∈ P(Y k ). By [9, Theorem 5.1], the tightness ofP ensures that {P k } has a subsequence {P k i } such that P k i → P * weakly. Using a similar argument to that of Part (i), we have (2.17) which means P * ∈ P(0). This shows the set-valued mapping P(Y ) is outer semicontinuous. On the other hand, since is a compact set of IR k , then by Prokhorov theorem, P( ) is metrizable (i.e. by Prokhorov metric) and hence it is a metric space. The latter ensures P( ) is a Hausdorff space. Subsequently, by [24,Theorem 4.27], P(·) is upper semicontinuous at point Y = 0.

Remark 2.1
In the case when i (ξ ), i = 1, . . . , q, is a scalar function, Assumption 2.1 (b) may be replaced by the following in Lemma 2.1: there exist an upper semicontinous function l(ξ ) and a lower semicontinous function u(ξ ) such that and for any sequence {P N } ∈P and any accumulation point P * of the sequence, To see this, let  We give a proof for completeness. Observe first that since In what follows, we consider the case when P(Y ) = ∅. Let S(Y ) denote the set of optimal solutions of problem (2.13). By Lemma 2.1, P(Y ) is weakly compact and hence v(Y ) < +∞. Moreover, since the objective function is In what follows, we consider the case Since P(·) is upper semicontinuous at point Y = 0, it is easy to verify that P * (Y ) is also upper semicontinuous at point 0 in that f is continuous in ξ . Thus, for any > 0 there exists a neighborhood U s of P * (0) such that Since is arbitrarily chosen, we conclude that v(Y ) is lower semicontinuous at point In what follows, we revisit some examples in the preceding subsection with Proposition 2.3. Consider Example 2.1. Assume that there exists i 0 ∈ {p + 1, . . . , q} and a positive number τ such that Moreover, if is bounded, then Assumption 2.1 (b) is fulfilled.
Of course, the boundedness assumption is undesirable in DRO (1.1) and in fact not needed for Slater type condition, we impose the restriction just to illustrate how Proposition 2.3 could be applied in some special circumstances. However, in the application of DRO to optimization problems with chance constraint, it might be a necessity to impose boundedness of in order for the robust chance constraints to be more applicable. We illustrate this argument through the following example.
Example 2.6 (Infeasibility of robust chance constraint) Consider the following distributionally robust chance constraint where x ∈ IR, p * ∈ (0, 1), ξ is a random variable with support set = IR, is an ambiguity set defined through true mean value 0 and variance σ . It is easy to show that sup P∈P P(ξ = 0) = 1. To see this, let P k be a discrete probability measure with where k > 2. It is easy to verify that P k ∈ P and sup k P k (ξ = 0) = 1. Let H (x) := {ξ ∈ IR : xξ ≤ α}. Then 0 ∈ H (x) for any x ∈ IR whenever α ≥ 0. Consequently the robust chance constraint does not have a feasible solution. The key issue here is that the unboundedness of allows the ambiguity set P to contain some probability measures which mass their probability near the mean value of ξ .
In a more recent development of distributionally robust optimization (see [49]), ambiguity set P comprises not only moment conditions but probabilistic constraints. Here we illustrate how Proposition 2.3 may be applied to such a case.
where j , j = 1, . . . , k is subset of and 0 ≤ a j ≤ 1. Using the indicator functions, the probabilistic constraints can be rewritten as Observe first that since is compact, P( ) is tight and so is P as the latter is just a subset of P( ). Second, for any sequence {P k } ⊂ P converging weakly toP, the lower semicontinuity of the indicator functions 1 (0. 5,1] On the other hand, the boundedness of ξ 1 over ensures E P k [ξ 1 ] → EP [ξ 1 ] = 0.8. This shows P is closed and hence by the well known Prokhorov theorem, P is weakly compact. Third, by Remark 2.1, the conclusions of Lemma 2.1 hold with the above stated boundedness and lower semicontinuity, hence by Proposition 2.3, we can assert that the inner maximization problem of (1.1) with ambiguity set (2.20) satisfies the strong Lagrange duality.
Note that it is possible to find an example where Assumption 2.1 fails to hold but the Slater type condition is satisfied.
Example 2.8 Let ξ be a random variable defined on IR with σ -algebra F. Let P( ) denote the set of all probability measures on (IR, F) and It is shown in [44] that P is not closed. On the other hand, it is easy to verify that the Slater type condition (2.6) holds. This shows Assumption 2.1 is not necessarily strictly weaker than the Slater type condition and may be used as a condition complementary to STC.
Before concluding this section, we give a simple example where strong duality fails in the absence of STC and lower semicontinuity condition. (2.23) The optimal value v(y) of (2.23) is +∞ for y ≥ 0 because the feasible set is empty in that case. When y < 0, we can write down its Lagrange dual

(2.24)
Since the inequality constraint of problem (2.23) satisfies the STC, problem (2.24) does not have a duality gap. Analogous to [22,Example 2], we can work out the optimal value of (2.24), which is v(y) = 2 + y(1 − r y ) − 2 r y , where r y is either ln(−y/ ln 2) ln 2 − or ln(−y/ ln 2) ln 2 + depending on which one provides a lower value for v(y). Since v(y) → 2 as y → 0 − , it shows that v is not lower semicontinuous at y = 0.

Boundedness of the Lagrange multipliers
In the last part of this section, we study existence of bounded optimal solutions to the Lagrange dual problem (2.2). This is motivated by necessity of boundedness of the set of feasible solutions to dual problem in order to carry out convergence analysis when a randomization method is applied to the Lagrange dual in Section 3. To ease the notation, we write W for the q + 1-tuple of Lagrange multipliers (λ 0 , 1 , . . . , q ) which take values in IR × S n 1 × · · · × S n p × K q− p . We use W(x) to denote the set of optimal solutions to Lagrange dual problem (2.2). We investigate conditions under which there is a positive constant η independent of x such that W(x) ∩ ηB = ∅, ∀x ∈ X, (2.25) where B denotes the unit ball in the space of IR × S n 1 × · · · × S n p × K q− p . The boundedness is required for the convergence in Sect. 3, see Assumption 3.2.

Proposition 2.4 (Existence of bounded Lagrange multipliers) Assume: (a) the optimal value of problem (2.2) is bounded by a constant (independent of x), (b)
sup x∈X,ξ ∈ | f (x, ξ)| < ∞, (c) the homogeneous system of inequalities has a unique solution 0. Then there exists a positive constant η such that (2.25) holds.
Proof For each x ∈ X , let λ 0 (x) denote the optimal value of problem (2.2). Under condition (a), there exists a constant C such that λ 0 (x) ≤ C for all x ∈ X . In what follows, we show that the components := ( 1 , . . . , q ) of the corresponding optimal solution are also bounded. Let F(x) denote the set of such that (λ 0 (x), (x)) is an optimal solution to problem (2.2) for given x ∈ X . It suffices to show that F(x) is bounded. Assume for the sake of a contradiction that there exists a sequence of By driving N to infinity and taking a subsequence if necessary, we may assume without loss of generality that N / N →ˆ with ˆ = 1 and consequently deduce a contradiction to condition (c).
Note that condition (c) is implied by the Slater type condition (2.3), see [53, Remark 2.1 (iii)]. It is unclear whether the condition can be fulfilled under the lower semicontinuity condition.

A randomization method and convergence analysis
Having established equivalence between problem (2.2) and its primal (2.1), we are now moving on to discuss numerical methods for solving problem (1.1). For the simplicity of notation, we use to denote ( 1 , . . . , q ). Let us write its dual problem as x ∈ X, i 0, for i = p + 1, . . . , q.

(3.1)
This is an optimization problem with decision variables x and matrix variables i , i = 1, . . . , q. In the case when f (·, ξ) is convex for every fixed ξ , the objective function is convex w.r.t. (x, ). Our idea here is to apply the well known cutting plane method [23] to solve (3.1). A key step of the method is to calculate a subgradient of the objective function at each iterate. This requires us to maximize the Lagrange function w.r.t. ξ which could be numerically expensive particularly when it is not concave w.r.t. ξ .
To circumvent the difficulty, we propose a randomization approach which discretizes the space of through Monte Carlo sampling. Specifically, let ξ 1 , . . . , ξ N be independent and identically distributed (iid) samples of ξ . We consider the following x ∈ X, i 0, for i = p + 1, . . . , q.

(3.2)
From practical point of view, this kind of approximation scheme is sensible in that it relies only on the samples rather than the range of the support set . This is a notable departure from the existing numerical approaches for solving distributionally robust optimization where the structure of the support is vital to develop an SDP reformulation. Of course, it might be arguable that samples obtained in practice may be contaminated, we will address this issue in a separate paper as it is not the main focus here. Unless otherwise specified, we assume the samples do not contain noise. At this point, it might be helpful to remind readers the notation ξ . In formulation (3.1), ξ is a deterministic vector. In the randomization approach, ξ is a random vector whose distribution is unknown but it is possible to obtain its iid samples. This is similar to the "uncertain parameter" in robust convex programs considered by Campi and Calafiore [10]. Note that theoretically speaking, samples generated by any continuous distribution with support set can be used to construct a random approximation scheme (3.2) although the resulting rate of convergence may be different.
For a fixed sample, we propose to apply the well known cutting plane method for solving problem (3.2). Observe that we can easily compute a subgradient of the objective function of problem (3.2). To see this, let J (x, ) denote the index set of By the well known Danskin's theorem,

Optimal value and optimal solution
Before going to the details of the numerical methods for problem (3.2), we derive some convergence results which theoretically justify the proposed approximation scheme. Specifically, we demonstrate convergence of the optimal value and the optimization solutions obtained from solving problem (3.2) to those of problem (3.1) as N → ∞. To this end, let us first consider the following general optimization problem where X is a compact set in IR n , g is continuous function of (x, ξ), ξ is a parameter which takes values over ⊂ IR k . By slightly abusing the notation, let us consider a random variable ξ with support set . Let ξ 1 , . . . , ξ N be independent and identically distributed samples of ξ . We consider the following approximation problem: For each realization of the random variables, we solve problem (3.4) and obtain an optimal value and optimal solution. We then ask ourself convergence of these quantities as N increases and investigate conditions under which the optimal value and optimal solution converge to their counterparts of problem (3.3). In what follows, we present a detailed analysis for (3.4). (b) There exist a nonnegative measurable function κ : → R + and constant γ > 0 such that for all ξ ∈ . (c) The moment generating function M κ (t) of κ(ξ ) is finite valued for all t in a neighborhood of zero. Let ϑ and ϑ N denote respectively the optimal values of problem (3.3) and problem (3.4), and X * and X N denote the corresponding sets of optimal solutions.

Lemma 3.1 (Convergence of random discretization scheme (3.4)) Assume: (a) g(x, ξ) satisfies Assumption 3.1; (b) the true probability distribution of ξ is continuous and there exists positive constants C
for all x ∈ X ; and (c) there are positive constants γ 2 and δ 0 with for any fixed point ξ 0 ∈ and δ ∈ (0, δ 0 ). Then the following assertions hold.
(i) For any positive number , there exist positive constants C( ) and β( ) such that If there exists an 0 > 0 such that R( ) > 0 for ∈ (0, 0 ) and R(·) is monotonically increasing over the interval, then R( ) → 0 as ↓ 0, and Proof The thrust of the proof is to use CVaR and its sample average approximation to approximate sup ξ ∈ g(x, ξ) of problem (3.3) which is in line with the convergence analysis carried out in [1]. However, there are a couple of important differences: (a) the convergence here is for the randomization scheme (3.4) rather than the sample average approximation of the CVaR approximation of sup ξ ∈ g(x, ξ), (b) g is not necessarily a convex function. Part (i). For β ∈ (0, 1), let where ρ(·) denotes the density function of the random variable ξ , (c) + = max(0, c) for c ∈ IR. In the literature, CVaR β (g(x, ξ)) is known as conditional value at risk at a specified confidence level β and v N β (x) is its sample average approximation, see [1,35]. It is well known that the maximum w.r.t. η in the above formulation is achieved at a finite η. In other words, we may restrict the maximum w.r.t. η to be taken within a closed interval [−a, a] for some sufficiently large positive number a, see [35]. It is easy to verify that We proceed the rest of the proof for this part in two steps.
Step 1. By the definition of CVaR, for any β ∈ (0, 1) Moreover, under conditions (b) and (c), it follows by [1, Proposition 1], g(x, ξ) has so-called consistent tail behaviour, that is, for all β ∈ (β 0 , 1). Therefore by driving β to 1, we have Step 2. Using the inequalities (3.8), we have Let be a small positive number. By (3.10), we may set β sufficiently close to 1 such that for N sufficiently large. Here in the first inequality, we are using the fact that the maximum w.r.t. η is achieved in [−a, a] for some appropriate positive constant a; see similar discussions in [51]. Note that |ϑ N − ϑ| ≤ sup x∈X |v N (x) − v(x)|. Combining (3.11) and (3.12), we arrive at Part (ii). Let R( ) be defined as in the statement of the lemma. Let be a fixed small positive number and δ which means x cannot be an optimal solution to problem (3.4), in other words, if x N is an optimal solution to problem (3.4) We make a few comments in sequel about the conditions and conclusion of the lemma as the result might be of broader interest.
First, condition (a) explicitly ensures g(x, y) the essential supremum of g(x, ξ) being bounded for each fixed x ∈ X . Condition (b) is considered by Anderson, Xu and Zhang [1]. Inequality (3.5) is guaranteed when g(x, ·) is Hölder continuous over . Condition (c) is fulfilled when the density function of ξ is bounded away from zero around ξ 0 . A combination of (b) and (c) provide a sufficient condition for the so-call consistent tail behaviour for g(x, ξ), see [1] for a detailed discussion.
Second, this kind of convergence analysis is slightly different from standard convergence analysis in stochastic programming in that here we use the largest sampled value of g(x, ξ) rather than its sample average. It should also be distinguished from the convergence analysis by Campi and Calafiore [10] for a similar discretization scheme whose focus is on the feasibility of an optimal solution obtained from solving (3.4) and the number of samples needed to guarantee the feasibility with a specified confidence. Instead it is more closely related to a recent work by Esfahani et al. [15] which presents a probabilistic argument for the convergence of the optimal value of (3.4) with a similar discretization scheme. Based on Lemma 3.1, it is possible to estimate the sample size for a specified discrepancy of the optimal values through large deviation theorem; see [50] and references therein. We leave the details for the interested readers to explore as they are beyond the main focus of this paper.
With Lemma 3.1, we are ready to state convergence of problem (3.2) to problem (3.1) in terms of the optimal value and the optimal solutions. For the simplicity of notation, let Let W x be defined as in (2.25). We make the following assumption. (b) The true probability distribution of ξ is continuous and there exist positive constants C 1 and ν 1 (independent of x) such that

ξ). The moment generating function of σ (ξ) denoted by E e t (σ (ξ )−E[σ (ξ)])
is finite valued for all t in a neighbourhood of zero. We omit the details.

Stationary points
In the case when f (x, ξ) is not convex in x, problem (3.2) is not a convex optimization problem. In such a case, we may not be able to obtain an optimal solution by solving the problem. This motivates us to study convergence of stationary points. Let x N be just a stationary point of problem (3.2). We look into whether any cluster point of sequence {x N } is a stationary point of (3.1).
To ease the exposition of analysis and maximize the potential application of the convergence results, we consider the general problems (3.3) and (3.4). Throughout this subsection, we assume g is continuously differentiable in x for every ξ . Therefore, both v(x) and v N (x) are Lipschitz continuous. Let Recall that the Clarke subdifferential of a locally Lipschitz continuous function φ(x) at x, denoted by ∂φ(x), is defined as follows: where D denotes the set of points near x at which φ is Fréchet differentiable, ∇φ(x) denotes the gradient of φ at x. In the case when φ is convex, the Clarke subdifferential coincides with the convex subdifferential, see [12] for details. By [12,Theorem 2.8.2], the Clarke subdifferential of v(x) can be written as where P[S] signifies the collection of probability measures supported on S. Likewise, by [12,Proposition 2.3.12], (3.14) By taking a subsequence if necessary, we may assume without loss of generality that By relabeling the samples, we may assume Using the property of the Clarke subdifferential, we deduce from (3.14) that there exist Then we may view P N as a probability distribution of ξ over the support set N (x N ) and consequently write η N as Let P( ) denote the set of all probability measures over induced by ξ . Since is a compact set, then P( ) is weakly compact, which means {P N } must have a weakly convergent subsequence. Assume for simplicity of notation that P N → P * weakly. Then P * ∈ P( ). Since g(x, ξ) is continuous and bounded on X × , the weak convergence and conditions (b) of Lemma 3.1 ensure Moreover, since is compact, all conditions of Lemma 3.1 are fulfilled. Thus v N (x) converges to v(x) uniformly over X as N → ∞. Likewise To complete the proof, we need to show that P * ∈ P[ * (x * )]. But this follows from the definition of P[ With Proposition 3.1, we are ready to study the convergence of stationary points. We call (x, ) a stationary point of problem (3.1) if it satisfies where {0} × K is defined as in (2.3), and N Z (z) denotes the Clarke normal cone to Z at z, that is, for z ∈ Z , and N Z (z) = ∅ when z / ∈ Z . Likewise, we say (x, ) is a stationary point of problem Proof Theorem 3.2 follows from the outer semicontinuity of normal cones N X (·) and N {0}×K (·) and the consistency of the subdifferential of Proposition 3.1.

Cutting plane method
We now turn to discuss numerical methods for solving problem (3.2) with a fixed sample. This is a deterministic convex program when f (x, ξ) is convex in x for every ξ . We propose to apply the well known cutting plane method for solving the problem. Step 1. Solve the linear semidefinite programming problem: Let (x t , λ t 0 , t 1 , . . . , t q ) denote the optimal solution. Step 2. Find j * t such that Step . . , t q ) as the optimal solution. Otherwise, construct a feasibility cut and set Go to Step 1 with t := t + 1.
The algorithmic procedures follow the classical cutting plane method by Kelley [23]. The only minor difference is that our problem (3.2) involves some matrix variables and problem (3.15) has to be solved by an SDP solver. Convergence of the algorithm can be easily established similar to Kelley [23], we omit the details.

Discretization of the ambiguity set
The randomization scheme (3.2) may be investigated from a different perspective. Let N := {ξ 1 , . . . , ξ N }. If we restrict the ambiguity set P in (1.2) to the discrete probability measures with support set N , then we have Here instead of writing P, we use P N to indicate that the set depends on N . Obviously P N ⊂ P in the sense that for every ( p 1 , . . . , p N ) ∈ P N , P N := N j=1 p j δ ξ j (ξ ) ∈ P, where δ ξ j (ξ ) denotes the Dirac probability measure over with probability mass at ξ j . Consequently the distributionally robust optimization problem (1.1) can be written as It is easy to verify that the Lagrange dual of the inner maximization problem can be written as which is equivalent to (3.2). This means the randomization scheme in Sect. 4 is equivalent to the discretization scheme (4.1). From numerical point of view, the difference between (4.1) and (4.2) lies in the fact that the latter is a single minimization problem whereas the former a finite dimensional min-max optimization problem. When f (x, ξ) is convex in x for every ξ , (4.1) becomes a saddle point problem. In the previous section, we have developed a numerical method for solving (4.2). Here our focus is on a numerical scheme which solves (4.1) directly for fixed N .
Our idea is based on the classical cutting plane method to be applied to the convex function v N (x) := sup P∈P N E P [ f (x, ξ)] over the compact set X , which can be described as follows: we start by selecting a probability p 0 ∈ P N and x 0 ∈ X . Let and find a minimizer of l 0 (x) over X . Note that l 0 (x) ≤ v N (x) for all x ∈ X but it is not necessarily a cutting plane . Let x 1 denote the optimal solution of l 0 (x). Next, evaluate v N (x) at x 1 . We do so by solving the inner maximization problem, that is, maximization of E P [ f (x 1 , ξ)] w.r.t. P over P N . Let p 1 denote the optimal solution. Then v N (x 1 ) is the corresponding optimal value. If v N ( and find minimizer of max(l 0 (x), l 1 (x)). In this way, we generate a sequence of cutting planes of v N (x) and a sequence of approximate optimal solutions {x t }.
Let x t and σ t denote the optimal solution and optimal value respectively.
Step 2. Solve the inner maximization problem Let p t and v t denote the optimal solution and optimal value. If v t ≤ σ t , then stop.
Algorithm 4.1 is inspired by a similar algorithm proposed by Pflug and Wozabal [30] for solving a distributionally robust portfolio problem and cutting surface method by Mehrotra and Papp [26] for a general class of moment robust optimization. Compared to the cutting surface method, our algorithm is not particularly aimed at finding a finite number of "points" in such that the inner maximum w.r.t. P is achieved at these points, i.e., it is ordinary cutting surface method based on the fundamental idea of cutting plane method.
In comparison with Algorithm 3.1, a notable difference is that Algorithm 4.1 builds up cutting planes in the space of decision variables whereas Algorithm 3.1 construct cutting planes in the space of decision variables and Lagrange multipliers. The difference affects applicability of the algorithms in different circumstances. We will come back to this in Sect. 5 after conducting some comparative numerical tests of the two algorithms.
Following convergence of classical cutting plane method (see [23]), we can assert the convergence of Algorithm 4.1. Note that Algorithm 4.1 is proposed for solving the discretized minimax problem (4.1) for fixed sample size N . It might be interesting to ask whether the optimum obtained from the sampling scheme converges to the optimum of the original DRO (1.1) as N increases. The following theorem addresses this. Proof Since x N is an optimal solution of problem (4.1), there exists P N ∈ P N such that (x N , P N ) is a saddle point of min x∈X max P∈P N P, f (x, ξ) , i.e., f (x, ξ) . (4.5) On the other hand, since is a compact set in Euclidean space, by [33, Theorem 1.12] P( ) is weakly compact under the topology of weak convergence. The latter guarantees every sequence in P contains a convergent subsequence, see Rachev [33,34]. By taking a subsequence if necessary, we may assume that x N → x * and P N → P * weakly. By the second equality of (4.5), we obtain P * , f (x * , ξ) ≤ min x∈X P *  , f (x, ξ) . In what follows, we show which will then enable us to claim and hence (x * , P * ) is a saddle point of min x∈X max P∈P P, f (x, ξ) . Assume for the sake of a contradiction that there existsP ∈ P such that . Moreover, under condition (a), there exists a sequence {P N } ⊂ P N converging toP weakly such that which contradicts the first equality of (4.5). Proof LetP be defined as in the proof of Theorem 4.2. Let λ ∈ (0, 1) be a constant and P 0 be defined as in condition (b), let P λ := λP + (1 − λ)P 0 . Since P is a convex set, P λ ∈ P and For fixed λ, there existsP λ N ∈ P N such thatP λ N converges weakly to P λ . To see this, let { 1 , . . . , N } be a Voronoi partition, that is, i , i = 1, . . . , N are pairwise disjoint sets with where p i = P λ ( i ) and 1 ξ i denotes the Dirac probability measure at ξ i . Under condition (b), the largest diameter of the Voronoi cells goes to zero as N increases. Consequently, we deduce by [28,Lemma 4.9]  It might be helpful to make a few comments about the conditions of Theorem 4.2 and Corollary 4.1.
First, from the proof of the corollary, we can see that conditions (b) in the theorem can be replaced by conditions (b) and (c) in the corollary when the moment system in the definition of P does not involve an equality constraint. It is an open question as to whether this is correct when the moment system involves an equality constraint, we leave this for our future research. We prefer conditions (b) and (c) in the corollary to condition (b) of the theorem in that the former are more verifiable. Moreover, since condition (b) in the corollary is a Slater condition, it ensures strong duality for the inner maximization problem of (1.1) whereas condition (b) of the theorem does not have such a guarantee. Further, under conditions (b) and (c) of the corollary, convergence of the optimal value of problem (4.1) can be drawn directly from Theorem 3.1, and in that case Theorem 4.2 may be understood as complementing Theorem 3.1 by ensuring convergence of the optimal solution. In contrast, under condition (b) of the theorem, it is unclear whether Theorem 3.1 would also give us a guarantee of convergence of the optimal value of (3.2) to that of problem (1.1) without the Slater condition (although we may verify the lower semicontinuity condition derived in Sect. 2). Overall, we conclude that the discretization scheme (4.1) is a bit safer than scheme (3.2) in the absence of strong duality for the inner maximization problem (1.1).
Second, condition (c) of the corollary means that N may be iid samples generated by any continuous distribution with support set or constructed in a deterministic manner.
Third, in the absence of strong duality, the optimal value of the discretized minimax optimization problem (4.1) provides a lower bound for the optimal value of the original distributionally robust optimization problem (1.1) because the discretized ambiguity set P N is smaller than P. In contrast, the optimal value of problem (3.1) may provide an upper bound for problem (1.1) as it is formulated through the Lagrange dual of the inner maximization problem. The follow-up discretization scheme (3.2) gives a lower bound for the optimal value of problem (3.1). Overall, in the absence of strong duality, we can conclude via Theorem 3.1 that the optimal value of problem (3.2) provides an upper bound for problem (1.1) when N is sufficiently large.

Numerical tests
In this section, we investigate the numerical performance of Algorithms 3.1 and 4.1 by carrying out some comparative analysis. We do so by applying them to a portfolio optimization problem and a multiproduct newsvendor problem. In the implementation of the algorithms, we use the ambiguity set defined as in (2.12) with γ 1 = 0.1 and γ 2 = 1.1. The mean and covariance matrix μ 0 and 0 are calculated through samples which are either obtained from historical data (in the first example) or generated by computer (in the second example).
The tests are carried out in MATLAB 8.0 installed on a Thinkpad T430 notebook computer with Windows 7 operating system and Intel Core i5 processor. The SDP subproblems in Algorithms 4.1 and 3.1 are solved by Matlab solver "SDPT3-4.0" [45].
Example 5. 1 We consider a portfolio optimization problem where the investor makes an optimal decision using historical return rate of 80 stocks between May 2009 and April 2015 from National Association of Securities Deal Automated Quotations (NAS-DAQ) index. The sample size is 2000. To simplify the discussions, we ignore the transaction fee, therefore the total value of portfolio is f (x, ξ) = ξ 1 x 1 + ξ 2 x 2 + · · · + ξ n x n , where ξ j denotes the random return rate of asset j.
The investor wants to choose several stocks from NASDAQ index with highest average return rates and make an optimal decision based on them, where the average return rates in the selection rule are calculated by taking average from all historical rates. In order to compare the two algorithms, we have carried out two sets of experiments. One is for the fixed number of portfolios as 5, we examine the performance of the algorithms in terms of CPU time with different sample sizes. This is to investigate sensitivity of the algorithms w.r.t. the change of sample size. The other is for fixed sample size 500, we test the performance of the algorithms as problem size increases from 5 to 80.
The results are depicted in Figs. 1 and 2 which show the relationships between CPU time and sample size and CPU time and portfolio size. In Fig. 1, we can see that the CPU time of Algorithm 4.1 increases rapidly at a linear rate as sample size increases whereas Algorithm 3.1 is not sensitive to the change of sample size. The underlying reason is that increase of sample size does not impact on the problem size of (3.2) but it does affect the size of inner maximization problem of (4.4). Figure 2 displays an opposite performance of the two algorithms where we fix up the sample size to 500 but increase the portfolio size. The phenomena can be interpreted by the fact that Algorithm 3.1 is sensitive to the increase of portfolio size (number of variables of x) because the cutting planes are constructed in higher dimensional vector and matrix spaces. With the matrix variables in place, any increase of the number of variables of x will significantly affect the overall problem size and hence the effectiveness of the cutting plane method. In contrast, the change of portfolio size does not have any impact on the size of problem (4.4) which is a key step of Algorithm 4.1, and its impact on outer minimization problem (4.3) is limited because the latter is an LP without any matrix variables.
In Example 5.1, the objective function is linear in x, so we don't need linearization at Step 1 of Algorithm 4.1. In what follows, we consider the case when the objective function is nonlinear.
Example 5.2 (Multiproduct newsvendor problem varied from Wiesemann et al. [49]) Assume that a newsvendor trades in i = 1, . . . , n products. Before observing the uncertain demands ξ i , the newsvendor orders x i units of product i at the wholesale price c i . Once ξ i is observed, she can sell the quantity min(x i , ξ i ) at the retail price v i . Any unsold stock (x i − ξ i ) + is cleared at the salvage price g i , and any unsatisfied demand (ξ i − x i ) + is lost.
We can describe the newsvendor's total loss as a function of the order decision x := (x 1 , . . . , x n ) T : where the minimum and nonnegativity operator are applied componentwise. where U (y) := e y/10 is an exponential disutility function. In order to compare performance of the two algorithms, we have carried out three sets of experiments. The first one is for the fixed number of products as 7, we examine the performance of the algorithms in terms of CPU times with different sample sizes from 400 to 900, the results are depicted in Fig. 3. The second one is for fixed sample size 100, we test the performance of the algorithms as problem size (number of products) increases from 9 to 27, the results are displayed in Fig. 4. The third one is for fixed number of products  as 2, we investigate the performance of the optimal values from the two algorithms when the sample size increases from 100 to 900. We generate 20 groups of samples for each sample size, calculate the optimal value by the two algorithms for each group and show the convergence in Figs. 5 and 6. The data are generated as follows: for ith product, the wholesale, retail and savage prices are set with c i = 0.1(5 + i − 1), v i = 0.15(5 + i − 1) and g i = 0.05(5 + i − 1); the vector of the product demands ξ is characterized by a multivariate log-normal distribution with the mean μ = (μ 1 , . . . , μ n ), μ i = 2, i = 1, . . . , n, and covariance = (σ i j ), σ ii = 0.35 + 0.01 * (i − 1) and σ i j = 0.01 for i = j, i, j = 1, . . . , n. Figures 3 and 4 display similar patterns to what we observed in Example 5.1 about change of CPU times against variation of the sample size and the number of products (the problem size). Figures 5 and 6 display the same trend of convergence of the optimal values obtained from the two algorithms as the sample size increases through boxplot. We can see roughly that the optimal values (or the range of the optimal values) converge relatively quickly when the sample size less than 500 and the convergence slows down when the sample size reaches 700. The observation is consistent with our established exponential convergence results. Note that no gap is observed as the strong duality holds in this case.

Conclusion
The paper explores conditions for strong duality in distributionally robust optimization with moment constraints and discrete approximation schemes for solving such problems. For the moment problems with only inequality constraints, Slater condition is often satisfied and in this paper we show how it can be verified for some specific moment problems. For the moment problems with equality and/or inequality constraints, the strong duality often requires the Slater type conditions which are relatively difficult to fulfil and verify. In the absence of the Slater type conditions, it is discovered that a new condition based on lower semicontinuity of the perturbed inner maximization may be used.
We propose two discrete approximation schemes for solving (1.1): one through the well known Lagrange dual formulation and the other through discretization of the ambiguity set which is effectively a kind of direct discretization of the minimax optimization problem. In terms of the optimal value, the dual based discretization scheme tends to give an upper bound whereas the direct discretization gives rise to a lower bound in the absence of strong duality. We then apply the well known cutting plane method to solve the respective discretized problems. In view of numerical efficiency, the preliminary tests show that the dual based discretization scheme is more sensitive to the increase of decision variables whereas the direct discretization scheme is more sensitive to the increase of the sample size. Neither of the schemes requires any specific structure of the underlying functions in the moment problems, in the objective or specific structure of the support set of the random variable, hence they provide an alternative to the mainstream SDP based approaches in the literature.
There is a prospect of applying the discretization schems to distributionally robust optimization problems with objective of minimizing risks. For example, in formulation (1.1), if we replace the expected loss E P [ f (x, ξ)] with CVaR of f (x, ξ) as defined in (3.7), then the objective becomes minimization of the worst CVaR. By exchanging the operation of minimization w.r.t. η and maximization w.r.t. probability measure, we end up with the standard formulation (1.1) with an auxiliary "decision variable" η. Similar reformulation can be applied to the case when the objective is a convex composition of E P [ f (x, ξ)] through Fenchel duality. Thus both the theoretical results in Section 2 and the numerical schemes in Sects. 3-4 apply to a large class of distributionally robust optimization problems with moment constraints.