Distributionally robust stochastic programs with side information based on trimmings

We consider stochastic programs conditional on some covariate information, where the only knowledge of the possible relationship between the uncertain parameters and the covariates is reduced to a finite data sample of their joint distribution. By exploiting the close link between the notion of trimmings of a probability measure and the partial mass transportation problem, we construct a data-driven Distributionally Robust Optimization (DRO) framework to hedge the decision against the intrinsic error in the process of inferring conditional information from limited joint data. We show that our approach is computationally as tractable as the standard (without side information) Wasserstein-metric-based DRO and enjoys performance guarantees. Furthermore, our DRO framework can be conveniently used to address data-driven decision-making problems under contaminated samples. Finally, the theoretical results are illustrated using a single-item newsvendor problem and a portfolio allocation problem with side information.


Introduction
Today's decision makers not only collect observations of the uncertainties directly affecting their decision-making processes, but also gather data about measurable exogenous variables that may have some predictive power on those uncertainties (Ban and Rudin 2019). In Statistics, Operations Research and Machine Learning, these variables are often referred to as covariates, explanatory variables, side information or features (Pang Ho and Hanasusanto 2019).
In the framework of Optimization Under Uncertainty, the side information acts by changing the probability measure of the uncertainties. In fact, if the joint distribution of the features and the uncertainties were known, this measure change would correspond to conditioning that distribution on the side information given. Unfortunately, in practice, the decision maker only has an incomplete picture of such a joint distribution in the form of a finite data sample. The development of optimization methods capable of exploiting the side information to make improved decisions, in a context of limited knowledge of its explanatory power on the uncertainties, defines the ultimate purpose of the so-called Prescriptive Stochastic Programming or Conditional Stochastic Optimization paradigm. This paradigm has recently become very popular in the technical literature, see, for instance, Ban and Rudin (2019), Bertsimas and Kallus (2020), Pang Ho and Hanasusanto (2019) and references therein. More specifically, a data-driven approach to address the newsvendor problem, whereby the decision is explicitly modeled as a parametric function of the features, is proposed in Ban and Rudin (2019). This approach thus seeks to optimize said function. In contrast, Bertsimas and Kallus (2020) formulate and formalize the problem of minimizing the conditional expectation cost given the side information, and develop various schemes based on machine learning methods (typically used for regression and prediction) to get data-driven solutions. Their approach is non-parametric in the sense that the optimal decision is not constrained to be a member of a certain family of the features' functions. The inspiring work of Bertsimas and Kallus (2020) has been subject to further study and improvement in two principal directions, namely, the design of efficient algorithms to trim down the computational burden of the optimization (Diao and Sen 2020) and the development of strategies to reduce the variance and bias of the decision obtained and its associated cost (the pairing of both interpreted as a statistical estimator). In the latter case, we can cite the work of Bertsimas and Van Parys (2017), where they leverage ideas from bootstrapping and machine learning to confer robustness on the decision and acquire asymptotic performance guarantees. Similarly, Bertsimas and McCord (2018) and Pang Ho and Hanasusanto (2019) propose regularization procedures to reduce the variance of the data-driven solution to the conditional expectation cost minimization problem, which is formalized and studied in Bertsimas and Kallus (2020). A scheme to robustify the data-driven methods introduced in this work is also proposed in Bertsimas et al. (2019) for dynamic decision-making.
A different, but related thrust of research focuses on developing methods to construct predictions specifically tailored to the optimization problem that is to be solved and where those predictions are then used as input information. Essentially, the predictions are intended to yield decisions with a low disappointment or regret. This framework is known in the literature as (smart) Predictthen-Optimize, see, e.g., Balghiti et al. (2019), Donti et al. (2017), Elmachtoub and Grigas (2021), Muñoz et al. (2020), and references therein.
Our research, in contrast, builds upon Distributionally Robust Optimization (DRO), which is a powerful modeling paradigm to protect the task of decision-making against the ambiguity of the underlying probability distribution of the uncertainty (Rahimian and Mehrotra 2019). Nevertheless, the technical literature on the use of DRO to address Prescriptive or Conditional Stochastic Programming problems is still relatively scarce. We highlight Bertsimas et al. (2019), Chen et al. (2020), Hanasusanto and Kuhn (2013), Kannan et al. (2021Kannan et al. ( , 2020b, Nguyen et al. (2020Nguyen et al. ( , 2021 with Nguyen et al. (2021) being a generalization of Nguyen et al. (2020). In Chen et al. (2020), they resort to a scenario-dependent ambiguity set to exploit feature information in a DRO framework.
However, their objective is to minimize a joint expectation and consequently, their approach cannot directly handle the Conditional Stochastic Optimization setting we consider here. Hanasusanto and Kuhn (2013) deal with a stochastic control problem with time-dependent data. They extend the idea of Hannah et al. (2010) to a fully dynamic setting and robustify the control policy against the worst-case weight vector that is within a certain χ 2 -distance from the one originally given by the Nadaraya-Watson estimator. In the case of Bertsimas et al. (2019), the authors propose using the conditional empirical distribution given by a local predictive method as the center of the Wasserstein ball that characterizes the DRO approach in Mohajerin Esfahani and Kuhn (2018).
This proposal, nonetheless, fails to explicitly account for the inference error associated with the local estimation. In Kannan et al. (2021Kannan et al. ( , 2020b, the authors develop a two-step procedure whereby a regression model between the uncertainty and the features is first estimated and then a distributionally robust decision-making problem is formulated, considering a Wasserstein ball around the empirical distribution of the residuals. Finally, the authors in Nguyen et al. (2021) also consider a Wasserstein-ball ambiguity set as in Bertsimas et al. (2019), Kannan et al. (2021Kannan et al. ( , 2020b, but centered at the empirical distribution of the joint data sample of the uncertainty and the features.
In addition, they further constrain the ambiguity set by imposing that the worst-case distribution assigns some probability mass to the support of the uncertainty conditional on the values taken on by the features.
Against this background, our main contributions are: 1. Modeling power: We develop a general framework to handle prescriptive stochastic programs within the DRO paradigm. Our DRO framework is based on a new class of ambiguity sets that exploit the close and convenient connection between trimmings and the partial mass problem to immunize the decision against the error incurred in the process of inferring conditional information from joint (limited) data. We also show that our approach serves as a natural framework for the application of DRO in data-driven decision-making under contaminated samples and naturally produces distributionally robust versions of some local nonparametric predictive methods such as Nadaraya-Watson kernel regression and K-nearest neighbors, which are used in the context of conditional stochastic optimization (Bertsimas and Van Parys 2017, Bertsimas and McCord 2018, Bertsimas et al. 2019, Pang Ho and Hanasusanto 2019. 2. Computational tractability: Our framework is as complex as the Wasserstein-metric-based DRO approach proposed in Mohajerin Esfahani and Kuhn (2018) without side information. Therefore, we extend the mass-transportation approach to the realm of Conditional Stochastic Optimization while preserving its appealing tractability properties.
3. Theoretical results and performance guarantees: Leveraging theory from probability trimmings and optimal transport, we show that our DRO model enjoys a finite sample guarantee and is asymptotically consistent. 4. Numerical results: We evaluate our DRO approach on the single-item newsvendor problem and the portfolio allocation problem, and compare it with the KNN method described in Bertsimas and Kallus (2020), the robustified KNN proposed in Bertsimas et al. (2019), and a KNN followed by the standard Wassertein-distance-based DRO model introduced in Mohajerin Esfahani and Kuhn (2018), as suggested in Bertsimas et al. (2019) too. Unlike all these approaches, ours explicitly accounts for the cost impact of the potential error made when inferring conditional information from a joint sample of the uncertainty and the covariates. To this end, we minimize the worst-case cost over a Wasserstein ball of probability measures with an ambiguous center.
The rest of the paper is organized as follows. In Section 2, we formulate our DRO framework to address decision-making problems under uncertainty in the presence of side information and show that it is as tractable as the standard Wasserstein-metric-based DRO approach developed in Mohajerin Esfahani and Kuhn (2018). In Section 3.1, we deal with the case in which the side information corresponds to an event of known and positive probability and discuss its application to data-driven decision-making under contaminated samples. The situation in which the probability of such an event is positive, but unknown, is treated in Section 3.2. Section 3.3 elaborates on the case in which the side information reduces to a specific realization of the feature vector, more precisely, the instance where the side information represents an event of zero probability. Section 4 provides results from numerical experiments and, finally, Section 5 concludes the paper.
Notation. We use R to represent the extended real line, and adopt the conventions of its associated arithmetic. Moreover, R + stands for the set of non-negative real numbers. We employ lower-case bold face letters to represent vectors. The inner product of two vectors u, v is denoted as u, v = u T v and by u we denote the norm of the vector u. For a set A, the indicator function I A (a) is defined through I A (a) = 1 if a ∈ A; = 0 otherwise. The Lebesgue measure in R d is denoted as λ d . We use the symbol δ ξ to represent the Dirac distribution supported on ξ. Additionally, we reserve the symbol " " for objects which are dependent on the sample data. The K-fold product of a distribution Q will be denoted as Q K . Finally, the symbols E and P denote, respectively, "expectation" and "probability" (the context will give us the measure under which that expectation or probability is taken).

Data-driven distributionally robust optimization with side information
Unfortunately, when it comes to solving problem (1), neither the true distribution Q noreven less so-the conditional one Q Ξ are generally known to the decision maker. Actually, the decision maker typically counts only on a data sample consisting of N observations ξ i := ( z i , y i ) for i = 1, . . . , N , which we assume are i.i.d. Therefore, the solution to problem (1) per se is, in practice, out of reach and the best the decision maker can do is to approximate the solution to (1) with some (probabilistic) performance guarantees. Within this context, Distributionally Robust Optimization (DRO) emerges as a powerful modeling framework to achieve that goal. In brief, the DRO approach aims to find a decision x ∈ X that is robust against all conditional probability distributions that are somehow plausible given the information at the decision maker's disposal. This is mathematically stated as follows: where U N is a so-called ambiguity set that contains all those plausible conditional distributions.
This ambiguity set must be built from the available information on ξ, which, in our case, comprises The subscript N in U N is intended to underline this issue. Furthermore, the condition Q Ξ ( Ξ) = 1 for all Q Ξ ∈ U N is implicit in the construction of that set. In our setup, however, problem (2) poses a major challenge, which has to do with the fact that the observations pertain to the true joint distribution Q, and not to the conditional one Q Ξ . Consequently, we need to build an ambiguity set U N for the plausible conditional distributions from the limited At this point, we should note that there are several approaches in the technical literature to handle the conditional stochastic optimization problem (1) for the particular case in which Ξ is defined as Ξ := {ξ = (z, y) ∈ Ξ : z = z * }. For example, Bertsimas and Kallus (2020) approximate (1) by the following conditional estimate where w i N (z * ) is a weight function that can be given by various non-parametric machine learning methods such as K-nearest neighbors, kernel regression, CART, and random forests. Formulation (3) can be naturally interpreted as a (conditional) Sample-Average-Approximation (SAA) of problem (1). Bertsimas and McCord (2018) extend the work by Bertsimas and Kallus (2020) to accommodate the setting in which the outcome of the uncertainty y may be contingent on the taken decision x.
For this purpose, they work with an enriched data set comprising observations of the uncertainty y, the decision x and the covariates z, and allow the weights in (3) to depend on x too. Besides, they add terms to the objective function of (3) to penalize estimates of its variance and bias. The case in which the weight function (3) is given by the Nadaraya-Watson (NW) kernel regression estimator is considered in Hannah et al. (2010) and Pang Ho and Hanasusanto (2019 where U i N := {y ∈ Ξ y : y − y i p ε N }. This problem can be seen as a robust SAA method capable of exploiting side information and has also been used in Bertsimas et al. ( , 2021. In our case, however, we follow a different path to address the conditional stochastic optimization problem (1) by way of (2). More precisely, we leverage the notion of trimmings of a distribution and the related theory of partial mass transportation.

The Partial Mass Transportation Problem and Trimmings
This section introduces some concepts about trimmings and the partial mass transportation problem that help us construct the ambiguity set U N in (2) from the sample data { ξ i } N i=1 . For simplicity, we restrict ourselves to probability measures defined in R d .
Now we introduce the notion of a trimming of a distribution, which is at the core of our proposed DRO framework.
Given 0 α 1 and probability measures P, Q ∈ R d , we say that Q is an (1 − α)-trimming of P if Q is absolutely continuous with respect to P , and the Radon-Nikodym derivative satisfies dQ dP 1 α .
The set of all (1 − α)-trimmings (or trimming set of level 1 − α) of P will be denoted by R 1−α (P ).
As extreme cases, we have that for α = 1, R 0 (P ) is just P , while, for α = 0, R 1 (P ) is the set of all probability measures absolutely continuous with respect to P . Given a probability P on R d , if α 1 α 2 , then R 1−α 2 (P ) ⊂ R 1−α 1 (P ). Especially useful is the fact that a trimming set is a convex set, which is, besides, compact under the topology of weak convergence. We refer the reader to (Álvarez-Esteban et al. 2011, Proposition 2.7) for other interesting properties about the set R 1−α (P ).
Consider now the following minimization problem: where D is a probability metric.
Problem (6) is known as the (D, 1 − α)−partial (or incomplete) mass problem (del Barrio and Matrán 2013). While there is a variety of probability metrics we could choose from to play the role of D in (6), here we work with the space P p (R d ) of probability distributions supported on R d with finite p-th moment and restrict ourselves to the p−Wasserstein metric, W p , for its tractability and theoretical advantages. In such a case (i.e., when D = W p ), problem (6) is referred to as a partial mass transportation problem and interpolates between the classical optimal mass transportation problem (when α = 1) and the random quantization problem (when α = 0).
Intuitively, the partial optimal transport problem goes as follows. We have an excess of offer of a certain quantity of mass at origin (supply) and a mass that needs to be satisfied at destination (demand), so that it is not necessary to serve all the mass (demand= α×supply). In other words, some (1 − α)-fraction of the mass at origin can be left non-served. The goal is to perform this task at the cheapest transportation cost. If we represent the demand at destination by a target probability distribution R, we can model the supply at origin as P α , where P is another probability distribution and the mass required at destination is α times the mass at origin. This way, a partial optimal transportation plan is a probability measure Π on R d × R d with first marginal in R 1−α (P ) and with second marginal equal to R, which solves the following cost minimization problem: The following lemma allows us to characterize the connection between the joint distribution Q and the conditional distribution Q Ξ in problem (1) above in terms of the partial mass problem.
Lemma 1. Let Q be a probability on R d such that Q( Ξ) = α > 0 and let Q Ξ be the Q-conditional probability distribution given the event ξ ∈ Ξ. Also, assume that for a given probability metric D, is closed for D over an appropiate set of probability distributions. Then, Q Ξ is the unique distribution that satisfies Q Ξ ( Ξ) = 1 and D (R 1−α (Q), Q Ξ ) = 0.
By way of Lemma (1), we can reformulate Problem (1) as follows: which now presents a form which is much more suited to our purpose, that is, to get to the DROtype of problem (2) we propose. The change, nonetheless, has been essentially cosmetic, because problem (7) still relies on the true joint distribution Q and therefore, is of no use in practice as it stands right now. To make it practical, we need to rewrite it not in terms of the unknown Q, but in terms of the information available to the decision maker, i.e., the sample data { ξ i } N i=1 . For that purpose, it seems sensible and natural to replace Q in (7b) with its best approximation taken directly from the data, namely, the empirical measure of the sample, Q N . Logically, to accommodate the approximation, we will need to introduce a budget ρ in equation (7b), that is, Hereinafter we will use U N (α, ρ) to denote the ambiguity set defined by constraints (8b)-(8c).
Under certain conditions, this uncertainty set enjoys nice topological properties, as we state in Proposition EC.2 in Appendix EC.1.2.
Now we define what we call the minimum transportation budget, which plays an important role in the selection of budget ρ in problem (P).
Importantly, the minimum transportation budget to the power of p, i.e., p N α , is the minimum value of ρ in (P) for this problem to be feasible. Furthermore, N α is random, because it depends on the available data sample, but realizes before the decision x is to be made. It constitutes, therefore, input data to problem (P).
We note that, if the random vector y takes values in a set that is independent of the feature vector z, i.e., for all z * ∈ Ξ z , {y ∈ Ξ y : ξ = (z * , y) ∈ Ξ} = Ξ y , then dist(ξ j , Furthermore, in what follows, we assume that dist(ξ j , Ξ) (interpreted as a random variable) conditional on ξ j / ∈ Ξ has a continuous distribution function. This ensures that, in the case Q( Ξ) = 0, which we study in Section 3.3, there will be exactly K nearest data points to Ξ with probability one.
Next we present an interesting result, which deals with the inner supremum of problem (P) and adds more meaning to this problem by linking it to an alternative formulation more in the style of the Wasserstein data-driven DRO approach proposed in Mohajerin Esfahani and Kuhn (2018), where, however, no side information is taken into account. In fact, the distributionally robust approach to conditional stochastic optimization that is proposed in Nguyen et al. (2021) is based on this alternative formulation (see Proposition A.4 in that work) 2 . A proof of the following result can be found in Appendix A.2. Proposition 1. Given N 1, Q( Ξ) = α > 0, and any positive value of ρ, problem (SP2) is a relaxation of (SP1), where (SP1) and (SP2) are given by and where by "relaxation" it is meant that any solution Q feasible in (SP1) can be mapped into a solution Q Ξ feasible in (SP2) with the same objective function value.
Among other things, Proposition 1 reveals that parameter ρ in problem (SP2), and hence in problem (P), can be understood as a cost budget per unit of transported mass. Likewise, parameter α can be interpreted as the minimum amount of mass (in per unit) of the empirical distribution Q N that must be transported to the support Ξ. This interpretation of parameters ρ and α will be useful to follow the rationale behind the DRO solution approaches that we develop later on.
On the other hand, despite the connection between problems (SP1) and (SP2) that Proposition 1 unveils, the latter is qualitatively more amenable to further generalization and analysis. Examples of this are given by the relevant cases α = 0, for which problem (SP1) is ill-posed, while problem (SP2) is not, and α unknown, for which the use of trimming sets in (SP2) allows for a more straightforward treatment. We will deal with both cases in Sections 3.3 and 3.2, respectively. Before that, we provide an implementable reformulation of the proposed DRO problem (P).

Towards a tractable reformulation of the partial mass transportation problem
In this section, we put the proposed DRO problem (P) in a form more suited to tackle its computational implementation and solution. For this purpose, we first need to introduce a technical result whereby we characterize the trimming sets of an empirical probability measure.
Lemma 2. Consider the sample data { ξ i } N i=1 and their associated empirical measure Proof. If α > 0, the form of any (1 − α)-trimming of Q N as are then required because any (1 − α)-trimming is a probability distribution.
On the other hand, if α = 0, the resulting trimming set R 1 ( Q N ) is simply the family of all probability distributions supported on the data points In short, Lemma 2 tells us that trimming a data sample of size N with level 1 − α involves reweighting the empirical distribution of such data by giving a new weight less than or equal to 1 N α to each data point. Therefore, we can recast constraint W p We are now ready to introduce the main result of this section.
Theorem 1 (Reformulation based on strong duality). For α > 0 and any value of ρ p N α , subproblem (SP2) is equivalent to the following one: Surely the most important takeaway message of Theorem 1 is that problem (P) is as tractable as the standard Wasserstein-metric-based DRO formulation proposed in Mohajerin Esfahani and Kuhn (2018) and Kuhn et al. (2019). In these two approaches, conditions under which the inner supremum in (SP2 ) can be recast in a more tractable form are provided. As an example, in Theorem EC.2 in Appendix EC.1.3, we provide a more refined reformulation of (SP2 ), whereby the problems we solve in Section 4 can be directly handled.
In the following section, we show that problem (P) works, under certain conditions, as a statistically meaningful surrogate decision-making model for the target conditional stochastic program (1).

Finite sample guarantee and asymptotic consistency
Next we argue that the worst-case optimal expected cost provided by problem (P) for a fixed sample size N and a suitable choice of parameters (α, ρ) (dependent on N ) leads to an upper confidence bound on the out-of-sample performance attained by the optimizers of (P) (finite sample guarantee) and that those optimizers almost surely converge to an optimizer of the true optimal expected cost as N grows to infinity (asymptotic consistency).
To be more precise, the out-of-sample performance of a given data-driven candidate solution We say that a data-driven method built to address problem (1) enjoys a finite sample guarantee, if it produces pairs ( x N , J N ) satisfying a relation in the form and J N is a certificate for the out-of-sample performance of x N (i.e., an upper bound that is generally contingent on the data sample). The probability on the right-hand side of (10), i.e., 1 − β, is known as the reliability of ( x N , J N ) and can be understood as a confidence level.
Our analysis relies on the lemma below, which immediately follows from setting P 1 := Q N , Q := Q Ξ , P 2 := Q in Lemma 3.13 on probability trimmings in Agulló Antolín (2018).
Lemma 3. Assume that Q Ξ , Q ∈ P p (R d ), and take p 1, then We notice that the term W p (R 1−α (Q), Q Ξ ) in (11) is not random and depends exclusively on the true distributions Q Ξ , Q, and the trimming level α. It is, therefore, independent of the data sample (unlike the other two terms involved).
Inequality (11) reveals an interesting trade-off. On the one hand, the distance W p (R 1−α (Q), Q Ξ ) diminishes as α decreases to zero, because the trimming set R 1−α (Q) grows in size. On the other, the term 1 α 1/p W p ( Q N , Q) becomes larger as α approaches zero. As we will see later on, controlling this trade-off is key to endowing problem (P) with performance guarantees. To this end, we will make use of Proposition 2 below.
Assumption 1. Suppose that the true joint probability distribution Q is light-tailed, i.e., there exists a constant a > p 1 such that E Q [exp( ξ a )] < ∞.
Proposition 2 (Concentration tail inequality). Suppose that Assumption 1 holds. Then, there are constants c, C > 0 such that, for all > 0, α > 0, and N 1, it holds Proof. Because of Lemma 3 we have where the right-hand side of this inequality is upper bounded by (13) according to (Fournier and Guillin 2015, Theorem 2). Q.E.D.
Assuming p = d/2 , if we equate β to β p, ,α (N ) and solving for we get: In what follows, we distinguish three general setups that may appear in the real-life use of Conditional Stochastic Optimization, namely, the case Q( Ξ) = α > 0 with α known, the case Q( Ξ) = α > 0 with α unknown, and the case Q λ d with Q( Ξ) = α = 0.
3.1. Case Q( Ξ) = α > 0. Applications in data-driven decision making under contaminated samples When Q( Ξ) = α > 0 and known, we can solve the following DRO problem: As we show below, problem P (α, ρ N ) enjoys a finite sample guarantee and produces solutions that are asymptotically consistent, i.e., that converge to the true solution (under complete information) given by problem (1). This is somewhat hinted at by the connection between problems (SP1) and (SP2) highlighted in Proposition 1.
We point out that, in the case α > 0, data points may fall into the set Ξ. Logically, the contribution of these points to the minimum transportation budget p N α is null and their order (the way their tie is broken) is irrelevant to our purpose. Now we show that the solutions of the distributionally robust optimization problem P (α, ρ N ) converge to the solution of the target conditional stochastic program (1) as N increases, for a careful choice of the budget ρ N . This result is underpinned by the fact that, under that selection of ρ N , any distribution in U N (α, ρ N ) converges to the true conditional distribution Q Ξ . This is formally stated in the following lemma.
Lemma 4 (Case α > 0: Convergence of conditional distributions). Suppose that the assumptions of Proposition 2 hold. Choose a sequence β N ∈ (0, 1), N ∈ N, such that Proof. Take N large enough and let Q N/ Ξ be the conditional probability distribution of Q N given ξ ∈ Ξ. We have We show that the two terms on the right-hand side of the above inequality vanish with probability one as N grows to infinity. We start with W p ( Q N/ Ξ , Q Ξ ).
Let I denote the subset of observations ξ i := ( z i , y i ) for i = 1, . . . , N , such that ξ i ∈ Ξ. It follows from the Strong Law of Large Numbers that Q N ( Ξ) = |I| N = α N → α almost surely. Besides, since the sequence β N , N ∈ N is summable and lim N →∞ N (β N ) → 0, the Borel-Cantelli Lemma and Proposition 2 implies Then, from Lemma 1, we deduce that W p ( Q N/ Ξ , Q Ξ ) → 0 with probability one.
We can deal with the term W p (Q N Ξ , Q N/ Ξ ) in a similar fashion, except for the subtle difference that, in this case, we require ρ N = max( p N,p,α (β N ), p N α ), so that, for all N ∈ N, problem P (α, ρ N ) delivers a feasible Q N Ξ in the sequence. Hence, in order to prove that W p (Q N Ξ , Q N/ Ξ ) → 0 almost surely, we need to show that lim N →∞ N α = 0 with probability one. This is something that can be directly deduced from the definition of N α , namely, Q.E.D.
Note that, by Equation (9) in Definition 2, we have that N α > 0 if and only if Once the convergence of Q N Ξ to the true conditional distribution Q Ξ in the p-Wasserstein metric has been established by the previous lemma, the following asymptotic consistency result, which is analogous to that of (Mohajerin Esfahani and Kuhn 2018, Theorem 3.6), can also be derived.
Theorem 3 (Asymptotic consistency). Consider that the conditions of Theorem 2 hold.
Take a sequence ρ N as in Lemma 4. Then, we have for all x ∈ X and ξ ∈ Ξ, then we have that J N → J * almost surely when N grows to infinity.
(ii) If the assumptions in (i) are satisfied, f (x, ξ) is lower semicontinuous on X for any fixed ξ ∈ Ξ, and the feasible set X is closed, then we have that any accumulation point of the sequence { x N } N is almost surely an optimal solution of problem (1).
Proof. We omit the proof, because it is essentially the same as the one in (Mohajerin Esfahani and Kuhn 2018, Theorem 3.6), except that, since we are working with p 1, we additionally require that f (x, ξ) be continuous in ξ so that we can make use of Theorem 7.12 from Villani (2003).
In the following remark, we show how problem P (α, ρ N ) can be used to make distributionally robust decisions in a context where the data available to the decision maker is contaminated.
Remark 1 (Data-driven decision-making under contaminated samples). Suppose that the dataset ξ i := ( z i , y i ) for i = 1, . . . , N is composed of correct and contaminated samples.
The decision maker only knows that a sample is correct with probability α and contaminated with probability 1 − α, but does not know which type each sample belongs to. Thus, the data have been generated from a mixture distribution given by P = αQ * + (1 − α)R, where Q * is the correct distribution and R a contamination.
In our context, this is equivalent to stating that Q * ∈ R 1−α (P ), which, in turn, can be formulated Since we only have limited information on P in the form of the empirical where we have assumed that the correct distribution Q * , the contamination R and the datagenerating distribution P are all supported on Ξ.
Furthermore, if we choose a summable sequence of β N ∈ (0, 1), N ∈ N, such that lim N →∞ N (β N ) = 0, then we have that In plain words, for N large enough, the decision vector x is being optimized by way of problem (18a)-(18b) over the "smallest" ambiguity set that almost surely contains the correct distribution Q * of the data (in the absence of any other information on Q * ). In fact, this means our DRO approach deals with contaminated samples in a way that is distinctly more convenient than that of Chen (2019) and Farokhi (2021). Essentially, they suggest optimizing over a 1-Wasserstein ball centered at P N of radius ρ, that is, under the argument that for ρ sufficiently large, the Wasserstein ball contains the true distribution of the data Q * with a certain confidence level. For instance, the author of Farokhi (2021) uses the triangle inequality and the convexity property of the Wasserstein distance to establish that ensure that Q * is within the Wasserstein ball with a given confidence level (a similar argument is made in Chen (2019)). In practice, though, this extra budget as such cannot be computed, because neither the correct distribution Q * nor the contamination R are known to the decision maker. However, our approach naturally encodes it in the ambiguity set (18b). Indeed, for N large enough, result (19) tells us that the correct distribution Q * belongs, almost surely, to the (1 − α)trimming set of the empirical distribution P N . It follows precisely from this and Proposition 4 in In short, our approach offers probabilistic guarantees in the finite-sample regime and, in the asymptotic one, naturally exploits all the information we have on Q * , namely, Q * ∈ R 1−α (P ), to robustify the decision x under contamination.
In this section, we discuss how we can use the proposed DRO approach to deal with the case in which Q( Ξ) = α > 0 is unknown. For this purpose, we first introduce a proposition that will allows us to design a distributionally robust strategy to tackle problem (1) by means of problem (P).
Proof. The proof of the proposition is trivial and directly follows from the fact that Based on Proposition 3, we could use the following two-step safe strategy to handle the case of unknown Q( Ξ) = α > 0: 1. First, solve the following uncertainty quantification problem (see Gao and Kleywegt (2016), Mohajerin Esfahani and Kuhn (2018) for further details), where the radius N of the Wasserstein ball has been chosen so that α N represents the minimum probability that the joint true distribution Q of the data assigns to the event ξ ∈ Ξ with confidence Now suppose that Q ∈ B N (β N ) ( Q N ) and therefore, α N α (this is a random event that occurs with probability at least 1 − β N ). According to Lemma 3, we have In other words, the two-step procedure here described does not degrade the reliability of the DRO solution. Furthermore, the minimum transportation budget N α N that makes problem (P (α N , ρ N ) ) feasible is always zero here, if the event ξ ∈ Ξ has been observed at least once. This is so because the uncertainty quantification problem of step 1 ensures that α N is lower than or equal to the fraction of training data points falling in Ξ.
Moreover, when N grows to infinity, this uncertainty quantification problem reduces to computing such a fraction of points, which, by the Strong Law of Large Numbers converges to the real α, i.e., α N → α with probability one. Therefore, in the asymptotic regime, this case resembles that of known α > 0.
We notice, however, that, in practice, setting ρ N p N (β N )/α N may result in too large budgets ρ N , and thus, in overly conservative solutions, because, as N is increased, α N decreases to zero. For this reason, in the numerical experiments of the electronic companion, we describe an alternative data-driven procedure to address the case α > 0, in which we simply set α N = Q N ( Ξ) in problem (P (α N , ρ N ) ) and use the data to tune parameter ρ N .
Suppose that the true joint distribution Q governing the random vector ξ := (z, y) admits a density function with respect to the Lebesgue measure λ d , with d = d z + d y . Without loss of generality, consider the event ξ ∈ Ξ, where Ξ is defined as Ξ = {ξ = (z, y) ∈ Ξ : z = z * }. This means that Therefore, our focus in this case is on the particular variant of problem (1) given by Problem (24)  Devising a DRO approach to problem (24) using the standard Wasserstein ball W p ( Q N , Q) ε is of no use here, because any point from the support of Q N with an arbitrarily small mass can be transported to the set Ξ at an arbitrarily small cost in terms of W p ( Q N , Q). This way, one could always place this arbitrarily small particle at a point (z * , y ) ∈ arg max (z,y)∈ Ξ f (x, (z, y)). In contrast, problem (P), which is based on partial mass transportation, offers a richer framework to seek for a distributional robust solution to (24). To see this, consider again the inequality (11). If we could set Unfortunately, fixing α to zero is not a real option due to the term 1 α 1/p W p ( Q N , Q) in the inequality. Therefore, what we propose instead is to solve a sequence of optimization problems in the form with both α N and ρ N tending to zero appropriately as N increases. Next we show that, under certain conditions, problem P (α N , ρ N ) enjoys a finite sample guarantee and is asymptotically consistent.
1. It admits uniformly for r ∈ [0, r 0 ] and y ∈ R dy the following expansion: where u ∈ R dz with ||u|| = 1, and where 1 : 2. The marginal density of z is bounded away from zero in B(z * , r 0 ).
Assumption 3 (Regularity and boundedness). We assume that 1. There exists C > 0 and r 0 > 0 such that P( z * − z r) Cr dz , for all 0 < r r 0 .
2. The uncertainty y is bounded, that is, y M a.s. for some constant M > 0.
We note that Assumption 3.1 is automatically implied by Assumption 2, but we explicitly state it here for ease of readability. Furthermore, under the boundedness condition established in Assumption 3.2, Assumption 2 is satisfied, for example, by a twice differentiable joint density φ(z, y) with continuous and bounded partial derivatives in B(z * , r) × Ξ y and bounded away from zero in that set. These are standard regularity conditions in the technical literature on kernel density estimation and regression (Pang Ho and Hanasusanto 2019).
Then, for all we have that the pair ( x N , J N ) delivered by problem P (α N , ρ N ) with parameters ρ N and α N enjoys the finite sample guarantee (10).
Proof. For problem P (α N , ρ N ) to be feasible, we need ρ N p N α N .
Furthermore, according to Theorem 3.5.2 in Falk et al. (2010), there exists a positive constant uniformly for 0 < r < r 0 , where Hell stands for Hellinger distance.
From Equation (5.1) in Santambrogio (2015) and Assumption 3.2 we know that In turn, from Gibbs and Su (2002) which we can express in terms of α as Remark 2. There are conditions on the smoothness of the true joint distribution Q around z = z * , other than those stated in Assumptions 2 and 3, for which we can also upper bound the distance W p (R 1−α (Q), Q Ξ ). We provide below two examples of these conditions, which have been invoked in Kannan et al. (2020a,b) and Bertsimas et al. (2019), respectively, and neither of which requires the boundedness of the uncertainty y.
Example 1. Suppose that the true data-generating model is given by is the regression function and e is a zero-mean random error. Furthermore, suppose that Assumption 3.1 holds and there exists a positive constant L such that f Take α(r) = Cr dz , for all 0 < r r 0 and set α 0 := α(r 0 ). With abuse of notation, we can write for any event within B(z * , r) × Ξ y where Q z is the probability law of the feature vector z and Q z=z is the conditional measure of Q given that z = z .
Since Q B(z * ,r)×Ξy ∈ R 1−α(r) (Q) for all 0 < r r 0 , by the convexity of the Wasserstein distance, we have Example 2. Take p = 1. Suppose that there exists a positive constant L such that W 1 (Q z=z , Q z=z * ) L z − z * , for all 0 z − z r 0 and that Assumption 3.1 holds.
Following a line of reasoning that is parallel to that of the previous example, we also get Equation (27) and Examples 1 and 2 reveal that our finite sample guarantee is affected by the curse of dimensionality. Recently, powerful ideas to break this curse have been introduced in Gao (2020) under the standard Wasserstein-metric-based DRO scheme. In our setup, however, we also need distributional robustness against the (uncertain) error incurred when inferring conditional information from a sample of the true joint distribution. This implies increasing the robustness budget in our approach by an amount linked to the term W p (R 1−α (Q), Q Ξ ). Consequently, we might need stronger assumptions on the data-generating model to break the dependence of this term with the dimension of the feature vector and thus extend the ideas in Gao (2020) to the realm of conditional stochastic optimization. Now we state the conditions under which the sequence of problems P (α N , ρ N ) , N → ∞, is asymptotically consistent.
Lemma 5 (Convergence of conditional distributions). Suppose that the support Ξ of the true joint distribution Q is compact and that Assumptions 2 and 3.1 hold.
Definition 2. Then, we have that Proof. First, we need to provide conditions under which (12) becomes summable over N for any arbitrarily small . In this way, we can choose a sequence β N ∈ (0, 1), large enough, set ρ N arbitrarily close to p N α N and notice that U N (α N , p N α N ) boils down to one single probability measure, the one made up of the N α N data points of Q N that are the closest to Ξ. In addition, we have p N α N → 0 with probability one. To see this, take K := N α N and note almost surely provided that α N → 0 (see (Biau and Devroye 2015, Lemmas 2.2 and 2.3)), where z K:N is the z-component of the K-th nearest neighbor to z * after reordering the data sample Remark 3. The compactness of the support set Ξ is assumed here just to simplify the proof.
In fact, in Appendix EC.2, we use results from nearest neighbors to show that the convergence of conditional distributions can be attained under the less restrictive condition N α N log(N ) → ∞ even in some cases for which the uncertainty y and the feature vector z are unbounded. In addition, we also make use of those results to demonstrate that distributionally robust versions of some local nonparametric predictive methods, such as Nadaraya-Watson kernel regression and K-nearest neighbors, naturally emerge from our approach.
Remark 5. Suppose that the event Ξ on which we condition problem (1) is given by Ξ := {ξ = (z 1 , z 2 , y) ∈ Ξ : z 1 = z * 1 , z 2 ∈ Z 2 }, with Q( Ξ) = 0 and P(z 2 ∈ Z 2 ) > 0. Let Q Z 2 be the probability measure of (z 1 , y) conditional on z 2 ∈ Z 2 . If we have that there is C > 0 and r 0 > 0 such that P( z * 1 − z 1 r) Cr dz 1 , for all 0 < r r 0 , and that Q Z 2 satisfies the smoothness condition invoked in either Theorem 4, Example 1 or Example 2, then the analysis in this section extends to that type of event by setting α(r) = Cr dz 1 · P(z 2 ∈ Z 2 ) and noticing that

Numerical Experiments
The following simulation experiments are designed to provide numerical evidence on the performance of the DRO framework with side information that we propose, with respect to other methods available in the technical literature. Here we only consider the case α = 0, while additional numerical experiments for the case α > 0 can be found in Appendix EC.3.
To numerically illustrate the setting Q( Ξ) = α = 0, we consider two well-known problems, namely, the (single-item) newsvendor problem and the portfolio allocation problem, both posed in the form inf x∈X E Q f (x, ξ) | ξ ∈ Ξ to allow for side information. We compare four data-driven approaches to address the solution to these two problems: Our approach, i.e., problem P (α N , ρ N ) with α N = K N /N , which we denote "DROTRIMM"; a Sample Average Approximation method based on a local predictive technique, in particular, the K N nearest neighbors, which we refer to as "KNN" (see Bertsimas and Kallus (2020) for further details); this very same local predictive method followed by a standard Wasserstein-metric-based DRO approach to robustify it, as suggested in (Bertsimas et al. 2019, Section 5), which we call "KNNDRO"; and the robustified KNN method (4), also proposed in Bertsimas et al. (2019), which we term "KNNROBUST." We clarify that KNNDRO uses the K nearest neighbors projected onto the set Ξ as the nominal "empirical" distribution that is used as the center of the Wasserstein ball in Mohajerin Esfahani and Kuhn (2018).
We also note that the newsvendor problem and the portfolio optimization problem are structurally different if seen from the lens of the standard Wasserstein-metric-based DRO approach.
Indeed, the newsvendor problem features an objective function with a Lipschitz constant with respect to the uncertainty that is independent of the decision x. Consequently, as per (Mohajerin Esfahani and Kuhn 2018, Remark 6.7), KNNDRO renders the same minimizer for this problem as that of KNN whenever the support set Ξ is equal to the whole space. This is, in contrast, not true for the portfolio allocation problem, which has an objective function with a Lipschitz constant with regard to the uncertainty that depends on the decision x.
In all the numerical experiments, we take the p-norm with p = 1 and, accordingly, we use the Wasserstein distance of order 1. Thus, all the optimization problems that we solve are linear programs. We consider a series of different values for the size N of the sample data. Unless stated otherwise in the text, for each N , we choose as the number of neighbors, K N , the value N/ log(N + 1) , where · stands for the floor function. Nevertheless, for the portfolio allocation problem, we also test the values N 0.9 and √ N to assess the impact of the number of neighbors on the out-of-sample performance of the four methods we compare.
We estimate x * ∈ arg min x∈X E Q Ξ [f (x, ξ)] and J * = E Q Ξ [f (x * , ξ)] using a discrete proxy of the true conditional distribution Q Ξ . In the newsvendor problem, this proxy is made up of 1085 data points, resulting from applying the KNN method (with the logarithmic rule) to 10 000 samples from the true data-generating joint distribution. In the portfolio optimization problem, we have an explicit form of Q Ξ , which we utilize to directly construct a 10 000-data-point approximation.
To compare the four data-driven approaches under consideration, we use two performance metrics, specifically, the out-of-sample performance of the data-driven solution and its out-of-sample disappointment. The former is given by where the dotted black horizontal line corresponds to either the optimal solution x * (only in the newsvendor problem) or to its associated optimal cost J * with complete information.
As is customary in practice, we use a data-driven procedure to tune the robustness parameter of each method. In particular, for a desired value of reliability 1 − β ∈ (0, 1) (in our numerical experiments, we set β to 0.15), and for each method j, where j = {KNNROBUST, KNNDRO, DROTRIMM}, we aim for the value of the robustness parameter for which the estimate of the objective value J j N given by method j provides an upper (1−β)-confidence bound on the out-of-sample performance of its respective optimal solution (see Equation (10)), while delivering the best out-of-sample performance. As the optimal robustness parameter is unknown and depends on the available data sample, we need to derive an estimator param β,j N that is also a function of the training data. We construct param β,j N and the corresponding reliability-driven solution as follows: 1. We generate kboot resamples (with replacement) of size N , each playing the role of a different training set. In our experiments we set kboot = 50. Moreover, we build a validation dataset determining the K N val -neighbors of the N val data points of the original sample of size N that have not been used to form the training set.
2. For each resample k = 1, . . . , kboot and each candidate value for param, we compute a solution by method j with parameter param on the k-th resample. The resulting optimal decision is denoted as x j,k N (param) and its corresponding objective value as J j,k N (param). Thereafter, we calculate the out-of-sample performance J( x j,k N (param)) of the data-driven solution x j,k N (param) over the validation set.
3. From among the candidate values for param such that J j,k N (param) exceeds the value J( x j,k N (param)) in at least (1 − β) × kboot different resamples, we take as param β,j N the one yielding the best out-of-sample performance averaged over the kboot validation datasets.
4. Finally, we compute the solution given by method j with parameter param β,j N , x j N := x j N (param β,j N ) and the respective certificate J j N := J j N (param β,j N ).
Recall that, in our approach, the robustness parameter ρ N must be greater than or equal to the minimum transportation budget to the power of p, that is, ε p N α N . Hence, if we decompose ρ N as ρ N = ε p N α N + ∆ ρ N , what one really needs to tune in DROTRIMM is the budget excess ∆ ρ N . Furthermore, for the same amount of budget ∆ ρ N , our approach will lead to more robust decisions x than KNNDRO, because the worst-case distribution in KNNDRO is also feasible in DROTRIMM. Consequently, in practice, the tuning of one of these methods could guide the tuning of the other.
Lastly, all the simulations have been run on a Linux-based server using up to 116 CPUs running in paralell, each clocking at 2.6 GHz with 4 GB of RAM. We have employed Gurobi 9.0 under Pyomo 5.2 to solve the associated linear programs.

The single-item newsvendor problem
In this subsection, we deal with the popular single-item newsvendor problem, which has received a lot of attention lately (see, for example, Ban and Rudin (2019) tively. Therefore, the support set of this distribution is the whole space R dz +dy , with d z = d y = 1.
In addition, we consider as Z the singleton {z * = 0.44}, with Ξ being the real line R as a result. Note that this curve is highly nonlinear around the context z * . Also, the demand may be negative, which, in the context of the newsvendor problem, can be interpreted as items being returned to the stores due to, for example, some quality defect. The set of candidate values from which the robustness parameters in methods KNNROBUST, KNNDRO and DROTRIMM have been selected is the discrete set composed of the thirty linearly spaced numbers between 0 and 2. We also consider the machine learning algorithm proposed in Ban and Rudin (2019), which was especially designed for the newsvendor problem with features. In this algorithm, a polynomial mapping between the optimal order quantity (i.e., the optimal quantile) and the covariates is presumed. The degree of the polynomial, up to the fourth degree, is tuned using the bootstrapping procedure described above. We denote this approach as ML from "Machine Learning".  Newsvendor problem with features: True distributions, quantile estimate and performance metrics performance delivered by each of the considered data-driven approaches for various sample sizes and runs, in that order. The shaded color areas have been obtained by joining the 15th and 85th percentiles of the box plots, while the associated bold colored lines link their means. The true optimal quantile (with complete information) and its out-of-sample performance are also depicted in Figures (1c) and (1b), respectively, using black dotted lines.
Interestingly, whereas the quantile estimators provided by DROTRIMM, KNNDRO and KNNROBUST all lead to negative out-of-sample disappoinment in general, KNNDRO and KNNROBUST exhibit substantially worse out-of-sample performance both in expectation and volatility. Recall that KNNDRO delivers the same solutions provided by KNN for this problem.
Its behavior is, therefore, influenced by the bias introduced by the K-nearest neighbors estimation, which is particularly notorious for small-size samples in this case, given the shape of the true conditional density, see Figure (1a). Actually, for some runs, the K-nearest neighbors, and hence KNNDRO, lead to negative quantile estimates, while the true one is positive and greater than 0.5.
By construction, both KNNDRO and KNNROBUST are mainly affected by the estimation error of the conditional probability distribution incurred by the local predictive method. On the contrary, our approach DROTRIMM offers a natural protection against this error and a richer spectrum of data-driven solutions. Indeed, DROTRIMM is able to identify solutions that lead to a better out-of-sample performance with a negative out-of-sample disappointment.
Finally, both ML and DROTRIMM exhibit a notorious stable behavior against the randomness of the sample. The order quantity provided by the former, however, does not converge to the true optimal one, because the relationship between the true optimal order and the feature z is far from being polynomial. Note that ML is a global method that seeks to learn the optimal order quantity for all possible contexts by using a polynomial up to the fourth degree. However, the (true) optimal order curve (that is, the white line in Figure 1a) is highly nonlinear within a neighborhood of the context z * = 0.44, but practically constant outside of it.

Portfolio optimization
We consider next an instance of the portfolio optimization problem that is based on that used in Bertsimas and McCord (2018) and Bertsimas and Van Parys (2017). The instance corresponds to a single-stage portfolio optimization problem in which we wish to find an allocation of a fixed budget to six different assets. Thus, x ∈ R 6 + denotes the decision variable vector, that is, the asset allocations, and their uncertain return is represented by y ∈ R 6 . In practice, these uncertain returns may be influenced by a set of features. First, the decision maker observes auxiliary covariates and later, selects the portfolio. We consider three different covariates that can potentially impact the returns and that we denote as z = (z 1 , z 2 , z 3 ). The decision maker wishes to leverage this side information to improve his/her decision-making process in which the goal is to maximize the expected value of the return while minimizing the conditional value at risk (CVar) of the portfolio, that is, the risk that the loss (− x, y ) + := max(− x, y , 0) is large. Using the reformulation of the CVar (see Rockafellar and Uryasev (2000) and Bertsimas and Van Parys (2017)) and introducing the auxiliary variable β , the decision maker aims to solve the following optimization problem given the value of the covariate z * (= (1000, 0.01, 5) in the numerical experiments): where the feasible set of decision variables of the problem, that is, X is equal to {(x, β ) ∈ R 6 + × R : 6 j=1 x j = 1}. We set δ = 0.5 and λ = 0.1 to simulate an investor with a moderate level of risk aversion. The parameter λ ∈ R + serves to tradeoff between risk and return, and δ refers to the (1 − δ)-quantile of the loss distribution. We take the same marginal distributions for the covariates as in Section 5.2 of Bertsimas and Van Parys (2017), i.e., z 1 N (1000, 50), z 2 N (0.02, 0.01) and log(z 3 ) N (0, 1). Furthermore, we follow their approach to construct the joint true distribution of the covariates and the asset returns. In particular, we take y/(z = (z 1 , z 2 , z 3 )) N 6 (µ + 0.1 · (z 1 − 1000) · v 1 + 1000 · z 2 · v 2 + 10 · log(z 3 + 1) · v 3 , Σ) with v 1 = (1, 1, 1, 1, 1, 1) T , v 2 = (4, 1, 1, 1, 1, 1) T , v 3 = (1, 1, 1, 1, 1, 1) T , and with µ, Σ 1/2 given in Bertsimas and Van Parys (2017), Esteban-Pérez and Morales (2021).
Note that, unlike in Bertsimas and Van Parys (2017), not all the features affect equally all the asset returns. Moreover, feature z 3 is log-normal and therefore, Assumption 1 does not hold.
Nonetheless, as we show below, DROTRIMM performs satisfactorily, which reveals that the conditions we derive in this paper to guarantee that our approach performs well are sufficient, but not necessary. Indeed, the condition Q Ξ ∈ U N (α N , ρ N ) is not required to ensure performance guarantees (Gao 2020, Kuhn et al. 2019. For all the methods, we have standardized the covariates z and the asset returns y using their means and variances. In all the simulations, the robustness parameter each method uses (i.e., ε N in KNNROBUST, the radius of the Wassertein ball, ρ N , in KNNDRO, and the budget excess ∆ ρ N in DROTRIMM) has been chosen from the discrete set {b · 10 c : b ∈ {0, . . . , 9}, c ∈ {−2, −1, 0}}, following the above data-driven procedure.
Similarly to the case of the single-item newsvendor problem, Figure 2 shows, for various sample sizes and 200 runs, the box plots pertaining to the out-of-sample disappointment and performance associated with each of the considered data-driven approaches. Each of the three pairs of subplots at the top of the figure has been obtained with a different rule to determine the number K N of nearest neighbors. Increasing this number seems to have a positive effect on the convergence speed of all the methods for this instance, although KNNROBUST (and KNNDRO to a lesser extent) has some trouble ensuring the desired reliability level, with the 85% line above 0 for the largest values of N we represent. In contrast, DROTRIMM manages to keep the disappointment negative. This is, in addition, accompanied by an important improvement of the the out-of-sample performance (in line with the criterion for selecting the best portfolio that we have established).
In fact, DROTRIMM produces boxplots that appear to be shifted downward, i.e., in the direction of better objective function values. On the other hand, the KNN method substantially improves its performance by employing a larger number of neighbors. However, it is way too optimistic in any case.
The results shown in the pair of subplots at the bottom of Figure 2 correspond to a number K N of neighbors that has been tuned jointly with the robustness parameter and for each method independently. For this purpose, we have selected the best value of K N for each approach from the discrete set {N 0.1 , N 0.2 , . . . , N 0.9 } following the bootstrapping-based procedure previously described. The data-driven tuning of the number K N of neighbors appears not to have a major effect on the performance of the different methods, especially in comparative terms. We do observe that the To facilitate the analysis of the results shown in Figure 2, we also provide Figure 3, which illustrates the (random) performance of the methods KNNROBUST, DROTRIMM and KNNDRO as a function of their respective robustness parameter, estimated over 200 independent runs. Again, the shaded areas cover the 15th and 85th percentiles, while the bold colored lines correspond to the average performance. The various plots are obtained for N = 30 and N = 400, with the number of neighbours given by the logarithmic rule. These plots are especially informative, because they are independent of the specific validation procedure used to tune the robustness parameters of the methods and thus, provide insight into the potential of each method to identify good solutions.
Note that the out-of-sample performance of all the three methods stabilizes around the same value as their respective robustness parameters grow large enough. This phenomenon is analogous to that discussed in (Mohajerin Esfahani and Kuhn 2018, Section 7.1). However, the value we observe here does not correspond to the "equally weighted portfolio," because we have standardized the data on the asset returns. As a result, the "robust portfolio" that delivers this out-of-sample performance depends on and is solely driven by the standard deviations of the different assets.
Very interestingly, DROTRIMM is able to uncover portfolios whose out-of-sample performance features a better mean-variance trade-off, in general. Furthermore, it requires a smaller value of the robustness parameter to guarantee reliability. All this is more evident (and useful) for the case N = 400, as we explain next. When N = 30, all the considered methods need large values of their robustness parameter to ensure reliability, so they all tend to operate close to the "robust portfolio" we mentioned above. DROTRIMM can certainly afford lower values of ∆ ρ in an attempt to improve performance, but this proves not to be that profitable for such a small sample size, for which the robust portfolio performs very well. As N increases, the robust portfolio loses its appeal, since its performance gradually becomes comparatively worse. DROTRIMM is then able to identify portfolios that perform significantly better in expectation, while providing an estimate of their return such that the desired reliability is guaranteed. For their part, KNNDRO and KNNROBUST are also able to discover solutions with an actual average cost lower than that of the robust portfolio (albeit with a worse expectation and a higher variance than those given by DROTRIMM). However, they are more prone to overestimate their returns.
Finally, we study the behavior of the different methods under other contexts. For this, we consider several values of N , one random data sample for each N , and 200 different contexts z * sampled from the marginal distributions of the features. The performance metrics (i.e., the out-of-sample disappointment and performance) are plotted in Figures 4a and 4b, respectively, under an optimal selection of the robustness parameters (that is, for each method we use the value of the robustness parameter that, while ensuring a negative disappointment, delivers the best out-of-sample performance). We observe that DROTRIMM systematically performs better, with an actual cost averaged over the 200 contexts that is lower irrespective of the sample size.

Conclusions
In this paper, we have exploited the connection between probability trimmings and partial mass transportation to provide an easy, but powerful and novel way to extend the standard Wassersteinmetric-based DRO to the case of conditional stochastic programs. Our approach produces decisions that are distributionally robust against the uncertainty in the whole process of inferring the conditional probability measure of the random parameters from a finite sample coming from the true joint data-generating distribution. Through a series of numerical experiments built on the single-item newsvendor problem and a portfolio allocation problem, we have demonstrated that our method attains notably better out-of-sample performance than some existing alternatives. We have supported these empirical findings with theoretical analysis, showing that our approach enjoys attractive performance guarantees.

Appendix A: Proofs of theoretical results
This appendix compiles the proofs of some of the theoretical derivations that appear in the paper. The following technical results are needed to develop these proofs.
Definition 3 (Contamination of a distribution). Given two probabilities P, Q on R d , we say that contamination neighbourhood of Q is the set of all (1 − α)-contaminated versions of Q and will be denoted Proposition 4 (Section 2.2. fromÁlvarez-Esteban et al. (2012) and p.18 in Agulló Antolín (2018)).
Let P , Q be probabilities on R d and α ∈ (0, 1], then for some probability R. Moreover, if D is a probability metric such that R 1−α (P ) is closed for D over an appropiate set of probability distributions, then (29) is equivalent to D(Q, R 1−α (P )) = 0.
Remark 6. As particular case, if we consider D = W p over the set of probability distributions with finite p-th moment, P p , we have that, if P , Q ∈ P p , then Q ∈ R 1−α (P ) if and only if W p (Q, R 1−α (P )) = 0.
Since, by hypothesis, Q Ξ and Q Ξ are different, there must exist an event A ⊂ Ξ such that Q Ξ (A) = Q Ξ (A).
We take that event and compute Q(A) as follows: which renders a contradiction given that R(A) = R (A) = 0. Q.E.D.

A.2. Proof of Proposition 1
We begin by proving the first claim of Proposition 1.
We show that every feasible solution of (SP1) can be mapped into a feasible solution of (SP2) with the same objective function value. To this end, take Q as a feasible solution of (SP1) and let Q Ξ be the Q-conditional probability measure given ξ ∈ Ξ. Take Q N and Q Ξ as the two probabilities in Corollary 1 with α ∈ (0, 1). There exists ). Furthermore, it automatically follows from Proposition 5 that Next we prove the second claim of the proposition. For this purpose, first we show that, if Q N ( Ξ) = 0, then every feasible solution of (SP2) can also be mapped into a feasible solution of (SP1) with the same objective function value. To this end, take Q Ξ feasible in (SP2) and consider Q 1−α ∈ R 1−α ( Q N ) such that . By Proposition 5, we have Hence, Q 1−α ( Ξ) = α, because Q N ( Ξ) gives zero measure to Ξ and so does any of its (1 − α)-trimmings.
Besides, we have that Therefore, Q 1−α is feasible in (SP1) and Q Ξ is the Q 1−α -conditional probability measure given ξ ∈ Ξ. , ξ)] and the mapping is direct,

A.3. Proof of Theorem 1
Thanks to Lemma 2, the subproblem (SP2) can be written equivalently as follows: which, in turn, can be reformulated as Π is a joint distribution of (z, y) and (z, y) with marginals Q Ξ and where reformulation (31) follows from the fact that the marginal distribution of (z, y) is the discrete distribution supported on points ( z i , y i ), with probability masses b i , i = 1, . . . , N . Thus, Π is completely determined by the conditional distributions Q i Ξ of (z, y) given (z, y) = ( z i , y i ), i = 1, . . . , N , that is, Now we split up the supremum into two: If we set λ as the dual variable of constraint (32c), then using standard duality arguments, we can equivalently rewrite the inner supremun as where we have swapped the supremum and the infimum in (35) by appealing to Sion's min-max theorem (Sion 1958), given that the objective function in (35)  Remark 7 (Limiting case α = 0). Endnotes 1. The preprints Kannan et al. (2020bKannan et al. ( , 2021, Nguyen et al. (2020Nguyen et al. ( , 2021 became available online while this paper was under review in this journal. 2. Proposition 1 in this paper predates the publication of preprint Nguyen et al. (2021). Notation. Given any norm · in the Euclidean space (of a certain dimension d), the dual norm is defined as u * = sup v 1 u, v . Let g be a function from R d to R, we will say that g is a proper function if g(x) < +∞ for at least one x and g(x) > −∞ for all x ∈ R d . In addition, the convex conjugate function of g, g * , is given by g * (y) := sup x∈R d y, x − g(x). It is well known that if g is a proper function, then g * is a proper function as well. The support function of set A, S A , is defined as S A (b) := sup a∈A b, a . The recession cone of a non-empty set A ⊆ R d is given by

Appendix EC.1: Complementary theoretical results
This section contains some theoretical results which are complementary to the theory developed in the manuscript. First, we introduce a few preliminary concepts and definitions. Second, we state the topological properties of the ambiguity set U N (α, ρ) in problem (P). Finally, we introduce a tractable reformulation of our DRO approach, which is similar to that in Kuhn et al. (2019, Theorem 8).

EC.1.1. Auxiliary measure theoretic concepts and Wasserstein metric
This subsection compiles some definitions and results from the measure theory that underpins our research. It starts with concepts related to the weak convergence of measures and compactness.
Subsequently, some known facts in connection with the topology generated by the Wasserstein metric W p are presented. We denote the set of all Borel probability measures supported on X as ec2 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings P(X ). Although some of the following concepts and results are still true in the more general setting of Polish spaces, we restrict ourselves here to X ⊆ R d . Similarly, we denote the p-Wasserstein space as P p (X ), that is, the set of all Borel probability measures supported on X with a finite p-th moment. It is well known that the p-Wassertein metric defines a metric in P p (X ) (Villani 2003, Theorem 7.3).
Definition EC.1 (Weak convergence of probability measures). Given a sequence of probability measures {Q N } N ⊆ P(X ), we say that it converges weakly to Q if for all bounded and continuous function on X .
Definition EC.2 (Tightness). A given set K ⊆ P(X ) is tight if for all ε > 0, there is a compact set X ε ⊂ X such that inf Q∈K Q(X ε ) > 1 − ε. If K reduces to a singleton, then we refer to the "tightness of a probability measure".
Definition EC.3 (Closed sets). A given set K ⊆ P(X ) is closed (under the topology of weak convergence) if for all sequence {Q N } N ⊂ K such that Q N converges weakly to Q, we have Q ∈ K.
The following theorem, which is known as Prokhorov's Theorem, connects the notions of weak compactness and tightness.
Theorem EC.1 (Prokhorov's Theorem). A set K ⊆ P(X ) is tight if and only if the closure of K is weakly compact in P(X ). Finally, we introduce a proposition that connects some of the aforementioned concepts with the Wasserstein metric. More concretely, this proposition establishes the topological properties of the Wasserstein space.
Proposition EC.1. Given p 1 and X ⊆ R d a closed set, we have: P p (X ) endowed with W p is a Polish space. A closed set K ⊆ P p (X ) is weakly compact if and only if it has p-uniformly integrable moments (and hence tight). Specifically, given a sequence of probability measures {Q N } N ⊆ P p (X ), the following statements are equivalent: 2. Q N converges weakly to Q and {Q N } N has p-uniformly integrable moments.
3. Q N converges weakly to Q and the following holds 4. For any L > 0 and any continuous function : X → R such that verifies | (ξ)| L(1 + ξ p ) for all ξ, the following holds Remark EC.1. Proposition EC.1 compiles results from Prop. 7.1.5 in Ambrosio et al. (2005) and Th. 7.12 in Villani (2003). It implies that the topology generated by W p and the weak topology do coincide on any subset K which has p-uniformly integrable moments. We note that assertion 2 in Proposition EC.1 is reduced to weak convergence if X is a compact set (see, for example, Panaretos and Zemel (2020, Corollary 2.2.2 )).

EC.1.2. Topological properties of the ambiguity set
The following proposition formally establishes the topological properties of our ambiguity set: Proposition EC.2. Given Q ∈ P p (R d ), α > 0, and ρ p N α , the ambiguity set of problem (P), U N (α, ρ), is non-empty, tight, weakly compact, and p-uniformly integrable. ec4 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings Proof The set U N (α, ρ) is non-empty, because ρ p N α . We can equivalently rewrite U N (α, ρ) as If α > 0, then the trimming set R 1−α ( Q N ) is tight and weakly compact, see Cascos and López-Díaz (2008, Lemmas 2 and 3). Furthermore, U N (α, ρ) is a subset of which is tight and weakly compact by Pichler and Xu (2018, Proposition 3). The tightness of U N (α, ρ) is trivially guaranteed, since any subset of a tight set is also tight. Hence, by Prokhorov's theorem, to demonstrate that U N (α, ρ) is also weakly compact, it suffices to show that it is closed.
For this purpose, let {Q N Ξ } N be a sequence of probability measures in U N (α, ρ) that converges weakly to Q. We need to show that Q is in U N (α, ρ) too. In turn, since U N (α, ρ) is a subset of K, which is closed, this boils down to proving that the weak limit satisfies the condition Q ∈ P( Ξ), that is, Q( Ξ) = 1. Given that the sequence {Q N Ξ } N converges weakly to Q and the support set Ξ is closed, Portmanteau's theorem (see Billingsley (1999, Theorem 2.1)) tells us that lim sup N →∞ Q N Ξ ( Ξ) = 1 Q( Ξ). This implies that Q( Ξ) = 1.
Finally, the p-uniform integrability of our ambiguity set follows from Proposition EC.1. To apply this proposition, we only need to check whether any distribution of U N (α, ρ) has a finite p-th moment. FromÁlvarez-Esteban et al. (2011) (see p. 363 for the the case p = 2, although the proof works similarly for any p 1), we know that there is a distribution Q Ξ in U N (α, ρ) that does not have a finite p-th moment. If this were the case, we would have W p (Q Ξ , R) = ∞ for some R ∈ R 1−α ( Q N ), which is in contradiction with the fact that W p (Q Ξ , R) must be less or equal to a finite ρ 1/p . Q.E.D.

EC.1.3. Tractable reformulation and maximizer of problem (SP2)
Next we provide a more manageable reformulation of problem (SP2), which can be used directly to address the decision-making problems considered in our numerical experiments. However, we omit its proof, as it runs in parallel with that of Mohajerin Esfahani and Kuhn (2018, Theorem e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings ec5 4.2) and Kuhn et al. (2019, Theorem 8). See also Zhen et al. (2021). Said reformulation relies on the following assumption.
Assumption EC.1. The region Ξ is a closed convex set, and f (x, ξ) := max k K g k (x, ξ), with g k , for each k K, being a proper, concave and upper semicontinuous function with respect to ξ (for any fixed value of x ∈ X) and not identically ∞ on Ξ.
In problem (SP2 ), we have suppressed the dependence of functions g k on x for ease of notation.
The following theorem serves to construct a maximizer (i.e., a worst-case distribution) of problem (SP2). Again, we omit its proof, as it is analogous to the proof of Mohajerin Esfahani and Kuhn (2018, Theorem 4.4) and Kuhn et al. (2019, Theorem 9). ec6 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings Theorem EC.3 (Worst-case distributions). Under the assumptions of Theorem EC.2, the worst-case expectation in (SP2) is equal to the optimal objective value of the following finite convex optimization problem 0). Also, the constraint ξ i − q ik /0 ∈ Ξ means that q ik is in the recession cone of Ξ, and 0 q ik /0 p is understood as lim γ ik ↓0 γ ik q ik /γ ik p .
Moreover, if we assume that p > 1 or that Ξ is bounded (with p 1), then if (γ * ik , q * ik ) maximizes the problem above, we have that the discrete probability distribution Q Ξ defined as represents a maximizer of the worst-case expectation problem.

Appendix EC.2: Asymptotic consistency under a nearest neighbors lens
In this section, we show that the asymptotic consistency of our DRO framework for the case Q λ d with Q( Ξ) = α = 0 can also be proved using a nearest-neighbors approach.
If the density of Q is sufficiently smooth, it is known that Q Ξ can be inferred from information on Q within a neighborhood of z = z * . This essentially means that the portion of mass from the empirical distribution Q N that is the closest to Ξ is statistically representative of the conditional distribution Q Ξ . Inspired by popular data-driven local predictive methods such as K nearest neighbours and kernel regression, we can solve problem (P) for a series of pairs (α N , ρ N ), both of which e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings ec7 tend to zero appropriately as N increases. Indeed, we will demonstrate that, in doing so, problem P (α N , ρ N ) naturally produces distributionally robustified versions of those popular methods when applied to solve problem (1). Next, we formalize these ideas.
Remark EC.2. Throughout this section, we will assume that dist( ξ i , Ξ) = z i − z * . This assumption is standard in the technical literature. The geometry of the joint support set Ξ is expected to have a negligible impact on the asymptotic performance of problem P (α N , ρ N ) (i.e., for large samples), because, under a smoothness condition on Q and K/N → 0, it holds that dist( ξ K:N , Ξ) → z K:N − z * → 0 almost surely (see Biau and Devroye (2015, Lemmas 2.2 and 2.3)), where z K:N is the z-component of the K-th nearest neighbor to z * after reordering the data sample Here, we show that the solutions of the distributionally robust optimization problem P (α N , ρ N ) converge to the solution of the targeted conditional stochastic program (1) as N increases, for a careful choice of parameters α N and ρ N . This result is underpinned by the fact that, under that selection of parameters α N and ρ N , any distribution in U N (α N , ρ N ) converges to the true conditional distribution Q Ξ .
Assumption EC.2 (Lipschitz-regularity). We assume that there exists an integrable function : R dy → R + such that for all y ∈ R dy φ y/z=z (y) − φ y/z=z * (y) (y) z − z * , ∀z such that z − z * r 0 (EC.3) where φ y/z=z (·) stands for the density function of y conditional on z = z .

Lemma EC.1 (Convergence of transported trimmed distributions). Suppose that
Assumptions 3 and EC.2 hold. Take (α N , ρ N ) such that α N → 0, N α N log(N ) → ∞, and ρ N → 0 a.s., with ρ N p N α N , where N α N is the minimum transportation budget as in Definition 2. Then, we have that Proof Since y is bounded, we only need to prove that Q N Ξ converges weakly to Q Ξ . For this purpose, take a continuous and bounded function h and let m( We deal with each of the terms in the inequality above one by one. First, we use Devroye (1982, Lemma 6) to get Now let be an integrable function satisfying condition (EC.3). Hence, for any z such that Let J be the number of samples such that their distance from the set Ξ is smaller than or equal to r 0 . We can write Next we upper bound the second term in right-hand side of the last inequality.
It suffices to take a feasible solution. In particular, we consider µ i:N = 0, ∀i, θ = 0, and λ = 1/r p 0 , which renders Consequently, we essentially need that lim N →∞ ρ N = 0 with probability one. To show this, as ρ Importantly, the budget ∆ ρ N is under the decision-maker's control, who simply needs to guarantee that ∆ ρ N → 0 so that the last two terms on the right-hand side of the previous inequality vanishes. Group these two terms into a N (∆ ρ N ), set K := N α N and note that N α N z K:N − z * .
Thus, for any arbitrary ε > 0, In turn, ec10 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings Furthermore, due to the first point in Assumption 3, it holds that for any 0 < η r 0 and provided that K N C 2 η dz (see Loubes and Pelletier (2017, formula (34)), which is an application of the lower-tail of Chernoff's bound).
Therefore, in that case, which we guarantee, for N large enough, by enforcing α N → 0. This way, for any arbitrarily small ε > 0, we finally have The last two terms on the right-hand side of (EC.6) are summable over N , while the first one is Consequently, the Borel-Cantelli Lemma allows us to conclude that given that a N → 0 when ∆ ρ N → 0. Thus, Q N Ξ converges weakly to Q Ξ almost surely. Q.E.D.
The following corollary extends the convergence to any distribution in the proposed ambiguity set (apart from the transported trimmings of the empirical distribution).
where Q N Ξ is any distribution from the ambiguity set U N (α N , ρ N ).
Proof This corollary is an immediate result of the previous lemma. With some abuse of By the triangle inequality, we have . We again use the triangle inequality to upper bound the second term on the right-hand side of (EC.7).
is the distribution with support on Ξ that is the closest (in p-Wasserstein and is precisely one of the transported trimmed distributions to which Lemma EC.1 refers. Hence, (i) If for any fixed ξ ∈ Ξ, f (·, ξ) is continuous on X, and for any fixed value x ∈ X, f (x, ξ) is continuous in ξ and there is L 0 such that |f (x, ξ)| L(1 + ξ p ) for all x ∈ X and ξ ∈ Ξ, then we have that J N → J * almost surely when N grows to infinity.
(ii) Let X N , X * be the set of optimal solutions of problems P (α N , ρ N ) and (24), respectively. If the assumptions in (i) are satisfied, the feasible set X is closed and X N , X * are non-empty, then we have that any accumulation point of the sequence { x N } N is almost surely an optimal solution of problem (24).
Let F be the class of random functions defined as follows and let D be the pseudometric between two probability measures P and Q given by For two sets of probability measures U 1 and U 2 , define the excess of U 1 over U 2 as The function f satisfies the following uniform-integrability-type condition for all x, Sun and Xu (2016, Proposition 1), we deduce that the set V N is compact (and hence bounded).
Let a N := inf v∈V N v, b N := sup v∈V N v and c := inf v∈V v = sup v∈V v. Now, denote the Hausdorff distance between the respective convex hulls of the sets V and V N as H(convV, convV N ). We have On the other hand, by Hess (1999, Proposition 2.1 (c)) and the definition of the Hausdorff distance, the following holds under Corollary EC.1 and Proposition EC.1. Thus, ec14 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings Hence, since the inequality above is independent of the value of x, we have Now, we show that the functions v N (x) and v(x) are continuous in x ∈ X: Fix an arbitrary x ∈ X and consider a sequence (x N ) N such that x N → x as N grows to infinity. We want to . For any ε > 0, there exists N > 0 sufficiently large such that for N N the following holds: As ε > 0 is arbitrary, this implies that the function v N (x) is con- Similarly, since f is continuous in x, we have that the function v(x) is continuous in x ∈ X. Finally, as v N (x) and v(x) are continuous in x ∈ X and lim N →∞ sup x∈X |v N (x) − v(x)| = 0 a.s., we deduce from Xu and Meng (2007, Lemma 3.4) that J N → J * a.s. and the proof of (i) is complete.
The proof of (ii) is given by the application of Liu and Xu (2013, Lemma 3.8 In the following two corollaries, we show that our framework naturally produces distributionally robust variants of popular non-parametric regression techniques such as the K-nearest neighbors and the Nadaraya-Watson kernel regression. This could serve to guide the selection of α N and ρ N . Corollary EC.2 (Distributionally robust K-nearest neighbors). Let K N be the number of nearest neighbors, chosen such that K N → ∞, K N /N → 0 and K N log N → ∞ when the sample size N grows to infinity. This defines a standard KNN regression method.
Take problem P (α N , ρ N ) , set α N := K N /N and compute the minimum transportation budget K N as in Definition 2. Problem P (α N , ρ N ) for any sequence of ρ N , N ∈ N, such that ρ N = p K N + ∆ ρ N with ∆ ρ N ↓ 0 is a distributionally robust variant of that KNN method.
Proof The proof of this claim directly follows from the fact that all the conditions in Consider a Nadaraya-Watson (NW) kernel regression method with bandwidth h N such that h N → 0 and N h dz N / log(N ) → ∞ when N grows to infinity. Also, assume that the non-negative Kernel K of the NW method satisfies that there exist positive numbers c 1 , c 2 and r such that Let w i , i = 1, . . . , N be the weights given by the NW method to the data points in a certain sample of size N and let w max := max i w i . Compute The choices α N := 1/(N w max ) and ρ N := ρ N W N + ∆ ρ N with ∆ ρ N ↓ 0 produce an asymptotically consistent and distributionally robust Nadaraya-Watson kernel regression method.
Proof To prove this corollary, we will use the following lemma, which appears in Devroye (1981). ec16 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings Lemma EC.2 (Lemma 4.1 from Devroye (1981)). If n is a binomial random variable with parameters N andp, then Define A i as the event ( z i − z * rh N ). Then, n = N i=1 I A i is a binomial random variable with parameters N andp = P( z i − z * rh N ) that represents the number of samples that are given a weight different from zero by the NW method. By Assumption 3, it follows thatp Cr dz h dz N , when rh N < r 0 . Furthermore, by the way the weights are constructed in this method and the choice of α N , we have that ρ N W N p N α N , provided that n 1. In that case, it also holds 1/n w max c 2 /(c 1 n) and thus, (c 1 n)/c 2 N α N n. Note that the event (n = 0) can happen only in a finite number of instances as N increases. Indeed, for N sufficiently large, P(n = 0) = (1 −p) the bandwidth of the NW method could be occasionally augmented in those specific instances so that n ≥ 1, without affecting the convergence of the method.
Thus, we have because h N tends to 0 as N grows to infinity. Now, we need to revisit Equation (EC.4), since N α N is random here (contingent on the training sample). In particular, we have for any arbitrary ε > 0. Hence, The summability with respect to N of the expectation on the right-hand side of the inequality above is ensured by Lemma EC.2, given that, for N large enough, Np/ log N Cr dz N h dz N / log(N ) → ∞.
The Borel-Cantelli lemma does the rest to conclude the proof.
While not explicitly required in this proof, it is easy to check that α N → 0 almost surely as well.
Note that c 1 n c 2 N α N n N , with E n N =p → 0, since h N → 0. Using Devroye (1982, Lemma 6), we get, for any ε > 0, which is summable with respect to N . Thus, lim N →∞ n N =p = 0 with probability one (as expected) and consequently, α N → 0 a.s. Q.E.D.
Similarly as before, the extra budget ∆ ρ N can be used by the decision-maker to robustify the NW solution. Nevertheless, in this case, as ρ N W N p N α N , the ambiguity set is not necessarily a singleton, meaning that our DRO approach already confers some degree of robustness on the decision vector We conclude this section with a corollary that extends Lemma EC.1 to the case of unbounded uncertainty y under certain conditions. This extension guarantees that the solution to problem P (α N , ρ N ) is asymptotically consistent also for this case.
Corollary EC.4 (Extension of Lemma EC.1 to unbounded y). Suppose that Assumptions 3.1 and EC.2 hold. Consider the true data-generating distribution Q of the random vector ξ := (z, y) with support Ξ := Ξ z × R dy and define m(z * ) = E[ y a | z = z * ], for some a p.
Assume that there exists a constant m > 0 such that m(z) < m for almost all z ∈ Ξ z , and that there are non-negative numbers (σ, ν) such that for almost all z * ∈ Ξ z . Then, if the sequence (α N , ρ N ), N ∈ N, meets the conditions stated in Lemma EC.1, we have that the convergence result stated in that lemma, also applies in the following two cases: i) a = p and function : R dy → R + in Assumption EC.2 is such that y p (y)dy < R < ∞; and ii) a > p. Proof Since the weak convergence of distributions is guaranteed by way of Lemma EC.1, we just need to prove that Ξ y p dQ N Ξ → Ξ y p dQ Ξ (i.e., convergence of the p-th moment, see Proposition EC.1). For this purpose, we will use different strategies in cases i) and ii).

Case i):
Here we follow a similar strategy to that used to prove Lemma EC.1.

We have
To upper bound the first term on the right-hand side of the above inequality, we exploit the subexponential character of y i p , i = 1, . . . , N (understood as random variables). To this end, we employ the following technical result, which corresponds to Theorem 2.51 in Bercu et al. (2015).
Theorem EC.5 (Theorem 2.51 from Bercu et al. (2015)). Let Z 1 , . . . , Z n be a finite sequence of independent and centered random variables such that, for all 1 k n, the random variable Z k satisfies log E[exp(tZ k )] l(t) for any t 0, with l(t) being a function from [0, ∞) to where l * stands for the convex conjugate of l.
We finish the proof of case i) here, because, from this point on, the process is the same as in Lemma EC.1, just replacing L and 2 h ∞ with R and m, respectively.
Case ii): Based on the corollary to Billingsley (2012, Theorem 25.12), it suffices to show that sup N R dy y a dQ N Ξ < ∞ We first compute the integral for a fixed N .
b N i ( y i a − m( z i )) + m ec20 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings By Theorem EC.5, we have, for any arbitrary > 0, which is summable with respect to N , because N α N log(N ) → ∞. Take ε := ε 0 > 0, there must then exist a sufficiently large N 0 such that for N N 0 with probability one.
Therefore, for almost all z ∈ Ξ z and some R > 0. In this case, for instance, we do not need the almosteverywhere boundedness condition on random variable m(z).

Portfolio optimization
In this section, we present and discuss some numerical results for the case Q( Ξ) = α > 0. For this purpose, we use the same portfolio allocation problem described in the main manuscript. To this end, we assume instead that the feature vector lives in an uncertainty set Z such that Q( Ξ) > 0.
e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings ec21 In particular, we consider Z := {z ∈ R 3 : z ∞ r}, withz being the standardized feature vector.
We draw 50 000 samples from the true joint data-generating distribution through the explicit form of y/z given in the main text. We then use the conditional empirical distribution made up of those samples falling within Ξ, specifically, 7306 data points, as a proxy of the true conditional distribution Q Ξ . Consequently, we have that Q( Ξ) ≈ 0.14612. We wish to solve the following optimization problem min (x,β )∈X with the rest of the parameters being equal to the values taken in the instance α = 0.
We also compare here four data-driven approaches to solve problem (EC.9), namely: • Our two approaches, i.e., problem P (α, ρ N ) with α := Q( Ξ) (that is approximately equal to 0.14612, as we have just mentioned), denoted as "DROTRIMM1" and problem P (α N , ρ N ) , where α N := Q N ( Ξ) is an estimate of α. We refer to this approach as "'DROTRIMM2." In principle, this would be the natural approach that a decision-maker with no knowledge of α would use.
• A sample average approximation (SAA) method that works with the samples falling in Ξ.
• The aforementioned SAA method followed by a standard Wasserstein-metric-based DRO approach to robustify it, which we call "SAADRO".
As in the previous numerical experiments, we employ a similar bootstrapping procedure based on the available data sample to tune the robustness parameter that each method j, with j ∈ {DROMTRIMM1, DROTRIMM2, SAADRO}, uses. More specifically, for each j ∈ {DROMTRIMM1, DROTRIMM2, SAADRO} and a given value of reliability 1 − β ∈ (0, 1) (in our numerical experiments, we set β to 0.15), we seek an estimator param β,j N that leads to the best ec22 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings out-of-sample performance, while guaranteeing the desired level of confidence 1 − β. For each sample of size N , we use the following algorithm to derive param β,j N and the corresponding portfolio solution: 1. We construct kboot resamples (with replacement) of size N , each playing the role of a different training dataset. In our experiments we use kboot = 50. Moreover, we build a validation dataset (per resample) from those data points from the original sample of size N that fall in Ξ, but which have not been involved in the resample. We only consider resamples from which we can build a validation set of at least one data point. Furthermore, unlike DROTRIMM1 and DROTRIMM2, SAADRO can only be implemented if we have at least one data point falling within Ξ in the training set (the same occurs with SAA). Thus, we implicitly assume that the source sample has no fewer than two data points in Ξ.
2. For each resample k = 1, . . . , kboot and each candidate value for param (taken from the discrete set {b · 10 c : b ∈ {0, . . . , 9}, c ∈ {−3, −2, −1, 0}}), we compute a solution by method j with parameter param on the k-th resample. The resulting optimal decision is denoted as x j,k N (param) and its corresponding objective value as J j,k N (param). Thereafter, we calculate the out-of-sample performance J( x j,k N (param)) of the data-driven solution x j,k N (param) over the validation dataset.
3. From among the candidate values for param such that J j,k N (param) exceeds the value J( x j,k N (param)) in at least (1 − β) × kboot different resamples, we take as param β,j N the one yielding the best cost performance averaged over the kboot resamples.
4. Finally, we compute the solution given by method j with parameter param β,j N , x j N := x j N (param β,j N ) and the respective certificate J j N := J j N (param β,j N ). approaches that incorporate robustness in the decision-making, DROTRIMM1 and DROTRIMM2 seem to systematically identify reliable portfolios with a better expected performance than those given by SAADRO. To investigate the ability of SAADRO, DROTRIMM1 and DROTRIMM2 to identify good portfolios, we provide Figure EC.2, which is analogous to Figure 3 in the case α = 0. Observe that both DROTRIMM1 and DROTRIMM2 guarantee reliability for smaller values of their robustness parameter than SAADRO. This gives the former a competitive advantage over the latter, essentially because it appears that a better out-of-sample performance (in expectation) is, in general, aligned with a lower distributional robustness (this finding is consistent with the fact that the unreliable SAA solution performs fairly well in terms of the weighted mean-risk asset returns). To be more precise, taking a small sample size N (say 50) and an equal value for each of their robustness parameters, DROTRIMM1 and DROTRIMM2 deliver portfolios with an actual expected cost (and variance) that is lower than or approximately equal to that of the portfolios provided by SAADRO.
They do so for any value of their robustness parameter. Furthermore, when N is increased, even though there exists a range of values of the robustness parameter for which SAADRO also identifies portfolios with a good performance out of sample, these are discarded by the method because ec24 e-companion to Esteban-Pérez and Morales: Distributionally robust conditional stochastic programs based on trimmings they do not comply with the reliability specification. For instance, take N = 400. SAADRO needs a radius larger than 0.2-0.3 to ensure reliability. However, for these values of the Wasserstein-ball radius, the portfolios given by SAADRO result in an actual expected cost above −70. On the other hand, DROTRIMM2 guarantees reliability with a value of its robustness parameter above 0.003-0.004, for which, in addition, it provides solutions with an actual expected cost below −77. used to tune the robustness parameters of the DRO methods. Therefore, these results correspond to the best solutions that can be obtained from SAADRO, DROTRIMM1 and DROTRIMM2, and confirm that our approaches (especially, DROTRIMM2) can potentially identify portfolios that significantly outperform those delivered by SAADRO under the same reliability requirement.