Advertisement

Mathematical Programming

, Volume 167, Issue 2, pp 235–292 | Cite as

Data-driven robust optimization

  • Dimitris Bertsimas
  • Vishal Gupta
  • Nathan Kallus
Full Length Paper Series A

Abstract

The last decade witnessed an explosion in the availability of data for operations research applications. Motivated by this growing availability, we propose a novel schema for utilizing data to design uncertainty sets for robust optimization using statistical hypothesis tests. The approach is flexible and widely applicable, and robust optimization problems built from our new sets are computationally tractable, both theoretically and practically. Furthermore, optimal solutions to these problems enjoy a strong, finite-sample probabilistic guarantee whenever the constraints and objective function are concave in the uncertainty. We describe concrete procedures for choosing an appropriate set for a given application and applying our approach to multiple uncertain constraints. Computational evidence in portfolio management and queueing confirm that our data-driven sets significantly outperform traditional robust optimization techniques whenever data are available.

Keywords

Robust optimization Data-driven optimization Chance-constraints Hypothesis testing 

Mathematics Subject Classification

80M50 (Optimization: Operations research, mathematical programming) 62H15 (Multivariate Analysis: Hypothesis Testing) 

1 Introduction

Robust optimization is a popular approach to optimization under uncertainty. The key idea is to define an uncertainty set of possible realizations of the uncertain parameters and then optimize against worst-case realizations within this set. Computational experience suggests that with well-chosen sets, robust models yield tractable optimization problems whose solutions perform as well or better than other approaches. With poorly chosen sets, however, robust models may be overly-conservative or computationally intractable. Choosing a good set is crucial. Fortunately, there are several theoretically motivated and experimentally validated proposals for constructing good uncertainty sets [3, 6, 10, 16]. These proposals share a common paradigm; they combine a priori reasoning with mild assumptions on the uncertainty to motivate the construction of the set.

On the other hand, the last decade witnessed an explosion in the availability of data. Massive amounts of data are now routinely collected in many industries. Retailers archive terabytes of transaction data. Suppliers track order patterns across their supply chains. Energy markets can access global weather data, historical demand profiles, and, in some cases, real-time power consumption information. These data have motivated a shift in thinking—away from a priori reasoning and assumptions and towards a new data-centered paradigm. A natural question, then, is how should robust optimization techniques be tailored to this new paradigm?

In this paper, we propose a general schema for designing uncertainty sets for robust optimization from data. We consider uncertain constraints of the form \(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}) \le 0\) where \({\mathbf {x}}\in {{\mathbb {R}}}^k\) is the optimization variable, and \({\tilde{{\mathbf {u}}}}\in {{\mathbb {R}}}^d\) is an uncertain parameter. We model this constraint by choosing a set \({\mathcal {U}}\) and forming the corresponding robust constraint
$$\begin{aligned} f({\mathbf {u}}, {\mathbf {x}}) \le 0 \quad \forall {\mathbf {u}}\in {\mathcal {U}}. \end{aligned}$$
(1)
We assume throughout that \(f({\mathbf {u}}, {\mathbf {x}})\) is concave in \({\mathbf {u}}\) for any \({\mathbf {x}}\).

In many applications, robust formulations decompose into a series of constraints of the form (1 through an appropriate transformation of variables, including uncertain linear optimization and multistage adaptive optimization (see, e.g., [6]). In this sense, (1) is the fundamental building block of many robust optimization models.

Many approaches [6, 16, 22] to constructing uncertainty sets for (1) assume \({\tilde{{\mathbf {u}}}}\) is a random variable whose distribution \({\mathbb {P}}^*\) is not known except for some assumed structural features. For example, they may assume that \({\mathbb {P}}^*\) has independent components but unknown marginal distributions. Furthermore, instead of insisting the given constraint hold almost surely with respect to \({\mathbb {P}}^*\), they instead authorize a small probability of violation. Specifically, given \(\epsilon > 0\), these approaches seek sets \({\mathcal {U}}_\epsilon \) that satisfy two key properties:
(P1)

The robust constraint (1) is computationally tractable.

(P2)
The set \({\mathcal {U}}_\epsilon \) implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \), that is, for any \({\mathbf {x}}^* \in {{\mathbb {R}}}^k\) and for every function \(f({\mathbf {u}}, {\mathbf {x}})\) that is concave in \({\mathbf {u}}\) for every \({\mathbf {x}}\), we have the implication:
$$\begin{aligned} \text {If } f\left( {\mathbf {u}}, {\mathbf {x}}^*\right) \le 0 \ \ \forall {\mathbf {u}}\in {\mathcal {U}}_\epsilon ,\quad \text { then } {\mathbb {P}}^*\left( f\left( {\tilde{{\mathbf {u}}}}, {\mathbf {x}}^*\right) \le 0 \right) \ge 1-\epsilon . \end{aligned}$$
(2)

(P2) ensures that a feasible solution to the robust constraint will also be feasible with probability \(1-\epsilon \) with respect to \({\mathbb {P}}^*\), despite not knowing \({\mathbb {P}}^*\) exactly. Existing proposals achieve (P2) by leveraging the a priori structural features of \({\mathbb {P}}^*\). Some of these approaches, e.g., [16], only consider the special case when \(f({\mathbf {u}}, {\mathbf {x}})\) is bi-affine, but one can generalize them to (2) using techniques from [5] (see also Sect. 2.1).

Like previous proposals, we also assume \({\tilde{{\mathbf {u}}}}\) is a random variable whose distribution \({\mathbb {P}}^*\) is not known exactly, and seek sets \({\mathcal {U}}_\epsilon \) that satisfy these properties. Unlike previous proposals—and this is critical—we assume that we have data \({\mathcal {S}}=\{\hat{{\mathbf {u}}}^1, \ldots , \hat{{\mathbf {u}}}^N\}\) drawn i.i.d. according to \({\mathbb {P}}^*\). By combining these data with the a priori structural features of \({\mathbb {P}}^*\), we can design new sets that imply similar probabilistic guarantees, but which are much smaller with respect to subset containment than their traditional counterparts. Consequently, robust models built from our new sets yield less conservative solutions than traditional counterparts, while retaining their robustness properties.

The key to our schema is using the confidence region of a statistical hypothesis test to quantify what we learn about \({\mathbb {P}}^*\) from the data. Specifically, our schema depends on three ingredients: a priori assumptions on \({\mathbb {P}}^*\), data, and a hypothesis test. By pairing different a priori assumptions and tests, we obtain distinct data-driven uncertainty sets, each with its own geometric shape, computational properties, and modeling power. These sets can capture a variety of features of \({\mathbb {P}}^*\), including skewness, heavy-tails and correlations.

In principle, there are many possible pairings of a priori assumptions and tests. We focus on pairings we believe are most relevant to practitioners for their tractability and applicability. Our list is non-exhaustive; there may exist other pairings that yield effective sets. Specifically, we consider situations where:
  • \({\mathbb {P}}^*\) has known, finite discrete support (Sect. 4).

  • \({\mathbb {P}}^*\) may have continuous support, and the components of \({\tilde{{\mathbf {u}}}}\) are independent (Sect. 5).

  • \({\mathbb {P}}^*\) may have continuous support, but data are drawn from its marginal distributions asynchronously (Sect. 6). This situation models the case of missing values.

  • \({\mathbb {P}}^*\) may have continuous support, and data are drawn from its joint distribution (Sect. 7). This is the general case.

Table 1 summarizes the a priori structural assumptions, hypothesis tests, and resulting uncertainty sets that we propose. Each set is convex and admits a tractable, explicit description; see the referenced equations.
Table 1

Summary of data-driven uncertainty sets proposed in this paper. SOC, EC and LMI denote second-order cone representable sets, exponential cone representable sets, and linear matrix inequalities, respectively

Assumptions on \({\mathbb {P}}^*\)

Hypothesis test

Geometric description

Eqs.

Inner problem

Discrete support

\(\chi ^2\)-test

SOC

(13, 15)

 

Discrete support

G-test

Polyhedral*

(13, 16)

 

Independent marginals

KS Test

Polyhedral*

(21)

Line search

Independent marginals

K Test

Polyhedral*

(76)

Line search

Independent marginals

CvM Test

SOC*

(76, 69)

 

Independent marginals

W Test

SOC*

(76, 70)

 

Independent marginals

AD Test

EC

(76, 71)

 

Independent marginals

Chen et al. [23]

SOC

(27)

Closed-form

None

Marginal Samples

Box

(31)

Closed-form

None

Linear Convex Ordering

Polyhedron

(34)

 

None

Shawe-Taylor and Cristianini [46]

SOC

(39)

Closed-form

None

Delage and Ye [25]

LMI

(41)

 

The additional “*” notation indicates a set of the above type with one additional, relative entropy constraint. KS, K, CvM, W, and AD denote the Kolmogorov–Smirnov, Kuiper, Cramer-von Mises, Watson and Anderson-Darling goodness of fit tests, respectively. In some cases, we can identify a worst-case realization of \({\mathbf {u}}\) in (1) for bi-affine f and a candidate \({\mathbf {x}}\) with a specialized algorithm. In these cases, the column “Inner Problem” roughly describes this algorithm

For each of our sets, we provide an explicit, equivalent reformulation of (1). The complexity of optimizing over this reformulation depends both on the function \(f({\mathbf {u}}, {\mathbf {x}})\) and the set \({\mathcal {U}}\). For each of our sets, we show that this reformulation is polynomial time tractable for a large class of functions f including bi-affine functions, separable functions, conic-quadratic representable functions and certain sums of uncertain exponential functions. By exploiting special structure in some of our sets, we can provide specialized routines for identifying a worst-case realization of \({\mathbf {u}}\) in (1) for bi-affine f and a candidate solution \({\mathbf {x}}\).1 Utilizing this separation routine within a cutting-plane method may offer performance superior to approaches which attempt to solve (1) directly [13, 38]. In these cases, the column “Inner Problem” in Table 1 roughly describes these routines.

We are not the first to consider using hypothesis tests in data-driven optimization; others have considered more specialized applications of hypothesis testing. Klabjan et al. [34] proposes a distributionally robust dynamic program based on Pearson’s \(\chi ^2\)-test for a particular inventory problem. Goldfarb and Iyengar [29] calibrate an uncertainty set for the mean and covariance of a distribution using linear regression and the t test. It is not clear how to generalize these methods to other settings, e.g., distributions with continuous support in the first case or general parameter uncertainty in the second. By contrast, we offer a comprehensive study of the connection between hypothesis testing and uncertainty set design, addressing a number of cases with general machinery.

Recently, Ben-Tal et al. [9] proposed a class of data-driven uncertainty sets based on phi-divergences. Several classical hypothesis tests, like Pearson’s \(\chi ^2\)-test and the G-test are based on phi-divergences (see also [32]). They focus on the case where the uncertain parameters \({\tilde{{\mathbf {u}}}}\), themselves, are a probability distribution with known, finite, discrete support. Robust optimization problems where the uncertainty is a probability distribution are typically called distributionally robust optimization (DRO) problems, and the corresponding uncertainty sets are called ambiguity sets. Although there have been a huge number of ambiguity sets proposed in the literature based on generalized moment constraints and probability metrics (see, e.g., [28, 50] for recent work), to the best of our knowledge Ben-Tal et al. [9] is the first to connect an ambiguity set with a hypothesis test. In contrast to these DRO models for ambiguity sets, we design uncertainty sets for general uncertain parameters \({\tilde{{\mathbf {u}}}}\), such as future product demand, service times, and asset returns; these uncertain parameters need not represent probabilities. Methodologically, treating general uncertain parameters requires different techniques than those typically used in constructing ambiguity sets.

This distinction is not to suggest our work entirely unrelated to DRO. Our hypothesis testing perspective provides a unified view of ambiguity sets in DRO and many other data-driven methods from the literature. For example, Calafiore and El Ghaoui [18] and Delage and Ye [25] have proposed data-driven methods for chance-constrained and distributionally robust problems, respectively, without using hypothesis testing. We show how these works can be reinterpreted through the lens of hypothesis testing. Leveraging this viewpoint enables us to apply methods from statistics, such as the bootstrap, to refine these methods and improve their numerical performance. Moreover, applying our schema, we can design data-driven uncertainty sets for robust optimization based upon these methods. Although we focus on Calafiore and El Ghaoui [18] and Delage and Ye [25] in this paper, this strategy applies equally well to a host of other methods beyond DRO, such as the likelihood estimation approach of [49]. In this sense, we believe hypothesis testing and uncertainty set design provide a common framework in which to compare and contrast different approaches.

At the same time, Ben-Tal et al. [6] establish a one-to-one correspondence between uncertainty sets for linear optimization that satisfy (P2) and safe approximations to ambiguous linear chance constraints (see also Remark 1). Recall that an ambiguous, linear chance constraint in \({\mathbf {x}}\) is of the form \(\sup _{{\mathbb {P}}\in {\mathcal {P}}} {\mathbb {P}}({\mathbf {x}}^T{\tilde{{\mathbf {u}}}}\le 0) \ge 1-\epsilon \) for some ambiguity set \({\mathcal {P}}\), i.e., it is a specific instance of DRO. Thus, through this correspondence, all of our results can be recast as new data-driven constructions for safe-approximations to chance constraints. Whether one phrases our results in the language of ambiguous chance constraints or uncertainty sets for (classical) robust optimization is largely a matter of taste. In what follows, we prefer uncertainty sets since many existing robust optimization applications in engineering and operations research are formulated in terms of general uncertain parameters. Our new uncertainty sets can be directly substituted into these existing models with little additional effort.

Finally, we note that Campi and Garatti [21] propose a very different data-driven method for robust optimization not based on hypothesis tests. In their approach, one replaces the uncertain constraint \(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}) \le 0\) with N sampled constraints over the data, \(f(\hat{{\mathbf {u}}}^j, {\mathbf {x}}) \le 0\), for \(j=1, \ldots , N\). For \(f({\mathbf {u}}, {\mathbf {x}})\) convex in \({\mathbf {x}}\) with arbitrary dependence in \({\mathbf {u}}\), they provide a tight bound \(N(\epsilon )\) such that if \(N \ge N(\epsilon )\), then, with high probability with respect to the sampling procedure \({\mathbb {P}}_{\mathcal {S}}\), any \({\mathbf {x}}\) which is feasible in the N sampled constraints satisfies \({\mathbb {P}}^*(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}) \le 0) \ge 1-\epsilon \). Various refinements of this base method have been proposed yielding smaller bounds \(N(\epsilon )\), including incorporating \(\ell _1\)-regularization [20] and allowing \({\mathbf {x}}\) to violate a small fraction of the constraints [19]. Compared to our approach, these methods are more generally applicable and provide a similar probabilistic guarantee. In the special case we treat where \(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}})\) is concave in \({\mathbf {u}}\), however, our proposed approach offers some advantages. First, because it leverages the concave structure of \(f({\mathbf {u}}, {\mathbf {x}})\), our approach generally yields less conservative solutions (for the same N and \(\epsilon \)) than [21] (see Sect. 3). Second, for fixed \(\epsilon > 0\), our approach is applicable even if \(N < N(\epsilon )\), while theirs is not. This distinction is important when \(\epsilon \) is very small and there may not exist enough data. Finally, as we will show, our approach reformulates (1) as a series of (relatively) sparse convex constraints, while the Campi and Garatti’s [21] approach will in general yield N dense constraints which may be numerically challenging when N is large.

We summarize our contributions:
  1. 1.

    We propose a new, systematic schema for constructing uncertainty sets from data using statistical hypothesis tests. When the data are drawn i.i.d. from an unknown distribution \({\mathbb {P}}^*\), sets built from our schema imply a probabilistic guarantee for \({\mathbb {P}}^*\) at any desired level \(\epsilon \).

     
  2. 2.

    We illustrate our schema by constructing a multitude of uncertainty sets. Each set is applicable under slightly different a priori assumptions on \({\mathbb {P}}^*\) as described in Table 1.

     
  3. 3.

    We prove that robust optimization problems over each of our sets are generally tractable. Specifically, for each set, we derive an explicit robust counterpart to (1) and show that for a large class of functions \(f({\mathbf {u}}, {\mathbf {x}})\) optimizing over this counterpart can be accomplished in polynomial time using off-the-shelf software.

     
  4. 4.

    We unify several existing data-driven methods through the lens of hypothesis testing. Through this lens, we motivate the use of common numerical techniques from statistics such as bootstrapping and Gaussian approximation to improve their performance. Moreover, we apply our schema to derive new uncertainty sets for (1) inspired by the refined versions of these methods.

     
  5. 5.

    We illustrate how to model multiple uncertain constraints with our sets by optimizing the parameters chosen for each individual constraint. This approach is tractable and yields solutions which will satisfy all uncertain constraints simultaneously for any desired level \(\epsilon \).

     
  6. 6.

    We illustrate how common cross-validation techniques from model selection in machine learning can be used to choose an appropriate set and calibrate its parameters.

     
  7. 7.

    Through applications in queueing and portfolio allocation, we assess the relative strengths and weaknesses of our sets. Overall, we find that although all of our sets shrink in size as \(N\rightarrow \infty \), they differ in their ability to represent features of \({\mathbb {P}}^*\). Consequently, they may perform very differently in a given application. In the above two settings, we find that our model selection technique frequently identifies a good set choice, and a robust optimization model built with this set performs as well or better than other robust data-driven approaches.

     
The remainder of the paper is structured as follows. Section 2 reviews background to keep the paper self-contained. Section 3 presents our schema for constructing uncertainty sets. Sections 47 describe the various constructions in Table 1. Section 8 reinterprets several techniques in the literature through the lens of hypothesis testing and, subsequently, uses them to motivate new uncertainty sets. Section 9.1 and “Appendix 3” discuss modeling multiple constraints and choosing the right set for an application, respectively. The remainder of Sect. 9 presents numerical experiments, and Sect. 10 concludes. All proofs are in the electronic companion.

In what follows, we adopt the following notational conventions: Boldfaced lowercase letters (\({\mathbf {x}}, \varvec{\theta }, \ldots \)) denote vectors, boldfaced capital letters (\({\mathbf {A}}, {\mathbf {C}}, \ldots \)) denote matrices, and ordinary lowercase letters (\(x, \theta \)) denote scalars. Calligraphic type (\({\mathcal {P}}, {\mathcal {S}} \ldots \)) denotes sets. The \(i\text {th}\) coordinate vector is \({\mathbf {e}}_i\), and the vector of all ones is \({\mathbf {e}}\). We always use \({\tilde{{\mathbf {u}}}}\in {{\mathbb {R}}}^d\) to denote a random vector and \({{\tilde{u}}}_i\) to denote its components. \({\mathbb {P}}\) denotes a generic probability measure for \({\tilde{{\mathbf {u}}}}\), and \({\mathbb {P}}^*\) denotes its true (unknown) measure. Moreover, \({\mathbb {P}}_i\) denotes the marginal measure of \({{\tilde{u}}}_i\). We let \({\mathcal {S}}= \{\hat{{\mathbf {u}}}^1, \ldots , \hat{{\mathbf {u}}}^N\}\) be a sample of N data points drawn i.i.d. according to \({\mathbb {P}}^*\), and let \({\mathbb {P}}^*_{\mathcal {S}}\) denote the measure of the sample \({\mathcal {S}}\), i.e., the N-fold product distribution of \({\mathbb {P}}^*\). Finally, \({\hat{{\mathbb {P}}}}\) denotes the empirical distribution with respect to \({\mathcal {S}}\), i.e., for any Borel set \({\mathcal {A}}\), \({{\hat{{\mathbb {P}}}}}( {\tilde{{\mathbf {u}}}}\in {\mathcal {A}} ) \equiv \frac{1}{N} \sum _{j=1}^N {\mathbb {I}}( \hat{{\mathbf {u}}}^j \in {\mathcal {A}}).\) Here \({\mathbb {I}}( \cdot )\) denotes the usual indicator function.

2 Background

To keep the paper self-contained, we recall some results needed to prove our sets are tractable and imply a probabilistic guarantee.

2.1 Tractability of Robust Nonlinear Constraints

Ben-Tal et al. [5] study constraint (1) and prove that for nonempty, convex, compact \({\mathcal {U}}\) satisfying a mild, regularity condition,2 (1) is equivalent to
$$\begin{aligned} \exists {\mathbf {v}}\in {{\mathbb {R}}}^d, t, s \in {{\mathbb {R}}}\ \text {s.t. } \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}\right) \le t, \ f_*\left( {\mathbf {v}}, {\mathbf {x}}\right) \ge s, \ t - s \le 0. \end{aligned}$$
(3)
Here, \(f_*({\mathbf {v}}, {\mathbf {x}})\) denotes the partial concave-conjugate of \(f({\mathbf {u}}, {\mathbf {x}})\) and \(\delta ^*({\mathbf {v}}| \ {\mathcal {U}})\) denotes the support function of \({\mathcal {U}}\), defined respectively as
$$\begin{aligned} f_*\left( {\mathbf {v}}, {\mathbf {x}}\right) \equiv \inf _{{\mathbf {u}}\in {{\mathbb {R}}}^d} {\mathbf {u}}^T{\mathbf {v}}- f\left( {\mathbf {u}}, {\mathbf {x}}\right) , \quad \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}\right) \equiv \sup _{{\mathbf {u}}\in {\mathcal {U}}} {\mathbf {v}}^T{\mathbf {u}}. \end{aligned}$$
(4)
For many \(f({\mathbf {u}}, {\mathbf {x}})\), \(f_*({\mathbf {v}}, {\mathbf {x}})\) admits a simple, explicit description. For example, for bi-affine \(f({\mathbf {u}}, {\mathbf {x}}) = {\mathbf {u}}^T{\mathbf {F}}{\mathbf {x}}+ {\mathbf {f}}_{\mathbf {u}}^T{\mathbf {u}}+ {\mathbf {f}}_{\mathbf {x}}^T{\mathbf {x}}+ f_0\), we have
$$\begin{aligned} f_*\left( {\mathbf {v}}, {\mathbf {x}}\right) = {\left\{ \begin{array}{ll} -{\mathbf {f}}_{\mathbf {x}}^T{\mathbf {x}}- f_0 &{}\text { if } v = \mathbf {{\mathbf {F}} {\mathbf {x}}} + {\mathbf {f}}_{\mathbf {u}}\ \\ \ -\infty &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
and (3) simplifies to
$$\begin{aligned} \delta ^*\left( {\mathbf {F}}{\mathbf {x}}+ {\mathbf {f}}_{\mathbf {u}}| \ {\mathcal {U}}\right) + {\mathbf {f}}_{\mathbf {x}}^T{\mathbf {x}}+ f_0 \le 0. \end{aligned}$$
(5)
In what follows, we concentrate on proving that we can represent \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) with a small number of convex inequalities suitable for off-the-shelf solvers for each of our sets \({\mathcal {U}}\). From (5), this representation will imply that (1) is theoretically and practically tractable for each of our sets whenever \(f({\mathbf {u}}, {\mathbf {x}})\) is bi-affine.
On the other hand, Ben-Tal et al. [5] provide a number of other examples of \(f({\mathbf {u}}, {\mathbf {x}})\) for which \(f_*({\mathbf {v}}, {\mathbf {x}})\) is tractable, including:
  • Separable Concave: \(f({\mathbf {u}}, {\mathbf {x}}) = \sum _{i=1}^k f_i({\mathbf {u}}) x_i\), for \(f_i({\mathbf {u}})\) concave and \(x_i \ge 0\).

  • Uncertain Exponentials: \(f({\mathbf {u}}, {\mathbf {x}}) = -\sum _{i=1}^k x_i^{u_i}\), for \(x_i > 1\) and \(0 < u_i \le 1\).

  • Conic Quadratic Representable: \(f({\mathbf {u}}, {\mathbf {x}})\) such that the set \(\{(t, {\mathbf {u}}) \in {{\mathbb {R}}}\times {{\mathbb {R}}}^d : f({\mathbf {u}}, {\mathbf {x}}) \ge t \} \) is conic quadratic representable ( cf. [40]).

Consequently, by providing a representation of \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) for each of our sets, we will also have proven that (1) is tractable for each of these classes of functions via (3).

For some sets, our formulation of \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) will involve complex nonlinear constraints, such as exponential cone constraints (cf. Table 1). Although it is theoretically possible to optimize over such constraints in polynomial time, this approach may be numerically challenging. An alternative to solving (3) directly is to use cutting-plane, bundle, or online optimization methods (see [8, 13, 38] for details). While these methods differ in the specifics of how they address (1), the critical subroutine in each method is “solving the inner problem.” Specifically, given a candidate solution \(({\mathbf {v}}_0, t_0)\), one must be able to easily compute \({\mathbf {u}}^* \in \arg \max _{{\mathbf {u}}\in {\mathcal {U}}} {\mathbf {v}}_0^T{\mathbf {u}}\) (notice \({\mathbf {u}}^*\) depends on \({\mathbf {v}}_0\)). From the definitions of the support function and \({\mathbf {u}}^*\), we have \(({\mathbf {v}}_0, t_0) \in \{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) if and only if \({\mathbf {v}}_0^T{\mathbf {u}}^* \le t_0\). In particular, if \(({\mathbf {v}}_0, t_0) \not \in \{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\), then the hyperplane \(\{({\mathbf {v}}, t) : {\mathbf {v}}^T{\mathbf {u}}^* = t \}\) separates \(({\mathbf {v}}_0, t_0)\) from \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\). Namely, any \(({\mathbf {v}}, t) \in \{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) satisfies the inequality \({\mathbf {v}}^T{\mathbf {u}}^* \le t\), but \(({\mathbf {v}}_0, t_0)\) does not. Such separating hyperplanes are used in cutting-plane and bundling methods to iteratively build up the constraint (1).

Although it is possible to use this idea to prove polynomial time tractability of robust constraints over our sets via the ellipsoid algorithm using separation oracles (see [30] for details), we do not pursue this idea. Rather, our primary motivation is in improving practical efficiency in the spirit of Bertsimas et al. [13] when the reformulation (3) may be challenging. To this end, when possible, we provide specialized algorithms for solving the inner problem and identifying a \({\mathbf {u}}^*\) through closed-form formulas or line searches. Practitioners can then employ these specialized algorithms within one of the above referenced cutting-plane, bundle, or online learning methods to yield practically efficient algorithms for large-scale instances.

2.2 Hypothesis Testing

We briefly review hypothesis testing. See [35] for a more complete treatment.

Given a null-hypothesis \(H_0\) that makes a claim about an unknown distribution \({\mathbb {P}}^*\), a hypothesis test seeks to use data \({\mathcal {S}}\) drawn from \({\mathbb {P}}^*\) to either declare that \(H_0\) is false, or, else, that there is insufficient evidence to determine its validity. For a given significance level \(0< \alpha < 1\), a typical test consists of a test statistic \(T \equiv T({\mathcal {S}}, H_0)\), depending on the data and \(H_0\), and a threshold \({\varGamma }\equiv {\varGamma }(\alpha , {\mathcal {S}}, H_0)\), depending on \(\alpha \), \({\mathcal {S}}\), and \(H_0\). If \(T > {\varGamma }\), we reject \(H_0\). Since T depends on \({\mathcal {S}}\), it is random. The threshold \({\varGamma }\) is chosen so that the probability with respect to \({\mathbb {P}}_{\mathcal {S}}\) of incorrectly rejecting \(H_0\) is at most \(\alpha \). The choice of \(\alpha \) is often application specific, although values of \(\alpha = 1, 5\) and \(10\%\) are common (cf., [35, Chapt. 3.1].)

As an example, consider the two-sided Student’s t test [35, Chapt. 5].) Given \(\mu _0 \in {{\mathbb {R}}}\), the t test considers the null-hypothesis \( H_0: {{\mathbb {E}}}^{{\mathbb {P}}^*}[{{\tilde{u}}}] = \mu _0\) using the statistic \(T = | ({{\hat{\mu }} - \mu _0})/({ {\hat{\sigma }}\sqrt{N}})|\) and threshold \({\varGamma }= t_{N-1, 1-\alpha /2}\). Here \({\hat{\mu }}, {\hat{\sigma }}\) are the sample mean and sample standard deviation, respectively, and \(t_{N-1, 1-\alpha }\) is the \(1-\alpha \) quantile of the Student t distribution with \(N-1\) degrees of freedom. Under the a priori assumption that \({\mathbb {P}}^*\) is Gaussian, the test guarantees that we will incorrectly reject \(H_0\) with probability at most \(\alpha \).

Many of the tests we consider are common in applied statistics, and tables for their thresholds are widely available. Several of our tests, however, are novel (e.g., the deviations test in Sect. 5.2.) In these cases, we propose using the bootstrap to approximate a threshold (cf. Algorithm 1). \(N_B\) should be chosen to be fairly large; we take \(N_B = 10^4\) in our experiments. The bootstrap is a well-studied and widely-used technique in statistics [26, 35]. Strictly speaking, hypothesis tests based on the bootstrap are only asymptotically valid for large N (see the references for a precise statement). Nonetheless, they are routinely used in applied statistics, even with N as small as 100, and a wealth of practical experience suggests they are extremely accurate. Consequently, we believe practitioners can safely use bootstrapped thresholds in the above tests.
Finally, we introduce the confidence region of a test, which will play a critical role in our construction. Given data \({\mathcal {S}}\), the \(1-\alpha \) confidence region of a test is the set of null-hypotheses that would not be rejected for \({\mathcal {S}}\) at level \(1-\alpha \). For example, the \(1-\alpha \) confidence region of the t test is \(\left\{ \mu _0 \in {{\mathbb {R}}}: \left| \frac{{\hat{\mu }} - \mu _0}{ {\hat{\sigma }}\sqrt{N}} \right| \le t_{N-1, 1-\alpha /2} \right\} .\) In what follows, however, we commit a slight abuse of nomenclature and instead use the term confidence region to refer to the set of all measures that are consistent with any a priori assumptions of the test and also satisfy a null-hypothesis that would not be rejected. In the case of the t test, the confidence region in the context of this paper is
$$\begin{aligned} {\mathcal {P}}^t \equiv \left\{ {\mathbb {P}}\in {\varTheta }(-\infty , \infty ): {\mathbb {P}}\text { is Gaussian with mean } \mu _0, \text { and } \left| \frac{{\hat{\mu }} - \mu _0}{ {\hat{\sigma }}\sqrt{N}} \right| {\le } t_{N-1, 1-\alpha /2} \right\} ,\nonumber \\ \end{aligned}$$
(6)
where \({\varTheta }(-\infty , \infty )\) is the set of Borel probability measures on \({{\mathbb {R}}}\).

By construction, the probability (with respect to the sampling procedure \({\mathbb {P}}_{\mathcal {S}}\)) that \({\mathbb {P}}^*\) is a member of its confidence region is at least \(1-\alpha \) as long as all a priori assumptions are valid. This is a critical observation. Despite not knowing \({\mathbb {P}}^*\), we can use a hypothesis test to create a set of distributions from the data that contains \({\mathbb {P}}^*\) for any specified probability.

3 Designing Data-Driven Uncertainty Sets

3.1 Geometric Characterization of the Probabilistic Guarantee

As a first step towards our schema, we provide a geometric characterization of (P2). One might intuit that a set \({\mathcal {U}}\) implies a probabilistic guarantee at level \(\epsilon \) only if \({\mathbb {P}}^*( {\tilde{{\mathbf {u}}}}\in {\mathcal {U}}) \ge 1-\epsilon \). As noted by Ben-Tal et al. [6, pp. 32–33], however, this intuition is false. Often, sets that are much smaller than the \(1-\epsilon \) support will still imply a probabilistic guarantee at level \(\epsilon \), and such sets should be preferred because they are less conservative.

The crux of the issue is that there may be many realizations \({\tilde{{\mathbf {u}}}}\not \in {\mathcal {U}}\) where nonetheless \(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}^*) \le 0\). Thus, \({\mathbb {P}}^*({\tilde{{\mathbf {u}}}}\in {\mathcal {U}})\) is in general an underestimate of \({\mathbb {P}}^*(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}^*) \le 0)\). One needs to exploit the dependence of f on \({\mathbf {u}}\) to refine the estimate. We note in passing that many existing data-driven approaches for robust optimization, e.g., [21], do not leverage this dependence. Consequently, although these approaches are general purpose, they may yield overly conservative uncertainty sets for (1).

In order to tightly characterize (P2), we introduce the Value at Risk. For any \({\mathbf {v}}\in {{\mathbb {R}}}^d\) and measure \({\mathbb {P}}\), the Value at Risk at level \(\epsilon \) with respect to \({\mathbf {v}}\) is
$$\begin{aligned} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \equiv \inf \left\{ t : {\mathbb {P}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\le t\right) \ge 1- \epsilon \right\} . \end{aligned}$$
(7)
Value at Risk is positively homogenous (in \({\mathbf {v}}\)), but typically non-convex. (Recall a function \(g({\mathbf {v}})\) is positively homogenous if \(g(\lambda {\mathbf {v}}) = \lambda g({\mathbf {v}})\) for all \(\lambda > 0\).) The critical result underlying our method is a relationship between Value at Risk and support functions of sets which satisfy (P2) (cf. (4)):

Theorem 1

  1. a)

    Suppose \({\mathcal {U}}\) is non-empty, convex, and compact. Assume that for every \({\mathbf {v}}\in {{\mathbb {R}}}^d\), \(\delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \ge \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \ \ \forall {\mathbf {v}}\in {{\mathbb {R}}}^d.\) Then, \({\mathcal {U}}\) implies a probabilistic guarantee at level \(\epsilon \) for \({\mathbb {P}}\).

     
  2. b)

    Suppose \(\exists {\mathbf {v}}\in {{\mathbb {R}}}^d\) such that \(\delta ^*( {\mathbf {v}}| \ {\mathcal {U}}) < \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \). Then, there exists bi-affine functions \(f({\mathbf {u}}, {\mathbf {x}})\) for which (2) does not hold.

     

The first part generalizes a result implicitly used in [6, 23] when designing uncertainty sets for the special case of bi-affine functions. To the best of our knowledge, the extension to general concave functions f is new.

3.2 Our Schema

The principal challenge in applying Theorem 1 to designing uncertainty sets is that \({\mathbb {P}}^*\) is not known. Recall, however, that the confidence region \({\mathcal {P}}\) of a hypothesis test, will contain \({\mathbb {P}}^*\) with probability at least \(1-\alpha \). This motivates the following schema: Fix \(0< \alpha < 1\) and \(0< \epsilon < 1\).

Note, the existence of the set in Step 3 is guaranteed by the bijection between closed, finite-valued, positively homogenous convex functions and convex, compact sets (see [11]).

Theorem 2

With probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), the resulting set \({\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha )\) implies a probabilistic guarantee at level \(\epsilon \) for \({\mathbb {P}}^*\).

Remark 1

Note that \(\delta ^*( {\mathbf {v}}| \ {\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha ) ) \le t \) is a safe-approximation to the ambiguous chance constraint \(\sup _{{\mathbb {P}}\in {\mathcal {P}}({\mathcal {S}}, \alpha , \epsilon )} {\mathbb {P}}({\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\le t ) \ge 1-\epsilon \) as defined in [6]. Ambiguous chance-constraints are closely related to sets which satisfy (P2). See [6] for more details. Practitioners who prefer to model with ambiguous chance constraints can directly use \(\delta ^*( {\mathbf {v}}| \ {\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha ) ) \le t \) in their formulations as a data-driven approach. We provide explicit descriptions of \(\delta ^*( {\mathbf {v}}| \ {\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha ) ) \le t \) below for each of our sets for this purpose.

Theorem 2 ensures that with probability at least \(1-\alpha \) with respect to the sampling procedure \({\mathbb {P}}_{\mathcal {S}}\), a robust feasible solution \({\mathbf {x}}\) will satisfy a single uncertain constraint \(f({\tilde{{\mathbf {u}}}}, {\mathbf {x}}) \le 0\) with probability at least \(1-\epsilon \). Often, however, we face \(m>1\) uncertain constraints \(f_j({\tilde{{\mathbf {u}}}}, {\mathbf {x}}) \le 0\), \(j=1, \ldots , m\), and seek \({\mathbf {x}}\) that will simultaneously satisfy these constraints, i.e.,
$$\begin{aligned} {\mathbb {P}}\left( \max _{j=1, \ldots , m} f_j\left( {\tilde{{\mathbf {u}}}}, {\mathbf {x}}\right) \le 0\right) \ge 1-{\overline{\epsilon }}, \end{aligned}$$
(8)
for some given \({\overline{\epsilon }}\). One approach is to replace each uncertain constraint with a corresponding robust constraint
$$\begin{aligned} f_j\left( {\mathbf {u}}, {\mathbf {x}}\right) \le 0, \quad \forall {\mathbf {u}}\in {\mathcal {U}}\left( {\mathcal {S}}, \epsilon _j, \alpha \right) , \end{aligned}$$
(9)
where \({\mathcal {U}}({\mathcal {S}}, \epsilon _j, \alpha )\) is constructed via our schema at level \(\epsilon _j = \epsilon /m\). By the union bound and Theorem 2, with probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), any \({\mathbf {x}}\) which satisfies (9) will satisfy (8).
The choice \(\epsilon _j = \epsilon /m\) is somewhat arbitrary. We would prefer to treat the \(\epsilon _j\) as decision variables and optimize over them, i.e., replace the m uncertain constraints by
$$\begin{aligned}&\displaystyle \min _{\epsilon _1 + \ldots + \epsilon _m \le {\overline{\epsilon }}, \varvec{\epsilon }\ge {\mathbf {0}}} \Biggr \{ \max _{j=1, \ldots , m} \Big \{ \max _{{\mathbf {u}}\in {\mathcal {U}}\left( {\mathcal {S}},\epsilon _j, \alpha \right) } f_j\left( {\mathbf {u}}, {\mathbf {x}}\right) \Big \} \Biggr \} \le 0\nonumber \\&\displaystyle \text {or, equivalently,}\nonumber \\&\displaystyle \exists \epsilon _1 + \ldots + \epsilon _m \le {\overline{\epsilon }}, \ \varvec{\epsilon }\ge {\mathbf {0}}\ : \ f_j\left( {\mathbf {u}}, {\mathbf {x}}\right) \le 0 \ \ \forall {\mathbf {u}}\in {\mathcal {U}}\left( {\mathcal {S}}, \epsilon _j, \alpha \right) , \quad j=1, \ldots , m.\nonumber \\ \end{aligned}$$
(10)
Unfortunately, Theorem 2 does not imply that with probability at least \(1-\alpha \) any feasible solution to (10) will satisfy (8). The issue is that Theorem 2 requires selecting \(\epsilon \) independently of \({\mathcal {S}}\), whereas the optimal \(\epsilon _j\)’s in (10) will depend on \({\mathcal {S}}\), creating an in-sample bias. We next introduce a stronger requirement on an uncertainty set than “implying a probabilistic guarantee,” and adapt Theorem 2 to address (10).

Given a family of sets indexed by \(\epsilon \), \(\{ {\mathcal {U}}(\epsilon ) : 0< \epsilon < 1 \}\), we say this family simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\) if, for all \(0< \epsilon < 1\), each \({\mathcal {U}}(\epsilon )\) implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \).

Theorem 3

Suppose \({\mathcal {P}}({\mathcal {S}}, \alpha , \epsilon ) \equiv {\mathcal {P}}({\mathcal {S}}, \alpha )\) does not depend on \(\epsilon \) in Step 1. above. Let \(\{{\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha ) : 0< \epsilon < 1 \}\) be the resulting family of sets obtained from our schema.
  1. a)

    With probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), \(\{{\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha ) : 0< \epsilon < 1 \}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\).

     
  2. b)

    With probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), any \({\mathbf {x}}\) which satisfies (10) will satisfy (8).

     

We provide numerical evidence in Sect. 9 that (10) offers significant benefit over (9). In some special cases, we can optimize the \(\epsilon _j\)’s in (10) exactly (see “Appendix 4). More generally, we must approximate this outer optimization numerically. We propose a specialized method leveraging the structure of our sets for this purpose in “Appendix 3”.

Depending on the quality of bound \(g(\cdot )\) in Step 2 of our schema, the resulting set \({\mathcal {U}}({\mathcal {S}}, \epsilon , \alpha )\) may not be contained in the support of \({\mathbb {P}}^*\). When a priori information is available on this support, we can always improve our set by taking intersections:

Theorem 4

Suppose \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subseteq {\mathcal {U}}_0\) where \({\mathcal {U}}_0\) is closed and convex. Suppose further that \({\mathcal {U}}_\epsilon \) is convex, compact. Then,
  1. a)

    If \({\mathcal {U}}_\epsilon \) implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \), then \({\mathcal {U}}_\epsilon \cap {\mathcal {U}}_0\) also implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \).

     
  2. b)

    If \(\{ {\mathcal {U}}_\epsilon : 0< \epsilon < 1 \}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\), then \(\{ {\mathcal {U}}_\epsilon \cap {\mathcal {U}}_0 : 0< \epsilon < 1 \}\) also simultaneously impies a probabilistic guarantee for \({\mathbb {P}}^*\).

     

Remark 2

The convexity condition on \({\mathcal {U}}_0\) is necessary. It is not difficult to construct examples where \({\mathcal {U}}_0\) is non-convex and \({\mathcal {U}}_0 \cap {\mathcal {U}}_\epsilon = \emptyset \), e.g., the example from [6, pp. 32–33] has this property.

Remark 3

For many sets \({\mathcal {U}}_0\), such as boxes, polyhedrons or ellipsoids, robust constraints over \({\mathcal {U}}_\epsilon \cap {\mathcal {U}}_0\) are essentially as tractable as robust constraints over \({\mathcal {U}}_\epsilon \). Specifically, from [5, Lemma A.4],
$$\begin{aligned}&\left\{ ({\mathbf {v}}, t) : \delta ^*\left( {\mathbf {v}}| {\mathcal {U}}_\epsilon \cap {\mathcal {U}}_0\right) {\le t} \right\} \nonumber \\&\quad = \Big \{ \left( {\mathbf {v}}, t \right) : \exists {\mathbf {w}}\in {{\mathbb {R}}}^d, \ t_1, t_2 \in {{\mathbb {R}}}\text { s.t. } \delta ^*\left( {\mathbf {v}}- {\mathbf {w}}| {\mathcal {U}}_\epsilon \right) \nonumber \\&\quad \le t_1, \ \delta ^*\left( {\mathbf {w}}| {\mathcal {U}}_0\right) \le t_2, \ t_1 + t_2 \le t \Big \}. \end{aligned}$$
(11)
Consequently, whenever the constraint \(\delta ^*({\mathbf {w}}| \ {\mathcal {U}}_0) \le t_2\) is tractable, the constraint (11) will also be tractable.

The next four sections apply this schema to create uncertainty sets. Since, \(\epsilon \), \(\alpha \) and \({\mathcal {S}}\) are typically fixed, we suppress some or all of them in the notation.

4 Uncertainty Sets Built from Discrete Distributions

In this section, we assume \({\mathbb {P}}^*\) has known, finite support, i.e., \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subseteq \{{\mathbf {a}}_0, \ldots , {\mathbf {a}}_{n-1}\}\). We consider two hypothesis tests: Pearson’s \(\chi ^2\) test and the G test [42]). Both tests consider the hypothesis \(H_0 : {\mathbb {P}}^* = {\mathbb {P}}_0\) where \({\mathbb {P}}_0\) is some specified measure. Specifically, let \(p_i = {\mathbb {P}}_0({\tilde{{\mathbf {u}}}}= {\mathbf {a}}_i)\) be the specified null-hypothesis, and let \({\hat{ {\mathbf {p}}}}\) denote the empirical probability distribution, i.e.,
$$\begin{aligned} {\hat{p}}_i \equiv \frac{1}{N}\sum _{j=1}^N {\mathbb {I}}\left( \hat{{\mathbf {u}}}^j = {\mathbf {a}}_i\right) \quad i =0, \ldots , n-1. \end{aligned}$$
In words, \({{\hat{p}}}_i\) represents the proportion of the sample taking value \({\mathbf {a}}_i\). Pearson’s \(\chi ^2\) test rejects \(H_0\) at level \(\alpha \) if \(\sum _{i=0}^{n-1} \frac{ (p_i - {\hat{p}}_i)^2}{2p_i} > \frac{1}{2N} \chi ^2_{n-1, 1-\alpha }, \) where \(\chi ^2_{n-1, 1-\alpha }\) is the \(1-\alpha \) quantile of a \(\chi ^2\) distribution with \(n-1\) degrees of freedom. Similarly, the G test rejects the null hypothesis at level \(\alpha \) if \( D({\hat{ {\mathbf {p}}}}, {\mathbf {p}}) > \frac{1}{2N} \chi ^2_{n-1, 1-\alpha } \) where
$$\begin{aligned} D\left( {\mathbf {p}}, {\mathbf {q}}\right) \equiv \sum _{i=0}^{n-1} p_i \log \left( p_i / q_i\right) \end{aligned}$$
(12)
is the relative entropy between \( {\mathbf {p}}\) and \( {\mathbf {q}}\).
The confidence regions for Pearson’s \(\chi ^2\) test and the G test are, respectively,
$$\begin{aligned} {\mathcal {P}}^{\chi ^2}= & {} \left\{ {\mathbf {p}}\in {\varDelta }_n : \sum _{i=0}^{n-1} \frac{ \left( p_i - {\hat{p}}_i\right) ^2}{2p_i} \le \frac{1}{2N} \chi ^2_{n-1, 1-\alpha } \right\} , \nonumber \\ {\mathcal {P}}^{G}= & {} \left\{ {\mathbf {p}}\in {\varDelta }_n : D\left( {\hat{ {\mathbf {p}}}}, {\mathbf {p}}\right) \le \frac{1}{2N} \chi ^2_{n-1, 1-\alpha } \right\} . \end{aligned}$$
(13)
Here \({\varDelta }_n = \left\{ (p_0, \ldots , p_{n-1})^T : {\mathbf {e}}^T {\mathbf {p}}= 1, \ \ p_i \ge 0 \ \ i = 0, \ldots , n-1 \right\} \) denotes the probability simplex. We will use these two confidence regions in Step 1 of our schema.
For a fixed measure \({\mathbb {P}}\), and vector \({\mathbf {v}}\in {{\mathbb {R}}}^d\), recall the Conditional Value at Risk:
$$\begin{aligned} {{\mathrm{\text {CVaR}}}}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \equiv \min _{t} \left\{ t + \frac{1}{\epsilon } {{\mathbb {E}}}^{\mathbb {P}}\left[ \left( {\tilde{{\mathbf {u}}}}^T{\mathbf {v}}- t \right) ^+\right] \right\} . \end{aligned}$$
(14)
Conditional Value at Risk is well-known to be a convex upper bound to Value at Risk [1, 43] for a fixed \({\mathbb {P}}\). We can compute a bound in Step 2 by considering the worst-case Conditional Value at Risk over the above confidence regions, yielding

Theorem 5

Suppose \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subseteq \{{\mathbf {a}}_0, \ldots , {\mathbf {a}}_{n-1}\}\). With probability \(1-\alpha \) over the sample, the families \(\{ {\mathcal {U}}^{\chi ^2}_\epsilon : 0< \epsilon < 1 \}\) and \(\{ {\mathcal {U}}^{G}_\epsilon : 0< \epsilon < 1 \}\) simultaneously imply a probabilistic guarantee for \({\mathbb {P}}^*\), where
$$\begin{aligned} {\mathcal {U}}_\epsilon ^{\chi ^2} =&\left\{ {\mathbf {u}}\in {{\mathbb {R}}}^d : {\mathbf {u}}= \sum _{j=0}^{n-1} q_j {\mathbf {a}}_j, \ {\mathbf {q}}\in {\varDelta }_n, \ {\mathbf {q}}\le \frac{1}{\epsilon } {\mathbf {p}}, \ {\mathbf {p}}\in {\mathcal {P}}^{\chi ^2} \right\} , \end{aligned}$$
(15)
$$\begin{aligned} {\mathcal {U}}_\epsilon ^{G} =&\left\{ {\mathbf {u}}\in {{\mathbb {R}}}^d : {\mathbf {u}}= \sum _{j=0}^{n-1} q_j {\mathbf {a}}_j, \ {\mathbf {q}}\in {\varDelta }_n, \ {\mathbf {q}}\le \frac{1}{\epsilon } {\mathbf {p}}, \ {\mathbf {p}}\in {\mathcal {P}}^{G} \right\} . \end{aligned}$$
(16)
Their support functions are given by
$$\begin{aligned} \begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}_\epsilon ^{\chi ^2}\right) = \min _{\beta , {\mathbf {w}}, \eta , \lambda , {\mathbf {t}}, {\mathbf {s}}} \quad&\beta + \frac{1}{\epsilon }\left( \eta + \frac{\lambda \chi ^2_{n-1, 1-\alpha }}{N} + 2\lambda - 2 \sum _{i=0}^{n-1} {\hat{p}}_i s_i \right) \\ \text {s.t.} \quad&{\mathbf {0}}\le {\mathbf {w}}\le \left( \lambda + \eta \right) {\mathbf {e}}, \ \ \lambda \ge 0, \ \ {\mathbf {s}}\ge {\mathbf {0}},\\&\left\| \begin{matrix} 2 s_i \\ w_i - \eta \end{matrix} \right\| \le 2\lambda - w_i + \eta , \ \ {\mathbf {a}}_i^T {\mathbf {v}}- w_i \le \beta , \quad i = 0, \ldots , n-1, \end{aligned} \end{aligned}$$
(17)
$$\begin{aligned} \begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}_\epsilon ^{G} \right) = \min _{\beta , {\mathbf {w}}, \eta , \lambda } \quad&\beta + \frac{1}{\epsilon }\left( \eta + \frac{\lambda \chi ^2_{n-1, 1-\alpha }}{2N} - \lambda \sum _{i=0}^{n-1} {\hat{p}}_i \log \left( 1 - \frac{w_i - \eta }{\lambda }\right) \right) \quad \\ \text {s.t}\quad&{\mathbf {0}}\le {\mathbf {w}}\le \left( \lambda + \eta \right) {\mathbf {e}}, \ \ \lambda \ge 0,\\&{\mathbf {a}}_i^T {\mathbf {v}}- w_i \le \beta , \quad i = 0, \ldots , n-1. \end{aligned} \end{aligned}$$
(18)

Remark 4

The sets \({\mathcal {U}}_\epsilon ^{\chi ^2}\), \({\mathcal {U}}_\epsilon ^G\) strongly resemble the uncertainty set for \({{\mathrm{\text {CVaR}}}}_\epsilon ^{{\hat{{\mathbb {P}}}}}\) in [12]. In fact, as \(N \rightarrow \infty \), all three of these sets converge almost surely to the set \({\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_\epsilon ^{{\mathbb {P}}^*}}\) defined by \(\delta ^*({\mathbf {v}}| {\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_\epsilon ^{{\mathbb {P}}^*}}) = {{\mathrm{\text {CVaR}}}}^{{\mathbb {P}}^*}_\epsilon \left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \). The key difference is that for finite N, \({\mathcal {U}}_\epsilon ^{\chi ^2}\) and \({\mathcal {U}}_\epsilon ^G\) imply a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \), while \({\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_\epsilon ^{{\hat{{\mathbb {P}}}}}}\) does not.

Remark 5

Theorem 5 exemplifies the distinction drawn in the introduction between uncertainty sets for discrete probability distributions—such as \({\mathcal {P}}^{\chi ^2}\) or \({\mathcal {P}}^G\) which have been proposed in [9]—and uncertainty sets for general uncertain parameters like \({\mathcal {U}}^{\chi ^2}_\epsilon \) and \({\mathcal {U}}^G_\epsilon \). The relationship between these two types of sets is explicit in Eqs. (15) and (16) because we have known, finite support. For continuous support and our other sets, the relationship is implicit in the worst-case value-at-risk in Step 2 of our schema.

Remark 6

When representing \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}_\epsilon ^{\chi ^2}) \le t \}\) or \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}_\epsilon ^{G}) \le t \}\), it suffices to find auxiliary variables that are feasible in (17) or (18). Thus, these sets are second-order-cone and exponential-cone representable, respectively. Although theoretically tractable, the exponential cone can be numerically challenging.

Because of these numerical issues, modeling with \({\mathcal {U}}_\epsilon ^{\chi ^2}\) is perhaps preferable to modeling with \({\mathcal {U}}_\epsilon ^{G}\). Fortunately, for large N, the difference between these two sets is negligible:

Proposition 1

With arbitrarily high probability, for any \( {\mathbf {p}}\in {\mathcal {P}}^G\), \(| D({\hat{ {\mathbf {p}}}}, {\mathbf {p}}) - \sum _{j=0}^{n-1} \frac{({\hat{p}}_j - p_j)^2}{2p_j}| = O(nN^{-3})\).

Thus, for large N, \({\mathcal {P}}^G\) is approximately equal to \({\mathcal {P}}^{\chi ^2}\), whereby \({\mathcal {U}}_\epsilon ^{G}\) is approximately equal to \({\mathcal {U}}_\epsilon ^{\chi ^2}\). For large N, then,  \({\mathcal {U}}_\epsilon ^{\chi ^2}\) should be preferred for its computational tractability.

4.1 A Numerical Example of \({\mathcal {U}}^{\chi ^2}_\epsilon \) and \({\mathcal {U}}^G_\epsilon \)

Figure 1 illustrates the sets \({\mathcal {U}}^{\chi ^2}_\epsilon \) and \({\mathcal {U}}^G_\epsilon \) with a particular numerical example. The true distribution is supported on the vertices of the given octagon. Each vertex is labeled with its true probability. In the absence of data when the support of \({\mathbb {P}}^*\) is known, the only uncertainty set \({\mathcal {U}}\) which implies a probabilistic guarantee for \({\mathbb {P}}^*\) is the convex hull of these points. We construct the sets \({\mathcal {U}}^{\chi ^2}_\epsilon \) (grey line) and \({\mathcal {U}}^G_\epsilon \) (black line) for \(\alpha = \epsilon = 10\%\) for various N. For reference, we also plot \({\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_{\epsilon }^{{\mathbb {P}}^*}}\) (shaded region) which is the limit of both sets as \(N\rightarrow \infty \).

For small N, our data-driven sets are equivalent to the convex hull of \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\), however, as N increases, our sets shrink considerably. For large N, as predicted by Propostion 1, \({\mathcal {U}}^G_\epsilon \) and \({\mathcal {U}}^{\chi ^2}_\epsilon \) are very similarly shaped.
Fig. 1

The left panel shows the sets \({\mathcal {U}}^{\chi ^2}_\epsilon \) and \({\mathcal {U}}^G_\epsilon \), \(\alpha =\epsilon =10\%\). When \(N=0\), the smallest set which implies a probabilistic guarantee is \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\), the given octagon. As N increases, both sets shrink to the \({\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_\epsilon ^{{\mathbb {P}}^*}}\) given by the shaded region. The right panel shows the empirical distribution function and confidence region corresponding to the Kolmogorov–Smirnov test

Remark 7

Figure 1 also enables us to contrast our approach to that of Campi and Garatti [21]. Namely, suppose that \(f({\mathbf {u}}, {\mathbf {x}})\) is linear in \({\mathbf {u}}\). In this case, \({\mathbf {x}}\) satisfies \(f({\hat{{\mathbf {u}}}}^j, {\mathbf {x}}) \le 0\) for \(j=1, \ldots , N\), if and only if \(f({\mathbf {u}}, {\mathbf {x}}) \le 0\) for all \({\mathbf {u}}\in \text {conv}({\mathcal {A}})\) where \({\mathcal {A}} \equiv \{{\mathbf {a}}\in {{\mathrm{\text {supp}}}}({\mathbb {P}}^*) : \exists 1 \le j \le N \text { s.t. } {\mathbf {a}}= {\hat{{\mathbf {u}}}}^j \}\). As \(N\rightarrow \infty \), \({\mathcal {A}} \rightarrow {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) almost surely. In other words, as \(N \rightarrow \infty \), the method of Campi and Garatti [21] in this case is equivalent to using the entire support as an uncertainty set, which is much larger than \({\mathcal {U}}^{{{\mathrm{\text {CVaR}}}}_\epsilon ^{{\mathbb {P}}^*}}\) above. Similar examples can be constructed with continuous distributions or the method of Calafiore and Monastero [19]. In each case, the critical observation is that these methods do not explicitly leverage the concave (or, in this case, linear) structure of \(f({\mathbf {u}}, {\mathbf {x}})\).

5 Independent Marginal Distributions

We next assume \({\mathbb {P}}^*\) may have continuous support, but the marginal distributions \({\mathbb {P}}^*_i\) are independent. Our strategy is to build a multivariate test by combining univariate tests for each marginal distribution.

5.1 Uncertainty Sets Built from the Kolmogorov–Smirnov Test

For this section, we assume that \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) is contained in a known, finite box \([ \hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}] \equiv \{ {\mathbf {u}}\in {{\mathbb {R}}}^d : {\hat{u}}_i^{(0)} \le u_i \le {\hat{u}}_i^{(N+1)}, \ \ i = 1, \ldots , d \}\).

Given a univariate measure \({\mathbb {P}}_{0, i}\), the Kolmogorov–Smirnov (KS) goodness-of fit test applied to marginal i considers the null-hypothesis \(H_0: {\mathbb {P}}^*_i = {\mathbb {P}}_{0, i}\). It rejects this hypothesis if
$$\begin{aligned} \max _{j = 1, \ldots , N} \max \left( \frac{j}{N} - {\mathbb {P}}_{0, i}\left( {{\tilde{u}}}\le {\hat{u}}_i^{(j)}\right) , {\mathbb {P}}_{0, i}\left( {{\tilde{u}}}< {\hat{u}}_i^{(j)}\right) - \frac{j-1}{N} \right) > {\varGamma }^{KS}. \end{aligned}$$
where \({\hat{u}}_i^{(j)}\) is the \(j\text {th}\) largest element among \({\hat{u}}_i^1, \ldots , {\hat{u}}_i^N\). Tables for \({\varGamma }^{KS}\) are widely available [47, 48].
The confidence region of the above test for the i-th marginal distribution is
$$\begin{aligned} {\mathcal {P}}_i^{KS}= & {} \left\{ {\mathbb {P}}_i \in {\varTheta }\left[ {\hat{u}}_i^{(0)}, {\hat{u}}_i^{(N+1)}\right] : \ {\mathbb {P}}_i\left( {{\tilde{u}}}_i \le {\hat{u}}_i^{(j)} \right) \right. \\\ge & {} \left. \frac{j}{N} - {\varGamma }^{KS}, \ {\mathbb {P}}_i\left( {{\tilde{u}}}_i < {\hat{u}}_i^{(j)}\right) \le \frac{j-1}{N} + {\varGamma }^{KS}, \ j=1, \ldots , N \right\} , \end{aligned}$$
where \({\varTheta }[{\hat{u}}_i^{(0)}, {\hat{u}}_i^{(N+1)}]\) is the set of all Borel probability measures on \([{\hat{u}}_i^{(0)}, {\hat{u}}_i^{(N+1)}]\). Unlike \({\mathcal {P}}^{\chi ^2}\) and \({\mathcal {P}}^G\), this confidence region is infinite dimensional.

Figure 1 illustrates an example. The true distribution is a standard normal whose cumulative distribution function (cdf) is the dotted line. We draw \(N=100\) data points and form the empirical cdf (solid black line). The \(80\%\) confidence region of the KS test is the set of measures whose cdfs are more than \({\varGamma }^{KS}\) above or below this solid line, i.e. the grey region.

Now consider the multivariate null-hypothesis \(H_0: {\mathbb {P}}^* = {\mathbb {P}}_0\). Since \({\mathbb {P}}^*\) has independent components, the test which rejects if \({\mathbb {P}}_i\) fails the KS test at level \(\alpha ^\prime = 1- \root d \of {1-\alpha }\) for any i is a valid test. Namely, \({\mathbb {P}}^*_{\mathcal {S}}( {\mathbb {P}}^*_i \text { is accepted by KS at level } \alpha ^\prime \text { for all } i = 1, \ldots ,d ) = \prod _{i=1}^d (1-\alpha ^\prime ) = 1-\alpha \) by independence. The confidence region of this multivariate test is
$$\begin{aligned} {\mathcal {P}}^I =\, \Big \{ {\mathbb {P}}\in {\varTheta }\left[ \hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}\right] : {\mathbb {P}}= \prod _{i=1}^d {\mathbb {P}}_i, \ \ {\mathbb {P}}_i \in {\mathcal {P}}^{KS}_i \ \ i=1, \ldots , d \Big \}. \end{aligned}$$
(“I” in \({\mathcal {P}}^I\) emphasizes independence). We use this confidence region in Step 1 of our schema.
When the marginals are independent, Nemirovski and Shapiro [41] proved
$$\begin{aligned} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le \inf _{\lambda \ge 0} \left( \lambda \log \left( 1/\epsilon \right) + \lambda \sum _{i=1}^d \log {{\mathbb {E}}}^{{\mathbb {P}}_i}\left[ e^{v_i{{\tilde{u}}}_i/\lambda }\right] \right) . \end{aligned}$$
This bound implies the worst-case bound
$$\begin{aligned} \sup _{{\mathbb {P}}\in {\mathcal {P}}^I} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le \inf _{\lambda \ge 0} \left( \lambda \log (1/\epsilon ) + \lambda \sum _{i=1}^d \log \sup _{{\mathbb {P}}_i \in {\mathcal {P}}^{KS}_i }{{\mathbb {E}}}^{{\mathbb {P}}_i}\left[ e^{v_i{{\tilde{u}}}_i/\lambda }\right] \right) , \end{aligned}$$
(19)
which we use in Step 2 of our schema. We solve the inner-most supremum explicitly by leveraging the simple geometry of \({\mathcal {P}}^{KS}_i\). Intuitively, the worst-case distribution will either be the lefthand boundary or the righthand boundary of the region in Fig. 1 depending on the sign of \(v_i\). These boundaries are defined by the discrete distributions \( {\mathbf {q}}^L({\varGamma }), {\mathbf {q}}^R({\varGamma }) \in {\varDelta }_{N+2}\) supported on \({\hat{u}}_i^{(0)}, \ldots , {\hat{u}}_i^{(N+1)}\) and defined by
$$\begin{aligned} \begin{aligned} q^L_j({\varGamma }) = {\left\{ \begin{array}{ll} {\varGamma }&{} \text { if } j = 0, \\ \frac{1}{N} &{} \text { if } 1\le j \le \lfloor N (1- {\varGamma }) \rfloor , \\ 1 - {\varGamma }- \frac{\lfloor N (1- {\varGamma }) \rfloor }{N} &{} \text { if } j = \lfloor N (1- {\varGamma }) \rfloor + 1, \\ 0 &{}\text { otherwise,} \end{array}\right. } \end{aligned} \quad \quad \begin{aligned} q^R_j({\varGamma }) = q^L_{N+1-j}({\varGamma }), \ \ j = 0, \ldots , N+1. \end{aligned}\nonumber \\ \end{aligned}$$
(20)
(Recall that \(D(\cdot , \cdot )\) denotes the relative entropy, cf. (12).) Then, we have

Theorem 6

Suppose \({\mathbb {P}}^*\) has independent components, with \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subseteq [\hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}]\). With probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), \(\{ {\mathcal {U}}^I_\epsilon : 0< \epsilon < 1\}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\), where
$$\begin{aligned} \begin{aligned} {\mathcal {U}}^I_\epsilon&=\, \Biggr \{ {\mathbf {u}}\in {{\mathbb {R}}}^d : \ \exists \theta _i \in [0, 1], \ {\mathbf {q}}^i \in {\varDelta }_{N+2}, \ i = 1\ldots , d,\\&\quad \sum _{j=0}^{N+1} {\hat{u}}_i^{(j)} q_j^i = u_i, \ i = 1, \ldots , d, \ \ \sum _{i=1}^d D\left( {\mathbf {q}}_i, \theta _i {\mathbf {q}}^L\left( {\varGamma }^{KS}\right) \right. \\&\quad +\left. \left( 1-\theta _i\right) {\mathbf {q}}^R\left( {\varGamma }^{KS}\right) \right) \le \log \left( 1/\epsilon \right) \Biggr \}. \end{aligned} \end{aligned}$$
(21)
Moreover,
$$\begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}^I_\epsilon \right)= & {} \inf _{\lambda \ge 0} \left\{ \lambda \log \left( 1/\epsilon \right) \right. \nonumber \\&\left. + \lambda \sum _{i=1}^d \log \left[ \max \left( \sum _{j=0}^{N+1} q^L_j({\varGamma }^{KS})e^{v_i{\hat{u}}_i^{(j)}/\lambda }, \sum _{j=0}^{N+1} q^R_j({\varGamma }^{KS})e^{v_i{\hat{u}}_i^{(j)}/\lambda } \right) \right] \right\} \nonumber \\ \end{aligned}$$
(22)

Remark 8

When representing \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}^I) \le t) \}\), we can drop the infimum over \(\lambda \ge 0\) in (22). This set is exponential cone representable, which, again, may be numerically challenging.

Remark 9

By contrast, because \( {\mathbf {q}}^L({\varGamma })\) (resp. \( {\mathbf {q}}^R({\varGamma })\)) is decreasing (resp. increasing) in its components, the lefthand branch of the innermost maximum in (22) will be attained when \(v_i \le 0\) and the righthand branch is attained otherwise. Thus, for fixed \({\mathbf {v}}\), the optimization problem in \(\lambda \) is convex and differentiable and can be efficiently solved with a line search. We can use this line search to identify a worst-case realization of \({\mathbf {u}}\) for a fixed \({\mathbf {v}}\). Specifically, let \(\lambda ^*\) be an optimal solution. Define
$$\begin{aligned} {\mathbf {p}}^i&= {\left\{ \begin{array}{ll} {\mathbf {q}}^L &{}\text { if } v_i \le 0, \\ {\mathbf {q}}^R &{}\text {otherwise,} \end{array}\right. } \quad \quad q_j^i = \frac{p_j^i e^{v_i {\hat{u}}_{i}^{(j)} /\lambda ^*}}{\sum _{j=0}^{N+1} p_j^i e^{v_i {\hat{u}}_{i}^{(j)} /\lambda ^*} }, \ \ j = 0, \ldots , N+1, \ \ i = 1, \ldots , d,\\ u^*_i&= \sum _{j=0}^{N+1} q_j^i {\hat{u}}_i^{(j)}, \ \ i = 1\ldots , d. \end{aligned}$$
Then \({\mathbf {u}}^* \in \arg \max _{{\mathbf {u}}\in {\mathcal {U}}^I_\epsilon } {\mathbf {v}}^T{\mathbf {u}}\). That this procedure is valid follows from the proof of Theorem 6.

Remark 10

The KS test is one of many goodness-of-fit tests based on the empirical distribution function (EDF), including the Kuiper (K), Cramer von-Mises (CvM), Watson (W) and Andersen-Darling (AD) tests [48, Chapt. 5]. We can define analogues of \({\mathcal {U}}^I_\epsilon \) for each of these tests, each having slightly different shape. Separating over \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}) \le t\}\) is polynomial time tractable for each of these sets, but we no longer have a simple algorithm for generating violated cuts. Thus, these sets are considerably less attractive from a computational point of view. Fortunately, through simulation studies with a variety of different distributions, we have found that the version of \({\mathcal {U}}^I_\epsilon \) based on the KS test generally performs as well as or better than the other EDF tests. Consequently, we recommend using the sets \({\mathcal {U}}^I_\epsilon \) as described. For completeness, we present the constructions for the analogous tests in “Appendix 5”.

5.2 Uncertainty Sets Motivated by Forward and Backward Deviations

In [23], the authors propose an uncertainty set based on the forward and backward deviations of a distribution. Recall, for a univariate distribution \({\mathbb {P}}_i\), its forward and backward deviations are defined by
$$\begin{aligned} \sigma _{fi}({\mathbb {P}}_i)= & {} \sup _{x> 0 } \sqrt{-\frac{2\mu _i}{x} + \frac{2}{x^2} \log \left( {{\mathbb {E}}}^{{\mathbb {P}}_i}\left[ e^{x {{\tilde{u}}}_i}\right] \right) } ,\nonumber \\ \sigma _{bi}({\mathbb {P}}_i)= & {} \sup _{x > 0 } \sqrt{\frac{2\mu _i}{x} + \frac{2}{x^2} \log \left( {{\mathbb {E}}}^{{\mathbb {P}}_i}\left[ e^{-x {{\tilde{u}}}_i}\right] \right) }, \end{aligned}$$
(23)
where \( {{\mathbb {E}}}^{{\mathbb {P}}_i}[{{\tilde{u}}}_i] = \mu _i\). The optimizations defining \(\sigma _{fi}({\mathbb {P}}_i), \sigma _{bi}({\mathbb {P}}_i)\) are one dimensional, convex problems which can be solved by a line search.

Chen et al. [23] focus on a non-data-driven setting, where the mean and support of \({\mathbb {P}}^*\) are known a priori, and show how to upper bound these deviations to calibrate their set. In a setting where one has data and a priori knows the mean of \({\mathbb {P}}^*\) precisely, they propose a method based on sample average approximation to estimate these deviations. Unfortunately, the precise statistical behavior of these estimators is not known, so it is not clear that this set calibrated from data implies a probabilistic guarantee with high probability with respect to \({\mathbb {P}}_{\mathcal {S}}\).

In this section, we use our schema to generalize the set of Chen et al. [23] to a data-driven setting where neither the mean of the distribution nor its support are known. Our set differs in shape and size from their proposal, and, unlike their original proposal, will simultaneously imply a probabilistic guarantee for \({\mathbb {P}}^*\).

We begin by creating an appropriate multivariate hypothesis test. To streamline the exposition, we assume throughout this section \({\mathbb {P}}^*\) has bounded (but potentially unknown) support. This assumption ensures both \(\sigma _{fi}({\mathbb {P}}_i), \sigma _{bi}({\mathbb {P}}_i)\) are finite [23].

Let \(\alpha ^\prime = 1- \root d \of {1-\alpha }\). For a given \(\mu _{0,i}, \sigma _{0, fi}, \sigma _{0, bi} \in {{\mathbb {R}}}\), consider the following null-hypotheses
$$\begin{aligned} H_0^1: {{\mathbb {E}}}^{{\mathbb {P}}^*_i}\left[ {{\tilde{u}}}\right] = \mu _{0, i}, \ \ H_0^2: \sigma _{fi}\left( {\mathbb {P}}^*_i\right) \le \sigma _{0, fi}, \ \ H_0^3: \sigma _{bi}\left( {\mathbb {P}}^*_i\right) \le \sigma _{0, bi} \end{aligned}$$
(24)
and the three tests that rejects if \(| {\hat{\mu }}_i - \mu _{0,i} | > t_i\), \(\sigma _{fi}({\hat{{\mathbb {P}}}}_i) > {\overline{\sigma }}_{fi}\) and \(\sigma _{bi}({\hat{{\mathbb {P}}}}_i) > {\overline{\sigma }}_{bi}\), respectively. Pick the thresholds \(t_i\), \({\overline{\sigma }}_{fi}\) and \({\overline{\sigma }}_{bi}\) so that these tests are valid at levels \(\alpha ^\prime /2\), \(\alpha ^\prime /4\), and \(\alpha ^\prime /4\), respectively. Since these three tests are not common in applied statistics, there are no tables for their thresholds. In practice, however, we will compute approximate thresholds for each test using the bootstrap (Algorithm 1). By the union bound, the test that rejects if any of these three tests rejects is valid at level \(\alpha ^\prime \) for the null-hypothesis that \(H_0^1\), \(H_0^2\) and \(H_0^3\) are all true. The confidence region of this test is
$$\begin{aligned} {\mathcal {P}}^{FB}_i = \{{\mathbb {P}}_i \in {\varTheta }\left( -\infty , \infty \right) : m_{bi} \le {{\mathbb {E}}}^{\mathbb {P}}_i\left[ {{\tilde{u}}}_i\right] \le m_{fi}, \ \ \sigma _{fi}\left( {\mathbb {P}}_i\right) \le {\overline{\sigma }}_{fi}, \ \ \sigma _{bi}\left( {\mathbb {P}}_i\right) \le {\overline{\sigma }}_{bi} \}, \end{aligned}$$
where \(m_{bi} = {\hat{\mu }}_i - t_i\) and \(m_{fi} = {\hat{\mu }}_i + t_i\).
Now consider the multivariate null-hypothesis and test
$$\begin{aligned} H_0: {{\mathbb {E}}}^{{\mathbb {P}}^*_i}\left[ {{\tilde{u}}}\right] = \mu _{0, i}, \ \sigma _{fi}\left( {\mathbb {P}}^*_i\right) \le \sigma _{0, fi}, \ \sigma _{bi}\left( {\mathbb {P}}^*_i\right) \le \sigma _{0, bi} \quad \forall i =1, \ldots , d,\nonumber \\ \text {Reject if } | {\hat{\mu }}_i - \mu _{0,i} |> t_i \text { or } \sigma _{fi}\left( {\hat{{\mathbb {P}}}}_i\right)> {\overline{\sigma }}_{fi} \text { or } \sigma _{bi}\left( {\hat{{\mathbb {P}}}}_i\right) > \overline{\sigma _{bi}} \text { for any } i =1, \ldots , d \end{aligned}$$
(25)
where \(t_i, {\overline{\sigma }}_{fi}, \overline{\sigma _{bi}}\) are valid thresholds for the previous univariate test (24) at levels \(\alpha ^\prime /2\), \(\alpha ^\prime /4\) and \(\alpha ^\prime /4\), respectively. As in Sect. 5, this is a valid test at level \(\alpha \). Its confidence region is \({\mathcal {P}}^{FB} = \{ {\mathbb {P}}: {\mathbb {P}}_i \in {\mathcal {P}}^{FB}_i \ i = 1, \ldots , d \}.\) We use this confidence region in Step 1 of our schema.
When the mean and deviations for \({\mathbb {P}}\) are known and the marginals are independent, Chen et al. [23] prove
$$\begin{aligned} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le \sum _{i=1}^d {{\mathbb {E}}}^{\mathbb {P}}\left[ {{\tilde{u}}}_i\right] v_i + \sqrt{ 2 \log \left( 1/\epsilon \right) \left( \sum _{i : v_i < 0 } \sigma _{bi}^2({\mathbb {P}}) v_i^2 + \sum _{i : v_i \ge 0 } \sigma _{fi}^2({\mathbb {P}}) v_i^2 \right) }.\nonumber \\ \end{aligned}$$
(26)
Computing the worst-case value of this bound over the above confidence region in Step 2 of our schema yields:

Theorem 7

Suppose \({\mathbb {P}}^*\) has independent components and bounded support. Let \(t_i\), \({\overline{\sigma }}_{fi}\) and \({\overline{\sigma }}_{bi}\) be thresholds such that (25) is a valid test at level \(\alpha \). With probability \(1-\alpha \) with respect to the sample, the family \(\{{\mathcal {U}}^{FB}_\epsilon : 0< \epsilon < 1 \}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\), where
$$\begin{aligned} {\mathcal {U}}^{FB}_\epsilon= & {} \left\{ {\mathbf {y}}_1 + {\mathbf {y}}_2 - {\mathbf {y}}_3 : {\mathbf {y}}_2, {\mathbf {y}}_2 \in {{\mathbb {R}}}^d_+, \ \ \sum _{i=1}^d \frac{y_{2i}^2 }{2 {\overline{\sigma }}_{fi}^2} + \frac{y_{3i}^2 }{2 {\overline{\sigma }}_{bi}^2} \right. \nonumber \\\le & {} \left. \log (1/\epsilon ), \ \ m_{bi} \le y_{1i} \le m_{fi}, \ \ i = 1, \ldots , d \right\} . \end{aligned}$$
(27)
Moreover,
$$\begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}^{FB}_\epsilon \right)= & {} \sum _{i : v_i \ge 0} m_{fi} v_i + \sum _{i: v_i< 0 } m_{bi} v_i \nonumber \\&+ \sqrt{ 2 \log \left( 1/\epsilon \right) \left( \sum _{i: v_i \ge 0} {\overline{\sigma }}_{fi}^2 v_i^2 + \sum _{i: v_i < 0 } {\overline{\sigma }}_{bi}^2 v_i^2\right) } \end{aligned}$$
(28)

Remark 11

From (28), \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}^{FB}_\epsilon ) \le t \}\) is second order cone representable. We can identify a worst-case realization of \({\mathbf {u}}\) in closed-form. Given \({\mathbf {v}}\), let
$$\begin{aligned} \lambda = \sqrt{\frac{\sum _{i: v_i> 0 } v_i^2 {\overline{\sigma }}_{fi}^2 + \sum _{i: v_i \le 0 } v_i^2 {\overline{\sigma }}_{bi}^2}{2 \log (1/\epsilon )} }, \quad u^*_i = {\left\{ \begin{array}{ll} m_{fi} + \frac{v_i {\overline{\sigma }}_{fi}^2}{\lambda } &{}\text { if } v_i > 0\\ \ m_{bi} + \frac{v_i {\overline{\sigma }}_{bi}^2}{\lambda } &{} \text { otherwise.} \end{array}\right. } \end{aligned}$$
Then \({\mathbf {u}}^* \in \arg \max _{{\mathbf {u}}\in {\mathcal {U}}^{FB}_\epsilon } {\mathbf {v}}^T{\mathbf {u}}\). The correctness of this procedure follows from the proof of Theorem 7.

Remark 12

\({\mathcal {U}}^{FB}_\epsilon \) need not be contained within \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\). If a priori information about \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) is known, we should apply Theorem 4 to refine \({\mathcal {U}}^{FB}_\epsilon \) to the smaller intersection \({\mathcal {U}}^{FB}_\epsilon \cap \text {conv}({{\mathrm{\text {supp}}}}({\mathbb {P}}^*))\)

5.3 Comparing \({\mathcal {U}}^I_\epsilon \) and \({\mathcal {U}}^{FB}_\epsilon \)

Figure 2 illustrates the sets \({\mathcal {U}}^I_\epsilon \) and \({\mathcal {U}}^{FB}_\epsilon \) numerically. The marginal distributions of \({\mathbb {P}}^*\) are independent and their densities are given in the left panel. Notice that the first marginal is symmetric while the second is highly skewed.

In the absence of data, knowing only \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) and that \({\mathbb {P}}^*\) has independent components, the smallest uncertainty which implies a probabilistic guarantee is the unit square (dotted line). With \(N=100\) data points from this distribution (blue circles), however, we can construct both \({\mathcal {U}}^I_\epsilon \) (dashed black line) and \({\mathcal {U}}^{FB}_\epsilon \) (solid black line) with \(\epsilon = \alpha = 10\%\), as shown. We also plot the limiting shape of these two sets as \(N \rightarrow \infty \) (corresponding grey lines).
Fig. 2

The left panel shows the marginal densities. The right panel shows \({\mathcal {U}}^I_\epsilon \) (dashed black line) and \({\mathcal {U}}^{FB}_\epsilon \) (solid black line) built from \(N=100\) data points (blue circles) and in the limit as \(N\rightarrow \infty \) (corresponding grey lines) (color figure online)

Several features are evident from the plots. First, both sets are able to learn from the data that \({\mathbb {P}}^*\) is symmetric in its first coordinate (the sets display vertical symmetry) and that \({\mathbb {P}}^*\) is skewed downwards in its second coordinate (the sets taper more sharply towards the top). Second, although \({\mathcal {U}}^I_\epsilon \) is a strict subset of \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\), \({\mathcal {U}}^{FB}_\epsilon \) is not. Finally, neither set is a subset of the other, and, although for \(N=100\), \({\mathcal {U}}^{FB}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) has smaller volume than \({\mathcal {U}}^I_\epsilon \), the reverse holds for larger N. Consequently, the best choice of set likely depends on N.

6 Uncertainty Sets Built from Marginal Samples

In this section, we observe samples from the marginal distributions of \({\mathbb {P}}^*\) separately, but do not assume these marginals are independent. This happens, e.g., when samples are drawn asynchronously, or when there are many missing values. In these cases, it is impossible to learn the joint distribution of \({\mathbb {P}}^*\) from the data. To streamline the exposition, we assume that we observe exactly N samples of each marginal distribution. The results generalize to the case of different numbers of samples at the expense of more notation.

In the univariate case, David and Nagaraja [24] develop a hypothesis test for the \(1-\epsilon /d\) quantile, or equivalently \(\text {VaR}_{\epsilon /d}^{{\mathbb {P}}_i}({{\tilde{u}}}_i)\) of a distribution \({\mathbb {P}}\). Namely, given \( {\overline{q}}_{i,0} \in {{\mathbb {R}}}\), consider the hypothesis \(H_{0, i}: \text {VaR}_{\epsilon /d}^{{\mathbb {P}}^*}({{\tilde{u}}}_i) \ge {\overline{q}}_{i,0}\). Define the index s by
$$\begin{aligned} s = \min \left\{ k \in {\mathbb {N}}: \sum _{j = k}^N \left( {\begin{array}{c}N\\ j\end{array}}\right) \left( \epsilon /d\right) ^{N-j} \left( 1-\epsilon /d\right) ^{j} \le \frac{\alpha }{2d} \right\} , \end{aligned}$$
(29)
and let \(s = N+1\) if the corresponding set is empty. Then, the test which rejects if \(q_{i, 0} > {\hat{u}}_i^{(s)}\) is valid at level \(\alpha /2d\) [24, Sect.7.1]. David and Nagaraja [24] also prove that \(\frac{s}{N} \downarrow (1-\epsilon / d\)).

The above argument applies symmetrically to the hypothesis \(H_{0, i}: \text {VaR}_{\epsilon /d}^{{\mathbb {P}}^*}(-{{\tilde{u}}}_i) \ge {\underline{q}}_{i,0}\) where the rejection threshold now becomes \({\hat{u}}_i^{(N-s + 1)}\). In the typical case when \(\epsilon /d\) is small, \(N - s +1 < s\) so that \({\hat{u}}_i^{(N-s + 1)} \le {\hat{u}}_i^{(s)}\).

Next given \({\overline{q}}_{i, 0}, {\underline{q}}_{i, 0} \in {{\mathbb {R}}}\) for \(i=1, \ldots , d\), consider the multivariate hypothesis:
$$\begin{aligned}&H_0: \text {VaR}_{\epsilon /d}^{{\mathbb {P}}^*}\left( {{\tilde{u}}}_i\right) \ge {\overline{q}}_{i, 0} \;\; \text { and }\;\; \text {VaR}_{\epsilon /d}^{{\mathbb {P}}^*}\left( -{{\tilde{u}}}_i\right) \ge {\underline{q}}_{i, 0} \quad \text { for all } i = 1, \ldots , d. \end{aligned}$$
By the union bound, the test which rejects if \({\hat{u}}_i^{(s)} < {\overline{q}}_i\) or \(-{\hat{u}}_i^{(N-s+1)} < {\underline{q}}_i\), i.e., the above tests fail for the i-th component, is valid at level \(\alpha \). Its confidence region is
$$\begin{aligned} {\mathcal {P}}^M&= \left\{ {\mathbb {P}}\in {\varTheta }\left[ \hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}\right] : \ \ \text {VaR}_{\epsilon /d}^{{\mathbb {P}}_i}({{\tilde{u}}}_i) \le {\hat{u}}_i^{(s)}, \ \ \right. \\&\quad \left. \text {VaR}_{\epsilon /d}^{{\mathbb {P}}_i} (-{{\tilde{u}}}_i)\ge {\hat{u}}_i^{(N-s+1)}, \ \ i=1, \ldots , d \right\} . \end{aligned}$$
Here “M” is to emphasize “marginals.” We use this confidence region in Step 1 of our schema.
When the marginals of \({\mathbb {P}}\) are known, Embrechts et al. [27] proves
$$\begin{aligned} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le \min _{\varvec{\lambda }: {\mathbf {e}}^T\varvec{\lambda }= \epsilon } \sum _{i=1}^d \text {VaR}_{\lambda _i}^{{\mathbb {P}}}\left( v_i{{\tilde{u}}}_i\right) \le \sum _{i=1}^d\text {VaR}_{\epsilon /d}^{{\mathbb {P}}}\left( v_i {{\tilde{u}}}_i\right) \end{aligned}$$
(30)
where the last inequality is obtained by letting \(\lambda _i = \epsilon /d\) for all i. From our schema,

Theorem 8

If s defined by Eq. (29) satisfies \(N-s+1 < s\), then, with probability at least \(1-\alpha \) over the sample, the set
$$\begin{aligned} {\mathcal {U}}^M_\epsilon = \left\{ {\mathbf {u}}\in {{\mathbb {R}}}^d: {\hat{u}}_i^{(N-s+1)} \le u_i \le {\hat{u}}_i^{(s)} \ \ i =1, \dots , d \right\} . \end{aligned}$$
(31)
implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \). Moreover,
$$\begin{aligned} \delta ^*({\mathbf {v}}| \ {\mathcal {U}}^M_\epsilon ) = \sum _{i=1}^d \max \left( v_i {\hat{u}}_i^{(N-s+1)}, v_i {\hat{u}}_i^{(s)}\right) . \end{aligned}$$
(32)

Remark 13

Notice that the family \(\{ {\mathcal {U}}^M_\epsilon : 0< \epsilon < 1 \}\), may not simultaneously imply a probabilistic guarantee for \({\mathbb {P}}^*\) because the confidence region \({\mathcal {P}}^M\) depends on \(\epsilon \).

Remark 14

The set \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| {\mathcal {U}}^M) \le t \}\) is a simple box, representable by linear inequalities. From (32), a worst-case realization is given by \(u^*_i = {\hat{u}}_i^{(s)} {\mathbb {I}}( v_i > 0) + {\hat{u}}_i^{(N-s + 1)} {\mathbb {I}}(v_i < 0 )\).

7 Uncertainty Sets for Potentially Non-independent Components

In this section, we assume we observe samples drawn from the joint distribution of \({\mathbb {P}}^*\) which may have unbounded support. We consider a goodness-of-fit hypothesis test based on linear-convex ordering proposed in [15]. Specifically, given some multivariate \({\mathbb {P}}_0\), consider the null-hypothesis \(H_0: {\mathbb {P}}^* = {\mathbb {P}}_0\). Bertsimas et al. [15] prove that the test which rejects \(H_0\) if there exists \(({\mathbf {a}}, b) \in {\mathcal {B}} \equiv \{{\mathbf {a}}\in {{\mathbb {R}}}^d, b \in {{\mathbb {R}}}: \Vert {\mathbf {a}}\Vert _1 + | b | \le 1\}\) such that
$$\begin{aligned}&{{\mathbb {E}}}^{{\mathbb {P}}_0}\left[ \left( {\mathbf {a}}^T{\tilde{{\mathbf {u}}}}- b\right) ^+\right] - \frac{1}{N} \sum _{j=1}^N \left( {\mathbf {a}}^T\hat{{\mathbf {u}}}^j - b\right) ^+ \\&\quad> {\varGamma }_{LCX} \ \ \text { or } \ \ \frac{1}{N} \sum _{j=1}^N \left( \hat{{\mathbf {u}}}^j\right) ^T \hat{{\mathbf {u}}}^j -{{\mathbb {E}}}^{{\mathbb {P}}_0}\left[ {\tilde{{\mathbf {u}}}}^T {\tilde{{\mathbf {u}}}}\right] > {\varGamma }_\sigma \end{aligned}$$
for appropriate thresholds \({\varGamma }_{LCX}, {\varGamma }_\sigma \) is a valid test at level \(\alpha \). The authors provide an explicit bootstrap algorithm to compute \({\varGamma }_{LCX}, {\varGamma }_\sigma \) as well as exact formulae for upper-bounding these quantities.
The confidence region of this test is
$$\begin{aligned} {\mathcal {P}}^{LCX}&=\, \Biggr \{ {\mathbb {P}}\in {\varTheta }({{\mathbb {R}}}^d): \ {{\mathbb {E}}}^{\mathbb {P}}\left[ \left( {\mathbf {a}}^T{\tilde{{\mathbf {u}}}}- b\right) ^+ \right] \le \frac{1}{N} \sum _{j=1}^N \left( {\mathbf {a}}^T\hat{{\mathbf {u}}}_j - b\right) ^+ \nonumber \\&\quad + {\varGamma }_{LCX} \ \ \forall ({\mathbf {a}}, b) \in {\mathcal {B}}, \nonumber \\&\quad \left. {{\mathbb {E}}}^{\mathbb {P}}\left[ \Vert {\tilde{{\mathbf {u}}}}\Vert ^2 \right] \ge \frac{1}{N}\sum _{j=1}^N \Vert \hat{{\mathbf {u}}}_j \Vert ^2 \right] - {\varGamma }_\sigma \Biggr \}. \end{aligned}$$
(33)
We use this confidence region in Step 1 of our schema. By explicitly computing the worst-case Value-at-Risk and applying our schema,

Theorem 9

The family \(\{ {\mathcal {U}}^{LCX}_\epsilon : 0< \epsilon < 1 \}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\) where
$$\begin{aligned} {\mathcal {U}}^{LCX}_\epsilon&= \,\Biggr \{ {\mathbf {u}}\in {{\mathbb {R}}}^d: \ \exists {\mathbf {r}}\in {{\mathbb {R}}}^d, \ 1 \le z \le 1/\epsilon , \ {\mathbf {s}}^1, {\mathbf {s}}^2, {\mathbf {s}}^3 \in {{\mathbb {R}}}^N \ \text { s.t. } \nonumber \\&\qquad {\mathbf {0}}\le {\mathbf {s}}^k \le \frac{z}{N} {\mathbf {e}}, \ \ k = 1, 2, 3, \nonumber \\&\qquad | z - {\mathbf {e}}^T{\mathbf {s}}^1 | \le {\varGamma }_{LCX}, \ \ | (z-1) - {\mathbf {e}}^T {\mathbf {s}}^2 | \le {\varGamma }_{LCX}, \ \ | 1 - {\mathbf {e}}^T{\mathbf {s}}^3 | \le {\varGamma }_{LCX}, \nonumber \\&\qquad \Vert {\mathbf {r}}+ {\mathbf {u}}- \sum _{j=1}^n s^1_j \hat{{\mathbf {u}}}_j \Vert _\infty \le {\varGamma }_{LCX}, \ \ \Vert {\mathbf {r}}- \sum _{j=1}^n s^2_j \hat{{\mathbf {u}}}_j \Vert _\infty \le {\varGamma }_{LCX}, \ \ \nonumber \\&\qquad \Vert {\mathbf {u}}- \sum _{j=1}^n s^3_j \hat{{\mathbf {u}}}_j \Vert _\infty \le {\varGamma }_{LCX} \Biggr \}. \end{aligned}$$
(34)
Moreover, \(\delta ^*({\mathbf {v}}| \ {\mathcal {U}}^{LCX}_\epsilon ) = \sup _{{\mathbb {P}}\in {\mathcal {P}}^{LCX}} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \) where
$$\begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}^{LCX}_\epsilon \right) \! =\!&\min _{\tau , \theta , \varvec{\alpha }, \varvec{\beta }, {\mathbf {y}}^1, {\mathbf {y}}^2, {\mathbf {y}}^3} \quad \frac{1}{\epsilon }\tau - \theta {+} {\varGamma }_{LCX} \Vert \varvec{\alpha }\Vert _1 {+} 2 {\varGamma }_{LCX} \Vert \varvec{\beta }\Vert _1 {+} {\varGamma }_{LCX} \Vert {\mathbf {v}}{+} \varvec{\beta }\Vert _1 \nonumber \\ \text {s.t.}&\quad -\theta + \tau + \alpha _1 + \alpha _2 = \frac{1}{N} \sum _{j=1}^n y^1_j + y^2_j + y^3_j \nonumber \\&\alpha _1 - \beta ^T {\hat{u}}_j \le y^1_j, \ \ \alpha _2 + \beta ^T {\hat{u}}_j \le y^2_j, \ \ \alpha _3 + \beta ^T {\hat{u}}_j + v^T{\hat{u}}_j \le y^3_j, \nonumber \\&j=1, \ldots , N, \nonumber \\&\tau , \theta \ge 0, \ \ {\mathbf {y}}^1, {\mathbf {y}}^2, {\mathbf {y}}^3 \ge {\mathbf {0}}\end{aligned}$$
(35)

Remark 15

By adding auxiliary variables, we can represent \({\mathcal {U}}_\epsilon ^{LCX}\) as the intersection of linear inequalities. Robust constraints over \({\mathcal {U}}_\epsilon ^{LCX}\) are thus tractable.

Remark 16

We stress that the robust constraint \(\max _{{\mathbf {u}}\in {\mathcal {U}}_\epsilon ^{LCX}} {\mathbf {v}}^T{\mathbf {u}}\) is exactly equivalent to the ambiguous chance-constraint \( \sup _{{\mathbb {P}}\in {\mathcal {P}}^{LCX}} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \) above.

8 Hypothesis Testing: A Unifying Perspective

Several data-driven methods in the literature create families of measures \({\mathcal {P}}({\mathcal {S}})\) that contain \({\mathbb {P}}^*\) with high probability. These methods do not explicitly reference hypothesis testing. In this section, we provide a hypothesis testing interpretation of two such methods [25, 46]. Leveraging this new perspective, we show how standard techniques for hypothesis testing, such as the bootstrap, can be used to improve upon these methods. Finally, we illustrate how our schema can be applied to these improved family of measures to generate new uncertainty sets. To the best of our knowledge, generating uncertainty sets for (1) is a new application of both [25, 46].

The key idea in both cases is to recast \({\mathcal {P}}({\mathcal {S}})\) as the confidence region of a hypothesis test. This correspondence is not unique to these methods. There is a one-to-one correspondence between families of measures which contain \({\mathbb {P}}^*\) with probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\) and the confidence regions of hypothesis tests. This correspondence is sometimes called the “duality between confidence regions and hypothesis testing” in the statistical literature [42]. It implies that any data-driven method predicated on a family of measures that contain \({\mathbb {P}}^*\) with probability \(1-\alpha \) can be interpreted in the light of hypothesis testing.

This observation is interesting for two reasons. First, it provides a unified framework to compare distinct methods in the literature and ties them to the well-established theory of hypothesis testing in statistics. Secondly, there is a wealth of practical experience with hypothesis testing. In particular, we know empirically which tests are best suited to various applications and which tests perform well even when the underlying assumptions on \({\mathbb {P}}^*\) that motivated the test may be violated. In the next section, we leverage some of this practical experience with hypothesis testing to strengthen these methods, and then derive uncertainty sets corresponding to these hypothesis tests to facilitate comparison between the approaches.

8.1 Uncertainty Set Motivated by Cristianini and Shawe-Taylor 2003

Let \(\Vert \cdot \Vert _F\) denote the Frobenius norm of matrices. In a particular machine learning context, Shawe-Taylor and Cristianini [46] prove

Theorem 10

(Cristianini and Shawe-Taylor, 2003) Suppose that \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) is contained within the ball of radius R and that \(N > (2 + 2 \log (2/\alpha ))^2.\) Then, with probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), \({\mathbb {P}}^* \in {\mathcal {P}}^{CS}({\varGamma }_1(\alpha /2, N), {\varGamma }_2(\alpha /2, N))\), where
$$\begin{aligned} {\mathcal {P}}^{CS}\left( {\varGamma }_1, {\varGamma }_2\right)= & {} \left\{ {\mathbb {P}}\in {\varTheta }(R) : \Vert {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}\right] - {\hat{\varvec{\mu }}} \Vert _2 \le {\varGamma }_1 \quad \text { and } \right. \\&\left. \Vert {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}{\tilde{{\mathbf {u}}}}^T\right] - {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}\right] {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}^T\right] - {\hat{\varvec{{\varSigma }}}} \Vert _F \le {\varGamma }_2,\right\} \end{aligned}$$
where \({\hat{\varvec{\mu }}}, {\hat{\varvec{{\varSigma }}}}\) denote the sample mean and covariance,
$$\begin{aligned} {\varGamma }_1(\alpha , N) = \frac{R}{\sqrt{N}} \left( 2 + \sqrt{2 \log 1/\alpha } \right) , \quad {\varGamma }_2(\alpha , N) = \frac{2R^2}{\sqrt{N}} \left( 2 + \sqrt{2 \log 2/\alpha } \right) , \end{aligned}$$
and \({\varTheta }(R)\) denotes the set of Borel probability measures supported on the ball of radius R.

We note that the key step in their proof utilizes a general purpose concentration inequality to compute \({\varGamma }_1(\alpha , N)\), \({\varGamma }_2(\alpha , N)\). (cf. [46, Theorem 1])

On the other hand, \({\mathcal {P}}^{CS}({\varGamma }_1(\alpha /2, N), {\varGamma }_2(\alpha /2, N))\) is also the \(1-\alpha \) confidence region of a hypothesis test for the mean and covariance of \({\mathbb {P}}^*\). Namely, consider the null-hypothesis and test
$$\begin{aligned}&H_0 : {{\mathbb {E}}}^{{\mathbb {P}}^*}[ {\tilde{{\mathbf {u}}}}] = \varvec{\mu }_0 \text { and } {{\mathbb {E}}}^{{\mathbb {P}}^*}\left[ {\tilde{{\mathbf {u}}}}{\tilde{{\mathbf {u}}}}^T\right] - {{\mathbb {E}}}^{{\mathbb {P}}^*}\left[ {\tilde{{\mathbf {u}}}}\right] {{\mathbb {E}}}^{{\mathbb {P}}^*}\left[ {\tilde{{\mathbf {u}}}}^T\right] = \varvec{{\varSigma }}_0, \end{aligned}$$
(36)
$$\begin{aligned}&\text {Reject if } \Vert {\hat{\varvec{\mu }}} - \varvec{\mu }_0 \Vert> {\varGamma }_1 \text { or } \Vert {\hat{\varvec{{\varSigma }}}} - \varvec{{\varSigma }}_0 \Vert > {\varGamma }_2. \end{aligned}$$
(37)
Theorem 10 proves that for \({\varGamma }_1 \rightarrow {\varGamma }_1(\alpha /2, N)\) and \({\varGamma }_2 \rightarrow {\varGamma }_2(\alpha /2, N)\), this is a valid test at level \(\alpha \) and \({\mathcal {P}}^{CS}({\varGamma }_1, {\varGamma }_2)\) is its confidence region.

Practical experience in applied statistics suggests, however, that tests whose thresholds are computed as above using general purpose concentration inequalities, while valid, are typically very conservative for reasonable values of \(\alpha \), N. They reject \(H_0\) when it is false only when N is very large. The standard remedy is to use the bootstrap (Algorithm 1) to approximate thresholds \({\varGamma }_1^{B}, {\varGamma }_2^{B}\). These bootstrapped thresholds are typically much smaller than thresholds based on concentration inequalities, but are still (approximately) valid at level \(1-\alpha \). The first five columns of Table 2 illustrates the magnitude of the difference with a particular example. Entries of \(\infty \) indicate that the threshold as derived in [46] does not apply for this value of N. The data are drawn from a standard normal distribution with \(d=2\) truncated to live in a ball of radius 9.2. We take \(\alpha = 10\%\), \(N_B = 10{,}000\). We can see that the reduction can be a full-order of magnitude, or more.

Reducing the thresholds \({\varGamma }_1, {\varGamma }_2\) shrinks \({\mathcal {P}}^{CS}({\varGamma }_1, {\varGamma }_2)\). Thus, replacing \({\varGamma }_1(\alpha /2, N), {\varGamma }_2(\alpha /2, N)\) by \({\varGamma }_1^B, {\varGamma }_2^B\) reduces the conservativeness of any method using \({\mathcal {P}}^{CS}\) (including the original machine learning application of Shawe-Taylor and Cristianini [46]) while retaining its robustness to ambiguity in \({\mathcal {P}}^*\) since \({\varGamma }_1^B, {\varGamma }_2^B\) are approximately valid thresholds which become exact as \(N\rightarrow \infty \). Thus in applications where having a precise \(1-\alpha \) guarantee is not necessary, or N is very large, bootstrapped thresholds should be preferred.
Table 2

Comparing Thresholds with and without bootstrap using \(N_B = 10{,}000\) replications, \(\alpha =10\%\)

 

Shawe-Taylor and Cristianini [46]

Delage and Ye [25]

N

\({\varGamma }_1\)

\({\varGamma }_2\)

\( {\varGamma }_1^B\)

\({\varGamma }_2^B\)

\(\gamma _1\)

\(\gamma _2\)

\( \gamma _1^B\)

\(\gamma _2^B\)

10

\(\infty \)

\(\infty \)

0.805

1.161

\(\infty \)

\(\infty \)

0.526

5.372

50

\(\infty \)

\(\infty \)

0.382

0.585

\(\infty \)

\(\infty \)

0.118

1.684

100

3.814

75.291

0.262

0.427

\(\infty \)

\(\infty \)

0.061

1.452

500

1.706

33.671

0.105

0.157

\(\infty \)

\(\infty \)

0.012

1.154

50,000

0.171

3.367

0.011

0.018

\(\infty \)

\(\infty \)

1e−4

1.015

100,000

0.121

2.381

0.008

0.013

0.083

5.044

6e−5

1.010

We use \({\mathcal {P}}^{CS}({\varGamma }_1, {\varGamma }_2)\) in Step 1 of our schema. In [18], the authors prove that for any \({\varGamma }_1, {\varGamma }_2\),
$$\begin{aligned} \sup _{{\mathbb {P}}\in {\mathcal {P}}^{CS}\left( {\varGamma }_1, {\varGamma }_2\right) } \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) = {\hat{\varvec{\mu }}}^T{\mathbf {v}}+ {\varGamma }_1 \Vert {\mathbf {v}}\Vert _2 + \sqrt{\frac{1-\epsilon }{\epsilon }} \sqrt{ {\mathbf {v}}^T \left( {\hat{\varvec{{\varSigma }}}} + {\varGamma }_2 {\mathbf {I}}\right) {\mathbf {v}}}. \end{aligned}$$
(38)
We translate this bound into an uncertainty set.

Theorem 11

Suppose \({\varGamma }_1, {\varGamma }_2\) are such that the test (37) is valid at level \(\alpha \). With probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), the family \(\{ {\mathcal {U}}^{CS}_\epsilon : 0< \epsilon < 1\}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\), where
$$\begin{aligned} {\mathcal {U}}_\epsilon ^{CS} = \left\{ {\hat{\varvec{\mu }}} + {\mathbf {y}}+ {\mathbf {C}}^T{\mathbf {w}}: \exists {\mathbf {y}}, {\mathbf {w}}\in {{\mathbb {R}}}^d \text { s.t. } \Vert {\mathbf {y}}\Vert \le {\varGamma }_1, \ \ \Vert {\mathbf {w}}\Vert \le \sqrt{\frac{1}{\epsilon } -1} \right\} , \end{aligned}$$
(39)
where \({\mathbf {C}}^T{\mathbf {C}}= {\hat{\varvec{{\varSigma }}}} + {\varGamma }_2 {\mathbf {I}}\) is a Cholesky decomposition. Moreover, \(\delta ^*({\mathbf {v}}| \ {\mathcal {U}}^{CS}_\epsilon )\) is given explicitly by the right-hand side of Eq. (38).

Remark 17

Notice that (38) is written with an equality. Thus, the robust constraint \(\max _{{\mathbf {u}}\in {\mathcal {U}}^{CS}_\epsilon } {\mathbf {v}}^T{\mathbf {x}}\le 0\) is exactly equivalent to the ambiguous chance-constraint \(\sup _{{\mathbb {P}}\in {\mathcal {P}}^{CS}({\varGamma }_1, {\varGamma }_2)} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le 0\).

Remark 18

From (38), \(\{ ({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}^{CS}_\epsilon ) \le t \}\) is second order cone representable. Moreover, we can identify a worst-case realization in closed-form. Given \({\mathbf {v}}\), let \({\mathbf {u}}^* = \varvec{\mu }+ \frac{{\varGamma }_1}{ \Vert {\mathbf {v}}\Vert } {\mathbf {v}}+ \sqrt{\frac{1}{\epsilon }-1} \frac{{\mathbf {C}}{\mathbf {v}}}{\Vert {\mathbf {C}}{\mathbf {v}}\Vert } \). Then \({\mathbf {u}}^* \in \arg \max _{{\mathbf {u}}\in {\mathcal {U}}^{CS}_\epsilon } {\mathbf {v}}^T{\mathbf {u}}\) (cf. Proof of Theorem 11).

Remark 19

\({\mathcal {U}}^{CS}_\epsilon \) need not be a subset of \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\). Consequently, when a priori knowledge of the support is available, we can refine this set as in Theorem 4.

To emphasize the benefits of bootstrapping when constructing uncertainty sets, Fig. 5 in the electronic companion illustrates the set \({\mathcal {U}}^{CS}_\epsilon \) for the example considered in Fig. 2 with thresholds computed with and without the bootstrap.

8.2 Uncertainty Set Motivated by Delage and Ye 2010

Delage and Ye [25] propose a data-driven approach for solving distributionally robust optimization problems. Their method relies on a slightly more general version of the following:3

Theorem 12

(Delage and Ye [25]) Let R be such that \({\mathbb {P}}^*( ({\tilde{{\mathbf {u}}}}- \varvec{\mu })^T \varvec{{\varSigma }}^{-1} ({\tilde{{\mathbf {u}}}}-\varvec{\mu }) \le R^2 ) = 1\) where \(\varvec{\mu }, \varvec{{\varSigma }}\) are the true mean and covariance of \({\tilde{{\mathbf {u}}}}\) under \({\mathbb {P}}^*\). Let,
$$\begin{aligned} \beta _2 \equiv \frac{R^2}{N}\left( 2 + \sqrt{2 \log (2/\alpha ) } \right) ^2, \quad \beta _1 \equiv \frac{R^2}{\sqrt{N}} \left( \sqrt{1- \frac{d}{R^4}} + \sqrt{ \log (4/\alpha )} \right) , \end{aligned}$$
and suppose N is large enough so that \(1-\beta _1 -\beta _2 > 0\). Finally suppose \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subseteq [\hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}] \). Then with probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), \({\mathbb {P}}^* \in {\mathcal {P}}^{DY}( \frac{ \beta _2}{1 - \beta _1 - \beta _2}, \frac{1+ \beta _2}{1 - \beta _1 - \beta _2})\) where
$$\begin{aligned} {\mathcal {P}}^{DY}\left( \gamma _1, \gamma _2\right) \equiv&\left\{ {\mathbb {P}}\in {\varTheta }\left[ \hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}\right] : \left( {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}\right] - {\hat{\varvec{\mu }}} \right) ^T {\hat{\varvec{{\varSigma }}}}^{-1} \left( {{\mathbb {E}}}^{\mathbb {P}}\left[ {\tilde{{\mathbf {u}}}}\right] - {\hat{\varvec{\mu }}} \right) \le \gamma _1,\right. \\&\left. {{\mathbb {E}}}^{\mathbb {P}}\left[ \left( {\tilde{{\mathbf {u}}}}- {\hat{\varvec{\mu }}}\right) \left( {\tilde{{\mathbf {u}}}}- {\hat{\varvec{\mu }}}\right) ^T \right] \preceq \gamma _2 {\hat{\varvec{{\varSigma }}}} \right. \Bigg \}. \end{aligned}$$

The key idea is again to compute the thresholds using a general purpose concentration inequality. The condition on N is required for the confidence region to be well-defined.

We again observe that \({\mathcal {P}}^{DY}(\gamma _1, \gamma _2)\) is the \(1-\alpha \) confidence region of a hypothesis test. Again, consider the null-hypothesis (36) and the test
$$\begin{aligned}&\text {Reject if } ({\hat{\varvec{\mu }}} {-} \varvec{\mu }_0)^T {\hat{\varvec{{\varSigma }}}}^{-1}({\hat{\varvec{\mu }}} {-} \varvec{\mu }_0)> \gamma _1 \text { or } \max _{\varvec{\lambda }} \frac{\varvec{\lambda }^T\left( \varvec{{\varSigma }}_0 {+} (\varvec{\mu }_0 - {\hat{\varvec{\mu }}})\left( \varvec{\mu }_0 - {\hat{\varvec{\mu }}}\right) ^T\right) \varvec{\lambda }}{\varvec{\lambda }^T {\hat{\varvec{{\varSigma }}}} \varvec{\lambda }}\nonumber \\&\quad > \gamma _2. \end{aligned}$$
(40)
Then, Theorem 12 proves that replacing \(\gamma _1 \rightarrow \frac{ \beta _2}{1 - \beta _1 - \beta _2}\) and \(\gamma _2 \rightarrow \frac{1+ \beta _2}{1 - \beta _1 - \beta _2}\) yields a test valid at level \(\alpha \) whose confidence region is \({\mathcal {P}}^{DY}(\gamma _1, \gamma _2)\).

Again, these thresholds are calculated via a general purpose inequality. Instead, we can approximate new thresholds using the bootstrap. Table 2 shows the reduction in magnitude. Observe that the bootstrap thresholds exist for all N, not just N sufficiently large. Moreover, they are significantly smaller, so that \({\mathcal {P}}^{DY}(\gamma _1^B, \gamma _2^B)\) is significantly smaller than \({\mathcal {P}}^{DY}( \frac{ \beta _2}{1 - \beta _1 - \beta _2}, \frac{1+ \beta _2}{1 - \beta _1 - \beta _2})\), while retaining (approximately) the same probabilistic guarantee. Therefore, in applications where having a precise \(1-\alpha \) guarantee is not necessary or N is very large, they may be preferred. We use \({\mathcal {P}}^{DY}(\gamma _1^B, \gamma _2^B)\) in Step 1 of our schema.

Theorem 13

Let \(\gamma _1, \gamma _2\) be such that the test (40) is valid at level \(\alpha \). Suppose \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*) \subset [\hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}]\). Then, with probability at least \(1-\alpha \) with respect to \({\mathbb {P}}_{\mathcal {S}}\), the family \(\{ {\mathcal {U}}^{DY}_\epsilon : 0< \epsilon < 1 \}\) simultaneously implies a probabilistic guarantee for \({\mathbb {P}}^*\), where
$$\begin{aligned} {\mathcal {U}}^{DY}_\epsilon&= \Big \{ {\mathbf {u}}\in \left[ \hat{{\mathbf {u}}}^{(0)}, \hat{{\mathbf {u}}}^{(N+1)}\right] : \exists \lambda \in {{\mathbb {R}}}, \ {\mathbf {w}}, {\mathbf {m}}\in {{\mathbb {R}}}^d, \ {\mathbf {A}}, {\hat{{\mathbf {A}}}} \succeq {\mathbf {0}}\text { s.t. } \nonumber \\&\qquad \lambda \le \frac{1}{\epsilon },\ (\lambda -1) \hat{{\mathbf {u}}}^{(0)} \le {\mathbf {m}}\le (\lambda -1) \hat{{\mathbf {u}}}^{(N+1)}, \ \lambda {\hat{\varvec{\mu }}}\nonumber \\&= {\mathbf {m}}+ {\mathbf {u}}+ {\mathbf {w}}, \ \Vert {\mathbf {C}}{\mathbf {w}}\Vert \le \lambda \sqrt{\gamma _1^B}, \nonumber \\&\qquad \begin{pmatrix} \lambda -1 &{} {\mathbf {m}}^T \\ {\mathbf {m}}&{} {\mathbf {A}}\end{pmatrix} \succeq {\mathbf {0}}, \ \begin{pmatrix} 1 &{} {\mathbf {u}}^T \\ {\mathbf {u}}&{} {\hat{{\mathbf {A}}}} \end{pmatrix} \succeq {\mathbf {0}}, \ \lambda \left( \gamma _2^B {\hat{\varvec{{\varSigma }}}} + {\hat{\varvec{\mu }}} {\hat{\varvec{\mu }}}^T\right) \nonumber \\&\quad - {\mathbf {A}}- {\hat{{\mathbf {A}}}} - {\mathbf {w}}{\hat{\varvec{\mu }}}^T - {\hat{\varvec{\mu }}} {\mathbf {w}}^T \succeq {\mathbf {0}}\Big \}, \end{aligned}$$
(41)
\(C^T C = {\hat{\varvec{{\varSigma }}}}^{-1}\) is a Cholesky-decomposition, and \(\gamma _1^B, \gamma _2^B\) are computed by bootstrap. Moreover,
$$\begin{aligned} \delta ^*\left( {\mathbf {v}}| \ {\mathcal {U}}^{DY}_\epsilon \right)&= \sup _{{\mathbb {P}}\in {\mathcal {P}}^{DY}\left( \gamma _1^B, \gamma _2^B\right) } \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) = \inf \quad t\\ \text {s.t.} \quad&r + s \le \theta \epsilon , \\&\begin{pmatrix} r + {\mathbf {y}}_1^{+T} \hat{{\mathbf {u}}}^{(0)} - {\mathbf {y}}_1^{-T} \hat{{\mathbf {u}}}^{(N+1)} &{} \frac{1}{2} ( {\mathbf {q}}- {\mathbf {y}}_1)^T, \\ \frac{1}{2} ( {\mathbf {q}}- {\mathbf {y}}_1) &{} {\mathbf {Z}}\end{pmatrix} \succeq {\mathbf {0}}, \\&\begin{pmatrix} r + {\mathbf {y}}_2^{+T} \hat{{\mathbf {u}}}^{(0)} - {\mathbf {y}}_2^{-T} \hat{{\mathbf {u}}}^{(N+1)} + t - \theta &{} \frac{1}{2} ( {\mathbf {q}}- {\mathbf {y}}_2 - {\mathbf {v}})^T, \\ \frac{1}{2} ( {\mathbf {q}}- {\mathbf {y}}_2 - {\mathbf {v}}) &{} {\mathbf {Z}}\end{pmatrix} \succeq {\mathbf {0}}, \\&s \ge \left( \gamma ^B_2 {\hat{\varvec{{\varSigma }}}} + {\hat{\varvec{\mu }}} {\hat{\varvec{\mu }}}^T\right) \circ {\mathbf {Z}}+ {\hat{\varvec{\mu }}}^T {\mathbf {q}}+ \sqrt{\gamma ^B_1} \Vert {\mathbf {q}}+ 2 {\mathbf {Z}}{\hat{\varvec{\mu }}} \Vert _{{\hat{\varvec{{\varSigma }}}}^{-1}}, \\&{\mathbf {y}}_1 = {\mathbf {y}}_1^+ - {\mathbf {y}}_1^-, \ \ {\mathbf {y}}_2 = {\mathbf {y}}_2^+ - {\mathbf {y}}_2^-, \ \ {\mathbf {y}}_1^+, {\mathbf {y}}_1^-, {\mathbf {y}}_2^+,{\mathbf {y}}_2^-, \theta \ge {\mathbf {0}}. \end{aligned}$$

Remark 20

Similar to \({\mathcal {U}}^{CS}_\epsilon \), the robust constraint \(\max _{{\mathbf {u}}\in {\mathcal {U}}^{DY}_\epsilon } {\mathbf {v}}^T{\mathbf {u}}\le 0\) is equivalent to the ambiguous chance constraint \(\sup _{{\mathbb {P}}\in {\mathcal {P}}^{DY}(\gamma _1^B, \gamma _2^B)} \text {VaR}_{\epsilon }^{{\mathbb {P}}}\left( {\mathbf {v}}^T{\tilde{{\mathbf {u}}}}\right) \le 0\).

Remark 21

The set \(\{({\mathbf {v}}, t) : \delta ^*({\mathbf {v}}| \ {\mathcal {U}}^{DY} ) \le t \}\) is representable as a linear matrix inequality. At time of writing, solvers for linear matrix inequalities are not as developed as those for second order cone programs. Consequently, one may prefer \({\mathcal {U}}^{CS}_\epsilon \) to \({\mathcal {U}}^{DY}_\epsilon \) in practice for its simplicity.

8.3 Comparing \({\mathcal {U}}^M_\epsilon \), \({\mathcal {U}}^{LCX}_\epsilon \), \({\mathcal {U}}^{CS}_\epsilon \) and \({\mathcal {U}}^{DY}_\epsilon \)

One of the benefits of deriving uncertainty sets corresponding to the methods of Delage and Ye [25] and Shawe-Taylor and Cristianini [46] is that it facilitates comparisons between these methods and our own proposals. In particular, we can make visual, qualitative assessments of the conservatism (in terms of size) and modeling power (in terms of shape). In Fig. 3, we illustrate the sets \({\mathcal {U}}^M_\epsilon \), \({\mathcal {U}}^{LCX}_\epsilon \), \({\mathcal {U}}^{CS}_\epsilon \) and \({\mathcal {U}}^{DY}_\epsilon \) for the same numerical example from Fig. 2. Note that each of these sets implies a probabilistic guarantee when data are drawn i.i.d. from a general joint distribution. Because \({\mathcal {U}}^M\) does not leverage the joint distribution \({\mathbb {P}}^*\), it does not learn that its marginals are independent. Consequently, \({\mathcal {U}}^M\) has pointed corners permitting extreme values of both coordinates simultaneously. The remaining sets do learn the marginal independence from the data and, hence, have rounded corners.

Interestingly, \({\mathcal {U}}^{CS}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) is very similar to \({\mathcal {U}}^{DY}_\epsilon \) for this example (indistinguishable in picture). Since \({\mathcal {U}}^{CS}\) and \({\mathcal {U}}^{DY}\) only depend on the first two moments of \({\mathbb {P}}^*\), neither is able to capture the skewness in the second coordinate. Finally, \({\mathcal {U}}^{LCX}\) is contained within \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) and displays symmetry in the first coordinate and skewness in the second. In this example it is also the smallest set (in terms of volume). All sets shrink as N increases.
Fig. 3

Comparing \({\mathcal {U}}^M_\epsilon \), \({\mathcal {U}}^{LCX}_\epsilon \), \({\mathcal {U}}^{CS}_\epsilon \) and \({\mathcal {U}}^{DY}_\epsilon \) for the example from Fig. 2, \(\epsilon = 10\%\), \(\alpha = 20\%\). The black dotted line represents \({{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\). The left panel uses \(N=100\) data points, while the right panel uses \(N=1000\) data points

8.4 Refining \({\mathcal {U}}^{FB}_\epsilon \)

Another common approach to hypothesis testing in applied statistics is to use tests designed for Gaussian data that are “robust to departures from normality.” The best known example of this approach is the t test from Sect. 2.2, for which there is a great deal of experimental evidence to suggest that the test is still approximately valid when the underlying data are non-Gaussian [35, Chapt. 11.3]. Moreover, certain nonparametric tests of the mean for non-Gaussian data are asymptotically equivalent to the t test, so that the t test, itself, is asymptotically valid for non-Gaussian data [35, p.180]. Consequently, the t test is routinely used in practice, even when the Gaussian assumption may be invalid.

We use the t test in combination with bootstrapping to refine \({\mathcal {U}}^{FB}_\epsilon \). We replace \(m_{fi}, m_{bi}\) in Eq. (27), with the upper and lower thresholds of a t test at level \(\alpha ^\prime /2\). We expect these new thresholds to correctly bound the true mean \(\mu _i\) with probability approximately \(1-\alpha ^\prime /2\) with respect to the data. We then use the bootstrap to calculate bounds on the forward and backward deviations \({\overline{\sigma }}_{fi}, {\overline{\sigma }}_{bi}\).

We stress not all tests designed for Gaussian data are robust to departures from normality. Applying Gaussian tests that lack this robustness will likely yield poor performance. Consequently, some care must be taken when choosing an appropriate test.

9 Implementation Details and Applications

9.1 Choosing the “Right” Set and Tuning \(\alpha \), \(\epsilon \)

Choosing an appropriate set from amongst those consistent with the a priori knowledge of \({\mathbb {P}}^*\) is a non-trivial task that depends on the application, data and N. In what follows, we adapt classical model selection procedures from machine learning by viewing a robust optimal solution \({\mathbf {x}}^*\) as analogous to a fitted parameter in a statistical model. There are, of course, a wide-variety of common model selection procedures (see [2, 31]), some of which may be more appropriate to the specific application than others. Perhaps the simplest approach is to split the data into two parts, a training set and a hold-out set. Use the training set to construct each potential uncertainty set, in turn, and solve the robust optimization problem. Evaluate each of the corresponding solutions out-of-sample on the hold-out set, and select the best solution. (“Best” may be interpreted in an application specific way.) When choosing among k sets that each imply a probabilistic guarantee at level \(\epsilon \) with probability \(1-\alpha \), this procedure will yield a set that satisfies a probabilistic guarantee at level \(\epsilon \) with probability at least \(1-k \alpha \) by the union bound.

In situations where N is only moderately large and using only half the data to calibrate an uncertainty set is impractical, we suggest using k-fold cross-validation to select a set (see [31] for a review of cross-validation). Unlike the above procedure, we cannot prove that the set chosen by k-fold cross-validation implies a probabilistic guarantee. Nevertheless, experience in machine learning suggests cross-validation is extremely effective. In what follows, we use fivefold cross-validation to select our sets.

As an aside, we point out that in applications where there is no natural choice for \(\alpha \) or \(\epsilon \), similar techniques can also be used to tune these parameters. Namely, solve the model over a grid of potential values for \(\alpha \) and/or \(\epsilon \) and then select the best value either using a hold-out set or cross-validation. Since the optimal value likely depends on the choice of uncertainty set, we suggest choosing the set and these parameters jointly.

9.2 Applications

We demonstrate how our new sets may be used in two applications: portfolio management and queueing theory. Our goals are to, first, illustrate their application and, second, to compare them to one another. We summarize our major insights:
  • In these two applications, our data-driven sets outperform traditional, non-data driven uncertainty sets, and, moreover, robust models built with our sets perform as well or better than other data-driven approaches.

  • Although our data-driven sets all shrink as \(N\rightarrow \infty \), they learn different features of \({\mathbb {P}}^*\), such as correlation structure and skewness. Consequently, different sets may be better suited to different applications, and the right choice of set may depend on N. Cross-validation effectively identifies the best set.

  • Optimizing the \(\epsilon _j\)’s in the case of multiple constraints can significantly improve performance.

Because of space considerations, we treat only the portfolio management application in the main text. The queueing application can be found in “Appendix 4”.

9.3 Portfolio Management

Portfolio management has been well-studied in the robust optimization literature [19, 29, 39]. For simplicity, we will consider the one period allocation problem:
$$\begin{aligned} \max _{{\mathbf {x}}} \left\{ \min _{{\mathbf {r}} \in {\mathcal {U}}} \ \ {\mathbf {r}}^T {\mathbf {x}}: \ \ {\mathbf {e}}^T {\mathbf {x}}= 1, \ \ {\mathbf {x}}\ge {\mathbf {0}}\right\} , \end{aligned}$$
(42)
which seeks the portfolio \({\mathbf {x}}\) with maximal worst-case return over the set \({\mathcal {U}}\). If \({\mathcal {U}}\) implies a probabilistic guarantee for \({\mathbb {P}}^*\) at level \(\epsilon \), then the optimal value \(z^*\) of this optimization is a conservative bound on the \(\epsilon \)-worst case return for the optimal solution \({\mathbf {x}}^*\).
We consider a synthetic market with \(d = 10\) assets. Returns are generated according to the following model from [39]:
$$\begin{aligned} {\tilde{r}}_i = {\left\{ \begin{array}{ll} \frac{\sqrt{(1-\beta _i)\beta _i}}{\beta _i} &{} \text {with probability } \beta _i \\ -\frac{\sqrt{(1-\beta _i)\beta _i}}{1-\beta _i} &{} \text {with probability } 1-\beta _i \end{array}\right. }, \quad \beta _i = \frac{1}{2}\left( 1 + \frac{i}{11}\right) , \ \ i = 1, \ldots , 10.\nonumber \\ \end{aligned}$$
(43)
In this model, all assets have the same mean return (0%), the same standard deviation (\(1.00\%\)), but have different skew and support. Higher indexed assets are highly skewed; they have a small probability of achieving a very negative return. Returns for different assets are independent. We simulate \(N=500\) returns as data.

We will utilize our sets \({\mathcal {U}}^M_\epsilon \) and \({\mathcal {U}}^{LCX}_\epsilon \) in this application. We do not consider the sets \({\mathcal {U}}^I_\epsilon \) or \({\mathcal {U}}^{FB}_\epsilon \) since we do not know a priori that the returns are independent. To contrast to the methods of Delage and Ye [25] and Shawe-Taylor and Cristianini [46] we also construct the sets \({\mathcal {U}}^{CS}_\epsilon \) and \({\mathcal {U}}^{DY}_\epsilon \). Recall from Remarks 17 and 20 that robust linear constraints over these sets are equivalent to ambiguous chance-constraints in the original methods, but with improved thresholds. As discussed in Remark 19, we also construct \({\mathcal {U}}^{CS}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) for comparison. We use \(\alpha = \epsilon = 10\%\) in all of our sets. Finally, we will also compare to the method of Calafiore and Monastero [19] (denoted “CM” in our plots), which is not an uncertainty set based method. We calibrate this method to also provide a bound on the \(10\%\) worst-case return that holds with at least \(90\%\) with respect to \({\mathbb {P}}_{\mathcal {S}}\) so as to provide a fair comparison.

We first consider the problem of selecting an appropriate set via 5-fold cross-validation. The top left panel in Fig. 4 shows the out-of-sample 10% worst-case return for each of the 5 runs (blue dots), as well as the average performance on the 5 runs for each set (black square). Sets \({\mathcal {U}}^M_\epsilon \), \({\mathcal {U}}^{CS}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) and \({\mathcal {U}}^{DY}_\epsilon \) yield identical portfolios (investing everything in the first asset) so we only include \({\mathcal {U}}^M\) in our graphs. The average performance is also shown in Table 3 under column CV (for “cross-validation.”) The optimal objective value of (42) for each of our sets (trained with the entire data set) is shown in column \(z_{In}\).
Fig. 4

Portolio performance by method: \(\alpha = \epsilon = 10\%\). Top left Cross-validation results. Top right Out-of-sample distribution of the 10% worst-case return over 100 runs. Bottom left Average portfolio holdings by method. Bottom right Out-of-sample distribution of the 10% worst-case return over 100 runs. The bottom right panel uses \(N=2000\). The remainder use \(N=500\)

Table 3

Portfolio statistics for each of our methods

 

\(N=500\)

\(N=2000\)

 

\(z_{In}\)

CV

\(z_{Out}\)

\(z_{Avg}\)

\(z_{In}\)

CV

\(z_{Out}\)

\(z_{Avg}\)

M

−1.095

−1.095

−1.095

−1.095

−1.095

−1.095

−1.095

−1.095

LCX

−0.699

−0.373

−0.373

−0.411

−0.89

−0.428

−0.395

−0.411

CS

−1.125

−0.403

−0.416

−0.397

−1.306

−0.400

−0.417

−0.396

CM

−0.653

−0.495

−0.425

−0.539

−0.739

−0.426

−0.549

−0.451

\({\mathcal {U}}^{DY}_\epsilon \) and \({\mathcal {U}}^{CS}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) perform identically to \({\mathcal {U}}^{M}_\epsilon \). “CM” refers to the method of [19]

Based on the top left panel of Fig. 4, it is clear that \({\mathcal {U}}^{LCX}_\epsilon \) and \({\mathcal {U}}^{CS}_\epsilon \) significantly outperform the remaining sets. They seem to perform similarly to the CM method. Consequently, we would choose one of these two sets in practice.

We can assess the quality of this choice by using the ground-truth model (43) to calculate the true 10% worst-case return for each of the portfolios. These are shown in Table 3 under column \(z_{Out}\). Indeed, these sets perform better than the alternatives, and, as expected, the cross-validation estimates are reasonably close to the true out-of-sample performance. By contrast, the in-sample objective value \(z_{In}\) is a loose bound. We caution against using this in-sample value to select the best set.

Interestingly, we point out that while \({\mathcal {U}}^{CS}_\epsilon \cap {{\mathrm{\text {supp}}}}({\mathbb {P}}^*)\) is potentially smaller (with respect to subset containment) than \({\mathcal {U}}^{CS}_\epsilon \), it performs much worse out-of-sample (it performs identically to \({\mathcal {U}}^M_\epsilon \)). This experiment highlights the fact that size calculations alone cannot predict performance. Cross-validation or similar techniques are required.

One might ask if these results are specific to the particular draw of 500 data points we use. We repeat the above procedure 100 times. The resulting distribution of 10% worst-case return is shown in the top right panel of Fig. 4 and the average of these runs is shown Table 3 under column \(z_{Avg}\). As might have been guessed from the cross-validation results, \({\mathcal {U}}^{CS}_\epsilon \) delivers more stable and better performance than either \({\mathcal {U}}^{LCX}_\epsilon \) or CM. \({\mathcal {U}}^{LCX}_\epsilon \) slightly outperforms CM, and its distribution is shifted right.

We next look at the distribution of actual holdings between these methods. We show the average holding across these 100 runs as well as \(10\%\) and \(90\%\) quantiles for each asset in the bottom left panel of Fig. 4. Since \({\mathcal {U}}^M_\epsilon \) does not use the joint distribution, it sees no benefit to diversification. Portfolios built from \({\mathcal {U}}^M_\epsilon \) consistently hold all their wealth in the first asset over all the runs, hence, they are omitted from graphs. The set \({\mathcal {U}}^{CS}_\epsilon \) depends only on the first two moments of the data, and, consequently, cannot distinguish between the assets. It holds a very stable portfolio of approximately the same amount in each asset. By contrast, \({\mathcal {U}}^{LCX}\) is able to learn the asymmetry in the distributions, and holds slightly less of the higher indexed (toxic) assets. CM is similar to \({\mathcal {U}}^{LCX}\), but demonstrates more variability in the holdings.

We point out that the performance of each method depends slightly on N. We repeat the above experiments with \(N=2000\). Results are summarized in Table 3. The bottom right panel of Fig. 4 shows the distribution of the \(10\%\) worst-case return. (Additional plots are also available in “Additional Portfolio Results” in Appendix.) Both \({\mathcal {U}}^{LCX}\) and CM perform noticeably better with the extra data, but \({\mathcal {U}}^{LCX}\) now noticeably outperforms CM and its distribution is shifted significantly to the right.

10 Conclusions

The prevalence of high quality data is reshaping operations research. Indeed, a new data-centered paradigm is emerging. In this work, we took a step towards adapting traditional robust optimization techniques to this new paradigm. Specifically, we proposed a novel schema for designing uncertainty sets for robust optimization from data using hypothesis tests. Sets designed using our schema imply a probabilistic guarantee and are typically much smaller than corresponding data poor variants. Models built from these sets are thus less conservative than conventional robust approaches, yet retain the same robustness guarantees.

Footnotes

  1. 1.

    We say \(f({\mathbf {u}}, {\mathbf {x}})\) is bi-affine if the function \({\mathbf {u}}\mapsto f({\mathbf {u}}, {\mathbf {x}})\) is affine for any fixed \({\mathbf {x}}\) and the function \({\mathbf {x}}\mapsto f({\mathbf {u}}, {\mathbf {x}})\) is affine for any fixed \({\mathbf {u}}\).

  2. 2.

    An example of a sufficient regularity condition is that \(ri({\mathcal {U}}) \cap ri(dom(f(\cdot , {\mathbf {x}}))) \ne \emptyset \), \(\forall {\mathbf {x}}\in {{\mathbb {R}}}^k\). Here \(ri({\mathcal {U}})\) denotes the relative interior of \({\mathcal {U}}\). Recall that for any non-empty convex set \({\mathcal {U}}\), \(ri({\mathcal {U}}) \equiv \{ {\mathbf {u}}\in {\mathcal {U}}\ : \ \forall {\mathbf {z}}\in {\mathcal {U}}, \ \exists \lambda > 1 \text { s.t. } \lambda {\mathbf {u}}+ (1-\lambda ) {\mathbf {z}}\in {\mathcal {U}}\}\) (cf. [11]).

  3. 3.

    Specifically, since R is typically unknown, the authors describe an estimation procedure for R and prove a modified version of the Theorem 12 using this estimate and different constants. We treat the simpler case where R is known here. Extensions to the other case are straightforward.

Notes

Acknowledgements

We would like to thank the area editor, associate editor and two anonymous reviewers for their helpful comments on an earlier draft of this manuscript. Part of this work was supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1122374.

References

  1. 1.
    Acerbi, C., Tasche, D.: On the coherence of expected shortfall. J. Bank. Financ. 26(7), 1487–1503 (2002)CrossRefGoogle Scholar
  2. 2.
    Arlot, S., Celisse, A., et al.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bandi, C., Bertsimas, D.: Tractable stochastic analysis in high dimensions via robust optimization. Math. Program. 134(1), 23–70 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bandi, C., Bertsimas, D., Youssef, N.: Robust queueing theory. Oper. Res. 63(3), 676–700 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Ben-Tal, A., Den Hertog, D., Vial, J.P.: Deriving robust counterparts of nonlinear uncertain inequalities. Math. Program. 149, 1–35 (2012)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)CrossRefzbMATHGoogle Scholar
  7. 7.
    Ben-Tal, A., Golany, B., Nemirovski, A., Vial, J.: Retailer-supplier flexible commitments contracts: a robust optimization approach. Manuf. Serv. Oper. Manag. 7(3), 248–271 (2005)CrossRefGoogle Scholar
  8. 8.
    Ben-Tal, A., Hazan, E., Koren, T., Mannor, S.: Oracle-based robust optimization via online learning. Oper. Res. 63(3), 628–638 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)CrossRefGoogle Scholar
  10. 10.
    Ben-Tal, A., Nemirovski, A.: Robust solutions of linear programming problems contaminated with uncertain data. Math. Program. 88(3), 411–424 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Bertsekas, D., Nedi, A., Ozdaglar, A.: Convex Analysis and Optimization. Athena Scientific, Belmont (2003)Google Scholar
  12. 12.
    Bertsimas, D., Brown, D.: Constructing uncertainty sets for robust linear optimization. Oper. Res. 57(6), 1483–1495 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Bertsimas, D., Dunning, I., Lubin, M.: Reformulations versus cutting planes for robust optimization (2014). http://www.optimization-online.org/DB_HTML/2014/04/4336.html
  14. 14.
    Bertsimas, D., Gamarnik, D., Rikun, A.: Performance analysis of queueing networks via robust optimization. Oper. Res. 59(2), 455–466 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation (2013). arxiv:1408.4445
  16. 16.
    Bertsimas, D., Sim, M.: The price of robustness. Oper. Res. 52(1), 35–53 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  18. 18.
    Calafiore, G., El Ghaoui, L.: On distributionally robust chance-constrained linear programs. J. Optim. Theory Appl. 130(1), 1–22 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Calafiore, G., Monastero, B.: Data-driven asset allocation with guaranteed short-fall probability. In: American Control Conference (ACC), 2012, pp. 3687–3692. IEEE (2012)Google Scholar
  20. 20.
    Campi, M., Car, A.: Random convex programs with l_1-regularization: sparsity and generalization. SIAM J. Control Optim. 51(5), 3532–3557 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Campi, M., Garatti, S.: The exact feasibility of randomized solutions of uncertain convex programs. SIAM J. Optim. 19(3), 1211–1230 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Chen, W., Sim, M., Sun, J., Teo, C.: From CVaR to uncertainty set: implications in joint chance-constrained optimization. Oper. Res. 58(2), 470–485 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Chen, X., Sim, M., Sun, P.: A robust optimization perspective on stochastic programming. Oper. Res. 55(6), 1058–1071 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    David, H., Nagaraja, H.: Order Statistics. Wiley Online Library, New York (1970)zbMATHGoogle Scholar
  25. 25.
    Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3), 596–612 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Efron, B., Tibshirani, R.: An Introduction to the Bootstrap, vol. 57. CRC Press, Boca Raton (1993)CrossRefzbMATHGoogle Scholar
  27. 27.
    Embrechts, P., Höing, A., Juri, A.: Using copulae to bound the value-at-risk for functions of dependent risks. Financ. Stoch. 7(2), 145–167 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Preprint (2015). arXiv:1505.05116
  29. 29.
    Goldfarb, D., Iyengar, G.: Robust portfolio selection problems. Math. Oper. Res. 28(1), 1–38 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Grötschel, M., Lovász, L., Schrijver, A.: The ellipsoid method and its consequences in combinatorial optimization. Combinatorica 1(2), 169–197 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning, vol. 2. Springer, Berlin (2009)CrossRefzbMATHGoogle Scholar
  32. 32.
    Jager, L., Wellner, J.A.: Goodness-of-fit tests via phi-divergences. Ann. Stat. 35(5), 2018–2053 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Kingman, J.: Some inequalities for the queue GI/G/1. Biometrika 49(3/4), 315–324 (1962)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Klabjan, D., Simchi-Levi, D., Song, M.: Robust stochastic lot-sizing by means of histograms. Prod. Oper. Manag. 22(3), 691–710 (2013)CrossRefGoogle Scholar
  35. 35.
    Lehmann, E., Romano, J.: Testing Statistical Hypotheses. Texts in Statistics. Springer, Berlin (2010)Google Scholar
  36. 36.
    Lindley, D.: The theory of queues with a single server. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 48, pp. 277–289. Cambridge University Press, Cambridge (1952)Google Scholar
  37. 37.
    Lobo, M., Vandenberghe, L., Boyd, S., Lebret, H.: Applications of second-order cone programming. Linear Algebra Appl. 284(1), 193–228 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Mutapcic, A., Boyd, S.: Cutting-set methods for robust convex optimization with pessimizing oracles. Optim. Methods Softw. 24(3), 381–406 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Natarajan, K., Dessislava, P., Sim, M.: Incorporating asymmetric distributional information in robust value-at-risk optimization. Manag. Sci. 54(3), 573–585 (2008)CrossRefzbMATHGoogle Scholar
  40. 40.
    Nemirovski, A.: Lectures on modern convex optimization. In: Society for Industrial and Applied Mathematics (SIAM). Citeseer (2001)Google Scholar
  41. 41.
    Nemirovski, A., Shapiro, A.: Convex approximations of chance constrained programs. SIAM J. Optim. 17(4), 969–996 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Rice, J.: Mathematical Statistics and Data Analysis. Duxbury press, Pacific Grove (2007)Google Scholar
  43. 43.
    Rockafellar, R., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)CrossRefGoogle Scholar
  44. 44.
    Rusmevichientong, P., Topaloglu, H.: Robust assortment optimization in revenue management under the multinomial logit choice model. Oper. Res. 60(4), 865–882 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    Shapiro, A.: On duality theory of conic linear problems. In: Goberna, M.Á., López, M.A. (eds.) Semi-infinite Programming, pp. 135–165. Springer, Berlin (2001)CrossRefGoogle Scholar
  46. 46.
    Shawe-Taylor, J., Cristianini, N.: Estimating the moments of a random vector with applications (2003). http://eprints.soton.ac.uk/260372/1/EstimatingTheMomentsOfARandomVectorWithApplications.pdf
  47. 47.
    Stephens, M.: EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc. 69(347), 730–737 (1974)CrossRefGoogle Scholar
  48. 48.
    Thas, O.: Comparing Distributions. Springer, Berlin (2010)CrossRefzbMATHGoogle Scholar
  49. 49.
    Wang, Z., Glynn, P.W., Ye, Y.: Likelihood robust optimization for data-driven newsvendor problems. Tech. rep., Working paper (2009)Google Scholar
  50. 50.
    Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62(6), 1358–1376 (2014)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg and Mathematical Optimization Society 2017

Authors and Affiliations

  1. 1.Sloan School of ManagementMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Marshall School of BusinessUniversity of Southern CaliforniaLos AngelesUSA
  3. 3.School of Operations Research and Information EngineeringCornell University and Cornell TechNew YorkUSA

Personalised recommendations