Data-driven stochastic optimization for distributional ambiguity with integrated confidence region

We discuss stochastic optimization problems under distributional ambiguity. The distributional uncertainty is captured by considering an entire family of distributions. Because we assume the existence of data, we can consider confidence regions for the different estimators of the parameters of the distributions. Based on the definition of an appropriate estimator in the interior of the resulting confidence region, we propose a new data-driven stochastic optimization problem. This new approach applies the idea of a-posteriori Bayesian methods to the confidence region. We are able to prove that the expected value, over all observations and all possible distributions, of the optimal objective function of the proposed stochastic optimization problem is bounded by a constant. This constant is small for a sufficiently large i.i.d. sample size and depends on the chosen confidence level and the size of the confidence region. We demonstrate the utility of the new optimization approach on a Newsvendor and a reliability problem.


Introduction
Since the seminal work by Dantzig [16] and Beale [4], the stochastic (linear) optimization problem has been well-studied by assuming the knowledge of the involved probability distribution [11,34,38,39,52]. However, in practical applications, the distribution is rarely known with sufficient accuracy, even if good estimators are at hand. Stability studies of stochastic optimization problems can yield important insights about the sensitivity of the computed optimal solutions if the assumed distribution does not mature [37,44,47,48]. To explicitly take the uncertainty around the involved distribution into account, the field of stochastic optimization under distributional ambiguity has emerged, which assumes that the underlying B Steffen Rebennack steffen.rebennack@kit.edu 1 Institute for Operations Research, Stochastic Optimization, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany distribution is part of a family of distributions [35, chapter 7]. This paper proposes a new data-driven stochastic optimization approach under distributional ambiguity.
The key idea of all stochastic optimization approaches under distributional ambiguity is to optimize some objective function over an entire family of distributions. Such an optimization problem is well-defined as soon as an appropriate metric has been chosen. To this end, two different methods have emerged: Bayesian and minimax approaches-minimax is also known as distributionally robust stochastic optimization in the context of our paper. The Bayesian approach assumes the knowledge of some a-priori distribution; if such an a-priori distribution is known, then the problem reduces to classical stochastic optimization. Minimax approaches take only the worst case distribution into account, instead of the entire family of distributions. This may result in over-conservative solutions [51,53]; careful construction of the family of distributions can mitigate this conservatism [9].
In this paper, we follow the recent trend of data-driven optimization [2,10,14,45]. We use observations to construct confidence regions. Specifically, we assume the parametric case, i.e., the family of distributions is parametrized by θ . A confidence region at level (1 − α) is then constructed from the observations for the parameter θ . Therefore, we assume that the set of parameter vectors is compact and that we know a safe region which contains the true parameter, i.e., θ is a vector that lies in some compact multidimensional set. For such a situation, minimax approaches typically optimize against the worst case distribution in either the safe region or some kind of confidence region taking the observations into account. In contrast, the Bayesian approach assumes the knowledge of a distribution of the parameter θ . This distribution is then typically estimated by a uniform distribution or a conditional uniform distribution taking the observations into account, for the safe region. The resulting optimization problem is then obtained by taking the expected value, where the parameter θ is treated as a random vector of the associated stochastic optimization problem (which is a function of θ ).
In our proposed approach, we combine the idea of the Bayesian method with a confidence region in a unique way. First, we define an appropriate estimator inside the calculated confidence region. In a second step, a new confidence region around this estimator is constructed. Then, a Bayesian-type optimization model is set up for the resulting confidence region. The careful construction of the new stochastic optimization problem allows us to study the quality of the optimal solutions obtained. Because this data-driven stochastic optimization approach takes only the available observations into account, the solution quality analysis needs to consider all possible observations and all possible distributions of the parameter θ , in the Bayesian sense. Therefore, we build the expectation with respect to all possible observations. This allows us to bound the optimal objective function in the expected value sense.
The derived bound is valid not only for the parameters contained in the confidence region, but for all parameters of the distribution in the safe region. Furthermore, it holds for any finite number of observations. This bound depends on the user-specified confidence level and three terms characterizing the quality of the estimation methods and the set of parameters for the distributions chosen. We are able to show that the derived bound converges to zero with an increase in the number of observations. This paper contributes to the body of literature on stochastic optimization under distributional ambiguity in the following unique ways: -We propose a new data-driven stochastic optimization model for stochastic optimization under distributional ambiguity (Definition 6). -We analyze optimal solutions of the proposed model to provide optimality bounds (Theorem 1). We further show that the proposed data-driven stochastic optimization model reduces to stochastic optimization in case of a known parameter (Corollary 1) and that the optimality bound converges to zero for an infinite number of observations (Corollary 2); two desired properties which show the consistency of the proposed model. We show that our approach contains the Bayesian approach as a special case, yielding a new bound for the Bayesian approach (Corollary 3). Further, we use the same proof technique to yield a classical result for the naïve case, where only an estimator is used for θ instead of a confidence region (Corollary 4). -We provide computational results for two different problems: a Newsvendor and a reliability problem.
The remainder of the paper is organized as follows. We present the mathematical foundations and notation of stochastic optimization under distributional ambiguity and review the literature on both Bayesian and distributionally robust stochastic optimization approaches in Sect. 2. In Sect. 3, we define confidence regions and how they can be computed by combining different estimators. Then, the main contributions are presented in Sect. 4. We suggest a new data-driven methodology towards stochastic optimization under distributional ambiguity and analyze its solution quality with respect to the expected value of all observations and all apriori distributions. Further, we study the behavior of the optimal solutions for an increase in the number of available observations, i.e., the asymptotic case. We also discuss the complexity of the proposed model. In Sect. 5, we apply the proof ideas towards the Bayesian approach and an naïve approach using only one estimator. We then present computational results in Sect. 6 for two different problems, before we conclude with Sect. 7.
We denote the decision variables by vector x and the decision function by d(·) or simply d. Observations are denoted by a capital X.

Stochastic optimization problem under distributional ambiguity
Let there be l (unknown) parameters which build a random vector denoted by ξ , i.e., the function ξ : → R l is measurable w.r.t. a σ -algebra A on . Then, E P denotes the expectation, given the distribution P of this random vector ξ .
We are interested in solving the following stochastic optimization problem for a continuous loss function L : R n × R l → R and a feasible region X ⊂ R n . The feasible region, X, is assumed to be non-empty and compact. It might be given by a collection of constraints As such, the m constraints are all deterministic. The optimization problem (1) has n here-and-now decision variables, collected in vector x.
We follow the spirit of stochastic optimization under distributional ambiguity and assume that the distribution P, the "true" distribution of ξ , is not known, but instead an element of a non-empty set P. The set P can contain all kinds of probability distributions. However, we make the following assumptions: Assumption 1 The set of distributions of the random vector ξ has the form with being compact.
The non-emptiness and compactness of X together with the continuity of L(·, ·) imply that the optimization problems min x∈X E P L(x, ξ) and max x∈X E P L(x, ξ) admit a finite extremum. Together with the compactness assumption of , the corresponding minimization and maximization problems over P ∈ P are finite as well; we exclude the trivial case that the minimum equals the maximum. This allows us to present the main theoretical results of this paper as absolute bounds, rather than relative bounds, by considering the normalized function and the corresponding stochastic optimization problem This normalization is constructed such that 0 ≤ E P Q(x, ξ) ≤ 1 for all x ∈ X; note that this does not necessarily hold for Q(x, ξ). Assumption 1 assumes the parametric case with an unknownd-dimensional parameter θ . With other words, the form of the distribution is known but its coefficients or parameters are unknown. Then, is a discretization of (2) and therefore is a discretization of , for some K ∈ N. The set of all distributions on the discrete set P, as defined in (4) and (5), creates the (K − 1)-dimensional standard simplex K −1 ⊂ R K . With other words, assuming that θ is a random vector, then the discretizations (4) and (5) yield a discrete random vector with K possible realizations. The collection of all possible distributions with K realizations is then given by K −1 .
To avoid notational ambiguity, we distinguish between an element of θ ∈ and the true parameter, which we denote byθ . Given the situation that the "true" distribution P is unknown, we cannot expect to solve problem (SP). In fact, we call the stochastic optimization problem (SP) the baseline problem for Our methodology assumes the availability of R data points in R l .

Assumption 2
We have R realizations X 1 , . . . , X R with X r ∈ R l , r = 1, . . . , R, of the random vector ξ and assume that the corresponding random vectors ξ 1 , . . . , ξ R are i.i.d. random vectors. We denote ζ = (ξ 1 , . . . , ξ R ) and assume that the distribution of ξ i is discrete or has density f (·) with respect to the Lebesgue measure in R l .
Note that we do not require the independence (nor the identical distribution) of the l coordinates of ξ r , r = 1, . . . , R, in Assumption 2.
We also refer to realizations as observations or data throughout the paper. For our discussions of data-driven Bayesian approaches, we require Definition 1 Let s ∈ K −1 . As in the Bayesian approach, s = (s 1 , . . . , s K ) is called a-priori distribution, i.e., s k is the probability that θ as a random vector has value θ k for k = 1, . . . , K . For a given a-priori distribution s, s k (X) is defined as the conditional probability that θ as a random vector has value θ k , under the condition that ζ has realization X. This conditional distribution s(X) is called the a-posteriori distribution and can be calculated by Two main streams of research have emerged to solve stochastic optimization problems under distributional ambiguity, where P ∈ P: Bayes and Minimax.

Bayesian approach
Assuming the discrete case 0 , in the Bayesian approach, the unknown distribution P (or parameter θ ) is interpreted as a realization of a discrete random vector. Therefore, we assume an a-priori distribution s = (s 1 , . . . , s K ) on the discrete set of distributions P. In the Bayesian approach, typically the discrete uniform distribution is chosen by default on P, i.e., s i = 1 K , i ∈ {1, . . . , K }, if no special a-priori distribution is given. We obtain the a-priori Bayesian problem, also called the mean-risk stochastic optimization problem where we write E θ k [·] for ease of notation, instead of E P θ k [·] stating that the expected value is taken for random vector ξ following the distribution P θ k with parameter θ k . Any optimal solution to (B) is called the Bayesian solution to the stochastic optimization problem under distributional ambiguity. We can define an analogous a-posteriori Bayesian approach, by taking the data X into account to yield We observe that the optimal solution of B(X) depends on the data X, i.e., a different set of realizations of the observation ζ may lead to a different optimal solution.
Recently, Wu et al. [57] proposed a data-driven Bayesian optimization approach. The authors apply a risk functional towards the expected value E θ k Q(x, ξ) . This risk functional contains the a-posteriori distribution, as a mapping from the random vector to the real numbers. Mean, mean-variance, value-at-risk (VaR) and conditional value-at-risk (CVaR) are considered as four different risk measures. This Bayesian risk optimization framework was already proposed before [60], whereas tailored solution strategies were first presented in [61]. For the case of an infinite number of observations, Wu et al. [57] prove several consistency and asymptotic results. This is the main theoretical difference to our work, where we establish bounds for a finite number of observations; cf. Theorem 1.

Minimax: distributionally robust stochastic optimization
Robust optimization approaches typically do not require probabilistic information but rely solely on the range (support) of the parameters. Then, robust optimization seeks a solution which optimizes against the worst among all possible realizations of the parameters [6].
If a set of distributions P is given instead of a single one, recent work has applied robust approaches for stochastic optimization problems under distributional ambiguity in the following minimax sense [59]: This line of research is called distributionally robust stochastic optimization. In this context, the set of distributions P is called the ambiguity set. An a-posteriori approach is obtained when taking the observations X into account to yield a new ambiguity set P(X); see Sect. 3.3. More precisely, the data are used to construct a probability model with an associated "confidence region" which then yields the ambiguity set (as a subset of P). Chapter 7 in the book by Pflug and Pichler [35] explains this concept very clearly. However, using an a-posteriori ambiguity set, constructed in such a way, is "dangerous" in the sense that we lose control over the error in the set of distributions "outside" P(X); see Sect. 4.
Because there is a growing body of literature in the area of distributionally robust stochastic optimization [49], we restrict our discussions on a few truly outstanding papers. The papers on distributionally robust stochastic optimization can be classified by the way the ambiguity sets are constructed. There are several papers which restrict the set of probability distributions by limiting its maximal deviation to a reference distribution, the baseline model [13,27]. Others restrict the set of probability distributions by some conical constraint [17] or derive bounds via two measurable sets [50]. Also, methods have been proposed which restrict the moments of the probability distributions considered [24,26,28]. Finally, one can also restrict the set of possible distributions by some ball in the Wasserstein distance sense [36]. A general class of ambiguity sets was also proposed which contains most published ambiguity sets as special cases [56].
Data-driven distributionally robust stochastic optimization approaches considering confidence regions have been proposed by various groups. Delage and Ye [17] construct ambiguity sets from data for mean and covariance matrices to yield distributionally robust stochastic optimization problems which allow for probabilistic performance guarantees. Bertsimas et al. [8] connect sample average approximation, distributionally robust optimization and hypothesis testing of goodness-of-fit. The resulting robust sample average approximation makes use of a refined ambiguity set construction and confidence regions to establish necessary and sufficient conditions on the hypothesis test to ensure that the resulting solution satisfies certain (probabilistic) finite-sample and asymptotic performance guarantees in a variety of parametric, and non-parametric settings. This idea has been extended to so-called Wasserstein sets establishing statistical guarantees on distributionally robust policies [19]. Ambiguity sets, composed of all distributions within a certain Wassertein distance with respect to an empirical distribution (for given observations) were also proposed and analyzed [21]. In an earlier work, Wang et al. [55] construct ambiguity sets by considering distributions which achieve a certain level of likelihood, given the set of observations. Gupta [25] uses Bayesian techniques to construct ambiguity sets which have very desirable features-they are optimal in some sense. In addition, the constructed ambiguity sets outperform many popular ambiguity sets previously proposed in the literature. The tutorial by Bayraksan [3] discusses the construction of data-driven ambiguity sets by limiting the phi-divergence of all considered distributions with respect to some nominal distribution. Agrawal et al.
[1] quantify the maximal loss incurred when correlations among data are ignored via distributionally robust stochastic optimization problems. All papers in the stream of research on data-driven distributionally robust stochastic optimization approaches have the following two main differences to the proposed approach in this paper: (1) these approaches are worst-case approaches in the minimax-sense, while we are proposing the optimization of an expected value, and (2) the performance guarantees are either probabilistic (i.e., with a high probability) or asymptotic (i.e., for an infinite number of observations), while our guarantees are deterministic and hold for any number of observations in the entire parameter space.
Van Parys et al. [54] also deal with data-driven distribution optimization and make a special kind of asymptotic optimal decision. Similar to our work, they also provide a series of finite sample size results. There are, however, some fundamental differences to the present paper. First, the approach is robust and not Bayesian-like softened robust. Second, the data are used for a predictor (and prescriptor) of the cost function. In contrast, we have chosen an estimator (as a function of the data) for the parameter that defines the distribution of the random parameter vector. Thirdly, we can bound the expected value, over all observations and all possible distributions, of the optimal objective function of the proposed stochastic optimization problem for a finite number of observations by a constant, and accordingly the asymptotic analysis yields the convergence of this constant to zero. This stands in contrast to the finite sample size results from Van Parys et al. [54], for example their Theorem 8, in that they assume the asymptotic distribution P ∞ (i.e., the sample path distribution) while our results hold for all a-priori distributions, cf. Theorem 1.
Soft robust optimization [5] and light robustness [20] have been suggested, for instance, to overcome the conservativeness of robust optimization under distributional ambiguity. Robust stochastic optimization approaches under ambiguity have received special attention in financial applications by considering ambiguous risk and utility functions [22,62].

Application of estimation methods
Per construction,θ is included in the set . However, in many applications, it is sufficient that this condition is met with probability less then 1, say equal to (1 − α) with 0 < α < 1, for a subset of . The value α is a user input and should be chosen as a good compromise between the level (1 − α) and the size of the corresponding subset (i.e., confidence region). Loosely speaking, the level (1 − α) allows the application of confidence regions instead of .

Confidence region
We are seeking a confidence region, determined for an estimator T = T (ζ ), where we rely on the usual definition of a (point) estimator, i.e., a measurable function T = T (ζ ) : → ⊂ R d , according to the following In Definition 2, (a) requires that any estimator T is, for all realizations of ζ , contained in the confidence region R(α, T ) as a subset of . (b) and (c) ensure that the confidence region R(α, T ) contains the true parameterθ with probability ≥ (1 − α). We remark that in the literature, confidence regions are also defined without any estimator. For the following, we are given the realization X of ζ .
the a-posteriori confidence region to T (X) at the level (1 − α).

Integration of confidence regions
If more than one estimation procedure is at hand, say V estimators T 1 , . . . , T V , we can intersect their corresponding a-posteriori confidence regions R α ν , T ν (X) at the level The idea is to choose, for every T ν , (1 − α/V ) as the level of the corresponding confidence region and to intersect all a-posteriori confidence regions as follows Note that I (α, X) is in general not an a-posteriori confidence region according to Defintion 3, as an appropriate estimator T is missing. However, if only one estimator is at hand, i.e., V = 1, then we denote the appropriate a-posteriori confidence region by I (α, X) as well.

Remark 1
The choice of the α ν values as α V is arbitrary. Any other assignment is fine as well, as long as α ν ∈ [0, 1] and condition α = V ν=1 α ν holds. Remark 1 holds because the probability that V ν=1 R α ν , T ν (X) does not containθ is the probability that either of the confidence regions R α ν , T ν (X) does not containθ, which is less than or equal to V ν=1 α ν = α. Figure 1 illustrates the resulting region I α, X when two a-posteriori confidence regions are integrated, i.e., intersected.

Integrated confidence region with a new estimator
To transform I (α, X) into an a-posteriori confidence region, we require a suitable estimator T o . Such an estimator is defined in the following, together with the resulting a-posteriori confidence region.
For any given α, with 0 < α < 1, we define an estimator T o (α, X) of θ as an element of I (α, X) which minimizes the squared maximal Euclidean distance, i.e., The motivation of definition (8) is to obtain a parameter which has the minimal distance to the "true" parameterθ at the level (1 − α), i.e., with a great probability. This is an aposteriori-like result. The mathematical reason for this particular choice of T o becomes clear with the proof of Theorem 1 on page 13.

Remark 2 The set
is also a confidence region at the level (1 − α).

Remark 3 The set
is an a-posteriori confidence region to estimator T o at level (1 − α) and, with that, also a subset of .
In the remainder of the paper, we mostly use H T o (α, X) . Figure 2 illustrates the estimator T o (α, X) and H T o (α, X) , corresponding to Fig. 1.

A-posteriori minimax using confidence region
Any (a-posteriori) region I (α, X) can be readily applied towards the minimax approach. The idea is to replace the ambiguity set P by I (α, X) ∩ 0 . We obtain the DRSP problem at level (1 − α) For any realization X of observation ζ , I (α, X) is a subset of , and hence the robustness of (DRSP α ) is somewhat "softened."

Stochastic optimization with confidence level for continuum 2
Based on the a-posteriori confidence region H T o (α, X) to estimator T o , we propose a new stochastic optimization problem in an effort to solve stochastic optimization problems under distributional ambiguity. The key idea is the combination of the a-posteriori Bayesian approach with our (integrated) confidence region.

Decision functions and mean-risk
Additionally to the new estimator T o (X) and its corresponding confidence region, we propose a novel solution for the stochastic optimization problem with distribution ambiguity minimizing the mean-risk. For that we require the following definitions.
We collect all such feasible decision functions d in the set D.
The decision function formalizes the concept that any solution of an a-posteriori optimization problem depends on the data X. With that, any solution is a function of the observation ζ .
For any distribution W on (in the a-priori sense) with Lipschitz-continuous density w(θ), the mean-risk for decision function d ∈ D is then defined as The mean-risk for d ∈ D is also called the Bayes risk for d.
Because is compact, it is especially a Lebesgue-measurable set. The risk is defined as the expected value of the function E θ Q(d(ζ ), ξ ) of random vector ζ . As such, the risk is an average of the objective function value for the decision function d of the stochastic optimization problem (SP). The mean-risk is then the average risk(d, θ) over all parameters θ ∈ . The risk function is of interest because the true value of the parameter θ is unknown.

The proposed solution
We propose a new kind of "softened robustness" in an effort to avoid the often criticized "over-conservatism" of the minimax criterion. This proposed alternative is a Bayesian-like approach. In this context, "robust" means that the Bayesian solution remains approximately optimal in a neighborhood (of some appropriate metric space of a-priori distributions) around the given a-priori distribution, or around the a-priori uniform distribution (if none is specified). While the Bayesian approach can overcome the aforementioned "over-conservatism", it is sensitive to the choice of the a-priori distribution and thus, robustness with respect to this distribution is of interest, cf. [7] and [23, chapters 3.6 and 3.7]. This is what we propose in this paper. This new kind of robustness is intuitively successful, if, for the computation of the aposteriori-Bayes solution, instead of the entire parameter space , a confidence region around an efficient estimator T o is chosen, whose diameter is small-in the sense that the resulting a-posteriori confidence region is smaller than -with a high confidence level (i.e., small α). This is the motivation for us to define the a-posteriori confidence region H (T o α, X) for the estimator T o .

Definition 6
For any given data X and level (1 − α), consider the data-driven stochastic optimization problem BP * α (X) within our best confidence region H T o (α, X) We denote a solution of the data-driven stochastic optimization problem BP * α (X) as B * (α, X).
The proposed new optimization model BP * α (X) represents a Bayes-like solution. Therefore, the appropriate optimality criterion is the Bayesian risk. The Bayesian risk, as defined in Definition 5, is an average with respect to the posterior distribution of the mean risk as a function of θ .

Remark 4
Solutions B * (α, X) are well-defined because BP * α (X) optimizes a continuous function of x over a compact set X; we assume that Q is a continuous function.

The main property of B * (˛, X)
Theorem 1 provides the main properties of solutions B * (α, X) on the continuum for any finite number of observations. We utilize this Theorem do derive similar bounds for the Bayesian approach and the naïve approach.
First, for fixed α, we define with (α) as defined in (9) and the standard deviation std(T o ) of T o . Especially if only one estimator is used, then it is particularly easy to check whether or not T o is unbiased. The main result is then stated as: Theorem 1 Let be continuous, non-empty and compact; L(·, ·) be continuous and nonconstant on the non-empty compactum X and let Assumption 2 hold. For any data X and level (1 − α), let B * (α, X) be a solution of BP * α (X) . Then, for any distribution W on (in the a-priori sense) with Lipschitz-continuous density w(θ) and Lipschitz constant l W as well as η * (α) as defined in (12) and Lebesgue measure λ( ) of , it holds that Proof For any feasible decision function d ∈ D, θ ∈ and y ∈ , we define for notational convenience

Case: T o not unbiased:
In this case, η * (α) = (α). By definition and using the Lebesgue integral The first term, (15a), is re-written as For any y ∈ , according to the definition of B * (α, y), Therefore (16a) is also ≤ 0. By using the Lipschitz constant l W , (16b) can be estimated by We have used Hölder's inequality (with parameter p = 1) by re-writing and by identifying R r =1 f (y r |θ)dy 1 . . . dy R as the (probability) measure. Hence, for (15a), we obtain The second term, (15b), is estimated by as w(·) is a probability density and therefore w(θ)dθ = 1 and H T o (α, ζ ) is a confidence region at level (1 − α), cf. Remark 2.
We use the normalization (3) of the function Q(·, ·) in inequality (14). This is the motivation of the normalization of function Q(·, ·).
We make the following comments on Theorem 1 for our data-driven approach towards stochastic optimization under distributional ambiguity and solutions B * (α, X): (i) Theorem 1 is a Bayesian-type result because the bound α + η * (α) · λ( ) · l W assumes an a-priori distribution W on . (ii) The bound α + η * (α) · λ( ) · l W is averaged over all possible observations (this is a unique result). (iii) The bound depends on four constants: α, η(α), λ( ) and l W . α is a user-controlled parameter; η(α) depends on the quality of the estimator, see (12); λ( ) is the size of the parameter space; l W depends on the chosen a-priori distribution W and is controlled by the user as well. (iv) The bound improves with smaller α values. This is consistent with our intuition in that smaller α values yield confidence regions of higher level. (v) Better estimator(s) also yield(s) smaller bounds because η(α) gets smaller. (vi) Increasing the "sharpness" of the a-priori distribution yields to a larger bound as the slope l W of the density increases. This is explained because a sharper distribution puts more "weight" on a smaller number of parameters θ ∈ , making it harder for solutions B * (α, X) to yield good results for this sharper a-priori distribution.
Next, we draw a connection to classical stochastic optimization where the distribution function is known, i.e., |P| = 1.
Proof For K = 1, the estimator T o (α, X) =θ and η * (α) ≡ 0. The confidence level (1−α) = 1. The quantities λ( ) and l W are bounded; as the dimension of the parameter θ ,d is fixed. The solution of BP * α (X) solves (SP) to optimality since the Lebesgue integral in BP * α (X) with respect to the unique trivial (probability) measure, which gives the singletonθ mass 1, yields the optimization problem min Eθ Q(x, ξ) : x ∈ X .

Asymptotic analysis
The idea of the asymptotic analysis is to show that the optimality deviation (α + η * (α) · λ( ) · l W ) converges to zero with the number of observations R.
We have defined R ∈ N as the finite number of i.i.d. observations and ζ = (ξ 1 , . . . , ξ R ) as the corresponding random vector. In the asymptotic case, there is an infinite (and countable) set of i.i.d. observations (X 1 , X 2 , . . .) and (ξ 1 , ξ 2 , . . .) is the corresponding infinite dimensional random vector. Therefore, we have a sequence We also must require the sequence of estimators (T R ) R∈N to be asymptotically efficient in the sense that lim R→∞ α R = 0, with for some positive null sequence (η R ) R∈N . We denote the set in (19) by Clearly, T R corresponds with T o , α R with α and η R with η * . In many cases, one can choose as the sequence (η R ) R∈N any null sequence (η R ) R∈N ) R∈N diverges to infinity. For example, one may choose If the parameter is the expected value (i.e., the ξ i are random variables) and (T R ) R∈N the sample mean, then, under the corresponding assumptions, the inequality of Chebyshev provides the convergence of (α R ) R∈N to zero. This can be seen as follows. According to the Chebyshev inequality which shows the converges to zero if R tends to infinity.

Corollary 2
We assume that the sequence (l R ) R∈N of Lipschitz-constants, corresponding with l W , is bounded. Normally it converges to zero, as the corresponding calculation for the case of normal distribution with the mean as parameter shows.
For any given data X and R ∈ N, consider the stochastic optimization problem (BP * α (X)) R for confidence region C T R z * B * ,R (X) := min const · and solution (B * (α R , X)) R . Then the sequence of solutions (B * (α R , X)) R∈N is asymptotically optimal for all θ ∈ (Theorem 1), i.e.,

The complexity of BP * ˛( X)
The complexity of optimization problem BP * α (X) depends (i) on the shape of the function E θ Q(x, ξ) , (ii) the shape of the feasible region X and (iii) the dimension of H T o (α, X) . We start by recognizing that the optimization problem BP * α (X) preserves the convexity of the original problem, i.e., if the stochastic optimization problem (1) is a convex optimization problem (for known probability distribution, i.e., given θ ), then BP * α (X) is also a convex optimization problem. This holds, because for λ ∈ [0, 1] and x 1 , Problem BP * α (X) can be solved, for instance, by using a numerical approximation method for the integral over H T o (α, X) . The resulting optimization problem falls then in one of the standard classes, for example, convex nonlinear programming (as in the computational examples in Sect. 6) or, more generally, mixed-integer nonlinear nonconvex programming (MINLP) problems. If the resulting optimization problem, after using some numerical approximation for the integral, is a (non-convex) MINLP, then one can resort to available off-the-shelf global optimization solvers. Alternatively, the resulting MINLP problem can be approximated using piecewise linear constructs to yield MILP problems, subject to any approximation quality [12,29,[40][41][42][43].
For the optimization problem BP * α (X) to be computationally tractable, it is not sufficient that the stochastic optimization problem (1) is tractable. The reason is that numerical integration (e.g., when using a cubature formula for dimension ≥ 2) gets hard with increasing dimension of H T o (α, X) . One is trapped by a curse-of-dimensionality [15].

Bounding the standard Bayesian and the Naïve approach
We next apply the idea of the proof of Theorem 1 towards the Bayesian and the Naïve Approach.

The standard Bayesian approach
We start with the definition of the standard Bayesian approach. Definition 7 For any given data X, consider the data-driven stochastic optimization problem BP * (X) with constant The chosen a-posteriori distribution is the uniform distribution on , i.e., standard Bayesian. We denote a solution as B(X), called standard Bayesian solution; see also Sect. 2.1 about the Bayesian approach.
For the standard Bayesian solution B(X), as defined in Definition 7, we derive from Theorem 1 the following

Corollary 3 We propose the same assumptions and notations as in Theorem 1 and the unbiasedness of T o . Then, for any distribution W on with Lipschitz-continuous density w(θ)
and Lipschitz constant l W , it holds that Proof The proof, especially the second part, is analogous to that of Theorem 1. By definition and using the Lebesgue integral f (y r |θ)dy 1 . . . dy R dθ.
Now we estimate the term (24) as follows: We have used Hölder's inequality by re-writing θ − T o (α, y) 2 = θ − T o (α, y) 2 · 1 and by identifying R r =1 f (y r |θ)dy 1 . . . dy R as the (probability) measure. Also we remember (14). Because the term (23) is ≤ 0 , holds. (13) resp. up to (η · λ( ) · l W ) in (22) for an entire class of apriori distributions. With other words, B * (α, X) is the solution of the stochastic optimization problem under distributional ambiguity, minimizing the mean-risk (up to a constant and for all a-priori distributions with Lipschitz-continuous density).

Remark 5 Theorem 1 and Corollary 3 establish the optimization of the mean-risk up to the
The difference of Theorem 1 to the Bayesian approach (as described in Sect. 2.1) is twofold. First, in the Bayesian approach, the parameter set is used instead of H T o (α, X) , as mentioned above. Second, Bayesian approaches do not involve an estimator T o (α, X).

Optimality bound for the risk
The key idea of our new approach, BP * α (X) , is the combination of an a-posteriori Bayesianlike approach with a confidence region around a special estimator. Therefore the derived optimality bound relates to the mean risks.
In order to obtain also a similar result in the classical sense, i.e., an optimality bound for the risk with respect to the true parameter, we take now the estimator itself instead of a confidence region around it. This is the naïve approach, where the uncertainty around the distribution P is ignored and the best estimator for θ is chosen.
We assume that we have an estimator T d = T d (ζ ) for the (true) parameter θ ∈ , where we rely on the usual definition of an estimator, i.e., a measurable function T d = T d (ζ ) : → . We then solve the stochastic optimization problem (SP) for T d (X) instead of θ , which we call the naïve approach.
Definition 8 For any given data X, consider the data-driven stochastic optimization problem DP * (X) We denote a solution of the data-driven stochastic optimization problem DP * T d (X) as For our analysis, we require the confidence region for estimator T d to any confidence level (1 − α). For any θ ∈ , we set for abbreviation f (y r |θ) for y = (y 1 , . . . , y R ) ∈ .
We assume the Lipschitz continuity of f (y 0 |θ) as a function of θ , with Lipschitz-constant l(y 0 ) for every y 0 ∈ ξ( ) and assume that exists. With this notation in place, we can derive the bound for the naïve approach.

Corollary 4 Let L(·, ·) be continuous and non-constant on the non-empty compactum X and let Assumption 2 hold. For any data X, let D * T d (X) be a solution of DP * T d (X)
. Then, for any θ ∈ and any α with 0 < α < 1, it holds that Proof Let any θ ∈ be given. For any feasible decision function d ∈ D and y = (y 1 , . . . , y R ) ∈ , we define for notational convenience By definition and using the Lebesgue integral The first term, (32a), is re-written as According to the definition of D * (y), holds for all y ∈ . Therefore (33a) is also ≤ 0.
According to (31) and by using the Lipschitz constant l d , (33b) can be estimated by Hence, for (32a), we obtain (because the probability is less or equal to one) The second term, (32b), is estimated by first estimation respecting (31) and the last as Corollary 4 is a classical result, in contrast to the Bayesian-type results in Theorem 1 and Corollary 3, in the sense that it holds for all θ ∈ . With other words, the bound in (30) holds for the risk and not the mean-risk.

Computational results
The computational results are obtained with GAMS version 24.4.6, where we use LIN-DOGLOBAL and BARON to solve the resulting non-linear programming (NLP) problems in Sect. 6.1 and Sect. 6.2, respectively [33,46]. For the sake of simplicity in the discussions, we optimize the loss function L(·, ·) directly, instead of the normalized and standardized expected function E Q(·, ·) . We approximate the integrals in the optimization problem BP * α (X) via the trapezoidal rule. Together with the convexity of the considered functions L(·, ·), the resulting optimization problem for BP * α (X) is convex. Note that solving the new optimization model BP * α (X) is computationally more expensive than the Bayesian approaches, which are in turn more computational expensive than the distributionally robust optimization approaches, for our tested instances.

Production planning & control (PPC): newsvendor problem
Given is the following simplified and stylized production planning & control (PPC) problem, in this special form also known as the Newsvendor problem, with a single consumer product to be produced for a single time horizon [18,32,58]. The demand is random and follows a normal distribution with unknown mean μ and known standard deviation of 10; i.e., ξ ∼ N (μ, 10 2 ).

Illustrative example
We assume that the true meanμ = 50 is unknown. However, we assume knowledge thatμ is contained in the interval Overproduction leads to inventory cost of c 1 = 2 [$/item]; underproduction is allowed but imposes a penalty of c 2 = 10 [$/item]. The cost (or loss) function is then given by with [y] + = max{0, y}. There is no initial inventory and between [l, u] = [25, 100] items can be produced. For some known random variable ξ ∼ P = N (μ, σ 2 ), the expected value of (35) is evaluated as Notice that E P L(x, ξ) is a convex function in x and that both min x∈X,P∈P E P L(x, ξ) and max x∈X,P∈P E P L(x, ξ) exist, i.e., they are finite. To evaluate the computed solutions for the true objective function, we define The optimal solution x * as a function of θ is shown in Fig. 3. The corresponding function values for the true parameterθ = 50 are shown in Fig. 4. Table 1 summarizes the results of the different stochastic optimization models, as listed in the first column. The second column reports on their optimal solution; the third column lists their corresponding objective function value; the fourth column shows their true objective function value (36), i.e., the computed solution is evaluated by the model with the true but unknown parameterθ ; the last column presents the gap between the optimal objective function value and the true objective function value, i.e., gap = (true objective function value − 29.982)/29.982. The a-posteriori approaches (S2), (S4), (S5), (T1) and (T2) utilize R = 20 realizations 1 of ζ . These values are given in the footnote below so that an interested reader can re-compute all quantities of the illustrative example.
The baseline model is the true model (therefore, the values in column three and four are identical and the gap in column five is 0%). Thus, the baseline model (B) serves as Table 1 Results for illustrative example for For the a-priori Bayesian model (S1), we use the (discrete) uniform distribution for the a-priori distribution, i.e., s k = 1/151 for k ∈ {1, . . . , 151}. The corresponding results are shown in Table 1. The obtained solution is worse than the three a-posteriori approaches but slightly better than the other a-priori approach (S3), which is an expected result (but does not hold in general for any random draw of the realizations) as the a-posteriori approaches take additional information into account compared to the a-priori approaches; the (DRSP) is a worst-case approach and, thus, is typically outperformed if the worst-case distribution does not mature.
In our example, ζ is a continuous random vector. Because ξ 1 , . . . , ξ R are i.i.d. random vectors, we obtain the a-posteriori Bayesian problem (B(X)) Its optimal solution and the function values are reported in row (S2) in Table 1. We observe that the a-posteriori Bayesian approach outperforms both a-priori approaches: the a-priori Bayesian and the a-priori distributionally robust stochastic optimization model, whose results are reported in row (S3). We choose the sample mean, as our first estimator T 1 , i.e., Because the standard deviation is known, the sample mean follows the normal distribution N (μ 1 , 10 2 R ). This yields the (1 − α) confidence interval for the true meanμ, with the critical value z α of the standard normal distribution. Because we know thatμ ∈ [a, b] ⊂ R, we may truncate the confidence interval (37) to obtain the For a second estimator T 2 , we choose the weighted-moving average method. Its estimator forμ is given byμ 2 = R r =1 g r X r R r =1 g r with weights or parameters g r ∈ R + , r = 1, . . . , R. Choosing g r = 1 for all r = 1, . . . , R reveals that the weighted-moving average method is a generalization of the sample mean. Estimatorμ 2 is unbiased and follows the normal distribution This yields another (1 − α) confidence interval forμ, similar to (38).
We choose as confidence level (1 − α) = 0.95. The weights g r are chosen as a nullsequence with g r = 1 r /5 . The 20 data points lead to the following confidence regions containing 73 distributional parameters. The solution of (DRSP α (X)) for I (α, X) ⊂ is reported in row (S4) in Table 1. We observe that (DRSP α (X)) yields a slightly better solution than the a-posteriori Bayesian model (S2). This might be unexpected because of the worst-case nature of (DRSP α (X)) but is explained by the incorporation of the additional information of the estimators T 1 and T 2 . We obtain as estimator T o (α, X) = 50.6. This yields We approximate the integral over H T o (α, X) , in optimization problem (BP * α (X)), via the trapezoid rule with 73 trapezoids. This yields a box-constrained NLP. The obtained solution is shown in row (S5) in Table 1. We observe that the solution x * and the true objective function value is the best among the seven different stochastic optimization models (S1)-(S5) and (T1)-(T2) with a gap of 0.02%. We also compute the results for the naïve approach, using the two estimators T 1 and T 2 . The sample average yields T 1 (X) ≈ 49.0004 and for the second estimator T 2 (X) ≈ 51.1941. The resulting solutions are then shown in columns (T1) and (T2) in Table 1. Note that the naïve approach is outperformed by all other a-posteriori approaches (S2), (S4) and (S5).
Next, we study the error bounds from Theorem 1 and Corollaries 3-4. Note that the derived bounds apply for the normalized expected function E Q(·, ·) . For the error bound α + η * (α) · λ( ) · l W in (13), we restrict the discussions to the sample average as estimator T 0 = T 1 alone. Because the half-width of the confidence interval (37) is independent of the observations (as we assume a known standard deviation), we obtain The sample average is an unbiased estimator ofθ with standard deviation Thus, for α ≤ 0.317. The Lebesgue measure λ( ) = 55 − 40 = 15 is the interval length. This yields for the Lipschitz constant l W of the density w(θ). Thus, l W is a user input and depends on the "sharpness" of the imposed measure. For example, for the uniform distribution l W = 0; for the triangular distribution with lower limit 40, upper limit 55 and mode 47.5: l W = 0.017; and the truncated normal with lower limit 40, upper limit 55, mean μ W and standard deviation For example, for choices μ W = 47.5 and σ W = 15, we obtain l W ≈ 0.0001872 for the truncated normal. With R = 20 and assuming this truncated normal, we obtain for the bound in Theorem 1 This yields the bound (22) std(T o ) · λ( ) · l W ≤ 0.00629 (40) for the Bayesian approach, as analyzed in Corollary 3. Both bounds (39) and (40) are for the mean-risk. The bound α + l d · η(α) of Corollary 4 is estimated as follows. The Lipschitz constant l d is bounded by because R = 20 and σ = 10 for our illustrative example; see above. Because η(α) is bounded and not "too large," we obtain Note that (41) bounds the risk (instead of the mean-risk for the other two approaches).

Varying the number of observations and the confidence level
We continue our Newsvendor example by making the following changes: the cost c 2 is reduced from 10 to 7, we choose only one estimator T 1 to compute the confidence intervals around the single parameterμ =θ = 50. This yields a true optimal objective function value of approx. 26.802. Further, we assume the knowledge ofθ ∈ [40,80], and the parameter discretization is set to be 0.5, leading to 81 parameters in 0 . We test α = 0.10, 0.05, 0.04 and 0.03 for R = 10, 15, 20, 25, 50, 75, 100, 150 and 200 observations. For each combination, we solve 100 instances. All computed x * are then evaluated by the true objective function (36); this is an out-of-sample performance test. For α = 0.05, the results are plotted in Fig. 5. Specifically, Fig. 5 plots the objective function values for the 100 samples for different values of R in decreasing order. These function values are obtained by the optimal solution x * of (BP * 0.05 (X)), evaluated with the true objective function. For example, for R = 25 observations, 100 random test instances are drawn and For our experiments, the solution plots for (B(X)) and (DRSP 0.05 (X)) look very similar and largely overlap with the one produced by (BP * 0.05 (X)). We observe in Fig. 5 that the worst solutions computed among the 100 instances improve with the increase in the number of observations. This is a very desirable and is the expected behavior of the model (BP * 0.05 (X)). Further, we note a convergence towards the optimal true objective function value of approximately 26.802.
The solution of the two a-priori approaches (B) and (DRSP) lead to true objective function values of approx. 45.3913901 and 45.1890283, respectively. They are far worse than any of the solutions computed by (B(X)), (DRSP α (X)) and (BP * (X)) for all 3600 instances 2 . This comes as no surprise, since the interval forθ ∈ [40,80] is quite large and, consequently, the value of this information is quite limited. Table 2 summarizes the computational results for all instances on the three different a-posteriori approaches (B(X)), (DRSP α (X)) and (BP * (X)). Specifically, columns 3-5 report the number of instances where each single approach yields a smaller true objective function value than the other two methods, for the computed solutions. Columns 6-8 state the standard deviation of the 100 computed true objective function values. Note that the aposteriori Bayesian model (B(X)) does not depend on our choice of α; while the observations are taken into account via the a-posteriori distribution, the confidence interval is not.
The computational results, as shown in columns 3-5, reveal that none of the three approaches always dominates any other method. Actually, we argue that each of the three approaches are valuable and are particularly strong in specific cases. The numbers in bold are the largest number among the three approaches The a-posteriori Bayesian model (B(X)) performs very well when many i.i.d observations are available (e.g., 150 or more). For these cases, all three approaches yield similar results (while recognizing that the standard deviation between the 100 objective function values is very small if many observations are available, leading also to small differences between the three approaches). A big advantage of (B(X)) is that no confidence level and no estimator has to be chosen. Also, the computational burden in solving (B(X)) is lower compared to (DRSP α (X)).
The a-posteriori distributionally robust stochastic optimization model (DRSP α (X)) is particularly strong for the cases where a significant number of i.i.d. observations are given (e.g., between 100 and 150).
Finally, the new approach BP * α (X) yields strong results when relatively few observations are given (e.g., between 10 and 75). The computational effort required in solving BP * α (X) is similar to (B(X)) if the trapezoidal rule is used.
The results of columns 6-8 in Table 2 quantify the tendency shown in Fig. 5 that the standard deviation decreases with an increase in the number of observations.

Reliability: two-dimensional parameter space 2
As a second test bed, we consider a reliability problem. Replacing a part in time, i.e., before it is broken, costs p 1 > 0. Once it is broken, the replacement costs increase to p 2 > p 1 . The goal is to find the optimal time t > 0 to replace the part, minimizing the expected replacement cost.
The corresponding loss function for the continuous random variable ξ , measuring the time until failure, is then given by with indicator function I A (·) : R 1 → R 1 . Our goal is to minimize the expected losses as a function of t > 0. For an exponentially distributed random variable ξ ∼ P =EXP(α, λ) with mean α + λ and density we obtain for t ≥ α An optimal replacement time t is obtained at t * = argmin t>0 E P L(t, ξ) = λ ln p 1 p 2 + 1 + α > α to yield the minimum min t>0 E P L(t, ξ) = E P L(t * , ξ) = p 2 λ ln p 1 p 2 + 1 , which is independent of α. We assume that both parameters α and λ are unknown. This yields the two-dimensional parameter θ = (α, λ). To obtain a common estimatorθ = (α,λ), we use the maximum likelihood estimates which makes use of our (i.i.d.) realizations X 1 , . . . , X R having distribution EXP(α, λ) with unknown α and λ.
Becauseα andλ are independent [31], we can use the following confidence regions at levels (1 − α 1 ) and (1 − α 2 ) for α and λ, respectively, with critical values r 1,R,α 1 and r 2,R,α 2 . The distributions, of which the critical values r 1,R,α 1 and r 2,R,α 2 are derived, can be taken from the literature [30]. The non-negativity of the exponential distribution implies that α ≤α which explains the one-sided confidence interval for α. An exact confidence region at level (1 − α 1 ) · (1 − α 2 ) for both parameters α and λ jointly, is then obtained by
It is theoretically possible, that the (1 − α 1 ) · (1 − α 2 ) confidence interval (44) is empty; that did not occur in our tests of over 42,000 instances (as reported in Tables 3 and 4, see below). Figures 6 and 7 plot the expected loss functions for different parameter configurations of λ and α within the safe interval (43). The horizontally dotted lines show the optimal t * values for each parameter configurations. The graphs for λ = 100 and α = 25 in Figs. 6 and 7 show the true expected loss function, against which we compare the computed t values (for the three different a-posteriori approaches tested). The figures show that (i) the loss functions are smooth in t for different parameter configurations λ and α, (ii) the optimal values for t vary  Table 3 continued R α 1 , The bold numbers show the largest number of times, the particular approach yields the best objective function value among the three approaches 0.075, 0.075 [5,55] Fig. 6 and red curve in Figure 7) is smooth, the true objective function values differ quite significantly when evaluating different values of t. Table 3 shows the results for the different computational tests performed for the reliability example. The setup of the experiments is similar to the ones reported in Table 2, while we use 1000 instances in Table 3 compared to 100 instances of Table 2. Column 1 shows the number of draws, R. The values of α 1 and α 2 are reported in column 2, while columns 3-5 report the number of instances where each approach yielded a better expected loss, i.e., the computed t was inserted in the true objective function value with the true parametersθ . The last three columns show the standard deviation of the true objective function values for the 1000 different runs.
Most notably, the proposed solution BP * α (X) outperforms both the a-posteriori distributionally robust and Bayesian approaches consistently and considerably. The new approach yielded the best true objective function values among the three approaches in over 87% of the 36000 instances. In general, BP * α (X) does particularly well for 50, 75 and 100 i.i.d. data points. We observe that the standard deviation of the true objective function values evaluated at the computed solutions for all approaches tends to decrease with the number of available data points, i.e., with increasing R. This behavior is expected as more realizations leads to smaller confidence regions which in turn should lead to smaller variations in the computed optimal time t.
The a-posteriori Bayesian approach has a much lower standard deviation (column 6 in Table 3) compared to the two other a-posteriori approaches. In general, a lower standard deviation is a desired property, but only when it comes together with better objective function values. This is not the case for the a-posteriori Bayesian approach. In contrast, the standard deviation for the new approach (column 8) is always smaller compared to the a-posteriori One might expect that the a-posteriori distributionally robust approach is better for such cases, where one of the true parametersα andλ lies outside the confidence interval. This is the case with probability α 1 + α 2 − α 1 · α 2 . For example, for α 1 = α 2 = 0.075 and α 1 = α 2 = 0.100 this leads to probabilities of 0.9375 and 0.81, respectively. Thus, for 1000 independent instances, we expect that the a-posteriori distributionally robust approach outperforms the other two approaches at about 73 and 190 instances, for α 1 = α 2 = 0.075 and α 1 = α 2 = 0.100, respectively. As the results in Table 3 reveal, this is only rarely the case. That the solution of the new approach, BP * α (X) , tends to outperform the a-posteriori distributionally robust approach even in such cases where at least one of the true parameters lies outside of the confidence interval further illustrates the strengths of the new approach.

Varying the safe interval
Next, we study the computational effects when varying the safe interval (43) as Therefore, consider Table 4 which shows experiments for R = 20 data points. The results indicate that the particular choice of the safe interval influences the relative results of the three approaches; this comes at no surprise as the (1 − α 1 ) · (1 − α 2 ) confidence interval (44) directly depends on the safe interval. With smaller safe intervals (43), both the a-posteriori Bayesian and the a-posteriori distributionally robust approaches tend to improve. Despite that trend, even for as small intervals as [20,30] × [80, 110], the new approach does considerably and consistently better. The relative improvement of both the a-posteriori Bayesian and a-posteriori distributionally robust optimization approaches is explained by the decreasing leeway. Very small safe intervals favors a-posteriori Bayesian approaches as their solutions get closer to BP * α (X) ; this can also be observed by comparing the results of α 1 = α 2 = 0.075 with the smaller confidence intervals resulting from α 1 = α 2 = 0.100.

Conclusions
We present a new data-driven approach towards stochastic optimization under distributional uncertainty. The proposed approach uses observations to construct confidence regions. We then apply an a-posteriori Bayesian approach towards this confidence region. This yields a new data-driven stochastic optimization problem.
The careful construction of the confidence region around an appropriate estimator allows us to analyze the quality of the solutions obtained when taking the expected values of all observations and all a-priori distributions. The derived optimality bound provides various insights on the quality of the obtained solutions. For example, this bound improves with better estimators and the bound converges to zero in the number of observations. If the distribution is known, i.e., the family of distributions reduces to a singleton, then the proposed approach is identical to the standard stochastic optimization problem. In this case, the aforementioned constant is zero. Our computational results shows that solutions of the proposed data-driven stochastic optimization problems can be superior to solutions of data-driven Bayesian and data-driven distributionally robust stochastic optimization approaches.
Future research might attempt to derive data-driven stochastic optimization problems which minimize the derived bound, though it remains unclear how this can be achieved. Another research direction is to search for a novel type of robustness. Robustness is then understood as the criterion that optimality holds for an entire environment of a-posteriori distributions, for a suitable distance-measure in the space of probability distributions. Then, the new model can be designed with the aim to produce the best compromise between the maximal distance and the resulting bound.