Abstract
We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.
Similar content being viewed by others
Notes
Our results can be extended to Wasserstein distances defined using \(\ell _q\)-norms with \(q \ne 2\).
See Remark 1 at the end of this section for a discussion of the case when \(f^* \not \in \mathcal {F}\).
For any \(u,v \in \mathbb {R}^{d_y}\), \(\Vert {{\text {proj}}_{\mathcal {Y}}(u) - {\text {proj}}_{\mathcal {Y}}(v)}\Vert \le \Vert {u-v}\Vert \).
In special cases (e.g., smooth unconstrained problems, see [31, Section 5]), it may be possible to establish the sharper convergence rate \(|{g(\hat{z}^{DRO}_n(x);x) - v^*(x)}| = O_p\big (n^{-r}\big )\).
We use \(\mu _n(x)\) to avoid a clash with the notation \(\zeta _n(x)\) for the radius of ambiguity sets with the same support as \(\hat{P}^{ER}_n(x)\). Having different notation for these two radii will prove useful in our unified analysis of the corresponding ER-DRO problems in Sect. 5.
We do not include results for Algorithm 3 with \(n = 20\) because it requires at least 30 samples (line 7 of Algorithm 3 needs at least 6 points for Lasso regression with 5-fold CV).
Once again, we only report results for Algorithm 3 when \(n \ge 30\).
References
Ban, G.Y., Gallien, J., Mersereau, A.J.: Dynamic procurement of new products with covariate information: the residual tree method. Manuf. Serv. Oper. Manag. 21(4), 798–815 (2019)
Ban, G.Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2018)
Bansal, M., Huang, K.L., Mehrotra, S.: Decomposition algorithms for two-stage distributionally robust mixed binary programs. SIAM J. Optim. 28(3), 2360–2383 (2018)
Bayraksan, G., Love, D.K.: Data-driven stochastic programming using phi-divergences. In: The Operations Research Revolution, pp. 1–19. INFORMS TutORials in Operations Research (2015)
Bazier-Matte, T., Delage, E.: Generalization bounds for regularized portfolio selection with market side information. INFOR: Information Systems and Operational Research, pp. 1–28 (2020)
Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)
Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation. Math. Program. 171(1–2), 217–282 (2018)
Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci. 66(3), 1025–1044 (2020)
Bertsimas, D., McCord, C., Sturt, B.: Dynamic optimization with side information. Eur. J. Oper. Res. 304(2), 634–651 (2023)
Bertsimas, D., Shtern, S., Sturt, B.: A data-driven approach to multistage stochastic linear optimization. Manag. Sci. (2022). https://doi.org/10.1287/mnsc.2022.4352
Bertsimas, D., Shtern, S., Sturt, B.: Two-stage sample robust optimization. Oper. Res. 70(1), 624–640 (2022)
Bertsimas, D., Van Parys, B.: Bootstrap robust prescriptive analytics. Math. Program. 195(1), 39–78 (2022)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Blanchet, J., Kang, Y., Murthy, K.: Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3), 830–857 (2019)
Blanchet, J., Murthy, K., Nguyen, V.A.: Statistical analysis of Wasserstein distributionally robust estimators. In: Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pp. 227–254. INFORMS (2021)
Blanchet, J., Murthy, K., Si, N.: Confidence regions in Wasserstein distributionally robust estimation. Biometrika 109(2), 295–315 (2022)
Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets for linear systems under disturbances and noisy observations. In: 2020 American Control Conference (ACC), pp. 4491–4496. IEEE (2020)
Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets with probabilistic guarantees for dynamic processes. IEEE Trans. Autom. Control 66(7), 2991–3006 (2020)
Boskos, D., Cortés, J., Martínez, S.: High-confidence data-driven ambiguity sets for time-varying linear systems. arXiv preprint arXiv:2102.01142 (2021)
Deng, Y., Sen, S.: Predictive stochastic programming. CMS 19(1), 65–98 (2022)
Dou, X., Anitescu, M.: Distributionally robust optimization with correlated data from vector autoregressive processes. Oper. Res. Lett. 47(4), 294–299 (2019)
Duchi, J.C., Glynn, P.W., Namkoong, H.: Statistics of robust optimization: a generalized empirical likelihood approach. Math. Oper. Res. 46(3), 946–969 (2021)
Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)
Esteban-Pérez, A., Morales, J.M.: Distributionally robust stochastic programs with side information based on trimmings. Math. Program. 195(1), 1069–1105 (2022)
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 162(3–4), 707–738 (2015)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Gao, R.: Finite-sample guarantees for Wasserstein distributionally robust optimization: breaking the curse of dimensionality. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2326
Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributionally robust optimization and variation regularization. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2383
Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. (2022). https://doi.org/10.1287/moor.2022.1275
Gotoh, J.Y., Kim, M.J., Lim, A.E.: Calibration of distributionally robust empirical optimization models. Oper. Res. 69(5), 1630–1650 (2021)
Hanasusanto, G.A., Kuhn, D.: Robust data-driven dynamic programming. In: Advances in Neural Information Processing Systems, pp. 827–835 (2013)
Hanasusanto, G.A., Kuhn, D.: Conic programming reformulations of two-stage distributionally robust linear programs over Wasserstein balls. Oper. Res. 66(3), 849–869 (2018)
Homem-de-Mello, T., Bayraksan, G.: Monte Carlo sampling-based methods for stochastic optimization. Surv. Oper. Res. Manag. Sci. 19(1), 56–85 (2014)
Kannan, R., Bayraksan, G., Luedtke, J.: Heteroscedasticity-aware residuals-based contextual stochastic optimization. arXiv preprint arXiv:2101.03139, pp. 1–15 (2021)
Kannan, R., Bayraksan, G., Luedtke, J.: Data-driven sample average approximation with covariate information. arXiv preprint arXiv:2207.13554v1, pp. 1–57 (2022)
Kuhn, D., Esfahani, P.M., Nguyen, V.A., Shafieezadeh-Abadeh, S.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Operations Research & Management Science in the Age of Analytics, pp. 130–166. INFORMS (2019)
Lam, H.: Robust sensitivity analysis for stochastic systems. Math. Oper. Res. 41(4), 1248–1275 (2016)
Lam, H.: Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4), 1090–1105 (2019)
Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100(9), 1989–2001 (2009)
Mak, W.K., Morton, D.P., Wood, R.K.: Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24(1–2), 47–56 (1999)
Nguyen, V.A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Distributionally robust local non-parametric conditional estimation. Adv. Neural Inf. Process. Syst. 33, 15232–15242 (2020)
Pflug, G., Wozabal, D.: Ambiguity in portfolio selection. Quant. Finance 7(4), 435–442 (2007)
Rahimian, H., Mehrotra, S.: Frameworks and results in distributionally robust optimization. Open J. Math. Optim. 3, 1–85 (2022)
Rigollet, P., Hütter, J.C.: High dimensional statistics. Lecture notes for MIT’s 18.657 course (2017). URl http://www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf
Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)
Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on stochastic programming: modeling and theory. SIAM (2009)
Trillos, N.G., Slepčev, D.: On the rate of convergence of empirical measures in \(\infty \)-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, Berlin (1996)
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer Science & Business Media, Berlin (2008)
Xie, W.: Tractable reformulations of two-stage distributionally robust linear programs over the type-\(\infty \) Wasserstein ball. Oper. Res. Lett. 48(4), 513–523 (2020)
Xu, H., Caramanis, C., Mannor, S.: A distributional interpretation of robust optimization. Math. Oper. Res. 37(1), 95–110 (2012)
Acknowledgements
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies. This research was performed using the computing resources of the UW-Madison CHTC in the CS Department. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation. The authors thank the three anonymous reviewers for suggestions that helped improve the readability of this paper. R.K. also thanks Nam Ho-Nguyen for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing interests that influenced the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies.
Appendices
Omitted proofs
1.1 Proof of Proposition 1
We begin with the following useful results.
Lemma 22
Let \(a_1, a_2, \dots , a_{d}\) be positive constants with \(a_j \ge 1\), \(\forall j \in [d]\). Then, we have
Proof
Let \(F(\theta ):= \bigl (\sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\bigr )^{2/\theta }\) and \(G(\theta ):= \log (F(\theta )) = \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i \right) \). The stated result holds if F (or equivalently, G) is nonincreasing on \(\theta \in [1,2]\). We have
where the nonnegative weight \(w_i:= \bigl (a^{1+ \theta /2}_i\bigr ) \bigl (\sum _{j=1}^{d} a^{1+ \theta /2}_j\bigr )^{-1}\). Note that \(w_i \in (0,1)\) and \(\sum _{i=1}^{d} w_i = 1\). For \(\theta \in [1,2]\), the above expression implies that \(G'(\theta ) \le 0\) whenever \(\prod _{i=1}^{d} (a_i)^{w_i} \le \sum _{i=1}^{d} a^{1+ \theta /2}_i\). We have
where the first inequality follows from the weighted AM-GM inequality, the second inequality follows from the fact that \(0< w_i < 1\), \(\forall i \in [d]\), and the final inequality follows from the assumption that \(a_i \ge 1\), \(\forall i \in [d]\). Consequently, \(G'(\theta ) \le 0\), \(\forall \theta \in [1,2]\), which implies F is nonincreasing on \(\theta \in [1,2]\). \(\square \)
Lemma 23
Let W be a sub-Gaussian random variable with variance proxy \(\sigma ^2_w\). Then
Proof
We have
Lemma 1.4 of [45] implies that
where \(e:= \exp (1)\). Therefore, inequality (14), the Fubini-Tonelli theorem, and the ratio test imply the stated result whenever \(\lim _{j \rightarrow \infty } t_{j+1}/t_j < 1\), where \(t_j:= \bigl ( \sigma ^2_w e^{\frac{2}{e}} j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}(j!)^{-1}\). Let \(C:= \sigma ^2_w e^{\frac{2}{e}}\). We have
where we use the fact \(\theta \in (1,2)\) in the last step. \(\square \)
We are now ready to prove Proposition 1. To establish \(\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] < +\infty \), it suffices to show that \(\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a) \mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}}}\right] < +\infty \), where \(\mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}} = 1\) if \(\Vert {\varepsilon }\Vert _{\infty } \ge 1\) and 0 otherwise. Lemma 22 implies that
Independence of the components of \(\varepsilon \) further implies
Consequently, it suffices to show that \(\mathbb {E}\Bigl [\exp \bigl ( |{\varepsilon _i}|^{1 + \frac{a}{2}}\bigr ) \mathbbm {1}_{\{|{\varepsilon _i}| \ge 1\}}\Bigr ] < +\infty \) for each \(i \in [d_y]\), which follows from Lemma 23. \(\square \)
1.2 Proof of Lemma 8
We require the following result (cf. [24, Lemma 3.7]).
Lemma 24
Suppose Assumptions 1, 2, and 6 hold and the samples \(\{\varepsilon ^{i}\}_{i=1}^{n}\) are i.i.d. Let \(\{Q_n(x)\}\) be a sequence of distributions with \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\). Then
Consequently, we a.s. have for n large enough:
Furthermore, \(\{Q_n(x)\}\) converges a.s. converges to \(P_{Y \mid X = x}\) under the Wasserstein metric
Proof
Mirrors the proof of [24, Lemma 3.7] on account of Theorem 7. \(\square \)
We now prove Lemma 8. From Theorem 7, we have for a.e. \(x \in \mathcal {X}\) that
Since \(\sum _n \alpha _n < +\infty \), the Borel-Cantelli lemma implies a.s. that for all n large enough
From Lemma 24, we a.s. have for any distribution \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) and n large enough that \(d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)\) for a.e. \(x \in \mathcal {X}\).
Suppose Assumption 4 holds. Using the fact that \(\hat{\mathcal {P}}_n(x;\zeta _n (\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{1,n}(x;\zeta _n(\alpha _n,x))\) for all orders \(p \in [1,+\infty )\), we a.s. have for n large enough and for a.e. \(x \in \mathcal {X}\) that
where the second inequality follows from Assumption 4 and the Kantorovich-Rubinstein theorem (cf. [37, Theorem 5]).
Suppose instead that Assumption 5 holds and \(p \in [2,+\infty )\). Since \(\hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{2,n}(x;\zeta _n(\alpha _n,x))\) for all \(p \in [2,+\infty )\), we a.s. have for n large enough and a.e. \(x \in \mathcal {X}\) that
where the latter inequality follows from Assumption 5 and [28, Lemma 2] (see also [29]). \(\square \)
1.3 Proof of Theorem 9
From Theorem 7, we have
From Lemma 24, we a.s. have \(\lim _{n \rightarrow \infty } d_{W,p}(P_{Y \mid X=x},Q_n(x)) = 0\) for a.e. \(x \in \mathcal {X}\) for any \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\). Theorem 6.9 of [51] then a.s. implies that \(Q_n(x)\) converges weakly to \(P_{Y \mid X = x}\) in the space of distributions with finite pth moments for a.e. \(x \in \mathcal {X}\).
Lemma 8 implies a.s. that for all n large enough
Therefore, to prove \(\lim _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) = v^*(x) = \lim _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x)\) in probability (or a.s.) for a.e. \(x \in \mathcal {X}\), it suffices to show that \(\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)\) a.s. for a.e. \(x \in \mathcal {X}\).
Fix \(\eta > 0\). For a.e. \(x \in \mathcal {X}\), let \(z^*(x) \in S^*(x)\) be an optimal solution to the true problem (3), and \(Q^{*}_{n}(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) be such that
We suppress the dependence of \(Q^{*}_{n}(x)\) on \(\eta \) for simplicity. We a.s. have for a.e. \(x \in \mathcal {X}\)
The first equality above follows from the fact that \(Q^*_n(x)\) converges weakly to \(P_{Y \mid X = x}\) (as noted above) and by Definition 6.8 of [51] (which holds by virtue of Assumption 3). Since \(\eta > 0\) was arbitrary, we conclude that \(\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)\) a.s. for a.e. \(x \in \mathcal {X}\).
Finally, we show that any accumulation point of \(\{\hat{z}^{DRO}_n(x)\}\) is almost surely an element of \(S^*(x)\) for a.e. \(x \in \mathcal {X}\), and argue that this implies \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x)) \xrightarrow {a.s.} 0\) for a.e. \(x \in \mathcal {X}\). From (15) and the above conclusion, we a.s. have
Let \(\bar{z}(x)\) be an accumulation point of \(\hat{z}^{DRO}_n(x)\) for a.e. \(x \in \mathcal {X}\). Assume by moving to a subsequence if necessary that \(\lim _{n \rightarrow \infty } \hat{z}^{DRO}_n(x) = \bar{z}(x)\). We a.s. have for a.e. \(x \in \mathcal {X}\)
where the second inequality follows from the lower semicontinuity of \(c(\cdot ,Y)\) on \(\mathcal {Z}\) for each \(Y \in \mathcal {Y}\) and the third inequality follows from Fatou’s lemma by virtue of Assumption 3. Consequently, we a.s. have that \(\bar{z}(x) \in S^*(x)\).
Suppose by contradiction that \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not a.s. converge to zero for a.e. \(x \in \mathcal {X}\). Then, there exists \(\bar{\mathcal {X}} \subseteq \mathcal {X}\) with \(P_X(\bar{\mathcal {X}}) > 0\) such that for each \(x \in \bar{\mathcal {X}}\), \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not a.s. converge to zero. Since \(\mathcal {Z}\) is compact, any sequence of estimators \(\{\hat{z}^{DRO}_n(x)\}\) has a convergent subsequence for each \(x \in \bar{\mathcal {X}}\). Therefore, whenever \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not converge to zero for some \(x \in \bar{\mathcal {X}}\) and a realization of the data \(\mathcal {D}_n\), there exists an accumulation point of the sequence \(\{\hat{z}^{DRO}_n(x)\}\) that is not a solution to problem (3). This contradicts the fact that every accumulation point of \(\{\hat{z}^{DRO}_n(x)\}\) is almost surely a solution to problem (3) for a.e. \(x \in \mathcal {X}\). \(\square \)
1.4 Proof of Theorem 10
Lemma 8 implies that inequality (15) a.s. holds for all n large enough. From Lemma 24, we a.s. have for any distribution \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) and sample size n large enough that \(d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)\) for a.e. \(x \in \mathcal {X}\).
If Assumption 4 holds, then the desired result follows from inequality (15) and part A of Lemma 8. On the other hand, if Assumption 4 holds and \(p \ge 2\), then the desired result follows from inequality (15) and part B of Lemma 8. \(\square \)
1.5 Proof of Lemma 12
Theorem 7 implies \(g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x)\) with probability at least \(1-\alpha \) when \(\zeta _n(\alpha ,x)\) is chosen according to equation (10). Lemma 5 then yields
Suppose Assumption 4 holds. Following the proof of part A of Lemma 8 (see Lemma 24), we have for any \(z^*(x) \in S^*(x)\)
Therefore, if we choose \(\alpha \in (0,1)\) so that \(2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa \), we have
Equation (10) implies that \(2L_1(z^*(x))\kappa ^{(2)}_{p,n}(\alpha ) \le \kappa /2\) whenever the risk level
with \(\theta \) equal to \(\min \{p/d_y, 1/2\}\) or p/a. Assumption 8 implies that we can choose the constant \(\kappa ^{(1)}_{p,n}(\alpha ,x)\) in equation (10) such that for a.e. \(x \in \mathcal {X}\), \(2L_1(z^*(x))\kappa ^{(1)}_{p,n}(\alpha ,x) \le \kappa /2\) whenever
The above bounds imply the existence of constants \(\tilde{\varGamma }(\kappa ,x), \tilde{\gamma }(\kappa ,x) > 0\) such that the risk level \(\alpha = \tilde{\varGamma }(\kappa ,x) \exp (-n\tilde{\gamma }(\kappa ,x))\) satisfies \(2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa \). Consequently, (11) holds.
Next, suppose instead that Assumption 5 holds. Following the proof of part B of Lemma 8 (see Lemma 24), we have for any \(z^*(x) \in S^*(x)\)
Therefore, if we pick \(\alpha \in (0,1)\) so that
then \(\mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) > v^*(x) + \kappa }\right\} \le 2\alpha \). Similar to the analysis above, positive constants \(\tilde{\varGamma }(\kappa ,x)\) and \(\tilde{\gamma }(\kappa ,x)\) and inequality (11) can be obtained by bounding the smallest value of \(\alpha \) using Assumption 8 and equation (10) so that
1.6 Proof of Lemma 14
By first adding and subtracting \(g^*_n(z;x)\), defined in problem (4), and then doing the same with \(g^*_{s,n}(z;x)\), we obtain
Consider the first term on the r.h.s. of (16). We have for each \(x \in \mathcal {X}\)
where the first step follows from Assumption 9, the second step follows from the definition of the set \(\hat{\mathcal {Y}}^i_n(x;\mu _n(x))\), the triangle inequality, and inequality (6), and the fourth step follows by applying the Cauchy-Schwarz inequality twice.
Next, consider the second term on the r.h.s. of (16). For each \(x \in \mathcal {X}\)
where the inequality follows by Cauchy-Schwarz. \(\square \)
1.7 Proof of Theorem 18
Since
we look to establish uniform rates of convergence of \(\hat{g}^{ER}_{s,n}(\cdot ;X)\) to \(g(\cdot ;X)\) with respect to the \(L^q\)-norm on \(\mathcal {X}\). From (16) and the triangle inequality, we have
We bound the terms on the r.h.s. of (18) using Lemma 14. Assumptions 9, 16, and 17 and \(\Vert {\mu _{n}(X)}\Vert _{L^q} = \bar{C}_{\mu } n^{-r/2}\) imply the first term on the r.h.s. of inequality (12) satisfies:
Assumptions 16 and 18 imply that the second term on the r.h.s. of inequality (12) satisfies
Finally, Assumption 14 implies
Putting the above three inequalities together into inequality (18), we obtain
The first part of the stated result then follows from (17). The second part of the stated result follows from (17) and the fact that
Ambiguity sets satisfying Assumption 12
In Sect. 5, we outlined conditions under which phi-divergence ambiguity sets \(\mathfrak {P}_n(x;\zeta _{n}(x))\) satisfy Assumption 12 for a suitable choice of the radius \(\zeta _{n}(x)\). Lemma 25 below determines sharp bounds on the radius \(\zeta _{n}(x)\) for some other families of ambiguity sets to satisfy Assumption 12. Before presenting the lemma, we introduce a third example of the ambiguity set \(\mathfrak {P}_n(x;\zeta _{n}(x))\) to add to Examples 1 and 2 in Sect. 3.
Example 3
Mean-upper-semideviation-based ambiguity sets [48]: given order \(a \in [1,+\infty )\) and radius \(\zeta _n(x) \ge 0\), let \(b:= a/(a-1)\) and define \(\hat{\mathcal {P}}_n(x)\) using
Lemma 25
The following ambiguity sets satisfy Assumption 12 with constant \(\rho \in (1,2]\):
-
(a)
CVaR-based ambiguity sets (see Example 1) with radius \(\zeta _{n}(x) = {C^{\zeta }_1 n^{1-\rho }}\),
-
(b)
Variation distance-based ambiguity sets (see Example 2) with radius \(\zeta _{n}(x) = {C^{\zeta }_2 n^{-\rho /2}}\),
-
(c)
Mean-upper-semideviation-based ambiguity sets of order \(a \in [1,+\infty )\) (see Example 3) with radius \(\zeta _{n}(x) = {\left\{ \begin{array}{ll} {C^{\zeta }_3 n^{1 - \rho /2}} &{}\text{ if } a \ge 2 \\ {C^{\zeta }_4 n^{3/2 - 1/a - \rho /2}} &{} \text{ if } a < 2 \end{array}\right. }\),
for some constants \(C^{\zeta }_i > 0\), \(i \in [4]\). Furthermore, these bounds are sharp in the sense described above.
Proof
-
(a)
Assume that \(\zeta _{n}(x) < 0.5\). We begin by noting that there exists an optimal solution to the problem \(\sup _{p \in \mathfrak {P}_n(x;\zeta _{n}(x))} \sum _{i=1}^{n} \big (p_i - \frac{1}{n}\big )^2\) that is an extreme point of the polytope \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Furthermore, every extreme point of \(\mathfrak {P}_n(x;\zeta _{n}(x))\) satisfies at least \(n-1\) of the set of 2n inequalities \(\Bigl \{p_i \ge 0, \, i \in [n], p_i \le \frac{1}{n(1-\zeta _{n}(x))}, \, i \in [n]\Bigr \}\), with equality. This implies that there exists an optimal solution at which at least \(n-1\) of the \(p_i\)s either take the value zero, or take the value \(\frac{1}{n(1-\zeta _{n}(x))}\). At this solution, \(n-1\) of the terms \(\big (p_i - \frac{1}{n}\big )^2\) are either \(\frac{1}{n^2}\) or \(\frac{1}{n^2} \bigl (\frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\bigr )^2\) (with \(\frac{1}{n^2}\) larger since \(\zeta _{n}(x) < 0.5\) by assumption).
Suppose \(M \in \{0,\dots ,n-1\}\) of the inequalities \(p_i \ge 0\), \(i \in [n]\), are satisfied with equality at such an optimal solution. Since \(\sum _{i=1}^{n} p_i = 1\) and \(p_i \le \frac{1}{n(1-\zeta _{n}(x))}\), \(\forall i \in [n]\), we require \((n-M)\frac{1}{n(1-\zeta _{n}(x))} \ge 1\), which implies \(M \le n\zeta _{n}(x)\). Consequently, \(M \le n\zeta _{n}(x) < n/2\) of the inequalities \(p_i \ge 0\), \(i \in [n]\), are satisfied with equality and at least \((n-1-M) \ge n(1-\zeta _{n}(x)) - 1 > n/2 - 1\) of the inequalities \(p_i \le \frac{1}{n(1-\zeta _{n}(x))}\), \(i \in [n]\), are satisfied with equality. Therefore, whenever \(\zeta _n(x) < 0.5\), we have:
$$\begin{aligned} \sum _{i=1}^{n} \bigg (p_i - \frac{1}{n}\bigg )^2&\le (n\zeta _{n}(x)+1)\frac{1}{n^2} + n(1-\zeta _{n}(x))\frac{1}{n^2} \left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) ^2 \\&= \frac{1}{n^2} + \frac{1}{n}\left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) . \end{aligned}$$Because the above analysis is constructive, it can be immediately used to deduce that the bound on \(\zeta _n(x)\) is sharp.
-
(b)
The stated result follows from the fact that
$$\begin{aligned} \sum _{i=1}^{n} \left( p_i - \frac{1}{n}\right) ^2 \le \left( \sum _{i=1}^{n} |{p_i - \frac{1}{n}}|\right) ^2 \le \zeta ^2_{n}(x), \quad \forall p \in \mathfrak {P}_n(x;\zeta _{n}(x)), \, x \in \mathcal {X}. \end{aligned}$$To see that the above bound is sharp, assume without loss of generality that \(n \ge 2\) and \(\zeta _{n}(x) \le 1\). Then, because
$$\begin{aligned} \left( \frac{1}{n} + \frac{\zeta _{n}(x)}{2}, \underbrace{\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2},\dots ,\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2}}_{n-1 \text { terms}}\right) \in \mathfrak {P}_n(x;\zeta _{n}(x)), \end{aligned}$$we have
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \ge \frac{\zeta ^2_n(x)}{4} + \frac{\zeta ^2_n(x)}{4(n-1)}. \end{aligned}$$ -
(c)
Let \(\bar{q}:= \frac{1}{n}\sum _{i=1}^{n} q_i\). We have:
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \le \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2, \end{aligned}$$where \(\mathfrak {Q}_n(x;\zeta _{n}(x)):= \left\{ q \in \mathbb {R}^n_+ : \Vert {q}\Vert _b \le \zeta _{n}(x) \right\} \). Note that for each \(q \in \mathfrak {Q}_n(x;\zeta _{n}(x))\), we have \(|{\bar{q}}| \le n^{-1} \Vert {q}\Vert _1 \le n^{-1/b} \Vert {q}\Vert _b\), which in turn implies
$$\begin{aligned} \Vert {q - \bar{q} \textbf{1}}\Vert _b \le \Vert {q}\Vert _b + |{\bar{q}}| \Vert {\textbf{1}}\Vert _b = \Vert {q}\Vert _b + |{\bar{q}}| n^{1/b} \le \Vert {q}\Vert _b + \Vert {q}\Vert _b \le 2\zeta _{n}(x), \end{aligned}$$where \(\textbf{1}\) is a vector of ones of appropriate dimension. Additionally, note that
$$\begin{aligned} \sum _{i=1}^{n} (q_i - \bar{q})^2 = \Vert {q - \bar{q}\textbf{1}}\Vert ^2 \le {\left\{ \begin{array}{ll} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{}\text{ if } b \le 2 \\ n^{1 - 2/b} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$The desired result then follows from
$$\begin{aligned} \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2&\le \underset{\left\{ q : \Vert {q - \bar{q}\textbf{1}}\Vert _b \le 2\zeta _{n}(x) \right\} }{\sup } \frac{1}{n^2}\Vert {q - \bar{q}\textbf{1}}\Vert ^2 \\&\le {\left\{ \begin{array}{ll} \frac{4}{n^2}\zeta ^2_{n}(x) &{}\text{ if } b \le 2 \\ \frac{4}{n^{1 + 2/b}} \zeta ^2_{n}(x) &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$We now show that the above bounds are sharp.
Consider first the case when \(b \le 2\) and assume without loss of generality that \(\zeta _{n}(x) = O(\sqrt{n})\). Note that \(p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]\), \(i \in [n]\), with \(q_1 = \zeta _{n}(x)\) and \(q_i = 0\), \(\forall i \ge 2\), is an element of \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Therefore
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^2}\right) . \end{aligned}$$Next, suppose instead that \(b > 2\) and assume without loss of generality that \(\zeta _{n}(x) = O(n^{1/b})\). Note that \(p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]\) with \(q_i = {\left\{ \begin{array}{ll} \bigl (\frac{2}{n}\bigr )^{1/b}\zeta _{n}(x) &{}\text{ if } i \equiv 0 \;(\bmod \; 2) \\ 0 &{} \text{ if } i \equiv 1 \;(\bmod \; 2) \end{array}\right. }\), \(i \in [n]\), is an element of \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Therefore
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^{1+2/b}}\right) . \end{aligned}$$
\(\square \)
Additional computational results
Algorithm 4 describes our procedure for estimating the 99% UCB on the optimality gap of our data-driven solutions using the multiple replication procedure [41]. We only use 20, 000 of the generated \(10^5\) samples from the conditional distribution of Y given \(X = x\) to compute these UCBs since they are sufficient to yield an accurate estimate of the optimality gaps. Unlike [36, Algorithm 1] that uses relative optimality gaps, we use absolute optimality gaps in our 99% UCB estimates to avoid division by small quantities when \(v^*(x)\) is close to zero.
We compare Algorithm 2 with an “optimal covariate-independent” specification of the radius \(\zeta _n\). This optimal covariate-independent radius is determined by choosing \(\zeta _n\) such that the medians of the \(99\%\) UCBs over the 20 different covariate realizations are minimized. We also benchmark Algorithm 3 against an “optimal covariate dependent” specification of \(\zeta _n(x)\) that is determined by choosing \(\zeta _n(x)\) such that the \(99\%\) UCBs are minimized. Determining these optimal covariate-independent and covariate-dependent radii \(\zeta _n(x)\) is impractical because it requires 20, 000 i.i.d. samples from the conditional distribution of Y given \(X = x\) (which a decision-maker does not have). We consider it only to benchmark the performance of Algorithms 2 and 3.
“Optimal” tuning of the Wasserstein radius. Figure 6 compares the performance of the W+OLS formulations when the radius \(\zeta _n(x)\) of the ambiguity set is determined using Algorithms 2 and 3 and optimal covariate-dependent and covariate-independent tuning. We vary the model degree \(\theta \), the covariate dimension among \(d_x \in \{3,10,100\}\), and the sample size among \(n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}\) in these experiments. The radius specified by Algorithm 2 performs better than the radius specified using Algorithm 3 for smaller sample sizes and covariate dimensions, and the converse holds for larger covariate dimensions and sample sizes. These results indicate that while covariate-dependent tuning theoretically has potential to yield better results than the covariate-independent tuning of Algorithm 2, Algorithm 3 is only able to obtain good estimates of the optimal covariate-dependent radius \(\zeta _n(x)\) for relatively large sample sizes n. The difference between the performance of Algorithm 2 and the optimal covariate-independent tuning reduces with increasing sample size and covariate dimension except for \(\theta = 2\). The difference between the performance of Algorithm 3 and optimal covariate-dependent tuning of the radius also reduces with increasing covariate dimension and sample size.
Comparison of the radii specified by Algorithms 1, 2, and 3. Figure 7 compares the radii specified by Algorithms 1, 2, and 3 with the optimal covariate-dependent radius and optimal covariate-independent radius for the W+OLS formulation. We consider \(d_x = 100\), vary the model degree \(\theta \), and vary the sample size among \(n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}\) in these experiments. Note that the y-axis limits are different across the subplots. First, note that the radius specified by Algorithm 1 shrinks very quickly to zero for all three values of \(\theta \). Consequently, we note from Fig. 2 that the resulting ER-DRO estimators typically do not perform as well as the estimators obtained when the radius \(\zeta _n\) is specified using Algorithms 2 and 3. Second, we see that the covariate-independent specifications of the radius result in more narrow distributions overall compared to the covariate-dependent specifications. This may be because the covariate-independent specifications of the radius attempt to choose a single value of \(\zeta _n(x)\) for all possible covariate realizations \(x \in \mathcal {X}\), whereas the covariate-dependent specifications can choose a different value of \(\zeta _n(x)\) depending on the realization \(x \in \mathcal {X}\). Third, the distribution of the radius determined using Algorithm 3 appears to converge to the distribution of the optimal covariate-dependent radius as the sample size increases. Similarly, the distribution of the radius determined using Algorithm 2 also appears to converge to the distribution of the optimal covariate-independent radius as n increases (except for the case when \(\theta = 2\)). Finally, as noted in Sect. 6, it may be advantageous to use a positive radius for the ambiguity set when the prediction model is misspecified (e.g., using OLS regression even when \(\theta \ne 1\)). This is corroborated by the plots for \(\theta = 2\), where the distribution of the optimal covariate-dependent radius is far from the zero distribution even for large sample sizes n. See Fig. 5.
Comparison with the jackknife-based formulations. Figure 8 compares the performance of the ER-SAA+OLS approach and the jackknife-based SAA (J-SAA+OLS) approach [36] with the W+OLS formulations when the radius \(\zeta _n(x)\) is specified using Algorithms 2 and 3. We consider \(d_x = 100\), vary the model degree \(\theta \), and vary the sample size among \(n \in \{3(d_x + 1),5(d_x + 1),10(d_x + 1),20(d_x + 1)\}\) in these experiments. As observed in [36], the J-SAA+OLS formulation performs better than the E+OLS formulation in the small sample size regime. Figure 8 shows that the W+OLS formulations outperform the J-SAA+OLS formulation (except when using Algorithm 2 for \(\theta = 2\) and large n). This is expected because the ER-DRO formulations account for both the errors in the approximation of \(f^*\) by \(\hat{f}_n\) and in the approximation of \(P_{Y \mid X = x}\) by \(P^*_n(x)\), whereas the J-SAA+OLS formulation only addresses the bias in the residuals obtained from OLS regression (i.e., even if \(\hat{f}_n\) is an accurate estimate of \(f^*\), the jackknife formulations do not account for the fact that \(P^*_n(x)\) may be a crude approximation of \(P_{Y \mid X = x}\)). We omit the results for the J+-SAA+OLS formulation because they are similar to those for the J-SAA+OLS formulation. See Fig. 6.
Comparison of mean optimality gaps. Figures 9 and 10 are the counterparts of Figs. 1 and 2 in Sect. 7 when the mean optimality gap is used instead of their \(99\%\) UCBs to compare the different estimators, i.e., when \(100 \, \texttt {avg}(\{\hat{G}^k(x)\})\) is used instead of the output \(\hat{B}_{99}(x)\) of Algorithm 4. As mentioned in Sect. 7, comparing mean optimality gaps instead of \(99\%\) UCBs results in very similar plots because the standard deviations \(\sqrt{\texttt {var}(\{\hat{G}^k(x)\})}\) of the gap estimates \(\{\hat{G}^k(x)\}\) are significantly lower than the gap estimates themselves—this is because we use 20, 000 random samples to derive each gap estimate, which yields estimates with low variability. Finally, we note that we also plotted median optimality gaps instead of mean optimality gaps and obtained very similar plots. See Figs. 7, 8, 9 and 10.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kannan, R., Bayraksan, G. & Luedtke, J.R. Residuals-based distributionally robust optimization with covariate information. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02014-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-023-02014-7
Keywords
- Data-driven stochastic programming
- Distributionally robust optimization
- Wasserstein distance
- Phi-divergences
- Covariates
- Machine learning
- Convergence rate
- Large deviations