Skip to main content
Log in

Residuals-based distributionally robust optimization with covariate information

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Our results can be extended to Wasserstein distances defined using \(\ell _q\)-norms with \(q \ne 2\).

  2. We investigate extensions of our framework that can adapt to heteroscedasticity in [35, 36], where we assume that \(Y = f^*(X) + Q^*(X)\varepsilon \) with X and \(\varepsilon \) independent (here, \(Q^*(X)\) denotes the covariate-dependent covariance matrix of the errors).

  3. See Remark 1 at the end of this section for a discussion of the case when \(f^* \not \in \mathcal {F}\).

  4. For any \(u,v \in \mathbb {R}^{d_y}\), \(\Vert {{\text {proj}}_{\mathcal {Y}}(u) - {\text {proj}}_{\mathcal {Y}}(v)}\Vert \le \Vert {u-v}\Vert \).

  5. In special cases (e.g., smooth unconstrained problems, see [31, Section 5]), it may be possible to establish the sharper convergence rate \(|{g(\hat{z}^{DRO}_n(x);x) - v^*(x)}| = O_p\big (n^{-r}\big )\).

  6. We use \(\mu _n(x)\) to avoid a clash with the notation \(\zeta _n(x)\) for the radius of ambiguity sets with the same support as \(\hat{P}^{ER}_n(x)\). Having different notation for these two radii will prove useful in our unified analysis of the corresponding ER-DRO problems in Sect. 5.

  7. We do not include results for Algorithm 3 with \(n = 20\) because it requires at least 30 samples (line 7 of Algorithm 3 needs at least 6 points for Lasso regression with 5-fold CV).

  8. Once again, we only report results for Algorithm 3 when \(n \ge 30\).

References

  1. Ban, G.Y., Gallien, J., Mersereau, A.J.: Dynamic procurement of new products with covariate information: the residual tree method. Manuf. Serv. Oper. Manag. 21(4), 798–815 (2019)

    Article  Google Scholar 

  2. Ban, G.Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bansal, M., Huang, K.L., Mehrotra, S.: Decomposition algorithms for two-stage distributionally robust mixed binary programs. SIAM J. Optim. 28(3), 2360–2383 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bayraksan, G., Love, D.K.: Data-driven stochastic programming using phi-divergences. In: The Operations Research Revolution, pp. 1–19. INFORMS TutORials in Operations Research (2015)

  5. Bazier-Matte, T., Delage, E.: Generalization bounds for regularized portfolio selection with market side information. INFOR: Information Systems and Operational Research, pp. 1–28 (2020)

  6. Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)

    Article  Google Scholar 

  7. Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation. Math. Program. 171(1–2), 217–282 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci. 66(3), 1025–1044 (2020)

    Article  Google Scholar 

  9. Bertsimas, D., McCord, C., Sturt, B.: Dynamic optimization with side information. Eur. J. Oper. Res. 304(2), 634–651 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bertsimas, D., Shtern, S., Sturt, B.: A data-driven approach to multistage stochastic linear optimization. Manag. Sci. (2022). https://doi.org/10.1287/mnsc.2022.4352

    Article  MATH  Google Scholar 

  11. Bertsimas, D., Shtern, S., Sturt, B.: Two-stage sample robust optimization. Oper. Res. 70(1), 624–640 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  12. Bertsimas, D., Van Parys, B.: Bootstrap robust prescriptive analytics. Math. Program. 195(1), 39–78 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  13. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  14. Blanchet, J., Kang, Y., Murthy, K.: Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3), 830–857 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Blanchet, J., Murthy, K., Nguyen, V.A.: Statistical analysis of Wasserstein distributionally robust estimators. In: Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pp. 227–254. INFORMS (2021)

  16. Blanchet, J., Murthy, K., Si, N.: Confidence regions in Wasserstein distributionally robust estimation. Biometrika 109(2), 295–315 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  17. Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets for linear systems under disturbances and noisy observations. In: 2020 American Control Conference (ACC), pp. 4491–4496. IEEE (2020)

  18. Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets with probabilistic guarantees for dynamic processes. IEEE Trans. Autom. Control 66(7), 2991–3006 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  19. Boskos, D., Cortés, J., Martínez, S.: High-confidence data-driven ambiguity sets for time-varying linear systems. arXiv preprint arXiv:2102.01142 (2021)

  20. Deng, Y., Sen, S.: Predictive stochastic programming. CMS 19(1), 65–98 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  21. Dou, X., Anitescu, M.: Distributionally robust optimization with correlated data from vector autoregressive processes. Oper. Res. Lett. 47(4), 294–299 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  22. Duchi, J.C., Glynn, P.W., Namkoong, H.: Statistics of robust optimization: a generalized empirical likelihood approach. Math. Oper. Res. 46(3), 946–969 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  23. Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  24. Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  25. Esteban-Pérez, A., Morales, J.M.: Distributionally robust stochastic programs with side information based on trimmings. Math. Program. 195(1), 1069–1105 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  26. Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 162(3–4), 707–738 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  27. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  28. Gao, R.: Finite-sample guarantees for Wasserstein distributionally robust optimization: breaking the curse of dimensionality. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2326

    Article  Google Scholar 

  29. Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributionally robust optimization and variation regularization. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2383

    Article  Google Scholar 

  30. Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. (2022). https://doi.org/10.1287/moor.2022.1275

    Article  Google Scholar 

  31. Gotoh, J.Y., Kim, M.J., Lim, A.E.: Calibration of distributionally robust empirical optimization models. Oper. Res. 69(5), 1630–1650 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  32. Hanasusanto, G.A., Kuhn, D.: Robust data-driven dynamic programming. In: Advances in Neural Information Processing Systems, pp. 827–835 (2013)

  33. Hanasusanto, G.A., Kuhn, D.: Conic programming reformulations of two-stage distributionally robust linear programs over Wasserstein balls. Oper. Res. 66(3), 849–869 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  34. Homem-de-Mello, T., Bayraksan, G.: Monte Carlo sampling-based methods for stochastic optimization. Surv. Oper. Res. Manag. Sci. 19(1), 56–85 (2014)

    MathSciNet  Google Scholar 

  35. Kannan, R., Bayraksan, G., Luedtke, J.: Heteroscedasticity-aware residuals-based contextual stochastic optimization. arXiv preprint arXiv:2101.03139, pp. 1–15 (2021)

  36. Kannan, R., Bayraksan, G., Luedtke, J.: Data-driven sample average approximation with covariate information. arXiv preprint arXiv:2207.13554v1, pp. 1–57 (2022)

  37. Kuhn, D., Esfahani, P.M., Nguyen, V.A., Shafieezadeh-Abadeh, S.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Operations Research & Management Science in the Age of Analytics, pp. 130–166. INFORMS (2019)

  38. Lam, H.: Robust sensitivity analysis for stochastic systems. Math. Oper. Res. 41(4), 1248–1275 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  39. Lam, H.: Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4), 1090–1105 (2019)

    MathSciNet  MATH  Google Scholar 

  40. Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100(9), 1989–2001 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  41. Mak, W.K., Morton, D.P., Wood, R.K.: Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24(1–2), 47–56 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  42. Nguyen, V.A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Distributionally robust local non-parametric conditional estimation. Adv. Neural Inf. Process. Syst. 33, 15232–15242 (2020)

    Google Scholar 

  43. Pflug, G., Wozabal, D.: Ambiguity in portfolio selection. Quant. Finance 7(4), 435–442 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  44. Rahimian, H., Mehrotra, S.: Frameworks and results in distributionally robust optimization. Open J. Math. Optim. 3, 1–85 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  45. Rigollet, P., Hütter, J.C.: High dimensional statistics. Lecture notes for MIT’s 18.657 course (2017). URl http://www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf

  46. Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)

    Article  Google Scholar 

  47. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)

    Article  Google Scholar 

  48. Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on stochastic programming: modeling and theory. SIAM (2009)

  49. Trillos, N.G., Slepčev, D.: On the rate of convergence of empirical measures in \(\infty \)-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  50. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, Berlin (1996)

    Book  MATH  Google Scholar 

  51. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer Science & Business Media, Berlin (2008)

    MATH  Google Scholar 

  52. Xie, W.: Tractable reformulations of two-stage distributionally robust linear programs over the type-\(\infty \) Wasserstein ball. Oper. Res. Lett. 48(4), 513–523 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  53. Xu, H., Caramanis, C., Mannor, S.: A distributional interpretation of robust optimization. Math. Oper. Res. 37(1), 95–110 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies. This research was performed using the computing resources of the UW-Madison CHTC in the CS Department. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation. The authors thank the three anonymous reviewers for suggestions that helped improve the readability of this paper. R.K. also thanks Nam Ho-Nguyen for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohit Kannan.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing interests that influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies.

Appendices

Omitted proofs

1.1 Proof of Proposition 1

We begin with the following useful results.

Lemma 22

Let \(a_1, a_2, \dots , a_{d}\) be positive constants with \(a_j \ge 1\), \(\forall j \in [d]\). Then, we have

$$\begin{aligned} { \left( \sum _{i=1}^{d} a^2_i \right) ^{\theta /2} \le \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i, \quad \forall \theta \in [1,2]. } \end{aligned}$$

Proof

Let \(F(\theta ):= \bigl (\sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\bigr )^{2/\theta }\) and \(G(\theta ):= \log (F(\theta )) = \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i \right) \). The stated result holds if F (or equivalently, G) is nonincreasing on \(\theta \in [1,2]\). We have

$$\begin{aligned} G'(\theta )= & {} \frac{1}{\theta } \left[ \sum _{i=1}^{d} w_i \log (a_i) - \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\right) \right] \\= & {} \frac{1}{\theta } \left[ \log \left( \prod _{i=1}^{d} (a_i)^{w_i}\right) - \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\right) \right] , \end{aligned}$$

where the nonnegative weight \(w_i:= \bigl (a^{1+ \theta /2}_i\bigr ) \bigl (\sum _{j=1}^{d} a^{1+ \theta /2}_j\bigr )^{-1}\). Note that \(w_i \in (0,1)\) and \(\sum _{i=1}^{d} w_i = 1\). For \(\theta \in [1,2]\), the above expression implies that \(G'(\theta ) \le 0\) whenever \(\prod _{i=1}^{d} (a_i)^{w_i} \le \sum _{i=1}^{d} a^{1+ \theta /2}_i\). We have

$$\begin{aligned} { \prod _{i=1}^{d} (a_i)^{w_i} \le \sum _{i=1}^{d} w_i a_i \le \sum _{i=1}^{d} a_i \le \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i, } \end{aligned}$$

where the first inequality follows from the weighted AM-GM inequality, the second inequality follows from the fact that \(0< w_i < 1\), \(\forall i \in [d]\), and the final inequality follows from the assumption that \(a_i \ge 1\), \(\forall i \in [d]\). Consequently, \(G'(\theta ) \le 0\), \(\forall \theta \in [1,2]\), which implies F is nonincreasing on \(\theta \in [1,2]\). \(\square \)

Lemma 23

Let W be a sub-Gaussian random variable with variance proxy \(\sigma ^2_w\). Then

$$\begin{aligned} { \mathbb {E}\left[ {\exp \bigl (|{W}|^{1 + \frac{\theta }{2}}\bigr )}\right] < +\infty , \quad \forall \theta \in (1,2). } \end{aligned}$$

Proof

We have

$$\begin{aligned} \mathbb {E}\left[ {\exp \bigl (|{W}|^{1 + \frac{\theta }{2}}\bigr )}\right]&= \mathbb {E}\left[ \sum _{j=0}^{\infty } \frac{|{W}|^{(1+\frac{\theta }{2})j}}{j!}\right] \end{aligned}$$
(14)

Lemma 1.4 of [45] implies that

$$\begin{aligned} { \mathbb {E}\left[ \frac{|{W}|^{(1+\frac{\theta }{2})j}}{j!}\right] \le \frac{\bigl ( \sigma ^2_w e^{\frac{2}{e}} j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}}{j!}, \quad \forall j \ge 2, } \end{aligned}$$

where \(e:= \exp (1)\). Therefore, inequality (14), the Fubini-Tonelli theorem, and the ratio test imply the stated result whenever \(\lim _{j \rightarrow \infty } t_{j+1}/t_j < 1\), where \(t_j:= \bigl ( \sigma ^2_w e^{\frac{2}{e}} j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}(j!)^{-1}\). Let \(C:= \sigma ^2_w e^{\frac{2}{e}}\). We have

$$\begin{aligned} \lim _{j \rightarrow \infty } \frac{t_{j+1}}{t_j}&= \lim _{j \rightarrow \infty } \frac{\bigl ( C (j+1) \bigr )^{(1+\frac{\theta }{2})\frac{j+1}{2}}}{\bigl ( C j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}} \frac{j!}{(j+1)!} \\&= \lim _{j \rightarrow \infty } \frac{(C(j+1))^{0.5(1+\frac{\theta }{2})}}{j+1} \left( \frac{j+1}{j}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= \lim _{j \rightarrow \infty } O(1) (j+1)^{0.5(\frac{\theta }{2}-1)} \, \lim _{j \rightarrow \infty } \left( \frac{j+1}{j}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= O(1)\lim _{j \rightarrow \infty } (j+1)^{0.5(\frac{\theta }{2}-1)} \, \lim _{j \rightarrow \infty } \left( 1 + \frac{(1+\frac{\theta }{2})\frac{1}{2}}{(1+\frac{\theta }{2})\frac{j}{2}}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= 0 \times O(1) = 0, \end{aligned}$$

where we use the fact \(\theta \in (1,2)\) in the last step. \(\square \)

We are now ready to prove Proposition 1. To establish \(\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] < +\infty \), it suffices to show that \(\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a) \mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}}}\right] < +\infty \), where \(\mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}} = 1\) if \(\Vert {\varepsilon }\Vert _{\infty } \ge 1\) and 0 otherwise. Lemma 22 implies that

$$\begin{aligned} \mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] \le \mathbb {E}\left[ \exp \left( \sum _{i=1}^{d_y} |{\varepsilon _i}|^{1 + \frac{a}{2}}\right) \right] = \mathbb {E}\left[ \prod _{i=1}^{d_y}\exp \Bigl ( |{\varepsilon _i}|^{1 + \frac{a}{2}}\Bigr )\right] . \end{aligned}$$

Independence of the components of \(\varepsilon \) further implies

$$\begin{aligned} \mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] \le \prod _{i=1}^{d_y} \mathbb {E}\left[ \exp \left( |{\varepsilon _i}|^{1 + \frac{a}{2}}\right) \right] . \end{aligned}$$

Consequently, it suffices to show that \(\mathbb {E}\Bigl [\exp \bigl ( |{\varepsilon _i}|^{1 + \frac{a}{2}}\bigr ) \mathbbm {1}_{\{|{\varepsilon _i}| \ge 1\}}\Bigr ] < +\infty \) for each \(i \in [d_y]\), which follows from Lemma 23. \(\square \)

1.2 Proof of Lemma 8

We require the following result (cf. [24, Lemma 3.7]).

Lemma 24

Suppose Assumptions 12, and 6 hold and the samples \(\{\varepsilon ^{i}\}_{i=1}^{n}\) are i.i.d. Let \(\{Q_n(x)\}\) be a sequence of distributions with \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\). Then

$$\begin{aligned} \mathbb {P}\Bigl \{ d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x) \Bigr \} \ge 1 - \alpha _n, \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Consequently, we a.s. have for n large enough:

$$\begin{aligned} d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Furthermore, \(\{Q_n(x)\}\) converges a.s. converges to \(P_{Y \mid X = x}\) under the Wasserstein metric

$$\begin{aligned} { \mathbb {P}\Bigl \{ \lim _{n \rightarrow \infty } d_{W,p}(P_{Y \mid X=x},Q_n(x)) = 0 \Bigr \} = 1. } \end{aligned}$$

Proof

Mirrors the proof of [24, Lemma 3.7] on account of Theorem 7. \(\square \)

We now prove Lemma 8. From Theorem 7, we have for a.e. \(x \in \mathcal {X}\) that

$$\begin{aligned} \mathbb {P}\big \{v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x)\big \} \ge 1 - \alpha _n, \quad \forall n \in \mathbb {N}. \end{aligned}$$

Since \(\sum _n \alpha _n < +\infty \), the Borel-Cantelli lemma implies a.s. that for all n large enough

$$\begin{aligned} v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

From Lemma 24, we a.s. have for any distribution \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) and n large enough that \(d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)\) for a.e. \(x \in \mathcal {X}\).

Suppose Assumption 4 holds. Using the fact that \(\hat{\mathcal {P}}_n(x;\zeta _n (\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{1,n}(x;\zeta _n(\alpha _n,x))\) for all orders \(p \in [1,+\infty )\), we a.s. have for n large enough and for a.e. \(x \in \mathcal {X}\) that

$$\begin{aligned}&\hat{v}^{DRO}_n(x) \le \underset{Q \in \bar{\mathcal {P}}_{1,n}(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \le g(z^*(x);x) + 2L_1(z^*(x))\zeta _n(\alpha _n,x), \end{aligned}$$

where the second inequality follows from Assumption 4 and the Kantorovich-Rubinstein theorem (cf. [37, Theorem 5]).

Suppose instead that Assumption 5 holds and \(p \in [2,+\infty )\). Since \(\hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{2,n}(x;\zeta _n(\alpha _n,x))\) for all \(p \in [2,+\infty )\), we a.s. have for n large enough and a.e. \(x \in \mathcal {X}\) that

$$\begin{aligned} \hat{v}^{DRO}_n(x)&\le \underset{Q \in \bar{\mathcal {P}}_{2,n}(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \\&\le g(z^*(x);x) + 2\bigl (\mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \bigr )^{1/2} \zeta _n(\alpha _n,x)\\&\quad + 4L_2(z^*(x))\zeta ^2_n(\alpha _n,x), \end{aligned}$$

where the latter inequality follows from Assumption 5 and [28, Lemma 2] (see also [29]). \(\square \)

1.3 Proof of Theorem 9

From Theorem 7, we have

$$\begin{aligned} \mathbb {P}\big \{d_{W,p}( \hat{P}^{ER}_n(x), P_{Y \mid X=x} ) > \zeta _n(\alpha _n,x)\big \} \le \alpha _n, \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

From Lemma 24, we a.s. have \(\lim _{n \rightarrow \infty } d_{W,p}(P_{Y \mid X=x},Q_n(x)) = 0\) for a.e. \(x \in \mathcal {X}\) for any \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\). Theorem 6.9 of [51] then a.s. implies that \(Q_n(x)\) converges weakly to \(P_{Y \mid X = x}\) in the space of distributions with finite pth moments for a.e. \(x \in \mathcal {X}\).

Lemma 8 implies a.s. that for all n large enough

$$\begin{aligned}&v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$
(15)

Therefore, to prove \(\lim _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) = v^*(x) = \lim _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x)\) in probability (or a.s.) for a.e. \(x \in \mathcal {X}\), it suffices to show that \(\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)\) a.s. for a.e. \(x \in \mathcal {X}\).

Fix \(\eta > 0\). For a.e. \(x \in \mathcal {X}\), let \(z^*(x) \in S^*(x)\) be an optimal solution to the true problem (3), and \(Q^{*}_{n}(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) be such that

$$\begin{aligned} \underset{Q \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \le \mathbb {E}_{Y \sim Q^*_n(x)}\left[ {c(z^*(x),Y)}\right] + \eta . \end{aligned}$$

We suppress the dependence of \(Q^{*}_{n}(x)\) on \(\eta \) for simplicity. We a.s. have for a.e. \(x \in \mathcal {X}\)

$$\begin{aligned} \limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x)&\le \limsup _{n \rightarrow \infty } \underset{Q \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \\&\le \limsup _{n \rightarrow \infty } \mathbb {E}_{Y \sim Q^*_n(x)}\left[ {c(z^*(x),Y)}\right] + \eta \\&= g(z^*(x);x) + \eta = v^*(x) + \eta . \end{aligned}$$

The first equality above follows from the fact that \(Q^*_n(x)\) converges weakly to \(P_{Y \mid X = x}\) (as noted above) and by Definition 6.8 of [51] (which holds by virtue of Assumption 3). Since \(\eta > 0\) was arbitrary, we conclude that \(\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)\) a.s. for a.e. \(x \in \mathcal {X}\).

Finally, we show that any accumulation point of \(\{\hat{z}^{DRO}_n(x)\}\) is almost surely an element of \(S^*(x)\) for a.e. \(x \in \mathcal {X}\), and argue that this implies \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x)) \xrightarrow {a.s.} 0\) for a.e. \(x \in \mathcal {X}\). From (15) and the above conclusion, we a.s. have

$$\begin{aligned} \liminf _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x) \le \lim _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) = v^*(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Let \(\bar{z}(x)\) be an accumulation point of \(\hat{z}^{DRO}_n(x)\) for a.e. \(x \in \mathcal {X}\). Assume by moving to a subsequence if necessary that \(\lim _{n \rightarrow \infty } \hat{z}^{DRO}_n(x) = \bar{z}(x)\). We a.s. have for a.e. \(x \in \mathcal {X}\)

$$\begin{aligned} v^*(x)\le & {} g(\bar{z}(x);x) \le \mathbb {E}\left[ {\liminf _{n \rightarrow \infty } c(\hat{z}^{DRO}_n(x),f^*(x)+\varepsilon )}\right] \\\le & {} \liminf _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x) \le v^*(x), \end{aligned}$$

where the second inequality follows from the lower semicontinuity of \(c(\cdot ,Y)\) on \(\mathcal {Z}\) for each \(Y \in \mathcal {Y}\) and the third inequality follows from Fatou’s lemma by virtue of Assumption 3. Consequently, we a.s. have that \(\bar{z}(x) \in S^*(x)\).

Suppose by contradiction that \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not a.s. converge to zero for a.e. \(x \in \mathcal {X}\). Then, there exists \(\bar{\mathcal {X}} \subseteq \mathcal {X}\) with \(P_X(\bar{\mathcal {X}}) > 0\) such that for each \(x \in \bar{\mathcal {X}}\), \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not a.s. converge to zero. Since \(\mathcal {Z}\) is compact, any sequence of estimators \(\{\hat{z}^{DRO}_n(x)\}\) has a convergent subsequence for each \(x \in \bar{\mathcal {X}}\). Therefore, whenever \(\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))\) does not converge to zero for some \(x \in \bar{\mathcal {X}}\) and a realization of the data \(\mathcal {D}_n\), there exists an accumulation point of the sequence \(\{\hat{z}^{DRO}_n(x)\}\) that is not a solution to problem (3). This contradicts the fact that every accumulation point of \(\{\hat{z}^{DRO}_n(x)\}\) is almost surely a solution to problem (3) for a.e. \(x \in \mathcal {X}\). \(\square \)

1.4 Proof of Theorem 10

Lemma 8 implies that inequality (15) a.s. holds for all n large enough. From Lemma 24, we a.s. have for any distribution \(Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))\) and sample size n large enough that \(d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)\) for a.e. \(x \in \mathcal {X}\).

If Assumption 4 holds, then the desired result follows from inequality (15) and part A of Lemma 8. On the other hand, if Assumption 4 holds and \(p \ge 2\), then the desired result follows from inequality (15) and part B of Lemma 8. \(\square \)

1.5 Proof of Lemma 12

Theorem 7 implies \(g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x)\) with probability at least \(1-\alpha \) when \(\zeta _n(\alpha ,x)\) is chosen according to equation (10). Lemma 5 then yields

$$\begin{aligned}&\mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x)> v^*(x) + \kappa }\right\} \\&\quad =\, \mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) - \hat{v}^{DRO}_n(x) + \hat{v}^{DRO}_n(x)> v^*(x) + \kappa }\right\} \\&\quad \le \, \alpha + \mathbb {P}\left\{ {\hat{v}^{DRO}_n(x) > v^*(x) + \kappa }\right\} . \end{aligned}$$

Suppose Assumption 4 holds. Following the proof of part A of Lemma 8 (see Lemma 24), we have for any \(z^*(x) \in S^*(x)\)

$$\begin{aligned} \mathbb {P}\left\{ {\hat{v}^{DRO}_n(x) > v^*(x) + 2L_1(z^*(x))\zeta _n(\alpha ,x)}\right\} \le \alpha . \end{aligned}$$

Therefore, if we choose \(\alpha \in (0,1)\) so that \(2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa \), we have

$$\begin{aligned} \mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) > v^*(x) + \kappa }\right\} \le 2\alpha . \end{aligned}$$

Equation (10) implies that \(2L_1(z^*(x))\kappa ^{(2)}_{p,n}(\alpha ) \le \kappa /2\) whenever the risk level

$$\begin{aligned} \alpha \ge {c_1 \exp \bigl (-c_2 n \bigl (\tfrac{\kappa }{4L_1(z^*(x))}\bigr )^{1/\theta }\bigr )} \end{aligned}$$

with \(\theta \) equal to \(\min \{p/d_y, 1/2\}\) or p/a. Assumption 8 implies that we can choose the constant \(\kappa ^{(1)}_{p,n}(\alpha ,x)\) in equation (10) such that for a.e. \(x \in \mathcal {X}\), \(2L_1(z^*(x))\kappa ^{(1)}_{p,n}(\alpha ,x) \le \kappa /2\) whenever

$$\begin{aligned} \alpha \ge 4\max \bigl \{&K_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))},x\bigr ) \exp \bigl (-n\beta _{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))},x\bigr )\bigr ), \\&\quad \quad \quad \bar{K}_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))}\bigr ) \exp \bigl (-n\bar{\beta }_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))}\bigr )\bigr )\bigr \}. \end{aligned}$$

The above bounds imply the existence of constants \(\tilde{\varGamma }(\kappa ,x), \tilde{\gamma }(\kappa ,x) > 0\) such that the risk level \(\alpha = \tilde{\varGamma }(\kappa ,x) \exp (-n\tilde{\gamma }(\kappa ,x))\) satisfies \(2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa \). Consequently, (11) holds.

Next, suppose instead that Assumption 5 holds. Following the proof of part B of Lemma 8 (see Lemma 24), we have for any \(z^*(x) \in S^*(x)\)

$$\begin{aligned} \mathbb {P}\bigl \{\hat{v}^{DRO}_n(x) > v^*(x) + \bigl (\mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \bigr )^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x)\bigr \} \le \alpha . \end{aligned}$$

Therefore, if we pick \(\alpha \in (0,1)\) so that

$$\begin{aligned} \left( \mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \right) ^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x) \le \kappa , \end{aligned}$$

then \(\mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) > v^*(x) + \kappa }\right\} \le 2\alpha \). Similar to the analysis above, positive constants \(\tilde{\varGamma }(\kappa ,x)\) and \(\tilde{\gamma }(\kappa ,x)\) and inequality (11) can be obtained by bounding the smallest value of \(\alpha \) using Assumption 8 and equation (10) so that

$$\begin{aligned}&\left( \mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \right) ^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x) \le \kappa . \quad \square \end{aligned}$$

1.6 Proof of Lemma 14

By first adding and subtracting \(g^*_n(z;x)\), defined in problem (4), and then doing the same with \(g^*_{s,n}(z;x)\), we obtain

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup }\, |{\hat{g}^{ER}_{s,n}(z;x) - g(z;x)}|\nonumber \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \Big |\sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} c(z,y) - c \left( z,f^*(x) + \varepsilon ^i \right) \Big |\nonumber \\&\qquad + \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_{s,n}(z;x) - g^*_n(z;x)}| + \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_n(z;x) - g(z;x)}|. \end{aligned}$$
(16)

Consider the first term on the r.h.s. of (16). We have for each \(x \in \mathcal {X}\)

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \Big |\sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} c(z,y) - c \left( z,f^*(x) + \varepsilon ^i \right) \Big |\\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} L(z) \Vert {y - (f^*(x) + \varepsilon ^i)}\Vert \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i L(z) \left( \mu _{n}(x) + \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \right) \\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \left( \mu _{n}(x) + \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \right) \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \left( \mu _{n}(x) + \left( \dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \bigl ( \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \bigr )^2\right) ^{\frac{1}{2}} \right) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( n \sum _{i=1}^{n} p^2_i\right) ^{\frac{1}{2}} \\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \left( \mu _{n}(x) + \left( \dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \bigl ( \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \bigr )^2\right) ^{\frac{1}{2}} \right) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( 1 + n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2\right) ^{\frac{1}{2}}, \end{aligned}$$

where the first step follows from Assumption 9, the second step follows from the definition of the set \(\hat{\mathcal {Y}}^i_n(x;\mu _n(x))\), the triangle inequality, and inequality (6), and the fourth step follows by applying the Cauchy-Schwarz inequality twice.

Next, consider the second term on the r.h.s. of (16). For each \(x \in \mathcal {X}\)

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup } \, |g^*_{s,n}(z;x) - g^*_n(z;x) |\\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \Big |\underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} p_i c(z,f^*(x) + \varepsilon ^i) - \frac{1}{n} \sum _{i=1}^{n} c(z,f^*(x) + \varepsilon ^i) \Big |\\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \Big |\underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big ( p_i - \frac{1}{n} \Big ) c(z,f^*(x) + \varepsilon ^i) \Big |\\&\quad \le \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \underset{z \in \mathcal {Z}}{\sup } \left( \frac{1}{n}\sum _{i=1}^{n} \big (c(z,f^*(x) + \varepsilon ^i)\big )^2\right) ^{\frac{1}{2}}, \end{aligned}$$

where the inequality follows by Cauchy-Schwarz. \(\square \)

1.7 Proof of Theorem 18

Since

$$\begin{aligned} \Vert {\hat{v}^{DRO}_n(X) - v^*(X)}\Vert _{L^q}&= \big \Vert \min _{z \in \mathcal {Z}} \hat{g}^{ER}_{s,n}(z;X) - \min _{z \in \mathcal {Z}} g(z;X)\big \Vert _{L^q} \nonumber \\&\le \Big \Vert \underset{z \in \mathcal {Z}}{\sup } |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q}, \end{aligned}$$
(17)

we look to establish uniform rates of convergence of \(\hat{g}^{ER}_{s,n}(\cdot ;X)\) to \(g(\cdot ;X)\) with respect to the \(L^q\)-norm on \(\mathcal {X}\). From (16) and the triangle inequality, we have

$$\begin{aligned} \Big \Vert \underset{z \in \mathcal {Z}}{\sup }\, |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q} \le&\, \Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{\hat{g}^{ER}_{s,n}(z;x) - g^*_{s,n}(z;X)}| \Big \Vert _{L^q} \nonumber \\&\quad +\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_{s,n}(z;X) - g^*_n(z;X)}|\Big \Vert _{L^q} \nonumber \\&\quad +\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_n(z;X) - g(z;X)}|\Big \Vert _{L^q}. \end{aligned}$$
(18)

We bound the terms on the r.h.s. of (18) using Lemma 14. Assumptions 9, 16, and 17 and \(\Vert {\mu _{n}(X)}\Vert _{L^q} = \bar{C}_{\mu } n^{-r/2}\) imply the first term on the r.h.s. of inequality (12) satisfies:

$$\begin{aligned}&\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{\hat{g}^{ER}_{s,n}(z;X) - g^*_{s,n}(z;X)}| \Big \Vert _{L^q} \\&\quad \le \, \bigg \Vert \underset{z \in \mathcal {Z}}{\sup } \, L(z) \biggl (\mu _{n}(X) + \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {\tilde{\varepsilon }^{i}_{n}(X)}\Vert ^2\biggr )^{\frac{1}{2}} \biggr ) \\&\quad \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \biggl ( 1 + n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \\&\quad = \, O(1) \biggl [ \Vert {\mu _n(X)}\Vert _{L^q} + \bigg \Vert \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {\tilde{\varepsilon }^{i}_{n}(X)}\Vert ^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \biggr ] \\&\quad = \, O(1) \biggl [ \Vert {\mu _n(X)}\Vert _{L^q} + \Vert {f^*(X) - \hat{f}_n(X)}\Vert _{L^q} \\&\qquad \qquad + \bigg \Vert \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {f^*(x^i) - \hat{f}_n(x^i)}\Vert ^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \biggr ] \\&\quad = \, O_p(n^{-r/2}). \end{aligned}$$

Assumptions 16 and 18 imply that the second term on the r.h.s. of inequality (12) satisfies

$$\begin{aligned}&\Big \Vert \sup _{z \in \mathcal {Z}} |{g^*_{s,n}(z;X) - g^*_n(z;X)}| \Big \Vert _{L^q} \\&\quad \le \, \bigg \Vert \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \underset{z \in \mathcal {Z}}{\sup } \left( \frac{1}{n}\sum _{i=1}^{n} \big (c(z,f^*(X) + \varepsilon ^i)\big )^2\right) ^{\frac{1}{2}}\bigg \Vert _{L^q} \\&\quad = \, O_p(1) \bigg \Vert \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \bigg \Vert _{L^q} = O_p(n^{-r/2}). \end{aligned}$$

Finally, Assumption 14 implies

$$\begin{aligned} \Big \Vert \sup _{z \in \mathcal {Z}} |{g^*_{n}(z;X) - g(z;X)}| \Big \Vert _{L^q} = O_p(n^{-1/2}). \end{aligned}$$

Putting the above three inequalities together into inequality (18), we obtain

$$\begin{aligned} \Big \Vert \underset{z \in \mathcal {Z}}{\sup } |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q} = O_p(n^{-r/2}). \end{aligned}$$

The first part of the stated result then follows from (17). The second part of the stated result follows from (17) and the fact that

$$\begin{aligned} {\Vert {g(\hat{z}^{DRO}_n(X);X) - v^*(X)}\Vert _{L^q}}&{\le \Vert {g(\hat{z}^{DRO}_n(X);X) - \hat{v}^{DRO}_n(X)}\Vert _{L^q} } \\&\qquad +{\Vert {\hat{v}^{DRO}_n(X) - v^*(X)}\Vert _{L^q}.} \quad \square \end{aligned}$$

Ambiguity sets satisfying Assumption 12

In Sect. 5, we outlined conditions under which phi-divergence ambiguity sets \(\mathfrak {P}_n(x;\zeta _{n}(x))\) satisfy Assumption 12 for a suitable choice of the radius \(\zeta _{n}(x)\). Lemma 25 below determines sharp bounds on the radius \(\zeta _{n}(x)\) for some other families of ambiguity sets to satisfy Assumption 12. Before presenting the lemma, we introduce a third example of the ambiguity set \(\mathfrak {P}_n(x;\zeta _{n}(x))\) to add to Examples 1 and 2 in Sect. 3.

Example 3

Mean-upper-semideviation-based ambiguity sets [48]: given order \(a \in [1,+\infty )\) and radius \(\zeta _n(x) \ge 0\), let \(b:= a/(a-1)\) and define \(\hat{\mathcal {P}}_n(x)\) using

$$\begin{aligned} \mathfrak {P}_n(x;\zeta _n(x)) := \bigg \{p \in \mathbb {R}^{n}_+ :&\sum _{i=1}^{n} p_i = 1 \, \text {and } \exists \, q \in \mathbb {R}^n_+ \text { such that } \Vert {q}\Vert _b \le \zeta _n(x), \\&\quad p_i = \frac{1}{n} \bigg [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \bigg ], \forall i \in [n] \bigg \}. \end{aligned}$$

Lemma 25

The following ambiguity sets satisfy Assumption 12 with constant \(\rho \in (1,2]\):

  1. (a)

    CVaR-based ambiguity sets (see Example 1) with radius \(\zeta _{n}(x) = {C^{\zeta }_1 n^{1-\rho }}\),

  2. (b)

    Variation distance-based ambiguity sets (see Example 2) with radius \(\zeta _{n}(x) = {C^{\zeta }_2 n^{-\rho /2}}\),

  3. (c)

    Mean-upper-semideviation-based ambiguity sets of order \(a \in [1,+\infty )\) (see Example 3) with radius \(\zeta _{n}(x) = {\left\{ \begin{array}{ll} {C^{\zeta }_3 n^{1 - \rho /2}} &{}\text{ if } a \ge 2 \\ {C^{\zeta }_4 n^{3/2 - 1/a - \rho /2}} &{} \text{ if } a < 2 \end{array}\right. }\),

for some constants \(C^{\zeta }_i > 0\), \(i \in [4]\). Furthermore, these bounds are sharp in the sense described above.

Proof

  1. (a)

    Assume that \(\zeta _{n}(x) < 0.5\). We begin by noting that there exists an optimal solution to the problem \(\sup _{p \in \mathfrak {P}_n(x;\zeta _{n}(x))} \sum _{i=1}^{n} \big (p_i - \frac{1}{n}\big )^2\) that is an extreme point of the polytope \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Furthermore, every extreme point of \(\mathfrak {P}_n(x;\zeta _{n}(x))\) satisfies at least \(n-1\) of the set of 2n inequalities \(\Bigl \{p_i \ge 0, \, i \in [n], p_i \le \frac{1}{n(1-\zeta _{n}(x))}, \, i \in [n]\Bigr \}\), with equality. This implies that there exists an optimal solution at which at least \(n-1\) of the \(p_i\)s either take the value zero, or take the value \(\frac{1}{n(1-\zeta _{n}(x))}\). At this solution, \(n-1\) of the terms \(\big (p_i - \frac{1}{n}\big )^2\) are either \(\frac{1}{n^2}\) or \(\frac{1}{n^2} \bigl (\frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\bigr )^2\) (with \(\frac{1}{n^2}\) larger since \(\zeta _{n}(x) < 0.5\) by assumption).

       Suppose \(M \in \{0,\dots ,n-1\}\) of the inequalities \(p_i \ge 0\), \(i \in [n]\), are satisfied with equality at such an optimal solution. Since \(\sum _{i=1}^{n} p_i = 1\) and \(p_i \le \frac{1}{n(1-\zeta _{n}(x))}\), \(\forall i \in [n]\), we require \((n-M)\frac{1}{n(1-\zeta _{n}(x))} \ge 1\), which implies \(M \le n\zeta _{n}(x)\). Consequently, \(M \le n\zeta _{n}(x) < n/2\) of the inequalities \(p_i \ge 0\), \(i \in [n]\), are satisfied with equality and at least \((n-1-M) \ge n(1-\zeta _{n}(x)) - 1 > n/2 - 1\) of the inequalities \(p_i \le \frac{1}{n(1-\zeta _{n}(x))}\), \(i \in [n]\), are satisfied with equality. Therefore, whenever \(\zeta _n(x) < 0.5\), we have:

    $$\begin{aligned} \sum _{i=1}^{n} \bigg (p_i - \frac{1}{n}\bigg )^2&\le (n\zeta _{n}(x)+1)\frac{1}{n^2} + n(1-\zeta _{n}(x))\frac{1}{n^2} \left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) ^2 \\&= \frac{1}{n^2} + \frac{1}{n}\left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) . \end{aligned}$$

    Because the above analysis is constructive, it can be immediately used to deduce that the bound on \(\zeta _n(x)\) is sharp.

  2. (b)

    The stated result follows from the fact that

    $$\begin{aligned} \sum _{i=1}^{n} \left( p_i - \frac{1}{n}\right) ^2 \le \left( \sum _{i=1}^{n} |{p_i - \frac{1}{n}}|\right) ^2 \le \zeta ^2_{n}(x), \quad \forall p \in \mathfrak {P}_n(x;\zeta _{n}(x)), \, x \in \mathcal {X}. \end{aligned}$$

    To see that the above bound is sharp, assume without loss of generality that \(n \ge 2\) and \(\zeta _{n}(x) \le 1\). Then, because

    $$\begin{aligned} \left( \frac{1}{n} + \frac{\zeta _{n}(x)}{2}, \underbrace{\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2},\dots ,\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2}}_{n-1 \text { terms}}\right) \in \mathfrak {P}_n(x;\zeta _{n}(x)), \end{aligned}$$

    we have

    $$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \ge \frac{\zeta ^2_n(x)}{4} + \frac{\zeta ^2_n(x)}{4(n-1)}. \end{aligned}$$
  3. (c)

    Let \(\bar{q}:= \frac{1}{n}\sum _{i=1}^{n} q_i\). We have:

    $$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \le \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2, \end{aligned}$$

    where \(\mathfrak {Q}_n(x;\zeta _{n}(x)):= \left\{ q \in \mathbb {R}^n_+ : \Vert {q}\Vert _b \le \zeta _{n}(x) \right\} \). Note that for each \(q \in \mathfrak {Q}_n(x;\zeta _{n}(x))\), we have \(|{\bar{q}}| \le n^{-1} \Vert {q}\Vert _1 \le n^{-1/b} \Vert {q}\Vert _b\), which in turn implies

    $$\begin{aligned} \Vert {q - \bar{q} \textbf{1}}\Vert _b \le \Vert {q}\Vert _b + |{\bar{q}}| \Vert {\textbf{1}}\Vert _b = \Vert {q}\Vert _b + |{\bar{q}}| n^{1/b} \le \Vert {q}\Vert _b + \Vert {q}\Vert _b \le 2\zeta _{n}(x), \end{aligned}$$

    where \(\textbf{1}\) is a vector of ones of appropriate dimension. Additionally, note that

    $$\begin{aligned} \sum _{i=1}^{n} (q_i - \bar{q})^2 = \Vert {q - \bar{q}\textbf{1}}\Vert ^2 \le {\left\{ \begin{array}{ll} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{}\text{ if } b \le 2 \\ n^{1 - 2/b} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$

    The desired result then follows from

    $$\begin{aligned} \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2&\le \underset{\left\{ q : \Vert {q - \bar{q}\textbf{1}}\Vert _b \le 2\zeta _{n}(x) \right\} }{\sup } \frac{1}{n^2}\Vert {q - \bar{q}\textbf{1}}\Vert ^2 \\&\le {\left\{ \begin{array}{ll} \frac{4}{n^2}\zeta ^2_{n}(x) &{}\text{ if } b \le 2 \\ \frac{4}{n^{1 + 2/b}} \zeta ^2_{n}(x) &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$

    We now show that the above bounds are sharp.

       Consider first the case when \(b \le 2\) and assume without loss of generality that \(\zeta _{n}(x) = O(\sqrt{n})\). Note that \(p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]\), \(i \in [n]\), with \(q_1 = \zeta _{n}(x)\) and \(q_i = 0\), \(\forall i \ge 2\), is an element of \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Therefore

    $$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^2}\right) . \end{aligned}$$

    Next, suppose instead that \(b > 2\) and assume without loss of generality that \(\zeta _{n}(x) = O(n^{1/b})\). Note that \(p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]\) with \(q_i = {\left\{ \begin{array}{ll} \bigl (\frac{2}{n}\bigr )^{1/b}\zeta _{n}(x) &{}\text{ if } i \equiv 0 \;(\bmod \; 2) \\ 0 &{} \text{ if } i \equiv 1 \;(\bmod \; 2) \end{array}\right. }\), \(i \in [n]\), is an element of \(\mathfrak {P}_n(x;\zeta _{n}(x))\). Therefore

    $$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^{1+2/b}}\right) . \end{aligned}$$

\(\square \)

Additional computational results

Algorithm 4
figure d

Estimating the 99% UCB on the optimality gap of a solution

Algorithm 4 describes our procedure for estimating the 99% UCB on the optimality gap of our data-driven solutions using the multiple replication procedure [41]. We only use 20, 000 of the generated \(10^5\) samples from the conditional distribution of Y given \(X = x\) to compute these UCBs since they are sufficient to yield an accurate estimate of the optimality gaps. Unlike [36, Algorithm 1] that uses relative optimality gaps, we use absolute optimality gaps in our 99% UCB estimates to avoid division by small quantities when \(v^*(x)\) is close to zero.

We compare Algorithm 2 with an “optimal covariate-independent” specification of the radius \(\zeta _n\). This optimal covariate-independent radius is determined by choosing \(\zeta _n\) such that the medians of the \(99\%\) UCBs over the 20 different covariate realizations are minimized. We also benchmark Algorithm 3 against an “optimal covariate dependent” specification of \(\zeta _n(x)\) that is determined by choosing \(\zeta _n(x)\) such that the \(99\%\) UCBs are minimized. Determining these optimal covariate-independent and covariate-dependent radii \(\zeta _n(x)\) is impractical because it requires 20, 000 i.i.d. samples from the conditional distribution of Y given \(X = x\) (which a decision-maker does not have). We consider it only to benchmark the performance of Algorithms 2 and 3.

Fig. 5
figure 5

(Wasserstein-DRO with Ridge regression) Comparison of the E+Ridge approach (E) with tuning of the W+Ridge radius using Algorithms 1 (1), 2 (2), and 3 (3). Top row: \(\theta = 1\). Middle row: \(\theta = 0.5\). Bottom row: \(\theta = 2\). Left column: \(d_x = 3\). Middle column: \(d_x = 10\). Right column: \(d_x = 100\)

“Optimal” tuning of the Wasserstein radius. Figure 6 compares the performance of the W+OLS formulations when the radius \(\zeta _n(x)\) of the ambiguity set is determined using Algorithms 2 and 3 and optimal covariate-dependent and covariate-independent tuning. We vary the model degree \(\theta \), the covariate dimension among \(d_x \in \{3,10,100\}\), and the sample size among \(n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}\) in these experiments. The radius specified by Algorithm 2 performs better than the radius specified using Algorithm 3 for smaller sample sizes and covariate dimensions, and the converse holds for larger covariate dimensions and sample sizes. These results indicate that while covariate-dependent tuning theoretically has potential to yield better results than the covariate-independent tuning of Algorithm 2, Algorithm 3 is only able to obtain good estimates of the optimal covariate-dependent radius \(\zeta _n(x)\) for relatively large sample sizes n. The difference between the performance of Algorithm 2 and the optimal covariate-independent tuning reduces with increasing sample size and covariate dimension except for \(\theta = 2\). The difference between the performance of Algorithm 3 and optimal covariate-dependent tuning of the radius also reduces with increasing covariate dimension and sample size.

Fig. 6
figure 6

(Comparison with “optimal” specification of the Wasserstein radius) Comparison of the W+OLS approach with the optimal covariate-dependent (\(\texttt {D}^{*}\)) and covariate-independent (\(\texttt {I}^{*}\)) tuning of the W+OLS radius, and the tuning of the W+OLS radius using Algorithm 3 (3) and Algorithm 2 (2). Top row: \(\theta = 1\). Middle row: \(\theta = 0.5\). Bottom row: \(\theta = 2\). Left column: \(d_x = 3\). Middle column: \(d_x = 10\). Right column: \(d_x = 100\)

Fig. 7
figure 7

(Comparison of the radii specified by Algorithms 12,  and 3) Comparison of the optimal covariate-dependent tuning (\(\texttt {D}^{*}\)) and optimal covariate-independent tuning (\(\texttt {I}^{*}\)) of the W+OLS radius, the covariate-dependent tuning of the W+OLS radius using Algorithm 3 (3), and the covariate-independent tuning of the W+OLS radius using Algorithm 1 (1) and Algorithm 2 (2) for \(d_x = 100\). Left: \(\theta = 1\). Middle: \(\theta = 0.5\). Right: \(\theta = 2\)

Comparison of the radii specified by Algorithms 12, and 3. Figure 7 compares the radii specified by Algorithms 12, and 3 with the optimal covariate-dependent radius and optimal covariate-independent radius for the W+OLS formulation. We consider \(d_x = 100\), vary the model degree \(\theta \), and vary the sample size among \(n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}\) in these experiments. Note that the y-axis limits are different across the subplots. First, note that the radius specified by Algorithm 1 shrinks very quickly to zero for all three values of \(\theta \). Consequently, we note from Fig. 2 that the resulting ER-DRO estimators typically do not perform as well as the estimators obtained when the radius \(\zeta _n\) is specified using Algorithms 2 and 3. Second, we see that the covariate-independent specifications of the radius result in more narrow distributions overall compared to the covariate-dependent specifications. This may be because the covariate-independent specifications of the radius attempt to choose a single value of \(\zeta _n(x)\) for all possible covariate realizations \(x \in \mathcal {X}\), whereas the covariate-dependent specifications can choose a different value of \(\zeta _n(x)\) depending on the realization \(x \in \mathcal {X}\). Third, the distribution of the radius determined using Algorithm 3 appears to converge to the distribution of the optimal covariate-dependent radius as the sample size increases. Similarly, the distribution of the radius determined using Algorithm 2 also appears to converge to the distribution of the optimal covariate-independent radius as n increases (except for the case when \(\theta = 2\)). Finally, as noted in Sect. 6, it may be advantageous to use a positive radius for the ambiguity set when the prediction model is misspecified (e.g., using OLS regression even when \(\theta \ne 1\)). This is corroborated by the plots for \(\theta = 2\), where the distribution of the optimal covariate-dependent radius is far from the zero distribution even for large sample sizes n. See Fig. 5.

Fig. 8
figure 8

(Comparison of Wasserstein-DRO with J-SAA) Comparison of the E+OLS (E) and J+OLS (J) approaches with tuning of the W+OLS radius using Algorithms 2 (2) and 3 (3) for \(d_x = 100\). Left: \(\theta = 1\). Middle: \(\theta = 0.5\). Right: \(\theta = 2\)

Fig. 9
figure 9

(Comparison of the mean optimality gaps of the ER-DRO formulations) Comparison of the mean optimality gaps of the E+OLS approach (E) with the covariate-independent tuning of the W+OLS radius (W), the S+OLS radius (S), and the H+OLS radius (H), all tuned using Algorithm 2. Top row: \(\theta = 1\). Middle row: \(\theta = 0.5\). Bottom row: \(\theta = 2\). Left column: \(d_x = 3\). Middle column: \(d_x = 10\). Right column: \(d_x = 100\)

Fig. 10
figure 10

(Mean optimality gaps for Wasserstein-DRO with OLS regression) Comparison of the mean optimality gaps of the E+OLS approach (E) with the tuning of the W+OLS radius using Algorithms 1 (1), 2 (2), and 3 (3). Top row: \(\theta = 1\). Middle row: \(\theta = 0.5\). Bottom row: \(\theta = 2\). Left column: \(d_x = 3\). Middle column: \(d_x = 10\). Right column: \(d_x = 100\)

Comparison with the jackknife-based formulations. Figure 8 compares the performance of the ER-SAA+OLS approach and the jackknife-based SAA (J-SAA+OLS) approach [36] with the W+OLS formulations when the radius \(\zeta _n(x)\) is specified using Algorithms 2 and 3. We consider \(d_x = 100\), vary the model degree \(\theta \), and vary the sample size among \(n \in \{3(d_x + 1),5(d_x + 1),10(d_x + 1),20(d_x + 1)\}\) in these experiments. As observed in [36], the J-SAA+OLS formulation performs better than the E+OLS formulation in the small sample size regime. Figure 8 shows that the W+OLS formulations outperform the J-SAA+OLS formulation (except when using Algorithm 2 for \(\theta = 2\) and large n). This is expected because the ER-DRO formulations account for both the errors in the approximation of \(f^*\) by \(\hat{f}_n\) and in the approximation of \(P_{Y \mid X = x}\) by \(P^*_n(x)\), whereas the J-SAA+OLS formulation only addresses the bias in the residuals obtained from OLS regression (i.e., even if \(\hat{f}_n\) is an accurate estimate of \(f^*\), the jackknife formulations do not account for the fact that \(P^*_n(x)\) may be a crude approximation of \(P_{Y \mid X = x}\)). We omit the results for the J+-SAA+OLS formulation because they are similar to those for the J-SAA+OLS formulation. See Fig. 6.

Comparison of mean optimality gaps. Figures 9 and 10 are the counterparts of Figs. 1 and 2 in Sect. 7 when the mean optimality gap is used instead of their \(99\%\) UCBs to compare the different estimators, i.e., when \(100 \, \texttt {avg}(\{\hat{G}^k(x)\})\) is used instead of the output \(\hat{B}_{99}(x)\) of Algorithm 4. As mentioned in Sect. 7, comparing mean optimality gaps instead of \(99\%\) UCBs results in very similar plots because the standard deviations \(\sqrt{\texttt {var}(\{\hat{G}^k(x)\})}\) of the gap estimates \(\{\hat{G}^k(x)\}\) are significantly lower than the gap estimates themselves—this is because we use 20, 000 random samples to derive each gap estimate, which yields estimates with low variability. Finally, we note that we also plotted median optimality gaps instead of mean optimality gaps and obtained very similar plots. See Figs. 7, 8, 9 and 10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kannan, R., Bayraksan, G. & Luedtke, J.R. Residuals-based distributionally robust optimization with covariate information. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02014-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10107-023-02014-7

Keywords

Mathematics Subject Classification

Navigation