Residuals-based distributionally robust optimization with covariate information

Kannan, Rohit; Bayraksan, Güzin; Luedtke, James R.

doi:10.1007/s10107-023-02014-7

Residuals-based distributionally robust optimization with covariate information

Full Length Paper
Series A
Published: 26 September 2023

(2023)
Cite this article

Mathematical Programming Submit manuscript

1181 Accesses
1 Citation
Explore all metrics

Abstract

We consider data-driven approaches that integrate a machine learning prediction model within distributionally robust optimization (DRO) given limited joint observations of uncertain parameters and covariates. Our framework is flexible in the sense that it can accommodate a variety of regression setups and DRO ambiguity sets. We investigate asymptotic and finite sample properties of solutions obtained using Wasserstein, sample robust optimization, and phi-divergence-based ambiguity sets within our DRO formulations, and explore cross-validation approaches for sizing these ambiguity sets. Through numerical experiments, we validate our theoretical results, study the effectiveness of our approaches for sizing ambiguity sets, and illustrate the benefits of our DRO formulations in the limited data regime even when the prediction model is misspecified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust estimation in regression and classification methods for large dimensional data

Article 05 July 2023

Testing regression models with selection-biased data

Article 13 May 2014

Valid prediction intervals for regression problems

Article 18 April 2022

Notes

Our results can be extended to Wasserstein distances defined using $\ell _q$-norms with $q \ne 2$.
We investigate extensions of our framework that can adapt to heteroscedasticity in [35, 36], where we assume that $Y = f^*(X) + Q^*(X)\varepsilon $ with X and $\varepsilon $ independent (here, $Q^*(X)$ denotes the covariate-dependent covariance matrix of the errors).
See Remark 1 at the end of this section for a discussion of the case when $f^* \not \in \mathcal {F}$.
For any $u,v \in \mathbb {R}^{d_y}$, $\Vert {{\text {proj}}_{\mathcal {Y}}(u) - {\text {proj}}_{\mathcal {Y}}(v)}\Vert \le \Vert {u-v}\Vert $.
In special cases (e.g., smooth unconstrained problems, see [31, Section 5]), it may be possible to establish the sharper convergence rate $|{g(\hat{z}^{DRO}_n(x);x) - v^*(x)}| = O_p\big (n^{-r}\big )$.
We use $\mu _n(x)$ to avoid a clash with the notation $\zeta _n(x)$ for the radius of ambiguity sets with the same support as $\hat{P}^{ER}_n(x)$. Having different notation for these two radii will prove useful in our unified analysis of the corresponding ER-DRO problems in Sect. 5.
We do not include results for Algorithm 3 with $n = 20$ because it requires at least 30 samples (line 7 of Algorithm 3 needs at least 6 points for Lasso regression with 5-fold CV).
Once again, we only report results for Algorithm 3 when $n \ge 30$.

References

Ban, G.Y., Gallien, J., Mersereau, A.J.: Dynamic procurement of new products with covariate information: the residual tree method. Manuf. Serv. Oper. Manag. 21(4), 798–815 (2019)
Article Google Scholar
Ban, G.Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2018)
Article MathSciNet MATH Google Scholar
Bansal, M., Huang, K.L., Mehrotra, S.: Decomposition algorithms for two-stage distributionally robust mixed binary programs. SIAM J. Optim. 28(3), 2360–2383 (2018)
Article MathSciNet MATH Google Scholar
Bayraksan, G., Love, D.K.: Data-driven stochastic programming using phi-divergences. In: The Operations Research Revolution, pp. 1–19. INFORMS TutORials in Operations Research (2015)
Bazier-Matte, T., Delage, E.: Generalization bounds for regularized portfolio selection with market side information. INFOR: Information Systems and Operational Research, pp. 1–28 (2020)
Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)
Article Google Scholar
Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation. Math. Program. 171(1–2), 217–282 (2018)
Article MathSciNet MATH Google Scholar
Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci. 66(3), 1025–1044 (2020)
Article Google Scholar
Bertsimas, D., McCord, C., Sturt, B.: Dynamic optimization with side information. Eur. J. Oper. Res. 304(2), 634–651 (2023)
Article MathSciNet MATH Google Scholar
Bertsimas, D., Shtern, S., Sturt, B.: A data-driven approach to multistage stochastic linear optimization. Manag. Sci. (2022). https://doi.org/10.1287/mnsc.2022.4352
Article MATH Google Scholar
Bertsimas, D., Shtern, S., Sturt, B.: Two-stage sample robust optimization. Oper. Res. 70(1), 624–640 (2022)
Article MathSciNet MATH Google Scholar
Bertsimas, D., Van Parys, B.: Bootstrap robust prescriptive analytics. Math. Program. 195(1), 39–78 (2022)
Article MathSciNet MATH Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Article MathSciNet MATH Google Scholar
Blanchet, J., Kang, Y., Murthy, K.: Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 56(3), 830–857 (2019)
Article MathSciNet MATH Google Scholar
Blanchet, J., Murthy, K., Nguyen, V.A.: Statistical analysis of Wasserstein distributionally robust estimators. In: Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pp. 227–254. INFORMS (2021)
Blanchet, J., Murthy, K., Si, N.: Confidence regions in Wasserstein distributionally robust estimation. Biometrika 109(2), 295–315 (2022)
Article MathSciNet MATH Google Scholar
Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets for linear systems under disturbances and noisy observations. In: 2020 American Control Conference (ACC), pp. 4491–4496. IEEE (2020)
Boskos, D., Cortés, J., Martínez, S.: Data-driven ambiguity sets with probabilistic guarantees for dynamic processes. IEEE Trans. Autom. Control 66(7), 2991–3006 (2020)
Article MathSciNet MATH Google Scholar
Boskos, D., Cortés, J., Martínez, S.: High-confidence data-driven ambiguity sets for time-varying linear systems. arXiv preprint arXiv:2102.01142 (2021)
Deng, Y., Sen, S.: Predictive stochastic programming. CMS 19(1), 65–98 (2022)
Article MathSciNet MATH Google Scholar
Dou, X., Anitescu, M.: Distributionally robust optimization with correlated data from vector autoregressive processes. Oper. Res. Lett. 47(4), 294–299 (2019)
Article MathSciNet MATH Google Scholar
Duchi, J.C., Glynn, P.W., Namkoong, H.: Statistics of robust optimization: a generalized empirical likelihood approach. Math. Oper. Res. 46(3), 946–969 (2021)
Article MathSciNet MATH Google Scholar
Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Article MathSciNet MATH Google Scholar
Esfahani, P.M., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)
Article MathSciNet MATH Google Scholar
Esteban-Pérez, A., Morales, J.M.: Distributionally robust stochastic programs with side information based on trimmings. Math. Program. 195(1), 1069–1105 (2022)
Article MathSciNet MATH Google Scholar
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 162(3–4), 707–738 (2015)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Gao, R.: Finite-sample guarantees for Wasserstein distributionally robust optimization: breaking the curse of dimensionality. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2326
Article Google Scholar
Gao, R., Chen, X., Kleywegt, A.J.: Wasserstein distributionally robust optimization and variation regularization. Oper. Res. (2022). https://doi.org/10.1287/opre.2022.2383
Article Google Scholar
Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. (2022). https://doi.org/10.1287/moor.2022.1275
Article Google Scholar
Gotoh, J.Y., Kim, M.J., Lim, A.E.: Calibration of distributionally robust empirical optimization models. Oper. Res. 69(5), 1630–1650 (2021)
Article MathSciNet MATH Google Scholar
Hanasusanto, G.A., Kuhn, D.: Robust data-driven dynamic programming. In: Advances in Neural Information Processing Systems, pp. 827–835 (2013)
Hanasusanto, G.A., Kuhn, D.: Conic programming reformulations of two-stage distributionally robust linear programs over Wasserstein balls. Oper. Res. 66(3), 849–869 (2018)
Article MathSciNet MATH Google Scholar
Homem-de-Mello, T., Bayraksan, G.: Monte Carlo sampling-based methods for stochastic optimization. Surv. Oper. Res. Manag. Sci. 19(1), 56–85 (2014)
MathSciNet Google Scholar
Kannan, R., Bayraksan, G., Luedtke, J.: Heteroscedasticity-aware residuals-based contextual stochastic optimization. arXiv preprint arXiv:2101.03139, pp. 1–15 (2021)
Kannan, R., Bayraksan, G., Luedtke, J.: Data-driven sample average approximation with covariate information. arXiv preprint arXiv:2207.13554v1, pp. 1–57 (2022)
Kuhn, D., Esfahani, P.M., Nguyen, V.A., Shafieezadeh-Abadeh, S.: Wasserstein distributionally robust optimization: theory and applications in machine learning. In: Operations Research & Management Science in the Age of Analytics, pp. 130–166. INFORMS (2019)
Lam, H.: Robust sensitivity analysis for stochastic systems. Math. Oper. Res. 41(4), 1248–1275 (2016)
Article MathSciNet MATH Google Scholar
Lam, H.: Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Oper. Res. 67(4), 1090–1105 (2019)
MathSciNet MATH Google Scholar
Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matrices based on vines and extended onion method. J. Multivar. Anal. 100(9), 1989–2001 (2009)
Article MathSciNet MATH Google Scholar
Mak, W.K., Morton, D.P., Wood, R.K.: Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24(1–2), 47–56 (1999)
Article MathSciNet MATH Google Scholar
Nguyen, V.A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Distributionally robust local non-parametric conditional estimation. Adv. Neural Inf. Process. Syst. 33, 15232–15242 (2020)
Google Scholar
Pflug, G., Wozabal, D.: Ambiguity in portfolio selection. Quant. Finance 7(4), 435–442 (2007)
Article MathSciNet MATH Google Scholar
Rahimian, H., Mehrotra, S.: Frameworks and results in distributionally robust optimization. Open J. Math. Optim. 3, 1–85 (2022)
Article MathSciNet MATH Google Scholar
Rigollet, P., Hütter, J.C.: High dimensional statistics. Lecture notes for MIT’s 18.657 course (2017). URl http://www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf
Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)
Article Google Scholar
Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)
Article Google Scholar
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on stochastic programming: modeling and theory. SIAM (2009)
Trillos, N.G., Slepčev, D.: On the rate of convergence of empirical measures in $\infty $-transportation distance. Can. J. Math. 67(6), 1358–1383 (2015)
Article MathSciNet MATH Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: with Applications to Statistics. Springer, Berlin (1996)
Book MATH Google Scholar
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer Science & Business Media, Berlin (2008)
MATH Google Scholar
Xie, W.: Tractable reformulations of two-stage distributionally robust linear programs over the type-$\infty $ Wasserstein ball. Oper. Res. Lett. 48(4), 513–523 (2020)
Article MathSciNet MATH Google Scholar
Xu, H., Caramanis, C., Mannor, S.: A distributional interpretation of robust optimization. Math. Oper. Res. 37(1), 95–110 (2012)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies. This research was performed using the computing resources of the UW-Madison CHTC in the CS Department. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation. The authors thank the three anonymous reviewers for suggestions that helped improve the readability of this paper. R.K. also thanks Nam Ho-Nguyen for helpful discussions.

Author information

Authors and Affiliations

Grado Department of Industrial Engineering, Virginia Tech, Blacksburg, VA, USA
Rohit Kannan
Department of Integrated Systems Engineering, The Ohio State University, Columbus, OH, USA
Güzin Bayraksan
Department of Industrial and Systems Engineering and Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA
James R. Luedtke

Authors

Rohit Kannan
View author publications
You can also search for this author in PubMed Google Scholar
Güzin Bayraksan
View author publications
You can also search for this author in PubMed Google Scholar
James R. Luedtke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rohit Kannan.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing interests that influenced the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Contract No. DE-AC02-06CH11357. R.K. also gratefully acknowledges the support of the U.S. Department of Energy through the LANL/LDRD Program and the Center for Nonlinear Studies.

Appendices

Omitted proofs

1.1 Proof of Proposition 1

We begin with the following useful results.

Lemma 22

Let $a_1, a_2, \dots , a_{d}$ be positive constants with $a_j \ge 1$, $\forall j \in [d]$. Then, we have

$$\begin{aligned} { \left( \sum _{i=1}^{d} a^2_i \right) ^{\theta /2} \le \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i, \quad \forall \theta \in [1,2]. } \end{aligned}$$

Proof

Let $F(\theta ):= \bigl (\sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\bigr )^{2/\theta }$ and $G(\theta ):= \log (F(\theta )) = \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i \right) $. The stated result holds if F (or equivalently, G) is nonincreasing on $\theta \in [1,2]$. We have

$$\begin{aligned} G'(\theta )= & {} \frac{1}{\theta } \left[ \sum _{i=1}^{d} w_i \log (a_i) - \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\right) \right] \\= & {} \frac{1}{\theta } \left[ \log \left( \prod _{i=1}^{d} (a_i)^{w_i}\right) - \frac{2}{\theta } \log \left( \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i\right) \right] , \end{aligned}$$

where the nonnegative weight $w_i:= \bigl (a^{1+ \theta /2}_i\bigr ) \bigl (\sum _{j=1}^{d} a^{1+ \theta /2}_j\bigr )^{-1}$. Note that $w_i \in (0,1)$ and $\sum _{i=1}^{d} w_i = 1$. For $\theta \in [1,2]$, the above expression implies that $G'(\theta ) \le 0$ whenever $\prod _{i=1}^{d} (a_i)^{w_i} \le \sum _{i=1}^{d} a^{1+ \theta /2}_i$. We have

$$\begin{aligned} { \prod _{i=1}^{d} (a_i)^{w_i} \le \sum _{i=1}^{d} w_i a_i \le \sum _{i=1}^{d} a_i \le \sum _{i=1}^{d} a^{1+ \frac{\theta }{2}}_i, } \end{aligned}$$

where the first inequality follows from the weighted AM-GM inequality, the second inequality follows from the fact that $0< w_i < 1$, $\forall i \in [d]$, and the final inequality follows from the assumption that $a_i \ge 1$, $\forall i \in [d]$. Consequently, $G'(\theta ) \le 0$, $\forall \theta \in [1,2]$, which implies F is nonincreasing on $\theta \in [1,2]$. $\square $

Lemma 23

Let W be a sub-Gaussian random variable with variance proxy $\sigma ^2_w$. Then

$$\begin{aligned} { \mathbb {E}\left[ {\exp \bigl (|{W}|^{1 + \frac{\theta }{2}}\bigr )}\right] < +\infty , \quad \forall \theta \in (1,2). } \end{aligned}$$

Proof

We have

$$\begin{aligned} \mathbb {E}\left[ {\exp \bigl (|{W}|^{1 + \frac{\theta }{2}}\bigr )}\right]&= \mathbb {E}\left[ \sum _{j=0}^{\infty } \frac{|{W}|^{(1+\frac{\theta }{2})j}}{j!}\right] \end{aligned}$$

(14)

Lemma 1.4 of [45] implies that

$$\begin{aligned} { \mathbb {E}\left[ \frac{|{W}|^{(1+\frac{\theta }{2})j}}{j!}\right] \le \frac{\bigl ( \sigma ^2_w e^{\frac{2}{e}} j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}}{j!}, \quad \forall j \ge 2, } \end{aligned}$$

where $e:= \exp (1)$. Therefore, inequality (14), the Fubini-Tonelli theorem, and the ratio test imply the stated result whenever $\lim _{j \rightarrow \infty } t_{j+1}/t_j < 1$, where $t_j:= \bigl ( \sigma ^2_w e^{\frac{2}{e}} j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}(j!)^{-1}$. Let $C:= \sigma ^2_w e^{\frac{2}{e}}$. We have

$$\begin{aligned} \lim _{j \rightarrow \infty } \frac{t_{j+1}}{t_j}&= \lim _{j \rightarrow \infty } \frac{\bigl ( C (j+1) \bigr )^{(1+\frac{\theta }{2})\frac{j+1}{2}}}{\bigl ( C j \bigr )^{(1+\frac{\theta }{2})\frac{j}{2}}} \frac{j!}{(j+1)!} \\&= \lim _{j \rightarrow \infty } \frac{(C(j+1))^{0.5(1+\frac{\theta }{2})}}{j+1} \left( \frac{j+1}{j}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= \lim _{j \rightarrow \infty } O(1) (j+1)^{0.5(\frac{\theta }{2}-1)} \, \lim _{j \rightarrow \infty } \left( \frac{j+1}{j}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= O(1)\lim _{j \rightarrow \infty } (j+1)^{0.5(\frac{\theta }{2}-1)} \, \lim _{j \rightarrow \infty } \left( 1 + \frac{(1+\frac{\theta }{2})\frac{1}{2}}{(1+\frac{\theta }{2})\frac{j}{2}}\right) ^{(1+\frac{\theta }{2})\frac{j}{2}} \\&= 0 \times O(1) = 0, \end{aligned}$$

where we use the fact $\theta \in (1,2)$ in the last step. $\square $

We are now ready to prove Proposition 1. To establish $\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] < +\infty $, it suffices to show that $\mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a) \mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}}}\right] < +\infty $, where $\mathbbm {1}_{\{\Vert {\varepsilon }\Vert _{\infty } \ge 1\}} = 1$ if $\Vert {\varepsilon }\Vert _{\infty } \ge 1$ and 0 otherwise. Lemma 22 implies that

$$\begin{aligned} \mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] \le \mathbb {E}\left[ \exp \left( \sum _{i=1}^{d_y} |{\varepsilon _i}|^{1 + \frac{a}{2}}\right) \right] = \mathbb {E}\left[ \prod _{i=1}^{d_y}\exp \Bigl ( |{\varepsilon _i}|^{1 + \frac{a}{2}}\Bigr )\right] . \end{aligned}$$

Independence of the components of $\varepsilon $ further implies

$$\begin{aligned} \mathbb {E}\left[ {\exp (\Vert {\varepsilon }\Vert ^a)}\right] \le \prod _{i=1}^{d_y} \mathbb {E}\left[ \exp \left( |{\varepsilon _i}|^{1 + \frac{a}{2}}\right) \right] . \end{aligned}$$

Consequently, it suffices to show that $\mathbb {E}\Bigl [\exp \bigl ( |{\varepsilon _i}|^{1 + \frac{a}{2}}\bigr ) \mathbbm {1}_{\{|{\varepsilon _i}| \ge 1\}}\Bigr ] < +\infty $ for each $i \in [d_y]$, which follows from Lemma 23. $\square $

1.2 Proof of Lemma 8

We require the following result (cf. [24, Lemma 3.7]).

Lemma 24

Suppose Assumptions 1, 2, and 6 hold and the samples $\{\varepsilon ^{i}\}_{i=1}^{n}$ are i.i.d. Let $\{Q_n(x)\}$ be a sequence of distributions with $Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))$. Then

$$\begin{aligned} \mathbb {P}\Bigl \{ d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x) \Bigr \} \ge 1 - \alpha _n, \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Consequently, we a.s. have for n large enough:

$$\begin{aligned} d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Furthermore, $\{Q_n(x)\}$ converges a.s. converges to $P_{Y \mid X = x}$ under the Wasserstein metric

$$\begin{aligned} { \mathbb {P}\Bigl \{ \lim _{n \rightarrow \infty } d_{W,p}(P_{Y \mid X=x},Q_n(x)) = 0 \Bigr \} = 1. } \end{aligned}$$

Proof

Mirrors the proof of [24, Lemma 3.7] on account of Theorem 7. $\square $

We now prove Lemma 8. From Theorem 7, we have for a.e. $x \in \mathcal {X}$ that

$$\begin{aligned} \mathbb {P}\big \{v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x)\big \} \ge 1 - \alpha _n, \quad \forall n \in \mathbb {N}. \end{aligned}$$

Since $\sum _n \alpha _n < +\infty $, the Borel-Cantelli lemma implies a.s. that for all n large enough

$$\begin{aligned} v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

From Lemma 24, we a.s. have for any distribution $Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))$ and n large enough that $d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)$ for a.e. $x \in \mathcal {X}$.

Suppose Assumption 4 holds. Using the fact that $\hat{\mathcal {P}}_n(x;\zeta _n (\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{1,n}(x;\zeta _n(\alpha _n,x))$ for all orders $p \in [1,+\infty )$, we a.s. have for n large enough and for a.e. $x \in \mathcal {X}$ that

$$\begin{aligned}&\hat{v}^{DRO}_n(x) \le \underset{Q \in \bar{\mathcal {P}}_{1,n}(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \le g(z^*(x);x) + 2L_1(z^*(x))\zeta _n(\alpha _n,x), \end{aligned}$$

where the second inequality follows from Assumption 4 and the Kantorovich-Rubinstein theorem (cf. [37, Theorem 5]).

Suppose instead that Assumption 5 holds and $p \in [2,+\infty )$. Since $\hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x)) \subseteq \bar{\mathcal {P}}_{2,n}(x;\zeta _n(\alpha _n,x))$ for all $p \in [2,+\infty )$, we a.s. have for n large enough and a.e. $x \in \mathcal {X}$ that

$$\begin{aligned} \hat{v}^{DRO}_n(x)&\le \underset{Q \in \bar{\mathcal {P}}_{2,n}(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \\&\le g(z^*(x);x) + 2\bigl (\mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \bigr )^{1/2} \zeta _n(\alpha _n,x)\\&\quad + 4L_2(z^*(x))\zeta ^2_n(\alpha _n,x), \end{aligned}$$

where the latter inequality follows from Assumption 5 and [28, Lemma 2] (see also [29]). $\square $

1.3 Proof of Theorem 9

From Theorem 7, we have

$$\begin{aligned} \mathbb {P}\big \{d_{W,p}( \hat{P}^{ER}_n(x), P_{Y \mid X=x} ) > \zeta _n(\alpha _n,x)\big \} \le \alpha _n, \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

From Lemma 24, we a.s. have $\lim _{n \rightarrow \infty } d_{W,p}(P_{Y \mid X=x},Q_n(x)) = 0$ for a.e. $x \in \mathcal {X}$ for any $Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))$. Theorem 6.9 of [51] then a.s. implies that $Q_n(x)$ converges weakly to $P_{Y \mid X = x}$ in the space of distributions with finite pth moments for a.e. $x \in \mathcal {X}$.

Lemma 8 implies a.s. that for all n large enough

$$\begin{aligned}&v^*(x) \le g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

(15)

Therefore, to prove $\lim _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) = v^*(x) = \lim _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x)$ in probability (or a.s.) for a.e. $x \in \mathcal {X}$, it suffices to show that $\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)$ a.s. for a.e. $x \in \mathcal {X}$.

Fix $\eta > 0$. For a.e. $x \in \mathcal {X}$, let $z^*(x) \in S^*(x)$ be an optimal solution to the true problem (3), and $Q^{*}_{n}(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))$ be such that

$$\begin{aligned} \underset{Q \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \le \mathbb {E}_{Y \sim Q^*_n(x)}\left[ {c(z^*(x),Y)}\right] + \eta . \end{aligned}$$

We suppress the dependence of $Q^{*}_{n}(x)$ on $\eta $ for simplicity. We a.s. have for a.e. $x \in \mathcal {X}$

$$\begin{aligned} \limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x)&\le \limsup _{n \rightarrow \infty } \underset{Q \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))}{\sup } \mathbb {E}_{Y \sim Q}\left[ {c(z^*(x),Y)}\right] \\&\le \limsup _{n \rightarrow \infty } \mathbb {E}_{Y \sim Q^*_n(x)}\left[ {c(z^*(x),Y)}\right] + \eta \\&= g(z^*(x);x) + \eta = v^*(x) + \eta . \end{aligned}$$

The first equality above follows from the fact that $Q^*_n(x)$ converges weakly to $P_{Y \mid X = x}$ (as noted above) and by Definition 6.8 of [51] (which holds by virtue of Assumption 3). Since $\eta > 0$ was arbitrary, we conclude that $\limsup _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) \le v^*(x)$ a.s. for a.e. $x \in \mathcal {X}$.

Finally, we show that any accumulation point of $\{\hat{z}^{DRO}_n(x)\}$ is almost surely an element of $S^*(x)$ for a.e. $x \in \mathcal {X}$, and argue that this implies $\text {dist}(\hat{z}^{DRO}_n(x),S^*(x)) \xrightarrow {a.s.} 0$ for a.e. $x \in \mathcal {X}$. From (15) and the above conclusion, we a.s. have

$$\begin{aligned} \liminf _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x) \le \lim _{n \rightarrow \infty } \hat{v}^{DRO}_n(x) = v^*(x), \quad \text {for a.e. } x \in \mathcal {X}. \end{aligned}$$

Let $\bar{z}(x)$ be an accumulation point of $\hat{z}^{DRO}_n(x)$ for a.e. $x \in \mathcal {X}$. Assume by moving to a subsequence if necessary that $\lim _{n \rightarrow \infty } \hat{z}^{DRO}_n(x) = \bar{z}(x)$. We a.s. have for a.e. $x \in \mathcal {X}$

$$\begin{aligned} v^*(x)\le & {} g(\bar{z}(x);x) \le \mathbb {E}\left[ {\liminf _{n \rightarrow \infty } c(\hat{z}^{DRO}_n(x),f^*(x)+\varepsilon )}\right] \\\le & {} \liminf _{n \rightarrow \infty } g(\hat{z}^{DRO}_n(x);x) \le v^*(x), \end{aligned}$$

where the second inequality follows from the lower semicontinuity of $c(\cdot ,Y)$ on $\mathcal {Z}$ for each $Y \in \mathcal {Y}$ and the third inequality follows from Fatou’s lemma by virtue of Assumption 3. Consequently, we a.s. have that $\bar{z}(x) \in S^*(x)$.

Suppose by contradiction that $\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))$ does not a.s. converge to zero for a.e. $x \in \mathcal {X}$. Then, there exists $\bar{\mathcal {X}} \subseteq \mathcal {X}$ with $P_X(\bar{\mathcal {X}}) > 0$ such that for each $x \in \bar{\mathcal {X}}$, $\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))$ does not a.s. converge to zero. Since $\mathcal {Z}$ is compact, any sequence of estimators $\{\hat{z}^{DRO}_n(x)\}$ has a convergent subsequence for each $x \in \bar{\mathcal {X}}$. Therefore, whenever $\text {dist}(\hat{z}^{DRO}_n(x),S^*(x))$ does not converge to zero for some $x \in \bar{\mathcal {X}}$ and a realization of the data $\mathcal {D}_n$, there exists an accumulation point of the sequence $\{\hat{z}^{DRO}_n(x)\}$ that is not a solution to problem (3). This contradicts the fact that every accumulation point of $\{\hat{z}^{DRO}_n(x)\}$ is almost surely a solution to problem (3) for a.e. $x \in \mathcal {X}$. $\square $

1.4 Proof of Theorem 10

Lemma 8 implies that inequality (15) a.s. holds for all n large enough. From Lemma 24, we a.s. have for any distribution $Q_n(x) \in \hat{\mathcal {P}}_n(x;\zeta _n(\alpha _n,x))$ and sample size n large enough that $d_{W,p}(P_{Y \mid X=x},Q_n(x)) \le 2\zeta _n(\alpha _n,x)$ for a.e. $x \in \mathcal {X}$.

If Assumption 4 holds, then the desired result follows from inequality (15) and part A of Lemma 8. On the other hand, if Assumption 4 holds and $p \ge 2$, then the desired result follows from inequality (15) and part B of Lemma 8. $\square $

1.5 Proof of Lemma 12

Theorem 7 implies $g(\hat{z}^{DRO}_n(x);x) \le \hat{v}^{DRO}_n(x)$ with probability at least $1-\alpha $ when $\zeta _n(\alpha ,x)$ is chosen according to equation (10). Lemma 5 then yields

$$\begin{aligned}&\mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x)> v^*(x) + \kappa }\right\} \\&\quad =\, \mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) - \hat{v}^{DRO}_n(x) + \hat{v}^{DRO}_n(x)> v^*(x) + \kappa }\right\} \\&\quad \le \, \alpha + \mathbb {P}\left\{ {\hat{v}^{DRO}_n(x) > v^*(x) + \kappa }\right\} . \end{aligned}$$

Suppose Assumption 4 holds. Following the proof of part A of Lemma 8 (see Lemma 24), we have for any $z^*(x) \in S^*(x)$

$$\begin{aligned} \mathbb {P}\left\{ {\hat{v}^{DRO}_n(x) > v^*(x) + 2L_1(z^*(x))\zeta _n(\alpha ,x)}\right\} \le \alpha . \end{aligned}$$

Therefore, if we choose $\alpha \in (0,1)$ so that $2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa $, we have

$$\begin{aligned} \mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) > v^*(x) + \kappa }\right\} \le 2\alpha . \end{aligned}$$

Equation (10) implies that $2L_1(z^*(x))\kappa ^{(2)}_{p,n}(\alpha ) \le \kappa /2$ whenever the risk level

$$\begin{aligned} \alpha \ge {c_1 \exp \bigl (-c_2 n \bigl (\tfrac{\kappa }{4L_1(z^*(x))}\bigr )^{1/\theta }\bigr )} \end{aligned}$$

with $\theta $ equal to $\min \{p/d_y, 1/2\}$ or p/a. Assumption 8 implies that we can choose the constant $\kappa ^{(1)}_{p,n}(\alpha ,x)$ in equation (10) such that for a.e. $x \in \mathcal {X}$, $2L_1(z^*(x))\kappa ^{(1)}_{p,n}(\alpha ,x) \le \kappa /2$ whenever

$$\begin{aligned} \alpha \ge 4\max \bigl \{&K_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))},x\bigr ) \exp \bigl (-n\beta _{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))},x\bigr )\bigr ), \\&\quad \quad \quad \bar{K}_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))}\bigr ) \exp \bigl (-n\bar{\beta }_{p,f}\bigl (\tfrac{\kappa }{8L_1(z^*(x))}\bigr )\bigr )\bigr \}. \end{aligned}$$

The above bounds imply the existence of constants $\tilde{\varGamma }(\kappa ,x), \tilde{\gamma }(\kappa ,x) > 0$ such that the risk level $\alpha = \tilde{\varGamma }(\kappa ,x) \exp (-n\tilde{\gamma }(\kappa ,x))$ satisfies $2L_1(z^*(x))\zeta _n(\alpha ,x) \le \kappa $. Consequently, (11) holds.

Next, suppose instead that Assumption 5 holds. Following the proof of part B of Lemma 8 (see Lemma 24), we have for any $z^*(x) \in S^*(x)$

$$\begin{aligned} \mathbb {P}\bigl \{\hat{v}^{DRO}_n(x) > v^*(x) + \bigl (\mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \bigr )^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x)\bigr \} \le \alpha . \end{aligned}$$

Therefore, if we pick $\alpha \in (0,1)$ so that

$$\begin{aligned} \left( \mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \right) ^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x) \le \kappa , \end{aligned}$$

then $\mathbb {P}\left\{ {g(\hat{z}^{DRO}_n(x);x) > v^*(x) + \kappa }\right\} \le 2\alpha $. Similar to the analysis above, positive constants $\tilde{\varGamma }(\kappa ,x)$ and $\tilde{\gamma }(\kappa ,x)$ and inequality (11) can be obtained by bounding the smallest value of $\alpha $ using Assumption 8 and equation (10) so that

$$\begin{aligned}&\left( \mathbb {E}\left[ {\Vert {\nabla c(z^*(x),Y)}\Vert ^2}\right] \right) ^{1/2} \zeta _n(\alpha ,x) + 4L_2(z^*(x))\zeta ^2_n(\alpha ,x) \le \kappa . \quad \square \end{aligned}$$

1.6 Proof of Lemma 14

By first adding and subtracting $g^*_n(z;x)$, defined in problem (4), and then doing the same with $g^*_{s,n}(z;x)$, we obtain

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup }\, |{\hat{g}^{ER}_{s,n}(z;x) - g(z;x)}|\nonumber \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \Big |\sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} c(z,y) - c \left( z,f^*(x) + \varepsilon ^i \right) \Big |\nonumber \\&\qquad + \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_{s,n}(z;x) - g^*_n(z;x)}| + \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_n(z;x) - g(z;x)}|. \end{aligned}$$

(16)

Consider the first term on the r.h.s. of (16). We have for each $x \in \mathcal {X}$

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \Big |\sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} c(z,y) - c \left( z,f^*(x) + \varepsilon ^i \right) \Big |\\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \sup _{y \in \hat{\mathcal {Y}}^{i}_n(x;\mu _n(x))} L(z) \Vert {y - (f^*(x) + \varepsilon ^i)}\Vert \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i L(z) \left( \mu _{n}(x) + \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \right) \\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup }\displaystyle \sum _{i=1}^{n} p_i \left( \mu _{n}(x) + \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \right) \\&\quad \le \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \left( \mu _{n}(x) + \left( \dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \bigl ( \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \bigr )^2\right) ^{\frac{1}{2}} \right) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( n \sum _{i=1}^{n} p^2_i\right) ^{\frac{1}{2}} \\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \, L(z) \left( \mu _{n}(x) + \left( \dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \bigl ( \Vert {\tilde{\varepsilon }^{i}_{n}(x)}\Vert \bigr )^2\right) ^{\frac{1}{2}} \right) \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( 1 + n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2\right) ^{\frac{1}{2}}, \end{aligned}$$

where the first step follows from Assumption 9, the second step follows from the definition of the set $\hat{\mathcal {Y}}^i_n(x;\mu _n(x))$, the triangle inequality, and inequality (6), and the fourth step follows by applying the Cauchy-Schwarz inequality twice.

Next, consider the second term on the r.h.s. of (16). For each $x \in \mathcal {X}$

$$\begin{aligned}&\underset{z \in \mathcal {Z}}{\sup } \, |g^*_{s,n}(z;x) - g^*_n(z;x) |\\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \Big |\underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} p_i c(z,f^*(x) + \varepsilon ^i) - \frac{1}{n} \sum _{i=1}^{n} c(z,f^*(x) + \varepsilon ^i) \Big |\\&\quad = \, \underset{z \in \mathcal {Z}}{\sup } \Big |\underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big ( p_i - \frac{1}{n} \Big ) c(z,f^*(x) + \varepsilon ^i) \Big |\\&\quad \le \, \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \underset{z \in \mathcal {Z}}{\sup } \left( \frac{1}{n}\sum _{i=1}^{n} \big (c(z,f^*(x) + \varepsilon ^i)\big )^2\right) ^{\frac{1}{2}}, \end{aligned}$$

where the inequality follows by Cauchy-Schwarz. $\square $

1.7 Proof of Theorem 18

Since

$$\begin{aligned} \Vert {\hat{v}^{DRO}_n(X) - v^*(X)}\Vert _{L^q}&= \big \Vert \min _{z \in \mathcal {Z}} \hat{g}^{ER}_{s,n}(z;X) - \min _{z \in \mathcal {Z}} g(z;X)\big \Vert _{L^q} \nonumber \\&\le \Big \Vert \underset{z \in \mathcal {Z}}{\sup } |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q}, \end{aligned}$$

(17)

we look to establish uniform rates of convergence of $\hat{g}^{ER}_{s,n}(\cdot ;X)$ to $g(\cdot ;X)$ with respect to the $L^q$-norm on $\mathcal {X}$. From (16) and the triangle inequality, we have

$$\begin{aligned} \Big \Vert \underset{z \in \mathcal {Z}}{\sup }\, |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q} \le&\, \Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{\hat{g}^{ER}_{s,n}(z;x) - g^*_{s,n}(z;X)}| \Big \Vert _{L^q} \nonumber \\&\quad +\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_{s,n}(z;X) - g^*_n(z;X)}|\Big \Vert _{L^q} \nonumber \\&\quad +\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{g^*_n(z;X) - g(z;X)}|\Big \Vert _{L^q}. \end{aligned}$$

(18)

We bound the terms on the r.h.s. of (18) using Lemma 14. Assumptions 9, 16, and 17 and $\Vert {\mu _{n}(X)}\Vert _{L^q} = \bar{C}_{\mu } n^{-r/2}$ imply the first term on the r.h.s. of inequality (12) satisfies:

$$\begin{aligned}&\Big \Vert \underset{z \in \mathcal {Z}}{\sup } \, |{\hat{g}^{ER}_{s,n}(z;X) - g^*_{s,n}(z;X)}| \Big \Vert _{L^q} \\&\quad \le \, \bigg \Vert \underset{z \in \mathcal {Z}}{\sup } \, L(z) \biggl (\mu _{n}(X) + \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {\tilde{\varepsilon }^{i}_{n}(X)}\Vert ^2\biggr )^{\frac{1}{2}} \biggr ) \\&\quad \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \biggl ( 1 + n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \\&\quad = \, O(1) \biggl [ \Vert {\mu _n(X)}\Vert _{L^q} + \bigg \Vert \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {\tilde{\varepsilon }^{i}_{n}(X)}\Vert ^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \biggr ] \\&\quad = \, O(1) \biggl [ \Vert {\mu _n(X)}\Vert _{L^q} + \Vert {f^*(X) - \hat{f}_n(X)}\Vert _{L^q} \\&\qquad \qquad + \bigg \Vert \biggl (\dfrac{1}{n}\displaystyle \sum _{i=1}^{n} \Vert {f^*(x^i) - \hat{f}_n(x^i)}\Vert ^2\biggr )^{\frac{1}{2}} \bigg \Vert _{L^q} \biggr ] \\&\quad = \, O_p(n^{-r/2}). \end{aligned}$$

Assumptions 16 and 18 imply that the second term on the r.h.s. of inequality (12) satisfies

$$\begin{aligned}&\Big \Vert \sup _{z \in \mathcal {Z}} |{g^*_{s,n}(z;X) - g^*_n(z;X)}| \Big \Vert _{L^q} \\&\quad \le \, \bigg \Vert \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \underset{z \in \mathcal {Z}}{\sup } \left( \frac{1}{n}\sum _{i=1}^{n} \big (c(z,f^*(X) + \varepsilon ^i)\big )^2\right) ^{\frac{1}{2}}\bigg \Vert _{L^q} \\&\quad = \, O_p(1) \bigg \Vert \underset{p \in \mathfrak {P}_n(X;\zeta _{n}(X))}{\sup } \left( n \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \right) ^{\frac{1}{2}} \bigg \Vert _{L^q} = O_p(n^{-r/2}). \end{aligned}$$

Finally, Assumption 14 implies

$$\begin{aligned} \Big \Vert \sup _{z \in \mathcal {Z}} |{g^*_{n}(z;X) - g(z;X)}| \Big \Vert _{L^q} = O_p(n^{-1/2}). \end{aligned}$$

Putting the above three inequalities together into inequality (18), we obtain

$$\begin{aligned} \Big \Vert \underset{z \in \mathcal {Z}}{\sup } |{\hat{g}^{ER}_{s,n}(z;X) - g(z;X)}|\Big \Vert _{L^q} = O_p(n^{-r/2}). \end{aligned}$$

The first part of the stated result then follows from (17). The second part of the stated result follows from (17) and the fact that

$$\begin{aligned} {\Vert {g(\hat{z}^{DRO}_n(X);X) - v^*(X)}\Vert _{L^q}}&{\le \Vert {g(\hat{z}^{DRO}_n(X);X) - \hat{v}^{DRO}_n(X)}\Vert _{L^q} } \\&\qquad +{\Vert {\hat{v}^{DRO}_n(X) - v^*(X)}\Vert _{L^q}.} \quad \square \end{aligned}$$

Ambiguity sets satisfying Assumption 12

In Sect. 5, we outlined conditions under which phi-divergence ambiguity sets $\mathfrak {P}_n(x;\zeta _{n}(x))$ satisfy Assumption 12 for a suitable choice of the radius $\zeta _{n}(x)$. Lemma 25 below determines sharp bounds on the radius $\zeta _{n}(x)$ for some other families of ambiguity sets to satisfy Assumption 12. Before presenting the lemma, we introduce a third example of the ambiguity set $\mathfrak {P}_n(x;\zeta _{n}(x))$ to add to Examples 1 and 2 in Sect. 3.

Example 3

Mean-upper-semideviation-based ambiguity sets [48]: given order $a \in [1,+\infty )$ and radius $\zeta _n(x) \ge 0$, let $b:= a/(a-1)$ and define $\hat{\mathcal {P}}_n(x)$ using

$$\begin{aligned} \mathfrak {P}_n(x;\zeta _n(x)) := \bigg \{p \in \mathbb {R}^{n}_+ :&\sum _{i=1}^{n} p_i = 1 \, \text {and } \exists \, q \in \mathbb {R}^n_+ \text { such that } \Vert {q}\Vert _b \le \zeta _n(x), \\&\quad p_i = \frac{1}{n} \bigg [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \bigg ], \forall i \in [n] \bigg \}. \end{aligned}$$

Lemma 25

The following ambiguity sets satisfy Assumption 12 with constant $\rho \in (1,2]$:

(a)
CVaR-based ambiguity sets (see Example 1) with radius $\zeta _{n}(x) = {C^{\zeta }_1 n^{1-\rho }}$,
(b)
Variation distance-based ambiguity sets (see Example 2) with radius $\zeta _{n}(x) = {C^{\zeta }_2 n^{-\rho /2}}$,
(c)
Mean-upper-semideviation-based ambiguity sets of order $a \in [1,+\infty )$ (see Example 3) with radius $\zeta _{n}(x) = {\left\{ \begin{array}{ll} {C^{\zeta }_3 n^{1 - \rho /2}} &{}\text{ if } a \ge 2 \\ {C^{\zeta }_4 n^{3/2 - 1/a - \rho /2}} &{} \text{ if } a < 2 \end{array}\right. }$,

for some constants $C^{\zeta }_i > 0$, $i \in [4]$. Furthermore, these bounds are sharp in the sense described above.

Proof

(a)
Assume that $\zeta _{n}(x) < 0.5$. We begin by noting that there exists an optimal solution to the problem $\sup _{p \in \mathfrak {P}_n(x;\zeta _{n}(x))} \sum _{i=1}^{n} \big (p_i - \frac{1}{n}\big )^2$ that is an extreme point of the polytope $\mathfrak {P}_n(x;\zeta _{n}(x))$. Furthermore, every extreme point of $\mathfrak {P}_n(x;\zeta _{n}(x))$ satisfies at least $n-1$ of the set of 2n inequalities $\Bigl \{p_i \ge 0, \, i \in [n], p_i \le \frac{1}{n(1-\zeta _{n}(x))}, \, i \in [n]\Bigr \}$, with equality. This implies that there exists an optimal solution at which at least $n-1$ of the $p_i$s either take the value zero, or take the value $\frac{1}{n(1-\zeta _{n}(x))}$. At this solution, $n-1$ of the terms $\big (p_i - \frac{1}{n}\big )^2$ are either $\frac{1}{n^2}$ or $\frac{1}{n^2} \bigl (\frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\bigr )^2$ (with $\frac{1}{n^2}$ larger since $\zeta _{n}(x) < 0.5$ by assumption).

Suppose $M \in \{0,\dots ,n-1\}$ of the inequalities $p_i \ge 0$, $i \in [n]$, are satisfied with equality at such an optimal solution. Since $\sum _{i=1}^{n} p_i = 1$ and $p_i \le \frac{1}{n(1-\zeta _{n}(x))}$, $\forall i \in [n]$, we require $(n-M)\frac{1}{n(1-\zeta _{n}(x))} \ge 1$, which implies $M \le n\zeta _{n}(x)$. Consequently, $M \le n\zeta _{n}(x) < n/2$ of the inequalities $p_i \ge 0$, $i \in [n]$, are satisfied with equality and at least $(n-1-M) \ge n(1-\zeta _{n}(x)) - 1 > n/2 - 1$ of the inequalities $p_i \le \frac{1}{n(1-\zeta _{n}(x))}$, $i \in [n]$, are satisfied with equality. Therefore, whenever $\zeta _n(x) < 0.5$, we have:
$$\begin{aligned} \sum _{i=1}^{n} \bigg (p_i - \frac{1}{n}\bigg )^2&\le (n\zeta _{n}(x)+1)\frac{1}{n^2} + n(1-\zeta _{n}(x))\frac{1}{n^2} \left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) ^2 \\&= \frac{1}{n^2} + \frac{1}{n}\left( \frac{\zeta _{n}(x)}{1-\zeta _{n}(x)}\right) . \end{aligned}$$
Because the above analysis is constructive, it can be immediately used to deduce that the bound on $\zeta _n(x)$ is sharp.
(b)
The stated result follows from the fact that
$$\begin{aligned} \sum _{i=1}^{n} \left( p_i - \frac{1}{n}\right) ^2 \le \left( \sum _{i=1}^{n} |{p_i - \frac{1}{n}}|\right) ^2 \le \zeta ^2_{n}(x), \quad \forall p \in \mathfrak {P}_n(x;\zeta _{n}(x)), \, x \in \mathcal {X}. \end{aligned}$$
To see that the above bound is sharp, assume without loss of generality that $n \ge 2$ and $\zeta _{n}(x) \le 1$. Then, because
$$\begin{aligned} \left( \frac{1}{n} + \frac{\zeta _{n}(x)}{2}, \underbrace{\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2},\dots ,\frac{1}{n} - \frac{\zeta _{n}(x)}{2n-2}}_{n-1 \text { terms}}\right) \in \mathfrak {P}_n(x;\zeta _{n}(x)), \end{aligned}$$
we have
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \ge \frac{\zeta ^2_n(x)}{4} + \frac{\zeta ^2_n(x)}{4(n-1)}. \end{aligned}$$
(c)
Let $\bar{q}:= \frac{1}{n}\sum _{i=1}^{n} q_i$. We have:
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 \le \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2, \end{aligned}$$
where $\mathfrak {Q}_n(x;\zeta _{n}(x)):= \left\{ q \in \mathbb {R}^n_+ : \Vert {q}\Vert _b \le \zeta _{n}(x) \right\} $. Note that for each $q \in \mathfrak {Q}_n(x;\zeta _{n}(x))$, we have $|{\bar{q}}| \le n^{-1} \Vert {q}\Vert _1 \le n^{-1/b} \Vert {q}\Vert _b$, which in turn implies
$$\begin{aligned} \Vert {q - \bar{q} \textbf{1}}\Vert _b \le \Vert {q}\Vert _b + |{\bar{q}}| \Vert {\textbf{1}}\Vert _b = \Vert {q}\Vert _b + |{\bar{q}}| n^{1/b} \le \Vert {q}\Vert _b + \Vert {q}\Vert _b \le 2\zeta _{n}(x), \end{aligned}$$
where $\textbf{1}$ is a vector of ones of appropriate dimension. Additionally, note that
$$\begin{aligned} \sum _{i=1}^{n} (q_i - \bar{q})^2 = \Vert {q - \bar{q}\textbf{1}}\Vert ^2 \le {\left\{ \begin{array}{ll} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{}\text{ if } b \le 2 \\ n^{1 - 2/b} \Vert {q - \bar{q}\textbf{1}}\Vert ^2_b &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$
The desired result then follows from
$$\begin{aligned} \underset{q \in \mathfrak {Q}_n(x;\zeta _{n}(x))}{\sup } \frac{1}{n^2}\sum _{i=1}^{n} (q_i - \bar{q})^2&\le \underset{\left\{ q : \Vert {q - \bar{q}\textbf{1}}\Vert _b \le 2\zeta _{n}(x) \right\} }{\sup } \frac{1}{n^2}\Vert {q - \bar{q}\textbf{1}}\Vert ^2 \\&\le {\left\{ \begin{array}{ll} \frac{4}{n^2}\zeta ^2_{n}(x) &{}\text{ if } b \le 2 \\ \frac{4}{n^{1 + 2/b}} \zeta ^2_{n}(x) &{} \text{ if } b > 2 \end{array}\right. }. \end{aligned}$$
We now show that the above bounds are sharp.

Consider first the case when $b \le 2$ and assume without loss of generality that $\zeta _{n}(x) = O(\sqrt{n})$. Note that $p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]$, $i \in [n]$, with $q_1 = \zeta _{n}(x)$ and $q_i = 0$, $\forall i \ge 2$, is an element of $\mathfrak {P}_n(x;\zeta _{n}(x))$. Therefore
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^2}\right) . \end{aligned}$$
Next, suppose instead that $b > 2$ and assume without loss of generality that $\zeta _{n}(x) = O(n^{1/b})$. Note that $p_i = \frac{1}{n} \big [ 1 + q_i - \frac{1}{n} \sum _{j=1}^{n} q_j \big ]$ with $q_i = {\left\{ \begin{array}{ll} \bigl (\frac{2}{n}\bigr )^{1/b}\zeta _{n}(x) &{}\text{ if } i \equiv 0 \;(\bmod \; 2) \\ 0 &{} \text{ if } i \equiv 1 \;(\bmod \; 2) \end{array}\right. }$, $i \in [n]$, is an element of $\mathfrak {P}_n(x;\zeta _{n}(x))$. Therefore
$$\begin{aligned} \underset{p \in \mathfrak {P}_n(x;\zeta _{n}(x))}{\sup } \sum _{i=1}^{n} \Big (p_i - \frac{1}{n}\Big )^2 = \varTheta \left( \frac{\zeta ^2_{n}(x)}{n^{1+2/b}}\right) . \end{aligned}$$

$\square $

Additional computational results

Algorithm 4 describes our procedure for estimating the 99% UCB on the optimality gap of our data-driven solutions using the multiple replication procedure [41]. We only use 20, 000 of the generated $10^5$ samples from the conditional distribution of Y given $X = x$ to compute these UCBs since they are sufficient to yield an accurate estimate of the optimality gaps. Unlike [36, Algorithm 1] that uses relative optimality gaps, we use absolute optimality gaps in our 99% UCB estimates to avoid division by small quantities when $v^*(x)$ is close to zero.

We compare Algorithm 2 with an “optimal covariate-independent” specification of the radius $\zeta _n$. This optimal covariate-independent radius is determined by choosing $\zeta _n$ such that the medians of the $99\%$ UCBs over the 20 different covariate realizations are minimized. We also benchmark Algorithm 3 against an “optimal covariate dependent” specification of $\zeta _n(x)$ that is determined by choosing $\zeta _n(x)$ such that the $99\%$ UCBs are minimized. Determining these optimal covariate-independent and covariate-dependent radii $\zeta _n(x)$ is impractical because it requires 20, 000 i.i.d. samples from the conditional distribution of Y given $X = x$ (which a decision-maker does not have). We consider it only to benchmark the performance of Algorithms 2 and 3.

“Optimal” tuning of the Wasserstein radius. Figure 6 compares the performance of the W+OLS formulations when the radius $\zeta _n(x)$ of the ambiguity set is determined using Algorithms 2 and 3 and optimal covariate-dependent and covariate-independent tuning. We vary the model degree $\theta $, the covariate dimension among $d_x \in \{3,10,100\}$, and the sample size among $n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}$ in these experiments. The radius specified by Algorithm 2 performs better than the radius specified using Algorithm 3 for smaller sample sizes and covariate dimensions, and the converse holds for larger covariate dimensions and sample sizes. These results indicate that while covariate-dependent tuning theoretically has potential to yield better results than the covariate-independent tuning of Algorithm 2, Algorithm 3 is only able to obtain good estimates of the optimal covariate-dependent radius $\zeta _n(x)$ for relatively large sample sizes n. The difference between the performance of Algorithm 2 and the optimal covariate-independent tuning reduces with increasing sample size and covariate dimension except for $\theta = 2$. The difference between the performance of Algorithm 3 and optimal covariate-dependent tuning of the radius also reduces with increasing covariate dimension and sample size.

Comparison of the radii specified by Algorithms 1, 2, and 3. Figure 7 compares the radii specified by Algorithms 1, 2, and 3 with the optimal covariate-dependent radius and optimal covariate-independent radius for the W+OLS formulation. We consider $d_x = 100$, vary the model degree $\theta $, and vary the sample size among $n \in \{5(d_x + 1),10(d_x + 1),20(d_x + 1),50(d_x + 1)\}$ in these experiments. Note that the y-axis limits are different across the subplots. First, note that the radius specified by Algorithm 1 shrinks very quickly to zero for all three values of $\theta $. Consequently, we note from Fig. 2 that the resulting ER-DRO estimators typically do not perform as well as the estimators obtained when the radius $\zeta _n$ is specified using Algorithms 2 and 3. Second, we see that the covariate-independent specifications of the radius result in more narrow distributions overall compared to the covariate-dependent specifications. This may be because the covariate-independent specifications of the radius attempt to choose a single value of $\zeta _n(x)$ for all possible covariate realizations $x \in \mathcal {X}$, whereas the covariate-dependent specifications can choose a different value of $\zeta _n(x)$ depending on the realization $x \in \mathcal {X}$. Third, the distribution of the radius determined using Algorithm 3 appears to converge to the distribution of the optimal covariate-dependent radius as the sample size increases. Similarly, the distribution of the radius determined using Algorithm 2 also appears to converge to the distribution of the optimal covariate-independent radius as n increases (except for the case when $\theta = 2$). Finally, as noted in Sect. 6, it may be advantageous to use a positive radius for the ambiguity set when the prediction model is misspecified (e.g., using OLS regression even when $\theta \ne 1$). This is corroborated by the plots for $\theta = 2$, where the distribution of the optimal covariate-dependent radius is far from the zero distribution even for large sample sizes n. See Fig. 5.

Comparison with the jackknife-based formulations. Figure 8 compares the performance of the ER-SAA+OLS approach and the jackknife-based SAA (J-SAA+OLS) approach [36] with the W+OLS formulations when the radius $\zeta _n(x)$ is specified using Algorithms 2 and 3. We consider $d_x = 100$, vary the model degree $\theta $, and vary the sample size among $n \in \{3(d_x + 1),5(d_x + 1),10(d_x + 1),20(d_x + 1)\}$ in these experiments. As observed in [36], the J-SAA+OLS formulation performs better than the E+OLS formulation in the small sample size regime. Figure 8 shows that the W+OLS formulations outperform the J-SAA+OLS formulation (except when using Algorithm 2 for $\theta = 2$ and large n). This is expected because the ER-DRO formulations account for both the errors in the approximation of $f^*$ by $\hat{f}_n$ and in the approximation of $P_{Y \mid X = x}$ by $P^*_n(x)$, whereas the J-SAA+OLS formulation only addresses the bias in the residuals obtained from OLS regression (i.e., even if $\hat{f}_n$ is an accurate estimate of $f^*$, the jackknife formulations do not account for the fact that $P^*_n(x)$ may be a crude approximation of $P_{Y \mid X = x}$). We omit the results for the J+-SAA+OLS formulation because they are similar to those for the J-SAA+OLS formulation. See Fig. 6.

Comparison of mean optimality gaps. Figures 9 and 10 are the counterparts of Figs. 1 and 2 in Sect. 7 when the mean optimality gap is used instead of their $99\%$ UCBs to compare the different estimators, i.e., when $100 \, \texttt {avg}(\{\hat{G}^k(x)\})$ is used instead of the output $\hat{B}_{99}(x)$ of Algorithm 4. As mentioned in Sect. 7, comparing mean optimality gaps instead of $99\%$ UCBs results in very similar plots because the standard deviations $\sqrt{\texttt {var}(\{\hat{G}^k(x)\})}$ of the gap estimates $\{\hat{G}^k(x)\}$ are significantly lower than the gap estimates themselves—this is because we use 20, 000 random samples to derive each gap estimate, which yields estimates with low variability. Finally, we note that we also plotted median optimality gaps instead of mean optimality gaps and obtained very similar plots. See Figs. 7, 8, 9 and 10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kannan, R., Bayraksan, G. & Luedtke, J.R. Residuals-based distributionally robust optimization with covariate information. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02014-7

Download citation

Received: 02 December 2020
Accepted: 03 August 2023
Published: 26 September 2023
DOI: https://doi.org/10.1007/s10107-023-02014-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Residuals-based distributionally robust optimization with covariate information

Abstract

Access this article

Similar content being viewed by others

Robust estimation in regression and classification methods for large dimensional data

Testing regression models with selection-biased data

Valid prediction intervals for regression problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Omitted proofs

1.1 Proof of Proposition 1

Lemma 22

Proof

Lemma 23

Proof

1.2 Proof of Lemma 8

Lemma 24

Proof

1.3 Proof of Theorem 9

1.4 Proof of Theorem 10

1.5 Proof of Lemma 12

1.6 Proof of Lemma 14

1.7 Proof of Theorem 18

Ambiguity sets satisfying Assumption 12

Example 3

Lemma 25

Proof

Additional computational results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation