Skip to main content
Log in

Bootstrap robust prescriptive analytics

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We address the problem of prescribing an optimal decision in a framework where the cost function depends on uncertain problem parameters that need to be learned from data. Earlier work proposed prescriptive formulations based on supervised machine learning methods. These prescriptive methods can factor in contextual information on a potentially large number of covariates to take context specific actions which are superior to any static decision. When working with noisy or corrupt data, however, such nominal prescriptive methods can be prone to adverse overfitting phenomena and fail to generalize on out-of-sample data. In this paper we combine ideas from robust optimization and the statistical bootstrap to propose novel prescriptive methods which safeguard against overfitting. We show indeed that a particular entropic robust counterpart to such nominal formulations guarantees good performance on synthetic bootstrap data. As bootstrap data is often a sensible proxy to actual out-of-sample data, our robust counterpart can be interpreted to directly encourage good out-of-sample performance. The associated robust prescriptive methods furthermore reduce to convenient tractable convex optimization problems in the context of local learning methods such as nearest neighbors and Nadaraya–Watson learning. We illustrate our data-driven decision-making framework and our novel robustness notion on a small newsvendor problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Technically, the conditional expectation \(\mathbb {E}_{D^\star \!\!}\left[ L(z, y) | x\right] \) is a random variable and uniquely defined only up to events of measure zero. That is, the inclusion in (2) can hold merely almost surely; see for instance the standard text by Billingsley [7]. Slightly abusing notation all statements in this paper involving the observation \(x_0\) should be interpreted to hold hence as \(X^\star \)-almost surely where \(X^\star \) is the distribution of the covariates x.

  2. Our notation here alludes to the fact that this estimator function mapping data to cost estimates is often asymptotically with \(n\rightarrow \infty \) an unbiased estimate of the actual unknown expected cost \(z\mapsto \mathbb {E}_{D^\star \!\!}\left[ L( z, y)|x={x_0}\right] \). In our paper the symbol “\(\mathrm{E}\)” will be associated with an estimator (a procedure mapping data to cost estimates) while “\(\mathbb {E}\)” denotes an expectation operator which may define for instance the actual but unknown expected cost.

  3. Notice that this distance function does not possess the discrimination property. Indeed, when \(\left\| {\bar{x}}_i-{x_0}\right\| _2=\left\| {\bar{x}}_j-{x_0}\right\| \) for \({\bar{x}}_i\ne \bar{x}_j\) we have a tie. The lack of discrimination property translates in ambiguously defined neighborhood sets. The discrimination property may be recovered by deterministically breaking ties based for instance on the value of \({\bar{y}}\). Györfi et al. [17] propose a randomized alternative by augmenting the covariates with an independent auxiliary uniformly distributed random variable on [0, 1]. They prove that by doing so ties occur with probability zero.

References

  1. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)

    MathSciNet  Google Scholar 

  2. Ban, G.-Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2019)

    Article  MathSciNet  Google Scholar 

  3. Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)

    Article  Google Scholar 

  4. Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation. In: Mathematical Programming, pp. 1–66 (2017)

  5. Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci. 66(3), 1025–1044 (2020)

    Article  Google Scholar 

  6. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)

    Article  MathSciNet  Google Scholar 

  7. Billingsley, P.: Probability and Measure. Wiley, New York (2008)

    MATH  Google Scholar 

  8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  9. Csiszár, I.: Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probab. 12(3), 768–793 (1984)

    Article  MathSciNet  Google Scholar 

  10. Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3), 595–612 (2010)

    Article  MathSciNet  Google Scholar 

  11. Dembo, A., Zeitouni, O.: Large Deviations Techniques and Applications, vol. 38. Springer, New York (2009)

    MATH  Google Scholar 

  12. Domahidi, A., Chu, E., Boyd, S.: ECOS: an SOCP solver for embedded systems. In: European Control Conference (ECC), pp. 3071–3076 (2013)

  13. Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia (1982)

    Book  Google Scholar 

  14. Elmachtoub, A.N., Paul, G.: Smart “predict, then optimize”. In: Management Science (2021)

  15. Erdoğan, E., Iyengar, G.: Ambiguous chance constrained problems and robust optimization. Math. Program. 107(1–2), 37–61 (2006)

    Article  MathSciNet  Google Scholar 

  16. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics. Springer, New York (2001)

    MATH  Google Scholar 

  17. Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, New York (2006)

    MATH  Google Scholar 

  18. Hanasusanto, G.A., Kuhn, D.: Robust data-driven dynamic programming. In: Advances in Neural Information Processing Systems, pp. 827–835 (2013)

  19. Hannah, L., Powell, W., Blei, D.M.: Nonparametric density estimation for stochastic optimization with an observable state variable. In: Advances in Neural Information Processing Systems, pp. 820–828 (2010)

  20. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  21. Michaud, R.O.: The Markowitz optimization enigma: is ‘optimized’ optimal? Financ. Anal. J. 45(1), 31–42 (1989)

  22. Mohajerin Esfahani, P., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)

    Article  MathSciNet  Google Scholar 

  23. Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 9(1), 141–142 (1964)

    Article  Google Scholar 

  24. Pug, G., Wozabal, D.: Ambiguity in portfolio selection. Quant. Finance 7(4), 435–442 (2007)

    Article  MathSciNet  Google Scholar 

  25. Popescu, I.: A semidefinite programming approach to optimal-moment bounds for convex classes of distributions. Math. Oper. Res. 30(3), 632–657 (2005)

    Article  MathSciNet  Google Scholar 

  26. Postek, K., den Hertog, D., Melenberg, B.: Computationally tractable counterparts of distributionally robust constraints on risk measures. SIAM Rev. 58(4), 603–650 (2016)

    Article  MathSciNet  Google Scholar 

  27. Sain, S.R.: Multivariate locally adaptive density estimation. Comput. Stat. Data Anal 39(2), 165–186 (2002)

    Article  MathSciNet  Google Scholar 

  28. Sun, H., Xu, H.: Convergence analysis for distributionally robust optimization and equilibrium problems. Math. Oper. Res. 41(2), 377–401 (2016)

    Article  MathSciNet  Google Scholar 

  29. Udell, M., Mohan, K., Zeng, D., Hong, J., Diamond, S., Boyd, S.: Convex optimization in Julia. In: SC14 Workshop on High Performance Technical Computing in Dynamic Languages (2014)

  30. Van Parys, B.P.G., Esfahani, P.M., Kuhn, D.: From data to decisions: Distributionally robust optimization is optimal. arXiv:1505.05116 (2017)

  31. Van Parys, B.P.G., Goulart, P.J., Kuhn, D.: Generalized Gauss inequalities via semidefinite programming. Math. Program. 156(1–2), 271–302 (2016)

    MathSciNet  MATH  Google Scholar 

  32. Walk, H.: Strong laws of large numbers and nonparametric estimation. In: Recent Developments in Applied Probability and Statistics, pp. 183–214. Springer(2010)

  33. Wang, Z., Glynn, P.W., Ye, Y.: Likelihood robust optimization for data-driven newsvendor problems. CMS 12(2), 241–261 (2016)

    Article  Google Scholar 

  34. Watson, G.S.: Smooth regression analysis. In: Sankhyā: The Indian Journal of Statistics, Series A, vol . 26, no. 4, pp. 359–372 (1964)

  35. Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62(6), 1358–1376 (2014)

    Article  MathSciNet  Google Scholar 

  36. Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order moment information. Math. Program. 137(1–2), 167–198 (2013)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers for their careful reading of this manuscript which greatly improved its overall exposition. The second author is generously supported by the Early Post.Mobility fellowship No. 165226 of the Swiss National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bart Van Parys.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Uniform consistent estimators

Nadaraya–Watson estimation can be shown to be point-wise consistent when using an appropriately scaled bandwidth parameter h(n) for any of the smoother functions listed in Fig. 4:

Theorem 7

(Walk [32]) Let us have a loss function satisfying \(\mathbb {E}_{D^\star \!\!}\left[ \left| L({\bar{z}}, y)\right| \cdot \max \{\log (\left| L({\bar{z}}, y)\right| ),0\}\right] < \infty \) for all \({\bar{z}}\). Let the bandwidth \(h(n)={cn^{-\delta }}\) for some \(c>0\) and \(\delta \in (0, {1}/{\dim (x)})\). Let S be any of the smoother functions listed in Fig. 4 and the weighing function is taken to be \(w_n(x, {x_0}) = S(\left\| x-{x_0}\right\| _2/h(n))\). Then, our estimation formulation (with \(k(n)=n\)) is asymptotically consistent for any \(D^\star \), i.e., with probability one we have

$$\begin{aligned} \lim _{n\rightarrow \infty } {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] = \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y) | x={x_0}\right] \qquad \forall {\bar{z}}. \end{aligned}$$

Nearest neighbors estimation is also consistent under very mild technical conditions provided that the number of neighbors k(n) is scaled appropriately with the number of training data samples:

Theorem 8

(Walk [32]) Assume \(\mathrm {dist}({\bar{d}}=({\bar{x}}, {\bar{y}}), {x_0}) = \left\| {\bar{x}}-{x_0}\right\| _2\) and follow the random tie breaking rule discussed in Györfi et al. [17]. Let \(k(n)=\lceil \min \{cn^{\delta }, n\} \rceil \) for some \(c>0\) and \(\delta \in (0,1)\). Then, our estimation formulation (with \(w_n(\bar{x}, {x_0})=1\)) is asymptotically consistent for any \(D^\star \), i.e., with probability one we have

$$\begin{aligned} \lim _{n\rightarrow \infty } {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] = \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y) | x={x_0}\right] \quad \forall {\bar{z}}. \end{aligned}$$

An estimator is denoted as uniformly consistent if the event

$$\begin{aligned} \lim _{n\rightarrow \infty }D^{\star \infty }\left[ \max _{{\bar{z}}}\, |{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y) | x={x_0}\right] |\le \epsilon \right] =1\nonumber \\ \end{aligned}$$
(34)

for any \(\epsilon >0\) and \(D^\star \). That is, the estimator predicts the cost for all potential decisions well to any arbitrary accuracy when only given access to a sufficiently large amount of training data. It is not difficult to see that this uniform consistency in turn implies the consistency of the budget minimization formulation. Indeed, with probability one we have

$$\begin{aligned}&\textstyle \lim _{n\rightarrow \infty }D^{\star \infty }\left[ \min _{z}\,\mathbb {E}_{D^\star \!\!}\left[ L(z, y)|x={x_0}\right] +\epsilon \ge {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z^\star ({x_0}), y)|x={x_0}\right] \right] =1 \end{aligned}$$
(35)
$$\begin{aligned}&{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z^\star ({x_0}), y)|x={x_0}\right] \ge {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] \end{aligned}$$
(36)
$$\begin{aligned}&\textstyle \lim _{n\rightarrow \infty }D^{\star \infty }\left[ {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] \ge \mathbb {E}_{D^\star \!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] -\epsilon \right] =1 \end{aligned}$$
(37)

Here, the limit (35) follows from the consistency of the estimator with regards to the full information decision \(z^\star ({x_0})\). Inequality (36) is a direct consequence of the characterization of the data-driven decision \(z_{\mathrm {data}[n]}\) as a minimizer of \({\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L(z, y)|x={x_0}\right] \). Remarking that \(\max _{{\bar{z}}}\, |{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y) | x={x_0}\right] | \le \epsilon \implies {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] \ge \mathbb {E}_{D^\star \!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] -\epsilon \) the limit (37) follows from the uniform consistency of the estimator stated in Eq. (34). Chaining the inequalities in results (35), (36) and (37) implies that

$$\begin{aligned} \textstyle \lim _{n\rightarrow \infty }D^{\star \infty }\left[ 0 \le \mathbb {E}_{D^\star \!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] - \min _{z}\,\mathbb {E}_{D^\star \!\!}\left[ L(z, y)|x={x_0}\right] \le 2\epsilon \right] =1. \end{aligned}$$

Here the optimality gap between the cost of the full information decision \(z^\star ({x_0})\) and the cost of the data-driven decision \(z_{\mathrm {data}[n]}\) is bounded by \(2\epsilon \). As this result holds for any arbitrary \(\epsilon >0\) the data-driven formulation is asymptotically consistent.

Proofs

1.1 Proof of Theorem 1

Proof

First note that the domain of the partial estimators as a function of the distribution D satisfies

$$\begin{aligned} \mathrm{dom}~{\mathrm {E}}^{n,j}_{D}\left[ L(z, Y)|x={x_0}\right] \subseteq \mathcal D_{n}^j. \end{aligned}$$

For a distribution D to be in the domain of the partial estimator the constraints in Eq. (22) must indeed be feasible. In other words, there must exists some P and \(s>0\) for which \(s\cdot D[{\bar{x}}, {\bar{y}}] = P[{\bar{x}}, {\bar{y}}]\) for all \(({\bar{x}}, {\bar{y}})\in \Omega _n\). The last two constraints in Eq. (22) then imply that any such \(D\in \mathcal D_n\) must also be in \(\mathcal D^j_{n}\).

Any \(D\in \mathcal D_{n,n}\) is the empirical distribution of some bootstrap data set, say \(\mathrm {bs}[n]\), consisting of n observations from the training data set. Each set \(\mathcal D_{n,n}^j:=\mathcal D_n^j\cap \mathcal D_{n,n}\) has a very natural interpretation in terms of \(N^j_n({x_0})\) being the smallest neighborhood containing at least k observations of this associated bootstrap data set \(\mathrm {bs}[n]\). Indeed, \(D\in \mathcal D_{n,n}^j\) is in terms of the associated data set equivalent to

$$\begin{aligned} \begin{array}{llll} k &{} \le &{} \sum _{({\bar{x}}, {\bar{y}})\in \mathrm {bs}[n]} \mathbb {1} \{({\bar{x}}, {\bar{y}})\in N^{j}_{n}({x_0})\} &{}= n \cdot \sum _{({\bar{x}}, {\bar{y}})\in N^j_n({x_0})} D[{\bar{x}}, {\bar{y}}], \\ k &{} > &{} \sum _{({\bar{x}}, {\bar{y}})\in \mathrm {bs}[n]}\mathbb {1}\{({\bar{x}}, {\bar{y}})\in N^{j\!-\!1}_{n}\! ({x_0})\} &{}= n\cdot \sum _{({\bar{x}}, {\bar{y}})\in N^{j\!-\!1}_{n}\! ({x_0})} D[{\bar{x}}, {\bar{y}}]. \end{array} \end{aligned}$$

The first inequality implies that the neighborhood \(N^j_n({x_0})\) contains at least k observations of the associated data set. Note that the sets \(N^j_n({x_0})\) are increasing with increasing j in terms of set inclusion. The latter inequality hence implies that the biggest smaller neighborhood \(N^{j\!-\!1}_n({x_0})\) does not contain k observations. Both conditions taken together thus imply that \(N^j_n({x_0})\) is the smallest neighborhood which contains at least k samples of the bootstrap data set. Any D in \(\mathcal D_{n,n}\) is an element in one and only one set \(\mathcal D^j_{n,n}\) as the smallest neighborhood containing at least k samples is uniquely defined for any data set. Formally, \(\mathcal D^j_{n,n}\cap \mathcal D^{j'}_{n,n}=\emptyset \) for all \(j\ne j'\) and moreover \(\cup _{j\in [n]} \mathcal D_{n,n}^j=\mathcal D_{n,n}\). Notice also that the only feasible s in the constraints defining the partial predictors in Eq. (22) is the particular choice \(s = {1}/{\sum _{({\bar{x}}, {\bar{y}})\in N^j_n({x_0})} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}>0\). Hence, we must have that for any \(D\in \mathcal D^j_{n,n}\) the partial estimator equates to

$$\begin{aligned} {\mathrm {E}}^{n,j}_{D}\left[ L({\bar{z}}, y)|x={x_0}\right] = \frac{\textstyle \sum _{({\bar{x}},\bar{y})\in N^j_{n}({x_0})} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D({\bar{x}}, {\bar{y}})}{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N_n^j({x_0})} w_n({\bar{x}},{x_0}) \cdot D({\bar{x}}, {\bar{y}})} \qquad \forall {\bar{z}} \end{aligned}$$

which is precisely the weighted average over the neighborhood \(N^j_n({x_0})\). We have hence for all \({\bar{z}}\) that

$$\begin{aligned}&\max _{j\in [1,\dots ,n]} \, {\mathrm {E}}^{n,j}_{D}\left[ L({\bar{z}}, y)|x={x_0}\right] \\&\quad = \left\{ \begin{array}{ll} \frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N^1_{n}({x_0})} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D[{\bar{x}}, {\bar{y}}]}{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N_n^1({x_0})} w_n({\bar{x}},{x_0}) \cdot D[{\bar{x}}, {\bar{y}}]} &{} \mathrm{for~} D\in \mathcal D^1_n ,\\ \vdots &{} \vdots \\ \frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N^n_{n}({x_0})} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D({\bar{x}}, \bar{y})}{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N_n^n({x_0})} w_n({\bar{x}},{x_0}) \cdot D({\bar{x}}, {\bar{y}})} &{} \mathrm{for~} D\in \mathcal D^n_n. \end{array}\right. \\&\quad = {\mathrm {E}}^n_{D\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \end{aligned}$$

as we have argued that \(D\in \mathcal D^j_{n,n}\) if and only if \(N^j_n({x_0})\) is the smallest neighborhood containing at least k data points. As the empirical distribution D and associated data set \(\mathrm {bs}[n]\) were chosen arbitrary the result follows. \(\square \)

1.2 Proof of Proposition 1

Proof

Observe that \(\left\{ D \in \mathcal D_n\ : \ R(D, D_{\mathrm {tr}})\le r \right\} \,\cap \, \mathcal N=\emptyset \) by construction. Furthermore, \(\left\{ D \in \mathcal D_n\ : \ R(D, D_{\mathrm {tr}})\le r \right\} \) is convex as R is convex in its first argument. Without loss of generality, the neighborhood \(\mathcal N\) is convex as well. By the Hahn-Banach separation Theorem an open convex set can be separated linearly from any other disjoint convex set. That is, there must exist a function G and constant \(a\in \mathrm {R}\) so that

$$\begin{aligned} \mathbb {E}_{D\!\!}\left[ G(x, y)\right] \le a < \mathbb {E}_{D'\!\!}\left[ G(x, y)\right] \quad \forall D\in \left\{ D \in \mathcal D_n\ : \ R(D, D_{\mathrm {tr}})\le r \right\} , ~D'\in \mathcal N. \end{aligned}$$

Hence, \({{\,\mathrm{int}\,}}\mathcal N = \mathcal N \subseteq \mathcal R' :=\{ D\in \mathcal D_n\ : \ \mathbb {E}_{D\!\!}\left[ G(x, y)\right] >a \ge \sup _{R(D', D_{\mathrm {tr}})\le r} \mathbb {E}_{D'\!\!}\left[ G(x, y)\right] \} \subseteq \mathcal R \) with \(\mathcal R=\{ D\in \mathcal D_n\ : \ \mathrm{E}^n_{D}[L(z^{\mathrm {r}}_{\mathrm {tr}}({x_0}), y)|x={x_0}] > \sup _{R(D', D_{\mathrm {tr}})\le r} \mathrm{E}^n_{D'}[L(z^{\mathrm {r}}_{\mathrm {tr}}({x_0}), y)|x={x_0}] \}\) the disappointment set of a nominal formulation based on the cost estimator \((z, D)\mapsto \mathrm{E}^n_{D}[L(z, y)|x={x_0}]=\mathbb {E}_{D\!\!}\left[ G(x, y)\right] \). Consequently, following inequality (31) we have

$$\begin{aligned} - \inf _{D' \in \mathcal N} \, B(D', D_{\mathrm {tr}})\le & {} \liminf _{n\rightarrow \infty } \frac{1}{n} \log D_{\mathrm {tr}}^\infty \left[ D_{\mathrm {bs}[n]} \in \mathcal N \right] \\\le & {} \liminf _{n\rightarrow \infty } \frac{1}{n} \log D_{\mathrm {tr}}^\infty \left[ D_{\mathrm {bs}[n]} \in \mathcal R \right] \end{aligned}$$

The stated result hence follows should we have \(\inf _{D' \in \mathcal N} \, B(D', D_{\mathrm {tr}})<r\). As \(\mathcal N\) is an open set there exists \(\lambda \in (0, 1)\) so that \(D(\lambda )=\lambda D+(1-\lambda )D_{\mathrm {tr}[n]}\in \mathcal N\). From convexity of B we have \(B(D(\lambda ), D_{\mathrm {tr}}) \le \lambda B(D, D_{\mathrm {tr}[n]}) + (1-\lambda ) B(D_{\mathrm {tr}}, D_{\mathrm {tr}}) < r\) for any \(\lambda \in (0,1)\) using that \(B(D, D_{\mathrm {tr}})=r\) and \(B(D_{\mathrm {tr}}, D_{\mathrm {tr}})=0\). Hence, indeed we have \(\inf _{D' \in \mathcal N} \, B(D', D_{\mathrm {tr}})\le B(D(\lambda ), D_{\mathrm {tr}}) < r\). \(\square \)

1.3 Proof of Corollary 1

Proof

Remark that from the definition of the Nadaraya–Watson cost estimate

$$\begin{aligned} {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L(z, y)|x={x_0}\right] :={\mathbb {E}_{D_{\mathrm {tr}[n]}\!\!}\left[ L(z, y)\cdot w_n(x,{x_0})\right] }/{\mathbb {E}_{D_{\mathrm {tr}[n]}\!\!}\left[ w_n(x,{x_0})\right] } \end{aligned}$$

given in (19) it follows that we have

$$\begin{aligned} \begin{array}{rl} {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L(z, y)|x={x_0}\right] :=\max _{s>0, P} &{} \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n}\, w_n({\bar{x}}, {x_0})\cdot L(z, {\bar{y}})\cdot P[{\bar{x}}, {\bar{y}}] \\ \mathrm {s.t.}&{} P[{\bar{x}}, {\bar{y}}] = D[{\bar{x}}, {\bar{y}}] \cdot s \quad \forall ({\bar{x}}, {\bar{y}}) \in \Omega _n,\\ &{} \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} P[{\bar{x}}, {\bar{y}}] = s,~ \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[\bar{x}, {\bar{y}}] = 1. \end{array} \end{aligned}$$

Indeed, the only feasible s is such that \(s = {1}/{\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}\). Hence, the only feasible P is \(P = {D}/{\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}\) and the equivalence follows. The chain of equalities

$$\begin{aligned} \begin{array}{r@{}l} \left\{ (s, P)\ : \ \exists D ~\mathrm {s.t.}~ s\cdot D= P,\,R(D, D_{\mathrm {tr}[n]})\le r \right\} &{} = \left\{ (s, P)\ : \ R({P}/{s}, D_{\mathrm {tr}[n]}) \le r \right\} \\ &{} = \left\{ (s, P)\ : \ s\cdot R({P}/{s}, D_{\mathrm {tr}[n]}) \le s\cdot r \right\} \end{array} \end{aligned}$$

imply that the robust budget function \(c_{n}(z, D_{\mathrm {tr}[n]}, {x_0})\) corresponds exactly to the optimization formulation claimed in the corollary. \(\square \)

1.4 Proof of Theorem 3

Proof

We first show the uniform convergence of the robust budget function to its nominal counterpart when the loss function \(L({\bar{z}}, {\bar{y}})< {\bar{L}} < \infty \) is bounded. Let us first consider a given training data set \(\mathrm {tr}[n]\). Note that because of its definition as robust counterpart of our estimator, we have that our robust budget can be bounded as

$$\begin{aligned} {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \le c_{n}({\bar{z}}, D_{\mathrm {tr}[n]}, {x_0}) \le {\mathrm {E}}^n_{D_{\mathrm {wc}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] +\alpha \end{aligned}$$

for some worst-case distributions \(D_{\mathrm {wc}[n]}\) at distance at most \(B(D_{\mathrm {wc}[n]}, D_{\mathrm {tr}[n]})\le r(n)\) from the training distribution for any arbitrary \(\alpha >0\). In terms of the total variation distance, we have that \(\left\| D_{\mathrm {wc}[n]} - D_{\mathrm {tr}[n]}\right\| _1 \le \sqrt{{B(D_{\mathrm {wc}[n]} , D_{\mathrm {tr}[n]})}/{2}} \le \sqrt{{r(n)}/{2}}\) following Pinkser’s inequality.

The Nadaraya–Watson estimate based on the worst-case distribution is the fraction

$$\begin{aligned} {\mathrm {E}}^n_{D_{\mathrm {wc}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] :=\frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in \Omega _n} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D_{\mathrm {wc}[n]}[{\bar{x}}, \bar{y}]/h(n)^d}{\sum _{({\bar{x}},{\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0})\cdot D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d} \end{aligned}$$

where we denote here \(d=\dim (x)\) for conciseness. We have that the denominator of the Nadaraya–Watson estimator is lower bounded by

$$\begin{aligned}&\textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]\\ =&\textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] +\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} \left( D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]-D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]\right) \\ \ge&\textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] - \left( \max _{({\bar{x}}, {\bar{y}})\in \Omega _n}w_n({\bar{x}}, {x_0})\right) {\sqrt{{r(n)}/{2}}}/{h(n)^d}\\ \ge&\textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n(\bar{x}, {x_0})}/{h(n)^d} D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] - {\sqrt{{r(n)}/{2}}}/{h(n)^d} \end{aligned}$$

Here the first inequality follows from the Cauchy-Schwartz inequality \(\left| a^\top b\right| \le \left\| a\right\| _\infty \cdot \left\| b\right\| _1 \). Notice that here the weights \(0\le w_n({\bar{x}}, {x_0})\le w_n({x_0}, {x_0})\le 1\) are all non-negative and bounded from above by one for all smoother functions in Fig. 4. Lemma 6 in Walk [32] establishes that the limit

$$\begin{aligned} \liminf _{n\rightarrow \infty } \textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {data}[n]}[{\bar{x}}, {\bar{y}}] = 2d(x_0)>0 \end{aligned}$$

is positive with probability one. Taken together with the premise \(\lim _{n\rightarrow \infty } {\sqrt{r(n)}}/{h(n)^d}=0\) this establishes the existence of a large enough sample size \(n_0\) such that for all \(n\ge n_0\) the denominator of the Nadaraya–Watson estimator satisfies \(\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}] \!\ge \! d({x_0})\) and \(\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n}\!\! {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]\ge 2{\sqrt{{r(n)}/{2}}}/{h(n)^d}\). Similarly, the nominator of the Nadaraya–Watson estimator satisfies

$$\begin{aligned} \begin{array}{rl} &{} \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} L({\bar{z}}, {\bar{y}})\cdot w_n({\bar{x}}, {x_0}) \cdot D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d\\ \le &{}\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} L({\bar{z}}, {\bar{y}})\cdot w_n({\bar{x}}, {x_0})\cdot D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d + \sqrt{{r(n)}/{2}}/h(n)^d. \end{array} \end{aligned}$$

In what follows we will use the inequality \({a}/{(b-x)} \le {a}/{b} + {2a}/{b} \cdot x\) for all \(x\le {b}/{2}\) when \(a, b>0\). This inequality follows trivially from the definition of convexity of the function \({a}/{(b-x)}\) for all \(x\le b\) when \(a, b>0\). Using the previous inequalities we can establish when \(n\ge n_0\) the following claims

$$\begin{aligned}&{\mathrm {E}}^n_{D_{\mathrm {wc}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \\&\le \frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in \Omega _n} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d}{\sum _{({\bar{x}},{\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0})\cdot D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d} + \frac{\sqrt{{r(n)}/{2}}}{h(n)^d d(x_0)} \\&\le \frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in \Omega _n} L({\bar{z}}, {\bar{y}}) \cdot w_n({\bar{x}},{x_0})\cdot D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d}{\sum _{({\bar{x}},{\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0})\cdot D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]/h(n)^d-{\sqrt{{r(n)}/{2}}}/{h(n)^d}} + \frac{\sqrt{{r(n)}/{2}}}{h(n)^d d(x_0)}\\&\le {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] + 2{\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] {\sqrt{{r(n)}/{2}}}/{h(n)^d} + \frac{\sqrt{{r(n)}/{2}}}{h(n)^d d(x_0)}\\&\le {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] + (2{\bar{L}} + {1}/{d(x_0)}) {\sqrt{{r(n)}/{2}}}/{h(n)^d}. \end{aligned}$$

The nominal Nadaraya–Watson estimator is uniformly consistent, i.e.,

$$\begin{aligned} |{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] |\le \epsilon (n) \end{aligned}$$

for \(\lim _{n\rightarrow \infty } \epsilon (n) = 0\) as discussed before. It hence trivially follows that the robust Nadaraya–Watson estimator is uniformly consistent as well. Indeed, from the previous inequality it follows that

$$\begin{aligned}&|c_{n}({\bar{z}}, D_{\mathrm {data}[n]}, {x_0}) - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] |\\&\quad \le \epsilon (n) + (2{\bar{L}} + {1}/{d(x_0)}) {\sqrt{{r(n)}/{2}}}/{h(n)^d} + \alpha \end{aligned}$$

with probability one for all \({\bar{z}}\). This inequality holds for any arbitrary \(\alpha >0\). Uniform consistency then directly implies here an asymptotically diminishing optimality gap

$$\begin{aligned}&\mathbb {E}_{D^\star \!\!}\left[ L(z_{\mathrm {data}[n]}^r, y)|x={x_0}\right] - \min _{z}\, \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \\&\quad \le 2\epsilon (n) + (4{\bar{L}} + {2}/{d(x_0)}) {\sqrt{{r(n)}/{2}}}/{h(n)^d}. \end{aligned}$$

Using that \(\lim _{n\rightarrow \infty } {\sqrt{r(n)}}/{h(n)^d}\) yields the wanted result immediately. \(\square \)

1.5 Proof of Theorem 4

Proof

We show the uniform convergence of the robust budget function to the unknown cost, that is, and any bounded function \(L({\bar{z}}, {\bar{y}})< {\bar{L}} < \infty \) for all \({\bar{z}}\) and \({\bar{y}}\). Let us first consider a given training data set \(\mathrm {tr}[n]\) without ties. That is, we have that \(\left| N^j_n({x_0})\right| =j\) for all \(j\in [n]\). Note that because of its definition as robust counterpart of the estimator \({\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \), we have that the robust cost can be bounded as

$$\begin{aligned} {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \le c_{n}(z^r_{\mathrm {tr}[n]}, D_{\mathrm {tr}[n]}, {x_0}) \le {\mathrm{E}}_{D_{\mathrm {wc}[n]}}^{j^\star }[L({\bar{z}}, y)|x={x_0}]+\alpha \end{aligned}$$

for some worst-case distributions \(D_{\mathrm {wc}[n]}\in \mathcal D^{j\star }_n\) at distance at most \(B(D_{\mathrm {wc}[n]}, D_{\mathrm {tr}[n]})\le r(n)\) from the training distribution for any arbitrary \(\alpha >0\). In terms of the total variation distance, we have that \(\left\| D_{\mathrm {wc}[n]} - D_{\mathrm {tr}[n]}\right\| _1 \le \sqrt{{B(D_{\mathrm {wc}[n]} , D_{\mathrm {tr}[n]})}/{2}} \le \sqrt{{r(n)}/{2}}\) following Pinkser’s inequality. The nearest-neighbors estimate based on the worst-case distribution is defined as the fraction

$$\begin{aligned} {\mathrm {E}}^n_{D_{\mathrm {wc}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] :=\frac{\textstyle \sum _{({\bar{x}},{\bar{y}})\in N^{j\star }_n({x_0})} L(\bar{z}, {\bar{y}}) \cdot D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]}{\sum _{({\bar{x}},\bar{y})\in N^{j\star }_n({x_0})} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]}. \end{aligned}$$

The neighborhood parameter \(j^\star \) satisfies by definition \( \textstyle \sum _{({\bar{x}}, {\bar{y}}) \in N^{j\star }_n({x_0})} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}] \ge k(n)/n. \) The previous inequality bounds the denominator from below by \({k(n)}/{n}\). Using the Cauchy-Schwartz inequality \(\left| a^\top b\right| \le \left\| a\right\| _\infty \cdot \left\| b\right\| _1 \) as well as the Pinkser inequality, the nominator of the Nadaraya–Watson estimator satisfies

$$\begin{aligned} \begin{array}{rl} &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{j\star }_n({x_0})} L({\bar{z}}, {\bar{y}}) \cdot D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}]\\ \le &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{j\star }_n({x_0})} L({\bar{z}}, \bar{y}) \cdot D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]+ {\bar{L}} \sqrt{{r(n)}/{2}}. \end{array} \end{aligned}$$

We also have that from the definition of the total variation distance that \(\Vert D_{\mathrm {tr}[n]}-D_{\mathrm {wc}[n]}\Vert _1\ge D_{\mathrm {tr}[n]}[N^{j\star }_n({x_0})] - D_{\mathrm {wc}[n]}[N^{j\star }_n({x_0})] \ge {(j^\star -k(n))}/{n}\). We have also \(\Vert D_{\mathrm {wc}[n]}-D_{\mathrm {tr}[n]}\Vert _1 \ge D_{\mathrm {wc}[n]}[N^{j\star -1}_n({x_0})] - D_{\mathrm {tr}[n]}[N^{j\star -1}_n({x_0})] \ge {(k(n)-j^\star )}/{n}\). Last two inequalities imply that we can use the bound \(\sqrt{r(n)/2}\ge \Vert D_{\mathrm {tr}[n]}-D_{\mathrm {wc}[n]}\Vert _1 \ge \left| k(n)-j^\star \right| /n\). By applying first the Cauchy-Schwartz inequality again and then the previously obtained bounds we can obtain

$$\begin{aligned} \begin{array}{rl} &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{j\star }_n({x_0})} L({\bar{z}}, {\bar{y}}) D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]\\ \le &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{k(n)}_n({x_0})} L({\bar{z}}, {\bar{y}}) D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] + {\bar{L}} | D_{\mathrm {tr}[n]}\{N^{j\star }_n({x_0})]-D_{\mathrm {tr}[n]}[N^{k(n)}_n({x_0})]|\\ \le &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{k(n)}_n({x_0})} L({\bar{z}}, {\bar{y}}) D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] + {\bar{L}} \left| j^\star -k(n)\right| /n\\ \le &{} \sum _{({\bar{x}}, {\bar{y}})\in N^{k(n)}_n({x_0})} L({\bar{z}}, {\bar{y}}) D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}] + {\bar{L}} \sqrt{{r(n)}/{2}}. \end{array} \end{aligned}$$

Hence,

$$\begin{aligned}&{\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \nonumber \\&\quad \le c_n({\bar{z}}, D_{\mathrm {tr}[n]}, {x_0}) \le {\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] + \frac{2 {\bar{L}} n\sqrt{{r(n)}/{2}}}{k(n)}+\alpha \end{aligned}$$
(38)

for any arbitrary training data set without ties. Ties among data points when using the random tie breaking method are a probability zero event which we may ignore. We already know that the nearest neighbors estimator is uniformly consistent. That is, we have that \(|{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] | \le \epsilon (n)\) with probability one and \(\lim _{n\rightarrow \infty } \epsilon (n)=0\). When the robustness radius does shrinks at an appropriate rate, i.e., its size compared to the bandwidth parameter is negligible (\(\lim _{n\rightarrow \infty } {n\sqrt{r(n)}}/{k(n)}=0\)), then uniform consistency of budget estimator \(c_n\) follows by taking the limit for n tends to infinity for the chain of inequalities in (38) applied to \(D_{\mathrm {data}[n]}\) and observing that \(\alpha >0\) is arbitrarily small. Uniform consistency of the nearest-neighbors formulation follows by the exact same argument as given in proof of Theorem 3 in case of the Nadaraya–Watson formulation. \(\square \)

1.6 Proof of Corollary 2

Proof

Let us fix a training data set with empirical training distribution \(D_{\mathrm {tr}[n]}\) and a given decision z. Let \(\bar{c}_n:={\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \) be the budgeted cost based on the training data with \(k(n)=n\). In order to prove the theorem, it suffices to characterize the probability of the event that the empirical distribution \(D_{\mathrm {bs}[n]}\) of random bootstrap data resampled from the training data realizes in the set

$$\begin{aligned} \mathcal C :=\left\{ D \in \mathcal D_n\ : \ \begin{array}{l} \exists s>0, ~\textstyle s \cdot \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0})\cdot L({\bar{z}}, y) \cdot D({\bar{x}}, {\bar{y}}) > {\bar{c}}_n,\\ s\cdot \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0}) \cdot D({\bar{x}}, {\bar{y}})=1 \end{array} \right\} \end{aligned}$$

as follows from Corollary 1. After eliminating the auxiliary variable s we arrive at the description \(\mathcal C = \{D\in \mathcal D_n : \textstyle \sum _{({\bar{x}}, \bar{y})\in \Omega _n} w_n({\bar{x}}, {x_0})\cdot L({\bar{z}}, y) \cdot D({\bar{x}}, {\bar{y}}) > {\bar{c}}_n \cdot \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0}) \cdot D({\bar{x}}, {\bar{y}})\}\). The set \(\mathcal C\) is a convex polyhedron. The robust budget cost \({\bar{c}}_n\) is constructed to ensure that \(\inf _{D\in \mathcal C}\, R(D, D_{\mathrm {tr}[n]}) > r\). Indeed, we have the rather direct implication \(\bar{D} \in \mathcal C \!\implies \! {\mathrm {E}}^n_{{\bar{D}}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] > {\bar{c}}_n = \sup \,\{ {\mathrm {E}}^n_{ D\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \,|\, R(D, D_{\mathrm {tr}[n]})\le r \}\) which in turn itself implies \(R({\bar{D}}, D_{\mathrm {tr}[n]}) > r\). Hence, the result follows from the bootstrap inequality (28) applied to the probability \(D_{\mathrm {tr}[n]}^\infty (D_{\mathrm {bs}[n]} \in \mathcal C)\) as in this particular case the employed distribution distance function (\(R=B\)) coincides with the bootstrap distance function. \(\square \)

1.7 Proof Lemma 1

Proof

We will employ standard Lagrangian duality on the convex optimization characterization (25) of the partial nearest neighbors cost function associated \(c^j_n({\bar{z}}, D, {x_0})\). The Lagrangian function associated with the primal optimization problem in (25) is denoted here at the function

$$\begin{aligned}&\mathcal L(P, s; \alpha , \beta , \eta , \nu ) :=\\&\textstyle \sum _{({\bar{x}}, {\bar{y}})\in N^j_{n}({x_0})} w_n({\bar{x}}, {x_0}) \cdot L(z, {\bar{y}}) \cdot P[x, y] \textstyle + \left( 1-\sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_{n}({x_0})} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]\right) \alpha \\&\textstyle + \left( \sum _{({\bar{x}}, {\bar{y}})\in N^j_{n}({x_0})} P[{\bar{x}}, {\bar{y}}] - \frac{k}{n} \cdot s \right) \eta _1 + \left( \frac{k-1}{n} \cdot s - \sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_{n}({x_0})} P[{\bar{x}}, {\bar{y}}] \right) \eta _2 \\&\textstyle + \left( \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} P[{\bar{x}}, {\bar{y}}] -s\right) \beta + \left( r \cdot s - \sum _{({\bar{x}}, \bar{y})\in \Omega _n} P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, \bar{y}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right) \nu \end{aligned}$$

where P and s are the primal variables of the primal optimization problem (25) and \(\alpha \), \(\beta \), \(\eta \) and \(\nu \) the dual variables associated with each of its constraints. Collecting the relevant terms in the Lagrangian function results in \(\mathcal L(P, s; \alpha , \beta , \nu ) =\)

$$\begin{aligned} \textstyle \alpha +&\textstyle s (r \nu - \beta - \frac{k}{n}(\eta _1-\eta _2) - \frac{\eta _2}{n}) \\&\textstyle + \sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_{n}({x_0})} \left[ P[{\bar{x}}, {\bar{y}}] \left( (L(z, y)-\alpha )\cdot w_n({\bar{x}}, {x_0}) +\beta + \eta _1 -\eta _2\right) - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right] \\&\textstyle + \sum _{({\bar{x}}, {\bar{y}})\in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} \left[ P[{\bar{x}}, {\bar{y}}] \left( (L(z, y)-\alpha )\cdot w_n({\bar{x}},{x_0}) +\beta + \eta _1\right) - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right] \\&\textstyle + \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n\setminus \mathtt N^j_{n}({x_0})} \left[ P[{\bar{x}}, {\bar{y}}] \beta - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s \cdot D[{\bar{x}}, \bar{y}]}\right) \right] \end{aligned}$$

The dual function of the primal optimization problem (25) is identified with the concave function \(g(\alpha , \beta , \eta , \nu ) :=\inf _{P\ge 0,\,s> 0} \mathcal L(P, s; \alpha , \beta , \nu )\). Using the same manipulations as presented in the proof of Lemma 2 we can express the dual function as \(g(\alpha , \beta , \eta , \nu ) =\)

$$\begin{aligned} = \textstyle \sup _{s> 0} \, \alpha + s (r \nu -&\textstyle \beta - \frac{k}{n}(\eta _1-\eta _2) - \frac{\eta _2}{n}) +s \nu \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n\setminus N^j_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{ \beta }{\nu }-1\right) \\&\textstyle +s \nu \sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) + \beta + \eta _1-\eta _2}{\nu }-1\right) \\&\textstyle +s \nu \sum _{({\bar{x}}, {\bar{y}})\in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L(z, \bar{y})-\alpha ) \cdot w_n({\bar{x}},{x_0}) + \beta + \eta _1}{\nu }-1\right) . \end{aligned}$$

Our dual function can be expressed alternatively as

$$\begin{aligned} g(\alpha , \beta , \eta , \nu ) = \textstyle \Big \{\alpha : r \cdot \nu&\textstyle - \frac{k}{n}(\eta _1-\eta _2) - \frac{\eta _2}{n} + \nu \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n\setminus N^j_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{ \beta }{\nu }-1\right) \\&\textstyle + \nu \sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) + \beta + \eta _1-\eta _2}{\nu }-1\right) \\&\textstyle + \nu \sum _{({\bar{x}}, {\bar{y}})\in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L(z, \bar{y})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) + \beta + \eta _1}{\nu }-1\right) \le \beta \Big \} . \end{aligned}$$

The dual optimization problem of the primal problem (25) is now found as \(\inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \nu )\). As the primal optimization problem in (25) is convex, strong duality holds under Slater’s condition which is satisfied whenever \(r>r^j_n\). Using first-order optimality conditions, the optimal \(\beta ^\star \) must satisfy the relationship \(\beta ^\star = -\nu + \nu \log (\sum _{({\bar{x}}, {\bar{y}}) \in N^{j-1}_{n}({x_0}} D[{\bar{x}}, {\bar{y}}] \exp ([(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) +\eta _1-\eta _2]/\nu ) + \sum _{({\bar{x}}, \bar{y}) \in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp ([(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) +\eta _1] /\nu ) + \sum _{({\bar{x}}, {\bar{y}}) \in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}])\). Substituting the optimal value of \(\beta ^\star \) in the back in the dual optimization problem gives

$$\begin{aligned} \inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \eta , \nu ) =&\textstyle \inf _{\alpha , \nu \ge 0}\, g(\alpha , \beta ^\star , \eta , \nu ) \\ =&\textstyle \inf \Big \{\alpha \in \mathrm {R}: \exists \nu \in \mathrm {R}_+, \exists \eta \in \mathrm {R}_+^2, ~r\cdot \nu - \frac{k}{n}(\eta _1-\eta _2) - \frac{\eta _2}{n} \cdot \nu \\&\qquad + \nu \log (\textstyle \sum _{({\bar{x}}, {\bar{y}})\in N^{j-1}_n({x_0})} \exp ({[(L(z, {\bar{y}})-\alpha )\cdot w_n({\bar{x}}, {x_0}) + \eta _1-\eta _2]}/{\nu })\cdot D[{\bar{x}}, {\bar{y}}] \\&\qquad \textstyle + \sum _{({\bar{x}}, {\bar{y}})\in N^j_n({x_0})\setminus N^{j-1}_n({x_0})} \exp ({[(L(z, {\bar{y}})-\alpha )\cdot w_n({\bar{x}},{x_0}) + \eta _1]}/{\nu })\cdot D[{\bar{x}}, {\bar{y}}] \\&\qquad \textstyle + \sum _{\Omega _n \setminus N^j_n({x_0})} D[\bar{x}, {\bar{y}}] ) \le 0 \Big \}. \end{aligned}$$

\(\square \)

1.8 Proof Lemma 2

Proof

We will employ standard Lagrangian duality on the convex optimization characterization of the Nadaraya–Watson cost function given in Corollary 1. The Lagrangian function associated with the primal optimization problem is denoted here at the function

$$\begin{aligned} \mathcal L(P, s; \alpha , \beta , \nu ) :=\textstyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0})&\cdot L({\bar{z}}, {\bar{y}}) \cdot P[{\bar{x}}, {\bar{y}}] \textstyle + \left( 1-\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]\right) \alpha + \\&\textstyle \left( \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} P[{\bar{x}}, \bar{y}] -s\right) \beta + \left( r \cdot s - \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right) \nu \end{aligned}$$

where P and s are the primal variables of the primal optimization problem given in Corollary 1 and \(\alpha \), \(\beta \) and \(\nu \) the dual variables associated with each of its constraints. Collecting the relevant terms in the Lagrangian function results in

$$\begin{aligned} \mathcal L(P, s; \alpha , \beta , \nu ) = \textstyle \alpha + s (r \nu - \beta ) + \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} \left[ P[{\bar{x}}, {\bar{y}}] \left( (L({\bar{z}}, {\bar{y}})-\alpha )\cdot w_n({\bar{x}}, {x_0})+\beta \right) - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, \bar{y}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right] \end{aligned}$$

The dual function of the primal optimization problem is identified with \(g(\alpha , \beta , \nu ) :=\inf _{P\ge 0,\,s> 0} \mathcal L(P, s; \alpha , \beta , \nu )\). Our dual function can be expressed alternatively as \(g(\alpha , \beta , \nu ) =\)

$$\begin{aligned}&\textstyle \sup _{s> 0} \, \alpha + s\left( r \nu - \beta \right) + {\displaystyle \sup _{P\ge 0}\, \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n}} \left[ P[{\bar{x}}, {\bar{y}}] \left( (L({\bar{z}}, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}},{x_0}) +\beta \right) - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s\cdot D[{\bar{x}}, {\bar{y}}]}\right) \right] \\ =&\textstyle \sup _{s> 0} \, \alpha + s\left( r \nu - \beta \right) + {\displaystyle \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} \sup _{P[{\bar{x}}, {\bar{y}}]\ge 0}} \left[ P[{\bar{x}}, {\bar{y}}] \left( (L({\bar{z}}, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) +\beta \right) - \nu P[{\bar{x}}, {\bar{y}}] \log \left( \frac{P[{\bar{x}}, {\bar{y}}]}{ s \cdot D[{\bar{x}}, {\bar{y}}]}\right) \right] \\ =&\textstyle \sup _{s> 0} \, \alpha + s\left( r \nu - \beta \right) + s \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, {\bar{y}}][ \sup _{\lambda \ge 0} \, \lambda \left( (L({\bar{z}}, {\bar{y}})-\alpha )\cdot w_n({\bar{x}}, {x_0}) + \beta \right) - \nu \lambda \log \left( \lambda \right) ]. \end{aligned}$$

The inner maximization problems over \(\lambda \) can be dealt with using the Fenchel conjugate of the \(\lambda \mapsto \lambda \cdot \log \lambda \) function as

$$\begin{aligned} =&\textstyle \sup _{s> 0} \, \alpha + s\left( r \nu - \beta \right) +s \nu \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L({\bar{z}}, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) + \beta }{\nu }-1\right) \\ =&\textstyle \left\{ \alpha \ : \ r\nu + \nu \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, {\bar{y}}] \exp \left( \frac{(L({\bar{z}}, \bar{y})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) + \beta }{\nu }-1\right) \le \beta \right\} . \end{aligned}$$

The dual optimization problem is now found as \(\inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \nu )\). As our primal optimization is convex, strong duality holds under Slater’s condition which is satisfied whenever \(r>0\). Using first-order optimality conditions, the optimal \(\beta ^\star \) must satisfy \( \beta ^\star = -\nu + \nu \log (\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, {\bar{y}}] \exp ((L({\bar{z}}, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0})/\nu )). \) Substituting the optimal value of \(\beta ^\star \) in the back in the dual optimization problem gives

$$\begin{aligned} \inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \nu )&= \textstyle \inf _{\alpha , \nu \ge 0}\, g(\alpha , \beta ^\star , \nu ) \\&= \textstyle \inf \left\{ \alpha \in \mathrm {R}\ : \ \exists \nu \in \mathrm {R}_+, \,r\nu + \nu \log \left( \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, \bar{y}] \exp \left( \frac{(L({\bar{z}}, {\bar{y}})-\alpha )\cdot w_n(\bar{x},{x_0})}{\nu }\right) \right) \le 0 \right\} . \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bertsimas, D., Van Parys, B. Bootstrap robust prescriptive analytics. Math. Program. 195, 39–78 (2022). https://doi.org/10.1007/s10107-021-01679-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01679-2

Keywords

Mathematics Subject Classification

Navigation