Abstract
We address the problem of prescribing an optimal decision in a framework where the cost function depends on uncertain problem parameters that need to be learned from data. Earlier work proposed prescriptive formulations based on supervised machine learning methods. These prescriptive methods can factor in contextual information on a potentially large number of covariates to take context specific actions which are superior to any static decision. When working with noisy or corrupt data, however, such nominal prescriptive methods can be prone to adverse overfitting phenomena and fail to generalize on out-of-sample data. In this paper we combine ideas from robust optimization and the statistical bootstrap to propose novel prescriptive methods which safeguard against overfitting. We show indeed that a particular entropic robust counterpart to such nominal formulations guarantees good performance on synthetic bootstrap data. As bootstrap data is often a sensible proxy to actual out-of-sample data, our robust counterpart can be interpreted to directly encourage good out-of-sample performance. The associated robust prescriptive methods furthermore reduce to convenient tractable convex optimization problems in the context of local learning methods such as nearest neighbors and Nadaraya–Watson learning. We illustrate our data-driven decision-making framework and our novel robustness notion on a small newsvendor problem.
Similar content being viewed by others
Notes
Technically, the conditional expectation \(\mathbb {E}_{D^\star \!\!}\left[ L(z, y) | x\right] \) is a random variable and uniquely defined only up to events of measure zero. That is, the inclusion in (2) can hold merely almost surely; see for instance the standard text by Billingsley [7]. Slightly abusing notation all statements in this paper involving the observation \(x_0\) should be interpreted to hold hence as \(X^\star \)-almost surely where \(X^\star \) is the distribution of the covariates x.
Our notation here alludes to the fact that this estimator function mapping data to cost estimates is often asymptotically with \(n\rightarrow \infty \) an unbiased estimate of the actual unknown expected cost \(z\mapsto \mathbb {E}_{D^\star \!\!}\left[ L( z, y)|x={x_0}\right] \). In our paper the symbol “\(\mathrm{E}\)” will be associated with an estimator (a procedure mapping data to cost estimates) while “\(\mathbb {E}\)” denotes an expectation operator which may define for instance the actual but unknown expected cost.
Notice that this distance function does not possess the discrimination property. Indeed, when \(\left\| {\bar{x}}_i-{x_0}\right\| _2=\left\| {\bar{x}}_j-{x_0}\right\| \) for \({\bar{x}}_i\ne \bar{x}_j\) we have a tie. The lack of discrimination property translates in ambiguously defined neighborhood sets. The discrimination property may be recovered by deterministically breaking ties based for instance on the value of \({\bar{y}}\). Györfi et al. [17] propose a randomized alternative by augmenting the covariates with an independent auxiliary uniformly distributed random variable on [0, 1]. They prove that by doing so ties occur with probability zero.
References
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
Ban, G.-Y., Rudin, C.: The big data newsvendor: practical insights from machine learning. Oper. Res. 67(1), 90–108 (2019)
Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., Rennen, G.: Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 59(2), 341–357 (2013)
Bertsimas, D., Gupta, V., Kallus, N.: Robust sample average approximation. In: Mathematical Programming, pp. 1–66 (2017)
Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci. 66(3), 1025–1044 (2020)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Billingsley, P.: Probability and Measure. Wiley, New York (2008)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Csiszár, I.: Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probab. 12(3), 768–793 (1984)
Delage, E., Ye, Y.: Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58(3), 595–612 (2010)
Dembo, A., Zeitouni, O.: Large Deviations Techniques and Applications, vol. 38. Springer, New York (2009)
Domahidi, A., Chu, E., Boyd, S.: ECOS: an SOCP solver for embedded systems. In: European Control Conference (ECC), pp. 3071–3076 (2013)
Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia (1982)
Elmachtoub, A.N., Paul, G.: Smart “predict, then optimize”. In: Management Science (2021)
Erdoğan, E., Iyengar, G.: Ambiguous chance constrained problems and robust optimization. Math. Program. 107(1–2), 37–61 (2006)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics. Springer, New York (2001)
Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, New York (2006)
Hanasusanto, G.A., Kuhn, D.: Robust data-driven dynamic programming. In: Advances in Neural Information Processing Systems, pp. 827–835 (2013)
Hannah, L., Powell, W., Blei, D.M.: Nonparametric density estimation for stochastic optimization with an observable state variable. In: Advances in Neural Information Processing Systems, pp. 820–828 (2010)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Michaud, R.O.: The Markowitz optimization enigma: is ‘optimized’ optimal? Financ. Anal. J. 45(1), 31–42 (1989)
Mohajerin Esfahani, P., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171(1–2), 115–166 (2018)
Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 9(1), 141–142 (1964)
Pug, G., Wozabal, D.: Ambiguity in portfolio selection. Quant. Finance 7(4), 435–442 (2007)
Popescu, I.: A semidefinite programming approach to optimal-moment bounds for convex classes of distributions. Math. Oper. Res. 30(3), 632–657 (2005)
Postek, K., den Hertog, D., Melenberg, B.: Computationally tractable counterparts of distributionally robust constraints on risk measures. SIAM Rev. 58(4), 603–650 (2016)
Sain, S.R.: Multivariate locally adaptive density estimation. Comput. Stat. Data Anal 39(2), 165–186 (2002)
Sun, H., Xu, H.: Convergence analysis for distributionally robust optimization and equilibrium problems. Math. Oper. Res. 41(2), 377–401 (2016)
Udell, M., Mohan, K., Zeng, D., Hong, J., Diamond, S., Boyd, S.: Convex optimization in Julia. In: SC14 Workshop on High Performance Technical Computing in Dynamic Languages (2014)
Van Parys, B.P.G., Esfahani, P.M., Kuhn, D.: From data to decisions: Distributionally robust optimization is optimal. arXiv:1505.05116 (2017)
Van Parys, B.P.G., Goulart, P.J., Kuhn, D.: Generalized Gauss inequalities via semidefinite programming. Math. Program. 156(1–2), 271–302 (2016)
Walk, H.: Strong laws of large numbers and nonparametric estimation. In: Recent Developments in Applied Probability and Statistics, pp. 183–214. Springer(2010)
Wang, Z., Glynn, P.W., Ye, Y.: Likelihood robust optimization for data-driven newsvendor problems. CMS 12(2), 241–261 (2016)
Watson, G.S.: Smooth regression analysis. In: Sankhyā: The Indian Journal of Statistics, Series A, vol . 26, no. 4, pp. 359–372 (1964)
Wiesemann, W., Kuhn, D., Sim, M.: Distributionally robust convex optimization. Oper. Res. 62(6), 1358–1376 (2014)
Zymler, S., Kuhn, D., Rustem, B.: Distributionally robust joint chance constraints with second-order moment information. Math. Program. 137(1–2), 167–198 (2013)
Acknowledgements
The authors would like to thank the reviewers for their careful reading of this manuscript which greatly improved its overall exposition. The second author is generously supported by the Early Post.Mobility fellowship No. 165226 of the Swiss National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Uniform consistent estimators
Nadaraya–Watson estimation can be shown to be point-wise consistent when using an appropriately scaled bandwidth parameter h(n) for any of the smoother functions listed in Fig. 4:
Theorem 7
(Walk [32]) Let us have a loss function satisfying \(\mathbb {E}_{D^\star \!\!}\left[ \left| L({\bar{z}}, y)\right| \cdot \max \{\log (\left| L({\bar{z}}, y)\right| ),0\}\right] < \infty \) for all \({\bar{z}}\). Let the bandwidth \(h(n)={cn^{-\delta }}\) for some \(c>0\) and \(\delta \in (0, {1}/{\dim (x)})\). Let S be any of the smoother functions listed in Fig. 4 and the weighing function is taken to be \(w_n(x, {x_0}) = S(\left\| x-{x_0}\right\| _2/h(n))\). Then, our estimation formulation (with \(k(n)=n\)) is asymptotically consistent for any \(D^\star \), i.e., with probability one we have
Nearest neighbors estimation is also consistent under very mild technical conditions provided that the number of neighbors k(n) is scaled appropriately with the number of training data samples:
Theorem 8
(Walk [32]) Assume \(\mathrm {dist}({\bar{d}}=({\bar{x}}, {\bar{y}}), {x_0}) = \left\| {\bar{x}}-{x_0}\right\| _2\) and follow the random tie breaking rule discussed in Györfi et al. [17]. Let \(k(n)=\lceil \min \{cn^{\delta }, n\} \rceil \) for some \(c>0\) and \(\delta \in (0,1)\). Then, our estimation formulation (with \(w_n(\bar{x}, {x_0})=1\)) is asymptotically consistent for any \(D^\star \), i.e., with probability one we have
An estimator is denoted as uniformly consistent if the event
for any \(\epsilon >0\) and \(D^\star \). That is, the estimator predicts the cost for all potential decisions well to any arbitrary accuracy when only given access to a sufficiently large amount of training data. It is not difficult to see that this uniform consistency in turn implies the consistency of the budget minimization formulation. Indeed, with probability one we have
Here, the limit (35) follows from the consistency of the estimator with regards to the full information decision \(z^\star ({x_0})\). Inequality (36) is a direct consequence of the characterization of the data-driven decision \(z_{\mathrm {data}[n]}\) as a minimizer of \({\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L(z, y)|x={x_0}\right] \). Remarking that \(\max _{{\bar{z}}}\, |{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y) | x={x_0}\right] | \le \epsilon \implies {\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] \ge \mathbb {E}_{D^\star \!\!}\left[ L(z_{\mathrm {data}[n]}, y)|x={x_0}\right] -\epsilon \) the limit (37) follows from the uniform consistency of the estimator stated in Eq. (34). Chaining the inequalities in results (35), (36) and (37) implies that
Here the optimality gap between the cost of the full information decision \(z^\star ({x_0})\) and the cost of the data-driven decision \(z_{\mathrm {data}[n]}\) is bounded by \(2\epsilon \). As this result holds for any arbitrary \(\epsilon >0\) the data-driven formulation is asymptotically consistent.
Proofs
1.1 Proof of Theorem 1
Proof
First note that the domain of the partial estimators as a function of the distribution D satisfies
For a distribution D to be in the domain of the partial estimator the constraints in Eq. (22) must indeed be feasible. In other words, there must exists some P and \(s>0\) for which \(s\cdot D[{\bar{x}}, {\bar{y}}] = P[{\bar{x}}, {\bar{y}}]\) for all \(({\bar{x}}, {\bar{y}})\in \Omega _n\). The last two constraints in Eq. (22) then imply that any such \(D\in \mathcal D_n\) must also be in \(\mathcal D^j_{n}\).
Any \(D\in \mathcal D_{n,n}\) is the empirical distribution of some bootstrap data set, say \(\mathrm {bs}[n]\), consisting of n observations from the training data set. Each set \(\mathcal D_{n,n}^j:=\mathcal D_n^j\cap \mathcal D_{n,n}\) has a very natural interpretation in terms of \(N^j_n({x_0})\) being the smallest neighborhood containing at least k observations of this associated bootstrap data set \(\mathrm {bs}[n]\). Indeed, \(D\in \mathcal D_{n,n}^j\) is in terms of the associated data set equivalent to
The first inequality implies that the neighborhood \(N^j_n({x_0})\) contains at least k observations of the associated data set. Note that the sets \(N^j_n({x_0})\) are increasing with increasing j in terms of set inclusion. The latter inequality hence implies that the biggest smaller neighborhood \(N^{j\!-\!1}_n({x_0})\) does not contain k observations. Both conditions taken together thus imply that \(N^j_n({x_0})\) is the smallest neighborhood which contains at least k samples of the bootstrap data set. Any D in \(\mathcal D_{n,n}\) is an element in one and only one set \(\mathcal D^j_{n,n}\) as the smallest neighborhood containing at least k samples is uniquely defined for any data set. Formally, \(\mathcal D^j_{n,n}\cap \mathcal D^{j'}_{n,n}=\emptyset \) for all \(j\ne j'\) and moreover \(\cup _{j\in [n]} \mathcal D_{n,n}^j=\mathcal D_{n,n}\). Notice also that the only feasible s in the constraints defining the partial predictors in Eq. (22) is the particular choice \(s = {1}/{\sum _{({\bar{x}}, {\bar{y}})\in N^j_n({x_0})} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}>0\). Hence, we must have that for any \(D\in \mathcal D^j_{n,n}\) the partial estimator equates to
which is precisely the weighted average over the neighborhood \(N^j_n({x_0})\). We have hence for all \({\bar{z}}\) that
as we have argued that \(D\in \mathcal D^j_{n,n}\) if and only if \(N^j_n({x_0})\) is the smallest neighborhood containing at least k data points. As the empirical distribution D and associated data set \(\mathrm {bs}[n]\) were chosen arbitrary the result follows. \(\square \)
1.2 Proof of Proposition 1
Proof
Observe that \(\left\{ D \in \mathcal D_n\ : \ R(D, D_{\mathrm {tr}})\le r \right\} \,\cap \, \mathcal N=\emptyset \) by construction. Furthermore, \(\left\{ D \in \mathcal D_n\ : \ R(D, D_{\mathrm {tr}})\le r \right\} \) is convex as R is convex in its first argument. Without loss of generality, the neighborhood \(\mathcal N\) is convex as well. By the Hahn-Banach separation Theorem an open convex set can be separated linearly from any other disjoint convex set. That is, there must exist a function G and constant \(a\in \mathrm {R}\) so that
Hence, \({{\,\mathrm{int}\,}}\mathcal N = \mathcal N \subseteq \mathcal R' :=\{ D\in \mathcal D_n\ : \ \mathbb {E}_{D\!\!}\left[ G(x, y)\right] >a \ge \sup _{R(D', D_{\mathrm {tr}})\le r} \mathbb {E}_{D'\!\!}\left[ G(x, y)\right] \} \subseteq \mathcal R \) with \(\mathcal R=\{ D\in \mathcal D_n\ : \ \mathrm{E}^n_{D}[L(z^{\mathrm {r}}_{\mathrm {tr}}({x_0}), y)|x={x_0}] > \sup _{R(D', D_{\mathrm {tr}})\le r} \mathrm{E}^n_{D'}[L(z^{\mathrm {r}}_{\mathrm {tr}}({x_0}), y)|x={x_0}] \}\) the disappointment set of a nominal formulation based on the cost estimator \((z, D)\mapsto \mathrm{E}^n_{D}[L(z, y)|x={x_0}]=\mathbb {E}_{D\!\!}\left[ G(x, y)\right] \). Consequently, following inequality (31) we have
The stated result hence follows should we have \(\inf _{D' \in \mathcal N} \, B(D', D_{\mathrm {tr}})<r\). As \(\mathcal N\) is an open set there exists \(\lambda \in (0, 1)\) so that \(D(\lambda )=\lambda D+(1-\lambda )D_{\mathrm {tr}[n]}\in \mathcal N\). From convexity of B we have \(B(D(\lambda ), D_{\mathrm {tr}}) \le \lambda B(D, D_{\mathrm {tr}[n]}) + (1-\lambda ) B(D_{\mathrm {tr}}, D_{\mathrm {tr}}) < r\) for any \(\lambda \in (0,1)\) using that \(B(D, D_{\mathrm {tr}})=r\) and \(B(D_{\mathrm {tr}}, D_{\mathrm {tr}})=0\). Hence, indeed we have \(\inf _{D' \in \mathcal N} \, B(D', D_{\mathrm {tr}})\le B(D(\lambda ), D_{\mathrm {tr}}) < r\). \(\square \)
1.3 Proof of Corollary 1
Proof
Remark that from the definition of the Nadaraya–Watson cost estimate
given in (19) it follows that we have
Indeed, the only feasible s is such that \(s = {1}/{\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}\). Hence, the only feasible P is \(P = {D}/{\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}}, {x_0}) \cdot P[{\bar{x}}, {\bar{y}}]}\) and the equivalence follows. The chain of equalities
imply that the robust budget function \(c_{n}(z, D_{\mathrm {tr}[n]}, {x_0})\) corresponds exactly to the optimization formulation claimed in the corollary. \(\square \)
1.4 Proof of Theorem 3
Proof
We first show the uniform convergence of the robust budget function to its nominal counterpart when the loss function \(L({\bar{z}}, {\bar{y}})< {\bar{L}} < \infty \) is bounded. Let us first consider a given training data set \(\mathrm {tr}[n]\). Note that because of its definition as robust counterpart of our estimator, we have that our robust budget can be bounded as
for some worst-case distributions \(D_{\mathrm {wc}[n]}\) at distance at most \(B(D_{\mathrm {wc}[n]}, D_{\mathrm {tr}[n]})\le r(n)\) from the training distribution for any arbitrary \(\alpha >0\). In terms of the total variation distance, we have that \(\left\| D_{\mathrm {wc}[n]} - D_{\mathrm {tr}[n]}\right\| _1 \le \sqrt{{B(D_{\mathrm {wc}[n]} , D_{\mathrm {tr}[n]})}/{2}} \le \sqrt{{r(n)}/{2}}\) following Pinkser’s inequality.
The Nadaraya–Watson estimate based on the worst-case distribution is the fraction
where we denote here \(d=\dim (x)\) for conciseness. We have that the denominator of the Nadaraya–Watson estimator is lower bounded by
Here the first inequality follows from the Cauchy-Schwartz inequality \(\left| a^\top b\right| \le \left\| a\right\| _\infty \cdot \left\| b\right\| _1 \). Notice that here the weights \(0\le w_n({\bar{x}}, {x_0})\le w_n({x_0}, {x_0})\le 1\) are all non-negative and bounded from above by one for all smoother functions in Fig. 4. Lemma 6 in Walk [32] establishes that the limit
is positive with probability one. Taken together with the premise \(\lim _{n\rightarrow \infty } {\sqrt{r(n)}}/{h(n)^d}=0\) this establishes the existence of a large enough sample size \(n_0\) such that for all \(n\ge n_0\) the denominator of the Nadaraya–Watson estimator satisfies \(\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}] \!\ge \! d({x_0})\) and \(\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n}\!\! {w_n({\bar{x}}, {x_0})}/{h(n)^d} D_{\mathrm {tr}[n]}[{\bar{x}}, {\bar{y}}]\ge 2{\sqrt{{r(n)}/{2}}}/{h(n)^d}\). Similarly, the nominator of the Nadaraya–Watson estimator satisfies
In what follows we will use the inequality \({a}/{(b-x)} \le {a}/{b} + {2a}/{b} \cdot x\) for all \(x\le {b}/{2}\) when \(a, b>0\). This inequality follows trivially from the definition of convexity of the function \({a}/{(b-x)}\) for all \(x\le b\) when \(a, b>0\). Using the previous inequalities we can establish when \(n\ge n_0\) the following claims
The nominal Nadaraya–Watson estimator is uniformly consistent, i.e.,
for \(\lim _{n\rightarrow \infty } \epsilon (n) = 0\) as discussed before. It hence trivially follows that the robust Nadaraya–Watson estimator is uniformly consistent as well. Indeed, from the previous inequality it follows that
with probability one for all \({\bar{z}}\). This inequality holds for any arbitrary \(\alpha >0\). Uniform consistency then directly implies here an asymptotically diminishing optimality gap
Using that \(\lim _{n\rightarrow \infty } {\sqrt{r(n)}}/{h(n)^d}\) yields the wanted result immediately. \(\square \)
1.5 Proof of Theorem 4
Proof
We show the uniform convergence of the robust budget function to the unknown cost, that is, and any bounded function \(L({\bar{z}}, {\bar{y}})< {\bar{L}} < \infty \) for all \({\bar{z}}\) and \({\bar{y}}\). Let us first consider a given training data set \(\mathrm {tr}[n]\) without ties. That is, we have that \(\left| N^j_n({x_0})\right| =j\) for all \(j\in [n]\). Note that because of its definition as robust counterpart of the estimator \({\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \), we have that the robust cost can be bounded as
for some worst-case distributions \(D_{\mathrm {wc}[n]}\in \mathcal D^{j\star }_n\) at distance at most \(B(D_{\mathrm {wc}[n]}, D_{\mathrm {tr}[n]})\le r(n)\) from the training distribution for any arbitrary \(\alpha >0\). In terms of the total variation distance, we have that \(\left\| D_{\mathrm {wc}[n]} - D_{\mathrm {tr}[n]}\right\| _1 \le \sqrt{{B(D_{\mathrm {wc}[n]} , D_{\mathrm {tr}[n]})}/{2}} \le \sqrt{{r(n)}/{2}}\) following Pinkser’s inequality. The nearest-neighbors estimate based on the worst-case distribution is defined as the fraction
The neighborhood parameter \(j^\star \) satisfies by definition \( \textstyle \sum _{({\bar{x}}, {\bar{y}}) \in N^{j\star }_n({x_0})} D_{\mathrm {wc}[n]}[{\bar{x}}, {\bar{y}}] \ge k(n)/n. \) The previous inequality bounds the denominator from below by \({k(n)}/{n}\). Using the Cauchy-Schwartz inequality \(\left| a^\top b\right| \le \left\| a\right\| _\infty \cdot \left\| b\right\| _1 \) as well as the Pinkser inequality, the nominator of the Nadaraya–Watson estimator satisfies
We also have that from the definition of the total variation distance that \(\Vert D_{\mathrm {tr}[n]}-D_{\mathrm {wc}[n]}\Vert _1\ge D_{\mathrm {tr}[n]}[N^{j\star }_n({x_0})] - D_{\mathrm {wc}[n]}[N^{j\star }_n({x_0})] \ge {(j^\star -k(n))}/{n}\). We have also \(\Vert D_{\mathrm {wc}[n]}-D_{\mathrm {tr}[n]}\Vert _1 \ge D_{\mathrm {wc}[n]}[N^{j\star -1}_n({x_0})] - D_{\mathrm {tr}[n]}[N^{j\star -1}_n({x_0})] \ge {(k(n)-j^\star )}/{n}\). Last two inequalities imply that we can use the bound \(\sqrt{r(n)/2}\ge \Vert D_{\mathrm {tr}[n]}-D_{\mathrm {wc}[n]}\Vert _1 \ge \left| k(n)-j^\star \right| /n\). By applying first the Cauchy-Schwartz inequality again and then the previously obtained bounds we can obtain
Hence,
for any arbitrary training data set without ties. Ties among data points when using the random tie breaking method are a probability zero event which we may ignore. We already know that the nearest neighbors estimator is uniformly consistent. That is, we have that \(|{\mathrm {E}}^n_{D_{\mathrm {data}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] - \mathbb {E}_{D^\star \!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] | \le \epsilon (n)\) with probability one and \(\lim _{n\rightarrow \infty } \epsilon (n)=0\). When the robustness radius does shrinks at an appropriate rate, i.e., its size compared to the bandwidth parameter is negligible (\(\lim _{n\rightarrow \infty } {n\sqrt{r(n)}}/{k(n)}=0\)), then uniform consistency of budget estimator \(c_n\) follows by taking the limit for n tends to infinity for the chain of inequalities in (38) applied to \(D_{\mathrm {data}[n]}\) and observing that \(\alpha >0\) is arbitrarily small. Uniform consistency of the nearest-neighbors formulation follows by the exact same argument as given in proof of Theorem 3 in case of the Nadaraya–Watson formulation. \(\square \)
1.6 Proof of Corollary 2
Proof
Let us fix a training data set with empirical training distribution \(D_{\mathrm {tr}[n]}\) and a given decision z. Let \(\bar{c}_n:={\mathrm {E}}^n_{D_{\mathrm {tr}[n]}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \) be the budgeted cost based on the training data with \(k(n)=n\). In order to prove the theorem, it suffices to characterize the probability of the event that the empirical distribution \(D_{\mathrm {bs}[n]}\) of random bootstrap data resampled from the training data realizes in the set
as follows from Corollary 1. After eliminating the auxiliary variable s we arrive at the description \(\mathcal C = \{D\in \mathcal D_n : \textstyle \sum _{({\bar{x}}, \bar{y})\in \Omega _n} w_n({\bar{x}}, {x_0})\cdot L({\bar{z}}, y) \cdot D({\bar{x}}, {\bar{y}}) > {\bar{c}}_n \cdot \sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} w_n({\bar{x}},{x_0}) \cdot D({\bar{x}}, {\bar{y}})\}\). The set \(\mathcal C\) is a convex polyhedron. The robust budget cost \({\bar{c}}_n\) is constructed to ensure that \(\inf _{D\in \mathcal C}\, R(D, D_{\mathrm {tr}[n]}) > r\). Indeed, we have the rather direct implication \(\bar{D} \in \mathcal C \!\implies \! {\mathrm {E}}^n_{{\bar{D}}\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] > {\bar{c}}_n = \sup \,\{ {\mathrm {E}}^n_{ D\!\!}\left[ L({\bar{z}}, y)|x={x_0}\right] \,|\, R(D, D_{\mathrm {tr}[n]})\le r \}\) which in turn itself implies \(R({\bar{D}}, D_{\mathrm {tr}[n]}) > r\). Hence, the result follows from the bootstrap inequality (28) applied to the probability \(D_{\mathrm {tr}[n]}^\infty (D_{\mathrm {bs}[n]} \in \mathcal C)\) as in this particular case the employed distribution distance function (\(R=B\)) coincides with the bootstrap distance function. \(\square \)
1.7 Proof Lemma 1
Proof
We will employ standard Lagrangian duality on the convex optimization characterization (25) of the partial nearest neighbors cost function associated \(c^j_n({\bar{z}}, D, {x_0})\). The Lagrangian function associated with the primal optimization problem in (25) is denoted here at the function
where P and s are the primal variables of the primal optimization problem (25) and \(\alpha \), \(\beta \), \(\eta \) and \(\nu \) the dual variables associated with each of its constraints. Collecting the relevant terms in the Lagrangian function results in \(\mathcal L(P, s; \alpha , \beta , \nu ) =\)
The dual function of the primal optimization problem (25) is identified with the concave function \(g(\alpha , \beta , \eta , \nu ) :=\inf _{P\ge 0,\,s> 0} \mathcal L(P, s; \alpha , \beta , \nu )\). Using the same manipulations as presented in the proof of Lemma 2 we can express the dual function as \(g(\alpha , \beta , \eta , \nu ) =\)
Our dual function can be expressed alternatively as
The dual optimization problem of the primal problem (25) is now found as \(\inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \nu )\). As the primal optimization problem in (25) is convex, strong duality holds under Slater’s condition which is satisfied whenever \(r>r^j_n\). Using first-order optimality conditions, the optimal \(\beta ^\star \) must satisfy the relationship \(\beta ^\star = -\nu + \nu \log (\sum _{({\bar{x}}, {\bar{y}}) \in N^{j-1}_{n}({x_0}} D[{\bar{x}}, {\bar{y}}] \exp ([(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) +\eta _1-\eta _2]/\nu ) + \sum _{({\bar{x}}, \bar{y}) \in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}] \exp ([(L(z, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0}) +\eta _1] /\nu ) + \sum _{({\bar{x}}, {\bar{y}}) \in N^j_{n}({x_0})\setminus N^{j-1}_{n}({x_0})} D[{\bar{x}}, {\bar{y}}])\). Substituting the optimal value of \(\beta ^\star \) in the back in the dual optimization problem gives
\(\square \)
1.8 Proof Lemma 2
Proof
We will employ standard Lagrangian duality on the convex optimization characterization of the Nadaraya–Watson cost function given in Corollary 1. The Lagrangian function associated with the primal optimization problem is denoted here at the function
where P and s are the primal variables of the primal optimization problem given in Corollary 1 and \(\alpha \), \(\beta \) and \(\nu \) the dual variables associated with each of its constraints. Collecting the relevant terms in the Lagrangian function results in
The dual function of the primal optimization problem is identified with \(g(\alpha , \beta , \nu ) :=\inf _{P\ge 0,\,s> 0} \mathcal L(P, s; \alpha , \beta , \nu )\). Our dual function can be expressed alternatively as \(g(\alpha , \beta , \nu ) =\)
The inner maximization problems over \(\lambda \) can be dealt with using the Fenchel conjugate of the \(\lambda \mapsto \lambda \cdot \log \lambda \) function as
The dual optimization problem is now found as \(\inf _{\alpha , \beta , \nu \ge 0}\, g(\alpha , \beta , \nu )\). As our primal optimization is convex, strong duality holds under Slater’s condition which is satisfied whenever \(r>0\). Using first-order optimality conditions, the optimal \(\beta ^\star \) must satisfy \( \beta ^\star = -\nu + \nu \log (\sum _{({\bar{x}}, {\bar{y}})\in \Omega _n} D[{\bar{x}}, {\bar{y}}] \exp ((L({\bar{z}}, {\bar{y}})-\alpha ) \cdot w_n({\bar{x}}, {x_0})/\nu )). \) Substituting the optimal value of \(\beta ^\star \) in the back in the dual optimization problem gives
\(\square \)
Rights and permissions
About this article
Cite this article
Bertsimas, D., Van Parys, B. Bootstrap robust prescriptive analytics. Math. Program. 195, 39–78 (2022). https://doi.org/10.1007/s10107-021-01679-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01679-2
Keywords
- Data analytics
- Distributionally robust optimization
- Statistical bootstrap
- Nadaraya–Watson learning
- Nearest neighbors learning