Abstract
“Localization” has proven to be a valuable tool in the Statistical Learning literature as it allows sharp risk bounds in terms of the problem geometry. Localized bounds seem to be much less exploited in the stochastic optimization literature. In addition, there is an obvious interest in both communities in obtaining risk bounds that require weak moment assumptions or “heavier-tails”. In this work we use a localization toolbox to derive risk bounds in two specific applications. The first is in portfolio risk minimization with conditional value-at-risk constraints. We consider a setting where, among all assets with high returns, there is a portion of dimension g, unknown to the investor, that has significant less risk than the other remaining portion. Our rates for the SAA problem show that “risk inflation”, caused by a multiplicative factor, affects the statistical rate only via a term proportional to g. As the “normalized risk” increases, the contribution in the rate from the extrinsic dimension diminishes while the dependence on g is kept fixed. Localization is a key tool to show this property. As a second application of our localization toolbox, we obtain sharp oracle inequalities for least-squares estimators with a Lasso-type constraint under weak moment assumptions. One main consequence of these inequalities is to obtain persistence, as posed by Greenshtein and Ritov, with covariates having heavier tails. This gives improvements in prior work of Bartlett, Mendelson and Neeman.
Similar content being viewed by others
Notes
Actually, the Corollary 2.5 in [29] cover only the case when \(\varvec{\Sigma }\) is the identity matrix, but the arguments are based on VC dimension theory that are readily extendable to our setting. We omit such details.
References
Artstein, Z., Wets, R.J.-B.: Consistency of minimizers and the SLLN for stochastic programs. J. Convex Anal. 2, 1–17 (1995)
Bartlett, P., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33, 1497–1537 (2005)
Bartlett, P., Mendelson, S.: Empirical minimization. Probab. Theory Rel. Fields 135(3), 311–334 (2006)
Barlett, P.L., Mendelson, S., Neeman, J.: \(\ell _1\)-regularized linear regression: persistence and oracle inequalities. Probab. Theory Relat. Fields 154, 193–224 (2012)
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of the Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)
Bellec, P.C., Lecué, G., Tsybakov, A.B.: Slope meets lasso: improved oracle bounds and optimality. Ann. Stat. 46(6B), 3603–3642 (2018)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1, 169–194 (2007)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Aggregation for Gaussian regression. Ann. Stat. 35(4), 1674–1697 (2007)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Sparse density estimation with \(\ell _1\) penalties. In: Bshouty, N.H., Gentile, C. (Eds.) Learning Theory. COLT 2007. Lecture Notes in Computer Science, vol. 4539. Springer, Berlin (2007)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Aggregation and sparsity via \(\ell _1\)-penalized least squares. In: Lugosi, G., Simon, H.U. (Eds.) Learning Theory. COLT 2006. Lecture Notes in Computer Science, vol. 4005. Springer, Berlin (2006)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Aggregation for regression learning. Preprint: arXiv.org/abs/math/0410214 (2004)
Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351 (2007)
Dupacovà, J., Wets, R.J.-B.: Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. Ann. Stat. 16(4), 1517–1549 (1988)
Guigues, V., Juditsky, A., Nemirovski, A.: Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optim. Methods Softw. 32(5), 1033–1058 (2017)
Greenshtein, E.: Best subset selection, persistence in high-dimensional statistical learning and optimization under \(\ell _1\) constraint. Ann. Stat. 34(5), 2367–2386 (2006)
Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988 (2004)
Homem-de-Mello, T., Bayraksan, G.: Monte Carlo sampling-based methods for stochastic optimization. Surv. Oper. Res. Manag. Sci. 19, 56–85 (2014)
Iusem, A.N., Jofré, A., Thompson, P.: Incremental constraint projection methods for monotone stochastic variational inequalities. Math. Oper. Res. 44(1), 236–263 (2019)
Kim, S., Pasupathy, R., Henderson, S.G.: A guide to sample average approximation. In: Michael, Fu. (ed.) Handbook of Simulation Optimization, International Series in Operations Research & Management Science, vol. 216, pp. 207–243. Springer, New York (2015)
King, A.J., Rockafellar, R.T.: Asymptotic theory for solutions in statistical estimation and stochastic programming. Math. Oper. Res. 18, 148–162 (1993)
King, A.J., Wets, R.J.-B.: Epi-consistency of convex stochastic programs. Stoch. Stoch. Rep. 34, 83–92 (1991)
Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Mathematics book series (LNM, volume 2033), Ecole d’Eté Probabilit. Saint-Flour Book Sub Series (LNMECOLE, volume 2033). Springer, Berlin (2011)
Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011)
Koltchinskii, V.: The Dantzig selector and sparsity oracle inequalities. Bernoulli 15(3), 799–828 (2009)
Koltchinskii, V.: Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincaré Probab. Stat. 45(1), 7–57 (2009)
Koltchinskii, V.: Sparse recovery in convex hulls via entropy penalization. Ann. Stat. 37(3), 1332–1359 (2009)
Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Lecué, G., Mendelson, S.: General nonexact oracle inequalities for classes with subexponential envelope. Ann. Stat. 40(2), 832–860 (2012)
Lecué, G., Mendelson, S.: Sparse recovery under weak moment assumptions. J. Eur. Math. Soc. 19, 881–904 (2017)
Leng, C., Lin, Y., Wahba, G.: A note on the lasso and related procedures in model selection. Stat. Sin. 16, 1273–1284 (2006)
Lounici, K.: Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2, 90–102 (2008)
Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37(1), 246–270 (2009)
Meinhausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)
Meinhausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34(3), 1436–1462 (2006)
Oliveira, R.I.: The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties (2013), preprint at arXiv:1312.2903
Oliveira, R.I.: The lower tail of random quadratic forms with applications to ordinary least squares. Probab. Theory Relat. Fields 166, 1175–1194 (2016)
Oliveira, R.I., Thompson, P.: Sample average approximation with heavier tails I: non-asymptotic bounds with weak assumptions and stochastic constraints. Math. Program. (2022). https://doi.org/10.1007/s10107-022-01810-x
Pflug, G.C.: Asymptotic stochastic programs. Math. Oper. Res. 20, 769–789 (1995)
Panchenko, D.: Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 31, 2068–2081 (2003)
Pang, J.-S.: Error bounds in mathematical programming. Math. Program. Ser. B 79(1), 299–332 (1997)
Pflug, G.C.: Stochastic programs and statistical data. Ann. Oper. Res. 85, 59–78 (1999)
Pflug, G.C.: Stochastic optimization and statistical inference. In: Ruszczyński, A., Shapiro, A. (eds.) Handbooks in OR & MS, vol. 10, pp. 427–482. Elsevier (2003)
Rockafellar, R.T., Urysaev, S.: Optimization of conditional value-at-risk. J. Risk 2(3), 493–517 (2000)
Römisch, W.: Stability of stochastic programming problems. In: Ruszczyński, A., Shapiro, A. (eds.) Handbooks in OR & MS, vol. 10, pp. 483–554. Elsevier (2003)
Shapiro, A.: Asymptotic properties of statistical estimators in stochastic programming. Ann. Stat. 17, 841–858 (1989)
Shapiro, A.: Asymptotic analysis of stochastic programs. Ann. Oper. Res. 30, 169–186 (1991)
Shapiro, A.: Monte Carlo sampling methods. In: Ruszczyński, A., Shapiro, A. (eds.) Handbooks in OR & MS, vol. 10, pp. 353–425. Elsevier (2003)
Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory. MOS-SIAM Series Optimization. SIAM, Philadelphia (2009)
Talagrand, M.: Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22, 28–76 (1994)
Talagrand, M.: Upper and Lower Bounds for Stochastic Processes. Springer (2014)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
van de Geer, S.A.: High-dimensional generalized linear models and the Lasso. Ann. Stat. 36(2), 614–645 (2008)
Zhang, C.-H., Huang, J.: The sparsity and the bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36(4), 1567–1594 (2008)
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
Zhang, T.: Some sharp performance bounds for least squares regression with L1 regularization. Ann. Stat. 37(5A), 2109–2144 (2009)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Acknowledgements
Roberto I. Oliveira has been funded by FAPESP. Philip Thompson was funded by the grant STAR - F.10005389.06.001 by Krannert School of Management, Purdue University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
of Proposition 1
Given admissible sequences \(\{{\mathcal {A}}_{1,j}\}_{j\ge 0}\) and \(\{{\mathcal {A}}_{2,j}\}_{j\ge 0}\) for \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\), one may define an admissible sequence \(\{{\mathcal {C}}_j\}_{j\ge 0}\) for \({\mathcal {M}}\) via:
It is easy to see that this is indeed admissible and moreover
Therefore,
or equivalently
The proof finishes when we note that \(\text {\textsf{diam}}({\mathcal {M}})^{\alpha }\le \gamma ^{(\alpha )}_2({\mathcal {M}})\) and take the infimum over admissible sequences. \(\square \)
We recall the following fundamental result due to Panchenko. It establishes a sub-Gaussian tail for the deviation of an heavy-tailed empirical process around its mean after a proper self-normalization by a random quantity \({{\widehat{V}}}\).
Theorem 4
(Panchenko’s inequality [39]) Let \({\mathcal {F}}\) be a finite family of measurable functions \(g:\Xi \rightarrow {\mathbb {R}}\) such that \({\textbf{P}}g^2(\cdot )<\infty \). Let also \(\{\xi _j\}_{j=1}^N\) and \(\{\eta _j\}_{j=1}^N\) be both i.i.d. samples drawn from a distribution \({\textbf{P}}\) over \(\Xi \) which are independent of each other. Define
Then, for all \(t>0\),
The following result is a direct consequence of Theorem 4 applied to the unitary class \({\mathcal {F}}:=\{g\}\). It provides a sub-Gaussian tail for any random variable with finite 2nd moment in terms its variance and empirical variance.
Lemma 8
(Sub-Gaussian tail for self-normalized sums) Suppose \(\{\xi _j\}_{j=1}^N\) is i.i.d. sample of a distribution \({\textbf{P}}\) over \(\Xi \) and denote by \({{\widehat{{\textbf{P}}}}}\) the correspondent empirical distribution. Then for any measurable function \(g:\Xi \rightarrow {\mathbb {R}}\) satisfying \({\textbf{P}}g(\cdot )^2<\infty \) and, for any \(t>0\),
Finally, we present the sub-gaussian tail of nonnegative random variables.
Lemma 9
(Sub-Gaussian lower tail for nonnegative random variables) Let \(\{Z_j\}_{j=1}^N\) be i.i.d. nonnegative random variables. Assume \(a\in (1,2]\) and \(0<{\mathbb {E}}[Z_1^a]<\infty \). Then, for all \(\epsilon >0\),
Proof
Let \(\theta ,\epsilon >0\). By the usual “Bernstein trick”, we get
It is a simple calculus exercise to show that \( \forall x\ge 0,e^{-x}\le 1-x+\frac{x^a}{a}. \) Applying this with \(x:=\theta Z_1\), we obtain
where the second inequality follows from the relation \(1+x\le e^x\) for all \(x\in {\mathbb {R}}\). We plug this back into (57) and get, for all \(\theta >0\),
Since \(a\in (1,2]\), we may actually minimize the above bound over \(\theta >0\). The minimum is attained at \( \theta _*:=\left( \frac{\epsilon {\mathbb {E}}[Z_1]}{{\mathbb {E}}[Z_1^a]}\right) ^{\frac{1}{a-1}}. \) To finish the proof, we plug this in (58) and notice that
using that \(1+\frac{1}{a-1}=\frac{a}{a-1}\). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Oliveira, R.I., Thompson, P. Sample average approximation with heavier tails II: localization in stochastic convex optimization and persistence results for the Lasso. Math. Program. 199, 49–86 (2023). https://doi.org/10.1007/s10107-023-01940-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-023-01940-w