Skip to main content

Advertisement

Log in

Sharp oracle inequalities for low-complexity priors

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

In this paper, we consider a high-dimensional statistical estimation problem in which the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the \(\ell _\infty \) and the nuclear norm penalties. All our estimators can be efficiently implemented using proximal splitting algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. This is for instance the case if \(\varvec{X}\) is drawn from the standard Gaussian ensemble and \(K=O(n)\) (the \(O(\cdot )\) is in fact even \(o(\cdot )\) as the remainder term is supposed to go to 0 as \(n \rightarrow +\infty \)). In this case, classical concentration bounds of the largest eigenvalue of a Wishart matrix allow to conclude that \(s(\varvec{X}) = O(1+\sqrt{K/n}) = O(1)\) with high probability.

References

  • Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.

    MathSciNet  MATH  Google Scholar 

  • Bakin, S. (1999). Adaptive regression and model selection in data mining problems. PhD Thesis, Australian National University, Canberra.

  • Bauschke, H. H., Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces1st ed. New York: Springer.

    MATH  Google Scholar 

  • Bellec, P. (2014). Concentration of quadratic forms under a Bernstein moment assumption. Technical report 1, ENSAE, France.

  • Bickel, P. J., Ritov, Y., Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4), 1705–1732.

    MathSciNet  MATH  Google Scholar 

  • Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E. J. (2014). Slope—Adaptive variable selection via convex optimization. Annals of Applied Statistics, 9(3), 1103–1140.

    MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications, springer series in statistics. Berlin, Heidelberg: Springer.

    MATH  Google Scholar 

  • Candès, E., Plan, Y. (2009). Near-ideal model selection by \(\ell _1\) minimization. The Annals of Statistics, 37(5A), 2145–2177.

    MathSciNet  MATH  Google Scholar 

  • Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.

    MathSciNet  MATH  Google Scholar 

  • Candès, E. J., Li, X., Ma, Y., Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 11:1–11:37.

    MathSciNet  MATH  Google Scholar 

  • Candès, E. J., Strohmer, T., Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8), 1241–1274.

    MathSciNet  MATH  Google Scholar 

  • Castillo, I., Schmidt-Hieber, J., van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5), 1986–2018.

    MathSciNet  MATH  Google Scholar 

  • Catoni, O. (2003). A PAC-Bayesian approach to adaptive classification. Technical report PMA-840, Laboratoire de Probabilités et de modèles aléatoires, Paris, France.

  • Catoni, O. (2007). PAC-Bayesian supervised classification (the thermodynamics of statistical learning), lecture notes-monograph series, Vol. 56. Beachwood, OH: Institute of Mathematical Statistics.

    MATH  Google Scholar 

  • Chandrasekaran, V., Recht, B., Parrilo, P. A., Willsky, A. (2012). The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6), 805–849.

    MathSciNet  MATH  Google Scholar 

  • Chen, S., Donoho, D., Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.

    MathSciNet  MATH  Google Scholar 

  • Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). An efficient proximal-gradient method for general structured sparse learning. Preprint arXiv:1005.4717.

  • Coste, M. (2000). An Introduction to Semialgebraic Geometry. Dottorato di ricerca in matematica / Università di Pisa,Dipartimento di Matematica, Istituti Editoriali e PoligraficiInternazionali, Pisa.

  • Dalalyan, A., Tsybakov, A. (2009). PAC-Bayesian bounds for the expected error of aggregation by exponential weights. Technical report 1, Université Paris 6, CREST and CERTIS, Ecole des Ponts ParisTech, Paris, personal communication.

  • Dalalyan, A., Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1–2), 39–61. https://doi.org/10.1007/s10994-008-5051-0.

    Article  Google Scholar 

  • Dalalyan, A. S., Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Proceedings of the 20th annual conference on learning theory (pp. 97–111). Springer, Berlin, Heidelberg, COLT’07. http://dl.acm.org/citation.cfm?id=1768841.1768854.

  • Dalalyan, A. S., Tsybakov, A. B. (2012). Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Sciences, 78(5), 1423–1443. https://doi.org/10.1016/j.jcss.2011.12.023.

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan, A. S., Hebiri, M., Lederer, J. (2017). On the prediction performance of the Lasso. Bernoulli, 23(1), 552–581. https://doi.org/10.3150/15-BEJ756.

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan, A. S., Grappin, E., Paris, Q. (2018). On the exponentially weighted aggregate with the Laplace prior. The Annals of Statistics, 46(5), 2452–2478.

    MathSciNet  MATH  Google Scholar 

  • Daniilidis, A., Drusvyatskiy, D., Lewis, A. S. (2014). Orthogonal invariance and identifiability. SIAM Journal on Matrix Analysis and Applications, 35(2), 580–598. https://doi.org/10.1137/130916710.

    Article  MathSciNet  MATH  Google Scholar 

  • Donoho, D. (2006). For most large underdetermined systems of linear equations the minimal \(\ell ^1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797–829.

    MathSciNet  MATH  Google Scholar 

  • Donoho, D., Tanner, J. (2010). Counting the faces of randomly-projected hypercubes and orthants. Discrete and Computational Geometry, 43(3), 522–541.

    MathSciNet  MATH  Google Scholar 

  • Durmus, A., Moulines, E., Pereyra, M. (2016). Sampling from convex non continuously differentiable functions, when Moreau meets Langevin. https://hal.archives-ouvertes.fr/hal-01267115, preprint hal-01267115.

  • Duy Luu, T., Fadili, J. M., Chesneau, C. (2016). PAC-Bayesian risk bounds for group-analysis sparse regression by exponential weighting. Technical report, hal-01367742. https://hal.archives-ouvertes.fr/hal-01367742.

  • Fadili, M. J., Peyré, G., Vaiter, S., Deledalle, C., Salmon, J. (2013). Stable recovery with analysis decomposable priors. In Sampling theory and applications (SAMPTA) (pp. 113–116). Bremen: Springer.

  • Fazel, M., Hindi, H., Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference (ACC), IEEE, Arlington, USA (Vol. 6, pp. 4734–4739).

  • Guedj, B., Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7, 264–291. https://doi.org/10.1214/13-EJS771.

    Article  MathSciNet  MATH  Google Scholar 

  • Hiriart-Urruty, J. B., Lemaréchal, C. (2001). Convex analysis and minimization algorithms, Grundlehren der mathematischen Wissenschaften, vol I and II. Berlin, Heidelberg: Springer.

  • Jacob, L., Obozinski, G., Vert, J. P. (2009). Group lasso with overlap and graph lasso. In A. P. Danyluk, L. Bottou, M. L. Littman (Eds.), 26th International conference on machine learning (ICML) (Vol. 382, p. 55). Montreal; ACM Press.

  • Jégou, H., Furon, T., Fuchs, J. J. (2012). Anti-sparse coding for approximate nearest neighbor search. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2029–2032). Kyoto: IEEE.

  • Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems. In 38th Summer school on probability theory and statistics saint-flour. Lecture Notes in Mathematics (Vol. 2033). New York: Springer.

  • Koltchinskii, V., Lounici, K., Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329. https://doi.org/10.1214/11-AOS894.

    Article  MathSciNet  MATH  Google Scholar 

  • Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. The Annals of Statistics, 35(4), 1698–1721. https://doi.org/10.1214/009053607000000055.

    Article  MathSciNet  MATH  Google Scholar 

  • Ledoux, M. (2001). The concentration of measure phenomenon. Mathematical surveys and monographs. Providence, RI: American Mathematical Society.

  • Leung, G., Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory, 52(8), 3396–3410.

    MathSciNet  MATH  Google Scholar 

  • Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4), 2164–2204. https://doi.org/10.1214/11-AOS896.

    Article  MathSciNet  MATH  Google Scholar 

  • Lyubarskii, Y., Vershynin, R. (2010). Uncertainty principles and vector quantization. IEEE Transactions on Information Theory, 56(7), 3491–3501.

    MathSciNet  MATH  Google Scholar 

  • Mai, T. T., Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electronic Journal of Statistics, 9(1), 823–841. https://doi.org/10.1214/15-EJS1020.

    Article  MathSciNet  MATH  Google Scholar 

  • Massart, P. (2007). Concentration inequalities and model selection. In Summer school on probability theory and statistics Saint-Flour XXXIII—2003. New York: Springer.

  • Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.

    MathSciNet  MATH  Google Scholar 

  • Nemirovski, A. (2000). Topics in non-parametric statistics. In M. Emery, A. Nemirovski & D. Voiculescu (Eds.), Summer school on probability theory and statistics Saint-Flour XXVIII-1998, Lecture notes in mathematics, Vol. 1738, pp. 87–285. New York: Springer.

    Google Scholar 

  • Osborne, M., Presnell, B., Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.

    MathSciNet  MATH  Google Scholar 

  • Peyré, G., Fadili, J., Chesneau, C. (2011). Adaptive structured block sparsity via dyadic partitioning. In 19th European signal processing conference (EUSIPCO). Barcelona, Spain: Springer.

  • Raskutti, G., Wainwright, M. J., Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over \(\ell _q\) -balls. IEEE Transactions on Information Theory, 57(10), 6976–6994. https://doi.org/10.1109/TIT.2011.2165799.

    Article  MathSciNet  MATH  Google Scholar 

  • Recht, B., Fazel, M., Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471–501.

    MathSciNet  MATH  Google Scholar 

  • Rigollet, P., Tsybakov, A. (2007). Linear and convex aggregation of density estimators. Mathematical Methods of Statistics, 16(3), 260–280. https://doi.org/10.3103/S1066530707030052.

    Article  MathSciNet  MATH  Google Scholar 

  • Rigollet, P., Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2), 731–771. https://doi.org/10.1214/10-AOS854.

    Article  MathSciNet  MATH  Google Scholar 

  • Rigollet, P., Tsybakov, A. B. (2012). Sparse estimation by exponential weighting. Statistical Science, 27(4), 558–575. https://doi.org/10.1214/12-STS393.

    Article  MathSciNet  MATH  Google Scholar 

  • Rockafellar, R. (1996). Convex analysis, Vol. 28. Princeton, NJ: Princeton University Press.

    Google Scholar 

  • Rockafellar, R. T., Wets, R. (1998). Variational analysis, Vol. 317. New York: Springer.

    MATH  Google Scholar 

  • Rudelson, M., Vershynin, R. (2008). On sparse reconstruction from Fourier and Gaussian measurements. Communications on Pure and Applied Mathematics, 61(8), 1025–1045.

    MathSciNet  MATH  Google Scholar 

  • Rudin, L., Osher, S., Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4), 259–268.

    MathSciNet  MATH  Google Scholar 

  • Studer, C., Yin, W., Baraniuk, R. G. (2012). Signal representations with minimum \(\ell _\infty \)-norm. In 50th Annual Allerton conference on communication, control and computing. Champaign-Urbana, USA: IEEE.

  • Su, W., Candès, E. J. (2015). Slope is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44(3), 1038–1068.

    MathSciNet  MATH  Google Scholar 

  • Sun, T., Zhang, C. H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879. https://doi.org/10.1093/biomet/ass043.

    Article  MathSciNet  MATH  Google Scholar 

  • Suzuki, T. (2015). Convergence rate of Bayesian tensor estimator and its minimax optimality. In 32nd International conference on machine learning (ICML) (Vol. 37, pp. 1273–1282). Lille, France: ACM Press.

  • Talagrand, M. (2005). The generic chaining. Upper and lower bounds of stochastic processes. Berlin: Springer.

    MATH  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B Methodological, 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.

    MathSciNet  MATH  Google Scholar 

  • Tropp, J. (2015a). Convex recovery of a structured signal from independent random linear measurements. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.

    Google Scholar 

  • Tropp, J. A. (2015b). An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1–2), 1–230. https://doi.org/10.1561/2200000048.

    Article  MATH  Google Scholar 

  • Tsybakov, A. B. (2008). Introduction to nonparametric estimation1st ed. New York: Springer.

    MATH  Google Scholar 

  • Vaiter, S., Golbabaee, M., Fadili, J., Peyré, G. (2015a). Model selection with low complexity priors. Information and Inference: A Journal of the IMA, 4(3), 230.

  • Vaiter, S., Peyré, G., Fadili, M. J. (2015b). Low complexity regularization of linear inverse problems. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.

    Google Scholar 

  • Vaiter, S., Deledalle, C., Fadili, M. J., Peyré, G., Dossal, C. (2017). The degrees of freedom of partly smooth regularizers. Annals of the Institute of Statistical Mathematics, 69(4), 791–832.

    MathSciNet  MATH  Google Scholar 

  • Vaiter, S., Peyré, G., Fadili, M. J. (2018). Model consistency of partly smooth regularizers. IEEE Transactions on Information Theory, 64(3), 1725–1737.

    MathSciNet  MATH  Google Scholar 

  • van de Geer, S. (2008). High-dimensional generalized linear models and the Lasso. The Annals of Statistics, 36, 614–645.

    MathSciNet  MATH  Google Scholar 

  • van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86. https://doi.org/10.1111/sjos.12032.

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer, S., Buhlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3, 1360–1392. https://doi.org/10.1214/09-EJS506.

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer, S., Lederer, J. (2013). The Bernstein–Orlicz norm and deviation inequalities. Probability Theory and Related Fields, 157, 225–250.

    MathSciNet  MATH  Google Scholar 

  • Vershynin, R. (2015). Estimation in high dimension : A geometric perspective. In G. Pfander (Ed.), Sampling theory, a renaissance, applied and numerical harmonic analysis (ANHA). Berlin, Heidelberg: Birkhäuser/Springer.

    Google Scholar 

  • Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics, 6, 38–90. https://doi.org/10.1214/12-ejs666.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, Z., Paterlini, S., Gao, F., Yang, Y. (2014). Adaptive minimax regression estimation over sparse lq-hulls. Journal of Machine Learning Research, 15(1), 1675–1711. http://dl.acm.org/citation.cfm?id=2627435.2638589.

  • Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384.

    MathSciNet  MATH  Google Scholar 

  • Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli, 10(1), 25–47.

    MathSciNet  MATH  Google Scholar 

  • Ye, F., Zhang, C. H. (2010). Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. Journal of Machine Learning Research, 11, 3519–3540. http://dl.acm.org/citation.cfm?id=1756006.1953043.

  • Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by Conseil Régional de Basse-Normandie and partly by Institut Universitaire de France.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jalal Fadili.

Appendices

Prerequisites from convex analysis

We here collect some ingredients from convex analysis that are essential to our exposition.

Monotone conjugate

Lemma 2

Let g be a non-decreasing function on \(\mathbb {R}_+\) that vanishes at 0. Then the following holds:

  1. (i)

    \(g^+\) is a proper closed convex and non-decreasing function on \(\mathbb {R}_+\) that vanishes at 0.

  2. (ii)

    If g is also closed and convex, then \(g^{++}=g\).

  3. (iii)

    Let \(f: t \in \mathbb {R}\mapsto g(|t|)\) such that f is differentiable on \(\mathbb {R}\), where g is finite-valued, strictly convex and strongly coercive. Then \(g^+\) is likewise finite-valued, strictly convex, strongly coercive, and \(f^*=g^+ \circ |\cdot |\) is differentiable on \(\mathbb {R}\). In particular, both g and \(g^+\) are strictly increasing on \(\mathbb {R}_+\).

Proof

  1. (i)

    By Bauschke and Combettes (2011, Proposition 13.11), \(g^+\) is a closed convex function. We have \(\inf _{t \ge 0} g(t)=-\sup _{t \ge 0} t \cdot 0 - g(t) = -g^+(0)\). Since g is non-decreasing and \(g(0)=0\), then \(g^+(0)=-\inf _{t \ge 0} g(t)=-g(0)=0\). In addition, by (5), we have \(g^+(a) \ge a \cdot 0 - g(0)=0\), \(\forall a \in \mathbb {R}_+\). This shows that \(g^+\) is nonnegative and \({{\mathrm{dom}}}(g^+) \ne \emptyset \), and in turn, it is also proper.

    Let ab in \(\mathbb {R}_+\) such that \(a < b\). Then

    $$\begin{aligned} g^+(a)-g^+(b)= & {} (\sup _{t \ge 0} t a - g(t)) - (\sup _{t' \ge 0} t' b - g(t')) \\\le & {} \sup _{t \ge 0} (t a - g(t) - t b + g(t)) \\= & {} \sup _{t \ge 0} t(a-b) = 0. \end{aligned}$$

    That is, \(g^+\) is non-decreasing on \(\mathbb {R}_+\).

  2. (ii)

    This follows from Rockafellar (1996, Theorem 12.4).

  3. (iii)

    By definition of f, f is a finite-valued function on \(\mathbb {R}\), strictly convex, differentiable and strongly coercive. It then follows from Hiriart-Urruty and Lemaréchal (2001, Corollary X.4.1.4) that \(f^*\) enjoys the same properties. In turn, using the fact that both f and \(f^*\) are even, we have \(g^+\) is strongly coercive, and strict convexity of f (resp. \(f^*\)) is equivalent to that of g (resp. \(g^+\)). Altogether, this shows the first claim. We now prove that g vanishes only at 0 (and similarly for \(g^+\)). As g is non-decreasing and strictly convex, we have, for any \(\rho \in ]0,1[\) and ab in \(\mathbb {R}_+\) such that \(a < b\),

    $$\begin{aligned} g(a)\le & {} g(\rho a + (1-\rho ) b) < \rho g(a) + (1-\rho ) g(b)\\\le & {} \rho g(b) + (1-\rho ) g(b) = g(b) . \end{aligned}$$

\(\square \)

Support function The support function of \({\mathcal C}\subset \mathbb {R}^p\) is

$$\begin{aligned} \sigma _{{\mathcal C}}({\varvec{\omega }})=\sup _{{\varvec{\theta }}\in {\mathcal C}} \langle {\varvec{\omega }},{\varvec{\theta }}\rangle . \end{aligned}$$

We recall the following properties whose proofs can be found in, e.g., Rockafellar (1996) and Hiriart-Urruty and Lemaréchal (2001).

Lemma 3

Let \({\mathcal C}\) be a non-empty set.

  1. (i)

    \(\sigma _{\mathcal C}\) is proper lower semicontinuous (lsc) and sublinear.

  2. (ii)

    \(\sigma _{\mathcal C}\) is finite-valued if and only if \({\mathcal C}\) is bounded.

  3. (iii)

    If \(0 \in {\mathcal C}\), then \(\sigma _{\mathcal C}\) is nonnegative.

  4. (iv)

    If \({\mathcal C}\) is convex and compact with \(0 \in {{\mathrm{int}}}({\mathcal C})\), then \(\sigma _{\mathcal C}\) is finite-valued and coercive.

Gauges and polars

Definition 3

(Polar set) Let \({\mathcal C}\) be a non-empty convex set. The set \({\mathcal C}^\circ \) given by

$$\begin{aligned} {\mathcal C}^\circ = \big \{ {\varvec{\eta }}\in \mathbb {R}^p \;:\; \langle {\varvec{\eta }},{\varvec{\theta }}\rangle \le 1 \quad \forall {\varvec{\theta }}\in {\mathcal C} \big \} \end{aligned}$$

is called the polar of \({\mathcal C}\).

The set \({\mathcal C}^\circ \) is closed convex and contains the origin. When \({\mathcal C}\) is also closed and contains the origin, then it coincides with its bipolar, i.e., \({\mathcal C}^{\circ \circ }={\mathcal C}\).

Let \({\mathcal C}\subseteq \mathbb {R}^p\) be a non-empty closed convex set containing the origin. The gauge of \({\mathcal C}\) is the function \(\gamma _{\mathcal C}\) defined on \(\mathbb {R}^p\) by

$$\begin{aligned} \gamma _{\mathcal C}({\varvec{\theta }}) = \inf \big \{ \lambda > 0 \;:\; {\varvec{\theta }}\in \lambda {\mathcal C} \big \} . \end{aligned}$$

As usual, \(\gamma _{\mathcal C}({\varvec{\theta }}) = + \infty \) if the infimum is not attained.

Lemma 4 hereafter recaps the main properties of a gauge that we need. In particular, (ii) is a fundamental result of convex analysis that states that there is a one-to-one correspondence between gauge functions and closed convex sets containing the origin. This allows to identify sets from their gauges, and vice versa.

Lemma 4

  1. (i)

    \(\gamma _{\mathcal C}\) is a nonnegative, lsc and sublinear function.

  2. (ii)

    \({\mathcal C}\) is the unique closed convex set containing the origin such that

    $$\begin{aligned} {\mathcal C}= \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; \gamma _{\mathcal C}({\varvec{\theta }}) \le 1 \big \} . \end{aligned}$$
  3. (iii)

    \(\gamma _{\mathcal C}\) is finite-valued if, and only if, \(0 \in {{\mathrm{int}}}({\mathcal C})\), in which case \(\gamma _{\mathcal C}\) is Lipschitz continuous.

  4. (iv)

    \(\gamma _{\mathcal C}\) is finite-valued and coercive if, and only if, \({\mathcal C}\) is compact and \(0 \in {{\mathrm{int}}}({\mathcal C})\).

See Vaiter et al. (2015a) for the proof.

Observe that thanks to sublinearity, local Lipschitz continuity valid for any finite-valued convex function is strengthened to global Lipschitz continuity. Moreover, \(\gamma _{\mathcal C}\) is a norm, having \({\mathcal C}\) as its unit ball, if and only if \({\mathcal C}\) is bounded with non-empty interior and symmetric.

We now define the polar gauge.

Definition 4

(Polar Gauge) The polar of a gauge \(\gamma _{\mathcal C}\) is the function \(\gamma _{\mathcal C}^\circ \) defined by

$$\begin{aligned} \gamma _{\mathcal C}^\circ ({\varvec{\omega }}) = \inf \big \{ \mu \ge 0 \;:\; \langle {\varvec{\theta }},{\varvec{\omega }}\rangle \le \mu \gamma _{\mathcal C}({\varvec{\theta }}), \forall {\varvec{\theta }} \big \} . \end{aligned}$$

An immediate consequence is that gauges polar to each other have the property

$$\begin{aligned} \langle {\varvec{\theta }},{\varvec{u}}\rangle \le \gamma _{\mathcal C}({\varvec{\theta }}) \gamma _{\mathcal C}^\circ ({\varvec{u}}) \quad \forall ({\varvec{\theta }},{\varvec{u}}) \in {{\mathrm{dom}}}(\gamma _{\mathcal C}) \times {{\mathrm{dom}}}(\gamma _{\mathcal C}^\circ ) , \end{aligned}$$
(42)

just as dual norms satisfy a duality inequality. In fact, polar pairs of gauges correspond to the best inequalities of this type.

Lemma 5

Let \({\mathcal C}\subseteq \mathbb {R}^p\) be a closed convex set containing the origin. Then,

  1. (ii)

    \(\gamma _{\mathcal C}^\circ \) is a gauge function and \(\gamma _{\mathcal C}^{\circ \circ }=\gamma _{\mathcal C}\).

  2. (iii)

    \(\gamma _{\mathcal C}^\circ =\gamma _{{\mathcal C}^\circ }\), or equivalently

    $$\begin{aligned} {\mathcal C}^\circ = \big \{ {\varvec{\theta }}\in \mathbb {R}^p \;:\; \gamma _{\mathcal C}^\circ ({\varvec{\theta }}) \le 1 \big \} . \end{aligned}$$
  3. (iv)

    The gauge of \({\mathcal C}\) and the support function of \({\mathcal C}\) are mutually polar, i.e.,

    $$\begin{aligned} \gamma _{\mathcal C}= \sigma _{{\mathcal C}^\circ } \quad \text {and} \quad \gamma _{{\mathcal C}^\circ } = \sigma _{\mathcal C}~. \end{aligned}$$

See Rockafellar (1996), Hiriart-Urruty and Lemaréchal (2001) and Vaiter et al. (2015a) for the proof.

Expectation of the inner product

We start with some definitions and notations that will be used in the proof. For a non-empty closed convex set \({\mathcal C}\in \mathbb {R}^p\), we denote \({\big ({\mathcal C}\big )}^0\) its minimal selection, i.e., the element of minimal norm in \({\mathcal C}\). This element is of course unique. For a proper lsc and convex function f and \(\gamma > 0\), its Moreau envelope (or Moreau–Yosida regularization) is defined by

$$\begin{aligned} {}^{f}\gamma ({\varvec{\theta }})&{\mathop {=}\limits ^{\text { def}}}\min _{\overline{{\varvec{\theta }}}\in \mathbb {R}^p} \frac{1}{2\gamma }\big \Vert \overline{{\varvec{\theta }}}- {\varvec{\theta }}\big \Vert _{2}^2 + f(\overline{{\varvec{\theta }}}). \end{aligned}$$

The Moreau envelope enjoys several important properties that we collect in the following lemma.

Lemma 6

Let f be a finite-valued and convex function. Then

  1. (i)

    \({\left( {}^{f}\gamma ({\varvec{\theta }})\right) }_{\gamma > 0}\) is a decreasing net, and \(\forall {\varvec{\theta }}\in \mathbb {R}^p\), \({}^{f}\gamma ({\varvec{\theta }}) \nearrow f({\varvec{\theta }})\) as \(\gamma \searrow 0\).

  2. (ii)

    \({}^{f}\gamma \in C^1(\mathbb {R}^p)\) with \(\gamma ^{-1}\)-Lipschitz continuous gradient.

  3. (iii)

    \(\forall {\varvec{\theta }}\in \mathbb {R}^p\), \(\nabla {}^{f}\gamma ({\varvec{\theta }}) \rightarrow {\big (\partial f({\varvec{\theta }})\big )}^0\) and \(\big \Vert \nabla {}^{f}\gamma ({\varvec{\theta }})\big \Vert _{2} \nearrow \big \Vert {\big (\partial f({\varvec{\theta }})\big )}^0\big \Vert _{2}\) as \(\gamma \searrow 0\).

Proof

(i) Bauschke and Combettes (2011, Proposition 12.32). (ii) Bauschke and Combettes (2011, Proposition 12.29). (iii) Since f is finite-valued and convex, it is subdifferentiable everywhere and its subdifferential is a maximal monotone operator with full domain \(\mathbb {R}^p\), and the result follows from Bauschke and Combettes (2011, Corollary 23.46(i)). \(\square \)

We are now equipped to prove the following important result. It will be proved here using Moreau–Yosida regularization. Yet another alternative proof could be based on mollifiers for approximating subdifferentials. Our result hereafter turns out to be instrumental to study EWA in the low-temperature regime for general penalties.

Proposition 5

Let the density \(\mu _n\) in (2), where

  1. (a)

    F satisfies Assumptions (H.1)–(H.2);

  2. (b)

    J is a finite-valued lower-bounded convex function, and \(\exists R > 0\) and \(\rho \ge 0\), such that \(\forall {\varvec{\theta }}\in \mathbb {R}^p\), \(\big \Vert {\big (\partial J({\varvec{\theta }})\big )}^0\big \Vert _{2} \le R \left\| {\varvec{\theta }}\right\| _{2}^\rho \);

  3. (c)

    and \(V_n\) is coercive.

Then, \(\forall \overline{{\varvec{\theta }}}\in \mathbb {R}^p\),

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right] = -p\beta . \end{aligned}$$

This result covers of course the situation where J fulfills (H.3). In this case, since \(\partial J({\varvec{\theta }}) \subset {\mathcal C}^\circ \) by Theorem 1(i), we have \(\rho =0\) and \(R={{\mathrm{diam}}}({\mathcal C}^\circ )\), the diameter of the convex compact set \({\mathcal C}^\circ \) containing the origin. It can be shown that, when \(F(\cdot ,\varvec{y})\) is strongly coercive, the coercivity assumption (c) can be equivalently stated as \(J_{\infty }({\varvec{\theta }}) > 0\), \(\forall {\varvec{\theta }}\in \ker (\varvec{X}) \setminus \left\{ 0 \right\} \), where \(J_\infty \) is the recession/asymptotic function of J, see e.g., Rockafellar and Wets (1998).

Proof

Let \(V^\gamma _n({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}}\tfrac{1}{n}F(\varvec{X}{\varvec{\theta }},\varvec{y})+\lambda _n {}^{J}\gamma ({\varvec{\theta }})\) and define \(\mu ^{\gamma }_n({\varvec{\theta }}) {\mathop {=}\limits ^{\text { def}}}\exp {\left( -V^\gamma _n({\varvec{\theta }})/\beta \right) }/Z\), where \(0< Z < +\infty \) is the normalizing constant of the density \(\mu _n\). Assumption (H.1) and Lemma 6(ii)–(iii) tell us that \(V^\gamma _n \in C^1(\mathbb {R}^p)\) and \(\nabla V^\gamma _n({\varvec{\theta }}) \rightarrow {\big (\partial V_n({\varvec{\theta }})\big )}^0\) as \(\gamma \rightarrow 0\). Thus

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right]&= \int _{\mathbb {R}^p} \lim _{\gamma \rightarrow 0} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \mathrm{d}{\varvec{\theta }}. \end{aligned}$$

We now check that \(\langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \) is dominated by an integrable function. From the definition of the Moreau envelope, we have

$$\begin{aligned} V^\gamma _n({\varvec{\theta }}) = \min _{\overline{{\varvec{\theta }}}\in \mathbb {R}^p} \tfrac{1}{n}F(\varvec{X}{\varvec{\theta }},\varvec{y}) + \lambda _n{\big (J({\varvec{\theta }}- \overline{{\varvec{\theta }}}) + \frac{1}{2\gamma }\big \Vert \overline{{\varvec{\theta }}}\big \Vert _{2}^2\big )} . \end{aligned}$$

From coercivity of \(V_n\), the objective in the \(\min \) is also coercive in \(({\varvec{\theta }},\overline{{\varvec{\theta }}})\) by Rockafellar and Wets (1998, Exercise 3.29(b)). It then follows from Rockafellar and Wets (1998, Theorem 3.31) that \(V^\gamma _n\) is also coercive. In turn, Rockafellar and Wets (1998, Theorem 11.8(c) and 3.26(a)) allow to assert that for some \(a \in ]0,+\infty [\), \(\exists b \in ]-\infty ,+\infty [\) such that for all \(\gamma > 0\) and \({\varvec{\theta }}\in \mathbb {R}^p\)

$$\begin{aligned} \mu ^{\gamma }_n({\varvec{\theta }}) \le \exp {\left( -a\left\| {\varvec{\theta }}\right\| _{2}-b\right) }/Z . \end{aligned}$$
(43)

Lemma 6-(iii) and assumption (b) on J entail that for any \({\varvec{\theta }}\in \mathbb {R}^p\),

$$\begin{aligned} \big \Vert \nabla {}^{J}\gamma ({\varvec{\theta }})\big \Vert _{2} \le \big \Vert {\big (\partial J({\varvec{\theta }})\big )}^0\big \Vert _{2} \le R \left\| {\varvec{\theta }}\right\| _{2}^\rho . \end{aligned}$$

Altogether, we have

$$\begin{aligned}&\big |\langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \big |\\&\quad \le \mu ^\gamma _n({\varvec{\theta }}) {\left( \big |\langle \varvec{X}^\top \tfrac{1}{n}\nabla F(\varvec{X}{\varvec{\theta }},\varvec{y}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \big |+\lambda _n\big \Vert \nabla {}^{J}\gamma ({\varvec{\theta }})\big \Vert _{2}\big \Vert \overline{{\varvec{\theta }}}-{\varvec{\theta }}\big \Vert _{2}\right) }\\&\quad \le C Z^{-1}\exp {\left( -F(\varvec{X}{\varvec{\theta }},\varvec{y})/(n\beta )\right) } \big |\langle \tfrac{1}{n}\nabla F(\varvec{X}{\varvec{\theta }},\varvec{y}),\varvec{X}(\overline{{\varvec{\theta }}}-{\varvec{\theta }})\rangle \big | \\&\qquad + (Z\exp {b})^{-1} \lambda _n R \exp {\left( -a\left\| {\varvec{\theta }}\right\| _{2}\right) } \big \Vert {\varvec{\theta }}\big \Vert _{2}^\rho \big \Vert \overline{{\varvec{\theta }}}-{\varvec{\theta }}\big \Vert _{2} , \end{aligned}$$

where the constant \(C > 0\) reflects the lower boundedness of J. It is easy to see that the function in this upper bound is integrable, where we also use (H.2). Hence, we can apply the dominated convergence theorem to get

$$\begin{aligned} \mathbb {E}_{\mu _n}\left[ \langle {\big (\partial V_n({\varvec{\theta }})\big )}^0,\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \right]&= \lim _{\gamma \rightarrow 0} \int _{\mathbb {R}^p} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \mathrm{d}{\varvec{\theta }}. \end{aligned}$$

Now, by simple differential calculus (chain and product rules), we have

$$\begin{aligned} \langle \mu ^\gamma _n({\varvec{\theta }})\nabla V^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle&= -\beta \langle \nabla \mu ^\gamma _n({\varvec{\theta }}),\overline{{\varvec{\theta }}}-{\varvec{\theta }}\rangle \\&= -\beta \sum _{i=1}^p \frac{\partial }{\partial {\varvec{\theta }}_i}{\left( \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right) } - p\beta \mu ^\gamma _n({\varvec{\theta }}). \end{aligned}$$

Integrating the first term, we get by Fubini theorem and the Newton–Leibniz formula

$$\begin{aligned}&\int _{\mathbb {R}^{p-1}} {\left( \int _{\mathbb {R}} \frac{\partial }{\partial {\varvec{\theta }}_i}{\left( \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right) } \mathrm{d}{\varvec{\theta }}_i\right) } \mathrm{d}{\varvec{\theta }}_1 \cdots \mathrm{d}{\varvec{\theta }}_{i-1} \mathrm{d}{\varvec{\theta }}_{i+1}\cdots \mathrm{d}{\varvec{\theta }}_p \\&\quad = \int _{\mathbb {R}^{p-1}} {\left[ \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i)\right] }_{\mathbb {R}} \mathrm{d}{\varvec{\theta }}_1 \cdots \mathrm{d}{\varvec{\theta }}_{i-1} \mathrm{d}{\varvec{\theta }}_{i+1}\cdots \mathrm{d}{\varvec{\theta }}_p = 0 , \end{aligned}$$

where we used coercivity of \(V^\gamma _n\) (see (43)) to conclude that \(\lim _{|{\varvec{\theta }}_i| \rightarrow +\infty } \mu ^\gamma _n({\varvec{\theta }})(\overline{{\varvec{\theta }}}_i-{\varvec{\theta }}_i) = 0\). For the second term, we have from Lemma 6(i) that \(\mu ^\gamma _n \rightarrow \mu _n\) as \(\gamma \rightarrow 0\). Thus, arguing again as in (43), we can apply the dominated convergence theorem to conclude that

$$\begin{aligned} \lim _{\gamma \rightarrow 0} \int _{\mathbb {R}^p} \mu ^\gamma _n({\varvec{\theta }}) \mathrm{d}{\varvec{\theta }}= \int _{\mathbb {R}^p} \mu _n({\varvec{\theta }}) \mathrm{d}{\varvec{\theta }}= 1. \end{aligned}$$

This concludes the proof. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luu, T.D., Fadili, J. & Chesneau, C. Sharp oracle inequalities for low-complexity priors. Ann Inst Stat Math 72, 353–397 (2020). https://doi.org/10.1007/s10463-018-0693-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-018-0693-6

Keywords

Navigation