Abstract
We study regularized regression problems where the regularizer is a proper, lower-semicontinuous, convex and partly smooth function relative to a Riemannian submanifold. This encompasses several popular examples including the Lasso, the group Lasso, the max and nuclear norms, as well as their composition with linear operators (e.g., total variation or fused Lasso). Our main sensitivity analysis result shows that the predictor moves locally stably along the same active submanifold as the observations undergo small perturbations. This plays a pivotal role in getting a closed-form expression for the divergence of the predictor w.r.t. observations. We also show that, for many regularizers, including polyhedral ones or the analysis group Lasso, this divergence formula holds Lebesgue a.e. When the perturbation is random (with an appropriate continuous distribution), this allows us to derive an unbiased estimator of the degrees of freedom and the prediction risk. Our results unify and go beyond those already known in the literature.
Similar content being viewed by others
Notes
Strictly speaking, the minimization may have to be over a convex subset of \(\mathbb {R}^p\).
The meaning of sensitivity is different here from what is usually intended in statistical sensitivity and uncertainty analysis.
We write the same symbol as for the derivative, and rigorously speaking, this has to be understood to hold Lebesgue a.e.
To be understood here as a set-valued mapping.
Obviously, Lemma 2(ii) holds in such a case at the unique minimizer \({\widehat{\beta }}(y)\).
References
Absil, P. A., Mahony, R., Trumpf, J. (2013). An extrinsic look at the riemannian hessian. Geometric science. of information (Vol. 8085, pp. 361–368)., Lecture notes in computer science. Berlin: Springer.
Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.
Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384–414.
Bakin, S. (1999). Adaptive regression and model selection in data mining problems. Thesis (Ph.D.)–Australian National University.
Bickel, P. J., Ritov, Y., Tsybakov, A. (2009). Simultaneous analysis of lasso and Dantzig selector. Annals of Statistics, 37(4), 1705–1732.
Bolte, J., Daniilidis, A., Lewis, A. S. (2011). Generic optimality conditions for semialgebraic convex programs. Mathematics of Operations Research, 36(1), 55–70.
Bonnans, J., Shapiro, A. (2000). Perturbation analysis of optimization problems., Springer Series in Operations Research. New York: Springer.
Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory, monograph series. Institute of Mathematical Statistics lecture notes (Vol. 9). Hayward: IMS.
Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods. Theory and Applications., Springer Series in Statistics. Berlin: Springer.
Bunea, F. (2008). Honest variable selection in linear and logistic regression models via \(\ell _1\) and \(\ell _1+\ell _2\) penalization. Electronic Journal of Statistics, 2, 1153–1194.
Candès, E., Plan, Y. (2009). Near-ideal model selection by \(\ell _1\) minimization. Annals of Statistics, 37(5A), 2145–2177.
Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.
Candès, E.J., Li, X., Ma, Y., Wright, J. (2011). Robust principal component analysis? Journal of the ACM 58(3):11:1–11:37.
Candès, E. J., Sing-Long, C. A., Trzasko, J. D. (2012). Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Transactions on Signal Processing, 61(19), 4643–4657.
Candès, E. J., Strohmer, T., Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8), 1241–1274.
Chavel, I. (2006). Riemannian geometry: a modern introduction. Cambridge studies in advanced mathematics (2nd ed., Vol. 98). New York: Cambridge University Press.
Chen, S., Donoho, D., Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.
Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). An efficient proximal-gradient method for general structured sparse learning. arXiv:1005.4717.
Combettes, P., Pesquet, J. (2007). A Douglas–Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4), 564–574.
Coste, M. (1999). An introduction to o-minimal geometry. Technical report, Institut de Recherche Mathematiques de Rennes.
Coste, M. (2002). An introduction to semialgebraic geometry. Technical report, Institut de Recherche Mathematiques de Rennes.
Daniilidis, A., Hare, W., Malick, J. (2009). Geometrical interpretation of the predictor–corrector type algorithms in structured optimization problems. Optimization: A Journal of Mathematical Programming & Operations Research 55(5–6), 482–503.
Daniilidis, A., Drusvyatskiy, D., Lewis, A. S. (2013). Orthogonal invariance and identifiability. Technical report, Vol. 1304, p. 1198.
DasGupta, A. (2008). Asymptotic theory of statistics and probability. Berlin: Springer.
Deledalle, C. A., Vaiter, S., Peyré, G., Fadili, M., Dossal, C. (2012). Risk estimation for matrix recovery with spectral regularization. In: ICML’12 workshop on sparsity, dictionaries and projections in machine learning and signal processing. arXiv:1205.1482.
Deledalle, C. A., Vaiter, S., Peyré, G., Fadili, J. M. (2014). Stein unbiased gradient estimator of the risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4), 2448–2487.
Donoho, D. (2006). For most large underdetermined systems of linear equations the minimal \(\ell ^1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797–829.
Dossal, C., Kachour, M., Fadili, M. J., Peyré, G., Chesneau, C. (2013). The degrees of freedom of penalized \(\ell _1\) minimization. Statistica Sinica, 23(2), 809–828.
Drusvyatskiy, D., Lewis, A. (2011). Generic nondegeneracy in convex optimization. Proceedings of the American Mathematical Society, 129, 2519–2527.
Drusvyatskiy, D., Ioffe, A., Lewis, A. (2015). Generic minimizing behavior in semi-algebraic optimizatio. SIAM Journal on Optimization (To appear).
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81(394), 461–470.
Eldar, Y. C. (2009). Generalized SURE for exponential families: Applications to regularization. IEEE Transactions on Signal Processing, 57(2), 471–481.
Evans, L. C., Gariepy, R. F. (1992). Measure theory and fine properties of functions. Studies in advanced mathematics. Boca Raton: CRC Press.
Fazel, M., Hindi, H., Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. Proceedings of the American Control Conference IEEE, 6, 4734–4739.
van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Annals of Statistics, 36, 614–645.
Hansen, N. R., Sokol, A. (2014). Degrees of freedom for nonlinear least squares estimation. Technical report, Vol. 1402, p. 2997.
Hudson, H. (1978). A natural identity for exponential families with applications in multiparameter estimation. Annals of Statistics, 6(3), 473–484.
Hwang, J. T. (1982). Improving upon standard estimators in discrete exponential families with applications to poisson and negative binomial cases. Annals of Statistics, 10(3), 857–867.
Jacob, L., Obozinski, G., Vert, J. P. (2009). Group lasso with overlap and graph lasso. In: Danyluk, A. P., Bottou, L., Littman, M. L. (eds.) ICML’09, Vol. 382, p. 55.
Jégou, H., Furon, T., Fuchs, J. J. (2012). Anti-sparse coding for approximate nearest neighbor search. In: IEEE ICASSP, pp. 2029–2032.
Kakade, S., Shamir, O., Sindharan, K., Tewari, A. (2010). Learning exponential families in high-dimensions: Strong convexity and sparsity. In: Teh, Y. W., Titterington, D. M. (eds.) Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS-10), Vol. 9, pp. 381–388.
Kato, K. (2009). On the degrees of freedom in shrinkage estimation. Journal of Multivariate Analysis, 100(7), 1338–1352.
Lee, J. M. (2003). Introduction to smooth manifolds. Graduate texts in mathematics. New York: Springer.
Lemaréchal, C., Hiriart-Urruty, J. (1996). Convex analysis and minimization algorithms: Fundamentals (Vol. 305). Berlin: Springer.
Lemaréchal, C., Oustry, F., Sagastizábal, C. (2000). The \(\cal U\)-lagrangian of a convex function. Transactions of the American mathematical Society, 352(2), 711–729.
Lewis, A. (1995). The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2, 173–183.
Lewis, A., Sendov, H. (2001). Twice differentiable spectral functions. SIAM Journal on Matrix Analysis on Matrix Analysis and Applications, 23, 368–386.
Lewis, A. S. (2003a). Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3), 702–725.
Lewis, A. S. (2003b). The mathematics of eigenvalue optimization. Mathematical Programming, 97(1–2), 155–176.
Lewis, A. S., Zhang, S. (2013). Partial smoothness, tilt stability, and generalized hessians. SIAM Journal on Optimization, 23(1), 74–94.
Liang, J., Fadili, M. J., Peyré, G., Luke, R. (2014). Activity Identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858.
Liu, H., Zhang, J. (2009). Estimation consistency of the group lasso and its applications. Journal of Machine Learning Research, 5, 376–383.
Lyubarskii, Y., Vershynin, R. (2010). Uncertainty principles and vector quantization. IEEE Transactions on Information Theory, 56(7), 3491–3501.
McCullagh, P., Nelder, J. A. (1989). Generalized Linear Models (2nd edn). Monographs on Statistics & Applied Probability. Boca Raton: Chapman & Hall/CRC.
Meier, L., Geer, S. V. D., Buhlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 51–71.
Meinshausen, N., Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, 1436–1462.
Meyer, M., Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Annals of Statistics, 28(4), 1083–1104.
Miller, S. A., Malick, J. (2005). Newton methods for nonsmooth convex minimization: connections among-lagrangian, riemannian newton and sqp methods. Mathematical Programming, 104(2–3), 609–633.
Mordukhovich, B. (1992). Sensitivity analysis in nonsmooth optimization. In: Field, D. A. & Komkov, V. (eds.) Theoretical aspects of industrial design. SIAM volumes in applied mathematics (Vol. 58), Philadelphia, pp 32–46.
Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Osborne, M., Presnell, B., Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.
Peyré, G., Fadili, J., Chesneau, C. (2011). Adaptive structured block sparsity via dyadic partitioning. In: EUSIPCO, Barcelona, Spain.
Ramani, S., Blu, T., Unser, M. (2008). Monte-Carlo SURE: A black-box optimization of regularization parameters for general denoising algorithms. IEEE Transactions on Image Processing, 17(9), 1540–1554.
Recht, B., Fazel, M., Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471–501.
Rockafellar, R. T. (1996). Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton: Princeton University Press.
Rudin, L., Osher, S., Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4), 259–268.
Saad, Y., Schultz, M. H. (1986). Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3), 856–869.
Solo, V., Ulfarsson, M. (2010). Threshold selection for group sparsity. In: IEEE ICASSP, pp. 3754–3757.
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9(6), 1135–1151.
Studer, C., Yin, W., Baraniuk, R. G. (2012). Signal representations with minimum \(\ell _\infty \)-norm. In: 50th annual Allerton conference on communication, control, and computing.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B Methodological, 58(1), 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Tibshirani, R. J., Taylor, J. (2012). Degrees of freedom in Lasso problems. Annals of Statistics, 40(2), 639–1284.
Tikhonov, A. N., Arsenin, V. Y. (1977). Solutions of Ill-posed problems. New York: Halsted Press.
Vaiter, S., Deledalle, C., Peyré, G., Fadili, M. J., Dossal, C. (2012). Degrees of freedom of the group Lasso. In: ICML’12 Workshops, pp. 89–92.
Vaiter, S., Deledalle, C., Peyré, G., Dossal, C., Fadili, M. J. (2013). Local behavior of sparse analysis regularization: Applications to risk estimation. Applied and Computational Harmonic Analysis, 35(3), 433–451.
Vaiter, S., Peyré, G., Fadili, M. J. (2014). Model consistency of partly smooth regularizers. arXiv:1405.1004.
Vaiter, S., Golbabaee, M., Fadili, M. J., Peyré, G. (2015). Model selection with low complexity priors. Information and Inference: A Journal of the IMA (IMAIAI).
van den Dries, L. (1998). Tame topology and o-minimal structures. Mathematrical society lecture notes (Vol. 248). New York: Cambridge Univiversity Press.
van den Dries, L., Miller, C. (1996). Geometric categories and o-minimal structures. Duke Mathematical Journal, 84, 497–540.
Vonesch, C., Ramani, S., Unser, M. (2008). Recursive risk estimation for non-linear image deconvolution with a wavelet-domain sparsity constraint. In: ICIP, IEEE, pp. 665–668.
Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384.
Wright, S. J. (1993). Identifiable surfaces in constrained optimization. SIAM Journal on Control and Optimization, 31(4), 1063–1079.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Yuan, M., Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1), 19–35.
Zou, H., Hastie, T., Tibshirani, R. (2007). On the “degrees of freedom” of the Lasso. Annals of Statistics, 35(5), 2173–2192.
Acknowledgments
This work has been supported by the European Research Council (ERC project SIGMA-Vision) and Institut Universitaire de France.
Author information
Authors and Affiliations
Corresponding author
Basic properties of o-minimal structures
Basic properties of o-minimal structures
In the following results, we collect some important stability properties of o-minimal structures. To be self-contained, we also provide proofs. To the best of our knowledge, these proofs, although simple, are not reported in the literature or some of them are left as exercises in the authoritative references (van den Dries 1998; Coste 1999). Moreover, in most proofs, to show that a subset is definable, we could just write the appropriate first-order formula (see Coste 1999, Page 12; van den Dries 1998, Section Ch1.1.2), and conclude using (Coste 1999, Theorem 1.13). Here, for the sake of clarity and to avoid cryptic statements for the non-specialist, we will translate the first-order formula into operations on the involved subsets, in particular projections, and invoke the above stability axioms of o-minimal structures. In the following, n denotes an arbitrary (finite) dimension which is not necessarily the number of observations used previously the paper.
Lemma 5
(Addition and multiplication) Let \(f : \Omega \subset \mathbb {R}^n \rightarrow \mathbb {R}^p\) and \(g : \Omega \subset \mathbb {R}^n \subset \mathbb {R}^p\) be definable functions. Then their pointwise addition and multiplication is also definable.
Proof
Let \(h=f+g\), and
where \(S= \{ (x,u,y,v,z,w) \;:\; x=y=z, u=v+w \} \) is obviously an algebraic (in fact linear) subset, hence definable by axiom 2. Axiom 1 and 2 then imply that B is also definable. Let \(\varPi _{3n+3p,n+p}: \mathbb {R}^{3n+3p} \rightarrow \mathbb {R}^{n+p}\) be the projection on the first \(n+p\) coordinates. We then have
whence we deduce that h is definable by applying \(3n+3p\) times axiom 4. Definability of the pointwise multiplication follows the same proof taking \(u=v \cdot w\) in S. \(\square \)
Lemma 6
(Inequalities in definable sets) Let \(f : \Omega \subset \mathbb {R}^n \rightarrow \mathbb {R}\) be a definable function. Then \( \{ x \in \Omega \;:\; f(x) > 0 \} \), is definable. The same holds when replacing > with <.
Clearly, inequalities involving definable functions are accepted when defining definable sets.
There are many possible proofs of this statement.
Proof
(1) Let \(B= \{ (x,y) \in \mathbb {R}\times \mathbb {R} \;:\; f(x)=y \} \cap (\Omega \times (0,+\infty )\), which is definable thanks to axioms 1 and 3, and that the level sets of a definable function are also definable. Thus
and we conclude using again axiom 4. \(\square \)
Yet another (simpler) proof.
Proof
(2) It is sufficient to remark that \( \{ x \in \Omega \;:\; f(x) > 0 \} \) is the projection of the set \( \{ (x,t) \in \Omega \times \mathbb {R} \;:\; t^2f(x)-1 = 0 \} \), where the latter is definable owing to Lemma 5. \(\square \)
Lemma 7
(Derivative) Let \(f: I \rightarrow \mathbb {R}\) be a definable differentiable function on an open interval I of \(\mathbb {R}\). Then its derivative \(f': I \rightarrow \mathbb {R}\) is also definable.
Proof
Let \(g: (x,t) \in I \times \mathbb {R}\mapsto g(x,t) = f(x+t)-f(x)\). Note that g is definable function on \(I \times \mathbb {R}\) by Lemma 5. We now write the graph of \(f'\) as
Let \(C= \{ (x,y,v,t,\varepsilon ,\delta ) \in I \times \mathbb {R}^5 \;:\; ((x,t),v) \in {{\mathrm{gph}}}(g) \} \), which is definable since g is definable and using axiom 3. Let
The first part in B is semi-algebraic, hence definable thanks to axiom 2. Thus B is also definable using axiom 1. We can now write
where the projectors and completions translate the actions of the existential and universal quantifiers. Using again axioms 4 and 1, we conclude. \(\square \)
With such a result at hand, this proposition follows immediately.
Proposition 2
(Differential and Jacobian) Let \(f=(f_1,\ldots ,f_p): \Omega \rightarrow \mathbb {R}^p\) be a differentiable function on an open subset \(\Omega \) of \(\mathbb {R}^n\). If f is definable, then so its differential mapping and its Jacobian. In particular, for each \(i=1,\ldots ,n\) and \(j=1,\ldots ,p\), the partial derivative \(\partial f_i/\partial x_j: \Omega \rightarrow \mathbb {R}\) is definable.
We provide below some results concerning the subdifferential.
Proposition 3
(Subdifferential) Suppose that f is a finite-valued convex definable function. Then for any \(x \in \mathbb {R}^n\), the subdifferential \(\partial f(x)\) is definable.
Proof
For every \(x \in \mathbb {R}^n\), the subdifferential \(\partial f(x)\) reads
Let \(K = \{ (\eta ,x') \in \mathbb {R}^n \times \mathbb {R}^n \;:\; f(x') < f(x) + \langle \eta ,\,x'-x\rangle \} \). Hence, \(\partial f(x) = \mathbb {R}^n {\setminus } \varPi _{2n,n}(K)\). Since f is definable, the set K is also definable using Lemmas 5 and 6, whence definability of \(\partial f(x)\) follows using axiom 4. \(\square \)
Lemma 8
Suppose that f is a finite-valued convex definable function. Then, the set
is definable.
Proof
Denote \(C = \{ (x,\eta ) \;:\; \eta \in {{\mathrm{ri}}}\partial f(x) \} \). Using the characterization of the relative interior of a convex set (Rockafellar 1996, Theorem 6.4), we rewrite C in the more convenient form
Let \(D = \mathbb {R}^n \times \mathbb {R}^n \times \mathbb {R}^n \times \mathbb {R}^n \times (1,+\infty ) \times \mathbb {R}^n\) and K defined as
Thus,
where the projectors and completions translate the actions of the existential and universal quantifiers. Using again axioms 4 and 1, we conclude. \(\square \)
About this article
Cite this article
Vaiter, S., Deledalle, C., Fadili, J. et al. The degrees of freedom of partly smooth regularizers. Ann Inst Stat Math 69, 791–832 (2017). https://doi.org/10.1007/s10463-016-0563-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-016-0563-z