The degrees of freedom of partly smooth regularizers

  • Samuel Vaiter
  • Charles Deledalle
  • Jalal Fadili
  • Gabriel Peyré
  • Charles Dossal
Article

Abstract

We study regularized regression problems where the regularizer is a proper, lower-semicontinuous, convex and partly smooth function relative to a Riemannian submanifold. This encompasses several popular examples including the Lasso, the group Lasso, the max and nuclear norms, as well as their composition with linear operators (e.g., total variation or fused Lasso). Our main sensitivity analysis result shows that the predictor moves locally stably along the same active submanifold as the observations undergo small perturbations. This plays a pivotal role in getting a closed-form expression for the divergence of the predictor w.r.t. observations. We also show that, for many regularizers, including polyhedral ones or the analysis group Lasso, this divergence formula holds Lebesgue a.e. When the perturbation is random (with an appropriate continuous distribution), this allows us to derive an unbiased estimator of the degrees of freedom and the prediction risk. Our results unify and go beyond those already known in the literature.

Keywords

Degrees of freedom Partial smoothness Manifold Sparsity Model selection O-minimal structures Semi-algebraic sets Group Lasso Total variation 

Notes

Acknowledgments

This work has been supported by the European Research Council (ERC project SIGMA-Vision) and Institut Universitaire de France.

References

  1. Absil, P. A., Mahony, R., Trumpf, J. (2013). An extrinsic look at the riemannian hessian. Geometric science. of information (Vol. 8085, pp. 361–368)., Lecture notes in computer science. Berlin: Springer.Google Scholar
  2. Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179–1225.MathSciNetMATHGoogle Scholar
  3. Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384–414.MathSciNetCrossRefMATHGoogle Scholar
  4. Bakin, S. (1999). Adaptive regression and model selection in data mining problems. Thesis (Ph.D.)–Australian National University.Google Scholar
  5. Bickel, P. J., Ritov, Y., Tsybakov, A. (2009). Simultaneous analysis of lasso and Dantzig selector. Annals of Statistics, 37(4), 1705–1732.Google Scholar
  6. Bolte, J., Daniilidis, A., Lewis, A. S. (2011). Generic optimality conditions for semialgebraic convex programs. Mathematics of Operations Research, 36(1), 55–70.Google Scholar
  7. Bonnans, J., Shapiro, A. (2000). Perturbation analysis of optimization problems., Springer Series in Operations Research. New York: Springer.Google Scholar
  8. Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory, monograph series. Institute of Mathematical Statistics lecture notes (Vol. 9). Hayward: IMS.Google Scholar
  9. Bühlmann, P., van de Geer, S. (2011). Statistics for high-dimensional data: Methods. Theory and Applications., Springer Series in Statistics. Berlin: Springer.Google Scholar
  10. Bunea, F. (2008). Honest variable selection in linear and logistic regression models via \(\ell _1\) and \(\ell _1+\ell _2\) penalization. Electronic Journal of Statistics, 2, 1153–1194.MathSciNetCrossRefMATHGoogle Scholar
  11. Candès, E., Plan, Y. (2009). Near-ideal model selection by \(\ell _1\) minimization. Annals of Statistics, 37(5A), 2145–2177.Google Scholar
  12. Candès, E. J., Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717–772.Google Scholar
  13. Candès, E.J., Li, X., Ma, Y., Wright, J. (2011). Robust principal component analysis? Journal of the ACM 58(3):11:1–11:37.Google Scholar
  14. Candès, E. J., Sing-Long, C. A., Trzasko, J. D. (2012). Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Transactions on Signal Processing, 61(19), 4643–4657.Google Scholar
  15. Candès, E. J., Strohmer, T., Voroninski, V. (2013). Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8), 1241–1274.Google Scholar
  16. Chavel, I. (2006). Riemannian geometry: a modern introduction. Cambridge studies in advanced mathematics (2nd ed., Vol. 98). New York: Cambridge University Press.Google Scholar
  17. Chen, S., Donoho, D., Saunders, M. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.Google Scholar
  18. Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). An efficient proximal-gradient method for general structured sparse learning. arXiv:1005.4717.
  19. Combettes, P., Pesquet, J. (2007). A Douglas–Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4), 564–574.Google Scholar
  20. Coste, M. (1999). An introduction to o-minimal geometry. Technical report, Institut de Recherche Mathematiques de Rennes.Google Scholar
  21. Coste, M. (2002). An introduction to semialgebraic geometry. Technical report, Institut de Recherche Mathematiques de Rennes.Google Scholar
  22. Daniilidis, A., Hare, W., Malick, J. (2009). Geometrical interpretation of the predictor–corrector type algorithms in structured optimization problems. Optimization: A Journal of Mathematical Programming & Operations Research 55(5–6), 482–503.Google Scholar
  23. Daniilidis, A., Drusvyatskiy, D., Lewis, A. S. (2013). Orthogonal invariance and identifiability. Technical report, Vol. 1304, p. 1198.Google Scholar
  24. DasGupta, A. (2008). Asymptotic theory of statistics and probability. Berlin: Springer.MATHGoogle Scholar
  25. Deledalle, C. A., Vaiter, S., Peyré, G., Fadili, M., Dossal, C. (2012). Risk estimation for matrix recovery with spectral regularization. In: ICML’12 workshop on sparsity, dictionaries and projections in machine learning and signal processing. arXiv:1205.1482.
  26. Deledalle, C. A., Vaiter, S., Peyré, G., Fadili, J. M. (2014). Stein unbiased gradient estimator of the risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4), 2448–2487.Google Scholar
  27. Donoho, D. (2006). For most large underdetermined systems of linear equations the minimal \(\ell ^1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797–829.MathSciNetCrossRefMATHGoogle Scholar
  28. Dossal, C., Kachour, M., Fadili, M. J., Peyré, G., Chesneau, C. (2013). The degrees of freedom of penalized \(\ell _1\) minimization. Statistica Sinica, 23(2), 809–828.Google Scholar
  29. Drusvyatskiy, D., Lewis, A. (2011). Generic nondegeneracy in convex optimization. Proceedings of the American Mathematical Society, 129, 2519–2527.Google Scholar
  30. Drusvyatskiy, D., Ioffe, A., Lewis, A. (2015). Generic minimizing behavior in semi-algebraic optimizatio. SIAM Journal on Optimization (To appear).Google Scholar
  31. Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81(394), 461–470.MathSciNetCrossRefMATHGoogle Scholar
  32. Eldar, Y. C. (2009). Generalized SURE for exponential families: Applications to regularization. IEEE Transactions on Signal Processing, 57(2), 471–481.MathSciNetCrossRefGoogle Scholar
  33. Evans, L. C., Gariepy, R. F. (1992). Measure theory and fine properties of functions. Studies in advanced mathematics. Boca Raton: CRC Press.Google Scholar
  34. Fazel, M., Hindi, H., Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. Proceedings of the American Control Conference IEEE, 6, 4734–4739.Google Scholar
  35. van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Annals of Statistics, 36, 614–645.MathSciNetCrossRefMATHGoogle Scholar
  36. Hansen, N. R., Sokol, A. (2014). Degrees of freedom for nonlinear least squares estimation. Technical report, Vol. 1402, p. 2997.Google Scholar
  37. Hudson, H. (1978). A natural identity for exponential families with applications in multiparameter estimation. Annals of Statistics, 6(3), 473–484.MathSciNetCrossRefMATHGoogle Scholar
  38. Hwang, J. T. (1982). Improving upon standard estimators in discrete exponential families with applications to poisson and negative binomial cases. Annals of Statistics, 10(3), 857–867.MathSciNetCrossRefMATHGoogle Scholar
  39. Jacob, L., Obozinski, G., Vert, J. P. (2009). Group lasso with overlap and graph lasso. In: Danyluk, A. P., Bottou, L., Littman, M. L. (eds.) ICML’09, Vol. 382, p. 55.Google Scholar
  40. Jégou, H., Furon, T., Fuchs, J. J. (2012). Anti-sparse coding for approximate nearest neighbor search. In: IEEE ICASSP, pp. 2029–2032.Google Scholar
  41. Kakade, S., Shamir, O., Sindharan, K., Tewari, A. (2010). Learning exponential families in high-dimensions: Strong convexity and sparsity. In: Teh, Y. W., Titterington, D. M. (eds.) Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS-10), Vol. 9, pp. 381–388.Google Scholar
  42. Kato, K. (2009). On the degrees of freedom in shrinkage estimation. Journal of Multivariate Analysis, 100(7), 1338–1352.MathSciNetCrossRefMATHGoogle Scholar
  43. Lee, J. M. (2003). Introduction to smooth manifolds. Graduate texts in mathematics. New York: Springer.CrossRefGoogle Scholar
  44. Lemaréchal, C., Hiriart-Urruty, J. (1996). Convex analysis and minimization algorithms: Fundamentals (Vol. 305). Berlin: Springer.Google Scholar
  45. Lemaréchal, C., Oustry, F., Sagastizábal, C. (2000). The \(\cal U\)-lagrangian of a convex function. Transactions of the American mathematical Society, 352(2), 711–729.Google Scholar
  46. Lewis, A. (1995). The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2, 173–183.MathSciNetMATHGoogle Scholar
  47. Lewis, A., Sendov, H. (2001). Twice differentiable spectral functions. SIAM Journal on Matrix Analysis on Matrix Analysis and Applications, 23, 368–386.Google Scholar
  48. Lewis, A. S. (2003a). Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3), 702–725.MathSciNetCrossRefMATHGoogle Scholar
  49. Lewis, A. S. (2003b). The mathematics of eigenvalue optimization. Mathematical Programming, 97(1–2), 155–176.MathSciNetCrossRefMATHGoogle Scholar
  50. Lewis, A. S., Zhang, S. (2013). Partial smoothness, tilt stability, and generalized hessians. SIAM Journal on Optimization, 23(1), 74–94.Google Scholar
  51. Liang, J., Fadili, M. J., Peyré, G., Luke, R. (2014). Activity Identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858.
  52. Liu, H., Zhang, J. (2009). Estimation consistency of the group lasso and its applications. Journal of Machine Learning Research, 5, 376–383.Google Scholar
  53. Lyubarskii, Y., Vershynin, R. (2010). Uncertainty principles and vector quantization. IEEE Transactions on Information Theory, 56(7), 3491–3501.Google Scholar
  54. McCullagh, P., Nelder, J. A. (1989). Generalized Linear Models (2nd edn). Monographs on Statistics & Applied Probability. Boca Raton: Chapman & Hall/CRC.Google Scholar
  55. Meier, L., Geer, S. V. D., Buhlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 51–71.Google Scholar
  56. Meinshausen, N., Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, 1436–1462.Google Scholar
  57. Meyer, M., Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Annals of Statistics, 28(4), 1083–1104.Google Scholar
  58. Miller, S. A., Malick, J. (2005). Newton methods for nonsmooth convex minimization: connections among-lagrangian, riemannian newton and sqp methods. Mathematical Programming, 104(2–3), 609–633.Google Scholar
  59. Mordukhovich, B. (1992). Sensitivity analysis in nonsmooth optimization. In: Field, D. A. & Komkov, V. (eds.) Theoretical aspects of industrial design. SIAM volumes in applied mathematics (Vol. 58), Philadelphia, pp 32–46.Google Scholar
  60. Negahban, S., Ravikumar, P., Wainwright, M. J., Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.Google Scholar
  61. Osborne, M., Presnell, B., Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.Google Scholar
  62. Peyré, G., Fadili, J., Chesneau, C. (2011). Adaptive structured block sparsity via dyadic partitioning. In: EUSIPCO, Barcelona, Spain.Google Scholar
  63. Ramani, S., Blu, T., Unser, M. (2008). Monte-Carlo SURE: A black-box optimization of regularization parameters for general denoising algorithms. IEEE Transactions on Image Processing, 17(9), 1540–1554.Google Scholar
  64. Recht, B., Fazel, M., Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471–501.Google Scholar
  65. Rockafellar, R. T. (1996). Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton: Princeton University Press.Google Scholar
  66. Rudin, L., Osher, S., Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4), 259–268.Google Scholar
  67. Saad, Y., Schultz, M. H. (1986). Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3), 856–869.Google Scholar
  68. Solo, V., Ulfarsson, M. (2010). Threshold selection for group sparsity. In: IEEE ICASSP, pp. 3754–3757.Google Scholar
  69. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9(6), 1135–1151.MathSciNetCrossRefMATHGoogle Scholar
  70. Studer, C., Yin, W., Baraniuk, R. G. (2012). Signal representations with minimum \(\ell _\infty \)-norm. In: 50th annual Allerton conference on communication, control, and computing.Google Scholar
  71. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B Methodological, 58(1), 267–288.MathSciNetMATHGoogle Scholar
  72. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. (2005). Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.Google Scholar
  73. Tibshirani, R. J., Taylor, J. (2012). Degrees of freedom in Lasso problems. Annals of Statistics, 40(2), 639–1284.Google Scholar
  74. Tikhonov, A. N., Arsenin, V. Y. (1977). Solutions of Ill-posed problems. New York: Halsted Press.Google Scholar
  75. Vaiter, S., Deledalle, C., Peyré, G., Fadili, M. J., Dossal, C. (2012). Degrees of freedom of the group Lasso. In: ICML’12 Workshops, pp. 89–92.Google Scholar
  76. Vaiter, S., Deledalle, C., Peyré, G., Dossal, C., Fadili, M. J. (2013). Local behavior of sparse analysis regularization: Applications to risk estimation. Applied and Computational Harmonic Analysis, 35(3), 433–451.Google Scholar
  77. Vaiter, S., Peyré, G., Fadili, M. J. (2014). Model consistency of partly smooth regularizers. arXiv:1405.1004.
  78. Vaiter, S., Golbabaee, M., Fadili, M. J., Peyré, G. (2015). Model selection with low complexity priors. Information and Inference: A Journal of the IMA (IMAIAI).Google Scholar
  79. van den Dries, L. (1998). Tame topology and o-minimal structures. Mathematrical society lecture notes (Vol. 248). New York: Cambridge Univiversity Press.CrossRefGoogle Scholar
  80. van den Dries, L., Miller, C. (1996). Geometric categories and o-minimal structures. Duke Mathematical Journal, 84, 497–540.Google Scholar
  81. Vonesch, C., Ramani, S., Unser, M. (2008). Recursive risk estimation for non-linear image deconvolution with a wavelet-domain sparsity constraint. In: ICIP, IEEE, pp. 665–668.Google Scholar
  82. Wei, F., Huang, J. (2010). Consistent group selection in high-dimensional linear regression. Bernoulli, 16(4), 1369–1384.Google Scholar
  83. Wright, S. J. (1993). Identifiable surfaces in constrained optimization. SIAM Journal on Control and Optimization, 31(4), 1063–1079.MathSciNetCrossRefMATHGoogle Scholar
  84. Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.Google Scholar
  85. Yuan, M., Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94(1), 19–35.Google Scholar
  86. Zou, H., Hastie, T., Tibshirani, R. (2007). On the “degrees of freedom” of the Lasso. Annals of Statistics, 35(5), 2173–2192.Google Scholar

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2016

Authors and Affiliations

  • Samuel Vaiter
    • 1
  • Charles Deledalle
    • 3
  • Jalal Fadili
    • 2
  • Gabriel Peyré
    • 1
  • Charles Dossal
    • 3
  1. 1.CEREMADE, CNRS, Université Paris-DauphineParis Cedex 16France
  2. 2.Normandie Univ, ENSICAEN, CNRS, GREYCCaen CedexFrance
  3. 3.IMB, CNRS, Université Bordeaux 1Talence CedexFrance

Personalised recommendations