Abstract
We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. This problem is relevant in machine learning, statistics and signal processing. It is well known that a linear regression can benefit from knowledge that the underlying regression vector is sparse. The combinatorial problem of selecting the nonzero components of this vector can be “relaxed” by regularizing the squared error with a convex penalty function like the ℓ1 norm. However, in many applications, additional conditions on the structure of the regression vector and its sparsity pattern are available. Incorporating this information into the learning method may lead to a significant decrease of the estimation error. In this paper, we present a family of convex penalty functions, which encode prior knowledge on the structure of the vector formed by the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish the basic properties of these penalty functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso method and other related methods.
Similar content being viewed by others
References
Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008)
Argyriou, A., Micchelli, C.A., Pontil, M.: On spectral learning. J. Mach. Learn. Res. 11, 935–953 (2010)
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernels learning, conic duality, and the SMO algorithm. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Bertsekas, D.: Nonlinear Programming. Athena Scientific (1999)
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)
Bunea, F., Tsybakov, A.B., Wegkamp, M.H.: Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 1, 169–194 (2007)
Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 9, 1615–1646 (2008)
Danskin, J.M.: The theory of max-min, with applications. SIAM J. Appl. Math. 14(4), 641–664 (1966)
Huang, J., Zhang, T., Metaxas, D.: Learning with structured sparsity. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 417–424. ACM (2009)
Jacob, L.: Structured priors for supervised learning in computational biology. Ph.D. thesis (2009)
Jacob, L., Obozinski, G., Vert, J.-P.: Group lasso with overlap and graph lasso. In: International Conference on Machine Learning (ICML 26) (2009)
Jenatton, R., Audibert, J.-Y., Bach, F.: Structured variable selection with sparsity-inducing norms. arXiv:0904.3523v2 (2009)
Koltchinskii, V., Yuan, M.: Sparsity in multiple kernel learning. Ann. Stat. 38(6), 3660–3695 (2010)
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., El Ghaoui, L., Jordan, M.I.: Learning the kernel matrix with semi-definite programming. J. Mach. Learn. Res. 5, 27–72 (2004)
Lounici, K.: Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of Statistics 2, 90–102 (2008)
Lounici, K., Pontil, M., Tsybakov, A.B., Van De Geer, S.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2010). Arxiv preprint arXiv:1007.1771
Micchelli, C.A., Pontil, M.: Feature space perspectives for learning the kernel. Mach. Learn. 66, 297–319 (2007)
Micchelli, C.A., Morales, J.M., Pontil, M.: A family of penalty functions for structured sparsity. In: J. Lafferty, Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1612–1623 (2010)
Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2010), pp. 418–433 (2010)
Owen, A.B.: A robust hybrid of lasso and ridge regression. In: Prediction and Discovery: AMS-IMS-SIAM Joint Summer Research Conference, Machine and Statistical Learning: Prediction and Discovery, vol. 443, p. 59 (2007)
Suzuki, T., Tomioka, R.: Regularization strategies and empirical bayesian learning for MKL. arXiv:1001.26151 (2011)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B Stat. Methodol. 58(1), 267–288 (1996)
van de Geer, S.A.: High-dimensional generalized linear models and the Lasso. Ann. Stat. 36(2), 614 (2008)
Yuan, M., Joseph, R., Zou, H.: Structured variable selection and estimation. Annals of Applied Statistics 3(4), 1738–1757 (2009)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B Stat. Methodol. 68(1), 49–67 (2006)
Zhao, P., Rocha, G., Yu, B.: Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. 37(6A), 3468–3497 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Lixin Shen.
Rights and permissions
About this article
Cite this article
Micchelli, C.A., Morales, J.M. & Pontil, M. Regularizers for structured sparsity. Adv Comput Math 38, 455–489 (2013). https://doi.org/10.1007/s10444-011-9245-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10444-011-9245-9