Abstract
Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a “grouped” behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems.
Similar content being viewed by others
References
Bach F (2010) Structured sparsity-inducing norms through submodular functions. In: Advances in Neural Information Processing Systems (NIPS)
Bach F, Jenatton R, Mairal J, Obozinski G (2011) Convex optimization with sparsity-inducing norms. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, Cambridge
Bach FR (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9: 1179–1225
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1): 141–148
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imgaging Sci 2(1): 183–202
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Massachusett
Bhatia R (1997) Matrix analysis. Springer, New York
Birgin EG, Martínez JM, Raydan M (2000) Nonmonotone spectral projected gradient methods on convex sets. SIAM J Opt 10(4): 1196–1211
Birgin EG, Martínez JM, Raydan M (2003) Inexact spectral projected gradient methods on convex sets. IMA J Numer Anal 23: 539–559
Cai JF, Candes EJ, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4): 1956–1982
Combettes PL, Pesquet J (2010) Proximal splitting methods in signal processing. arXiv:0912.3522v4
Dai YH, Fletcher R (2005) Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer Math 100(1): 21–47
Donoho D (2002) Denoising by soft-thresholding. IEEE Tran Inf Theory 41(3): 613–627
Duchi J, Singer Y (2009) Online and batch learning using forward-backward splitting. J Mach Learn Res
Evgeniou T, Micchelli C, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6: 615–637
Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: SIGKDD, 109–117
Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736v1 [math.ST]
Fukushima M, Mine H (1981) A generalized proximal point algorithm for certain non-convex minimization problems. Int J Syst Sci 12(8): 989–1000
Horn RA, Johnson CR (1991) Topics in matrix analysis. Cambridge University Press, Cambridge
Jenatton R, Mairal J, Obozinski G, Bach F (2010) Proximal methods for sparse hierarchical dictionary learning. In: International Conference on Machine Learning (ICML), pp 487–494
Kim D, Sra S, Dhillon IS (2010) A scalable trust-region algorithm with application to mixed-norm regression. In: International Conferences on Machine Learning (ICML), pp 519–526
Kiwiel K (2007) On linear-time algorithms for the continuous quadratic knapsack problem. J Optim Theory Appl 134: 549–554
Kowalski M (2009) Sparse regression using mixed norms. Appl Comput Harmon Anal 27(3): 303–324
Lewis A (1995) The convex analysis of unitarily invariant matrix functions. J Convex Anal 2(1): 173–183
Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: International Conference on Machine Learning, pp 649–656
Liu J, Ji S, Ye J (2009) SLEP: sparse learning with efficient projections. Arizona State University, Phoenix. http://www.public.asu.edu/~jye02/Software/SLEP
Liu J, Ye J (2009) Efficient euclidean projections in linear time. In: International conference on machine learning (ICML), pp 657–664
Liu J, Ye J (2010) Efficient L1/Lq norm regularization. arXiv:1009.4766v1
Liu J, Ye J (2010) Moreau-Yosida regularization for grouped tree structure learning. In: Neural information processing systems (NIPS)
Mairal J, Jenatton R, Obozinski G, Bach F (2010) Network flow algorithms for structured sparsity. In: Advances in neural information processing systems (NIPS)
Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of \({\mathbb{R}^n}\) . J Optim Theory Appl 50(1): 195–200
Obonzinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Technical report. University of California, Berkeley
Patriksson M (2005) A survey on a classic core problem in operations research, vol 33. Technical Report. Chalmers University of Technology and Göteborg University, Sweden
Polyak BT (1987) Introduction to optimization. Optimization Software, Wellesley
Quattoni A, Carreras X, Collins M, Darrell T (2009) An efficient projection for ℓ 1,∞ regularization. In: International conference on machine learning (ICML)
Rakotomamonjy A, Flamary R, Gasso G, Canu S (2010)ℓ p −ℓ q penalty for sparse linear and sparse multiple kernel multi-task learning. Technical Report hal-00509608, Version 1, INSA-Rouen
Rice U (2010) Compressive sensing resources. http://dsp.rice.edu/cs
Rish I, Grabarnik G (2010) Sparse modeling: ICML 2010 tutorial. Online
Rosen J (1960) The gradient projection method for nonlinear programming. Part I. Linear constraints. J Soc Ind Appl Math 8(1): 181–217
Schmidt M, van~den Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Artificial intelligence and statistics (AISTATS)
Schmidt M, Roux NL, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in neural information processing systems (NIPS)
Similä T, Tikka J (2007) Input selection and shrinkage in multiresponse linear regression. Comp Stat Data Anal 52(1): 406–422
Sra S (2011) Fast projections onto ℓ 1,q -norm balls for grouped feature selection. In: European conference on machine learning (ECML)
Tomioka R, Suzuki T, Sugiyama M (2011) Augmented lagrangian methods for learning, selecting, and combining features. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, Cambridge
Tropp JA (2006) Algorithms for simultaneous sparse approximation, Part II: convex relaxation. Signal Proc 86(3): 589–602
Turlach BA, Venables WN, Wright SJ (2005) Simultaneous variable selection. Technometrics 27: 349–363
van den Berg E, Schmidt M, Friedlander MP, Murphy K (2008) Group sparsity via linear-time projection. Tech Rep TR-2008-09, University of British Columbia, Vancouver
Yuan M, Lin Y (2004) Model selection and estimation in regression with grouped variables. Technical Report 1095. Deptartment of Statistics, University of Wisconsin, Madison
Zhang Y, Yeung DY, Xu Q (2010) Probabilistic multi-task feature selection. In: Neural information processing systems (NIPS)
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6A): 3468–3497
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.
Rights and permissions
About this article
Cite this article
Sra, S. Fast projections onto mixed-norm balls with applications. Data Min Knowl Disc 25, 358–377 (2012). https://doi.org/10.1007/s10618-012-0277-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0277-7