Skip to main content
Log in

Fast projections onto mixed-norm balls with applications

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a “grouped” behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bach F (2010) Structured sparsity-inducing norms through submodular functions. In: Advances in Neural Information Processing Systems (NIPS)

  • Bach F, Jenatton R, Mairal J, Obozinski G (2011) Convex optimization with sparsity-inducing norms. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, Cambridge

    Google Scholar 

  • Bach FR (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9: 1179–1225

    MathSciNet  MATH  Google Scholar 

  • Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1): 141–148

    Article  MathSciNet  MATH  Google Scholar 

  • Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imgaging Sci 2(1): 183–202

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Massachusett

    MATH  Google Scholar 

  • Bhatia R (1997) Matrix analysis. Springer, New York

    Book  Google Scholar 

  • Birgin EG, Martínez JM, Raydan M (2000) Nonmonotone spectral projected gradient methods on convex sets. SIAM J Opt 10(4): 1196–1211

    Article  MATH  Google Scholar 

  • Birgin EG, Martínez JM, Raydan M (2003) Inexact spectral projected gradient methods on convex sets. IMA J Numer Anal 23: 539–559

    Article  MathSciNet  MATH  Google Scholar 

  • Cai JF, Candes EJ, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4): 1956–1982

    Article  MathSciNet  MATH  Google Scholar 

  • Combettes PL, Pesquet J (2010) Proximal splitting methods in signal processing. arXiv:0912.3522v4

  • Dai YH, Fletcher R (2005) Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer Math 100(1): 21–47

    Article  MathSciNet  MATH  Google Scholar 

  • Donoho D (2002) Denoising by soft-thresholding. IEEE Tran Inf Theory 41(3): 613–627

    Article  MathSciNet  Google Scholar 

  • Duchi J, Singer Y (2009) Online and batch learning using forward-backward splitting. J Mach Learn Res

  • Evgeniou T, Micchelli C, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6: 615–637

    MathSciNet  MATH  Google Scholar 

  • Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: SIGKDD, 109–117

  • Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736v1 [math.ST]

  • Fukushima M, Mine H (1981) A generalized proximal point algorithm for certain non-convex minimization problems. Int J Syst Sci 12(8): 989–1000

    Article  MathSciNet  MATH  Google Scholar 

  • Horn RA, Johnson CR (1991) Topics in matrix analysis. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Jenatton R, Mairal J, Obozinski G, Bach F (2010) Proximal methods for sparse hierarchical dictionary learning. In: International Conference on Machine Learning (ICML), pp 487–494

  • Kim D, Sra S, Dhillon IS (2010) A scalable trust-region algorithm with application to mixed-norm regression. In: International Conferences on Machine Learning (ICML), pp 519–526

  • Kiwiel K (2007) On linear-time algorithms for the continuous quadratic knapsack problem. J Optim Theory Appl 134: 549–554

    Article  MathSciNet  MATH  Google Scholar 

  • Kowalski M (2009) Sparse regression using mixed norms. Appl Comput Harmon Anal 27(3): 303–324

    Article  MathSciNet  MATH  Google Scholar 

  • Lewis A (1995) The convex analysis of unitarily invariant matrix functions. J Convex Anal 2(1): 173–183

    MathSciNet  MATH  Google Scholar 

  • Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: International Conference on Machine Learning, pp 649–656

  • Liu J, Ji S, Ye J (2009) SLEP: sparse learning with efficient projections. Arizona State University, Phoenix. http://www.public.asu.edu/~jye02/Software/SLEP

  • Liu J, Ye J (2009) Efficient euclidean projections in linear time. In: International conference on machine learning (ICML), pp 657–664

  • Liu J, Ye J (2010) Efficient L1/Lq norm regularization. arXiv:1009.4766v1

  • Liu J, Ye J (2010) Moreau-Yosida regularization for grouped tree structure learning. In: Neural information processing systems (NIPS)

  • Mairal J, Jenatton R, Obozinski G, Bach F (2010) Network flow algorithms for structured sparsity. In: Advances in neural information processing systems (NIPS)

  • Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of \({\mathbb{R}^n}\) . J Optim Theory Appl 50(1): 195–200

    Article  MathSciNet  MATH  Google Scholar 

  • Obonzinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Technical report. University of California, Berkeley

  • Patriksson M (2005) A survey on a classic core problem in operations research, vol 33. Technical Report. Chalmers University of Technology and Göteborg University, Sweden

  • Polyak BT (1987) Introduction to optimization. Optimization Software, Wellesley

    Google Scholar 

  • Quattoni A, Carreras X, Collins M, Darrell T (2009) An efficient projection for 1,∞ regularization. In: International conference on machine learning (ICML)

  • Rakotomamonjy A, Flamary R, Gasso G, Canu S (2010) p q penalty for sparse linear and sparse multiple kernel multi-task learning. Technical Report hal-00509608, Version 1, INSA-Rouen

  • Rice U (2010) Compressive sensing resources. http://dsp.rice.edu/cs

  • Rish I, Grabarnik G (2010) Sparse modeling: ICML 2010 tutorial. Online

  • Rosen J (1960) The gradient projection method for nonlinear programming. Part I. Linear constraints. J Soc Ind Appl Math 8(1): 181–217

    Article  MATH  Google Scholar 

  • Schmidt M, van~den Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Artificial intelligence and statistics (AISTATS)

  • Schmidt M, Roux NL, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in neural information processing systems (NIPS)

  • Similä T, Tikka J (2007) Input selection and shrinkage in multiresponse linear regression. Comp Stat Data Anal 52(1): 406–422

    Article  MATH  Google Scholar 

  • Sra S (2011) Fast projections onto 1,q -norm balls for grouped feature selection. In: European conference on machine learning (ECML)

  • Tomioka R, Suzuki T, Sugiyama M (2011) Augmented lagrangian methods for learning, selecting, and combining features. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, Cambridge

    Google Scholar 

  • Tropp JA (2006) Algorithms for simultaneous sparse approximation, Part II: convex relaxation. Signal Proc 86(3): 589–602

    Article  MathSciNet  MATH  Google Scholar 

  • Turlach BA, Venables WN, Wright SJ (2005) Simultaneous variable selection. Technometrics 27: 349–363

    Article  MathSciNet  Google Scholar 

  • van den Berg E, Schmidt M, Friedlander MP, Murphy K (2008) Group sparsity via linear-time projection. Tech Rep TR-2008-09, University of British Columbia, Vancouver

  • Yuan M, Lin Y (2004) Model selection and estimation in regression with grouped variables. Technical Report 1095. Deptartment of Statistics, University of Wisconsin, Madison

  • Zhang Y, Yeung DY, Xu Q (2010) Probabilistic multi-task feature selection. In: Neural information processing systems (NIPS)

  • Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6A): 3468–3497

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suvrit Sra.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sra, S. Fast projections onto mixed-norm balls with applications. Data Min Knowl Disc 25, 358–377 (2012). https://doi.org/10.1007/s10618-012-0277-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0277-7

Keywords

Navigation