Data Mining and Knowledge Discovery

, Volume 25, Issue 2, pp 358–377 | Cite as

Fast projections onto mixed-norm balls with applications

  • Suvrit SraEmail author


Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a “grouped” behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems.


Mixed-norm Group sparsity Fast projection Multitask learning Matrix norms Stochastic gradient 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bach F (2010) Structured sparsity-inducing norms through submodular functions. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
  2. Bach F, Jenatton R, Mairal J, Obozinski G (2011) Convex optimization with sparsity-inducing norms. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, CambridgeGoogle Scholar
  3. Bach FR (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9: 1179–1225MathSciNetzbMATHGoogle Scholar
  4. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1): 141–148MathSciNetzbMATHCrossRefGoogle Scholar
  5. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imgaging Sci 2(1): 183–202MathSciNetzbMATHCrossRefGoogle Scholar
  6. Bertsekas DP (1999) Nonlinear programming. Athena Scientific, MassachusettzbMATHGoogle Scholar
  7. Bhatia R (1997) Matrix analysis. Springer, New YorkCrossRefGoogle Scholar
  8. Birgin EG, Martínez JM, Raydan M (2000) Nonmonotone spectral projected gradient methods on convex sets. SIAM J Opt 10(4): 1196–1211zbMATHCrossRefGoogle Scholar
  9. Birgin EG, Martínez JM, Raydan M (2003) Inexact spectral projected gradient methods on convex sets. IMA J Numer Anal 23: 539–559MathSciNetzbMATHCrossRefGoogle Scholar
  10. Cai JF, Candes EJ, Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4): 1956–1982MathSciNetzbMATHCrossRefGoogle Scholar
  11. Combettes PL, Pesquet J (2010) Proximal splitting methods in signal processing. arXiv:0912.3522v4Google Scholar
  12. Dai YH, Fletcher R (2005) Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer Math 100(1): 21–47MathSciNetzbMATHCrossRefGoogle Scholar
  13. Donoho D (2002) Denoising by soft-thresholding. IEEE Tran Inf Theory 41(3): 613–627MathSciNetCrossRefGoogle Scholar
  14. Duchi J, Singer Y (2009) Online and batch learning using forward-backward splitting. J Mach Learn ResGoogle Scholar
  15. Evgeniou T, Micchelli C, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6: 615–637MathSciNetzbMATHGoogle Scholar
  16. Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: SIGKDD, 109–117Google Scholar
  17. Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736v1 [math.ST]Google Scholar
  18. Fukushima M, Mine H (1981) A generalized proximal point algorithm for certain non-convex minimization problems. Int J Syst Sci 12(8): 989–1000MathSciNetzbMATHCrossRefGoogle Scholar
  19. Horn RA, Johnson CR (1991) Topics in matrix analysis. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  20. Jenatton R, Mairal J, Obozinski G, Bach F (2010) Proximal methods for sparse hierarchical dictionary learning. In: International Conference on Machine Learning (ICML), pp 487–494Google Scholar
  21. Kim D, Sra S, Dhillon IS (2010) A scalable trust-region algorithm with application to mixed-norm regression. In: International Conferences on Machine Learning (ICML), pp 519–526Google Scholar
  22. Kiwiel K (2007) On linear-time algorithms for the continuous quadratic knapsack problem. J Optim Theory Appl 134: 549–554MathSciNetzbMATHCrossRefGoogle Scholar
  23. Kowalski M (2009) Sparse regression using mixed norms. Appl Comput Harmon Anal 27(3): 303–324MathSciNetzbMATHCrossRefGoogle Scholar
  24. Lewis A (1995) The convex analysis of unitarily invariant matrix functions. J Convex Anal 2(1): 173–183MathSciNetzbMATHGoogle Scholar
  25. Liu H, Palatucci M, Zhang J (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In: International Conference on Machine Learning, pp 649–656Google Scholar
  26. Liu J, Ji S, Ye J (2009) SLEP: sparse learning with efficient projections. Arizona State University, Phoenix.
  27. Liu J, Ye J (2009) Efficient euclidean projections in linear time. In: International conference on machine learning (ICML), pp 657–664Google Scholar
  28. Liu J, Ye J (2010) Efficient L1/Lq norm regularization. arXiv:1009.4766v1Google Scholar
  29. Liu J, Ye J (2010) Moreau-Yosida regularization for grouped tree structure learning. In: Neural information processing systems (NIPS)Google Scholar
  30. Mairal J, Jenatton R, Obozinski G, Bach F (2010) Network flow algorithms for structured sparsity. In: Advances in neural information processing systems (NIPS)Google Scholar
  31. Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of \({\mathbb{R}^n}\) . J Optim Theory Appl 50(1): 195–200MathSciNetzbMATHCrossRefGoogle Scholar
  32. Obonzinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Technical report. University of California, BerkeleyGoogle Scholar
  33. Patriksson M (2005) A survey on a classic core problem in operations research, vol 33. Technical Report. Chalmers University of Technology and Göteborg University, SwedenGoogle Scholar
  34. Polyak BT (1987) Introduction to optimization. Optimization Software, WellesleyGoogle Scholar
  35. Quattoni A, Carreras X, Collins M, Darrell T (2009) An efficient projection for 1,∞ regularization. In: International conference on machine learning (ICML)Google Scholar
  36. Rakotomamonjy A, Flamary R, Gasso G, Canu S (2010) p q penalty for sparse linear and sparse multiple kernel multi-task learning. Technical Report hal-00509608, Version 1, INSA-RouenGoogle Scholar
  37. Rice U (2010) Compressive sensing resources.
  38. Rish I, Grabarnik G (2010) Sparse modeling: ICML 2010 tutorial. OnlineGoogle Scholar
  39. Rosen J (1960) The gradient projection method for nonlinear programming. Part I. Linear constraints. J Soc Ind Appl Math 8(1): 181–217zbMATHCrossRefGoogle Scholar
  40. Schmidt M, van~den Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Artificial intelligence and statistics (AISTATS)Google Scholar
  41. Schmidt M, Roux NL, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in neural information processing systems (NIPS)Google Scholar
  42. Similä T, Tikka J (2007) Input selection and shrinkage in multiresponse linear regression. Comp Stat Data Anal 52(1): 406–422zbMATHCrossRefGoogle Scholar
  43. Sra S (2011) Fast projections onto 1,q-norm balls for grouped feature selection. In: European conference on machine learning (ECML)Google Scholar
  44. Tomioka R, Suzuki T, Sugiyama M (2011) Augmented lagrangian methods for learning, selecting, and combining features. In: Sra S, Nowozin S, Wright SJ (eds) Optimization for machine learning. MIT Press, CambridgeGoogle Scholar
  45. Tropp JA (2006) Algorithms for simultaneous sparse approximation, Part II: convex relaxation. Signal Proc 86(3): 589–602MathSciNetzbMATHCrossRefGoogle Scholar
  46. Turlach BA, Venables WN, Wright SJ (2005) Simultaneous variable selection. Technometrics 27: 349–363MathSciNetCrossRefGoogle Scholar
  47. van den Berg E, Schmidt M, Friedlander MP, Murphy K (2008) Group sparsity via linear-time projection. Tech Rep TR-2008-09, University of British Columbia, VancouverGoogle Scholar
  48. Yuan M, Lin Y (2004) Model selection and estimation in regression with grouped variables. Technical Report 1095. Deptartment of Statistics, University of Wisconsin, MadisonGoogle Scholar
  49. Zhang Y, Yeung DY, Xu Q (2010) Probabilistic multi-task feature selection. In: Neural information processing systems (NIPS)Google Scholar
  50. Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37(6A): 3468–3497MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Max Planck Institute for Intelligent SystemsTübingenGermany

Personalised recommendations