Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches

  • Mark Schmidt
  • Glenn Fung
  • Rómer Rosales
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4701)


L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the non-differentiability of the 1-norm. In this paper we compare state-of-the-art optimization techniques to solve this problem across several loss functions. Furthermore, we propose two new techniques. The first is based on a smooth (differentiable) convex approximation for the L1 regularizer that does not depend on any assumptions about the loss function used. The other technique is a new strategy that addresses the non-differentiability of the L1-regularizer by casting the problem as a constrained optimization problem that is then solved using a specialized gradient projection method. Extensive comparisons show that our newly proposed approaches consistently rank among the best in terms of convergence speed and efficiency by measuring the number of function evaluations required.


Feature Selection Loss Function Step Length Line Search Constrain Optimization Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Chen, C., Mangasarian, O.L.: A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl 5(2), 97–138 (1996)zbMATHMathSciNetGoogle Scholar
  2. 2.
    Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1999)zbMATHMathSciNetGoogle Scholar
  3. 3.
    Efron, B., Johnstone, I., Hastie, T., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Figueiredo, M.: Adapative sparseness for supervised learning. IEEE. Trans. Pattern. Anal. Mach. Intell. 25(9), 1150–1159 (2003)CrossRefGoogle Scholar
  5. 5.
    Freund, R.M., Mizuno, S.: Interior point methods: Current status and future directions. Optima 51, 1–9 (1996)Google Scholar
  6. 6.
    Fu, W.: Penalized regressions: The bridge versus the LASSO. J. Comput. Graph. Stat. 7(3), 397–416 (1998)CrossRefGoogle Scholar
  7. 7.
    Gafni, E., Bertsekas, D.: Two-metric projection methods for constrained optimization. SIAM J. Contr. Optim. 22(6), 936–964 (1984)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Garcia Palomares, U.M., Mangasarian, O.L.: Superlinearly convergent Quasi–Newton algorithms for nonlinearly constrained optimization problems. Math. Program. 11, 1–13 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHCrossRefGoogle Scholar
  10. 10.
    Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework for contextual interaction in classification. In: ICCV (2003)Google Scholar
  11. 11.
    Lee, S.-I., Lee, H., Abbeel, P., Ng, A.Y.: Efficient L1 regularized logistic regression. In: AAAI (2006)Google Scholar
  12. 12.
    Lee, Y.-J., Mangasarian, O.L.: SSVM: A smooth support vector machine. Comput. Optim. Appl. 20, 5–22 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Ng, A.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: ICML, pp. 78–85. ACM Press, New York (2004)CrossRefGoogle Scholar
  14. 14.
    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999)zbMATHGoogle Scholar
  15. 15.
    Perkins, S., Lacker, K., Theiler, J.: Grafting: Fast, incremental feature selection by gradient descent in function space. J. Mach. Learn. Res. 3, 1333–1356 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Shevade, S., Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)CrossRefGoogle Scholar
  17. 17.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58(1), 267–288Google Scholar
  18. 18.
    Weston, J., Elisseeff, A., Scholkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)zbMATHCrossRefGoogle Scholar
  19. 19.
    Zhao, P., Yu, B.: On model selection consistency of LASSO. J. Mach. Learn. Res. 7, 2541–2567 (2007)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Mark Schmidt
    • 1
  • Glenn Fung
    • 2
  • Rómer Rosales
    • 2
  1. 1.Department of Computer Science University of British Columbia 
  2. 2.IKM CKS, Siemens Medical SolutionsUSA

Personalised recommendations