Machine Learning

, Volume 93, Issue 1, pp 31–52 | Cite as

Block coordinate descent algorithms for large-scale sparse multiclass classification

Article

Abstract

Over the past decade, 1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., 1/2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs 1/2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun in Math. Program. 117:387–423, 2009) and the other without (Richtárik and Takáč in Math. Program. 1–38, 2012a; Tech. Rep. arXiv:1212.0873, 2012b). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably compared to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms 1/2-regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

Keywords

Multiclass classification Group sparsity Block coordinate descent 

References

  1. Bach, F. R., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1), 1–106. CrossRefGoogle Scholar
  2. Bakin, S. (1999). Adaptative regression and model selection in data mining problems. Ph.D. thesis, Australian National University. Google Scholar
  3. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202. MathSciNetMATHCrossRefGoogle Scholar
  4. Bertsekas, D. P. (1999). Nonlinear programming. Belmont: Athena Scientific. MATHGoogle Scholar
  5. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of conference on learning theory (COLT) (pp. 144–152). Google Scholar
  6. Chang, K. W., Hsieh, C. J., & Lin, C. J. (2008). Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9, 1369–1398. MathSciNetMATHGoogle Scholar
  7. Combettes, P., & Wajs, V. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4, 1168–1200. MathSciNetMATHCrossRefGoogle Scholar
  8. Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292. MATHGoogle Scholar
  9. Dredze, M., Crammer, K., & Pereira, F. (2008). Confidence-weighted linear classification. In Proceedings of international conference on machine learning (ICML) (pp. 264–271). CrossRefGoogle Scholar
  10. Duchi, J., & Singer, Y. (2009a). Boosting with structural sparsity. In Proceedings of international conference on machine learning (ICML) (pp. 297–304). Google Scholar
  11. Duchi, J., & Singer, Y. (2009b). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10, 2899–2934. MathSciNetMATHGoogle Scholar
  12. Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. In Proceedings of neural information processing systems (NIPS) (pp. 681–687). Google Scholar
  13. Fan, R. E., & Lin, C. J. (2007). A study on threshold selection for multi-label classification. Tech. rep., National Taiwan University. Google Scholar
  14. Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332. MathSciNetMATHCrossRefGoogle Scholar
  15. Friedman, J., Hastie, T., & Tibshirani, R. (2010a). A note on the group lasso and a sparse group lasso. Tech. Rep. arXiv:1001.0736.
  16. Friedman, J. H., Hastie, T., & Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. Google Scholar
  17. Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7, 397–416. MathSciNetGoogle Scholar
  18. Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99, 67–81. MathSciNetMATHCrossRefGoogle Scholar
  19. Mangasarian, O. (2002). A finite Newton method for classification. Optimization Methods and Software, 17, 913–929. MathSciNetMATHCrossRefGoogle Scholar
  20. Meier, L., Van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 70(1), 53–71. MathSciNetMATHCrossRefGoogle Scholar
  21. Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252. MathSciNetCrossRefGoogle Scholar
  22. Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Tech. rep., Columbia University. Google Scholar
  23. Richtárik, P., & Takáč, M. (2012a). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 1–38. Google Scholar
  24. Richtárik, P., & Takáč, M. (2012b). Parallel coordinate descent methods for big data optimization. Tech. Rep. arXiv:1212.0873.
  25. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141. MathSciNetMATHGoogle Scholar
  26. Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 1–28. Google Scholar
  27. Shevade, S. K., & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17), 2246–2253. CrossRefGoogle Scholar
  28. Tseng, P., & Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117, 387–423. MathSciNetMATHCrossRefGoogle Scholar
  29. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of international conference on machine learning (ICML) (pp. 1113–1120). Google Scholar
  30. Weston, J., & Watkins, C. (1999). Support vector machines for multi-class pattern recognition. In Proceedings of European symposium on artificial neural networks, computational intelligence and machine learning (pp. 219–224). Google Scholar
  31. Wright, S. J. (2012). Accelerated block-coordinate relaxation for regularized optimization. SIAM Journal on Optimization, 22, 159–186. MathSciNetMATHCrossRefGoogle Scholar
  32. Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. Transactions on Signal Processing, 57(7), 2479–2493. MathSciNetCrossRefGoogle Scholar
  33. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49–67. MathSciNetMATHCrossRefGoogle Scholar
  34. Yuan, G. X., Chang, K. W., Hsieh, C. J., & Lin, C. J. (2010). A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11, 3183–3234. MathSciNetMATHGoogle Scholar
  35. Yuan, G. X., Ho, C. H., & Lin, C. J. (2011). An improved glmnet for l1-regularized logistic regression. In Proceedings of the international conference on knowledge discovery and data mining (pp. 33–41). Google Scholar
  36. Zhang, H. H., Liu, Y., Wu, Y., & Zhu, J. (2006). Variable selection for multicategory svm via sup-norm regularization. Electronic Journal of Statistics, 2, 149–167. MathSciNetCrossRefGoogle Scholar
  37. Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541–2563. MathSciNetMATHGoogle Scholar
  38. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320. MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Mathieu Blondel
    • 1
  • Kazuhiro Seki
    • 1
  • Kuniaki Uehara
    • 1
  1. 1.Graduate School of System InformaticsKobe UniversityKobeJapan

Personalised recommendations