Block coordinate descent algorithms for large-scale sparse multiclass classification
- 1k Downloads
Over the past decade, ℓ1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., ℓ1/ℓ2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs ℓ1/ℓ2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun in Math. Program. 117:387–423, 2009) and the other without (Richtárik and Takáč in Math. Program. 1–38, 2012a; Tech. Rep. arXiv:1212.0873, 2012b). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably compared to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms ℓ1/ℓ2-regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.
KeywordsMulticlass classification Group sparsity Block coordinate descent
- Bakin, S. (1999). Adaptative regression and model selection in data mining problems. Ph.D. thesis, Australian National University. Google Scholar
- Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of conference on learning theory (COLT) (pp. 144–152). Google Scholar
- Duchi, J., & Singer, Y. (2009a). Boosting with structural sparsity. In Proceedings of international conference on machine learning (ICML) (pp. 297–304). Google Scholar
- Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. In Proceedings of neural information processing systems (NIPS) (pp. 681–687). Google Scholar
- Fan, R. E., & Lin, C. J. (2007). A study on threshold selection for multi-label classification. Tech. rep., National Taiwan University. Google Scholar
- Friedman, J., Hastie, T., & Tibshirani, R. (2010a). A note on the group lasso and a sparse group lasso. Tech. Rep. arXiv:1001.0736.
- Friedman, J. H., Hastie, T., & Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. Google Scholar
- Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Tech. rep., Columbia University. Google Scholar
- Richtárik, P., & Takáč, M. (2012a). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 1–38. Google Scholar
- Richtárik, P., & Takáč, M. (2012b). Parallel coordinate descent methods for big data optimization. Tech. Rep. arXiv:1212.0873.
- Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 1–28. Google Scholar
- Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of international conference on machine learning (ICML) (pp. 1113–1120). Google Scholar
- Weston, J., & Watkins, C. (1999). Support vector machines for multi-class pattern recognition. In Proceedings of European symposium on artificial neural networks, computational intelligence and machine learning (pp. 219–224). Google Scholar
- Yuan, G. X., Ho, C. H., & Lin, C. J. (2011). An improved glmnet for l1-regularized logistic regression. In Proceedings of the international conference on knowledge discovery and data mining (pp. 33–41). Google Scholar