Block coordinate descent algorithms for largescale sparse multiclass classification
 1.1k Downloads
 24 Citations
Abstract
Over the past decade, ℓ _{1} regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixednorm (e.g., ℓ _{1}/ℓ _{2}) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for largescale and highdimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs ℓ _{1}/ℓ _{2} regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fasttoevaluate multiclass models. For optimization, we employ two globallyconvergent variants of block coordinate descent, one with line search (Tseng and Yun in Math. Program. 117:387–423, 2009) and the other without (Richtárik and Takáč in Math. Program. 1–38, 2012a; Tech. Rep. arXiv:1212.0873, 2012b). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably compared to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms ℓ _{1}/ℓ _{2}regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.
Keywords
Multiclass classification Group sparsity Block coordinate descent1 Introduction
ℓ _{1}regularized loss minimization has attracted a great deal of research over the past decade (Yuan et al. 2010). ℓ _{1} regularization has many advantages, including its computational efficiency, its ability to perform implicit feature selection and under certain conditions, to recover the model’s true sparsity (Zhao and Yu 2006). More recently, mixednorm (e.g., ℓ _{1}/ℓ _{2}) regularization has been proposed (Bakin 1999; Yuan and Lin 2006) as a way to select groups of features. Here, the notion of group is applicationdependent and may be used to exploit prior knowledge about natural feature groups (Yuan and Lin 2006; Meier et al. 2008) or problem structure (Obozinski et al. 2010; Duchi and Singer 2009a).
In this paper, we propose a novel direct multiclass formulation specifically designed for largescale and highdimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs ℓ _{1}/ℓ _{2} regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fasttoevaluate multiclass models (see Sect. 2). For optimization, we employ two globallyconvergent variants of block coordinate descent, one with line search (Tseng and Yun 2009) and the other without (Richtárik and Takáč 2012a). We present the two variants in a unified manner and develop core components needed to optimize our objective (efficient gradient computation, Lipschitz constant of the gradient, computationally cheap stopping criterion). The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS (Duchi and Singer 2009b), FISTA (Beck and Teboulle 2009) and SpaRSA (Wright et al. 2009). Furthermore, we show that our formulation obtains very compact multiclass models and outperforms ℓ _{1}/ℓ _{2}regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.
2 Sparsityinducing regularization
Let \(R_{\ell_{2}}(\boldsymbol{W}_{j:}) = \\boldsymbol{W}_{j:}\_{2}\) (notice that the ℓ _{2} norm is not squared). With ℓ _{1}/ℓ _{2} regularization (group lasso), \(R_{\ell_{1}/\ell_{2}}(\boldsymbol{W}) = \sum_{j} R_{\ell_{2}}(\boldsymbol{W}_{j:})\), the model becomes sparse at the feature group (here, row) level. Applied to a multiclass model, ℓ _{1}/ℓ _{2} regularization can thus force weights corresponding to the same feature to become zero across all classes. The corresponding features can therefore be safely ignored at test time, which is especially useful when features are expensive to extract. For more information on sparsity inducing penalties, see Bach et al.’s excellent survey (Bach et al. 2012).
3 Related work
3.1 Multiclass classification: direct vs. indirect formulations
Classifying an object into one of several categories is an important problem arising in many applications such as document classification and object recognition. Machine learning approaches to this problem can be roughly divided into two categories: direct and indirect approaches. While direct approaches formulate the multiclass problem directly, indirect approaches reduce the multiclass problem to multiple independent binary classification or regression problems. Because support vector machines (SVMs) (Boser et al. 1992) were originally proposed as a binary classification model, they have frequently been used in combination with indirect approaches to perform multiclass classification. Among them, one of the most popular is “onevsrest” (Rifkin and Klautau 2004), which consists of learning to separate one class from all the others, independently for all m possible classes. Direct multiclass SVM extensions were later proposed by Weston and Watkins (1999), Lee et al. (2004) and Crammer and Singer (2002). They were all formulated as constrained problems and solved in the dual. An unconstrained (nondifferentiable) form of the CrammerSinger formulation is popularly used with stochastic subgradient descent algorithms such as Pegasos (ShalevShwartz et al. 2010). Another popular direct multiclass (smooth) formulation, which is an intuitive extension of traditional logistic regression, is multiclass logistic regression. In this paper, we propose an efficient direct multiclass formulation.
3.2 Sparse multiclass classification
Recently, mixednorm regularization has attracted much interest (Yuan and Lin 2006; Meier et al. 2008; Duchi and Singer 2009a, 2009b; Obozinski et al. 2010) due to its ability to impose sparsity at the feature group level. Few papers, however, have investigated its application to multiclass classification. Zhang et al. (2006) extend Lee et al.’s multiclass SVM formulation (Lee et al. 2004) to employ ℓ _{1}/ℓ _{∞} regularization and formulate the learning problem as a linear program (LP). However, they experimentally verify their method only on very small problems (both in terms of n and d). Duchi and Singer (2009a) propose a boostinglike algorithm specialized for ℓ _{1}/ℓ _{2}regularized multiclass logistic regression. In another paper, Duchi and Singer (2009b) derive and analyze FOBOS, a stochastic subgradient descent framework based on forwardbackward splitting and apply it, among other things, to ℓ _{1}/ℓ _{2}regularized multiclass logistic regression. In this paper, we choose ℓ _{1}/ℓ _{2} regularization, since it can be optimized more efficiently than ℓ _{1}/ℓ _{∞} regularization (see Sect. 4.7).
3.3 Coordinate descent methods
Although coordinate descent methods were among the first optimization methods proposed and studied in the literature (see Bertsekas 1999 and references therein), it is only recently that they regained popularity, thanks to several successful applications in the machine learning (Fu 1998; Shevade and Keerthi 2003; Friedman et al. 2007, 2010b; Yuan et al. 2010; Qin et al. 2010) and optimization (Tseng and Yun 2009; Wright 2012; Richtárik and Takáč 2012a) communities. Conceptually and algorithmically simple, (block) coordinate descent algorithms focus at each iteration on updating one block of variables while keeping the others fixed, and have been shown to be particularly wellsuited for minimizing objective functions with nonsmooth separable regularization such as ℓ _{1} or ℓ _{1}/ℓ _{2} (Tseng and Yun 2009; Wright 2012; Richtárik and Takáč 2012a).
Coordinate descent algorithms have different tradeoffs: expensive gradientbased greedy block selection as opposed to cheap cyclic or randomized selection, use of line search (Tseng and Yun 2009; Wright 2012) or not (Richtárik and Takáč 2012a). For largescale linear classification, and we confirm in this paper, cyclic and randomized block selection schemes have been shown to achieve excellent performance (Yuan et al. 2010, 2011; Chang et al. 2008; Richtárik and Takáč 2012a). The most popular loss function for ℓ _{1}regularized binary classification is arguably logistic regression, due to its smoothness (Yuan et al. 2010). Binary logistic regression was also successfully combined with ℓ _{1}/ℓ _{2} regularization in the case of userdefined feature groups (Meier et al. 2008). However, recent work (Yuan et al. 2010, 2011; Chang et al. 2008) using coordinate descent indicate that logistic regression is substantially slower to train than ℓ _{2}loss (squared hinge) SVMs. This is because, contrary to ℓ _{2}loss SVMs, logistic regression requires expensive log and exp computations (equivalent to dozens of multiplications) to compute the gradient or objective value (Yuan et al. 2011). Motivated by this background, we propose a novel efficient direct multiclass formulation. Compared to multiclass logistic regression, which suffers from the same problems as its binary counterpart, our formulation can be optimized very efficiently by block coordinate descent and lends itself to largescale and highdimensional problems such as document classification.
4 Sparse direct multiclass classification
4.1 Objective function
 1.
It is a direct multiclass formulation and its relation with Eq. (1) is intuitive.
 2.
Its objective value and gradient can be computed efficiently (unlike multiclass logistic regression, which requires expensive log and exp operations).
 3.
It empirically performs comparably or better than other multitask and multiclass formulations.
 4.
It meets several conditions needed to prove global convergence of block coordinate descent algorithms (see Sect. 4.6).
Our objective, Eq. (2), is similar in spirit to Weston and Watkins’ multiclass SVM formulation (Weston and Watkins 1999), in that it ensures that the correct class’s score is greater than all the other classes by at least 1. However, it has the following differences: it is unconstrained (rather than constrained), it is ℓ _{1}/ℓ _{2}regularized (rather than \(\ell_{2}^{2}\)regularized) and it penalizes misclassifications quadratically (rather linearly), which ensures differentiability of L(W).
4.2 Optimization by block coordinate descent
A key property of F(W) is the separability of its nonsmooth part R(W) over groups j=1,2,…,d. This calls for an algorithm which minimizes F(W) by updating W group by group. In this paper, to minimize F(W), we thus employ block coordinate descent. We consider two variants, one with line search (Tseng and Yun 2009) and the other without (Richtárik and Takáč 2012a). We present the two variants in a unified manner.
4.3 Efficient partial gradient computation
4.4 Choice of block, \(\mathcal{L}_{j}^{t}\) and \(\alpha_{j}^{t}\)
We now discuss how to choose, at every iteration, the block W _{ j:} to update, \(\mathcal{L}_{j}^{t}\) and \(\alpha_{j}^{t}\), depending on whether a line search is used or not.
4.4.1 With line search (Tseng and Yun)
Similarly to Eq. (5), the cost of computing Eq. (7) and Eq. (6) is \(O(\hat{n}(m1))\). In practice, we observe that one line search step often suffices for Eq. (7) to be satisfied. Therefore, the cost of one call to Algorithm 2 is in general \(O(\hat{n}(m1))\).
To enjoy Tseng and Yun’s theoretical guarantees (see Sect. 4.6), we need to use cyclic block selection. That is, in Algorithm 1, at each inner iteration, we need to choose j=l.
4.4.2 Without line search (Richtárik and Takáč)
Using a line search or not is a matter of tradeoff: using a line search has higher cost per iteration but can potentially lead to greater progress due to the larger step size. We compare both strategies experimentally in Sect. 5.2. One advantage of Richtárik and Takáč’s framework, however, is that it can be parallelized (Richtárik and Takáč 2012b), potentially leading to significant speedups. In future work, we plan to compare sequential and parallel block coordinate descent when applied to our objective, Eq. (2).
4.5 Stopping criterion
We would like to develop a stopping criterion for Algorithm 1 which can be checked at almost no extra computational cost. Proposition 1 characterizes an optimal solution of Eq. (2).
Proposition 1
4.6 Global convergence properties
We discuss convergence properties for the two block coordinate descent variants we considered: cyclic block coordinate descent with line search (Tseng and Yun 2009) and randomized block coordinate descent without line search (Richtárik and Takáč 2012a). To have finite termination of the line search, Tseng and Yun (Lemma 5.1), require that L has Lipschitz continuous gradient, which we prove with Lemma 1 in Appendix A. For asymptotic convergence, Tseng and Yun assume that each block is cyclically visited (Eq. (12)). They further assume (Assumption 1) that H ^{ t } is upperbounded by some value and lowerbounded by 0, which is guaranteed by our choice \(\boldsymbol {H}^{t} = \mathcal{L}_{j}^{t} \boldsymbol{I}\). Richtárik and Takáč also assume (Sect. 2) that the blockwise gradient is Lipschitz. They show (Theorem 4) that using their algorithm, there exists a finite iteration t such that P(F(W ^{ t })−F(W ^{∗})≤ϵ)≥1−ρ, where ϵ>0 is the accuracy of the solution and 0<ρ<1 is the target confidence.
4.7 Extensions
Another possible extension consists in replacing ℓ _{1}/ℓ _{2} regularization by ℓ _{1}/ℓ _{∞} regularization or ℓ _{1}+ℓ _{1}/ℓ _{2} regularization (sparse group lasso, Friedman et al. 2010a). This requires changing the proximity operator, Eq. (4), as well as reworking the stopping criterion developed in Sect. 4.5. Similarly to ℓ _{1}/ℓ _{2} regularization, ℓ _{1}/ℓ _{∞} regularization leads to group sparsity. However, the proximity operator associated with the ℓ _{∞} norm requires a projection on an ℓ _{1}norm ball (Bach et al. 2012) and is thus computationally more expensive than the proximity operator associated with the ℓ _{2} norm, which takes a closed form, Eq. (4). For ℓ _{1}+ℓ _{1}/ℓ _{2} regularization (sparse group lasso), the groupwise proximity operator can readily be computed by applying first the proximity operator associated with the ℓ _{1} norm and then the one associated with the ℓ _{2} norm (Bach et al. 2012). However, sparse group lasso regularization requires the tuning of an extra hyperparameter, which balances between ℓ _{1}/ℓ _{2} and ℓ _{1} regularizations. For this reason, we do not consider it in our experiments.
5 Experiments
We conducted two experiments. In the first experiment, we investigated the performance (in terms of speed of convergence and row sparsity) of block coordinate descent (with or without line search) for optimizing the proposed direct multiclass formulation Eq. (2), compared to other stateoftheart solvers. In the second experiment, we compared the proposed direct multiclass formulation with other multiclass and multitask formulations in terms of test accuracy, row sparsity and training speed. Experiments were run on a Linux machine with an Intel Xeon CPU (3.47 GHz) and 4 GB memory.
5.1 Datasets

Amazon7: productreview (books, DVD, electronics, …) classification.

RCV1: news document classification.

MNIST: handwritten digit classification.

News20: newgroup message classification.

Sector: webpage (industry sectors) classification.
Datasets used in Sect. 5
Dataset  Instances  Features  Nonzero features  Classes 

Amazon7  1,362,109  262,144  0.04 %  7 
RCV1  534,135  47,236  0.1 %  52 
MNIST  70,000  780  19 %  10 
News20  18,846  130,088  0.1 %  20 
Sector  9,619  55,197  0.3 %  105 
5.2 Comparison of block coordinate descent with other solvers

BCD (LS): block coordinate descent with line search and with cyclic block selection (Tseng and Yun 2009),

BCD (CST): block coordinate descent without line search and with randomized uniform block selection (Richtárik and Takáč 2012a),

FISTA (LS): an accelerated iterative thresholding algorithm with line search (Beck and Teboulle 2009),

FISTA (CST): same as above but with constant step size \(\frac{1}{\mathcal{K}}\) (see Appendix A),

SpaRSA: a similar approach to ISTA (Beck and Teboulle 2009) but with different line search (Wright et al. 2009),

FOBOS: a projected stochastic subgradient descent framework (Duchi and Singer 2009b).
All solvers are used to minimize the same objective: our proposed multiclass formulation, Eq. (2).
5.2.1 Comparison of block coordinate descent with or without line search
Figures 2 and 3 indicate that block coordinate descent (BCD) with line search was overall slightly faster to converge than without. Empirically, we observe that the sufficient decrease condition checked by the line search, Eq. (7), is usually accepted on the first try (\(\alpha^{t}_{j}=1\)). In that case, the line search does not incur much extra cost, since the objective value difference F(W ^{ t+1})−F(W ^{ t }), needed for Eq. (7), can be computed in the same loop as the partial gradient. For the few times when more than one line search step is required, our formulation has the advantage that the objective value difference can be computed very efficiently (no expensive log or exp). However, similarly to other iterative solvers, BCD (both with or without line search) may suffer from slow convergence on very loosely regularized problems (very small λ).
In terms of row sparsity, Fig. 4 shows that in all datasets, BCD had a twophase behavior: first increasing the number of nonzero rows, then rapidly decreasing it. Compared to other solvers, BCD was always the fastest to reach the sparsity level corresponding to a given λ value.
5.2.2 Comparison with a projected stochastic subgradient descent solver: FOBOS
BCD outperformed FOBOS on smaller datasets (News20, Sector) and was comparable to FOBOS on larger datasets (MNIST, RCV1, Amazon7). However, for FOBOS, we found that tuning the initial step size η _{0} was crucial to obtain good convergence speed and accuracy. This additional “degree of freedom” is a major disadvantage of FOBOS over BCD, in practice. However, since it is based on stochastic subgradient descent, FOBOS can handle nondifferentiable loss functions (e.g., the CrammerSinger multiclass loss), unlike BCD.
Figure 4 shows that FOBOS obtained much less sparse solutions than BCD. In particular, on RCV1 with λ=10^{−3}, BCD obtained less than 5 % nonzero rows whereas FOBOS obtained almost 80 %.
5.2.3 Comparison with fullgradient solvers: FISTA and SpaRSA
BCD outperformed FISTA and SpaRSA on all datasets, both in speed of objective value decrease and test accuracy increase. FISTA (LS) and SpaRSA achieved similar convergence speed with a slight advantage for FISTA (LS). Interestingly, FISTA (CST) was always quite worse than FISTA (LS), showing that, in the fullgradient case, doing a line search to adjust the step size at every iteration is greatly beneficial. In contrast, the difference between BCD (LS) and BCD (CST) appeared to be smaller. FISTA (CST) uses one global step size \(\frac{1}{\mathcal{K}}\) whereas BCD (CST) uses a perblock step size \(\frac{1}{\mathcal{K}_{j}}\). Therefore, BCD (CST) uses a constant step size which is more appropriate for each block.
BCD, FOBOS, FISTA and SpaRSA differ in how they make use of gradient information at each iteration. FISTA and SpaRSA use the entire gradient G(W)∈R ^{ d×m } averaged over all n training instances. This is expensive, especially when both n and d are large. On the other hand, FOBOS uses a stochastic approximation of the entire gradient (averaged over a single training instance) and BCD uses only the partial gradient G(W)_{ j:}∈R ^{ m } (averaged over all training instances). FOBOS and BCD can therefore quickly start to minimize Eq. (2) and increase test accuracy, when FISTA and SpaRSA are not even done computing G(W) yet. Additionally, FISTA and SpaRSA change W entirely at each iteration, which forces to recompute G(W) and F(W ^{ t+1})−F(W ^{ t }) entirely. In the case of BCD, only one block W _{ j:} is modified at a time, enabling the fast implementation technique described in Sect. 4.3.
In terms of sparsity, FISTA and SpaRSA reduced the number of nonzero rows much more slowly than BCD. However, in the limit, they obtained similar row sparsity to BCD.
5.2.4 Effect of shrinking
We also extended to ℓ _{1}/ℓ _{2} regularization the shrinking method originally proposed by Yuan et al. (2010) for ℓ _{1}regularized binary classification. Indeed, using optimality conditions developed in Sect. 4.5, it is possible to discard zero blocks early if, according to the optimality conditions, they are likely to remain zero. However, we found that shrinking did not improve convergence on lowerdimensional datasets such as RCV1 and only slightly helped on higherdimensional datasets such as Amazon7. This is in line with Yuan et al.’s experimental results on ℓ _{1}regularized binary classification.
5.3 Comparison with other multiclass and multitask objectives
 multiclass squared hinge (proposed, same as Eq. (2)):$$\underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{i=1}^n \sum _{r \neq y_i} \max\bigl(1  (\boldsymbol{W}_{:y_i} \cdot\boldsymbol{x}_i  \boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i), 0\bigr)^2 + \lambda\sum _{j=1}^d \\boldsymbol{W}_{j:} \_2. $$
 multitask squared hinge:where Y _{ ir }=+1 if y _{ i }=r and Y _{ ir }=−1 otherwise.$$ \underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{r=1}^m \sum _{i=1}^n \max(1  \boldsymbol{Y}_{ir} \boldsymbol{W}_{:r} \cdot\boldsymbol{x}_i, 0)^2 + \lambda\sum_{j=1}^d \ \boldsymbol{W}_{j:}\_2, $$(11)
 multiclass logistic regression:$$ \underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{i=1}^n \log\biggl(1 + \sum_{r \neq y_i} \exp(\boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i  \boldsymbol{W}_{:y_i} \cdot\boldsymbol {x}_i)\biggr) + \lambda\sum_{j=1}^d \\boldsymbol{W}_{j:}\_2. $$(12)
For both multitask squared hinge and multiclass logistic regression, we computed the partial gradient using an efficient implementation technique similar to the one described in Sect. 4.3. For the multiclass and multitask squared hinge formulations, we used BCD with line search. For the multiclass logistic regression formulation, we used BCD without line search, since we observed faster training times (see Sect. 5.3.1). For multiclass logistic regression, the partial gradient’s Lipschitz constant is \(\mathcal{K}_{j} = \frac{1}{2} \sum_{i} \boldsymbol{x}_{ij}^{2}\) (Duchi and Singer 2009a).
Minimum, median and maximum training times (in seconds) of different ℓ _{1}/ℓ _{2}regularized objective functions, when computing solutions for 10 logspace values between λ=10^{−2} and λ=10^{−4} (Sector), and between λ=10^{−3} and λ=10^{−5} (other datasets)
Dataset  MC Squared Hinge  MT Squared Hinge  MC Logistic 

Amazon7  285.93  655.53  4,137.39 
486.54  909.84  5,822.83  
1,118.65  1,740.90  6,128.27  
RCV1  2,962.82  5,089.91  10,413.47 
3,901.58  6,300.77  11,540.80  
4,581.33  6,556.67  15,055.57  
MNIST  32.55  42.74  141.54 
55.25  79.56  260.30  
61.67  99.37  645.04  
News20  50.46  71.42  172.94 
70.73  93.82  218.04  
75.08  98.15  244.62  
Sector  62.01  251.57  149.09 
167.01  259.50  532.30  
191.52  273.85  743.45 
5.3.1 Comparison with multiclass logistic regression
Compared to multiclass logistic regression, Eq. (12), our objective achieved overall comparable accuracy. As indicated in Table 2, however, our objective was substantially faster to train (up to ten times in terms of median time) than multiclass logistic regression. Computationally, our objective has indeed two important advantages. First, the objective and gradient are “lazy”: they iterate over instances and classes only when the score is not greater than the score assigned to the correct label by at least 1, whereas multiclass logistic regression always iterates over all n instances and m−1 classes. Second, they do not contain any exp or log computations, which are expensive to compute in practice (equivalent to dozens of multiplications) (Yuan et al. 2011).
We also tried to use BCD with line search for optimizing the multiclass logistic regression objective. However, we found that the version without line search was overall faster. For example, on Amazon7, the median time with line search was 8723.28 seconds instead of 5822.83 seconds without line search. This contrasts with our results from Sect. 5.2 and thus suggests that a line search may not be beneficial when the objective value and gradient are expensive to compute. However, our results show that even without line search, multiclass logistic regression is much slower to train than our formulation.
5.3.2 Comparison with multitask squared hinge loss
Using an efficient implementation technique similar to the one described in Sect. 4.3, the cost of computing Eq. (13) is \(O(\hat{n}m)\) rather than \(O(\hat {n}(m1))\) for Eq. (14). We also observed that our multiclass objective typically reached the stopping criterion in fewer iterations than the multitask objective (e.g., k=73 vs. k=108 on the News20 dataset with λ=10^{−3}).
6 Conclusion
In this paper, we proposed a novel direct sparse multiclass formulation, specifically designed for largescale and highdimensional problems. We presented two block coordinate descent variants (Tseng and Yun 2009; Richtárik and Takáč 2012a) in a unified manner and developed the core components needed to efficiently optimize our formulation. Experimentally, we showed that block coordinate descent achieves comparable or better convergence speed than FOBOS (Duchi and Singer 2009b), while obtaining much sparser solutions and not needing an extra hyperparameter. Furthermore, it outperformed full gradient based solvers such as FISTA (Beck and Teboulle 2009) and SpaRSA (Wright et al. 2009). Compared to multiclass logistic regression, our multiclass formulation had significantly faster training times (up to ten times in terms of median time) while achieving similar test accuracy. Compared to a multitask squared hinge formulation, our formulation had overall better test accuracy and faster training times. In future work, we would like to empirically evaluate the extensions described in Sect. 4.7.
References
 Bach, F. R., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsityinducing penalties. Foundations and Trends in Machine Learning, 4(1), 1–106. CrossRefGoogle Scholar
 Bakin, S. (1999). Adaptative regression and model selection in data mining problems. Ph.D. thesis, Australian National University. Google Scholar
 Beck, A., & Teboulle, M. (2009). A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183–202. MathSciNetzbMATHCrossRefGoogle Scholar
 Bertsekas, D. P. (1999). Nonlinear programming. Belmont: Athena Scientific. zbMATHGoogle Scholar
 Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of conference on learning theory (COLT) (pp. 144–152). Google Scholar
 Chang, K. W., Hsieh, C. J., & Lin, C. J. (2008). Coordinate descent method for largescale l2loss linear support vector machines. Journal of Machine Learning Research, 9, 1369–1398. MathSciNetzbMATHGoogle Scholar
 Combettes, P., & Wajs, V. (2005). Signal recovery by proximal forwardbackward splitting. Multiscale Modeling & Simulation, 4, 1168–1200. MathSciNetzbMATHCrossRefGoogle Scholar
 Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2, 265–292. zbMATHGoogle Scholar
 Dredze, M., Crammer, K., & Pereira, F. (2008). Confidenceweighted linear classification. In Proceedings of international conference on machine learning (ICML) (pp. 264–271). CrossRefGoogle Scholar
 Duchi, J., & Singer, Y. (2009a). Boosting with structural sparsity. In Proceedings of international conference on machine learning (ICML) (pp. 297–304). Google Scholar
 Duchi, J., & Singer, Y. (2009b). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10, 2899–2934. MathSciNetzbMATHGoogle Scholar
 Elisseeff, A., & Weston, J. (2001). A kernel method for multilabelled classification. In Proceedings of neural information processing systems (NIPS) (pp. 681–687). Google Scholar
 Fan, R. E., & Lin, C. J. (2007). A study on threshold selection for multilabel classification. Tech. rep., National Taiwan University. Google Scholar
 Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332. MathSciNetzbMATHCrossRefGoogle Scholar
 Friedman, J., Hastie, T., & Tibshirani, R. (2010a). A note on the group lasso and a sparse group lasso. Tech. Rep. arXiv:1001.0736.
 Friedman, J. H., Hastie, T., & Tibshirani, R. (2010b). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. Google Scholar
 Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7, 397–416. MathSciNetGoogle Scholar
 Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99, 67–81. MathSciNetzbMATHCrossRefGoogle Scholar
 Mangasarian, O. (2002). A finite Newton method for classification. Optimization Methods and Software, 17, 913–929. MathSciNetzbMATHCrossRefGoogle Scholar
 Meier, L., Van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 70(1), 53–71. MathSciNetzbMATHCrossRefGoogle Scholar
 Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252. MathSciNetCrossRefGoogle Scholar
 Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient blockcoordinate descent algorithms for the group lasso. Tech. rep., Columbia University. Google Scholar
 Richtárik, P., & Takáč, M. (2012a). Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Programming, 1–38. Google Scholar
 Richtárik, P., & Takáč, M. (2012b). Parallel coordinate descent methods for big data optimization. Tech. Rep. arXiv:1212.0873.
 Rifkin, R., & Klautau, A. (2004). In defense of onevsall classification. Journal of Machine Learning Research, 5, 101–141. MathSciNetzbMATHGoogle Scholar
 ShalevShwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2010). Pegasos: primal estimated subgradient solver for svm. Mathematical Programming, 1–28. Google Scholar
 Shevade, S. K., & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17), 2246–2253. CrossRefGoogle Scholar
 Tseng, P., & Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117, 387–423. MathSciNetzbMATHCrossRefGoogle Scholar
 Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of international conference on machine learning (ICML) (pp. 1113–1120). Google Scholar
 Weston, J., & Watkins, C. (1999). Support vector machines for multiclass pattern recognition. In Proceedings of European symposium on artificial neural networks, computational intelligence and machine learning (pp. 219–224). Google Scholar
 Wright, S. J. (2012). Accelerated blockcoordinate relaxation for regularized optimization. SIAM Journal on Optimization, 22, 159–186. MathSciNetzbMATHCrossRefGoogle Scholar
 Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. Transactions on Signal Processing, 57(7), 2479–2493. MathSciNetCrossRefGoogle Scholar
 Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49–67. MathSciNetzbMATHCrossRefGoogle Scholar
 Yuan, G. X., Chang, K. W., Hsieh, C. J., & Lin, C. J. (2010). A comparison of optimization methods and software for largescale l1regularized linear classification. Journal of Machine Learning Research, 11, 3183–3234. MathSciNetzbMATHGoogle Scholar
 Yuan, G. X., Ho, C. H., & Lin, C. J. (2011). An improved glmnet for l1regularized logistic regression. In Proceedings of the international conference on knowledge discovery and data mining (pp. 33–41). Google Scholar
 Zhang, H. H., Liu, Y., Wu, Y., & Zhu, J. (2006). Variable selection for multicategory svm via supnorm regularization. Electronic Journal of Statistics, 2, 149–167. MathSciNetCrossRefGoogle Scholar
 Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541–2563. MathSciNetzbMATHGoogle Scholar
 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320. MathSciNetzbMATHCrossRefGoogle Scholar