1 Introduction

1-regularized loss minimization has attracted a great deal of research over the past decade (Yuan et al. 2010). 1 regularization has many advantages, including its computational efficiency, its ability to perform implicit feature selection and under certain conditions, to recover the model’s true sparsity (Zhao and Yu 2006). More recently, mixed-norm (e.g., 1/ 2) regularization has been proposed (Bakin 1999; Yuan and Lin 2006) as a way to select groups of features. Here, the notion of group is application-dependent and may be used to exploit prior knowledge about natural feature groups (Yuan and Lin 2006; Meier et al. 2008) or problem structure (Obozinski et al. 2010; Duchi and Singer 2009a).

In this paper, we focus on the application of 1/ 2 regularization to multiclass classification problems. Let W be a d×m matrix, where d represents the number of features and m the number of classes. We denote by W j:R m the jth row of W and by W :r R d its rth column. We consider the traditional multiclass model representation where an input vector xR d is classified to one of the m classes using the following rule:

$$ y = \mathop{\mathrm{argmax}}_{r \in\{1, \dots, m\}} \boldsymbol{W}_{:r} \cdot \boldsymbol{x}. $$
(1)

Each column W :r of W can be thought as a prototype representing the rth class and the inner product W :r x as the score of the rth class with respect to x. Therefore, Eq. (1) chooses the class with highest score. Given n training instances x i R d and their associated labels y i ∈{1,…,m}, our goal is to estimate W.

In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs 1/ 2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models (see Sect. 2). For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun 2009) and the other without (Richtárik and Takáč 2012a). We present the two variants in a unified manner and develop core components needed to optimize our objective (efficient gradient computation, Lipschitz constant of the gradient, computationally cheap stopping criterion). The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS (Duchi and Singer 2009b), FISTA (Beck and Teboulle 2009) and SpaRSA (Wright et al. 2009). Furthermore, we show that our formulation obtains very compact multiclass models and outperforms 1/ 2-regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

2 Sparsity-inducing regularization

Figure 1 illustrates sparsity patterns obtained by different forms of regularization and shows why 1/ 2 regularization is particularly well adapted to multiclass models. With \(\ell_{2}^{2}\) regularization (ridge), \(R_{\ell_{2}^{2}}(\boldsymbol{W}) = \frac{1}{2} \sum_{j,r} \boldsymbol{W}_{jr}^{2}\), sparsity is not enforced and therefore the obtained model may become completely dense. With 1 regularization (lasso), \(R_{\ell_{1}}(\boldsymbol{W}) = \sum_{j,r} |\boldsymbol{W}_{jr}|\), the model becomes sparse at the individual weight level. A well-known problem of 1 regularization is that, if several features are correlated, it tends to select only one of them, even if other features are useful for prediction. To solve this problem, 1 regularization can be combined with \(\ell_{2}^{2}\) regularization. The resulting regularization, \(R_{\ell_{1} + \ell_{2}^{2}}(\boldsymbol{W}) = \rho R_{\ell_{1}}(\boldsymbol{W}) + (1 - \rho) R_{\ell_{2}^{2}}(\boldsymbol{W})\), where ρ>0 is a hyperparameter, is known as elastic-net in the literature (Zou and Hastie 2005) and leads to sparsity at the individual weight level.

Fig. 1
figure 1

Illustration of the sparsity patterns obtained by \(\ell_{2}^{2}\) (ridge), 1 (lasso), \(\ell_{1} + \ell_{2}^{2}\) (elastic-net) and 1/ 2 (group lasso) regularizations on the matrix WR d×m. With 1/ 2 regularization, we can obtain compact and fast-to-evaluate multiclass models

Let \(R_{\ell_{2}}(\boldsymbol{W}_{j:}) = \|\boldsymbol{W}_{j:}\|_{2}\) (notice that the 2 norm is not squared). With 1/ 2 regularization (group lasso), \(R_{\ell_{1}/\ell_{2}}(\boldsymbol{W}) = \sum_{j} R_{\ell_{2}}(\boldsymbol{W}_{j:})\), the model becomes sparse at the feature group (here, row) level. Applied to a multiclass model, 1/ 2 regularization can thus force weights corresponding to the same feature to become zero across all classes. The corresponding features can therefore be safely ignored at test time, which is especially useful when features are expensive to extract. For more information on sparsity inducing penalties, see Bach et al.’s excellent survey (Bach et al. 2012).

3 Related work

3.1 Multiclass classification: direct vs. indirect formulations

Classifying an object into one of several categories is an important problem arising in many applications such as document classification and object recognition. Machine learning approaches to this problem can be roughly divided into two categories: direct and indirect approaches. While direct approaches formulate the multiclass problem directly, indirect approaches reduce the multiclass problem to multiple independent binary classification or regression problems. Because support vector machines (SVMs) (Boser et al. 1992) were originally proposed as a binary classification model, they have frequently been used in combination with indirect approaches to perform multiclass classification. Among them, one of the most popular is “one-vs-rest” (Rifkin and Klautau 2004), which consists of learning to separate one class from all the others, independently for all m possible classes. Direct multiclass SVM extensions were later proposed by Weston and Watkins (1999), Lee et al. (2004) and Crammer and Singer (2002). They were all formulated as constrained problems and solved in the dual. An unconstrained (non-differentiable) form of the Crammer-Singer formulation is popularly used with stochastic subgradient descent algorithms such as Pegasos (Shalev-Shwartz et al. 2010). Another popular direct multiclass (smooth) formulation, which is an intuitive extension of traditional logistic regression, is multiclass logistic regression. In this paper, we propose an efficient direct multiclass formulation.

3.2 Sparse multiclass classification

Recently, mixed-norm regularization has attracted much interest (Yuan and Lin 2006; Meier et al. 2008; Duchi and Singer 2009a, 2009b; Obozinski et al. 2010) due to its ability to impose sparsity at the feature group level. Few papers, however, have investigated its application to multiclass classification. Zhang et al. (2006) extend Lee et al.’s multiclass SVM formulation (Lee et al. 2004) to employ 1/ regularization and formulate the learning problem as a linear program (LP). However, they experimentally verify their method only on very small problems (both in terms of n and d). Duchi and Singer (2009a) propose a boosting-like algorithm specialized for 1/ 2-regularized multiclass logistic regression. In another paper, Duchi and Singer (2009b) derive and analyze FOBOS, a stochastic subgradient descent framework based on forward-backward splitting and apply it, among other things, to 1/ 2-regularized multiclass logistic regression. In this paper, we choose 1/ 2 regularization, since it can be optimized more efficiently than 1/ regularization (see Sect. 4.7).

3.3 Coordinate descent methods

Although coordinate descent methods were among the first optimization methods proposed and studied in the literature (see Bertsekas 1999 and references therein), it is only recently that they regained popularity, thanks to several successful applications in the machine learning (Fu 1998; Shevade and Keerthi 2003; Friedman et al. 2007, 2010b; Yuan et al. 2010; Qin et al. 2010) and optimization (Tseng and Yun 2009; Wright 2012; Richtárik and Takáč 2012a) communities. Conceptually and algorithmically simple, (block) coordinate descent algorithms focus at each iteration on updating one block of variables while keeping the others fixed, and have been shown to be particularly well-suited for minimizing objective functions with non-smooth separable regularization such as 1 or 1/ 2 (Tseng and Yun 2009; Wright 2012; Richtárik and Takáč 2012a).

Coordinate descent algorithms have different trade-offs: expensive gradient-based greedy block selection as opposed to cheap cyclic or randomized selection, use of line search (Tseng and Yun 2009; Wright 2012) or not (Richtárik and Takáč 2012a). For large-scale linear classification, and we confirm in this paper, cyclic and randomized block selection schemes have been shown to achieve excellent performance (Yuan et al. 2010, 2011; Chang et al. 2008; Richtárik and Takáč 2012a). The most popular loss function for 1-regularized binary classification is arguably logistic regression, due to its smoothness (Yuan et al. 2010). Binary logistic regression was also successfully combined with 1/ 2 regularization in the case of user-defined feature groups (Meier et al. 2008). However, recent work (Yuan et al. 2010, 2011; Chang et al. 2008) using coordinate descent indicate that logistic regression is substantially slower to train than 2-loss (squared hinge) SVMs. This is because, contrary to 2-loss SVMs, logistic regression requires expensive log and exp computations (equivalent to dozens of multiplications) to compute the gradient or objective value (Yuan et al. 2011). Motivated by this background, we propose a novel efficient direct multiclass formulation. Compared to multiclass logistic regression, which suffers from the same problems as its binary counterpart, our formulation can be optimized very efficiently by block coordinate descent and lends itself to large-scale and high-dimensional problems such as document classification.

4 Sparse direct multiclass classification

4.1 Objective function

Given n training instances x i R d and their associated labels y i ∈{1,…,m}, our goal is to estimate W such that Eq. (1) produces accurate predictions and W is row-wise sparse. To this end, we minimize the following convex objective:

$$ \underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} F( \boldsymbol{W}) = \underbrace{\frac{1}{n} \sum_{i=1}^n \sum_{r \neq y_i} \max\bigl(1 - (\boldsymbol{W}_{:y_i} \cdot\boldsymbol{x}_i - \boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i), 0\bigr)^2}_{L(\boldsymbol{W})} + \lambda \underbrace{\sum_{j=1}^d \| \boldsymbol{W}_{j:}\|_2}_{R(\boldsymbol{W})}, $$
(2)

where λ>0 is a parameter controlling the trade-off between loss and penalty minimization. We call F(W) 1/ 2-regularized multiclass squared hinge loss function. Intuitively, for all training instances and classes (different from the correct label), if the score is less than the score assigned to the correct label by at least 1, the model suffers zero loss. Otherwise, it suffers a loss which is quadratically proportional to the difference between the scores. Besides convexity, F(W) possesses the following desirable properties:

  1. 1.

    It is a direct multiclass formulation and its relation with Eq. (1) is intuitive.

  2. 2.

    Its objective value and gradient can be computed efficiently (unlike multiclass logistic regression, which requires expensive log and exp operations).

  3. 3.

    It empirically performs comparably or better than other multitask and multiclass formulations.

  4. 4.

    It meets several conditions needed to prove global convergence of block coordinate descent algorithms (see Sect. 4.6).

Our objective, Eq. (2), is similar in spirit to Weston and Watkins’ multiclass SVM formulation (Weston and Watkins 1999), in that it ensures that the correct class’s score is greater than all the other classes by at least 1. However, it has the following differences: it is unconstrained (rather than constrained), it is 1/ 2-regularized (rather than \(\ell_{2}^{2}\)-regularized) and it penalizes misclassifications quadratically (rather linearly), which ensures differentiability of L(W).

4.2 Optimization by block coordinate descent

A key property of F(W) is the separability of its non-smooth part R(W) over groups j=1,2,…,d. This calls for an algorithm which minimizes F(W) by updating W group by group. In this paper, to minimize F(W), we thus employ block coordinate descent. We consider two variants, one with line search (Tseng and Yun 2009) and the other without (Richtárik and Takáč 2012a). We present the two variants in a unified manner.

Algorithm 1 outlines block coordinate descent for minimizing F(W). At each iteration, Algorithm 1 selects a block W j:∈Rm of coefficients and updates it, keeping all other blocks fixed (how to choose the block is delayed to Sect. 4.4). This procedure is repeated several times until a suitable stopping criterion is met or the maximum number of outer iterations K is reached. The main difficulty arising in Algorithm 1 is how to solve the sub-problem associated with each weight block W j:. Let W t be the weight matrix at iteration t. The key idea of block coordinate descent frameworks for non-smooth separable minimization (Tseng and Yun 2009; Richtárik and Takáč 2012a) is to update each block by solving the following quadratic approximation of F around W t:

$$ \boldsymbol{W}^*_{j:} = \underset{\boldsymbol{W}_{j:} \in \mathbf{R}^m}{\mathrm{argmin}} ~ G\bigl(\boldsymbol{W}^t \bigr)_{j:}^\mathrm{T} \bigl(\boldsymbol{W}_{j:} - \boldsymbol{W}_{j:}^t\bigr) + \frac{1}{2} \bigl( \boldsymbol{W}_{j:} - \boldsymbol{W}_{j:}^t \bigr)^\mathrm{T} \boldsymbol{H}^t \bigl(\boldsymbol{W}_{j:} - \boldsymbol{W}_{j:}^t\bigr) + \lambda\| \boldsymbol{W}_{j:}\|_2, $$
(3)

where we used G(W) j:R m to denote the jth row of the gradient of L(W) and H t is a m×m matrix. If we choose \(\boldsymbol{H}^{t}=\mathcal{L}_{j}^{t} \boldsymbol{I}\) where \(\mathcal{L}_{j}^{t}\) is a scalar (we discuss its choice in Sect. 4.4) and I is the identity matrix, Eq. (3) can be rewritten as:

$$\boldsymbol{W}^*_{j:} = \underset{\boldsymbol{W}_{j:} \in \mathbf{R}^m}{\mathrm{argmin}} ~ \frac{1}{2} \bigl\| \boldsymbol{W}_{j:} - \boldsymbol{V}_{j:}^t \bigr\|^2 + \mu_j^t \|\boldsymbol{W}_{j:} \|_2 $$

where we defined \(\boldsymbol{V}_{j:}^{t} = \boldsymbol{W}^{t}_{j:} - \frac{1}{\mathcal{L}_{j}^{t}} G(\boldsymbol{W}^{t})_{j:}\) and \(\mu_{j}^{t} = \frac{\lambda}{\mathcal{L}_{j}^{t}}\). This problem takes a form which is well-known in the signal-processing literature and whose solution is called proximity operator (Combettes and Wajs 2005). The proximity-operator associated with the 2 norm takes a closed form (see e.g. Duchi and Singer 2009b for a derivation):

$$ \boldsymbol{W}^*_{j:} = \mathrm{Prox}_{\mu_j^t \|\cdot\|_2}\bigl( \boldsymbol{V}^t_{j:}\bigr) = \max\biggl\{1 - \frac{\mu_j^t}{\|\boldsymbol{V}^t_{j:}\|_2}, 0\biggr\} \boldsymbol{V}^t_{j:}. $$
(4)

This operator is known as vectorial soft-thresholding operator (Wright et al. 2009), owing to the fact that \(\boldsymbol{W}^{*}_{j:}\) becomes entirely zero when \(1 - \frac{\mu^{t}}{\|\boldsymbol{V}^{t}_{j:}\|_{2}} < 0\). Summarizing, we obtain \(\boldsymbol{W}^{*}_{j:}\) by taking a partial gradient step with step size \(\frac{1}{\mathcal{L}^{t}_{j}}\) and then projecting the result by \(\mathrm{Prox}_{\mu_{j}^{t} \|\cdot\|_{2}}\). Finally, let \(\boldsymbol{\delta }^{t}_{j} = \boldsymbol{W}^{*}_{j:} - \boldsymbol{W}^{t}_{j:}\). The last step consists in setting \(\boldsymbol{W}^{t+1}_{j:} = \boldsymbol{W}^{t}_{j:} + \alpha_{j}^{t} \boldsymbol{\delta}^{t}_{j}\). We discuss the choice of \(\alpha_{j}^{t}\) in Sect. 4.4. Algorithm 2 summarizes how to solve the block sub-problem associated with W j: (we drop the superscript t since there is no ambiguity).

Algorithm 1
figure 2

Block-coordinate algorithm for minimization of F(W)

Algorithm 2
figure 3

Solving the block sub-problem associated with W j:

4.3 Efficient partial gradient computation

We now discuss efficient computation of the partial gradient G(W) j: of L(W), which is crucial for the general efficiency of Algorithm 1. We first rewrite L(W) as:

$$L(\boldsymbol{W}) = \frac{1}{n} \sum_{i=1}^n \sum_{r \neq y_i} \max\bigl(A(\boldsymbol{W})_{ir}, 0\bigr)^2, $$

where A(W) is a n×m matrix defined by:

$$A(\boldsymbol{W})_{ir} = 1 - (\boldsymbol{W}_{:y_i} \cdot \boldsymbol {x}_i - \boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i). $$

The partial gradient of L(W) can then be concisely written as:

$$ G(\boldsymbol{W})_{j:} = -\frac{2}{n} \sum _{i=1}^n \sum_{r \neq y_i} \max \bigl(A(\boldsymbol{W})_{ir}, 0\bigr) [\boldsymbol{x}_{ij} \boldsymbol{e}_{y_i} - \boldsymbol{x}_{ij} \boldsymbol{e}_r], $$
(5)

where \(\boldsymbol{e}_{r} = [\underbrace{0, \dots, 0}_{r-1}, 1, 0, \dots, 0]^{\mathrm{T}}\).

Since computing A(W) ir from scratch would be computationally prohibitive, we instead initialize A(W) to 1 n×m at the beginning of Algorithm 1, then when a weight block is updated by W j:W j:+α j δ j , we update A(W) by \(A(\boldsymbol{W})_{ir} \leftarrow A(\boldsymbol{W})_{ir} + \alpha_{j} (\boldsymbol{\delta}_{jr} - \boldsymbol{\delta}_{j{y_{i}}}) \boldsymbol{x}_{ij}\) for all i such that x ij ≠0 and all ry i . Thanks to this implementation technique, denoting \(\hat{n}\) the average number of non-zero values per feature, the cost of computing Eq. (5) is only \(O(\hat{n}(m-1))\). We summarize how to efficiently compute G(W) j: in Algorithm 3. When using sparse data, the compressed sparse column (CSC) format can be used for fast access to all non-zero values of feature j (inner loop in Algorithm 3).

Algorithm 3
figure 4

Efficient computation of G(W) j: and h(W) j:

4.4 Choice of block, \(\mathcal{L}_{j}^{t}\) and \(\alpha_{j}^{t}\)

We now discuss how to choose, at every iteration, the block W j: to update, \(\mathcal{L}_{j}^{t}\) and \(\alpha_{j}^{t}\), depending on whether a line search is used or not.

4.4.1 With line search (Tseng and Yun)

Following Tseng and Yun (2009), we can choose

$$\mathcal{L}_j^t = \max\bigl(\bigl\|{h}\bigl( \boldsymbol{W}^t\bigr)_{j:}\bigr\|_{\infty}, \epsilon \bigr), $$

where ϵ is a small constant (e.g., 10−12) to ensure positivity and:

$$h(\boldsymbol{W})_{j:} = \biggl[ \frac{\partial^2 L}{\partial\boldsymbol{W}_{j1}^2}, \dots, \frac{\partial^2 L}{\partial\boldsymbol{W}_{jm}^2} \biggr]^\mathrm{T}. $$

In our case, L is not twice-differentiable, since G(W) is not differentiable when A(W) ir =0. We can however define its generalized second derivatives (Mangasarian 2002; Chang et al. 2008):

$$ h(\boldsymbol{W})_{j:} = \frac{2}{n} \sum _{i=1}^n \sum_{r \neq y_i} \delta_{[A(\boldsymbol {W})_{ir} > 0]} \bigl(\boldsymbol{x}_{ij}^2 \boldsymbol{e}_{y_i} + \boldsymbol{x}_{ij}^2 \boldsymbol{e}_r\bigr), $$
(6)

where δ[.] is the Kronecker delta. Since choosing \(\mathcal {L}^{t}_{j}\) as above might lead to an overly large step size \(\frac{1}{\mathcal {L}^{t}_{j}}\), Tseng and Yun choose \(\alpha^{t}_{j}\) such that the following sufficient decrease condition is satisfied:

$$ F\bigl(\boldsymbol{W}^{t+1}\bigr) - F\bigl(\boldsymbol{W}^t \bigr) \le \sigma\alpha^t_j \bigl(G\bigl( \boldsymbol{W}^t\bigr)_{j:}^\mathrm{T} \boldsymbol { \delta}_j + \lambda\bigl\|\boldsymbol{W}^t_{j:} + \boldsymbol{\delta}_j\bigr\|_2 - \lambda \bigl\| \boldsymbol{W}^t_{j:}\bigr\|_2\bigr), $$
(7)

where σ is a user-defined constant such that 0<σ<1. We can choose \(\alpha^{t}_{j}\) by backtracking line search, that is, by sequentially trying \(\alpha^{t}_{j}=1, \omega, \omega^{2}, \dots\) until Eq. (7) is satisfied. Common choices in the optimization literature for σ and ω are 0.01 and 0.5, respectively. Since we have \(\boldsymbol{W}^{t+1}_{j:} = (1-\alpha^{t}_{j}) \boldsymbol {W}^{t}_{j:} + \alpha^{t}_{j} \boldsymbol{W}^{*}_{j:}\), we see that \(\boldsymbol {W}^{t+1}_{j:}\) can be interpreted as a weighted sum between the current iterate and the subproblem’s solution.

Similarly to Eq. (5), the cost of computing Eq. (7) and Eq. (6) is \(O(\hat{n}(m-1))\). In practice, we observe that one line search step often suffices for Eq. (7) to be satisfied. Therefore, the cost of one call to Algorithm 2 is in general \(O(\hat{n}(m-1))\).

To enjoy Tseng and Yun’s theoretical guarantees (see Sect. 4.6), we need to use cyclic block selection. That is, in Algorithm 1, at each inner iteration, we need to choose j=l.

4.4.2 Without line search (Richtárik and Takáč)

We show in Appendix A that G(W) j: is Lipschitz with constant

$$\mathcal{K}_j = \frac{4(m-1)}{n} \sum _i \boldsymbol{x}_{ij}^2. $$

Following Richtárik and Takáč (2012a), we can choose \(\mathcal{L}^{t}_{j} = \mathcal{K}_{j}\). In that case, no line search is needed, i.e., \(\alpha^{t}_{j} = 1\) and \(\boldsymbol{W}^{t+1}_{j:} = \boldsymbol {W}^{*}_{j:}\). Our implementation pre-computes \(\mathcal{K}_{j}\ \forall j \in\{1, \dots, d\}\) and stores the results in a d-dimensional vector. Note that, Richtárik and Takáč assume that blocks are selected with uniform probability \(\frac{1}{d}\).

Using a line search or not is a matter of trade-off: using a line search has higher cost per iteration but can potentially lead to greater progress due to the larger step size. We compare both strategies experimentally in Sect. 5.2. One advantage of Richtárik and Takáč’s framework, however, is that it can be parallelized (Richtárik and Takáč 2012b), potentially leading to significant speedups. In future work, we plan to compare sequential and parallel block coordinate descent when applied to our objective, Eq. (2).

4.5 Stopping criterion

We would like to develop a stopping criterion for Algorithm 1 which can be checked at almost no extra computational cost. Proposition 1 characterizes an optimal solution of Eq. (2).

Proposition 1

W is an optimal solution of Eq. (2) if and only ifj:

figure a

Proof is given in Appendix B. Using Proposition 1 and the fact that Eq. (8b) is equivalent to ∥G(W) j:2=λ if W j:0, we define v t, the optimality violation at the tth iteration (the bigger, the stronger the violation):

$$ v^t = \begin{cases} \max(\|G(\boldsymbol{W}^t)_{j(t):}\|_2 - \lambda, 0) & \mbox{if } \boldsymbol{W}^t_{j(t):} = \mathbf{0} \\ \||G(\boldsymbol{W}^t)_{j(t):}\|_2 - \lambda| & \mbox{if } \boldsymbol{W}^t_{j(t):} \neq\mathbf{0}, \end{cases} $$
(9)

where j(t) denotes the block selected at the tth iteration. In Eq. (9), the max operator is to account for the inequality in (8a) and the absolute value for the equality in (8b). Since we already need G(W t) j(t): for solving each block sub-problem, computing v t comes at almost no extra cost.

As indicated in Algorithm 1, we check convergence at the end of each outer iteration. Let \(\mathcal{T}^{k} = \{(k-1)d +1, (k-1)d+2, \dots, kd\}\) be the set of values taken by t at the kth outer iteration. One possible stopping criterion is:

$$ \frac{\sum_{t \in\mathcal{T}^k} v^t}{\sum_{t \in\mathcal{T}^1} v^t} < \tau, $$
(10)

where 0<τ≤1 is a user-defined tolerance constant (the bigger, the faster to stop). This criterion is the most natural when cyclic block selection is used, since the sums in Eq. (10) are then over all blocks 1,…,d. Another possible stopping criterion consists in replacing the 1 norm by the norm:

$$\frac{\max_{t \in\mathcal{T}^k} v^t}{\max_{t \in\mathcal{T}^1} v^t} < \tau. $$

We use this criterion when randomized uniform block selection is used. In both cases, the denominator serves the purpose of normalization (hence, τ is not sensitive to the dataset dimensionality).

4.6 Global convergence properties

We discuss convergence properties for the two block coordinate descent variants we considered: cyclic block coordinate descent with line search (Tseng and Yun 2009) and randomized block coordinate descent without line search (Richtárik and Takáč 2012a). To have finite termination of the line search, Tseng and Yun (Lemma 5.1), require that L has Lipschitz continuous gradient, which we prove with Lemma 1 in Appendix A. For asymptotic convergence, Tseng and Yun assume that each block is cyclically visited (Eq. (12)). They further assume (Assumption 1) that H t is upper-bounded by some value and lower-bounded by 0, which is guaranteed by our choice \(\boldsymbol {H}^{t} = \mathcal{L}_{j}^{t} \boldsymbol{I}\). Richtárik and Takáč also assume (Sect. 2) that the blockwise gradient is Lipschitz. They show (Theorem 4) that using their algorithm, there exists a finite iteration t such that P(F(W t)−F(W )≤ϵ)≥1−ρ, where ϵ>0 is the accuracy of the solution and 0<ρ<1 is the target confidence.

4.7 Extensions

A straightforward extension of our objective, Eq. (2), is label ranking with multilabel data:

$$\underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{i=1}^n \sum _{r \in\mathcal{Y}_i, r' \notin\mathcal{Y}_i} \max\bigl(1 - (\boldsymbol{W}_{:r} \cdot\boldsymbol{x}_i - \boldsymbol{W}_{:r'} \cdot \boldsymbol{x}_i), 0\bigr)^2 + \lambda\sum _{j=1}^d \|\boldsymbol{W}_{j:} \|_2, $$

where \(\mathcal{Y}_{i}\) is the set of labels assigned to x i . Intuitively, this objective attempts to assign higher score to relevant labels than to non-relevant labels. If the goal is to predict label sets rather than label rankings, threshold selection methods (Fan and Lin 2007; Elisseeff and Weston 2001) may be applied as a post-processing step.

Another possible extension consists in replacing 1/ 2 regularization by 1/ regularization or 1+ 1/ 2 regularization (sparse group lasso, Friedman et al. 2010a). This requires changing the proximity operator, Eq. (4), as well as reworking the stopping criterion developed in Sect. 4.5. Similarly to 1/ 2 regularization, 1/ regularization leads to group sparsity. However, the proximity operator associated with the norm requires a projection on an 1-norm ball (Bach et al. 2012) and is thus computationally more expensive than the proximity operator associated with the 2 norm, which takes a closed form, Eq. (4). For 1+ 1/ 2 regularization (sparse group lasso), the group-wise proximity operator can readily be computed by applying first the proximity operator associated with the 1 norm and then the one associated with the 2 norm (Bach et al. 2012). However, sparse group lasso regularization requires the tuning of an extra hyperparameter, which balances between 1/ 2 and 1 regularizations. For this reason, we do not consider it in our experiments.

5 Experiments

We conducted two experiments. In the first experiment, we investigated the performance (in terms of speed of convergence and row sparsity) of block coordinate descent (with or without line search) for optimizing the proposed direct multiclass formulation Eq. (2), compared to other state-of-the-art solvers. In the second experiment, we compared the proposed direct multiclass formulation with other multiclass and multitask formulations in terms of test accuracy, row sparsity and training speed. Experiments were run on a Linux machine with an Intel Xeon CPU (3.47 GHz) and 4 GB memory.

5.1 Datasets

Table 1 summarizes the datasets we used to conduct our experiments:

  • Amazon7: product-review (books, DVD, electronics, …) classification.

  • RCV1: news document classification.

  • MNIST: handwritten digit classification.

  • News20: newgroup message classification.

  • Sector: web-page (industry sectors) classification.

We created Amazon7 using the entire data of Dredze et al. (2008) (they used only a small subset). For the scale of this dataset, constructing feature vectors from raw text by conventional bag-of-words extraction exceeded the memory of our computer. For this reason, we instead used the hashing trick (Weinberger et al. 2009) (a popular technique for large-scale and high-dimensional linear classification problems) and set the dimensionality to d=218. Amazon7 is available for download from http://www.mblondel.org/data/. Other datasets are available in vectorized form from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. To determine test accuracy, we used stratified selection in order to split each dataset into 4/5 training and 1/5 testing.

Table 1 Datasets used in Sect. 5

5.2 Comparison of block coordinate descent with other solvers

In this section, we compare different solvers:

  • BCD (LS): block coordinate descent with line search and with cyclic block selection (Tseng and Yun 2009),

  • BCD (CST): block coordinate descent without line search and with randomized uniform block selection (Richtárik and Takáč 2012a),

  • FISTA (LS): an accelerated iterative thresholding algorithm with line search (Beck and Teboulle 2009),

  • FISTA (CST): same as above but with constant step size \(\frac{1}{\mathcal{K}}\) (see Appendix A),

  • SpaRSA: a similar approach to ISTA (Beck and Teboulle 2009) but with different line search (Wright et al. 2009),

  • FOBOS: a projected stochastic subgradient descent framework (Duchi and Singer 2009b).

All solvers are used to minimize the same objective: our proposed multiclass formulation, Eq. (2).

Figures 2 and 3 compare the relative objective value difference \(\frac{F(\boldsymbol{W}) - F(\boldsymbol{W}^{*})}{F(\boldsymbol{W^{*}})}\) (lower is better) and test accuracy (higher is better) of the above solvers as a function of training time, when λ=10−3 and λ=10−5, respectively. For FOBOS, we used the step size \(\eta_{t} = \frac{\eta_{0}}{\sqrt{t}}\), where we chose η 0 beforehand from 10−3, 10−2, …, 103 with a held-out validation set.

Fig. 2
figure 5figure 5

Relative objective value difference (left) and test accuracy (right) as a function of training time, when λ=10−3. Time is in log-scale

Fig. 3
figure 6figure 6

Relative objective value difference (left) and test accuracy (right) as a function of training time, when λ=10−5. Time is in log-scale

Figure 4 compares the number of non-zero rows of the solution (lower is better) as a function of training time for the different solvers, when λ=10−3 (left) and λ=10−5 (right).

Fig. 4
figure 7figure 7

Percentage of non-zero rows as a function of training time, when λ=10−3 (left) and λ=10−5 (right). Time is in log-scale

5.2.1 Comparison of block coordinate descent with or without line search

Figures 2 and 3 indicate that block coordinate descent (BCD) with line search was overall slightly faster to converge than without. Empirically, we observe that the sufficient decrease condition checked by the line search, Eq. (7), is usually accepted on the first try (\(\alpha^{t}_{j}=1\)). In that case, the line search does not incur much extra cost, since the objective value difference F(W t+1)−F(W t), needed for Eq. (7), can be computed in the same loop as the partial gradient. For the few times when more than one line search step is required, our formulation has the advantage that the objective value difference can be computed very efficiently (no expensive log or exp). However, similarly to other iterative solvers, BCD (both with or without line search) may suffer from slow convergence on very loosely regularized problems (very small λ).

In terms of row sparsity, Fig. 4 shows that in all datasets, BCD had a two-phase behavior: first increasing the number of non-zero rows, then rapidly decreasing it. Compared to other solvers, BCD was always the fastest to reach the sparsity level corresponding to a given λ value.

5.2.2 Comparison with a projected stochastic subgradient descent solver: FOBOS

BCD outperformed FOBOS on smaller datasets (News20, Sector) and was comparable to FOBOS on larger datasets (MNIST, RCV1, Amazon7). However, for FOBOS, we found that tuning the initial step size η 0 was crucial to obtain good convergence speed and accuracy. This additional “degree of freedom” is a major disadvantage of FOBOS over BCD, in practice. However, since it is based on stochastic subgradient descent, FOBOS can handle non-differentiable loss functions (e.g., the Crammer-Singer multiclass loss), unlike BCD.

Figure 4 shows that FOBOS obtained much less sparse solutions than BCD. In particular, on RCV1 with λ=10−3, BCD obtained less than 5 % non-zero rows whereas FOBOS obtained almost 80 %.

5.2.3 Comparison with full-gradient solvers: FISTA and SpaRSA

BCD outperformed FISTA and SpaRSA on all datasets, both in speed of objective value decrease and test accuracy increase. FISTA (LS) and SpaRSA achieved similar convergence speed with a slight advantage for FISTA (LS). Interestingly, FISTA (CST) was always quite worse than FISTA (LS), showing that, in the full-gradient case, doing a line search to adjust the step size at every iteration is greatly beneficial. In contrast, the difference between BCD (LS) and BCD (CST) appeared to be smaller. FISTA (CST) uses one global step size \(\frac{1}{\mathcal{K}}\) whereas BCD (CST) uses a per-block step size \(\frac{1}{\mathcal{K}_{j}}\). Therefore, BCD (CST) uses a constant step size which is more appropriate for each block.

BCD, FOBOS, FISTA and SpaRSA differ in how they make use of gradient information at each iteration. FISTA and SpaRSA use the entire gradient G(W)∈R d×m averaged over all n training instances. This is expensive, especially when both n and d are large. On the other hand, FOBOS uses a stochastic approximation of the entire gradient (averaged over a single training instance) and BCD uses only the partial gradient G(W) j:R m (averaged over all training instances). FOBOS and BCD can therefore quickly start to minimize Eq. (2) and increase test accuracy, when FISTA and SpaRSA are not even done computing G(W) yet. Additionally, FISTA and SpaRSA change W entirely at each iteration, which forces to recompute G(W) and F(W t+1)−F(W t) entirely. In the case of BCD, only one block W j: is modified at a time, enabling the fast implementation technique described in Sect. 4.3.

In terms of sparsity, FISTA and SpaRSA reduced the number of non-zero rows much more slowly than BCD. However, in the limit, they obtained similar row sparsity to BCD.

5.2.4 Effect of shrinking

We also extended to 1/ 2 regularization the shrinking method originally proposed by Yuan et al. (2010) for 1-regularized binary classification. Indeed, using optimality conditions developed in Sect. 4.5, it is possible to discard zero blocks early if, according to the optimality conditions, they are likely to remain zero. However, we found that shrinking did not improve convergence on lower-dimensional datasets such as RCV1 and only slightly helped on higher-dimensional datasets such as Amazon7. This is in line with Yuan et al.’s experimental results on 1-regularized binary classification.

5.3 Comparison with other multiclass and multitask objectives

In this experiment, we used block coordinate descent to minimize and compare different 1/ 2-regularized multiclass and multitask objectives:

  • multiclass squared hinge (proposed, same as Eq. (2)):

    $$\underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{i=1}^n \sum _{r \neq y_i} \max\bigl(1 - (\boldsymbol{W}_{:y_i} \cdot\boldsymbol{x}_i - \boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i), 0\bigr)^2 + \lambda\sum _{j=1}^d \|\boldsymbol{W}_{j:} \|_2. $$
  • multitask squared hinge:

    $$ \underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{r=1}^m \sum _{i=1}^n \max(1 - \boldsymbol{Y}_{ir} \boldsymbol{W}_{:r} \cdot\boldsymbol{x}_i, 0)^2 + \lambda\sum_{j=1}^d \| \boldsymbol{W}_{j:}\|_2, $$
    (11)

    where Y ir =+1 if y i =r and Y ir =−1 otherwise.

  • multiclass logistic regression:

    $$ \underset{\boldsymbol{W} \in\mathbf{R}^{d \times m}}{\mathrm{minimize}} \frac{1}{n} \sum_{i=1}^n \log\biggl(1 + \sum_{r \neq y_i} \exp(\boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i - \boldsymbol{W}_{:y_i} \cdot\boldsymbol {x}_i)\biggr) + \lambda\sum_{j=1}^d \|\boldsymbol{W}_{j:}\|_2. $$
    (12)

For both multitask squared hinge and multiclass logistic regression, we computed the partial gradient using an efficient implementation technique similar to the one described in Sect. 4.3. For the multiclass and multitask squared hinge formulations, we used BCD with line search. For the multiclass logistic regression formulation, we used BCD without line search, since we observed faster training times (see Sect. 5.3.1). For multiclass logistic regression, the partial gradient’s Lipschitz constant is \(\mathcal{K}_{j} = \frac{1}{2} \sum_{i} \boldsymbol{x}_{ij}^{2}\) (Duchi and Singer 2009a).

Figure 5 compares the row-sparsity/test-accuracy trade-off of the above objectives. We generated 10 log-spaced values between λ=10−2 and λ=10−4 (Sector), and between λ=10−3 and λ=10−5 (other datasets). For each λ value, we computed the solution and measured the percentage of non-zero rows as well as test accuracy, so as to obtain Fig. 5. Table 2 shows the time (minimum, median and maximum) that was needed to compute the solutions of each objective. The results reported are averages obtained from 3 different train-test splits. We set τ=10−3 and K=200.

Fig. 5
figure 8

Test accuracy of multiclass and multiclass 1/ 2-regularized objective functions as a function of the percentage of non-zero rows in the solution. Results were obtained by computing solutions for 10 log-spaced values between λ=10−2 and λ=10−4 (Sector), and between λ=10−3 and λ=10−5 (other datasets)

Table 2 Minimum, median and maximum training times (in seconds) of different 1/ 2-regularized objective functions, when computing solutions for 10 log-space values between λ=10−2 and λ=10−4 (Sector), and between λ=10−3 and λ=10−5 (other datasets)

5.3.1 Comparison with multiclass logistic regression

Compared to multiclass logistic regression, Eq. (12), our objective achieved overall comparable accuracy. As indicated in Table 2, however, our objective was substantially faster to train (up to ten times in terms of median time) than multiclass logistic regression. Computationally, our objective has indeed two important advantages. First, the objective and gradient are “lazy”: they iterate over instances and classes only when the score is not greater than the score assigned to the correct label by at least 1, whereas multiclass logistic regression always iterates over all n instances and m−1 classes. Second, they do not contain any exp or log computations, which are expensive to compute in practice (equivalent to dozens of multiplications) (Yuan et al. 2011).

We also tried to use BCD with line search for optimizing the multiclass logistic regression objective. However, we found that the version without line search was overall faster. For example, on Amazon7, the median time with line search was 8723.28 seconds instead of 5822.83 seconds without line search. This contrasts with our results from Sect. 5.2 and thus suggests that a line search may not be beneficial when the objective value and gradient are expensive to compute. However, our results show that even without line search, multiclass logistic regression is much slower to train than our formulation.

5.3.2 Comparison with multitask squared hinge loss

Compared to the multitask squared hinge formulation, Eq. (11), we found that our direct multiclass formulation had overall better accuracy and training time. This multitask formulation can be thought as one-vs-rest with 1/ 2 regularization. It is semantically different from our multiclass formulation, since it only attempts to correctly predict binary labels for different tasks (columns of Y), not the true multiclass labels. It is instructive to compare its gradient (without regularization term)

$$ G_{MT}(\boldsymbol{W})_{j:} = -\frac{2}{n} \sum _r \sum_i \boldsymbol{Y}_{ir} \max(1 - \boldsymbol{Y}_{ir} \boldsymbol{W}_{:r} \cdot\boldsymbol {x}_i, 0) \boldsymbol{x}_{ij} \boldsymbol{e}_r $$
(13)

with the gradient in the case of our formulation,

$$ G(\boldsymbol{W})_{j:} = -\frac{2}{n} \sum _{i=1}^n \sum_{r \neq y_i} \max\bigl(1 - (\boldsymbol{W}_{:y_i} \cdot \boldsymbol{x}_i - \boldsymbol{W}_{:r} \cdot \boldsymbol{x}_i), 0\bigr) [\boldsymbol{x}_{ij} \boldsymbol{e}_{y_i} - \boldsymbol{x}_{ij} \boldsymbol{e}_r]. $$
(14)

The main difference is that the inner sum in Eq. (13) updates only one element G MT (W) jr of the gradient by adding x ij weighted by Y ir max(1−Y ir W :r x i ,0), whereas the inner sum in Eq. (14) updates two elements G(W) jr and \(G(\boldsymbol{W})_{jy_{i}}\) by adding/subtracting x ij weighted by \(\max(A(\boldsymbol{W})_{ir}, 0) = \max(1 - (\boldsymbol{W}_{:y_{i}} \cdot\boldsymbol{x}_{i} - \boldsymbol{W}_{:r} \cdot\boldsymbol{x}_{i}), 0)\).

Using an efficient implementation technique similar to the one described in Sect. 4.3, the cost of computing Eq. (13) is \(O(\hat{n}m)\) rather than \(O(\hat {n}(m-1))\) for Eq. (14). We also observed that our multiclass objective typically reached the stopping criterion in fewer iterations than the multitask objective (e.g., k=73 vs. k=108 on the News20 dataset with λ=10−3).

6 Conclusion

In this paper, we proposed a novel direct sparse multiclass formulation, specifically designed for large-scale and high-dimensional problems. We presented two block coordinate descent variants (Tseng and Yun 2009; Richtárik and Takáč 2012a) in a unified manner and developed the core components needed to efficiently optimize our formulation. Experimentally, we showed that block coordinate descent achieves comparable or better convergence speed than FOBOS (Duchi and Singer 2009b), while obtaining much sparser solutions and not needing an extra hyperparameter. Furthermore, it outperformed full gradient based solvers such as FISTA (Beck and Teboulle 2009) and SpaRSA (Wright et al. 2009). Compared to multiclass logistic regression, our multiclass formulation had significantly faster training times (up to ten times in terms of median time) while achieving similar test accuracy. Compared to a multitask squared hinge formulation, our formulation had overall better test accuracy and faster training times. In future work, we would like to empirically evaluate the extensions described in Sect. 4.7.