Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s10107-012-0614-z
- Cite this article as:
- Richtárik, P. & Takáč, M. Math. Program. (2014) 144: 1. doi:10.1007/s10107-012-0614-z
- 70 Citations
- 2.7k Downloads
Abstract
In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function and prove that it obtains an \(\varepsilon \)-accurate solution with probability at least \(1-\rho \) in at most \(O((n/\varepsilon ) \log (1/\rho ))\) iterations, where \(n\) is the number of blocks. This extends recent results of Nesterov (SIAM J Optim 22(2): 341–362, 2012), which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing \(\varepsilon \) from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving first true iteration complexity bounds. For strongly convex functions the method converges linearly. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Finally, we demonstrate numerically that the algorithm is able to solve huge-scale \(\ell _1\)-regularized least squares problems with a billion variables.
Keywords
Block coordinate descent Huge-scale optimization Composite minimization Iteration complexity Convex optimization LASSO Sparse regression Gradient descent Coordinate relaxation Gauss–Seidel methodMathematics Subject Classification (2000)
65K05 90C05 90C06 90C251 Introduction
- 1.Size of Data. The size of the problem, measured as the dimension of the variable of interest, is so large that the computation of a single function value or gradient is prohibitive. There are several situations in which this is the case, let us mention two of them.
Memory. If the dimension of the space of variables is larger than the available memory, the task of forming a gradient or even of evaluating the function value may be impossible to execute and hence the usual gradient methods will not work.
Patience. Even if the memory does not preclude the possibility of taking a gradient step, for large enough problems this step will take considerable time and, in some applications such as image processing, users might prefer to see/have some intermediary results before a single iteration is over.
- 2.Nature of Data. The nature and structure of data describing the problem may be an obstacle in using current methods for various reasons, including the following.
Completeness. If the data describing the problem is not immediately available in its entirety, but instead arrives incomplete in pieces and blocks over time, with each block “corresponding to” one variable, it may not be realistic (for various reasons such as “memory” and “patience” described above) to wait for the entire data set to arrive before the optimization process is started.
Source. If the data is distributed on a network not all nodes of which are equally responsive or functioning, it may be necessary to work with whatever data is available at a given time.
1.1 Block coordinate descent methods
The basic algorithmic strategy of CD methods is known in the literature under various names such as alternating minimization, coordinate relaxation, linear and non-linear Gauss–Seidel methods, subspace correction and domain decomposition. As working with all the variables of an optimization problem at each iteration may be inconvenient, difficult or impossible for any or all of the reasons mentioned above, the variables are partitioned into manageable blocks, with each iteration focused on updating a single block only, the remaining blocks being fixed. Both for their conceptual and algorithmic simplicity, CD methods were among the first optimization approaches proposed and studied in the literature (see [1] and the references therein; for a survey of block CD methods in semidefinite programming we refer the reader to [24]). While they seem to have never belonged to the mainstream focus of the optimization community, a renewed interest in CD methods was sparked recently by their successful application in several areas—training support vector machines in machine learning [3, 5, 18, 28, 29], optimization [9, 13, 17, 21, 23, 25, 31], compressed sensing [8], regression [27], protein loop closure [2] and truss topology design [16]—partly due to a change in the size and nature of data described above.
- (i)
Recent efforts suggest that complexity results are perhaps more readily obtained for randomized methods and that randomization can actually improve the convergence rate [6, 18, 19].
- (ii)
Choosing all blocks with equal probabilities should, intuitively, lead to similar results as is the case with a cyclic strategy. In fact, a randomized strategy is able to avoid worst-case order of coordinates, and hence might be preferable.
- (iii)
Randomized choice seems more suitable in cases when not all data is available at all times.
- (iv)
One may study the possibility of choosing blocks with different probabilities (we do this in Sect. 4). The goal of such a strategy may be either to improve the speed of the method (in Sect. 6.1 we introduce a speedup heuristic based on adaptively changing the probabilities), or a more realistic modeling of the availability frequencies of the data defining each block.
1.2 Problem description and our contribution
- (i)
\(\varPsi \equiv 0\). This covers the case of smooth minimization and was considered in [13].
- (ii)\(\varPsi \) is the indicator function of a block-separable convex set (a box), i.e.,where \(x^{(i)}\) is block \(i\) of \(x\in \mathbb{R }^N\) (to be defined precisely in Sect. 2) and \(S_1,\ldots ,S_n\) are closed convex sets. This choice of \(\varPsi \) models problems with smooth objective and convex constraints on blocks of variables. Indeed, (1) takes on the form$$\begin{aligned} \varPsi (x) = I_{S_1\times \cdots \times S_n}(x) {\overset{\text{ def}}{=}}{\left\{ \begin{array}{ll}0&\text{ if} \quad x^{(i)}\in S_i \quad \forall i,\\ +\infty&\text{ otherwise,}\end{array}\right.} \end{aligned}$$Iteration complexity results in this case were given in [13].$$\begin{aligned} \min \;f(x) \quad \text{ subject} \text{ to} \quad x^{(i)} \in S_i, \quad i=1,\ldots ,n. \end{aligned}$$
- (iii)
\(\varPsi (x) \equiv \lambda \Vert x\Vert _1\) for \(\lambda >0\). In this case we decompose \(\mathbb{R }^N\) into \(N\) blocks, each corresponding to one coordinate of \(x\). Increasing \(\lambda \) encourages the solution of (1) to be sparser [26]. Applications abound in, for instance, machine learning [3], statistics [20] and signal processing [8]. The first iteration complexity results for the case with a single block were given in [12].
- (iv)
There are many more choices such as the elastic net [32], group lasso [10, 14, 30] and sparse group lasso [4]. One may combine indicator functions with other block separable functions such as \(\varPsi (x) = \lambda \Vert x\Vert _1 + I_{S_1 \times \cdots \times S_n}(x)\), \(S_i = [l_i,u_i]\), where the sets introduce lower and upper bounds on the coordinates of \(x\).
While the asymptotic convergence rates of some variants of CD methods are well understood [9, 21, 23, 31], iteration complexity results are very rare.To the best of our knowledge, randomized CD algorithms for minimizing a composite function have been proposed and analyzed (in the iteration complexity sense) in a few special cases only: (a) the unconstrained convex quadratic case [6], (b) the smooth unconstrained (\(\varPsi \equiv 0\)) and the smooth block-constrained case (\(\varPsi \) is the indicator function of a direct sum of boxes) [13] and (c) the \(\ell _1\)-regularized case [18]. As the approach in [18] is to rewrite the problem into a smooth box-constrained format first, the results of [13] can be viewed as a (major) generalization and improvement of those in [18] (the results were obtained independently).
Our contribution. In this paper we further improve upon and extend and simplify the iteration complexity results of Nesterov [13], treating the problem of minimizing the sum of a smooth convex and a simple nonsmooth convex block separable function (1). We focus exclusively on simple (as opposed to accelerated) methods. The reason for this is that the per-iteration work of the accelerated algorithm in [13] on huge scale instances of problems with sparse data (such as the Google problem where sparsity corresponds to each website linking only to a few other websites or the sparse problems we consider in Sect. 6) is excessive. In fact, even the author does not recommend using the accelerated method for solving such problems; the simple methods seem to be more efficient.
Summary of complexity results obtained in this paper
Algorithm | Objective | Complexity |
---|---|---|
Algorithm 2 (UCDC) (Theorem 5) | Convex composite | \(\tfrac{2n\max \left\{ {\fancyscript{R}}^2_{L}(x_0), F(x_0)-F^*\right\} }{\varepsilon }\left(1+\log \tfrac{1}{\rho }\right)\) |
\(\tfrac{2n{\fancyscript{R}}^2_{L}(x_0)}{\varepsilon }\log \left(\tfrac{F(x_0)-F^*}{\varepsilon \rho }\right)\) | ||
Algorithm 2 (UCDC) (Theorem 8) | Strongly convex composite | \(n\tfrac{1+\mu _{\varPsi }(L)}{\mu _f(L)+\mu _{\varPsi }(L)} \log \left(\tfrac{F(x_0)-F^*}{\varepsilon \rho }\right)\) |
Algorithm 3 (RCDS) (Theorem 12) | Convex smooth | \(\tfrac{2{\fancyscript{R}}^2_{LP^{-1}}(x_0)}{\varepsilon } \left(1 + \log \tfrac{1}{\rho }\right) -2\) |
Algorithm 3 (RCDS) (Theorem 13) | Strongly convex smooth | \(\tfrac{1}{\mu _f(LP^{-1})}\log \left(\tfrac{f(x_0)-f^*}{\varepsilon \rho }\right)\) |
The symbols \(P, L, {\fancyscript{R}}^2_{W}(x_0)\) and \(\mu _\phi (W)\) appearing in Table 1 will be defined precisely in Sect. 2. For now it suffices to say that \(L\) is a diagonal matrix encoding the (block) coordinate Lipschitz constants of the gradient of \(f,P\) is a diagonal matrix encoding the probabilities \(\{p_i\},{\fancyscript{R}}^2_{W}(x_0)\) is a measure of the squared distance of the initial iterate \(x_0\) from the set of minimizers of problem (1) in a norm defined by a diagonal matrix \(W\) and \(\mu _\phi (W)\) is the strong convexity parameter of function \(\phi \) with respect to that norm.
- 1.
Composite setting. We consider the composite setting^{2} (1), whereas [13] covers the unconstrained and constrained smooth setting only.
- 2.
No need for regularization. Nesterov’s high probability results in the case of minimizing a function which is not strongly convex are based on regularizing the objective to make it strongly convex and then running the method on the regularized function. The regularizing term depends on the distance of the initial iterate to the optimal point, and hence is unknown, which means that the analysis in [13] does not lead to true iteration complexity results. Our contribution here is that we show that no regularization is needed by doing a more detailed analysis using a thresholding argument (Theorem 1).
- 3.
Better complexity. Our complexity results are better by the constant factor of 4. Also, we have removed \(\varepsilon \) from the logarithmic term.
- 4.
General probabilities. Nesterov considers probabilities \(p_i\) proportional to \(L_i^{\alpha }\), where \(\alpha \ge 0\) is a parameter. High probability results are proved in [13] for \(\alpha \in \{0,1\}\) only. Our results in the smooth case hold for an arbitrary probability vector \(p\).
- 5.
General norms. Nesterov’s expectation results (Theorems 1 and 2) are proved for general norms. However, his high probability results are proved for Euclidean norms only. In our approach all results in the smooth case hold for general norms.
- 6.
Simplification. Our analysis is shorter.
1.3 Contents
This paper is organized as follows. We start in Sect. 2 by defining basic notation, describing the block structure of the problem, stating assumptions and describing the generic randomized block-coordinate descent algorithm (RCDC). In Sect. 3 we study the performance of a uniform variant (UCDC) of RCDC as applied to a composite objective function and in Sect. 4 we analyze a smooth variant (RCDS) of RCDC; that is, we study the performance of RCDC on a smooth objective function.In Sect. 5 we compare known complexity results for CD methods with the ones established in this paper. Finally, in Sect. 6 we demonstrate the efficiency of the method on \(\ell _1\)-regularized least squares and linear support vector machine problems.
2 Preliminaries
In Sect. 2.1 we describe the setting, basic assumptions and notation, Sect. 2.2 describes the algorithm and in Sect. 2.3 we present the key technical tool of our complexity analysis.
2.1 Assumptions and notation
Example
Let \(n=N,N_i=1\) for all \(i\) and \(U = [e_1,e_2,\ldots ,e_n]\) be the \(n\times n\) identity matrix. Then \(U_i=e_i\) is the \(i\)th unit vector and \(x^{(i)} = e_i^Tx\in \mathbb{R }_i =\mathbb{R }\) is the \(i\)th coordinate of \(x\). Also, \(x = \sum _i e_ix^{(i)}\). If we let \(B_i=1\) for all \(i\), then \(\Vert t\Vert _{(i)} = \Vert t\Vert ^*_{(i)} = |t|\) for all \(t\in \mathbb{R }\).
2.2 The algorithm
The iterates \(\{x_k\}\) are random vectors and the values \(\{F(x_k)\}\) are random variables. Clearly, \(x_{k+1}\) depends on \(x_k\) only.As our analysis will be based on the (expected) per-iteration decrease of the objective function, our results hold if we replace \(V_i(x_k,t)\) by \(F(x_k +U_{i} t)\) in Algorithm 1.
2.3 Key technical tool
Here we present the main technical tool which is used at the end of of our iteration complexity proofs.
Theorem 1
- (i)
\(\mathbf{E }[\xi _{k+1} \;|\; x_k] \le \xi _k - \tfrac{\xi _k^2}{c_1}\), for all \(k\), where \(c_1>0\) is a constant,
- (ii)
\(\mathbf{E }[\xi _{k+1} \;|\; x_k] \le (1-\tfrac{1}{c_2}) \xi _k\), for all \(k\) such that \(\xi _k\ge \varepsilon \), where \(c_2>1\) is a constant.
Proof
The above theorem will be used with \(\{x_k\}_{k\ge 0}\) corresponding to the iterates of Algorithm 1 and \(\phi (x) = F(x)-F^*\).
Restarting. Note that similar, albeit slightly weaker, high probability results can be achieved by restarting as follows. We run the random process \(\{\xi _k\}\) repeatedly \(r=\lceil \log \tfrac{1}{\rho }\rceil \) times, always starting from \(\xi _0\), each time for the same number of iterations \(k_1\) for which \(\mathbf{P }(\xi _{k_1}>\varepsilon ) \le \tfrac{1}{e}\). It then follows that the probability that all \(r\) values \(\xi _{k_1}\) will be larger than \(\varepsilon \) is at most \((\tfrac{1}{e})^r \le \rho \). Note that the restarting technique demands that we perform \(r\) evaluations of the objective function; this is not needed in the one-shot approach covered by the theorem.
Tightness. It can be shown on simple examples that the bounds in the above result are tight.
3 Coordinate descent for composite functions
In this section we study the performance of Algorithm 1 in the special case when all probabilities are chosen to be the same, i.e., \(p_i=\tfrac{1}{n}\) for all \(i\). For easier future reference we set this method apart and give it a name (Algorithm 2).
Lemma 2.
Proof
Lemma 3.
For all \(x\in \mathrm{dom }F\) we have \(H(x,T(x)) \le \min _{y\in \mathbb{R }^N} \{F(y) + \tfrac{1-\mu _f(L)}{2}\Vert y-x\Vert _L^2\}\).
Proof
3.1 Convex objective
In order for Lemma 2 to be useful, we need to estimate \(H(x_k,T(x_k))-F^*\) from above in terms of \(F(x_k)-F^*\).
Lemma 4.
Proof
We are now ready to estimate the number of iterations needed to push the objective value within \(\varepsilon \) of the optimal value with high probability. Note that since \(\rho \) appears in the logarithm, it is easy to attain high confidence.
Theorem 5
- (i)\(\varepsilon <F(x_0)-F^*\) and$$\begin{aligned} k \ge \tfrac{2n \max \left\{ {\fancyscript{R}}^2_{L}(x_0), F(x_0)-F^*\right\} }{\varepsilon } \left(1 + \log \tfrac{1}{\rho }\right) + 2 - \tfrac{2n\max \left\{ {\fancyscript{R}}^2_{L}(x_0), F(x_0)-F^*\right\} }{F(x_0)-F^*},\quad \end{aligned}$$(32)
- (ii)\(\varepsilon < \min \{{\fancyscript{R}}^2_{L}(x_0), F(x_0)-F^*\}\) and$$\begin{aligned} k \ge \tfrac{2n {\fancyscript{R}}^2_{L}(x_0)}{\varepsilon } \log \tfrac{F(x_0)-F^*}{\varepsilon \rho }. \end{aligned}$$(33)
Proof
3.2 Strongly convex objective
The following lemma will be useful in proving linear convergence of the expected value of the objective function to the minimum.
Lemma 6.
Proof
A modification of the above lemma (and of the subsequent results using it) is possible where the assumption \(\mu _{f}(L)+\mu _\varPsi (L)>0\) replaced by the slightly weaker assumption \(\mu _F(L)>0\). Indeed, in the third inequality in the proof one can replace \(\mu _f+\mu _\varPsi \) by \(\mu _F\); the estimate (35) gets improved a bit. However, we prefer the current version for reasons of simplicity of exposition.
We now show that the expected value of \(F(x_k)\) converges to \(F^*\) linearly.
Theorem 7.
Proof
Follows from Lemmas 2 and 6. \(\square \)
The following is an analogue of Theorem 5 in the case of a strongly convex objective. Note that both the accuracy and confidence parameters appear in the logarithm.
Theorem 8.
Proof
3.3 A regularization technique
In this section we investigate an alternative approach to establishing an iteration complexity result in the case of an objective function that is not strongly convex. The strategy is very simple. We first regularize the objective function by adding a small quadratic term to it, thus making it strongly convex, and then argue that when Algorithm 2 is applied to the regularized objective, we can recover an approximate solution of the original non-regularized problem. This approach was used in [13] to obtain iteration complexity results for a randomized block coordinate descent method applied to a smooth function. Here we use the same idea outlined above with the following differences: (i) our proof is different, (ii) we get a better complexity result, and (iii) our approach works also in the composite setting.
We first need to establish that an approximate minimizer of \(F_\mu \) must be an approximate minimizer of \(F\).
Lemma 9.
If \(x^{\prime }\) satisfies \(F_\mu (x^{\prime }) \le \min _{x\in \mathbb{R }^N} F_\mu (x) +\tfrac{\varepsilon }{2}\), then \(F(x^{\prime }) \le F^* +\varepsilon \).
Proof
The following theorem is an analogue of Theorem 5. The result we obtain in this way is slightly different to the one given in Theorem 5 in that \(2n{\fancyscript{R}}^2_{L}(x_0)/\varepsilon \) is replaced by \(n(1+\Vert x_0-x^*\Vert _L^2/\varepsilon )\). In some situations, \(\Vert x_0-x^*\Vert _L^2\) can be significantly smaller than \({\fancyscript{R}}^2_{L}(x_0)\).
Theorem 10.
Proof
4 Coordinate descent for smooth functions
In this section we give a much simplified and improved treatment of the smooth case (\(\varPsi \equiv 0\)) as compared to the analysis in Sects. 2 and 3 of [13].
Lemma 11.
Proof
4.1 Convex objective
We are now ready to state the main result of this section.
Theorem 12.
Proof
4.2 Strongly convex objective
Theorem 13.
Proof
- 1.Uniform probabilities. If \(p_i=\tfrac{1}{n}\) for all \(i\), then$$\begin{aligned} \tfrac{1}{\mu _f(LP^{-1})} = \tfrac{1}{\mu _f(nL)} \overset{(13)}{=} \tfrac{n}{\mu _f(L)} \overset{(38)}{=} \tfrac{\mathrm{tr }(L)}{\mu _f\left({\widetilde{L}}^{}\right)} = n\tfrac{\mathrm{tr }(L)/n}{\mu _f\left({\widetilde{L}}^{}\right)}. \end{aligned}$$
- 2.Probabilities proportional to the Lipschitz constants. If \(p_i = \tfrac{L_i}{\mathrm{tr }(L)}\) for all \(i\), then$$\begin{aligned} \tfrac{1}{\mu _f(LP^{-1})} = \tfrac{1}{\mu _f(\mathrm{tr }(L) I)} \overset{(13)}{=} \tfrac{\mathrm{tr }(L)}{\mu _f(I)} = n\tfrac{\mathrm{tr }(L)/n}{\mu _f(I)}. \end{aligned}$$
5 Comparison of CD methods with complexity guarantees
In this section we compare the results obtained in this paper with existing CD methods endowed with iteration complexity bounds.
5.1 Smooth case (\(\varPsi = 0\))
- 1.Uniform probabilities. Note that in the uniform case (\(p_i=\tfrac{1}{n}\) for all \(i\)) we haveand hence the leading term (ignoring the logarithmic factor) in the complexity estimate of Theorem 12 (line 3 of Table 2) coincides with the leading term in the complexity estimate of Theorem 5 (line 4 of Table 2; the second result): in both cases it is$$\begin{aligned} {\fancyscript{R}}^2_{LP^{-1}}(x_0) = n{\fancyscript{R}}^2_{L}(x_0), \end{aligned}$$$$\begin{aligned} \tfrac{2n{\fancyscript{R}}^2_{L}(x_0)}{\varepsilon }. \end{aligned}$$Note that the leading term of the complexity estimate given in Theorem 3 of [13] (line 2 of Table 2), which covers the uniform case, is worse by a factor of 4.Table 2
Comparison of our results to the results in [13] in the non-strongly convex case
Algorithm
\(\varPsi \)
\(p_i\)
Norms
Complexity
Objective
Nesterov [13](Theorem 4)
0
\(\tfrac{L_i}{\sum _i L_i}\)
Eucl.
\(\left(2n\!+\! \tfrac{8\left(\sum _i L_i\right){\fancyscript{R}}^2_{I}(x_0)}{\varepsilon }\right)\log \tfrac{4\left(f(x_0)\!-\!f^*\right)}{\varepsilon \rho }\)
\(f(x)+\tfrac{\varepsilon \Vert x-x_0\Vert _{I}^2}{8{\fancyscript{R}}^2_{I}(x_0)}\)
Nesterov [13](Theorem 3)
0
\(\tfrac{1}{n}\)
Eucl.
\(\tfrac{8n{\fancyscript{R}}^2_{L}(x_0)}{\varepsilon }\log \tfrac{4(f(x_0)-f^*)}{\varepsilon \rho }\)
\(f(x)+\tfrac{\varepsilon \Vert x-x_0\Vert _L^2}{8{\fancyscript{R}}^2_{L}(x_0)}\)
Algorithm 3 (Theorem 12)
0
\(>\!0\)
General
\(\tfrac{2{\fancyscript{R}}^2_{LP^{-1}}(x_0)}{\varepsilon } \left(1 + \log \tfrac{1}{\rho }\right) -2\)
\(f(x)\)
Algorithm 2 (Theorem 5)
Separable
\(\tfrac{1}{n}\)
Eucl.
\(\tfrac{2n\max \left\{ {\fancyscript{R}}^2_{L}(x_0), F(x_0)-F^*\right\} }{\varepsilon }\left(1\!+\!\log \tfrac{1}{\rho }\right)\)
\(F(x)\)
\(\tfrac{2n{\fancyscript{R}}^2_{L}(x_0)}{\varepsilon }\log \tfrac{F(x_0)-F^*}{\varepsilon \rho }\)
- 2.Probabilities proportional to Lipschitz constants. If we set \(p_i = \tfrac{L_i}{\mathrm{tr }(L)}\) for all \(i\), thenIn this case Theorem 4 in [13] (line 1 of Table 2) gives the complexity bound \(2[n+\tfrac{4\mathrm{tr }(L){\fancyscript{R}}^2_{I}(x_0)}{\varepsilon }]\) (ignoring the logarithmic factor), whereas we obtain the bound \(\tfrac{2\mathrm{tr }(L){\fancyscript{R}}^2_{I}(x_0)}{\varepsilon }\) (line 3 of Table 2), an improvement by a factor of 4. Note that there is a further additive decrease by the constant \(2n\) (and the additional constant \(\tfrac{2{\fancyscript{R}}^2_{LP^{-1}}(x_0)}{f(x_0)-f^*}-2\) if we look at the sharper bound 51).$$\begin{aligned} {\fancyscript{R}}^2_{LP^{-1}}(x_0) = \mathrm{tr }(L){\fancyscript{R}}^2_{I}(x_0). \end{aligned}$$
- 3.
General probabilities. Note that unlike the results in [13], which cover the choice of two probability vectors only (lines 1 and 2 of Table 2)—uniform and proportional to \(L_i\)—our result (line 3 of Table 2) covers the case of arbitrary probability vector \(p\). This opens the possibility for fine-tuning the choice of \(p\), in certain situations, so as to minimize \({\fancyscript{R}}^2_{LP^{-1}}(x_0)\).
- 4.
Logarithmic factor. Note that in our results we have managed to push \(\varepsilon \) out of the logarithm.
- 5.
Norms. Our results hold for general norms.
- 6.
No need for regularization. Our results hold for applying the algorithms to \(F\) directly; i.e., there is no need to first regularize the function by adding a small quadratic term to it (in a similar fashion as we have done it in Sect. 3.3). This is an essential feature as the regularization constants are not known and hence the complexity results obtained that way are not true/valid complexity results.
5.2 Nonsmooth case (\(\varPsi \ne 0\))
Comparison of CD approaches for minimizing composite functions (for which iteration complexity results are provided)
Algorithm | Lipschitz constant(s) | \(\varPsi \) | Block | Choice of coordinate | Work per 1 iteration |
---|---|---|---|---|---|
Yun and Tseng [22] | \(L(\nabla f)\) | Separable | Yes | Greedy | Expensive |
Saha and Tewari [17] | \(L(\nabla f)\) | \(\Vert \cdot \Vert _1\) | No | Cyclic | Cheap |
Shalev-Shwartz and Tewari [18] | \(\beta = \max _i L_i\) | \(\Vert \cdot \Vert _1\) | No | \(\tfrac{1}{n}\) | Cheap |
This paper (Algorithm 2) | \(L_i\) | Separable | Yes | \(\tfrac{1}{n}\) | Cheap |
The methods of Yun & Tseng and Saha & Tewari use one Lipschitz constant only, the Lipschitz constant \(L(\nabla f)\) of the gradient of \(f\) with respect to the standard Euclidean norm. Note that \(\max _i L_i \le L(\nabla f) \le \sum _i L_i\). If \(n\) is large, this constant is typically much larger than the (block) coordinate constants \(L_i\). Shalev-Shwartz and Tewari use coordinate Lipschitz constants, but assume that all of them are the same. This is suboptimal as in many applications the constants \(\{L_i\}\) will have a large variation and hence if one chooses \(\beta = \max _i L_i\) for the common Lipschitz constant, steplengths will necessarily be small (see Fig. 2 in Sect. 6).
Comparison of iteration complexities of the methods listed in Table 3
Algorithm | Complexity | Complexity (expanded) |
---|---|---|
Yun and Tseng [22] | \(O\left(\tfrac{nL(\nabla f)\Vert x^*-x_0\Vert ^2_2}{\varepsilon }\right)\) | \(O\left(\tfrac{n}{\varepsilon }\sum \limits _i L(\nabla f)\left(u^{(i)}\right)^2\right)\) |
Saha and Tewari [17] | \(O\left(\tfrac{nL(\nabla f)\Vert x^*-x_0\Vert ^2_2}{\varepsilon }\right)\) | \(O\left(\tfrac{n}{\varepsilon } \sum \limits _i L(\nabla f)\left(u^{(i)}\right)^2\right)\) |
Shalev-Shwartz and Tewari [18] | \(O\left(\tfrac{n \beta \Vert x^*-x_0\Vert _2^2}{\varepsilon }\right)\) | \(O\left(\tfrac{n}{\varepsilon } \sum \limits _i (\max _i L_i) \left(u^{(i)}\right)^2\right)\) |
This paper (Algorithm 2) | \(O\left(\tfrac{n\Vert x^*-x_0\Vert ^2_{L}}{\varepsilon }\right)\) | \(O\left(\tfrac{n}{\varepsilon } \sum \limits _i L_i \left(u^{(i)}\right)^2\right)\) |
6 Numerical experiments
In this section we study the numerical behavior of RCDC on synthetic and real problem instances of two problem classes: Sparse Regression / Lasso [20] (Sect. 6.1) and Linear Support Vector Machines (Sect. 6.2). As an important concern in Sect. 6.1 is to demonstrate that our methods scale well with size, our algorithms were written in C and all experiments were run on a PC with 480 GB RAM.
6.1 Sparse regression/lasso
6.1.1 Instance generator
In order to be able to test Algorithm 4 under controlled conditions we use a (variant of the) instance generator proposed in Sect. 6 of [12] (the generator was presented for \(\lambda = 1\) but can be easily extended to any \(\lambda >0\)). In it, one chooses the sparsity level of \(A\) and the optimal solution \(x^*\); after that \(A,b, x^*\) and \(F^*=F(x^*)\) are generated. For details we refer the reader to the aforementioned paper.
In what follows we use the notation \(\Vert A\Vert _0\) and \(\Vert x\Vert _0\) to denote the number of nonzero elements of matrix \(A\) and of vector \(x\), respectively.
6.1.2 Speed versus sparsity
In the first experiment we investigate, on problems of size \(m=10^7\) and \(n=10^6\), the dependence of the time it takes for UCDC to complete a block of \(n\) iterations (the measurements were done by running the method for \(10\times n\) iterations and then dividing by 10) on the sparsity levels of \(A\) and \(x^*\). Looking at Table 5, we see that the speed of UCDC depends roughly linearly on the sparsity level of \(A\) (and does not depend on \(\Vert x^*\Vert _0\) at all). Indeed, as \(\Vert A\Vert _0\) increases from \(10^7\) through \(10^8\) to \(10^9\), the time it takes for the method to complete \(n\) iterations increases from about \(0.9\) s through \(4\)–\(6\) s to about \(46\) s. This is to be expected since the amount of work per iteration of the method in which coordinate \(i\) is chosen is proportional to \(\Vert a_i\Vert _0\) (computation of \(\alpha , \Vert a_i\Vert _2^2\) and \(g_{k+1}\)).
6.1.3 Efficiency on huge-scale problems
The time it takes for UCDC to complete a block of \(n\) iterations increases linearly with \(\Vert A\Vert _0\) and does not depend on \(\Vert x^*\Vert _0\)
\(\Vert x^*\Vert _0\) | \(\Vert A\Vert _0 = 10^7\) | \(\Vert A\Vert _0 = 10^8\) | \(\Vert A\Vert _0 = 10^9\) |
---|---|---|---|
\(16\times 10^2\) | 0.89 | 5.89 | 46.23 |
\(16\times 10^3\) | 0.85 | 5.83 | 46.07 |
\(16\times 10^4\) | 0.86 | 4.28 | 46.93 |
Performance of UCDC on a sparse regression instance with a million variables
\(A\in \mathbb{R }^{(2\times 10^7)\times 10^6}, \Vert A\Vert _0 =5\times 10^7\) | |||
---|---|---|---|
\({k}/n\) | \(\frac{F(x_k)-F^*}{F(x_0)-F^*}\) | \(\Vert x_k\Vert _0\) | Time (sec) |
0.00 | \(10^{0}\) | 0 | 0.0 |
2.12 | \(10^{-1}\) | 880,056 | 5.6 |
4.64 | \(10^{-2}\) | 990,166 | 12.3 |
5.63 | \(10^{-3}\) | 996,121 | 15.1 |
7.93 | \(10^{-4}\) | 998,981 | 20.7 |
10.39 | \(10^{-5}\) | 997,394 | 27.4 |
12.11 | \(10^{-6}\) | 993,569 | 32.3 |
14.46 | \(10^{-7}\) | 977,260 | 38.3 |
18.07 | \(10^{-8}\) | 847,156 | 48.1 |
19.52 | \(10^{-9}\) | 701,449 | 51.7 |
21.47 | \(10^{-10}\) | 413,163 | 56.4 |
23.92 | \(10^{-11}\) | 210,624 | 63.1 |
25.18 | \(10^{-12}\) | 179,355 | 66.6 |
27.38 | \(10^{-13}\) | 163,048 | 72.4 |
29.96 | \(10^{-14}\) | 160,311 | 79.3 |
30.94 | \(10^{-15}\) | 160,139 | 82.0 |
32.75 | \(10^{-16}\) | 160,021 | 86.6 |
34.17 | \(10^{-17}\) | 160,003 | 90.1 |
35.26 | \(10^{-18}\) | 160,000 | 93.0 |
36.55 | \(10^{-19}\) | 160,000 | 96.6 |
38.52 | \(10^{-20}\) | 160,000 | 101.4 |
39.99 | \(10^{-21}\) | 160,000 | 105.3 |
40.98 | \(10^{-22}\) | 160,000 | 108.1 |
43.14 | \(10^{-23}\) | 160,000 | 113.7 |
47.28 | \(10^{-24}\) | 160,000 | 124.8 |
47.28 | \(10^{-25}\) | 160,000 | 124.8 |
47.96 | \(10^{-26}\) | 160,000 | 126.4 |
49.58 | \(10^{-27}\) | 160,000 | 130.3 |
52.31 | \(10^{-28}\) | 160,000 | 136.8 |
53.43 | \(10^{-29}\) | 160,000 | 139.4 |
In both tables the first column corresponds to the “full-pass” iteration counter \(k/n\). That is, after \(k=n\) coordinate iterations the value of this counter is 1, reflecting a single “pass” through the coordinates. The remaining columns correspond to, respectively, the size of the current residual \(F(x_k)-F^*\) relative to the initial residual \(F(x_0)-F^*\), size \(\Vert x_k\Vert _0\) of the support of the current iterate \(x_k\), and time (in seconds). A row is added whenever the residual initial residual is decreased by an additional factor of 10.
Performance of UCDC on a sparse regression instance with a billion variables and 20 billion nonzeros in matrix \(A\)
\(A\in \mathbb{R }^{10^{10}\times 10^9}, \Vert A\Vert _0 = 2\times 10^{10}\) | |||
---|---|---|---|
\({k}/n\) | \(\frac{F(x_k)-F^*}{F(x_0)-F^*}\) | \(\Vert x_k\Vert _0\) | Time (hours) |
0 | \(10^{0}\) | 0 | 0.00 |
1 | \(10^{-1}\) | 14,923,993 | 1.43 |
3 | \(10^{-2}\) | 22,688,665 | 4.25 |
16 | \(10^{-3}\) | 24,090,068 | 22.65 |
UCDC has a very similar behavior on the larger problem as well (Table 7). Note that \(A\) has 20 billion nonzeros. In \(1\times n\) iterations the initial residual is decreased by a factor of \(10\), and this takes less than an hour and a half. After less than a day, the residual is decreased by a factor of 1,000. Note that it is very unusual for convex optimization methods equipped with iteration complexity guarantees to be able to solve problems of these sizes.
6.1.4 Performance on fat matrices (\(m<n\))
UCDC needs many more iterations when \(m<n\), but local convergence is still fast
\({k}/n\) | \(F(x_k)-F^*\) | \(\Vert x_k\Vert _0\) | Time (sec) |
---|---|---|---|
\(1\) | \({<}10^7\) | 63,106 | 0.21 |
\(5{,}010\) | \({<}10^6\) | 33,182 | 1,092.59 |
\(18{,}286\) | \({<}10^5\) | 17,073 | 3,811.67 |
\(21{,}092\) | \({<}10^4\) | 15,077 | 4,341.52 |
\(21{,}416\) | \({<}10^3\) | 11,469 | 4,402.77 |
\(21{,}454\) | \({<}10^2\) | 5,316 | 4,410.09 |
\(21{,}459\) | \({<}10^1\) | 1,856 | 4,411.04 |
\(21{,}462\) | \({<}10^0\) | 1,609 | 4,411.63 |
\(21{,}465\) | \({<}10^{-1}\) | 1,600 | 4,412.21 |
\(21{,}468\) | \({<}10^{-2}\) | 1,600 | 4,412.79 |
\(21{,}471\) | \({<}10^{-3}\) | 1,600 | 4,413.38 |
6.1.5 Comparing different probability vectors
Note that in both cases the choice \(\alpha =1\) is the best. In other words, coordinates with large \(L_i\) have a tendency to decrease the objective function the most. However, looking at the \(\lambda =0\) case, we see that the method with \(\alpha = 1\) stalls after about 20,000 iterations. The reason for this is that now the coordinates with small \(L_i\) should be chosen to further decrease the objective value. However, they are chosen with very small probability and hence the slowdown. A solution to this could be to start the method with \(\alpha = 1\) and then switch to \(\alpha =0\) later on. On the problem with \(\lambda =1\) this effect is less pronounced. This is to be expected as now the objective function is a combination of \(f\) and \(\varPsi \), with \(\varPsi \) exerting its influence and mitigating the effect of the Lipschitz constants.
6.1.6 Coordinate descent versus a full-gradient method
In the 1–1 plot of Fig. 2 (plot in the 1–1 position, i.e., in the upper-left corner), the Lipschitz constants \(L_i\) were generated uniformly at random in the interval \((0,1)\). We see that the RCDC variants with \(\alpha =0\) and \(\alpha =0.2\) exhibit virtually the same behavior, whereas \(\alpha =1\) and FG struggle finding a solution with error tolerance below \(10^{-5}\) and \(10^{-2}\), respectively. The \(\alpha =1\) method does start off a bit faster, but then stalls due to the fact that the coordinates with small Lipschitz constants are chosen with extremely small probabilities. For a more accurate solution one needs to be updating these coordinates as well.
In order to zoom in on this phenomenon, in the 1–2 plot we construct an instance with an extreme distribution of Lipschitz constants: 98 % of the constants have the value \(10^{-6}\), whereas the remaining 2 % have the value \(10^3\). Note that while the FG and \(\alpha =1\) methods are able to quickly decrease the objective function within \(10^{-4}\) of the optimum, they get stuck afterwards since they effectively never update the coordinates with \(L_i = 10^{-6}\). On the other hand, the \(\alpha = 0\) method starts off slowly, but does not stop and manages to solve the problem eventually, in about \(2\times 10^5\) iterations.
In the 2–1 (resp. 2–2) plot we choose 70 % (resp. 50 %) of the Lipschitz constants \(L_i\) to be 1, and the remaining 30 % (resp. 50 %) equal to 100. Again, the \(\alpha =0\) and \(\alpha = 0.2\) methods give the best long-term performance.
In summary, if fast convergence to a solution with a moderate accuracy us needed, then \(\alpha =1\) is the best choice (and is always better than FG). If one desires a solution of higher accuracy, it is recommended to switch to \(\alpha = 0\). In fact, it turns out that we can do much better than this using a “shrinking” heuristic.
6.1.7 Speedup by adaptive change of probability vectors
6.2 Linear support vector machines
A list of a few popular loss functions
\({\fancyscript{L}}(w;x_i,y_i)\) | Name | Property |
---|---|---|
\(\max \left\{ 0, 1-y_j w^Tx_j \right\} \) | L1-SVM loss (L1-SVM) | \(C^0\) continuous |
\(\max \left\{ 0, 1-y_j w^Tx_j \right\} ^2\) | L2-SVM loss (L2-SVM) | \(C^1\) continuous |
\(\log \left(1+e^{-y_j w^Tx_j}\right)\) | Logistic loss (LG) | \(C^2\) continuous |
Because our setup requires \(f\) to be at least \(C^1\) continuous, we will consider the L2-SVM and LG loss functions only. In the experiments below we consider the L1 regularized setup.
6.2.1 A few implementation remarks
Lipschitz constants and coordinate derivatives for SVM
Loss function | \(L_i\) | \(\nabla _i f(w)\) |
---|---|---|
L2-SVM | \( 2\gamma \displaystyle \sum \limits _{j=1}^m \left(y_jx_j^{(i)}\right)^2\) | \( -2\gamma \cdot \displaystyle \sum \limits _{j\, :\, -y_jw^Tx_j>-1} y_j x_j^{(i)}\left(1 -y_jw^Tx_j\right)\) |
LG | \(\displaystyle \tfrac{\gamma }{4} \sum _{j=1}^m \left(y_jx_j^{(i)}\right)^2\) | \(\displaystyle -\gamma \cdot \sum _{j=1}^m y_jx_j^{(i)}\frac{ e^{-y_j w^Tx_j}}{1+e^{-y_j w^Tx_j}}\) |
6.2.2 Small scale test
Cross validation accuracy (CV-A) and testing accuracy (TA) for various choices of \(\gamma \)
Loss function | \(\gamma \) | CV-A (%) | TA (%) | \(\gamma \) | CV-A (%) | TA (%) |
---|---|---|---|---|---|---|
L2-SVM | 0.0625 | 94.1 | 93.2 | 2 | 97.0 | 95.6 |
0.1250 | 95.5 | 94.5 | 4 | 97.0 | 95.4 | |
0.2500 | 96.5 | 95.4 | 8 | 96.9 | 95.1 | |
0.5000 | 97.0 | 95.8 | 16 | 96.7 | 95.0 | |
1.0000 | 97.0 | 95.8 | 32 | 96.4 | 94.9 | |
LG | 0.5000 | 0.0 | 0.0 | 8 | 40.7 | 37.0 |
1.0000 | 96.4 | 95.2 | 16 | 37.7 | 36.0 | |
2.0000 | 43.2 | 39.4 | 32 | 37.6 | 33.4 | |
4.0000 | 39.3 | 36.5 | 64 | 36.9 | 34.1 |
6.2.3 Large scale test
We have used the dataset kdd2010 (bridge to algebra),^{6} which has 29,890,095 features and 19,264,097 training and 748,401 testing instances. Training the classifier on the entire training set required approximately 70 s in the case of L2-SVM loss and 112 s in the case of LG loss. We have run UCDC for \(n\) iterations.
A function \(F: \mathbb{R }^N\rightarrow \mathbb{R }\) is isotone if \(x\ge y\) implies \(F(x)\ge F(y)\).
Note that in [12] Nesterov considered the composite setting and developed standard and accelerated gradient methods with iteration complexity guarantees for minimizing composite objective functions. These can be viewed as block coordinate descent methods with a single block.
This will not be the case for certain types of matrices, such as those arising from wavelet bases or FFT.
Acknowledgments
We thank anonymous referees and Hui Zhang (National University of Defense Technology, China) for useful comments that helped to improve the manuscript.