Abstract
We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. We analyze the runtime of the framework and obtain rates that improve state-of-the-art results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.
Similar content being viewed by others
Notes
Technically speaking, it may be more accurate to use the term randomized dual coordinate ascent, instead of stochastic dual coordinate ascent. This is because our algorithm makes more than one pass over the data, and therefore cannot work directly on distributions with infinite support. However, following the convention in the prior machine learning literature, we do not make this distinction.
If the regularizer \(g(w)\) in the definition of \(P(w)\) is non-differentiable, we can replace \(\nabla \Psi (\tilde{w})\) with an appropriate sub-gradient of \(\Psi \) at \(\tilde{w}\). It is easy to verify that the proof is still valid.
Usually, the training data comes with labels, \(y_i \in \{\pm 1\}\), and the loss function becomes \(\log (1+e^{-y_i x_i^\top w})\). However, we can easily get rid of the labels by re-defining \(x_i \leftarrow -y_i x_i\).
References
Baes, M.: Estimate Sequence Methods: Extensions and Approximations. Institute for Operations Research, ETH, Zürich (2009)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574 (2011)
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. ACM (2008)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. Technical report. arXiv:1312.5799 (2013)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)
Hu, C., Weike, P., Kwok, J.T.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems, pp. 781–789 (2009)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Stochastic block-coordinate frank-wolfe optimization for structural svms. arXiv preprint arXiv:1207.4747 (2012)
Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In: NIPS, pp. 905–912 (2009)
Roux, N.L., Schmidt, M., Bach, F.: A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258 (2012)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140, 125–161 (2013)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. Technical report. arXiv:1109.2415 (2011)
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. In: ICML, p. 117 (2009)
Shalev-Shwartz, S., Zhang, T.: Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-GrAdient SOlver for SVM. In: ICML, pp. 807–814 (2007)
Shalev-Shwartz, S., Srebro, N., Zhang, T.: Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20(6), 2807–2832 (2010)
Takác, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: ICML (2013)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)
Zhang, T.: On the dual formulation of regularized linear systems. Mach. Learn. 46, 91–129 (2002)
Acknowledgments
The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1407939 and NSF IIS-1250985.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper replaces and improves an earlier arxiv version [25]. An extended abstract of the paper was presented in ICML 2014.
Appendix: Proofs of iteration bounds for Prox-SDCA
Appendix: Proofs of iteration bounds for Prox-SDCA
The proof technique follows that of Shalev-Shwartz and Zhang [26], but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms. The proof of Theorem 1 is almost identical to the proof of Theorem 1 in Shalev-Shwartz and Zhang [25], except that we do not upper bound \(\mathbb {E}[D(\alpha ^*)-D(\alpha ^{(0)})]\) by 1.
Proof of Theorem 2
Denote \(\epsilon _D^{(t)} := D(\alpha ^*)-D(\alpha ^{(t)})\). Define \(t_0 = \lceil \frac{n}{s} \log (2\epsilon _D^{(0)}/\epsilon _D) \rceil \). The proof of Theorem 1 implies that for every \(t\), \(\mathbb {E}[\epsilon _D^{(t)}] \le \epsilon _D^{(0)}\,e^{-\frac{st}{n}}\). By Markov’s inequality, with probability of at least \(1/2\) we have \(\epsilon _D^{(t)} \le 2\epsilon _D^{(0)}\,e^{-\frac{st}{n}}\). Applying it for \(t=t_0\) we get that \(\epsilon _D^{(t_0)} \le \epsilon _D\) with probability of at least \(1/2\). Now, lets apply the same argument again, this time with the initial dual sub-optimality being \(\epsilon _D^{(t_0)}\). Since the dual is monotonically non-increasing, we have that \(\epsilon _D^{(t_0)} \le \epsilon _D^{(0)}\). Therefore, the same argument tells us that with probability of at least \(1/2\) we would have that \(\epsilon _D^{(2t_0)} \le \epsilon _D\). Repeating this \(\lceil \log _2(1/\delta ) \rceil \) times, we obtain that with probability of at least \(1-\delta \), for some \(k\) we have that \(\epsilon _D^{(kt_0)} \le \epsilon _D\). Since the dual is monotonically non-decreasing, the claim about the dual sub-optimality follows.
Next, for the duality gap, using the inequality
from Lemma 1 of Shalev-Shwartz and Zhang [25].
We have that for every \(t\) such that \(\epsilon _D^{(t-1)} \le \epsilon _D\) we have
This proves the second claim of Theorem 2.
For the last claim, suppose that at round \(T_0\) we have \(\epsilon _D^{(T_0)} \le \epsilon _D\). Let \(T = T_0 + n/s\). It follows that if we choose \(t\) uniformly at random from \(\{T_0,\ldots ,T-1\}\), then \(\mathbb {E}[ P(w^{(t)})-D(\alpha ^{(t)})] \le \epsilon _D\). By Markov’s inequality, with probability of at least \(1/2\) we have \( P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D\). Therefore, if we choose \(\log _2(2/\delta )\) such random \(t\), with probability \(\ge 1-\delta /2\), at least one of them will have \( P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D\). Combining with the first claim of the theorem, choosing \(\epsilon _D = \epsilon _P/2\), and applying the union bound, we conclude the proof of the last claim of Theorem 2. \(\square \)
Rights and permissions
About this article
Cite this article
Shalev-Shwartz, S., Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016). https://doi.org/10.1007/s10107-014-0839-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-014-0839-0