Skip to main content
Log in

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. We analyze the runtime of the framework and obtain rates that improve state-of-the-art results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Technically speaking, it may be more accurate to use the term randomized dual coordinate ascent, instead of stochastic dual coordinate ascent. This is because our algorithm makes more than one pass over the data, and therefore cannot work directly on distributions with infinite support. However, following the convention in the prior machine learning literature, we do not make this distinction.

  2. If the regularizer \(g(w)\) in the definition of \(P(w)\) is non-differentiable, we can replace \(\nabla \Psi (\tilde{w})\) with an appropriate sub-gradient of \(\Psi \) at \(\tilde{w}\). It is easy to verify that the proof is still valid.

  3. Usually, the training data comes with labels, \(y_i \in \{\pm 1\}\), and the loss function becomes \(\log (1+e^{-y_i x_i^\top w})\). However, we can easily get rid of the labels by re-defining \(x_i \leftarrow -y_i x_i\).

References

  1. Baes, M.: Estimate Sequence Methods: Extensions and Approximations. Institute for Operations Research, ETH, Zürich (2009)

    Google Scholar 

  2. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  3. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)

    MATH  MathSciNet  Google Scholar 

  4. Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574 (2011)

  5. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)

    Google Scholar 

  6. d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  7. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)

  8. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)

    MATH  MathSciNet  Google Scholar 

  9. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. ACM (2008)

  10. Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)

  11. Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. Technical report. arXiv:1312.5799 (2013)

  12. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  13. Hu, C., Weike, P., Kwok, J.T.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems, pp. 781–789 (2009)

  14. Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Stochastic block-coordinate frank-wolfe optimization for structural svms. arXiv preprint arXiv:1207.4747 (2012)

  15. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In: NIPS, pp. 905–912 (2009)

  16. Roux, N.L., Schmidt, M., Bach, F.: A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258 (2012)

  17. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  18. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  19. Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140, 125–161 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  20. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)

  21. Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. Technical report. arXiv:1109.2415 (2011)

  22. Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)

    MATH  MathSciNet  Google Scholar 

  23. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  24. Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. In: ICML, p. 117 (2009)

  25. Shalev-Shwartz, S., Zhang, T.: Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012)

  26. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MATH  MathSciNet  Google Scholar 

  27. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-GrAdient SOlver for SVM. In: ICML, pp. 807–814 (2007)

  28. Shalev-Shwartz, S., Srebro, N., Zhang, T.: Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20(6), 2807–2832 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  29. Takác, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: ICML (2013)

  30. Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)

    MATH  MathSciNet  Google Scholar 

  31. Zhang, T.: On the dual formulation of regularized linear systems. Mach. Learn. 46, 91–129 (2002)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1407939 and NSF IIS-1250985.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Zhang.

Additional information

This paper replaces and improves an earlier arxiv version [25]. An extended abstract of the paper was presented in ICML 2014.

Appendix: Proofs of iteration bounds for Prox-SDCA

Appendix: Proofs of iteration bounds for Prox-SDCA

The proof technique follows that of Shalev-Shwartz and Zhang [26], but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms. The proof of Theorem 1 is almost identical to the proof of Theorem 1 in Shalev-Shwartz and Zhang [25], except that we do not upper bound \(\mathbb {E}[D(\alpha ^*)-D(\alpha ^{(0)})]\) by 1.

Proof of Theorem 2

Denote \(\epsilon _D^{(t)} := D(\alpha ^*)-D(\alpha ^{(t)})\). Define \(t_0 = \lceil \frac{n}{s} \log (2\epsilon _D^{(0)}/\epsilon _D) \rceil \). The proof of Theorem 1 implies that for every \(t\), \(\mathbb {E}[\epsilon _D^{(t)}] \le \epsilon _D^{(0)}\,e^{-\frac{st}{n}}\). By Markov’s inequality, with probability of at least \(1/2\) we have \(\epsilon _D^{(t)} \le 2\epsilon _D^{(0)}\,e^{-\frac{st}{n}}\). Applying it for \(t=t_0\) we get that \(\epsilon _D^{(t_0)} \le \epsilon _D\) with probability of at least \(1/2\). Now, lets apply the same argument again, this time with the initial dual sub-optimality being \(\epsilon _D^{(t_0)}\). Since the dual is monotonically non-increasing, we have that \(\epsilon _D^{(t_0)} \le \epsilon _D^{(0)}\). Therefore, the same argument tells us that with probability of at least \(1/2\) we would have that \(\epsilon _D^{(2t_0)} \le \epsilon _D\). Repeating this \(\lceil \log _2(1/\delta ) \rceil \) times, we obtain that with probability of at least \(1-\delta \), for some \(k\) we have that \(\epsilon _D^{(kt_0)} \le \epsilon _D\). Since the dual is monotonically non-decreasing, the claim about the dual sub-optimality follows.

Next, for the duality gap, using the inequality

$$\begin{aligned} \mathbb {E}_t[D(\alpha ^{(t)})-D(\alpha ^{(t-1)})] \ge \frac{s}{n}\, (P(w^{(t-1)})-D(\alpha ^{(t-1)})). \end{aligned}$$

from Lemma 1 of Shalev-Shwartz and Zhang [25].

We have that for every \(t\) such that \(\epsilon _D^{(t-1)} \le \epsilon _D\) we have

$$\begin{aligned} P(w^{(t-1)})-D(\alpha ^{(t-1)}) \le \frac{n}{s} \, \mathbb {E}[D(\alpha ^{(t)})-D(\alpha ^{(t-1)})] \le \frac{n}{s} \epsilon _D. \end{aligned}$$

This proves the second claim of Theorem 2.

For the last claim, suppose that at round \(T_0\) we have \(\epsilon _D^{(T_0)} \le \epsilon _D\). Let \(T = T_0 + n/s\). It follows that if we choose \(t\) uniformly at random from \(\{T_0,\ldots ,T-1\}\), then \(\mathbb {E}[ P(w^{(t)})-D(\alpha ^{(t)})] \le \epsilon _D\). By Markov’s inequality, with probability of at least \(1/2\) we have \( P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D\). Therefore, if we choose \(\log _2(2/\delta )\) such random \(t\), with probability \(\ge 1-\delta /2\), at least one of them will have \( P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D\). Combining with the first claim of the theorem, choosing \(\epsilon _D = \epsilon _P/2\), and applying the union bound, we conclude the proof of the last claim of Theorem 2. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shalev-Shwartz, S., Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016). https://doi.org/10.1007/s10107-014-0839-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-014-0839-0

Mathematics Subject Classification

Navigation