Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization

Shalev-Shwartz, Shai; Zhang, Tong

doi:10.1007/s10107-014-0839-0

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization

Full Length Paper
Series A
Published: 20 November 2014

Volume 155, pages 105–145, (2016)
Cite this article

Mathematical Programming Submit manuscript

Shai Shalev-Shwartz¹ &
Tong Zhang^2,3

1746 Accesses
126 Citations
Explore all metrics

Abstract

We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an inner-outer iteration procedure. We analyze the runtime of the framework and obtain rates that improve state-of-the-art results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Fundamentals of Artificial Neural Networks and Deep Learning

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

Notes

Technically speaking, it may be more accurate to use the term randomized dual coordinate ascent, instead of stochastic dual coordinate ascent. This is because our algorithm makes more than one pass over the data, and therefore cannot work directly on distributions with infinite support. However, following the convention in the prior machine learning literature, we do not make this distinction.
If the regularizer $g(w)$ in the definition of $P(w)$ is non-differentiable, we can replace $\nabla \Psi (\tilde{w})$ with an appropriate sub-gradient of $\Psi $ at $\tilde{w}$. It is easy to verify that the proof is still valid.
Usually, the training data comes with labels, $y_i \in \{\pm 1\}$, and the loss function becomes $\log (1+e^{-y_i x_i^\top w})$. However, we can easily get rid of the labels by re-defining $x_i \leftarrow -y_i x_i$.

References

Baes, M.: Estimate Sequence Methods: Extensions and Approximations. Institute for Operations Research, ETH, Zürich (2009)
Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MATH MathSciNet Google Scholar
Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.: Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)
MATH MathSciNet Google Scholar
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574 (2011)
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)
Google Scholar
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Article MATH MathSciNet Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
MATH MathSciNet Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. ACM (2008)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. Technical report. arXiv:1312.5799 (2013)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)
Article MATH MathSciNet Google Scholar
Hu, C., Weike, P., Kwok, J.T.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems, pp. 781–789 (2009)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Stochastic block-coordinate frank-wolfe optimization for structural svms. arXiv preprint arXiv:1207.4747 (2012)
Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In: NIPS, pp. 905–912 (2009)
Roux, N.L., Schmidt, M., Bach, F.: A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258 (2012)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140, 125–161 (2013)
Article MATH MathSciNet Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. Technical report. arXiv:1109.2415 (2011)
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
MATH MathSciNet Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. In: ICML, p. 117 (2009)
Shalev-Shwartz, S., Zhang, T.: Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MATH MathSciNet Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-GrAdient SOlver for SVM. In: ICML, pp. 807–814 (2007)
Shalev-Shwartz, S., Srebro, N., Zhang, T.: Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20(6), 2807–2832 (2010)
Article MATH MathSciNet Google Scholar
Takác, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: ICML (2013)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)
MATH MathSciNet Google Scholar
Zhang, T.: On the dual formulation of regularized linear systems. Mach. Learn. 46, 91–129 (2002)
Article MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank Fen Xia for careful proof-reading of the paper which helped us to correct numerous typos. Shai Shalev-Shwartz is supported by the following grants: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and ISF 598-10. Tong Zhang is supported by the following grants: NSF IIS-1407939 and NSF IIS-1250985.

Author information

Authors and Affiliations

School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
Shai Shalev-Shwartz
Department of Statistics, Rutgers University, Piscataway, NJ, USA
Tong Zhang
Baidu Inc., Beijing, China
Tong Zhang

Authors

Shai Shalev-Shwartz
View author publications
You can also search for this author in PubMed Google Scholar
Tong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Zhang.

Additional information

This paper replaces and improves an earlier arxiv version [25]. An extended abstract of the paper was presented in ICML 2014.

Appendix: Proofs of iteration bounds for Prox-SDCA

The proof technique follows that of Shalev-Shwartz and Zhang [26], but with the required generality for handling general strongly convex regularizers and smoothness/Lipschitzness with respect to general norms. The proof of Theorem 1 is almost identical to the proof of Theorem 1 in Shalev-Shwartz and Zhang [25], except that we do not upper bound $\mathbb {E}[D(\alpha ^*)-D(\alpha ^{(0)})]$ by 1.

Proof of Theorem 2

Denote $\epsilon _D^{(t)} := D(\alpha ^*)-D(\alpha ^{(t)})$. Define $t_0 = \lceil \frac{n}{s} \log (2\epsilon _D^{(0)}/\epsilon _D) \rceil $. The proof of Theorem 1 implies that for every $t$, $\mathbb {E}[\epsilon _D^{(t)}] \le \epsilon _D^{(0)}\,e^{-\frac{st}{n}}$. By Markov’s inequality, with probability of at least $1/2$ we have $\epsilon _D^{(t)} \le 2\epsilon _D^{(0)}\,e^{-\frac{st}{n}}$. Applying it for $t=t_0$ we get that $\epsilon _D^{(t_0)} \le \epsilon _D$ with probability of at least $1/2$. Now, lets apply the same argument again, this time with the initial dual sub-optimality being $\epsilon _D^{(t_0)}$. Since the dual is monotonically non-increasing, we have that $\epsilon _D^{(t_0)} \le \epsilon _D^{(0)}$. Therefore, the same argument tells us that with probability of at least $1/2$ we would have that $\epsilon _D^{(2t_0)} \le \epsilon _D$. Repeating this $\lceil \log _2(1/\delta ) \rceil $ times, we obtain that with probability of at least $1-\delta $, for some $k$ we have that $\epsilon _D^{(kt_0)} \le \epsilon _D$. Since the dual is monotonically non-decreasing, the claim about the dual sub-optimality follows.

Next, for the duality gap, using the inequality

$$\begin{aligned} \mathbb {E}_t[D(\alpha ^{(t)})-D(\alpha ^{(t-1)})] \ge \frac{s}{n}\, (P(w^{(t-1)})-D(\alpha ^{(t-1)})). \end{aligned}$$

from Lemma 1 of Shalev-Shwartz and Zhang [25].

We have that for every $t$ such that $\epsilon _D^{(t-1)} \le \epsilon _D$ we have

$$\begin{aligned} P(w^{(t-1)})-D(\alpha ^{(t-1)}) \le \frac{n}{s} \, \mathbb {E}[D(\alpha ^{(t)})-D(\alpha ^{(t-1)})] \le \frac{n}{s} \epsilon _D. \end{aligned}$$

This proves the second claim of Theorem 2.

For the last claim, suppose that at round $T_0$ we have $\epsilon _D^{(T_0)} \le \epsilon _D$. Let $T = T_0 + n/s$. It follows that if we choose $t$ uniformly at random from $\{T_0,\ldots ,T-1\}$, then $\mathbb {E}[ P(w^{(t)})-D(\alpha ^{(t)})] \le \epsilon _D$. By Markov’s inequality, with probability of at least $1/2$ we have $ P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D$. Therefore, if we choose $\log _2(2/\delta )$ such random $t$, with probability $\ge 1-\delta /2$, at least one of them will have $ P(w^{(t)})-D(\alpha ^{(t)}) \le 2\epsilon _D$. Combining with the first claim of the theorem, choosing $\epsilon _D = \epsilon _P/2$, and applying the union bound, we conclude the proof of the last claim of Theorem 2. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shalev-Shwartz, S., Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016). https://doi.org/10.1007/s10107-014-0839-0

Download citation

Received: 22 December 2013
Accepted: 31 October 2014
Published: 20 November 2014
Issue Date: January 2016
DOI: https://doi.org/10.1007/s10107-014-0839-0

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization

Abstract

Access this article

Similar content being viewed by others

Learning from positive and unlabeled data: a survey

Fundamentals of Artificial Neural Networks and Deep Learning

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs of iteration bounds for Prox-SDCA

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Navigation

Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization

Abstract

Access this article

Similar content being viewed by others

Learning from positive and unlabeled data: a survey

Fundamentals of Artificial Neural Networks and Deep Learning

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs of iteration bounds for Prox-SDCA

Appendix: Proofs of iteration bounds for Prox-SDCA

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation