Skip to main content

Fully asynchronous stochastic coordinate descent: a tight lower bound on the parallelism achieving linear speedup

Abstract

We seek tight bounds on the viable parallelism in asynchronous implementations of coordinate descent that achieves linear speedup. We focus on asynchronous coordinate descent (ACD) algorithms on convex functions which consist of the sum of a smooth convex part and a possibly non-smooth separable convex part. We quantify the shortfall in progress compared to the standard sequential stochastic gradient descent. This leads to a simple yet tight analysis of the standard stochastic ACD in a partially asynchronous environment, generalizing and improving the bounds in prior work. We also give a considerably more involved analysis for general asynchronous environments in which the only constraint is that each update can overlap with at most q others. The new lower bound on the maximum degree of parallelism attaining linear speedup is tight and improves the best prior bound almost quadratically.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. 1.

    In fact, having a continuous gradient suffices.

  2. 2.

    There are also versions of the sequential algorithm in which different coordinates can be selected with different probabilities.

  3. 3.

    This is a weakening of the standard strong convexity.

  4. 4.

    This is expressed in terms of a parameter \(\tau \), renamed q in this paper, which is essentially the possible parallelism; the connection between them depends on the relative times to calculate different updates.

  5. 5.

    E.g., communication delays, interference from other computations (say due to mutual exclusion when multiple cores commit updates to the same coordinate), interference from the operating system and CPU scheduling.

  6. 6.

    The fetch-and-add CPU instruction atomically increments the contents of a memory location by a specified value.

  7. 7.

    The standard birthday paradox result states that if \(\epsilon \sqrt{n}\) cores each chooses a random coordinate among [n] uniformly, the probability of a collision is \(\varTheta (\epsilon ^2)\).

References

  1. 1.

    Avron, H., Druinsky, A., Gupta, A.: Revisiting asynchronous linear solvers: provable convergence rate through randomization. J. ACM 62(6), 51:1–51:27 (2015)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Baudet, G.M.: Asynchronous iterative methods for multiprocessors. J. ACM 25(2), 226–244 (1978)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods. Prentice Hall, Upper Saddle River (1989)

    MATH  Google Scholar 

  4. 4.

    Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Borkar, V.S.: Asynchronous stochastic approximations. SIAM J. Control Optim. 36(3), 662–663 (1998)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra Appl. 2(2), 199–222 (1969)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman’s function. SIAM J. Optim. 3(3), 538–543 (1993)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Cheung, Y.K., Cole, R.: Amortized analysis of asynchronous price dynamics. In: 26th Annual European Symposium on Algorithms, ESA 2018, August 20–22, 2018, pp. 18:1–18:15, Helsinki, Finland (2018). https://doi.org/10.4230/LIPIcs.ESA.2018.18

  9. 9.

    Cheung, Y.K., Cole, R., Devanur, N.R.: Tatonnement beyond gross substitutes? Gradient descent to the rescue. In: STOC, pp. 191–200 (2013)

  10. 10.

    Cheung, Y.K., Cole, R., Tao, Y.: Parallel stochastic asynchronous coordinate descent: tight bounds on the possible parallelism (2018). arXiv e-prints arXiv:1811.05087

  11. 11.

    Cole, R., Tao, Y.: An analysis of asynchronous stochastic accelerated coordinate descent (2018). arXiv e-prints arXiv:1808.05156

  12. 12.

    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411

    Article  MATH  Google Scholar 

  13. 13.

    Fang, C., Huang, Y., Lin, Z.: Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). arXiv preprint arXiv:1802.09747

  14. 14.

    Frommer, A., Szyld, D.B.: On asynchronous iterations. J. Comput. Appl. Math. 123(1–2), 201–216 (2000). Numerical Analysis 2000. Vol. III: Linear Algebra

    MathSciNet  Article  Google Scholar 

  15. 15.

    Hannah, R., Feng, F., Yin, W.: A2BCD: Asynchronous acceleration with optimal complexity. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net. https://openreview.net/forum?id=rylIAsCqYm

  16. 16.

    Leblond, R., Pedregosa, F., Lacoste-Julien, S.: Improved asynchronous parallel optimization analysis for stochastic incremental methods. J. Mach. Learn. Res. 19, 81:1–81:68 (2018)

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Liu, J., Wright, S.J.: Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), 351–376 (2015)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16, 285–322 (2015)

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.: Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4), 2202–2229 (2017). https://doi.org/10.1137/16M1057000

    MathSciNet  Article  MATH  Google Scholar 

  21. 21.

    Meier, L., Van De Geer, S., Bhlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)

    Book  Google Scholar 

  23. 23.

    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Niu, F., Recht, B., Re, C., Wright, S.J.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: NIPS, pp. 693–701 (2011)

  25. 25.

    Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Saunders, C., Gammerman, A., Vovk, V.: Ridge regression learning algorithm in dual variables. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pp. 515–521. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998). http://dl.acm.org/citation.cfm?id=645527.657464

  27. 27.

    Sun, T., Hannah, R., Yin, W.: Asynchronous coordinate descent under more realistic assumptions. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6182–6190. Curran Associates, Inc., New York (2017)

    Google Scholar 

  28. 28.

    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1994)

    MathSciNet  MATH  Google Scholar 

  29. 29.

    Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1–2), 387–423 (2009)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Tsitsiklis, J.N., Bertsekas, D.P., Athans, M.: Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control 31(9), 803–812 (1986)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

We thank several anonymous reviewers for their helpful and thoughtful suggestions regarding earlier versions of this paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yixin Tao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Part of the work done while Yun Kuen Cheung held positions at Courant Institute, NYU, at Faculty of Computer Science, University of Vienna and at Max Planck Institute for Informatics, Saarland Informatics Campus. He was supported in part by NSF Grant CCF-1217989, the Vienna Science and Technology Fund (WWTF) project ICT10-002, Singapore NRF 2018 Fellowship NRF-NRFF2018-07 and MOE AcRF Tier 2 Grant 2016-T2-1-170. Additionally the research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Grant Agreement No. 340506. Richard Cole and Yixin Tao’s work was supported in part by NSF Grants CCF-1217989, CCF-1527568 and CCF-1909538.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 65 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cheung, Y.K., Cole, R. & Tao, Y. Fully asynchronous stochastic coordinate descent: a tight lower bound on the parallelism achieving linear speedup. Math. Program. 190, 615–677 (2021). https://doi.org/10.1007/s10107-020-01552-8

Download citation

Keywords

  • Asynchronous coordinate descent
  • Stochastic coordinate descent
  • Linear speedup
  • Amortized analysis

Mathematics Subject Classification

  • 90C06 Large-scale problems in mathematical programming
  • 90C25 Convex programming