Distributed Stochastic Optimization of Regularized Risk via Saddle-Point Problem

  • Shin MatsushimaEmail author
  • Hyokun Yun
  • Xinhua Zhang
  • S. V. N. Vishwanathan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10534)


Many machine learning algorithms minimize a regularized risk, and stochastic optimization is widely used for this task. When working with massive data, it is desirable to perform stochastic optimization in parallel. Unfortunately, many existing stochastic optimization algorithms cannot be parallelized efficiently. In this paper we show that one can rewrite the regularized risk minimization problem as an equivalent saddle-point problem, and propose an efficient distributed stochastic optimization (DSO) algorithm. We prove the algorithm’s rate of convergence; remarkably, our analysis shows that the algorithm scales almost linearly with the number of processors. We also verify with empirical evaluations that the proposed algorithm is competitive with other parallel, general purpose stochastic and batch optimization algorithms for regularized risk minimization.



This work is partially supported by MEXT KAKENHI Grant Number 26730114 and JST-CREST JPMJCR1304.


  1. 1.
    Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. JMLR 15, 1111–1133 (2014)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Bertsekas, D., Tsitsiklis, J.: Parallel and Distributed Computation: Numerical Methods (1997)Google Scholar
  3. 3.
    Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine Learning (2011)Google Scholar
  4. 4.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends ML 3(1), 1–123 (2010)zbMATHGoogle Scholar
  5. 5.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  6. 6.
    Bradley, J., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for L1-regularized loss minimization. In: ICML, pp. 321–328 (2011)Google Scholar
  7. 7.
    Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS, pp. 281–288 (2006)Google Scholar
  8. 8.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2010)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Fan, R.E., Chang, J.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. JMLR 9, 1871–1874 (2008)zbMATHGoogle Scholar
  10. 10.
    Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: KDD, pp. 69–77 (2011)Google Scholar
  11. 11.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. (2009)Google Scholar
  12. 12.
    Hsieh, C.J., Yu, H.F., Dhillon, I.S.: PASSCoDe: parallel asynchronous stochastic dual coordinate descent. In: ICML (2015)Google Scholar
  13. 13.
    Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: ICML, pp. 408–415 (2008)Google Scholar
  14. 14.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: NIPS, pp. 315–323 (2013)Google Scholar
  15. 15.
    Langford, J., Smola, A.J., Zinkevich, M.: Slow learners are fast. In: NIPS (2009)Google Scholar
  16. 16.
    Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Neural Information Processing Systems (2014)Google Scholar
  17. 17.
    Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(3), 503–528 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Nedić, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Heidelberg (2004). CrossRefzbMATHGoogle Scholar
  21. 21.
    Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: NIPS, pp. 693–701 (2011)Google Scholar
  22. 22.
    Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Schölkopf, B., Smola, A.J.: Learning with Kernels (2002)Google Scholar
  24. 24.
    Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning. Cambridge University Press, Cambridge (2014)CrossRefzbMATHGoogle Scholar
  25. 25.
    Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver for SVM. In: ICML (2007)Google Scholar
  26. 26.
    Sonnenburg, S., Franc, V.: COFFIN: a computational framework for linear SVMs. In: ICML (2010)Google Scholar
  27. 27.
    Teo, C.H., Vishwanthan, S.V.N., Smola, A.J., Le, Q.V.: Bundle methods for regularized risk minimization. JMLR 11, 311–365 (2010)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Webb, S., Caverlee, J., Pu, C.: Introducing the webb spam corpus: using email spam to identify web spam automatically. In: CEAS (2006)Google Scholar
  29. 29.
    Yan, F., Xu, N., Qi, Y.: Parallel inference for latent Dirichlet allocation on graphics processing units. In: NIPS, pp. 2134–2142 (2009)Google Scholar
  30. 30.
    Yang, T.: Trading computation for communication: distributed stochastic dual coordinate ascent. In: NIPS (2013)Google Scholar
  31. 31.
    Yun, H., Yu, H.F., Hsieh, C.J., Vishwanathan, S.V.N., Dhillon, I.S.: NOMAD: non-locking, stOchastic multi-machine algorithm for asynchronous and decentralized matrix completion. VLDB 7, 975–986 (2014)Google Scholar
  32. 32.
    Zhang, Y., Xiao, L.: DiSCO: distributed optimization for self-concordant empirical loss. In: ICML (2015)Google Scholar
  33. 33.
    Zinkevich, M., Smola, A.J., Weimer, M., Li, L.: Parallelized stochastic gradient descent. In: NIPS, pp. 2595–2603 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Shin Matsushima
    • 1
    Email author
  • Hyokun Yun
    • 2
  • Xinhua Zhang
    • 3
  • S. V. N. Vishwanathan
    • 2
    • 4
  1. 1.The University of TokyoTokyoJapan
  2. 2.Amazon.comSeattleUSA
  3. 3.University of Illinois at ChicagoChicagoUSA
  4. 4.University of CaliforniaSanta CruzUSA

Personalised recommendations