Advertisement

Asynchronous COMID: The Theoretic Basis for Transmitted Data Sparsification Tricks on Parameter Server

  • Cheng Daning
  • Li Shigang
  • Zhang Yunquan
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 911)

Abstract

Asynchronous FTRL-proximal and L2 norm done at server are two widely used tricks in Parameters Server which is an implement of delayed SGD. Their commonness is leaving parts of updating computation on server which reduces the burden of network via making transmitted data sparse. But above tricks’ convergences are not well-proved. In this paper, based on above commonness, we propose a more general algorithm named as asynchronous COMID and prove its regret bound. We prove that asynchronous FTRL-proximal and L2 norm done at server are applications of asynchronous COMID, which demonstrates the convergences of above two tricks. Then, we conduct experiments to verify theoretical results. Experimental results show that compared with delayed SGD on Parameters Server, asynchronous COMID reduces the burden of the network without any harm on the mathematical convergence speed and final output.

Keywords

Asynchronous parallel COMID FTRL L2 norm Parameters server 

Notes

Acknowledgement

This work was supported by National Natural Science Foundation of China under Grant No. 61502450, Grant No. 61432018, and Grant No. 61521092; National Key R&D Program of China under Grant No. 2016YFB0200800, Grant No. 2017YFB0202302, and Grant No. 2016YFE0100300.

References

  1. 1.
    Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  2. 2.
    Baidu: Paddlepaddle (2016). https://github.com/PaddlePaddle/Paddle
  3. 3.
    Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December, pp. 161–168 (2007)Google Scholar
  4. 4.
    Chaturapruek, S., Duchi, J.C., Re, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care, pp. 1531–1539 (2015)Google Scholar
  5. 5.
    Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. Statistics (2015)Google Scholar
  6. 6.
    Dean, J., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
  7. 7.
    Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 257–269 (2010)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Duchi, J., Tewari, A., Chicago, T.: Composite objective mirror descent. In: COLT 2010 - The Conference on Learning Theory, Haifa, Israel, June, pp. 14–26 (2010)Google Scholar
  10. 10.
    Feng, N., Recht, B., Re, C., Wright, S.J.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Adv. Neural Inf. Process. Syst. 24, 693–701 (2011)Google Scholar
  11. 11.
    Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. Adv. Neural Inf. Process. Syst. 2013(2013), 1223–1231 (2013)Google Scholar
  12. 12.
    Langford, J., Smola, A.J., Zinkevich, M.: Slow learners are fast. In: Advances in Neural Information Processing Systems 22: Conference on Neural Information Processing Systems 2009. Proceedings of A Meeting Held 7–10 December 2009, Vancouver, British Columbia, Canada, pp. 2331–2339 (2009)Google Scholar
  13. 13.
    Mcmahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and l1 regularization. JMLR 15, 2011 (2013)Google Scholar
  14. 14.
    Mcmahan, H.B., et al.: Ad clickprediction: a view from the trenches. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1222–1230 (2013)Google Scholar
  15. 15.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. Siam J. Optim. 19, 1574–1609 (2009)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training set size. In: International Conference on Machine Learning, pp. 928–935 (2008)Google Scholar
  18. 18.
    Xing, E.P., et al.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2013)CrossRefGoogle Scholar
  19. 19.
    Yu, H., Lo, H., Hsieh, H.: Feature engineering and classifier ensemble for KDD cup 2010. In: JMLR Workshop and Conference (2010)Google Scholar
  20. 20.
    Zhu, Y., Chatterjee, S., Duchi, J.C., Lafferty, J.D.: Local minimax complexity of stochastic convex optimization. In: Neural Information Processing Systems, pp. 3423–3431 (2016)Google Scholar
  21. 21.
    Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. Adv. Neural Inf. Process. Syst. 23(23), 2595–2603 (2010)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.SKL of Computer Architecture, Institute of Computing Technology, CASBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations