Advertisement

Efficient Decentralized Deep Learning by Dynamic Model Averaging

  • Michael KampEmail author
  • Linara Adilova
  • Joachim Sicking
  • Fabian Hüger
  • Peter Schlicht
  • Tim Wirtz
  • Stefan Wrobel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)

Abstract

We propose an efficient protocol for decentralized training of deep neural networks from distributed data sources. The proposed protocol allows to handle different phases of model training equally well and to quickly adapt to concept drifts. This leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. Moreover, we derive a communication bound that scales well with the hardness of the serialized learning problem. The reduction in communication comes at almost no cost, as the predictive performance remains virtually unchanged. Indeed, the proposed protocol retains loss bounds of periodically averaging schemes. An extensive empirical evaluation validates major improvement of the trade-off between model performance and communication which could be beneficial for numerous decentralized learning applications, such as autonomous driving, or voice recognition and image classification on mobile phones. Code related to this paper is available at: https://bitbucket.org/Michael_Kamp/decentralized-machine-learning.

Notes

Acknowledgements

This research has been supported by the Center of Competence Machine Learning Rhein-Ruhr (ML2R).

Supplementary material

478880_1_En_24_MOESM1_ESM.pdf (4.3 mb)
Supplementary material 1 (pdf 4451 KB)

References

  1. 1.
    Bojarski, M., et al.: End to end learning for self-driving cars. CoRR abs/1604.07316 (2016)Google Scholar
  2. 2.
    Boley, M., Kamp, M., Keren, D., Schuster, A., Sharfman, I.: Communication-efficient distributed online prediction using dynamic model synchronizations. In: BD3@ VLDB, pp. 13–18 (2013)Google Scholar
  3. 3.
    Bottou, L.: Stochastic gradient learning in neural networks. Proc. Neuro-Nımes 91(8), 12 (1991)Google Scholar
  4. 4.
    Bshouty, N.H., Long, P.M.: Linear classifiers are nearly optimal when hidden variables have diverse effects. Mach. Learn. 86(2), 209–231 (2012)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chen, J., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous SGD. In: International Conference on Learning Representations Workshop Track (2016)Google Scholar
  6. 6.
    Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
  7. 7.
    Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 165–202 (2012)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Feng, J., Xu, H., Mannor, S.: Outlier robust online learning. CoRR abs/1701.00251 (2017)Google Scholar
  9. 9.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Going deeper: autonomous steering with neural memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 214–221 (2017)Google Scholar
  10. 10.
    Gabel, M., Keren, D., Schuster, A.: Communication-efficient distributed variance monitoring and outlier detection for multivariate time series. In: Proceedings of the 28th International Parallel and Distributed Processing Symposium, pp. 37–47. IEEE (2014)Google Scholar
  11. 11.
    Jiang, Z., Balu, A., Hegde, C., Sarkar, S.: Collaborative deep learning in fixed topology networks. In: Advances in Neural Information Processing Systems, pp. 5904–5914 (2017)Google Scholar
  12. 12.
    Kamp, M., Boley, M., Keren, D., Schuster, A., Sharfman, I.: Communication-efficient distributed online prediction by dynamic model synchronization. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8724, pp. 623–639. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-662-44848-9_40CrossRefGoogle Scholar
  13. 13.
    Kamp, M., Boley, M., Missura, O., Gärtner, T.: Effective parallelisation for machine learning. In: Advances in Neural Information Processing Systems, pp. 6480–6491 (2017)Google Scholar
  14. 14.
    Kamp, M., Boley, M., Mock, M., Keren, D., Schuster, A., Sharfman, I.: Adaptive communication bounds for distributed online learning. In: 7th NIPS Workshop on Optimization for Machine Learning (2014)Google Scholar
  15. 15.
    Kamp, M., Bothe, S., Boley, M., Mock, M.: Communication-efficient distributed online learning with kernels. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 805–819. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46227-1_50CrossRefGoogle Scholar
  16. 16.
    Keren, D., Sagy, G., Abboud, A., Ben-David, D., Schuster, A., Sharfman, I., Deligiannakis, A.: Geometric monitoring of heterogeneous streams. IEEE Trans. Knowl. Data Eng. 26(8), 1890–1903 (2014)CrossRefGoogle Scholar
  17. 17.
    Keren, D., Sharfman, I., Schuster, A., Livne, A.: Shape sensitive geometric monitoring. IEEE Trans. Knowl. Data Eng. 24(8), 1520–1535 (2012)CrossRefGoogle Scholar
  18. 18.
    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: International Conference on Learning Representations (2017)Google Scholar
  19. 19.
    Lazerson, A., Sharfman, I., Keren, D., Schuster, A., Garofalakis, M., Samoladas, V.: Monitoring distributed streams using convex decompositions. Proc. VLDB Endow. 8(5), 545–556 (2015)CrossRefGoogle Scholar
  20. 20.
    LeCun, Y.: The mnist database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/
  21. 21.
    Mcdonald, R., Mohri, M., Silberman, N., Walker, D., Mann, G.S.: Efficient large-scale distributed training of conditional maximum entropy models. In: Advances in Neural Information Processing Systems, pp. 1231–1239 (2009)Google Scholar
  22. 22.
    McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)Google Scholar
  23. 23.
    McMahan, B., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private recurrent language models. In: International Conference on Learning Representations (2018)Google Scholar
  24. 24.
    Moler, C.: Matrix computation on distributed memory multiprocessors. Hypercube Multiprocessors 86(181–195), 31 (1986)Google Scholar
  25. 25.
    Nguyen, Q., Hein, M.: The loss surface of deep and wide neural networks. In: International Conference on Machine Learning, pp. 2603–2612 (2017)Google Scholar
  26. 26.
    Pomerleau, D.A.: Alvinn: an autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems, pp. 305–313 (1989)Google Scholar
  27. 27.
    Shamir, O.: Without-replacement sampling for stochastic gradient methods. In: Advances in Neural Information Processing Systems, pp. 46–54 (2016)Google Scholar
  28. 28.
    Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. In: International Conference on Machine Learning, pp. 1000–1008 (2014)Google Scholar
  29. 29.
    Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. Trans. Database Syst. 32(4), 301–312 (2007)CrossRefGoogle Scholar
  30. 30.
    Sharfman, I., Schuster, A., Keren, D.: Shape sensitive geometric monitoring. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 301–310. ACM (2008)Google Scholar
  31. 31.
    Smith, V., Chiang, C.K., Sanjabi, M., Talwalkar, A.S.: Federated multi-task learning. In: Advances in Neural Information Processing Systems, pp. 4424–4434 (2017)Google Scholar
  32. 32.
    Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: Proceedings of the International Conference on Learning Representations (2017)Google Scholar
  33. 33.
    Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging sgd. In: Advances in Neural Information Processing Systems, pp. 685–693 (2015)Google Scholar
  34. 34.
    Zhang, Y., Wainwright, M.J., Duchi, J.C.: Communication-efficient algorithms for statistical optimization. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2012)Google Scholar
  35. 35.
    Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Michael Kamp
    • 1
    • 2
    • 3
    Email author
  • Linara Adilova
    • 1
    • 2
  • Joachim Sicking
    • 1
    • 2
  • Fabian Hüger
    • 4
  • Peter Schlicht
    • 4
  • Tim Wirtz
    • 1
    • 2
  • Stefan Wrobel
    • 1
    • 2
    • 3
  1. 1.Fraunhofer IAISSankt AugustinGermany
  2. 2.Fraunhofer Center for Machine LearningSankt AugustinGermany
  3. 3.University of BonnBonnGermany
  4. 4.Volkswagen Group ResearchWolfsburgGermany

Personalised recommendations