Skip to main content

ZipLine: an optimized algorithm for the elastic bulk synchronous parallel model

Abstract

The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of deep learning models. A shortcoming of the BSP is that it requires workers to wait for the straggler at every iteration. Therefore, employing BSP increases the waiting time of the faster workers of a cluster and results in an overall prolonged training time. To ameliorate this shortcoming of BSP, we propose ElasticBSP, a model that aims to relax its strict synchronization requirement with an elastic synchronization by allowing delayed synchronization to minimize the waiting time. ElasticBSP offers more flexibility and adaptability during the training phase, without sacrificing the accuracy of the trained model. ElasticBSP is realized by the algorithm named ZipLine, which consists of two phases. First, it estimates for each worker the end time points of its future iterations at run time, and then a one-pass algorithm over the estimated time points of all workers is employed to fast compute an optimal future time point for synchronization. We provide theoretical results about the correctness and performance of the ZipLine algorithm. Furthermore, we propose algorithmic and implementation optimizations of ZipLine, namely ZipLineOpt and ZipLineOptBS, which reduce the time complexity of ZipLine to linearithmic time. A thorough experimental evaluation demonstrates that our proposed ElasticBSP model, materialized by the proposed optimized ZipLine variants, converges faster and to a higher accuracy than the predominant BSP. The focus of the paper is on optimizing the synchronization scheduling over a parameter server architecture. It is orthogonal to other types of optimizations, such as the learning rate optimization.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    We have mentioned the inter-computer communication bottleneck earlier. For intra-computer communication, the communication delay is caused by data moving between GPUs. GPUs accelerates the DNNs training since the computation of DNNs is matrix operation and GPU is specialized in SIMD (single instruction multiple data) parallel processing for large batch data processing. However, data moving between GPUs within a computer has a potential bottleneck since GPU-to-GPU memory copy has to go through PCIe links (64 Gbps theoretical bandwidth for 4 PCIe links on a regular motherboard) unless the expensive NVLinks are installed (e.g., 80 Gbps theoretical bandwidth for 4 NVLink links).

  2. 2.

    The time points depicted in Fig. 4 were generated by our synthetic data generator that is described in Sect. 6.1.

  3. 3.

    Note that \(e_i^p\) is a triple in our proposed algorithm, containing a timestamp value, the worker id p and the iteration id i of the worker p, where i and p are meta data used to identify to whom the timestamp value belongs to. For simplicity, we may ignore p and i and consider \(e_i^p\) as a timestamp value when the meta data are clear.

  4. 4.

    The code of the generator is available at https://github.com/xingzhaoo/ElasticBSP.

  5. 5.

    BSP is predominantly used in industry and is supported by PyTorch, TensorFlow and MXNet. The latter two also support ASP. SSP is available in Petuum and we implemented it into MXNet. Other state-of-the-art synchronous models for the parameter server framework that are not used in practice or incompatible with MXNet are not included.

  6. 6.

    https://www.soscip.org/

  7. 7.

    AlexNet was designed to train on ImageNet 1K with 1,000 classes and a million training samples, each sample has \(256 \times 256\) pixels resolution. Since it takes long time to train AlexNet, to get results of 24 experiments (3 runs per parallel paradigm) faster, we reduce the size and layers of AlexNet for CIFAR-10 with 10 classes and 50,000 training samples, each sample of which only has \(28 \times 28\) pixels resolution.

  8. 8.

    We did not use 0.0001 as the learning rate is because it led to too long training time. Note that other settings may lead to better predictive performance. However, hyperparameter tuning for a deep model is not a focus of this paper, and thus we did not search for the best setting of the parameters for our method. We focus on comparing the synchronization methods in terms of their convergence rate and converged accuracy under the same hyperparameter setting. Different methods may reach their best performance under different settings.

  9. 9.

    We tried the hyper-parameters setting in the original work (He et al., 2016) and did not get a better accuracy than our setting.

  10. 10.

    The GPU cluster that we used only allows a job to run up to 24 hours. Given this time constraint, with batch size 256, the most epochs VGG-16 can complete on ImageNet 1K using the baseline model (i.e., BSP) is 19. Without the time constraint, one can expect that all distributed paradigms may converge to a higher accuracy on ImageNet 1K.

  11. 11.

    We tried the hyper-parameters setting in the original work (He et al. 2016) and did not get a better accuracy than our setting.

References

  1. Benz, K., & Bohnert, T. (2013). Dependability modeling framework: A test procedure for high availability in cloud operating systems. In: 2013 IEEE 78th Vehicular technology conference (VTC Fall), IEEE, pp 1–8.

  2. Chen, J., Pan, X., Monga, R., Bengio, S., & Jozefowicz, R. (2016). Revisiting distributed synchronous sgd. arXiv preprint arXiv:160400981.

  3. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., et al. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:151201274.

  4. Chen, X. W., & Lin, X. (2014). Big data deep learning: Challenges and perspectives. IEEE Access, 2, 514–525.

    Article  Google Scholar 

  5. Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., & Andrew, N. (2013). Deep learning with cots hpc systems. In: International conference on machine learning, PMLR, pp 1337–1345.

  6. Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., et al. (2014). Exploiting bounded staleness to speed up big data analytics. In: 2014 USENIX Annual technical conference (USENIX ATC 14), pp 37–48.

  7. Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., & Xing, E. P. (2016). Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In: Proceedings of the eleventh european conference on computer systems, pp 1–16.

  8. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M, Senior, A., Tucker, P., Yang, K., Le, Q., & Ng, A. (2012). Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottu, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 1223–1231). Curran Associates, Inc. https://papers.nips.cc/paper/2012/hash/6aca97005c68f1206823815f66102863-Abstract.html.

  9. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255.

  10. Dryden, N., Moon, T., Jacobs, S. A., Van Essen, B. (2016). Communication quantization for data-parallel training of deep neural networks. In: 2016 2nd Workshop on machine learning in hpc environments (MLHPC), IEEE, pp 1–8.

  11. Dunke, F. (2014). Online optimization with lookahead. PhD thesis, Karlsruher Institut für Technologie (KIT), https://doi.org/10.5445/IR/1000042132.

  12. Dutta, S., Joshi, G., Ghosh, S., Dube, P., & Nagpurkar, P. (2018). Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd. In: International conference on artificial intelligence and statistics, PMLR, pp 803–812.

  13. Gerbessiotis, A. V., & Valiant, L. G. (1994). Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing, 22(2), 251–267.

    Article  Google Scholar 

  14. Harlap, A., Cui, H., Dai, W., Wei, J., Ganger, G. R., Gibbons, P. B., Gibson, G. A., & Xing, E. P. (2016). Addressing the straggler problem for iterative convergent parallel ml. In: Proceedings of the seventh ACM symposium on cloud computing, pp 98–111.

  15. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.

  16. Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., et al. (2013). More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems (pp. 1223–1231). New York: Curran Associates Inc.

    Google Scholar 

  17. Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Citeseer: Tech. rep.

    Google Scholar 

  18. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  19. Langer, M., He, Z., Rahayu, W., & Xue, Y. (2020). Distributed training of deep learning models: A taxonomic perspective. IEEE Transactions on Parallel and Distributed Systems, 31(12), 2802–2818.

    Article  Google Scholar 

  20. Li, H., Kadav, A., Kruus, E., & Ungureanu, C. (2015). Malt: distributed data-parallelism for existing ml applications. In: Proceedings of the tenth european conference on computer systems, pp 1–16.

  21. Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., & Su, B. Y. (2014). Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on operating systems design and implementation (OSDI 14), pp 583–598.

  22. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., & Han, J. (2019). On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:190803265.

  23. Moritz, P., Nishihara, R., Stoica, I., & Jordan. M. I. (2015). Sparknet: Training deep networks in spark. arXiv preprint arXiv:151106051.

  24. Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th international conference on neural information processing systems, NIPS’11, p 693-701.

  25. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556.

  26. Strom, N. (2015). Scalable distributed dnn training using commodity gpu cloud computing. In: Sixteenth annual conference of the international speech communication association, ISCA, pp 1488–1492.

  27. Teng, M., & Wood, F. (2018). Bayesian distributed stochastic gradient descent. Advances in Neural Information Processing Systems, 31, 6378–6388.

    Google Scholar 

  28. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, AAAI’16, pp. 2094–2100.

  29. Wang, J., & Joshi, G. (2019). Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In: Proceedings of machine learning and systems (SysML’19) 1: pp. 212–229.

  30. Wilson, D. R., & Martinez, T. R. (2001). The need for small learning rates on large problems. In: IJCNN’01. International joint conference on neural networks. Proceedings (Cat. No. 01CH37222), IEEE, vol 1, pp 115–119.

  31. Wu, Y., Liu, L., Bae, J., Chow, K. H., Iyengar, A., Pu, C., Wei, W., Yu, L., & Zhang, Q. (2019). Demystifying learning rate policies for high accuracy training of deep neural networks. In: 2019 IEEE International conference on big data (Big Data), IEEE, pp 1971–1980.

  32. Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., et al. (2017). Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters. In: USENIX Annual technical conference (USENIX ATC 17), pp 181–193.

  33. Zhang, H., Li, Y., Deng, Z., Liang, X., Carin, L., & Xing, E. (2020). Autosync: Learning to synchronize for data-parallel distributed deep learning. Advances in Neural Information Processing Systems, 33, 906–917.

    Google Scholar 

  34. Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging sgd. Advances in Neural Information Processing Systems, 28, 685–693.

    Google Scholar 

  35. Zhao, X., An, A., Liu, J., Chen, B. X. (2019a). Dynamic stale synchronous parallel distributed training for deep learning. In: 2019 IEEE 39th International conference on distributed computing systems (ICDCS’19), IEEE, pp 1507–1517.

  36. Zhao, X., Papagelis, M., An, A., Chen, B. X., Liu, J., & Hu, Y. (2019b). Elastic bulk synchronous parallel model for distributed deep learning. In: 2019 IEEE International conference on data mining (ICDM’19), IEEE, pp 1504–1509.

  37. Zhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P., Ye, Y., Li, L. J., & Li, F. F. (2018). Distributed asynchronous optimization with unbounded delays: How slow can you go? In: 2018 International conference on machine learning, PMLR, pp 5970–5979.

  38. Zhu, R., Yang, S., Pfadler, A., Qian, Z., & Zhou, J. (2020). Learning efficient parameter server synchronization policies for distributed sgd. In: 8th International conference on learning representations, URL https://openreview.net/forum?id=rJxX8T4Kvr.

  39. Zinkevich, M., Weimer, M., Li, L., & Smola, A. J. (2010). Parallelized stochastic gradient descent. In: Proceedings of the 23rd international conference on neural information processing systems, vol 2, pp 2595–2603.

Download references

Acknowledgements

This work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), IBM Canada and the Big Data Research Analytics and Information Network (BRAIN) Alliance established by Ontario Research Fund - Research Excellence Program (ORF-RE). The experiments were performed on the GPU cluster of SOSCIP. SOSCIP is funded by the Federal Economic Development Agency of Southern Ontario, the Province of Ontario, IBM Canada, Ontario Centres of Excellence, Mitacs and 15 Ontario academic member institutions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xing Zhao.

Ethics declarations

Conflict of interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: João Gama, Alípio Jorge, Salvador García.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, X., Papagelis, M., An, A. et al. ZipLine: an optimized algorithm for the elastic bulk synchronous parallel model. Mach Learn 110, 2867–2903 (2021). https://doi.org/10.1007/s10994-021-06064-w

Download citation

Keywords

  • Distributed deep learning
  • Parameter server framework
  • Data parallelism
  • BSP
  • Stale synchronous parallel
  • Asynchronous parallel