Skip to main content
Log in

Accelerate distributed deep learning with cluster-aware sketch quantization

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication cost. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant compression error that not only increases the number of training iterations but also requires a higher number of quantization bits (and consequently higher delay for each iteration) to keep the validation accuracy as high as the original stochastic gradient descent (SGD) approach. To address this problem, in this paper we propose cluster-aware sketch quantization (CASQ), a novel sketch-based gradient quantization method for SGD with convergence guarantees. CASQ models the nonuniform distribution of gradients via clustering, and adaptively allocates appropriate numbers of hash buckets based on the statistics of different clusters to compress gradients. Extensive evaluation shows that compared to existing quantization methods, CASQ-based SGD (i) achieves the same validation accuracy when decreasing quantization level from 3 bits to 2 bits, and (ii) reduces the training time to convergence by up to 43% for the same training loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407

    Article  MathSciNet  MATH  Google Scholar 

  2. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learning Res, 2011, 12: 2121–2159

    MathSciNet  MATH  Google Scholar 

  3. Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147

  4. Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning. ACM Comput Surv, 2020, 52: 1–43

    Article  Google Scholar 

  5. Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2012. 1223–1231

  6. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770–778

  7. Mahajan D, Girshick R, Ramanathan V, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of European Conference on Computer Vision, 2018. 181–196

  8. Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805

  9. Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of International Conference on Machine Learning, 2019. 6105–6114

  10. Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision, 2020. 213–229

  11. Xie Q, Luong M T, Hovy E, et al. Self-training with noisy student improves ImageNet classification. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 10687–10698

  12. Alistarh D, Grubic D, Li J, et al. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1709–1720

  13. Wen W, Xu C, Yan F, et al. TernGrad: ternary gradients to reduce communication in distributed deep learning. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1509–1519

  14. Faghri F, Tabrizian I, Markov I, et al. Adaptive gradient quantization for data-parallel SGD. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. 33: 3174–3185

    Google Scholar 

  15. Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2013. 315–323

  16. Wang H, Sievert S, Liu S, et al. ATOMO: communication-efficient learning via atomic sparsification. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 9850–9861

  17. Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, 2014. 583–598

  18. Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J Algorithms, 2005, 55: 58–75

    Article  MathSciNet  MATH  Google Scholar 

  19. Cormode G, Hadjieleftheriou M. Finding the frequent items in streams of data. Commun ACM, 2009, 52: 97–105

    Article  Google Scholar 

  20. Fu Y, Li D, Shen S, et al. Clustering-preserving network flow sketching. In: Proceedings of IEEE INFOCOM 2020, 2020. 1309–1318

  21. Karimireddy S P, Rebjock Q, Stich S, et al. Error feedback fixes SIGNSGD and other gradient compression schemes. In: Proceedings of International Conference on Machine Learning, 2019. 3252–3261

  22. Xu H, Ho C Y, Abdelmoniem A M, et al. GRACE: a compressed communication framework for distributed machine learning. In: Proceedings of 2021 IEEE 41st International Conference on Distributed Computing Systems, 2021. 561–572

  23. You Y, Buluç A, Demmel J. Scaling deep learning on GPU and knights landing clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, 2017

  24. Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. 265–283

  25. Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 181–193

  26. Li Y, Yu M, Li S, et al. Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018. 8056–8067

  27. Huang Y, Cheng Y, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. 2018. ArXiv:1811.06965

  28. Shi S, Chu X, Li B. MG-WFBP: efficient data communication for distributed synchronous SGD algorithms. In: Proceedings of IEEE INFOCOM 2019-IEEE Conference on Computer Communications, 2019. 172–180

  29. Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 1–15

  30. Rasley J, Rajbhandari S, Ruwase O, et al. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. 3505–3506

  31. Fan S, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021. 431–445

  32. Bernstein J, Wang Y X, Azizzadenesheli K, et al. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of International Conference on Machine Learning, 2018. 560–569

  33. Stich S U, Cordonnier J B, Jaggi M. Sparsified SGD with memory. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 4447–4458

  34. Lin Y, Han S, Mao H, et al. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of International Conference on Learning Representations, 2018

  35. Wangni J, Wang J, Liu J, et al. Gradient sparsification for communication-efficient distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 1299–1309

  36. Renggli C, Ashkboos S, Aghagolzadeh M, et al. SparCML: high-performance sparse communication for machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. 1–15

  37. Shi S, Zhao K, Wang Q, et al. A convergence analysis of distributed SGD with communication-efficient gradient sparsification. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 3411–3417

  38. Chen C Y, Ni J, Lu S, et al. ScaleCom: scalable sparsified gradient compression for communication-efficient distributed training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2020

  39. Yu M, Lin Z, Narra K, et al. GradiVeQ: vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 5123–5133

  40. Vogels T, Karimireddy S P, Jaggi M. PowerSGD: practical low-rank gradient compression for distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 14259–14268

  41. Zhang H, Li J, Kara K, et al. ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: Proceedings of International Conference on Machine Learning, 2017. 4035–4043

  42. Ramezani-Kebrya A, Faghri F, Roy D M. NUQSGD: improved communication efficiency for data-parallel SGD via nonuniform quantization. 2019. ArXiv:1908.06077

  43. Horvath S, Ho C Y, Horvath L, et al. Natural compression for distributed deep learning. 2019. ArXiv:1905.10988

  44. Fu F, Hu Y, He Y, et al. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via tinyscript. In: Proceedings of International Conference on Machine Learning, 2020. 3304–3314

  45. Cui L, Su X, Zhou Y, et al. Slashing communication traffic in federated learning by transmitting clustered model updates. IEEE J Sel Areas Commun, 2021, 39: 2572–2589

    Article  Google Scholar 

  46. Ivkin N, Rothchild D, Ullah E, et al. Communication-efficient distributed SGD with sketching. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 13144–13154

  47. Rothchild D, Panda A, Ullah E, et al. FetchSGD: communication-efficient federated learning with sketching. In: Proceedings of International Conference on Machine Learning, 2020. 8253–8265

  48. Jiang J, Fu F, Yang T, et al. SKCompress: compressing sparse and nonuniform gradient in distributed machine learning. VLDB J, 2020, 29: 945–972

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 62025208, 61972409) and National Key Research Development Program of China (Grant No. 2021YFB0301200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongsheng Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, K., Zhang, Y., Fu, Y. et al. Accelerate distributed deep learning with cluster-aware sketch quantization. Sci. China Inf. Sci. 66, 162102 (2023). https://doi.org/10.1007/s11432-021-3532-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3532-8

Keywords

Navigation