Accelerate distributed deep learning with cluster-aware sketch quantization

Ge, Keshi; Zhang, Yiming; Fu, Yongquan; Lai, Zhiquan; Deng, Xiaoge; Li, Dongsheng

doi:10.1007/s11432-021-3532-8

Accelerate distributed deep learning with cluster-aware sketch quantization

Research Paper
Published: 22 May 2023

Volume 66, article number 162102, (2023)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Keshi Ge¹,
Yiming Zhang¹,
Yongquan Fu¹,
Zhiquan Lai¹,
Xiaoge Deng¹ &
…
Dongsheng Li¹

81 Accesses
Explore all metrics

Abstract

Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication cost. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant compression error that not only increases the number of training iterations but also requires a higher number of quantization bits (and consequently higher delay for each iteration) to keep the validation accuracy as high as the original stochastic gradient descent (SGD) approach. To address this problem, in this paper we propose cluster-aware sketch quantization (CASQ), a novel sketch-based gradient quantization method for SGD with convergence guarantees. CASQ models the nonuniform distribution of gradients via clustering, and adaptively allocates appropriate numbers of hash buckets based on the statistics of different clusters to compress gradients. Extensive evaluation shows that compared to existing quantization methods, CASQ-based SGD (i) achieves the same validation accuracy when decreasing quantization level from 3 bits to 2 bits, and (ii) reduces the training time to convergence by up to 43% for the same training loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Automated machine learning: past, present and future

Article Open access 18 April 2024

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

References

Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407
Article MathSciNet MATH Google Scholar
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learning Res, 2011, 12: 2121–2159
MathSciNet MATH Google Scholar
Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning. ACM Comput Surv, 2020, 52: 1–43
Article Google Scholar
Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2012. 1223–1231
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770–778
Mahajan D, Girshick R, Ramanathan V, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of European Conference on Computer Vision, 2018. 181–196
Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805
Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of International Conference on Machine Learning, 2019. 6105–6114
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision, 2020. 213–229
Xie Q, Luong M T, Hovy E, et al. Self-training with noisy student improves ImageNet classification. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 10687–10698
Alistarh D, Grubic D, Li J, et al. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1709–1720
Wen W, Xu C, Yan F, et al. TernGrad: ternary gradients to reduce communication in distributed deep learning. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1509–1519
Faghri F, Tabrizian I, Markov I, et al. Adaptive gradient quantization for data-parallel SGD. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. 33: 3174–3185
Google Scholar
Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2013. 315–323
Wang H, Sievert S, Liu S, et al. ATOMO: communication-efficient learning via atomic sparsification. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 9850–9861
Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, 2014. 583–598
Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J Algorithms, 2005, 55: 58–75
Article MathSciNet MATH Google Scholar
Cormode G, Hadjieleftheriou M. Finding the frequent items in streams of data. Commun ACM, 2009, 52: 97–105
Article Google Scholar
Fu Y, Li D, Shen S, et al. Clustering-preserving network flow sketching. In: Proceedings of IEEE INFOCOM 2020, 2020. 1309–1318
Karimireddy S P, Rebjock Q, Stich S, et al. Error feedback fixes SIGNSGD and other gradient compression schemes. In: Proceedings of International Conference on Machine Learning, 2019. 3252–3261
Xu H, Ho C Y, Abdelmoniem A M, et al. GRACE: a compressed communication framework for distributed machine learning. In: Proceedings of 2021 IEEE 41st International Conference on Distributed Computing Systems, 2021. 561–572
You Y, Buluç A, Demmel J. Scaling deep learning on GPU and knights landing clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, 2017
Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. 265–283
Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 181–193
Li Y, Yu M, Li S, et al. Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018. 8056–8067
Huang Y, Cheng Y, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. 2018. ArXiv:1811.06965
Shi S, Chu X, Li B. MG-WFBP: efficient data communication for distributed synchronous SGD algorithms. In: Proceedings of IEEE INFOCOM 2019-IEEE Conference on Computer Communications, 2019. 172–180
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 1–15
Rasley J, Rajbhandari S, Ruwase O, et al. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. 3505–3506
Fan S, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021. 431–445
Bernstein J, Wang Y X, Azizzadenesheli K, et al. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of International Conference on Machine Learning, 2018. 560–569
Stich S U, Cordonnier J B, Jaggi M. Sparsified SGD with memory. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 4447–4458
Lin Y, Han S, Mao H, et al. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of International Conference on Learning Representations, 2018
Wangni J, Wang J, Liu J, et al. Gradient sparsification for communication-efficient distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 1299–1309
Renggli C, Ashkboos S, Aghagolzadeh M, et al. SparCML: high-performance sparse communication for machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. 1–15
Shi S, Zhao K, Wang Q, et al. A convergence analysis of distributed SGD with communication-efficient gradient sparsification. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 3411–3417
Chen C Y, Ni J, Lu S, et al. ScaleCom: scalable sparsified gradient compression for communication-efficient distributed training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2020
Yu M, Lin Z, Narra K, et al. GradiVeQ: vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 5123–5133
Vogels T, Karimireddy S P, Jaggi M. PowerSGD: practical low-rank gradient compression for distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 14259–14268
Zhang H, Li J, Kara K, et al. ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: Proceedings of International Conference on Machine Learning, 2017. 4035–4043
Ramezani-Kebrya A, Faghri F, Roy D M. NUQSGD: improved communication efficiency for data-parallel SGD via nonuniform quantization. 2019. ArXiv:1908.06077
Horvath S, Ho C Y, Horvath L, et al. Natural compression for distributed deep learning. 2019. ArXiv:1905.10988
Fu F, Hu Y, He Y, et al. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via tinyscript. In: Proceedings of International Conference on Machine Learning, 2020. 3304–3314
Cui L, Su X, Zhou Y, et al. Slashing communication traffic in federated learning by transmitting clustered model updates. IEEE J Sel Areas Commun, 2021, 39: 2572–2589
Article Google Scholar
Ivkin N, Rothchild D, Ullah E, et al. Communication-efficient distributed SGD with sketching. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 13144–13154
Rothchild D, Panda A, Ullah E, et al. FetchSGD: communication-efficient federated learning with sketching. In: Proceedings of International Conference on Machine Learning, 2020. 8253–8265
Jiang J, Fu F, Yang T, et al. SKCompress: compressing sparse and nonuniform gradient in distributed machine learning. VLDB J, 2020, 29: 945–972
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 62025208, 61972409) and National Key Research Development Program of China (Grant No. 2021YFB0301200).

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Keshi Ge, Yiming Zhang, Yongquan Fu, Zhiquan Lai, Xiaoge Deng & Dongsheng Li

Authors

Keshi Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yongquan Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiquan Lai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoge Deng
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongsheng Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ge, K., Zhang, Y., Fu, Y. et al. Accelerate distributed deep learning with cluster-aware sketch quantization. Sci. China Inf. Sci. 66, 162102 (2023). https://doi.org/10.1007/s11432-021-3532-8

Download citation

Received: 05 September 2021
Revised: 09 December 2021
Accepted: 10 May 2022
Published: 22 May 2023
DOI: https://doi.org/10.1007/s11432-021-3532-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerate distributed deep learning with cluster-aware sketch quantization

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Automated machine learning: past, present and future

Bolstering stochastic gradient descent with model building

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerate distributed deep learning with cluster-aware sketch quantization

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Automated machine learning: past, present and future

Bolstering stochastic gradient descent with model building

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation