Abstract
Gradient quantization has been widely used in distributed training of deep neural network (DNN) models to reduce communication cost. However, existing quantization methods overlook that gradients have a nonuniform distribution changing over time, which can lead to significant compression error that not only increases the number of training iterations but also requires a higher number of quantization bits (and consequently higher delay for each iteration) to keep the validation accuracy as high as the original stochastic gradient descent (SGD) approach. To address this problem, in this paper we propose cluster-aware sketch quantization (CASQ), a novel sketch-based gradient quantization method for SGD with convergence guarantees. CASQ models the nonuniform distribution of gradients via clustering, and adaptively allocates appropriate numbers of hash buckets based on the statistics of different clusters to compress gradients. Extensive evaluation shows that compared to existing quantization methods, CASQ-based SGD (i) achieves the same validation accuracy when decreasing quantization level from 3 bits to 2 bits, and (ii) reduces the training time to convergence by up to 43% for the same training loss.
Similar content being viewed by others
References
Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learning Res, 2011, 12: 2121–2159
Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning. ACM Comput Surv, 2020, 52: 1–43
Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2012. 1223–1231
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 770–778
Mahajan D, Girshick R, Ramanathan V, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of European Conference on Computer Vision, 2018. 181–196
Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805
Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of International Conference on Machine Learning, 2019. 6105–6114
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision, 2020. 213–229
Xie Q, Luong M T, Hovy E, et al. Self-training with noisy student improves ImageNet classification. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 10687–10698
Alistarh D, Grubic D, Li J, et al. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1709–1720
Wen W, Xu C, Yan F, et al. TernGrad: ternary gradients to reduce communication in distributed deep learning. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2017. 1509–1519
Faghri F, Tabrizian I, Markov I, et al. Adaptive gradient quantization for data-parallel SGD. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. 33: 3174–3185
Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2013. 315–323
Wang H, Sievert S, Liu S, et al. ATOMO: communication-efficient learning via atomic sparsification. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 9850–9861
Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, 2014. 583–598
Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J Algorithms, 2005, 55: 58–75
Cormode G, Hadjieleftheriou M. Finding the frequent items in streams of data. Commun ACM, 2009, 52: 97–105
Fu Y, Li D, Shen S, et al. Clustering-preserving network flow sketching. In: Proceedings of IEEE INFOCOM 2020, 2020. 1309–1318
Karimireddy S P, Rebjock Q, Stich S, et al. Error feedback fixes SIGNSGD and other gradient compression schemes. In: Proceedings of International Conference on Machine Learning, 2019. 3252–3261
Xu H, Ho C Y, Abdelmoniem A M, et al. GRACE: a compressed communication framework for distributed machine learning. In: Proceedings of 2021 IEEE 41st International Conference on Distributed Computing Systems, 2021. 561–572
You Y, Buluç A, Demmel J. Scaling deep learning on GPU and knights landing clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, 2017
Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. 265–283
Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 181–193
Li Y, Yu M, Li S, et al. Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018. 8056–8067
Huang Y, Cheng Y, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. 2018. ArXiv:1811.06965
Shi S, Chu X, Li B. MG-WFBP: efficient data communication for distributed synchronous SGD algorithms. In: Proceedings of IEEE INFOCOM 2019-IEEE Conference on Computer Communications, 2019. 172–180
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 1–15
Rasley J, Rajbhandari S, Ruwase O, et al. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. 3505–3506
Fan S, Rong Y, Meng C, et al. DAPPLE: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021. 431–445
Bernstein J, Wang Y X, Azizzadenesheli K, et al. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of International Conference on Machine Learning, 2018. 560–569
Stich S U, Cordonnier J B, Jaggi M. Sparsified SGD with memory. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 4447–4458
Lin Y, Han S, Mao H, et al. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of International Conference on Learning Representations, 2018
Wangni J, Wang J, Liu J, et al. Gradient sparsification for communication-efficient distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 1299–1309
Renggli C, Ashkboos S, Aghagolzadeh M, et al. SparCML: high-performance sparse communication for machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. 1–15
Shi S, Zhao K, Wang Q, et al. A convergence analysis of distributed SGD with communication-efficient gradient sparsification. In: Proceedings of International Joint Conference on Artificial Intelligence, 2019. 3411–3417
Chen C Y, Ni J, Lu S, et al. ScaleCom: scalable sparsified gradient compression for communication-efficient distributed training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2020
Yu M, Lin Z, Narra K, et al. GradiVeQ: vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2018. 5123–5133
Vogels T, Karimireddy S P, Jaggi M. PowerSGD: practical low-rank gradient compression for distributed optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 14259–14268
Zhang H, Li J, Kara K, et al. ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: Proceedings of International Conference on Machine Learning, 2017. 4035–4043
Ramezani-Kebrya A, Faghri F, Roy D M. NUQSGD: improved communication efficiency for data-parallel SGD via nonuniform quantization. 2019. ArXiv:1908.06077
Horvath S, Ho C Y, Horvath L, et al. Natural compression for distributed deep learning. 2019. ArXiv:1905.10988
Fu F, Hu Y, He Y, et al. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via tinyscript. In: Proceedings of International Conference on Machine Learning, 2020. 3304–3314
Cui L, Su X, Zhou Y, et al. Slashing communication traffic in federated learning by transmitting clustered model updates. IEEE J Sel Areas Commun, 2021, 39: 2572–2589
Ivkin N, Rothchild D, Ullah E, et al. Communication-efficient distributed SGD with sketching. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2019. 13144–13154
Rothchild D, Panda A, Ullah E, et al. FetchSGD: communication-efficient federated learning with sketching. In: Proceedings of International Conference on Machine Learning, 2020. 8253–8265
Jiang J, Fu F, Yang T, et al. SKCompress: compressing sparse and nonuniform gradient in distributed machine learning. VLDB J, 2020, 29: 945–972
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (Grant Nos. 62025208, 61972409) and National Key Research Development Program of China (Grant No. 2021YFB0301200).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ge, K., Zhang, Y., Fu, Y. et al. Accelerate distributed deep learning with cluster-aware sketch quantization. Sci. China Inf. Sci. 66, 162102 (2023). https://doi.org/10.1007/s11432-021-3532-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-021-3532-8