Activations and Gradients Compression for Model-Parallel Training

Rudakov, M. I.; Beznosikov, A. N.; Kholodov, Ya. A.; Gasnikov, A. V.

doi:10.1134/S1064562423701314

Activations and Gradients Compression for Model-Parallel Training

Published: 25 March 2024

Volume 108, pages S272–S281, (2023)
Cite this article

Doklady Mathematics Aims and scope Submit manuscript

M. I. Rudakov^1,2,
A. N. Beznosikov^1,2,
Ya. A. Kholodov¹ &
…
A. V. Gasnikov^1,2

33 Accesses
1 Altmetric
Explore all metrics

Abstract

Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers’ communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that \(K = 10\% \) is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with \(K = 30\% \) worsens model performance significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://github.com/Glemhel/ActivationsGradientsCompressionForMP.
https://github.com/kuangliu/pytorch-cifar.
https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py.

REFERENCES

OpenAI, GPT-4 Technical Report (2023). https://doi.org/10.48550/arXiv.2303.08774
T. L. Scao, A. Fan, C. Akiki, et al., “BLOOM: A 176B-parameter open-access multilingual language model” (2022). https://doi.org/10.48550/arXiv.2211.05100
H. Laurencon, L. Saulnier, T. Wang, et al., “The BigScience ROOTS corpus: A 1.6 TB composite multilingual dataset,” in Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv. 53 (2), 1–33 (2020).
Article Google Scholar
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks” (2014). https://doi.org/10.48550/arXiv.1404.5997
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training” (2018). https://doi.org/10.48550/arXiv.1811.03600
A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep learning in TensorFlow” (2018). https://doi.org/10.48550/arXiv.1802.05799
S. Li, Y. Zhao, R. Varma, et al., “Pytorch distributed: Experiences on accelerating data parallel training” (2020). https://doi.org/10.14778/3415478.3415530
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism” (2019). https://doi.org/10.48550/arXiv.1909.08053
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020), pp. 3505–3506.
A. Borzunov, D. Baranchuk, T. Dettmers, et al., “Petals: Collaborative inference and fine-tuning of large models” (2022). https://doi.org/10.48550/arXiv.2209.01188
Y. Huang, Y. Cheng, A. Bapna, et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in Neural Information Processing Systems (2019), Vol. 32.
L. Guan, W. Yin, D. Li, and X. Lu, “XPipe: Efficient pipeline model parallelism for multi-GPU DNN training” (2019). https://doi.org/10.48550/arXiv.1911.04610
A. Harlap, D. Narayanan, A. Phanishayee, et al., “PipeDream: Fast and efficient pipeline parallel DNN training” (2018). https://doi.org/10.48550/arXiv.1806.03377
M. Diskin, A. Bukhtiyarov, M. Ryabinin, et al., “Distributed deep learning in open collaborations,” Adv. Neural Inf. Process. Syst. 34, 7879–7897 (2021).
Google Scholar
F. Fu, Y. Hu, Y. He, et al., “Don’t waste your bits! squeeze activations and gradients for deep neural networks via TinyScript,” International Conference on Machine Learning, PMLR (2020), pp. 3304–3314.
R. D. Evans and T. Aamodt, “AC-GC: Lossy activation compression with guaranteed convergence,” Adv. Neural Inf. Process. Syst. 34, 27434–27448 (2021).
Google Scholar
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022).
Google Scholar
J. Song, J. Yim, J. Jung, et al., “Optimus-CC: Efficient large NLP model training with 3D parallelism aware communication compression,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2023), Vol. 2, pp. 560–573.
S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” Advances in Neural Information Processing Systems (2018), Vol. 31.
A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning” (2020). https://doi.org/10.48550/arXiv.2002.12410
S. Bian, D. Li, H. Wang, E. P. Xing, and S. Venkataraman, “Does compressing activations help model parallel training?” (2023). https://doi.org/10.48550/arXiv.2301.02654
V. Gupta, D. Choudhary, P. Tang, et al., “Training recommender systems at scale: Communication-efficient model and data parallelism,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2021), pp. 2928–2936.
J. Wang, B. Yuan, L. Rimanic, et al., “Fine-tuning language models over slow networks using activation quantization with guarantees,” Adv. Neural Inf. Process. Syst. 35, 19215–19230 (2022).
Google Scholar
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Proceedings of the 15th Annual Conference of the International Speech Communication Association (2014).
K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, “Distributed learning with compressed gradient differences” (2019). https://doi.org/10.48550/arXiv.1901.09269
P. Richtarik, I. Sokolov, and I. Fatkhullin, “EF21: A new, simpler, theoretically better, and practically faster error feedback,” Adv. Neural Inf. Process. Syst. 34, 4384–4396 (2021).
Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images” (2009). https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog 1 (8), 9 (2019).
S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations (2016).
J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimization for non-convex problems,” in International Conference on Machine Learning, PMLR (2018), pp. 560–569.
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in Neural Information Processing Systems (2017), Vol. 30.
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding” (2015). https://doi.org/10.48550/arXiv.1510.00149
C. Hong, H. Kim, S. Baik, J. Oh, and K. M. Lee, “DAQ: Channel-wise distribution-aware quantization for deep image super-resolution networks,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2022), pp. 2675–2684.
H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright, “Atomo: Communication-efficient learning via atomic sparsification,” Advances in Neural Information Processing Systems (2018), Vol. 31.
T. Vogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical low-rank gradient compression for distributed optimization,” Advances in Neural Information Processing Systems (2019), Vol. 32.
E. Gorbunov, K. P. Burlachenko, Z. Li, and P. Richtárik, “MARINA: Faster non-convex distributed learning with compression,” in International Conference on Machine Learning, PMLR (2021), pp. 3788–3798.

Download references

ACKNOWLEDGMENTS

We deeply thank the most irreplaceable and the most secretive colleague from Yandex.Research for insightful discussions during our research. We also thank A. Antonov for providing computational resources for the project.

Funding

The research of A. Beznosikov was supported by Russian Science Foundation (project no. 23-11-00229).

Author information

Authors and Affiliations

Innopolis University, Innopolis, Republic of Tatarstan, Russia
M. I. Rudakov, A. N. Beznosikov, Ya. A. Kholodov & A. V. Gasnikov
Moscow Institute of Physics and Technology, Moscow, Russia
M. I. Rudakov, A. N. Beznosikov & A. V. Gasnikov

Authors

M. I. Rudakov
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Beznosikov
View author publications
You can also search for this author in PubMed Google Scholar
Ya. A. Kholodov
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Gasnikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to M. I. Rudakov, A. N. Beznosikov, Ya. A. Kholodov or A. V. Gasnikov.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Best article on artificial intelligence and machine learning AI Journey 2023#

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rudakov, M.I., Beznosikov, A.N., Kholodov, Y.A. et al. Activations and Gradients Compression for Model-Parallel Training. Dokl. Math. 108 (Suppl 2), S272–S281 (2023). https://doi.org/10.1134/S1064562423701314

Download citation

Received: 01 September 2023
Revised: 15 September 2023
Accepted: 18 October 2023
Published: 25 March 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S1064562423701314

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Activations and Gradients Compression for Model-Parallel Training

Abstract

Access this article

Notes

REFERENCES

ACKNOWLEDGMENTS

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Publisher’s Note.

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Search

Navigation