Abstract
Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers’ communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that \(K = 10\% \) is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with \(K = 30\% \) worsens model performance significantly.
Notes
https://github.com/Glemhel/ActivationsGradientsCompressionForMP.
https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py.
REFERENCES
OpenAI, GPT-4 Technical Report (2023). https://doi.org/10.48550/arXiv.2303.08774
T. L. Scao, A. Fan, C. Akiki, et al., “BLOOM: A 176B-parameter open-access multilingual language model” (2022). https://doi.org/10.48550/arXiv.2211.05100
H. Laurencon, L. Saulnier, T. Wang, et al., “The BigScience ROOTS corpus: A 1.6 TB composite multilingual dataset,” in Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv. 53 (2), 1–33 (2020).
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks” (2014). https://doi.org/10.48550/arXiv.1404.5997
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training” (2018). https://doi.org/10.48550/arXiv.1811.03600
A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep learning in TensorFlow” (2018). https://doi.org/10.48550/arXiv.1802.05799
S. Li, Y. Zhao, R. Varma, et al., “Pytorch distributed: Experiences on accelerating data parallel training” (2020). https://doi.org/10.14778/3415478.3415530
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism” (2019). https://doi.org/10.48550/arXiv.1909.08053
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020), pp. 3505–3506.
A. Borzunov, D. Baranchuk, T. Dettmers, et al., “Petals: Collaborative inference and fine-tuning of large models” (2022). https://doi.org/10.48550/arXiv.2209.01188
Y. Huang, Y. Cheng, A. Bapna, et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in Neural Information Processing Systems (2019), Vol. 32.
L. Guan, W. Yin, D. Li, and X. Lu, “XPipe: Efficient pipeline model parallelism for multi-GPU DNN training” (2019). https://doi.org/10.48550/arXiv.1911.04610
A. Harlap, D. Narayanan, A. Phanishayee, et al., “PipeDream: Fast and efficient pipeline parallel DNN training” (2018). https://doi.org/10.48550/arXiv.1806.03377
M. Diskin, A. Bukhtiyarov, M. Ryabinin, et al., “Distributed deep learning in open collaborations,” Adv. Neural Inf. Process. Syst. 34, 7879–7897 (2021).
F. Fu, Y. Hu, Y. He, et al., “Don’t waste your bits! squeeze activations and gradients for deep neural networks via TinyScript,” International Conference on Machine Learning, PMLR (2020), pp. 3304–3314.
R. D. Evans and T. Aamodt, “AC-GC: Lossy activation compression with guaranteed convergence,” Adv. Neural Inf. Process. Syst. 34, 27434–27448 (2021).
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022).
J. Song, J. Yim, J. Jung, et al., “Optimus-CC: Efficient large NLP model training with 3D parallelism aware communication compression,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2023), Vol. 2, pp. 560–573.
S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” Advances in Neural Information Processing Systems (2018), Vol. 31.
A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning” (2020). https://doi.org/10.48550/arXiv.2002.12410
S. Bian, D. Li, H. Wang, E. P. Xing, and S. Venkataraman, “Does compressing activations help model parallel training?” (2023). https://doi.org/10.48550/arXiv.2301.02654
V. Gupta, D. Choudhary, P. Tang, et al., “Training recommender systems at scale: Communication-efficient model and data parallelism,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2021), pp. 2928–2936.
J. Wang, B. Yuan, L. Rimanic, et al., “Fine-tuning language models over slow networks using activation quantization with guarantees,” Adv. Neural Inf. Process. Syst. 35, 19215–19230 (2022).
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Proceedings of the 15th Annual Conference of the International Speech Communication Association (2014).
K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, “Distributed learning with compressed gradient differences” (2019). https://doi.org/10.48550/arXiv.1901.09269
P. Richtarik, I. Sokolov, and I. Fatkhullin, “EF21: A new, simpler, theoretically better, and practically faster error feedback,” Adv. Neural Inf. Process. Syst. 34, 4384–4396 (2021).
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images” (2009). https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog 1 (8), 9 (2019).
S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations (2016).
J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimization for non-convex problems,” in International Conference on Machine Learning, PMLR (2018), pp. 560–569.
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in Neural Information Processing Systems (2017), Vol. 30.
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding” (2015). https://doi.org/10.48550/arXiv.1510.00149
C. Hong, H. Kim, S. Baik, J. Oh, and K. M. Lee, “DAQ: Channel-wise distribution-aware quantization for deep image super-resolution networks,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2022), pp. 2675–2684.
H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright, “Atomo: Communication-efficient learning via atomic sparsification,” Advances in Neural Information Processing Systems (2018), Vol. 31.
T. Vogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical low-rank gradient compression for distributed optimization,” Advances in Neural Information Processing Systems (2019), Vol. 32.
E. Gorbunov, K. P. Burlachenko, Z. Li, and P. Richtárik, “MARINA: Faster non-convex distributed learning with compression,” in International Conference on Machine Learning, PMLR (2021), pp. 3788–3798.
ACKNOWLEDGMENTS
We deeply thank the most irreplaceable and the most secretive colleague from Yandex.Research for insightful discussions during our research. We also thank A. Antonov for providing computational resources for the project.
Funding
The research of A. Beznosikov was supported by Russian Science Foundation (project no. 23-11-00229).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors of this work declare that they have no conflicts of interest.
Additional information
Best article on artificial intelligence and machine learning AI Journey 2023#
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rudakov, M.I., Beznosikov, A.N., Kholodov, Y.A. et al. Activations and Gradients Compression for Model-Parallel Training. Dokl. Math. 108 (Suppl 2), S272–S281 (2023). https://doi.org/10.1134/S1064562423701314
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701314