Skip to main content
Log in

Activations and Gradients Compression for Model-Parallel Training

  • Published:
Doklady Mathematics Aims and scope Submit manuscript

Abstract

Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers’ communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that \(K = 10\% \) is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with \(K = 30\% \) worsens model performance significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

Notes

  1. https://github.com/Glemhel/ActivationsGradientsCompressionForMP.

  2. https://github.com/kuangliu/pytorch-cifar.

  3. https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py.

REFERENCES

  1. OpenAI, GPT-4 Technical Report (2023). https://doi.org/10.48550/arXiv.2303.08774

  2. T. L. Scao, A. Fan, C. Akiki, et al., “BLOOM: A 176B-parameter open-access multilingual language model” (2022). https://doi.org/10.48550/arXiv.2211.05100

  3. H. Laurencon, L. Saulnier, T. Wang, et al., “The BigScience ROOTS corpus: A 1.6 TB composite multilingual dataset,” in Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).

  4. J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv. 53 (2), 1–33 (2020).

    Article  Google Scholar 

  5. A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks” (2014). https://doi.org/10.48550/arXiv.1404.5997

  6. C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training” (2018). https://doi.org/10.48550/arXiv.1811.03600

  7. A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep learning in TensorFlow” (2018). https://doi.org/10.48550/arXiv.1802.05799

  8. S. Li, Y. Zhao, R. Varma, et al., “Pytorch distributed: Experiences on accelerating data parallel training” (2020). https://doi.org/10.14778/3415478.3415530

  9. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism” (2019). https://doi.org/10.48550/arXiv.1909.08053

  10. J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020), pp. 3505–3506.

  11. A. Borzunov, D. Baranchuk, T. Dettmers, et al., “Petals: Collaborative inference and fine-tuning of large models” (2022). https://doi.org/10.48550/arXiv.2209.01188

  12. Y. Huang, Y. Cheng, A. Bapna, et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in Neural Information Processing Systems (2019), Vol. 32.

  13. L. Guan, W. Yin, D. Li, and X. Lu, “XPipe: Efficient pipeline model parallelism for multi-GPU DNN training” (2019). https://doi.org/10.48550/arXiv.1911.04610

  14. A. Harlap, D. Narayanan, A. Phanishayee, et al., “PipeDream: Fast and efficient pipeline parallel DNN training” (2018). https://doi.org/10.48550/arXiv.1806.03377

  15. M. Diskin, A. Bukhtiyarov, M. Ryabinin, et al., “Distributed deep learning in open collaborations,” Adv. Neural Inf. Process. Syst. 34, 7879–7897 (2021).

    Google Scholar 

  16. F. Fu, Y. Hu, Y. He, et al., “Don’t waste your bits! squeeze activations and gradients for deep neural networks via TinyScript,” International Conference on Machine Learning, PMLR (2020), pp. 3304–3314.

  17. R. D. Evans and T. Aamodt, “AC-GC: Lossy activation compression with guaranteed convergence,” Adv. Neural Inf. Process. Syst. 34, 27434–27448 (2021).

    Google Scholar 

  18. T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022).

    Google Scholar 

  19. J. Song, J. Yim, J. Jung, et al., “Optimus-CC: Efficient large NLP model training with 3D parallelism aware communication compression,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2023), Vol. 2, pp. 560–573.

  20. S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” Advances in Neural Information Processing Systems (2018), Vol. 31.

  21. A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning” (2020). https://doi.org/10.48550/arXiv.2002.12410

  22. S. Bian, D. Li, H. Wang, E. P. Xing, and S. Venkataraman, “Does compressing activations help model parallel training?” (2023). https://doi.org/10.48550/arXiv.2301.02654

  23. V. Gupta, D. Choudhary, P. Tang, et al., “Training recommender systems at scale: Communication-efficient model and data parallelism,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2021), pp. 2928–2936.

  24. J. Wang, B. Yuan, L. Rimanic, et al., “Fine-tuning language models over slow networks using activation quantization with guarantees,” Adv. Neural Inf. Process. Syst. 35, 19215–19230 (2022).

    Google Scholar 

  25. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Proceedings of the 15th Annual Conference of the International Speech Communication Association (2014).

  26. K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, “Distributed learning with compressed gradient differences” (2019). https://doi.org/10.48550/arXiv.1901.09269

  27. P. Richtarik, I. Sokolov, and I. Fatkhullin, “EF21: A new, simpler, theoretically better, and practically faster error feedback,” Adv. Neural Inf. Process. Syst. 34, 4384–4396 (2021).

    Google Scholar 

  28. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.

  29. A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images” (2009). https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  30. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog 1 (8), 9 (2019).

  31. S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations (2016).

  32. J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimization for non-convex problems,” in International Conference on Machine Learning, PMLR (2018), pp. 560–569.

  33. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” Advances in Neural Information Processing Systems (2017), Vol. 30.

  34. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding” (2015). https://doi.org/10.48550/arXiv.1510.00149

  35. C. Hong, H. Kim, S. Baik, J. Oh, and K. M. Lee, “DAQ: Channel-wise distribution-aware quantization for deep image super-resolution networks,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2022), pp. 2675–2684.

  36. H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright, “Atomo: Communication-efficient learning via atomic sparsification,” Advances in Neural Information Processing Systems (2018), Vol. 31.

  37. T. Vogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical low-rank gradient compression for distributed optimization,” Advances in Neural Information Processing Systems (2019), Vol. 32.

  38. E. Gorbunov, K. P. Burlachenko, Z. Li, and P. Richtárik, “MARINA: Faster non-convex distributed learning with compression,” in International Conference on Machine Learning, PMLR (2021), pp. 3788–3798.

Download references

ACKNOWLEDGMENTS

We deeply thank the most irreplaceable and the most secretive colleague from Yandex.Research for insightful discussions during our research. We also thank A. Antonov for providing computational resources for the project.

Funding

The research of A. Beznosikov was supported by Russian Science Foundation (project no. 23-11-00229).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to M. I. Rudakov, A. N. Beznosikov, Ya. A. Kholodov or A. V. Gasnikov.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Best article on artificial intelligence and machine learning AI Journey 2023#

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rudakov, M.I., Beznosikov, A.N., Kholodov, Y.A. et al. Activations and Gradients Compression for Model-Parallel Training. Dokl. Math. 108 (Suppl 2), S272–S281 (2023). https://doi.org/10.1134/S1064562423701314

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064562423701314

Keywords:

Navigation