Skip to main content

Benchmark assessment for the DeepSpeed acceleration library on image classification


Deep neural networks have shown remarkable performance on a wide range of classification tasks and applications. However, the large model size and the enormous size of the training dataset make the training process slow and often limited by the computing resources. To overcome this limitation, distributed training can be used to accelerate the process by utilizing multiple devices for a single model training. In this work, we evaluate the performance of Microsoft DeepSpeed, a distributed training library, on image classification tasks by comparing the performance of 108 trained neural networks in 27 unique settings. Our experimental results suggest that DeepSpeed may provide limited benefits for simpler learning tasks (e.g. smaller neural network models or simpler datasets). On the other hand, for more complex learning tasks, DeepSpeed can provide up to 8× faster training with possible performance improvement. Our study contributes to a better understanding of the capabilities and limitations of the DeepSpeed library, providing insights into when and where it may be most beneficial to use in image classification settings.

This is a preview of subscription content, access via your institution.

Fig. 1

Data availability

The datasets used in this study is publicly available.







  1. Liu, L., Chang, J., Liang, G., Xiong, S.: Simulated quantum mechanics-based joint learning network for stroke lesion segmentation and tici grading. IEEE J. Biomed. Health. Inform. (2023).

    Article  Google Scholar 

  2. Xing, X., Liang, G., Zhang, Y., Khanal, S., Lin, A-L., Jacobs, N.: Advit: Vision transformer on multi-modality pet images for Alzheimer disease diagnosis. In 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), p. 1–4. IEEE (2022)

  3. Ying, Q., Xing, X., Liu, L., Lin, A-L., Jacobs, N., Liang, G.: Multi-modal data analysis for alzheimer’s disease diagnosis: An ensemble model using imagery and genetic features. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 3586–3591. IEEE (2021)

  4. Lin, S.-C., Su, Y., Liang, G., Zhang, Y., Jacobs, N., Zhang, Y.: Estimating cluster masses from sdss multiband images with transfer learning. Mon. Notices Royal Astron. Soc. 512(3), 3885–3894 (2022)

    Article  Google Scholar 

  5. Zhang, Y., Liang, G., Su, Y., Jacobs, N.: Multi-branch attention networks for classifying galaxy clusters. In 2020 25th International Conference on Pattern Recognition (ICPR), pp.9643–9649. IEEE (2021)

  6. Su, Y., Zhang, Y., Liang, G., ZuHone, J.A., Barnes, D.J., Jacobs, N.B., Ntampaka, M., Forman, W.R., Nulsen, P.E.J., Kraft, R.P., et al.: A deep learning view of the census of galaxy clusters in illustristng. Mon. Notices Royal Astron. Soc. 498(4), 5620–5628 (2020)

    Article  Google Scholar 

  7. Zhang, Y., Usman Rafique, M., Christie, G., Jacobs, N.: CrossAdapt: cross-scene adaptation for multi-domain depth estimation. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (July 2023)

  8. Workman, S., Rafique, M. U., Blanton, H., Jacobs, N.: Revisiting near/remote sensing with geospatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.1778–1787 (2022)

  9. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  10. Liang, G., Guerrero, J., Zheng, F., Alsmadi, I.: Enhancing neural text detector robustness with [CDATA[\mu]] attacking and rr-training. Electronics 12(8), 1948 (2023)

    Article  Google Scholar 

  11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, (2021)

  12. Liang, G., Greenwell, C., Zhang, Y., Xing, X., Wang, X., Kavuluru, R., Jacobs, N.: Contrastive cross-modal pre-training: a general strategy for small sample medical imaging. IEEE J. Biomed. Health Inform. 26(4), 1640–1649 (2021)

    Article  Google Scholar 

  13. Bianco, S., Cadene, R., Celona, L., Napoletano, P.: Benchmark analysis of representative deep neural network architectures. IEEE Access 6, 64270–64277 (2018).

    Article  Google Scholar 

  14. Mahapatra, S.: Why deep learning over traditional machine learning. Towards Data Science, (2018)

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25, (2012)

  16. Dosovitskiy, A.,et al.: An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations (2021)

  17. Lin, W., Hasenstab, K., Moura, G.C., Schwartzman, A.: Comparison of handcrafted features and convolutional neural networks for liver mr image adequacy assessment. Sci. Rep. 10(1), 1–11 (2020)

    Article  Google Scholar 

  18. Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 . (2021)

  19. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al.: Large scale distributed deep networks. Advances in neural information processing systems, 25 (2012)

  20. Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., Rellermeyer, J.S.: A survey on distributed machine learning. Acm Comput. Surv. (csur). 53(2), 1–33 (2020)

    Article  Google Scholar 

  21. Wang, X., Xiong, Y., Qian, X., Wei, Y., Li, L., Wang, M.: Lightseq2: Accelerated training for transformer-based models on gpus. arXiv preprint arXiv:2110.05722 (2021)

  22. Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P., et al.: Pytorch distributed: experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020)

  23. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 3505–3506. Association for Computing Machinery, New York, NY, USA, 8 (2020). ISBN 978-1-4503-7998-4. Accessed 01 Jan 2022

  24. Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)

  25. Kwon, Y., Rhu, M.: Beyond the memory wall: a case for memory-centric hpc system for deep learning. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 148–161. IEEE (2018)

  26. Shi, S., Wang, Q., Chu, X.: Performance modeling and evaluation of distributed deep learning frameworks on gpus. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 949–957. IEEE (2018)

  27. Viviani, P., Drocco, M., Aldinucci, M.: Pushing the boundaries of parallel deep learning–a practical approach. arXiv preprint arXiv:1806.09528 (2018)

  28. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24 (2011)

  29. Chilimbi, T., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 571–582 (2014)

  30. Gaunt, A.L., Johnson, M.A., Riechert, M., Tarlow, D., Tomioka, R., Vytiniotis, D., Webster, S.: Ampnet: asynchronous model-parallel training for dynamic neural networks. arXiv preprint arXiv:1705.09786 (2017)

  31. Koliousis, A., Watcharapichat, P., Weidlich, M., Mai, L., Costa, P., Pietzuch, P.: Crossbow: scaling deep learning with small batch sizes on multi-gpu servers. arXiv preprint arXiv:1901.02244 (2019)

  32. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: \(\{\)TensorFlow\(\}\): A system for \(\{\)Large-Scale\(\}\) machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283 (2016)

  33. Chollet, F., et al.: Keras: The python deep learning library. Astrophysics source code library, pp. ascl–1806 (2018)

  34. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678 (2014)

  35. Vision, B., Center, L.: Caffe. (2017-04-21) [2017-06-01]. (2019)

  36. Seide, F., Agarwal, A.: Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2135–2135 (2016)

  37. Dai, J.J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C.L., Wan, Y., Li, Z., et al.: Bigdl: A distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing, pp. 50–60 (2019)

  38. Ooi, B.C., Tan, K.-L., Wang, S., Wang, W., Cai, Q., Chen, G., Gao, J., Luo, Z., Tung, A.K.H., Wang, Y., et al.: Singa: a distributed deep learning platform. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 685–688 (2015)

  39. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

  40. Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)

  41. Microsoft. Onnx runtime. URL

  42. Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp. 1–6 (2015)

  43. Cavallaro, D.I.G., Memon, M.S., Sedona, R.: Scalable machine learning with high performance and cloud computing. In: IEEE international geoscience and remote sensing symposium (IGARSS) (No. FZJ-2020-04999). Jülich Supercomputing Center (2020)

  44. Branwen, G.: September 2020 news. (2019)

  45. Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE (2020)

  46. Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 (2019)

  47. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

  48. Tang, H., Gan, S., Awan, A.A., Rajbhandari, S., Li, C., Lian, X., Liu, J., Zhang, C., He, Y.: 1-bit adam: communication efficient large-scale training with adam’s convergence speed. arXiv preprint arXiv:2102.02888 (2021)

  49. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  50. Pudipeddi, B., Mesmakhosroshahi, M., Xi, J., Bharadwaj, S.: Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645 (2020)

  51. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images, available at University of Toronto. (2009)

  52. Lin, T.-Yi, Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Springer (2014)

  53. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  54. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  55. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  56. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  57. Huang, G., Liu, Z., Der Van Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017)

  58. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)

  59. Wightman, R.: Pytorch image models. (2019)

  60. Smith, L.N.: Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. IEEE (2017)

  61. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR (2019)

  62. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  63. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)

  64. Smith, L.N.: A disciplined approach to neural network hyper-parameters: part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820 (2018)

  65. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

Download references


No Funding support

Author information

Authors and Affiliations



All authors contribute to paper through either code, experiments or writing.

Corresponding author

Correspondence to Izzat Alsmadi.

Ethics declarations

Conflict of interest

Authors declare none.

Informed consent

Authors declare none.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, G., Atoum, M.S., Xing, X. et al. Benchmark assessment for the DeepSpeed acceleration library on image classification. Cluster Comput (2023).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: