Skip to main content

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

  • Conference paper
  • First Online:
Euro-Par 2022: Parallel Processing (Euro-Par 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13440))

Included in the following conference series:

Abstract

Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size.

We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)

    Google Scholar 

  2. Awan, A.A., et al.: OC-DNN: exploiting advanced unified memory capabilities in CUDA 9 and Volta GPUs for out-of-core DNN training. In: HiPC, pp. 143–152. IEEE (2018)

    Google Scholar 

  3. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)

    Google Scholar 

  4. Dreuning, H., et al.: Artifact and instructions to generate experimental results for Euro-Par 2022 paper: mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training (2022). https://doi.org/10.6084/m9.figshare.20000960

  5. Fan, S., et al.: DAPPLE: a pipelined data parallel approach for training large models. In: PPoPP, pp. 431–445 (2021)

    Google Scholar 

  6. Hara, K., et al.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV Workshops, pp. 3154–3160 (2017)

    Google Scholar 

  7. Huang, Y., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: NeurIPS, pp. 103–112 (2019)

    Google Scholar 

  8. Jansen, M., et al.: DDLBench: towards a scalable benchmarking infrastructure for distributed deep learning. In: 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), pp. 31–39. IEEE (2020)

    Google Scholar 

  9. Kim, C., et al.: Torchgpipe: on-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020)

  10. Mittal, S., Vaishay, S.: A survey of techniques for optimizing deep learning on GPUs. J. Syst. Archit. 99, 101635 (2019)

    Article  Google Scholar 

  11. Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: SOSP, pp. 1–15 (2019)

    Google Scholar 

  12. Narayanan, D., et al.: Efficient large-scale language model training on GPU clusters using Megatron-LM. In: SC21, pp. 1–15 (2021)

    Google Scholar 

  13. Narayanan, D., et al.: Memory-efficient pipeline-parallel DNN training. In: ICML, pp. 7937–7947. PMLR (2021)

    Google Scholar 

  14. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8026–8037 (2019)

    Google Scholar 

  15. Pinckaers, H., Litjens, G.: Training convolutional neural networks with megapixel images. arXiv preprint arXiv:1804.05712 (2018)

  16. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)

  17. Rajbhandari, S., et al.: ZeRO: memory optimizations toward training trillion parameter models. In: SC20, pp. 1–16. IEEE (2020)

    Google Scholar 

  18. Real, E., et al.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4780–4789 (2019)

    Google Scholar 

  19. Rhu, M., et al.: vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In: MICRO, pp. 1–13. IEEE (2016)

    Google Scholar 

  20. Shazeer, N., et al.: Mesh-Tensorflow: deep learning for supercomputers. In: NeurIPS, pp. 10414–10423 (2018)

    Google Scholar 

  21. Shoeybi, M., et al.: Megatron-LM: training multi-billion parameter language models using GPU model parallelism. arXiv preprint arXiv:1909.08053 (2019)

  22. Shriram, S., et al.: Dynamic memory management for GPU-based training of deep neural networks. In: IPDPS, pp. 200–209. IEEE (2019)

    Google Scholar 

  23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  24. Tanaka, M., et al.: Automatic graph partitioning for very large-scale deep learning. In: IPDPS, pp. 1004–1013. IEEE (2021)

    Google Scholar 

  25. Yang, B., et al.: PipeMare: asynchronous pipeline parallel DNN training. MLSys 3, 269–296 (2021)

    Google Scholar 

  26. Zhang, J., et al.: Efficient memory management for GPU-based deep learning systems. arXiv preprint arXiv:1903.06631 (2019)

Download references

Acknowledgements and Data Availability Statement

We would like to thank the anonymous reviewers for their valuable feedback. This work is part of the Efficient Deep Learning (EDL) programme (grant number P16-25), financed by the Dutch Research Council (NWO). This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. The datasets generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.20000960 [4].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henk Dreuning .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dreuning, H., Bal, H.E., Nieuwpoort, R.V.v. (2022). mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12597-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12596-6

  • Online ISBN: 978-3-031-12597-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics