Abstract
Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size.
We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)
Awan, A.A., et al.: OC-DNN: exploiting advanced unified memory capabilities in CUDA 9 and Volta GPUs for out-of-core DNN training. In: HiPC, pp. 143–152. IEEE (2018)
Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Dreuning, H., et al.: Artifact and instructions to generate experimental results for Euro-Par 2022 paper: mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training (2022). https://doi.org/10.6084/m9.figshare.20000960
Fan, S., et al.: DAPPLE: a pipelined data parallel approach for training large models. In: PPoPP, pp. 431–445 (2021)
Hara, K., et al.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV Workshops, pp. 3154–3160 (2017)
Huang, Y., et al.: GPipe: efficient training of giant neural networks using pipeline parallelism. In: NeurIPS, pp. 103–112 (2019)
Jansen, M., et al.: DDLBench: towards a scalable benchmarking infrastructure for distributed deep learning. In: 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), pp. 31–39. IEEE (2020)
Kim, C., et al.: Torchgpipe: on-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020)
Mittal, S., Vaishay, S.: A survey of techniques for optimizing deep learning on GPUs. J. Syst. Archit. 99, 101635 (2019)
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: SOSP, pp. 1–15 (2019)
Narayanan, D., et al.: Efficient large-scale language model training on GPU clusters using Megatron-LM. In: SC21, pp. 1–15 (2021)
Narayanan, D., et al.: Memory-efficient pipeline-parallel DNN training. In: ICML, pp. 7937–7947. PMLR (2021)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8026–8037 (2019)
Pinckaers, H., Litjens, G.: Training convolutional neural networks with megapixel images. arXiv preprint arXiv:1804.05712 (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Rajbhandari, S., et al.: ZeRO: memory optimizations toward training trillion parameter models. In: SC20, pp. 1–16. IEEE (2020)
Real, E., et al.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4780–4789 (2019)
Rhu, M., et al.: vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In: MICRO, pp. 1–13. IEEE (2016)
Shazeer, N., et al.: Mesh-Tensorflow: deep learning for supercomputers. In: NeurIPS, pp. 10414–10423 (2018)
Shoeybi, M., et al.: Megatron-LM: training multi-billion parameter language models using GPU model parallelism. arXiv preprint arXiv:1909.08053 (2019)
Shriram, S., et al.: Dynamic memory management for GPU-based training of deep neural networks. In: IPDPS, pp. 200–209. IEEE (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tanaka, M., et al.: Automatic graph partitioning for very large-scale deep learning. In: IPDPS, pp. 1004–1013. IEEE (2021)
Yang, B., et al.: PipeMare: asynchronous pipeline parallel DNN training. MLSys 3, 269–296 (2021)
Zhang, J., et al.: Efficient memory management for GPU-based deep learning systems. arXiv preprint arXiv:1903.06631 (2019)
Acknowledgements and Data Availability Statement
We would like to thank the anonymous reviewers for their valuable feedback. This work is part of the Efficient Deep Learning (EDL) programme (grant number P16-25), financed by the Dutch Research Council (NWO). This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. The datasets generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.20000960 [4].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Dreuning, H., Bal, H.E., Nieuwpoort, R.V.v. (2022). mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham. https://doi.org/10.1007/978-3-031-12597-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-12597-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12596-6
Online ISBN: 978-3-031-12597-3
eBook Packages: Computer ScienceComputer Science (R0)