File Access Patterns of Distributed Deep Learning Applications

Parraga, Edixon; Leon, Betzabeth; Mendez, Sandra; Rexachs, Dolores; Luque, Emilio

doi:10.1007/978-3-031-14599-5_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1634))

Included in the following conference series:

Conference on Cloud Computing, Big Data & Emerging Topics

396 Accesses

Abstract

Nowadays, Deep Learning (DL) applications have become a necessary solution for analyzing and making predictions with big data in several areas. However, DL applications introduce heavy input/output (I/O) loads on computer systems. These types of applications, when running on distributed systems or distributed memory parallel systems, handle a large amount of information that must be read in the training stage. Inherently parallel and distributed systems and persistent file accesses can easily overwhelm traditional shared file systems and negatively impact application performance. In this way, the management of these applications constitutes a constant challenge due to their popularity in HPC systems. Scientific applications or simulators have traditionally been executed and are optimized for this type systems. Therefore, it is essential to identify the key factors involved in the I/O of a DL application to find the most appropriate form of configuration to minimize the impact of I/O on the performance of this type of application. In the present work, we present an analysis of the behavior of the patterns generated by I/O operations in the training stage of distributed deep learning applications. We selected two well-known datasets such as CIFAR and MNIST to describe file access patterns.

This publication is supported under contract PID2020-112496GB-I00, funded by the Agencia Estatal de Investigación (AEI), Spain and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pytorch. https://pytorch.org/docs/stable/index.html/. Accessed 24 Mar 2021
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/, software available from tensorflow.org
Bae, M., Jeong, M., Yeo, S., Oh, S., Kwon, O.K.: I/O performance evaluation of large-scale deep learning on an HPC system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), pp. 436–439. IEEE (2019)
Google Scholar
Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4–26 (2020)
Article Google Scholar
Byna, S., et al.: Exahdf5: delivering efficient parallel I/O on exascale computing systems. J. Comput. Sci. Technol. 35(1), 145–160 (2020)
Article Google Scholar
Carns, P., et al.: Understanding and improving computational science storage access through continuous characterization. Trans. Storage 7(3), 8:1–8:26 (2011). https://doi.org/10.1145/2027066.2027068
Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015)
Dryden, N., Böhringer, R., Ben-Nun, T., Hoefler, T.: Clairvoyant prefetching for distributed machine learning I/O. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3458817.3476181
Farhangi, A., Bian, J., Wang, J., Guo, Z.: Work-in-progress: a deep learning strategy for I/O scheduling in storage systems. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp. 568–571. IEEE (2019)
Google Scholar
Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report (2009)
Google Scholar
LeCun, Y., Cortes, C., Burges, C.: MNIST. http://yann.lecun.com/exdb/mnist/. Accessed 24 Mar 2021
Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762
Paul, A.K., Karimi, A.M., Wang, F.: Characterizing machine learning I/O workloads on leadership scale HPC Systems. In: 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2021). https://doi.org/10.1109/MASCOTS53633.2021.9614303
Rojas, E., Kahira, A.N., Meneses, E., Gomez, L.B., Badia, R.M.: A study of checkpointing in large scale training of deep neural networks. arXiv preprint arXiv:2012.00825 (2020)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Wan, L., et al.: I/O performance characterization and prediction through machine learning on hpc systems. In: CUG2020 Proceedings (2020)
Google Scholar
Zacarias, F.V., Petrucci, V., Nishtala, R., Carpenter, P., Mossé, D.: Intelligent colocation of HPC workloads. J. Parall. Distrib. Comput. 151, 125–137 (2021)
Article Google Scholar
Zhang, Z., Huang, L., Pauloski, J.G., Foster, I.: Aggregating local storage for scalable deep learning I/O. In: 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), pp. 69–75. IEEE (2019)
Google Scholar
Zhu, Y., et al.: Entropy-aware I/o pipelining for large-scale deep learning on HPC systems. In: 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 145–156 (2018). https://doi.org/10.1109/MASCOTS.2018.00023
Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., Chowdhury, F.: Efficient user-level storage disaggregation for deep learning. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12 (2019). https://doi.org/10.1109/CLUSTER.2019.8891023
Zhu, Z., Tan, L., Li, Y., Ji, C.: PHDFS: optimizing I/O performance of HDFS in deep learning cloud computing platform. J. Syst. Archit. 109, 101810 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture and Operating Systems Department, Universitat Autónoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
Edixon Parraga, Betzabeth Leon, Sandra Mendez, Dolores Rexachs & Emilio Luque
Computer Sciences Department, Barcelona Supercomputing Center (BSC), 08034, Barcelona, Spain
Sandra Mendez

Authors

Edixon Parraga
View author publications
You can also search for this author in PubMed Google Scholar
Betzabeth Leon
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Mendez
View author publications
You can also search for this author in PubMed Google Scholar
Dolores Rexachs
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Luque
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edixon Parraga .

Editor information

Editors and Affiliations

National University of La Plata, La Plata, Argentina
Enzo Rucci
National University of La Plata, La Plata, Argentina
Marcelo Naiouf
National University of La Plata, La Plata, Argentina
Franco Chichizola
National University of La Plata, La Plata, Argentina
Laura De Giusti
National University of La Plata, La Plata, Argentina
Armando De Giusti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parraga, E., Leon, B., Mendez, S., Rexachs, D., Luque, E. (2022). File Access Patterns of Distributed Deep Learning Applications. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-14599-5_1
Published: 05 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14598-8
Online ISBN: 978-3-031-14599-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

File Access Patterns of Distributed Deep Learning Applications