Abstract
Nowadays, Deep Learning (DL) applications have become a necessary solution for analyzing and making predictions with big data in several areas. However, DL applications introduce heavy input/output (I/O) loads on computer systems. These types of applications, when running on distributed systems or distributed memory parallel systems, handle a large amount of information that must be read in the training stage. Inherently parallel and distributed systems and persistent file accesses can easily overwhelm traditional shared file systems and negatively impact application performance. In this way, the management of these applications constitutes a constant challenge due to their popularity in HPC systems. Scientific applications or simulators have traditionally been executed and are optimized for this type systems. Therefore, it is essential to identify the key factors involved in the I/O of a DL application to find the most appropriate form of configuration to minimize the impact of I/O on the performance of this type of application. In the present work, we present an analysis of the behavior of the patterns generated by I/O operations in the training stage of distributed deep learning applications. We selected two well-known datasets such as CIFAR and MNIST to describe file access patterns.
This publication is supported under contract PID2020-112496GB-I00, funded by the Agencia Estatal de Investigación (AEI), Spain and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pytorch. https://pytorch.org/docs/stable/index.html/. Accessed 24 Mar 2021
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/, software available from tensorflow.org
Bae, M., Jeong, M., Yeo, S., Oh, S., Kwon, O.K.: I/O performance evaluation of large-scale deep learning on an HPC system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), pp. 436–439. IEEE (2019)
Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4–26 (2020)
Byna, S., et al.: Exahdf5: delivering efficient parallel I/O on exascale computing systems. J. Comput. Sci. Technol. 35(1), 145–160 (2020)
Carns, P., et al.: Understanding and improving computational science storage access through continuous characterization. Trans. Storage 7(3), 8:1–8:26 (2011). https://doi.org/10.1145/2027066.2027068
Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015)
Dryden, N., Böhringer, R., Ben-Nun, T., Hoefler, T.: Clairvoyant prefetching for distributed machine learning I/O. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3458817.3476181
Farhangi, A., Bian, J., Wang, J., Guo, Z.: Work-in-progress: a deep learning strategy for I/O scheduling in storage systems. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp. 568–571. IEEE (2019)
Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report (2009)
LeCun, Y., Cortes, C., Burges, C.: MNIST. http://yann.lecun.com/exdb/mnist/. Accessed 24 Mar 2021
Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762
Paul, A.K., Karimi, A.M., Wang, F.: Characterizing machine learning I/O workloads on leadership scale HPC Systems. In: 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2021). https://doi.org/10.1109/MASCOTS53633.2021.9614303
Rojas, E., Kahira, A.N., Meneses, E., Gomez, L.B., Badia, R.M.: A study of checkpointing in large scale training of deep neural networks. arXiv preprint arXiv:2012.00825 (2020)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Wan, L., et al.: I/O performance characterization and prediction through machine learning on hpc systems. In: CUG2020 Proceedings (2020)
Zacarias, F.V., Petrucci, V., Nishtala, R., Carpenter, P., Mossé, D.: Intelligent colocation of HPC workloads. J. Parall. Distrib. Comput. 151, 125–137 (2021)
Zhang, Z., Huang, L., Pauloski, J.G., Foster, I.: Aggregating local storage for scalable deep learning I/O. In: 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), pp. 69–75. IEEE (2019)
Zhu, Y., et al.: Entropy-aware I/o pipelining for large-scale deep learning on HPC systems. In: 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 145–156 (2018). https://doi.org/10.1109/MASCOTS.2018.00023
Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., Chowdhury, F.: Efficient user-level storage disaggregation for deep learning. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12 (2019). https://doi.org/10.1109/CLUSTER.2019.8891023
Zhu, Z., Tan, L., Li, Y., Ji, C.: PHDFS: optimizing I/O performance of HDFS in deep learning cloud computing platform. J. Syst. Archit. 109, 101810 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Parraga, E., Leon, B., Mendez, S., Rexachs, D., Luque, E. (2022). File Access Patterns of Distributed Deep Learning Applications. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-14599-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14598-8
Online ISBN: 978-3-031-14599-5
eBook Packages: Computer ScienceComputer Science (R0)