Skip to main content

File Access Patterns of Distributed Deep Learning Applications

  • Conference paper
  • First Online:
Cloud Computing, Big Data & Emerging Topics (JCC-BD&ET 2022)

Abstract

Nowadays, Deep Learning (DL) applications have become a necessary solution for analyzing and making predictions with big data in several areas. However, DL applications introduce heavy input/output (I/O) loads on computer systems. These types of applications, when running on distributed systems or distributed memory parallel systems, handle a large amount of information that must be read in the training stage. Inherently parallel and distributed systems and persistent file accesses can easily overwhelm traditional shared file systems and negatively impact application performance. In this way, the management of these applications constitutes a constant challenge due to their popularity in HPC systems. Scientific applications or simulators have traditionally been executed and are optimized for this type systems. Therefore, it is essential to identify the key factors involved in the I/O of a DL application to find the most appropriate form of configuration to minimize the impact of I/O on the performance of this type of application. In the present work, we present an analysis of the behavior of the patterns generated by I/O operations in the training stage of distributed deep learning applications. We selected two well-known datasets such as CIFAR and MNIST to describe file access patterns.

This publication is supported under contract PID2020-112496GB-I00, funded by the Agencia Estatal de Investigación (AEI), Spain and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pytorch. https://pytorch.org/docs/stable/index.html/. Accessed 24 Mar 2021

  2. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/, software available from tensorflow.org

  3. Bae, M., Jeong, M., Yeo, S., Oh, S., Kwon, O.K.: I/O performance evaluation of large-scale deep learning on an HPC system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), pp. 436–439. IEEE (2019)

    Google Scholar 

  4. Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4–26 (2020)

    Article  Google Scholar 

  5. Byna, S., et al.: Exahdf5: delivering efficient parallel I/O on exascale computing systems. J. Comput. Sci. Technol. 35(1), 145–160 (2020)

    Article  Google Scholar 

  6. Carns, P., et al.: Understanding and improving computational science storage access through continuous characterization. Trans. Storage 7(3), 8:1–8:26 (2011). https://doi.org/10.1145/2027066.2027068

  7. Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015)

  8. Dryden, N., Böhringer, R., Ben-Nun, T., Hoefler, T.: Clairvoyant prefetching for distributed machine learning I/O. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2021, Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3458817.3476181

  9. Farhangi, A., Bian, J., Wang, J., Guo, Z.: Work-in-progress: a deep learning strategy for I/O scheduling in storage systems. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp. 568–571. IEEE (2019)

    Google Scholar 

  10. Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

  11. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report (2009)

    Google Scholar 

  12. LeCun, Y., Cortes, C., Burges, C.: MNIST. http://yann.lecun.com/exdb/mnist/. Accessed 24 Mar 2021

  13. Mittal, S., Rajput, P., Subramoney, S.: A survey of deep learning on CPUs: opportunities and co-optimizations. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2021). https://doi.org/10.1109/TNNLS.2021.3071762

  14. Paul, A.K., Karimi, A.M., Wang, F.: Characterizing machine learning I/O workloads on leadership scale HPC Systems. In: 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8 (2021). https://doi.org/10.1109/MASCOTS53633.2021.9614303

  15. Rojas, E., Kahira, A.N., Meneses, E., Gomez, L.B., Badia, R.M.: A study of checkpointing in large scale training of deep neural networks. arXiv preprint arXiv:2012.00825 (2020)

  16. Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

  17. Wan, L., et al.: I/O performance characterization and prediction through machine learning on hpc systems. In: CUG2020 Proceedings (2020)

    Google Scholar 

  18. Zacarias, F.V., Petrucci, V., Nishtala, R., Carpenter, P., Mossé, D.: Intelligent colocation of HPC workloads. J. Parall. Distrib. Comput. 151, 125–137 (2021)

    Article  Google Scholar 

  19. Zhang, Z., Huang, L., Pauloski, J.G., Foster, I.: Aggregating local storage for scalable deep learning I/O. In: 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), pp. 69–75. IEEE (2019)

    Google Scholar 

  20. Zhu, Y., et al.: Entropy-aware I/o pipelining for large-scale deep learning on HPC systems. In: 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 145–156 (2018). https://doi.org/10.1109/MASCOTS.2018.00023

  21. Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., Chowdhury, F.: Efficient user-level storage disaggregation for deep learning. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12 (2019). https://doi.org/10.1109/CLUSTER.2019.8891023

  22. Zhu, Z., Tan, L., Li, Y., Ji, C.: PHDFS: optimizing I/O performance of HDFS in deep learning cloud computing platform. J. Syst. Archit. 109, 101810 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edixon Parraga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Parraga, E., Leon, B., Mendez, S., Rexachs, D., Luque, E. (2022). File Access Patterns of Distributed Deep Learning Applications. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14599-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14598-8

  • Online ISBN: 978-3-031-14599-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics