Skip to main content
Log in

Optimizing I/O forwarding techniques for extreme-scale event tracing

Cluster Computing Aims and scope Submit manuscript

Abstract

Programming development tools are a vital component for understanding the behavior of parallel applications. Event tracing is a principal ingredient to these tools, but new and serious challenges place event tracing at risk on extreme-scale machines. As the quantity of captured events increases with concurrency, the additional data can overload the parallel file system and perturb the application being observed. In this work we present a solution for event tracing on extreme-scale machines. We enhance an I/O forwarding software layer to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system. Furthermore, we introduce a sophisticated write buffering capability to limit the impact. To validate the approach, we employ the Vampir tracing toolset using these new capabilities. Our results demonstrate that the approach increases the maximum traced application size by a factor of 5× to more than 200,000 processes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. In this paper, we use 1 MB=220 B, 1 GB=230 B, and 1 TB=240 B.

  2. An OTF stream abstracts the events from a single process or thread.

  3. 10,752 event files, one definitions file, and one control file.

References

  1. SYSIO. http://sourceforge.net/projects/libsysio

  2. Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: DataStager: scalable data staging services for petascale applications. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 39–48 (2009)

    Chapter  Google Scholar 

  3. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. 22(6), 685–701 (2010)

    Google Scholar 

  4. Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: Proceedings of the 11th IEEE International Conference on Cluster Computing (CLUSTER) (2009)

    Google Scholar 

  5. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: PLFS: A checkpoint filesystem for parallel applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)

    Google Scholar 

  6. Bland, A., Kendall, R., Kothe, D., Rogers, J., Shipman, G.: Jaguar: the world’s most powerful computer. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009)

    Google Scholar 

  7. Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 1071–1080 (2011)

    Chapter  Google Scholar 

  8. Carns, P., Ligon III, W., Ross, R., Wyckoff, P.: BMI: A network abstraction layer for parallel I/O. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Workshop on Communication Architecture for Clusters (CAC) (2005)

    Google Scholar 

  9. Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: A parallel file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference (ALS), pp. 317–327 (2000)

    Google Scholar 

  10. Chaarawi, M., Dinan, J., Kimpe, D.: On the usability of the MPI shared file pointer routines. In: Proceedings of the 19th European MPI Users’ Group Meeting (EuroMPI) (2012)

    Google Scholar 

  11. Chen, J.H., Choudhary, A., de Supinski, B., DeVries, M., Hawkes, E.R., Klasky, S., Liao, W.K., Ma, K.L., Mellor-Crummey, J., Podhorszki, N., Sankaran, R., Shende, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Discov. 2(1), 015001 (2009)

    Article  Google Scholar 

  12. Ching, A., Choudhary, A., Coloma, K., Liao, W., Ross, R., Gropp, W.: Noncontiguous I/O access through MPI-IO. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), pp. 104–111 (2003)

    Google Scholar 

  13. Cope, J., Iskra, K., Kimpe, D., Ross, R.: Bridging HPC and Grid file I/O with IOFSL. In: Proceedings of the Workshop on State of the Art in Scientific and Parallel Computing (PARA’10) (2011)

    Google Scholar 

  14. Docan, C., Parashar, M., Klasky, S.: DART: A substrate for high speed asynchronous data IO. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC) (2008)

    Google Scholar 

  15. Frings, W., Wolf, F., Petkov, V.: Scalable massively parallel I/O to task-local files. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)

    Google Scholar 

  16. Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. Oper. Syst. Rev. 37, 29–43 (2003)

    Article  Google Scholar 

  17. Gygi, F., Duchemin, I., Donadio, D., Galli, G.: Practical algorithms to facilitate large-scale first-principles molecular dynamics. J. Phys. Conf. Ser. 180, 1 (2009)

    Article  Google Scholar 

  18. Hildebrand, D., Honeyman, P.: Exporting storage systems in a scalable manner with pNFS. In: Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), pp. 18–27 (2005)

    Chapter  Google Scholar 

  19. IEEE POSIX standard 1003.1 2004 edition. http://www.opengroup.org/onlinepubs/000095399/functions/write.html

  20. Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Knüpfer, A., Iskra, K., Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale through I/O forwarding middleware. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’12). HPDC ’12, pp. 49–60. ACM, New York (2012)

    Chapter  Google Scholar 

  21. IOR HPC benchmark. http://sourceforge.net/projects/ior-sio/

  22. Iskra, K., Romein, J.W., Yoshii, K., Beckman, P.: ZOID: I/O-forwarding infrastructure for petascale architectures. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 153–162 (2008)

    Chapter  Google Scholar 

  23. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Proceedings of the 9th International Conference on Computational Science (ICCS), vol. 2, pp. 686–695 (2009)

    Google Scholar 

  24. Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B.: Improving the scalability of parallel jobs by adding parallel awareness. In: Proceedings of the 15th ACM/IEEE International Conference on High Performance Networking and Computing (SC) (2003)

    Google Scholar 

  25. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir performance analysis tool-set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Berlin (2008)

    Chapter  Google Scholar 

  26. Liu, N., Cope, J., Carns, P.H., Carothers, C.D., Ross, R.B., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: MSST. IEEE Press, New York (2012)

    Google Scholar 

  27. Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments (CLADE), pp. 15–24 (2008)

    Chapter  Google Scholar 

  28. MPI Forum. MPI-2: extensions to the message–passing interface (1997). http://www.mpi-forum.org/docs/docs.html

  29. Muelder, C., Gygi, F., Ma, K.-L.: Visual analysis of inter-process communication for large-scale parallel computing. IEEE Trans. Vis. Comput. Graph. 15(6), 1129–1136 (2009)

    Article  Google Scholar 

  30. Muelder, C., Sigovan, C., Ma, K.-L., Cope, J., Lang, S., Iskra, K., Beckman, P., Ross, R.: Visual analysis of I/O system behavior for high-end computing. In: Proceedings of the 3rd International Workshop on Large-Scale System and Application Performance (LSAP) (2011)

    Google Scholar 

  31. NCSA. HDF5. http://hdf.ncsa.uiuc.edu/HDF5/

  32. Nisar, A., Liao, W., Choudhary, A.: Scaling parallel I/O performance through I/O delegate and caching system. In: Proceedings of 20th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2008)

    Google Scholar 

  33. Ohta, K., Kimpe, D., Cope, J., Iskra, K., Ross, R., Ishikawa, Y.: Optimization techniques at the I/O forwarding layer. In: Proceedings of the 12th IEEE International Conference on Cluster Computing (CLUSTER) (2010)

    Google Scholar 

  34. Pedretti, K., Brightwell, R., Williams, J.: Cplant™ runtime system support for multi-processor and heterogeneous compute nodes. In: Proceedings of the 4th IEEE International Conference on Cluster Computing (CLUSTER), pp. 207–214 (2002)

    Chapter  Google Scholar 

  35. Petascale Data Storage Institute. http://www.pdsi-scidac.org/

  36. Peterka, T., Goodell, D., Ross, R., Shen, H.-W., Thakur, R.: A configurable algorithm for parallel image-compositing applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)

    Google Scholar 

  37. Romein, J.F.: FCNP: Fast I/O on the Blue Gene/P. In: Parallel and Distributed Processing Techniques and Applications (PDPTA) (2009)

    Google Scholar 

  38. Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)

    Article  Google Scholar 

  39. Shipman, G., Dillow, D., Oral, S., Wang, F.: The Spider center wide file system; from concept to reality. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009)

    Google Scholar 

  40. Snyder, P.: tmpfs: a virtual memory file system. In: Proceedings of the Autumn 1990 European UNIX Users’ Group Conference, pp. 241–248 (1990)

    Google Scholar 

  41. Tallent, N.R., Mellor-Crummey, J., Franco, M., Landrum, R., Adhianto, L.: Scalable fine-grained call path tracing. In: Proceedings of the International Conference on Supercomputing (ICS ’11), pp. 63–74. ACM, New York (2011)

    Google Scholar 

  42. Vishwanath, V., Hereld, M., Iskra, K., Kimpe, D., Morozov, V., Papka, M., Ross, R., Yoshii, K.: Accelerating I/O forwarding in IBM Blue Gene/P systems. In: Proceedings of 22nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2010)

    Google Scholar 

  43. Wylie, B.J.N., Geimer, M., Mohr, B., Böhme, D., Szebenyi, Z., Wolf, F.: Large-scale performance analysis of Sweep3D with the Scalasca toolset. Parallel Process. Lett. 20(4), 397–414 (2010)

    Article  MathSciNet  Google Scholar 

  44. Yoshii, K., Iskra, K., Naik, H., Beckman, P., Broekema, P.C.: Performance and scalability evaluation of “Big Memory” on Blue Gene Linux. Int. J. High Perform. Comput. Appl. 25(2), 148–160 (2011)

    Article  Google Scholar 

  45. Yu, H., Sahoo, R.K., Howson, C., Almási, G., Castaños, J.G., Gupta, M., Moreira, J.E., Parker, J.J., Engelsiepen, T.E., Ross, R.B., Thakur, R., Latham, R., Gropp, W.D.: High performance file I/O for the Blue Gene/L supercomputer. In: Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), pp. 187–196 (2006)

    Google Scholar 

Download references

Acknowledgements

We thank Ramanan Sankaran (ORNL) for providing a working version of S3D as well as a benchmark problem set for JaguarPF. We are grateful to Matthias Jurenz for his assistance on VampirTrace as well as Matthias Weber and Ronald Geisler for their support for Vampir. The IOFSL project is supported by the DOE Office of Science and National Nuclear Security Administration (NNSA). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which are supported by the Office of Science of the U.S. Department of Energy under contracts DE-AC02-06CH11357 and DE-AC05-00OR22725, respectively. This work was supported in part by the National Science Foundation (NSF) through NSF-0937928 and NSF-0724599. This work is supported in a part by the German Research Foundation (DFG) in the Collaborative Research Center 912 “Highly Adaptive Energy-Efficient Computing“.

The general enhancement of the VampirTrace and Vampir tools at TU Dresden for full-size runs on large-scale HPC systems is supported with funding and cooperation by ORNL and UT-Battelle.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Ilsche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ilsche, T., Schuchart, J., Cope, J. et al. Optimizing I/O forwarding techniques for extreme-scale event tracing. Cluster Comput 17, 1–18 (2014). https://doi.org/10.1007/s10586-013-0272-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-013-0272-9

Keywords

Navigation