Abstract
Programming development tools are a vital component for understanding the behavior of parallel applications. Event tracing is a principal ingredient to these tools, but new and serious challenges place event tracing at risk on extreme-scale machines. As the quantity of captured events increases with concurrency, the additional data can overload the parallel file system and perturb the application being observed. In this work we present a solution for event tracing on extreme-scale machines. We enhance an I/O forwarding software layer to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system. Furthermore, we introduce a sophisticated write buffering capability to limit the impact. To validate the approach, we employ the Vampir tracing toolset using these new capabilities. Our results demonstrate that the approach increases the maximum traced application size by a factor of 5× to more than 200,000 processes.
Notes
In this paper, we use 1 MB=220 B, 1 GB=230 B, and 1 TB=240 B.
An OTF stream abstracts the events from a single process or thread.
10,752 event files, one definitions file, and one control file.
References
Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: DataStager: scalable data staging services for petascale applications. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 39–48 (2009)
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. 22(6), 685–701 (2010)
Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: Proceedings of the 11th IEEE International Conference on Cluster Computing (CLUSTER) (2009)
Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: PLFS: A checkpoint filesystem for parallel applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)
Bland, A., Kendall, R., Kothe, D., Rogers, J., Shipman, G.: Jaguar: the world’s most powerful computer. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009)
Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 1071–1080 (2011)
Carns, P., Ligon III, W., Ross, R., Wyckoff, P.: BMI: A network abstraction layer for parallel I/O. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Workshop on Communication Architecture for Clusters (CAC) (2005)
Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: A parallel file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference (ALS), pp. 317–327 (2000)
Chaarawi, M., Dinan, J., Kimpe, D.: On the usability of the MPI shared file pointer routines. In: Proceedings of the 19th European MPI Users’ Group Meeting (EuroMPI) (2012)
Chen, J.H., Choudhary, A., de Supinski, B., DeVries, M., Hawkes, E.R., Klasky, S., Liao, W.K., Ma, K.L., Mellor-Crummey, J., Podhorszki, N., Sankaran, R., Shende, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Discov. 2(1), 015001 (2009)
Ching, A., Choudhary, A., Coloma, K., Liao, W., Ross, R., Gropp, W.: Noncontiguous I/O access through MPI-IO. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), pp. 104–111 (2003)
Cope, J., Iskra, K., Kimpe, D., Ross, R.: Bridging HPC and Grid file I/O with IOFSL. In: Proceedings of the Workshop on State of the Art in Scientific and Parallel Computing (PARA’10) (2011)
Docan, C., Parashar, M., Klasky, S.: DART: A substrate for high speed asynchronous data IO. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC) (2008)
Frings, W., Wolf, F., Petkov, V.: Scalable massively parallel I/O to task-local files. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)
Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. Oper. Syst. Rev. 37, 29–43 (2003)
Gygi, F., Duchemin, I., Donadio, D., Galli, G.: Practical algorithms to facilitate large-scale first-principles molecular dynamics. J. Phys. Conf. Ser. 180, 1 (2009)
Hildebrand, D., Honeyman, P.: Exporting storage systems in a scalable manner with pNFS. In: Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), pp. 18–27 (2005)
IEEE POSIX standard 1003.1 2004 edition. http://www.opengroup.org/onlinepubs/000095399/functions/write.html
Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Knüpfer, A., Iskra, K., Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale through I/O forwarding middleware. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’12). HPDC ’12, pp. 49–60. ACM, New York (2012)
IOR HPC benchmark. http://sourceforge.net/projects/ior-sio/
Iskra, K., Romein, J.W., Yoshii, K., Beckman, P.: ZOID: I/O-forwarding infrastructure for petascale architectures. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 153–162 (2008)
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Proceedings of the 9th International Conference on Computational Science (ICCS), vol. 2, pp. 686–695 (2009)
Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B.: Improving the scalability of parallel jobs by adding parallel awareness. In: Proceedings of the 15th ACM/IEEE International Conference on High Performance Networking and Computing (SC) (2003)
Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir performance analysis tool-set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Berlin (2008)
Liu, N., Cope, J., Carns, P.H., Carothers, C.D., Ross, R.B., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: MSST. IEEE Press, New York (2012)
Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments (CLADE), pp. 15–24 (2008)
MPI Forum. MPI-2: extensions to the message–passing interface (1997). http://www.mpi-forum.org/docs/docs.html
Muelder, C., Gygi, F., Ma, K.-L.: Visual analysis of inter-process communication for large-scale parallel computing. IEEE Trans. Vis. Comput. Graph. 15(6), 1129–1136 (2009)
Muelder, C., Sigovan, C., Ma, K.-L., Cope, J., Lang, S., Iskra, K., Beckman, P., Ross, R.: Visual analysis of I/O system behavior for high-end computing. In: Proceedings of the 3rd International Workshop on Large-Scale System and Application Performance (LSAP) (2011)
NCSA. HDF5. http://hdf.ncsa.uiuc.edu/HDF5/
Nisar, A., Liao, W., Choudhary, A.: Scaling parallel I/O performance through I/O delegate and caching system. In: Proceedings of 20th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2008)
Ohta, K., Kimpe, D., Cope, J., Iskra, K., Ross, R., Ishikawa, Y.: Optimization techniques at the I/O forwarding layer. In: Proceedings of the 12th IEEE International Conference on Cluster Computing (CLUSTER) (2010)
Pedretti, K., Brightwell, R., Williams, J.: Cplant™ runtime system support for multi-processor and heterogeneous compute nodes. In: Proceedings of the 4th IEEE International Conference on Cluster Computing (CLUSTER), pp. 207–214 (2002)
Petascale Data Storage Institute. http://www.pdsi-scidac.org/
Peterka, T., Goodell, D., Ross, R., Shen, H.-W., Thakur, R.: A configurable algorithm for parallel image-compositing applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009)
Romein, J.F.: FCNP: Fast I/O on the Blue Gene/P. In: Parallel and Distributed Processing Techniques and Applications (PDPTA) (2009)
Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)
Shipman, G., Dillow, D., Oral, S., Wang, F.: The Spider center wide file system; from concept to reality. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009)
Snyder, P.: tmpfs: a virtual memory file system. In: Proceedings of the Autumn 1990 European UNIX Users’ Group Conference, pp. 241–248 (1990)
Tallent, N.R., Mellor-Crummey, J., Franco, M., Landrum, R., Adhianto, L.: Scalable fine-grained call path tracing. In: Proceedings of the International Conference on Supercomputing (ICS ’11), pp. 63–74. ACM, New York (2011)
Vishwanath, V., Hereld, M., Iskra, K., Kimpe, D., Morozov, V., Papka, M., Ross, R., Yoshii, K.: Accelerating I/O forwarding in IBM Blue Gene/P systems. In: Proceedings of 22nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2010)
Wylie, B.J.N., Geimer, M., Mohr, B., Böhme, D., Szebenyi, Z., Wolf, F.: Large-scale performance analysis of Sweep3D with the Scalasca toolset. Parallel Process. Lett. 20(4), 397–414 (2010)
Yoshii, K., Iskra, K., Naik, H., Beckman, P., Broekema, P.C.: Performance and scalability evaluation of “Big Memory” on Blue Gene Linux. Int. J. High Perform. Comput. Appl. 25(2), 148–160 (2011)
Yu, H., Sahoo, R.K., Howson, C., Almási, G., Castaños, J.G., Gupta, M., Moreira, J.E., Parker, J.J., Engelsiepen, T.E., Ross, R.B., Thakur, R., Latham, R., Gropp, W.D.: High performance file I/O for the Blue Gene/L supercomputer. In: Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), pp. 187–196 (2006)
Acknowledgements
We thank Ramanan Sankaran (ORNL) for providing a working version of S3D as well as a benchmark problem set for JaguarPF. We are grateful to Matthias Jurenz for his assistance on VampirTrace as well as Matthias Weber and Ronald Geisler for their support for Vampir. The IOFSL project is supported by the DOE Office of Science and National Nuclear Security Administration (NNSA). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which are supported by the Office of Science of the U.S. Department of Energy under contracts DE-AC02-06CH11357 and DE-AC05-00OR22725, respectively. This work was supported in part by the National Science Foundation (NSF) through NSF-0937928 and NSF-0724599. This work is supported in a part by the German Research Foundation (DFG) in the Collaborative Research Center 912 “Highly Adaptive Energy-Efficient Computing“.
The general enhancement of the VampirTrace and Vampir tools at TU Dresden for full-size runs on large-scale HPC systems is supported with funding and cooperation by ORNL and UT-Battelle.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ilsche, T., Schuchart, J., Cope, J. et al. Optimizing I/O forwarding techniques for extreme-scale event tracing. Cluster Comput 17, 1–18 (2014). https://doi.org/10.1007/s10586-013-0272-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-013-0272-9