Cluster Computing

, Volume 17, Issue 1, pp 1–18 | Cite as

Optimizing I/O forwarding techniques for extreme-scale event tracing

  • Thomas IlscheEmail author
  • Joseph Schuchart
  • Jason Cope
  • Dries Kimpe
  • Terry Jones
  • Andreas Knüpfer
  • Kamil Iskra
  • Robert Ross
  • Wolfgang E. Nagel
  • Stephen Poole


Programming development tools are a vital component for understanding the behavior of parallel applications. Event tracing is a principal ingredient to these tools, but new and serious challenges place event tracing at risk on extreme-scale machines. As the quantity of captured events increases with concurrency, the additional data can overload the parallel file system and perturb the application being observed. In this work we present a solution for event tracing on extreme-scale machines. We enhance an I/O forwarding software layer to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the underlying file system. Furthermore, we introduce a sophisticated write buffering capability to limit the impact. To validate the approach, we employ the Vampir tracing toolset using these new capabilities. Our results demonstrate that the approach increases the maximum traced application size by a factor of 5× to more than 200,000 processes.


Event tracing I/O forwarding Atomic append 



We thank Ramanan Sankaran (ORNL) for providing a working version of S3D as well as a benchmark problem set for JaguarPF. We are grateful to Matthias Jurenz for his assistance on VampirTrace as well as Matthias Weber and Ronald Geisler for their support for Vampir. The IOFSL project is supported by the DOE Office of Science and National Nuclear Security Administration (NNSA). This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which are supported by the Office of Science of the U.S. Department of Energy under contracts DE-AC02-06CH11357 and DE-AC05-00OR22725, respectively. This work was supported in part by the National Science Foundation (NSF) through NSF-0937928 and NSF-0724599. This work is supported in a part by the German Research Foundation (DFG) in the Collaborative Research Center 912 “Highly Adaptive Energy-Efficient Computing“.

The general enhancement of the VampirTrace and Vampir tools at TU Dresden for full-size runs on large-scale HPC systems is supported with funding and cooperation by ORNL and UT-Battelle.


  1. 1.
  2. 2.
    Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.: DataStager: scalable data staging services for petascale applications. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 39–48 (2009) CrossRefGoogle Scholar
  3. 3.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. 22(6), 685–701 (2010) Google Scholar
  4. 4.
    Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: Proceedings of the 11th IEEE International Conference on Cluster Computing (CLUSTER) (2009) Google Scholar
  5. 5.
    Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M.: PLFS: A checkpoint filesystem for parallel applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009) Google Scholar
  6. 6.
    Bland, A., Kendall, R., Kothe, D., Rogers, J., Shipman, G.: Jaguar: the world’s most powerful computer. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009) Google Scholar
  7. 7.
    Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 1071–1080 (2011) CrossRefGoogle Scholar
  8. 8.
    Carns, P., Ligon III, W., Ross, R., Wyckoff, P.: BMI: A network abstraction layer for parallel I/O. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Workshop on Communication Architecture for Clusters (CAC) (2005) Google Scholar
  9. 9.
    Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: A parallel file system for Linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference (ALS), pp. 317–327 (2000) Google Scholar
  10. 10.
    Chaarawi, M., Dinan, J., Kimpe, D.: On the usability of the MPI shared file pointer routines. In: Proceedings of the 19th European MPI Users’ Group Meeting (EuroMPI) (2012) Google Scholar
  11. 11.
    Chen, J.H., Choudhary, A., de Supinski, B., DeVries, M., Hawkes, E.R., Klasky, S., Liao, W.K., Ma, K.L., Mellor-Crummey, J., Podhorszki, N., Sankaran, R., Shende, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Discov. 2(1), 015001 (2009) CrossRefGoogle Scholar
  12. 12.
    Ching, A., Choudhary, A., Coloma, K., Liao, W., Ross, R., Gropp, W.: Noncontiguous I/O access through MPI-IO. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), pp. 104–111 (2003) Google Scholar
  13. 13.
    Cope, J., Iskra, K., Kimpe, D., Ross, R.: Bridging HPC and Grid file I/O with IOFSL. In: Proceedings of the Workshop on State of the Art in Scientific and Parallel Computing (PARA’10) (2011) Google Scholar
  14. 14.
    Docan, C., Parashar, M., Klasky, S.: DART: A substrate for high speed asynchronous data IO. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC) (2008) Google Scholar
  15. 15.
    Frings, W., Wolf, F., Petkov, V.: Scalable massively parallel I/O to task-local files. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009) Google Scholar
  16. 16.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google file system. Oper. Syst. Rev. 37, 29–43 (2003) CrossRefGoogle Scholar
  17. 17.
    Gygi, F., Duchemin, I., Donadio, D., Galli, G.: Practical algorithms to facilitate large-scale first-principles molecular dynamics. J. Phys. Conf. Ser. 180, 1 (2009) CrossRefGoogle Scholar
  18. 18.
    Hildebrand, D., Honeyman, P.: Exporting storage systems in a scalable manner with pNFS. In: Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), pp. 18–27 (2005) CrossRefGoogle Scholar
  19. 19.
  20. 20.
    Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Knüpfer, A., Iskra, K., Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale through I/O forwarding middleware. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’12). HPDC ’12, pp. 49–60. ACM, New York (2012) CrossRefGoogle Scholar
  21. 21.
  22. 22.
    Iskra, K., Romein, J.W., Yoshii, K., Beckman, P.: ZOID: I/O-forwarding infrastructure for petascale architectures. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 153–162 (2008) CrossRefGoogle Scholar
  23. 23.
    Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Proceedings of the 9th International Conference on Computational Science (ICCS), vol. 2, pp. 686–695 (2009) Google Scholar
  24. 24.
    Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B.: Improving the scalability of parallel jobs by adding parallel awareness. In: Proceedings of the 15th ACM/IEEE International Conference on High Performance Networking and Computing (SC) (2003) Google Scholar
  25. 25.
    Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir performance analysis tool-set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Berlin (2008) CrossRefGoogle Scholar
  26. 26.
    Liu, N., Cope, J., Carns, P.H., Carothers, C.D., Ross, R.B., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: MSST. IEEE Press, New York (2012) Google Scholar
  27. 27.
    Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments (CLADE), pp. 15–24 (2008) CrossRefGoogle Scholar
  28. 28.
    MPI Forum. MPI-2: extensions to the message–passing interface (1997).
  29. 29.
    Muelder, C., Gygi, F., Ma, K.-L.: Visual analysis of inter-process communication for large-scale parallel computing. IEEE Trans. Vis. Comput. Graph. 15(6), 1129–1136 (2009) CrossRefGoogle Scholar
  30. 30.
    Muelder, C., Sigovan, C., Ma, K.-L., Cope, J., Lang, S., Iskra, K., Beckman, P., Ross, R.: Visual analysis of I/O system behavior for high-end computing. In: Proceedings of the 3rd International Workshop on Large-Scale System and Application Performance (LSAP) (2011) Google Scholar
  31. 31.
  32. 32.
    Nisar, A., Liao, W., Choudhary, A.: Scaling parallel I/O performance through I/O delegate and caching system. In: Proceedings of 20th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2008) Google Scholar
  33. 33.
    Ohta, K., Kimpe, D., Cope, J., Iskra, K., Ross, R., Ishikawa, Y.: Optimization techniques at the I/O forwarding layer. In: Proceedings of the 12th IEEE International Conference on Cluster Computing (CLUSTER) (2010) Google Scholar
  34. 34.
    Pedretti, K., Brightwell, R., Williams, J.: Cplant™ runtime system support for multi-processor and heterogeneous compute nodes. In: Proceedings of the 4th IEEE International Conference on Cluster Computing (CLUSTER), pp. 207–214 (2002) CrossRefGoogle Scholar
  35. 35.
    Petascale Data Storage Institute.
  36. 36.
    Peterka, T., Goodell, D., Ross, R., Shen, H.-W., Thakur, R.: A configurable algorithm for parallel image-compositing applications. In: Proceedings of 21st ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2009) Google Scholar
  37. 37.
    Romein, J.F.: FCNP: Fast I/O on the Blue Gene/P. In: Parallel and Distributed Processing Techniques and Applications (PDPTA) (2009) Google Scholar
  38. 38.
    Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006) CrossRefGoogle Scholar
  39. 39.
    Shipman, G., Dillow, D., Oral, S., Wang, F.: The Spider center wide file system; from concept to reality. In: Proceedings of the 51st Cray User Group Meeting (CUG) (2009) Google Scholar
  40. 40.
    Snyder, P.: tmpfs: a virtual memory file system. In: Proceedings of the Autumn 1990 European UNIX Users’ Group Conference, pp. 241–248 (1990) Google Scholar
  41. 41.
    Tallent, N.R., Mellor-Crummey, J., Franco, M., Landrum, R., Adhianto, L.: Scalable fine-grained call path tracing. In: Proceedings of the International Conference on Supercomputing (ICS ’11), pp. 63–74. ACM, New York (2011) Google Scholar
  42. 42.
    Vishwanath, V., Hereld, M., Iskra, K., Kimpe, D., Morozov, V., Papka, M., Ross, R., Yoshii, K.: Accelerating I/O forwarding in IBM Blue Gene/P systems. In: Proceedings of 22nd ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2010) Google Scholar
  43. 43.
    Wylie, B.J.N., Geimer, M., Mohr, B., Böhme, D., Szebenyi, Z., Wolf, F.: Large-scale performance analysis of Sweep3D with the Scalasca toolset. Parallel Process. Lett. 20(4), 397–414 (2010) CrossRefMathSciNetGoogle Scholar
  44. 44.
    Yoshii, K., Iskra, K., Naik, H., Beckman, P., Broekema, P.C.: Performance and scalability evaluation of “Big Memory” on Blue Gene Linux. Int. J. High Perform. Comput. Appl. 25(2), 148–160 (2011) CrossRefGoogle Scholar
  45. 45.
    Yu, H., Sahoo, R.K., Howson, C., Almási, G., Castaños, J.G., Gupta, M., Moreira, J.E., Parker, J.J., Engelsiepen, T.E., Ross, R.B., Thakur, R., Latham, R., Gropp, W.D.: High performance file I/O for the Blue Gene/L supercomputer. In: Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), pp. 187–196 (2006) Google Scholar

Copyright information

© Springer Science + Business Media New York (outside the USA) 2013

Authors and Affiliations

  • Thomas Ilsche
    • 1
    Email author
  • Joseph Schuchart
    • 3
  • Jason Cope
    • 2
  • Dries Kimpe
    • 2
  • Terry Jones
    • 3
  • Andreas Knüpfer
    • 1
  • Kamil Iskra
    • 2
  • Robert Ross
    • 2
  • Wolfgang E. Nagel
    • 1
  • Stephen Poole
    • 3
  1. 1.Technische Universität Dresden (ZIH)DresdenGermany
  2. 2.Argonne National LaboratoryArgonneUSA
  3. 3.Oak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations