Multi-versioning Performance Opportunities in BGAS System for Resilience

  • Nan Dun
  • Dirk Pleiter
  • Aiman Fang
  • Nicolas Vandenbergen
  • Andrew A. Chien
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9697)

Abstract

Resilience has become a major concern in high-performance computing (HPC) systems. Addressing the increasing risk of latent errors (or silent data corruption) is one of the biggest challenges. Multi-version checkpointing system, which keeps multi-version of the application states, has been proposed as a solution and has been implemented in Global View Resilience (GVR). The resulting more sophisticated management of data introduces overheads and the resulting impact on performance need to be investigated. In this paper we explore the performance of GVR for an HPC system with integrated non-volatile memories, namely Blue Gene Active Storage (BGAS). Our empirical study shows that the BGAS system provides a significantly more efficient basis for flexible error recovery by using GVR multi-versioning features compared to using a standard external storage system attached to the same Blue Gene/Q installation. Using BGAS especially achieves at least \(10\times \) performance boost for random traversal across multiple versions due to significantly better performance for small random I/O operations.

Keywords

Resilience Multi-versioning Global view resilience BGAS Parallel file-system 

References

  1. 1.
  2. 2.
    Scalable checkpoint/restart (SCR) library. https://github.com/hpc/scr
  3. 3.
    Summit compute system. https://www.olcf.ornl.gov/summit/
  4. 4.
    Antypas, K., Wright, N., Cardo, N.P., Andrews, A., Cordery, M.: Cori: a Cray XC pre-exascale system for NERSC. In: Cray User Group Proceedings. Cray (2014)Google Scholar
  5. 5.
    Bariuso, R., Knies, A.: SHMEM user’s guide for C. Cray Research, Inc. (1994)Google Scholar
  6. 6.
    Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)Google Scholar
  7. 7.
    Bent, J., Grider, G., Kettering, B., Manzanares, A., McClelland, M., Torres, A., Torrez, A.: Storage challenges at Los Alamos National Lab. In: IEEE 28th Symposium on Mass Storage Systems and Technologies, pp. 1–5, April 2012Google Scholar
  8. 8.
    Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale computing study: technology challenges in achieving exascale systems. Technical report DARPA IPTO (2008)Google Scholar
  9. 9.
    Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54, 67–77 (2011)CrossRefGoogle Scholar
  10. 10.
    Brown, D.L., Messina, P., Keyes, D., Morrison, J., Lucas, R., Shalf, J., Beckman, P., Brightwell, R., Geist, A., Vetter, J., et al.: Scientific grand challenges: crosscutting technologies for computing at the exascale. Office of Science, U.S. Department of Energy, pp. 2–4, February 2010Google Scholar
  11. 11.
    Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)CrossRefGoogle Scholar
  12. 12.
    Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21(02), 111–132 (2011)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Carns, P., Latham, R., Ross, R., Iskra, K., Lang, S., Riley, K.: 24/7 characterization of petascale I/O workloads. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–10, August 2009Google Scholar
  14. 14.
    Chien, A.A., Balaji, P., Beckman, P., Dun, N., Fang, A., Fujita, H., Iskra, K., Rubenstein, Z., Zheng, Z., Schreiber, R., Hammond, J., Dinan, J., Laguna, I., Dubey, A., Hoemmen, M., Heroux, M., Teranishi, K., Siegel, A.: Versioned distributed arrays for resilience in scientific applications: global view resilience. In: Proceedings of International Conference on Computational Science (2015)Google Scholar
  15. 15.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3) (2006)Google Scholar
  16. 16.
    Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12 (2009)Google Scholar
  17. 17.
    Dun, N., Fujita, H., Tramm, J., Chien, A.A., Siegel, A.R.: Data decomposition in Monte Carlo particle transport simulations using global view arrays. Int. J. High Perform. Comput. Appl. March 2015Google Scholar
  18. 18.
    Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013)CrossRefGoogle Scholar
  19. 19.
    Fang, A., Chien, A.A.: How much SSD is useful for resilience in supercomputers. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale (2015)Google Scholar
  20. 20.
    Ferreira, K., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)Google Scholar
  21. 21.
    Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 78:1–78:12 (2012)Google Scholar
  22. 22.
    Fitch, B.G.: Exploring the capabilities of a massively scalable, compute-in-storage architecture (2013). http://www.hpdc.org/2013/site/files/HPDC13_Fitch_BlueGeneActiveStorage.pdf
  23. 23.
    Fitch, B.G., Rayshubskiy, A., Pitman, M.C., Ward, T.J.C., Germain, R.S.: Using the active storage fabrics model to address petascale storage challenges. In: Proceedings of the 4th Annual Workshop on Petascale Data Storage (2009)Google Scholar
  24. 24.
    Fujita, H., Dun, N., Rubenstein, Z.A., Chien, A.A.: Log-structured global array for efficient multi-version snapshots. In: Proceedings of 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 281–291 (2015)Google Scholar
  25. 25.
    Fujita, H., Iskra, K., Balaji, P., Chien, A.A.: Empirical comparison of three versioning architectures. In: Proceedings of IEEE Cluster 2015 (2015)Google Scholar
  26. 26.
    Gao, S., He, B., Xu, J.: Real-time in-memory checkpointing for future hybrid memory systems. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 263–272 (2015)Google Scholar
  27. 27.
    GVR Team.: Global View Resilience (GVR) API documentation, version 1.0.1. Technical report, University of Chicago, Department of Computer Science, October 2015Google Scholar
  28. 28.
    Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494 (2006)CrossRefGoogle Scholar
  29. 29.
    Heger, D., Shah, G.: IBM’s general parallel file system (GPFS) 1.4 for AIX. Technical report, IBM Corporation, November 2001Google Scholar
  30. 30.
    Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)CrossRefMATHGoogle Scholar
  31. 31.
    IBM Blue Gene Team: The IBM Blue Gene project. IBM J. Res. Dev. 57 (2013)Google Scholar
  32. 32.
    Jones, T., Koniges, A., Yates, R.K.: Performance of the IBM general parallel file system. In: Proceedings of 2000 IEEE International Parallel and Distributed Processing Symposium (2000)Google Scholar
  33. 33.
    Jülich Supercomputing Centre: BGAS user documentation. https://trac.version.fz-juelich.de/EIC/wiki/bgas-user
  34. 34.
    Jülich Supercomputing Centre: Blue Gene Active Storage boosts I/O performance at JSC. http://www.fz-juelich.de/SharedDocs/Pressemitteilungen/UK/EN/2013/13-11-18bgas.html
  35. 35.
    Kulkarni, A., Manzanares, A., Ionkov, L., Lang, M., Lumsdaine, A.: The design and implementation of a multi-level content-addressable checkpoint file system. In: 2012 19th International Conference on High Performance Computing, pp. 1–10, December 2012Google Scholar
  36. 36.
    Li, D., Vetter, J.S., Marin, G., McCurdy, C., Cira, C., Liu, Z., Yu, W.: Identifying opportunities for byte-addressable non-volatile memory in extreme-scale scientific applications. In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 945–956 (2012)Google Scholar
  37. 37.
    Liu, N., Cope, J., Carns, P., Carothers, C., Ross, R., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: Proceedings of the 2012 IEEE Conference on Massive Data Storage (2012)Google Scholar
  38. 38.
    Lu, G., Zheng, Z., Chien, A.A.: When is multi-version checkpointing needed? In: Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale, pp. 49–56 (2013)Google Scholar
  39. 39.
    Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 610–621 (2014)Google Scholar
  40. 40.
    Metzler, B., Trivedi, A.: Prototyping byte-addressable NVM access. In: Proceedings of 11th OpenFabrics Developers Workshop (2015)Google Scholar
  41. 41.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)Google Scholar
  42. 42.
    Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Aprà, E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20(2), 203–231 (2006)CrossRefGoogle Scholar
  43. 43.
    Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2) (1998)Google Scholar
  44. 44.
    Ouyang, X., et al.: Enhancing checkpoint performance with staging I/O and SSD. In: Proceedings of 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010Google Scholar
  45. 45.
    Romano, P.K., Forget, B.: The OpenMC Monte Carlo particle transport code. Ann. Nucl. Energy 51, 274–281 (2013)CrossRefGoogle Scholar
  46. 46.
    Sato, K., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Maruyama, N., Matsuoka, S.: A user-level InfiniBand-based file system and checkpoint strategy for burst buffers. In: Proceedings of 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014)Google Scholar
  47. 47.
    Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1(3), 222–238 (1983)CrossRefGoogle Scholar
  48. 48.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of 2006 IEEE/IFIP International Conference on Dependable Systems and Networks (2006)Google Scholar
  49. 49.
    Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of Supercomputing (2011)Google Scholar
  50. 50.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9) (1974)Google Scholar
  51. 51.
    Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: Proceedings of 2011 IEEE International Parallel and Distributed Processing Symposium (2011)Google Scholar
  52. 52.
    Zhou, M., Du, Y., Childers, B.R., Melhem, R., Mosse, D.: Writeback-aware bandwidth partitioning for multi-core systems with PCM. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 113–122 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Nan Dun
    • 1
  • Dirk Pleiter
    • 2
  • Aiman Fang
    • 1
  • Nicolas Vandenbergen
    • 2
  • Andrew A. Chien
    • 1
  1. 1.Department of Computer ScienceUniversity of ChicagoChicagoUSA
  2. 2.Jülich Research Centre, JSCJülichGermany

Personalised recommendations