Advertisement

EXAHD: An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond

  • Mario Heene
  • Alfredo Parra Hinojosa
  • Michael Obersteiner
  • Hans-Joachim Bungartz
  • Dirk PflügerEmail author
Conference paper

Abstract

Within the current reporting period (04/2016–04/2017) of our HLRS project we have developed a scalable implementation of the fault-tolerant combination technique. Fault-tolerance is one of the key topics in the ongoing research of algorithms for future exascale systems. Our algorithms enable fault-tolerance for both hard and soft faults, for the efficient and massively parallel computation of high-dimensional PDEs without the need of checkpointing or process replication. The research project EXAHD is part of DFG’s priority program “Software for Exascale Computing” (SPPEXA). The project’s target application is the large-scale simulation of plasma turbulence with the code GENE. The report combines parts of three publications.

Notes

Acknowledgements

This work was supported by the German Research Foundation (DFG) through the Priority Programme 1648 Software for Exascale Computing (SPPEXA) and by the HLRS.

References

  1. 1.
    L. Bautista-Gomez, F. Cappello, Detecting silent data corruption for extreme-scale MPI applications, in Proceedings of the 22nd European MPI Users’ Group Meeting (ACM, New York, 2015), p. 12Google Scholar
  2. 2.
    E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, F. Cappello, Lightweight silent data corruption detection based on runtime data analysis for HPC applications, in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15 (ACM, New York, 2015), pp. 275–278Google Scholar
  3. 3.
    W. Bland et al., A proposal for user-level failure mitigation in the mpi-3 standard. University of Tennessee (2012)Google Scholar
  4. 4.
    M. Blatt, A. Burchardt, A. Dedner, C. Engwer, J. Fahlke, B. Flemisch, C. Gersbacher, C. Gräser, F. Gruber, C. Grüninger et al., The distributed and unified numerics environment, version 2.4. Arch. Numer. Softw. 4(100), 13–29 (2016)Google Scholar
  5. 5.
    A. Brizard, T. Hahm, Foundations of nonlinear gyrokinetic theory. Rev. Mod. Phys. 79, 421–468 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    H.J. Bungartz, M. Griebel, Sparse Grids. Acta Numer. 13, 147–269 (2004)CrossRefGoogle Scholar
  7. 7.
    F. Cappello et al., Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)Google Scholar
  8. 8.
    T. Dannert, Gyrokinetische simulation von plasmaturbulenz mit gefangenen teilchen und elektromagnetischen effekten. Ph.D. thesis, Technische Universität München (2005)Google Scholar
  9. 9.
    E. Doyle, Y. Kamada, T. Osborne et al., Chapter 2: plasma confinement and transport. Nucl. Fusion 47(6), S18 (2007)Google Scholar
  10. 10.
    J. Elliott, M. Hoemmen, F. Mueller, Evaluating the impact of SDC on the GMRES iterative solver, in 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IEEE, Piscataway, 2014), pp. 1193–1202Google Scholar
  11. 11.
    J. Elliott, M. Hoemmen, F. Mueller, Resilience in numerical methods: a position on fault models and methodologies (2014). arXiv preprint arXiv:1401.3013Google Scholar
  12. 12.
    D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, Detection and correction of silent data corruption for large-scale High-Performance Computing, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (IEEE Computer Society Press, Piscataway, 2012), p. 78Google Scholar
  13. 13.
    M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, F. Rossi, R. Ulerich, GNU scientific library reference manual. Library available online at http://www.gnu.org/software/gsl (2015)
  14. 14.
    M. Griebel, W. Huber, U. Rüde, T. Störtkuhl, The combination technique for parallel sparse-grid-preconditioning or -solution of PDEs on workstation networks, in Parallel Processing: CONPAR 92 VAPP V. LNCS, vol. 634 (1992)Google Scholar
  15. 15.
    M. Griebel, M. Schneider, C. Zenger, A combination technique for the solution of sparse grid problems, in Iterative Methods in Linear Algebra (IMACS, Elsevier, North Holland, 1992), pp. 263–281zbMATHGoogle Scholar
  16. 16.
    B. Harding et al., Fault tolerant computation with the sparse grid combination technique. SIAM J. Sci. Comput. 37(3), C331–C353 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    M. Heene, D. Pflüger, Efficient and scalable distributed-memory hierarchization algorithms for the sparse grid combination technique, in Parallel Computing: On the Road to Exascale (2016)Google Scholar
  18. 18.
    M. Heene, D. Pflüger, Scalable algorithms for the solution of higher-dimensional PDEs, in Software for Exascale Computing - SPPEXA 2013–2015, ed. by H.-J. Bungartz, P. Neumann, W.E. Nagel (Springer, Berlin, 2016), pp. 165–186CrossRefGoogle Scholar
  19. 19.
    M. Heene, A.P. Hinojosa, H.J. Bungartz, D. Pflüger, A massively-parallel, fault-tolerant solver for high-dimensional PDEs, in Euro-Par 2016: Parallel Processing Workshops. Lecture Notes in Computer Science, vol. 10104 (Springer, Cham, 2016), pp. 635–647Google Scholar
  20. 20.
    M. Hegland, J. Garcke, V. Challis, The combination technique and some generalisations. Linear Algebra Appl. 420(2–3), 249–275 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    A. Pan, J.W. Tschanz, S. Kundu, A low cost scheme for reducing silent data corruption in large arithmetic circuits, in IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, 2008. DFTVS’08 (IEEE, Boston, 2008), pp. 343–351Google Scholar
  22. 22.
    A. Parra Hinojosa et al., Towards a fault-tolerant, scalable implementation of GENE, in Proceedings of ICCE 2014. LNCSE (Springer, Berlin, 2015)Google Scholar
  23. 23.
    A. Parra Hinojosa et al., Handling silent data corruption with the sparse grid combination technique, in Proceedings of the SPPEXA Workshop. LNCSE (Springer, Berlin, 2016)Google Scholar
  24. 24.
    D. Pflüger et al., SG++ library. http://sgpp.sparsegrids.org/
  25. 25.
    M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRefGoogle Scholar
  26. 26.
    J. Walter, Design and implementation of a fault simulation layer for the combination technique on HPC systems. Master’s thesis, University of Stuttgart, 2016Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Mario Heene
    • 1
  • Alfredo Parra Hinojosa
    • 2
  • Michael Obersteiner
    • 2
  • Hans-Joachim Bungartz
    • 2
  • Dirk Pflüger
    • 1
    Email author
  1. 1.Institute for Parallel and Distributed SystemsUniversity of StuttgartStuttgartGermany
  2. 2.Chair of Scientific ComputingTechnical University of MunichGarchingGermany

Personalised recommendations