Advertisement

Towards a Fault-Tolerant, Scalable Implementation of GENE

  • Alfredo Parra HinojosaEmail author
  • C. Kowitz
  • M. Heene
  • D. Pflüger
  • H.-J. Bungartz
Part of the Lecture Notes in Computational Science and Engineering book series (LNCSE, volume 105)

Abstract

We consider the HPC challenge of fault tolerance in the context of plasma physics simulations using the sparse grid combination technique. In the combination technique formalism, one breaks down a single, highly expensive simulation into many, considerably cheaper independent simulations that are propagated in time and then combined to approximate the results of the full solution. This introduces a new level of parallelism from which various fault tolerance approaches can be deduced. We investigate two such approaches, corresponding to two different simulation modes of the plasma physics code GENE: the simulation of a time-dependent, 5-dimensional PDE, and the computation of certain eigenvalues of the spectrum of a problem-specific linear operator. This paper has two main contributions to the field of fault tolerance with the combination technique. First, we show that the recently developed fault-tolerant combination technique performs well even for highly complex simulation codes, i.e., beyond the usual Poisson or advection problems; and second, we demonstrate a new way to use of the optimized combination technique (OptiCom) in the context of fault tolerance when dealing with eigenvalue computations. This work is a building block of the project EXAHD within the DFG’s Priority Programme “Software for Exascale Computing” (SPPEXA).

Keywords

Fault Tolerance Coarse Grid Sparse Grid Combination Technique Eigenvalue Computation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA), along with the support of the Technische Universität München – Institute for Advanced Study, funded by the German Excellence Initiative (and the European Union Seventh Framework Programme under grant agreement n 291763). D. Pflüger further acknowledges the financial support of the DFG within the Cluster of Excellence in Simulation Technology (EXC 310/1), and A. Parra Hinojosa thanks the support of CONACYT, Mexico.

References

  1. 1.
    Brizard, A., Hahm, T.: Foundations of nonlinear gyrokinetic theory. Rev. Mod. Phys. 79(2), 421–468 (2007). DOI 10.1103/RevModPhys.79.421zbMATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Bungartz, H.J., Griebel, M.: Sparse grids. Acta Numerica 13, 147–269 (2004). DOI 10.1017/S0962492904000182MathSciNetCrossRefGoogle Scholar
  3. 3.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)Google Scholar
  4. 4.
    Das, S., Neumaier, A.: Solving overdetermined eigenvalue oroblems. SIAM J. Sci. Comput. 35(2), 541–560 (2013)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Elliott, J., Hoemmen, M., Mueller, F.: Resilience in numerical methods: a position on fault models and methodologies. arXiv preprint arXiv:1401.3013 (2014)Google Scholar
  6. 6.
    Garcke, J.: Regression with the optimised combination technique. In: Proceedings of the 23rd international conference on Machine learning, pp. 321–328. ACM Press, New York (2006)Google Scholar
  7. 7.
    Garcke, J.: An optimised sparse grid combination technique for eigenproblems. Proc. Appl. Math. Mech. 7(1), 1022301–1022302 (2007)CrossRefGoogle Scholar
  8. 8.
    Garcke, J., Griebel, M.: On the computation of the eigenproblems of hydrogen and helium in strong magnetic and electric fields with the sparse grid combination technique. J. Comput. Phys. 165(2), 694–716 (2000). DOI 10.1006/jcph.2000.6627zbMATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Jenko, F., Dorland, W., Kotschenreuther, M., Rogers, B.N.: Rogers Electron temperature gradient driven turbulence. Phys. Plasmas 7(5), 1904–1910 (2000). AIP Publishing. http://www.genecode.org/
  10. 10.
    Goerler, T., Lapillonne, X., Brunner, S., Dannert, T., Jenko, F., Merz, F., Told, D.: The global version of the gyrokinetic turbulence code GENE. J. Comput. Phys. 230, 7053–7071 (2011)zbMATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Görler, T.: Multiscale effects in plasma microturbulence. Ph.D. thesis, Universität Ulm (2009)Google Scholar
  12. 12.
    Griebel, M., Schneider, M., Zenger, C.: A combination technique for the solution of sparse grid problems. In: Iterative Methods in Linear Algebra, pp. 263–281. Elsevier (1992)Google Scholar
  13. 13.
    Harding, B., Hegland, M.: A robust combination technique. ANZIAM J. 54, C394–C411 (2013)MathSciNetGoogle Scholar
  14. 14.
    Harding, B., Hegland, M.: Robust solutions to PDEs with multiple grids. In: Garcke, J., Pflüger, D. (eds.) Sparse Grids and Applications—Munich 2012 SE. Lecture Notes in Computational Science and Engineering, vol. 97, pp. 171–193. Springer, Berlin (2014)CrossRefGoogle Scholar
  15. 15.
    Harding, B., Hegland, M., Larson, J., Southern, J.: Scalable and fault tolerant computation with the sparse grid combination technique. arXiv:1404.2670 (2014)Google Scholar
  16. 16.
    Harrar II, D., Osborne, M.: Computing eigenvalues of ordinary differential equations. ANZIAM J. 44(April), C313–C334 (2003)MathSciNetGoogle Scholar
  17. 17.
    Heene, M., Kowitz, C., Pflüger, D.: Load balancing for massively parallel computations with the sparse grid combination technique. In: PARCO, pp. 574–583. IOS Press (2013)Google Scholar
  18. 18.
    Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)MathSciNetGoogle Scholar
  19. 19.
    Hegland, M., Garcke, J., Challis, V.: The combination technique and some generalisations. Linear Algebra Appl. 420(2–3), 249–275 (2007)zbMATHMathSciNetCrossRefGoogle Scholar
  20. 20.
    Hernandez, V., Roman, J.E., Vidal, V.: SLEPc: a scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw. 31(3), 351–362 (2005)zbMATHMathSciNetCrossRefGoogle Scholar
  21. 21.
    Hupp, P., Jacob, R., Heene, M., Pflüger, D., Hegland, M.: Global communication schemes for the sparse grid combination technique. In: PARCO, pp. 564–573. IOS Press (2013)Google Scholar
  22. 22.
    Kowitz, C., Hegland, M.: The sparse grid combination technique for computing eigenvalues in linear gyrokinetics. Procedia Comput. Sci. 18(0), 449–458 (2013)CrossRefGoogle Scholar
  23. 23.
    Kowitz, C., Hegland, M.: An Opticom Method for Computing Eigenpairs. In: Garcke, J., Pflüger D. (eds.) Sparse Grids and Applications—Munich 2012 SE. Lecture Notes in Computational Science and Engineering, vol. 97, pp. 239–253. Springer, Berlin (2014)CrossRefGoogle Scholar
  24. 24.
    Kowitz, C., Pflüger, D., Jenko, F., Hegland, M.: The combination technique for the initial value problem in linear gyrokinetics. In: Sparse Grids and Applications, Lecture Notes in Computational Science and Engineering, vol. 88, pp. 205–222. Springer, Heidelberg (2012)Google Scholar
  25. 25.
    Kowitz, C., Pflüger, D., Jenko, F., Hegland, M.: The combination technique for the initial value problem in linear gyrokinetics. In: Sparse Grids and Applications, pp. 205–222. Springer, Berlin (2013)Google Scholar
  26. 26.
    Larson, J.W., Hegland, M., Harding, B., Roberts, S., Stals, L., Rendell, A.P., Strazdins, P., Ali, M.M., Kowitz, C., Nobes, R., et al.: Fault-tolerant grid-based solvers: combining concepts from sparse grids and mapreduce. Procedia Comput. Sci. 18, 130–139 (2013)CrossRefGoogle Scholar
  27. 27.
    Merz, F.: Gyrokinetic simulation of multimode plasma turbulence. Ph.D. thesis (2009)Google Scholar
  28. 28.
    Mohr, B., Frings, W.: Jülich blue gene/p extreme scaling workshop 2009. Technical Report, Technical report FZJ-JSC-IB-2010-02. Online at http://juser.fz-juelich.de/record/8924/files/ib-2010-02.ps.gz (2010)
  29. 29.
    Pflüger, D.: Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, München (2010)Google Scholar
  30. 30.
    Pflüger, D., Bungartz, H.-J., Griebel, M., Jenko, F., Dannert, T., Heene, M., Parra Hinojosa, A., Kowitz, C., Zaspel, P.: EXAHD: an exa-scalable two-level sparse grid approach for higher-dimensional problems in plasma physics and beyond. In: Euro-Par 2014: Parallel Processing Workshops, pp. 565–576. Springer (2014)Google Scholar
  31. 31.
    Shahzad, F., Wittmann, M., Zeiser, T., Hager, G., Wellein, G.: An evaluation of different i/o techniques for checkpoint/restart. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp. 1708–1716. IEEE Computer Society, Silver Spring, MD (2013)Google Scholar
  32. 32.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Alfredo Parra Hinojosa
    • 1
    Email author
  • C. Kowitz
    • 1
  • M. Heene
    • 2
  • D. Pflüger
    • 2
  • H.-J. Bungartz
    • 1
  1. 1.Scientific ComputingTechnische Universität MünchenMünchenGermany
  2. 2.Institute for Parallel and Distributed SystemsUniversity of StuttgartStuttgartGermany

Personalised recommendations