Towards a Fault-Tolerant, Scalable Implementation of GENE

Hinojosa, Alfredo Parra; Kowitz, C.; Heene, M.; Pflüger, D.; Bungartz, H.-J.

doi:10.1007/978-3-319-22997-3_3

Alfredo Parra Hinojosa¹⁰,
C. Kowitz¹⁰,
M. Heene¹¹,
D. Pflüger¹¹ &
…
H.-J. Bungartz¹⁰

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 105))

947 Accesses
7 Citations

Abstract

We consider the HPC challenge of fault tolerance in the context of plasma physics simulations using the sparse grid combination technique. In the combination technique formalism, one breaks down a single, highly expensive simulation into many, considerably cheaper independent simulations that are propagated in time and then combined to approximate the results of the full solution. This introduces a new level of parallelism from which various fault tolerance approaches can be deduced. We investigate two such approaches, corresponding to two different simulation modes of the plasma physics code GENE: the simulation of a time-dependent, 5-dimensional PDE, and the computation of certain eigenvalues of the spectrum of a problem-specific linear operator. This paper has two main contributions to the field of fault tolerance with the combination technique. First, we show that the recently developed fault-tolerant combination technique performs well even for highly complex simulation codes, i.e., beyond the usual Poisson or advection problems; and second, we demonstrate a new way to use of the optimized combination technique (OptiCom) in the context of fault tolerance when dealing with eigenvalue computations. This work is a building block of the project EXAHD within the DFG’s Priority Programme “Software for Exascale Computing” (SPPEXA).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Multigrid at Scale?

The Resiliency of Multilevel Methods on Next-Generation Computing Platforms: Probabilistic Model and Its Analysis

Notes

1.
Two level vectors \(\boldsymbol{\alpha },\boldsymbol{\beta }\) satisfy \(\boldsymbol{\alpha }\geq \boldsymbol{\beta }\) if \(\alpha _{i} \geq \beta _{i},\quad \forall i \in \{ 1,\mathop{\ldots },d\}\).
2.
Modulo the treatment of the boundary, which in some dimensions might add ± 1 discretization points.
3.
http://www.mac.tum.de/wiki/index.php/MAC_Cluster

References

Brizard, A., Hahm, T.: Foundations of nonlinear gyrokinetic theory. Rev. Mod. Phys. 79(2), 421–468 (2007). DOI 10.1103/RevModPhys.79.421
Article MATH MathSciNet Google Scholar
Bungartz, H.J., Griebel, M.: Sparse grids. Acta Numerica 13, 147–269 (2004). DOI 10.1017/S0962492904000182
Article MathSciNet Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)
Google Scholar
Das, S., Neumaier, A.: Solving overdetermined eigenvalue oroblems. SIAM J. Sci. Comput. 35(2), 541–560 (2013)
Article MathSciNet Google Scholar
Elliott, J., Hoemmen, M., Mueller, F.: Resilience in numerical methods: a position on fault models and methodologies. arXiv preprint arXiv:1401.3013 (2014)
Google Scholar
Garcke, J.: Regression with the optimised combination technique. In: Proceedings of the 23rd international conference on Machine learning, pp. 321–328. ACM Press, New York (2006)
Google Scholar
Garcke, J.: An optimised sparse grid combination technique for eigenproblems. Proc. Appl. Math. Mech. 7(1), 1022301–1022302 (2007)
Article Google Scholar
Garcke, J., Griebel, M.: On the computation of the eigenproblems of hydrogen and helium in strong magnetic and electric fields with the sparse grid combination technique. J. Comput. Phys. 165(2), 694–716 (2000). DOI 10.1006/jcph.2000.6627
Article MATH MathSciNet Google Scholar
Jenko, F., Dorland, W., Kotschenreuther, M., Rogers, B.N.: Rogers Electron temperature gradient driven turbulence. Phys. Plasmas 7(5), 1904–1910 (2000). AIP Publishing. http://www.genecode.org/
Goerler, T., Lapillonne, X., Brunner, S., Dannert, T., Jenko, F., Merz, F., Told, D.: The global version of the gyrokinetic turbulence code GENE. J. Comput. Phys. 230, 7053–7071 (2011)
Article MATH MathSciNet Google Scholar
Görler, T.: Multiscale effects in plasma microturbulence. Ph.D. thesis, Universität Ulm (2009)
Google Scholar
Griebel, M., Schneider, M., Zenger, C.: A combination technique for the solution of sparse grid problems. In: Iterative Methods in Linear Algebra, pp. 263–281. Elsevier (1992)
Google Scholar
Harding, B., Hegland, M.: A robust combination technique. ANZIAM J. 54, C394–C411 (2013)
MathSciNet Google Scholar
Harding, B., Hegland, M.: Robust solutions to PDEs with multiple grids. In: Garcke, J., Pflüger, D. (eds.) Sparse Grids and Applications—Munich 2012 SE. Lecture Notes in Computational Science and Engineering, vol. 97, pp. 171–193. Springer, Berlin (2014)
Chapter Google Scholar
Harding, B., Hegland, M., Larson, J., Southern, J.: Scalable and fault tolerant computation with the sparse grid combination technique. arXiv:1404.2670 (2014)
Google Scholar
Harrar II, D., Osborne, M.: Computing eigenvalues of ordinary differential equations. ANZIAM J. 44(April), C313–C334 (2003)
MathSciNet Google Scholar
Heene, M., Kowitz, C., Pflüger, D.: Load balancing for massively parallel computations with the sparse grid combination technique. In: PARCO, pp. 574–583. IOS Press (2013)
Google Scholar
Hegland, M.: Adaptive sparse grids. ANZIAM J. 44, C335–C353 (2003)
MathSciNet Google Scholar
Hegland, M., Garcke, J., Challis, V.: The combination technique and some generalisations. Linear Algebra Appl. 420(2–3), 249–275 (2007)
Article MATH MathSciNet Google Scholar
Hernandez, V., Roman, J.E., Vidal, V.: SLEPc: a scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw. 31(3), 351–362 (2005)
Article MATH MathSciNet Google Scholar
Hupp, P., Jacob, R., Heene, M., Pflüger, D., Hegland, M.: Global communication schemes for the sparse grid combination technique. In: PARCO, pp. 564–573. IOS Press (2013)
Google Scholar
Kowitz, C., Hegland, M.: The sparse grid combination technique for computing eigenvalues in linear gyrokinetics. Procedia Comput. Sci. 18(0), 449–458 (2013)
Article Google Scholar
Kowitz, C., Hegland, M.: An Opticom Method for Computing Eigenpairs. In: Garcke, J., Pflüger D. (eds.) Sparse Grids and Applications—Munich 2012 SE. Lecture Notes in Computational Science and Engineering, vol. 97, pp. 239–253. Springer, Berlin (2014)
Chapter Google Scholar
Kowitz, C., Pflüger, D., Jenko, F., Hegland, M.: The combination technique for the initial value problem in linear gyrokinetics. In: Sparse Grids and Applications, Lecture Notes in Computational Science and Engineering, vol. 88, pp. 205–222. Springer, Heidelberg (2012)
Google Scholar
Kowitz, C., Pflüger, D., Jenko, F., Hegland, M.: The combination technique for the initial value problem in linear gyrokinetics. In: Sparse Grids and Applications, pp. 205–222. Springer, Berlin (2013)
Google Scholar
Larson, J.W., Hegland, M., Harding, B., Roberts, S., Stals, L., Rendell, A.P., Strazdins, P., Ali, M.M., Kowitz, C., Nobes, R., et al.: Fault-tolerant grid-based solvers: combining concepts from sparse grids and mapreduce. Procedia Comput. Sci. 18, 130–139 (2013)
Article Google Scholar
Merz, F.: Gyrokinetic simulation of multimode plasma turbulence. Ph.D. thesis (2009)
Google Scholar
Mohr, B., Frings, W.: Jülich blue gene/p extreme scaling workshop 2009. Technical Report, Technical report FZJ-JSC-IB-2010-02. Online at http://juser.fz-juelich.de/record/8924/files/ib-2010-02.ps.gz (2010)
Pflüger, D.: Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, München (2010)
Google Scholar
Pflüger, D., Bungartz, H.-J., Griebel, M., Jenko, F., Dannert, T., Heene, M., Parra Hinojosa, A., Kowitz, C., Zaspel, P.: EXAHD: an exa-scalable two-level sparse grid approach for higher-dimensional problems in plasma physics and beyond. In: Euro-Par 2014: Parallel Processing Workshops, pp. 565–576. Springer (2014)
Google Scholar
Shahzad, F., Wittmann, M., Zeiser, T., Hager, G., Wellein, G.: An evaluation of different i/o techniques for checkpoint/restart. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp. 1708–1716. IEEE Computer Society, Silver Spring, MD (2013)
Google Scholar
Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28, 129–173 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported (in part) by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA), along with the support of the Technische Universität München – Institute for Advanced Study, funded by the German Excellence Initiative (and the European Union Seventh Framework Programme under grant agreement n^∘ 291763). D. Pflüger further acknowledges the financial support of the DFG within the Cluster of Excellence in Simulation Technology (EXC 310/1), and A. Parra Hinojosa thanks the support of CONACYT, Mexico.

Author information

Authors and Affiliations

Scientific Computing, Technische Universität München, München, Germany
Alfredo Parra Hinojosa (Chair), C. Kowitz (Chair) & H.-J. Bungartz (Chair)
Institute for Parallel and Distributed Systems, University of Stuttgart, Stuttgart, Germany
M. Heene & D. Pflüger

Authors

Alfredo Parra Hinojosa
View author publications
You can also search for this author in PubMed Google Scholar
C. Kowitz
View author publications
You can also search for this author in PubMed Google Scholar
M. Heene
View author publications
You can also search for this author in PubMed Google Scholar
D. Pflüger
View author publications
You can also search for this author in PubMed Google Scholar
H.-J. Bungartz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alfredo Parra Hinojosa .

Editor information

Editors and Affiliations

Institut für Parallele und Verteilte Systeme, Universität Stuttgart , Stuttgart, Germany
Miriam Mehl
Institut für Baustatik und Baudynamik, Universität Stuttgart, Stuttgart, Germany
Manfred Bischoff
Fakultät für Maschinenbau, Technische Universität Darmstadt , Darmstadt, Germany
Michael Schäfer

Appendix: GENE Parameter File

& box

kymin = 0.3

lv = 3.00

lw = 9.00

lx = 4.16667

adapt_lx = T

mu_grid_type =

’ clenshaw_curtis ’

/

&general

nonlinear = F

arakawa_zv = T

arakawa_zv_order = 2

calc_dt = F

dt_max = 7.39E −4

courant = 1.0

beta = 0.1E −02

debye2 = 0.0

collision_op = ’none ’

init_cond = ’ fb ’

hyp_z = 2.000

hyp_v = 0.5000

/

& geometry

magn_geometry = ’ circular ’

q0 = 1.4

shat = 0.8

trpeps = 0.18

major_R = 1.0

norm_flux_projection = F

/

&species

name = ’ ions ’

omn = 2.0

omt = 4.5

mass = 1.0

temp = 1.0

dens = 1.0

charge = 1

/

&species

name = ’ electrons ’

omn = 2.0

omt = 3.5

mass = 0.27E −03

temp = 1.5

dens = 1.0

charge = −1

/

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hinojosa, A.P., Kowitz, C., Heene, M., Pflüger, D., Bungartz, HJ. (2015). Towards a Fault-Tolerant, Scalable Implementation of GENE. In: Mehl, M., Bischoff, M., Schäfer, M. (eds) Recent Trends in Computational Engineering - CE2014. Lecture Notes in Computational Science and Engineering, vol 105. Springer, Cham. https://doi.org/10.1007/978-3-319-22997-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-22997-3_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22996-6
Online ISBN: 978-3-319-22997-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Towards a Fault-Tolerant, Scalable Implementation of GENE

Abstract

Access this chapter

Similar content being viewed by others

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Multigrid at Scale?

The Resiliency of Multilevel Methods on Next-Generation Computing Platforms: Probabilistic Model and Its Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: GENE Parameter File

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Towards a Fault-Tolerant, Scalable Implementation of GENE

Abstract

Access this chapter

Similar content being viewed by others

A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs

Multigrid at Scale?

The Resiliency of Multilevel Methods on Next-Generation Computing Platforms: Probabilistic Model and Its Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: GENE Parameter File

Appendix: GENE Parameter File

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation