Advertisement

Scalable Parallelization of Stencils Using MODA

  • Nabeeh JumahEmail author
  • Julian Kunkel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11887)

Abstract

The natural and the design limitations of the evolution of processors, e.g., frequency scaling and memory bandwidth bottlenecks, push towards scaling applications on multiple-node configurations besides to exploiting the power of each single node. This introduced new challenges to porting applications to the new infrastructure, especially with the heterogeneous environments. Domain decomposition and handling the resulting necessary communication is not a trivial task. Parallelizing code automatically cannot be decided by tools in general as a result of the semantics of the general-purpose languages.

To allow scientists to avoid such problems, we introduce the Memory-Oblivious Data Access (MODA) technique, and use it to scale code to configurations ranging from a single node to multiple nodes, supporting different architectures, without requiring changes in the source code of the application. We present a technique to automatically identify necessary communication based on higher-level semantics. The extracted information enables tools to generate code that handles the communication. A prototype is developed to implement the techniques and used to evaluate the approach. The results show the effectiveness of using the techniques to scale code on multi-core processors and on GPU based machines. Comparing the ratios of the achieved GFLOPS to the number of nodes in each run, and repeating that on different numbers of nodes shows that the achieved scaling efficiency is around 100%. This was repeated with up to 100 nodes. An exception to this is the single-node configuration using a GPU, in which no communication is needed, and hence, no data movement between GPU and host memory is needed, which yields higher GFLOPS.

Keywords

HPC Scalability Parallel programming Stencils 

Notes

Acknowledgements

This work was supported in part by the German Research Foundation (DFG) through the Priority Programme 1648 Software for Exascale Computing SPPEXA (GZ: LU 1353/11-1). We also thank the Swiss National Supercomputing Center (CSCS), who provided access to their machines to run the experiments. We also thank Prof. John Thuburn – University of Exeter, for his help to develop the code of the shallow water equations.

References

  1. 1.
    Bjørstad, P.E., Widlund, O.B.: Iterative methods for the solution of elliptic problems on regions partitioned into substructures. SIAM J. Numer. Anal. 23(6), 1097–1120 (1986)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Chan, T.F., Resasco, D.C.: A domain-decomposed fast Poisson solver on a rectangle. SIAM J. Sci. Stat. Comput. 8(1), s14–s26 (1987)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Christen, M., Schenk, O., Burkhart, H.: PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Parallel & Distributed Processing Symposium, pp. 676–687. IEEE (2011)Google Scholar
  4. 4.
    Fox, G.C.: Domain decomposition in distributed and shared memory environments. In: Houstis, E.N., Papatheodorou, T.S., Polychronopoulos, C.D. (eds.) ICS 1987. LNCS, vol. 297, pp. 1042–1073. Springer, Heidelberg (1988).  https://doi.org/10.1007/3-540-18991-2_62CrossRefGoogle Scholar
  5. 5.
    Fürlinger, K., et al.: DASH: data structures and algorithms with support for hierarchical locality. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8806, pp. 542–552. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-14313-2_46CrossRefGoogle Scholar
  6. 6.
    Heybrock, S., et al.: Lattice QCD with domain decomposition on Intel® Xeon Phi™ co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 69–80. IEEE Press (2014)Google Scholar
  7. 7.
    Jum’ah, N., Kunkel, J.: Performance portability of earth system models with user-controlled GGDML code translation. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 693–710. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-02465-9_50CrossRefGoogle Scholar
  8. 8.
    Jumah, N., Kunkel, J.: Automatic vectorization of stencil codes with the GGDML language extensions. In: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2019, pp. 2:1–2:7. ACM, New York (2019)Google Scholar
  9. 9.
    Jumah, N., Kunkel, J.M., Zängl, G., Yashiro, H., Dubos, T., Meurdesoif, T.: GGDML: icosahedral models language extensions. J. Comput. Sci. Technol. Updates 4(1), 1–10 (2017)CrossRefGoogle Scholar
  10. 10.
    Keyes, D.E.: Domain decomposition: a bridge between nature and parallel computers. Technical report, Institute for Computer Applications in Science and Engineering Hampton VA (1992)Google Scholar
  11. 11.
    Keyes, D.E., Gropp, W.D.: A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation. SIAM J. Sci. Stat. Comput. 8(2), s166–s202 (1987)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Lengauer, C., et al.: ExaStencils: advanced stencil-code engineering. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8806, pp. 553–564. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-14313-2_47CrossRefGoogle Scholar
  13. 13.
    Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 11. ACM (2011)Google Scholar
  14. 14.
    Niu, X., Coutinho, J.G.F., Luk, W.: A scalable design approach for stencil computation on reconfigurable clusters. In: 2013 23rd International Conference on Field programmable Logic and Applications, pp. 1–4. IEEE (2013)Google Scholar
  15. 15.
    Yount, C., Tobin, J., Breuer, A., Duran, A.: YASK–yet another stencil kernel: a framework for HPC stencil code-generation and tuning. In: 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39. IEEE (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Universität HamburgHamburgGermany
  2. 2.University of ReadingReadingUK

Personalised recommendations