Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing

Maruyama, Naoya; Aoki, Takayuki; Taura, Kenjiro; Yokota, Rio; Wahib, Mohamed; Matsuda, Motohiko; Fukuda, Keisuke; Shimokawabe, Takashi; Onodera, Naoyuki; Müller, Michel; Iwasaki, Shintaro

doi:10.1007/978-981-13-1924-2_5

Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing

Naoya Maruyama²,
Takayuki Aoki³,
Kenjiro Taura⁴,
Rio Yokota³,
Mohamed Wahib²,
Motohiko Matsuda²,
Keisuke Fukuda³,
Takashi Shimokawabe⁴,
Naoyuki Onodera³,
Michel Müller³ &
…
Shintaro Iwasaki⁴

Chapter
First Online: 07 December 2018

480 Accesses

Abstract

We present an overview of our project that aimed to achieve both high performance and high productivity. In order to achieve our aim, we designed and developed high-level domain-specific frameworks that can automate many of tedious and complicated program optimizations for certain computation patterns. This article walks through some of our research results and highlights how we achieved both high performance and high productivity.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Akiyama, S., Taura, K.: Uni-address threads: scalable thread management for RDMA-based work stealing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC’15, Portland, pp. 15–26 (2015)
Google Scholar
Andoh, Y., Yoshii, N., Fujimoto, K., Mizutani, K., Kojima, H., Yamada, A., Okazaki, S., Kawaguchi, K., Nagao, H., Iwahashi, K., Mizutani, F., Minami, K., Ichikawa, S., Komatsu, H., Ishizuki, S., Takeda, Y., Fukushima, M.: MODYLAS: a highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by fast multipole method (FMM) and highly scalable fine-grained new parallel processing algorithms. J. Chem. Theory Comput. 9, 3201–3209 (2012)
Article Google Scholar
Antoniu, G., Bougé, L., Namyst, R.: An efficient and transparent thread migration scheme in the PM2 runtime system. In: Proceedings of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, San Juan, pp. 496–510 (1999)
Google Scholar
Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Stat. Comput. 6(1), 85–103 (1985)
Article MathSciNet Google Scholar
Architecture Review Board: OpenMP application program interface version 3.0. Technical report (2008)
Google Scholar
Bernaschi, M., Fatica, M., Melchionna, S., Succi, S., Kaxiras, E.: A flexible high-performance Lattice Boltzmann GPU code for the simulations of fluid flows in complex geometries. Concurr. Comput. Pract. Exp. 22(1), 1–14 (2010)
Article Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Article MathSciNet Google Scholar
Carrier, J., Greengard, L., Rokhlin, V.: A fast adaptive multipole algorithm for particle simulations. SIAM J. Sci. Stat. Comput. 9(4), 669–686 (1988)
Article MathSciNet Google Scholar
Choi, C.H., Ivanic, J., Gordon, M.S., Reudenberg, K.: Rapid and stable determination of rotation matrices between spherical harmonics by direct recursion. J. Chem. Phys. 111(19), 8825–8831 (1999)
Article Google Scholar
Dachsel, H.: Fast and accurate determination of the Wigner rotation matrices in the fast multipole method. J. Chem. Phys. 124, 144115 (2006)
Article Google Scholar
Darve, E., Cecka, C., Takahashi, T.: The fast multipole method on parallel clusters, multicore processors, and graphics processing units. Comptes Rendus Mecanique 339, 185–193 (2011)
Article Google Scholar
Dehnen, W.: A hierarchical O(N) force calculation algorithm. J. Comput. Phys. 179(1), 27–42 (2002)
Article MathSciNet Google Scholar
Dinan, J., Brian Larkins, D., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, Portland, pp. 53:1–53:11 (2009)
Google Scholar
Dubinski, J.: A parallel tree code. New Astron. 1, 133–147 (1996)
Article Google Scholar
Fortin, P.: Multipole-to-local operator in the fast multipole method: comparison of FFT, rotations and BLAS improvements. Technical Report RR-5752, Rapports de recherche, et theses de l’Inria (2005)
Google Scholar
Fortin, P.: High performance parallel hierarchical algorithm for N-body problems. Ph.D. thesis, Universite Bordeaux 1 (2007)
Google Scholar
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, Montreal, pp. 212–223 (1998)
Google Scholar
Fukuda, K., Matsuda, M., Maruyama, N., Yokota, R., Taura, K., Matsuoka, S.: Tapas: an implicitly parallel programming framework for hierarchical n-body algorithms. In: 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016, Wuhan, China, 13–16 Dec 2016, pp. 1100–1109 (2016)
Google Scholar
Germano, M., Piomelli, U., Moin, P., Cabot, W.H.: A dynamic subgrid-scale eddy viscosity model. Phys. Fluids A Fluid Dyn. 3(7), 1760–1765 (1991)
Article Google Scholar
Grama, A.Y., Kumar, V., Sameh, A.: Scalable parallel formulations of the Barnes-Hut method for N-body simulations. In: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing, Washington, DC, pp. 1–10 (1994)
Google Scholar
Gumerov, N.A., Duraiswami, R.: Fast multipole methods on graphics processors. J. Comput. Phys. 227, 8290–8313 (2008)
Article MathSciNet Google Scholar
Hiraishi, T., Yasugi, M., Umatani, S., Yuasa, T.: Backtracking-based load balancing. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, Raleigh, pp. 55–64 (2009)
Google Scholar
Iwasaki, S., Taura, K.: A static cut-off for task parallel programs. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT ’16, pp. 139–150. ACM, New York (2016)
Google Scholar
Kobayashi, H., Ham, F., Wu, X.: Application of a local SGS model based on coherent structures to complex geometries. Int. J. Heat Fluid Flow 29(3), 640–653 (2008). The Fifth International Symposium on Turbulence and Shear Flow Phenomena (TSFP5), Munich
Google Scholar
Křivánek, J., Konttinen, J., Pattanaik, S., Bouatouch, K.: Fast approximation to spherical harmonic rotation. Technical Report 1728, Institut De Recherche En Informatique Et Systemes Aleatoires (2005)
Google Scholar
Lange, B., Fortin, P.: Parallel dual tree traversal on multi-core and many-core architectures for astrophysical N-body simulations. Technical Report hal-00947130, Sorbonne Universités UPMC (2014)
Google Scholar
Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland (2009)
Google Scholar
Lessig, C., de Witt, T., Fiume, E.: Efficient and accurate rotation of finite spherical harmonics expansions. J. Comput. Phys. 231, 243–250 (2012)
Article MathSciNet Google Scholar
Makino, J.: Comparison of two different tree algorithms. J. Comput. Phys. 88, 393–408 (1990)
Article Google Scholar
Makino, J.: A fast parallel treecode with GRAPE. Publ. Astron. Soc. Jpn. 56, 521–531 (2004)
Article Google Scholar
Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physics: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, Seattle (2011)
Google Scholar
Min, S.-J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Fifth Conference on Partitioned Global Address Space Programming Models, PGAS ’11, Galveston Island (2011)
Google Scholar
Mohr, E., Kranz, D.A., Halstead, Jr. R. H.: Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2, 264–280 (1991)
Article Google Scholar
Müller, M., Aoki, T.: Hybrid fortran: high productivity GPU porting framework applied to Japanese weather prediction model. In: WACCPD: Accelerator Programming Using Directives 2017, pp. 20–41. Springer (2018)
Google Scholar
Müller, M., Aoki, T.: New high performance GPGPU code transformation framework applied to large production weather prediction code (2018). Preprint as accepted for ACM TOPC
Google Scholar
Nakashima, J., Nakatani, S., Taura, K.: Design and implementation of a customizable work stealing scheduler. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS ’13, Eugene, pp. 9:1–9:8 (2013)
Google Scholar
Ohnuki, S., Chewl, W.C.: Error minimization of multipole expansion. SIAM J. Sci. Comput. 26(6), 2047–2065 (2005)
Article MathSciNet Google Scholar
Ohori, A., Taura, K., Ueno, K.: Making SML# a general-purpose high-performance language. In: ML Family Workshop, Oxford (2017)
Google Scholar
Pharr, M., Mark, W.R.: ISPC: a SPMD compiler for high-performance CPU programming. In: 2012 Innovative Parallel Computing (InPar), San Jose, pp. 1–13, May 2012.
Google Scholar
Rahimian, A., Lashuk, I., Veerapaneni, S., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., Biros, G.: Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, New Orleans, pp. 1–11 (2010)
Google Scholar
Rankin, W.T.: Efficient parallel implementations of multipole based N-body algorithm. Ph.D. thesis, Duke University (1999)
Google Scholar
Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates, Inc., Sebastopol (2007)
Google Scholar
Salmon, J.K.: Parallel Hierarchical N-Body Methods. Ph.D. thesis, California Institute of Technology (1991)
Google Scholar
Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kalé, L.V., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. 29(3), 512–526 (2018)
Article Google Scholar
Shimokawabe, T., Aoki, T., Ishida, J., Kawano, K., Muroi, C.: 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Proc. Comput. Sci. 4, 1535–1544 (2011)
Article Google Scholar
Shimokawabe, T., Aoki, T., Muroi, C., Ishida, J., Kawano, K., Endo, T., Nukada, A., Maruyama, N., Matsuoka, S.: An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, pp. 1–11 (2010)
Google Scholar
Shimokawabe, T., Aoki, T., Onodera, N.: A high-productivity framework for multi-GPU computation of mesh-based applications. In: HiStencils 2014, Vienna, p. 23 (2014)
Google Scholar
Shimokawabe, T., Aoki, T., Onodera, N.: High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, New Orleans, pp. 251–261 (2014)
Google Scholar
Shimokawabe, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Aoki, T., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, pp. 1–11 (2011)
Google Scholar
Singh, J.P., Holt, C., Hennessy, J.L., Gupta, A.: A parallel adaptive fast multipole method. In: Proceedings of the Supercomputing Conference 1993, Portland, pp. 54–65 (1993)
Google Scholar
Solomonik, E., Kalé, L.V.: Highly scalable parallel sorting. In: IEEE International Symposium on Parallel and Distributed Processing, Rio de Janeiro, pp. 1–12 (2010)
Google Scholar
Takahashi, T., Cecka, C., Fong, W., Darve, E.: Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Numer. Methods Eng. 89, 105–133 (2012)
Article Google Scholar
Taura, K., Nakashima, J., Yokota, R., Maruyama, N.: A task parallel implementation of fast multipole methods. In: Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), Salt Lake City (2012)
Google Scholar
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), New Orleans (2014)
Google Scholar
Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: ACM Conference on High Performance and Distributed Computing (HPDC’15), Portland (2015)
Google Scholar
Wahib, M., Maruyama, N.: Data-centric GPU-based adaptive mesh refinement. In: Workshop on Irregular Applications: Architectures and Algorithms (IA3 2015), Austin (2015)
Google Scholar
Wahib, M., Maruyama, N., Aoki, T.: Daino: a high-level framework for parallel and efficient AMR on GPUs. In: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16), Salt Lake City (2016)
Google Scholar
Warren, M.S., Salmon, J.K.: A parallel hashed OCT-tree N-body algorithm. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, pp. 12–21 (1993)
Google Scholar
Warren, M.S., Salmon, J.K.: A portable parallel particle program. Comput. Phys. Commun. 87, 266–290 (1995)
Article Google Scholar
Wheeler, K.B., Murphy, R.C., Thain, D.: Qthreads: an API for programming with millions of lightweight threads. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, IPDPS ’08, pp. 1–8 (2008)
Google Scholar
Xian, W., Takayuki, A.: Multi-GPU performance of incompressible flow computation by Lattice Boltzmann method on GPU cluster. Parallel Comput. 37(9), 521–535 (2011). Emerging Programming Paradigms for Large-Scale Scientific Computing
Google Scholar
Yokota, R.: An FMM based on dual tree traversal for many-core architectures. J. Algorithms Comput. Technol. 7(3), 301–324 (2013)
Article Google Scholar
Yokota, R., Barba, L.A.: A tuned and scalable fast multipole method as a preeminent algorithm for Exascale systems. Int. J. High Perform. Comput. Appl. 26(4), 337–346 (2012)
Article Google Scholar
Yokota, R., Turkiyyah, G., Keyes, D.: Communication complexity of the fast multipole method and its algebraic variants. Supercomput. Front. Innov. 1(1), 63–84 (2014)
Google Scholar
Yu, H., Girimaji, S.S., Luo, L.-S.: DNS and LES of decaying isotropic turbulence with and without frame rotation using Lattice Boltzmann method. J. Comput. Phys. 209(2), 599–616 (2005)
Article Google Scholar
Zhang, B.: Asynchronous task scheduling of the fast multipole method using various runtime systems. In: Proceedings of the Forth Workshop on Data-Flow Execution Models for Extreme Scale Computing, Edmonton (2014)
Google Scholar
Zima, H.P., Callahan, D., Chamberlain, B.L.: The cascade high productivity language. In: International Workshop on High-Level Programming Models and Supportive Environments, Santa Fe, pp. 52–60 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

RIKEN AICS, Kobe, Hyogo, Japan
Naoya Maruyama, Mohamed Wahib & Motohiko Matsuda
Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Takayuki Aoki, Rio Yokota, Keisuke Fukuda, Naoyuki Onodera & Michel Müller
University of Tokyo, Bunkyo-ku, Tokyo, Japan
Kenjiro Taura, Takashi Shimokawabe & Shintaro Iwasaki

Authors

Naoya Maruyama
View author publications
You can also search for this author in PubMed Google Scholar
Takayuki Aoki
View author publications
You can also search for this author in PubMed Google Scholar
Kenjiro Taura
View author publications
You can also search for this author in PubMed Google Scholar
Rio Yokota
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Wahib
View author publications
You can also search for this author in PubMed Google Scholar
Motohiko Matsuda
View author publications
You can also search for this author in PubMed Google Scholar
Keisuke Fukuda
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Shimokawabe
View author publications
You can also search for this author in PubMed Google Scholar
Naoyuki Onodera
View author publications
You can also search for this author in PubMed Google Scholar
Michel Müller
View author publications
You can also search for this author in PubMed Google Scholar
Shintaro Iwasaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naoya Maruyama .

Editor information

Editors and Affiliations

RIKEN Center for Computational Science, Kobe, Japan
Mitsuhisa Sato

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Maruyama, N. et al. (2019). Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing. In: Sato, M. (eds) Advanced Software Technologies for Post-Peta Scale Computing. Springer, Singapore. https://doi.org/10.1007/978-981-13-1924-2_5

Download citation

DOI: https://doi.org/10.1007/978-981-13-1924-2_5
Published: 07 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1923-5
Online ISBN: 978-981-13-1924-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics