Abstract
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. [1] and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.
Notes
- 1.
ExaFMM is an open-source code base to utilize fast multipole algorithms, in parallel, and with GPU capability. Algorithms pertaining to partitioning and communication reduction are all available on the public repository https://github.com/exafmm/exafmm.
References
Hoefler, T., Siebert, C., Lumsdaine, A.: Scalable communication protocols for dynamic sparse data exchange. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2010, pp. 159–168. ACM, New York (2010)
Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Stat. Comput. 6(1), 85–103 (1985)
Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys. 73(2), 325–348 (1987)
Beatson, R., Greengard, L.: A short course on fast multipole methods. Wavelets Multilevel Methods Elliptic PDEs 1, 1–37 (1997)
Lu, B., Cheng, X., Huang, J., McCammon, J.A.: Order \(N\) algorithm for computation of electrostatic interactions in biomolecular systems. Proc. Natl. Acad. Sci. 103(51), 19314–19319 (2006)
Yokota, R., Bardhan, J.P., Knepley, M.G., Barba, L.A., Hamada, T.: Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns. Comput. Phys. Commun. 182(6), 1272–1283 (2011)
Ohno, Y., Yokota, R., Koyama, H., Morimoto, G., Hasegawa, A., Masumoto, G., Okimoto, N., Hirano, Y., Ibeid, H., Narumi, T., et al.: Petascale molecular dynamics simulation using the fast multipole method on K computer. Comput. Phys. Commun. 185(10), 2575–2585 (2014)
Rui, P., Chen, R.: An efficient sparse approximate inverse preconditioning for FMM implementation. Microw. Opt. Technol. Lett. 49(7), 1746–1750 (2007)
Bédorf, J., Gaburov, E., Zwart, S.P.: A sparse octree gravitational \(N\)-body code that runs entirely on the GPU processor. J. Comput. Phys. 231(7), 2825–2839 (2012)
Price, D., Monaghan, J.: An energy-conserving formalism for adaptive gravitational force softening in smoothed particle hydrodynamics and \(N\)-body codes. Mon. Not. R. Astron. Soc. 374(4), 1347–1358 (2007)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: a view from Berkeley. Technical report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Warren, M.S., Salmon, J.K.: A fast tree code for many-body problems. Los Alamos Sci. 22(10), 88–97 (1994)
Bédorf, J., Gaburov, E., Fujii, M.S., Nitadori, K., Ishiyama, T., Portegies Zwart, S.: 24.77 Pflops on a gravitational tree-code to simulate the Milky Way Galaxy with 18600 GPUs. In: Proceedings of the 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2014)
Speck, R., Ruprecht, D., Krause, R., Emmett, M., Minion, M., Winkel, M., Gibbon, P.: A massively space-time parallel \(N\)-body solver. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 92. IEEE Computer Society Press (2012)
Winkel, M., Speck, R., Hubner, H., Arnold, L., Krause, R., Gibbon, P.: A massively parallel, multi-disciplinary barnes-hut tree code for extreme-scale \(N\)-body simulations. Comput. Phys. Commun. 183(4), 880–889 (2012)
Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. Commun. ACM 55(5), 101–109 (2012)
Zandifar, M., Abdul Jabbar, M., Majidi, A., Keyes, D., Amato, N.M., Rauchwerger, L.: Composing algorithmic skeletons to express high-performance scientific applications. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ser. ICS 2015, pp. 415–424. ACM, New York (2015)
AbdulJabbar, M., Yokota, R., Keyes, D.: Asynchronous execution of the fast multipole method using charm++. arXiv preprint arXiv:1405.7487 (2014)
Salmon, J.K.: Parallel hierarchical N-body methods. Ph.D. dissertation, California Institute of Technology (1991)
Warren, M.S., Salmon, J.K.: A parallel hashed oct-tree \(N\)-body algorithm. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, pp. 12–21. ACM (1993)
Makino, J.: A fast parallel treecode with GRAPE. Publ. Astron. Soc. Jpn. 56, 521–531 (2004)
Solomonik, E., Kalé, L.V.: Highly scalable parallel sorting. In: Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12 (2010)
Haverkort, H.: An inventory of three-dimensional Hilbert space-filling curves. arXiv preprint arXiv:1109.2323 (2011)
Dubinski, J.: A parallel tree code. New Astron. 1, 133–147 (1996)
Warren, M.S., Salmon, J.K.: Astrophysical \(N\)-body simulations using hierarchical tree data structures. In: Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, ser. Supercomputing 1992, pp. 570–576. IEEE Computer Society Press, Los Alamitos (1992)
Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009)
Teng, S.-H.: Provably good partitioning and load balancing algorithms for parallel adaptive \(N\)-body simulation. SIAM J. Sci. Comput. 19(2), 635–656 (1998)
Yokota, R., Turkiyyah, G., Keyes, D.: Communication complexity of the fast multipole method and its algebraic variants. Supercomput. Front. Innov.: Int. J. 1(1), 63–84 (2014)
Malhotra, D., Biros, G.: PVFMM: a parallel kernel independent fmm for particle and volume potentials. Commun. Comput. Phys. 18(3), 808–830 (2015)
Acknowledgment
This work was supported by JSPS KAKENHI Grant-in-Aid for Young Scientists A Grant Number 16H05859. This work is partially supported by “Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” and “High Performance Computing Infrastructure” in Japan. The authors are grateful to the KAUST Supercomputing Laboratory for the use of the Shaheen XC40 system.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Abduljabbar, M., Markomanolis, G.S., Ibeid, H., Yokota, R., Keyes, D. (2017). Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10266. Springer, Cham. https://doi.org/10.1007/978-3-319-58667-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-58667-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58666-3
Online ISBN: 978-3-319-58667-0
eBook Packages: Computer ScienceComputer Science (R0)