Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10266)


Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. [1] and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.


N-body methods Fast multipole method Load balancing Communication reduction 



This work was supported by JSPS KAKENHI Grant-in-Aid for Young Scientists A Grant Number 16H05859. This work is partially supported by “Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” and “High Performance Computing Infrastructure” in Japan. The authors are grateful to the KAUST Supercomputing Laboratory for the use of the Shaheen XC40 system.


  1. 1.
    Hoefler, T., Siebert, C., Lumsdaine, A.: Scalable communication protocols for dynamic sparse data exchange. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2010, pp. 159–168. ACM, New York (2010)Google Scholar
  2. 2.
    Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Stat. Comput. 6(1), 85–103 (1985)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys. 73(2), 325–348 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Beatson, R., Greengard, L.: A short course on fast multipole methods. Wavelets Multilevel Methods Elliptic PDEs 1, 1–37 (1997)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Lu, B., Cheng, X., Huang, J., McCammon, J.A.: Order \(N\) algorithm for computation of electrostatic interactions in biomolecular systems. Proc. Natl. Acad. Sci. 103(51), 19314–19319 (2006)CrossRefGoogle Scholar
  6. 6.
    Yokota, R., Bardhan, J.P., Knepley, M.G., Barba, L.A., Hamada, T.: Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns. Comput. Phys. Commun. 182(6), 1272–1283 (2011)CrossRefzbMATHGoogle Scholar
  7. 7.
    Ohno, Y., Yokota, R., Koyama, H., Morimoto, G., Hasegawa, A., Masumoto, G., Okimoto, N., Hirano, Y., Ibeid, H., Narumi, T., et al.: Petascale molecular dynamics simulation using the fast multipole method on K computer. Comput. Phys. Commun. 185(10), 2575–2585 (2014)CrossRefGoogle Scholar
  8. 8.
    Rui, P., Chen, R.: An efficient sparse approximate inverse preconditioning for FMM implementation. Microw. Opt. Technol. Lett. 49(7), 1746–1750 (2007)CrossRefGoogle Scholar
  9. 9.
    Bédorf, J., Gaburov, E., Zwart, S.P.: A sparse octree gravitational \(N\)-body code that runs entirely on the GPU processor. J. Comput. Phys. 231(7), 2825–2839 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Price, D., Monaghan, J.: An energy-conserving formalism for adaptive gravitational force softening in smoothed particle hydrodynamics and \(N\)-body codes. Mon. Not. R. Astron. Soc. 374(4), 1347–1358 (2007)CrossRefGoogle Scholar
  11. 11.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: a view from Berkeley. Technical report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)Google Scholar
  12. 12.
    Warren, M.S., Salmon, J.K.: A fast tree code for many-body problems. Los Alamos Sci. 22(10), 88–97 (1994)Google Scholar
  13. 13.
    Bédorf, J., Gaburov, E., Fujii, M.S., Nitadori, K., Ishiyama, T., Portegies Zwart, S.: 24.77 Pflops on a gravitational tree-code to simulate the Milky Way Galaxy with 18600 GPUs. In: Proceedings of the 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2014)Google Scholar
  14. 14.
    Speck, R., Ruprecht, D., Krause, R., Emmett, M., Minion, M., Winkel, M., Gibbon, P.: A massively space-time parallel \(N\)-body solver. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 92. IEEE Computer Society Press (2012)Google Scholar
  15. 15.
    Winkel, M., Speck, R., Hubner, H., Arnold, L., Krause, R., Gibbon, P.: A massively parallel, multi-disciplinary barnes-hut tree code for extreme-scale \(N\)-body simulations. Comput. Phys. Commun. 183(4), 880–889 (2012)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. Commun. ACM 55(5), 101–109 (2012)CrossRefGoogle Scholar
  17. 17.
    Zandifar, M., Abdul Jabbar, M., Majidi, A., Keyes, D., Amato, N.M., Rauchwerger, L.: Composing algorithmic skeletons to express high-performance scientific applications. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ser. ICS 2015, pp. 415–424. ACM, New York (2015)Google Scholar
  18. 18.
    AbdulJabbar, M., Yokota, R., Keyes, D.: Asynchronous execution of the fast multipole method using charm++. arXiv preprint arXiv:1405.7487 (2014)
  19. 19.
    Salmon, J.K.: Parallel hierarchical N-body methods. Ph.D. dissertation, California Institute of Technology (1991)Google Scholar
  20. 20.
    Warren, M.S., Salmon, J.K.: A parallel hashed oct-tree \(N\)-body algorithm. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, pp. 12–21. ACM (1993)Google Scholar
  21. 21.
    Makino, J.: A fast parallel treecode with GRAPE. Publ. Astron. Soc. Jpn. 56, 521–531 (2004)CrossRefGoogle Scholar
  22. 22.
    Solomonik, E., Kalé, L.V.: Highly scalable parallel sorting. In: Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12 (2010)Google Scholar
  23. 23.
    Haverkort, H.: An inventory of three-dimensional Hilbert space-filling curves. arXiv preprint arXiv:1109.2323 (2011)
  24. 24.
    Dubinski, J.: A parallel tree code. New Astron. 1, 133–147 (1996)CrossRefGoogle Scholar
  25. 25.
    Warren, M.S., Salmon, J.K.: Astrophysical \(N\)-body simulations using hierarchical tree data structures. In: Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, ser. Supercomputing 1992, pp. 570–576. IEEE Computer Society Press, Los Alamitos (1992)Google Scholar
  26. 26.
    Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009)Google Scholar
  27. 27.
    Teng, S.-H.: Provably good partitioning and load balancing algorithms for parallel adaptive \(N\)-body simulation. SIAM J. Sci. Comput. 19(2), 635–656 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Yokota, R., Turkiyyah, G., Keyes, D.: Communication complexity of the fast multipole method and its algebraic variants. Supercomput. Front. Innov.: Int. J. 1(1), 63–84 (2014)Google Scholar
  29. 29.
    Malhotra, D., Biros, G.: PVFMM: a parallel kernel independent fmm for particle and volume potentials. Commun. Comput. Phys. 18(3), 808–830 (2015)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Extreme Computing Research Center (ECRC)King Abdullah University of Science and Technology (KAUST)ThuwalSaudi Arabia
  2. 2.KAUST Supercomputing Laboratory (KSL)King Abdullah University of Science and Technology (KAUST)ThuwalSaudi Arabia
  3. 3.Global Scientific Information and Computing Center (GSIC)Tokyo Institute of Technology (TITECH)TokyoJapan

Personalised recommendations