Special Issue Paper

Computer Science - Research and Development

, Volume 24, Issue 1, pp 21-31

First online:

A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation

  • Tsuyoshi HamadaAffiliated withFaculty of Engineering, Department of Computer and Information Sciences, Nagasaki University Email author 
  • , Keigo NitadoriAffiliated withDepartment of Astronomy, University of Tokyo
  • , Khaled BenkridAffiliated withSchool of Engineering, The University of Edinburgh
  • , Yousuke OhnoAffiliated withRIKEN (The Institute of Physical and Chemical Research)
  • , Gentaro MorimotoAffiliated withRIKEN (The Institute of Physical and Chemical Research)
  • , Tomonari MasadaAffiliated withFaculty of Engineering, Department of Computer and Information Sciences, Nagasaki University
  • , Yuichiro ShibataAffiliated withFaculty of Engineering, Department of Computer and Information Sciences, Nagasaki University
  • , Kiyoshi OguriAffiliated withFaculty of Engineering, Department of Computer and Information Sciences, Nagasaki University
  • , Makoto TaijiAffiliated withRIKEN (The Institute of Physical and Chemical Research)

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple \( \mathcal{O}(N^{2})\) algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized \( \mathcal{O}(N \log N)\) algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.

Keywords

GPU N-bodysimulation Tree algorithm