Advertisement

An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs

  • Koji Nakano
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7439)

Abstract

The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\Omega({n\over w}+{nl\over p}+l\log n)\) time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM.

Keywords

Memory machine models prefix-sums computation parallel algorithm GPU CUDA 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley (1983)Google Scholar
  2. 2.
    Akl, S.G.: Parallel Sorting Algorithms. Academic Press (1985)Google Scholar
  3. 3.
    Batcher, K.E.: Sorting networks and their applications. In: Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314 (1968)Google Scholar
  4. 4.
    Flynn, M.J.: Some computer organizations and their effectiveness. IEEE Transactions on Computers C-21, 948–960 (1872)CrossRefGoogle Scholar
  5. 5.
    Gibbons, A., Rytter, W.: Efficient Parallel Algorithms. Cambridge University Press (1988)Google Scholar
  6. 6.
    Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The nyu ultracomputer–designing an MIMD shared memory parallel computer. IEEE Trans. on Computers C-32(2), 175–189 (1983)CrossRefGoogle Scholar
  7. 7.
    Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison Wesley (2003)Google Scholar
  8. 8.
    Harris, M., Sengupta, S., Owens, J.D.: Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison-Wesley (2007)Google Scholar
  9. 9.
    Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), http://doi.acm.org/10.1145/7902.7903 CrossRefGoogle Scholar
  10. 10.
    Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann (2011)Google Scholar
  11. 11.
    Ito, Y., Ogawa, K., Nakano, K.: Fast ellipse detection algorithm using hough transform on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 313–319 (December 2011)Google Scholar
  12. 12.
    Lawrie, D.H.: Access and alignment of data in an array processor. IEEE Trans. on Computers C-24(12), 1145–1155 (1975)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Man, D., Uda, K., Ito, Y., Nakano, K.: A GPU implementation of computing euclidean distance map with efficient memory access. In: Proc. of International Conference on Networking and Computing, pp. 68–76 (December 2011)Google Scholar
  14. 14.
    Man, D., Uda, K., Ueyama, H., Ito, Y., Nakano, K.: Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. International Journal of Networking and Computing 1, 260–276 (2011)Google Scholar
  15. 15.
    Nakano, K.: Simple memory machine models for GPUs. In: Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797 (May 2012)Google Scholar
  16. 16.
    Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 320–326 (December 2011)Google Scholar
  17. 17.
    NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)Google Scholar
  18. 18.
    NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)Google Scholar
  19. 19.
    Quinn, M.J.: Parallel Computing: Theory and Practice. McGraw-Hill (1994)Google Scholar
  20. 20.
    Uchida, A., Ito, Y., Nakano, K.: Fast and accurate template matching using pixel rearrangement on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 153–159 (December 2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Koji Nakano
    • 1
  1. 1.Department of Information EngineeringHiroshima UniversityHigashi HiroshimaJapan

Personalised recommendations