Theoretical Parallel Computing Models for GPU Computing

Chapter

Abstract

The latest GPUs are designed for general purpose computing and attract the attention of many application developers. The main purpose of this chapter is to introduce theoretical parallel computing models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), that capture the essence of CUDA-enabled GPUs. These models have three parameters: the number p of threads and the width w of the memory and the memory access latency l. As examples of parallel algorithms on these theoretical models, we show fundamental algorithms for computing the sum and the prefix-sums of n numbers. We first show that the sum of n numbers can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units are necessary to compute the sum. We also present a simple parallel algorithm for computing the prefix-sums that runs in \(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\) time units on the DMM and the UMM. Clearly, this algorithm is not optimal. We present an optimal parallel algorithm that computes the prefix-sums of n numbers in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We also show several experimental results on GeForce Titan GPU.

References

  1. 1.
    A.V. Aho, J.D. Ullman, J.E. Hopcroft, Data Structures and Algorithms (Addison Wesley, Boston, 1983)MATHGoogle Scholar
  2. 2.
    S.G. Akl, Parallel Sorting Algorithms (Academic, London, 1985)MATHGoogle Scholar
  3. 3.
    K.E. Batcher, Sorting networks and their applications, in Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314, 1968Google Scholar
  4. 4.
    M.J. Flynn, Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21, 948–960 (1972)CrossRefGoogle Scholar
  5. 5.
    A. Gibbons, W. Rytter, Efficient Parallel Algorithms (Cambridge University Press, New York, 1988)MATHGoogle Scholar
  6. 6.
    A. Gottlieb, R. Grishman, C.P. Kruskal, K.P., McAuliffe, L. Rudolph, M. Snir, The nyu ultracomputer – designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32(2), 175–189 (1983)Google Scholar
  7. 7.
    A. Grama, G. Karypis, V. Kumar, A. Gupta, Introduction to Parallel Computing (Addison Wesley, Boston, 2003)Google Scholar
  8. 8.
    M. Harris, S. Sengupta, J.D. Owens, Parallel prefix sum (scan) with CUDA (Chapter 39), in GPU Gems 3 (Addison Wesley, Boston, 2007)Google Scholar
  9. 9.
    W.D. Hillis, G.L. Steele Jr., Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903
  10. 10.
    W.W. Hwu, GPU Computing Gems, Emerald Edition (Morgan Kaufmann, San Francisco, 2011)Google Scholar
  11. 11.
    Y. Ito, K. Ogawa, K. Nakano, Fast ellipse detection algorithm using Hough transform on the GPU, in Proc. of International Conference on Networking and Computing, pp. 313–319, 2011Google Scholar
  12. 12.
    A. Kasagi, K. Nakano, Y. Ito, Offline permutation algorithms on the discrete memory machine with performance evaluation on the GPU. IEICE Trans. Inf. Syst. Vol. E96-D(12), 2617–2625 (2013)Google Scholar
  13. 13.
    A. Kasagi, K. Nakano, Y. Ito, An optimal offline permutation algorithm on the hierarchical memory machine, with the GPU implementation, in Proc. of International Conference on Parallel Processing, pp. 1–10, 2013Google Scholar
  14. 14.
    D.H. Lawrie, Access and alignment of data in an array processor. IEEE Trans. Comput. C-24(12), 1145– 1155 (1975)CrossRefMathSciNetGoogle Scholar
  15. 15.
    F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes (Morgan Kaufmann, San Francisco, 1991)Google Scholar
  16. 16.
    D. Man, K. Nakano, Y. Ito, The approximate string matching on the hierarchical memory machine, with performance evaluation, in Proc. of International Symposium on Embedded Multicore/Many-core System-on-Chip, pp. 79–84, 2013Google Scholar
  17. 17.
    D. Man, K. Uda, Y. Ito, K. Nakano, A GPU implementation of computing Euclidean distance map with efficient memory access, in Proc. of International Conference on Networking and Computing, pp. 68–76, 2011Google Scholar
  18. 18.
    D. Man, K. Uda, H. Ueyama, Y. Ito, K. Nakano, Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. Int. J. Netw. Comput. 1(2), 260–276 (2011)Google Scholar
  19. 19.
    K. Nakano, Asynchronous memory machine models with barrier synchronization, in Proc. of International Conference on Networking and Computing, pp. 58–67, 2012Google Scholar
  20. 20.
    K. Nakano, Efficient implementations of the approximate string matching on the memory machine models, in Proc. of International Conference on Networking and Computing, pp. 233–239, 2012Google Scholar
  21. 21.
    K. Nakano, An optimal parallel prefix-sums algorithm on the memory machine models for GPUs, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 99–113Google Scholar
  22. 22.
    K. Nakano, Simple memory machine models for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797, 2012Google Scholar
  23. 23.
    K. Nakano, The hierarchical memory machine model for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 591–600, 2013Google Scholar
  24. 24.
    K. Nakano, Sequential memory access on the unified memory machine with application to the dynamic programming, in Proc. of International Symposium on Computing and Networking, pp. 85–94, 2013Google Scholar
  25. 25.
    K. Nakano, S. Matsumae, Y. Ito, The random address shift to reduce the memory access congestion on the discrete memory machine, in Proc. of International Symposium on Computing and Networking, pp. 95–103, 2013Google Scholar
  26. 26.
    K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the matrix chain product on the GPU, in Proc. of International Conference on Networking and Computing, pp. 320–326, 2011Google Scholar
  27. 27.
    K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the optimal poygon triangulation on the GPU, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 1–15Google Scholar
  28. 28.
    NVIDIA Corporation, NVIDIA CUDA C best practice guide version 3.1 (2010)Google Scholar
  29. 29.
    NVIDIA Corporation, NVIDIA CUDA C programming guide version 5.0 (2012)Google Scholar
  30. 30.
    M.J. Quinn, Parallel Computing: Theory and Practice (McGraw-Hill, New York, 1994)Google Scholar
  31. 31.
    A. Uchida, Y. Ito, K. Nakano, Fast and accurate template matching using pixel rearrangement on the GPU, in Proc. of International Conference on Networking and Computing, pp. 153–159, 2011Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Hiroshima UniversityHigashi-HiroshimaJapan

Personalised recommendations