# Theoretical Parallel Computing Models for GPU Computing

• Koji Nakano
Chapter

## Abstract

The latest GPUs are designed for general purpose computing and attract the attention of many application developers. The main purpose of this chapter is to introduce theoretical parallel computing models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), that capture the essence of CUDA-enabled GPUs. These models have three parameters: the number p of threads and the width w of the memory and the memory access latency l. As examples of parallel algorithms on these theoretical models, we show fundamental algorithms for computing the sum and the prefix-sums of n numbers. We first show that the sum of n numbers can be computed in $$O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)$$ time units on the DMM and the UMM. We then go on to show that $$\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)$$ time units are necessary to compute the sum. We also present a simple parallel algorithm for computing the prefix-sums that runs in $$O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)$$ time units on the DMM and the UMM. Clearly, this algorithm is not optimal. We present an optimal parallel algorithm that computes the prefix-sums of n numbers in $$O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)$$ time units on the DMM and the UMM. We also show several experimental results on GeForce Titan GPU.

## Keywords

Graphic Processing Unit Memory Access Shared Memory Global Memory Memory Bank
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
A.V. Aho, J.D. Ullman, J.E. Hopcroft, Data Structures and Algorithms (Addison Wesley, Boston, 1983)
2. 2.
S.G. Akl, Parallel Sorting Algorithms (Academic, London, 1985)
3. 3.
K.E. Batcher, Sorting networks and their applications, in Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314, 1968Google Scholar
4. 4.
M.J. Flynn, Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21, 948–960 (1972)
5. 5.
A. Gibbons, W. Rytter, Efficient Parallel Algorithms (Cambridge University Press, New York, 1988)
6. 6.
A. Gottlieb, R. Grishman, C.P. Kruskal, K.P., McAuliffe, L. Rudolph, M. Snir, The nyu ultracomputer – designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32(2), 175–189 (1983)Google Scholar
7. 7.
A. Grama, G. Karypis, V. Kumar, A. Gupta, Introduction to Parallel Computing (Addison Wesley, Boston, 2003)Google Scholar
8. 8.
M. Harris, S. Sengupta, J.D. Owens, Parallel prefix sum (scan) with CUDA (Chapter 39), in GPU Gems 3 (Addison Wesley, Boston, 2007)Google Scholar
9. 9.
W.D. Hillis, G.L. Steele Jr., Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903
10. 10.
W.W. Hwu, GPU Computing Gems, Emerald Edition (Morgan Kaufmann, San Francisco, 2011)Google Scholar
11. 11.
Y. Ito, K. Ogawa, K. Nakano, Fast ellipse detection algorithm using Hough transform on the GPU, in Proc. of International Conference on Networking and Computing, pp. 313–319, 2011Google Scholar
12. 12.
A. Kasagi, K. Nakano, Y. Ito, Offline permutation algorithms on the discrete memory machine with performance evaluation on the GPU. IEICE Trans. Inf. Syst. Vol. E96-D(12), 2617–2625 (2013)Google Scholar
13. 13.
A. Kasagi, K. Nakano, Y. Ito, An optimal offline permutation algorithm on the hierarchical memory machine, with the GPU implementation, in Proc. of International Conference on Parallel Processing, pp. 1–10, 2013Google Scholar
14. 14.
D.H. Lawrie, Access and alignment of data in an array processor. IEEE Trans. Comput. C-24(12), 1145– 1155 (1975)
15. 15.
F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes (Morgan Kaufmann, San Francisco, 1991)Google Scholar
16. 16.
D. Man, K. Nakano, Y. Ito, The approximate string matching on the hierarchical memory machine, with performance evaluation, in Proc. of International Symposium on Embedded Multicore/Many-core System-on-Chip, pp. 79–84, 2013Google Scholar
17. 17.
D. Man, K. Uda, Y. Ito, K. Nakano, A GPU implementation of computing Euclidean distance map with efficient memory access, in Proc. of International Conference on Networking and Computing, pp. 68–76, 2011Google Scholar
18. 18.
D. Man, K. Uda, H. Ueyama, Y. Ito, K. Nakano, Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. Int. J. Netw. Comput. 1(2), 260–276 (2011)Google Scholar
19. 19.
K. Nakano, Asynchronous memory machine models with barrier synchronization, in Proc. of International Conference on Networking and Computing, pp. 58–67, 2012Google Scholar
20. 20.
K. Nakano, Efficient implementations of the approximate string matching on the memory machine models, in Proc. of International Conference on Networking and Computing, pp. 233–239, 2012Google Scholar
21. 21.
K. Nakano, An optimal parallel prefix-sums algorithm on the memory machine models for GPUs, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 99–113Google Scholar
22. 22.
K. Nakano, Simple memory machine models for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797, 2012Google Scholar
23. 23.
K. Nakano, The hierarchical memory machine model for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 591–600, 2013Google Scholar
24. 24.
K. Nakano, Sequential memory access on the unified memory machine with application to the dynamic programming, in Proc. of International Symposium on Computing and Networking, pp. 85–94, 2013Google Scholar
25. 25.
K. Nakano, S. Matsumae, Y. Ito, The random address shift to reduce the memory access congestion on the discrete memory machine, in Proc. of International Symposium on Computing and Networking, pp. 95–103, 2013Google Scholar
26. 26.
K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the matrix chain product on the GPU, in Proc. of International Conference on Networking and Computing, pp. 320–326, 2011Google Scholar
27. 27.
K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the optimal poygon triangulation on the GPU, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 1–15Google Scholar
28. 28.
NVIDIA Corporation, NVIDIA CUDA C best practice guide version 3.1 (2010)Google Scholar
29. 29.
NVIDIA Corporation, NVIDIA CUDA C programming guide version 5.0 (2012)Google Scholar
30. 30.
M.J. Quinn, Parallel Computing: Theory and Practice (McGraw-Hill, New York, 1994)Google Scholar
31. 31.
A. Uchida, Y. Ito, K. Nakano, Fast and accurate template matching using pixel rearrangement on the GPU, in Proc. of International Conference on Networking and Computing, pp. 153–159, 2011Google Scholar