Abstract
The latest GPUs are designed for general purpose computing and attract the attention of many application developers. The main purpose of this chapter is to introduce theoretical parallel computing models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), that capture the essence of CUDA-enabled GPUs. These models have three parameters: the number p of threads and the width w of the memory and the memory access latency l. As examples of parallel algorithms on these theoretical models, we show fundamental algorithms for computing the sum and the prefix-sums of n numbers. We first show that the sum of n numbers can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units are necessary to compute the sum. We also present a simple parallel algorithm for computing the prefix-sums that runs in \(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\) time units on the DMM and the UMM. Clearly, this algorithm is not optimal. We present an optimal parallel algorithm that computes the prefix-sums of n numbers in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We also show several experimental results on GeForce Titan GPU.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
A.V. Aho, J.D. Ullman, J.E. Hopcroft, Data Structures and Algorithms (Addison Wesley, Boston, 1983)
S.G. Akl, Parallel Sorting Algorithms (Academic, London, 1985)
K.E. Batcher, Sorting networks and their applications, in Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314, 1968
M.J. Flynn, Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21, 948–960 (1972)
A. Gibbons, W. Rytter, Efficient Parallel Algorithms (Cambridge University Press, New York, 1988)
A. Gottlieb, R. Grishman, C.P. Kruskal, K.P., McAuliffe, L. Rudolph, M. Snir, The nyu ultracomputer – designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32(2), 175–189 (1983)
A. Grama, G. Karypis, V. Kumar, A. Gupta, Introduction to Parallel Computing (Addison Wesley, Boston, 2003)
M. Harris, S. Sengupta, J.D. Owens, Parallel prefix sum (scan) with CUDA (Chapter 39), in GPU Gems 3 (Addison Wesley, Boston, 2007)
W.D. Hillis, G.L. Steele Jr., Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903
W.W. Hwu, GPU Computing Gems, Emerald Edition (Morgan Kaufmann, San Francisco, 2011)
Y. Ito, K. Ogawa, K. Nakano, Fast ellipse detection algorithm using Hough transform on the GPU, in Proc. of International Conference on Networking and Computing, pp. 313–319, 2011
A. Kasagi, K. Nakano, Y. Ito, Offline permutation algorithms on the discrete memory machine with performance evaluation on the GPU. IEICE Trans. Inf. Syst. Vol. E96-D(12), 2617–2625 (2013)
A. Kasagi, K. Nakano, Y. Ito, An optimal offline permutation algorithm on the hierarchical memory machine, with the GPU implementation, in Proc. of International Conference on Parallel Processing, pp. 1–10, 2013
D.H. Lawrie, Access and alignment of data in an array processor. IEEE Trans. Comput. C-24(12), 1145– 1155 (1975)
F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes (Morgan Kaufmann, San Francisco, 1991)
D. Man, K. Nakano, Y. Ito, The approximate string matching on the hierarchical memory machine, with performance evaluation, in Proc. of International Symposium on Embedded Multicore/Many-core System-on-Chip, pp. 79–84, 2013
D. Man, K. Uda, Y. Ito, K. Nakano, A GPU implementation of computing Euclidean distance map with efficient memory access, in Proc. of International Conference on Networking and Computing, pp. 68–76, 2011
D. Man, K. Uda, H. Ueyama, Y. Ito, K. Nakano, Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. Int. J. Netw. Comput. 1(2), 260–276 (2011)
K. Nakano, Asynchronous memory machine models with barrier synchronization, in Proc. of International Conference on Networking and Computing, pp. 58–67, 2012
K. Nakano, Efficient implementations of the approximate string matching on the memory machine models, in Proc. of International Conference on Networking and Computing, pp. 233–239, 2012
K. Nakano, An optimal parallel prefix-sums algorithm on the memory machine models for GPUs, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 99–113
K. Nakano, Simple memory machine models for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797, 2012
K. Nakano, The hierarchical memory machine model for GPUs, in Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 591–600, 2013
K. Nakano, Sequential memory access on the unified memory machine with application to the dynamic programming, in Proc. of International Symposium on Computing and Networking, pp. 85–94, 2013
K. Nakano, S. Matsumae, Y. Ito, The random address shift to reduce the memory access congestion on the discrete memory machine, in Proc. of International Symposium on Computing and Networking, pp. 95–103, 2013
K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the matrix chain product on the GPU, in Proc. of International Conference on Networking and Computing, pp. 320–326, 2011
K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the optimal poygon triangulation on the GPU, in Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP). Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 1–15
NVIDIA Corporation, NVIDIA CUDA C best practice guide version 3.1 (2010)
NVIDIA Corporation, NVIDIA CUDA C programming guide version 5.0 (2012)
M.J. Quinn, Parallel Computing: Theory and Practice (McGraw-Hill, New York, 1994)
A. Uchida, Y. Ito, K. Nakano, Fast and accurate template matching using pixel rearrangement on the GPU, in Proc. of International Conference on Networking and Computing, pp. 153–159, 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Nakano, K. (2014). Theoretical Parallel Computing Models for GPU Computing. In: Koç, Ç. (eds) Open Problems in Mathematics and Computational Science. Springer, Cham. https://doi.org/10.1007/978-3-319-10683-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-10683-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10682-3
Online ISBN: 978-3-319-10683-0
eBook Packages: Computer ScienceComputer Science (R0)