# Theoretical Parallel Computing Models for GPU Computing

## Abstract

The latest GPUs are designed for general purpose computing and attract the attention of many application developers. The main purpose of this chapter is to introduce theoretical parallel computing models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), that capture the essence of CUDA-enabled GPUs. These models have three parameters: the number *p* of threads and the width *w* of the memory and the memory access latency *l*. As examples of parallel algorithms on these theoretical models, we show fundamental algorithms for computing the sum and the prefix-sums of *n* numbers. We first show that the sum of *n* numbers can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units are necessary to compute the sum. We also present a simple parallel algorithm for computing the prefix-sums that runs in \(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\) time units on the DMM and the UMM. Clearly, this algorithm is not optimal. We present an optimal parallel algorithm that computes the prefix-sums of *n* numbers in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We also show several experimental results on GeForce Titan GPU.

## 1 Introduction

Research into parallel algorithms has a long history of more than 40 years. Sequential algorithms have been developed mostly on the random access machine (RAM) [1]. In contrast, since there are a variety of connection methods and patterns between processors and memories, many parallel computing models have been presented and many parallel algorithmic techniques have been shown on them. The most well-studied parallel computing model is the parallel random access machine (PRAM) [5, 7, 30], which consists of processors and a shared memory. Each processor on the PRAM can access any address of the shared memory in a time unit. The PRAM is a good parallel computing model in the sense that parallelism of each problem can be revealed by the performance of parallel algorithms on the PRAM. However, since the PRAM requires a shared memory that can be accessed by all processors at the same time, it is not feasible.

*The graphics processing unit (GPU)* is a specialized circuit designed to accelerate computation for building and manipulating images [10, 11, 17, 31]. Latest GPUs are designed for general purpose computing and can perform computation in applications traditionally handled by the CPU. Hence, GPUs have recently attracted the attention of many application developers [10, 26, 27]. NVIDIA provides a parallel computing architecture called *CUDA* (compute unified device architecture) [29], the computing engine for NVIDIA GPUs. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in NVIDIA GPUs. In many cases, GPUs are more efficient than multicore processors [18], since they have hundreds of processor cores and very high memory bandwidth.

CUDA uses two types of memories in the NVIDIA GPUs: *the global memory* and *the shared memory* [29]. The global memory is implemented in off-chip DRAMs with large capacity, say, 1.5–6 GB, but its access latency is very high. The shared memory is an extremely fast on-chip memory with lower capacity, say, 16–48 KB. The efficient usage of the global memory and the shared memory is a key for CUDA developers to accelerate applications using GPUs. In particular, we need to consider *the coalescing* of the global memory access and *the bank conflicts* of the shared memory access [17, 18, 28]. To maximize the bandwidth between the GPU and the DRAM chips, the consecutive addresses of the global memory must be accessed at the same time. Thus, threads of CUDA should perform coalesced access when they access to the global memory. The address space of the shared memory is mapped into several physical memory banks. If two or more threads access to the same memory banks at the same time, the access requests are processed sequentially. Hence, to maximize the memory access performance, threads should access to distinct memory banks to avoid the bank conflicts of the memory access.

*Memory machine models*,

*the Discrete Memory Machine (DMM)*and

*the Unified Memory Machine (UMM)*[22], are parallel computing models which reflect the essential features of the shared memory and the global memory of NVIDIA GPUs. The outline of the architectures of the DMM and the UMM are illustrated in Fig. 1. In both architectures,

*a sea of threads (Ts)*are connected to

*the memory banks (MBs)*through

*the memory management unit (MMU)*. Each thread is a random access machine (RAM) [1], which can execute fundamental operations in a time unit. We do not discuss the architecture of the sea of threads in this chapter, but we can imagine that it consists of a set of multicore processors which can execute many threads in parallel. Threads are executed in SIMD [4] fashion, and the processors run on the same program and work on the different data.

MBs constitute a single address space of the memory. A single address space of the memory is mapped to the MBs in an interleaved way such that a word of data of address *i* is stored in the \((i\bmod w)\) th bank, where *w* is the number of MBs. The main difference of the two architectures is the connection of the address line between the MMU and the MBs, which can transfer an address value. In the DMM, the address lines connect the MBs and the MMU separately, while a single address line from the MMU is connected to the MBs in the UMM. Hence, in the UMM, the same address value is broadcast to every MB, and the same address of the MBs can be accessed in each time unit. On the other hand, different addresses of the MBs can be accessed in the DMM. Since the memory access of the UMM is more restricted than that of the DMM, the UMM is less powerful than the DMM.

The performance of algorithms of the PRAM is usually evaluated using two parameters: the size *n* of the input and the number *p* of processors. For example, it is well known that the sum of *n* numbers can be computed in \(O(\frac{n} {p} +\log n)\) time on the PRAM [5]. On the other hand, four parameters, the size *n* of the input, the number *p* of threads, the width *w*, and the latency *l* of the memory are used when the performance of algorithms on the DMM and on the UMM is evaluated. The width *w* is the number of memory banks and the latency *l* is the number of time units to complete the memory access. Hence, the performance of algorithms on the DMM and the UMM is evaluated as a function of *n* (the size of a problem), *p* (the number of threads), *w* (the width of a memory), and *l* (the latency of a memory). In NVIDIA GPUs, the width *w* of global and shared memory is 16 or 32. Also, the latency *l* of the global memory is several hundreds of clock cycles. In CUDA, a grid can have at most 65,535 blocks with at most 1,024 threads each [29]. Thus, the number *p* of threads can be 65 million.

Suppose that an array *a* of *n* numbers is given. The prefix-sums of *a* is the array of size *n* such that the *i*th (\(0 \leq i \leq n - 1\)) element is \(a[0] + a[1] + \cdots + a[i]\). Clearly, a sequential algorithm can compute the prefix-sums by executing \(a[i + 1] \leftarrow a[i + 1] + a[i]\) for all *i* (0 ≤ *i* ≤ *n* − 2). The computation of the prefix-sums of an array is one of the most important algorithmic procedures. Many sequential algorithms such as graph algorithms, geometric algorithms, image processing, and matrix computation call prefix-sums algorithms as a subroutine. In particular, many parallel algorithms use a parallel prefix-sum algorithm. For example, the prefix-sum computation is used to obtain the preorder, the in-order, and the post-order of a rooted binary tree in parallel [5]. So, it is very important to develop efficient parallel algorithms for the prefix-sums.

As examples of parallel algorithms on the DMM and the UMM, this chapter shows algorithms for computing the sum and the prefix-sums. We first show that the sum of *n* numbers can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units using *p* threads on the DMM and the UMM with width *w* and latency *l*. We then go on to discuss the lower bound of the time complexity and show three lower bounds, \(\varOmega ( \frac{n} {w})\)-time bandwidth limitation, \(\varOmega (\frac{\mathit{nl}} {p} )\)-time latency limitation, and *Ω*(*l*log*n*)-time reduction limitation. From this discussion, the computation of the sum and the prefix-sums takes at least \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. Thus, the sum algorithm is optimal. For the computation of the prefix-sums, we first evaluate the computing time of a well-known simple algorithm [8, 30]. We show that a simple prefix-sum algorithm runs in \(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\) time. Hence, this fact shows this simple prefix-sum algorithm is not optimal, and it has an overhead of factor log*n* both for the bandwidth limitation \(\frac{n} {w}\) and for the latency limitation \(\frac{\mathit{nl}} {p}\). We show an optimal parallel algorithm that computes the prefix-sums of *n* numbers in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM.

This article is organized as follows. Section 2 introduces the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) [22]. In Sect. 3, we evaluate the computing time of the contiguous memory access to the memory of the DMM and the UMM. The contiguous memory access is a key ingredient of parallel algorithm development on the DMM and the UMM. Using the contiguous access, we show that the sum of *n* numbers can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units in Sect. 4. We then go on to discuss the lower bound of the time complexity and show three lower bounds, \(\varOmega ( \frac{n} {w})\)-time bandwidth limitation, \(\varOmega (\frac{\mathit{nl}} {p} )\)-time latency limitation, and *Ω*(*l*log*n*)-time reduction limitation in Sect. 5. Section 6 shows a simple prefix-sum algorithm, which runs in \(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\) time units. In Sect. 7, we show an optimal parallel prefix-sum algorithm running in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units. Section 8 presents several implementation and experimental results of the sum and the prefix-sum algorithms using GeForce Titan GPU. In Sect. 9, we briefly show several published results on memory machine models. The final section offers conclusion of this article.

## 2 Memory Machine Models: DMM and UMM

The main purpose of this section is to define the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) [22].

*the Discrete Memory Machine (DMM)*of width

*w*and latency

*l*. Let

*m*[

*i*] (

*i*≥ 0) denote a memory cell of address

*i*in the memory. Let \(B[j] =\{ m[j],m[j + w],m[j + 2w],m[j + 3w],\ldots \}\) (0 ≤

*j*≤

*w*− 1) denote

*the jth bank*of the memory as illustrated in the Fig. 2. Clearly, a memory cell

*m*[

*i*] is in the \((i\bmod w)\) th memory bank. We assume that memory cells in different banks can be accessed in a time unit, but no two memory cells in the same bank can be accessed in a time unit. Also, we assume that

*l*time units are necessary to complete an access request and continuous requests are processed in a pipeline fashion through the MMU. Thus, it takes \(k + l - 1\) time units to complete

*k*access requests to a particular bank.

We assume that *p* threads are partitioned into \(\frac{p} {w}\) groups of *w* threads called *warps*. More specifically, *p* threads are partitioned into \(\frac{p} {w}\) warps *W*(0), *W*(1), \(\ldots\), \(W( \frac{p} {w} - 1)\) such that \(W(i) =\{ T(i \cdot w),T(i \cdot w + 1),\ldots,T((i + 1) \cdot w - 1)\}\) (\(0 \leq i \leq \frac{p} {w} - 1\)). Warps are dispatched for memory access in turn, and *w* threads in a warp try to access the memory at the same time. In other words, \(W(0),W(1),\ldots,W( \frac{p} {w} - 1)\) are dispatched in a round-robin manner if at least one thread in a warp requests memory access. If no thread in a warp needs memory access, such warp is not activated for memory access and is skipped. When *W*(*i*) is activated, *w* thread in *W*(*i*) sends memory access requests, one request per thread, to the memory. We also assume that a thread cannot send a new memory access request until the previous memory access request is completed. Hence, if a thread sends a memory access request, it must wait *l* time units to send a new memory access request.

*p*= 8,

*w*= 4, and

*l*= 5. In the figure,

*p*= 8 threads are partitioned into \(\frac{p} {w} = 2\) warps

*W*(0) = {

*T*(0),

*T*(1),

*T*(2),

*T*(3)} and

*W*(1) = {

*T*(4),

*T*(5),

*T*(6),

*T*(7)}. As illustrated in the figure, four threads in

*W*(0) try to access

*m*[7],

*m*[5],

*m*[15], and

*m*[0], and those in

*W*(1) try to access

*m*[10],

*m*[11],

*m*[12], and

*m*[9]. The time for the memory access is evaluated under the assumption that memory access is processed by imaginary

*l*pipeline stages with

*w*registers each as illustrated in the figure. Each pipeline register in the first stage receives memory access request from threads in an activated warp. Each

*i*th (0 ≤

*i*≤

*w*− 1) pipeline register receives the request to the

*i*th memory bank. In each time unit, a memory request in a pipeline register is moved to the next one. We assume that the memory access completes when the request reaches the last pipeline register.

Note that the architecture of pipeline registers illustrated in Fig. 3 are imaginary, and it is used only for evaluating the computing time. The actual architecture should involve a multistage interconnection network [6, 14] or sorting network [2, 3] to route memory access requests.

Let us evaluate the time for memory access on the DMM. First, the access request for *m*[7], *m*[5], *m*[0] is sent to the first pipeline stage. Since *m*[7] and *m*[15] are in the same bank *B*[3], their memory requests cannot be sent to the first stage at the same time. Next, the *m*[15] is sent to the first stage. After that, memory access requests for *m*[10], *m*[11], *m*[12], *m*[9] are sent at the same time, because they are in different memory banks. Finally, after \(l - 1 = 4\) time units, these memory requests are processed. Hence, the DMM takes \(2 + 1 + 4 = 7\) time units to complete the memory access.

We next define *the Unified Memory Machine* (*UMM*) of width *w* as follows. Let \(A[j] =\{ m[j \cdot w],m[j \cdot w + 1],\ldots,m[(j + 1) \cdot w - 1]\}\) denote the *j*th address group as illustrated in Fig. 2. We assume that memory cells in the same address group are processed at the same time. However, if they are in the different groups, one time unit is necessary for each of the groups. Also, similarly to the DMM, *p* threads are partitioned into warps and each warp access to the memory in turn.

Again, let us evaluate the time for memory access using Fig. 3 on the UMM for *p* = 8, *w* = 4, and *l* = 5. The memory access requests by *W*(0) are in three address groups. Thus, three time units are necessary to send them to the first stage of pipeline registers. Next, two time units are necessary to send memory access requests by *W*(1), because they are in two address groups. After that, it takes \(l - 1 = 4\) time units to process the memory access requests. Hence, totally \(3 + 2 + 4 = 9\) time units are necessary to complete all memory access.

## 3 Contiguous Memory Access

The main purpose of this section is to review the contiguous memory access on the DMM and the UMM shown in [22]. Suppose that an array *a* of size *n* ( ≥ *p*) is given. We use *p* threads to access all of *n* memory cells in *a* such that each thread accesses to \(\frac{n} {p}\) memory cells. Note that “accessing” can be “reading from” or “writing in.” Let *a*[*i*] (0 ≤ *i* ≤ *n* − 1)denote the *i*th memory cells in *a*. When *n* ≥ *p*, the contiguous access can be performed as follows:

Let us evaluate the computing time. For each *t* (\(0 \leq t \leq \frac{n} {p} - 1\)), *p* threads access *p* memory cells \(a[\mathit{pt}],a[\mathit{pt} + 1],\ldots,a[p(t + 1) - 1]\). This memory access is performed by \(\frac{p} {w}\) warps in turn. More specifically, first, *w* threads in *W*(0) access \(a[\mathit{pt}],a[\mathit{pt} + 1],\ldots,a[\mathit{pt} + w - 1]\). After that, *p* threads in *W*(1) access \(a[\mathit{pt} + w],a[\mathit{pt} + w + 1],\ldots,a[\mathit{pt} + 2w - 1]\), and the same operation is repeatedly performed. In general, *p* threads in \(W(j)\) (\(0 \leq j \leq \frac{p} {w} - 1\)) accesses to \(a[\mathit{pt} + \mathit{jw}],a[\mathit{pt} + \mathit{jw} + 1],\ldots,a[\mathit{pt} + (j + 1)w - 1]\). Since *w* memory cells accessed by a warp are in the different bank, the access can be completed in *l* time units on the DMM. Also, these *w* memory cells are in the same address group, and thus, the access can be completed in *l* time units on the UMM.

*w*threads in each

*W*(

*j*) send

*w*memory access requests in one time unit. Let us consider two cases:

- Case 1: \(\frac{p} {w} \leq l\)
If this is the case, \(\frac{p} {w}\) warps send memory access requests in turn, and the first memory access requests by the first warp

*W*(0) are completed in*l*time units as illustrated in Fig. 4. After that,*w*threads in*W*(0) can send next memory access requests immediately, and they can be completed in*l*time units. This is repeated \(\frac{n} {p}\) times. After all memory access requests by*W*(0) are completed, it takes \(\frac{p} {w} - 1\) time units to complete the last memory access requests by \(W( \frac{p} {w} - 1)\). Hence, the contiguous access can be done in \(\frac{\mathit{nl}} {p} + \frac{p} {w} - 1\) time units.- Case 2: \(\frac{p} {w}> l\)
When the memory access requests by

*w*threads in*W*(0) are completed, they cannot send next memory requests immediately. They must wait until*w*threads in \(W( \frac{p} {w} - 1)\) send the memory access requests. Hence, all warps send memory access request continuously in turn as illustrated in Fig. 4. Since each of \(\frac{p} {w}\) warps sends memory access requests \(\frac{n} {p}\) times, it takes \(\frac{p} {w} \cdot \frac{n} {p} = \frac{n} {w}\) time units to send all memory access requests. After that, the last memory access requests by the last warp are completed in*l*− 1 time units. Hence, the contiguous access can be done in \(\frac{n} {w} + l - 1\) time units.

With cases 1 and 2 combined, the prefix-sums can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l)\) time units. Further, if *n* < *p* then the contiguous memory access can be simply done using *n* threads out of the *p* threads. If this is the case, the memory access can be done by \(O( \frac{n} {w} + l)\) time units. Therefore, we have,

### Lemma 1

*The contiguous access to an array of size n can be done in*\(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l)\)*time using p threads on the UMM and the DMM with width w and latency l.*

## 4 An Optimal Parallel Algorithm for Computing the Sum

The main purpose of this section is to show an optimal parallel algorithm for computing the sum on the memory machine models.

Let *a* be an array of *n* = 2^{m} numbers. Let us show an algorithm to compute the sum \(a[0] + a[1] + \cdots + a[n - 1]\). The algorithm uses a well-known parallel computing technique which repeatedly computes the sums of pairs. We implement this technique to perform contiguous memory access. The details are spelled out as follows:

*p*threads to compute the sum. For each

*t*(0 ≤

*t*≤

*m*− 1), 2

^{t}operations “\(a[i] \leftarrow a[i] + a[i + 2^{t}]\)” are performed. These operations involve the following memory access operations:

reading from \(a[0],a[1],\ldots,a[2^{t} - 1]\),

reading from \(a[2^{t}],a[2^{t} + 1],\ldots,a[2 \cdot 2^{t} - 1]\), and

writing in \(a[0],a[1],\ldots,a[2^{t} - 1]\).

*p*threads both on the DMM and on the UMM with width

*w*and latency

*l*from Lemma 1. Thus, the total computing time is

and we have

### Theorem 1

*The sum of n numbers can be computed in*\(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\)*time units using p threads on the DMM and on the UMM with width w and latency l.*

## 5 The Lower Bound of the Computing Time

Let us discuss the lower bound of the time necessary to compute the sum on the DMM and the UMM to show that our parallel summing algorithm for Theorem 1 is optimal. Since the sum is the last value of the prefix-sums, this lower bound discussion for the sum can be applied to that for the prefix-sums. We will show three lower bounds of the sum, \(\varOmega ( \frac{n} {w})\)-time bandwidth limitation, \(\varOmega (\frac{\mathit{nl}} {p} )\)-time latency limitation, and *Ω*(*l*log*n*)-time reduction limitation.

Since the width of the memory is *w*, at most *w* numbers in the memory can be read in a time unit. Clearly, all of the *n* numbers must be read to compute the sum. Hence, \(\varOmega ( \frac{n} {w})\) time units are necessary to compute the sum. We call the \(\varOmega ( \frac{n} {w})\)-time lower bound *the bandwidth limitation*.

Since the memory access takes latency *l*, a thread can send at most \(\frac{t} {l}\) memory read requests in *t* time units. Thus, *p* threads can send at most \(\frac{pt} {l}\) total memory requests in *t* time units. Since at least *n* numbers in the memory must be read to compute the sum, \(\frac{pt} {l} \geq n\) must be satisfied. Thus, at least \(t =\varOmega (\frac{\mathit{nl}} {p} )\) time units are necessary. We call the \(\varOmega (\frac{\mathit{nl}} {p} )\)-time lower bound *the latency limitation*.

Next, we will show *the reduction limitation*, the *Ω*(*l*log*n*)-time lower bound. The formal proof is more complicated than those for the bandwidth limitation and the latency limitation.

Imagine that each of *n* input numbers stored in the shared memory (or the global memory) is a token and each thread is a box. Whenever two tokens are placed in a box, they are merged into one immediately. We can move tokens to boxes and each box can accept at most one token in *l* time units. Suppose that we have *n* tokens outside boxes. We will prove that it takes at least *l*log*n* time units to merge them into one token. For this purpose, we will prove that if we have *n*^{′} tokens at some time, we must have at least \(\frac{n^{{\prime}}} {2}\) tokens *l* time units later. Suppose that we have *n*^{′} tokens such that *k* of them are in *k* boxes and the remaining *n*^{′}− *k* tokens are out of boxes. If *k* ≤ *n*^{′}− *k*, then we can move *k* tokens to *k* boxes and can merge *k* pairs of tokens in *l* time units. After that, *n*^{′}− *k* tokens remain. If *k* > *n*^{′}− *k*, then we can merge *n*^{′}− *k* pairs of tokens and we have *k* tokens after *l* time units. Hence, after *l* time units, we have at least \(\max (n^{{\prime}}- k,k) \geq \frac{n^{{\prime}}} {2}\) tokens. Thus, we must have at least \(\frac{n^{{\prime}}} {2}\) tokens *l* time units later. In other words, it is not possible to reduce the number of tokens by less than half. Hence, in *t* time units, we have at least \(\frac{n} {2^{\frac{t} {l} }}\) tokens. Since \(\frac{n} {2^{\frac{t} {l} }} \leq 1\) must be satisfied, it takes at least *t* ≥ *l*log*n* time units to merge *n* tokens into one. It should be clear that reading a number by a thread from the shared memory (or the global memory) corresponds to a token movement to a box. Therefore, it takes at least *Ω*(*l*log*n*) time units to compute the sum of *n* numbers.

From the discussion above, we have

### Theorem 2

*Both the DMM and the UMM with p threads, width w, and latency l take at least*\(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\)*time units to compute the sum of n numbers.*

From Theorem 2, the parallel algorithm for computing the sum shown for Theorem 1 is optimal.

## 6 A Simple Prefix-Sum Algorithm

*a*with

*n*= 2

^{m}numbers is given. Let us start with a well-known simple prefix-sum algorithm for array

*a*[8, 9] and show it is not optimal. The simple prefix-sum algorithm is written as follows:

*p*threads are available and evaluate the computing time of the simple prefix-sum algorithm. The following three memory access operations are performed for each

*t*(0 ≤

*t*≤

*m*− 1) by:

reading from \(a[0],a[1],\ldots,a[n - 2^{t} - 1]\),

reading from \(a[2^{t}],a[2^{t} + 1],\ldots,a[n - 1]\), and

writing in \(a[2^{t}],a[2^{t} + 1],\ldots,a[n - 1]\).

*n*− 2

^{t}elements. Hence, the computing time of each

*t*is \(O(\frac{n-2^{t}} {w} + \frac{(n-2^{t})l} {p} + l)\) from Lemma 1. The total computing time is

### Lemma 2

*The simple prefix-sum algorithm runs*\(O(\frac{n\log n} {w} + \frac{nl\log n} {p} + l\log n)\)*time units using p threads on the DMM and on the UMM with width w and latency l.*

If the computing time of Lemma 2 matches the lower bound shown in Theorem 2, the prefix-sum algorithm is optimal. However, it does not match the lower bound. In the following section, we will show an optimal prefix-sum algorithm whose running time matches the lower bound.

## 7 An Optimal Prefix-Sum Algorithm

This section shows an optimal algorithm for the prefix-sums running \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units. We use *m* arrays \(a_{0},a_{1},\ldots a_{m-1}\) as work space. Each *a*_{t} (0 ≤ *t* ≤ *m* − 1) can store 2^{t} numbers. Thus, the total size of the *m* arrays is no more than \(2^{0} + 2^{1} + \cdots + 2^{m-1} - 1 = 2^{m} - 1 = n - 1\). We assume that the input of *n* numbers are stored in array *a*_{m} of size *n*.

The algorithm has two stages. In the first stage, interval sums are stored in the *m* arrays. The second stage uses interval sums in the *m* arrays to compute the resulting prefix-sums. The details of the first stage are spelled out as follows.

*a*

_{t}[

*i*] (\(0 \leq t \leq m - 1,0 \leq i \leq 2^{t} - 1\)) stores \(a_{t}[i \cdot \frac{n} {2^{t}}] + a_{t}[i \cdot \frac{n} {2^{t}} + 1] + \cdots + a_{t}[(i + 1) \cdot \frac{n} {2^{t}} - 1]\).

In the second stage, the prefix-sums are computed by computing the sums of the interval sums as follows:

Figure 7 shows how the prefix-sums are computed. In the figure, “\(a_{t+1}[2 \cdot i + 1] \leftarrow a_{t}[i]\)” and “\(a_{t+1}[2 \cdot i + 2] \leftarrow a_{t+1}[2 \cdot i + 2] + a_{t}[i]\)” correspond to “copy” and “add,” respectively.

*a*

_{m}[

*i*] (\(0 \leq i \leq 2^{t} - 1\)) stores the prefix-sum \(a_{m}[0] + a_{m}[1] + \cdots + a_{m}[i]\). We assume that

*p*threads are available and evaluate the computing time. The first stage involves the following memory access operations for each

*t*(0 ≤

*t*≤

*m*− 1):

reading from \(a_{t+1}[0],a_{t+1}[2],\ldots,a_{t+1}[2^{t+1} - 2]\),

reading from \(a_{t+1}[1],a_{t+1}[3],\ldots,a_{t+1}[2^{t+1} - 1]\), and

writing in \(a_{t}[0],a_{t}[1],\ldots,a_{t}[2^{t} - 1]\).

*t*(0 ≤

*t*≤

*m*− 1):

reading from \(a_{t}[0],a_{t}[1],\ldots,a_{t}[2^{t} - 1]\),

reading from \(a_{t+1}[2],a_{t+1}[4],\ldots,a_{t+1}[2^{t+1} - 2]\), and

writing in \(a_{t+1}[0],a_{t+1}[1],\ldots,a_{t+1}[2^{t+1} - 1]\).

Similarly, these operations can be done in \(O(\frac{2^{t}} {w} + \frac{2^{t}l} {p} + l)\) time units. Hence, the total computing time of the second stage is also \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\). Thus, we have

### Theorem 3

*The prefix-sums of n numbers can be computed in*\(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\)*time units using p threads on the DMM and the UMM with width w and latency l.*

From Theorem 2, the lower bound of the computing time of the prefix-sums is \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\). Thus, The prefix-sum algorithm for Theorem 3 is optimal.

## 8 Experimental Results

This section is devoted to show experimental results. We have implemented an algorithm for computing the sum (Theorem 1) on the shared memory and the global memory on GeForce Titan and evaluated the performance. We have also implemented the simple prefix-sum algorithm (Lemma 2) and the optimal prefix-sum algorithm (Theorem 3) on GeForce Titan. GeForce Titan has 2688 processor cores in 14 streaming multiprocessors.

*n*32-bit (float) numbers, when

*n*≤ 2, 048 and 1,024 threads for

*n*= 4, 096, because a CUDA block can have up to 1,024 threads. Since the capacity of the shared memory is up to 48 KB, we can implement these algorithms up to 4,096 32-bit numbers. The running times of the sum algorithm and that of the simple prefix-sum algorithm are almost the same for small

*n*, because the latency overhead

*l*log

*n*is dominant in the computing time. On the other hand, the optimal prefix-sum algorithm takes much more computing time, because its latency overhead is more than 2

*l*log

*n*. However, the bandwidth overhead \(\frac{n\log n} {w}\) of the naive algorithm is dominant when

*n*is large. Hence, the running time of the naive algorithm is larger than the others for large

*n*.

Figure 8 also shows the computing time for 1K-128M (2^{10} − 2^{27}) 32-bit (float) numbers on the global memory. We have used \(\frac{n} {2}\) threads in multiple CUDA blocks for *n* 32-bit (float) numbers. We can see that the bandwidth overhead \(O( \frac{n} {w})\) for the sum algorithm and the optimal prefix-sum algorithm and \(O(\frac{n\log n} {w} )\) for the simple prefix-sum algorithm is dominant for large *n*. For small *n*, the latency overhead *l*log*n* is dominant.

## 9 Open Problems in Memory Machine Models

to develop and implement parallel algorithms using CUDA, and

to evaluate the performance on a specific GPU.

However, we are not able to evaluate the goodness of parallel algorithms by the running time of a specific GPU, since the running time depends on many factors including programming skills, compiler optimization, GPU card models, and CUDA versions. It is very important to evaluate the performance of parallel algorithms for GPU computing by theoretical analysis. So, we have published memory machine models for GPU computing and published several algorithmic techniques on memory machine models.

We still have a lot of things to do in the area of theoretical research for GPU computing. Many researchers have developed many parallel algorithmic techniques on traditional parallel computing models [5, 7, 15, 30] such as PRAMs. However, in many cases, direct implementation of the memory machine models is not efficient. We need to consider the memory access characteristics of the memory machine models when we design parallel algorithms on them. There are a lot of open problems to develop algorithmic techniques for graph problems, geometric problems, and optimization problems, among others.

### Conclusion

This chapter introduces theoretical parallel computing models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM), that capture the essence of CUDA-enabled GPUs. We have shown that the sum and the prefix-sums can be computed in \(O( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units on the DMM and the UMM. We have also shown that \(\varOmega ( \frac{n} {w} + \frac{\mathit{nl}} {p} + l\log n)\) time units are necessary to compute the sum. We also show several experimental results on a CUDA-enabled GPU.

### References

- 1.A.V. Aho, J.D. Ullman, J.E. Hopcroft,
*Data Structures and Algorithms*(Addison Wesley, Boston, 1983)MATHGoogle Scholar - 2.S.G. Akl,
*Parallel Sorting Algorithms*(Academic, London, 1985)MATHGoogle Scholar - 3.K.E. Batcher, Sorting networks and their applications, in
*Proc. AFIPS Spring Joint Comput. Conf.*, vol. 32, pp. 307–314, 1968Google Scholar - 4.M.J. Flynn, Some computer organizations and their effectiveness. IEEE Trans. Comput.
**C-21**, 948–960 (1972)CrossRefGoogle Scholar - 5.A. Gibbons, W. Rytter,
*Efficient Parallel Algorithms*(Cambridge University Press, New York, 1988)MATHGoogle Scholar - 6.A. Gottlieb, R. Grishman, C.P. Kruskal, K.P., McAuliffe, L. Rudolph, M. Snir, The nyu ultracomputer – designing an MIMD shared memory parallel computer. IEEE Trans. Comput.
**C-32**(2), 175–189 (1983)Google Scholar - 7.A. Grama, G. Karypis, V. Kumar, A. Gupta,
*Introduction to Parallel Computing*(Addison Wesley, Boston, 2003)Google Scholar - 8.M. Harris, S. Sengupta, J.D. Owens, Parallel prefix sum (scan) with CUDA (Chapter 39), in
*GPU Gems 3*(Addison Wesley, Boston, 2007)Google Scholar - 9.W.D. Hillis, G.L. Steele Jr., Data parallel algorithms. Commun. ACM
**29**(12), 1170–1183 (1986). doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903 - 10.W.W. Hwu,
*GPU Computing Gems*, Emerald Edition (Morgan Kaufmann, San Francisco, 2011)Google Scholar - 11.Y. Ito, K. Ogawa, K. Nakano, Fast ellipse detection algorithm using Hough transform on the GPU, in
*Proc. of International Conference on Networking and Computing*, pp. 313–319, 2011Google Scholar - 12.A. Kasagi, K. Nakano, Y. Ito, Offline permutation algorithms on the discrete memory machine with performance evaluation on the GPU. IEICE Trans. Inf. Syst.
**Vol. E96-D**(12), 2617–2625 (2013)Google Scholar - 13.A. Kasagi, K. Nakano, Y. Ito, An optimal offline permutation algorithm on the hierarchical memory machine, with the GPU implementation, in
*Proc. of International Conference on Parallel Processing*, pp. 1–10, 2013Google Scholar - 14.D.H. Lawrie, Access and alignment of data in an array processor. IEEE Trans. Comput.
**C-24**(12), 1145– 1155 (1975)CrossRefMathSciNetGoogle Scholar - 15.F.T. Leighton,
*Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes*(Morgan Kaufmann, San Francisco, 1991)Google Scholar - 16.D. Man, K. Nakano, Y. Ito, The approximate string matching on the hierarchical memory machine, with performance evaluation, in
*Proc. of International Symposium on Embedded Multicore/Many-core System-on-Chip*, pp. 79–84, 2013Google Scholar - 17.D. Man, K. Uda, Y. Ito, K. Nakano, A GPU implementation of computing Euclidean distance map with efficient memory access, in
*Proc. of International Conference on Networking and Computing*, pp. 68–76, 2011Google Scholar - 18.D. Man, K. Uda, H. Ueyama, Y. Ito, K. Nakano, Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. Int. J. Netw. Comput.
**1**(2), 260–276 (2011)Google Scholar - 19.K. Nakano, Asynchronous memory machine models with barrier synchronization, in
*Proc. of International Conference on Networking and Computing*, pp. 58–67, 2012Google Scholar - 20.K. Nakano, Efficient implementations of the approximate string matching on the memory machine models, in
*Proc. of International Conference on Networking and Computing*, pp. 233–239, 2012Google Scholar - 21.K. Nakano, An optimal parallel prefix-sums algorithm on the memory machine models for GPUs, in
*Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP)*. Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 99–113Google Scholar - 22.K. Nakano, Simple memory machine models for GPUs, in
*Proc. of International Parallel and Distributed Processing Symposium Workshops*, pp. 788–797, 2012Google Scholar - 23.K. Nakano, The hierarchical memory machine model for GPUs, in
*Proc. of International Parallel and Distributed Processing Symposium Workshops*, pp. 591–600, 2013Google Scholar - 24.K. Nakano, Sequential memory access on the unified memory machine with application to the dynamic programming, in
*Proc. of International Symposium on Computing and Networking*, pp. 85–94, 2013Google Scholar - 25.K. Nakano, S. Matsumae, Y. Ito, The random address shift to reduce the memory access congestion on the discrete memory machine, in
*Proc. of International Symposium on Computing and Networking*, pp. 95–103, 2013Google Scholar - 26.K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the matrix chain product on the GPU, in
*Proc. of International Conference on Networking and Computing*, pp. 320–326, 2011Google Scholar - 27.K. Nishida, Y. Ito, K. Nakano, Accelerating the dynamic programming for the optimal poygon triangulation on the GPU, in
*Proc. of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP)*. Lecture Notes in Computer Science, vol. 7439 (Springer, Berlin, 2012), pp. 1–15Google Scholar - 28.NVIDIA Corporation, NVIDIA CUDA C best practice guide version 3.1 (2010)Google Scholar
- 29.NVIDIA Corporation, NVIDIA CUDA C programming guide version 5.0 (2012)Google Scholar
- 30.M.J. Quinn,
*Parallel Computing: Theory and Practice*(McGraw-Hill, New York, 1994)Google Scholar - 31.A. Uchida, Y. Ito, K. Nakano, Fast and accurate template matching using pixel rearrangement on the GPU, in
*Proc. of International Conference on Networking and Computing*, pp. 153–159, 2011Google Scholar