# Exploiting Batch Processing on Streaming Architectures to Solve 2D Elliptic Finite Element Problems: A Hybridized Discontinuous Galerkin (HDG) Case Study

## Abstract

Numerical methods for elliptic partial differential equations (PDEs) within both continuous and hybridized discontinuous Galerkin (HDG) frameworks share the same general structure: local (elemental) matrix generation followed by a global linear system assembly and solve. The lack of inter-element communication and easily parallelizable nature of the local matrix generation stage coupled with the parallelization techniques developed for the linear system solvers make a numerical scheme for elliptic PDEs a good candidate for implementation on streaming architectures such as modern graphical processing units (GPUs). We propose an algorithmic pipeline for mapping an elliptic finite element method to the GPU and perform a case study for a particular method within the HDG framework. This study provides comparison between CPU and GPU implementations of the method as well as highlights certain performance-crucial implementation details. The choice of the HDG method for the case study was dictated by the computationally-heavy local matrix generation stage as well as the reduced trace-based communication pattern, which together make the method amenable to the fine-grained parallelism of GPUs. We demonstrate that the HDG method is well-suited for GPU implementation, obtaining total speedups on the order of 30–35 times over a serial CPU implementation for moderately sized problems.

### Keywords

High-order finite elements Spectral/\(hp\) elements Discontinuous Galerkin method Hybridization Streaming processors Graphical processing units (GPUs)## 1 Introduction

In the last decade, commodity streaming processors such as those found in graphical processing units (GPUs) have arisen as a driving platform for heterogeneous parallel processing with strong scalability, power and computational efficiency [1]. In the past few years, a number of algorithms have been developed to harness the processing power of GPUs for a number of problems which require multi-element processing techniques [2, 3]. This work is motivated by our attempt to find effective ways of mapping continuous and hybridized discontinuous Galerkin (HDG) methods to the GPU. Significant gains in performance have been made when combining GPUs with discontinuous Galerkin (DG) for hyperbolic problems (e.g. [4]); in this work, we focus on whether similar gains can be achieved when solving elliptic problems.

Note that within a hyperbolic setting, each time step of a DG method algorithmically consists of a single parallel update step where the inter-element communication is limited to the numerical flux computation that is performed locally. In the case of many elliptic operator discretizations, however, one is required to solve a linear system in order to find the values of globally coupled unknowns. The linear system in question can be reduced in size if static condensation (Schur Complement) technique is applied, but it has to be solved nevertheless. Depending on the choice of linear solver, the system matrix can either be explicitly assembled or stored as a collection of elemental matrices accompanied by the local-to-global mapping data. In this particular work we have chosen to explicitly assemble the system matrix on the GPU to match the CPU code used for comparison.

Due to the different structure of numerical methods for elliptic PDEs and the unavoidable global coupling of unknowns, one usually breaks the solution process into several of stages: local (elemental) matrix generation, global linear system matrix assembly, and global linear system solve. If static condensation is applied and the global linear system is solved for the trace solution (solution on the boundary of elements), there is an additional stage of recovering the elemental solution from the trace data. Each of the stages outlined above benefits from parallelization on the GPU to a different degree: the local matrix generation stage benefits from parallelization much more than the assembly and global solve stages, due to the fact that operations performed are completely independent for different elements.

The goals this paper pursues are the following: (a) to provide the reader with an intuition regarding the overall benefit that parallelization on streaming architectures provides to numerical methods for elliptic problems as well as per-stage benefits and the runtime trends for different stages; (b) to propose a pipeline for solving 2D elliptic finite element problems on GPUs and provide a case study to understand the benefits of GPU implementation for numerical problems formulated within the HDG framework; (c) to propose a per-edge assembly as a more efficient approach than the traditional per-element assembly, given the structure of the HDG method and the restrictions of the current generation of SIMD hardware. The key ingredients to our proposed approach are the mathematical nature of the HDG method and the batch processing capabilities (and algorithmic limitations) of the GPU. The choice of method for our case study is motivated by the fact that the local matrix generation stage, which benefits the most from parallelization, is much more computationally intensive for the HDG method as opposed to the CG method. We now provide background concerning the HDG method and discuss the batch processing capabilities of the GPU.

### 1.1 Background

DG methods have seen considerable success in a variety of applications due to ease of implementation, ability to use arbitrary unstructured geometries, and suitability for parallelization. The local support of the basis functions in DG methods allows for domain decomposition at the element level which lends itself well to parallel implementations (e.g. [5, 6]). A number of recent works have demonstrated that DG methods are well-suited for implementation on a GPU [7, 8], for reasons of memory reference locality, regularity of access patterns, and dense arithmetic computations. Computational performance of DG methods is closely tied to polynomial order. As polynomial order increases on DG methods, memory bandwidth becomes less of a bottleneck as the floating point arithmetic operations become the dominant factor. The increase in floating point operation throughput on GPUs has led to implementations of high-order DG methods on the GPU [9].

However DG methods still suffer from and are often criticized for the need to employ significantly more degrees of freedom than other numerical methods [10], which results in a bigger global linear system to solve. The introduction of the HDG method in Cockburn et al. [11] successfully resolved this issue by providing a method within the DG framework whose only globally coupled degrees of freedom were those of the scalar unknown on the borders of the elements. The HDG method uses a formulation which expresses all of the unknowns in terms of the numerical trace of the hybrid scalar variable \(\lambda \). This method greatly reduces the global linear system size, while maintaining properties that make DG methods apt to parallelization. The elemental nature of DG methods have encouraged many to assert that they should be “easily parallelizable” (e.g. [4, 12, 13]). Due to weak coupling between elements in the HDG method, there is less inter-element communication needed which is advantageous for scaling the method to a parallel implementation. The combination of a batch collection of local (elemental) problems which needs to be computed and the reduced trace-based communication pattern of HDG conceptually makes this method well-suited to the fine-grained parallelism of streaming architectures such as modern GPUs. It is the local (elemental) batch nature of the decomposition which directs us to investigate the GPU implementation of the method. In the next subsection we provide an overview of batched operations, describe the current state of batch processing in existing software packages, and explain why it was relevant to create our own batch processing framework.

### 1.2 Batched Operations

Batch processing is the act of grouping some number of like tasks and computing them as a “batch” in parallel. This generally involves a large set of data whose elements can be processed independently of each other. Batch processing eliminates much of the overhead of iterative non-batched operations. “Batch” processing is well-suited to GPUs due to the SIMD architecture which allows for high parallelization of large streams of data. Basic linear algebra subprograms (BLAS) are a common example of large scale operations that benefit significantly from batch processing. The HDG method specifically benefits from batched BLAS Level 2 (matrix–vector multiplication) and BLAS Level 3 (matrix–matrix multiplication) operations.

Finding efficient implementations for solving linear algebra problems is one of the most active areas of research in GPU computing. The NVIDIA CUBLAS [14] and AMD APPML [15] are well-known solutions for BLAS functions on GPUs. While CUBLAS is specifically designed for the NVIDA GPU architecture based on CUDA [14], the AMD solution using OpenCL [16] is a more general cross platform solution for both GPU and multi-CPU architectures. CUBLAS has constantly improved based on a successive number of research attempts by Volkov [17], Dongarra [18, 19] *etc.* This led to a speed improvement of one to two orders of magnitude for many functions from the first release version till now. In recent releases, CUBLAS and other similar packages have been providing batch processing support to improve processing efficiency on multi-element processing tasks. The support is, however, not complete as currently CUBLAS only supports batch mode processing for BLAS Level 3, but not for functions within BLAS Level 1 and BLAS Level 2.

It is due to the these limitations of existing software that the authors were prompted to create a batch processing framework. We developed a batch processing framework for the GPU which uses the same philosophy present in CUBLAS. However, we augmented it with additional operations such as matrix-vector multiplication and matrix inversion. The framework is generalized such that it is not limited specifically to linear algebra operations; however, due to the finite element context of this paper, we restricted our focus to linear algebra operations.

### 1.3 Outline

The paper is organized as follows. In Sect. 2 we present the mathematical formulation of the HDG method. In Sect. 3 we introduce all the necessary implementation building blocks: polynomial expansion bases, matrix form of the equations from Sect. 2, trace assembly and spread operators, *etc.* Sect. 4 and its subsections present details that are specific to GPU implementation of the HDG method. First we describe the implementation pipeline followed by the description of the local matrix generation in Sect. 4.1, the global system matrix assembly in Sect. 4.2, and the global solve and subsequent local solve in Sect. 4.3. In Sect. 5 we present numerical results which include a comparison of CPU and GPU implementations of HDG method. Finally, in Sect. 6 we conclude with potential directions for future research along with a summary of the results.

## 2 Mathematical Formulation of HDG

In Sects. 2.2–2.4 we define the HDG methods. We start by presenting the global weak formulation in Sect. 2.2. In Sect. 2.3, we define *local problems*: a collection of elemental operators that express the approximation inside each element in terms of the approximation at its border. Finally, we provide a *global* formulation with which we determine the approximation on the border of the elements in Sect. 2.4. The resulting global boundary system is significantly smaller than the full system one would solve without solving *local problems* first. Once the solution has been obtained on the boundaries of the elements, the primary solution over each element can be determined independently through a forward-application of the elemental operators. However before proceeding we first define the partitioning of the domain and the finite element spaces in Sect. 2.1.

### 2.1 Partitioning of the Domain and the Spectral/\(hp\) Element Spaces

We begin by discretizing our domain. We assume \({\mathcal T}(\varOmega )\) is a two-dimensional tessellation of \(\varOmega \). Let \(\varOmega ^e \in {\mathcal T}(\varOmega )\) be a non-overlapping element within the tessellation such that if \(e_1 \ne e_2\) then \(\varOmega ^{e_1} \bigcap \varOmega ^{e_2} = \mathbf{\emptyset }\). By \(N_{el}\), we denote the number of elements (or cardinality) of \({\mathcal T}(\varOmega )\). Let \(\partial \varOmega ^e\) denote the boundary of the element \(\varOmega ^e\) (*i.e.*\(\bar{\varOmega }^e\setminus \varOmega ^e\)) and \(\partial \varOmega ^e_i\) denote an individual edge of \(\partial \varOmega ^e\) such that \(1 \le i\le N_b^e\) where \(N_b^e\) denotes the number of edges of element \(e\). We then denote by \(\varGamma \) the set of boundaries \(\partial \varOmega ^e\) of all the elements \(\varOmega ^e\) of \({\mathcal T}(\varOmega )\). Finally, we denote by \(N_\varGamma \) the number of edges (or cardinality) of \(\varGamma \).

For simplicity, we assume that the tessellation \({\mathcal T}(\varOmega )\) consists of conforming elements. Note that HDG formulation can be extended to non-conforming meshes. We do not consider the case of a non-conforming mesh in this work, as it would complicate the implementation while not enhancing the contribution statement in any way. We say that \(\varGamma ^l\) is an *interior edge* of the tessellation \({\mathcal T}(\varOmega )\) if there are two elements of the tessellation, \(\varOmega ^e\) and \(\varOmega ^f\), such that \(\varGamma ^l=\partial \varOmega ^{e}\cap \partial \varOmega ^f\) and the length of \(\varGamma ^l\) is not zero. We say that \(\varGamma ^l\) is a *boundary edge* of the tessellation \({\mathcal T}(\varOmega )\) if there is an element of the tessellation, \(\varOmega ^e\), such that \(\varGamma ^l=\partial \varOmega ^e\cap \partial \varOmega \) and the length of \(\varGamma ^l\) is not zero.

As it will be useful later, let us define a collection of index mapping functions, that allow us to relate the local edges of an element \(\varOmega ^e\), namely, \(\partial \varOmega ^e_1, \dots , \partial \varOmega ^e_{N^e_b}\), with the global edges of \(\varGamma \), that is, with \(\varGamma ^1,\dots ,\varGamma ^{N_\varGamma }\). Thus, since the \(j\)th edge of the element \(\varOmega ^e\), \(\partial \varOmega ^e_j\), is the \(l\)th edge \(\varGamma ^l\) of the set of edges \(\varGamma \), we set \(\sigma (e,j)=l\) so that we can write \(\partial \varOmega ^e_j = \varGamma ^{\sigma (e,j)}\).

### 2.2 The HDG Method

### 2.3 Local Problems of the HDG Method

*global number*is \(e\), we denote the value of \(\tau \) on the edge whose

*local number*is \(i\) by \(\tau ^{e,i}\).

### 2.4 The Global Formulation for \(\lambda \)

It remains to determine \(\lambda \). To do so, we require that the boundary conditions be weakly satisfied *and* that the normal component of the numerical trace of the flux \(\widetilde{\varvec{q}}\) given by (5d) be single valued. This renders this numerical trace *conservative*, a highly valued property for this type of methods; see Arnold et al. [25].

## 3 HDG Discrete Matrix Formulation and Implementation Considerations

In this section, to get a better appreciation of the implementation of the HDG approach, we consider the matrix representation of the HDG equations. The intention here is to introduce the notation and provide the basis for the discussion in the following sections. More details regarding the matrix formulation can be found in Kirby et al. [26].

In our numerical implementation, we have applied a spectral/\(hp\) element type discretization which is described in detail in Karniadakis and Sherwin [20]. In this work we use the modified Jacobi polynomial expansions on a triangle in the form of generalized tensor products. This expansion was originally proposed by Dubiner [27] and is also detailed in Karniadakis and Sherwin [20], Sherwin and Karniadakis [21]. We have selected this basis due to computational considerations: tensorial nature of the basis coupled with the decomposition into an *interior* and *boundary* modes [20, 21] benefits the HDG implementation. In particular, when computing a boundary integral of an elemental basis function, edge basis function together with edge-to-element mapping can be used. This fact will be further commented upon in the following sections.

### 3.1 Matrix Form of the Equations of the HDG Local Solvers

### 3.2 Matrix Form of the Global Equation for \(\lambda \)

### 3.3 Assembling the Transmission Condition from Elemental Contributions

## 4 Implementation Pipeline

We formulated our approach as a pipeline which illustrates the division of tasks between CPU (host) and GPU (Fig. 2). Initial setup steps are handled by the CPU after which the majority of the work is performed on the GPU and finally the resulting elemental solution is passed back to the CPU. Initially, the host parses the mesh file to determine the number of elements, forcing function, and mesh configuration. From this information the CPU can generate the data set that is required by the GPU to compute the finite element solution. This is followed by the generation of the \(\mathbb E^e, (\mathbb M^e)^{-1}, \mathbb D_k^e\) elemental matrices, edge to element mappings, global edge permutation lists and the right hand side vector \({{\underline{F}}}\). This data is then transferred to the GPU.

### 4.1 Building the Local Problems on the GPU

The local matrices are created using a batch processing scheme. The generation of the local matrices can be conducted in a matrix-free manner, but we choose to construct the matrices to take advantage of BLAS Level 3 batched matrix functions. We have found this to be a more computationally efficient approach on the GPU. Each step of the local matrix generation process is executed as a batch operating on all elements in the mesh. The batched matrix operations assign a thread block to each elemental matrix. In most cases a thread is assigned to operate on each element of a matrix, which are processed concurrently by the GPU in the various assembly and matrix operations.

Before we proceed to discuss the details of the local matrix generation we would like to make note of a certain implementation detail: the use of the edge to element map. As was previously mentioned in Sect. 3.1, we choose the trace expansion to match the elemental expansion along the element’s edge. This choice allows us to use edge expansions together with the edge to element map to generate some of the matrices in a more efficient manner. For example, in Eq. (11) we use the edge to element map to form a sparse matrix \(\mathbb E^{e}_{l}[i,j] = \left\langle \phi ^e_i,\phi ^e_j \right\rangle _{\partial \varOmega ^e_l}\) from the entries of a dense matrix \(\hat{\mathbb E}^{e}_{l}[m,n] = \left\langle \psi ^e_m,\psi ^e_n \right\rangle _{\partial \varOmega ^e_l}\). This approach is also used in the formation of the \(\widetilde{\mathbb E}^{e}_{kl}\), \(\mathbb F^{e}_{l}\) and \(\widetilde{\mathbb F}^{e}_{kl}\) matrices.

The goal of the local matrix generation process (steps B1 and B2) is to form matrices \(\mathbb K^e\) for every element in the mesh. In order to facilitate this, the following matrices must be generated: \(\mathbb Z^e\), block entries of \((\mathbb A^e)^{-1}\), \(\mathbb C^e\), \(\mathbb B^e\) and \(\mathbb G^e\). The \(\mathbb Z^e\) and \(\mathbb U^e\) matrices will be saved for later computations while the rest of the matrices are discarded after use to reduce memory constraints.

The final step of the local matrix generation involves constructing the local \({\mathbb K}^e\) matrices which are formed from the explicit matrix-matrix multiplication of the \(\mathbb B^e\) matrices with the concatenated \(\mathbb U^e\) and \(\mathbb Q_k^e\) matrices. This is subtracted from the diagonal \(\mathbb G^e\) matrix, which is not formed explicitly, to form \(\mathbb K^e\). Note that matrices \({\mathbb M}^e\), \({\mathbb Z}^e\), and \({\mathbb K}^e\) matrices will be symmetric which halves the required storage space. The elemental operations at each step are independent of each other so the batches can be broken up into smaller tiles to conform to memory constraints or to be distributed across multiple processing units. This process results in the local \(\mathbb K^e\) matrices being generated for each element which are then used to assemble the global \(\mathbf K\) matrix.

### 4.2 Assembling the Local Problems on the GPU

In this section we describe the assembly of the global linear system matrix \(\mathbf{K}\) from the elemental matrices \({\mathbb K}^e\). A typical CG or DG element-based approach to the assembly process, when parallelized, has to employ atomic operations to avoid race conditions. In this paper we propose an edge based assembly process that eliminates the need of expensive GPU atomic operations and avoids race conditions by using reduction operations. The reduction list is generated with a sorting operation which is relatively efficient on GPUs. This lock-free approach is better suited for the SIMD architecture of the GPU where each thread is acting on a separate edge in the mesh. In this way we avoid any race conditions during the assembly process while still maximizing throughput on the GPU.

Next, we describe the proposed method for triangular meshes. Note that this approach can be straightforwardly extended to quadrilateral meshes. In order to evaluate a single entry of the global matrix \(\mathbf K\) we need to determine the indices of entries to which local matrices \({\mathbb K}^e\) will be assembled. To do this, we need to know which element(s) a given edge \(l_i\) belongs to. Given the input triangle list that stores the global edge indices of each triangle, we can generate the edge neighbor list that stores the neighboring triangle indices for each edge. Having the edge neighbor list, we assign the assembly task of each row of \(\mathbf{K}\) to a thread. Each thread uses the edge neighbor list and the triangle list to find the element index \(e\) as well as the entry indices of \({\mathbb K}^e\) to fetch the appropriate data and perform the assembly operation on the corresponding row of \(\mathbf K\).

For our example, the triangle list would be {0,1,2,0,4,3}. Using it we can create an edge neighbor list {0,0,0,1,1,1} that stores the index of a triangle to which each edge from the first list belongs. Next we sort the triangle list by edge index and permute the edge neighbor list according to the sorting. Now the triangle list and edge neighbor list are {0,0,1,2,3,4} and {0,1,0,0,1,1} respectively. These new lists indicate that edge \(l_0\) neighbors triangles \(e_0\) and \(e_1\), and that edge \(l_1\) has neighbors only one triangle \(e_0\), etc. Figure 4 demonstrates the assembly process of the \(0\)th row (corresponding to the \(l_0\) edge) of the \(\mathbf K\) matrix from the entries of elemental matrices \({\mathbb K}^{e_0}\) and \({\mathbb K}^{e_1}\).

the the from the triangles from array. corresponding and add those global \(\mathbf K\)

*Remark 1*

We would like to stress the importance of the edge-only inter-element connectivity provided by the HDG method. This property ensures that the sparsity (number of nonzero entries per row) of the global linear system matrix depends only on the element types used and not on the mesh structure (e.g. vertex degree). The other benefit provided by the HDG method is the ability to assemble the system matrix by-edges as opposed to by-elements, which removes the need for costly atomic assembly operations. Now, if we look at the CG method, elements are connected through both edge degrees of freedom and vertex degrees of freedom. This through-the-vertex element connectivity makes it both unfeasible to use the compact ELL system matrix representation for a general mesh and makes it hard to avoid atomic operations in the assembly process.

*Remark 2*

We note that there are multiple ways to address the issue of evaluating the discrete system. A full global system need not be assembled in some cases. One can use a local matrix approach or a global matrix approach. In the local matrix approach, a local operator matrix is applied to each elemental matrix. This allows for on the fly assembly without the need to construct a global matrix system. The global matrix approach assembles a global matrix system from the local elemental contributions. Vos et al. [29] describe these approaches in detail for the continuous Galerkin (FEM) method. In either case, information from multiple elements must be used to compute any given portion of the final system. This requires the use of some synchronized ordering within the mapping process. There are several methods for handling this ordering. One such method is to use atomic operations to ensure that each element in the final system is updated without race conditions. Another method is to use asynchronous ordering and pass the updates to a communication interface which handles the updates in a synchronized fashion. This is demonstrated in the work by Goddekke et al. [30, 31], in which they use MPI to handle the many-to-one mapping through asynchronous ordering. In either case a many-to-one mapping exists and a synchronized ordering must be used to prevent race conditions. We chose to use the global approach to compare our results to the previous work by Kirby et al. [26], in which the authors also used the global approach.

### 4.3 Trace Space Solve and Local Problem Spreading on the GPU

The final steps of the process construct the elemental primitive solution \(\hat{u}^e\) (B5 and B6 of the GPU pipeline). This requires retrieving the elemental solution from the trace solution. We form the element-wise vector of local \({\underline{\lambda }}^e\) coefficients by scattering the coefficients of the global trace solution \({\underline{\varLambda }}\) produced by the sparse solve. The values are scattered back out to the local vectors using the edge to triangle list. Each interior edge will be scattered to two elements and each boundary edge will be scattered to one element. This is equivalent to the operation performed by the trace space spreading operator \(\mathcal {A}^e_{HDG}\) which we conduct in a matrix free manner.

## 5 Numerical Results

In this section we discuss the performance of the GPU implementation of the HDG method using the Helmholtz equation as a test case. In the end of the section we also provide a short discussion of the CG method GPU implementation based on the preliminary data collected. For verification and runtime comparison we use a CPU implementation of the Helmholtz solver existing within the Nektar++ framework v3.2 [32]. Nektar++ is a freely-available highly-optimized finite element framework. The code is robust and efficient, and it allows for ease of reproducibility of our CPU test results. Our implementation also takes advantage of the GPU parallel primitives in the CUDA Cusp and Thrust libraries [33, 34]. All the tests referenced in this section were performed on a machine with a Nvidia Tesla M2090 GPU, 128 GB of memory, and an Intel Xeon E5630 CPU running at 2.53 GHz. The system was using openSUSE 12.1 with CUDA runtime version 4.2.

Numerical errors from the GPU implementation of Helmholtz solver on a \(40 \times 40\) triangular mesh

Order | GPU \(L^\infty \) error | Order of convergence | GPU \(L^2\) error | Order of convergence |
---|---|---|---|---|

1 | 1.59334e\(-\)02 | – | 3.95318e\(-\)03 | – |

2 | 4.95546e\(-\)04 | 5.01 | 8.04917e\(-\)05 | 5.62 |

3 | 1.10739e\(-\)05 | 5.48 | 1.3446e\(-\)06 | 5.90 |

4 | 1.93802e\(-\)07 | 5.84 | 1.88309e\(-\)08 | 6.16 |

5 | 5.71909e\(-\)09 | 5.08 | 1.07007e\(-\)09 | 4.14 |

6 | 1.40495e\(-\)08 | \(-\)1.30 | 4.63559e\(-\)09 | \(-\)2.12 |

7 | 2.46212e\(-\)08 | \(-\)0.81 | 5.77189e\(-\)09 | \(-\)0.32 |

8 | 5.19398e\(-\)08 | \(-\)1.08 | 1.44714e\(-\)08 | \(-\)1.33 |

9 | 1.17087e\(-\)07 | \(-\)1.17 | 2.92382e\(-\)08 | \(-\)1.01 |

Total run time data for CPU and GPU implementation of Helmholtz problem (time is measured in ms)

Order | \(20 \times 20\) mesh | \(40 \times 40\) mesh | \(80 \times 80\) mesh | ||||||
---|---|---|---|---|---|---|---|---|---|

GPU | CPU | Speedup | GPU | CPU | Speedup | GPU | CPU | Speedup | |

1 | 117 | 268 | 2.29 | 231 | 1,427 | 6.19 | 559 | 9,889 | 17.69 |

2 | 170 | 483 | 2.84 | 323 | 2,843 | 8.8 | 858 | 24,459 | 28.5 |

3 | 264 | 828 | 3.14 | 480 | 5,145 | 10.71 | 1,508 | 54,728 | 36.28 |

4 | 383 | 1,414 | 3.69 | 853 | 8,896 | 10.43 | 2,777 | 105,896 | 38.13 |

5 | 526 | 2,268 | 4.31 | 1,387 | 15,165 | 10.94 | 4,894 | 180,373 | 36.85 |

6 | 769 | 3,484 | 4.53 | 2,295 | 24,873 | 10.84 | 8,165 | 289,319 | 35.44 |

7 | 1,136 | 5,251 | 4.62 | 3,550 | 36,869 | 10.39 | 12,879 | 436,217 | 33.87 |

8 | 1,613 | 7,683 | 4.76 | 5,393 | 54,474 | 10.1 | 20,072 | 630,613 | 31.42 |

9 | 2,214 | 11,451 | 5.17 | 7,489 | 79,604 | 10.63 | 28,481 | 883,340 | 31.02 |

GPU memory requirements (in kB) for each mesh and polynomial order

Polynomial order | \(20\times 20\) mesh | \(40\times 40\) mesh | \(80\times 80\) mesh |
---|---|---|---|

1 | 685 | 2,727 | 14,869 |

2 | 1,887 | 7,517 | 41,818 |

3 | 3,968 | 15,821 | 89,211 |

4 | 7,160 | 28,560 | 162,624 |

5 | 11,693 | 46,656 | 267,633 |

6 | 17,797 | 71,031 | 409,813 |

7 | 25,703 | 102,605 | 594,740 |

8 | 35,640 | 142,301 | 827,989 |

9 | 47,840 | 191,040 | 1,115,136 |

Timing data for the four major stages of GPU implementation on \(20 \times 20\) mesh (time is measured in ms)

Polynomial order | Local matrix generation—HDG | Global assembly | Global solve | Local solve |
---|---|---|---|---|

1 | 7 | 18 | 75 | 2 |

2 | 9 | 51 | 106 | 2 |

3 | 11 | 47 | 113 | 2 |

4 | 14 | 56 | 161 | 2 |

5 | 21 | 42 | 162 | 2 |

6 | 40 | 95 | 215 | 2 |

7 | 60 | 113 | 203 | 2 |

8 | 107 | 121 | 253 | 2 |

9 | 155 | 132 | 246 | 3 |

Timing data for the four major stages of GPU implementation on \(40 \times 40\) mesh (time is measured in ms)

Polynomial order | Local matrix generation—HDG | Global assembly | Global solve | Local solve |
---|---|---|---|---|

1 | 11 | 29 | 124 | 3 |

2 | 14 | 47 | 128 | 3 |

3 | 19 | 61 | 133 | 4 |

4 | 28 | 59 | 191 | 4 |

5 | 55 | 57 | 195 | 5 |

6 | 94 | 192 | 257 | 6 |

7 | 140 | 139 | 266 | 7 |

8 | 249 | 92 | 346 | 7 |

9 | 422 | 137 | 361 | 8 |

Timing data for the four major stages of GPU implementation on \(80 \times 80\) mesh (time is measured in ms)

Polynomial order | Local matrix generation—HDG | Global assembly | Global solve | Local solve |
---|---|---|---|---|

1 | 18 | 53 | 210 | 6 |

2 | 32 | 88 | 213 | 7 |

3 | 44 | 135 | 239 | 8 |

4 | 82 | 159 | 303 | 9 |

5 | 194 | 146 | 355 | 10 |

6 | 347 | 236 | 469 | 10 |

7 | 537 | 291 | 551 | 12 |

8 | 868 | 322 | 722 | 13 |

9 | 1,413 | 405 | 769 | 17 |

We use batched matrix-matrix multiplication operations as the baseline comparison for our method. The FLOPS demonstrated by homogeneous BLAS3 operations serve as an upper bound on the the performance of the batched operations carried out in the HDG process. The batched operations in the HDG pipeline are a combination of BLAS1, BLAS2, BLAS3, and matrix inversion operations. BLAS3 operations demonstrate the best performance, in terms of FLOPS, due to to higher computational density over the other operations. Our method demonstrates peak performance of 60 GFLOPS, which is \(\sim \)75 % of the peak FLOPS seen by batched matrix-matrix multiplication operations using cuBLAS [35], on a GPU with 665 peak GFLOPS for double precision. The addition of matrix inversion operations, BLAS1 and BLAS2 operations lower the computational performance from that of pure BLAS3 operations.

Figure 6 illustrates the FLOPS and bandwidth of the local matrix generation process and provides a comparison between the rates on the CPU and GPU (with and without the transfer time). Figure 7 provides an estimate of the FLOPS for the global solve stage. The solver performs the conjugate gradient method on the sparse global matrix. From this we estimated the FLOPS based on the size of \(\mathbf{K}\), the number of non-zero entries in the global matrix, and the number of iterations required to converge to a solution. Our estimate may be slightly higher than the actual FLOPS demonstrated by the solver, due to implementation specific optimizations. Our FLOPS estimate was derived from the conjugate gradient algorithm which requires approximately \(2N_{nz}+3N_{rows} + N_{iter}*(2N_{nz} + 10N_{rows})\) operations, where \(N_{nz}\) is the number of non-zero entries in the sparse global system (which is approximately \(N_{\lambda }^l N_{\varGamma } \times 5N_{\lambda }^l\)), \(N_{rows}\) is the number of rows (which corresponds to \(N_{\lambda }^l N_{\varGamma }\)), and \(N_{iter}\) is the number of iterations required to converge to a solution.

We note that the global solve stage contributes a non-negligible amount of time to the overall method. The choice of iterative solver influences the time taken by this stage. In our CPU implementation we use a banded Cholesky solver, while the GPU implementation uses an iterative conjugate gradient solver from the CUSP library. This CUDA library uses a multigrid preconditioner and is a state-of-the-art GPU solver for sparse linear systems. There are alternatives to this approach, such as the sparse matrix-vector product technique described by Roca et al. [36]. Their method takes advantage of the sparsity pattern of the global matrix to efficiently perform an iterative solve of the system. We chose our approach based on the fact that the global system solve is not the focus of our method, and instead focus on the parallelization of the elemental operations.

Local matrix generation time for CG and HDG methods on GPU (time is measured in ms)

Order | \(20 \times 20\) mesh | \(40 \times 40\) mesh | \(80 \times 80\) mesh | |||
---|---|---|---|---|---|---|

HDG | CG | HDG | CG | HDG | CG | |

1 | 7 | 4 | 11 | 7 | 18 | 14 |

2 | 9 | 6 | 14 | 12 | 32 | 20 |

3 | 11 | 10 | 19 | 15 | 44 | 32 |

4 | 14 | 12 | 28 | 22 | 82 | 56 |

5 | 21 | 16 | 55 | 36 | 194 | 116 |

6 | 40 | 20 | 94 | 54 | 347 | 209 |

7 | 60 | 30 | 140 | 91 | 537 | 338 |

8 | 107 | 40 | 249 | 131 | 868 | 492 |

9 | 155 | 59 | 422 | 205 | 1,413 | 808 |

## 6 Conclusions and Future Work

We have directly compared a CPU and GPU implementation of the HDG method for a two-dimensional elliptic scalar problem using regular triangular meshes with polynomial orders ranging from \(1 \le P \le 9\). We have discussed how to efficiently implement the HDG method within the context of the GPU architecture, and we provide results which show the relative costs and scaling of the stages that take place in the HDG method as polynomial order and mesh size increase.

Our results indicate the efficacy of applying batched operations to the HDG method. We provide an efficient way to map values from the local matrices to the global matrix during the global assembly step through the use of a lock-free edge mapping technique. This technique avoids atomic operations and is key for implementing an efficient HDG method on the GPU. The framework we suggest illustrates an effective GPU pipeline which could be adapted to fit methods structurally similar to HDG.

Through our numerical tests we have demonstrated that the HDG method is well suited to large scale streaming SIMD architectures such as the GPU. We consistently see a speed up of \(30\times \) or more for meshes of size \(80 \times 80\) and larger. The method demonstrates strong scaling with respect to mesh size. With each increasing mesh size, for a given polynomial order, the number of elements increases by \(4\times \), and we see a corresponding increase in compute time of roughly \(\sim \)4\(\times \). As the mesh size increases, the process becomes more efficient due to increased computational density relative to processing overhead. We have also demonstrated that the HDG method is well-suited to batch processing with low inter-element coupling and highly independent operations.

Let us end by indicating possible extensions to the work presented. One possible extension could be a GPU implementation of the statically condensed CG method. The formulation of the statically condensed CG method is similar to that of the HDG method. The structure of the global \(\mathbf K\) matrix will differ due to increased coupling between elements in the CG case (see Kirby et al. [26] for details). This may present an additional challenge in formulating the global assembly step in an efficient manner on the GPU, because elements are coupled by edges and vertices. We suspect that the performance gains will not be as great as in the HDG case.

Another possible extension could be scaling of the HDG method to multiple GPUs. The local matrix generation and the global assembly step consist of independent operations and would scale well with increased parallelization. The cost of the local matrix generation stage grows at a faster rate than the other stages, and becomes the dominant factor for \(P \ge 7\) for moderately sized and larger meshes. The global assembly stage would also see performance gains, since the assembly process is performed on a per-edge basis. Each GPU could be given a unique set of edges to assemble into the global matrix \(\mathbf K\), with some overlapping edges being passed along to avoid cross communication. The global solve stage may prove to be a bottleneck in a multi-GPU implementation since it cannot be easily divided up amongst multiple processing units. However, as we have shown in our results, the computation time for this step does not grow at the same rate as the local matrix generation step.

### Acknowledgments

We would like to thank Professor B. Cockburn (U. Minnesota) for the helpful discussions on this topic. This work was supposed by the Department of Energy (DOE NETL DE-EE0004449) and under NSF OCI-1148291.

