Abstract
Sparse general matrix–matrix multiplication (SpGEMM) is a crucial and complex computational task in many practical applications. Improving the performance of SpGEMM on SIMT processors like modern GPUs is challenging due to the unpredictable sparsity of sparse matrices. Although existing GPU solutions have made progress in improving performance through advanced algorithm design, they ignore some optimizations related to specific processor architectures. This can result in a partially inefficient implementation of their algorithms. This paper focuses on optimizing four inefficient parts of the NSparse algorithm on DCU (a GPU-like accelerator). The optimizations include: 1) setting parameters to improve the load balance of the second matrix by extracting maximum row information at runtime; 2) reducing overhead of binning operations by making full use of registers and shared memory effectively; 3) improving numerical SpGEMM performance by adjusting its calculation mode; and 4) enhancing global load balance through finer-grained grouping and kernel configurations. Experiment results demonstrate that when compared to five state-of-the-art SpGEMM algorithms (bhSparse, KokkosKernels, NSparse, rocSparse, and spECK), our optimized method achieves an average of 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups on 29 sparse matrices with different sparse structures, respectively.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Sparse general matrix–matrix multiplication (SpGEMM) calculates the product of two sparse matrices A and B to obtain a sparse result matrix C. It is a key computing kernel in many scientific and engineering applications, particularly in linear algebra, graph algorithms, machine learning, and other areas. Specifically, these applications include algebraic multigrid solvers [1, 2], multi-source breadth-first search [3, 4], colored intersection search [5], triangle counting [6, 7], graph betweenness centrality [8], Markov clustering algorithm [9, 10], etc. Therefore, enhancing the efficiency of the SpGEMM algorithm becomes crucial.
The access and computational behavior of SpGEMM are influenced by several factors, including the distribution of nonzero elements in the sparse matrix, the compressed storage format, and the computational data flow used. This poses a significant challenge when computing SpGEMM on SIMT hardware. First, the differing distribution of nonzero elements in the input matrix causes a difference in the number of intermediate products produced in each row of the result matrix. Consequently, this creates an uneven workload for the threads. Second, due to the unpredictable sparsity of the result matrix, it becomes very difficult to allocate memory in advance for the result matrix stored in compressed format. Finally, the position of the nonzero elements in each row of the output matrix is unknown, making it difficult to find an efficient accumulator to accumulate intermediate products.
Although the state-of-the-art SpGEMM algorithm NSparse [11] had addressed the mentioned issues to a certain extent and achieved significant performance improvement, it primarily emphasizes high-level algorithm design, overlooking some low-level optimizations related to hardware architecture. This can result in a partially inefficient implementation of their algorithm. For example, inefficient parts of this algorithm implementation include neglecting the load balance processing of the second matrix, performing excessive atomic operations on global memory during the binning process, performing excessive shared memory accesses during numerical SpGEMM computation, and improperly mapping some threads to computation tasks, all of which can have an impact on SpGEMM performance.
In order to improve the performance of SpGEMM on DCU (GPU-like accelerator) [12,13,14,15,16], we analyze four inefficient parts of the NSparse implementation and optimize each inefficient part based on the DCU architecture. We evaluated the performance of our optimized algorithm on 29 widely used benchmarks. When compared to the five SpGEMM algorithms bhSparse [17], KokkosKernels [18], NSparse [11], rocSparse [19], and spECK [20], our work has achieved significant speedups. The average speedups achieved are 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x), respectively.
The rest of the paper is organized as follows: Sect. 2 describes the background knowledge of SpGEMM and the architecture of DCU. Section 3 discusses the related work. Section 4 describes the main idea and four inefficient parts of the NSparse implementation. In Sect. 5, the optimization work for the four inefficient parts of the NSparse implementation is described. In Sect. 6, the experimental environment and the performance comparison between our optimized algorithm and five state-of-the-art SpGEMM algorithms are presented. Section 7 provides a summary and an outlook on future work.
2 Background
This section mainly introduces the background knowledge of the SpGEMM and the architecture of DCU.
2.1 Notations
Table 1 shows the symbolic representation commonly used in this paper. The matrix sizes for the three matrices A, B, and C are \(M*K\), \(K*N\), and \(M* N\), respectively.
2.2 CSR storage format
The Compressed Sparse Row storage format (CSR) is a widely used storage format for sparse matrices. It represents sparse matrices efficiently by storing only the values of nonzero elements and their row indices, thus saving storage space. The CSR format has been widely used in previous SpGEMM research [11, 17, 18, 20, 21]. We also selected the CSR format as the input and output format of SpGEMM.
Figure 1 shows the CSR storage format of a \(5*5\) sparse matrix. There are three arrays contained in the CSR storage format. The first array rpt records the starting position of the first nonzero element of each row in the one-dimensional storage space. The Nnz in the i-th row can be calculated by \(\text {rpt}[i+1]-\text {rpt}[i]\), so the size of the rpt array is M+1. The second array col and the third array val record the column indices and values of the nonzero elements in row-major order, and the size of the col and val arrays is determined by the Nnz of the sparse matrix. The advantage of the CSR storage format is its ability to quickly retrieve information using row pointers and minimize storage space usage.
2.3 SpGEMM and Its row–row algorithm
SpGEMM computes \(C=AB\), where both input and output matrices are sparse and each element of matrix C is calculated as:
where i and j denote the row index of matrix A and the column index of matrix B, respectively, and k denotes the column index of matrix A and the row index of matrix B.
The parallel SpGEMM primarily utilizes the row–row formulation proposed by Gustavson [22], as demonstrated in Algorithm 1. In the row–row formulation, the nonzero element \(a_{ik}\) in the i-th row of matrix A is multiplied with all nonzero elements in the k-th row of matrix B, resulting in intermediate products for the current result row (line 5 of Algorithm 1). If the column index of the intermediate product appears for the first time, first set \(c_{ij}\) and then insert it into temporary storage (lines 6–8). For intermediate products with the same column indices, directly accumulate them into \(c_{ij}\) (line 10). The row–row formulation has the advantage of ensuring that each row in the result matrix is independent, which makes it well-suited for parallel processing on SIMT hardware. The computation formulation adopted in this paper is also the row–row formulation.
2.4 General workflow of SpGEMM
Figure 2 illustrates the general workflow of SpGEMM, which consists of four steps. Allocating memory for the compressed format storage of the result matrix can be challenging due to the unpredictable sparsity of the matrix. Therefore, it is crucial to initially predict the size of the result matrix. There are three common approaches for predicting the size of the result matrix: the exact method [11, 18, 20, 21, 23, 24], the upper-bound method [17, 25, 26], and the probabilistic method [27,28,29]. The exact method allocates memory of the actual size for the result matrix. This approach involves performing a symbolic SpGEMM computation first to determine the size of the result matrix. Due to the need for two SpGEMM computations, this method is more time-consuming. The upper-bound method allocates the maximum amount of memory for the result matrix, typically the total number of intermediate products. However, the memory allocated by this method often exceeds the actual requirements, leading to resource waste. The probabilistic method represents a compromise, estimating an imprecise size through random sampling and probability analysis of the input matrix. When this method’s estimation fails, additional memory allocation is necessary.
The second step primarily involves memory allocation on the target device. The third step involves partitioning the work according to the sparse characteristics of the result matrix, ensuring a balanced workload for all threads. The last step is the most time-consuming in the entire computation process, involving numerical multiplication and generating a substantial number of intermediate products. Subsequently, these intermediate products are accumulated using accumulators to obtain the final result row. Commonly used accumulators include hash accumulators [11, 20, 21, 27, 30], heap accumulators [17], and dense accumulators [20, 31]. This paper use the exact method to calculate the size of the result matrix and hash accumulators to accumulate intermediate products.
2.5 Introduction of DCU
Hygon DCU [12,13,14,15,16] is a GPU-like accelerator running in the Radeon Open Computer (ROCm) environment [32], which is based on the AMD second-generation Vega architecture and supports applications for HIP-based [33] heterogeneous computing. The Hygon DCU has 60 Compute Units (CUs) operating at a core clock frequency of 1.7 GHz, along with 16GB of global memory. Figure 3 shows the internal structure of CU. Each CU is equipped with 64 KB of shared memory and is divided into 4 SIMD compute groups. Each SIMD compute group unit consists of 16 lanes and can support 16 double precision and 32 single precision floating operations per clock cycle. Additionally, each SIMD computation group is equipped with 64 KB of VGPR high-speed registers. A single thread has the capability to utilize up to 256 32-bit register resources.
Unlike most GPUs, which have a warp size of 32, the DCU has a warp size of 64. At the same time, a warp composed of 64 threads is executed on a SIMD computing group, with each SIMD computing group capable of running up to 10 warps concurrently. In addition, there is a 16 KB L1 cache in the CU as an on-chip storage resource. A single CU shares a special SALU computing unit, which is responsible for address calculation and branch judgment.
3 Related work
In recent years, researchers have proposed many methods for GPU to improve the performance of SpGEMM. Since GPU architectures are closely related to that of the DCU, much of existing body of research for SpGEMM on GPUs is also applicable to DCU. We draw from the advancements made in these works to inform us of how to program the DCU. In this section, we will present a variety of SpGEMM methods that utilize hash tables to accumulate intermediate products on the GPU [11, 20, 21, 23, 27, 34].
Demounth et al. [23] proposed the implementation of the SpGEMM algorithm in the cuSPARSE mathematics library. To save storage space, the algorithm utilizes the exact method for predicting the size of the result matrix. Additionally, it employs a shared memory hash table to accelerate the accumulation process of intermediate products. When a collision occurs in the shared memory hash table, the global memory hash table is utilized, leading to many global memory accesses. In addition, the algorithm utilizes a consistent kernel for calculating all rows, leading to significant imbalances in workload and subsequently impacting the algorithm’s efficiency. Anh et al. [27] proposed Balanced Hash. This algorithm addresses load imbalance by storing the intermediate products in global memory and distributing the work list evenly among threads for hash accumulation operations. The algorithm’s hash table is located in shared memory and has a fixed size. Once the calculation for each round is finished, the hash table’s contents is then written to global memory. When a hash conflict occurs, the intermediate product is placed in the temporary queue and handled in the subsequent round of hash operations until the temporary queue is emptied. Although the balanced hash algorithm achieves a good complexity balance, the work list consumes a significant amount of temporary storage space, and handling hash collisions necessitate additional global memory accesses, limiting the algorithm’s performance.
Kurt et al. [34] deeply investigated the reasons for the lower performance of SpGEMM compared to SpMV and used hypergraph partitioning to prove that there is no inherent lower limit to data movement, allowing SpGEMM to handle more data movement than SpMV. Based on this conclusion, they developed an adaptive work distribution strategy to implement hashing based on scatter vector updates that reduced the number of idle threads while improving performance significantly. Specifically, thread blocks are divided into virtual warps to process a row of matrix A, and the size of the virtual warp is determined based on the average number of nonzero elements in B corresponding to a given row of matrix A. Nagasaka et al. [11] proposed NSparse, which groups row by Nprod (or Nnz) in the result matrix and creates different size hash tables and kernel configurations for each group to fully utilize the GPU resource. This algorithm also eliminates the need to access global memory when a hash collision occurs. However, we also find some inefficiencies in the implementation of this algorithm, analyzing and optimizing them in detail in Sects. 4 and 5.
Austria et al. [20] proposed spECK. In addition to utilizing hash-based accumulation methods, spECK also introduces a dense accumulation method and selects a more suitable accumulator based on the sparsity of the input matrix rows. Therefore, spECK demonstrates strong performance across a wide range of input matrices. In addition, spECK proposes a method to improve the load balance of the second matrix. This involves conducting a simple row analysis on the input matrix during runtime and subsequently dynamically adjusting the number of threads to access a complete row of the B matrix. However, for small matrices, this matrix row analysis can be expensive. Inspired by this, this paper adopts a second matrix load balancing method with less additional overhead (Sect. 5.1). OpSparse was proposed by Zhaoyang Du et al. [21]. The algorithm is optimized to address seven inefficient parts of NSparse and spECK, resulting in significant performance enhancements on Nvidia Tesla V100 GPUs. OpSparse improves binning performance by using shared memory to move atomic operations from global memory to shared memory, but it does not take into account the use of registers in this process. Our work makes full use of registers and shared memory to eliminate most atomic operations to further improve binning performance (Sect. 5.2).
4 Main ideas and inefficient parts of NSparse
This section mainly introduces the main idea of the NSparse algorithm proposed by Nagasaka et al. [11] and discusses its four inefficient parts.
4.1 Main ideas of NSparse
NSparse is composed of seven steps, as shown in Fig. 4. The first step is to calculate Nprod for every row of the result matrix. The second step assigns different rows to different cores for calculation based on the Nprod of each row to ensure load balance. The third step involves utilizing a hash table to determine the exact size of the result matrix. The hash table is allocated in the shared memory or global memory of the target device. Next, the fourth and fifth steps allocate memory on the target device and set the row pointers in CSR format for the result matrix. In the sixth step, the focus is on achieving a balanced load by classifying rows into separate bins according to their Nnz. In the last step, numerical calculations are performed using a hash-based accumulation method to determine the values and column indices of the result matrix.
NSparse primarily utilizes a hash table for the accumulation of intermediate products. Figure 5 shows the hash-based accumulation method. This method starts by creating a hash table in temporary space, with its size being an upper bound on the Nnz in the result row. The keys of the hash table are column indices, and their initial values are set to \(-1\). During the computation, a hash function is used to efficiently find the position for inserting the intermediate product. Before insertion, the hash table verifies the key value. If it is \(-1\) or equal to the column index of the intermediate product to be inserted, accumulation is performed directly. If a hash collision occurs, linear probing is employed to locate an available position in the hash table for insertion. Once the hash operations are completed, a shrinking operation is applied to the hash table to transit into a dense state. Finally, the hash table is sorted according to column indices to obtain the result row stored in compressed format.
The main objectives of NSparse are to address three key challenges: memory allocation for the result matrix, accumulation of intermediate products, and global load balance. This method demonstrates substantial enhancements in performance in common benchmark tests. However, Its implementation also has several inefficient parts. In the following sections, we will explore four of these inefficient parts.
4.2 Inefficient parts of the NSparse implementation
4.2.1 Second matrix load balance
NSparse assigns a varying number of threads to each row of matrix A through binning to address the issue of global load imbalance. However, it neglects to perform local load balance on matrix B. Before running, NSparse utilizes a heuristic approach to allocate a fixed number of threads to the B matrix. When the result rows are sparse, a single thread is used to access a row of B, whereas when the result rows are dense, a warp is used to access a row of B. However, it is important to note that the Nnz in each row of the B matrix may be different, and the resulting matrix’s sparsity cannot be determined solely based on the B matrix. Therefore, if the number of threads is set based on the sparsity of the result rows, it can result in a significant load imbalance when threads access matrix B. For example, for the webbase-1 M matrix in our test set, the longest row length is 4700 and the shortest is zero. When calculating long result rows, NSparse assigns 32 threads to access these two rows. However, the longest row requires 147 iterations, whereas the shortest row does not require any iterations. This situation can lead to severe thread load imbalance.
4.2.2 Binning method
The binning operation aims to classify the rows of matrix A into bins and assign them to different GPU cores for computation. This helps to address the issue of load imbalance during the computation process. During the binning process, atomic operations are crucial as they ensure that multiple GPU threads can modify a variable simultaneously. The binning operation used by NSparse is divided into two phases, and Fig. 6 shows these two phases of the binning process. The first phase counts the size of each bin and generates offsets to record where each bin is stored in one-dimensional space. During this phase, every thread carries out a single atomicAdd and atomicMax operation on the global memory. During the second phase, the row IDs of different rows are allocated to different bins according to the Nprod (or Nnz). During this phase, every thread must perform two atomicAdd operations on the global memory. It is evident that during the binning process, NSparse performs a large number of atomic operations on global memory, leading to significant global memory access.
Despite the binning operation being an auxiliary operation of NSparse with a low time complexity, our experiments (Fig. 10 in Sect. 6) indicate that it accounts for over \(12\%\) of the total execution time on average. Even some small benchmark tests exceed \(35\%\). This indicates that the binning method utilized by NSparse is less efficient.
4.2.3 Numerical SpGEMM
During numerical SpGEMM (Fig. 5), NSparse utilizes a hash table to accumulate intermediate products and then carries out shrink operation. Because the shrink operation shrinks the hash table in place using multiple threads, atomic operations are required to obtain the storage location of nonzero elements after shrinkage. In this process, atomic operations will lead to excessive access to shared memory and cause bank conflicts in the shared memory. During the shrinking process, a significant number of branch operations are also utilized, resulting in warp divergence. These factors all affect the performance of the numerical SpGEMM. In addition, NSparse returns the results using a counting-like sorting method that is also relatively inefficient.
Furthermore, in the numerical SpGEMM process, NSparse assigns the same number of threads and hardware resources to rows in matrix A that is entirely zero or contains a single nonzero element as it does to other rows. This results in an increase in idle threads and a waste of resources. Among the 29 benchmark matrices we selected, eight matrices include such special rows. It is worth noting that matrices such as cit-Patents, patents_main, and NotreDame_actors have certain rows that make up more than 10% of the total rows. It is evident that these particular rows are common, requiring a distinct approach to their handling.
4.2.4 Kernel configuration
There is an unreasonable situation in NSparse’s thread and computing task mapping. For instance, in the second group of numerical calculations, result rows with Nnz ranging from 17 to 256 are allocated 32 threads each for calculation. However, if the nonzero elements are mostly distributed on the left boundary, it may result in many idle threads. Additionally, for result rows where nonzero elements are mostly distributed on the right boundary, the workload of threads may become excessively heavy. This unreasonable mapping can result in a global load imbalance, which can potentially hinder the algorithm’s performance.
5 Optimization
This section mainly introduces the optimization of the above four inefficient parts of the NSparse implementation on DCU.
5.1 Optimization of the second matrix load balance
The main objective of load balance for the second matrix is to determine the number of threads assigned to access each row of matrix B. In Fig. 7, during the computation process, N threads are allocated to process one row of the A matrix. These N threads are divided into K groups, with each group consisting of \(G = N/K\) threads. Each group reads a nonzero element of the A matrix multiplied by a row of the corresponding B matrix. It can be seen that the performance of the algorithm is affected by the size of G. If G is too large, it will result in an increase in the number of iterations and the generation of many idle threads. If G is chosen to be too small, the workload of the threads will become heavy for the long rows of B.
Based on the observations above, an intuitive approach is to set the size of G according to the average value of Nnz in the B matrix. However, due to the possibility of significant variations in the Nnz for each row of the B matrix, using an average value for G can be unfavorable for longer rows. In our experiments, we discovered that longer rows of the B matrix have a more significant impact on performance. Therefore, we focus more on long rows. During runtime, the row information of the B matrix is extracted to determine the maximum row length. Based on this information, the size of G is set accordingly. We set each thread to access up to 8 elements of the B matrix as much as possible. To reduce synchronization overhead, we set the size of G to be no more than the size of one Warp (64 here). Therefore, We set up a candidate set for G, \(\{2,4,8,16,32,64\}\). For a common GPU with a warp size of 32, the maximum value of G is set to 32.
Extracting the row information of B can be completed during the calculation of Nprod, resulting in negligible additional overhead. Therefore, this approach can enhance the load balance of the second matrix with minimal additional overhead.
5.2 Optimization of binning method
Before executing symbolic SpGEMM and numerical SpGEMM, a binning process must be carried out. Both operations have identical functions: classifying the row ids of the A matrix based on the Nprod (or Nnz) of each row and writing them in a one-dimensional continuous storage space. The binning process involves two stages of calculation. Algorithm 2 shows the first stage of the binning operation. Its objective is to calculate the size of each bin and then use the exclusive prefix sum to obtain each bin’s offset in the one-dimensional storage space. Algorithm 3 demonstrates the second stage of binning calculation, where the row id is recorded in each bin. For more general devices, the parameter warpsize (line 18 of Algorithm 2) can be modified to the size of the target device.
Because atomic operations on global memory have the greatest impact on performance, we maximize the use of registers and shared memory to reduce their occurrence. Specifically, in order to determine the size of each bin, every thread within the warp initially utilizes a register to store its target bin id (lines 8–16 of Algorithm 2). The __shfl_down instruction is utilized within the warp to perform reduction operations for obtaining the local bin size based on the warp (lines 18–22 of Algorithm 2). Next, within each block, an atomicAdd operation is performed on the shared memory to calculate the local bin size for each block (lines 24–28 of Algorithm 2). Finally, these values are written to global memory, resulting in the global bin size (lines 31–33 of Algorithm 2). In the second stage, after calculating the block-based local bin size, the starting position of each block for binning is determined (lines 9–13 of Algorithm 3). Then, the row id is written into the corresponding bin (lines 16–25 of Algorithm 3).
The above process takes advantage of the locality of computing tasks within the thread group and makes full use of registers and shared memory. This significantly reduces the need for atomic operations on global memory, resulting in improved performance for binning operations. At the same time, all zero rows are also excluded during the binning process to avoid wasting resources in the calculation stage.
5.3 Optimization of numerical SpGEMM
Figure 8 shows the optimization we made to the numerical SpGEMM process. We utilize a combination of direct and hash methods to accumulate intermediate products. During the calculation process, the rows of the A matrix are initially examined to determine if they consist of just a single nonzero element. When dealing with a single element, the direct method is employed. This involves multiplying the single nonzero element of the A matrix with the nonzero elements of the corresponding rows of the B matrix in sequence and then storing the results. If matrix A is not a single element, a hash operation is performed to accumulate the intermediate products. Then, sorting is done based on the key values of the hash table. We utilize the highly efficient Bitonic Sort for our sorting algorithm. Once the sorting is complete, the starting position where the result needs to be written is determined by subtracting the Nnz of the current result row from the hash table size. For all zero rows in matrix A, the resultant rows are all zero, and they do not need to be computed in any way, so they are directly excluded from the binning process (lines 15–17 of Algorithm 3).
The above method helps reduce the number of idle threads and waste of hardware resources. Additionally, it eliminates shrinkage operations in the numerical computation process, reducing access to shared memory and branching operations.
5.4 Optimization of kernel configuration
This paper focuses on finer-grained grouping and kernel configuration to improve global load balance. We use the Hygon DCU as the target device for kernel configuration. Each compute unit (CU) in the DCU has the same amount of shared memory, with each CU having 64KB. Additionally, each block has up to 48KB of shared memory available. Each block can contain up to 1024 threads. More specific parameters of the Hygon DCU are presented in Sect. 2.5. For SIMT hardware with different configurations, Bin 7 can be scaled up or down based on its shared memory size. For instance, for a device with a maximum available shared memory of 64KB, bin 7 of the symbolic stage can be expanded to 16,384, while bin 7 of the numerical stage can be expanded to 5461.
5.4.1 Kernel configuration in symbolic step
Table 2 shows the kernel configurations and the selection of the range of bins for the symbolic step. In the symbolic step, the 9 bins are divided based on the Nprod of each row and are computed by 9 kernels. Since the DCU has at most 48KB of shared memory available per block and each column index in the symbol step accounts for 4B. As a result, the maximum size of the shared memory hash table that can be stored in each block is 12,288. Therefore, the hash tables for Kernel0-Kernel7 are implemented in shared memory, while the hash table for Kernel8 is implemented in global memory, which is twice the size of the upper bound of the Nprod in the result rows. In order to avoid warp divergence caused by short rows (short rows are those with \(Nnz < warp\ size\)), the 16 short rows are combined into a single row group and given to Kernel0 for processing. At this point, the threads are organized into thread groups of size eight, with each group handling a single short row. Kernel1 processes one row using a warp, while Kernels 2 through 8 all use one block to process one row. The above grouping enables a more balanced workload among threads, thus enhancing global load balance in the symbolic step.
5.4.2 Kernel configuration in numerical step
Table 3 shows the kernel configuration for the numerical step and the selection of the bin ranges. Additionally, it highlights the use of nine bins, which are processed by ten kernels. The hash table in the numerical step is made up of column indices and values, with each pair accounting for 12B, yielding a maximum shared memory hash table size of 4096. Each block of Kernel0 and Kernel1 uses thread group to calculate multiple rows. Kernel2 utilizes a warp for processing a single row, whereas Kernel3–Kernel9 employ a single block for calculating a row.
Both Kernel8 and Kernel9 hash tables exist in global memory. They have a difference in the way Kernel8 utilizes the highly optimized thrust::sort_by_key api [35] for sorting after hashing. Subsequently, the calculated result rows are copied to the result storage space through device-to-device memory copying. To reduce the sorting overhead, its hash table is set to 1.5 times the upper bound of the Nnz in the result rows. The hash table size used by Kernel9 is twice the upper bound of the Nnz in the result rows. It performs calculations in a conventional way. When the number of rows allocated to bin8 is small, Kernel8 is invoked; when the number of rows is large, Kernel9 is called. This is because when there are many rows, it becomes necessary to perform multiple device-to-device row copies, which is quite time-consuming. Therefore, the advantages of Kernel8 become diminished when the number of rows increases. The mentioned grouping and kernel configuration can improve global load balance during the numerical step, while also enhancing performance in handling long rows.
6 Performance evaluation
We conducted an overall performance comparison with five state-of-the-art SpGEMM algorithms: bhSparse, KokkosKernels, NSparse, rocSparse, and spECK. To better demonstrate the effectiveness of our optimizations, we also conducted a comparison of the computational and additional auxiliary overhead with NSparse. In all benchmark tests, we evaluated the performance of double-precision multiplication for \(C = A^2\).
6.1 Evaluation environments
To evaluate the performance of our algorithms, we used a Deep Computer Unit (DCU) as an accelerator. The code is written by HIP and compiled by HIPCC (version 4.2.2). The code runs on a Hygon CPU with the CentOS 7.6 operating system installed. It has 32 physical cores, 32 KB of L1 cache size, 128 GB of global memory, and 64 threads. The program execution method is to warmup for one round, then execute the program 20 times, and take the average time as the base time. Benchmarks are evaluated based on FLOPS performance [11, 17, 20, 21], which is calculated as twice the number of intermediate products generated by the computational result matrix divided by base time.
We select 29 matrices from the SuiteSparse matrix collection [36] to use as our benchmark data. These matrices have various sparse characteristics and consist of 27 square matrices and two non-square matrices. These matrices have been extensively utilized in previous performance evaluations of SpGEMM research [11, 17,18,19,20,21, 23,24,25,26,27, 30, 37]. Table 4 summarizes the details of these matrices. The bhSparse algorithm is unable to execute the last two matrices because the intermediate products take up a significant amount of temporary storage space.
The source code provided by the SpGEMM algorithms we compared is written in CUDA [38]. Since CUDA and HIP are highly similar, in order to make them work on DCU, we use HIPIFY [39] to transcode bhSparse, NSparse and spECK to HIP. This does not change the logic of their algorithm. Additionally, the KokkosKernels backend is selected to be HIP.
6.2 Overall performance of SpGEMM
Figure 9 shows the overall FLOPS performance of SpGEMM. Overall, our algorithm demonstrates the best performance, followed by spECK and rocSparse, while bhSparse and KokkosKernels have the lowest performance. In terms of acceleration, our work achieves an average speedup of 7.99x (up to 18.2x) compared to bhSparse, 8.01x (up to 20.83x) compared to KokkosKernels, 2.37x (up to 6.16x) compared to NSparse, 1.82x (up to 4.20x) compared to rocSparse, and 1.63x (up to 5.01x) compared to spECK.
Among all the benchmark matrices, our work achieves the best results in 27 matrices. It particularly demonstrates a more significant speedup on matrices m133-b3, mc2depi, atmosmodl, parabolic_fem, wb-edu and mari002. However, on the matrices cage15 and conf5_4-8x8-05, our work does not perform as well as spECK. This is because, during the numerical SpGEMM phase, more than 50% of the rows in these two matrices use hash tables with occupancy greater than 80%. This means that more hash conflicts are generated during numerical computation, and processing them causes warp divergence and more idle threads, limiting SpGEMM performance. This also highlights the limitations of hash methods. In contrast, spECK ensures that the hash table occupancy does not exceed 66% by expanding the hash table, so it performs better for this type of matrix. We decide against adopting this strategy because increasing the size of the hash table increases the overhead of sorting, which is detrimental for most matrices.
6.3 Multiple optimizations performance analysis of SpGEMM
In order to clearly demonstrate our multiple optimizations for NSparse, in this section, we analyze the contributions of various optimizations to the performance improvements of NSparse, highlighting their significance. Given that the time taken for memory allocation is the same, our next focus is solely on comparing the binning overhead and computational overhead of SpGEMM. In this case, the computational overhead involves the time it takes for both the symbolic SpGEMM and numerical SpGEMM.
6.3.1 The ratio of execution time
The ratio of execution time of each stage is shown in Fig. 10, where the execution time of NSparse is set as 1. The majority of the computation time is spent on the numerical SpGEMM. Compared with NSparse, our work has greatly reduced the time of numerical SpGEMM, especially on matrix atmosmodl, which has achieved very significant results. For symbolic SpGEMM, we’ve achieved significant acceleration on most matrices. The overall effect of our binning operation is equally significant. The time required for binning operations correlates with the size of the matrices. For large matrices such as webbase-1 M, cage15 and wb-edu, the overhead of our binning operations is almost negligible. However, on smaller matrices like m133-b3 and patents_main, the overhead of binning operations is comparatively more significant. In the following sections, we will conduct separate comparative analyses on these stages.
6.3.2 Binning overhead
Figure 11 shows the percentage of time needed for two binning operations compared to the total time for SpGEMM. The results demonstrate that our work performs optimally in all benchmark matrix sets. NSparse incurs additional auxiliary overhead that accounts for 12.9% of the overall overhead on average and 35.49% of the overall overhead in the worst case. However, in our work, it represents a minor portion of the overall overhead, with 10.15% in the worst case and an average of 4.9%. This indicates that NSparse’s auxiliary tasks incur greater overhead. We take advantage of the locality of computing tasks within thread groups and make full use of registers and shared memory to reduce atomic operations on global memory, making our binning method more effective than NSparse.
6.3.3 Computation overhead
Compared to NSparse, we did more fine-grained grouping and kernel configuration as well as load balance optimizations for the second matrix. In addition, the optimizations in the numerical SpGEMM step mainly focus on minimizing shared memory accesses and managing special rows. Specifically, our proposed computational model for numerical SpGEMM involves sorting the hash table using an efficient sorting algorithm. The result is then written directly, with consideration given to the handling of special rows. NSparse, however, needs to perform a shrinking operation initially and then determine the final position of the result through a counting-like sorting process.
Figures 12 and 13 show the FLOPS performance comparison between our work and NSparse in symbolic SpGEMM and numerical SpGEMM, respectively. The results demonstrate that for the symbolic step, our work performs optimally on 28 matrices, with an average speedup of 2.23x (up to 8.9x) over NSparse. In the numerical step, our work performs optimally on all benchmarks, achieving an average speedup of 2.52x (up to 5.83x). This demonstrates the effectiveness of our enhanced second matrix load balance scheme, numerical SpGEMM optimization scheme, and fine-grained grouping and kernel configuration scheme.
7 Summary and outlook
This paper provides an in-depth analysis of four inefficient parts of the NSparse implementation and optimizes them on DCU. 1) The load balance of the second matrix is enhanced by extracting the maximum row information of the second matrix and setting parameters on runtime, 2) The binning method is optimized by making full use of registers and shared memory, 3) The computation mode of numerical SpGEMM is optimized to improve its efficiency, and 4) The global load balance is improved by finer-grained grouping and kernel configuration. The experimental results demonstrate that our work have achieved on average 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups over five state-of-the-art SpGEMM algorithms bhSparse, KokkosKernels, NSparse, rocSparse, and spECK, respectively.
In the future, we will continue to work on the performance optimization of SpGEMM. We will apply the optimization techniques proposed in this paper to more advanced SpGEMM algorithm. For example, the binning method is applied to SpECK, and the numerical computing mode is implemented to OpSparse. In addition, we will also focus on applying these optimization techniques to other sparse matrix operations, such as applying the binning method to Sparse-Matrix Dense-Matrix Multiplication (SpMM) to improve the load balance of sparse matrix operations.
Data availibility
The dataset is the open source SuitSparse benchmark and can be accessed at https://suitesparse-collection-website.herokuapp.com/.
References
Bell N, Dalton S, Olson LN (2012) Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J Sci Comput 34(4):C123–C152. https://doi.org/10.1137/110838844
Ballard G, Siefert C, Hu J (2016) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM J Sci Comput 38(3):C203–C231. https://doi.org/10.1137/15M1028807
Then M, Kaufmann M, Chirigati F, et al (2014) The more the merrier: efficient multi-source graph traversal. Proc VLDB Endow 8(4):449–460. https://doi.org/10.14778/2735496.2735507
Buluç A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. In: Conference on High Performance Computing Networking, Storage and Analysis, pp 65:1–65:12. https://doi.org/10.1145/2063384.2063471
Kaplan H, Sharir M, Verbin E (2006) Colored intersection searching via sparse rectangular matrix multiplication. In: Proceedings of the 22nd ACM Symposium on Computational Geometry, pp 52–60. https://doi.org/10.1145/1137856.1137866
Davis TA (2018) Graph algorithms via suitesparse: graphblas: triangle counting and k-truss. In: 2018 IEEE High Performance Extreme Computing Conference, pp 1–6. https://doi.org/10.1109/HPEC.2018.8547538
Azad A, Buluç A, Gilbert JR (2015) Parallel triangle counting and enumeration using matrix algebra. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp 804–811. https://doi.org/10.1109/IPDPSW.2015.75
Buluç A, Gilbert JR (2011) The combinatorial blas: design, implementation, and applications. Int J High Perform Comput Appl 25(4):496–509. https://doi.org/10.1177/1094342011403516
Niu Q, Lai PW, Faisal SM, et al (2014) A fast implementation of MLR-MCL algorithm on multi-core processors. In: 2014 21st International Conference on High Performance Computing (HiPC), pp 1–10. https://doi.org/10.1109/HiPC.2014.7116888
Bustamam A, Burrage K, Hamilton NA (2010) A GPU implementation of fast parallel Markov clustering in bioinformatics using ellpack-r sparse data format. In: 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp 173–175. https://doi.org/10.1109/ACT.2010.10
Nagasaka Y, Nukada A, Matsuoka S (2017) High-performance and memory-saving sparse general matrix-matrix multiplication for nvidia pascal gpu. In: Proceedings of the 46th International Conference on Parallel Process. (ICPP), pp 101–110. https://doi.org/10.1109/ICPP.2017.19
Han P, Hua H, Wang H et al (2024) A universal parallel simulation framework for energy pipeline networks on high-performance computers. J Supercomput. https://doi.org/10.1007/s11227-024-05996-z
Guo H, Zhang L, Zhang Y et al (2024) Openmp offloading data transfer optimization for DCUs. J Supercomput 80(2):2381–2402. https://doi.org/10.1007/s11227-023-05422-w
Niu J, Gao W, Han L, et al (2023) A DCU code generation and optimization method based on polyhedral model. In: International Conference on Cloud Computing, Performance Computing, and Deep Learning (CCPCDL 2023), SPIE, pp 416–428
Zhou QW, Li JN, Zhao RC, et al (2023) Compilation optimization of DCU-oriented openMP thread scheduling. In: Journal of Physics: Conference Series, IOP Publishing, p 012003. https://doi.org/10.1088/1742-6596/2558/1/012003
Hua H, Jin Q, Zhang Y, et al (2023) Immersed boundary method of two-phase flow based on DCU parallel acceleration. In: International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023), SPIE, pp 265–274. https://doi.org/10.1117/12.2681641
Liu W, Vinter B (2015) A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J Parallel Distrib Comput 85:47–61. https://doi.org/10.1016/j.jpdc.2015.06.010
Deveci M, Trott C, Rajamanickam S (2018) Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput 78:33–46. https://doi.org/10.1016/j.parco.2018.06.009
AMD (2023) Rocsparse documentation. https://rocsparse.readthedocs.io/en/master. Accessed 22 December 2023
Parger M, Winter M, Mlakar D, et al (2020) Speck: Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 362–375. https://doi.org/10.1145/3332466.3374521
Du Z et al (2022) OpSparse: a highly optimized framework for sparse general matrix multiplication on GPUs. IEEE Access 10:85960–85974. https://doi.org/10.1109/ACCESS.2022.3196940
Gustavson FG (1978) Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Trans Math Softw 4(3):250–269. https://doi.org/10.1145/355791.355796
Demouth J (2012) Sparse matrix-matrix multiplication on the GPU. In: Proceedings of the GPU Technology Conference (GTC), pp 1–21
Niu Y, Lu Z, Ji H, et al (2022) Tilespgemm: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 90–106. https://doi.org/10.1145/3503221.3508431
Winter M, Mlakar D, Zayer R, et al (2019) Adaptive sparse matrix-matrix multiplication on the GPU. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp 68–81. https://doi.org/10.1145/3293883.3295701
Gremse F, Hofter A, Schwen LO et al (2015) Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM J Sci Comput 37(1):C54–C71. https://doi.org/10.1137/130948811
Niu APNQ, Fan R, Wen Y (2016) Balanced hashing and efficient GPU sparse general matrix-matrix multiplication. In: Proceedings of the 2016 International Conference on Supercomputing, pp 1–12. https://doi.org/10.1145/2925426.2926273
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55(3):441–453. https://doi.org/10.1006/jcss.1997.1534
Du Z, et al (2023) Predicting the output structure of sparse matrix multiplication with sampled compression ratio. In: 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS), pp 483–490. https://doi.org/10.1109/ICPADS56603.2022.00069
Liu J, He X, Liu W et al (2019) Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int J Parallel Prog 47(3):403–417. https://doi.org/10.1007/s10766-018-0604-8
Shah V, Gilbert JR (2010) Sparse matrices in matlab*p: design and implementation. In: High Performance Computing-HiPC 2004: 11th International Conference, Bangalore, India, December 19–22, 2004. Proceedings 11, pp 144–155. https://doi.org/10.1007/978-3-540-30474-6_20
AMD (2024) Rocm documentation. https://rocm.docs.amd.com/projects/HIP/en/latest/index.html. Accessed 24 April 2024
AMD (2023) Hip documentation. https://rocm.docs.amd.com/projects/HIP/en/latest/index.html. Accessed 22 December 2023
Kurt SE, Thumma V, Hong C, et al (2017) Characterization of data movement requirements for sparse matrix computations on GPUs. In: 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp 283–293.https://doi.org/10.1109/HiPC.2017.00040
NVIDIA (2023) Thrust documentation. https://thrust.github.io/doc/index.html. Accessed 22 December 2023
Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw 38(1):1–25. https://doi.org/10.1145/2049662.2049663
Dalton S, Olson L, Bell N (2015) Optimizing sparse matrix-matrix multiplication for the GPU. ACM Trans Math Softw (TOMS) 41(4):1–20. https://doi.org/10.1145/2699470
NVIDIA (2023) Cuda documentation. https://docs.nvidia.com/cuda/. Accessed 22 December 2023
AMD (2024) Hipify documentation. https://rocm.docs.amd.com/projects/HIPIFY/en/latest/index.html. Accessed 21 March 2024
Funding
This research was funded by Major Science and Technology Special Projects in Henan Province (221100210600) and Major Science and Technology Special Projects in Henan Province (201400210100).
Author information
Authors and Affiliations
Contributions
H.G., H.W. and W.C. conceived the idea and wrote the main manuscript text, C.Z., Y.H. and S.Z. participated in experiments, D.Z. prepared Figs. 1, 2, 3, 4, 5, 6, 7, Y.G. prepared Figs. 8, 9, 10, 11, J.S., T.W., Q.L. and G.W. provided suggestions for revisions to the paper, and all authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guo, H., Wang, H., Chen, W. et al. Optimizing sparse general matrix–matrix multiplication for DCUs. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06234-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06234-2