1 Introduction

Sparse general matrix–matrix multiplication (SpGEMM) calculates the product of two sparse matrices A and B to obtain a sparse result matrix C. It is a key computing kernel in many scientific and engineering applications, particularly in linear algebra, graph algorithms, machine learning, and other areas. Specifically, these applications include algebraic multigrid solvers [1, 2], multi-source breadth-first search [3, 4], colored intersection search [5], triangle counting [6, 7], graph betweenness centrality [8], Markov clustering algorithm [9, 10], etc. Therefore, enhancing the efficiency of the SpGEMM algorithm becomes crucial.

The access and computational behavior of SpGEMM are influenced by several factors, including the distribution of nonzero elements in the sparse matrix, the compressed storage format, and the computational data flow used. This poses a significant challenge when computing SpGEMM on SIMT hardware. First, the differing distribution of nonzero elements in the input matrix causes a difference in the number of intermediate products produced in each row of the result matrix. Consequently, this creates an uneven workload for the threads. Second, due to the unpredictable sparsity of the result matrix, it becomes very difficult to allocate memory in advance for the result matrix stored in compressed format. Finally, the position of the nonzero elements in each row of the output matrix is unknown, making it difficult to find an efficient accumulator to accumulate intermediate products.

Although the state-of-the-art SpGEMM algorithm NSparse [11] had addressed the mentioned issues to a certain extent and achieved significant performance improvement, it primarily emphasizes high-level algorithm design, overlooking some low-level optimizations related to hardware architecture. This can result in a partially inefficient implementation of their algorithm. For example, inefficient parts of this algorithm implementation include neglecting the load balance processing of the second matrix, performing excessive atomic operations on global memory during the binning process, performing excessive shared memory accesses during numerical SpGEMM computation, and improperly mapping some threads to computation tasks, all of which can have an impact on SpGEMM performance.

In order to improve the performance of SpGEMM on DCU (GPU-like accelerator) [12,13,14,15,16], we analyze four inefficient parts of the NSparse implementation and optimize each inefficient part based on the DCU architecture. We evaluated the performance of our optimized algorithm on 29 widely used benchmarks. When compared to the five SpGEMM algorithms bhSparse [17], KokkosKernels [18], NSparse [11], rocSparse [19], and spECK [20], our work has achieved significant speedups. The average speedups achieved are 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x), respectively.

The rest of the paper is organized as follows: Sect. 2 describes the background knowledge of SpGEMM and the architecture of DCU. Section 3 discusses the related work. Section 4 describes the main idea and four inefficient parts of the NSparse implementation. In Sect. 5, the optimization work for the four inefficient parts of the NSparse implementation is described. In Sect. 6, the experimental environment and the performance comparison between our optimized algorithm and five state-of-the-art SpGEMM algorithms are presented. Section 7 provides a summary and an outlook on future work.

2 Background

This section mainly introduces the background knowledge of the SpGEMM and the architecture of DCU.

2.1 Notations

Table 1 shows the symbolic representation commonly used in this paper. The matrix sizes for the three matrices A, B, and C are \(M*K\), \(K*N\), and \(M* N\), respectively.

Table 1 Notations

2.2 CSR storage format

The Compressed Sparse Row storage format (CSR) is a widely used storage format for sparse matrices. It represents sparse matrices efficiently by storing only the values of nonzero elements and their row indices, thus saving storage space. The CSR format has been widely used in previous SpGEMM research [11, 17, 18, 20, 21]. We also selected the CSR format as the input and output format of SpGEMM.

Fig. 1
figure 1

Illustration of the CSR storage format

Figure 1 shows the CSR storage format of a \(5*5\) sparse matrix. There are three arrays contained in the CSR storage format. The first array rpt records the starting position of the first nonzero element of each row in the one-dimensional storage space. The Nnz in the i-th row can be calculated by \(\text {rpt}[i+1]-\text {rpt}[i]\), so the size of the rpt array is M+1. The second array col and the third array val record the column indices and values of the nonzero elements in row-major order, and the size of the col and val arrays is determined by the Nnz of the sparse matrix. The advantage of the CSR storage format is its ability to quickly retrieve information using row pointers and minimize storage space usage.

2.3 SpGEMM and Its row–row algorithm

SpGEMM computes \(C=AB\), where both input and output matrices are sparse and each element of matrix C is calculated as:

$$\begin{aligned} C_{ij}=\sum _{k}A_{ik}*B_{kj} \end{aligned}$$
(1)

where i and j denote the row index of matrix A and the column index of matrix B, respectively, and k denotes the column index of matrix A and the row index of matrix B.

The parallel SpGEMM primarily utilizes the row–row formulation proposed by Gustavson [22], as demonstrated in Algorithm 1. In the row–row formulation, the nonzero element \(a_{ik}\) in the i-th row of matrix A is multiplied with all nonzero elements in the k-th row of matrix B, resulting in intermediate products for the current result row (line 5 of Algorithm 1). If the column index of the intermediate product appears for the first time, first set \(c_{ij}\) and then insert it into temporary storage (lines 6–8). For intermediate products with the same column indices, directly accumulate them into \(c_{ij}\) (line 10). The row–row formulation has the advantage of ensuring that each row in the result matrix is independent, which makes it well-suited for parallel processing on SIMT hardware. The computation formulation adopted in this paper is also the row–row formulation.

Algorithm 1
figure a

Pseudo code of row–row SpGEMM

2.4 General workflow of SpGEMM

Figure 2 illustrates the general workflow of SpGEMM, which consists of four steps. Allocating memory for the compressed format storage of the result matrix can be challenging due to the unpredictable sparsity of the matrix. Therefore, it is crucial to initially predict the size of the result matrix. There are three common approaches for predicting the size of the result matrix: the exact method [11, 18, 20, 21, 23, 24], the upper-bound method [17, 25, 26], and the probabilistic method [27,28,29]. The exact method allocates memory of the actual size for the result matrix. This approach involves performing a symbolic SpGEMM computation first to determine the size of the result matrix. Due to the need for two SpGEMM computations, this method is more time-consuming. The upper-bound method allocates the maximum amount of memory for the result matrix, typically the total number of intermediate products. However, the memory allocated by this method often exceeds the actual requirements, leading to resource waste. The probabilistic method represents a compromise, estimating an imprecise size through random sampling and probability analysis of the input matrix. When this method’s estimation fails, additional memory allocation is necessary.

Fig. 2
figure 2

General workflow of SpGEMM

The second step primarily involves memory allocation on the target device. The third step involves partitioning the work according to the sparse characteristics of the result matrix, ensuring a balanced workload for all threads. The last step is the most time-consuming in the entire computation process, involving numerical multiplication and generating a substantial number of intermediate products. Subsequently, these intermediate products are accumulated using accumulators to obtain the final result row. Commonly used accumulators include hash accumulators [11, 20, 21, 27, 30], heap accumulators [17], and dense accumulators [20, 31]. This paper use the exact method to calculate the size of the result matrix and hash accumulators to accumulate intermediate products.

2.5 Introduction of DCU

Hygon DCU [12,13,14,15,16] is a GPU-like accelerator running in the Radeon Open Computer (ROCm) environment [32], which is based on the AMD second-generation Vega architecture and supports applications for HIP-based [33] heterogeneous computing. The Hygon DCU has 60 Compute Units (CUs) operating at a core clock frequency of 1.7 GHz, along with 16GB of global memory. Figure 3 shows the internal structure of CU. Each CU is equipped with 64 KB of shared memory and is divided into 4 SIMD compute groups. Each SIMD compute group unit consists of 16 lanes and can support 16 double precision and 32 single precision floating operations per clock cycle. Additionally, each SIMD computation group is equipped with 64 KB of VGPR high-speed registers. A single thread has the capability to utilize up to 256 32-bit register resources.

Unlike most GPUs, which have a warp size of 32, the DCU has a warp size of 64. At the same time, a warp composed of 64 threads is executed on a SIMD computing group, with each SIMD computing group capable of running up to 10 warps concurrently. In addition, there is a 16 KB L1 cache in the CU as an on-chip storage resource. A single CU shares a special SALU computing unit, which is responsible for address calculation and branch judgment.

Fig. 3
figure 3

The internal structure of CU

3 Related work

In recent years, researchers have proposed many methods for GPU to improve the performance of SpGEMM. Since GPU architectures are closely related to that of the DCU, much of existing body of research for SpGEMM on GPUs is also applicable to DCU. We draw from the advancements made in these works to inform us of how to program the DCU. In this section, we will present a variety of SpGEMM methods that utilize hash tables to accumulate intermediate products on the GPU [11, 20, 21, 23, 27, 34].

Demounth et al. [23] proposed the implementation of the SpGEMM algorithm in the cuSPARSE mathematics library. To save storage space, the algorithm utilizes the exact method for predicting the size of the result matrix. Additionally, it employs a shared memory hash table to accelerate the accumulation process of intermediate products. When a collision occurs in the shared memory hash table, the global memory hash table is utilized, leading to many global memory accesses. In addition, the algorithm utilizes a consistent kernel for calculating all rows, leading to significant imbalances in workload and subsequently impacting the algorithm’s efficiency. Anh et al. [27] proposed Balanced Hash. This algorithm addresses load imbalance by storing the intermediate products in global memory and distributing the work list evenly among threads for hash accumulation operations. The algorithm’s hash table is located in shared memory and has a fixed size. Once the calculation for each round is finished, the hash table’s contents is then written to global memory. When a hash conflict occurs, the intermediate product is placed in the temporary queue and handled in the subsequent round of hash operations until the temporary queue is emptied. Although the balanced hash algorithm achieves a good complexity balance, the work list consumes a significant amount of temporary storage space, and handling hash collisions necessitate additional global memory accesses, limiting the algorithm’s performance.

Kurt et al. [34] deeply investigated the reasons for the lower performance of SpGEMM compared to SpMV and used hypergraph partitioning to prove that there is no inherent lower limit to data movement, allowing SpGEMM to handle more data movement than SpMV. Based on this conclusion, they developed an adaptive work distribution strategy to implement hashing based on scatter vector updates that reduced the number of idle threads while improving performance significantly. Specifically, thread blocks are divided into virtual warps to process a row of matrix A, and the size of the virtual warp is determined based on the average number of nonzero elements in B corresponding to a given row of matrix A. Nagasaka et al. [11] proposed NSparse, which groups row by Nprod (or Nnz) in the result matrix and creates different size hash tables and kernel configurations for each group to fully utilize the GPU resource. This algorithm also eliminates the need to access global memory when a hash collision occurs. However, we also find some inefficiencies in the implementation of this algorithm, analyzing and optimizing them in detail in Sects. 4 and 5.

Austria et al. [20] proposed spECK. In addition to utilizing hash-based accumulation methods, spECK also introduces a dense accumulation method and selects a more suitable accumulator based on the sparsity of the input matrix rows. Therefore, spECK demonstrates strong performance across a wide range of input matrices. In addition, spECK proposes a method to improve the load balance of the second matrix. This involves conducting a simple row analysis on the input matrix during runtime and subsequently dynamically adjusting the number of threads to access a complete row of the B matrix. However, for small matrices, this matrix row analysis can be expensive. Inspired by this, this paper adopts a second matrix load balancing method with less additional overhead (Sect. 5.1). OpSparse was proposed by Zhaoyang Du et al. [21]. The algorithm is optimized to address seven inefficient parts of NSparse and spECK, resulting in significant performance enhancements on Nvidia Tesla V100 GPUs. OpSparse improves binning performance by using shared memory to move atomic operations from global memory to shared memory, but it does not take into account the use of registers in this process. Our work makes full use of registers and shared memory to eliminate most atomic operations to further improve binning performance (Sect. 5.2).

4 Main ideas and inefficient parts of NSparse

This section mainly introduces the main idea of the NSparse algorithm proposed by Nagasaka et al. [11] and discusses its four inefficient parts.

4.1 Main ideas of NSparse

NSparse is composed of seven steps, as shown in Fig. 4. The first step is to calculate Nprod for every row of the result matrix. The second step assigns different rows to different cores for calculation based on the Nprod of each row to ensure load balance. The third step involves utilizing a hash table to determine the exact size of the result matrix. The hash table is allocated in the shared memory or global memory of the target device. Next, the fourth and fifth steps allocate memory on the target device and set the row pointers in CSR format for the result matrix. In the sixth step, the focus is on achieving a balanced load by classifying rows into separate bins according to their Nnz. In the last step, numerical calculations are performed using a hash-based accumulation method to determine the values and column indices of the result matrix.

Fig. 4
figure 4

Overall workflow of NSparse

NSparse primarily utilizes a hash table for the accumulation of intermediate products. Figure 5 shows the hash-based accumulation method. This method starts by creating a hash table in temporary space, with its size being an upper bound on the Nnz in the result row. The keys of the hash table are column indices, and their initial values are set to \(-1\). During the computation, a hash function is used to efficiently find the position for inserting the intermediate product. Before insertion, the hash table verifies the key value. If it is \(-1\) or equal to the column index of the intermediate product to be inserted, accumulation is performed directly. If a hash collision occurs, linear probing is employed to locate an available position in the hash table for insertion. Once the hash operations are completed, a shrinking operation is applied to the hash table to transit into a dense state. Finally, the hash table is sorted according to column indices to obtain the result row stored in compressed format.

Fig. 5
figure 5

Illustration of hash-based accumulation method

The main objectives of NSparse are to address three key challenges: memory allocation for the result matrix, accumulation of intermediate products, and global load balance. This method demonstrates substantial enhancements in performance in common benchmark tests. However, Its implementation also has several inefficient parts. In the following sections, we will explore four of these inefficient parts.

4.2 Inefficient parts of the NSparse implementation

4.2.1 Second matrix load balance

NSparse assigns a varying number of threads to each row of matrix A through binning to address the issue of global load imbalance. However, it neglects to perform local load balance on matrix B. Before running, NSparse utilizes a heuristic approach to allocate a fixed number of threads to the B matrix. When the result rows are sparse, a single thread is used to access a row of B, whereas when the result rows are dense, a warp is used to access a row of B. However, it is important to note that the Nnz in each row of the B matrix may be different, and the resulting matrix’s sparsity cannot be determined solely based on the B matrix. Therefore, if the number of threads is set based on the sparsity of the result rows, it can result in a significant load imbalance when threads access matrix B. For example, for the webbase-1 M matrix in our test set, the longest row length is 4700 and the shortest is zero. When calculating long result rows, NSparse assigns 32 threads to access these two rows. However, the longest row requires 147 iterations, whereas the shortest row does not require any iterations. This situation can lead to severe thread load imbalance.

4.2.2 Binning method

The binning operation aims to classify the rows of matrix A into bins and assign them to different GPU cores for computation. This helps to address the issue of load imbalance during the computation process. During the binning process, atomic operations are crucial as they ensure that multiple GPU threads can modify a variable simultaneously. The binning operation used by NSparse is divided into two phases, and Fig. 6 shows these two phases of the binning process. The first phase counts the size of each bin and generates offsets to record where each bin is stored in one-dimensional space. During this phase, every thread carries out a single atomicAdd and atomicMax operation on the global memory. During the second phase, the row IDs of different rows are allocated to different bins according to the Nprod (or Nnz). During this phase, every thread must perform two atomicAdd operations on the global memory. It is evident that during the binning process, NSparse performs a large number of atomic operations on global memory, leading to significant global memory access.

Despite the binning operation being an auxiliary operation of NSparse with a low time complexity, our experiments (Fig. 10 in Sect. 6) indicate that it accounts for over \(12\%\) of the total execution time on average. Even some small benchmark tests exceed \(35\%\). This indicates that the binning method utilized by NSparse is less efficient.

Fig. 6
figure 6

Illustration of binning method

4.2.3 Numerical SpGEMM

During numerical SpGEMM (Fig. 5), NSparse utilizes a hash table to accumulate intermediate products and then carries out shrink operation. Because the shrink operation shrinks the hash table in place using multiple threads, atomic operations are required to obtain the storage location of nonzero elements after shrinkage. In this process, atomic operations will lead to excessive access to shared memory and cause bank conflicts in the shared memory. During the shrinking process, a significant number of branch operations are also utilized, resulting in warp divergence. These factors all affect the performance of the numerical SpGEMM. In addition, NSparse returns the results using a counting-like sorting method that is also relatively inefficient.

Furthermore, in the numerical SpGEMM process, NSparse assigns the same number of threads and hardware resources to rows in matrix A that is entirely zero or contains a single nonzero element as it does to other rows. This results in an increase in idle threads and a waste of resources. Among the 29 benchmark matrices we selected, eight matrices include such special rows. It is worth noting that matrices such as cit-Patents, patents_main, and NotreDame_actors have certain rows that make up more than 10% of the total rows. It is evident that these particular rows are common, requiring a distinct approach to their handling.

4.2.4 Kernel configuration

There is an unreasonable situation in NSparse’s thread and computing task mapping. For instance, in the second group of numerical calculations, result rows with Nnz ranging from 17 to 256 are allocated 32 threads each for calculation. However, if the nonzero elements are mostly distributed on the left boundary, it may result in many idle threads. Additionally, for result rows where nonzero elements are mostly distributed on the right boundary, the workload of threads may become excessively heavy. This unreasonable mapping can result in a global load imbalance, which can potentially hinder the algorithm’s performance.

5 Optimization

This section mainly introduces the optimization of the above four inefficient parts of the NSparse implementation on DCU.

5.1 Optimization of the second matrix load balance

The main objective of load balance for the second matrix is to determine the number of threads assigned to access each row of matrix B. In Fig. 7, during the computation process, N threads are allocated to process one row of the A matrix. These N threads are divided into K groups, with each group consisting of \(G = N/K\) threads. Each group reads a nonzero element of the A matrix multiplied by a row of the corresponding B matrix. It can be seen that the performance of the algorithm is affected by the size of G. If G is too large, it will result in an increase in the number of iterations and the generation of many idle threads. If G is chosen to be too small, the workload of the threads will become heavy for the long rows of B.

Based on the observations above, an intuitive approach is to set the size of G according to the average value of Nnz in the B matrix. However, due to the possibility of significant variations in the Nnz for each row of the B matrix, using an average value for G can be unfavorable for longer rows. In our experiments, we discovered that longer rows of the B matrix have a more significant impact on performance. Therefore, we focus more on long rows. During runtime, the row information of the B matrix is extracted to determine the maximum row length. Based on this information, the size of G is set accordingly. We set each thread to access up to 8 elements of the B matrix as much as possible. To reduce synchronization overhead, we set the size of G to be no more than the size of one Warp (64 here). Therefore, We set up a candidate set for G, \(\{2,4,8,16,32,64\}\). For a common GPU with a warp size of 32, the maximum value of G is set to 32.

Extracting the row information of B can be completed during the calculation of Nprod, resulting in negligible additional overhead. Therefore, this approach can enhance the load balance of the second matrix with minimal additional overhead.

Fig. 7
figure 7

Illustration of the second matrix load balance. \(G=4\) requires 3 iterations, \(G=2\) requires 2 iterations

5.2 Optimization of binning method

Before executing symbolic SpGEMM and numerical SpGEMM, a binning process must be carried out. Both operations have identical functions: classifying the row ids of the A matrix based on the Nprod (or Nnz) of each row and writing them in a one-dimensional continuous storage space. The binning process involves two stages of calculation. Algorithm 2 shows the first stage of the binning operation. Its objective is to calculate the size of each bin and then use the exclusive prefix sum to obtain each bin’s offset in the one-dimensional storage space. Algorithm 3 demonstrates the second stage of binning calculation, where the row id is recorded in each bin. For more general devices, the parameter warpsize (line 18 of Algorithm 2) can be modified to the size of the target device.

Because atomic operations on global memory have the greatest impact on performance, we maximize the use of registers and shared memory to reduce their occurrence. Specifically, in order to determine the size of each bin, every thread within the warp initially utilizes a register to store its target bin id (lines 8–16 of Algorithm 2). The __shfl_down instruction is utilized within the warp to perform reduction operations for obtaining the local bin size based on the warp (lines 18–22 of Algorithm 2). Next, within each block, an atomicAdd operation is performed on the shared memory to calculate the local bin size for each block (lines 24–28 of Algorithm 2). Finally, these values are written to global memory, resulting in the global bin size (lines 31–33 of Algorithm 2). In the second stage, after calculating the block-based local bin size, the starting position of each block for binning is determined (lines 9–13 of Algorithm 3). Then, the row id is written into the corresponding bin (lines 16–25 of Algorithm 3).

The above process takes advantage of the locality of computing tasks within the thread group and makes full use of registers and shared memory. This significantly reduces the need for atomic operations on global memory, resulting in improved performance for binning operations. At the same time, all zero rows are also excluded during the binning process to avoid wasting resources in the calculation stage.

Algorithm 2
figure b

Pseudo code of first pass binning

Algorithm 3
figure c

Pseudo code of second pass binning

5.3 Optimization of numerical SpGEMM

Figure 8 shows the optimization we made to the numerical SpGEMM process. We utilize a combination of direct and hash methods to accumulate intermediate products. During the calculation process, the rows of the A matrix are initially examined to determine if they consist of just a single nonzero element. When dealing with a single element, the direct method is employed. This involves multiplying the single nonzero element of the A matrix with the nonzero elements of the corresponding rows of the B matrix in sequence and then storing the results. If matrix A is not a single element, a hash operation is performed to accumulate the intermediate products. Then, sorting is done based on the key values of the hash table. We utilize the highly efficient Bitonic Sort for our sorting algorithm. Once the sorting is complete, the starting position where the result needs to be written is determined by subtracting the Nnz of the current result row from the hash table size. For all zero rows in matrix A, the resultant rows are all zero, and they do not need to be computed in any way, so they are directly excluded from the binning process (lines 15–17 of Algorithm 3).

Fig. 8
figure 8

Illustration of numerical SpGEMM optimization

The above method helps reduce the number of idle threads and waste of hardware resources. Additionally, it eliminates shrinkage operations in the numerical computation process, reducing access to shared memory and branching operations.

5.4 Optimization of kernel configuration

This paper focuses on finer-grained grouping and kernel configuration to improve global load balance. We use the Hygon DCU as the target device for kernel configuration. Each compute unit (CU) in the DCU has the same amount of shared memory, with each CU having 64KB. Additionally, each block has up to 48KB of shared memory available. Each block can contain up to 1024 threads. More specific parameters of the Hygon DCU are presented in Sect. 2.5. For SIMT hardware with different configurations, Bin 7 can be scaled up or down based on its shared memory size. For instance, for a device with a maximum available shared memory of 64KB, bin 7 of the symbolic stage can be expanded to 16,384, while bin 7 of the numerical stage can be expanded to 5461.

5.4.1 Kernel configuration in symbolic step

Table 2 shows the kernel configurations and the selection of the range of bins for the symbolic step. In the symbolic step, the 9 bins are divided based on the Nprod of each row and are computed by 9 kernels. Since the DCU has at most 48KB of shared memory available per block and each column index in the symbol step accounts for 4B. As a result, the maximum size of the shared memory hash table that can be stored in each block is 12,288. Therefore, the hash tables for Kernel0-Kernel7 are implemented in shared memory, while the hash table for Kernel8 is implemented in global memory, which is twice the size of the upper bound of the Nprod in the result rows. In order to avoid warp divergence caused by short rows (short rows are those with \(Nnz < warp\ size\)), the 16 short rows are combined into a single row group and given to Kernel0 for processing. At this point, the threads are organized into thread groups of size eight, with each group handling a single short row. Kernel1 processes one row using a warp, while Kernels 2 through 8 all use one block to process one row. The above grouping enables a more balanced workload among threads, thus enhancing global load balance in the symbolic step.

5.4.2 Kernel configuration in numerical step

Table 3 shows the kernel configuration for the numerical step and the selection of the bin ranges. Additionally, it highlights the use of nine bins, which are processed by ten kernels. The hash table in the numerical step is made up of column indices and values, with each pair accounting for 12B, yielding a maximum shared memory hash table size of 4096. Each block of Kernel0 and Kernel1 uses thread group to calculate multiple rows. Kernel2 utilizes a warp for processing a single row, whereas Kernel3–Kernel9 employ a single block for calculating a row.

Both Kernel8 and Kernel9 hash tables exist in global memory. They have a difference in the way Kernel8 utilizes the highly optimized thrust::sort_by_key api [35] for sorting after hashing. Subsequently, the calculated result rows are copied to the result storage space through device-to-device memory copying. To reduce the sorting overhead, its hash table is set to 1.5 times the upper bound of the Nnz in the result rows. The hash table size used by Kernel9 is twice the upper bound of the Nnz in the result rows. It performs calculations in a conventional way. When the number of rows allocated to bin8 is small, Kernel8 is invoked; when the number of rows is large, Kernel9 is called. This is because when there are many rows, it becomes necessary to perform multiple device-to-device row copies, which is quite time-consuming. Therefore, the advantages of Kernel8 become diminished when the number of rows increases. The mentioned grouping and kernel configuration can improve global load balance during the numerical step, while also enhancing performance in handling long rows.

Table 2 Symbolic step kernel configuration for DCU
Table 3 Numerical step kernel configuration for DCU

6 Performance evaluation

We conducted an overall performance comparison with five state-of-the-art SpGEMM algorithms: bhSparse, KokkosKernels, NSparse, rocSparse, and spECK. To better demonstrate the effectiveness of our optimizations, we also conducted a comparison of the computational and additional auxiliary overhead with NSparse. In all benchmark tests, we evaluated the performance of double-precision multiplication for \(C = A^2\).

6.1 Evaluation environments

To evaluate the performance of our algorithms, we used a Deep Computer Unit (DCU) as an accelerator. The code is written by HIP and compiled by HIPCC (version 4.2.2). The code runs on a Hygon CPU with the CentOS 7.6 operating system installed. It has 32 physical cores, 32 KB of L1 cache size, 128 GB of global memory, and 64 threads. The program execution method is to warmup for one round, then execute the program 20 times, and take the average time as the base time. Benchmarks are evaluated based on FLOPS performance [11, 17, 20, 21], which is calculated as twice the number of intermediate products generated by the computational result matrix divided by base time.

We select 29 matrices from the SuiteSparse matrix collection [36] to use as our benchmark data. These matrices have various sparse characteristics and consist of 27 square matrices and two non-square matrices. These matrices have been extensively utilized in previous performance evaluations of SpGEMM research [11, 17,18,19,20,21, 23,24,25,26,27, 30, 37]. Table 4 summarizes the details of these matrices. The bhSparse algorithm is unable to execute the last two matrices because the intermediate products take up a significant amount of temporary storage space.

The source code provided by the SpGEMM algorithms we compared is written in CUDA [38]. Since CUDA and HIP are highly similar, in order to make them work on DCU, we use HIPIFY [39] to transcode bhSparse, NSparse and spECK to HIP. This does not change the logic of their algorithm. Additionally, the KokkosKernels backend is selected to be HIP.

Table 4 Detailed information of the 29 sparse matrices

6.2 Overall performance of SpGEMM

Figure 9 shows the overall FLOPS performance of SpGEMM. Overall, our algorithm demonstrates the best performance, followed by spECK and rocSparse, while bhSparse and KokkosKernels have the lowest performance. In terms of acceleration, our work achieves an average speedup of 7.99x (up to 18.2x) compared to bhSparse, 8.01x (up to 20.83x) compared to KokkosKernels, 2.37x (up to 6.16x) compared to NSparse, 1.82x (up to 4.20x) compared to rocSparse, and 1.63x (up to 5.01x) compared to spECK.

Among all the benchmark matrices, our work achieves the best results in 27 matrices. It particularly demonstrates a more significant speedup on matrices m133-b3, mc2depi, atmosmodl, parabolic_fem, wb-edu and mari002. However, on the matrices cage15 and conf5_4-8x8-05, our work does not perform as well as spECK. This is because, during the numerical SpGEMM phase, more than 50% of the rows in these two matrices use hash tables with occupancy greater than 80%. This means that more hash conflicts are generated during numerical computation, and processing them causes warp divergence and more idle threads, limiting SpGEMM performance. This also highlights the limitations of hash methods. In contrast, spECK ensures that the hash table occupancy does not exceed 66% by expanding the hash table, so it performs better for this type of matrix. We decide against adopting this strategy because increasing the size of the hash table increases the overhead of sorting, which is detrimental for most matrices.

Fig. 9
figure 9

Overall FLOPS performance of SpGEMM on DCU

6.3 Multiple optimizations performance analysis of SpGEMM

In order to clearly demonstrate our multiple optimizations for NSparse, in this section, we analyze the contributions of various optimizations to the performance improvements of NSparse, highlighting their significance. Given that the time taken for memory allocation is the same, our next focus is solely on comparing the binning overhead and computational overhead of SpGEMM. In this case, the computational overhead involves the time it takes for both the symbolic SpGEMM and numerical SpGEMM.

6.3.1 The ratio of execution time

The ratio of execution time of each stage is shown in Fig. 10, where the execution time of NSparse is set as 1. The majority of the computation time is spent on the numerical SpGEMM. Compared with NSparse, our work has greatly reduced the time of numerical SpGEMM, especially on matrix atmosmodl, which has achieved very significant results. For symbolic SpGEMM, we’ve achieved significant acceleration on most matrices. The overall effect of our binning operation is equally significant. The time required for binning operations correlates with the size of the matrices. For large matrices such as webbase-1 M, cage15 and wb-edu, the overhead of our binning operations is almost negligible. However, on smaller matrices like m133-b3 and patents_main, the overhead of binning operations is comparatively more significant. In the following sections, we will conduct separate comparative analyses on these stages.

Fig. 10
figure 10

Performance breakdown comparing with NSparse

6.3.2 Binning overhead

Figure 11 shows the percentage of time needed for two binning operations compared to the total time for SpGEMM. The results demonstrate that our work performs optimally in all benchmark matrix sets. NSparse incurs additional auxiliary overhead that accounts for 12.9% of the overall overhead on average and 35.49% of the overall overhead in the worst case. However, in our work, it represents a minor portion of the overall overhead, with 10.15% in the worst case and an average of 4.9%. This indicates that NSparse’s auxiliary tasks incur greater overhead. We take advantage of the locality of computing tasks within thread groups and make full use of registers and shared memory to reduce atomic operations on global memory, making our binning method more effective than NSparse.

Fig. 11
figure 11

Relative execution time of two binning operations compared to the overall execution time of SpGEMM

6.3.3 Computation overhead

Compared to NSparse, we did more fine-grained grouping and kernel configuration as well as load balance optimizations for the second matrix. In addition, the optimizations in the numerical SpGEMM step mainly focus on minimizing shared memory accesses and managing special rows. Specifically, our proposed computational model for numerical SpGEMM involves sorting the hash table using an efficient sorting algorithm. The result is then written directly, with consideration given to the handling of special rows. NSparse, however, needs to perform a shrinking operation initially and then determine the final position of the result through a counting-like sorting process.

Figures 12 and 13 show the FLOPS performance comparison between our work and NSparse in symbolic SpGEMM and numerical SpGEMM, respectively. The results demonstrate that for the symbolic step, our work performs optimally on 28 matrices, with an average speedup of 2.23x (up to 8.9x) over NSparse. In the numerical step, our work performs optimally on all benchmarks, achieving an average speedup of 2.52x (up to 5.83x). This demonstrates the effectiveness of our enhanced second matrix load balance scheme, numerical SpGEMM optimization scheme, and fine-grained grouping and kernel configuration scheme.

Fig. 12
figure 12

Performance of symbolic SpGEMM

Fig. 13
figure 13

Performance of numerical SpGEMM

7 Summary and outlook

This paper provides an in-depth analysis of four inefficient parts of the NSparse implementation and optimizes them on DCU. 1) The load balance of the second matrix is enhanced by extracting the maximum row information of the second matrix and setting parameters on runtime, 2) The binning method is optimized by making full use of registers and shared memory, 3) The computation mode of numerical SpGEMM is optimized to improve its efficiency, and 4) The global load balance is improved by finer-grained grouping and kernel configuration. The experimental results demonstrate that our work have achieved on average 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups over five state-of-the-art SpGEMM algorithms bhSparse, KokkosKernels, NSparse, rocSparse, and spECK, respectively.

In the future, we will continue to work on the performance optimization of SpGEMM. We will apply the optimization techniques proposed in this paper to more advanced SpGEMM algorithm. For example, the binning method is applied to SpECK, and the numerical computing mode is implemented to OpSparse. In addition, we will also focus on applying these optimization techniques to other sparse matrix operations, such as applying the binning method to Sparse-Matrix Dense-Matrix Multiplication (SpMM) to improve the load balance of sparse matrix operations.