1 Introduction

In the digital age, the exponential growth of data presents significant challenges for processing and analyzing information [1]. Efficient data processing algorithms are crucial, particularly for handling large-scale datasets. In this context, selection and top-K algorithms are vital tools for filtering and extracting valuable data [2]. These algorithms aim to select the \(k\)th element or the top \(K\) elements from a dataset based on specific criteria. A popular method for optimizing these algorithms is bucket partitioning [3]. Importantly, the surge in data processing requirements has turned GPUs into key computational resources [4]. Equipped with numerous parallel processing units, GPUs offer significant speedup opportunities for computation tasks. Consequently, algorithms [5,6,7,8,9,10,11,12] like bucket partition-based selection [3] and top-K [4, 13] are increasingly being adapted for GPU implementation to harness this computational power.

However, implementing top-K and selection algorithms with bucket partitioning on GPUs presents specific challenges. In traditional GPU-based bucket partitioning implementations, data are usually segmented into different buckets according to certain rules, with each bucket being processed by an individual thread block [14]. This method can lead to problems, particularly when the data distribution is skewed, resulting in imbalances in GPU workload and excessive recursion depth, which in turn degrade performance [10, 11]. Furthermore, traditional implementations typically involve multiple kernels executed serially [3, 15, 16], leading to significant core startup and communication overhead. Additionally, merging data from selected buckets at a global level introduces high-latency data transfers, further impacting throughput [12, 13]. To address these challenges, we introduce a novel execution model for top-K and selection algorithms, named the Split-Bucket Partition (SBP). Contrary to traditional methods, the SBP model initiates by physically splitting data into tiles processed by a thread block, followed by dividing the data into buckets based on varying partition rules. This approach, through fine-grained task allocation and generating fixed-size arrays from each tile, effectively mitigates the drawbacks associated with uneven bucket distribution and diminishes the latency involved in data merging. Moreover, our model utilizes only a single kernel to implement the algorithm, significantly reducing the additional communication and startup costs associated with multiple kernels.

For practical validation, we applied the SBP model to both radix and search-tree bucket partitioning rules and further tailored the selection and top-K algorithms with specific optimizations. One such optimization is task optimization, which includes either fusing or dividing tasks within the GPU’s kernel. This approach enhances resource utilization by minimizing idle GPU cycles and reducing computational overhead. Control flow optimization, on the other hand, focuses on streamlining the internal logic within the kernel. Figure 1 shows the cost cycle of top-K algorithms varying \(K\) values, where the proportion of time consumed by bucket traversal ranges from 37% to 50% and remains constant regardless of variations in the value of \(K\). Specifically, we optimized the traversal processes across buckets, which are often time-consuming and not essential for the core functionality of top-K algorithms. These adjustments to the control flow aim to eliminate unnecessary computations, thereby enhancing overall algorithmic efficiency.

Fig. 1
figure 1

Cost cycle of top-K algorithms based on different partition rule when \(\textrm{DataSize}=2^{19}\)

To demonstrate the robustness and applicability of our approach, we conducted experiments on two distinct GPU architectures: TU102 and A100. The results of these experiments reveal that, in the context of bucket partitioning employing both radix and search-tree partitioning rules, the SBP execution model outperforms the current state-of-the-art top-K and selection algorithms. Specifically, the performance improvement achieved by our model in the top-K algorithm reaches up to 2.48 times in a uniform distribution, 5 times in a non-uniform distribution, and 2.3 times in real-world datasets. Meanwhile, for the selection algorithm, the improvement peaks at approximately 2.3 times in a uniform distribution, 15.5 times in a non-uniform distribution, and 1.4 times in real-world datasets, respectively.

Critically, while our research primarily focuses on bucket partition-based top-K and selection algorithms, the introduced SBP model is inherently versatile. It is designed to be adaptable to a broad array of bucket partitioning algorithms, going beyond the scope of this paper. We believe the SBP approach can be pivotal for other algorithms, like sorting [17], hashing [1, 18], and other relevant areas [19], and we aim to explore these applications in future work.

In summary, the contributions of this paper are as follows:

  • We introduce the SBP model for more efficient bucket partitioning, considering both selection and top-K algorithms.

  • We demonstrate the effectiveness of this approach by applying it to two distinct bucket partitioning rules and optimizing the associated algorithms.

  • We pave the way for future research by showing that the SBP model and the accompanying optimizations have broader applications in other computational tasks, such as sorting and hashing.

2 Background

2.1 Bucket partition

Bucket partitioning is a prevalent data partitioning technique employed in various computational scenarios, such as selection [3], top-K [4, 13], sorting [14], query processing [11], and hash tables [1].

Figure 2 shows the basic bucket partition execution model of top-K and selection algorithms based on bucket partitioning under search tree rules. Different bucketing rules determine distinct bucket partitioning steps, such as pivot selection  [20]. Here, we take the search tree rule as an example. There are three steps in the basic bucket partition execution model of the top-K algorithm and the selection algorithm: (1) collecting data into different buckets based on specific rules, (2) executing algorithms independently within each bucket, and (3) merging the outputs of each bucket.

Specifically, in the first step, the input data are divided into different buckets through the bucket partitioning rule (Radix or search tree), and then, the algorithm (selection or top-K) is executed independently in each bucket. The third step of merging, that is, the calculation of the target bucket in Fig. 2, is determined by the algorithm. For example, for the top-K algorithm, we need to find the bucket containing the top-K elements, while for the selection algorithm, we need to find the bucket containing the \(k\)th element. We will introduce these two algorithms in detail in Sects. 2.3 and 2.2.

This bucket partitioning method has obvious advantages in parallel processing [20]. It not only improves the degree of parallelism but also significantly reduces the workload of each iteration. As data volumes grow, this technique becomes increasingly popular in GPU-based computing.

Despite the extensive application of bucket partitioning in GPUs, research [5, 19] has mainly focused on the high degree of parallelism enabled by various bucketing strategies. However, the effectiveness of a given policy can vary across different algorithms, influenced by several factors such as bucket size, the number of buckets, and bucket division rules. In terms of hardware architecture, the memory hierarchy of GPUs introduces another layer of complexity, as varying memory accesses or partition processes can result in different GPU execution patterns, thereby affecting performance and potentially causing bottlenecks. Some algorithms see their execution time dominated by the bucket collection process [21], while others incur high communication costs during the merging phase [3, 22]. Additionally, certain algorithms suffer from load imbalances due to the specific bucket partitioning rules employed [23]. Consequently, formulating a potent strategy for partitioning buckets effectively on GPUs continues to pose an intricate challenge that, as far as we know, lacks a generalized solution.

At present, formulating effective and universal bucket partitioning strategies on GPUs remains a complex challenge. A notable scarcity of research and solutions in this domain has been observed. In our study, we specifically focus on the top-K and selection algorithms. We have analyzed the impact of existing bucket partitioning execution modes on these algorithms and, based on this analysis, propose a novel execution mode aimed at addressing the current issues. Our approach not only targets the top-K and selection algorithms but also offers a new perspective for other algorithms based on bucket partitioning.

The emphasis of our research lies in the abstraction and optimization of the top-K and selection algorithms, particularly in the context of bucket partitioning. Initially, we identify performance bottlenecks under the current execution modes for these algorithms. Subsequently, we introduce a new execution model, termed the SBP (split-bucket partitioning) execution model. This model is designed to enhance the performance of these algorithms on GPUs. Our methodology demonstrates how algorithms based on bucket partitioning can be transformed into an execution model and analyzed for performance bottlenecks, a process that holds potential applicability to other algorithms. However, due to the complexity of GPU programming and the unique characteristics of different algorithms, applying our execution model to other algorithms may require manual adjustments to the implementation details to ensure optimal performance.

2.2 Bucketing in top-K algorithms

The top-K algorithm takes as input a dataset containing \(n\) elements and identifies the \(K\) highest or lowest valued elements. This algorithm finds wide applications in fields such as deep learning, data mining, and information retrieval [2]. However, in the past, significant research has predominantly focused on the application of the top-K algorithm in databases [13, 15, 24], overlooking its crucial role as a fundamental data operation in various other domains. This oversight constitutes a distinct issue. Given the growing role of GPUs in these domains, optimizing top-K algorithms for GPUs has become a priority [4]. Notably, some high-performance parallel computing libraries, such as Arrayfire [25], still use sorting operations to retrieve the top \(K\) elements, which is unnecessary and inefficient [26]. Several methods exist for finding the top-K elements, including threshold-based [27], bitonic sorting based [21], and bucket partition-based algorithms [3]. The bucket partition-based algorithms [3] are a variant of the selection algorithm discussed earlier.

As shown in Fig. 2, these algorithms typically partition the input dataset into buckets based on specific rules. (Figure 2 demonstrates the search tree technique; in this paper, we also introduce radix computation.) Search tree technology selects pivots in the input data through sampling, then uses these pivots to build a search tree, and allocates the data to different buckets. The cardinality calculation is performed by calculating the cardinality of each element, and then allocating the data to different buckets according to the size of the cardinality. The data are divided into different buckets, which are then sorted in ascending order. If the size of the ith bucket \(\textrm{bucket}_i\) is \(\textrm{bucket}[i]\), then the set of target buckets containing the top-\(k\) elements can be defined as \(\sum ^{i} _ {0} \textrm{bucket}_i\), where \(\sum ^{i-1} _ {0} \textrm{bucket}[i] \le K < \sum ^{i} _ {0} \textrm{bucket}[i]\). In Fig. 2, \(i=1\). The set of target buckets is then taken as the input for the next iteration, repeating the process to narrow down the search range until the number of elements in the target bucket falls below a cutoff value, thereby obtaining the top-\(k\) elements by sorting a minimal number of elements.

2.3 Bucketing in selection algorithms

Selection algorithms primarily aim to choose the \(k\)th element from a set of \(n\) elements, making it a fundamental challenge encountered in various computational problems. Most established selection algorithms, such as radix selection [16], random selection [28], and sample selection [29], employ a similar characteristic execution pattern [3] of partition-based sorting algorithms [14]. Similar to the top-K computation process, the selection algorithm focuses on the specific bucket containing the \(k\)th element, as shown in Fig. 2. Compared to the top-K, where the set of target buckets is \(\sum ^{i=1} _ {i=0}\textrm{bucket}_i\), in the same scenario, the number of target buckets for the selection algorithm is 1. Figure 2 demonstrates the search tree technique. In this paper, we also introduce algorithms based on radix computation, which only differ in the rules of bucket partitioning; the execution flow is similar. We will provide a detailed description in Sect. 3.

Fig. 2
figure 2

Framework of selection and top-K algorithm based on bucket partition

2.4 GPU architecture

Efficiency in data processing tasks like top-K queries on modern GPUs is strongly influenced by their memory hierarchy. Figure 3 shows a simplified GPU memory structure and execution model, consisting of multiple streaming multiprocessors (SMs). Each SM houses around \(96-128\,\textrm{KB}\) of memory, doubling as L1 cache and shared memory, and a \(256\,\textrm{KB}\) register file, the fastest memory component. The hierarchy spans from the fastest register file to the shared memory, on-chip L2 cache, and finally, the slower off-chip global memory.

Efficiency in data processing tasks like top-K queries on modern GPUs is strongly influenced by their memory hierarchy. Figure 3 shows a simplified GPU memory structure and execution model, consisting of multiple streaming multiprocessors (SMs). Each SM houses approximately \(96-128\,\textrm{KB}\) of memory, doubling as L1 cache and shared memory, and a \(256\,\textrm{KB}\) register file, the fastest memory component. The hierarchy spans from the fastest register file to the shared memory, on-chip L2 cache, and finally, the slower off-chip global memory.

Fig. 3
figure 3

GPU memory hierarchy and execution model

3 Methodology

In this section, we extensively elucidate the SBP execution model for top-K and selection algorithms, aiming to minimize communication overhead, memory access costs, and load imbalance. The SBP execution model consists of two fundamental components: the execution model and the bucket partitioning strategy. The execution model abstracts the execution process of bucket-based algorithms into a generic model, thereby analyzing the bottlenecks of a class of algorithms. To analyze the bottlenecks of most existing top-K and selection algorithms based on bucket partitioning from this perspective, Sect. 3.1 comprehensively introduces the execution model under the traditional bucket partitioning strategy for top-K and selection algorithms and analyzes the potential bottlenecks encountered by these algorithms from both theoretical (computation and memory access) and implementation perspectives. To address these bottlenecks, we propose an execution model based on the SBP strategy (SBP execution model), which refines and adjusts task allocation to resolve the aforementioned bottlenecks. Section 3.2 details the SBP execution model for top-K and selection algorithms from both theoretical and implementation perspectives.

For top-K and selection algorithms based on bucket partitioning, radix technology and search tree technology are generally considered to be the two most effective bucket partitioning technologies. Therefore, based on the SBP execution model, we introduce top-K and selection algorithms for two different bucket division rules, namely four algorithms: \(TK\_RdSP\) (radix-based top-K algorithm under the SBP execution model), \(TK\_StSP\) (Top-K algorithm based on search tree under the SBP execution model), \(S\_RdSP\) (radix-based Selection algorithm under the SBP execution model) and \(S\_StSP\) ( Selection algorithm based on search tree under the SBP execution model ). Section 3.3 introduces the SBP execution model applied to the top-K algorithm (\(TK\_RdSP\),\(TK\_StSP\)). Additionally, we discuss two optimizations applied to \(TK\_StSP\) and its implementation on a GPU. Section 3.4 introduces the SBP execution model applied to the selection algorithm (\(S\_RdSP\),\(S\_StSP\)) and its on a GPU.

3.1 Execution model based on customary bucket partition strategy

In this section, we introduce the traditional bucket partitioning execution model of the top-K algorithm and the selection algorithm and then analyze its performance bottlenecks from the perspectives of computation, memory access, and memory hierarchy. Finally, we introduce the implementation of the traditional bucket partition execution model.

For traditional bucket partitioning on GPUs, the general data flow is based on the bucket partitioning strategy [14], as shown in Fig. 4. According to the partitioning rules, the input data are divided into several parts, and the algorithm is executed in each part. The execution unit of partitioning is determined by the granularity of the algorithm. For instance, in basic algorithms like sorting [14], hashing [1], top-K algorithm [15], and selection algorithm [3], the execution unit is the thread block. However, this method has limitations. Due to the coarse granularity of bucket partitioning, data updates require global synchronization, leading to high communication overhead. Additionally, the throughput is also limited by the random memory accesses that occur during the normalization of elements within the buckets. This method is also very sensitive to data distribution, often leading to load imbalance.

Fig. 4
figure 4

Traditional partition strategy and general implementation of one iteration on GPU

3.1.1 Execution model analysis

We analyze the execution model of traditional bucket partitioning for top-K and selection algorithms by focusing on performance bottlenecks through an examination of individual operations: the bucket collection operation (\(OP_{bc}\)), the algorithm execution within buckets (\(OP_{ex}\)), and the merging operation of output elements in buckets (\(OP_{mg}\)). The execution sequence of bucket partitioning can be delineated by the following equation:

$$\begin{aligned} OP_{bc} \rightarrow OP_{ex} \rightarrow OP_{mg} \end{aligned}$$
(1)

Secondly, we analyze the cost of each operation based on the amount of computation and memory transactions on the GPU, as well as the hierarchy of memory access. This analysis helps in understanding the workings of each operation and the execution formula of bucket partitioning.

$$\begin{aligned} \textrm{Cost}(OP_{bc\_rd})= & {} C_{rd} + C_{hg} \nonumber \\= & {} 1 \textrm{com}_{L} + 1 \textrm{read}_{L} + 1 \textrm{add}_{G} + 1 \textrm{write}_{G} \nonumber \\ N_C= & {} \textrm{bucket}\_\textrm{num} + \log (TreeHeight) \nonumber \\ \textrm{Cost}(OP_{bc\_st})= & {} C_{bst} + C_{cb} \nonumber \\= & {} 2 \times (\textrm{read}_{G} + \textrm{write}_{G} + \textrm{read}_{L}) + N_C \times \textrm{com}_{L} + 1 \textrm{add}_{G} \end{aligned}$$
(2)
$$\begin{aligned} \textrm{Cost}(OP_{mg})= & {} \textrm{read}_G + \textrm{write}_G + \textrm{com}_L\end{aligned}$$
(3)
$$\begin{aligned} \mathrm{Cost(total)}= & {} \textrm{Cost}(OP_{bc} + OP_{ex} + OP_{mg}) + 2 \times \textrm{Cost}(\textrm{sync}) \end{aligned}$$
(4)

The bucket collection operation \(OP_{bc}\) collects elements in the input dataset according to the bucket partition rule, and the cost of \(OP_{bc}\) is shown in Eq. 3. Specifically, it computes the bucket ID for each element and counts the number of elements in each bucket. We distinguish between two types of rules: radix (\(OP_{bc\_rd}\)) and search tree (\(OP_{bc\_st}\)).

\(OP_{bc\_rd}\) primarily computes the radix of each element and updates the histogram. Let \(C_{rd}\) and \(C_{hg}\) refer to the costs of these two operations, respectively. For \(C_{rd}\), the radix calculation is executed within a thread block, meaning the accessed memory hierarchy is local to the thread block, including shared memory and registers. Hence, the cost of radix computation for each element is \(1 \textrm{com}_L\), as shown in Eq. 2. The histogram update step under the radix rule involves a global histogram update, meaning both accumulation and writing require accessing global memory, hence the subscripts for \(\textrm{add}\) and \(\textrm{write}\) are both \(G\). The number of computations and memory transactions are similar, but the memory transaction is high latency due to global memory access.

\(OP_{bc\_st}\) involves building a search tree according to the input dataset and then collecting the data into ordered buckets. Let \(C_{bst}\) and \(C_{cb}\) refer to the costs of building the search tree and collecting buckets, respectively. As shown in Eq. 2, \(C_{bst}\) can be expressed as \(1 \textrm{read}_G + 1 \textrm{write}_G\); building the search tree is a global operation. \(C_{cb}\) involves reading the search tree and elements from global memory, then computing the bucket ID via accessing local memory, and counting the bucket for each element, which requires global accumulation and writing, similar to the radix-based partitioning rule. It is important to note that ‘\(N_C\)’, as referenced in Eqs. 2 and  5, denotes the total computational amount of \(OP_{bc\_st}\) as analyzed above. Although the number of computing tasks seems to be more than the number of memory accesses, because the computing tasks are performed in local memory, and the memory accesses are performed in global memory, the cost of memory accesses is greater. Therefore, the cost of \(OP_{bc\_st}\) will be greater than \(OP_{bc\_rd}\).

figure a

Algorithm 1 Bucket partition on GPU.

The algorithm execution within a bucket operation (\(OP_{ex}\)) executes the algorithm on each bucket independently. The specific tasks vary depending on the algorithm used ; details are provided in Sects. 3.3 and 3.4.

Traditional bucket partitioning for \(OP_{mg}\) involves inter-thread block communication and synchronization to collect target elements. This is done using the prefix operator and global allocation, leading to increased communication overhead in global memory. In Eq. 3, elements are fetched from global memory into thread blocks. The prefix operation is then used to determine the goal bucket ID, and the target elements of that bucket ID are written back into global memory. This process involves multiple global memory accesses for relatively few computations, which is inefficient on GPUs.

Overall, for the traditional execution model, the total cost also needs to include the cost of synchronization. This is because there is a need to synchronize thread blocks twice, as shown in Algorithm 1, where three kernels are executed serially. This results in high latency and low bandwidth.

3.1.2 General implementation

Algorithm 1 shows the implementation of the common bucket partition execution model in the top-K algorithm and the selection algorithm. This algorithm adopts a general approach, executing operations independently in sequence through multiple kernels [3, 14]. Each operation is implemented as a separate kernel, executed by several threads organized as blocks, as shown in the right part of Fig. 4. In both the diagram and the algorithm, we only describe a single recursive implementation of the algorithm. However, in actual implementation, we iterate the algorithm multiple times until the total number of elements is less than the number of elements processed by a basic unit. While this execution pattern is straightforward to implement, it is not particularly efficient due to the high-latency and low-bandwidth global transactions it incurs.

3.2 Split-bucket partition execution model

In this section, we introduce the SBP execution model for top-K and selection algorithms, which aims to solve the problems of high global communication consumption and load imbalance in the traditional strategy. We will introduce the Split-bucket partition (SBP) strategy in Sect. 3.2.1, which addresses the load imbalance problem through the split operation and reduces global communication through the improvement of the merge operation. Then, we discuss the execution model under the SBP strategy from the perspective of calculating memory cost in Sect. 3.2.2. Finally, we detail how the SBP model is mapped to the GPU from the perspectives of coarse-grained and fine-grained task allocation in Sect. 3.2.3, highlighting how the SBP execution model solves problems in traditional strategies from an implementation perspective.

3.2.1 SBP strategy

Fig. 5
figure 5

Data flow of SBP strategy and general implementation of one iteration on GPU

We introduce the “split-bucket partitioning (SBP)” strategy, as shown in Fig. 5. This strategy divides data into several parts based on task granularity and further subdivides it into buckets for executing algorithms. In SBP, blocks serve as the unit of division, with threads acting as the execution unit for each bucket. Specifically, input data are divided into different tiles, each mapped to a thread block on the GPU. Within a thread block, a tile’s data is allocated to several buckets according to different bucket partitioning rules and processed by threads executing algorithms, such as top-k and selection algorithms. Differing from traditional bucket partitioning (illustrated in Sect. 3.1.1), in the merging phase of SBP, instead of computing a prefix sum of the number of elements in all buckets to determine the target bucket’s index and performing a global memory merge operation, we perform padding within each thread block, ensuring a consistent number of output elements per block. This allows direct calculation of each thread block’s offset position. During the merge phase, we directly allocate the output elements of each thread block to their corresponding positions in global memory.

The “Split” step evenly distributes data to each thread block and then performs bucket partitioning within the thread block, solving the load imbalance issue caused by the coarse granularity of traditional method’s bucket partitioning and improving the algorithm’s parallelism. However, due to our smaller task granularity, i.e., the execution unit being a thread block, this increases the task scheduling overhead within the thread block, which might cause phenomena like branch divergence and register spillover. Therefore, finding the right balance between task granularity and load balancing is a challenge. The "Merge" phase, by fixing the output amount of each thread block, directly eliminates the process of global address updates, thus minimizing communication overhead.

Overall, the SBP execution model enhances data locality and decreases global communication consumption through improvements in the merge operation. It also augments fine-grained task allocation through the split operation and mitigates load imbalance. However, due to the smaller task granularity, where the execution unit is a thread block, there is an increase in task scheduling overhead within the thread block. This increased overhead could lead to issues such as branch divergence and register overflow

3.2.2 SBP execution model analyzation

In this section, we analyze the SBP execution model of top-K and selection algorithms from the perspective of memory access. The execution model based on the SBP strategy can be represented by Eq. 1, but the difference lies in the memory hierarchy corresponding to each operation. For \(OP_{bc}\), the costs of \(OP_{bc\_rd}\) and \(OP_{bc\_st}\) are:

$$\begin{aligned} \begin{aligned} \textrm{Cost}(OP_{bc\_rd})&= C_{rd} + C_{hg} \\&= 1 \textrm{com}_{L} + 1 \textrm{read}_{L} + 1 \textrm{add}_{L} + 1 \textrm{write}_{L} \\ N_C&= \textrm{bucket}\_\textrm{num} + \log (\textrm{TreeHeight}) \\ \textrm{Cost}(OP_{bc\_st})&= C_{bst} + C_{cb} \\&= 2 \times (2 \times \textrm{read}_{L} + \textrm{write}_{L}) + N_C \times \textrm{com}_{L} + 1 \textrm{add}_{L} \\ \end{aligned} \end{aligned}$$
(5)

As shown in Eq. 5, since the cost of the bucket collection operation is primarily determined by the memory access hierarchy as discussed in Sect. 3.1.1, and both bucket partitioning and traversal in the SBP strategy are conducted within thread blocks, all memory accesses occur in local memory. Specifically, for the radix bucket partitioning operation, both radix computation and histogram updating are performed within thread blocks; hence, they are local memory accesses, indicated by subscript \(L\). In tree-based bucket partitioning, the operations of building the search tree and collecting buckets are also executed within thread blocks, involving local memory accesses. In contrast, operations in the execution model under the traditional bucket partitioning strategy occur in global memory. Therefore, in this part of the analysis, the SBP execution model effectively considers and utilizes the GPU’s memory hierarchy, increasing access to shared memory and registers while minimizing global memory access, thereby reducing memory access overhead.

In this paper, we focus on top-K and selection algorithms, so the main goal of \(OP_{ex}\) is to filter elements. After completing the bucket collection step, this involves computing the target bucket ID, hence involving prefix sum calculations. In the execution model under the traditional bucket partitioning strategy, a new kernel might be launched with thread blocks as its execution units, leading to global memory access. In contrast, bucket partitioning operations under the SBP strategy are based on threads, so communication may occur more within registers or shared memory. Detailed analysis will be presented in Sects. 3.3 and 3.4.

$$\begin{aligned} \textrm{Cost}(OP_{mg})= & {} \textrm{read}_L + \textrm{write}_G + \textrm{write}_L + \textrm{com}_L \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{Cost(total)}= & {} \textrm{Cost}(OP_{bc} + OP_{ex} + OP_{mg}) \end{aligned}$$
(7)

The merge operation \(OP_{mg}\) under the SBP strategy is designed to compute the target bucket ID determined within thread blocks, so its communication overhead is predominantly within local memory. As illustrated in Eq. 6, the input for the merge operation consists of elements within each thread block, stored in registers. These elements are then compared with the target bucket ID and written into shared memory. Subsequently, a fixed size of data is uniformly output by padding, facilitating the calculation of the offset of the target bucket elements in each thread block to determine their position in global memory. Finally, the elements in shared memory are directly written into global memory. Consequently, each element in \(OP_{mg}\) undergoes one local write (\(\textrm{write}_L\)) and one global write (\(\textrm{write}_G\)). Hence, the merge operation based on the SBP model not only involves more extensive computational tasks than those in the traditional execution model but also primarily leverages shared memory and registers for memory access, which are notably efficient in GPUs. This approach in the SBP execution model substantially minimizes global memory access and communication overhead.

Overall, the total cost in the SBP execution model does not account for the cost of synchronization. This is evident in Algorithm 2, where there is only one kernel, obviating the need for synchronization within a single algorithm iteration. Thus, the SBP execution model further reduces the costs associated with high-latency communication and synchronization. However, this model can exert increased pressure on shared memory and registers, potentially impacting the number of active warps and reducing the number of parallel threads during execution.

Overall, the SBP execution model exhibits significant advantages in handling top-K and selection algorithms, especially in optimizing memory access. By operating mainly in local memory (such as shared memory and registers), the Split operation localizes the data, while the improved Merge operation increases the number of data local iterations, making the model significantly less dependent on global memory. While reducing global communication, it increases the ability to combat imbalanced data sets and improves execution efficiency in the GPU architecture. However, this optimization also brings some challenges, including increased implementation complexity and specific limitations on GPU resources. Nonetheless, the potential of the SBP model to improve GPU computing efficiency makes it a powerful tool for handling top-K and selection algorithms.

3.2.3 Execution model with SBP mapped into GPU

In this section, we present the implementation of the SBP model on GPUs, focusing on the perspective of task granularity to introduce the GPU implementation.

Coarse-grained allocation Considering the memory hierarchy in GPU architectures, we differentiate between coarse-grained and fine-grained task allocations.

Coarse-grained allocation mainly deals with high-latency global memory. Performance bottlenecks here are often due to high-latency device synchronization and memory access. The customary coarse-grained task allocation, as shown in Fig. 4, can lead to the serialization of multiple kernels, resulting in performance bottlenecks [22] due to high-latency data transfers and waits (the red part in Fig. 4). Therefore, in the SBP execution model, multiple operations are combined into a single kernel, reducing global synchronization, data transfer, and kernel launch overhead, thereby improving performance.

As illustrated in Algorithm 2, we first divide the data into several tiles based on task volume granularity, allocate them to each thread block, and execute the corresponding algorithms (top-K or selection algorithms) by a single kernel. The bucket collection, algorithm execution, and merging operations are all completed within this kernel. These fine-grained task allocations will be detailed in Section 3.2.3. The SBP execution flow maps to the GPU, where each thread block executes a kernel, which includes all operations in a single iteration of SBP execution, as shown in line 4 of Algorithm 2. After the iteration ends, the results output by each thread need to be aggregated into global memory for the next iteration if the total count of output elements is more than the elements processed by one thread block (\(\textrm{tile}\_\textrm{size}\)), as shown in line 5 of Algorithm 2.

Implementing the SBP model on a GPU presents significant challenges. Firstly, although encapsulating each operation within a single kernel can reduce the overhead of inter-kernel synchronization and data transfer, this approach increases the demand for registers. High register usage can lead to reduced GPU occupancy, fewer active warps, and consequently, a decline in overall computational efficiency. More seriously, excessive demand for registers can lead to register spillage, further increasing access costs from local memory. Secondly, determining the optimal block size (BS) and the number of elements processed per thread to balance parallelism and resource utilization is also challenging, requiring fine-tuning to achieve optimal performance. In response to these challenges, we redesign the task allocation logic of the kernels, reducing the register requirements per kernel, thereby enhancing GPU occupancy and reducing the risk of register spillage. This strategy effectively balances the use of computational resources and execution efficiency. The detailed implementation and effects of these optimizations will be elaborately discussed in Section 3.3.2.

figure b

Algorithm 2 SBP execution model of coarse-grained implement on GPU.

Fine-grained allocation Fine-grained task allocation mainly targets on-chip memory, such as shared memory and registers. The primary challenge at this granularity level is to increase thread-level parallelism, thereby maximizing the number of active warp threads and reducing stall time. For general bucket partition implementations, low utilization of shared memory and registers is the main challenges. Our split-bucket partitioning (SBP) model improves these aspects by allocating data to tiles in shared memory and processing more tasks in one kernel to improve the utilization of registers. As shown in Algorithm 3, we first allocate shared memory as shown in lines 1–2. Then, we execute operations \(O_{bc}\), \(O_{ex}\), and \(O_{mg}\) in lines 7–8, lines 9–11, and lines 14–15, respectively.

For \(O_{mg}\) in the SBP model, we set a specific output value to maintain a consistent number of output elements per thread block, thereby directly calculating each thread block’s offset position, reducing communication overhead, as shown in line 16 of Algorithm 3. This parameter choice is based on the results of a comprehensive evaluation of multiple parameter configurations, aiming to balance computational load and resource utilization. This approach allows for parallel processing of data within a single tile of global memory, thereby enhancing data parallelism and reducing latency and synchronization overhead in global memory. Specifically, this optimization significantly improves overall processing speed and throughput by reducing the number of accesses to global memory and enhancing the efficiency of thread collaboration. However, the optimization process also faces challenges: A fixed cutoff value can affect the number of algorithm iterations within each thread block, especially when there are numerous duplicate elements in the data. Under such circumstances, while a smaller cutoff value can reduce the output of each thread block and decrease the number of iterations of the kernel, it also increases the number of iterations per thread block in a single execution. Therefore, choosing an appropriate cutoff value is a challenge, as it requires finding the optimal balance between reducing global synchronization overhead and optimizing computational efficiency within thread blocks. Experimental testing has determined that when a thread processes 1024 elements, a cutoff value of 128 achieves optimal performance.

figure c

Algorithm 3 Kernel of SBP execution model on GPU.

As for \(O_{bc}\) and \(O_{ex}\), the implementation of these two operations largely depends on the algorithm and the bucket partitioning rules, so we will provide a detailed introduction in Sects. 3.3 and 3.4.

In summary, in this section, we introduce the implementation of the SBP (split-bucket partitioning) model on GPU in detail, focusing on task granularity. Through coarse-grained allocation, the SBP model reduces synchronization and data transfer overhead between cores, but increases the demand for registers and may lead to a decrease in GPU occupancy. Fine-grained allocation mainly targets on-chip memory (such as shared memory and registers), aiming to increase thread-level parallelism. We improved register utilization by optimizing the allocation of data in shared memory and processing more tasks in a single core. These strategies balance the use of computing resources while improving execution efficiency, but the challenges include optimization of register usage and appropriate thread block size selection. Through careful adjustment and optimization, the implementation of the SBP model on the GPU effectively improves the speed and throughput of data processing, while ensuring efficient utilization of resources.

3.3 Top-K algorithm based on SBP

In this section, we introduce the design and implementation of the top-K algorithm based on the SBP execution model. We first present the basic concept of applying the top-K algorithm to both the traditional and the SBP execution models. Then, we discuss separately the design and implementation of the radix-based top-K algorithm (\(TK\_RdSP\)) in Sect. 3.3.1 and the search tree-based top-K algorithm (\(TK\_StSP\)) in Sect. 3.3.2, respectively. In Sect. 3.3.2, we introduce two optimization methods (task optimization and control flow optimization) of the \(TK\_StSP\) algorithm to address the problem of decreased occupancy resulting from a single complex kernel and redundant calculations during the bucket collection process.

As illustrated in Eq. 1, the \(OP_{ex}\) can be described as \(OP_{top-K}\) to decide which buckets contain the top-K elements, using the bucket counts as input and traversing it to find the goal buckets. This operation is usually implemented on the host. However, it suffers from \(OP_{mg}\) as a memory-bound operation, which requires global data update. When dealing with larger amounts of data, more global memory accesses are involved.

Therefore, we apply the SBP execution model to the top-K algorithm. Specifically, we integrated the calculations of \(OP_{mg}\) and \(OP_{bc}\) operations into a single kernel, which reduces access to global memory but at the same time increases the use of registers, leading to decreased execution efficiency (occupancy). To optimize this issue, we adjusted the task allocation of the kernel. Specifically, we separated part of the operations in \(OP_{bc}\) to run as an independent kernel. This approach is intended to alleviate the burden on complex kernels. We chose to isolate the construction part of the search tree, as this part only needs to process a relatively small amount of data, and its cost and communication overhead regarding global memory are almost negligible.

Also, for \(OP_{bc}\), it is not necessary to collect all the elements into buckets for small K values, as the collection operation dominated runtime. We will introduce the control flow optimization in Section 3.3.2 to solve this problem. Therefore, we need to find a balance between the granularity of the task and the load balance.

When we apply the SBP strategy to the top-K algorithm, we evenly split the data into multiple tiles and then divide the data into buckets according to partitioning rules. For the data contained in each bucket, we first calculate the element count for each bucket, followed by the computation of the prefix sum of these element counts. Utilizing this prefix sum information, we identify the buckets containing the target elements and then aggregate the corresponding buckets into the result array. Afterward, by padding the array to a fixed number of outputs from each block, we can allocate a fixed range of addresses on global memory for each block. This implements a parallel merge operation, which can reduce the computational cost for each index of the result array for the output elements. This method is effective for both \(TK\_RdSP\) and \(TK\_StSP\), enhancing the parallelism of the algorithm.

3.3.1 \(TK\_RdSP\) algorithm

Algorithm 4 describes the specific steps of the \(TK\_RdSP\) algorithm based on radix top-K. As the coarse-grained algorithm implementation of SBP is the same for each algorithm, as detailed in Algorithm 2, here we only introduce the fine-grained algorithm implementation. In the \(TK\_RdSP\) algorithm, we have modified the radix top-K from [15]. Specifically, the implementation of \(OP_{bc}\) is described in lines 5–6 of Algorithm 4. We determine the bucket ID of each element by performing bit computations with the element’s binary representation. By using atomic addition, we calculate the number of elements in each bucket. In line 7, we traverse the prefix sum array and compare it with the selection condition, i.e., the value of k, to find the goal bucket ID. Lines 8–11 of Algorithm 4 implement \(OP_{ex}\) and the local merge of \(OP_{mg}\); we store the elements that meet the condition in the goal bucket. As our radix top-K operates on a bitwise basis, a for loop is required to traverse all the bits, as shown in line 4 of Algorithm 4. After the for loop ends, it signifies the end of the radix top-K computation part within the thread block. To output a threshold size of data from each thread block, we fill up the elements in the target bucket in line 14 of Algorithm 4 and then allocate the target bucket from the thread block to the corresponding position in the global target bucket in line 15 to implement the global merge of \(OP_{mg}\). This is also a characteristic of fine-grained task allocation in the SBP execution model 3.2.3.

figure d

Algorithm 4 Top-k algorithm based on radix with SBP model.

For the \(TK\_RdSP\) algorithm, the three operations of the execution model are performed sequentially and are implemented within a single kernel. This can enhance the workload of each thread and the parallelism of instructions, but it may also lead to an increase in the number of registers within thread blocks, potentially causing register spill into local memory (which has a latency as high as global memory). The increased use of registers and shared memory due to complex kernels results in a decreased occupancy rate, which is disadvantageous for GPUs. However, because the bucket calculation and bucket collection parts (bucket traversal and bucket counting) of the radix algorithm are closely intertwined, separating them would result in redundant computations and increased unnecessary workload, thus reducing performance. After experimental testing and consideration, we retained this part of the operations, but for the tree-based top-K algorithm, we have optimized the tasks.

3.3.2 \(TK\_StSP\) algorithm and optimization

The \(TK\_StSP\) algorithm differs from the \(TK\_RdSP\) algorithm primarily in its bucket partitioning rules. The most significant distinction is that the bucket collection process in \(TK\_StSP\) allows for the separation and independent implementation of three steps: bucket partitioning, bucket ID calculation, and bucket counting. This separation affords room for optimization in reducing the number of kernel registers and redundant computations. To this end, we introduce task optimization in Section 3.3.2, where bucket partitioning is isolated into a separate kernel, diminishing the workload and the number of registers required for complex kernels, thereby enhancing kernel utilization. Additionally, we present control flow optimization in Section 3.3.2, where bucket ID calculation and bucket counting are separated, and bucket counting is amalgamated with merging. This approach permits obtaining the target bucket array without traversing all elements, curtailing redundant computations and augmenting algorithm efficiency.

Control flow optimization

Fig. 6
figure 6

Runtime proportion of the bucket traversal in the top-K algorithms

We gauged the time proportion of the bucket traversal process in the top-K runtime, as depicted in Fig. 6. It is evident that the bucket traversal process consumes a significant portion of the top-K runtime for the search-tree-based top-K algorithm when \(K\) is small, and this proportion diminishes as \(K\) increases.

This phenomenon occurs because as the value of \(K\) increases, the overall runtime also escalates, while the time dedicated to bucket traversal remains unchanged. Consequently, the proportion of time allocated to bucket traversal decreases. For the radix-based top-K algorithm, the total runtime does not amplify with an increase in \(K\), so the proportion remains steady, averaging 37%. The constant time devoted to bucket traversal is attributable to the generic implementation of bucket partitioning, which necessitates traversing and counting all buckets. Through the prefix sum of these counts, the bucket containing the target element is pinpointed, and the target bucket is assimilated into the result array. In scenarios where the value of \(K\) is relatively diminutive, traversing all buckets becomes superfluous and can even undermine efficiency. This is particularly true when the sizes of elements in the buckets are sequential, implying that the elements in one bucket are uniformly smaller or larger than those in the subsequent bucket. In such cases, it suffices to traverse a limited number of buckets to locate the target element, obviating the need to peruse all of them. Hence, we advocate for control flow optimization for the top-K algorithm to address this predicament.

Fig. 7
figure 7

Traditional execution details of bucket partitioning in the top-K algorithm

Fig. 8
figure 8

Execution details of the top-K algorithm with control flow optimization when \(K=2\)

The traditional control flow of the bucket partition model, as illustrated in Fig. 7, involves generating the flag vector to compute the histogram, followed by prefix summing the histogram to ascertain the goal bucket. Subsequently, the goal bucket is merged into the result array. Building upon the analysis above, to mitigate the necessity of traversing all buckets, we have refined the control flow. The optimization encompasses initially merging the elements from the first bucket into the result array, succeeded by computing the element count within that bucket. Thereafter, the elements from the second bucket are merged into the result array, and the cumulative count of elements from the initial two buckets is ascertained. This iterative procedure persists until the coveted count of target elements, denoted as \(K\), is realized. This methodology is graphically represented in Fig. 8.

As portrayed in Fig. 8, the optimization necessitates initially amalgamating the elements of the first bucket into the result array, followed by tabulating the element count within that bucket. Subsequently, it is ascertained whether the aggregate count surpasses \(K\). If not, the procedure continues by merging the elements from the subsequent bucket into the result array and calculating the total count of elements from the initial buckets. This iterative approach is sustained until the desired number of target elements (referred to as \(K\)) is achieved. In the illustration provided in Fig. 8, the \(K\) value is 2, and there are 3 elements in the first bucket. Consequently, only one pass of bucket traversal is requisite.

Specifically, as illustrated in Algorithm 5, we commence by constructing the search tree and calculating the bucket ID for each element, as delineated in lines 2 and 3 of Algorithm 5. Subsequently, in line 4, we initialize the cumulative count of elements within each bucket. Lines 5–11 of Algorithm 5 depict the initial bucket collection process, notably commencing from the first bucket. Here, we progressively add the elements of the bucket to the \(\textrm{goal}\_\textrm{bucket}\) array, concurrently accumulating the number of elements in the bucket to a variable named count, continuing this process until count exceeds k. In practice, particularly when k is relatively modest, it generally suffices to traverse no more than two buckets to fulfill the condition \(\textrm{count} > k\), with the value of count often being substantially large. This is attributable to the random generation of the nodes of the search tree and the fact that, given a small value of k, there’s a higher likelihood of node values exceeding k. Consequently, we reiterate the algorithm with the amassed count of elements, initiating from line 12 of Algorithm 5, and persist in iterating until the aggregated elements fall short of the cutoff value.

By optimizing the control flow, we can reduce the need for comprehensive bucket traversal, thereby enhancing the performance of the top-K algorithm. Furthermore, the setting of the cutoff value ensures that each thread block outputs a fixed number of elements. This approach eliminates the need for global synchronization and communication in the merging operation, while also promoting register reuse, significantly improving the performance of the algorithm. However, it is important to note that the effectiveness of this optimization is most apparent when the value of \(K\) is relatively small. As \(K\) increases, the benefits of this optimization diminish and may even lead to a decrease in performance. This issue can be traced back to line 16 of Algorithm 5, which causes the number of iterations to increase in proportion to the increase in the \(K\) value.

Task optimization In the top-K algorithm implemented via SBP, the extensive task volume of complex kernels can lead to register spillover and low occupancy. To mitigate these issues, we separated part of the bucket collection operation from the complex kernel, reducing the kernel’s load. However, this separation can cause an imbalance in thread block loads, leading to only a few thread blocks being engaged in computation, while numerous others remain idle.

figure e

Algorithm 5 Top-K algorithm based on search tree with SBP model.

In the search tree-based top-K algorithm, the tree construction operation is isolated into a separate kernel, and only the search tree is stored in global memory, which is practically negligible for large datasets. Specifically, we relocated line 1 of Algorithm 5 outside the kernel, as shown in Algorithm 6. We construct the search tree in line 1, and since the search tree has few nodes (our implementation determined that 3 nodes are optimal ), a single thread block suffices. While this leaves many thread blocks idle, the computation is minimal, ensuring brief wait times. Remarkably, this optimization enhances performance by approximately 30%.

figure f

Algorithm 6 Task optimization of TK StSP.

Conversely, for a radix-based top-K algorithm, separation is superfluous due to its reliance solely on radix computations, leaving no room for division.

3.4 Selection algorithm based on SBP

In this section, we delve into the design and implementation of selection algorithms under the SBP execution model, utilizing two distinct bucket partitioning methods. Section 3.4.1 will introduce the radix-based selection algorithm within the SBP execution model (\(S\_RdSP\)), and Sect. 3.4.2 will focus on the search tree-based selection algorithm within the SBP execution model (\(S\_StSP\)).

As illustrated in Eq. 1, \(OP_{ex}\) can be represented as \(OP_{\textrm{top}-K}\) to determine which bucket contains the \(k\)th element, utilizing the bucket counts as input and traversing them to identify the goal bucket, typically implemented on the host. However, \(OP_{\textrm{selection}}\) is a serial operation, which might diminish the algorithm’s parallelism. Consequently, we propose a parallel selection algorithm using the SBP strategy.

3.4.1 \(S\_RdSP\) algorithm

We extend the SBP execution model to selection algorithms, employing a modular approach similar to that of the top-K algorithm. However, the objectives differ: While the top-K algorithm seeks to identify the top \(k\) elements, the selection algorithm aims to find the \(k\)th smallest element. For the \(S\_RdSP\) algorithm, lines 5–7 of Algorithm 7 outline the implementation of \(OP_{bc}\) for the radix-based selection algorithm. This involves allocating elements to buckets via bitwise operations and computing the number of elements in each bucket using atomic addition. Lines 8–11 describe storing elements that meet the criterion in their respective buckets, specifically, elements whose bucket ID matches the target bucket ID, thereby fulfilling the basic procedure of \(OP_{\textrm{selection}}\) for the radix-based selection algorithm. Lines 17–18 of Algorithm 7 detail the implementation of \(OP_{mg}\) for global merging, merging the elements of the target bucket into the result array.

3.4.2 \(S\_StSP\) algorithm and optimization

Given the selection algorithm’s focus on a single element, which may reside in any bucket, the efficacy of control flow optimization could be compromised. Specifically, if the target element is located in a higher-indexed bucket, control flow optimization might necessitate additional bucket traversal, extra checks, and potential branch divergence, which could hinder performance. Therefore, in the context of the selection algorithm, we opt not to employ control flow optimization. Instead, we focus on task optimization, applicable to the search tree-based selection algorithm. This optimization, similar to that used in the top-K algorithm detailed in Section 3.3.2, involves separating the bucket partitioning operation from the complex kernel, thereby reducing the number of registers and improving the kernel’s occupancy. The specific implementation of the \(S\_StSP\) algorithm is depicted in Algorithm 8. As shown in lines 1–3 of Algorithm 8, we first construct the search tree and calculate the bucket ID for each element. In the \(S\_StSPT\) algorithm, which is the task-optimized version, we separate line 1 from Algorithm 8. Lines 5–7 use atomic addition to calculate the number of elements in each bucket and determine the global bucket ID. Lines 8–10 depict the bucket collection process, which involves storing elements that meet the criterion in the goal bucket, specifically, elements whose bucket ID matches the goal bucket ID. The \(OP_{mg}\) operation, similar to that in \(TK\_StSP\), accepts the target bucket as input data and iterates over it until the number of elements in the target bucket falls below the cutoff value, as shown in lines 11–22. We pad with maximum values until the total count equals the cutoff value, as in line 11, and ultimately allocate it directly to the corresponding global memory, as in line 24.

figure g

Algorithm 7 Selection algorithms based on radix with SBP model.

figure h

Algorithm 8 Selection algorithm based on search tree with SBP model.

4 Evaluation

4.1 Experimental setup

In this section, we evaluate the performance of top-K and selection algorithms under different strategies. Our experiments were conducted on TU102 and A100 GPUs, representing some of the most advanced GPU architectures available. The algorithms under evaluation, as listed in Table 1, include selection and top-K algorithms based on the SBP execution model: \(S\_RdSP\), \(S\_StSP\), \(TK\_RdSP\), and \(TK\_StSP\), alongside their optimally optimized versions: the task-optimized search tree-based selection algorithm \(S\_StSPT\) and the task and control flow optimized search tree-based top-K algorithm \(TK\_StSPTC\).

The benchmark algorithms utilized are:

  • \(S\_StP\) and \(TK\_StP\) algorithms: These represent the state-of-the-art parallel selection and top-K algorithms based on the search tree on GPU [3].

  • \(S\_RdP\) and \(TK\_RdP\) algorithms: \(TK\_RdP\) is the GPU implementation of the state-of-the-art radix-based top-K algorithm [15], inspired by the popular GGKS algorithm [16]. For \(S\_RdP\), we adapted the implementation of \(TK\_RdP\) [15], modifying the selection condition of the result array to exclusively select buckets with a bucket ID matching the goal bucket ID. This modification yields buckets that contain only the \(k\)th element and implements the state-of-the-art radix-based selection algorithm based on the traditional execution model.

Table 1 Algorithms

We assessed the algorithms on three datasets:

  • Uniform dataset: Characterized by a uniform distribution with values spanning from 0 to \(2^{30}\), this dataset encompasses sample sizes ranging from \(2^{21}\) to \(2^{29}\).

  • Bucket killer dataset: Comprising primarily of 1 s (as floating-point numbers), this dataset contains only four distinct numbers that deviate from 1.0. Each of these four numbers differs from 1.0 by a single digit in their 8-bit representation. This dataset, specifically designed to minimize the compression efficiency during a single pass of a bucket traversal operation, was introduced by researchers from MIT for evaluating top-K algorithms based on bucket partitioning technology [15].

  • Real dataset: The \(ANN\_SIFT1B\) dataset [30], widely used for k-nearest neighbor research [4], consists of 1 billion vectors, each with 128 dimensions, representing images. We utilize the initial vector of the \(ANN\_SIFT1B\) dataset to compute the Euclidean distances between one vector and the other 1 billion vectors. These calculated distances then form the input vector for the top-K operation, facilitating the analysis of real-world dataset performance.

4.2 Performance evaluation of top-K algorithm based on SBP model

In this section, we focus on evaluating the top-K algorithm based on the SBP execution model. Specifically, in Sect. 4.2.1, we investigate the impact of data size on the algorithm’s performance. In Sect. 4.2.2, we analyze the influence of varying \(K\) values on the efficiency of the top-K algorithm, and in Sect. 4.2.3, we explore how data distribution affects performance.

4.2.1 Dependence on data size

We evaluated the impact of data volume on the top-K algorithms on the Uniform dataset on both TU102 and A100 architectures, as depicted in Fig. 9. On the TU102 architecture, we assessed three top-K algorithms based on SBP: \(TK\_RdSP\), \(TK\_StSP\), and \(TK\_StSPTC\). The evaluation of \(TK\_RdSP\) allowed us to discern the impact of the SBP execution strategy on top-K algorithms rooted in radix bucket partitioning. Evaluating \(TK\_StSP\) and \(TK\_StSPTC\) provided insights into the effects of the SBP execution strategy and optimization methods under the SBP execution model on top-K algorithms based on search tree bucket partitioning.

We observed that as the data volume increases, the runtime of all algorithms also increases. However, algorithms based on SBP generally outperform those under traditional models. For the search tree-based bucket partitioning top-K algorithms, \(TK\_StSPTC\) and \(TK\_StSP\) are more efficient and exhibit a more gradual increase in runtime compared to \(TK\_StP\). Specifically, \(TK\_StSP\) is 1.2-\(-\)1.78 times faster than the benchmark algorithm \(TK\_StP\), \(TK\_StSPTC\) is 1.14-\(-\)1.96 times faster than \(TK\_StSP\), and \(TK\_StSPTC\) achieves a speedup of 1.4-\(-\)2.3 times compared to the benchmark algorithm \(TK\_StP\). It is evident that the SBP execution model significantly influences the performance of search tree-based bucket partitioning top-K algorithms, and optimization techniques substantially contribute to performance enhancement. For radix-based bucket partitioning top-K algorithms, the SBP execution model notably augments performance, with \(TK\_RdSP\) being 1.2-\(-\)1.8 times faster than the benchmark algorithm \(TK\_RdP\). This enhancement is attributed to the finer execution granularity of algorithms under the SBP model, which optimally harnesses the parallelism of GPUs.

Fig. 9
figure 9

Performance of top-K algorithm with increasing dataset size under uniform data distribution when \(K=8\) on TU102 architecture

Performance on A100 We conducted an evaluation of two optimized top-K algorithms utilizing the SBP model on the A100 architecture: \(TK\_StSPTC\) and \(TK\_RdSP\), as illustrated in Fig. 10. The results indicate that for the search tree-based bucket partitioning top-K algorithm, \(TK\_StSPTC\) not only outperforms \(TK\_StP\) but also exhibits a more moderate increase in runtime, a trend even more pronounced than that observed on the TU102 architecture, as depicted in Fig. 9. Specifically, the speedup of \(TK\_StSPTC\) relative to the benchmark algorithm \(TK\_StP\) ranges between 1.5 and 2.9 times. In the realm of radix-based bucket partitioning top-K algorithms, \(TK\_RdSP\) surpasses the benchmark algorithm \(TK\_RdP\) by a factor of 2.0 to 2.3 times. This significant performance enhancement on the A100 can be attributed to the architecture’s augmented shared memory capacity, which effectively reduces the dependency of algorithms under the SBP execution model on shared memory, thus optimizing memory utilization and substantially boosting algorithm efficiency.

Fig. 10
figure 10

Performance of top-K algorithm with increasing dataset size under uniform data distribution when \(K=8\) on A100 architecture

4.2.2 Dependence on \(K\) value

In this section, we examine the impact of varying \(K\) values on the top-K algorithms based on the SBP execution model. As detailed in Sect. 4.2.1, we assessed \(TK\_RdSP\), \(TK\_StSP\), and \(TK\_StSPTC\) on the TU102 architecture and evaluated the final versions on the A100. The SBP execution model, with its fixed cutoff value for the output size of each thread block, inherently limits the maximum \(K\) value output. Currently, our algorithms can output a maximum \(K\) value of 128. Notably, when \(K=128\), adjustments are necessary for the thread block size and cutoff value to meet the demand for larger \(K\) values, requiring the thread block size to be 1024 and the cutoff value 256. We evaluated the best implementations based on the SBP model on both TU102 and A100 architectures.

Figure 11 demonstrates the impact of different \(K\) values on SBP top-K algorithms at a data volume of \(2^{28}\). We observed an increase in the runtime of algorithms based on search tree bucket partitioning technology (\(TK\_StP\) and \(TK\_RdSPTC\)) as \(K\) values escalate. However, the runtime of \(TK\_StSPTC\) increases significantly post \(K > 16\), attributable to the algorithm’s reduction of unnecessary bucket partitioning operations when \(K\) is small, leading to a decrease in redundant computations and enhanced efficiency. The number of buckets traversed amplifies with the increase in \(K\). Conversely, \(TK\_StSP\) manifests a more gradual escalation in runtime with increasing \(K\) value, but its growth rate is slower than \(TK\_SPTC\), as it does not adopt control flow optimization, and thus, the increment in \(K\) does not influence the number of bucket traversals. However, as the nodes of the search tree are randomly generated, a larger \(K\) value augments the number of elements in the target bucket, thereby increasing the number of iteration within the thread block. After \(K \ge 64\), \(TK\_StSP\) becomes more advantageous than \(TK\_StSPTC\). Overall, \(TK\_StSPTC\) outperforms those based on the traditional execution model, with speedups ranging between 1.2 and 2.48 times, as depicted in Fig. 12. For radix-based bucket partitioning top-K algorithms, the growth trend of \(TK\_RdSP\) is relatively slow with the increment in \(K\), as \(TK\_RdSP\) lacks control flow optimization, hence the variation in \(K\) value does not influence the number of bucket traversals. Similar to \(TK\_StSP\), as \(K\) value increases, the number of elements in the target bucket also swells, increasing the iteration number. When \(K=128\), due to the alteration in block size and cutoff value, the runtime of both \(TK\_RdSP\) and \(TK\_StSP\) escalates. Nevertheless, as shown in Fig. 12, \(TK\_RdSP\) still outperforms \(TK\_RdP\), with speedups ranging between 1.1 and 1.25 times.

Fig. 11
figure 11

Top-K algorithms performance of various K on TU102 when \(DataSize = 2^{28}\)

Fig. 12
figure 12

Top-K algorithms speed up of various K on TU102 when \(DataSize = 2^{28}\)

Performance on A100 We assessed the effect of divergent \(K\) values on the efficacy of top-K algorithms on the A100 architecture, as depicted in Fig. 13. It was noted that as \(K\) escalates, the runtime of algorithms anchored in search tree bucket partitioning technology (\(TK\_StP\), \(TK\_StSP\), and \(TK\_StSPTC\)) correspondingly intensifies. However, contrary to the performance dynamics on the TU102 architecture, the growth trajectory of runtime for \(TK\_StSPTC\) aligns nearly parallelly with that of the benchmark algorithm. This phenomenon can be attributed to the expanded shared memory on the A100, which furnishes additional leeway for bucket traversal. As illustrated in Fig. 14, the speedup of \(TK\_StSPTC\) and \(TK\_StSP\) relative to \(TK\_StP\) fluctuates between 1.42-\(-\)2.72 times and 1.48-\(-\)1.99 times, respectively. Beyond \(K \ge 16\), \(TK\_StSP\) outperforms \(TK\_StSPTC\) in terms of efficiency. This can be ascribed to the \(TK\_StSP\) algorithm’s reduction of unnecessary bucket partitioning operations when the \(K\) value is modest, thereby diminishing superfluous computations and bolstering algorithm efficiency, while the necessity to traverse buckets augments with the increase in \(K\). Regarding radix-based bucket partitioning top-K algorithms, both \(TK\_RdP\) and \(TK\_RdSP\) exhibit a consistent growth trend as \(K\) swells, maintaining stability. This constancy is a consequence of the absence of control flow optimization in \(TK\_RdSP\), meaning that alterations in the \(K\) value do not impinge on the number of bucket traversals. As delineated in Fig. 14, the speedup of \(TK\_RdSP\) hovers around 2.0-\(-\)2.3 times.

Fig. 13
figure 13

Top-K algorithm performance of various K on A100 when \(\textrm{DataSize} = 2^{28}\)

Fig. 14
figure 14

Top-K algorithm speed up of various K on A100 when \(\textrm{DataSize} = 2^{28}\)

4.2.3 Dependence on distribution

In this exploration, our attention is primarily riveted on algorithms predicated on bucket partitioning. Notably, the performance of these algorithms is often contingent on the data distribution, with the bucket-killer dataset epitomizing a data distribution that typically poses challenges for bucket partitioning techniques[15]. In this segment, we evaluate the performance of the top-K algorithm under the SBP model on both the bucket killer dataset and real datasets, thereby underscoring the practical viability of our algorithm.

Evaluation on bucket killer dataset Figure 15 reveals that both \(TK\_StSP\) and \(TK\_RdSP\) algorithms manifest a more restrained escalation in runtime with the amplification of data volume compared to benchmark algorithms. Moreover, their runtimes are shorter, boasting speedups of 1.9-\(-\)4.9 times and 2.6-\(-\)4.3 times, respectively. The SBP execution model, by fragmenting the data into smaller segments for concurrent processing, amplifies the parallelism of the algorithms. This strategy mitigates the issues instigated by skewed data distributions to a certain degree and curtails the associated high-latency communication overhead, as elucidated in Sect. 3.2.2.

Fig. 15
figure 15

Performance of top-K algorithm with increasing dataset size under bucket killer data distribution

Evaluation on real-world dataset The acceleration effect on real datasets is relatively subdued due to the prevalence of duplicate elements within these datasets, which somewhat attenuate the algorithm’s acceleration potential. The SBP execution model, with its fixed output size for each thread block, aims to stabilize the output position of target bucket elements for each thread. However, the profusion of duplicate elements in real datasets heightens the probability of selecting these elements during the search tree construction phase in algorithms reliant on search trees. This situation leads to a proliferation of duplicate elements in the target bucket, necessitating multiple iterations within the thread block to filter these elements, thereby prolonging the number of bucket traversals. As a result, as delineated in Fig. 16, the acceleration effect on real datasets is marginally diminished, with the speedup of \(TK\_StSP\) and \(TK\_StSPTC\) relative to \(TK\_StP\) being 1.4-\(-\)1.86 and 1.03-\(-\)2.36, respectively. Nevertheless, for top-K algorithms utilizing radix-based bucket partitioning rules, the SBP execution model consistently maintains a stable acceleration profile, achieving a speedup of 1.03-\(-\)1.3 times.

Fig. 16
figure 16

Top-K algorithms speed up with various K value under real-world dataset on TU102

4.3 Performance evaluation of selection algorithm based on SBP model

In this section, our focus is on evaluating the Selection algorithms that utilize the SBP execution model. Specifically, in Sect. 4.3.1, we examine the impact of data size on the performance of these algorithms. Additionally, in Sect. 4.3.2, we investigate how data distribution influences the performance of the Selection algorithms, providing a comprehensive understanding of the factors that affect their efficiency and effectiveness.

4.3.1 Dependence on data size

We scrutinized the impact of data volume on the Selection algorithms on the Uniform dataset on both TU102 and A100 architectures, as depicted in Fig. 17. On TU102, we assessed three SBP Selection algorithms: \(S\_RdSP\), \(S\_StSP\), and \(S\_StSPT\). The evaluation of \(S\_RdSP\) shed light on the impact of the SBP execution strategy on Radix-based bucket partitioning Selection algorithms. By evaluating \(S\_StSP\) and \(S\_StSPT\), we gained insights into the influence of the SBP execution strategy and optimization methods under the SBP execution model on Selection algorithms based on search tree bucket partitioning. As illustrated in Fig. 18, observations indicate that with the increase in data volume, there is a corresponding increase in the runtime of all algorithms.

Fig. 17
figure 17

Performance of selection algorithm with increasing dataset size under uniform data distribution on TU102

Fig. 18
figure 18

Performance of selection algorithm with increasing dataset size under uniform data distribution on A100

Figure 17 illustrates that as data volume expands, the runtime of all algorithms correspondingly ascends. Nonetheless, it is evident that algorithms founded on the SBP model outperform those predicated on traditional models in terms of speed. Specifically, for search tree-based bucket partitioning selection algorithms, both \(S\_StSPT\) and \(S\_StSP\) outperform \(S\_StP\). Notably, \(S\_StSP\) is 1.47-\(-\)1.86 times faster than the benchmark algorithm \(S\_StP\), \(S\_StSPT\) is 1.14-\(-\)1.78 times quicker than \(S\_StSP\), and in comparison with the benchmark algorithm \(S\_StP\), \(S\_StSPT\) achieves a speedup of 1.14-\(-\)2.3 times. This underscores the substantial impact of the SBP execution model on the performance of search tree-based bucket partitioning selection algorithms, with optimization techniques also playing a pivotal role in enhancing performance. Similarly, for radix-based bucket partitioning selection algorithms, the SBP execution model enhances performance, with \(S\_RdSP\) registering a speed of 1.59-\(-\)1.77 times that of the benchmark algorithm \(S\_RdP\). This enhancement is ascribed to the finer execution granularity of algorithms under the SBP model, which optimally leverages the parallelism of GPUs.

Performance on A100 Observations indicate that as data volume burgeons, the runtime of all algorithms also escalates. However, algorithms predicated on the SBP model consistently demonstrate superior performance compared to those based on traditional models. For search tree-based bucket partitioning selection algorithms, both \(S\_StSPT\) and \(S\_StSP\) eclipse \(S\_StP\) in terms of speed. Specifically, \(S\_StSP\) is 1.47-\(-\)1.86 times nimbler than the benchmark algorithm \(S\_StP\), \(S\_StSPT\) is 1.14-\(-\)1.78 times brisker than \(S\_StSP\), and relative to the benchmark algorithm \(S\_StP\), \(S\_StSPT\) manifests a speedup of 1.14-\(-\)2.3 times. These findings corroborate the profound influence of the SBP execution model on the efficacy of search tree-based bucket partitioning selection algorithms, with optimization techniques markedly enhancing performance. Likewise, for radix-based bucket partitioning selection algorithms, the SBP execution model substantially augments performance, rendering \(S\_RdSP\) 1.59-\(-\)1.77 times more expedient than the benchmark algorithm \(S\_RdP\). This improvement is attributed to the algorithms’ finer execution granularity under the SBP model, which maximizes the utilization of GPU parallelism.

4.3.2 Dependence on distribution

In this section, we evaluate the performance of Selection algorithms based on the SBP model when applied to the bucket killer dataset and real-world datasets. As depicted in Fig. 19, the SBP Selection algorithms (\(S\_RdSP\) and \(S\_StSPT\)) demonstrate shorter runtimes that increase more gradually compared to the benchmark algorithms (\(S\_RdP\) and \(S\_StP\)), with speedups ranging from 3.1-\(-\)4.6 times and 4.56-\(-\)15.5 times, respectively. This superior performance can primarily be attributed to the SBP Selection algorithms’ capability to address the high-latency communication costs associated with the \(S\_StP\) algorithm and the uneven bucket distribution presented by the bucket killer dataset. By dividing the data into smaller chunks for parallel processing and subsequent bucketing, SBP Selection algorithms effectively counter the challenges of uneven distribution inherent in traditional bucket partitioning techniques. Furthermore, the integration of all operations into a single, complex kernel reduces the high-latency communication overhead stemming from global memory access, thereby enhancing algorithm performance.

In the case of the \(ANN\_SIFT1B\) dataset, the variation in speedup for the SBP Selection algorithms mirrors the performance trends observed in top-K algorithms. For algorithms based on search tree bucket partitioning technology, acceleration is somewhat tempered, largely due to the SBP execution model’s constraint wherein each thread block outputs a fixed size. This leads to an increase in the number of iterations within each thread block when faced with a high prevalence of duplicate elements, thus impacting the algorithm’s performance. Specifically, the speedup of \(S\_StSPT\) in comparison with \(S\_StP\) stands at 2.1. For algorithms based on radix bucket partitioning technology, the observed speedup hovers around 1.42.

Fig. 19
figure 19

Performance of selection algorithm with increasing dataset size under bucket killer data distribution on TU102

5 Conclusion

Top-K and selection operations pose fundamental challenges in data processing and analysis. The conventional bucket partition execution model, widely adopted to address these challenges, exhibits specific limitations when applied to GPU implementations, such as uneven bucket distribution and increased merging latency. To counter these issues, this paper introduces an innovative Split-Bucket Partition (SBP) execution model. Our empirical studies have applied the SBP model to both radix and search-tree bucket partitioning strategies for top-K and selection algorithms, integrating task and control flow optimizations. The model significantly surpasses existing methodologies, yielding performance enhancements of 2.5 times and 1.5 times for radix and search-tree partitioning, respectively. In contexts characterized by non-uniform data distributions, the performance gains range from 1.9 times to 15.5 times, with the maximum speedup on real-world datasets reaching 2.3.

It is crucial to acknowledge, however, that the SBP model is not devoid of limitations. Notably, the model’s architecture, which integrates nearly all operations into a single, intricate kernel, exerts significant demands on shared memory and registers. This may lead to excessive utilization of shared memory beyond the device’s capacity and potentially cause register spillover to local memory, thereby impeding overall performance. Our optimization strategies, tested on both the TU102 and A100 GPU architectures, have demonstrated notable speedup benefits. For instance, on the A100 architecture, a maximum speedup of 2.9 times was attained with uniformly distributed datasets. While the current focus of this paper is on refining top-K and selection algorithms, the SBP model also exhibits promising prospects for enhancing other computational tasks, such as sorting and hashing, avenues we aim to pursue in subsequent research.