1 Introduction

When designing a chip, a CPU is usually required to process tasks at full pelt and respond in real time. Multi-layer cache is a necessity to ensure the speed of the tasks, which is not suitable for parallel computing of ultra-multi-tasks and may greatly increase processors’ power consumption [1]. GPUs specialize in floating-point computation, whose structure is designed to be simpler and whose response time is less than CPUs. GPUs parallelize by continuing the SIMD approach and executing the same program multiple times [2,3,4]. GPUs switch contexts and store snapshots of internal states quickly. Therefore, the power consumption from control units is relatively small, saving a lot of resources for floating-point computing. GPUs are a perfect fit for tackling intensive tasks and running thousands of threads simultaneously, and are suitable for architectural operations with high predictability, latency, and throughput [5].

Nowadays, in view of the numerous advantages of GPUs, they are no longer limited to 3D graphics processing. The development of general-purpose GPU [6,7,8] (referred to as GPGPU hereinafter) computing technology has opened up new opportunities for speeding up general-purpose parallel applications and drawn abundant attention in the industry. In terms of floating-point computing, parallel computing, and general computing, GPGPU can improve the performance by several orders of magnitude compared to utilizing CPUs only [3, 4, 9, 10]. GPGPU jobs can be carried out in various platforms, such as Internet of Things [11].

The efficiency of accessing memory is a deciding factor in improving performance due to the special multi-threaded execution mode of GPUs. Given that a large number of threads may issue memory access requests at the same time, those requests may be delivered to the off-chip DRAM if the storage hierarchy cannot effectively deal with said mode, leading to numerous threads being blocked and unable to secure the requested data [2, 12,13,14]. This will bring about painful repercussions. For example, the computing efficiency of GPUs may plummet. Such access behaviors cannot reasonably match the design of GPU on-chip storage hierarchy, which holds back the advantages of GPUs’ architectures and greatly degrades their performances [14,15,16].

Given the unique structure and working principle of GPUs, it is necessary to develop efficient warp scheduling strategies. In this regard, we propose a new approach called WSMP, which is based on a combination of the multi-level feedback queue and perceptron-based prefetch filtering. WSMP consists of four key components that enable effective scheduling of warps on a GPU.

The first component of WSMP is the consideration of the latency tolerance of a warp. When a warp is scheduled, the threads within it have either already located or obtained the required data for computation or failed to find the data in the cache. This information is used to calculate the latency tolerance of the warp, which is defined as the proportion of non-blocking threads among all the threads in the warp. By taking into account the latency tolerance of a warp, the WSMP strategy can prioritize scheduling warps that are most likely to require immediate attention, thus improving the overall efficiency of the GPU. The higher the latency tolerance of warps, the lower the priority they will be given, because such warps allow other warps badly in need of obtaining data from cache to be scheduled before their own.

The second component is multi-level feedback queue (MFQ). An MFQ is composed of multiple queues, each of which is used as a set of warps with a certain priority. A higher the latency tolerance of a warp indicates a lower priority in MFQ, and vice versa. The scheduler traverses from high-priority queues to low-priority ones, wraps each warp with an ID, and individually sends them to the underlying prefetcher and then PPF. Subsequently, the approved cache blocks to prefetch are moved into L2 cache. The scheduler would then be ready for processing warps in an exact order, so as to make hits in cache as many as possible and considerably reduce the queuing delay, which is an important performance indicator [17].

The third component is Signature path prefetcher (SPP) [18]. Hardware prefetching is an important feature of modern high-performance processors [19]. SPP prefetches by dynamically updating the entries in the signature tables and pattern tables. They record the information of the last 256 visited pages and the count of each one of signature-related parameters, respectively, prefetch cache lines by referencing prefetch confidence and prefetch depth, and finally generate a bundle of cache lines to decisively prefetch.

The last component is perceptron-based prefetch filtering (PPF) [20]. PPF utilizes the perceptron training algorithm to gather various forms of information and make comprehensive decisions. In our design, PPF is implemented using a set of vectors that contain weighted entries. The PPF structure we modeled retains n features, requiring n different weight tables. These weights are adjusted in real-time based on feedback and preset thresholds are used to filter out invalid cache lines in SPP. After extensive efforts, we verify our theory using GPGPU-Sim, which is a cycle-level simulator that models GPUs and their corresponding workloads [21].

This paper mainly has the following three contributions:

  • On the basis of the characteristics of GPUs and warps, the structures of the advanced SPP and PPF are modified to adapt to the GPU benchmarks and to work on GPU platforms better;

  • On the ground of the latency tolerance of a GPU core, the latency tolerance of a GPU warp is proposed, and the attributes of a warp are used to reflect its importance (priority) and to assist the subsequent scheduling process;

  • The idea of MFQ is proposed to schedule all the warps, and the actual scheduling sequence is reasonably formulated according to the priority of the warps.

2 Motivation

To reduce DRAM accesses, the GPU integrates the memory merging component into the loading storage unit of the streaming multiprocessors (SM). This allows memory requests from multiple threads in a warp to be merged into just a few memory transactions, improving the memory access efficiency when memory access instructions are executed. However, for GPGPU applications with irregular memory references, such as BFS and DG in the ISPASS 2009, the access of threads in a warp can hardly be combined. Such unconsolidated memory access requests often lead to memory divergence, which means some threads in a warp thread have low latency due to cache hits, while others need to endure longer latency due to cache misses [14, 16, 22], as shown in Fig. 1. We prefer that each thread end the current task as soon as possible, as shown in Fig. 2. Because the GPU uses the SIMD execution model, the warp must wait for the slowest memory access to complete before it is considered completed. Uncombined load and store instructions can also lead to a large number of memory accesses, which will exacerbate cache contention and cache pollution. Improving cache utilization and performance is key to enhancing GPUs’ processing capacity, resource utilization, and efficiency, making it an area of high research value.

Fig. 1
figure 1

Internal structure of an original warp

Fig. 2
figure 2

A warp after prefetching

There is multiple existing work in the field of hiding latency. In [23], the authors investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency, and a larger set of pending threads to hide main memory latency. In [24], the authors propose two independent ideas: (1) the large warp microarchitecture, and (2) two-level warp scheduling. Once the warps in the active group stop during the long waiting, the warps in the candidate group will be executed to avoid long pause. However, such algorithms (including the traditional Round-Robin scheduling) will destroy the locality to a certain extent and reduce the cache hit rate. In [25], the authors propose a dynamic resizing on active warps (DRAW) scheduler, which adjusts the size of a fetch group dynamically based on the execution phases of applications, to hide various types of stalls. It does not take minimizing performance loss into much consideration, though.

There seems to be numerous shortcomings in the above researches, such as excessive status checks and corrections, volatile cache model with frequent data changes, single criteria parameter, and the inability to accurately monitor the timing of reordering memory instructions. Therefore, to some extent, the strategies mentioned above are not exactly GPGPU-friendly and we should rise above the challenges in every possible way to render GPU effective and efficient while running GPGPU workloads.

Moreover, there is also a lot of existing work on cache management. In [26], the authors propose adaptive cache and concurrency allocation (CCA) to prevent cache thrashing and improve the utilization of bandwidth and computational resources, hence improving performance. Bypassed warps cannot allocate cache lines in the data cache to prevent cache thrashing, but are able to take advantage of available memory bandwidth and computational resources. In [27], the authors propose an adaptive cache management policy specifically for many-core accelerators. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is significantly improved. The schemes above are all based on data locality, which requires on-chip memory resources to track related locality. Also, the original cache structure needs greatly modifying. In addition, although these studies reach finer granularity, they have not been further optimized for characteristics within memory. Other prefetching algorithms such as [19, 22], they are well designed and efficient, but not applicable to (or can be easily modified and migrated to) pure GPUs due to different and distinctive architectures.

It is worth noting that these proposals cannot guarantee that urgent warps will stand a fair chance in being processed. Also, implementing these strategies is no picnic, especially for a heuristic strategy, which already potentially limits the efficiency of GPU-only architectures. Considering the defects mentioned and the sophistication of purposes of GPUs nowadays, we should be bound up in exploring a better and practical way of tackling these problems.

There is little existing work that accommodates both warp scheduling and cache prefetching, which is why the goals of this research are to strike a balance between cache hit rate and prefetch range without intensely disrupting the original GPU cache structure, and to adjust the scheduling order of warps more reasonably. Meanwhile, we will make an effort to ensure that the collateral overhead is within an acceptable range.

3 The design of WSMP

3.1 HS: a customized GPGPU benchmark

In order to prove that GPU can have a certain performance improvement using WSMP, it is also necessary to cope with non-special workload in addition to the exclusive benchmarks for performance testing [28, 29]. Therefore, we hereby compose a customized benchmark written in CUDA. Here, we choose heap sort (abbreviated as HS hereinafter) as our self-made workload.

The basic idea of HS is as follows. First, an unordered sequence is constructed into a max heap, and the largest value after sorting is the root node of the heap. Then, the root node is exchanged with the last element of the heap, and the maximum value is now placed at the end of the heap. Next, the remaining n-1 elements are reconstructed into a heap, and these elements are all less than the element at the end of the heap. Repeat the above process until the termination condition of current recursion is reached. The original unordered sequence will be in order. If we traverse the heap from top to bottom and from left to right, we will get a strictly ascending sequence. HS will participate in all possible subsequent data analyses.

3.2 Modified SPP model

Signature path prefetcher (SPP) is a lookahead prefetcher highly based on confidence [18]. SPP uses a method that resembles data coloring to record necessary historical information. Specifically, it uses the characteristic that an integer contains enough bits, which do not interfere with each other, to construct a special signature that reflects the page tag, the offset of last access, and any other information necessary. Once updated, the signature discards only the information of the farthest away page, appending new page information instead. In other words, the authors choose an update strategy similar to the incremental mode. This advantageously allows SPP to learn not only simple memory access modes quickly but also analyze more complex memory access patterns.

The original SPP model works properly in the CPU, but it is not fully applicable in the GPU. The main reasons are: (1) the minimum scheduling unit of a GPU is a warp (rather than a single thread), and (2) CPUs are suitable for the situation where there are few but heavy tasks, whereas GPUs are suitable for the situation where there are a huge number of tasks but each workload is rather small. Considering said reasons and that the prediction granularity in this paper is reduced to warp, we make numerous modifications to the original SPP model. The following describes our modified SPP model, which is shown in Fig. 3.

Fig. 3
figure 3

Overall structure of modified SPP

3.2.1 Signature table

Originally, SPP dynamically maintains two tables at runtime: a signature table and a pattern table. In our design, SPP maintains multiple pattern tables as many as warps need. The signature table records the information of the last 256 visited pages, including the tags of the visited pages, the offset of the last visited page, and the new signature generated in real time. In order to keep the information on the issuer of each request, which is the very warp that triggered the access to current page, we also add the warp_id field in the signature table. warp_id will participate in signature generation later. Specifically, the new signature can be generated by formula (1) and formula (2):

$$\begin{aligned} S^{\prime}= (S< < 3)\;\textrm{XOR}\;\mathrm delta \end{aligned}$$
(1)
$$\begin{aligned} S^{\prime\prime}= (S\,'\;\mathrm{AND\;0x7FFFFFF})\;\textrm{OR}\;(\mathrm warp\_id\;< < \;27) \end{aligned}$$
(2)

where S and S″ are the original signature and the new signature, respectively, and S′ only serves as an intermediate variable. delta makes an important contribution to preserving historical information because it records the difference between the offsets when the same address is accessed twice. In addition, we sacrifice more bits to record the unique identification of the warp. Although this reduces the storage of historical entries, the ones we discard are the farthest away and have little impact on the subsequent prediction process, let alone the workflow of SPP. Besides, this is more reasonable for GPUs, because the threads in GPUs do not have locality when accessing data as strong as threads in CPUs do.

3.2.2 Pattern table

A pattern table includes four fields: index, delta, \(C_\textrm{delta}\) (the number of times a delta appears under the current index), and \(C_\textrm{sig}\) (the number of times the current signature appears), where index corresponds to signature in the signature table. In our research, due to a large number of warps and their complex interactions, we stipulate that the number of pattern tables is the same as that of active warps, i.e., a warp corresponds to a pattern table exclusively, to reduce the complexity of updating data. The signature table and pattern tables cooperate with each other and the entries within are updated synchronously to ensure the immediacy of the entries in SPP and the reliability of the prefetched data.

3.2.3 Workflow

In the initial stage, SPP needs some known parameters to decide the next step. SPP takes advice from itself. Once SPP enters the working state, each step of its work will be based on the results obtained from the previous step. When a GPU issues an access request, the corresponding page, offset, and other information will be recorded, and the signature table will be accessed according to the recorded entry. This behavior will trigger data change, so that both the corresponding signature and the entry in the signature table will be updated using the formulae in section 3.2.1.

The issued request will be mapped to a unique entry in the signature table, and the warp_id field will also be mapped to a unique pattern table. This is when the cache line prefetching can be performed, which is described as follows (see Fig. 3). After locating the unique entry in the signature table by the accessed page and offset, SPP automatically increases \(C_\textrm{delta}\) and \(C_\textrm{sig}\) by one, takes the original signature obtained in the entry as the input of the formulae in section 3.2.1, and obtains a new signature that discards the farthest one. Similarly, after mapping to the pattern table by the original signature, the corresponding counters should also increase by one automatically, and current confidence \(\textrm{CONF}_\textrm{d}\) will be calculated using formula (3):

$$\begin{aligned} \textrm{CONF}_\textrm{d} = \frac{C_\textrm{delta}}{C_\textrm{sig}} \end{aligned}$$
(3)

where d is current prefetch depth, which is the number of times the pattern table is used. During prefetching, current output should be used as the input for the next prefetch to obtain the candidates. To make the prefetching process controllable, many threshold values must be set. First, we use formula (4) to calculate the confidence \(P_\textrm{d}\) of current prefetch path:

$$\begin{aligned} P_\textrm{d} = \alpha _\mathrm{warp\_id} \cdot \textrm{CONF}_\textrm{d} \cdot P_\mathrm{d-1} \end{aligned}$$
(4)

where \(\alpha _\mathrm{warp\_id}\) indicates the accuracy of prefetching cache address for a specific warp whose unique identification is warp_id. Since we have separate pattern tables for each warp, there is no need to use global accuracy. \(\alpha _\mathrm{warp\_id}\) reflects the uniqueness of a certain warp better. \(P_\mathrm{d-1}\) indicates the path confidence of the last round of prefetching. When first prefetching (the \(P_0\) phase), \(P_\mathrm{d-1}\) can be considered as 1 to facilitate the subsequent calculation process.

Looking back on Fig. 3: Suppose Warp #02 makes a request for accessing Page 4, Offset 5, we should locate the entry in Signature Table where WarpID equals 02 and Page Tag equals 4. Then, we change Last Offset to 5 which is what the warp hopes to access, and we get the delta - 4. So, if the first 27-bit part of the signature is originally set to 0x1, it will be changed to 0xC. After locating the entry in pattern table, where index equals 0x1 and delta equals 4, \(C_\textrm{delta}\) and \(C_\textrm{sig}\) will increase by one, and the confidence on current path will be updated from 4/6 to 5/7. Through sufficient iterations, prefetch-ready cache lines will be eventually yielded and output for further use.

It is worth mentioning that prefetching process of SPP cannot be limited merely by the subsequent PPF filtering. In order to avoid unlimited update of \(P_\textrm{d}\), we should set necessary thresholds \(T_\textrm{p}\) (as the lower bound of \(P_\textrm{d}d\)) and \(T_\textrm{f}\) (as the upper bound of the number of cache lines). Through experiments, it is found that the deterioration rate of \(P_\textrm{d}\) is relatively fast compared with the CPU structure in the original paper, and the number of cache lines prefetched under current structure is not a lot for the vast majority of warps. Statistically, it does not exceed 20 lines each time SPP is used as per usual, where each time several rounds are involved. Therefore, it is reasonable to set \(T_\textrm{p}\) to 0.65 as we do here. However, to avoid the trouble of comparing floating-point numbers, we use the magnifying coefficient method again to map the coefficients to the range of 0 to 100, so \(T_\textrm{p}\) is set to 65 at this point. Moreover, such design also avoids limiting the size of candidate sets in PPF using \(T_\textrm{f}\), which indirectly reduces the filtering time of PPF and improves the overall performance of GPGPU.

3.3 Modified PPF model

Perceptron is a lightweight learning mechanism, which can be used to gather different forms of information and make comprehensive decisions according to the information collected. PPF considers many features related to prefetching, such as prefetch depth, page tag, and offset, which will be sent to PPF later as input. Then, the underlying prefetcher (i.e., the modified SPP model mentioned in Section 3.2) is enhanced by filtering out invalid prefetches in the cache line set [20].

The architecture of PPF in our experiment is shown in Fig. 4. PPF can be implemented with a set of vectors. Each entry in the vectors is weighted. PPF retains n features related to prefetching, which is why n different weight tables are required. These weights will be adjusted in real time according to the feedback, and each feature is used to index to a different table.

Fig. 4
figure 4

Overall structure of modified PPF

Once SPP is triggered, its proposed candidate group of cache lines (i.e., the total output of SPP) will be fed to PPF, and then PPF will decide which addresses in the group seem valid so they are recommended to prefetch, and the judgment is made by the filter module in PPF. After obtaining the features related to the proposed cache lines, the filter module indexes the corresponding weight tables according to these features, and adds up all the corresponding weights. The sum of these weights \(\zeta \) is the confidence of current proposed cache line group.

Similar to SPP, we also need to set thresholds for PPF: \(\tau _\textrm{hi}\) and \(\tau _\textrm{lo}\). Our model determines whether the current cache line address is worth prefetching by comparing those thresholds with the sum \(\zeta \). Specifically, we use Algorithm 1 as our way to judge.

figure a

According to the algorithm, the modified PPF dynamically maintains three types of tables at runtime: reject table, candidate table, and prefetch table, which are used to record the filtered entries, candidate entries (entries that do not strictly meet the prefetch requirements but are not sufficient to be directly rejected), and entries to prefetch, respectively. If the algorithm determines that the current prefetched cache line is invalid, the cache block may be evicted but only when there is not enough space left in L2 cache.

When evicting the cache block, the corresponding address in the prefetch table will be tried to be located. If said entry exists, then PPF mispredicted. Therefore, the weight table will be re-indexed according to the corresponding characteristics of the prefetch request, and the weights within the table will be adjusted accordingly. Likewise, when a cache block is requested, the reject table is accessed. The entries in the table are checked before the next set of cache lines to prefetch is triggered. If it is retrieved, or namely, a hit is made, then the corresponding cache block must have been initially prefetched by SPP, but incorrectly rejected by PPF. This enables PPF to learn from the mistake and adjust the weights accordingly, through which the prefetching mechanism can be further optimized. In addition, if the number of SPP prefetches is too small to reach that of minimum optimal prefetches, it will try to obtain the cache lines from the candidate table. If the number of prefetched entries still fails to meet the requirement, no additional attempt will be made at this point to ensure the quality of the prefetched cache lines, which is reflected in the statistics as the cache hit rate (or namely the cache miss rate subtracted from 1).

3.4 The latency tolerance of a warp

In order to maximize throughput, modern memory schedulers used in GPUs are usually specially designed, but by default, they affirm that all requests from different cores have the same priority, meaning that the requests are equally important to the schedulers. However, this is not always the case. The internal situation of each warp may be highly complicated. For example, some threads are almost done and waiting for natural extinction; some threads are still intensively calculating data, but have little impact on other threads; some threads have not completed the computations, but the data they need may not exist in the cache, which would cause a miss, so they need to issue an extra memory access request; some threads have already completed the task, but have been waiting for other threads to complete [30]. Obviously, threads in different stages should not share the same priority.

According to our observations on some benchmarks in ISPASS 2009, each benchmark produces different types of warps as expected. Therefore, the warps must be classified and rearranged according to their priorities for schedulers in order to schedule the warps in a more reasonable way and cooperate with the strategy of multi-level feedback queue. GPU cores with a larger fraction of warps waiting for data to come back from DRAM are less likely to tolerate the latency of an outstanding memory request. In [30], the authors propose the concept of latency tolerance of a GPU core. If a warp will not make a request for memory in the future, meaning that it is either executing a computation or the data required in the future already exists in cache, then the warp is considered low-latency, or that it has a high latency tolerance, just to rephrase. The proportion of low-latency warps among all the warps on each core is the latency tolerance of this core. In order to optimize the scheduling process and facilitate the subsequent prefetching process, we refine the granularity to warp level, and take the proportion of non-blocking threads among a warp as its real-time latency tolerance \(\sigma \):

$$\begin{aligned} \sigma = \frac{\mathrm{Nonblocked\;Threads}}{\mathrm{Threads\;Count\;within\;a\;Warp}} = 1 - \frac{\mathrm{Blocked\;Threads}}{32} \end{aligned}$$
(5)

A warp with a higher latency tolerance has a lower scheduling priority. Warps with lower latency tolerance require immediate response, while those with higher latency tolerance allow other warps to jump the queue and compete for scheduling opportunities to some extent. The reverse is also true.

3.5 Multi-level feedback queue

With the concept of warp latency tolerance, we proceed to build a multi-level feedback queue (MFQ). As mentioned above, before any warp is scheduled, the latency tolerance of each warp must be sensed. We can divide the warps into several groups according to their tolerance. The lower the tolerance, the higher the criticality. After grouping up, the warps from each group are inserted into the corresponding queue according to the average criticality of the group. Multiple queues storing warps with different criticality constitute a dynamically adjusted MFQ, as shown in Fig. 5.

In addition, we reorder the requests in a queue on a regular basis before each level of queues is actually scheduled to make the scheduling process more reasonable. Requests with the same warp_id within the queue are put together, enabling the requests to be processed consecutively. Also, the cache access delay induced by warps can be reduced by mitigating the interference from different warp requests. Both are wonderful maneuvers for improving the overall performance of a GPU.

Fig. 5
figure 5

Multi-level feedback queue

Although the workload of each core on a GPU is not particularly heavy, there is still no guarantee that the subsequent warp will not be delayed too long by the current warp. That is to say, the actual execution time of the jobs that come first or later must be unknown. This is commonplace because there is no flag indicating the progress in a warp’s structure or an easy way to do so, and that it is almost impossible to predict where the program exits simply by the currently executed instruction. Considering the above reasons, the MFQ is designed into a structure in which each queue is arranged from top to bottom according to the priority. The top queue stores the most critical warps (warps with the lowest latency tolerance), and the bottom queue stores the least critical warps. Each time a new warp is added to the MFQ, it is inserted into the end of the corresponding queue according to its priority transformed from its latency tolerance.

However, doing so would bring about a problem that high-priority warps may always enter the upper layer of MFQ, directly leading to a long shot for low-priority ones to be scheduled. To restrain such phenomenon, we additionally record the number of cycles that each warp has been waiting for. The second the number exceeds a certain threshold, the warp will be automatically upgraded to the next higher priority. We can prevent starvation by actively trying to obtain a scheduling opportunity, ensuring the warps of the same level get a fair deal. The scheduling process in MFQ is shown in Algorithm 2

figure b

The algorithm above reveals that we differentiate between read and write requests for all requests, which naturally prioritizes read requests over write requests. However, in cases where there are no warp requests whose priority is increased due to timeout in the top-level queue, current write requests may also get the chance to be scheduled first. In addition, we multiply the original latency tolerance calculated according to the formula in the previous section by a magnifying coefficient to convert decimals to integers, so as to avoid large errors caused by precision loss when comparing with the threshold.

4 Experiment result & analysis

4.1 Experiment environment

In order to evaluate our proposed warp scheduling strategy, we conduct experiments using GPGPU-Sim and 14 benchmarks. Specifically, seven benchmarks are from ISPASS 2009, and six are from Rodinia 3.1 [31, 32]. GPGPU-Sim is a cycle-level simulator that models contemporary GPUs running GPU computing workloads written in CUDA or OpenCL. The simulator was developed by Tor Aamodt’s research group at the University of British Columbia and has been widely used by GPU researchers [21, 33].

The experimental setup and configuration are detailed in Table 1. We selected these benchmarks to provide a diverse set of workloads that represent a range of real-world applications. By conducting experiments with these benchmarks, we are able to evaluate the performance and effectiveness of our proposed warp scheduling strategy under different workload conditions.

Table 1 Research configuration

FR-FCFS (first ready-first come first serve) is the original scheduling algorithm of GPGPU-Sim. In our experimental experiment, WSMP can be turned on or off individually, so the environment will be back to baseline when it is turned off. SPP and PPF in WSMP can also be turned on or off simultaneously. Therefore, there are three experimental groups, including baseline, WSMP where only the warp scheduling policy is changed (latency tolerance + MFQ, denoted as Partial), and WSMP where its functions are fully enabled (latency tolerance + MFQ + SPP + PPF, denoted as WSMP). In addition, benchmarks in Rodinia do not have official abbreviations, so just to clarify: CFD, LUD, PF, LMD2, HyS, and NW stand for CFD Solver, LU Decomposition, PathFinder, LavaMD2, Hybrid Sort, and Needleman–Wunsch, respectively.

4.2 IPC

When it comes to measuring application performance, one widely used metric is IPC (instructions per cycle). IPC reflects the instruction throughput within a given time period and can be a good indicator of an application’s computational intensity. IPC represents the instruction throughput for only a period of time, and it cannot represent the running time of a program, though. For example, for an I/O-bound program, the IPC of each SM may remain consistently less than 1.0. This is because I/O-bound programs are limited by input/output operations, and not by the processing speed of the GPU. IPC can be calculated by:

$$\begin{aligned} \textrm{IPC} = \frac{\textrm{InstsExecuted}}{\textrm{ActiveCycles}} \end{aligned}$$
(6)

where InstsExecuted is the number of warp instructions (not threads) retired by an SM and ActiveCycles is the number of SM cycles where the SM has at least one active warp. In addition, another concept, ElapsedCycles, refers to the number of SM cycles during the PM collection period.

To quantify how our theory turns out, we compare Partial and WSMP with baseline. Figure 6 shows the improvement of IPC relative to baseline that has been regularized to 1.0. Also, we compare [34] (denoted as PWS) with our work for further confirmation.

Fig. 6
figure 6

IPC comparison for all benchmarks

It can be seen from the figure that the increase ratio of all benchmarks is 21.88% for Partial and 26.45% for WSMP, both on average. As general rule, the higher the IPC, the better. Even though WSMP is not always the best strategy, still, it outperforms Partial/PWS most of the time, indicating that it counts as a good and practical strategy and that the performance of the GPU has been improved. Furthermore, it is valid to consider that WSMP will improve other GPGPU workloads given the variety of our benchmarks.

The WSMP strategy proposed in this research has a significant impact on improving IPC, especially in terms of cache prefetching. Although the modifications to SPP and PPF have eliminated unnecessary modules designed only for CPUs, and the active warps have been evenly distributed among temporary vectors, the additional computation and extra cycles of the GPU cannot be ignored. Despite this, WSMP still has a remarkable effect on improving IPC. In most of the benchmarks, WSMP performs better than the previous strategy, PWS. Therefore, it can be concluded that WSMP is a promising strategy for enhancing the performance of GPU workloads.

4.3 L2 Cache miss rate

The fastest memory is registers just as in CPUs. L1 cache and shared memory rank second, which is also pretty limited in size. L1 cache is private and L2 cache is shared (globally public). L2 cache caches local and global memory, and reduces memory bandwidth usage and power consumption to hide the cost of accessing memory. Main memory is typically slower than L2 Cache, so L2 cache is able to improve performance considerably in many applications [26, 35]. That is to say, efficient L2 cache management can boost the performance of GPGPU [36], which is why we mainly focus on the performance of L2 cache. More specifically, we use the miss rate of L2 cache to reflect the quality of how it is managed. Also, we compare [37] (denoted as MSC) with our work for further confirmation. Figure 7 shows the L2 cache miss rates of all the benchmarks mentioned above.

Fig. 7
figure 7

L2 cache miss rate for all benchmarks

We do not include Partial in the figure because it disables the cache management module. Therefore, the comparison between partial and baseline is meaningless. As can be seen from the figure, the cache miss rate from WSMP has been reduced by 9.54% on average. The miss rate of most benchmarks become much lower (such as BFS and STO), while others have little change, indicating that they may be computation-bound. Regarding to MSC, on the rare occasions when running benchmarks like NN and LMD2, MSC outperforms WSMP, and other than that, WSMP works much better than that. However, the overall L2 cache miss rate still declines, meaning that the modified combination of SPP and PPF can indubitably adapt to the GPU memory access mode, and it is also worth considering that WSMP will work for other workloads.

4.4 MFQ-related targets

As mentioned in section 3.5, warps need to be sorted by MFQ and then scheduled by the scheduler level by level. This process replaces the original sequential scheduling algorithm (FR-FCFS). In order to decide whether MFQ is reasonably designed, we must count how many warps are stored in queues of different levels waiting for scheduling when each benchmark is running. The integer thresholds for MFQ (which is the product of latency tolerance and magnifying coefficient, within the range from inclusive 0 to exclusive 100) are shown in Table 2.

Table 2 Thresholds for MFQ

Where Lower Tolerance, Upper Tolerance, and Timeout Cycle refers to the lower bound of a warp (\(\tau _{lo}\)), upper bound of a warp (\(\tau _{hi}\)), and cycles that timing out and leveling up are performed when reached. When a warp in the L0, L1, or L2 queue times out, it is automatically transferred to the back of the queue with the next higher priority, while a warp in the L3 queue will be transferred to the back of the L2 queue after timeout. Due to the aforementioned timeouts, we make a rule under which statistics are collected: the warps contribute the cumulative value to the queue where they are eventually scheduled. For example, let’s say the current L2 and L3 queues contain 35 and 15 warps, respectively. Because of the lengthy waiting time, 8 warps in the L2 queue are marked timed-out, and there remain 27 and 23 warps in the L2 and L3 queues, respectively. Then, the warps in L3 and L2 are successively scheduled. At this juncture, the final contribution of the L2 and L3 queues to the cumulative value is 27 and 23. Figure 8 shows the proportion of the scheduled warps in queues of different levels per benchmark (banks are not distinguished here). Figure 9 shows the timeout rate of the warps in each queue per benchmark.

Fig. 8
figure 8

Warp dispersal percentage

Fig. 9
figure 9

Timeout rate for each level of queue

We can see from Fig. 8 that for almost all benchmarks, prio-0 warps account for more than half of the warps and prio-3 warps was only a tiny fraction of all warps, which is rather self-explanatory: urgent requests do not turn up so often.

As for Fig. 9, we can see that timeout is pretty common in all levels of each queue, and that whatever the level is, no timeout rate is above 15.00%. In fact, timeout rate is an important indicator in measuring MFQ because higher rates spell more enqueueing and dequeueing, which is a cycle-consuming operation for queues; on the other hand, lower rates indicate that warps are being scheduled in a relatively smooth and rapid way. Currently, 15.00% and less is the best we can do after countless experiments and workloads have been up to expectations in terms of performances of GPUs.

Later, we fiddle around a few technicalities to get inside the MFQ, through which should we be able learn more runtime internal details. Specifically, we are able to track how many times each warp is transferred to another queue. Figure 10 shows the percentage of warps that are either transferred less than or at least 3 times for each benchmark. Note that the percentages are concerned with warps that are at least once transferred to other levels of queues, meaning that warps that never leave their original queues are excluded.

Fig. 10
figure 10

Warp transfer percentage

Clearly, almost each warp will be scheduled after no more than two transfers, which is of great benefit to the smooth working process of the GPU. It casts light on the working principles of GPGPU workloads and the fact that most warps that emerge during execution is not urgent. As a result, it is needless to make great efforts to reduce the times of transferring for warps that only account for 6.9%.

4.5 Overhead

In our design, we mainly add two modules compared with the original GPGPU-Sim architecture:

(1) Warp scheduling strategy, namely the combination of latency tolerance and MFQ. We change the original FR-FCFS scheduling strategy of the simulator and integrated WSMP for warp scheduling. In order to calculate the latency tolerance, we modify the structure storing a warp’s metadata, and add the memory access and computation flags for the threads in a warp to facilitate the calculation of the proportion of non-blocking threads in the warp. We must multiply the proportion by the amplification factor before any warp is inserted into MFQ to avoid the trouble that may be brought upon floating-point computation. We set up four queues of four levels in MFQ. The scheduler inserts a warp to the end of a queue according to its latency tolerance, and schedules the four queues from high level through low level. Given that each bank must be equipped with MFQ, and the actual program will not utilize more than 16 banks (#0 to #15) when running (which can be seen deduced from output statistics), the program will not construct more than 16 MQFs. In fact, benchmarks rarely use more than 6 banks according to the test results. Also, it is worth mentioning that each warp stored in a queue in MFQ is only a pointer to the actual task, and there is no extra copy operation of the object block. In addition, the coalesce-and-sort operation will only be triggered when the L3 queue reaches maximum capacity or at least a warp times out. Therefore, it will cause little additional overhead and have less impact on the system considering this small probability event, and without a doubt, it will not affect IPC on a large scale, let alone L2 cache hit rate.

(2) GPU cache management, namely the combination of modified SPP and PPF, which is mainly used for the shared L2 cache. We implemented and integrated the GPU cache management module to reach the performance balance between cache hit rate and IPC. In order to record the runtime information of warps in detail and prefetch reasonable cache blocks as many as possible, we change the structures of many components in SPP and PPF, and expand (and privatize) the global table to numerous tables. In this way, the number of signature tables, pattern tables, or weight tables is the same as that of warps. Therefore, the additional overhead for better cache management mainly refers to the searches and maintenance of these tables. Based on the output results, they only occupy kilobytes of the memory. Coupled with their reusability, these collateral costs can be safely ignored. In addition, the results do not show that the cache management strategy has caused serious performance degradation, meaning that it is fully applicable to the GPU platform.

Besides, we collect the actual simulation rate (also known as instructions per second, IPS). The results show that the IPS of some benchmarks decreases slightly, while the IPS of other benchmarks increases slightly. That being said, the overall change is extremely small, indicating that the additional overhead does not have much impact on performance.

5 Future work

The main goal of this research is to improve the performance of GPU for general GPU programs. In section 4.5, we analyzed that the cost of the new strategy is relatively small, though it still takes advantage of a large number of vectors to record the information on each warp, which will increase the computational burden, more or less. Theoretically, a feasible method would be to group the warps issued by each benchmark so that it becomes possible for multiple warps to share vectors to reduce the overhead of memory and other operations.

6 Conclusion

We make multiple modifications on the original SPP and PPF models, so that they can adapt to the GPU platforms and benchmarks. In this way, we are capable of managing L2 cache better and improve the cache hit rate. In addition, by referencing the latency tolerance of each warp, the scheduling sequence of all warp is handed over to the multi-level feedback queue. Experiment results show that IPC has increased by 26.45% on average, the miss rate of L2 cache has decreased by 9.54% on average, and the simulation rate has not decreased significantly, indicating that the strategy we propose is feasible for GPU applications.