1 Introduction

The hardware data prefetcher predicts the memory access patterns of an application and fetches useful cache lines from deeper levels in the memory hierarchy ahead of their first demand reference [1,2,3,4,5,6]. Existing prefetchers make predictions based on the spatial or temporal locality of program memory accesses, and their effectiveness is limited to specific memory access patterns. Spatial prefetchers predict fixed offsets, multiple offsets, or streams within a spatial region [7,8,9,10,11]. For example, VLDP [9], SPP [4], and an IP complex stride prefetcher included in IPCP [11] focus on deltas between accessed addresses to predict the next address delta. However, the spatial prefetcher cannot predict the first cache miss in each region and is limited to a fixed region size.

Temporal prefetchers are capable of predicting intricate memory access patterns by anticipating recurring miss sequences [3, 12,13,14]. For example, Irregular Stream Buffer (ISB) [12] and Domino [3] track the temporal order of memory accesses. Prefetchers that predict through recording pairs of related addresses will lead to increased storage overhead, and they cannot predict compulsory misses. Usually, temporal prefetchers demand hundreds of KBs. Recently, managed ISB (MISB) [13] and Triage [14] have optimized the hardware overhead without compromising coverage. When dealing with complex programs with multiple memory access patterns and mixed programs with different memory access patterns, the aforementioned prefetchers fail to meet the requirements of the program. Moreover, prefetch degree is a key feature of a data prefetcher, indicating how many prefetch requests are generated. Specifically, it denotes the quantity of cache lines or memory blocks that the prefetcher is instructed to proactively fetch ahead of the actual demand. A higher prefetch degree implies fetching more data ahead of demand, which can potentially improve performance by reducing cache misses at the cost of increased memory bandwidth usage. Dynamically adjusting the prefetch degree has various impacts on the program performance and main memory bandwidth during program execution [15,16,17].

Integrating multiple prefetchers in a processor can compensate the prediction limitations of single prefetchers, cover different memory access patterns of programs, and integrate the performance benefits of multiple prefetchers. However, multiple prefetchers compete for limited shared resources, such as LLC and main memory bandwidth [18,19,20,21]. If multiple prefetchers lack efficient control, it may result in system performance degradation of certain progranms. Existing prefetch control methods suffer from inherent limitations, such as offline model training and insufficient adaption to dynamic program behavior and system configurations [15, 16]. The prefetch control methods based on the classification of memory access patterns require offline model training and assume consistency between offline training and online operation in terms of memory access patterns. When dealing with different memory access modes, the accuracy of this kind of prefetchers is reduced, which will result in the system performance degradation. The prefetch control methods based on performance feedback compare the performance of different prefetchers during program execution and select the optimal one at the current program stage from a set of prefetchers. For example, Sandbox Prefetcher [22], which uses Bloom filters to evaluate the accuracy of prefetchers, selects the prefetcher with the highest prefetching accuracy from candidate fixed offset prefetchers. However, it suffers from hysteresis, and for interleaved access patterns, a sub-optimal prefetcher can degrade the prefetch performance before the optimal prefetcher is selected.

The objective of our work is to design a prefetch control framework that can (1) dynamically select the prefetcher to be activated from multiple options based on online training; (2) adaptively adjust the prefetch degree of the selected prefetcher; and (3) adjust prefetchers after each cache access to alleviate the hysteresis of prefetch control.

In this paper, we propose RL-CoPref, a reinforcement learning-based prefetch control framework that effectively manages prefetch activation and adjusts prefetch degree in response to changes in cache and main memory bandwidth. To achieve this, RL-CoPref utilizes tile coding [23] to discretize the continuous state space, enhancing the learning and decision-making processes in the RL framework. Tile coding is employed to structure the state space, particularly for program features such as program counters and cacheline addresses, which are inherently continuous. After each demand request, RL-CoPref extracts a set of program features from the current program context information, discretizes them using tile coding, and employs them as state features in the RL process. RL-CoPref then selects the best prefetching action based on historical learning experience and evaluates the effectiveness of each action by considering prefetch accuracy and main memory bandwidth utilization. The reward value received by RL-CoPref after each action is calculated based on prefetch hits/misses and the current usage of main memory bandwidth. Through this RL approach, RL-CoPref learns to improve prefetching performance over time, facilitating the more efficient use of system resources.

This paper introduces RL-CoPref, a RL-based coordinated prefetching controller designed to enhance the integration of performance improvements from multiple prefetchers in complex mixed memory access modes. The key contributions of RL-CoPref can be outlined as follows.

  • We present a novel RL-based prefetcher control framework that leverages RL advantages to adaptively learn, optimize, and adjust both the activation and prefetch degree of prefetchers. This dynamic adaptation aims to enhance overall program performance.

  • To address the diverse landscape of processor architectures and workload types, RL-CoPref incorporates multiple program features as feature vectors. We meticulously design appropriate reward functions to incentivize prefetch controllers, ensuring an effective balance between performance improvement and DRAM bandwidth consumption.

  • RL-CoPref is engineered as a lightweight and low-overhead prefetch controller. It adopts a simple table-based approach to store learned knowledge, minimizing complex computations and reducing storage overheads. Experimental results affirm the efficacy of our prefetch control framework across various workloads and processor architectures, highlighting its low hardware overhead.

2 Related work

2.1 Hardware prefetchers

Prefetchers can be classified into different types based on their prefetching strategies, where the most common used types are temporal prefetchers and spatial prefetchers [24,25,26,27,28].

Temporal prefetchers [3, 13, 29] can predict memory access patterns based on temporal locality, especially recurring access patterns that arise when iterating over data structures, so as to reduce data access latency. MISB [13] utilizes the access patterns of irregular data structures, such as linked lists, trees, and graphs to predict future data access and fetch data to cache. Domino [3] temporal prefetcher finds a matching item for prefetching by looking up the two most recent miss addresses. Although temporal address-correlating prefetchers can predict recurring sequences of misses in data structures, their prefetch degree is limited due to the imperfect repetition of large-scale data traversals.

Unlike temporal prefetchers, spatial prefetchers [7, 24, 30, 31] are more proficient in predicting repetitive spatial layouts over contiguous regions of memory. BOP prefetcher [24] uses a set of predefined offsets, which are derived from the history cache access patterns and program code analysis. When the program accesses a particular address, the hardware prefetcher attempts to prefetch the data from one of the predefined offsets that best matches that address. In contrast, PC-based stride prefetcher [7] uses the program counter to track the execution stride of programs and predict the prefetching data. However, spatial prefetchers are hard to predict irregular access patterns, and the page size of a spatial prefetcher is typically fixed, which means that for a large dataset, it may require multiple prefetches to fully cover all data in the dataset.

In addition to temporal prefetchers and spatial prefetchers, there are other prefetching strategies, such as instruction-based prefetching [32] and hybrid prefetching [33]. While these prefetching strategies have their own advantages, they also exhibit limitations in specific usage scenarios. For instance, temporal prefetchers may experience performance degradation due to their inability to adapt to certain program access patterns. Spatial prefetchers, when dealing with multithreaded programs or programs with numerous branches, may encounter inaccuracies in prefetching, leading to bandwidth wastage. Therefore, prefetch control strategies become crucial in addressing these issues. By comprehensively considering the strengths and limitations of different prefetchers, control strategies can more effectively select and configure prefetchers to meet specific workload and system requirements.

2.2 Prefetching control policy

The prefetching control policy [16, 17, 22, 34,35,36,37,38] aims to improve system performance by exploiting the performance advantages of multiple prefetchers. In recent years, various prefetching controllers based on different prefetching control policies have been proposed. Some rule-based controllers use specific prefetching rules to control the prefetchers. BAPC [17] is a bandwidth-aware dynamic prefetching controller that dynamically adjusts prefetching behaviors based on the memory bandwidth utilization in the system. SPAC [16] can cooperatively control the aggressiveness of multiple prefetchers and limit the behavior of prefetchers based on the improvement of fair-speedup in multi-core systems. Sandbox Prefetching [22] is a new method to determining the appropriate prefetching strategy at runtime. It evaluates simple and aggressive offset prefetchers by adding the prefetch address to a Bloom filter, instead of actually prefetching the data to the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to determine whether the evaluated aggressive prefetcher could accurately prefetch the data, while discovering for the presence of prefetchable streams. Improved prefetching is only executed when the evaluated prefetchers surpass a certain level of accuracy. The rule-based prefetching control method, however, suffers from certain limitations such as reliance on predefined rules, inadequate adaptability to unknown access patterns, and susceptibility to hardware design constraints.

Although there are various prefetching controllers to optimize the performance of the prefetchers, most of these controllers only focus on controlling a single prefetcher without considering how to control the joint behavior of multiple prefetchers. However, by effectively controlling multiple prefetchers, programs can better predict necessary data and pre-read it into the processor, thereby avoiding execution delays caused by data waiting. By optimizing the control of multiple prefetchers, the performance advantages of multiple prefetchers can be better utilized, and the utilization of the cache can be improved, thereby improving system performance.

2.3 Reinforcement learning in computer architecture

Recently, machine learning-based algorithms have been proposed for several micro-architectural tasks, including cache management [39,40,41,42], branch prediction [43, 44], and hardware prefetching [45,46,47,48,49,50,51]. Despite the impressive results achieved by these methods on memory access prediction and branch jumps, they suffer from two significant limitations. Firstly, the models employed are often too large to fit within the LLC of typical processors, which limits their applications in practice. Secondly, these models require substantial computation during inference, resulting in significant delays that exceed the admissible delays of prefetchers or branch predictors.

In contrast, RL does not require prior knowledge or data to determine the optimal action. Rather, it acquires this ability through interactions with the environment. This advantage makes RL promising to address computer architecture problems, particularly when significant prior knowledge or data is not available, and the optimal actions can only be learned dynamically [52,53,54,55].

In recent years, researchers have proposed RL-based techniques to address various challenges in computer architecture, such as DRAM scheduling and memory prefetching. In particular, [56] presents a preliminary memory scheduler that leverages RL to handle the complex problem of DRAM scheduling. Meanwhile, [30] introduces Pythia, a customizable prefetching framework that formulates prefetching as an RL problem. Pythia is designed to be highly efficient, which consumes minimal power and area in each core, and its overheads in terms of die area and power consumption are negligible compared to those of commercial processors.

3 Preliminaries

In this section, the challenges of controlling multiple prefetchers in modern computer systems are discussed, and the RL is used as a potential solution to handle such challenges. Specifically, we provide an overview of reinforcement learning and its applicability of prefetching control.

3.1 Challenges in controlling multiple prefetchers

Modern processors employ multiple prefetchers to predict and prefetch data, which use different prediction strategies. However, coordinating the activation and degree of multiple prefetchers to maximum overall performance presents a significant challenge. This is because different prefetchers may exhibit varying prediction accuracies, and the optimal prefetch degree can differ across various prefetchers and workloads. Moreover, the performance impacts of activating or deactivating a prefetcher, or changing its prefetch degree, are unpredictable, as it depends on the specific workload and system configuration. Some additional points on why coordinating the activation and prefetch degree of multiple prefetchers can be difficult are listed as follows:

  • Workload heterogeneity Different workloads have different memory access patterns, which makes it challenging to design a single prefetcher that can perform well across all workloads. Thus, multiple prefetchers with different designs are often used to handle different types of memory access patterns in workloads.

  • Interaction among prefetchers Managing the activation and degree of prefetchers in a system can be challenging due to their interaction. For example, some prefetchers may produce inaccurate predictions that could negatively affect the accuracy of other prefetchers. Additionally, some prefetchers may produce excessive prefetches, which could consume an excessive amount of memory bandwidth and affect the performance of other prefetchers.

  • Prefetch degree Adjusting the prefetch degree of a prefetcher may have various degrees of impacts on system performance and main memory bandwidth. Increasing the prefetch degree can improve the prefetcher’s data anticipation, reducing cache misses and enhancing system performance. However, it may increase memory bandwidth usage. Conversely, decreasing the prefetch degree may ease memory bandwidth pressure but could result in more cache misses and a potential performance decline. It can be challenging to find the optimal balance between prefetch degree and performance, especially when multiple prefetchers are involved.

  • Runtime variability Workloads may exhibit runtime variability, where the memory access patterns change over time. It is challenging to predict which prefetcher or combination of prefetchers will perform best at any given time. Additionally, the effectiveness of a prefetcher may change as the system load varies, which makes it difficult to determine optimal prefetch degree settings that are universally applicable.

Overall, these factors present significant challenges to control the activation and prefetch degree of multiple prefetchers in a system. Therefore, there is a critical need for a coordinated prefetching controller that can dynamically adjust the activation and prefetch degree of multiple prefetchers based on program behavior to optimize overall system performance.

3.2 Reinforcement learning

RL [23] is a machine learning method based on interactive trial-and-error learning, which aims to enable agents to learn optimal behavior strategies through interaction with the environment. RL can be viewed as a sequential decision-making problem, in which agents need to make optimal decisions at different time steps to maximize cumulative rewards. This process can be modeled as a Markov Decision Process (MDP), which consists of a five-tuple \((S, A, P, R, \gamma )\), where S is the set of states, A is the set of actions, P is the state transition probability matrix, R is the reward function, and \(\gamma \) is the discount factor. In MDP, an agent chooses an action \(a_t\) based on the current state \(s_t\) and receives an immediate reward \(r_t\) and transition to the next state \(s_{t+1}\). The agent’s goal is to learn the optimal policy \(\pi \) by maximizing the expected cumulative reward \(\sum _{t=0}^\infty \gamma ^t r_t\), where \(\gamma \in [0,1]\) is the discount factor used to balance short-term and long-term rewards.

The control problem of multiple prefetchers can be formalized as a MDP in RL, where the agent needs to choose the optimal action in the state space to maximize the performance improvement brought by the prefetchers. RL methods possess characteristics such as autonomous learning and strong adaptability, which can effectively learn and optimize in different environments and tasks. As such, they are well-suited for complex prefetcher control problems.

3.3 Applicability of reinforcement learning to multi-prefetcher control

In the context of coordinating multiple prefetchers, RL exhibits significant applicability. Firstly, its adaptive learning feature enables prefetch controllers to autonomously learn in a complex memory access environment, adapting to dynamic changes in the system. Simultaneously, a prefetch controller should be performance-driven, possessing the ability to adaptively coordinate different prefetchers throughout the system based on their impact on performance, providing robust performance improvements across varying workloads and system configurations. Secondly, RL’s online learning capability allows prefetchers to continuously receive rewards from the environment and iteratively optimize their strategies in real-time, eliminating the need for an expensive offline training phase. This aligns with the continuous learning requirement of prefetch controllers to adapt to evolving workloads and system conditions, ensuring consistent performance improvements.

Lastly, RL’s ease of implementation makes it an ideal choice. Compared to other complex machine learning models, RL models are relatively efficient in hardware implementation, featuring reasonable size and low latency, making them more accessible for adoption in practical processors. SARSA algorithm, as a simple yet effective variant of RL, offers low-cost and low-overhead characteristics, making it a lightweight and low-overhead prefetch controller suitable for real-world implementation. Its simplicity not only aids in reducing computational complexity but also facilitates effective coordination in a multi-prefetcher environment, providing a stable and reliable advantage for system performance.

4 The basic principle of RL-CoPref

In this paper, we present an intelligent framework for multi-prefetcher control based on RL. The primary goal of our framework is to dynamically optimize the activation and prefetching degree of multiple prefetchers, thereby enhancing overall system performance and bandwidth utilization. Our scalable and adaptable framework is designed to accommodate diverse system configurations and workloads. Figure 1 provides an overview of our RL-based multi-prefetcher controller.

Fig. 1
figure 1

High-level overview of the RL-based prefetcher controller

The controller’s objective is to select an optimal prefetching control strategy by interacting with the environment, aiming to maximize system performance and bandwidth utilization. Specifically, for each demand access, the state space vector in RL is constructed using information such as the program counter (PC), address, and other program features extracted from the access. The RL agent then chooses the optimal prefetching control action, represented by a value that corresponds to the activation and prefetch degree settings for all underlying prefetchers. This action involves determining which prefetchers to activate and the prefetch degree of each prefetcher. The prefetch degree determines the number of cache lines to prefetch on each occasion. Upon selecting the action, the RL agent utilizes a set of predicted addresses generated by multiple prefetchers, each employing its unique prefetching strategy. For each prefetching control action, the controller receives a reward value, evaluating the accuracy of prefetching with respect to the current DRAM bandwidth usage. This comprehensive RL-based multi-prefetcher controller aims to intelligently adapt to varying system demands and configurations, ultimately optimizing system performance.

4.1 Formulation of the RL-CoPref

We have formulated the prefetcher controller as a RL agent, and thus, we have defined a set of states and actions, along with an appropriate reward structure to guide the learning process.

4.1.1 State

In the development of RL-based multi-prefetcher controllers, the careful selection of the state space is crucial. We leverage detailed control-flow and data-flow information derived from memory accesses as integral components of the state space. Examples of control-flow information include the program counter (PC) or PC sequences, representing the program’s execution path. Data-flow information involves data dependencies within the program, illustrated by cacheline address and cacheline delta. The chosen state space features provide valuable insights into program execution and memory access, facilitating the acquisition of more effective prefetching strategies through RL algorithms.

We systematically evaluated various combinations of program features and identified the best-performing set (PC, Address, and PC+Delta) as the state vector. This selection enables the agent to infer data locality and dependencies. It is worth noting that these program features have been widely adopted in recent prefetching research [1, 2, 4, 11, 57].

4.1.2 Action

RL-CoPref leverages RL to learn the optimal actions for controlling and optimizing prefetch operations across multiple underlying prefetchers in various states. In the RL-based multi-prefetcher controller, the action space is defined as a set of selectable operations representing the mapping values for activation and prefetch degree of all underlying prefetchers. The activation state determines whether to execute a prefetch operation, while the prefetch degree of each prefetcher controls the amount of data retrieved from L3 cache. By adjusting the mapping values of activation state and prefetch degree, the intelligent agent can choose actual prefetch addresses from the predicted addresses provided by the underlying prefetchers based on their individual prefetching strategies. To minimize RL-CoPref’s storage overhead, we pruned the action list without significantly impacting performance.

4.1.3 Reward

The design of reward values plays a crucial role in shaping the behavior and decision-making of the prefetching controller. In our proposed prefetching controller, the primary objective is to enhance performance by selecting an optimal set of prefetchers and determining their corresponding prefetch degree. To achieve this optimization, we introduce a numeric reward value to assess the controller’s behavior. This reward value is contingent on the success of the prefetching operation, marked as positive for prefetch hits and negative for prefetch misses.

However, variations exist in the reward values based on high and low bandwidth usage scenarios. Specifically, in instances of inaccurate prefetching operations, the negative reward value is smaller in high bandwidth usage scenarios compared to low bandwidth usage scenarios. Conversely, if the agent refrains from prefetching, the negative reward value in low bandwidth usage scenarios is relatively lower than in high bandwidth usage scenarios. This approach is strategically designed to enhance the performance and accuracy of the prefetching controller across different bandwidth usage scenarios, ultimately contributing to increased efficiency in DRAM bandwidth utilization.

RL-CoPref tracks memory bandwidth utilization by employing a straightforward counter situated at the memory controller. This counter keeps track of DRAM column access (CAS) commands over a time window spanning 4\(\times \) tRC cycles, where tRC represents the minimum interval between two DRAM row activations. To introduce hysteresis into the tracking mechanism, the counter is halved at the conclusion of each window. The peak DRAM bandwidth and the maximum potential count of CAS commands within each tRC window are determined by the number of channels and the width of each channel. This counter is further categorized into quartiles (25%, 50%, and 75%) of peak bandwidth. If the counter exceeds 75% of the peak value, it indicates high bandwidth usage; otherwise, it signifies low usage.

5 RL-CoPref framework

Figure 2 presents an overview of the proposed RL-CoPref. Initially, each demand request is handled by a preprocessing step, where a state vector is generated, which contains pertinent information regarding PC, address, and other program features, as elaborated in Sect. 4.1.1. An agent, based on the Q-Value Vault, selects the best possible action based on the current state vector. To facilitate the training of the model, a sampling buffer is designed to collect state transitions. Since the action and reward are

Fig. 2
figure 2

Overall design of RL-CoPref

asynchronous, a delayed sampling mechanism is employed for Q-table training. The reward is derived from future prefetch hits/misses, as detailed in Sect. 5.2.2.

5.1 RL-based multi-prefetcher control algorithm

Algorithm 1 shows SARSA RL algorithm to control multiple prefetchers. The action space of the algorithm is the mapping between the prefetch activation status and prefetch degree of all underlying prefetchers. The algorithm starts with an initialization of the Q-table. In each iteration, the algorithm utilizes program features, which includes PC, memory address, and PC+Delta, as state information to select an action. The action selection process is guided by the epsilon-greedy policy, which randomly selects an action with probability epsilon, and selects the action with the highest Q-value otherwise.

Algorithm 1
figure a

Train and Predict Algorithm

The algorithm then calculates a reward based on prefetch hits/misses and main memory bandwidth utilization. The reward value is defined as positive if the prefetching hits and negative if it misses. The negative reward value is comparatively lower in high bandwidth utilization compared to low bandwidth utilization. Moreover, if the prefetch is not issued, the algorithm assigns a negative reward value. After receiving the reward, the algorithm updates the Q-value based on the difference between the expected and actual rewards. The updated Q-value is used in the next iteration to select actions. The algorithm iteratively performs the above steps until convergence.

5.2 Detailed design of RL-CoPref

We first present the design of Q-Value Vault in the first section. Then, in the second section, we discuss the assignment of rewards and updating of Q-values in RL-CoPref for coordinating multiple prefetchers.

5.2.1 Q-value vault

Due to the fact that PC and address information in memory access are usually continuous variables, directly using them as state space elements would lead to a large state space and make the algorithm difficult to handle. Therefore, we use tile coding to divide the continuous feature space into multiple discrete regions, where each region is called a tile. Each tile is represented by a two-dimensional table that represents the discretized feature space. We partition each feature into a group of tiles, with each tile representing a discrete region and covering all possible values of the feature space.

The Q-values for a feature-action pair (ja) can be represented as the sum of the Q-values for all tiles corresponding to feature j, where feature j is tiled into \(m_j\) tiles, and the Q value for the ith tile is denoted as \(Q_{ij}(s,a)\) for \(i \in {1,2,\ldots ,m_j}\). Mathematically, we can express the Q-value of feature-action pair \(Q_j(s,a)\), as in (1):

$$\begin{aligned} Q_j(s,a) = \sum _{i=1}^{m_j} Q_{ij}(s,a) \end{aligned}$$
(1)

We use the max-Q algorithm to compute the Q(sa) value for a state-action pair (sa), which is defined as the maximum of the Q-values for all feature-action pairs, as in (2):

$$\begin{aligned} Q(s,a) = \max \limits _{j=1}^{k} \left\{ \sum \limits _{i=1}^{m_j} Q_{ij}(s,a) \right\} \end{aligned}$$
(2)

where k represents the number of features, \(m_j\) represents the number of tiles for feature j, and \(Q_{ij}(s,a)\) represents the Q value for the ith tile of feature j and action a. In the computation of Q-values, a pipelined approach can be employed to accelerate the process. This approach exploits the independence of Q-value summation across tiles corresponding to each feature. Specifically, Q-values for each tile corresponding to a feature can be summed using parallel arithmetic units and generated Q-value for the feature-action pair \(Q_j(s,a)\). This hardware accelerator design is characterized by high parallelism and efficiency, which allows for the computation of a large number of Q-values in a short period of time, thereby improving the efficiency and performance of the algorithm.

Figure 3 illustrates the high-level organization of the Q-Value Vault, and how it retrieves a Q-value for a given state and action. Each state is represented as a vector, with each element indicating which tile the state falls into. In the initial stage, RL-CoPref computes the index for each tile and each constituent feature of the given state vector. In the second pipeline stage, RL-CoPref uses the feature indices and an action index to retrieve the \(Q_{ij}(s,a)\) values from each tile. In the third pipeline stage, RL-CoPref sums up the \(Q_{ij}(s,a)\) values to get the feature-action Q-value for each constituent feature.

Fig. 3
figure 3

Overall design of Q-value vault

In the fourth pipeline stage, RL-CoPref computes the maximum of all feature-action Q-values to get the state-action Q-value. The final step involves comparing the retrieved state-action Q-value with the maximum state-action Q-value found so far, and updating the maximum Q-value, if the retrieved value is greater.

5.2.2 Assigning rewards and updating Q-values

To track usefulness of the prefetched requests, RL-CoPref maintains a first-in-first-out list of recently taken actions, along with their corresponding prefetch addresses in the sampling buffer. Every prefetch action is inserted into the sampling buffer. As RL-CoPref cannot always immediately assign a reward to a taken action because the usefulness of the corresponding prefetch request is known, RL-CoPref stores both the action and prefetch addresses in the sampling buffer to properly assign rewards to them. A reward gets assigned to every sampling buffer entry before or when it is evicted from the buffer. During eviction, the reward and the state-action pair associated with the evicted entry are used to update the corresponding Q-value in the Q-Value Vault. This process is referred to as delayed sampling.

RL-CoPref adopts a RL methodology to assign rewards to each sampling buffer entry, which comprises three distinct scenarios. The first scenario involves assigning an immediate reward during the buffer insertion phase, where RL-CoPref assigns an immediate reward \(R_{NH}\) or \(R_{NL}\) to the corresponding sampling buffer entry based on the current system memory bandwidth usage, if it chooses not to issue a prefetch. The second scenario occurs during the buffer residency period, where rewards are assigned based on the prefetching action’s performance. During a sampling buffer entry’s residency period, a positive reward \(R_{P}\) is assigned if the prefetch address stored in the entry matches the address of a demand request. The third scenario occurs during the buffer eviction phase, where the reward is assigned based on the corresponding prefetch address’s demand status. If a reward is not assigned to a sampling buffer entry until eviction, a negative reward \(R_{NAH}\) or \(R_{NAL}\) is assigned based on whether the current system memory bandwidth usage is high or low. Through these three scenarios, RL-CoPref’s reward assignment system aims to optimize the prefetching mechanism’s performance and maximize the system’s memory bandwidth utilization.

6 Experiments

6.1 Experimental settings

To evaluate the performance of RL-CoPref, we employ a simulation framework based on ChampSim [58], which has been released by the second JILP Cache Replacement Championship (CRC2). The simulated system models a processor with a 4-wide out-of-order execution, an 8-stage pipeline, and a three-level cache hierarchy. The configuration parameters of the simulated system are provided in Table 1.

Table 1 Baseline configuration

Hardware Prefetcher We use Best-offset Prefetcher (BO) and Managed Irregular Stream Buffer Prefetcher (MISB) as the underlying prefetchers in the RL-CoPref prefetching control framework. BO and MISB are state-of-the-art spatial and temporal prefetchers, respectively. BO algorithm is a spatial prefetcher that predicts memory requests based on the spatial locality of previously accessed data. It tries to find the optimal prefetching offset by testing a list of deltas. MISB identifies correlated addresses within a PC-localized stream and learns temporally correlated memory accesses based on them. It manages the movement of metadata between the on-chip metadata caches and off-chip metadata storage using a metadata prefetcher.

Ensemble Prefetch To provide a comparison to our framework, we consider the Sandbox Prefetcher (SBP) [22], which is a state-of-the-art non-RL ensemble prefetcher. SBP uses a Bloom filter to evaluate the accuracy of multiple offset prefetchers at runtime. However, the greedy strategy of SBP limits its ability to quickly adapt to changing access patterns. To address this limitation, we propose a modified version of SBP, which we refer to as the Extended Sandbox Prefetcher (ESBP). ESBP extends the constraint of offset prefetchers to all types of prefetchers and allows to select the prefetcher with the highest recent prefetching accuracy. We enhanced ESBP by introducing a greedy parameter \(\epsilon \), which determines the probability of selecting the best prefetcher. Specifically, with a probability of \(1 - \epsilon \), ESBP selects the best prefetcher based on the most recent prefetch accuracy, while there is a probability of \(\epsilon \) for randomly selecting a prefetcher from all available options. We evaluate the performance of RL-CoPref against ESBP and other state-of-the-art prefetchers in our experiments.

Benchmarks We evaluate RL-CoPref on the SPEC CPU2006 and SPEC CPU2017 [59] benchmarks based on the sim-point traces offered by DPC-3 [60]. Figures 4 and 5 show the performance of different prefetchers for all traces, and the results show that RL-CoPref provides 32.09% improvement over the baseline LRU. We warm-up caches for 100 M (100 million) instructions and evaluated the performance for the next 500 M instructions.

6.2 Experiment results

6.2.1 Single-core performance

We evaluate the effectiveness of our proposed coordinated prefetching controller (i.e., RL-CoPref) for multiple prefetchers. In our evaluation, we compare RL-CoPref with other state-of-the-art prefetchers, including BO Prefetcher, MISB Prefetcher, and ESBP. Figure 4 shows the performance improvement of each prefetch method. RL-CoPref provides the best performance across all prefetch methods on average. Compared with the baseline LRU, RL-CoPref improves the performance by 32.09% on average. RL-CoPref provides 4.68%, 10.88%, and 2.87% higher performance improvement on average over the BO, MISB, and ESBP prefetch methods, respectively.

The adjustment policies of ESBP are based on selecting the prefetcher with the highest recent prefetching accuracy, while RL-CoPref applies RL to find the optimal prefetch control policy with the most substantial long-term performance benefit. The effectiveness of RL-CoPref is because of its ability to learn a coordinated prefetching strategy that takes into account the interactions between prefetchers and the cache, whereas ESBP’s approach lacks such coordination and may select a suboptimal prefetcher. Furthermore, RL-CoPref’s ability to continuously adapt to changing workloads and system conditions also enables it to have superior performance than ESBP.

Fig. 4
figure 4

Performance improvement of RL-CoPref and state-of-the-art prefetchers

Fig. 5
figure 5

Prefetch coverage of RL-CoPref and state-of-the-art prefetchers

Figure 5 shows the single-core prefetch coverage of each configuration in the LLC. Compared with rule-based prefetcher controller, RL-CoPref improves prefetch coverage. On average, RL-CoPref provides 7.07%, 17.44%, and 2.46% higher coverage than BO, MISB, and ESBP, respectively. RL-CoPref can achieve higher prefetch coverage by learning to coordinate multiple prefetchers and adaptively adjust their prefetch degree based on feedback from the cache. This approach enables the controller to take advantage of the strengths of each prefetcher, while mitigating their weaknesses, which leads to a more effective and efficient use of the prefetching resources. This results in a more robust and adaptable prefetching strategy that can respond to changes in the workload and system conditions, which leads to higher prefetch coverage and improved performance.

6.2.2 Multi-core performance

To evaluate the performance of RL-CoPref in a multi-core environment, we compare the performance results of different algorithms using the Weighted Speedup (WS) metric. WS takes into account enhancements in multi-core performance, considering the impact of each core and providing a holistic view of system efficiency. For a multi-core mixed workload, we randomly select four distinct traces from our trace list and execute one trace on each core. The WS calculation formula is given by:

$$\begin{aligned} \text {WS} = \sum _{i=0}^{N-1} \frac{{\text {IPC}_\textrm{together}}_i}{{\text {IPC}_\textrm{alone}}_i} \end{aligned}$$
(3)

Figure 6 illustrates the corresponding benefits of WS normalized to no prefetching. RL-CoPref showed an additional average enhancement of \(14.99\%\), \(5.44\%\), \(1.42\%\), and \(1.26\%\) in WS compared to baseline, MISB, BO, and ESBP, respectively. This improvement can be attributed to RL-CoPref’s ability to adaptively respond to changing workloads, leveraging its reinforcement learning techniques for efficient prefetching coordination across multiple cores.

Fig. 6
figure 6

WS improvement normalized to baseline

6.2.3 Performance evaluation of RL-CoPref with more underlying prefetchers

To evaluate the performance of RL-CoPref in the presence of multiple underlying prefetchers, we conducted experiments by adding two more prefetchers, IP-based stride prefetcher and Domino prefetcher, on top of the existing BO and MISB Prefetchers. IP-based stride prefetcher detects access patterns within a spatial region based on the instruction pointer and generates prefetch requests following this pattern to reduce the number of cache misses. Domino prefetcher is a novel temporal data prefetching technique designed to improve the effectiveness of existing temporal prefetchers. It overcomes the limitations of existing lookup mechanisms by logically looking up the history with both one and two last miss addresses to find a match for prefetching. The use of multiple prefetchers to test the performance of RL-CoPref is to verify its ability on coordinating among multiple prefetchers and optimizing their interactions with the cache, so as to evaluate whether its performance could be maintained in complex prefetching environments.

Our experiments showed that RL-CoPref outperforms the baseline LRU policy and all individual prefetchers, as well as the state-of-the-art ESBP prefetcher. Figure 7 shows the performance improvement of each prefetch method. Specifically, RL-CoPref outperforms 35.50% over LRU baseline, 5.91% over BO, 16.54% over MISB, 7.87% over Domino prefetcher, 14.38% over the IP-based stride prefetcher, and 4.64% over the ensemble prefetching controller ESBP.

Moreover, as the number of underlying prefetchers increases, RL-CoPref shows better performance. This improvement is because of its ability to coordinate among multiple prefetchers and optimize their interactions with the cache. With the growing complexity of modern computer systems, the ability to coordinate among multiple prefetchers becomes increasingly important for prefetching efficiency. Therefore, RL-CoPref can be a promising solution for prefetching in modern computer systems.

Fig. 7
figure 7

Performance improvement of RL-CoPref and state-of-the-art prefetchers

Fig. 8
figure 8

Prefetch coverage of RL-CoPref and state-of-the-art prefetchers

Figure 8 shows the prefetch coverage of each prefetch method. Specifically, RL-CoPref outperforms 76.15% over LRU baseline, 22.23% over BO, 23.07% over MISB, 20.62% over Domino prefetcher, 32.36% over the IP-based stride prefetcher, and 9.03% over ESBP. As the number of underlying prefetchers increases, the multi-prefetcher controller can leverage more historical data to train and update prefetching policies, while better sensing dynamic changes in the environment. This mechanism enables the multi-prefetcher controller to more accurately predict data access patterns, which has higher prefetch coverage in practical applications.

6.3 Sensitivity to the LLC size

Figure 9 shows the performance improvement of BO Prefetcher, MISB Prefetcher, ESBP, and RL-CoPref averaged over all traces while changing LLC size from 256KB to 4MB.

Fig. 9
figure 9

Average performance improvements with varying LLC size

It can be observed that RL-CoPref consistently outperforms BO, MISB, and ESBP across different LLC size configurations. For 256KB (and 4MB) LLC, RL-BPAC improves performance over BO, MISB, and ESBP by 6.42% (3.40%), 8.73% (9.91%%), and 1.34% (2.23%), respectively. RL-CoPref has better performance than other prefetchers because of its adaptive management of prefetcher activation and prefetch degree based on system-level feedback information, which provides robust performance benefits across different LLC sizes.

6.4 Performance scaling with memory bandwidth

To evaluate the effectiveness of RL-CoPref under different DRAM bandwidth configurations, we perform experiments by scaling the DRAM bandwidth. Figure 10 shows how the performance improvements of each configuration scale as the DRAM bandwidth scale is from the 600-MTPS to 9600-MTPS. Each of the bandwidth configurations depicted in Fig. 10 corresponds approximately to the per-core DRAM bandwidth available in various commercially available processors, such as the Intel Core i9 [61], AMD EPYC Rome [62], and AMD Ryzen Threadripper [63]. RL-CoPref still has better performance compared to other prefetchers across a range of DRAM bandwidth configurations. By scaling the DRAM bandwidth, RL-CoPref can optimize the utilization of prefetchers by dynamically adjusting the prefetching policies to adapt to the current environment, which leads to consistent performance improvements across various DRAM bandwidth configurations.

Fig. 10
figure 10

Average performance scaling with DRAM bandwidth

6.5 Storage overhead

The storage overhead of RL-CoPref is 38.3KB, with the Q-Value Vault accounting for 36KB and the sampling buffer accounting for 2.3KB. This low-cost design makes RL-CoPref hardware-friendly and easily implementable. A comprehensive analysis of the storage overhead is provided in Table 2.

Table 2 Storage overhead of RL-CoPref

7 Conclusion

In this paper, we propose RL-CoPref, a coordinated prefetching controller that leverages RL and dynamic selection of multiple prefetchers based on program context information. Our approach incorporates tiled encoding technique to effectively address the challenge of mixed memory access patterns, so it has strong adaptability and fast learning ability. Through extensive evaluations using the ChampSim simulator, RL-CoPref has demonstrated the best performance over both state-of-the-art individual prefetchers and the ensemble prefetcher SBP, which achieves an average coverage of 76.15% and a 35.50% IPC improvement. In our future work, we will investigate hardware implementation optimizations, budget sensitivity, and ensemble prefetching for multi-core architectures in the context of RL-CoPref.