RL-CoPref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers

Yang, Huijing; Fang, Juan; Su, Xing; Cai, Zhi; Wang, Yuening

doi:10.1007/s11227-024-05938-9

RL-CoPref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers

Open access
Published: 27 February 2024

Volume 80, pages 13001–13026, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

RL-CoPref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers

Download PDF

Huijing Yang¹^na1,
Juan Fang¹^na1,
Xing Su¹^na1,
Zhi Cai¹^na1 &
…
Yuening Wang¹^na1

562 Accesses
Explore all metrics

Abstract

Modern processors employ data prefetchers to alleviate the impact of long memory access latency. However, current prefetchers are designed for specific memory access patterns, which perform poorly on mixed applications with multiple memory access patterns. To address these issues, RL-CoPref, a reinforcement learning (RL)-based coordinated prefetching controller for multiple prefetchers, is proposed in this paper. RL-CoPref takes diverse program context information as the input, learns to maximize cumulative rewards, and evaluates prefetch quality based on prefetch hits/misses and memory bandwidth utilization. It can dynamically adjust the prefetch activation and prefetch degree, enabling multiple prefetchers to complement each other on mixed applications. Our extensive evaluation, utilizing the ChampSim simulator, demonstrates that RL-CoPref can effectively adapt to various workloads and system configurations, optimizing prefetch control. On average, RL-CoPref achieves 76.15% prefetch coverage, having 35.50% IPC improvement, outperforming state-of-the-art individual prefetchers by 5.91–16.54% and outperforming SBP, a state-of-the-art (non-RL) prefetch controller, by 4.64%.

RLOP: A Framework Design for Offset Prefetching Combined with Reinforcement Learning

On-Demand Prefetching Heuristic Policies: A Performance Evaluation

Learning I/O Access Patterns to Improve Prefetching in SSDs

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The hardware data prefetcher predicts the memory access patterns of an application and fetches useful cache lines from deeper levels in the memory hierarchy ahead of their first demand reference [1,2,3,4,5,6]. Existing prefetchers make predictions based on the spatial or temporal locality of program memory accesses, and their effectiveness is limited to specific memory access patterns. Spatial prefetchers predict fixed offsets, multiple offsets, or streams within a spatial region [7,8,9,10,11]. For example, VLDP [9], SPP [4], and an IP complex stride prefetcher included in IPCP [11] focus on deltas between accessed addresses to predict the next address delta. However, the spatial prefetcher cannot predict the first cache miss in each region and is limited to a fixed region size.

Temporal prefetchers are capable of predicting intricate memory access patterns by anticipating recurring miss sequences [3, 12,13,14]. For example, Irregular Stream Buffer (ISB) [12] and Domino [3] track the temporal order of memory accesses. Prefetchers that predict through recording pairs of related addresses will lead to increased storage overhead, and they cannot predict compulsory misses. Usually, temporal prefetchers demand hundreds of KBs. Recently, managed ISB (MISB) [13] and Triage [14] have optimized the hardware overhead without compromising coverage. When dealing with complex programs with multiple memory access patterns and mixed programs with different memory access patterns, the aforementioned prefetchers fail to meet the requirements of the program. Moreover, prefetch degree is a key feature of a data prefetcher, indicating how many prefetch requests are generated. Specifically, it denotes the quantity of cache lines or memory blocks that the prefetcher is instructed to proactively fetch ahead of the actual demand. A higher prefetch degree implies fetching more data ahead of demand, which can potentially improve performance by reducing cache misses at the cost of increased memory bandwidth usage. Dynamically adjusting the prefetch degree has various impacts on the program performance and main memory bandwidth during program execution [15,16,17].

Integrating multiple prefetchers in a processor can compensate the prediction limitations of single prefetchers, cover different memory access patterns of programs, and integrate the performance benefits of multiple prefetchers. However, multiple prefetchers compete for limited shared resources, such as LLC and main memory bandwidth [18,19,20,21]. If multiple prefetchers lack efficient control, it may result in system performance degradation of certain progranms. Existing prefetch control methods suffer from inherent limitations, such as offline model training and insufficient adaption to dynamic program behavior and system configurations [15, 16]. The prefetch control methods based on the classification of memory access patterns require offline model training and assume consistency between offline training and online operation in terms of memory access patterns. When dealing with different memory access modes, the accuracy of this kind of prefetchers is reduced, which will result in the system performance degradation. The prefetch control methods based on performance feedback compare the performance of different prefetchers during program execution and select the optimal one at the current program stage from a set of prefetchers. For example, Sandbox Prefetcher [22], which uses Bloom filters to evaluate the accuracy of prefetchers, selects the prefetcher with the highest prefetching accuracy from candidate fixed offset prefetchers. However, it suffers from hysteresis, and for interleaved access patterns, a sub-optimal prefetcher can degrade the prefetch performance before the optimal prefetcher is selected.

The objective of our work is to design a prefetch control framework that can (1) dynamically select the prefetcher to be activated from multiple options based on online training; (2) adaptively adjust the prefetch degree of the selected prefetcher; and (3) adjust prefetchers after each cache access to alleviate the hysteresis of prefetch control.

In this paper, we propose RL-CoPref, a reinforcement learning-based prefetch control framework that effectively manages prefetch activation and adjusts prefetch degree in response to changes in cache and main memory bandwidth. To achieve this, RL-CoPref utilizes tile coding [23] to discretize the continuous state space, enhancing the learning and decision-making processes in the RL framework. Tile coding is employed to structure the state space, particularly for program features such as program counters and cacheline addresses, which are inherently continuous. After each demand request, RL-CoPref extracts a set of program features from the current program context information, discretizes them using tile coding, and employs them as state features in the RL process. RL-CoPref then selects the best prefetching action based on historical learning experience and evaluates the effectiveness of each action by considering prefetch accuracy and main memory bandwidth utilization. The reward value received by RL-CoPref after each action is calculated based on prefetch hits/misses and the current usage of main memory bandwidth. Through this RL approach, RL-CoPref learns to improve prefetching performance over time, facilitating the more efficient use of system resources.

This paper introduces RL-CoPref, a RL-based coordinated prefetching controller designed to enhance the integration of performance improvements from multiple prefetchers in complex mixed memory access modes. The key contributions of RL-CoPref can be outlined as follows.

We present a novel RL-based prefetcher control framework that leverages RL advantages to adaptively learn, optimize, and adjust both the activation and prefetch degree of prefetchers. This dynamic adaptation aims to enhance overall program performance.
To address the diverse landscape of processor architectures and workload types, RL-CoPref incorporates multiple program features as feature vectors. We meticulously design appropriate reward functions to incentivize prefetch controllers, ensuring an effective balance between performance improvement and DRAM bandwidth consumption.
RL-CoPref is engineered as a lightweight and low-overhead prefetch controller. It adopts a simple table-based approach to store learned knowledge, minimizing complex computations and reducing storage overheads. Experimental results affirm the efficacy of our prefetch control framework across various workloads and processor architectures, highlighting its low hardware overhead.

2 Related work

2.1 Hardware prefetchers

Prefetchers can be classified into different types based on their prefetching strategies, where the most common used types are temporal prefetchers and spatial prefetchers [24,25,26,27,28].

Temporal prefetchers [3, 13, 29] can predict memory access patterns based on temporal locality, especially recurring access patterns that arise when iterating over data structures, so as to reduce data access latency. MISB [13] utilizes the access patterns of irregular data structures, such as linked lists, trees, and graphs to predict future data access and fetch data to cache. Domino [3] temporal prefetcher finds a matching item for prefetching by looking up the two most recent miss addresses. Although temporal address-correlating prefetchers can predict recurring sequences of misses in data structures, their prefetch degree is limited due to the imperfect repetition of large-scale data traversals.

Unlike temporal prefetchers, spatial prefetchers [7, 24, 30, 31] are more proficient in predicting repetitive spatial layouts over contiguous regions of memory. BOP prefetcher [24] uses a set of predefined offsets, which are derived from the history cache access patterns and program code analysis. When the program accesses a particular address, the hardware prefetcher attempts to prefetch the data from one of the predefined offsets that best matches that address. In contrast, PC-based stride prefetcher [7] uses the program counter to track the execution stride of programs and predict the prefetching data. However, spatial prefetchers are hard to predict irregular access patterns, and the page size of a spatial prefetcher is typically fixed, which means that for a large dataset, it may require multiple prefetches to fully cover all data in the dataset.

In addition to temporal prefetchers and spatial prefetchers, there are other prefetching strategies, such as instruction-based prefetching [32] and hybrid prefetching [33]. While these prefetching strategies have their own advantages, they also exhibit limitations in specific usage scenarios. For instance, temporal prefetchers may experience performance degradation due to their inability to adapt to certain program access patterns. Spatial prefetchers, when dealing with multithreaded programs or programs with numerous branches, may encounter inaccuracies in prefetching, leading to bandwidth wastage. Therefore, prefetch control strategies become crucial in addressing these issues. By comprehensively considering the strengths and limitations of different prefetchers, control strategies can more effectively select and configure prefetchers to meet specific workload and system requirements.

2.2 Prefetching control policy

The prefetching control policy [16, 17, 22, 34,35,36,37,38] aims to improve system performance by exploiting the performance advantages of multiple prefetchers. In recent years, various prefetching controllers based on different prefetching control policies have been proposed. Some rule-based controllers use specific prefetching rules to control the prefetchers. BAPC [17] is a bandwidth-aware dynamic prefetching controller that dynamically adjusts prefetching behaviors based on the memory bandwidth utilization in the system. SPAC [16] can cooperatively control the aggressiveness of multiple prefetchers and limit the behavior of prefetchers based on the improvement of fair-speedup in multi-core systems. Sandbox Prefetching [22] is a new method to determining the appropriate prefetching strategy at runtime. It evaluates simple and aggressive offset prefetchers by adding the prefetch address to a Bloom filter, instead of actually prefetching the data to the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to determine whether the evaluated aggressive prefetcher could accurately prefetch the data, while discovering for the presence of prefetchable streams. Improved prefetching is only executed when the evaluated prefetchers surpass a certain level of accuracy. The rule-based prefetching control method, however, suffers from certain limitations such as reliance on predefined rules, inadequate adaptability to unknown access patterns, and susceptibility to hardware design constraints.

Although there are various prefetching controllers to optimize the performance of the prefetchers, most of these controllers only focus on controlling a single prefetcher without considering how to control the joint behavior of multiple prefetchers. However, by effectively controlling multiple prefetchers, programs can better predict necessary data and pre-read it into the processor, thereby avoiding execution delays caused by data waiting. By optimizing the control of multiple prefetchers, the performance advantages of multiple prefetchers can be better utilized, and the utilization of the cache can be improved, thereby improving system performance.

2.3 Reinforcement learning in computer architecture

Recently, machine learning-based algorithms have been proposed for several micro-architectural tasks, including cache management [39,40,41,42], branch prediction [43, 44], and hardware prefetching [45,46,47,48,49,50,51]. Despite the impressive results achieved by these methods on memory access prediction and branch jumps, they suffer from two significant limitations. Firstly, the models employed are often too large to fit within the LLC of typical processors, which limits their applications in practice. Secondly, these models require substantial computation during inference, resulting in significant delays that exceed the admissible delays of prefetchers or branch predictors.

In contrast, RL does not require prior knowledge or data to determine the optimal action. Rather, it acquires this ability through interactions with the environment. This advantage makes RL promising to address computer architecture problems, particularly when significant prior knowledge or data is not available, and the optimal actions can only be learned dynamically [52,53,54,55].

In recent years, researchers have proposed RL-based techniques to address various challenges in computer architecture, such as DRAM scheduling and memory prefetching. In particular, [56] presents a preliminary memory scheduler that leverages RL to handle the complex problem of DRAM scheduling. Meanwhile, [30] introduces Pythia, a customizable prefetching framework that formulates prefetching as an RL problem. Pythia is designed to be highly efficient, which consumes minimal power and area in each core, and its overheads in terms of die area and power consumption are negligible compared to those of commercial processors.

3 Preliminaries

In this section, the challenges of controlling multiple prefetchers in modern computer systems are discussed, and the RL is used as a potential solution to handle such challenges. Specifically, we provide an overview of reinforcement learning and its applicability of prefetching control.

3.1 Challenges in controlling multiple prefetchers

Modern processors employ multiple prefetchers to predict and prefetch data, which use different prediction strategies. However, coordinating the activation and degree of multiple prefetchers to maximum overall performance presents a significant challenge. This is because different prefetchers may exhibit varying prediction accuracies, and the optimal prefetch degree can differ across various prefetchers and workloads. Moreover, the performance impacts of activating or deactivating a prefetcher, or changing its prefetch degree, are unpredictable, as it depends on the specific workload and system configuration. Some additional points on why coordinating the activation and prefetch degree of multiple prefetchers can be difficult are listed as follows:

Workload heterogeneity Different workloads have different memory access patterns, which makes it challenging to design a single prefetcher that can perform well across all workloads. Thus, multiple prefetchers with different designs are often used to handle different types of memory access patterns in workloads.
Interaction among prefetchers Managing the activation and degree of prefetchers in a system can be challenging due to their interaction. For example, some prefetchers may produce inaccurate predictions that could negatively affect the accuracy of other prefetchers. Additionally, some prefetchers may produce excessive prefetches, which could consume an excessive amount of memory bandwidth and affect the performance of other prefetchers.
Prefetch degree Adjusting the prefetch degree of a prefetcher may have various degrees of impacts on system performance and main memory bandwidth. Increasing the prefetch degree can improve the prefetcher’s data anticipation, reducing cache misses and enhancing system performance. However, it may increase memory bandwidth usage. Conversely, decreasing the prefetch degree may ease memory bandwidth pressure but could result in more cache misses and a potential performance decline. It can be challenging to find the optimal balance between prefetch degree and performance, especially when multiple prefetchers are involved.
Runtime variability Workloads may exhibit runtime variability, where the memory access patterns change over time. It is challenging to predict which prefetcher or combination of prefetchers will perform best at any given time. Additionally, the effectiveness of a prefetcher may change as the system load varies, which makes it difficult to determine optimal prefetch degree settings that are universally applicable.

Overall, these factors present significant challenges to control the activation and prefetch degree of multiple prefetchers in a system. Therefore, there is a critical need for a coordinated prefetching controller that can dynamically adjust the activation and prefetch degree of multiple prefetchers based on program behavior to optimize overall system performance.

3.2 Reinforcement learning

RL [23] is a machine learning method based on interactive trial-and-error learning, which aims to enable agents to learn optimal behavior strategies through interaction with the environment. RL can be viewed as a sequential decision-making problem, in which agents need to make optimal decisions at different time steps to maximize cumulative rewards. This process can be modeled as a Markov Decision Process (MDP), which consists of a five-tuple $(S, A, P, R, \gamma )$, where S is the set of states, A is the set of actions, P is the state transition probability matrix, R is the reward function, and $\gamma $ is the discount factor. In MDP, an agent chooses an action $a_t$ based on the current state $s_t$ and receives an immediate reward $r_t$ and transition to the next state $s_{t+1}$. The agent’s goal is to learn the optimal policy $\pi $ by maximizing the expected cumulative reward $\sum _{t=0}^\infty \gamma ^t r_t$, where $\gamma \in [0,1]$ is the discount factor used to balance short-term and long-term rewards.

The control problem of multiple prefetchers can be formalized as a MDP in RL, where the agent needs to choose the optimal action in the state space to maximize the performance improvement brought by the prefetchers. RL methods possess characteristics such as autonomous learning and strong adaptability, which can effectively learn and optimize in different environments and tasks. As such, they are well-suited for complex prefetcher control problems.

3.3 Applicability of reinforcement learning to multi-prefetcher control

In the context of coordinating multiple prefetchers, RL exhibits significant applicability. Firstly, its adaptive learning feature enables prefetch controllers to autonomously learn in a complex memory access environment, adapting to dynamic changes in the system. Simultaneously, a prefetch controller should be performance-driven, possessing the ability to adaptively coordinate different prefetchers throughout the system based on their impact on performance, providing robust performance improvements across varying workloads and system configurations. Secondly, RL’s online learning capability allows prefetchers to continuously receive rewards from the environment and iteratively optimize their strategies in real-time, eliminating the need for an expensive offline training phase. This aligns with the continuous learning requirement of prefetch controllers to adapt to evolving workloads and system conditions, ensuring consistent performance improvements.

Lastly, RL’s ease of implementation makes it an ideal choice. Compared to other complex machine learning models, RL models are relatively efficient in hardware implementation, featuring reasonable size and low latency, making them more accessible for adoption in practical processors. SARSA algorithm, as a simple yet effective variant of RL, offers low-cost and low-overhead characteristics, making it a lightweight and low-overhead prefetch controller suitable for real-world implementation. Its simplicity not only aids in reducing computational complexity but also facilitates effective coordination in a multi-prefetcher environment, providing a stable and reliable advantage for system performance.

4 The basic principle of RL-CoPref

In this paper, we present an intelligent framework for multi-prefetcher control based on RL. The primary goal of our framework is to dynamically optimize the activation and prefetching degree of multiple prefetchers, thereby enhancing overall system performance and bandwidth utilization. Our scalable and adaptable framework is designed to accommodate diverse system configurations and workloads. Figure 1 provides an overview of our RL-based multi-prefetcher controller.

The controller’s objective is to select an optimal prefetching control strategy by interacting with the environment, aiming to maximize system performance and bandwidth utilization. Specifically, for each demand access, the state space vector in RL is constructed using information such as the program counter (PC), address, and other program features extracted from the access. The RL agent then chooses the optimal prefetching control action, represented by a value that corresponds to the activation and prefetch degree settings for all underlying prefetchers. This action involves determining which prefetchers to activate and the prefetch degree of each prefetcher. The prefetch degree determines the number of cache lines to prefetch on each occasion. Upon selecting the action, the RL agent utilizes a set of predicted addresses generated by multiple prefetchers, each employing its unique prefetching strategy. For each prefetching control action, the controller receives a reward value, evaluating the accuracy of prefetching with respect to the current DRAM bandwidth usage. This comprehensive RL-based multi-prefetcher controller aims to intelligently adapt to varying system demands and configurations, ultimately optimizing system performance.

4.1 Formulation of the RL-CoPref

We have formulated the prefetcher controller as a RL agent, and thus, we have defined a set of states and actions, along with an appropriate reward structure to guide the learning process.

4.1.1 State

In the development of RL-based multi-prefetcher controllers, the careful selection of the state space is crucial. We leverage detailed control-flow and data-flow information derived from memory accesses as integral components of the state space. Examples of control-flow information include the program counter (PC) or PC sequences, representing the program’s execution path. Data-flow information involves data dependencies within the program, illustrated by cacheline address and cacheline delta. The chosen state space features provide valuable insights into program execution and memory access, facilitating the acquisition of more effective prefetching strategies through RL algorithms.

We systematically evaluated various combinations of program features and identified the best-performing set (PC, Address, and PC+Delta) as the state vector. This selection enables the agent to infer data locality and dependencies. It is worth noting that these program features have been widely adopted in recent prefetching research [1, 2, 4, 11, 57].

4.1.2 Action

RL-CoPref leverages RL to learn the optimal actions for controlling and optimizing prefetch operations across multiple underlying prefetchers in various states. In the RL-based multi-prefetcher controller, the action space is defined as a set of selectable operations representing the mapping values for activation and prefetch degree of all underlying prefetchers. The activation state determines whether to execute a prefetch operation, while the prefetch degree of each prefetcher controls the amount of data retrieved from L3 cache. By adjusting the mapping values of activation state and prefetch degree, the intelligent agent can choose actual prefetch addresses from the predicted addresses provided by the underlying prefetchers based on their individual prefetching strategies. To minimize RL-CoPref’s storage overhead, we pruned the action list without significantly impacting performance.

4.1.3 Reward

The design of reward values plays a crucial role in shaping the behavior and decision-making of the prefetching controller. In our proposed prefetching controller, the primary objective is to enhance performance by selecting an optimal set of prefetchers and determining their corresponding prefetch degree. To achieve this optimization, we introduce a numeric reward value to assess the controller’s behavior. This reward value is contingent on the success of the prefetching operation, marked as positive for prefetch hits and negative for prefetch misses.

However, variations exist in the reward values based on high and low bandwidth usage scenarios. Specifically, in instances of inaccurate prefetching operations, the negative reward value is smaller in high bandwidth usage scenarios compared to low bandwidth usage scenarios. Conversely, if the agent refrains from prefetching, the negative reward value in low bandwidth usage scenarios is relatively lower than in high bandwidth usage scenarios. This approach is strategically designed to enhance the performance and accuracy of the prefetching controller across different bandwidth usage scenarios, ultimately contributing to increased efficiency in DRAM bandwidth utilization.

RL-CoPref tracks memory bandwidth utilization by employing a straightforward counter situated at the memory controller. This counter keeps track of DRAM column access (CAS) commands over a time window spanning 4$\times $ tRC cycles, where tRC represents the minimum interval between two DRAM row activations. To introduce hysteresis into the tracking mechanism, the counter is halved at the conclusion of each window. The peak DRAM bandwidth and the maximum potential count of CAS commands within each tRC window are determined by the number of channels and the width of each channel. This counter is further categorized into quartiles (25%, 50%, and 75%) of peak bandwidth. If the counter exceeds 75% of the peak value, it indicates high bandwidth usage; otherwise, it signifies low usage.

5 RL-CoPref framework

Figure 2 presents an overview of the proposed RL-CoPref. Initially, each demand request is handled by a preprocessing step, where a state vector is generated, which contains pertinent information regarding PC, address, and other program features, as elaborated in Sect. 4.1.1. An agent, based on the Q-Value Vault, selects the best possible action based on the current state vector. To facilitate the training of the model, a sampling buffer is designed to collect state transitions. Since the action and reward are

asynchronous, a delayed sampling mechanism is employed for Q-table training. The reward is derived from future prefetch hits/misses, as detailed in Sect. 5.2.2.

5.1 RL-based multi-prefetcher control algorithm

Algorithm 1 shows SARSA RL algorithm to control multiple prefetchers. The action space of the algorithm is the mapping between the prefetch activation status and prefetch degree of all underlying prefetchers. The algorithm starts with an initialization of the Q-table. In each iteration, the algorithm utilizes program features, which includes PC, memory address, and PC+Delta, as state information to select an action. The action selection process is guided by the epsilon-greedy policy, which randomly selects an action with probability epsilon, and selects the action with the highest Q-value otherwise.

The algorithm then calculates a reward based on prefetch hits/misses and main memory bandwidth utilization. The reward value is defined as positive if the prefetching hits and negative if it misses. The negative reward value is comparatively lower in high bandwidth utilization compared to low bandwidth utilization. Moreover, if the prefetch is not issued, the algorithm assigns a negative reward value. After receiving the reward, the algorithm updates the Q-value based on the difference between the expected and actual rewards. The updated Q-value is used in the next iteration to select actions. The algorithm iteratively performs the above steps until convergence.

5.2 Detailed design of RL-CoPref

We first present the design of Q-Value Vault in the first section. Then, in the second section, we discuss the assignment of rewards and updating of Q-values in RL-CoPref for coordinating multiple prefetchers.

5.2.1 Q-value vault

Due to the fact that PC and address information in memory access are usually continuous variables, directly using them as state space elements would lead to a large state space and make the algorithm difficult to handle. Therefore, we use tile coding to divide the continuous feature space into multiple discrete regions, where each region is called a tile. Each tile is represented by a two-dimensional table that represents the discretized feature space. We partition each feature into a group of tiles, with each tile representing a discrete region and covering all possible values of the feature space.

The Q-values for a feature-action pair (j, a) can be represented as the sum of the Q-values for all tiles corresponding to feature j, where feature j is tiled into $m_j$ tiles, and the Q value for the ith tile is denoted as $Q_{ij}(s,a)$ for $i \in {1,2,\ldots ,m_j}$. Mathematically, we can express the Q-value of feature-action pair $Q_j(s,a)$, as in (1):

$$\begin{aligned} Q_j(s,a) = \sum _{i=1}^{m_j} Q_{ij}(s,a) \end{aligned}$$

(1)

We use the max-Q algorithm to compute the Q(s, a) value for a state-action pair (s, a), which is defined as the maximum of the Q-values for all feature-action pairs, as in (2):

$$\begin{aligned} Q(s,a) = \max \limits _{j=1}^{k} \left\{ \sum \limits _{i=1}^{m_j} Q_{ij}(s,a) \right\} \end{aligned}$$

(2)

where k represents the number of features, $m_j$ represents the number of tiles for feature j, and $Q_{ij}(s,a)$ represents the Q value for the ith tile of feature j and action a. In the computation of Q-values, a pipelined approach can be employed to accelerate the process. This approach exploits the independence of Q-value summation across tiles corresponding to each feature. Specifically, Q-values for each tile corresponding to a feature can be summed using parallel arithmetic units and generated Q-value for the feature-action pair $Q_j(s,a)$. This hardware accelerator design is characterized by high parallelism and efficiency, which allows for the computation of a large number of Q-values in a short period of time, thereby improving the efficiency and performance of the algorithm.

Figure 3 illustrates the high-level organization of the Q-Value Vault, and how it retrieves a Q-value for a given state and action. Each state is represented as a vector, with each element indicating which tile the state falls into. In the initial stage, RL-CoPref computes the index for each tile and each constituent feature of the given state vector. In the second pipeline stage, RL-CoPref uses the feature indices and an action index to retrieve the $Q_{ij}(s,a)$ values from each tile. In the third pipeline stage, RL-CoPref sums up the $Q_{ij}(s,a)$ values to get the feature-action Q-value for each constituent feature.

In the fourth pipeline stage, RL-CoPref computes the maximum of all feature-action Q-values to get the state-action Q-value. The final step involves comparing the retrieved state-action Q-value with the maximum state-action Q-value found so far, and updating the maximum Q-value, if the retrieved value is greater.

5.2.2 Assigning rewards and updating Q-values

To track usefulness of the prefetched requests, RL-CoPref maintains a first-in-first-out list of recently taken actions, along with their corresponding prefetch addresses in the sampling buffer. Every prefetch action is inserted into the sampling buffer. As RL-CoPref cannot always immediately assign a reward to a taken action because the usefulness of the corresponding prefetch request is known, RL-CoPref stores both the action and prefetch addresses in the sampling buffer to properly assign rewards to them. A reward gets assigned to every sampling buffer entry before or when it is evicted from the buffer. During eviction, the reward and the state-action pair associated with the evicted entry are used to update the corresponding Q-value in the Q-Value Vault. This process is referred to as delayed sampling.

RL-CoPref adopts a RL methodology to assign rewards to each sampling buffer entry, which comprises three distinct scenarios. The first scenario involves assigning an immediate reward during the buffer insertion phase, where RL-CoPref assigns an immediate reward $R_{NH}$ or $R_{NL}$ to the corresponding sampling buffer entry based on the current system memory bandwidth usage, if it chooses not to issue a prefetch. The second scenario occurs during the buffer residency period, where rewards are assigned based on the prefetching action’s performance. During a sampling buffer entry’s residency period, a positive reward $R_{P}$ is assigned if the prefetch address stored in the entry matches the address of a demand request. The third scenario occurs during the buffer eviction phase, where the reward is assigned based on the corresponding prefetch address’s demand status. If a reward is not assigned to a sampling buffer entry until eviction, a negative reward $R_{NAH}$ or $R_{NAL}$ is assigned based on whether the current system memory bandwidth usage is high or low. Through these three scenarios, RL-CoPref’s reward assignment system aims to optimize the prefetching mechanism’s performance and maximize the system’s memory bandwidth utilization.

6 Experiments

6.1 Experimental settings

To evaluate the performance of RL-CoPref, we employ a simulation framework based on ChampSim [58], which has been released by the second JILP Cache Replacement Championship (CRC2). The simulated system models a processor with a 4-wide out-of-order execution, an 8-stage pipeline, and a three-level cache hierarchy. The configuration parameters of the simulated system are provided in Table 1.

Table 1 Baseline configuration

Full size table

Hardware Prefetcher We use Best-offset Prefetcher (BO) and Managed Irregular Stream Buffer Prefetcher (MISB) as the underlying prefetchers in the RL-CoPref prefetching control framework. BO and MISB are state-of-the-art spatial and temporal prefetchers, respectively. BO algorithm is a spatial prefetcher that predicts memory requests based on the spatial locality of previously accessed data. It tries to find the optimal prefetching offset by testing a list of deltas. MISB identifies correlated addresses within a PC-localized stream and learns temporally correlated memory accesses based on them. It manages the movement of metadata between the on-chip metadata caches and off-chip metadata storage using a metadata prefetcher.

Ensemble Prefetch To provide a comparison to our framework, we consider the Sandbox Prefetcher (SBP) [22], which is a state-of-the-art non-RL ensemble prefetcher. SBP uses a Bloom filter to evaluate the accuracy of multiple offset prefetchers at runtime. However, the greedy strategy of SBP limits its ability to quickly adapt to changing access patterns. To address this limitation, we propose a modified version of SBP, which we refer to as the Extended Sandbox Prefetcher (ESBP). ESBP extends the constraint of offset prefetchers to all types of prefetchers and allows to select the prefetcher with the highest recent prefetching accuracy. We enhanced ESBP by introducing a greedy parameter $\epsilon $, which determines the probability of selecting the best prefetcher. Specifically, with a probability of $1 - \epsilon $, ESBP selects the best prefetcher based on the most recent prefetch accuracy, while there is a probability of $\epsilon $ for randomly selecting a prefetcher from all available options. We evaluate the performance of RL-CoPref against ESBP and other state-of-the-art prefetchers in our experiments.

Benchmarks We evaluate RL-CoPref on the SPEC CPU2006 and SPEC CPU2017 [59] benchmarks based on the sim-point traces offered by DPC-3 [60]. Figures 4 and 5 show the performance of different prefetchers for all traces, and the results show that RL-CoPref provides 32.09% improvement over the baseline LRU. We warm-up caches for 100 M (100 million) instructions and evaluated the performance for the next 500 M instructions.

6.2 Experiment results

6.2.1 Single-core performance

We evaluate the effectiveness of our proposed coordinated prefetching controller (i.e., RL-CoPref) for multiple prefetchers. In our evaluation, we compare RL-CoPref with other state-of-the-art prefetchers, including BO Prefetcher, MISB Prefetcher, and ESBP. Figure 4 shows the performance improvement of each prefetch method. RL-CoPref provides the best performance across all prefetch methods on average. Compared with the baseline LRU, RL-CoPref improves the performance by 32.09% on average. RL-CoPref provides 4.68%, 10.88%, and 2.87% higher performance improvement on average over the BO, MISB, and ESBP prefetch methods, respectively.

The adjustment policies of ESBP are based on selecting the prefetcher with the highest recent prefetching accuracy, while RL-CoPref applies RL to find the optimal prefetch control policy with the most substantial long-term performance benefit. The effectiveness of RL-CoPref is because of its ability to learn a coordinated prefetching strategy that takes into account the interactions between prefetchers and the cache, whereas ESBP’s approach lacks such coordination and may select a suboptimal prefetcher. Furthermore, RL-CoPref’s ability to continuously adapt to changing workloads and system conditions also enables it to have superior performance than ESBP.

Figure 5 shows the single-core prefetch coverage of each configuration in the LLC. Compared with rule-based prefetcher controller, RL-CoPref improves prefetch coverage. On average, RL-CoPref provides 7.07%, 17.44%, and 2.46% higher coverage than BO, MISB, and ESBP, respectively. RL-CoPref can achieve higher prefetch coverage by learning to coordinate multiple prefetchers and adaptively adjust their prefetch degree based on feedback from the cache. This approach enables the controller to take advantage of the strengths of each prefetcher, while mitigating their weaknesses, which leads to a more effective and efficient use of the prefetching resources. This results in a more robust and adaptable prefetching strategy that can respond to changes in the workload and system conditions, which leads to higher prefetch coverage and improved performance.

6.2.2 Multi-core performance

To evaluate the performance of RL-CoPref in a multi-core environment, we compare the performance results of different algorithms using the Weighted Speedup (WS) metric. WS takes into account enhancements in multi-core performance, considering the impact of each core and providing a holistic view of system efficiency. For a multi-core mixed workload, we randomly select four distinct traces from our trace list and execute one trace on each core. The WS calculation formula is given by:

$$\begin{aligned} \text {WS} = \sum _{i=0}^{N-1} \frac{{\text {IPC}_\textrm{together}}_i}{{\text {IPC}_\textrm{alone}}_i} \end{aligned}$$

(3)

Figure 6 illustrates the corresponding benefits of WS normalized to no prefetching. RL-CoPref showed an additional average enhancement of $14.99\%$, $5.44\%$, $1.42\%$, and $1.26\%$ in WS compared to baseline, MISB, BO, and ESBP, respectively. This improvement can be attributed to RL-CoPref’s ability to adaptively respond to changing workloads, leveraging its reinforcement learning techniques for efficient prefetching coordination across multiple cores.

6.2.3 Performance evaluation of RL-CoPref with more underlying prefetchers

To evaluate the performance of RL-CoPref in the presence of multiple underlying prefetchers, we conducted experiments by adding two more prefetchers, IP-based stride prefetcher and Domino prefetcher, on top of the existing BO and MISB Prefetchers. IP-based stride prefetcher detects access patterns within a spatial region based on the instruction pointer and generates prefetch requests following this pattern to reduce the number of cache misses. Domino prefetcher is a novel temporal data prefetching technique designed to improve the effectiveness of existing temporal prefetchers. It overcomes the limitations of existing lookup mechanisms by logically looking up the history with both one and two last miss addresses to find a match for prefetching. The use of multiple prefetchers to test the performance of RL-CoPref is to verify its ability on coordinating among multiple prefetchers and optimizing their interactions with the cache, so as to evaluate whether its performance could be maintained in complex prefetching environments.

Our experiments showed that RL-CoPref outperforms the baseline LRU policy and all individual prefetchers, as well as the state-of-the-art ESBP prefetcher. Figure 7 shows the performance improvement of each prefetch method. Specifically, RL-CoPref outperforms 35.50% over LRU baseline, 5.91% over BO, 16.54% over MISB, 7.87% over Domino prefetcher, 14.38% over the IP-based stride prefetcher, and 4.64% over the ensemble prefetching controller ESBP.

Moreover, as the number of underlying prefetchers increases, RL-CoPref shows better performance. This improvement is because of its ability to coordinate among multiple prefetchers and optimize their interactions with the cache. With the growing complexity of modern computer systems, the ability to coordinate among multiple prefetchers becomes increasingly important for prefetching efficiency. Therefore, RL-CoPref can be a promising solution for prefetching in modern computer systems.

Figure 8 shows the prefetch coverage of each prefetch method. Specifically, RL-CoPref outperforms 76.15% over LRU baseline, 22.23% over BO, 23.07% over MISB, 20.62% over Domino prefetcher, 32.36% over the IP-based stride prefetcher, and 9.03% over ESBP. As the number of underlying prefetchers increases, the multi-prefetcher controller can leverage more historical data to train and update prefetching policies, while better sensing dynamic changes in the environment. This mechanism enables the multi-prefetcher controller to more accurately predict data access patterns, which has higher prefetch coverage in practical applications.

6.3 Sensitivity to the LLC size

Figure 9 shows the performance improvement of BO Prefetcher, MISB Prefetcher, ESBP, and RL-CoPref averaged over all traces while changing LLC size from 256KB to 4MB.

It can be observed that RL-CoPref consistently outperforms BO, MISB, and ESBP across different LLC size configurations. For 256KB (and 4MB) LLC, RL-BPAC improves performance over BO, MISB, and ESBP by 6.42% (3.40%), 8.73% (9.91%%), and 1.34% (2.23%), respectively. RL-CoPref has better performance than other prefetchers because of its adaptive management of prefetcher activation and prefetch degree based on system-level feedback information, which provides robust performance benefits across different LLC sizes.

6.4 Performance scaling with memory bandwidth

To evaluate the effectiveness of RL-CoPref under different DRAM bandwidth configurations, we perform experiments by scaling the DRAM bandwidth. Figure 10 shows how the performance improvements of each configuration scale as the DRAM bandwidth scale is from the 600-MTPS to 9600-MTPS. Each of the bandwidth configurations depicted in Fig. 10 corresponds approximately to the per-core DRAM bandwidth available in various commercially available processors, such as the Intel Core i9 [61], AMD EPYC Rome [62], and AMD Ryzen Threadripper [63]. RL-CoPref still has better performance compared to other prefetchers across a range of DRAM bandwidth configurations. By scaling the DRAM bandwidth, RL-CoPref can optimize the utilization of prefetchers by dynamically adjusting the prefetching policies to adapt to the current environment, which leads to consistent performance improvements across various DRAM bandwidth configurations.

6.5 Storage overhead

The storage overhead of RL-CoPref is 38.3KB, with the Q-Value Vault accounting for 36KB and the sampling buffer accounting for 2.3KB. This low-cost design makes RL-CoPref hardware-friendly and easily implementable. A comprehensive analysis of the storage overhead is provided in Table 2.

Table 2 Storage overhead of RL-CoPref

Full size table

7 Conclusion

In this paper, we propose RL-CoPref, a coordinated prefetching controller that leverages RL and dynamic selection of multiple prefetchers based on program context information. Our approach incorporates tiled encoding technique to effectively address the challenge of mixed memory access patterns, so it has strong adaptability and fast learning ability. Through extensive evaluations using the ChampSim simulator, RL-CoPref has demonstrated the best performance over both state-of-the-art individual prefetchers and the ensemble prefetcher SBP, which achieves an average coverage of 76.15% and a 35.50% IPC improvement. In our future work, we will investigate hardware implementation optimizations, budget sensitivity, and ensemble prefetching for multi-core architectures in the context of RL-CoPref.

Data availability

All data generated or analyzed during this study are included in this article. There are no further materials to provide.

References

Bakhshalipour M, Shakerinava M, Lotfi-Kamran P, Sarbazi-Azad H (2019) Bingo spatial data prefetcher. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 399–411. https://doi.org/10.1109/HPCA.2019.00053 . IEEE
Bera R, Nori AV, Mutlu O, Subramoney S (2019) Dspatch: dual spatial pattern prefetcher. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 531–544. https://doi.org/10.1145/3352460.3358325
Bakhshalipour M, Lotfi-Kamran P, Sarbazi-Azad H (2018) Domino temporal data prefetcher. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 131–142. https://doi.org/10.1109/HPCA.2018.00021 . IEEE
Kim J, Pugsley SH, Gratz PV, Reddy AN, Wilkerson C, Chishti Z (2016) Path confidence based lookahead prefetching. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 1–12. https://doi.org/10.1109/MICRO.2016.7783763
Peled L, Weiser U, Etsion Y (2019) A neural network prefetcher for arbitrary memory access patterns. ACM Trans Archit Code Optim TACO 16(4):1–27. https://doi.org/10.1145/3345000
Article Google Scholar
Shi Z, Jain A, Swersky K, Hashemi M, Ranganathan P, Lin C (2021) A hierarchical neural model of data prefetching. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp 861–873. https://doi.org/10.1145/3445814.3446752
Fu JW, Patel JH, Janssens BL (1992) Stride directed prefetching in scalar processors. ACM SIGMICRO Newsl 23(1–2):102–110. https://doi.org/10.1145/144965.145006
Article Google Scholar
Somogyi S, Wenisch TF, Ailamaki A, Falsafi B, Moshovos A (2006) Spatial memory streaming. ACM SIGARCH Comput Archit News 34(2):252–263. https://doi.org/10.1145/1150019.1136508
Article Google Scholar
Shevgoor M, Koladiya S, Balasubramonian R, Wilkerson C, Pugsley SH, Chishti Z (2015) Efficiently prefetching complex address patterns. In: 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 141–152. https://doi.org/10.1145/2830772.2830793
Ishii Y, Inaba M, Hiraki K (2011) Access map pattern matching for high performance data cache prefetch. J Instr Level Parall 13(2011):1–24
Google Scholar
Pakalapati S, Panda B (2020) Bouquet of instruction pointers: instruction pointer classifier-based spatial hardware prefetching. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp 118–131. https://doi.org/10.1109/isca45697.2020.00021
Jain A, Lin C (2013) Linearizing irregular memory accesses for improved correlated prefetching. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp 247–259. https://doi.org/10.1145/2540708.2540730
Wu H, Nathella K, Sunwoo D, Jain A, Lin, C (2019) Efficient metadata management for irregular data prefetching. In: Proceedings of the 46th International Symposium on Computer Architecture, pp 449–461. https://doi.org/10.1145/3307650.3322225
Wu H, Nathella K, Pusdesris J, Sunwoo D, Jain A, Lin C (2019) Temporal prefetching without the off-chip metadata. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 996–1008. https://doi.org/10.1145/3352460.3358300
Srinath S, Mutlu O, Kim H, Patt YN (2007) Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In: 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, pp 63–74. https://doi.org/10.1109/hpca.2007.346185
Panda B (2016) SPAC: a synergistic prefetcher aggressiveness controller for multi-core systems. IEEE Trans Comput 65(12):3740–3753. https://doi.org/10.1109/tc.2016.2547392
Article MathSciNet Google Scholar
Navarro C, Feliu J, Petit S, Gomez ME, Sahuquillo J (2020) Bandwidth-aware dynamic prefetch configuration for IBM Power8. IEEE Trans Parallel Distrib Syst 31(8):1970–1982. https://doi.org/10.1109/tpds.2020.2982392
Article Google Scholar
Ebrahimi E, Mutlu O, Lee CJ, Patt YN (2009) Coordinated control of multiple prefetchers in multi-core systems. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 316–326. https://doi.org/10.1109/tpds.2020.2982392
Sridharan A, Panda B, Seznec A (2017) Band-pass prefetching: an effective prefetch management mechanism using prefetch-fraction metric in multi-core systems. ACM Trans Archit Code Optim TACO 14(2):1–27. https://doi.org/10.1145/3090635
Article Google Scholar
Wu B, Dai P, Wang Z, Wang C, Wang Y, Yang J, Cheng Y, Liu D, Zhang Y, Zhao W (2019) Bulkyflip: a NAND-spin-based last-level cache with bandwidth-oriented write management policy. IEEE Trans Circuits Syst I Regul Pap 67(1):108–120. https://doi.org/10.1109/TCSI.2019.2947242
Article Google Scholar
Hiebel J, Brown LE, Wang Z (2019) Machine learning for fine-grained hardware prefetcher control. In: Proceedings of the 48th International Conference on Parallel Processing, pp 1–9. https://doi.org/10.1145/3337821.3337854
Pugsley SH, Chishti Z, Wilkerson C, Chuang Pf, Scott RL, Jaleel A, Lu SL, Chow K, Balasubramonian R (2014 ) Sandbox prefetching: safe run-time evaluation of aggressive prefetchers. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 626–637. https://doi.org/10.1109/hpca.2014.6835971
Thrun S, Littman ML (2000) Reinforcement learning: an introduction. AI Mag 21(1):103–103
Google Scholar
Michaud P (2016) Best-offset hardware prefetching. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 469–480. https://doi.org/10.1109/hpca.2016.7446087
Navarro-Torres A, Panda B, Alastruey-Benedé J, Ibánez P, Viñals-Yúfera V, Ros A (2022) Berti: an accurate local-delta data prefetcher. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 975–991. https://doi.org/10.1109/MICRO56248.2022.00072
Mittal S (2016) A survey of recent prefetching techniques for processor caches. ACM Comput Surv CSUR 49(2):1–35. https://doi.org/10.1145/2907071
Article Google Scholar
Kim J, Teran E, Gratz PV, Jiménez DA, Pugsley SH, Wilkerson C (2017) Kill the program counter: Reconstructing program behavior in the processor cache hierarchy. ACM SIGPLAN Not 52(4):737–749. https://doi.org/10.1145/3093336.3037701
Article Google Scholar
Zhang P, Srivastava A, Nori AV, Kannan R, Prasanna VK (2022) TransforMAP: transformer for memory access prediction. arXiv preprint arXiv:2205.14778. https://doi.org/10.48550/arXiv.2205.14778
Wu H, Nathella K, Pabst M, Sunwoo D, Jain A, Lin C (2021) Practical temporal prefetching with compressed on-chip metadata. IEEE Trans Comput 71(11):2858–2871. https://doi.org/10.1109/TC.2021.3065909
Article Google Scholar
Bera R, Kanellopoulos K, Nori A, Shahroodi T, Subramoney S, Mutlu O (2021) Pythia: a customizable hardware prefetching framework using online reinforcement learning. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp 1121–1137. https://doi.org/10.1145/3466752.3480114
Jiang S, Yang Q, Ci Y (2022) Merging similar patterns for hardware prefetching. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 1012–1026. https://doi.org/10.1109/MICRO56248.2022.00071
Ros A, Jimborean A (2021) A cost-effective entangling prefetcher for instructions. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp 99–111. https://doi.org/10.1109/ISCA52012.2021.00017
Somogyi S, Wenisch TF, Ailamaki A, Falsafi B (2009) Spatio-temporal memory streaming. ACM SIGARCH Comput Archit News 37(3):69–80. https://doi.org/10.1145/1555815.1555766
Article Google Scholar
Panda B, Balachandran S (2014) Introducing thread criticality awareness in prefetcher aggressiveness control. In: 2014 Design, Automation & test in Europe Conference & Exhibition (DATE), pp 1–6. https://doi.org/10.7873/DATE.2014.092
Sun G, Shen J, Veidenbaum AV (2019) Combining prefetch control and cache partitioning to improve multicore performance. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 953–962. https://doi.org/10.1109/IPDPS.2019.00103
Jalili M, Erez M (2022) Managing prefetchers with deep reinforcement learning. IEEE Comput Archit Lett 21(2):105–108
Article Google Scholar
Zhang P, Kannan R, Srivastava A, Nori AV, Prasanna VK (2022) Resemble: reinforced ensemble framework for data prefetching. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 1–14
Adiletta MJ, Fargo F, Diamond M, Adiletta J, Franza O, Steely S (2023) A reinforcement learning approach to optimize cache prefetcher aggressiveness at run-time. In: 2023 Tenth International Conference on Software Defined Systems (SDS). IEEE, pp 95–102
Teran E, Wang Z, Jiménez DA (2016) Perceptron learning for reuse prediction. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 1–12. https://doi.org/10.1109/10.1109/MICRO.2016.7783705
Shi Z, Huang X, Jain A, Lin C (2019) Applying deep learning to the cache replacement problem. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 413–425. https://doi.org/10.1145/3352460.3358319
Wu N, Li P (2020) Phoebe: reuse-aware online caching with reinforcement learning for emerging storage models. arXiv preprint arXiv:2011.07160
Yang H-J, Fang J, Cai M, Cai Z (2023) A prefetch-adaptive intelligent cache replacement policy based on machine learning. J Comput Sci Technol 38(2):391–404
Article Google Scholar
Zangeneh S, Pruett S, Lym S, Patt YN (2020) BranchNet: a convolutional neural network to predict hard-to-predict branches. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 118–130. https://doi.org/10.1109/MICRO50266.2020.00022
Garza E, Mirbagher-Ajorpaz S, Khan TA, Jiménez DA (2019) Bit-level perceptron prediction for indirect branches. In: 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp 27–38. https://doi.org/10.1145/3307650.3322217
Bhatia E, Chacon G, Pugsley S, Teran E, Gratz PV, Jiménez DA (2019) Perceptron-based prefetch filtering. In: 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, p 1. https://doi.org/10.1145/3307650.3322207
Zhang P, Srivastava A, Brooks B, Kannan R, Prasanna VK (2020) RAOP: recurrent neural network augmented offset prefetcher. In: The International Symposium on Memory Systems, pp 352–362. https://doi.org/10.1145/3422575.3422807
Peled L, Mannor S, Weiser U, Etsion Y (2015) Semantic locality and context-based prefetching using reinforcement learning. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp 285–297
Srivastava A, Lazaris A, Brooks B, Kannan R, Prasanna VK (2019) Predicting memory accesses: the road to compact ML-driven prefetcher. In: Proceedings of the International Symposium on Memory Systems, pp 461–470
Ganfure GO, Wu C-F, Chang Y-H, Shih W-K (2020) DeepPrefetcher: a deep learning framework for data prefetching in flash storage devices. IEEE Trans Comput Aided Des Integr Circuits Syst 39(11):3311–3322
Article Google Scholar
Gerogiannis G, Torrellas J (2023) Micro-armed bandit: lightweight & reusable reinforcement learning for microarchitecture decision-making. In: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pp 698–713
Huang Y, Wang Z (2023) RLOP: a framework design for offset prefetching combined with reinforcement learning. In: International Conference on Computer Engineering and Networks. Springer, pp 90–99
Jain R, Panda PR, Subramoney S (2017) Cooperative multi-agent reinforcement learning-based co-optimization of cores, caches, and on-chip network. ACM Trans Archit Code Optim TACO 14(4):1–25. https://doi.org/10.1145/3132170
Article Google Scholar
Jain R, Panda PR, Subramoney S (2017) A coordinated multi-agent reinforcement learning approach to multi-level cache co-partitioning. In: Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 800–805. https://doi.org/10.23919/DATE.2017.7927098
Chen Z, Marculescu D (2015) Distributed reinforcement learning for power limited many-core system performance optimization. In: 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, pp 1521–1526
Donyanavard B, Mück T, Rahmani AM, Dutt N, Sadighi A, Maurer F, Herkersdorf A (2019) SOSA: self-optimizing learning with self-adaptive control for hierarchical system-on-chip management. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp 685–698. https://doi.org/10.1145/3352460.3358312
Ipek E, Mutlu O, Martínez JF, Caruana R (2008) Self-optimizing memory controllers: a reinforcement learning approach. ACM SIGARCH Comput Archit News 36(3):39–50. https://doi.org/10.1145/1394608.1382172
Article Google Scholar
Shakerinava M, Bakhshalipour M, Lotfi-Kamran P, Sarbazi-Azad H (2019) Multi-lookahead offset prefetching. In: The Third Data Prefetching Championship
ChampSim (2017). https://github.com/ChampSim/ChampSim
SPEC CPU 2017 (2017). https://www.spec.org/cpu2017/
3rd Data Prefetching Championship. https://dpc3.compas.cs.stonybrook.edu
Intel core i9 (2003). In: SIGMETRICS ’03. https://en.wikichip.org/wiki/intel/core_i9
Amd epyc (2003). In: SIGMETRICS ’03. https://en.wikichip.org/wiki/amd/epyc
Amd ryzen threadripper (2003). In: SIGMETRICS ’03. https://en.wikichip.org/wiki/amd/ryzen_threadripper

Download references

Acknowledgements

The authors would like to thank the reviewers for their efforts and for providing helpful suggestions that have led to several important improvements in our work. We would also like to thank all teachers and students in our laboratory for helpful discussions.

Funding

This work is supported by Beijing Natural Science Foundation (4192007) and supported by the National Natural Science Foundation of China (61202076 and 62276011), along with other government sponsors.

Author information

Huijing Yang, Juan Fang, Xing Su, Zhi Cai, and Yuening Wang have contributed equally to this work.

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Huijing Yang, Juan Fang, Xing Su, Zhi Cai & Yuening Wang

Authors

Huijing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Juan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xing Su
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yuening Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HY and JF designed the RL-CoPref for multiple prefetchers and composed Chapter 4 and Chapter 5. HY and XS participated in the experimental testing and analysis of benchmarks. ZC and YW participated in the optimization of image design and text in the paper. All authors composed the rest of the manuscript, reviewed the whole manuscript, and approved the final manuscript.

Corresponding author

Correspondence to Juan Fang.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Consent to participate

Not applicable.

Consent for publication

The authors readily consent to have this paper published.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, H., Fang, J., Su, X. et al. RL-CoPref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers. J Supercomput 80, 13001–13026 (2024). https://doi.org/10.1007/s11227-024-05938-9

Download citation

Accepted: 27 January 2024
Published: 27 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11227-024-05938-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RL-CoPref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers

Abstract

Similar content being viewed by others

RLOP: A Framework Design for Offset Prefetching Combined with Reinforcement Learning

On-Demand Prefetching Heuristic Policies: A Performance Evaluation

Learning I/O Access Patterns to Improve Prefetching in SSDs

1 Introduction

2 Related work

2.1 Hardware prefetchers

2.2 Prefetching control policy

2.3 Reinforcement learning in computer architecture

3 Preliminaries

3.1 Challenges in controlling multiple prefetchers

3.2 Reinforcement learning

3.3 Applicability of reinforcement learning to multi-prefetcher control

4 The basic principle of RL-CoPref

4.1 Formulation of the RL-CoPref

4.1.1 State

4.1.2 Action

4.1.3 Reward

5 RL-CoPref framework

5.1 RL-based multi-prefetcher control algorithm

5.2 Detailed design of RL-CoPref

5.2.1 Q-value vault

5.2.2 Assigning rewards and updating Q-values

6 Experiments

6.1 Experimental settings

6.2 Experiment results

6.2.1 Single-core performance

6.2.2 Multi-core performance

6.2.3 Performance evaluation of RL-CoPref with more underlying prefetchers

6.3 Sensitivity to the LLC size

6.4 Performance scaling with memory bandwidth

6.5 Storage overhead

7 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent to participate

Consent for publication

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation