Meta-reinforcement learning for edge caching in vehicular networks

Content caching to local repositories closer to the user can significantly improve the utilization of the backbone network, reduce latency, and improve reliability. Nevertheless, proactive caching in fast varying environments, especially in vehicular networks has many challenges including the change of the popularity of data with time as well as the changing popularity due to the change of requesting vehicles associated with the roadside units (RSUs). Learning techniques, especially reinforcement learning (RL), play a significant role in caching. Nevertheless, faster adaptation to reach the optimal policy with the dynamic nature of vehicular caching is still highly needed. In this paper, we propose a meta-reinforcement learning (Meta-RL) algorithm for proactive caching and caching replacement techniques based on model agnostic meta-learning (MAML) that can learn and adapt to new tasks faster and can improve the overall hit rate. Simulation results show that the proposed meta-RL algorithm improves the efficiency of the content caching system and provides faster convergence. The proposed meta-RL exhibits a particularly superior performance in the case of changing popularity of cached data.


Introduction
With the increased interest in autonomous vehicles and the proliferation of the internet of things (IoT) and the internet of vehicles (Ji et al. 2020), delivering internet services and data to moving vehicles is gaining increased attention. The deployment of road-side units (RSUs) with multiaccess edge computing (MEC) capabilities opens the door for content caching to moving vehicles. For timely cache delivery, caching systems opt for proactive caching, where popular contents are pre-cached from the remote content servers to edge nodes, whether base stations (BS) or RSUs within proximity to the end-users. In proactive caching, if the requested data resides within the local cache, it can be readily delivered to the user with a small delay (Wang et al. 2017). Otherwise, the requested data is forwarded to be retrieved from the original content server. Retrieved data from the content server can be used to replace less popular content (with fewer requests) within the local cache.
One of the main drivers of the edge caching concept is that it is statistically verified that popular content, which is highly requested by users over a brief time span is limited. Therefore, caching the most popular content at the edge can relieve the traffic burden on the backhaul network (Luong et al. 2019). Nevertheless, the gain of proactive caching is heavily related to the caching policy, the size of the available cache, and the popularity of the content. Social media, for example, have a non-stationary nature and short popularity duration (Lobzhanidze et al. 2013). The popularity of content such as videos on Facebook is limited to ∼ 2 hours while tweeted videos on Twitter have a lifetime of 18 min (Somuyiwa et al. 2018). This, in return, places a burden on the caching policy that should be dynamic and continuously updated. In addition, the limited caching quota in the RSUs plays another factor in the caching and adds to the challenge of the caching policy to carefully choose the content that will maximize the retrieval of the content from the cache without the need to access the core network, which is referred to as hit-rate. Another challenge manifests itself is the mobility of the vehicles (Mohammadnezhad and Ghaffari 2019; Ghaffari 2020), which calls for a fast-learning algorithm that is capable of self-learning from previous experiences to adapt within a few learning steps to reach the most suitable 1 3 caching policy in response to the high entering and leaving rates of the passing vehicles.
Learning techniques have been extensively used for proactive caching, e.g., Ji et al. (2020); Wang et al. (2017). In Varanasi and Chilukuri (2019), a differential caching technique for vehicular networks called FlexiCache is presented to address quality of service (QoS) requirements through the utilization of kernel ridge regression (KRR) to predict the percentage of the caching allocated to different traffic types. In Saputra et al. (2019), two deep learning (DL) frameworks for cooperative caching in the edge network are proposed. The first approach is based on a centralized DL algorithm that collects the log files from the edge nodes, while the second approach is a distributed learning model that learns the demand using their local data and shares only the trained model with the central node. In this same context, federated learning has been introduced to proactive caching in Yu et al. (2018), where the popularity of the content is updated hierarchically without the need to access the individual user access data. In addition, RL has been heavily utilized to adapt to new environments. In Hu et al. (2018), a deep Q-learning is used to calculate the parameters of the cache placement, resource allocation as well as the possible vehicle-to-vehicle (V2V) and vehicle-to-RSUs links. The model in Hu et al. (2018) assumes cache storage is available at both vehicles and RSUs with different sizes at the respective caching points. The work in Somuyiwa et al. (2018) targets a threshold-based caching scheme for cache replacement, RL is used to optimize the long-term average energy cost, which is directly related to the number of downloads which is, in turn, related to the dynamic popularity as well as the channel quality between the user and the caching unit. Recent work based on federated reinforcement learning is proposed in Majidi et al. (2021), where several edge caching units cooperate to predict user requests and determine the suitable placement policy. However, this temporal feature extraction method provides a weak awareness of the largescale edge environment (Xu and Li 2021).
Recently, meta-learning has been proposed as a new direction in machine learning algorithms (Finn et al. 2017). Meta-learning, principally, targets the cognitive experience of learning to learn from previous tasks to achieve faster adaptation to newer tasks. This can be directly translated to the ability to use shorter training periods with fewer samples, which is convenient for the dynamic nature of the caching case in the vehicular scene. The seminal work in Finn et al. (2017) laid the foundation for adapting meta-learning to reinforcement learning in the framework of meta-reinforcement learning (MRL). The combination of meta-learning and reinforcement learning enables the agent to learn a general policy that can be adapted to different content distributions. In this paper, we propose a meta-reinforcement Learning scheme for proactive content caching and caching replacement to improve the caching capability and reduce the latency while increasing the cache hit rate. We adopted two main algorithms: Deep Deterministic Policy Gradient (DDPG) RL  as base learner and Model Agnostic Meta-Learning (MAML) (Finn et al. 2017) as meta learner. The DDPG is an offpolicy actor-critic RL algorithm that is utilized to learn the long-term caching policy and predict the popularity of the requested content. DDPG proved to be an efficient RL algorithm, especially, in the case of large discrete or continuous action space, i.e., the action space that contains all requested and cached contents. Different from different attempts to improve the DDPG, e.g. the work in Chen et al. ( , 2021 that improves the convergence speed of DDPG by adding convolutional neural network and long term-short term memory, we opt for a model agnostic meta-learning that has the inherent concept of learning from one task to another in few gradient steps and can be generalized to different problems without being specific to a certain environment or system model assumptions (Finn et al. 2017). The MAML algorithm aims to expand the learning parameters to various tasks, which enables the RL agent (DDPG) to rapidly learn from different experiences and better adapt to new environments. This has a significant effect on the convergence speed as well as the reduction of required floating points operations (FLOPs) The main contributions of this paper can be summarized as follows: 1. We present the MAML-DDPG-based proactive content caching which, enables two major benefits: (a) Learning from experience and the adaptation to the dynamic nature of the content popularity which continuously changes with time. (b) the sample efficiency, i.e. learning from a few data examples. 2. We develop a system model to test the MAML-DDPG scheme-based proactive caching consisting of a single BS and vehicular users. The framework aims at constructing a global model that selects the most popular files. 3. The work proposed in this paper is tested for two different datasets to assess the MAML-DDPG algorithm against the selected state-of-the-art reference models, namely DDPG, Deep Q-learning as deep learning methods as well as legacy classical caching approaches like First In First Out (FIFO), Least Recently Used (LRU) and Least Frequently Used (LFU). Simulation results verify that the proposed MAML-DDPG outperforms the other algorithms in terms of cache efficiency and cache hit rate in addition to speed and complexity.
The rest of the paper is organized as follows: the related work is described in section 2. Section 3 introduces the system model. In section 4, we discuss the proposed meta-RL scheme in detail. The performance evaluation and experimental results are presented in section 5. Finally, the paper is concluded in section 6.

Related work
In the classical caching method (Fares et al. 2012), the cache storage is updated statically based on legacy rules such as FIFO, LRU, and LFU. Nevertheless, classical placement methods have limitations considering the dynamic nature of the content's popularity. Several works have addressed this dynamic nature. In , the authors considered collaborative caching to optimize the distribution in vehicular networks considering two classes of cached data; namely location-based cache, and popularity-based cache.
On the other hand, deep learning has recently become a major enabler in different wireless communications aspects, e.g., Chen et al. (2019); Sun et al. (2019). A new trend of designing more efficient proactive caching based on popularity has emerged, e.g., Anokye et al. (2020); Nomikos et al. (2021). The work in Song et al. (2021) focuses on building a quality of experience edge caching based on a class-based user interest model, which is specifically suitable for a large number of files of small sizes, like short videos. Reinforcement learning has been among the most successful learning algorithms that can learn the caching strategy through updating a reward. More specifically, the work presented in Zhong et al. (2018) presents a deep reinforcement learning (DRL) scheme that is based on Wolpertinger architecture. The Wolpertinger architecture introduces an action embedding layer between the actor and critic modules, which is realized by K-nearest-neighbour to explore the possible actions and exclude poor decisions. The reported simulation results indicate that the long-term hit rate achieved by the proposed DRL outperforms that of the FIFO scheme. Another work presented in Hou et al. (2018) proposes Q-learning as the policy for reinforcement learning and focuses on the non-safety data caching in vehicular networks and focuses on improving the overall transmission latency of the data to the moving vehicles. The proposed system entails a group of RSUs and the algorithm decides on both the index of the RSU that will communicate the data to the vehicle and the number of caches to be assigned. Nevertheless, the vehicular network is highly dynamic due to the departure of current vehicles and the entrance of new associated users to the caching RSU or BS. As such, to update the system more efficiently and adapt to new tasks, the learning technique involved should have the capability of learning to learn and adapt to new tasks of changing popularity. Meta-learning can provide such ability to adapt to the vehicular caching dynamics.
In the domain of applying meta-learning to proactive caching, the authors of Thar et al. (2019) proposed a metalearning framework based on finding the best deep learning model for content's popularity prediction and updating the model whenever its performance degrades. However, their model is computationally expensive and cannot respond quickly to the degradation of the model performance, it also does not consider the vehicle's mobility.

System model
The system model of the MAML-DDPG content caching scheme is presented in figure. 1. The system model considers a single-edge BS node with caching capability limited to C files. A mobile user within the BS coverage range can access the BS with requests for data. If the data requested resides within the edge cache, this corresponds to a cache hit. Otherwise, i.e., in the case of a miss, the BS retrieves the data from the central content server, where we assume that the BS is connected through a reliable link to the original content server to avoid any consideration of pack loss and limit the penalty to the missed cache and the encountered latency for data retrieval from the central server.
In our system model, each BS has a cache storage capacity C and receives number of requests R eq = {R 1 , R 2 , .., R t , ...} where R t is the requested content at time slot t which has a unique ID. We also assume that the contents, with unique IDs, have the same size (Zhong et al. 2018). For each content request, the meta-RL agent determines the caching and replacement policy. We model this problem by a Markov Decision Process (MDP). From the caching point of view, each content can be either cached or not cached. This categorization is also referred to as cache state. This cache state can be altered based on caching decision updates involving the content under investigation and another selected content resulting in either swapping the states of the two contents or keeping the states unchanged.

Proposed MAML-DDPG scheme
One key challenge in caching placement is to cope with the dynamic nature resulting from the changing popularity either with time or with current users leaving the caching BS coverage area and new users entering, who have different preferences in fetched caching data. Reinforcement learning provides the ability to learn through experiment and reward and has been utilized in caching extensively (Nomikos et al. 2021). Introducing meta-learning in reinforcement learning provides a powerful hierarchical learning model where a meta-learner above the base reinforcement learning provides faster adaptation to new tasks (popularity profiles) with few-shot learning, i.e., few samples to reach an optimized reward. In the following, we explain the basic Markov decision process for the reinforcement learning model and we move to explain in detail the proposed meta-reinforcement learning.

Markov decision Process for system model
In this section, we formalize the model for the reinforcement learning problem as a Markov decision process (MDP) similar to the one introduced in Zhong et al. (2018). In which we define the state space S , action space A , and rewards r for the cache replacement policy as follows: State space: We consider both the cached content C and the requested content R eq as the space for the state S . The state s t at each decision epoch t is represented by (C t , R eq t ) where t = {1, 2, ..., T} . In order to avoid the high dimensional state space, we define the row vector f i to represent the number of requests for the cached content i in the shortterm, medium-term, and long-term time scale such that where f x 0 , and x = s, m, l is the short-term, medium-term or long-term feature of the currently requested content, and f x i is similarly, the short-term, medium-term or long-term feature vector of the i th cached content, i = {1, 2, ..., C} where C is the cache capacity at the BS. The state space is calculated offline and then updated online as the time index progresses in the caching algorithm.
Action space: We define A as the action space where A = {0, 1, 2, ..., C} . To limit the action space, we assume that at most one item can be swapped out from the cache storage at each decision epoch t. Let a t ∈ A be the action taken at time t, and transition state s t to s t+1 . For every epoch t, the action a t can take one of C + 1 possible values. the case of a t = 0 means that the currently requested content is not stored in the cache storage, and the current caching space is not updated. And when a t = {1, 2, ..., C} , the action is to replace the content with the index equal to the action by the new content in the cache space.
Reward: One popular reward is the cache hit rate. We utilize the cache hit rate as the objective reward to represent the objective of the proposed framework. The cache hit rate is one of the most effective tools to measure the success of caching in practice. Nevertheless, the hit rate can be measured over the short term and the long term. Short-term hit rate at time t refers to successful content retrieval in the next time epoch t + 1 . Consequently, the short term reward r s t at time t can take only two values r s t ∈ to {0, 1} . On the other hand, The long-term reward looks at a longer time span. For our work, we define the long-term span to 100 epochs such that the long-term reward r l t ∈ 0, ..., 100 . Finally defining the total reward as the weighted sum of the short-term and long-term rewards, we can express the total reward at time t as: where is the weight to balance the short and long-term rewards, which can be tuned during the experiments. By choosing to be a value between 0 and 1, more priority is given to the short-term reward to maximize the cache hit rate at every epoch given the selected action.
(1) r t = r s t + r l t Fig. 1 The system model of the proposed edge caching framework

Data Loader
Cache Storage

MAML-DDPG algorithm
The meta-reinforcement learning architecture can be viewed as the interaction of two learners: the base learner and the meta learner (Finn et al. 2017). The base learner is the typical reinforcement learner responsible for learning the caching policy. The meta learner on the other hand is responsible for adapting to the dynamically changing caching environment in which the content popularity rapidly changes over time. figure*. 2 illustrates the MAML-DDPG operation. In our proposed framework, we choose the base learner as the DDPG reinforcement learner. The meta-learner is chosen to be the MAML meta-learner, which is among the most successful meta-learners since it has the advantage of fast learning on new tasks. MAML is also a model-free algorithm that is usually capable of training the model with a few gradient steps. The MAML algorithm can also be combined with different model types and domains, including reinforcement learning, especially the policy gradient methods, i.e., DDPG algorithm. Another main advantage of MAML is sample efficiency, which means that the algorithm can learn from a few data samples.

B.1) The base-learner: DDPG algorithm
DDPG is an actor-critic algorithm that inherits the merits of both DQN and the policy gradient methods . The algorithm is illustrated as follows.
The actor-network: defined as a function f ( ) with the parameter , the goal to to map the state s from the state space S to the action space A . The mapping provides actions ′ in A for a given state under the current parameter. Then, the actions are scaled and the most appropriate valid action a ∈ A is chosen through the argmax operator (Dulac-Arnold et al. 2015), The critic network: criticizes the policy designed by the actor and signals out the actions with low Q-values. This is very similar to playing a game between opponents in game theory. The critic network bases its decision on the deterministic target policy described as: Where the term Q s t+1 , (s t+1 | )| Q is the future accumulative reward function which is weighted by the discount factor ∈ (0, 1] . When is closer to 0, the learning agent gives more weight to the immediate reward than the future ones.
Update: The critic is updated by minimizing the loss function as The actor policy is updated using the policy gradient, which is given as The DDPG performs updates on the parameters of both actor and critic networks, the objective is to constrain the target network values to change slowly rather than to be frozen for a while, which usually occurs in the DQN algorithm. The update of the parameters Q and are given as: where ≪ 1.

B.2) Meta learner: MAML-RL
The meta-RL is to apply meta-learning techniques on reinforcement learning tasks, in which the learning agent is trained over the distribution of various tasks, the method follows the Algorithm[3] in Finn et al. (2017). After several training epochs, the agent should be able to solve a new task by adapting the base RL algorithm to acquire the optimal policy faster with few samples for the new unencountered task. We define the RL task T i as an MDP model that represents a caching environment with a dataset that contains the cached content and requested content. The task data is represented by D T = {D T 1 , D T 2 , ...} where D T i is the sampled batch of data for the i th RL task.
Each training epoch contains two learning steps, metalearning and inner-learning over the learning parameters and ′ i , respectively. In the meta-learning step, the DDPG agent samples K trajectories of states and actions over a time horizon T using f , a policy that maps the states s t to a distribution of actions a t at each time t ∈ {1, 2, ..., T} , where is the general learning parameter for all RL tasks. Then, the gradient of the loss function L D i (f ) is computed and utilized to update the inner task parameter ′ i . In the inner-learning step, the agent samples K trajectories using the policy f ′ i . In each learning episode, the agent computes the loss function L D i (f � i ) . At the end of each meta epoch, the meta parameter is updated by applying the gradient step over the sum of losses. We detail the proposed algorithm in Algorithm 1.  Receive a content request R eq t .

11:
if the requested content is cached:

12:
Update cache state and cache hit rate.

15:
Update cache state and hit rate.

18:
Evaluate ∇ L D i (f ) using D and L D i in Equation 5.

19:
Compute adapted parameters with gradient descent:

Simulation setup and datasets
We simulate our proposed meta-RL on different types of datasets including Zipf and Pintos data.
Zipf Data: Similar to Zhong et al. (2018), the total number of cached files is set to 5000 while 10,000 requests are collected as initial test data. The Zipf parameter is set to 1.3 to indicate fixed content popularity, which is the initial test case. Subsequently, the long-term cache hit rate is studied for dynamic popularity where the popularity distribution varies with time.
Pintos Data: We used the real-time data collected via the Pintos system. The entire dataset contains 159 disk activity records for 159 running program instances. Each record is saved as csv file, where the first column with header block sector suggests the access sequence to block sectors, the second column with header read/write records the access operation (read or write), and the last column with header boot/exec indicates whether this disk access is to load the running program. The data contains 10,000 requests produced by 5 selected user programs in the Pintos test files.

Feature extraction
The feature f i is extracted from the requests and used as input for the training model. The short, medium and long-term features are kept as a small subset of history to adapt the neural network to explore the latent information from it. In the experiment setup, we considered the recent 10, 100, and 1000 requests for each file content.

DDPG:
We compared the proposed MAML-DDPG with the DDPG algorithm. The learning parameters of the DDPG agent are as follows; the learning rate for actor and critic networks are = 0.001 , and Q = 0.001 respectively. The batch size B = 32 , the discount factor is 0.85, the replay buffer capacity is set to 10000. We used two different target network replacement strategies (equations 6 and 7): soft replacement ( = 0.01) and hard replacement (repetition iterations are 500 and 600 for critic and actor networks respectively).
Deep Q-Network (DQN): The DQN algorithm was first introduced in Mnih et al. (2013) and Mnih et al. (2015). The DQN enhances the classical Q-learning policy reinforcement learning by replacing the Q-table with neural networks. The DQN hyperparameters we used in the experiments are as follows: learning rate = 0.01 , batch size B = 32 . The neural network consists of one hidden layer, containing 256 neurons. The memory list has size M =10000 and the discount factor is set to 0.9.
First in first out (FIFO): In this technique, the cached contents are indexed according to the time of caching. When a miss occurs and the cache storage is full, the system has to replace a selected content. If FIFO policy is used then the oldest stored cache content is replaced with the new content.
Least recently used (LRU): In this technique, keeps a record of the recent uses of the file indexed against the requests. As the technique name indicates, when the cache storage is full and in case of a miss, the replacement policy selects the least recently requested content to be replaced with the new data from the server.
Least frequently used (LFU): In contrary to the LRU, this technique keeps a count of the number of requests for every cache portion as an indication of popularity. Consequently, when cache replacement is due, i.e., the cache is full and a cache miss case occurred, the cached content with the least requests record number is replaced with the new content.

Performance comparison
To analyze the performance of our proposed model, we evaluate the cache efficiency (Müller et al. 2016) as the performance metric, which is the ratio of cache hits to the total number of user requests. We compare the performance of the proposed MAML-DDPG scheme with other reinforcement learning algorithms, i.e. Deep Q-Network and DDPG. In the experiment setup, we chose the short, medium, and longterm features to be 10, 100, and 1000 requests, respectively. The number of meta-training epochs is set to 50, and we set the number of learning episodes to 200 for the RL agents, i.e., DDPG and DQN. a) Cache hit rate In this part, we calculate the hit rate and compare the results with the reference algorithms presented above. Figure 3 compares the cache hit rate realized by the proposed meta-learning algorithm and the competing algorithms for different cache capacities. The results were estimated at cache sizes {1, 5, 25, 50, 100, 300, 500} . Here, the Zipf distribution parameter is kept fixed at 1.3 to indicate static popularity distribution. For the MAML-DDPG approach, we split the training data into batches where every batch is considered as a different learning task. We update each task parameter by the general meta parameter as in algorithm 1. The results indicated that the proposed meta-learning framework provides a higher cache hit rate when tested on all cache capacity values under inspection. The figure also shows that the DDPG algorithm can achieve a hit rate similar to the DQN while running much faster.
In figures 4 and 5, we study the effect of the dynamic environment by studying the effect of changing the parameter . More particularly, in figure 4, we compare the overall cache hit rate for MAML-DDPG and the other algorithms with different Zipf parameters = {0.5, 0.7, 0.9, 1.1, 1.3, 1.5} and C = 50 . In the proposed MAML-DDPG experiment, each learning task is initialized over a Zipf parameter, and for each task, the data are split into 5 batches for inner task learning, in which each batch has 1000 files and 2000 user requests. The meta parameters are updated and shared between all tasks as stated in the algorithm. figure 4 shows that our proposed scheme outperforms the other techniques of caching for 1 3 all values of . Following the same setup, figure 5 shows the resulting hit rate at C = 500 where the algorithms have similar behavior as in figure 4 but with higher hit rates achieved due to the high caching capacity. It is worth mentioning that at high caching capacity, the DQN and the classical techniques take a large amount of training and running time to perform caching replacement which greatly increases the amount of delay and the computational cost.
We also performed experiments on Pintos data files, the cache hit rate is evaluated for the proposed MAML-DDPG, DDPG, DQN, and LRU caching approaches. Table 1 illustrates that our proposed framework results in a better cache hit rate for most of the files.

b) Efficiency
In this part, we have performed two types of comparison. First, we compare the runtime of our proposed framework with the learning-based caching schemes, e.g. DDPG and DQN. The main difference between the DDPG and the DQN-based caching methods is that the DDPG algorithm samples a policy that only considers a small set of valid actions. On contrary, the DQN computes the Q-value for all possible actions which increase the time of each decision epoch. Table 2 shows the running time for MAML-DDPG, DDPG, and DQN in 1000 decision epochs. As stated in the table, the MAML-DDPG achieved less time than the other algorithms, which proved the efficiency of our proposed system in reducing the caching latency. This is done at cache sizes of 50 and 500 and as clear from the results with larger cache sizes the difference is more noticeable in favor of the MAML-DDPG proposed algorithm. figure 6 shows the average runtime over a dataset of 5000 contents for different cache sizes. The MAML-DDPG algorithm outperforms the other RL algorithms in running time, while the DQN achieved the slowest performance. figure 7 shows the cache hit rate over a dataset of a dynamically changing popularity distribution. We measured the performance of the MAML-DDPG, DDPG, and DQN algorithms over The learning curves are presented in figure 8 where we plotted the convergence of the reward values for the MAML-DDPG compared to DDPG and DQN algorithms for cache size=500. The figure shows that the MAML-DDPG converges faster than DDPG and DQN algorithms.
In the second comparison, we compared both the number of training parameters and the number of floating point  Table 3 shows a clear advantage for the proposed MAML-DDPG algorithm as it involves less parameters and fewer FLOPs as well. The comparison is extended in figure 9 where the number of FLOPs is plotted for different cache sizes. The figure shows that the models follow a linear complexity that grows with the cache size.

Conclusion
In this paper, we proposed a new meta-reinforcement learning scheme for proactive content caching in vehicular networks. Our proposed model, MAML-DDPG, can achieve high cache hit rates while autonomously adapt to the dynamic change in caching popularity. The proposed model is applied to improve the caching in a single RSU and is tested for changing popularity distributions and different cache sizes. The proposed MAML-DDPG is compared to widely utilized classical caching models, i.e., FIFO, LRU, and LFU as well as state-of-the-art reinforcement learning caching strategies, namely, DDPG and DQN. The improved hit rate, with different cache sizes and changing popularity is confirmed through simulations. In addition, the proposed MAML-DDPG faster convergence is confirmed through convergence analysis, run-time estimation and number of FLOPs required. The ability of the proposed algorithm to converge faster and adapt to changing popularity makes it highly suitable to vehicular caching scenarios. As for Future work, the proposed MAML-DDPG can be extended to a collaborative and federated structure, where collaboration among several RSUs is considered with an incorporated realistic mobility model to simulate the movement of the vehicles and their association with different RSU caches.
Funding Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). This work is done as part of the project "Meta-learning core for Vehicular Networks" funded by the Information Technology Industry Development Agency (ITIDA), Project ID: PRP2019.R27.1.
Data availability Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of interest/Conflict of interest No conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.