Introduction

Internet of Things devices (IoTs) in wireless networks generate a wide range of workload data with specific requests in real-time, such as real-time online maps, somatosensory games, and high-definition videos. These request-specific workloads are typically computationally intensive (requiring sufficient computation resources) or latency-sensitive (task completion deadlines are limited) [1,2,3,4]. However, IoT’s computation power and battery level are inadequate to meet the stringent deadlines for completing computation workloads. Cloud computing technology can effectively solve computation resource insufficient problems [5, 6], such as global cost minimization on all user’s workloads based on evolution algorithm [7] and cloudlet-based workload’s computation efficiency improvement [8]. Unfortunately, the troubles with cloud computing-assisted schemes are the backhaul delay of request-specific workloads and resource allocation scheduling on the cloud server.

To avoid the problems caused by cloud computing applications, mobile edge computing (MEC) has been introduced as a new computing and communication pattern [9,10,11]. It separates the request-specific workloads generated by IoTs from distant cloud computing servers and offloads workloads to closer MEC servers [12]. In summary, the advantages of MEC compared to cloud computing are: (1) MEC servers are closer to IoT terminals, thus tackling the problem of backhaul delay, and (2) computation resources are more concentrated and oriented-request towards the IoTs within its coverage range, thus solving the problem of resource allocation scheduling. However, MEC servers equipped with base station are usually installed on the ground. When encountering special circumstances such as natural disasters and occlusion, MEC servers cannot guarantee the quality of service (QoS) of IoT’s computation workloads. As the application of UAV gradually matures, its flexible mobility makes it widely used in wireless communication environments. However, UAV is widely used in wireless communication environments due to its flexible mobility and computing ability not inferior to MEC servers. Nonetheless, limited studies have been on request-specific workload offloading issues that necessitate specific application software, such as real-time online maps and games. These applications (workloads) usually require specific service software, which cannot be accomplished without the service software being available on upper-level servers like the MEC server. To this end, if the service software required-specific workloads are pre-cached on the MEC server, it can significantly enhance the execution efficiency of the MEC server and increase the completing ratio of workloads.

Currently, two types of caching are applied to edge devices or cloud devices. One is content caching, and the other is service caching. Content caching is the static cache that only caches the specific needs of some terminals and is not related to task execution. For instance, edge servers on highways or urban expressways cache in-vehicle movies or nearby traffic data. These caches are only used to assist vehicle terminals in achieving a better driving experience. Service caching is the dynamic cache, specifically an application program that performs tasks. For instance, applications for virtual reality/augmented reality (VR/AR), human face recognition, etc. These service caching with specific types can service request-specific workloads. Therefore, in addition to considering the storage space of the service caching in the MEC server, the indicators such as the delay and energy consumption of service caching for execution request-specific workloads must be taken into account. In terms of content caching, the edge server only needs to consider the downlink delivery of workloads, not the execution of workloads. As aforementioned, service caching needs to consider the execution of workloads on the edge server side. However, most works on service caching optimization efforts consider the offloading decision and cache configuration optimization but do not consider the communication condition such as time-varying channel state.

Deep reinforcement learning (DRL) is a hybrid deep learning and reinforcement learning method [13,14,15]. It can learn the optimal policy by setting up continuous real-time interactions between agents and the environment. In other words, the training goal of DRL is to find the optimal policy by minimizing the cost function or maximizing the reward function, so it can be deployed to solve the optimization problem. There are currently some state-of-the-art works based on DRL optimization scheme in MEC or UAV, such as Deep Deterministic Policy Gradient (DDPG)-based task offloading optimization [16], UAV-based tasks offloading and energy harvesting with Double Deep Q-learning (DDQN) [17], and service caching placement scheme [18]. However, when encountering heterogeneous workloads processing problems with multiple IoTs in complex scenarios, it is difficult to solve them using single learning-based method.

In our work, we formulated an UAV-enabled MEC computation and communication system with multi-IoT, multi-UAV, and multi-MEC server-mounted macro base station (MECS-M). To the best of our knowledge, the proposed MEC system is the first to consider utilizing service caching to tackle the heterogeneous workloads generated by IoTs, and utilizing UAVs to address communication occlusion issues. Furthermore, since the cost minimization optimization problem includes two parts: request-specific workloads offloading decision for IoTs and service caching hosting placement for UAV and MECS-M, we use two DRL schemes (MADDPG and DDQN) to solve these two optimization problems. Furthermore, the proposed DRL-based scheme on the request-specific workloads offloading decision and service caching hosting placement in the UAV-enabled MEC model is the first cost-minimizing work that considering the randomness of workload arrival, the time-varying feature of channel state, the limitation of the hosting service caching, and wireless communication blocking.

The main contributions of this work are summarized below:

  • We model an UAV-enabled MEC computation and communication system with multi-IoT, multi-UAV, and multi-MECS-M. The request-specific workloads generated from IoTs are offloaded to UAV and MECS-M, which host particular service caching by the unlink transmission channel. In addition to computation and communication capabilities, UAVs can harvest renewable resources through their energy-harvesting equipment.

  • Taking into account the randomness of workload arrival, the time-varying feature of channel state, the limitation of the hosting service caching, and wireless communication blocking, we formulate the joint workloads offloading and service caching problem issue to the two-stage optimization solutions for minimizing the long-term weighted average cost-minimizing. Furthermore, we proposed a request-specific workload offloading and service caching decision-making scheme based on a medley Deep reinforcement learning scheme (WSSMDRL) to tackle the two-stage optimization problems.

  • In the UAV-enabled MEC environment, we define the state, the action, and the reward for two DRL algorithms. In terms of the first optimization sub-problem, we propose a MADDPG-based request-specific workload offloading policy, and each IoT in the group is designed to be a learning agent to interact with the MEC environment by central training and decentralized executing mode. For the second optimization sub-problem, DDQN is projected to find the optimal service caching hosting decision-making selection by a decentralized learning pattern.

  • We carry out comprehensive simulation experiments to demonstrate the convergence of WSSMDRL under various learning rates, offload ratios, the occlusion rate of IoTs, and the total number of service cachings hosted by UAV or MECS-M. In addition, we introduce four benchmark learning algorithms to verify the performance of WSSMDRL by defining average cumulative rewards under different parameter settings

The paper is structured as follows. Section “Related works” summarizes the related works. In section “System model”, we present the system model, which includes the network model, energy harvesting model, wireless communication model, and workloads computation model. We state the optimization problem in section “Optimization problem statement”. In section “Transformation of cost minimization problem”, we decompose the cost minimization problem into two sub-problems. Section “Medley DRL-based optimization scheme” lists two optimization schemes: MADDPG-based request-specific workloads offloading decision-making optimization scheme and DDQN-based service caching hosting selection scheme. Comprehensive performance evaluation and experiments are conducted in section “Performance evaluation”. Finally, in section “Conclusions”, we conclude our work.

Related works

In this section, we first introduced the computation offloading and resource allocation works in MEC model, then we described the work of caching-based optimization strategy, and finally we presented related Learning-based optimization schemes.

Computation offloading and resource allocation in MEC

There were some existing works in the MEC environment. Waqar et al. optimized the tasks offloading and resource allocation problem based on Q-learning and Deep Q-Learning (DQN) in an MEC-assisted vehicular network [19]. Chen et al. introduced the MEC framework assisted with the vehicular network to implement a multi-hop offloading optimization strategy for minimizing the delay of finishing workloads generated from vehicles. Furthermore, the bat algorithm combined with greedy policy is proposed to tackle the optimization problem [20]. Li et al. considered the dual connectivity (DC) between user terminals and MEC server in non-orthogonal multiple access (NOMA) MEC system. In addition, the heuristic algorithm and deep deterministic policy gradient(DDPG), a model-free deep reinforcement learning(DRL) algorithm, were deployed to find the optimal workload segmentation and resource allocation for minimizing global energy consumption [21]. Nonetheless, limited studies had been on request-specific workload offloading issues that necessitate specific application software, such as real-time online maps and games.

Content caching and service caching-based optimization strategy

In terms of content caching, there were many existing works have been studied. Tan et al. deployed matrix completion(MC) theory of optimizing the content caching placement strategy on the MEC node for minimizing the estimated error rate of caching [22]. Zhang et al. formulated the Device-to-Device (D2D)-assisted MEC model and modeled the caching placement and contract selection based on blockchain as the Markov Decision Process (MDP). Furthermore, a DRL-based scheme is designed to gain the minimum latency for execution workloads from devices [23]. Zhang et al. tackled the long-term delay minimization problem based on the genetic algorithm, and the Lyapunov algorithm considered caching and offloading on multiple base stations [24]. In [22,23,24], for content caching, considering the characteristics of the workloads from IoTs, the update cycle of caches cannot be too short. Otherwise, the workload execution efficiency will be significantly reduced due to the lack of hosting caches on the edge server side. The existing works on service caching could be more extensive. Zheng et al. constructed an UAV-assisted MEC framework, and each edge server(UAV and MEC server) hosted some service caching for request-specific tasks. Furthermore, minimizing the total delay of all terminals for finishing tasks problem was decoupled into several sub-problems, and iterative algorithm and K-Means algorithm were deployed to tackle these optimization problems to minimize the total latency and energy consumption cost for mobile users in the service caching-assisted MEC model [18]. Bi et al. applied the branch and bound method to solve the offloading decision-making strategy and caching placement decision [25]. Zhong et al. formulated a cloud-edge cooperating framework equipped with specific service caching for executing workloads from mobile users. Based on the proposed framework, a modified general benders decomposition strategy was addressed to decomposite task offloading and caching placement problems to two sub-problems for minimizing the average computing latency of all workloads [26]. The works [18, 25, 26] effort consider the offloading decision and cache configuration optimization but did not consider the communication condition such as time-varying channel state.

Learning-based optimization scheme

There are currently some state-of-the-art works based on DRL and other learning-based optimization scheme in MEC or UAV. For DRL, There were two types methods: value-based methods and policy-based methods [27]. Value-based DRL methods comprised DQN [28], Double Deep Q-learning (DDQN) [29], and Dueling DQN [30]. Policy-based DRL methods included Policy Gradient and Proximal Policy Optimization (PPO) [31]. There were DRL schemes to combine the two policies, such as Actor-Critic [32], Asynchronous Advantage Actor-Critic (A3C) [33], Deep Deterministic Policy Gradient (DDPG) [34], Deep Multi-agent Deterministic Policy Gradient (MADDPG) [35]. Some existing works utilized DRL to perform workloads offloading or resource allocation optimization in MEC or cloud computing environment, such as DDPG-based communication and computation latency minimization in intelligent reflecting surface (RIS)-enabled MEC model [36], privacy protection and edge server performance optimization based on Actor-Critic [37], physical layer security optimization for video data in MEC layer by hierarchical reward function-based DRL scheme [38]. However, none of the mentioned works was optimized for service caching in the edge server. There were few studies for optimizing service caching configuration based on the DRL scheme. Ren et al. modeled a service caching-assisted MEC system and proposed a PPO-based optimization scheme for computation resource schedule and bandwidth resource allocation to minimize the mean service delay for all IoT devices [39]. Zhou et al. combined blockchain with edge servers and designed an edge caching framework. Service caching was placed in MEC servers to progress the computation resource utilization. DQN and greedy policy were utilized to query the offloading decision and service caching update strategy for obtaining the minimum execution delay [40]. In [39, 40], the DRL-based schemes were deployed to optimize the problem related to service caching, but these schemes did not consider the edge cooperation or task execution in case of edge server exceptions, such as communication blocking. For other learning-based scheme, the novel research field successfully combines machine learning and swarm intelligence approaches and proved to be able to obtain outstanding results in different areas [41], such as the proposed Chaotic Firefly Algorithm [42], Genetic Algorithm-based hierarchical feature selection scheme [43], long-short term memory and gated recurrent unit medley neural networks optimization algorithm [44]. We were considering applying these novel hybrid algorithms to our future works.

System model

This section provides a comprehensive description of the suggested MEC system model, which includes the network model, energy harvesting model, wireless communication model, as well as workload computation model.

Fig. 1
figure 1

Multi-IoTs and multi-edge servers communication and computation model

Network model

Figure 1 illustrates an UAV-enabled heterogeneous wireless computation and communication network comprising multiple IoTs, multiple UAVs, multiple MEC servers, and a cloud service center. In this network, each WU may generate request-specific workloads that require significant computation (computation-intensive and particular service caching requested) resources with a specific probability. However, due to their limited battery capacity and computing power, especially no specific service caching, WUs cannot handle computation-intensive and time-sensitive request-specific workloads. To tackle this issue, the MEC layer consists of multiple high-performance MEC servers, which are strategically positioned. Each MEC server is equipped with a macro base station (MBS) to facilitate communication between IoTs and cloud servers, commonly known as MECS-M. Moreover, each MECS-M is powered by a traditional power grid without wireless charging capability. In addition, each MECS-M allocates computing resources to store specialized and unique service caching from the upper cloud server. These service caching are designed to process request-specific workloads offloaded by linked IoTs. As revealed in Fig. 1, there are some unique buildings (e.g., power plants, abandoned factories) that block wireless communications between IoTs and the MECS-M. Therefore, UAVs with superior computation ability can be utilized to offload computation tasks from WUs through wireless communication. It is worth noting that UAVs can be charged wirelessly using solar energy, which eliminates the need for manual charging. Besides, the cloud service center comprises a cluster of servers and a set of service cachings. With its robust abilities, the cloud server can accommodate various service cachings and tackles request-specific workloads from UAVs, MECS-Ms, and IoTs. Specific applications or software, such as traffic scenarios, real-time mapping, and augmented reality/virtual reality (AR/VR), require services hosted by UAV, MEC-BS, or cloud server to execute the workload requested by IoTs. To fulfill these computation-intensive requests, UAVs and MECS-Ms must allocate particular service caching beforehand. As IoTs require more frequent communication with UAVs, and MECS-Ms, the cloud server must selectively sink(transfer) service caching to the UAVs and MECS-Ms. However, due to limited computation and storage space on the UAVs, MECS-Ms, compared to the cloud server, only a limited number of service caching can be transferred. This ensures the maximum utility of the UAV and MECS-M.

Without loss of generality, \({\mathcal {I}} = \{1, 2, \ldots ,I\}\), \({\mathcal {D}} = \{1, 2, \ldots , D\}\), and \({\mathcal {E}} = \{1, 2, \ldots , E\}\) can be represented as the sets of IoTs, UAVs, and MECS-M, respectively. This means that there are W IoTs and E MECS-M located in a certain region, while D UAVs hover at a certain height in the air. Consequently, set \({\mathcal {I}} = \{1, 2, \ldots ,I\}\) can be divided into M subsets: \({\mathcal {I}}_1, {\mathcal {I}}_2, \ldots , {\mathcal {I}}_M\). M means the total number of groups of UAVs and MECS-Ms; that is, the total number of IoTs served by the m-th group of UAVs and MECs-Ms is \(I_m\). Similarly, we define \({\mathcal {C}} = \{1, 2, \ldots , C\}\) as the service caching set, and all service cachings are stored on the cloud server. UAVs and MECS-M have limited storage space, so they can only host certain types of service caching in a certain time period. We define \({\hat{C}}_m\) as the maximum size of service cachings hosted by each UAV or MECS-M. Accordingly, if the UAV \(d, d \in {\mathcal {D}}\) or MECS-M \(e, e \in {\mathcal {E}}\) hosts service caching \(c, c \in {\mathcal {C}}\), then the request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\) can be processed by UAV \(d, d \in {\mathcal {D}}\) or MECS-M \(e, e \in {\mathcal {E}}\); otherwise, these request-specific workloads need to be offloaded to the cloud server for processing, because the cloud server hosts all types of service cachings.

However, UAVs can charge certain connected IoTs while receiving offloaded request-specific workloads, thanks to their energy-harvesting device for collecting renewable energy. It is important to note that we assume that the coverage of each UAV and MECS-M does not overlap, meaning that each IoT can only communicate with one UAV or MECS-M and cannot move out of the coverage of the UAV and MECS-M in each time slot.Footnote 1

Energy harvesting model

In the proposed MEC system, \({\mathcal {T}}\) is defined to describe a continuous time duration, which can be divided into discrete time intervals, denoted as time slots. Specifically, the set of time slots is represented as \( {\mathcal {T}} ={1, 2, \ldots ,T}\). This means that the total duration of the system is T, and each time slot is denoted as \(t \in {\mathcal {T}}\)

For section “Network model”, UAVs can charge connected user terminals while receiving offloaded request-specific workloads since these IoTs have specific energy harvesting cells. Assuming that the UAVs hovering in the air contains energy harvesting cell, the energy utilization efficiency can be significantly improved, which leads to lower energy consumption. The UAV’s energy harvesting cell can collect renewable energy, such as wind or solar power, in real-time and then use its equipment to convert it into available electrical energy and store it in its energy storage unit. Additionally, we assume that the arrival of renewable energy for UAV \(d, d \in {\mathcal {D}}\) follows a Poisson distribution with \(\lambda _d^r\) in each time slot, which is independent and identically distributed (i.i.d.). It is essential to consider the energy loss during energy harvesting. Accordingly, the energy harvested by the UAV \(d, d \in {\mathcal {D}}\) can be derived using the following equation during time slot t:

$$\begin{aligned} {\mathbb {E}}\left[ A_d^{\textrm{energy}}(t)\right] =\xi _d \lambda _d^{\textrm{energy}} \end{aligned}$$
(1)

where \(\xi _d\) stands for the linear energy loss coefficient.

As stated in the EARTH project [45], the UAV’s base energy consumption in each time slot comprises static (stationary) energy consumption for hovering, caching, and charging and dynamic (controllable) energy consumption for executing offloaded workloads and trajectory planning. This can be expressed by the following equation:

$$\begin{aligned} E_{d}^{\textrm{base}}\left( t\right) =\left[ p_{d}^{\textrm{static}}\left( t\right) + \nu _d p_{d}^{\textrm{dynamic}}\left( t\right) \right] *t \end{aligned}$$
(2)

where Eq. (2) includes the static power (\(p_{d}^{\textrm{static}}\)) that accounts for hovering, caching, and charging power consumption and other base power consumptions. The dynamic power (\(p_{d}^{\textrm{dynamic}}\)) is made up of two parts: transmission power for wireless charging energy and execution power for computation offloading workloads. The control parameter (\(\nu _d\)) is strongly associated with UAV \(d, d \in {\mathcal {D}}\)’s processing abilities. By considering the energy harvested and consumed by each UAV, the energy consumption model for UAV \(d, d \in {\mathcal {D}}\) can be constructed during the whole duration T as follows:

$$\begin{aligned}{} & {} B_d^{\textrm{energy}}(t+1)\\{} & {} \quad =\left\{ \begin{aligned}&\max \left\{ B_d^{\textrm{energy}}(t)-E_d^{\textrm{base}}(t)+\xi _d \lambda _d^{\textrm{energy}},0 \right\} \\&\min \left\{ B_d^{\textrm{energy}}(t)-E_d^{\textrm{base}}(t)+\xi _d \lambda _d^{\textrm{energy}},B_d^{\textrm{max}} \right\} \end{aligned} \right. \nonumber \end{aligned}$$
(3)

Where the energy consumption of the UAV \(d, d \in {\mathcal {D}}\) is constructed as a queueing model, \(B_d^{\textrm{max}}\) represents the maximum available size of the queue for UAV \(d, d \in {\mathcal {D}}\).

As derived by Eq. (3), the harvested(input) energy and consumed(output) energy is time-varying during each time slot depending on the energy arrival rate \(\lambda _d^{\textrm{energy}}\), the size of request-specific workloads processing and the size of transmission and execution power for offloaded workloads. This directly influences the size of energy queue \(B_d^{\textrm{energy}}\) for each UAV. \(B_d^{\textrm{energy}}(t) > 0\) means sufficient renewable energy to meet UAV d’s energy consumption. When \(B_d^{\textrm{energy}}(t) > B_d^{\textrm{max}}\), the arrival of renewable energy is sufficient, and queue \(B_d^{\textrm{energy}}\) overflows. However, once \(B_d^{\textrm{energy}}(t)\) is negative, the arrival of renewable energy is deficient. In such cases, to guarantee the UAV’s computation and communication ability, it will be powered by the traditional grid, which ineluctably leads to high energy consumption.

Wireless communication model

To describe the relative location in the proposed MEC framework, 3D Cartesian coordinates are introduced for the positions of IoTs, UAVs, and MECS-Ms. We set \([x_i, y_i, z_i]\), \([x_d, y_d, z_d]\), and \([x_e, y_e, z_e]\) as the coordinates of IoT \(i, i \in {\mathcal {I}}\), UAV \(d, d \in {\mathcal {D}}\), and MECS-M \(e, e \in {\mathcal {E}}\). Accordingly, we can conveniently calculate the European distance between the IoT \(i, i \in {\mathcal {I}}\) and the UAV \(d, d \in {\mathcal {D}}\) in time slot t as:

$$\begin{aligned}{} & {} \text {Dis}_{i,d}\left( t\right) \\{} & {} \quad =\sqrt{[x_{i}(t)-x_d(t)]^2+ [y_{i}(t)-y_d(t)]^2+[z_{i}(t)-z_d(t)]^2}\nonumber \end{aligned}$$
(4)

Similarly, the distance between the IoT \(i, i \in {\mathcal {I}}\) and the MECS-M \(e, e \in {\mathcal {E}}\) as:

$$\begin{aligned}{} & {} \text {Dis}_{i,e}\left( t\right) \\{} & {} \quad =\sqrt{[x_{i}(t)-x_e(t)]^2 +[y_{i}(t)-y_e(t)]^2+[z_{i}(t)-z_e(t)]^2}\nonumber \end{aligned}$$
(5)

Without loss of generality, we assume that the coverage between multiple UAVs and multiple MECs does not overlap. In other words, a specific UAV or MECM-S can only serve an IoT, and the initial state of the channel between IoT and UAV, MECM-S satisfies a complex Gaussian distribution with zero-mean which meets \(h_{i,d}(0) = \Phi _d^{\textrm{loss}} \sqrt{(Dis_d^{\textrm{init}}/Dis_{i,d})^{p_d}} I_d\) and \(h_{i,e}(0) = \Phi _e^{\textrm{loss}} \sqrt{(Dis_e^{\textrm{init}}/Dis_{i,e})^{p_e}} I_e\), where \(\Phi _d^{\textrm{loss}},\Phi _e^{\textrm{loss}}\) are the pass loss coefficient. \(Dis_d^{\textrm{init}},Dis_e^{\textrm{init}}\) indicate the initial reference distance which are constants. \(p_d,p_e\) express the control factor, which are positive integers. Apart from the impact of distance and path loss on the channel state, we adopt an autoregressive model [16, 46] to depict the time-varying channel state during each time slot. The channel model between the IoT \(i, i \in {\mathcal {I}}\) and the UAV \(d, d \in {\mathcal {D}}\) in time slot t can be formulated as follows:

$$\begin{aligned} h_{i, d}\left( t\right) =\varrho _{d} h_{i, d}(t-1)+\sqrt{1-\varrho _{d}^{2}} r_{d}^{\textrm{error}}\left( t\right) \end{aligned}$$
(6)

where \(h_{i, d}\left( t\right) \) means the channel state between the IoT \(i, i \in {\mathcal {I}}\) and the UAV \(d, d \in {\mathcal {D}}\) in time slot t. \(\varrho _{d}\) represents the control factor which is close to 1. \(r_{d}^{\textrm{error}}\) is the error vector which follows complex Gaussian distribution with zero mean and standard deviation.

Similarly, the channel model between the IoT \(i, i \in {\mathcal {I}}\) and the MECS-M \(e, e \in {\mathcal {E}}\) in time slot t can be written as follows:

$$\begin{aligned} h_{i, e}\left( t\right) =\varrho _{e} h_{i, e}(t-1)+\sqrt{1-\varrho _{e}^{2}} r_{e}^{\textrm{error}}\left( t\right) \end{aligned}$$
(7)

Furthermore, we adopt an orthogonal frequency-division multiple access (OFDMA) modes to simulate the communication between IoTs and UAVs, MECS-Ms. Following the description indicated in [17], the bandwidth between IoTs and UAVs, MECS-M is divisible. Specifically, we use \(B_d\) to represent the total bandwidth of the wireless channel of the UAV d for linked IoTs, and there are K sub-bandwidth. Based on Shannon’s theorem, the transmission rate achieved between the IoT \(i, i \in {\mathcal {I}}\) and the UAV \(d, d \in {\mathcal {D}}\) can be expressed as:

$$\begin{aligned} R_{i,d}^{\textrm{offload}}(t)= B_{d}/K_d \log _{2}\left( 1+ \frac{p_{i, d}^{\textrm{offload}}(t) \cdot h_{i, d}(t)}{\sigma _d(t)^{2}}\right) \end{aligned}$$
(8)

where \(p_{i, d}^{\textrm{offload}}\) indicates the transmission power for offloading the request-specific workloads from IoT \(i, i \in {\mathcal {I}}\) to UAV \(d, d \in {\mathcal {D}}\). \(\sigma _d(t)\) is the standard variance of additive white Gaussian noise (AWGN) with zero means.

Similarly, The transmission rate between the IoT \(i, i \in {\mathcal {I}}\) and the MECS-M \(e, e \in {\mathcal {E}}\) in time slot t can be written as follows:

$$\begin{aligned} R_{i,e}^{\textrm{offload}}(t)= B_{e}/K_e \log _{2}\left( 1+ \frac{p_{i, e}^{\textrm{offload}}(t) \cdot h_{i, e}(t)}{\sigma _e(t)^{2}}\right) \end{aligned}$$
(9)

It is noteworthy that although the transmission modes of IoT and UAV have the same as those of IoT and MECS-M when occlusion occurs between IoT and MECS-M, IoT can only communicate with the UAV, although the computation and communication capabilities of the UAV are weaker than those of MECS-M.

Workloads computation model

We assume that each IoT will generate a request-specific workload with a certain probability in each time slot. We set the arrival probability of the workload required for specific service caching \(c, c \in {\mathcal {C}}\) generated by IoT \(i, i \in {\mathcal {I}}\) as \(P_{i,c}^{\textrm{in}}(t)\), which follows the Bernoulli distribution with \(\lambda _{i,c}^{\textrm{in}}\). We define the request-specific workload generated by IoT \(i, i \in {\mathcal {I}}\) in time slot t as a tuple \(w_{i,c}(t)=\langle w_{i,c}^{\textrm{size}}(t), w_{i,c}^{\textrm{cycle}}(t), w_{i,c}^{\textrm{nocach}}(t), w_{i,c}^{\textrm{hide}}(t), w_{i,c}^{\textrm{duration}}(t) \rangle \). \(w_{i,c}^{\textrm{size}}(t)\) means the total size of request-specific workloads for IoT \(i, i \in {\mathcal {I}}\). \(w_{i,c}^{\textrm{cycle}}(t)\) represents the total number of CPU cycles for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\). \(w_{i,c}^{\textrm{nocache}}(t)\) indicates the total capacity of service caching \(c, c \in {\mathcal {C}}\) for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\). \(w_{i,c}^{\textrm{hide}}(t)\) stands for the probability of IoT \(i, i \in {\mathcal {I}}\) being blocked by buildings. \(w_{i,c}^{\textrm{duration}}(t)\) denotes the deadline for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\). It is worth noting that due to each IoT cannot host the service caching required to complete its request-specific workload, the workload needs to be offloaded to the upper layer UAV or MECS-M. Extensive arriving request-specific workloads offloaded from IoTs must be stored on the UAV or MECS-M first and then processed by the UAV or MECS-M within the task deadline. To this end, we define \(B_d^{\textrm{data}}\) and \(B_e^{\textrm{data}}\) as the buffer queue for offloaded workloads on the UAV or MECS-M, respectively. In addition, we assume that all request-specific workloads are indivisible and latency-sensitive, so workloads can only be selectively offloaded to an UAV or an MECS-M. To this end, we define \(\tau _{i,j}^{\textrm{offload}}\) as the offloading decision-making variable and meet \(\tau _{i,j}^{\textrm{offload}} \in [0,1]\). where \(\tau _{i,j}^{\textrm{offload}}=0\) stands for the request-specific workload \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is offloaded to UAV \(j, j \in {\mathcal {D}}\). \(\tau _{i,j}^{\textrm{offload}}=1\) expresses the request-specific workload \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is offloaded to MECS-M \(j, j \in {\mathcal {E}}\).

Workloads computation offload to UAV

As described in section “Network model”, the UAV will hover in the air with a predetermined trajectory, so as long as the IoTs do not exceed the communication coverage of the UAV, the request-specific workloads generated by IoTs can be offloaded to the UAV for processing through uplink wireless communication. Besides, when IoTs are obstructed by the building, the IoTs cannot communicate with MECS-M. To this end, the transmission latency for offloading request-specific workload \(w_{i,c}(t)\) from IoT \(i, i \in {\mathcal {I}}\) to UAV \(d, d \in {\mathcal {D}}\) in the time slot t can be derived as

$$\begin{aligned} l_{i,d}^{\textrm{offload}}(t)= \frac{(1-\tau _{i,j}^{\textrm{offload}}) w_{i,c}^{\textrm{size}}(t)}{R_{i,d}^{\textrm{offload}}(t)} \end{aligned}$$
(10)

Accordingly, the transmission energy consumption for offloading request-specific workload \(w_{i,c}(t)\) from IoT \(i, i \in {\mathcal {I}}\) to UAV \(d, d \in {\mathcal {D}}\) in the time slot t can be derived as

$$\begin{aligned} e_{i,d}^{\textrm{offload}}(t)= p_{i,d}^{\textrm{offload}}(t) l_{i,d}^{\textrm{offload}}(t) \end{aligned}$$
(11)

In this paper, we define \(\rho _{i,c,j}^{\textrm{execute}}\) as the decision-making variable of whether the UAV or the MECS-M hosts the service caching \(c, c \in {\mathcal {C}}\) and meets \(\rho _{i,c,j}^{\textrm{execute}} \in [0,1]\). where \(\rho _{i,c,j}^{\textrm{execute}}=0\) means the service caching \(c, c \in {\mathcal {C}}\) for executing the request-specific workloads \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is not hosted by UAV \(j, j \in {\mathcal {D}}\) or MECS-M \(j, j \in {\mathcal {E}}\). \(\rho _{i,c,j}^{\textrm{execute}}=1\) means the service caching \(c, c \in {\mathcal {C}}\) for executing the request-specific workloads \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is hosted by UAV \(j, j \in {\mathcal {D}}\) or MECS-M \(j, j \in {\mathcal {E}}\). However, after request-specific workload \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is offloaded to the UAV \(d, d \in {\mathcal {D}}\), if the UAV \(d, d \in {\mathcal {D}}\) hosts the service caching \(c, c \in {\mathcal {C}}\) required for workload \(w_{i,c}(t)\), the UAV \(d, d \in {\mathcal {D}}\) can directly execute workload \(w_{i,c}(t)\) through its processor. The execution latency for offloaded request-specific workload \(w_{i,c}(t)\) on UAV \(d, d \in {\mathcal {D}}\) in the time slot t can be calculated by

$$\begin{aligned} l_{i,d}^{\textrm{execute}}(t)= \frac{(1-\tau _{i,j}^{\textrm{offload}}) \rho _{i,c,j}^{\textrm{execute}} w_{i,c}^{\textrm{cycle}}(t)}{f_{i,d}^{\textrm{alloc}}(t)} \end{aligned}$$
(12)

where \(f_{i,d}^{\textrm{alloc}}(t)\) means the total number of allocated CPU cycles for executing the request-specific workload \(w_{i,c}(t)\), which can be obtained by \(f_{m}(t)= \sqrt{\frac{p_{d,l}^{\textrm{execute}}(t)}{\kappa _{d} \cdot (1-\tau _{i,j}^{\textrm{offload}}) w_{i,c}^{\textrm{cycle}}(t)}}\) [47]. \(\kappa _{d}\) represents the effective switching capacitance parameter of UAV’s control chip.

To this end, the execution energy consumption for offloaded request-specific workload \(w_{i,c}(t)\) on UAV \(d, d \in {\mathcal {D}}\) in the time slot t can be calculated by

$$\begin{aligned} e_{i,d}^{\textrm{pro}}(t)= \kappa _{d} [f_{i,d}^{\textrm{alloc}}(t)]^3 l_{i,d}^{\textrm{execute}}(t) \end{aligned}$$
(13)

When the UAV \(d, d \in {\mathcal {D}}\) does not host the service caching \(c, c \in {\mathcal {C}}\) for executing request-specific workload \(w_{i,c}(t)\), the workload \(w_{i,c}(t)\) needs to be transmitted to the cloud service for processing. Furthermore, we assume that the total execution latency of transmitting the request-specific workload \(w_{i,c}(t)\) to the cloud server and returning the results is related to the total capacity \(w_{i,c}^{\textrm{nocache}}(t)\) of service caching \(c, c \in {\mathcal {C}}\) for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\). Accordingly, we can obtain the total execution latency by the following equation.

$$\begin{aligned} l_{i,d}^{\textrm{nocach}}(t)= (1-\tau _{i,j}^{\textrm{offload}}) (1-\rho _{i,c,j}^{\textrm{execute}}) \Delta _{i,d} w_{i,c}^{\textrm{nocache}}(t) \end{aligned}$$
(14)

where \(\Delta _{i,d}\) is the control coefficient which is a positive constant.

Then, the total execution energy consumption for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\) without service caching can be written by

$$\begin{aligned} e_{i,d}^{\textrm{nocach}}(t)= p_{i,d}^{\textrm{nocach}}(t) l_{i,d}^{\textrm{nocach}}(t) \end{aligned}$$
(15)

where \(p_{i,d}^{\textrm{nocach}}(t)\) indicates the allocated power for finishing the request-specific workloads without service caching hosted on UAV \(d, d \in {\mathcal {D}}\).

Workloads computation offload to MECS-M

When the building does not obstruct the IoTs, the IoTs can communicate with MECS-M due to MECS-M’s powerful computation resource and communication capability. Similar to the communication and computation model between IoTs and UAVs, the transmission latency for offloading request-specific workload \(w_{i,c}(t)\) from IoT \(i, i \in {\mathcal {I}}\) to MECS-M \(e, e \in {\mathcal {E}}\) in the time slot t can be derived as

$$\begin{aligned} l_{i,e}^{\textrm{offload}}(t)= \frac{\tau _{i,j}^{\textrm{offload}} w_{i,c}^{\textrm{size}}(t)}{R_{i,e}^{\textrm{offload}}(t)} \end{aligned}$$
(16)

Furthermore, the transmission energy consumption for offloading request-specific workload \(w_{i,c}(t)\) from IoT \(i, i \in {\mathcal {I}}\) to MECS-M \(e, e \in {\mathcal {E}}\) in the time slot t can be derived as

$$\begin{aligned} e_{i,e}^{\textrm{offload}}(t)= p_{i,e}^{\textrm{offload}}(t) l_{i,e}^{\textrm{offload}}(t) \end{aligned}$$
(17)

After request-specific workload \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) is offloaded to the MECS-M \(e, e \in {\mathcal {E}}\), if the MECS-M \(e, e \in {\mathcal {E}}\) hosts the service caching \(c, c \in {\mathcal {C}}\) required for workload \(w_{i,c}(t)\), the MECS-M \(e, e \in {\mathcal {E}}\) can execute workload \(w_{i,c}(t)\) through its own processor. The execution latency for offloaded request-specific workload \(w_{i,c}(t)\) on MECS-M \(e, e \in {\mathcal {E}}\) in the time slot t can be calculated by

$$\begin{aligned} l_{i,e}^{\textrm{execute}}(t)= \frac{\tau _{i,j}^{\textrm{offload}} \rho _{i,c,j}^{\textrm{execute}} w_{i,c}^{\textrm{cycle}}(t)}{f_{i,e}^{\textrm{alloc}}(t)} \end{aligned}$$
(18)

where \(f_{i,e}^{\textrm{alloc}}(t)\) means the total number of allocated CPU cycles for executing the request-specific workload \(w_{i,c}(t)\), which can be obtained by \(f_{e}(t)= \sqrt{\frac{p_{e,l}^{\textrm{execute}}(t)}{\kappa _{e} \cdot (1-\tau _{i,j}^{\textrm{offload}}) w_{i,c}^{\textrm{cycle}}(t)}}\).

The execution energy consumption for offloaded request-specific workload \(w_{i,c}(t)\) on MECS-M \(e, e \in {\mathcal {E}}\) in the time slot t can be calculated by

$$\begin{aligned} e_{i,e}^{\textrm{pro}}(t)= \kappa _{e} [f_{i,e}^{\textrm{alloc}}(t)]^3 l_{i,e}^{\textrm{execute}}(t) \end{aligned}$$
(19)

Similarly, when the MECS-M \(e, e \in {\mathcal {E}}\) does not host the service caching \(c, c \in {\mathcal {C}}\) for executing request-specific workload \(w_{i,c}(t)\), the workload \(w_{i,c}(t)\) needs to be transmitted to the cloud service for processing. Accordingly, we can obtain the total execution latency by the following equation:

$$\begin{aligned} l_{i,e}^{\textrm{nocach}}(t)= \tau _{i,j}^{\textrm{offload}} (1-\rho _{i,c,j}^{\textrm{execute}}) \Delta _{i,e} w_{i,c}^{\textrm{nocache}}(t) \end{aligned}$$
(20)

Furthermore, the total execution energy consumption for finishing request-specific workloads generated by IoT \(i, i \in {\mathcal {I}}\) without service caching can be written by

$$\begin{aligned} e_{i,e}^{\textrm{nocach}}(t)= p_{i,e}^{\textrm{nocach}}(t) l_{i,e}^{\textrm{nocach}}(t) \end{aligned}$$
(21)

Optimization problem statement

In order to further express the optimization problem that we require to derive, we define total latency and energy consumption in this section.

As previously mentioned in section “Workloads computation model”, in terms of IoT \(i, i \in {\mathcal {I}}\), the total latency for completing the request-specific workload \(w_{i,c}(t)\) in time slot t can be given by:

$$\begin{aligned} l_i^{\textrm{total}}(i)= & {} l_{i,d}^{\textrm{offload}}(t) + l_{i,d}^{\textrm{execute}}(t) + l_{i,d}^{\textrm{nocach}}(t)+l_{i,e}^{\textrm{offload}}(t) \nonumber \\{} & {} + l_{i,e}^{\textrm{execute}}(t) + l_{i,e}^{\textrm{nocach}}(t) \end{aligned}$$
(22)

Similarly, the total energy consumption for completing the request-specific workload \(w_{i,c}(t)\) of IoT \(i, i \in {\mathcal {I}}\) can be calculated by:

$$\begin{aligned} e_i^{\textrm{total}}(i)= & {} e_{i,d}^{\textrm{offload}}(t) + \varkappa _d e_{i,d}^{\textrm{execute}}(t) + \varkappa _d e_{i,d}^{\textrm{nocach}}(t)\nonumber \\{} & {} +e_{i,e}^{\textrm{offload}}(t) + e_{i,e}^{\textrm{execute}}(t) + e_{i,e}^{\textrm{nocach}}(t) \end{aligned}$$
(23)

where \(\varkappa _d\) indicates the control parameter related to the length of renewable resource \(B_d^{\textrm{energy}}\) described in section “Energy harvesting model”.

Accordingly, For IoT \(i, i \in {\mathcal {I}}\), the weighted sum of latency cost and energy consumption cost is written as:

$$\begin{aligned} C_{i}^{\textrm{total}}(t)= \xi ^{\textrm{latency}} l_i^{\textrm{total}}(t) + \xi ^{\textrm{energy}} e_i^{\textrm{total}}(i) \end{aligned}$$
(24)

where \(\xi ^{\textrm{latency}}\) and \(\xi ^{\textrm{energy}}\) are the weighted factors for controlling the total latency cost and the energy consumption cost for completing the request-specific workloads of IoT \(i, i \in {\mathcal {I}}\) and meeting \(\xi ^{\textrm{latency}}+\xi ^{\textrm{energy}}=1\).

We have derived the latency cost, the power consumption cost, and the weighted total cost for finishing the request-specific workload \(w_{i,c}(t)\) generated by IoT \(i, i \in {\mathcal {I}}\) by Eq. (22), Eq. (23), and Eq.(24). The objective of the optimization problem is to explore the minimum long-term weighted average cost including the latency cost, power consumption cost while satisfying the following constraint conditions. To this end, the optimization problem can be defined as follows:

$$\begin{aligned} \textbf{P1}&\quad \min _{(\tau _{i,j}^{\textrm{offload}},\rho _{i,c,j}^{\textrm{execute}}) } \quad {\mathbb {E}}\left[ \lim _{t \rightarrow \infty } \frac{1}{T}\frac{1}{M} \sum _{t=1}^{T} \sum _{m=1}^{M} \sum _{i=1}^{I_m} C_{i}^{\textrm{total}}(t)\right] \end{aligned}$$
(25a)
$$\begin{aligned} \text{ s.t. }&\quad \tau _{i,j}^{\textrm{offload}} \in [0,1] \end{aligned}$$
(25b)
$$\begin{aligned}&\quad \rho _{i,c,j}^{\textrm{execute}} \in [0,1],\sum \limits _{c \in {\mathcal {C}}} \rho _{i,c,j}^{\textrm{execute}} \le {\hat{C}}_m \end{aligned}$$
(25c)
$$\begin{aligned}&\quad l_i^{\textrm{total}}(t) \le w_{i,c}^{\textrm{duration}}(t) \end{aligned}$$
(25d)
$$\begin{aligned}&\quad B_{d}^{\textrm{energy}} \le B_{d}^{\textrm{max}} \end{aligned}$$
(25e)
$$\begin{aligned}&\quad f_{d}(t) \le F_{d}^{\textrm{cycle,max}}, f_{e}(t) \le F_{e}^{\textrm{cycle,max}} \end{aligned}$$
(25f)

where (25a) means the optimization problem’s objective is to jointly find the optimal request-specific workload offloading decision-making policy and service caching hosting selection for minimizing the long-term weighted average cost, including the latency cost and power consumption cost. Constraint (25b) shows the region of request-specific workload offloading decision-making policy. Constraint (25c) indicates whether UAV or MECS-M hosts the service caching for the request-specific workload. Constraint (25d) ensures that the number of service cachings hosted by UAV or MECS-M is not greater than the maximum constraint. Constraint (25e) represents the length of the energy buffer queue for UAVs that cannot exceed the maximum size constraint. Constraint (25d) denotes that the allocated CPU cycles of UAV and MECS-M for processing the offloaded request-specific workloads cannot be greater than the maximum available CPU cycles of UAV and MECS-M; that is, \(F_{e}^{\textrm{cycle,max}}\).

Transformation of cost minimization problem

Cost minimization problem \(\textbf{P1}\) tackles joint optimization for request-specific workloads offloading selection \(\tau _{i,j}^{\textrm{offload}}\) and service caching hosting selection \(\rho _{i,c,j}^{\textrm{execute}}\). Obviously, these two optimization variables are coupled. To deal with this conundrum, we can decompose the cost minimization problem \(\textbf{P1}\) into two sub-problems, \(\textbf{P2}\) and \(\textbf{P3}\).

In terms of \(\textbf{P2}\), the goal is to find the optimal request-specific workload offloading decision-making policy for minimizing the long-term weighted average cost with the service caching configuration \(\rho _{i,c,j}^{\textrm{execute}}\) on UAV and MECS-M obtained in advance. To this end, we define \(\textbf{P2}\) as follows.

$$\begin{aligned} \textbf{P2}&\quad \min _{(\tau _{i,j}^{\textrm{offload}})} \quad {\mathbb {E}} \left[ \lim _{t \rightarrow \infty } \frac{1}{T}\frac{1}{M} \sum _{t=1}^{T} \sum _{m=1}^{M}\sum _{i=1}^{I_m} C_{i}^{\textrm{total}}(t)\right] \end{aligned}$$
(26a)
$$\begin{aligned} \text{ s.t. }&\quad \textrm{constraints} \quad \mathrm{(25b), (25d), (25e), (25f)} \end{aligned}$$
(26b)
$$\begin{aligned}&\quad \sum \limits _{c \in {\mathcal {C}}} \rho _{i,c,j}^{\textrm{execute}} = {\hat{C}}_m \end{aligned}$$
(26c)

It is worth noting that constraint (26c) stands for UAV and MECS-M host the same number of service cachings, and the types of service caching are fixed.

Similarly, the objective of \(\textbf{P3}\) is to obtain the optimal service caching hosting selection policy for minimizing the long-term weighted average cost under the given \(\tau _{i,j}^{\textrm{offload}}\); that is, the total size of request-specific workloads generated by IoT offloading to UAV and MECS-M can be gained in advance. Accordingly, \(\textbf{P3}\) can be defined as follows

$$\begin{aligned} \textbf{P3}&\quad \min _{(\rho _{i,c,j}^{\textrm{execute}})} \quad {\mathbb {E}}\left[ \lim _{t \rightarrow \infty } \frac{1}{T}\frac{1}{M} \sum _{t=1}^{T} \sum _{m=1}^{M} \sum _{i=1}^{I_m} C_{i}^{\textrm{total}}(t)\right] \end{aligned}$$
(27a)
$$\begin{aligned} \text{ s.t. }&\quad \textrm{constraints}\quad \mathrm{(25c), (25d), (25e), (25f)} \end{aligned}$$
(27b)

Medley DRL-based optimization scheme

We elaborate on the implementation process of long-term weighted average cost-minimizing the problem. From section “Optimization problem statement”, we have transformed the joint optimization problem \(\textbf{P1}\) into two separate optimization subproblems, \(\textbf{P2}\) and \(\textbf{P3}\). However, subproblems \(\textbf{P2}\) and \(\textbf{P3}\) are still NP-Hard which traditional convex optimization schemes and heuristic algorithms cannot address. To this end, we deploy DRL to tackle these two sub-problems separately.

MADDPG-based multi-workloads offloading decision-making scheme

As described in section “Network model”, the coverage of each UAV and MECS-M does not overlap, meaning that each IoT can only communicate with one UAV or MECS-M and cannot move out of the coverage of the UAV and MECS-M in each time slot. Each IoT can be considered a training agent in each group(some IoTs, an UAV, and an MECS-M). All IoTs cooperate in implementing optimal offloading decisions based on the proposed MEC environment to obtain maximum expected rewards and achieve the minimum weighted average cost. Therefore, we exploit the MADDPG algorithm to tackle \(\textbf{P2}\). Figure 2 illustrates the schematic diagram of the MADDPG-based decision-making training and execution model for IoT agents.

Fig. 2
figure 2

MADDPG-based decision-making training and execution model for IoT agents

Then, we give the workflow of the MADDPG’s agents. In time slot t, there are \(I_m\) agents, and IoT \(i_m, i_m \in {\mathcal {I}}_m\) is the \(i_m\)-th agent. Each IoT agent interacts with the UAV-enable MEC environment to conduct joint training and adopts centralized training and decentralized execution, that is, to perform joint actions based on the joint states during training to obtain the optimal expected cumulative rewards; During execution, each agent’s action is executed based on the joint states. Furthermore, we define the state space of the proposed MADDPG-based optimization scheme as \({\textbf{O}}\), and the state \(o_{i_m}, o_{i_m} \in {\textbf{O}}\) of agent \(i_m\) in time slot t can be written as

$$\begin{aligned} o_{i_m}(t)= & {} \{w(i_m,c)(t), h_{i_m,d}(t), h_{i_m,e}(t), \rho _{i_m,j}^{\textrm{execute}}, B_d^{\textrm{energy}}(t),\nonumber \\{} & {} f_{i_m,d}^{\textrm{alloc}}(t),f_{i_m,e}^{\textrm{alloc}}(t), {\tilde{\rho }}_{i,c^{\star },j}^{\textrm{execute}} \} \end{aligned}$$
(28)

where \(w(i_m,c)(t), h_{i_m,d}(t), h_{i_m,e}(t), \rho _{i_m,j}^{\textrm{execute}}\) are locally observable information of IoT’s own agent, and \(B_d^{\textrm{energy}}(t), f_{i_m,d}^{\textrm{alloc}}(t),f_{i_m,e}^{\textrm{alloc}}(t)\) are global information for IoT’s agent, which must be received from UAV and MECS-M by the pattern of broadcasting. Most importantly, \({\tilde{\rho }}_{i,c^{\star },j}^{\textrm{execute}}\) represents the optimal service caching hosting decision of UAV or MECS-M.

We continue defining action space as \({\textbf{A}}\), and the state \(a_{i_m}, a_{i_m} \in {\textbf{A}}\) of agent \(i_m\) in time slot t can be gained as \(a_{i_m}(t)=\{\tau _{i,j}^{\textrm{offload}}\}\), and \(\tau _{i,j}^{\textrm{offload}}\) is limited to [0, 1].

In terms of the reward function, the goal of \(\textbf{P2}\) is to find the optimal request-specific workload offloading decision-making policy for minimizing the long-term weighted average cost. To this end, we define the reward function as \(r_{i_m}(t)=-{\textbf{F}}(c_{i_m}^{\textrm{total}}(t))\), where \({\textbf{F}}\) indicates the linear function for \(c_{i_m}^{\textrm{total}}(t)\). To this end, minimizing the weighted average cost is transformed into maximizing the average cumulative rewards.

Without loss of generality, the joint observation for IoT agent \(i_m\) is defined as \({{\textbf {o}}}_{m}(t)= \{o_1, o_2, \ldots , o_{I_m}\}\). From Fig. 2, while the Actor network only requires local information to guide its actions, the Critic network must acquire global information from the environment during training. Accordingly, the policy gradient of the Actor network can be defined as:

$$\begin{aligned}{} & {} \nabla _{\theta ^{\mu }} J\left( \mu _{\theta }\right) \nonumber \\{} & {} \quad ={\mathbb {E}}_{{\textbf{o}}_m \sim W_m}\left[ \nabla _{\theta ^{\mu }} \mu _{\theta } \left( {\textbf{o}}_m\right) \nabla _{a_{i_m}} Q_{i_m}^{\mu } \left( {\textbf{o}}_m, a_1,a_2,\ldots ,a_{I_m}\right) \right. \nonumber \\{} & {} \qquad \left. \mid _{a_{i_m}=\mu _{\theta _{i_m}}\left( {\textbf{o}}_m\right) }\right] \end{aligned}$$
(29)

where \(\theta = [\theta _1, \theta _2, \ldots , \theta _{I_m}]\) means the policy’s parameters for the Actor’s primary networks of MADDPG. \(\theta ^{\mu } = [\theta _1^{\mu }, \theta _2^{\mu }, \ldots , \theta _{I_m}^{\mu }]\) means the policy’s parameters for the Actor’s target networks of MADDPG. \(\mu _{\theta }\) represents the policy of actor networks. \(W^{\textrm{buffer}}\) stands for the experience replay memory, which is composed of the tuple \(\{{\textbf{o}}_m, a_1, a_2, \ldots , a_{I_m}, r_1, r_2, \ldots , r_{I_m}, {\textbf{o}}_m^{\prime } \}\). \(Q(o_{i_m}, a_{i_m})\) denotes the state-action function, which is the expected cumulative reward. The ultimate goal of experience replay is to prevent overfitting and meet independent identical distribution. Moreover, experience replay also makes samples reusable, thereby improving learning efficiency of DRL.

In terms of the Critic’s target network, we can gain the target expected cumulative reward by

$$\begin{aligned} y_{m}(t)=r_{i_m}+\gamma {\hat{Q}}_{i_m}^{\mu ^{\prime }}\left( \mathbf {o_m^{\prime }}, a_1^{\prime },a_2^{\prime },\ldots ,a_{i_m}^{\prime } \mid _{a_{i_m}^{\prime }=\mu _{\theta ^{\mu }}^{\prime } \left( {\textbf{o}}_m\right) }\right) \nonumber \\ \end{aligned}$$
(30)

To this end, the loss function of the centralized mode can be indicated as follows:

$$\begin{aligned} L= & {} {\mathbb {E}}_{\left( {\textbf{o}}_m, a_{i_m}, r_{i_m}, \mathbf {o_m^{\prime }}\right) \sim W^{\textrm{buffer}}}\nonumber \\{} & {} \times \left( \left( y_{i_m}(t)-Q_{i_m}^{\mu }\left( {\textbf{o}}_m, a_1,a_2,\ldots ,a_{i_m} \mid \theta _{i_m}^{Q}\right) \right) ^{2}\right) \nonumber \\ \end{aligned}$$
(31)

The Actor and Critic network’s parameters(four networks for one agent) for the target policy network are replaced by the soft update pattern:

$$\begin{aligned} \left\{ \begin{array}{l} {\theta ^{\mu ^{\prime }} \leftarrow \tau \theta ^{\mu }+(1-\tau ) \theta ^{\mu ^{\prime }}} \\ {\theta ^{Q^{\prime }} \leftarrow \tau \theta ^{Q}+(1-\tau ) \theta ^{Q^{\prime }}}\end{array}\right. \end{aligned}$$
(32)

Algorithm 1 signifies the MADDPG-based request-specific workloads decision-making optimization scheme for all IoTs.

Algorithm 1
figure a

The MADDPG-based request-specific workloads offloading decision-making optimization scheme for all IoTs

DDQN-based service caching hosting decision-making scheme

The aim of problem \(\textbf{P3}\) is to seek the optimal service caching hosting selection policy for UAV and MECS-M, which is specific to workloads generated by IoTs covered by UAV and MECS-M. Due to the coverage of each UAV and MECS-M not overlapping and the total number of coverage regions is defined as M, which is explained in section “Network model”, there is no association between IoTs, UAVs, and MECS-Ms within different M coverage regions. Accordingly, for each coverage region, UAVs and MECs can determine which service cachings to host based on the features of the request-specific workloads generated by all linked IoTs under the given optimal offloading policy from \(\textbf{P2}\). Because the service cachings are selectively hosted by the UAV or MECS-M, we transform the optimization problem \(\textbf{P3}\) for each IoT into the optimization problem \(\textbf{P4}\) for each edge server (UAV or MECS-M) denoted as followsFootnote 2:

$$\begin{aligned} \textbf{P4}&\quad \min _{(\rho _{i,c,j}^{\textrm{execute}})} \quad {\mathbb {E}}\left[ \lim _{t \rightarrow \infty } \frac{1}{T}\frac{1}{M} \sum _{t=1}^{T} \sum _{m=1}^{M} C_{I_m}^{\textrm{total}}(t)\right] \end{aligned}$$
(33a)
$$\begin{aligned} \text{ s.t. }&\quad \textrm{constraints}\quad \mathrm{(25c), (25d), (25e), (25f)} \end{aligned}$$
(33b)

From section “Network model”, there are M groups UAVs and MECS-Ms servers \(I_m\) IoTs. Furthermore, in terms of \(I_m\) IoTs covered by m-th group UAVs and MECS-Ms, \(C_{I_m}^{\textrm{total}}(t)\) indicates the weighted average cost for completing all request-specific workloads generated by all \(I_m\) IoTs in time slot t, which can be computed as

$$\begin{aligned} C_{I_m}^{\textrm{total}}(t)= \xi ^{\textrm{latency}} \max \limits _{i \in {\mathcal {I}}_m}\{l_i^{\textrm{total}}(t)\} + \xi ^{\textrm{energy}} \sum _{i=1}^{I_m} e_i^{\textrm{total}}(i) \end{aligned}$$
(34)

To this end, we deploy the DDQN scheme, which is a model-free DRL algorithm to solve \(\textbf{P4}\) based on the proposed UAV-enable MEC environment. Figure 3 shows the DDQN-based service caching selection policy model for all UAVs and MECS-Ms. Due to the coverage of each UAV and MECS-M does not overlap, we propose a decentralized DDQN scheme to set each UAV or MECS-M as a learning agent; each agent trains optimal service caching hosting selection policy by decentralized training and decentralized execution pattern. Without loss of generality, we define the state of UAV \(d, d \in {\mathcal {D}}\) in the group m during time slot t as \(s_m(t)=\{w(i_m,c)(t), h_{i_m,d}(t), h_{i_m,e}(t), \rho _{i_m,j}^{\textrm{execute}}, B_d^{\textrm{energy}}(t), f_{i_m,d}^{\textrm{alloc}}(t),f_{i_m,e}^{\textrm{alloc}}(t), {\tilde{\tau }}_{i,j}^{\textrm{execute}}\}\), where \({\tilde{\tau }}_{i,j}^{\textrm{execute}}\) indicates the workloads offloading decision of IoT \(i, i \in {\mathcal {I}}_m\). Similarly, the state of UAV \(d, d \in {\mathcal {D}}\) in group m can be defined as \(a_m(t)=\{\rho _{i,c,j}^{\textrm{execute}}\}\) and the reward function is set as \(r_{m}(t)=-{\textbf{F}}(c_{I_m}^{\textrm{total}}(t))\). In terms of MECS-M \(e, e \in {\mathcal {E}}\) in group m, the state, action, and reward can be defined the same way as that of UAV \(d, d \in {\mathcal {D}}\). In our paper, we only define state, action, and reward for UAV, and we will not explain that of MECS-M much.

For the service caching decision-making policy of UAV \(d, d \in {\mathcal {D}}\) or MECS-M \(e, e \in {\mathcal {E}}\) in group m, we deploy the \(\epsilon \)-greedy pattern which is defined as follows:

$$\begin{aligned} a_m(t)= \left\{ \begin{array}{ll} \arg \min Q_m(s_m(t), a, a \in {\mathcal {A}}_m \vert \theta _m), &{} P(a_m(t))=1-\epsilon \\ \mathrm {ramdom(0, length(a_m(t)))}, &{} P(a_m(t))=\epsilon \end{array} \right. \nonumber \\ \end{aligned}$$
(35)

where \({\mathcal {A}}_m\) means the action space of the agent for UAV \(d, d \in {\mathcal {D}}\) or MECS-M \(e, e \in {\mathcal {E}}\) in group m. \(length(a_m(t)\) denotes all the available actions in action space \({\mathcal {A}}_m\).

From Fig. 3, there are M UAV agents or MECS-M agents interacting with the UAV-enable MEC environment. Moreover, we also define M experience replay memories to store the training samples for DDQN networks, and the sample tuple is set as \(\{s_m, a_m, r_m, s_m^{\prime }\}\). Two neural networks serve each agent composed of the primary and target networks. According to the DDQN’s agent, the expected cumulative rewards function(Q-function) can be gained with the time differential error(TD-error) model by

$$\begin{aligned} y_{m}(t)= & {} r_{m}(t)+\gamma _{m} Q_m^{\prime }\left( s_{m}^{\prime }(t),\right. \nonumber \\{} & {} \left. \quad \arg \max Q_m \left( s_{m}^{\prime }(t), a_{m}(t) \vert \theta _m\right) \vert \theta _m^{\prime }\right) \end{aligned}$$
(36)

To this end, the loss function of agent m can be obtained by the continuous training of the primary network and parameter updating of the target network:

$$\begin{aligned} L_m(Q_m,Q_m^{\prime })= & {} {\mathbb {E}}_{(s_{m}, a_{m}, r_{m}, s_{m}^{\prime }) \sim W_m }\nonumber \\{} & {} \times \left[ (y_{m}(t)-Q_m (o_{m}(t), a_{m}(t) \vert \theta _m) ) \right] ^{2} \end{aligned}$$
(37)

where \(\gamma _{m}\) is the discount factor of reward updated by decay mode during training of the primary network from agent m.

Fig. 3
figure 3

DDQN-based service caching hosting decision-making scheme

Algorithm 2 explains the execution pattern of the DDQN-based service caching hosting decision-making scheme.

Algorithm 2
figure b

DDQN-based service caching hosting decision-making scheme

Complexity analysis

We are very sorry for ignoring the computational complexity of the proposed work and thanks for your advice. For MADDPG-based workload offloading decision-making algorithm of proposed WSSMDRL, in terms of each IoT device \(i\in {{{\mathcal {I}}}_{m}}\) in group \(m\in {\mathcal {M}}\), the neural network for MADDPG is composed of \({{N}_{a}}+1\) layers: an input layer, \({{N}_{a}}-1\) fully connected layers, and an output layer. We assume that K is the number of training samples. E denotes the number of epoch for computing the loss function. \({{m}_{I}}\) represents the dimension of input layer which is related to the size of state space. \({{m}_{l}}\) is the number of neurons in layer n, where \(n\in N,n\ge 2\). \({{m}_{O}}\) represents the dimension of output layer which is related to the size of action space. Therefore, the computation complexity of MADDPG is \(O({{I}_{m}}\times K\times E\times {{m}_{I}}\times {{({{N}_{a}}-1)}^{{{m}_{l}}}}\times {{m}_{O}})\). In addition, we ignored the complexity of all activation function \(O({{m}_{l}})\), which is only related to \({{m}_{l}}\) and far less than the complexity of input layer, hidden layer, and output layer for MADDPG. Similarly, for DDQN-based service caching placement algorithm of proposed WSSMDRL, in terms of each edge server (UAV or MECS-M), the neural network for DDQN is composed of \({{N}_{b}}+1\) layers: an input layer, \({{N}_{b}}-1\) fully connected layers, and an output layer. K is the number of training samples. E denotes the number of epochs for computing the loss function. \({{m}_{s}}\) represents the dimension of input layer. \({{m}_{l}}\) stands for the number of neurons in layer l, where \(l\in L,l\ge 2\). Therefore, the computation complexity of DCODRL is \(O(K\times E\times {{m}_{s}}\times {{({{N}_{b}}-1)}^{{{m}_{l}}}}\times {{m}_{l}})\). It is noting that each training agent for UAV or MECS-M is independent because DDQN-based service caching placement algorithm is decentralized training in simulation environment, so the computation complexity of DCODRL is independent of the number of agents. we also neglect the complexity of all activation function is. Therefore, the computation complex of proposed WSSMDRL algorithm is \(O({{I}_{m}}\times K\times E\times {{m}_{I}}\times {{({{N}_{a}}-1)}^{{{m}_{l}}}}\times {{m}_{O}}+K\times E\times {{m}_{s}}\times {{({{N}_{b}}-1)}^{{{m}_{l}}}}\times {{m}_{l}})\). Due to the combination of two learning algorithms, our proposed WSSMDRL algorithm has a higher computational complexity than traditional single learning algorithms. However, as the performance of GPU hardware devices improves and related learning algorithms are optimized, the performance of the combined learning algorithm will gradually improve.

Performance evaluation

In this section, we first provide the parameter configuration of the experiment, then describe five comparison algorithms, and finally verify the convergence and advantage of the proposed WSSMDRL algorithm. Furthermore, WSSMDRL works on the test and train hardware platform with NVIDIA RTX 2080Ti GPU and Intel I9 9900K CPU. We install PyCharm 2020.1 with TensorFlow 1.13.1 as the software platform.

Experiment configuration

In our proposed UAV-enabled MEC environment, there are three UAVs, three MECS-M, and 30 IoTs. That is to say, \(D=3,E=3\), and \(I=30\). Each group includes 10 IoTs, \(I_m=10\). The total size of service cachings is defined as \(C=10\). The maximum size of service cachings hosted by each UAV or MECS-M is set as \({\hat{C}}_m=2\). the total duration of the system is set as \(T=100\). The arrival of renewable energy for each UAV is set as \(\lambda _d^{r}=3\). The loss coefficient is \(\xi _d=0.2\). The maximum length of energy queue is set as \(B_d^{\textrm{max}}=10\). According to the parameters defined in communication model are set as \(z_i=z_e=0\), \(z_d=100\), \(\Phi _d^{\textrm{loss}}=\Phi _e^{\textrm{loss}}=10^{-3}\), \(Dis_d^{\textrm{init}}=Dis_e^{\textrm{init}}=1\). \(\varrho _{d}=\varrho _{e}=0.95\), \(B_d=B_e=10\,\hbox {MHz}\), \(\sigma _d^2=\sigma _e^2=10^{-9}\), \(p_{i,d}^{\textrm{offload}}=p_{i,e}^{\textrm{offload}}=1\)W. According to the parameters defined in computation model are set as \(\kappa _{e}=2.2*10^{-29}, \kappa _{d}=2*10^{-29},*\), \(F_{e}^{\textrm{cycle,max}}=1.2\,\hbox {GHz}\), \(F_{e}^{\textrm{cycle,max}}=1.4\,\hbox {GHz}\). \(w_{i,c}^{\textrm{hide}}=0.3\), \(p_{d,l}^{\textrm{execute}}=1\)W.

The structure of MADDPG-based decision-making and execution model for IoT agents can be described in Table 1. As described in Fig. 2 of section “MADDPG-based multi-workloads offloading decision-making scheme”, each agent \(I_m\) interacts with the UAV-enabled MEC environment, including multiple neural networks of Actor and Critic. Table 1 shows the structure of MADDPG-based decision-making training and execution model for IoT agents.

Table 1 The structure of each neural network

The actor and critic networks include two fully connected layers with 300 and 200 neurons, respectively. The activation function in the actor’s hidden layers is tanh, and the activation function in the critic’s hidden layers is ReLU. Furthermore, the main hyperparameters of our proposed MADDPG-based algorithm are described in Table 2. The curves of all learning algorithms need to be train 400 runs. That is to say, the learning results of each algorithm are the mean of 10 best runs from all 100 runs respectively. To this end, the results obtained by training and testing are more effective and precise.

Table 2 Hyperparameters setup of MADDPG

Similarly, the structure of DDQN-based service caching hosting decision-making scheme can be described in Table 3.

Table 3 Hyperparameters setup of DDQN

Performance comparison

To evaluate the performance and advantages of our proposed WSSMDRL algorithm, we have introduced four benchmark algorithms as follows:

Workloads offloading scheme based on decentralized DDPG (WOSBDD)

Each service equipment trains its DDPG networks independently, and there is no cooperation between them. Therefore, there is no edge cooperation between multiple edge servers in its upper layer. Regarding workloads offloading optimization policy, Each UAV or MECS-M agent trains its four networks independently based on DDPG, and no interactive training between multiple agents means that there is no agent cooperation among multiple agents IoTs. DDQN is still deployed to train the optimal service caching hosting selection strategy for UAVs or MECS-Ms.

Fig. 4
figure 4

The trajectory of all UAVs and the distribution of all IoTs

Greedy hosting service caching scheme based on MADDPG (GHSCSBM)

For the request-specific workloads offloading selection strategy, we still use MADDPG to optimize. However, we do not deploy DDQN to train the optimal service caching hosting selection strategy for UAV or MECS-M. The greedy strategy is addressed to optimize service caching hosting selection. The service caching configuration strategy first counts the number of service cachings required for the arrival workloads in the last two time slots. UAV or MECS-M hosts the two most required caches by greedy strategy.

Random hosting service caching scheme based on MADDPG (RHSCSBM)

MADDPG is still deployed to optimize the request-specific workloads offloading selection strategy. Nevertheless, DDQN is not deployed to train the optimal service caching hosting selection strategy for UAV or MECS-M. UAV or MECS-M random hosts service caching, no matter what types of service caching are required for arrival workloads.

Genetic algorithm for hosting service caching scheme based on MADDPG (GASCSBM)

MADDPG is deployed to optimize the request-specific workloads offloading selection strategy. In terms of the service caching hosting selection, Genetic Algorithm is designed to obtain the optimal service caching hosting selection strategy for UAV or MECS-M.

Experiment results

In Fig. 4, we can see the projected trajectory of all three UAVs under different IoT distributions in each group. The orange curves indicate the flying trajectory taken by the UAV. The red dot stands for the IoTs offloading their request-specific workloads to UAV or MECS-M because there is no obstruction between IoTs and MECS-M. In contrast, the green dots denote the IoT, which can only offload its request-specific workloads to UAV because there are obstructions between IoTs and MECS-M. In the simulation, the UAV flies from 3D coordinate (0, 0, 100) to 3D coordinate 100, 100, 100 according to the planned trajectory.

The experimental results are demonstrated from two perspectives: the convergence of the proposed WSSMDRL algorithm and the performance advantages of WSSMDRL in comparison to other baseline algorithms. Figures 5, 6, 7, and 8 demonstrate the convergence of MADDPG. In addition, we define the average cumulative reward composed of the reward \(r_m(t)\) from DDQN-based service caching hosting decision-making scheme and the reward \(r_{i_m}(t)\) from MADDPG-based service caching hosting decision-making scheme during all time slots \(T^{\textrm{total}}\) on each episode as follows:

$$\begin{aligned} {\textbf{R}}_m^{{\text {mean-perep}}}=\frac{1}{T}\frac{1}{M}\frac{1}{I_m}\sum _{t=1}^{T}\sum _{m=1}^{M} \sum _{i_m=1}^{I_m} \left[ \frac{r_{i_m}(t)+r_{m}(t)}{2}\right] \end{aligned}$$
(38)

The performance of WSSMDRL for average cumulative rewards \({\textbf{R}}_m^{{\text {mean-perep}}}\) with different critic network’s learning rates \(\alpha _c\) of MADDPG for workloads offloading selection is shown in Fig. 5. We fix the Actor network’s learning rate \(\alpha _a=0.0001\) of MADDPG and the learning rate \(\alpha _m=0.001\) of DDQN for service caching hosting selection. It is observed that the convergence of WSSMDRL varies significantly among the different learning rate \(\alpha _c\). The convergence of WSSMDRL is unsatisfactory, particularly when the Critic network’s learning rate \(\alpha _c\) is set to 0.1 or 0.01. This is due to the excessive learning step size preventing WSSMDRL from effectively learning the decision-making strategy on two DRL-based schemes. The convergence of WSSMDRL can be achieved at a particular value for the average cumulative rewards \({\textbf{R}}_m^{{\text {mean-perep}}}\) when the critic network’s learning rate \(\alpha _c\) is set to 0.001 or 0.0001. From the curves, WSSMDRL is unstable between 400 and 600 episodes, even if it finally converges after 600 episodes when \(\alpha _c\) is set to 0.0001. Therefore, the best convergence can be obtained using WSSMDRL when \(\alpha _c = 0.001\).

Fig. 5
figure 5

Convergence of WSSMDRL with different learning rate \(\alpha _c\) for critic network

Figure 6 depicts the convergence of WSSMDRL with different flying heights \(z_d(t)\) for the UAV. Different flight heights can affect the flight trajectory of UAVs, their performance in serving IoTs, and the selection of hosted service cachings, especially the different channel states between IoTs and UAVs. When the flight height of an UAV is set as \(z_d(t)=80\), its location is closer to the IoTs to be served, resulting in better channel states, and its flight distance is also closer to the IoTs, resulting in better performance of WSSMDRL in terms of \({\textbf{R}}_m^{{\text {mean-perep}}}\). Conversely, when the flight height of the UAV is set as \(z_d(t)=120\), the performance of WSSMDRL is worse than that of \(z_d(t)=80\) or 100. However, regardless of the distance setting, the convergence of WSSMDRL can be guaranteed.

Fig. 6
figure 6

Convergence of WSSMDRL with different flying height of UAV \(z_d(t)\)

Figure 7 reflects the convergence of WSSMDRL with different probabilities of IoTs being blocked by buildings \(w_{i,j}^{\textrm{hide}}\). We set the probability of IoTs being blocked by buildings in three groups to 0.2, 0.3, and 0.4, respectively. As can be seen from the curves in Fig. 7, when the occlusion rate is low, the average cumulative reward \({\textbf{R}}_m^{{\text {mean-perep}}}\) is greater. This is because when IoTs are shielded by buildings, they cannot communicate with the MECS-M and can only communicate with the UAVs. In addition, due to the fact that the computation power of the UAV is weaker than that of the MECS-M and the distance \(Dis_{i,d}(t)\) between IoTs and UAV is relatively farther than \(Dis_{i,e}(t)\) between IoTs and MECS-M, this results in the channel state \(h_{i,d}(t)\) is lower than \(h_{i,e}(t)\). Therefore, the curves with high occlusion rates will have lower performance.

Fig. 7
figure 7

Convergence of WSSMDRL with different probability of IoTs being blocked by buildings \(w_{i,j}^{\textrm{hide}}\)

Fig. 8
figure 8

Convergence of WSSMDRL with different number of service caching hosted by UAV or MECS-M \({\hat{C}}_m\)

Fig. 9
figure 9

Performance of WSSMDRL with different total number of IoTs I in all groups

Figure 8 demonstrates the convergence of WSSMDRL with the different number of service caching hosted by UAV or MECS-M \({\hat{C}}_m\). The maximum number of service cachings that UAVs and MECS-Ms can host directly affects the average cumulative reward \({\textbf{R}}_m^{{\text {mean-perep}}}\). We set the maximum number of service cachings hosted by UAVs and MECS-Ms in three groups to 1, 2, and 3, respectively. The wide range of hosting service cachings implies that the edge server can directly process more request-specific workloads. Otherwise, the workloads must be offloaded to the cloud server for processing, which will inevitably affect the success rate of task completion, resulting in more penalties and further degradation in the WSSMDRL algorithm’s performance.

Accordingly, after verifying the convergence of WSSMDRL, we will demonstrate the performance of WSSMDRL in training and testing using box and whiskers diagrams. Figure 9 reveals the performance of WSSMDRL with different total number of IoTs I in all groups in training and testing process. As shown in Fig. 9a, at the initial stage of training, the average cumulative reward \({\textbf{R}}_m^{{\text {mean-perep}}}\) of WSSMDRL is little. After 400 episodes, the performance of WSSMDRL is stable. Most of the abnormal samples in box and whiskers diagrams are generated during the initial training stage. As shown in Fig. 9b, the performance of WSSMDRL during testing is significant better than that during training and the abnormal samples in box and whiskers diagrams is less than those during training.

Furthermore, Figs. 9, 10, and 11 indicate the performance of WSSMDRL and show the comparison on other four benchmark algorithms: WOSBDD, GHSCSBM, RHSCSBM, and GASCSBM. Accordingly, we can define the average cumulative reward composed of the reward \(r_m(t)\) from the DDQN-based service caching hosting decision-making scheme and the reward \(r_{i_m}(t)\) from MADDPG-based service caching hosting decision-making scheme during all time slots \(T^{\textrm{total}}\) in all episodes \(\textrm{Episode}^{\textrm{total}}\) as follows:

$$\begin{aligned} {\textbf{R}}_m^{{\text {mean-alleps}}}= & {} \frac{1}{\textrm{Episode}^{\textrm{total}}}\frac{1}{T} \frac{1}{M}\frac{1}{I_m}\nonumber \\{} & {} \sum _{t=1}^{T}\sum _{m=1}^{M} \sum _{i_m=1}^{I_m} \left[ \frac{r_{i_m}(t)+r_{m}(t)}{2}\right] \end{aligned}$$
(39)

We indicate the performance evaluation on five optimization schemes with different numbers of IoTs I in all groups. Firstly, Fig. 10 shows the swarm plots of five optimization schemes for performance with different number of IoTs I in testing process. As shown in Fig. 10, the range for the number of IoTs I in all groups is set to [18, 24, 30, 36, 42]. Accordingly, the range for the number of IoTs \(I_m\) in each group is [6, 8, 10, 12, 14]. Each plot for five algorithms is \({\textbf{R}}_m^{{\text {mean-perep}}}\) defined by Eq. (38) which is the average cumulative reward during all time slots \(T^{\textrm{total}}\) on each episode. All the swarm plots for each algorithm are the average cumulative reward in all testing episodes \(\textrm{Episode}^{\textrm{total}}\). \({\textbf{R}}_m^{{\text {mean-alleps}}}\) is the mean of each group of swarm plots for each optimization scheme.

Fig. 10
figure 10

Swarm plots of five optimization schemes with different number of IoTs I in all groups

It can be seen from Fig. 11 that as the number of IoTs in each group increases, the request-specific workloads to be processed will also increase, which will inevitably lead to the increase in the computation burden and communication burden of UAV and MECS-M, thereby reducing the average cumulative reward \({\textbf{R}}_m^{{\text {mean-alleps}}}\). Nevertheless, regardless of the value of I, the performance of the proposed WSSMDRL algorithm is better than the other four algorithms. However, if \(I \ge 36\), the performance of WSSMDRL and WOSBDD decreases significantly, while the performance of GHSCSBM, RHSCSBM, and GASCSBM decreases slowly. This is due to multi-workload processing leads to average service caching requirements, so the type of hosted service caching has little effect on multi-IoTs and multi-workloads, so It has little effect on the performance of algorithms WSSMDRL and WOSBDD.

Fig. 11
figure 11

Performance evaluation on five optimization schemes with different number of IoTs I in all groups

Figure 12 displays the performance evaluation on five optimization schemes with different arrival probability \(p_{i,c}^{\textrm{in}}\) of request-specific workloads by IoTs. We set the range of \(p_{i,c}^{\textrm{in}}\) as [0.4, 0.5, 0.6, 0.7, 0.8]. When \(p_{i,c}^{\textrm{in}}=0.4\), IoT generates fewer request-specific workloads in each time slot, so the performance gap of the five algorithms is not obvious. However, when \(p_{i,c}^{\textrm{in}} \ge 0.5\), the increase in the amount of request-specific workloads leads to a smaller average cumulative reward \({\textbf{R}}_m^{{\text {mean-alleps}}}\) for all five algorithms. In addition, we can see from curves in Fig. 12 that the performance degradation of WSSMDRL and WOSBDD is not significant. However, the performance degradation of GHSCSBM, RHSCSBM, and GASCSBM is serious.

Fig. 12
figure 12

Performance evaluation on five optimization schemes with different arrival probability \(p_{i,c}^{\textrm{in}}\) of request-specific workloads by IoTs

Figure 13 reveals the performance evaluation on five optimization schemes with different capacities of buffer queue \(B_d^{\textrm{data}}\) on UAV \(d, d \in {\mathcal {D}}\). Here, we fix the capacity of buffer queue \(B_e^{\textrm{data}}\) on MECS-M \(e, e \in {\mathcal {E}}\) unchanged. Although the capacity of service caching does not determine the caching category of UAV and MEC hosts, it can affect the processing latency of request-specific workloads. We can see that when \(B_d^{\textrm{data}} < 1.2\), the capacity of service caching cannot meet the processing latency requirements for request-specific workloads, which results in some request-specific workloads not being completed within the deadline \(w_{i,c}^{\textrm{duration}}\), resulting in performance degradation due to the penalties \(\Psi _{i_m}\) on request-specific workload \(i_m, i_m \in {\mathcal {I}}_m\).

Fig. 13
figure 13

Performance evaluation on five optimization schemes with different capacity of buffer queue \(B_d^{\textrm{data}}\) on UAV

Conclusions

This paper studied the long-term weighted average cost minimizing problem on multi-IoT request-specific workload offloading and service caching hosting selection in the UAV-enable MEC framework. Since request-specific workloads generated by massive IoTs needed to be offloaded to the upper edge server and processed by a specific service caching, we decomposed the joint optimization problem into two-stage optimization subproblems. First, in terms of IoTs in each group, we proposed a request-specific workloads decision-making optimization scheme based on MADDPG to gain the optimal offloading policy under the general service caching categories hosted by UAV and MECS-M. Next, in terms of UAVs and MECS-Ms in all groups, we implemented a decentralized DDQN-based scheme to obtain the optimal service caching hosting strategy under the existing offloaded workloads from IoTs. We fulfilled the comprehensive simulations in the proposed MEC environment, and experimental results displayed that the proposed WSSMDRL can converge quickly on different parameter configurations, and outperformed other baseline schemes in terms of average cumulative rewards.

Our work still has some shortcomings in both theory and practice. For instance, in terms of the hybrid optimization method, we make more use of DRL while neglecting other learning methods, such as the combination of other learning-based methods and swarm intelligence methods. Regarding simulation practice, we do not fully consider the trajectory planning of UAV, and we did not conduct complete statistical testing are not conduct complete due to the complexity of the simulation environment. In our further work, we will study a hybrid scheme that combines meta heuristics and machine learning to better optimize the problem. In addition, we consider optimizing the flying trajectory of UAV based on the features of arrival workloads in the proposed system, rather than planning the path in advance through deep learning-based methods. Furthermore, our proposed hybrid optimization method can be further expanded and applied in the digital transformation of future intelligent farms and forest operations.