Introduction

Currently, the amount of image data, which exceeds 34 trillion GB, imposes a heavy workload on doctors [1]. The radiomics-based image diagnosis model (RIDM) [2] is a time-consuming and computation-intensive (CI) mature clinical diagnostic method. As a commonly used solution for hospitals to handle large-scale computing tasks, the medical image cloud [3, 4] is far from the hospital TD, resulting in significant transmission delay and energy consumption (DEC).

Task offloading (TO) [5] as a critical technology of edge computing (EC) [6] offers a solution to the above dilemma by offloading the CI task to a closer edge server (ES). This can effectively decrease delay but also increase energy consumption. Thus, choosing an appropriate offloading strategy for the RIDM task to trade off DEC is a key problem [7]. In fact, the complexity of medical image data requires significant computational resources to support the execution of various phases in RIDM. In addition, the combination of various methods during each radiomics phase results in different RIDM tasks. Hence, we believe that a good TO strategy should improve the RIDM run efficiency and adapt to different RIDM environments. Nevertheless, to obtain such a TO strategy, the following issues in the RIDM task should be addressed.

The run efficiency of the RIDM task is constrained as existing TO solutions are divided into binary and partial solutions based on task separability [8]. However, considering the complexity of the radiomics workflow, this solution of a simplistic partition task is proven unsuitable. Therefore, it is crucial to correctly partition and handle subtasks with multiple dependencies based on the internal logic of the workflow for the successful execution of RIDM tasks [9]. On the other hand, due to limited resources in ES, executing assigned subtasks independently results in slower speeds, thereby impacting user experience [10]. Thus, developing collaboration between ESs is necessary to speed up task processing.

The TO problem mentioned above is NP-hard [11]. Many solutions using heuristic [8] or approximation [12] (HA) algorithms have been developed. Nevertheless, to adapt to different RIDM environments, it is impractical to apply HA algorithms that depend on expert knowledge and precise mathematical models (EKM). Model-free deep reinforcement learning (DRL) [13, 14] has received widespread attention due to not relying on manual intervention or EKM. However, DRL suffers from lower sampling efficiency and brittleness to hyperparameters [15]. Hence, there is an urgent need to choose a robust DRL algorithm for the RIDM-TO problem to adapt to different RIDM environments.

Therefore, the following three challenges need to be solved in RIDM task offloading. First, how can represent the dependencies between modules (i.e., subtasks) in the RIDM task? Second, how can collaborative computing between ESs be explored to improve the efficiency of RIDM task execution? How can the drawbacks of model-free DRL be overcome to improve the robustness of the offloading decision-making process?

Motivated by the above challenges, we propose a distributed collaboration-dependent task offloading strategy based on DRL (DCDO-DRL). In particular, considering the uniqueness of the radiomics workflow and the limited resources of ESs, we combine reinforcement learning (RL) [16], a sequence-to-sequence (S2S) neural network [17] and EC to optimize the offloading decision-making process of the RIDM task. Specifically, the main contributions of this article are summarized as follows.

  1. 1.

    This article proposes a DCDO-DRL strategy that can improve the RIDM execution efficiency and adapt to different RIDM environments. In DCDO-DRL, we use RL to model the TO problem as a Markov decision process (MDP). DCDO-DRL aims to maximize the RIDM task utility, a weighted sum of DEC generated by execution.

  2. 2.

    In a radiomics workflow-based medical scenario, the RIDM task consists of several dependent subtasks that can be modeled as a directed acyclic graph (DAG). The offloading decision process in the DAG is represented by the sequence prediction of the S2S neural network. A multiple ESs distributed collaboration processing (DCP) algorithm based on the network topology and the resources is proposed for offloading subtasks to the edge.

  3. 3.

    The DCDO-DRL strategy utilizes a discrete soft actor-critic (SAC) method based on maximum entropy to learn a robust DRL algorithm empowered by the S2S neural network, enabling it to adapt to different RIDM environments. In particular, we modify the action space of the SAC algorithm from continuous to discrete to adapt to the offloading actions in the RIDM task.

  4. 4.

    We prove the convergence and statistical superiority of the DCDO-DRL strategy. The numerical results reveal that, compared with other algorithms, the DCDO-DRL strategy improves the execution utility of the RIDM task by at least 23.07, 12.77, and 8.51% in the three scenarios.

Related work

The massive amount of data poses challenges to traditional medical image processing using MapReduce and Hadoop [18,19,20,21]. Cloud Computing is a proven way to manage and process big data [22, 23]. However, there are great distances between CC and TDs in the medical imaging cloud. Transferring a large amount of image data will incur a significant delay and energy consumption. Task offloading has attracted wide attention as one of the most promising solutions to the above issue [24]. Unfortunately, researchers have paid less attention to improving medical image processing by task offloading, mainly in fields such as the Internet of Vehicles and unmanned aerial vehicles.

The existing task offloading strategies are divided into two categories: HA-based TO and DRL-based TO. HA-based TO strategies are achieved through expert knowledge or precise mathematical models. For example, Li et al. proposed a binary offloading policy based on an alternating direction method of a multiplier algorithm to achieve power minimization [25]. To minimize system cost, Pan et al. proposed a heuristic algorithm to solve the binary computation offloading problem, which is formulated as a mixed-integer non-linear programming problem [26]. Chen and Wang proposed a situation-aware binary offloading strategy based on heuristic algorithms that maximize delay and energy consumption by opportunistically adopting changing resource availability conditions [8]. Zhang et al. To minimize task latency and energy consumption, A proposes an offloading scheme that adjusts the task priority in the subtask dependency graph [27]. Fu et al. aimed to minimize energy consumption during task execution by an iterative algorithm based on successive convex approximation [28]. Bi et al. incorporated PSO and Genetic Learning, designing a meta-heuristic algorithm to minimize the system energy [29]. These above studies adopt HA algorithms endorsed by expert knowledge, which are difficult to adjust dynamically according to the different environments. In addition, when the scale of the task offloading problem is large, the decision generation time is very long and only an approximate optimal solution can be obtained..

For DRL-based strategies, continuously optimize offloading decisions through online learning and gradually get the optimal offloading strategy. For example, Wang et al. incorporated Lyapunov Optimization, Multi-armed Bandit, and Extreme Value Theory, proposing a learning-based energy-aware task offloading policy [30]. Seid et al. formulated the task offloading problem as a Markov decision process considering a stochastic game, to minimize energy consumption and delay [31]. Similarly, Alam and Jamalipour modeled the task offloading problem as a Stochastic Game optimization problem and solved it with a multi-agent DRL-based Hungarian algorithm [32]. Zhan et al. proposed a policy gradient-based DRL approach to solving the task offloading problem, which is formulated as a partially observable Markov decision process [33]. Chen et al. considered the task’s relevance and designed a distributed DRL algorithm to solve the task offloading in industrial networks [34]. Some researchers combine blockchain and DRL to solve the task offloading problem. Wang et al. formulated the task offloading problem as a Markov game and combined Blockchain, DRL and Mean Field Theory to propose a secure learning-based off-chain task offloading algorithm [35]. To guarantee the security and reliability of task offloading, Shi et al. incorporated a DRL-based computational offloading scheme and a consensus algorithm based on practical Byzantine fault tolerance (PBFT) in the smart contract of blockchain [36]. The model-free DRL frameworks [37] used above, such as deep Q-learning, PPO, and DDPG, have although self-learning and adaptive characteristics, suffer from poor sample efficiency and hyperparameter brittleness. There is an urgent need to choose a robust DRL algorithm for the RIDM-TO problem to adapt to different RIDM environments.

All the above solutions assume that the task is dependent and has no internal dependencies. However, most tasks in real life are not like this, especially the RIDM task. If dependencies are ignored when making offloading decisions, it will reduce strategy performance. Furthermore, these solutions also fail to consider the limited computing resources of edge servers. This resource condition makes it difficult to undertake computation-intensive tasks like RIDM. Therefore, for the RIDM task offloading scenario, this article proposes a DCDO-DRL strategy, which is designed to maximize the execution utility of the RIDM task. We propose a DCP algorithm to develop collaborative computing between ESs. We adopt DAG to represent the dependencies of the RIDM task. The offloading decision process in the DAG is represented by the sequence prediction of the S2S neural network. Finally, to obtain a robust offloading strategy, the DCDO-DRL strategy utilizes discrete SAC to train the S2S neural network.

System model and problem formulation

This section presents first a hierarchical system architecture. Next, convert the radiomics workflow into a DAG to demonstrate the dependencies of the RIDM task. Then, the computation and transmission process is described in the local and edge layers. Finally, a utility function is designed to formalize the goal of this article.

System model

As illustrated in Fig. 1, we consider a three-layer hierarchical system framework with terminal-edge-cloud collaboration for RIDM task execution. This system comprises multiple terminal devices, multiple edge servers, and a centralized cloud. TDs are endowed with limited computation and storage capabilities, typically for performing small-scale RIDM tasks in hospital PCs. ESs are endowed with large computation and storage capabilities. The communication between ESs and TDs within the communication range communicate, between ESs and between CC and ESs is carried out via a wireless link, fiber optic link and backbone link, respectively. The CC has near-infinite resources to afford the computation and storage capabilities to train a task offloading planner (TOP) model. TD and ES execute tasks based on the task positions assigned by the cloud-trained TOP model. For readability, Table 1 summarizes the notation used in this article. To clearly explain the hierarchical framework for RIDM task execution, it is formalized as follows:

Fig. 1
figure 1

Three-layer hierarchical system framework

Definition 1

RIDM task offloading system model is a 12-tuple: \({\text {RIDM-TO}}=(S,D,{\text {DCP}},G,B,\mu ,\phi ,{\zeta }, V^l,{V}^s,T^{total},{\Psi }^{total})\), which is described in Appendix.

Conversion of radiomics workflow to DAG

The RIDM task is often treated holistically in studies, neglecting the internal dependencies. It may lead to a negative impact on the DEC of the task execution. Thus, we design a more fine-grained division of the RIDM task according to the radiomics workflow.

The radiomics workflow consists of multiple interdependent modules. A simple workflow has linear dependencies between modules. However, complex workflow involves more complicated internal dependencies between modules. Each module can be seen as a medical subtask. We modeled RIDM as different DAGs based on the selection of the method in the actual radiomics workflow. To clearly explain the DAG topology of the RIDM task, it is formalized as follows:

Definition 2

DAG topology of RIDM task model is 2-tuple: \({g}_{{d}_i}= (M, Z)\), where \(M=\left\{ {m}_{{d}_i,v}| v=1,2,\ldots ,\right. \left. V\right\} \) is the vertex finite set that represents medical subtasks. \(Z=\left\{ \textbf{z}<{m}_{{d}_i,v},{m}_{{d}_i,w}>| v,w\in T\right\} \) is the directed edge finite set that expresses the dependencies among medical subtask. \({m}_{{d}_i,v}\) is the immediate predecessor medical subtask of \({m}_{{d}_i,w}\).

Fig. 2
figure 2

An example of a radiomics-based prediction of diffuse glioma grading model with DAG conversion

Figure 2 shows the workflow and DAG of the diffuse glioma grading (DGG) prediction model based on radiomics in [38]. The radiomics workflow presented in Fig. 2a is divided into five phases [39, 40], each with several methods. Specifically, (1) image pre-processing is a standardized operation before using image data. It mainly includes 6 methods, such as histogram equalization [41], image enhancement [42], and image registration [43]. (2) Segmentation is the extraction of regions of interest in images, which can be divided into automatic, semi-automatic, and manual segmentation methods, such as edge-based segmentation [44], K-means clustering [45], and fuzzy C-means clustering [46]. (3) Feature extraction is performed on the original image (OI) or nine derived images (DI) processed by the filter. There are four types of features: first-order statistical features (FSF) [47], shape-based features [48], texture-based features (TF) [49], and wavelet features (WF) [50]. Note that the FSF and WF are extracted on the OI and nine DIs. WF is calculated on 8 sub-bands of OI for FSF and TF. Thus, there are 10 + 1 + 10 + 16 = 37 methods. (4) Feature selection filters out redundant and unstable features. It mainly contains 8 methods, such as LASSO regression [51] and minimum redundancy maximum relevance [52]. (5) Model construction is the selection of a suitable model based on various target problems, which mainly includes 4 methods, such as logistic regression [53] and support vector machine (SVM) [54]. In Ref. [38], the DGG model first segments the three sequences and extracts features in ROI by four methods. The features are then filtered by the LASSO algorithm and modeled using SVM. Thus, the subtask set after modeling the DGG model as the corresponding DAG is \(M=\left\{ {m}_{{d}_1,1}, {m}_{{d}_1,2},{m}_{{d}_1,3},{m}_{{d}_1,4},{m}_{{d}_1,5},{m}_{{d}_1,6},{m}_{{d}_1,7},\right. \left. {m}_{{d}_1,8}\right\} \) (Fig. 2b). Directed lines indicate data dependencies between its subtasks. For example, \({m}_{{d}_1,7}\) must run after the processing of \({m}_{{d}_1,3}, {m}_{{d}_1,4}, {m}_{{d}_1,5}\) and \({m}_{{d}_1,6}\).

Table 1 System notations description

Local computing

As shown in Fig. 3a, we construct the DAG topology for the RIDM task. In the local computing mode (LCM), the \({m}_{{d}_i,v}\) in \({g}_{{d}_i}\) is only performed locally on the terminal device, with the offloading proportion \(\mu _{{d}_i,v}=0\). To clearly explain the parameters of LCM under \({g}_{{d}_i}\) topology, it is formalized as follows:

Definition 3

Local computing parameters model is 4-tuple: \({V}^l=\left( F^l,\chi ^l,T^l,E^{l,c}\right) \), which is described in Appendix. For terminal device \({d}_i\), the local computation delay \({\tau }_{{d}_i,v}^{l,c}\) [s] of processing \(b_{{d}_i,v}^l\) can be given by

$$\begin{aligned} \begin{matrix} {\tau }_{{d}_i,v}^{l,c}=\left( b_{{d}_i,v}^l\cdot \phi \right) /f_{{d}_i}^l \end{matrix} \end{aligned}$$
(1)

The local actual execution start time \({\text {st}}_{{d}_i,v}^{l,c}\) [s] of processing \(b_{{d}_i,v}^l\) on \({d}_i\) can be given by

$$\begin{aligned} \begin{matrix} {\text {st}}_{{d}_i,v}^{l,c}=\max \left\{ {\text {it}}_{{d}_i,v}^{l,c}, \textrm{psc}_{{d}_i,v}^{l,c}\right\} \end{matrix} \end{aligned}$$
(2)

where \({\text {it}}_{{d}_i,v}^{l,c}=\max \left\{ {\text {it}}_{{d}_i,v-1}^{l,c},{\text {ft}}_{{d}_i,v-1}^{l,c}\right\} \) denotes the idle time of the CPU \({d}_i\) when executing \({m}_{{d}_i,v}\). \({\text {psc}}_{{d}_i,v}^{l,c}={\max }_{g\in {\text {pred}}(v)}\left\{ {\text {ft}}_{{{e}_j,d}_i,g}^{s,d},{\text {ft}}_{{d}_i,g}^{l,c}\right\} \) indicates the last predecessor subtask of \({m}_{{d}_i,v}\) has been completed. \({\text {pred}}\left( v\right) \) is the set of predecessor subtasks of \({m}_{{d}_i,v}\). \({\text {ft}}_{{{e}_j,d}_i,g}^{s,d}\) see “ECCM workflow”. Therefore, the outer max block in (2) represents that \({m}_{{d}_i,v}\) starts execution on \({d}_i\) if and only if \({\text {pred}}\left( v\right) \) has completed and CPU \({d}_i\) is idle. Hence, the local actual execution finish time \({\text {ft}}_{{d}_i,v}^{l,c}\) [s] of processing \(b_{{d}_i,v}^l\) on \({d}_i\) can be given by

$$\begin{aligned} \begin{matrix} {\text {ft}}_{{d}_i,v}^{l,c}={\text {st}}_{{d}_i,v}^{l,c}+{\tau }_{{d}_i,v}^{l,c} \end{matrix} \end{aligned}$$
(3)

Besides the required computation delay, processing each subtask also generates some computation energy. The local computation energy consumption \({\psi }_{{d}_i,v}^{l,c}\) [J] required by \({d}_i\) to process \(b_{{d}_i,v}^l\) can be given by

$$\begin{aligned} \begin{matrix} {\psi }_{{d}_i,v}^{l,c}={\chi _{{d}_i}^L\cdot {b}_{{d}_i,v}^l\cdot \phi } \end{matrix} \end{aligned}$$
(4)

Figure 3b illustrates the execution process of the subtask locally at \({d}_i\). The factors that determine the local actual execution finish time of \({m}_{{d}_i,v}\) are the local actual start time and the local computation delay. In addition, the local computation energy consumption during execution is also essential.

Fig. 3
figure 3

Local computing mode. The width and length of the rectangle represent the energy consumption and delay generated during this stage

Edge collaboration computing

Figure 4b demonstrates the edge collaborative computing mode (ECCM) network architecture. The architecture comprises multiple heterogeneous ESs. Each ES has equal rights to share computing and communication resources at the edge of the network. Formally, we model the network architecture as an undirected graph, \(G^{s}=\left\{ N^{s},C^{s}\right\} \), where the vertex set \(N^{s}=E\) is the ESs in the network and the edge set \(C^{s}\) denotes the connections among ESs, respectively. \(c_{jk}=\left( {e}_j,{e}_k\right) \in C^{s}\) represents the connection between \({e}_j\) and \({e}_k\). Assuming that \({d}_i\) falls within the communication range of \({e}_j\), as shown in Fig. 4. \({m}_{{d}_i,v}\) in \({g}_{{d}_i}\) is offloaded and runs in governed \({e}_j\) under the \(\zeta \) function mapping, where offloading proportion \(\mu _{{d}_i,v}=1\). To clearly explain the parameters of ECCM with \({g}_{{d}_i}\) topology, it is formalized as follows:

Definition 4

Edge collaboration computing parameters model is 10-tuple: \({V}^s=(F^s,\chi ^s,{{\Delta }M}^s,{{\Delta }F}^s,{P}^s, {R}^s,T^{u},T^{c},T^{d},E^s)\), which is described in Appendix.

ECCM is a three-step process that includes sending, processing, and feedback. \({m}_{{d}_i,v}\) is first sent from \({d}_i\) to \({e}_j\). Second, to enhance processing speed, \({e}_j\) adopts a DCP algorithm to find suitable adjacent ESs at the edge layer. Subsequently, \({e}_j\) executes \({m}_{{d}_i,v}\) in a distributed manner with these adjacent ESs. Finally, the result of the processing is feedback to \({d}_i.\) The detailed workflow of ECCM is described in “ECCM workflow”.

Fig. 4
figure 4

Edge collaboration computing mode

Fig. 5
figure 5

The execution process of subtasks on the ECCM. The width and length of the rectangle represent the energy consumption and delay generated during this stage

Problem formulation

The goal of the three-layer hierarchical system is to find an effective offloading strategy to maximize the utility of \({g}_{{d}_i}\) after RIDM task execution. The total delay and total energy consumption are affected by the resources of \({d}_i\) and \({e}_j\) and the execution location of subtasks. The total delay \({\tau }_{{d}_i}^{total}\) [s] required to process all data of a \({g}_{{d}_i}\) can be given by

$$\begin{aligned} \begin{matrix} {\tau }_{{d}_i}^{total}=\max \left[ {\max }_{{q}\in {\text {EMT}}}\left( {\text {ft}}_{{e}_j,{d}_i,q}^{s,d},{\text {ft}}_{{d}_i,q}^{l,c}\right) \right] \end{matrix} \end{aligned}$$
(5)

where EMT is the set of exit medical subtasks that are without successor subtasks. The total energy consumption \({\psi }_{{d}_i}^{total}\) [J] required to process all data of a \({g}_{{d}_i}\) can be given by

$$\begin{aligned} {\psi }_{{d}_i}^{total}= & {} \sum _{v=1}^{V}\left( {\psi }_{{d}_i,v}^{l,c}\cdot \left( 1-\mu _{{d}_i,v}\right) ,\right. \nonumber \\{} & {} \left. \left( \psi _{{e}_j,{d}_i,v}^{s,u}+\psi _{{e}_j,{d}_i,v}^{s,c}+e_{{e}_j,{d}_i,v}^{s,d}\right) \cdot \mu _{{d}_i,v}\right) \end{aligned}$$
(6)

where \({\text {ft}}_{{e}_j,{d}_i,q}^{s,d}\), \(\psi _{{e}_j,{d}_i,v}^{s,u}\), \(\psi _{{e}_j,{d}_i,v}^{s,c}\), \( e_{{e}_j,{d}_i,v}^{s,d}\) see “ECCM workflow”. The weighted sum of delay and energy consumption, i.e., utility, is used as the performance metric in this article. Let \(\beta ^t\) and \(\beta ^e\) be the weight indicators, where \(\beta ^t+\beta ^e=1\) and \(\beta ^t,\beta ^e\in \left[ 0,1\right] \). For the terminal device \({d}_i\), the utility of a \({g}_{{d}_i}\) given an offloading strategy \(\mu _{{d}_i}\), \(O_{\mu _{{d}_i}}^C\) is given by

$$\begin{aligned} O_{\mu _{{d}_i}}^C{} & {} ={\beta }^t\cdot \frac{{{\max }_{{q}\in {\text {EMT}}}{\text {ft}}_{{d}_i,q}^{l,c}-{\tau }_{{d}_i}^{total}}}{{\max }_{{q}\in {\text {EMT}}}{\text {ft}}_{{d}_i,q}^{l,c}} \nonumber \\{} & {} \quad \,+\,{\beta ^e\cdot \frac{\sum _{v=1}^{V}{\psi }_{{d}_i,v}^{l,c}-{\psi }_{{d}_i}^{total}}{\sum _{v=1}^{V}{\psi }_{{d}_i,v}^{l,c}}} \end{aligned}$$
(7)

where \({\max }_{{q}\in {\text {EMT}}}{\text {ft}}_{{d}_i,q}^{l,c}\) and \(\sum _{v=1}^{V}{\psi }_{{d}_i,v}^{l,c}\) are the total delay and total energy consumption of the local execution of the RIDM task. Hence, the optimization problem with respect to the utility is formulated as follows:

$$\begin{aligned} \begin{matrix} {\max \ O}_{\mu _{{d}_i}}^C \end{matrix} \end{aligned}$$
(8)

Intuitively, the optimization problem in (8) is an NP-hard problem [55]. Finding the optimal offloading strategy can be extremely challenging for high-dynamic DAG topology. To tackle the challenges, this article proposes the DCDO-DRL strategy in “DCDO-DRL design”.

ECCM workflow

This subsection describes the three stages of ECCM, including uploading subtasks, running subtasks on edge servers, and receiving the result data from subtasks.

Uploading subtasks

In the ECCM, the subtask is first uploaded from TD to ES and executed then on the edge server instead of locally. The transmission delay \(\tau _{{e}_j,{d}_i,v}^{s,u}\) [s] required by \({d}_i\) to send \(b_{{d}_i,v}^u\) to \({e}_j\) via the uplink channel can be given by

$$\begin{aligned} \begin{matrix} \tau _{{e}_j,{d}_i,v}^{s,u}=b_{{d}_i,v}^u/{r}_{{e}_j,{d}_i}^s \end{matrix} \end{aligned}$$
(9)

The actual execution start time \({\text {st}}_{{e}_j,{d}_i,v}^{s,u}\) [s] of sending \(b_{{d}_i,v}^u\) on the uplink channel can be given by

$$\begin{aligned} \begin{matrix} {\text {st}}_{{e}_j,{d}_i,v}^{s,u}=\max \left\{ {\text {it}}_{{e}_j,{d}_i,v}^{s,u},{\text {psc}}_{{e}_j,{d}_i,v}^{s,u}\right\} \end{matrix} \end{aligned}$$
(10)

where \({\text {it}}_{{e}_j,{d}_i,v}^{s,u}=\max \left\{ {\text {it}}_{{e}_j,{d}_i,v-1}^{s,u},{\text {ft}}_{{e}_j,{d}_i,v-1}^{s,u}\right\} \) is the idle time on the uplink channel when sending \({m}_{{d}_i,v}\). \({\text {psc}}_{{e}_j,{d}_i,v}^{s,u}={\max }_{g\in {\text {pred}}(v)}\left\{ {\text {ft}}_{{e}_j,{d}_i,g}^{s,d},{\text {ft}}_{{d}_i,g}^{l,c}\right\} \) represents all data needed by \({m}_{{d}_i,v}\) has finished. Therefore, the outer max block in (10) denotes that \({m}_{{d}_i,v}\) is allowed to send data to \({e}_j\) if and only if the idle time of the uplink channel and \({\text {pred}}\left( v\right) \) has completed.

The actual execution finish time \({\text {ft}}_{{e}_j,{d}_i,v}^{s,u}\) [s] of sending \(b_{{d}_i,v}^u\) on the uplink channel can be given by

$$\begin{aligned} \begin{matrix} {\text {ft}}_{{e}_j,{d}_i,v}^{s,u}={\text {st}}_{{e}_j,{d}_i,v}^{s,u}+\tau _{{e}_j,{d}_i,v}^{s,u} \end{matrix} \end{aligned}$$
(11)

The transmission energy consumption \(\psi _{{e}_j,{d}_i,v}^{s,u}\) [J] on the uplink channel when sending \(b_{{d}_i,v}^u\) can be given by

$$\begin{aligned} \begin{matrix} \psi _{{e}_j,{d}_i,v}^{s,u}={p}_{{e}_j,{d}_i}^s\cdot \tau _{{e}_j,{d}_i,v}^{s,u}=\left( {p}_{{e}_j,{d}_i}^s\cdot b_{{d}_i,v}^u\right) /{r}_{{e}_j,{d}_i}^s \end{matrix} \end{aligned}$$
(12)

Figure 5a shows the sending phase in the ECCM, illustrating the process of \({m}_{{d}_i,v}\) is uploaded from \({d}_i\) to affiliated \({e}_j\). At this phase, the factors that determine the actual execution finish time of \({m}_{{d}_i,v}\) are the actual start time and transmission delay in the upload channel. In addition, there is the transmission energy consumption.

Running on edge servers

Inspired by Ref. [8], we propose a DCP algorithm at the edge layer to accelerate subtask processing by exploring the collaborative computing capabilities between ESs. DCP algorithm avoids the problem that a long processing time on a single ES with limited computational resources.

To describe the DCP algorithm more clearly, the pseudo-code and diagram are shown in Algorithm 1 and Fig. 6. The DCP algorithm includes three parts: the first part is lines 1–6, located in the adjacent ESs for \({e}_j\). The adjacent ESs are defined based on whether there is an edge between two ESs in the edge layer network topology \(G^{s}=\left\{ N^{s},C^{s}\right\} \). The second part is lines 7–11, finding the suitable adjacent ESs (SAESs). Filtering adjacent ES by determining if there is enough remaining memory in the ES to execute subtasks. The third part is lines 12–17, the subtask is further divided into small subtasks according to the subtask allocation matrix U and the remaining computing capacity of ESs. SAESs cooperatively process the assigned small subtask, calculate computational delay and computational energy consumption, and return results to \({e}_j\).

Fig. 6
figure 6

The execution process of subtasks on the ECCM

The total computational power available F for subtask execution at the edge layer is

$$\begin{aligned} \begin{matrix} F=f_{{e}_j}^s+\sum \limits _{u_{jk=1}}{{\Delta }f_{{e}_k}^s}\ \ \forall k=1,2,...m \end{matrix} \end{aligned}$$
(13)

Note that fiber optic communication with a high transmission rate is used between edge servers. Thus, the transmission delay is negligible when the adjacent ESs receive assigned small subtasks and send processing results back to \({e}_j\).

The computation delay \(\tau _{{e}_j,{d}_i,v}^{s,c}\) [s] required by \({d}_i\) to process \(b_{{d}_i,v}^c\) on the SAESs can be given by

$$\begin{aligned} \begin{matrix} \tau _{{e}_j,{d}_i,v}^{s,c}=\left( b_{{d}_i,v}^s\cdot \phi \right) /F \end{matrix} \end{aligned}$$
(14)

Similarly, the actual execution start time \({\text {st}}_{{e}_j,{d}_i,v}^{s,c}\) [s] for \(b_{{d}_i,v}^c\) processing on the SAESs can be given by

$$\begin{aligned} \begin{matrix} {\text {st}}_{{e}_j,{d}_i,v}^{s,c}=\max \left\{ {\text {it}}_{{e}_j,{d}_i,v}^{s,c},{\text {psc}}_{{e}_j,{d}_i,v}^{s,c}\right\} \end{matrix} \end{aligned}$$
(15)

where \({\text {pcs}}_{{e}_j,{d}_i,v}^{s,c}=\max \left\{ {\max }_{g\in {\text {pred}}(v)}{\text {ft}}_{{e}_j,{d}_i,g}^{s,c}, {\text {ft}}_{{e}_j,{d}_i,v}^{s,u}\right\} \) indicates that \(b_{{d}_i,v}^c\) has been uploaded to \({e}_j\) and all the predecessor data needed by \({m}_{{d}_i,v}\) has finished. \({\text {it}}_{{e}_j,{d}_i,v}^{s,c}=\max \left\{ {\text {it}}_{{e}_j,{d}_i,v-1}^{s,c},{\text {ft}}_{{e}_j,{d}_i,v-1}^{s,c}\right\} \) is the idle time that \({e}_j\) can handle \(b_{{d}_i,v}^c\). Notice that since we assign small subtasks to ES in SAESs based on computational capacity, the execution time of each small subtask is guaranteed to be the same. Thus, it ensures the consistency of the idle time of \({e}_j\) and ES in SAESs. Therefore, the outer max block in (15) indicates that the actual start time that \({m}_{{d}_i,v}\) only can be processed relies on the idle time of \({e}_j\) and the actual execution finish time of predecessor subtasks.

The actual execution finish time \({\text {ft}}_{{e}_j,{d}_i,v}^{s,c}\) [s] for \(b_{{d}_i,v}^c\) processing on the SAESs can be given by

$$\begin{aligned} \begin{matrix} {\text {ft}}_{{e}_j,{d}_i,v}^{s,c}={\text {st}}_{{e}_j,{d}_i,v}^{s,c}+\tau _{{e}_j,{d}_i,v}^{s,c} \end{matrix} \end{aligned}$$
(16)

Meanwhile, computation energy consumption is also generated. The computation energy consumption \(\psi _{{e}_j,{d}_i,v}^{s,c}\) [J] required by \({d}_i\) to process \(b_{{d}_i,v}^c\) on the SAESs can be given by

$$\begin{aligned} \begin{matrix} \psi _{{e}_j,{d}_i,v}^{s,c}=\left( \frac{\chi _{{e}_j}^s\cdot f_{{e}_j}^s}{F}+\sum \limits _{u_{jk}=1}{\frac{\chi _{{e}_k}^s\cdot f_{{e}_k}^s}{F}}\right) \cdot b_{{d}_i,v}^s\cdot \phi \\ \ \ \ \ \ \ \ \ \forall k=1,2,\ldots m \end{matrix} \end{aligned}$$
(17)

The subtasks are executed in a distributed manner by ES and the SAESs, as shown in Fig. 6. Figure 5b illustrates, during the processing phase, the actual execution finish time of \({m}_{{d}_i,v}\) lies on the actual start execution time and computation time on \({e}_j\). In addition, there is the computational energy consumption.

Algorithm 1
figure a

DCP

Receiving the result data

After the subtask is executed on ES, the results data will be sent back to TD. The transmission delay \(\tau _{{e}_j,{d}_i,v}^{s,d}\) [s] required to receive the result data \(b_{{d}_i,v}^d\) from \({e}_j\) to \({d}_i \) via the downlink channel can be given by

$$\begin{aligned} \begin{matrix} \tau _{{e}_j,{d}_i,v}^{s,d}=b_{{d}_i,v}^d/{r}_{{e}_j,{d}_i}^s \end{matrix} \end{aligned}$$
(18)

Similarly, the actual execution start time \({\text {st}}_{{e}_j,{d}_i,v}^{s,d}\) [s] of receiving \(b_{{d}_i,v}^d\) on the downlink channel can be given by

$$\begin{aligned} \begin{matrix} {\text {st}}_{{e}_j,{d}_i,v}^{s,d}=\max \left\{ {\text {ft}}_{{e}_j,{d}_i,v}^{s,c},{\text {it}}_{{e}_j,{d}_i,v}^{s,d}\right\} \end{matrix} \end{aligned}$$
(19)

where \({\text {it}}_{{e}_j,{d}_i,v}^{s,d}=\max \left\{ {\text {it}}_{{e}_j,{d}_i,v-1}^{s,d},{\text {ft}}_{{e}_j,{d}_i,v-1}^{s,d}\right\} \) is the idle time on the downlink channel when receiving the result. Therefore, the outer max block in (19) denoted that the actual start time that \({m}_{{d}_i,v}\) only can return result data to \({d}_i\) depends on the idle time of the downlink channel and the actual execution finish time of \({m}_{{d}_i,v}\) in the processing phase.

The local actual execution finish time \({\text {ft}}_{{e}_j,{d}_i,v}^{s,d}\) [s] of receiving \(b_{{d}_i,v}^d\) on the downlink channel can be given by

$$\begin{aligned} \begin{matrix} {\text {ft}}_{{e}_j,{d}_i,v}^{s,d}={\text {st}}_{{e}_j,{d}_i,v}^{s,d}+\tau _{{e}_j,{d}_i,v}^{s,d} \end{matrix} \end{aligned}$$
(20)

The transmission energy consumption \(\psi _{{e}_j,{d}_i,v}^{s,d}\) [J] required to receive the result data \(b_{{d}_i,v}^d\) from ES to \({d}_i\) via the downlink channel can be given by

$$\begin{aligned} \begin{matrix} \psi _{{e}_j,{d}_i,v}^{s,d}={p}_{{e}_j,{d}_i}^s\cdot \tau _{{e}_j,{d}_i,v}^{s,d}=\left( {p}_{{e}_j,{d}_i}^s\cdot b_{{d}_i,v}^d\right) /{r}_{{e}_j,{d}_i}^s \end{matrix} \end{aligned}$$
(21)

The execution of the results data is sent from the governed ES to the terminal device as shown in Fig. 5c. Similarly, during the feedback phase, the actual execution finish time of \({m}_{{d}_i,v}\) depends on the actual execution start time and transmission time on the download channel. In addition, there is the transmission of energy consumption.

Fig. 7
figure 7

The framework of DCDO-DRL. (1) TOP model training data flow. The TOP model uses an S2S neural network to interact with the environment and optimize the offloading strategy through RL. (2) RIDM task offloading data flow. TD loads the trained TOP model from the cloud to obtain subtask offloading locations and execute subtasks accordingly

DCDO-DRL design

This section describes first the architecture of the DCDO-DRL strategy. Next, the RIDM task offloading problem is formulated as a Markov decision process (MDP). Then, A S2S neural network is adopted to predict the offloading process. Finally, we introduce the workflow mechanism of the DCDO-DRL strategy.

DCDO-DRL

According to the challenges introduced in “Introduction”, we optimized the system model in “System model and problem formulation” and constructed the DCDP-DRL strategy, whose architecture is shown in Fig. 7. Each TD is equipped with a TOP module derived from the cloud-trained model. The edge collaborative processing module executes subtasks assigned in a distributed manner. There are two components in the could layer: (1) RIDM task DAG pool stores DAGs from the different RIDM tasks of TDs. (2) TOP model training module outputs the offload location of the subtask via RL and S2S neural networks.

The DCDO-DRL architecture includes two data flows. (1) TOP model training data flow. The TD first embeds the data information of the RIDM task into DAG; Next, the DAG is uploaded to the RIDM task DAG pool; Finally, the S2S neural network (agent) in the TOP model interacts with the environment (network, DAG, as well as the computing power of TDs and ES) to iteratively learn and optimize the offloading strategy. (2) RIDM task offloading data flow. The TD first loads the TOP model trained in the cloud; then, the test DAG on the TD is input to the TOP model to get the execution location of subtasks, i.e., local processing or edge distributed processing.

MDP formulation

To deal with the RIDM task offloading problem, we adopt a DRL-based algorithm to get an offloading strategy to maximize the utility of the RIDM task execution. First, the offloading problem is formulated as an MDP to implement the DRL algorithm. In this article, the MDP is defined by a tuple \(\left( {\mathcal {S}},{\mathcal {A}},{\mathcal {R}},{\mathcal {P}},\gamma \right) \), where \({\mathcal {S}}\) is the environment states space, \({\mathcal {A}}\) is the action space, \({\mathcal {R}}\) is reward function, \({\mathcal {P}}\) is the state transition probability matrix and \(\gamma \) is the discount factor. The motivation of an agent is to find a strategy that can maximize accumulated reward and select the best behavior. Hence, the three key elements of MDP can be defined as follows:

State: The DEC cost of performing \({m}_{{d}_i,v}\) is related to the RIDM topologies \({g}_{{d}_i}\), the task size B, the task computational complexity \(\phi \), local computing mode parameters \({V}^l\), edge collaboration computing mode parameters \({V}^s\), etc.. Thus, the state space reflects the observations from the environment when the RIDM task executes, which can be given by

$$\begin{aligned} \begin{matrix} {\mathcal {S}}=\left\{ s_{{d}_i,v}| i=1,2,\ldots ,n;v=1,2,\ldots ,V\right\} \end{matrix} \end{aligned}$$
(22)

where \(s_{{d}_i,v}=\left( Em\left( {g}_{{d}_i}\right) ,\mu _{{d}_i,1:\ v}\right) \) denotes the state when running \({m}_{{d}_i,v}\). \(\mu _{{d}_i,1:\ v}=\left\{ \mu _{{d}_i,1},\mu _{{d}_i,2},...,\mu _{{d}_i,v}\right\} \) is the partial offloading decision for the subtasks from \({m}_{{d}_i,1}\) to \({m}_{{d}_i,v}\). \(Em\left( {g}_{{d}_i}\right) \) is the encoded \({g}_{{d}_i}\) with a sequence of subtask embedding. Each subtask embedding is a three-vector. The first vector is the indices of the immediate predecessor of \({m}_{{d}_i,v}\); the second vector contains the index of \({m}_{{d}_i,v}\) and DEC cost of \({m}_{{d}_i,v}\); the last vector is the indices of the immediate successor of \({m}_{{d}_i,v}\).

Action: Based on the observed environment states, the agent has two executions for each subtask, i.e., local execution or offloading to the edge server, so the action space can be given by \({\mathcal {A}}=\left\{ 0,1\right\} \), \(a_{{d}_i,v}=\mu _{{d}_i,v}=0\) denotes local processing and \(a_{{d}_i,v}=\mu _{{d}_i,v}=1\) denotes edge collaboration processing.

Fig. 8
figure 8

Subtask offloading process. TD inputs the subtask sequence for DAG to the encoder in the S2S neural network with an attention mechanism. The decoder then outputs offloading locations based on this input, which are used to execute the subtasks accordingly

Reward: According to the environment state and action, the agent calculates reward values. The objective is to maximize utility by (8). Utility is the weighted sum of DECs generated after the completion of multiple RIDM subtasks. The reward should guide the agreement between the objective and learning. To achieve this objective, we define the reward function as an increment of the DEC after making an offloading decision for a subtask. There are four reasons. First, we have to consider the DEC to ensure maximum utility without sacrificing one factor. Second, the weight can flexibly adjust the proportion of DEC. Third, the function helps to prevent the agent from getting stuck and adapting to changes in the environment. Finally, the increment measures the consequences of offloading decisions, facilitating a balance between global and local utility. Formally, the reward function can be given by

$$\begin{aligned} r_{{d}_i,v}= & {} \beta ^t\cdot \frac{\left( {\max }_{{q}\in {\text {EMT}}}{\text {ft}}_{{d}_i,q}^{l,c}\right) /V-\left( \tau _{{d}_i,1:v}^{total}-\tau _{{d}_i,1:v-1}^{total}\right) }{{\max }_{{q}\in {\text {EMT}}}{\text {ft}}_{{d}_i,q}^{l,c}} \nonumber \\{} & {} +\,\beta ^e\cdot \frac{\left( \sum _{v=1}^{V}{\psi }_{{d}_i,v}^{l,c}\right) /V-\left( \psi _{{d}_i,1:v}^{total}-\psi _{{d}_i,1:v-1}^{total}\right) }{\sum _{v=1}^{V}{\psi }_{{d}_i,v}^{l,c}}\nonumber \\ \end{aligned}$$
(23)

where \({\max _{q\in {\text {EMT}}}}\textrm{ft}_{d_i,q}^{l,c}\) and \(\sum _{v=1}^{V}\psi _{d_i,v}^{l,c}\) are the delay and energy consumption required to run all tasks in the DAG locally. \(\tau _{d_i,1:v}^{total}-\tau _{d_i,1:v\mathrm {-} 1}^{total}\) and \(\psi _{d_i,1:v}^{total}-\psi _{d_i,1:v\mathrm {-} 1}^{total}\) are the increment of the delay and energy consumption.

Subtask offloading process

According to (8) and MDP, the sequential decision-making of the RIDM task offloading problem is switched to an S2S prediction problem. The input of the S2S neural network is a sequence of subtask embedding. the output is an offloading strategy \(\pi \left( \mu _{{d}_i}|Em\left( {g}_{{d}_i}\right) \right) \). The strategy is the probability of V subtasks selecting action given the encoded \({g}_{{d}_i}\), which can be given by

$$\begin{aligned}{} & {} \pi \left( \mu _{{d}_i}|Em\left( {g}_{{d}_i}\right) \right) \nonumber \\{} & {} \quad =\prod _{v=1}^{V}\pi \left( \mu _{{d}_i,v}|Em\left( {g}_{{d}_i}\right) ,\mu _{{d}_i,v-1}\right) \nonumber \\{} & {} \quad =\prod _{v=1}^{V}{\mathbb {P}}\left( \mu _{{d}_i,v}|Em\left( {g}_{{d}_i}\right) ,\mu _{{d}_i,v-1}\right) \end{aligned}$$
(24)

where \({\mathbb {P}}\left( \mu _{{d}_i,v}|Em\left( {g}_{{d}_i}\right) ,\mu _{{d}_i,v-1}\right) \) is the probability of selecting action \(\mu _{{d}_i,v}\) for \({m}_{{d}_i,v}\) under the state \(s_{{d}_i,v}\). The subtask offloading process includes three steps, as shown in Fig. 8.

Step 1: Get the subtask sequence for \({g}_{{d}_i}\). We arrange all the subtasks by (25). The central idea is to choose the maximum weight-sum of running delay and energy consumption for each subtask under LCM and ECCM. The indexes of all subtasks are then sorted in ascending order by sort value, where succ(v) is the set of successor subtasks of \({m}_{{d}_i,v}\). \(O_{{d}_i,v}^{l}={\tau }_{{d}_i,v}^{l,c}+{\psi }_{{d}_i,v}^{l,c}\) is the running delay and energy consumption locally. \(\tau _{{e}_j,{d}_i,v}^s=\tau _{{e}_j,{d}_i,v}^{s,u}+\tau _{{e}_j,{d}_i,v}^{s,c}+\tau _{{e}_j,{d}_i,v}^{s,d}\) and \(\psi _{{e}_j,{d}_i,v}^s=\psi _{{e}_j,{d}_i,v}^{s,u}+\psi _{{e}_j,{d}_i,v}^{s,c}+e_{{e}_j,{d}_i,v}^{s,d}\) indicate the running delay and energy consumption of \({m}_{{d}_i,v}\) during the upload, processing, and feedback phases of ECCM.

$$\begin{aligned} \begin{matrix} {\text {sort}}\left( {m}_{{d}_i,v}\right) =\,\left\{ \begin{matrix}{\text {min}}\left( O_{{d}_i,v}^{l},\tau _{{e}_j,{d}_i,v}^s+\psi _{{e}_j,{d}_i,v}^s\right) \ \ \ \ , {\text {if}}\ {v}\in {\text {EMT}}\\ \begin{matrix} {\text {min}}\left( O_{{d}_i,v}^{l},\right. \tau _{{e}_j,{d}_i,v}^s+\psi _{{e}_j,{d}_i,v}^s\ \ \ \ \ \ \\ \left. \ \ \ \ +{{\text {max}}}_{q\in {\text {succ}}\left( v\right) }\left( {\text {sort}}\left( {m}_{{d}_i,q}\right) \right) \right) \\ \end{matrix}, {\text {if}}\ {v}\notin {\text {EMT}}\\ \end{matrix}\right. \\ \end{matrix}\nonumber \\ \end{aligned}$$
(25)

Step 2: Input the subtask sequence to the encoder of the S2S neural network. Once the encoding is done, feed it to the decoder and get the output.

The offloading strategy defined in (25) can be represented by an S2S neural network. In this article, we adopt Bidirectional Long Short-Term Memory (Bi-LSTM) [56] and Long Short-Term Memory (LSTM) [57] as an encoder and a decoder in a S2S neural network. The encoder of the S2S neural network converts the input graph into a continuous subtask sequence \(M=\left\{ {m}_{{d}_i,v}|\right. \left. v=1,2,\ldots ,V\right\} \). A decoder then uses this sequence to generate the offloading strategy \(\mu _{{d}_i}=\left\{ \mu _{{d}_i,v}| v=1,2,\right. \left. \ldots ,V\right\} \). This combination can integrate node features and relationships, capture global context and handle long-term dependencies effectively. The details are as follows: \({m}_{{d}_i,v}\) is first converted to an embedding vector \({{\varvec{m}}}_{{d}_i,v}\) before each encoding step, and then the Bi-LSTM transforms the hidden state \({\varvec{h}}_{{d}_i,v-1}^{en}\) at the previous step and \({{\varvec{m}}}_{{d}_i,v}\) into the hidden state \({\varvec{h}}_{{d}_i,v}^{en}\) at the current step encoder, which can be given by

$$\begin{aligned} \begin{matrix} {\varvec{h}}_{{d}_i,v}^{en}={\text {Bi-LSTM}}\left( {\varvec{h}}_{{d}_i,v-1}^{en},{{\varvec{m}}}_{{d}_i,v}\right) \end{matrix} \end{aligned}$$
(26)

After the embedding vectors of all subtasks are encoded in sequence, the hidden layer state vector \({\varvec{h}}_{{d}_i}^{en}=\left\{ h_{{d}_i,v}^{en}|\right. \left. v=1,2,\ldots ,V\right\} \) of an encoder is got.

To improve the efficiency and accuracy of task processing, we introduce an attention mechanism. The context vector \({\varvec{c}}_{{d}_i,d}\) decoded in step d is the weighted average of all hidden states \({\varvec{h}}_{{d}_i,v}^{en}\) of the encoder output, which can be given by

$$\begin{aligned} \begin{aligned} \begin{matrix} {\varvec{c}}_{{d}_i,d}=\sum _{i=1}^{V}{\frac{{\text {exp}}\left( {\text {score}}\left( {\varvec{h}}_{{d}_i,d}^{de},{\varvec{h}}_{{d}_i,v}^{en}\right) \right) }{\sum _{k=1}^{V}{\text {exp}}\left( {\text {score}}\left( {\varvec{h}}_{{d}_i,k}^{de},{\varvec{h}}_{{d}_i,v}^{en}\right) \right) }\cdot {\varvec{h}}_{{d}_i,v}^{en}} \end{matrix} \end{aligned} \end{aligned}$$
(27)

where the weight \(a_{{d}_i,d,v}\) is a probability distribution at \(v=1,2,\ldots ,V\) for a given d. \({\text {score}}\left( {\varvec{h}}_{{d}_i,d}^{de},{\varvec{h}}_{{d}_i,v}^{en}\right) \) is a forward feedback neural network, which computes an alignment score from the hidden state \({\varvec{h}}_{{d}_i,v}^{en}\) of encoder at step v and the hidden state \({\varvec{h}}_{{d}_i,d}^{de}\) of decoder at step d. At each step of decoding, LSTM takes as inputs the hidden state \({\varvec{h}}_{{d}_i,d-1}^{de}\) at the previous step and the context vector \({\varvec{c}}_{{d}_i,d}\) at the current step d, the hidden state \({\varvec{h}}_{{d}_i,d}^{de}\) of the decoder output can be given by

$$\begin{aligned} \begin{matrix} {\varvec{h}}_{{d}_i,d}^{de}={\text {LSTM}}\left( {\varvec{h}}_{{d}_i,d-1}^{de},{\varvec{c}}_{{d}_i,d}\right) \end{matrix} \end{aligned}$$
(28)

Combine the current decoder hidden state \({\varvec{h}}_{{d}_i,d}^{de}\) and context vector \({\varvec{c}}_{{d}_i,d}\), we get the attention hidden state \({\widetilde{{\varvec{h}}}}_{{d}_i,d}^{de}\), which can be given by

$$\begin{aligned} \begin{matrix} {\widetilde{{\varvec{h}}}}_{{d}_i,d}^{de}=\tanh {\left( W_c\left[ {\varvec{c}}_{{d}_i,d};{\varvec{h}}_{{d}_i,d}^{de}\right] \right) } \end{matrix} \end{aligned}$$
(29)

The predictive distribution is produced from the attentional vector \({\widetilde{{\varvec{h}}}}_{{d}_i,d}^{de}\) and softmax layer, which can be given by

$$\begin{aligned} \begin{matrix} p(\mu _{{d}_i,d}|\mu _{{<d}_i,d},{\text {M}})={\text {softmax}}\left( W_s{\widetilde{{\varvec{h}}}}_{{d}_i,d}^{de}\right) \end{matrix} \end{aligned}$$
(30)

Step 3: With the output of the decoder in the S2S neural network, i.e., the offloading decisions \(\mu _{{d}_i}\) of a sequence that contains all the subtasks, each of which is placed on the corresponding device. If \(\mu _{{d}_i,v}=0, {m}_{{d}_i,v}\) is performed locally; if \(\mu _{{d}_i,v}=1, {m}_{{d}_i,v}\) is sent to the corresponding edge server \({e}_j\) and processed in a distributed manner according to the DCP algorithm.

Training mechanism

Fig. 9
figure 9

The training mechanism of the DCDO-DRL strategy. (1) Actor network: maximizing expected returns. (2) Critical network: accurately estimating the value of each state-action pair. (3) Target network: its parameters gradually synchronize with the parameters of the main network during the training process to stabilize the learning process

The SAC algorithm proposed by Haarnoja et al. [58] maximizes the entropy while the expected reward. Inspired by the SAC, a training mechanism is designed for the DCDO-DRL strategy to learn a robust DRL algorithm. The mechanism follows the discrete SAC, which reconstructs the action space of SAC to adapt to task offloading scenarios. Next, we discuss the training mechanism. The objective function, compared to the traditional RL, considers the entropy item \(\alpha {\mathcal {H}}\left( \pi \left( \cdot |s_{{d}_i,v}\right) \right) \) and concentrates on maximizing the accumulated reward. The definition is as follows:

$$\begin{aligned}{} & {} {\max }_\pi \sum _{v=1}^{V}{{\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim \tau _\pi }\left[ \left( r\left( s_{d_i,v},a_{d_i,v}\right) \right. \right. } \nonumber \\{} & {} \qquad \left. \left. +\alpha {\mathcal {H}}{\left( \pi \left( a_{d_i,v}| s_{d_i,v}\right) \right) }\right) \gamma ^{v-1}\right] \nonumber \\{} & {} \quad {=\max }_\pi \sum _{v=1}^{V}{{\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim \tau _\pi }\left[ \left( r\left( s_{d_i,v},a_{d_i,v}\right) \right. \right. }\nonumber \\{} & {} \qquad \left. \left. -\alpha \log {\left( \pi \left( a_{d_i,v}| s_{d_i,v}\right) \right) }\right) \gamma ^{v-1}\right] \end{aligned}$$
(31)

where \(\tau _\pi \) is the state-action trajectory distribution following the policy \(\pi \); \(\gamma \in \left[ 0,1\right] \) is a discount factor used to distinguish the importance between current and future rewards; \(\alpha \) is the temperature parameter that controls the stochastic of the optimal policy; \({\mathcal {H}}\left( \pi \left( \cdot |s_{{d}_i,v}\right) \right) =-{\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim \tau _\pi }\log {\left( \pi \left( a_{d_i,v}| s_{d_i,v}\right) \right) }\) is the entropy of the policy distribution, which permits the exploration of additional solutions.

The optimal temperature \(\alpha \) varies across tasks due to differences in reward. In addition, the policy is continuously updated during training, resulting in changes to the corresponding Q value and further affecting the choice of \(\alpha \). Therefore, to train the temperature \(\alpha \) parameter dynamically, we will rewrite (31) with the mean entropy as a constraint and the transformed objective function as follows: [59]

$$\begin{aligned}{} & {} {\max }_\pi \sum _{v=1}^{V}{{\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim \tau _\pi }\left[ r\left( s_{d_i,v},a_{d_i,v}\right) \right. }{\left. \gamma ^{v-1}\right] }\nonumber \\ {}{} & {} \quad {\text {s.t.}}\ {\mathcal {H}}\left( \pi \left( \cdot |s_{{d}_i,v}\right) \right) \ge \hat{{\mathcal {H}}}\ \ \forall v\in V \end{aligned}$$
(32)

where \(\hat{{\mathcal {H}}}\) is the minimum value of the average entropy over the sample. The objective of our policy is transformed to maximize the cumulative reward, provided the sample average entropy is no less than \(\hat{{\mathcal {H}}}\). The optimal temperature \(\alpha _v^*\) can be given by

$$\begin{aligned} \alpha _v^*= & {} {\text {argmin}}_{\alpha _{d_i,v}}{\mathbb {E}}_{a_{d_i,v}\sim \pi _{d_i,v}^*}\left[ -\alpha _{d_i,v}\right. \nonumber \\{} & {} \left. \left( \log {\left( \pi _{d_i,v}^*\left( a_{d_i,v}| s_{d_i,v};\alpha _{d_i,v}\right) \right) \hat{{\mathcal {H}}}}\right) \right] \end{aligned}$$
(33)

\(\pi _{d_i,v}^*\left( a_{d_i,v}| s_{d_i,v};\alpha _{d_i,v}\right) \) denotes the temperature \(\alpha _{d_i,v}\) when the action \(a_{d_i,v}\) is chosen according to the optimal policy \(\pi _{d_i,v}^*\) in state \(s_{d_i,v}\). Thus, the temperature objective of solving the \(\alpha _v^*\), which can be given by

$$\begin{aligned} {\mathcal {L}}\left( \alpha \right)&= {\mathbb {E}}_{a_{d_i,v}\sim \pi _{d_i,v}} \nonumber \\&\qquad \left[ \alpha \left( \log {\left( \pi _{d_i,v}\left( a_{d_i,v}| s_{d_i,v}\right) \right) }-\hat{{\mathcal {H}}})\right) \right] \end{aligned}$$
(34)

It can be observed that optimal policy and optimal strategies interact with each other and that both should be updated iteratively. Based on Ref. [60], the (32) is solved using a soft strategy iteration with policy evaluation and policy promotion. In the policy evaluation phase, the DCDO-DRL strategy constructs two functions by modifying Bellman backup: (1) the soft action-value function \(Q_\pi \left( s,a\right) \) evaluates the Q-value given state-action pair under the policy \(\pi \); (2) the soft state-value function \(v_\pi \left( s\right) \) evaluates the value of a state under the policy \(\pi \) with the entropy term. The two functions can be given by

$$\begin{aligned} \begin{matrix} Q_\pi \left( s,a\right) =r\left( s,a\right) +\gamma \sum \limits _{s^\prime \in S}{{\mathcal {P}}\left( s^\prime | s,a\right) V_\pi \left( s^\prime \right) } \end{matrix} \end{aligned}$$
(35)
$$\begin{aligned} \begin{matrix} V_\pi \left( s\right) ={\mathbb {E}}_{a\sim \pi }\left[ Q_\pi \left( s,a\right) -\alpha l o g{\left( \pi \left( a| s\right) \right) }\right] \end{matrix} \end{aligned}$$
(36)

Then, the mean squared error is the soft Bellman error method. It updates the soft Q-network parameter \(\xi \) by measuring the Q-network and target Q-network, which can be given by

$$\begin{aligned} {\mathcal {L}}_{\mathcal {Q}}\left( \xi \right)= & {} {\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim {\mathcal {D}}}\left[ \frac{1}{2}\left( Q_\xi \left( s_{d_i,v},a_{d_i,v}\right) \right. \right. \nonumber \\{} & {} - \left. \left. \acute{Q}\left( s_{d_i,v},a_{d_i,v}\right) \right) ^2\right] \nonumber \\= & {} {\mathbb {E}}_{\left( s_{d_i,v},a_{d_i,v}\right) \sim {\mathcal {D}}}\left[ \frac{1}{2}\left( Q_\xi \left( s_{d_i,v},a_{d_i,v}\right) \ \ \right. \right. \nonumber \\{} & {} - \left. \left. \left( r\left( s_{d_i,v},a_{d_i,v}\right) +\gamma V_{{\bar{\xi }}}\left( s_{d_i,v+1}\right) \right) \right) ^2\right] \end{aligned}$$
(37)

where \({\mathcal {D}}\) is the replay buffer that stores a series of transitions \(\left( s_{d_i,v},a_{d_i,v},r_{{d}_i,v},s_{d_i,v+1}\right) \cdot {\bar{\xi }}\) is the parameter for a target Q-network and copied from \(\xi \) after a certain time.

Since the action space in this article is discrete, the expectation of \(V_{{\bar{\xi }}}\left( s_{d_i,v+1}\right) \) can be solved by discrete action probabilities with random variable states, which can be given by

$$\begin{aligned}{} & {} V_{{\bar{\xi }}}\left( s_{d_i,v+1}\right) =\sum _{a_{d_i,v+1}\in {\mathcal {A}}}{\pi \left( a_{d_i,v+1}| s_{d_i,v+1}\right) } \nonumber \\{} & {} \left[ Q_{{\bar{\xi }}}\left( s_{d_i,v+1},a_{d_i,v+1}\right) -\alpha {\text {log}}{\left( \pi \left( a_{d_i,v+1}| s_{d_i,v+1}\right) \right) }\right] \end{aligned}$$
(38)

The aim of the policy improvement phase is to update the policy to maximize the reward. Based on Ref. [58], to make sure the policy is processable, the Q-value obtained during the policy evaluation phase is first indexed to update the policy. Then, it is converted to the acceptable policy set \(\mathrm {\Pi }\) via the minimum Kullback–Leibler divergence. Thus, the update of the policy is defined as follows (39). The loss function of the policy network can be given by (40), in which the parameter \(\varphi \) is updated using stochastic gradients.

$$\begin{aligned} \pi _{\text {new}}= & {} {\text {argmin}}_{\pi \in \mathrm {\Pi }}D_{\textrm{KL}} \nonumber \\{} & {} \left( \mathrm {\pi }\left( \cdot | s_{d_i,v}\right) ||\frac{{\text {exp}}{\left( \frac{1}{\mathrm {\alpha }}\textrm{Q}_{\mathrm {\pi }_{\textrm{old}}}\left( s_{d_i,v},\cdot \right) \right) }}{Z_{\mathrm {\pi }_{\textrm{old}}}\left( s_{d_i,v}\right) }\right) \end{aligned}$$
(39)
$$\begin{aligned} {\mathcal {L}}_\pi \left( \varphi \right)= & {} {\mathbb {E}}_{s_{d_i,v}\in {\mathcal {D}}}\sum _{a_{d_i,v}\in {\mathcal {A}}}{\pi _\varphi \left( a_{d_i,v}| s_{d_i,v}\right) } \nonumber \\{} & {} \left( \alpha {\text {log}}{\left( \pi _\varphi \left( a_{d_i,v}| s_{d_i,v}\right) \right) }-Q_\xi \left( s_{d_i,v},a_{d_i,v}\right) \right) \nonumber \\ \end{aligned}$$
(40)

Algorithm 2 and Fig. 9 illustrate the training mechanism and pseudo-code of the DCDO-DRL strategy. Algorithm 2 comprises three parts. The first part (lines 1–6) defines initial parameters, including environment critic networks, actor network, target networks, replay buffer, and gradient descent step length. The second part (lines 7–13) interacts with the environment to trigger the action and the next state following the current policy. The transition then is stored in the replay buffer. The third part (lines 14–21) updates the S2S neural network using the stochastic gradients and transitions stored in the replay buffer. The two critic networks, actor network, temperature parameters and two target networks are updated by lines 16–19. Off-policy learning is more effective, mainly because of the ability to learn experience from policies other than the target policy. The core idea of DCDO-DRL strategy in two aspects: (1) the loss function (37) and (40) of the critic and actor networks incorporate an entropy element. (2) The two Q networks as the critic and target networks, respectively. In addition, the loss function (37) and (40) adopt the minimum value of the \(Q_\pi \left( s,a\right) \) function to improve the training speed.

Algorithm 2
figure b

DCDO-DRL

Complexity analysis

The time complexity of DCDO-DRL mainly involves two parts: the S2S neural network and discrete SAC. The S2S neural network includes an encoder (Bi-LSTM), a decoder (LSTM), and an attention mechanism. The time complexity is \( {\mathcal {O}}(L\times N\times M^2)\), \({\mathcal {O}}(L\times N\times M^2)\), and \({\mathcal {O}}(L\times N\times M\times H)\), where L is the sequence length, N is the batch size, M is the number of hidden units, H is the number of attention heads. In addition, the time complexity of discrete SAC is \({\mathcal {O}}(B\times P\times K\times T)\), where B is the batch size, P is the number of parameters in the neural network, K is the number of training steps, and T is the number of computations per step. Hence, the time complexity of DCDO-DRL is the sum of two parts, \({\mathcal {O}}(L\times N\times (2\,M^2+M\times H)+B\times P\times K\times T)\).

Numerical results

This section shows the experimental settings, algorithm convergence, and the impact of attributes on algorithm performance. Furthermore, we investigated the statistical advantages of the DCDO-DRL strategy compared to seven methods in three scenarios.

Simulation setup

To evaluate the performance of the DCDO-DRL strategy, PyCharm is used as the development tool for Python IDE. The S2S neural network is established via the TensorFlow framework. We implement the DCDO-DRL strategy with TensorFlow based on OpenAI Spring Up.

Inspired by Ref. [24], we set the system parameters and initial hyperparameters for the S2S neural network after visiting three centers.Footnote 1 The system parameters are given in Table 2, which involve hardware and communication conditions of TD and ES, and RIDM task information. Specifically, the transmission rate and power are set as 7 Mbps and 1.258 W [61]. The CPU computational capacity of TD is 1 G cycles/s, while ES is 9 G cycles/s. We also set the energy coefficients of TD and ES are \(1.25 \times 10^{-8}\) J/cycle and \(1.25\times 10^{-7}\) J/cycle according to Ref. [61]. In our simulation experiment, it is assumed that the subtasks in the RIDM task are offloaded to ES or TD. The workflow of radiomics is complex and changeable. Hence, we model different RIDM tasks as DAG with different topologies, which elaborate dependencies among modules. The RIDM task data size is set between 250 and 2500 KB and the subtask number V of DAG ranges from 10 to 30 according to the different requirements of radiomics. The computational complexity of each subtask is \({10}^{7}-{10}^{8}\) cycles/s. We select 100 DAGs for each subtask number as the training set and another 20 DAGs as the test set. Then, the S2S neural network is trained based on the information of each subtask in the DAG as input. Finally, to obtain a robust offloading strategy, the DCDO-DRL strategy utilizes discrete SAC to train the S2S neural network.

The S2S neural network is set as a two-layer Bi-LSTM encoder and a two-layer LSTM decoder, each with 256 hidden units and layer normalization [62]. During training, the learning rate is 0.0003, the gradient descent step length is 0.00001, and the batch size is 100. These hyperparameters significantly affect the training and convergence speed of the DCDO-DRL strategy. After initialization and grid search, the optimal hyperparameter settings are presented in Table 3.

Table 2 The value for system parameters
Table 3 The value for the S2S neural network and training hyperparameters

Compare algorithms

To evaluate the performance of the DCDO-DRL strategy, we conduct a comparison of the following seven algorithms: (1) local computing (L. Comp.): all subtasks of DAG are executed on the user terminal device without offloading. (2) Full offloading (F. Offl.): all subtasks of DAG are executed on the edge server. (3) Random offloading (R. Offl.): each subtask of the DAG is randomly offloaded to the user terminal device or edge server. (4) Greedy offloading (G. Offl.): find the best offloading location for each subtask of the DAG by selecting the current optimal solution each time. (5) Round-Robin-based offloading (RR. Offl.): all subtasks of the DAG are alternately performed on the TD and ES. (6) HEFT-based offloading (HEFT. Offl.): HEFT. Offl. [27] adopt the Heterogeneous Earliest Finish Time algorithm to prioritize the subtask of the DAG and run sorting tasks according to the earliest estimated finish time. (7) DRL-based task offloading (DRLTO): DRLTO [24] combined recurrent neural network and DRL to deal with the task offloading scheme and adopt Proximal Policy Optimization to improve the training efficiency.

Convergence analysis

This subsection evaluates the convergence of the proposed DCDO-DRL and DRLTO. The aim of this article is to maximize the RIDM task execution utility. Thus, we set \(\beta ^t=\beta ^e=0.5\) according to Ref. [24]. The subtask number V of DAG for the RIDM task is 15. The transmission rate is 7 Mbps. The transmission power is 1.258 W. The CPU computational capacity of the main ES and TD is 9 G cycles/s and 1 G cycles/s. Other parameters are detailed in Tables 2 and 3. The DCDO-DRL strategy records the average and updates the S2S neural network at each iteration.

The simulation results are shown in Fig. 10. The x-axis denotes the number of iterations, and the y-axis represents the average reward. It can be noticed that the average reward converges more quickly when the number of iterations is less than 100. As the number of iterations increases, the value grows steadily with a smaller oscillation amplitude. The result demonstrates that the average reward convergence value of the DCDO-DRL strategy is 0.021 at around 200 iterations. Although DRLTO also has the same convergence trend, its convergence speed is lower than the proposed DCDO-DRL strategy in this article. DRLTO converges quickly before 200 iterations and then becomes slower. Finally, the average reward converges to 0.009 in 500 iterations. Therefore, compared to DRLTO, the DCDO-DRL strategy improves training speed. The reason is that the proposed DCDO-DRL strategy maximizes the entropy and the expected reward at the same. This further results in a stronger exploration capability of the DCDO-DRL strategy in the training process.

Impact of subtask numbers

The subsection contrasts the performance of the DCDO-DRL strategy with seven algorithms in terms of various subtask numbers. In this scenario 1, the system is deployed as follows. The subtask number V of DAG for the RIDM task is from 10 to 30. The transmission rate is 7 Mbps. The CPU computational capacity of the main ES and a TD is 9 G cycles/s and 1 G cycles/s. The rest parameter values as shown in Tables 2 and 3. The simulation results are shown in Fig. 11.

Fig. 10
figure 10

The average reward of the DCDO-DRL and DRLTO

By varying the subtask number, The DCDO-DRL strategy has higher utility on the RIDM task compared to the other algorithms. As shown in Fig. 11, the DCDO-DRL strategy has a lower average delay than most algorithms (Fig. 11a). The average energy consumption and average utility are usually lower (Fig. 11b) and higher (Fig. 11c) than those of the other algorithms. When the number of subtasks is small (i.e., N=10), the average delay, average energy consumption and average utility of each algorithm is lower, but the DCDO-DRL strategy still is optimal. As N further increases, all three are increased. The main reason is that increasing the number of subtasks results in a more complex DAG for the RIDM task, which exacerbates the difficulty of task scheduling. In addition, assigning more subtasks to ES reduces the computation time, but also increases the data transmission time, and computation/transmission energy consumption. The computation energy consumption of ES is also higher than TD at the same time. In summary, for the scenario of variable subtask numbers, the DCDO-DRL strategy improved the execution utility of RIDM tasks by 23.07% (computer by (\(\frac{{\text {utility}}_{\text {DCDO-DRL}}-{\text {utility}}_{\text {DRLTO}}}{{\text {utility}}_{\text {DRLTO}}}\)) compared to DRLTO.

In addition, we analyze the correlation between average delay, average energy consumption, and average utility. The joint distribution diagram is a visual representation to display the interrelationship between the two variables. Figure 12a demonstrates the joint distribution between the average delay and average utility under scenario 1. It can be seen from the regression line that the two variables have a positive correlation. As the average delay increases, the average utility shows an increasing trend. Figure 12b shows the joint distribution between the average energy consumption and average utility under scenario 1. The regression line also exhibits that there is also a positive correlation.

Fig. 11
figure 11

Illustration on the impact of subtask number

Fig. 12
figure 12

Joint distribution between average utility and average delay or average energy consumption under scenario 1

Fig. 13
figure 13

Illustration on the impact of the transmission rate

Impact of transmission rate

The subsection contrasts the performance of the DCDO-DRL strategy on various transmission rates with seven algorithms. In this scenario 2, the system is deployed as follows. The transmission rate ranges from 5 Mbps to 17 Mbps. The subtask number V of DAG for the RIDM task is 15. The CPU computational capacity of the main ES and TD also is 9 G cycles/s and 1 G cycles/s. The rest parameters as shown in Tables 2 and 3. The simulation results are shown in Fig. 13.

By varying the transmission rate, the results show that the DCDO-DRL strategy has better performance on the RIDM task than other algorithms. As shown in Fig. 13, the average delay and average energy consumption of L. Offl. are fixed. When the transmission rate is small (i.e., \({r}_{{e}_j,{d}_i}^s=5\)), transmitting all data to ES incurs considerable delays (Fig. 13a). When \({r}_{{e}_j,{d}_i}^s=7, 9\), L. Offl. has the lowest average energy consumption and the highest average delay (Fig. 13b). As \({r}_{{e}_j,{d}_i}^s\) further increases, the average delay and average energy consumption of all algorithms decrease (except for L. Offl.). The DCDO-DRL strategy has the lowest energy consumption (Fig. 13a, b). This is because the higher transmission rate is beneficial for offloading tasks to ES. Figure 13c shows the average utility of F. Offl. – DCDO-DRL gradually increases as the transmission rate increases. This is because the reduction in transmission time drives the execution of subtasks on ES. To sum up, the DCDO-DRL strategy improved the execution utility of the RIDM task by 12.77% compared to the suboptimal DRLTO, specifically in scenarios related to transmission rates.

Similarly, the histogram on the upper and right sides of Fig. 14a demonstrates the marginal distribution of average delay and average utility under scenario 2, respectively. The middle part shows the joint distribution between the two variables. The histogram on the upper of Fig. 14b displays the marginal distribution of average energy consumption. The two regression lines in Fig. 14 show a negative slope, implying a negative correlation between all average latency and average energy consumption and average utility. The shaded part shows the confidence interval of the regression lines. As the increase of two variables, the average utility displays a decreasing trend. However, it can be clearly seen that the regression line in Fig. 14a is steeper compared to Fig. 14b. Therefore, the average delay has a greater effect on the average utility.

Fig. 14
figure 14

Illustration on joint distribution between average utility and average delay or average energy consumption under scenario 2

Fig. 15
figure 15

Illustration on the impact of CPU computational capacity

Fig. 16
figure 16

Illustration on joint distribution between average utility and average delay or average energy consumption under scenario 3

Impact of CPU computational capacity

To further evaluate the DCDO-DRL strategy, this subsection compares the performance of the DCDO-DRL strategy on various CPU computational capabilities with seven algorithms. In this scenario 3, the system is deployed as follows. The CPU computational capacity of the main ES ranges from 1 G cycles/s to 8 G cycles/s. The transmission rate is 7 Mbps. The subtask number V of DAG for the RIDM task is 15. The CPU computational capacity of TD is 1 G cycles/s. The rest parameter values as shown in Tables 2 and 3. The simulation results are shown in Fig. 15.

By adjusting the computing power of ES, the DCDO-DRL strategy performs better in the RIDM task. As shown in Fig. 15, the average delay, average energy consumption and average utility of L. Offl. are constant. The reason is that L. Offl. is not affected by the computing power of ES. When the computational power of ES is small (i.e., \(f_{{e}_j}^s=1\)), it is equivalent to that of TD. Running all data on ES will generate massive energy consumption. Thus, the energy consumption of F. Offl. is huge in Fig. 15b. When \(f_{{e}_j}^s=2\), the average energy consumption and the average utility of the F. Offl. drops abruptly and increases steeply, respectively. As \(f_{{e}_j}^s\) is further increased, there is little difference in the average delay of the individual algorithms (Fig. 15a). The average energy consumption shows a steadily decreasing trend, while the average utility increases slowly except for L. Offl. and F. Offl. (Fig. 15b, c). The influence of ES’s computing capability gradually becoming smaller is the primary cause of this. In conclusion, compared to the suboptimal DRLTO, the DCDO-DRL strategy improves the execution efficiency of RIDM tasks by 8.51% when faced with different CPU computing power.

Likewise, Fig. 16 demonstrates the joint distribution between the average delay, average energy consumption and average utility under scenario 3. The result shows that both regression lines show a negative correlation between the two variables. However, compared to the regression line with Fig. 16a, the one with Fig. 16b is steeper. This reflects the fact that average energy consumption has a greater impact on average utility.

Table 4 The P value under various experiment settings

Statistical superiority analysis

Statistical test is a widely used method to evaluate the performance of the algorithm in various fields. In the above analysis, we compute the average utility performance of seven algorithms across different subtask numbers, different transmission rates, and different CPU computational capacities. To determine the superiority of the DCDO-DRL strategy in three scenarios, we conduct pairwise comparisons. In this article, we use the Wilcoxon rank sum test [60] as a non-parametric statistical test. The test compares the significant level differences between algorithms by P value. Note that we consider the algorithm to be statistically different if and only if the P value is less than 0.05. The P values calculated under three scenarios are shown in Table 4. For the subtask numbers, the P values are less than 0.05, indicating a statistically significant difference between the DCDO-DRL strategy and other algorithms. Similarly, there is a statistically significant difference in transmission rate, as the P values are all less than 0.05. In terms of CPU computing power, although not all P values are less than 0.05, 5 out of 7 also shows statistically significant differences. To sum up, by counting the P values of DCDO-DRL and seven algorithms, it can be found that only two of the 21 P values exceed 0.05, reflecting the statistical superiority of the DCDO-DRL strategy in maximizing the RIDM task execution utility. To sum up, by counting the P values of DCDO-DRL against the seven algorithms, it can be found that only two of the 21 P values exceed 0.05, reflecting the statistical superiority of the DCDO-DRL strategy in maximizing the RIDM task execution utility.

Conclusion

In this article, we propose a DCDO-DRL strategy, which plays a significant role in improving the RIDM execution efficiency and adapting to the different RIDM environments in the medical image cloud. DCDO-DRL aims to maximize the RIDM task utility, a weighted sum of DEC generated by execution. Specifically, the internal dependencies of the RIDM task based on radiomics are modeled by a DAG. The offloading decision process in the DAG is represented by the sequence prediction of the S2S neural network. Next, we propose a DCP algorithm to accelerate subtask processing by collaborating with multiple ES resources. Finally, to improve the robustness of the S2S neural network, the DCDO-DRL strategy follows the discrete SAC. The results show the execution utility of the DCDO-DRL strategy in the RIDM task by at least 23.07, 12.77, and 8.51% in three scenarios.

It is worth noting that content caching is also an effective way to decrease computational delay and energy consumption. Therefore, our future research work focuses on exploring the problem of combining content caching and task offloading. One potential solution is to formulate the problem as a mixed-integer non-linear programming (MINLP) problem. Then, the MINLP problem is then proved to be a 0–1 knapsack problem and solved by an efficient algorithm.