1 Introduction

Cloud computing is generally accepted as a type of distributed system linked by a high-speed network. It includes the applications delivered as services over the Internet, the hardware and systems software that can dynamically provide services to users (Armbrust et al. 2010; Adhikari et al. 2019). As a paradigm that provides services to users in a pay-as-you-go (Zhang et al. 2020) manner or pay-per-use (Zhou et al. 2019), Cloud computing has four forms: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS) (Armbrust et al. 2010; Adhikari et al. 2019; Zhan et al. 2015; Midya et al. 2018; Chase and Niyato 2017), and a new form of serverless computing (Rings et al. 2009; Adhikari et al. 2019).

Cloud computing provisions computing resources on the basis of CPU (Central Processing Unit) (Adhikari et al. 2019; Kardani-Moghaddam et al. 2021), RAM (Random Access Memory) (Rjoub et al. 2020; Monge et al. 2020), GPU (Graphics Processing Unit) (Shao et al. 2019; Tong et al. 2020), Disk Capacity (Adhikari et al. 2019; Kardani-Moghaddam et al. 2021) and Network Bandwidth (Rjoub et al. 2020; Mei et al. 2019). From another perspective, “time” and “space” are also two pivotal resources of Cloud computing. Time means the whole service life cycle of the Cloud platform, and space means the real physical place to emplace physical devices. Electrical components of Cloud computing devices are driven by electric energy and work at the time and space. They constitute the real resources assembled of Cloud computing. Therefore, real natural resources provided by Cloud computing are effective electric energy conversion per unit of space and per unit of time (frequency), regarding energy, time, and space as essential resources (Lin et al. 2023a). The limited resource utilization capacity of Cloud computing will raise the cost and energy consumption of Cloud system (Zhou et al. 2019; Wan et al. 2020). Moreover, long response time, long queuing time and high delay rate will direct the decrease in QoS (quality of service). Consequently, how to schedule components of Cloud computing in an efficient, energy-saving, balanced method, is a critical factor, influencing the orientation of Cloud computing in the future.

Cloud computing has some characteristics including the huge scale of devices, the complexity of scenarios, the unpredictability of user requests, the randomness of electronic components, and the uncertain temperature of various components presented in the running process. These characteristics pose challenges to efficient and effective resource scheduling of Cloud computing (Xie et al. 2019; Guo et al. 2019; Duc et al. 2019). Currently, multi-phase approach (Laili et al. 2020; Guo et al. 2019; Xu and Buyya 2019), virtual machine migration (Kumar et al. 2019; Ren et al. 2020b; Zhang et al. 2019a), queuing model (Caron et al. 2009; Ding et al. 2020; Zhang et al. 2020; Duc et al. 2019), service migration (Tuli et al. 2022; Ren et al. 2020a; Zhan et al. 2015), workload migration (Fiandrino et al. 2017), application migration (Zhan et al. 2015; Duc et al. 2019), task migration (Tian et al. 2018; Kumar et al. 2019; Miao et al. 2020) and scheduling algorithm of scheduler are current common strategies to resolve the resource management of Cloud computing. Among these, the core of the solution is still the design of the scheduler on the basis of the scheduling algorithm. Figure 1 shows a resource management and task allocation process with a scheduler as the core. The users operate the clients to submit task requests to the Cloud center through the high-speed networks; The Cloud center collects tasks, generates scheduling schemes leveraging scheduling algorithms, and allocates tasks to server nodes; The server nodes then provide corresponding services to users (Zhou et al. 2023a). Due to its impact on the effective operation of Cloud, scheduling algorithms of Cloud computing have attracted researchers. The scheduling problem in distributed systems is usually an NP-complete problem or an NP-hard problem without a polynomial-time algorithm unless \(NP=P\) (Adhikari et al. 2019; Chen et al. 2019; Mei et al. 2019). Existing methods to resolve scheduling problems mainly contain six categories including Dynamic Programming, Probability algorithm, Heuristic method, Meta-Heuristic algorithm, Hybrid algorithm and Machine Learning (ML).

Fig. 1
figure 1

Resource management and task allocation process based on schedule center

As classic methods (non-machine learning) are not experts in addressing the complex scheduling scenarios of Cloud computing, there are abundant discussions and research about the application of ML in Cloud scheduling such as work from Microsoft (Bianchini et al. 2020), CLOUDS Laboratory of The University of Melbourne (Ilager et al. 2021), and other institutes (Duc et al. 2019; Rodrigues et al. 2020; Demirci 2015). Deep reinforcement learning (DRL), belonging to ML, is a novel approach combined with the advantages of the deep neural network (DNN) and reinforcement learning (RL). In recent years, DRL has been prevalent in solving Cloud scheduling and has been proven to occupy strong superiorities in many complex scenarios (Guo et al. 2021; Feng et al. 2020; Karthiban and Raj 2020; Wang et al. 2021a; Dong et al. 2020; Cao et al. 2020; Chen et al. 2022c; Xu et al. 2022). There are many surveys (Price 1982; Kumar et al. 2019; Adhikari et al. 2019; Duc et al. 2019; Rodrigues et al. 2020; Cong et al. 2020b; Zhan et al. 2015; Bera et al. 2015; Xu et al. 2017a; Lin et al. 2021; Ren et al. 2020b; Xu and Buyya 2019; Welsh and Benkhelifa 2020; Cong et al. 2020a; Braiki and Youssef 2019; Jennings and Stadler 2015; Arunarani et al. 2019; Demirci 2015; Goodarzy et al. 2020; Singh et al. 2023; Khan et al. 2022; Lin et al. 2023b) that have provided detail, comprehensive and valuable reviews of various fields in Cloud computing. Some examples related to Cloud resource optimization management are as follows. Adhikari et al. (2019) reviewed the workflow scheduling in Cloud and analyzed the characteristics of its techniques by classifying them based on the objectives and execution mode. Lin et al. (2023b) focused on the performance interference of virtual machines and revisited interference-aware strategies for scheduling optimization as well as co-optimization-based approaches. Arunarani et al. (2019) provided a literature survey for task scheduling strategies (mainly including some meta-heuristic algorithms-based task scheduling) and discussed the various issues related to scheduling methodologies and the limitations to overcome. Xu et al. (2017a) reviewed load balancing algorithms for virtual machine placement in cloud computing, including some heuristic, meta-heuristic and hybrid algorithms related to the load balancing problems. Following different scheduling scenarios, Zhan et al. (2015) presented a comprehensive survey of evolutionary approaches in Cloud resource scheduling, mainly including the genetic algorithm (GA), ant colony optimization (ACO) and particle swarm optimization (PSO). Singh et al. (2023) presented a review for a taxonomy of meta-heuristic scheduling techniques in Cloud and fog, from several categories including physics-based algorithms, evolutionary algorithms, biology-based algorithms, chemistry-based algorithms, etc. Some existing surveys have discussed the application of ML in Cloud scheduling (Goodarzy et al. 2020; Duc et al. 2019; Rodrigues et al. 2020; Demirci 2015; Khan et al. 2022). For example, Duc et al. (2019) discussed some ML methods for resource provisioning in edge-Cloud applications, mainly including the applications of DNN, support vector machines (SVM), decision trees, Bayesian networks, splines, and exponential smoothing. Rodrigues et al. (2020) discussed the application of machine learning in computation and communication control in mobile edge computing, including fuzzy control model, tree-based naive Bayes, SVM, etc. Khan et al. (2022) presented a literature review for the application of ML methods in Cloud resource management, mainly including prediction or classification approaches such as SVM, k-nearest neighbors (KNN), DL, etc. However, there is no survey specifically discussing the application of DRL in Cloud scheduling, as it is a novel direction emerging and developing in recent years. Researchers are still exploring the application pattern of RL, especially DRL in Cloud scheduling (Feng et al. 2020; Lu et al. 2020; Xu et al. 2017b; Liu et al. 2017; Kardani-Moghaddam et al. 2021; Guo et al. 2021; Karthiban and Raj 2020; Tong et al. 2020; Wang et al. 2021a; Cao et al. 2020; Nouri et al. 2019). Similarly, DRL (or RL) is also applied to solve scheduling problems in other field (Ni et al. 2020; Baccour et al. 2020).

Noting the potential application value of DRL in Cloud scheduling, we consider providing a comprehensive survey for existing research using DRL-based methods to solve Cloud scheduling. Based on the reviews and discussions, we finally target challenges and future directions using DRL to adapt to more realistic scenarios of Cloud scheduling.

The main contributions of this paper can be summarized as follows.

  1. (1)

    A comprehensive review and discussions of existing scheduling algorithms for Cloud computing;

  2. (2)

    An analysis for the frameworks of RL and DRL from the perspective of model structures;

  3. (3)

    A structured review and discussion of existing research using DRL in Cloud scheduling;

  4. (4)

    Some identified challenges and potential future directions of DRL-based methods in Cloud scheduling.

The rest of the paper is organized as follows. According to the classification of classic methods and machine learning methods, Sect. 2 formulates the scheduling and reviews the existing scheduling algorithms utilized in Cloud computing. Sect. 3 presents the structure analysis of RL and DRL applied in Cloud scheduling to assist better understanding of DRL (RL) methods used in the existing research. Sect. 4 provides some structured presentations of existing research using DRL methods and discusses the current situation of DRL in Cloud scheduling. Then, Sect. 5 lists challenges and potential future directions of applying DRL in Cloud scheduling. Finally, Sect. 6 concludes this paper.

2 Scheduling and algorithms in cloud

2.1 Mathematical formulation of scheduling

For the sake of the presentation, we list some notations with descriptions in Table 1.

Table 1 A list of notations with descriptions

In distributed systems, scheduling problems are usually NP-hard (Adhikari et al. 2019; Ghalami and Grosu 2019; Xu et al. 2009). Some of the mainstreams in Cloud scheduling focus on objectives including minimizing energy consumption (Gokuldhev et al. 2020; Mishra and Manjula 2020; Lin et al. 2023a), minimizing makespan (Sardaraz and Tahir 2020; Natesan and Chokkalingam 2020; Dong et al. 2020), minimizing delay time (or delayed services) (Pandiyan et al. 2020; Belgacem et al. 2020; Zhang et al. 2020), reducing response time (Tuli et al. 2022; Haytamy and Omara 2020), maximizing the degree of load balancing (Sardaraz and Tahir 2020; Ghasemi and Haghighat 2020; Adhikari et al. 2020), increasing reliability (Pandiyan et al. 2020; Tuli et al. 2022), increasing the utilization of resources (Li et al. 2020a; Lu et al. 2020; Ding et al. 2020), maximizing the profit of providers (Sardaraz and Tahir 2020; Natesan and Chokkalingam 2020; Gabi et al. 2020), maximizing task completion ratio (Tuli et al. 2022; Priya et al. 2019; Wang et al. 2015), minimizing Service Level Agreement (SLA) Violation (Tuli et al. 2022; Li et al. 2020a; Nouri et al. 2019), maximizing throughput (Zhang et al. 2019b; Mishra and Manjula 2020; Devaraj et al. 2020), and multi-objectives (Natesan and Chokkalingam 2020; Gokuldhev et al. 2020; Mishra and Manjula 2020).

There are several different definitions of resource scheduling in some literature (Kumar et al. 2019; Adhikari et al. 2019; Zhan et al. 2015). From Kumar et al. (2019), resource scheduling can be done in two ways: first is on-demand scheduling in which the Cloud service provider provides the resources quickly to random workload, and second is long-term reservation in which large numbers of virtual machines are in ideal condition due to which under-provisioning type of problem occurs. From Adhikari et al. (2019), task scheduling is to find an optimal order of the tasks that meet the scheduling objectives. Resource scheduling is defined by Zhan et al. (2015) as to find an “optimal" mapping “Tasks \(\rightarrow\) Resources” to meet one or several given objectives. There are still other definitions, which focus on whether the scheduled object is a task, a workflow, or a resource. Additionally, there are also some definitions using resource scheduling as a general term for resource management which may also contain task scheduling, workflow scheduling, resource scheduling, etc. In this paper, we unify these by “resource scheduling in Cloud computing” or “Cloud scheduling”. Then, a scheduling algorithm for Cloud can be defined as an algorithm with specific rules, strategies, or processes that can generate a scheduling scheme including which resources a task is assigned to (i.e., X), and when to start processing the task (i.e., S).

Referring to existing studies of Cloud scheduling and for the sake of comprehensive discussion, we can establish a universal formulation for scheduling problems. It can be assumed that the number of indivisible tasks is M, the number of server nodes is N, and each server node has D dimensional resources (such as CPU load, GPU load, RAM, bandwidth, disk storage, etc.). Then, the i-th task can be represented by a parameter matrix \(V_i=\left\{ v_{ijk} \right\} _{N\times D}\) where \(1\le i\le M\), \(1\le j\le N\), \(1\le k\le D\), and \(v_{ijk}\) indicates the capacity or space or time requirement for j-th dimensional resource when the i-th task is allocated to the j-th service node. The set of parameters of tasks \(\left\langle V_1, V_2,\dots V_M \right\rangle\) is set as \(V=\left\{ v_{ijk} \right\} _{M\times N\times D}\). The parameters of server nodes can be set as \(L=\left\{ l_{jk} \right\} _{N\times D}\), where \(l_{jk}\) means the load status of the k-th dimensional resource in the j-th server node. Using a matrix \(X=\left\{ x_{ij}\right\} _{M\times N}\) to represent the allocation solution of mapping “Tasks \(\rightarrow\) Resources” and a matrix \(S=\left\{ s_{i}\right\} _{M}\) to represent the start time of tasks, then a scheduling scheme can be expressed by the combination of X and S, marked as \(\left\langle X, S \right\rangle\). Wherein, \(x_{ij}\in \{0,1\}\) and \(\sum _{j=1}^{N} x_{ij}=1\), which means the indivisible task can be allocated to only one node. \(x_{ij}=1\) means the i-th task is allocated to the j-th node. Limiting S can generate the execution order between tasks. For example, setting \(s_{i_1}\ge e_{i_2}\) (where \(e_{i}\) is the end time of the i-th task) equals that the \(i_1\)-th task must begin after the finish of the \(i_2\)-th task. Thus, the matrix S is sufficient to include the execution order of the task.

A optimization result of scheduling is a mapping from the solution \(\left\langle X, S \right\rangle\), the parameters of tasks V and server nodes L. Thus, the optimization objective can be set as

$$\begin{aligned} \min \omega =\omega \left( X, S, V, L \right) \end{aligned}$$
(1)

where \(\omega\) is a function with respect to X, S, V and L. Multi-objectives can be represented by multiple functions of \(\omega\) as

$$\begin{aligned} \min \omega =\left\{ \begin{aligned} \omega _1\left( X, S, V, L \right) \\ \omega _2\left( X, S, V, L \right) \\ \dots \end{aligned} \right. \end{aligned}$$
(2)

For example, the objective of minimizing makespan can be expressed as Eq. (3) and that of minimizing total running time as Eq. (4) assuming each node is either idle or processing only one task (Zhou et al. 2023a).

$$\begin{aligned} \min {\omega _{makespan}}= & {} \left( {\mathop {\max }\limits _{j = 1,2, \ldots ,N} \left( {\sum \limits _{i = 1}^M {{x_{ij}}{v^{PT}_{ij}}} } \right) } \right) \end{aligned}$$
(3)
$$\begin{aligned} \min {\omega _{total \_ time}}= & {} \left( {\sum \limits _{j = 1}^N {\sum \limits _{i = 1}^M {{x_{ij}}v^{PT}_{ij}} } } \right) \end{aligned}$$
(4)

where \(v^{PT}_{ij}\) means the processing time of the i-th task when executed in the j-th nodes which belongs to one dimension of V as time can also be regarded as a dimension of resources. The objective of load balancing can be expressed as Eq. (5) when using variance of load to measure the degree of balancing (Zhou et al. 2022).

$$\begin{aligned} \min \omega _{variance} = { \frac{1}{N}\sum \limits _{j = 1}^{N} {{{\left( {\sum \limits _{i = 1}^{M} {{x_{ij}}{v^{DS}_{ij}}} } \right) }^2}}} {- { {\frac{1}{N^2}}\left( \sum \limits _{j = 1}^{N} {\left( {\sum \limits _{i = 1}^{M} {{x_{ij}}{v^{DS}_{ij}}} } \right) }\right) ^2 }} \end{aligned}$$
(5)

where \(v^{DS}_{ij}\) means of disk storage requirement when the i-th task is allocated to the j-th node which also belongs to one dimension of V.

Assuming the power of the j-th node at time t is related to the load status \(L_j(t)\), marked as \(P_j(t)=P_j \left( L_j(t) \right)\), thus the objective of minimizing energy consumption of the whole system from time 0 to T can be written as Eq. (6) (Ding et al. 2020; Shan et al. 2020; Lin et al. 2021).

$$\begin{aligned} \min \omega _{energy\_consumption} = \sum \limits _{j = 1}^N {\int _0^T {{P_j}\left( {{L_j}\left( t \right) } \right) } } \end{aligned}$$
(6)

From the above examples, most of the optimization objectives in Cloud scheduling can be expressed by the structure of \(\omega \left( X, S, V, L \right)\).

Fig. 2
figure 2

A diagram of scheduling algorithm to generate the scheme \(\left\langle X, S \right\rangle\)

With the formulation of the scheduling problem, a scheduling algorithm is an integration of mappers from \(\left( V,L,\omega \right)\) to the scheduling scheme \(\left\langle X, S \right\rangle\). It can be set an algorithm as Al and its solution can be expressed by

$$\begin{aligned} Al\left( V,L,\omega \right) =\left\langle X, S \right\rangle \end{aligned}$$
(7)

Thus, a process of using an algorithm to solve the optimization solutions can be shown as Fig.  2. From Fig.  2, two key factors for scheduling are production and evaluation of schemes. In solving scheduling schemes, the evaluation for the performance of an optimized solution, i.e. the process of obtaining \(\omega \left( X, S, V, L \right)\) or its equivalent evaluation functions, is crucial. Some simple optimization objectives in ideal scenarios can be directly calculated. However, for some complex optimization objectives, this function \(\omega \left( X, S, V, L \right)\) may not have explicit expressions. For example, for minimizing energy consumption in Eq. (6), \(P_j \left( L_j(t) \right)\) cannot be represented by elementary functions generally so that the expression of \(\omega \left( X, S, V, L \right)\) is implicit. For some optimization objectives with explicit expressions in ideal scenarios, it may be also difficult to directly calculate the optimization results when in some highly stochastic system processes. For example, when the processing time \(v^{PT}_{ij}\) in Eq. (3) is random and satisfying different distributions, the makespan will also be random. Thus, the selection of scheduling algorithms should be based on the characteristics of scenarios. The different mapping processes of Eq. (7) will correspond to different categories of algorithms.

Fig. 3
figure 3

A diagram for continuous dynamic scheduling process over time t

When considering dynamic scheduling, a diagram of its process over time can be seen in Fig.  3. The scheduling scheme at a time t is responsible for meeting the scheduling requirements at the current time, but will also be related to the status of server nodes at subsequent times. It indicates that when making scheduling decisions at time t, it is necessary to consider the subsequent changes in the system. This also puts forward requirements for evaluating the quality of scheduling schemes, which shows the significance of a predictor.

Generally, algorithms for Cloud scheduling contain six categories: Dynamic Programming (DP), Probability algorithm (Randomization), Heuristic method, Meta-Heuristic algorithm, Hybrid algorithms and Machine Learning. From the properties of these algorithms, except for ML, other algorithms do not have the ability to predict system states. In this paper, we regard dynamic programming, randomization, heuristic method, meta-heuristic algorithm, and hybrid algorithm as classic approaches. In order to analyze the future direction of Cloud scheduling and discuss the potential application of DRL, we will review the current scheduling algorithms of Cloud.

2.2 Review for classic algorithms

For classic approaches, the most commonly utilized methods in surveyed literature are heuristic, meta-heuristic and hybrid algorithms. Thus, we mainly review these three types of algorithms to assist the later review and discussion on the application of DRL in Cloud scheduling.

2.2.1 Heuristic algorithms

Heuristic is an algorithm to solve an optimization problem based on intuitionistic or empirical construction. Due to their lower complexity, heuristic algorithms are prevalent in some scenarios with a clear evaluation function requiring rapidity but not requiring high optimization results. Additionally, the worst-case of heuristic algorithms is generally predictable hence with a lower risk of improper allocation.

In existing research, Guan and Melodia (2017) applied the Jacobi Best-response Algorithm (JBA) to minimize cost in Multi-Broker Mobile Cloud Computing Networks and proved theoretical results demonstrating the existence of disagreement points and convergence of Jacobi Best-response Algorithm of the brokers to disagreement points. Tian et al. (2016) proposed an adapting Johnson’s model-based algorithm with 2-competitive to minimize the makespan of multiple MapReduce jobs and proved its performance in theory. Lin et al. (2022) proposed Peak Efficiency Aware Scheduling (PEAS) to optimize the energy consumption and QoS in the on-line virtual machine allocation and reallocation of Cloud. Dynamic Bipartition-First-Fit (BFF), a \((1+\frac{g-2}{k}-\frac{g-1}{k^2})\) competitive algorithm based on First-Fit algorithm, was proposed and its performance was proved theoretically by Tian et al. (2013). Hong et al. (2019) proposed a QoS-Aware Distributed Algorithm based on first-come-first-improve (FCFI) and all-come-then-improve (ACTI) algorithms to reduce computation time and energy consumption of Industrial IoT-Edge-Cloud Computing Environments. ECOTS (energy consumption optimization cloud task scheduling algorithm), with low time and space complexity, took into account multiple key factors such as task resource requirements, server power efficiency model and performance degradation in order to reduce energy consumption of Cloud (Lin et al. 2018). Longest Loaded Interval First algorithm (LLIF), a 2-approximation algorithm with theoretical proof of its performance, was proposed by Tian et al. (2018) to minimize the energy consumption of VM reservations in the Cloud.

Other common heuristic methods are Johnson’s model, FF (first fit), BF (best fit), RR (round-robin), FFD (first fit decreasing), BFD (best fit decreasing), Jacobi Best-response Algorithm (Guan and Melodia 2017) and their variants.

Table 2 A summary of heuristic algorithms

To give an overall observation, we collected the reviewed literature and gained Table 2. From Table 2, heuristic algorithms mainly focus on the single-objective optimization including minimizing makespan, minimizing energy consumption and load balancing. However, there are several defects of heuristics as follows.

  1. (1)

    For the scenarios using heuristic, some major objects (such as the time, energy or load) are often assumed to be given or easily calculated. For complex scenarios where the optimization objective is implicit with respect to solutions, heuristic algorithms often fail to generate a feasible solution.

  2. (2)

    A heuristic algorithm is often designed for one or few specific scenarios. When only one element in the scenario changes, the algorithm may need to be redesigned.

  3. (3)

    Heuristics are usually only suitable for the single-objective problems.

  4. (4)

    Moreover, the solution of heuristic algorithms can usually be further optimized.

2.2.2 Meta-heuristic algorithms

In skeleton, meta-heuristic, the combination of heuristic and randomization (Kumar et al. 2019), includes Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), Genetic Algorithm (GA), Firefly Algorithm (FA), etc.

ACO imitates ant colony to search for food as a search route. Liu et al. (2018) proposed OEMACS combining OEM (order exchange and migration) local search techniques and ACO to resolve energy consumption of VMs deployment in Cloud computing, which significantly reduced the energy consumption and improved the effectiveness of different resources compared with conventional heuristic and other evolutionary-based approaches. A et al. (2019) proposed two ant colony-based optimization algorithms (TACO) to address VM scheduling and routing in multi-tenant Cloud data centers aiming at improving the utilization of energy in Cloud computing. Abualigah and Diabat (2020) proposed an alternative meta-heuristic technique based on the Ant Lion Optimizer Algorithm (MALO) to resolve multi-objective optimization of Cloud computing, which performed better in load balancing and makespan compared with GA, MSDE, PSO, WOA, MSA and ALO.

GA imitates the process of natural evolution as a search route of the algorithm. Proposed by Deb et al. (2002), NSGA-II occupies better convergence and optimal solution and has become one of the benchmarks using the fast non-dominated sorting algorithm, introducing elite strategy and using congestion-congestion comparison operator. Liu et al. (2016) improve the search strategy based on NSGA-II to reduce the energy consumption, response time, load imbalance and makespan in Cloud computing. NSGA-III utilizes reference points with preferable distribution as a novel search route to maintain the diversity of the population to improve the optimization results of GA (Seada and Deb 2015; Miriam et al. 2021). Xu et al. (2019) applied NSGA-III to optimize the execution time and energy consumption of IoT-enabled Cloud-edge computing. MOGA (Jiang et al. 2016) and MOEAs (Laili et al. 2020) improved the search route strategies based on NSGA-II and were utilized to settle Cloud scheduling.

The studies of Firefly algorithm include FA (Adhikari et al. 2020) and FIMPSOA (Devaraj et al. 2020). That of PSO include MOPSO (Li et al. 2017), TSPSO (Jena 2015) and HAPSO (Midya et al. 2018). Other meta-heuristic algorithms include Multi-objective Whale Optimization Algorithm (MWOA) (Reddy and Kumar 2017), nature-inspired Chaotic Squirrel Search Algorithm (CSSA) (Sanaj and Prathap 2020), etc.

Table 3 A summary of meta-heuristic algorithms

By collecting and sorting out the literature using meta-heuristic algorithms to solve resource scheduling problems, we gain Table 3 with their corresponding optimization objectives. Since the meta-heuristic algorithms are also applicable to the scenarios of dynamic scheduling and heterogeneous servers where the heuristic algorithms are applicable, their application scenarios are not listed in Table 3. From Table 3, meta-heuristic algorithms with searchability for the solution can address more complex optimization problems not only for single-objective problems but also for multi-objective problems. They are applied to solve optimizing cost, energy consumption, makespan, running time and resource utilization. Meta-heuristic algorithms are more applicable than heuristic algorithms but at the expense of computational complexity and randomness. However, although these optimization objectives in meta-heuristics include some complex objects (such as energy, Qos and cost), their calculations have been simplified with some ideal assumptions far from reality (A et al. 2019; Ramezani et al. 2015; Xu et al. 2019).

Fig. 4
figure 4

A diagram of search-based algorithms

Meta-heuristic and other search algorithms are based on the specific search route, whose diagram can be seen in Fig. 4. They use the search route to adjust the current solutions to generate new solutions, evaluate the performance (such as fitness) of newly generated solutions according to the optimization objectives-based evaluation functions, and then determine whether to proceed to the next search based on the evaluation. The two key factors in Fig. 4 are the search route and evaluation of solutions. The search route needs to generate better solutions. However, there are several inevitable defects of meta-heuristic as follows.

  1. (1)

    For scenarios where the solution can be directly evaluated, the convergence of the meta-heuristic cannot be guaranteed due to the presence of randomness. The randomness of the meta-heuristic increases redundant computations.

  2. (2)

    As the search space increases, the required search iterations must also increase accordingly, subsequently producing more redundant solutions.

  3. (3)

    When it is difficult to evaluate the quality of a solution, the search route will lose its direction, and the search algorithm will degenerate into pure randomization. When \(\omega \left( X,S,V,L \right)\) is implicit, the meta-heuristic and other search algorithms themselves do not provide a method for evaluating solutions.

  4. (4)

    Meta-heuristic also does not provide a way to predict system status.

The first and second defects will limit the optimality of the meta-heuristic for its feasible scenarios. The third and fourth defects, which also appear in heuristic, will cause the algorithm unable to be used in some real-world complex scenarios.

2.2.3 Hybrid algorithms

Some other classic algorithms used in Cloud scheduling mainly contain DP, Random algorithms, and hybrid algorithms (combining two or more algorithms). Among them, hybrid algorithms are also widely used in solving complex scheduling problems in Cloud computing. Hybrid algorithms can combine the advantages of multiple algorithms to produce better solutions. In terms of search algorithms, a single algorithm has an inherent local convergence solution and the solution of the hybrid algorithm needs to satisfy the convergence conditions of multiple algorithms simultaneously (Zhou et al. 2023a). Therefore, the convergence solution of the hybrid algorithm is usually better than the corresponding single algorithm. PSO-ACS (M and T 2021), mingled with PSO and ACO, applied PSO to find the optimal solution of task scheduling and ACO to find the best migration path of VMs on PMs. FACO (Ragmani et al. 2020), a hybrid fuzzy ant colony optimization algorithm, exploited a fuzzy module dedicated to pheromone evaluation to improve the performance of ACO by optimizing the search route of ACO. Hybrid Genetic-Gravitational Search Algorithm (HG-GSA) (Chaudhary and Kumar 2019) based on gravitational search algorithm for searching the best position of the particle consequently optimizing the search route of GA. FMPSO (modified PSO + fuzzy theory) (Mansouri et al. 2019) used crossover and mutation operators surmounting the local optimum of PSO and applied a fuzzy inference system for fitness calculations. SFLA-GA algorithm (shuffled frog leaping algorithm + GA) (Kayalvili and Selvam 2019) took advantage of the two algorithms to transmit information among groups hence the optimal search route. GHW-NSGA II (Zhou et al. 2023b) leveraged heuristic-based search algorithm as an extra search route of NSGA II to optimize the utilization of multi-dimensional resources, which improved the convergence speed and optimality on the basis of GA. SPSO-GA (Chen et al. 2022a) combined Self-adaptive Particle Swarm Optimization algorithm with Genetic Algorithm operators to reduce the energy consumption of the scenario offloading DNN layers Cloud-Edge environment. On the basis of SPSO-GA, PSO-GA-G (Chen et al. 2022d) added a greedy algorithm to optimize computation offloading. The combination of multiple meta-heuristics is beneficial for improving the overall convergence speed, hence improving search efficiency. LPT-One and BFD-One (Zhou et al. 2023a) used heuristic algorithms to act as the search routes and combined multiple heuristic-based search routes to improve the approximation of minimizing makespan.

Other hybrid algorithms in Cloud scheduling, include ABC-SA integrating the functionality of simulated annealing (SA) into artificial bee colony (Muthulakshmi and Somasundaram 2019), SFGA (a hybrid Shuffled Frog Leaping Algorithm and Genetic Algorithm) (Ibrahim et al. 2020), TSDQ-hybrid meta-heuristic algorithms based on Dynamic dispatch Queues (Alla et al. 2018), etc. These algorithms demonstrated the flexibility, superiority, adaptability and mobility of hybrid algorithms and simultaneously manifested the unlimited possibilities and significance of research hybrid structurally.

Similar to meta-heuristic algorithms, hybrid algorithms are also applicable to a variety of multi-objective problems. However, a hybrid algorithm, with multiple heuristics or meta-heuristics as elemental algorithms, cannot exceed the scenarios that the elemental algorithms are suitable for, which implies that it is also not suitable for the scenarios with implicit \(\omega \left( X,S,V,L \right)\).

2.2.4 Summary of classic algorithms

Although the classic algorithms have been applied to various objectives under various scenarios and achieved considerable performances, they still do not solve how to calculate or evaluate the various elements such as energy, time, load and utilization according to the properties of tasks and resources. Therefore, the models of Cloud computing in their applications are different from the realistic scenes, which causes them to only be applicable when the elements (such as time, cost, energy, and load) are given or easy to calculate. This also leads to the difference between the expected performance of these algorithms and the actual operational performance in reality. When the mapping of objective \(\omega \left( X,S,V,L \right)\) is implicit, classic algorithms are unable to guarantee optimality and are even unable to obtain feasible solutions. This is because classic algorithms do not provide a method to measure the performance of solutions. In addition, for a new optimization problem, classic algorithms, without memorability, need to resolve the optimization solution from scratch.

2.3 Machine learning

Before providing a detailed introduction to DRL-based algorithms in Cloud scheduling, we give a collection of ML methods used in Cloud scheduling by reviewing literature in Table 4. The ML methods used in scheduling problems mainly contain deep learning (DL), RL and DRL. In addition, other types of machine learning methods, such as KNN (Khan et al. 2022; Lin et al. 2023a) and imitation learning (Guo et al. 2021; Wang et al. 2021b), SVM (Lin et al. 2021; Rodrigues et al. 2020), had also been applied in cloud scheduling.

Table 4 Machine Learning Methods in Cloud Scheduling

In practice, a Cloud system has several characteristics:

  • large scale and complexity of systems that make it impossible to model accurately;

  • timeliness of scheduling decisions that demands the high-speed scheduling algorithm;

  • randomness of tasks (or requests) including randomness of task numbers, arrival time and sizes.

These characteristics are challenging for the research on Cloud scheduling. Most existing optimization methods are designed for specific applications (Lin et al. 2023a). When we constantly consider more factors in the process of modeling Cloud scheduling, the existing classic algorithms are no longer applicable. It is tough for one specific meta-heuristic, heuristic and hybrid algorithms to fully adapt to the real dynamic Cloud computing systems or Edge-Cloud computing systems.

Considering the defects of classic algorithms, ML-related methods can utilize specific mapping methods to record the computational mode of optimization objectives. This addresses the dilemma of evaluating the quality of a solution when \(\omega \left( X,S,V,L \right)\) is implicit. E.g., the Q-table in Q-learning and various DNNs in DRL both embed the ability to evaluate the quality of a solution with some memorability. While, there is no given target scheduling scheme as the label, which makes it impossible to solve the scheduling problem solely using DL. One effective approach is to apply DL in meta-heuristic to evaluate the quality of solutions using the realistic situation of the system to obtain optimization objectives and to train neural networks. This enables meta-heuristics to perform effective searches, such as NN-DNSGA-II algorithm (combining DL with GA) (Ismayilov and Topcuoglu 2020) and DLSC framework (combining DL with PSO) (Haytamy and Omara 2020).

A novel type of ML policy for Cloud scheduling is the combination of DNN and RL, called deep reinforcement learning. Different from the combination of DL and meta-heuristic, DRL leverages DNN to act as the solution generator. Integrating the advantages of RL and DL, DRL has the following benefits.

  • Ability of modeling: it can model complex systems and decision-making policies with DNN even when \(\omega \left( X,S,V,L \right)\) is implicit;

  • Adaptability for optimization objectives: training progress based on gradient descent algorithm makes it possible to search the optimization solution for various objectives;

  • Adaptability for the environment: DRL can adjust parameters to adapt to various environments;

  • Possibilities for further growth: DRL can grow over time to process large-scale tasks;

  • Adaptability for state-space: DRL can process continuous states or multi-dimensional states;

  • Memory of experience: DRL possesses the capacity to memory experience with experience replay.

For the sake of demonstrating the above benefits and further analyzing the challenges of DRL in scheduling, the next section will introduce and analyze the framework of RL and DRL as the foundation to support the follow-up review and discussion.

3 Analysis of RL and DRL frameworks in scheduling

Machine learning is the discipline of teaching the computer to predict outcomes or classify objects without explicit programming (Rodrigues et al. 2020). ML is also an artificial intelligence discipline of studying how computers simulate or implement human learning behavior so that computers can gain new knowledge and skills. Based on learning methods, ML can be divided into supervised learning, unsupervised learning and semi-supervised learning (Rodrigues et al. 2020). RL is one of the unsupervised learning (Nouri et al. 2019). Based on learning strategy, ML contains Symbol learning, Artificial Neural Networks learning, Statistical ML, Bionic ML, etc. DL on the basis of deep artificial neural networks and RL are two subsets with the intersection of ML, where the intersection between DL and RL is DRL. DRL combined with the perception of DL and with the decision-making ability of RL, has been applied in robot control, computer vision, natural language processing and some Go sports (Luong et al. 2019; Wang et al. 2020).

From the scheduling formulation and classic algorithms in Sect. 2, the two key factors of scheduling are the production and evaluation of solutions. The classic algorithms, including heuristics and meta-heuristics, don’t possess the ability to evaluate solutions or to predict the status of the system when \(\omega \left( X,S,V,L \right)\) is implicit. Therefore, they suffered in some realistic scenarios. Due to the combination of DL and RL, DRL has the flexibility of adopting DNN to complete any process in solving the scheduling schemes. Meanwhile, the RL mechanism in DRL maintains the well-performed solution, as it ensures statistical advantages of performance through training. RL in DRL, based on the theory of the Markov process, is also suitable for dynamic scheduling (Dong et al. 2023; Chen et al. 2023c, a).

The application of DRL to resolve the scheduling problems in Cloud computing emerged in recent years, which is an effective intersection of two emerging technologies. DRL has shown superior performance in the current research on the application in Cloud scheduling. To support subsequent review of existing research and discussion of challenges and future directions, we introduce and analyze the evolution of RL and DRL frameworks in this section, which can provide a comprehensive insight into understanding the operation process of DRL.

3.1 RL framework

Firstly, we introduce and analyze the framework of RL. RL is based on the interaction model between the agent and the environment. It instructs the agent to learn the optimal action strategy by the feedback from the environment corresponding to the agent’s action. The RL model can update the action strategy according to timely feedback and long-term feedback. The agent will choose the action on the basis of the action strategy. State-space, action-space, environment, feedback (reward), and strategy are five basic elements of RL. Figure 5 shows a fundamental structure of RL with these basic elements.

Fig. 5
figure 5

A fundamental framework of RL

In some of the reviewed literature related to RL (Dong et al. 2020; Kardani-Moghaddam et al. 2021; Liu et al. 2017), the feedback is described as a reward. In this paper, we regard it as feedback considering that both positive feedback and negative feedback will affect the learning progress of the agent’s action strategy (Tong et al. 2020; Guo et al. 2021; Lu et al. 2020). The concept “feedback” originated from cybernetics (Glushkov and Kranc 1966). Based on the perspective of feedback in RL, RL requires feedback from the environment outside of the solver, while meta-heuristics and other search algorithms typically rely on evaluation functions; the role of feedback in RL is to update the solver (i.e., affecting how to solve a scheme) (Kaelbling et al. 1996; Gronauer and Diepold 2022; Yan et al. 2009; Shishira et al. 2016; Zhan et al. 2015), while evaluation functions in meta-heuristics or other search algorithms are responsible for update the solution (directly changing the scheme). The fundamental framework of RL in Fig. 5 provides a simplified overall structure, but it is not sufficient to directly solve scheduling problems. Especially, this architecture does not solve the difficulties that we need to face when solving the scheduling problems mentioned above. Therefore, we need to further evolve the architecture of RL based on this foundation framework.

In another standpoint to comprehend the fundamental structure of RL, the agent learns the strategy by trailing error. And trailing error requires the agent to maintain the balance between exploration and exploitation. The greedy method, random method and meta-heuristic method will be used to simulate the decision progress between exploration and exploitation. Markov decision process (MDP) is a common model to express the action choice process and Bellman Equation, a dynamic programming equation, is a common function to update the action strategy. Hence, a framework of RL containing action selection and strategy update can be shown as Fig. 6.

Fig. 6
figure 6

A framework of RL with action selection and strategy update

The framework of RL with action selection and strategy update in Fig. 6 can already be used to solve some optimization problems, but has no enough consideration to the temporal changes in the system state and agent state. Thus, it is not sufficient to solve the time-related scheduling problem.

In complex scenarios, state and agent vary with time. In addition, the decision should be made according to the state and agent in real-time and the feedback from the environment will affect strategy directly. Hence, an agent-state-based structure of RL can be shown in Fig. 7.

Fig. 7
figure 7

A complex framework of RL based on varying agent-state

In most realistic scenes, a system is often not completely independent and will alter with extrinsic stimulus. The environment in Fig. 7 is actually the internal environment of the system which cannot express the overall interferences from other systems to this internal system. Hence, the system of RL in Fig. 7 should be regarded as an autonomous system because the agent and environment evolve on the autonomous rules. A computer game, a Go sport and a language processing problem that covers a large enough amount of data may be regarded as an autonomous system because their regulation is quite stable without external modification of regulation. The movement of vehicles and antagonistic sports are usually affected by external incentives. Then, a Cloud computing system, with time-varying constructive demands, optimization objectives and users’ requests, is not an autonomous system. Moreover, the update function of the internal environment for the agent-state and the decision-making function is also time-varying such as the revenue ratio of Cloud computing is variable in different periods of the same day. Regarding the decision-making process as an ensemble, a framework of RL with a time-varying extrinsic stimulus can be shown in Fig. 8.

Fig. 8
figure 8

A framework of RL with extrinsic stimulus

The above frameworks in Fig. 5\(\rightarrow\) Fig. 6\(\rightarrow\) Fig. 7\(\rightarrow\) Fig. 8 constitute the evolution process of RL structure from simple to complex. The complicated framework can address a lot of problems in decision-making when the input data is discretized. With increasing complex input data and the increasing dimensions of agent-state, RL frameworks without leveraging DNN are not applicable, because the decision-makers, such as Q-table, are unable to make use of the information of state and feedback without sufficient perception ability, hence the training may not converge. Therefore, it is significant to integrate some neural networks to enhance information perception ability, so as to improve the quality of optimized solutions.

3.2 DRL framework

Before further analyzing the DRL frameworks, we discuss the framework in Fig. 8 again on the sight of mapping of mathematics. In Fig. 8, the decision-maker, which is a complex mapping from agent-state to action, is integrated as an ensemble. Some patterns of RL focus on the expression of this mapping relationship such as Q-table, Advantage Function, Policy Gradient, and Hidden Markov Chain. However, as the sizes and dimensions of state space and action space increase, the computational complexity and storage space of these patterns will grow exponentially. Furthermore, when the state space is non-discrete which appears in the scheduling problem usually, it is difficult to express the mapping relationship in the general methods of RL. Nonhomogeneous Markov process-based RL, one of the methods to express the process of time-varying continuous time-space and continuous state space, requires solving differential equations with variable coefficients, however. It astricts the application of the nonhomogeneous Markov process in RL to solve the problem with time-varying continuous time-space and continuous state space. Hence for various reasons, a DNN with excellent performance in the establishment of mapping relationship is a considerable choice to be a mapper of strategy between state-agent and action. Then, we can improve Fig. 8 to a framework using a DNN to express the decision process, shown in Fig. 9.

Fig. 9
figure 9

A framework of RL with DNN-based decision-maker (belonging to DRL)

The framework in Fig. 9 is actually DRL, which can deal with more complex scenarios than Fig. 8 by using DNNs to participate in decision-making. Regarding the decision process as a mapping process, Fig. 9 enlightens us to reconstruct the structure of Fig. 8 according to the mapping relation. Then, we can construct the framework in Fig. 8 as five mappers including the mapper of time-varying, mapper of stimulus evolution, mapper of decision, environment and mapper of feedback as Fig. 10.

Fig. 10
figure 10

A framework of RL with multiple segments of system represented by multiple mappers

The details of each mapper in Fig. 10 are as follows:

  • Mapper of time-varying refers to the relationship between agent-state and time with stimulus from extrinsic or internal space. In it, time and update are the input, as well as stimulus force is the output.

  • Mapper of stimulus evolution refers to the stimulus evolution in agent and state as agent and state are usually variable with stimulus where stimulus force is the input and the set of agent-state at real-time is output.

  • Mapper of decision is responsible for calculating the next action according to the current state of the agent where the set of agent-state at this time is input and action at next time is output.

  • Environment receives actions that the result of the mapper of decision provides, evolves according to the action of the agent and then outputs the environment’s state at the next time. The output of the environment enters the mapper of feedback as its input and enters the mapper of time-varying as the internal stimulus for the agent.

  • Mapper of feedback receives the environment’s state, then stores it as replay storage in preparation for long-term feedback in the future and takes it as timely feedback simultaneously. Long-term feedback and real-time feedback will update the parameters of the decision-maker.

The framework in Fig. 10 is a generalized RL based on the integration of mappers. In some of the experiments, the mapper of time-varying, mapper of stimulus evolution and environment are usually simulated by the program or observed in real scenes. Mapper of decision and mapper of feedback can be constructed with DNNs. As the mapper of feedback is aimed at updating the parameters of the decision-maker, the mapper of feedback can be designed as a neural network to calculate the loss function of the DNN in the mapper of decision, hence Nature DQN or Double DQN (Li et al. 2020b; Karthiban and Raj 2020; Dong et al. 2020; Cheng et al. 2018) where the neural network of feedback is called as target network with the same structure of decision’s network. While inherently, the five mappers in Fig. 10 can all be represented by neural networks respectively. In some scenes, it is difficult to simulate or observe the realistic process of a complex system, and the neural network can be used as an end-to-end alternative. An extreme example is that five mappers are all expressed with DNNs, shown as Fig. 11. However, existing research, using DRL to resolve the scheduling problems in Cloud computing surveyed in this paper, is carried out by replacing one or several of the five mappers with neural networks and has performed well in experiments according to their results, which will be reviewed next section to support the discussion and analysis of the current situation, challenges and future direction of DRL in Cloud scheduling.

Fig. 11
figure 11

A framework of DRL with multiple DL segments of system

3.3 Summary

The Cloud environment is a complex and random system with large-scale user requests and a complex physical environment, and these user requests and extrinsic physical environment can be regarded as a time-varying stimulus. In addition, the actual running processes of electronic components and software programs are hard to express using simulation. Crucially, the high dimensionality and the continuity of state-space make the mapper of decision and mapper of feedback difficult to model with conventional methods. In summary, the five mappers may have demand to be modeled with implicit expression functions, while the DNN is a practical method currently to deliver implicit relation based on sufficient data and sufficient training time.

Moreover, the literature adjusts the structure of the neural network (CNN, LSTM, full connection, Transformer, etc.), increases the strategy of initialization of neural network parameters, the training strategy of the neural network, prediction or simulation of the internal and external environment, and assist with other meta-heuristic algorithms as appropriate. With the analysis of the RL framework in this section, we can review and analyze existing DRL-based methods in Cloud scheduling following this perspective.

4 Overview of DRL-based scheduling in cloud

Based on the location of the neural network, we review frameworks in surveyed papers. In RL, the central component is the mapper of decisions that can conduct the scheduler of Cloud computing. In Cloud scheduling using RL, the mapper of decision is usually represented by a DNN or Q-table. In order to deeply analyze and macroscopically summarize the application of DRL in Cloud scheduling, we organize the information of literature structurally. In addition to the information of the literature, we also reorganize the possible future work of some literature to provide another probability through the analysis of the reviewed literature.

4.1 Literature review

QEEC (Ding et al. 2020) is a Q-learning-based task scheduling framework for energy-efficient Cloud computing using a Q-value table to express the decision-maker of action.

PCRA (Chen et al. 2022b) is a Prediction-enabled feedback Control with RL-based resource Allocation using a feedback control Q-value prediction model to predict the values of management operations at different system states.

The DeepRM_Plus (Guo et al. 2021) uses a neural network that has a convolution neural network (CNN) of six layers to describe the mapper of the decision based on the great success of DNN in image processing. The center cluster, waiting for queues, and backlog queue compose the state of the environment which is represented by an image.

AGH+QL (Sun et al. 2020), a novel revised Q-learning-based model, takes hash codes as input states with reduced size of state space.

DQST (Tong et al. 2020), deep Q-learning task scheduling, uses the fully connected network to calculate the Q-values which can express the mapper of action decision.

DERP (Bitsakos et al. 2018) uses three different approaches of a DRL agent to handle the multi-dimensional state and to provide elastic VM resources.

Modified DRL (Karthiban and Raj 2020), RLTS (Dong et al. 2020), DRL-Cloud (Cheng et al. 2018), and ADRL (Kardani-Moghaddam et al. 2021) also use the structure of action-value Q network (or called evaluate Q-network (Dong et al. 2020)) and target-Q network. Then, their similarities and differences are as follows.

IDRQN (Lu et al. 2020) is a fine-grained task offloading scheme based on DRL with Q-network and Target net where the LSTM network layer is used in Q-network and the candidate network is used to update Target Net. DPM framework (Liu et al. 2017) based on RL, adopts the long short-term memory (LSTM) network to capture the prediction results and uses DRL to train the strategy of resource allocation aimed at reducing energy consumption in the Cloud environment. The LSTM network used to predict the state of the environment can be regarded as a mapper of time-varying in Fig. 10.

DDQN (Dueling Deep Q-Network) (Li et al. 2020b) contains a set of convolutional neural networks and a fully connected layer to achieve higher efficiency of data processing, lower network cost, and better security of data interaction.

MRLCO (Wang et al. 2021a), a Meta Reinforcement Learning-based method, contains a seq2seq neural network to represent the policy.

MADRL (Cao et al. 2020), a novel multi-agent DRL, contains actor-network and critic-network to generate Q value. The actor-network with the two-layer fully connected network is a mapper from state to action, and the critic-network with two fully connected network hidden layers and an output layer with one node is a mapper from state and action to Q-value.

DRL+FL (Shan et al. 2020), based on DDQN, uses Federal Learning to accelerate the training of DRL agents.

MDP_DT (Lolos et al. 2017a), a novel full-model-based RL for elastic resource management, employs adaptive state space partitioning.

RLFTWS (Dong et al. 2023) designed a heuristic algorithm for the task allocation and execution according to the selected fault-tolerant strategy, as well as developed a DDQN to select the fault-tolerant strategy adaptively for each task under the current environment state, which is not only prediction but also learning in the process of interacting with the environment.

AV-MPO (Chen et al. 2023c), on-policy maximum a posteriori policy optimization with gated transformer-XL, used an attention-based DRL algorithm to a Cloud-edge collaboration manufacturing task scheduling.

Other DRL-based scheduling algorithms include HCDRL (Chen et al. 2023a), DT (decision transformer) using GPT (Wang et al. 2023), CORA (Huang et al. 2023), DRAW (Chen et al. 2023b), PRLCC (Zade et al. 2022), ReCARL (Xu et al. 2022), etc. These algorithms still belong to the DRL architecture of Fig. 10 analyzed in Sect. 3.

Based on the review and collection of literature, Table 5 provides a summary of multi-aspects including category and objectives of RL-based algorithms, Table 6 provides a summary of the mappers, Table 7 provides the summary of scenario and task/server nature, as well as Table 8 provides the summary of experimental data and compared baselines.

Table 5 Summary of RL-based algorithms in terms of category and objectives
Table 6 Summary of RL-based algorithms in terms of the mappers of decision and other mappers
Table 7 Summary of RL-based algorithms in terms of scenario and task/server nature
Table 8 Summary of RL-based algorithms in terms of experimental data and compared baselines
Table 9 Summary of RL-based algorithms in terms of strategies and advantages

From Table 5, the DRL method mainly using QL and DQN can address a variety of optimization objectives almost covering the existing optimization objectives. Based on the analysis for the evolution of the DRL framework in Sect. 3, we can generalize these algorithms from the perspective of mappers. In Table 6, DNNs are mainly used to be the decision-makers of DRL in scheduling. In DDQN, a target network is used as the mapper of feedback. This is consistent with our analysis in Sects. 2 and 3 that the two key factors for scheduling are production and evaluation of schemes. This also indicates that these two factors are the difficulties in scheduling problems. By incorporating different strategies and networks into the corresponding mapper in Fig. 10, we can obtain the corresponding RL-based scheduling algorithms. On this basis, the additional strategies are mainly aimed at improving training speed, perception ability, accuracy of the evaluation for the solution and optimality. From Table 7, DRL methods are mainly used for dynamic or online scheduling, and they are not only applicable to heterogeneous resources and independent tasks but also to dependent tasks and non-preemptive tasks, which demonstrates DRL methods have wider application scenarios. From Table 8, DRL methods outperform many existing scheduling algorithms. This re-verifies our analysis in Sect. 2 that before actually executing the tasks in server nodes, classic algorithms cannot accurately evaluate or predict the quality of optimization schemes in dynamic scheduling of complex scenarios. Using DNN, DRL can obtain the performance of a scheme and guide for further improvement of the solution.

Table 10 Future work of RL-based algorithms

In the reviewed literature, strategies of queuing, accelerating training, partitioning state space of the agent, capturing resource state, keeping the stability of rewards, etc., are proposed to optimize the performance of algorithms. In order to more accurately analyze the advantages of the DRL methods (or RL) in this literature, we collect the advantages of each literature according to the description in the corresponding literature. Then, their details are listed in Table 9. Combined with the results and conclusion in the reviewed literature, future work of DRL-based algorithms (or RL) in reviewed literature are listed in Table 10. With the structural information listed in tables, we can deeply discuss existing DRL-based methods in Cloud scheduling.

Fig. 12
figure 12

A framework of DDQN of DRL for scheduling

4.2 Discussion

Based on the above review of RL-based Cloud scheduling especially based on the information listed in Tables 9 and 10, we summarize the current situation and advantages of DRL (and RL) in Cloud scheduling as follows.

  • DRL has strong adaptability for continuous or high dimensional state space; adaptability for scheduling scenarios and various optimization objectives of Cloud computing.

  • DRL has the flexibility to adopt various DNNs as the mappers to predict some implicit information, so as to improve the optimality of scheduling.

  • The main scenario closest to a realistic scene used RL to solve in reviewed literature is a dynamic online multi-resources scheduling problem in a Cloud computing environment or Edge-Cloud computing environment which can contain dependent or independent tasks, workflows, and homogeneous or heterogeneous servers.

  • In the reviewed literature, experiment results showed DRL can achieve better performance than various commonly compared algorithms such as Randomization, FCFS, Round-robin, Greedy, Q learning, MDP, QDT, FIFO, HEFT, FA, and SDR. And these algorithms together with Conventional DQN can be regarded as baselines to evaluate other algorithms in the future.

  • DDQN is the most commonly used model to solve the scheduling problem in Cloud computing as well as in some Edge-Cloud computing of reviewed literature (Karthiban and Raj 2020; Dong et al. 2020; Cheng et al. 2018; Lu et al. 2020; Xu et al. 2017b; Kardani-Moghaddam et al. 2021; Li et al. 2020b; Cao et al. 2020; Shan et al. 2020). Its common framework can be drawn as Fig. 12. The DDQN contains two networks, action-value Q network and target-Q network, with the same structure (Shan et al. 2020; Wang et al. 2020). The action-value Q network can generate the Q-value of the action corresponding to the current state. Additionally, the target-Q network can generate target value based on real-time feedback and long-term feedback to obtain the loss function to train the action-value Q network. Simultaneously, the wide application of DDQN also shows that the DRL model has strong adaptability and portability for various scheduling scenarios and optimization objectives.

  • DRL can also be leveraged to address multi-objective scheduling problems, while the previous methods to solve multi-objective optimization problems are mainly meta-heuristic algorithms.

  • The major policies in reviewed literature using DRL (or RL) contain several aspects:

    • Adjusting the structure of decision-mapper to DNN or Q-table;

    • Strategies to accelerate training of (deep) RL such as the periodical update;

    • Partition strategies for state-space;

    • Federal learning to improve convergence and stability;

    • Strategies to perceive current states or to predict subsequent states of the agent in RL;

    • Policies to provide loss function to train main-net in DRL.

5 Challenges and future directions for DRL-based scheduling

With the comprehensive review and analysis of the previous sections, we can discuss the challenges and future direction of DRL in Cloud scheduling.

Although DRL-based scheduling algorithms have performed advantages in the reviewed literature. DRL, as a complex, non-analytic and time–costing algorithm (Luong et al. 2019), has inevitable challenges to address the scheduling problems in real large-scale Cloud computing systems. Based on the comprehensive review of the existing Cloud scheduling and the investigation of the actual operation process, we collect the main challenges and defects using DRL to solve scheduling problems in Cloud computing as follows:

  1. (1)

    DRL consumes large computing power and occupies prodigious complexity in the progress of training and computation especially for multi-clusters or large-scale systems. Thus, DRL requires a certain period of time before it can be put into use. For scenarios with single and computable objectives, using DRL is not cost-effective.

  2. (2)

    The scheduling results based on DRL are still unpredictable, so the performance of the worst case is hard to evaluate. Therefore, the probability of system collapse caused by extremely poor scheduling schemes is not 0.

  3. (3)

    Real scheduling also depends on the prediction of dynamic tasks without preemptive and prior knowledge. Since DRL is based on DNN to perceive system states and generate solutions, its performance is highly dependent on the accuracy of DNNs. The training dataset of DRL is limited to cover scenarios in the real system. Thus, it is necessary to retrain the DRL model for a new scenario.

  4. (4)

    Gradient descent algorithm used in DRL or Bellman Equation used in QL have inherent restrictions which will lead to local optimization rather than global optimization. Additionally, the training labels of DRL in scheduling are usually not the global optimal solutions. There is currently a lack of public reliable datasets for its training. These will result in a significant gap between the DRL solution and the theoretically optimal solution.

  5. (5)

    Unexplainability of training process challenges for the theoretical derivation based on mathematics techniques. Modeling and theoretical derivation of high dimensional continuous state space demand the development of mathematics.

  6. (6)

    Antagonism network, federal learning, mathematical logic, Nonhomogeneous Markov process, hidden Markov process and other policies can be utilized in DRL or RL while existing DRLs are mainly based on homogeneous Markov process. For Nonhomogeneous Markov processes, the results involve the solution of matrix differential equations with variable coefficients, which is difficult currently.

The solution to these challenges requires more pursuit of theoretical research based on the improvement of mathematical theory such as Markov Decision Processes (Luong et al. 2019; Sun et al. 2020), Gradient Descent Theory (Wang et al. 2020; Guo et al. 2021), Matrix theory (Sun et al. 2020) and Discrete Mathematics, as well as requires more research of real objects in realistic scenarios such as thermal conversion process, calculation process, driving process by electric signal and voltage switching process over the time of components. Nevertheless, research on DRL utilizing in Cloud scheduling still has considerable potential, which can advance modeling for complex scenes, theoretical modeling for ML, reduction of computational complexity, the flexibility of scheduling algorithms and theoretical research on existing algorithms in principle.

According to the previous review and analysis, some of the future directions utilizing RL especially DRL in Cloud scheduling can be summarized as follows.

Fig. 13
figure 13

Framework of modified DDQN combined various approaches and networks of references

  1. (1)

    DRL can combine other policies or approaches to meet complex scenes and multi-objectives, which has been verified in experiments of reviewed literature. Therefore, its application mode in a realistic engineering environment to reduce the risk of excessive computational complexity is a noteworthy direction.

  2. (2)

    As the RL is one category of non-supervised learning, one of the crucial issues is how to train the main network of the decision-mapper in DRL. The main net in DRL can be represented by CNN, LSTM, Transformers, etc. Setting various training strategies can accelerate the convergence speed of DRL, enabling it to participate in generating scheduling schemes more quickly. Some policies to improve the convergence performance of DRL contain leverages of meta-heuristic method and imitating learning (Guo et al. 2021). Queue model is also a crucial aspect to increase the performance of schedulers such as M/M/S queuing model (Ding et al. 2020) and M/G/1 queuing model (Li 2009).

  3. (3)

    Application pattern of DRL to assist other analyzable scheduling algorithms such as FCFS, Hungarian algorithm, LPT algorithm, Johnson’s algorithm and more.

  4. (4)

    It a potential direction to adjust and improve the mappers based on the existing architecture of Fig. 10, so as to widen the adapted scenarios and improve the optimality of DRL. E.g., combining with the reviewed literature and our research, we integrate and gain a framework of modified DDQN combined various approaches and networks to solve scheduling problems of Cloud computing, especially for realistic scenarios as Fig. 13.

  5. (5)

    The exploration of more novel roles of DRL is also a practical research direction, e.g., DRL can perform as a system strategist to boost the process of selecting methods, as specific approaches are able to adapt specific scenarios (Zhou et al. 2022). To illustrate its possible novel role, a Deep Q-learning-based framework of the scheduler is presented in Fig. 14 used to schedule various scheduling algorithms for different scenarios, which can be called a scheduler of scheduling algorithm aiming to give full play to the superiorities of all scheduling algorithms and considering that all the algorithms are part of anthroposophy. In this framework, the scheduling algorithms are regarded as resources that can be automatically selected and DRL-based algorithms are not only components of resource scheduling algorithms but also strategies to guide the selection of specific scheduling algorithms. DRL-based algorithm selector is one such attempt, whose framework can be seen in Fig. 15 (Zhou et al. 2022). From Zhou et al. (2022), it is not recommended to use a single algorithm for all scenarios.

  6. (6)

    Additional bases for the application of DRL in scheduling problems are baselines and benchmarks. We summarize the characteristics of baselines and benchmarks as follows.

    A baseline should possess the following properties.

    1. (a)

      The baseline is easy to construct and implement;

    2. (b)

      It has reproducibility and performance stability;

    3. (c)

      It can adapt to multiple scenes, objectives and experiments with variable scales;

    4. (d)

      It should include various types of algorithms;

    5. (e)

      Its performance can be proved theoretically or has recognized conclusion;

    6. (f)

      It should have been optimized to a certain extent and should contain; some state-of-the-art representing the characteristics of their types;

    7. (g)

      The compared experiments should reduce the influence of parameter adjustment as much as possible and reserve the inherent performance of the algorithm.

      A benchmark should possess the following properties.

    8. (a)

      It should be easy to reproduce and calculate;

    9. (b)

      It should contain data from multiple scenarios and be as close as possible to real scenarios;

    10. (c)

      It can be applied to the experimental verification of a variety of optimization objectives;

    11. (d)

      It should have a dynamic scale rather than a single scale which can avoid the performance optimization caused by adjusting parameters;

    12. (e)

      If there is a random experiment, a benchmark should have enough sampling times;

    13. (f)

      It should have certain control variables to verify the advantage of local strategy;

    14. (g)

      It should contain enough extreme scenarios, especially on some parameter boundaries;

    15. (h)

      The comparison is relatively fair, such as comparing the solutions generated under the same computational cost;

    16. (i)

      It can test the algorithm running under a variety of devices and components.

  7. (7)

    Other potential directions used DRL to solve scheduling problems in Cloud computing still demand further research, which can be listed as follows.

    1. (a)

      How to construct a novel well-performed framework of RL or DRL and how to construct a novel well-performed DNN in DRL?

    2. (b)

      How to accelerate the training or reduce the calculation complexity of (deep) RL to enhance its transferability?

    3. (c)

      How to ensure the stability of the results under the application of DRL to resolve scheduling problems in large-scale Cloud computing to avoid the risk caused by extremely poor schemes?

    4. (d)

      How to construct the deducible optimization theory?

    5. (e)

      How to build a flexible scheduling system combining various scheduling algorithms to cope with time-varying objectives?

    6. (f)

      How to capture agent-state of DRL accurately?

    7. (g)

      How to improve other categories of methods to address pervasive scheduling problems not only in Cloud computing but also in other distributed systems?

Fig. 14
figure 14

Two phases Q learning-based scheduler used to schedule various scheduling algorithms

Fig. 15
figure 15

A framework of DRL-based selector with various strategies (Zhou et al. 2022)

6 Conclusions

In this paper, we provide a universal formulation of scheduling and review various types of scheduling algorithms in Cloud. Two key factors of scheduling are the production and evaluation of solutions. By analyzing the formulation and algorithms of scheduling, we discuss the defects of classic algorithms, which also demonstrate the necessity of DRL-based methods for scheduling. To assist the acquaintance of DRL in Cloud scheduling, we provide the analysis for the evolution of RL frameworks (including DRL) from the perspective of mappers. On the basis of analysis for RL frameworks, we provide a survey of existing DRL-based methods in Cloud scheduling. Then, we analyze and discuss the advantages, challenges and future direction of DRL-based Cloud scheduling.

From this surveyed work, we can see that the application of DRL in resource scheduling of Cloud computing is an effective and non-substitutable technique. Simultaneously based on the reviewed literature, some of the main advantages of DRL used in resource scheduling are adaptability and portability to scenarios and optimization objectives because the use of DNN enables DRL to describe the higher dimensional and or continuous agent’s state space when the objective is implicit or hard to calculate, which allows DRL-based methods to achieve better performance in many complex scenarios such as the dynamic resource scheduling of large-scale Cloud computing for dependent tasks and heterogeneous servers. Due to the combination of DL and RL, DRL-based scheduling algorithms can solve some scheduling problems that the classic algorithms are unable to solve.

With the reviews of existing works, we discuss the challenges of DRL in Cloud scheduling. The main challenges of using DRL in Cloud scheduling are complexity, unexplainability and local convergence of the training process, as well as the unpredictability of scheduling results. Then, we provide several potential directions for future research of DRL in Cloud scheduling based on these challenges. In future directions, in addition to combining other policies to reduce the complexity and improving structures of DRL, we also propose a point of view of using DRL as an algorithm selector for scheduling. Moreover, we also list the properties required for the baseline and benchmark of DRL-based Cloud scheduling.

Based on the above work in this paper, we can see that DRL has significant potential in Cloud scheduling deriving abundant research directions. Regarding all algorithms as resources, how to combine DRL with other types of algorithms to solve more difficult scheduling problems (i.e., scheduling algorithm selectors) is still worthy of continuous exploration and research.