Deep Reinforcement Learning-based Methods for Resource Scheduling in Cloud Computing: A Review and Future Directions

As the quantity and complexity of information processed by software systems increase, large-scale software systems have an increasing requirement for high-performance distributed computing systems. With the acceleration of the Internet in Web 2.0, Cloud computing as a paradigm to provide dynamic, uncertain and elastic services has shown superiorities to meet the computing needs dynamically. Without an appropriate scheduling approach, extensive Cloud computing may cause high energy consumptions and high cost, in addition that high energy consumption will cause massive carbon dioxide emissions. Moreover, inappropriate scheduling will reduce the service life of physical devices as well as increase response time to users' request. Hence, efficient scheduling of resource or optimal allocation of request, that usually a NP-hard problem, is one of the prominent issues in emerging trends of Cloud computing. Focusing on improving quality of service (QoS), reducing cost and abating contamination, researchers have conducted extensive work on resource scheduling problems of Cloud computing over years. Nevertheless, growing complexity of Cloud computing, that the super-massive distributed system, is limiting the application of scheduling approaches. Machine learning, a utility method to tackle problems in complex scenes, is used to resolve the resource scheduling of Cloud computing as an innovative idea in recent years. Deep reinforcement learning (DRL), a combination of deep learning (DL) and reinforcement learning (RL), is one branch of the machine learning and has a considerable prospect in resource scheduling of Cloud computing. This paper surveys the methods of resource scheduling with focus on DRL-based scheduling approaches in Cloud computing, also reviews the application of DRL as well as discusses challenges and future directions of DRL in scheduling of Cloud computing.


INTRODUCTION
Cloud environment is general accepted as a set of distributed systems linked by high-speed network including the applications delivered as services over the Internet, the hardware and systems software that can dynamically provide services to users [1][2][3].As a paradigm that provides services to users according to users' demand in pay-as-you-go [4][5][6] manner or pay-per-use [7,8] to the general public, Cloud computing usually has four forms that include three traditional types of services, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) [1,3,7,9,10] as well as a new form of serverless computing [3,11].
As an elastic, reliable, dynamic services provider, Cloud computing provisions computing resources on the basis of CPU (Central Processing Unit) [3,12,13], RAM (Random Access Memory) [14,15], GPU (Graphics Processing Unit) [16,17], Disk Capacity [3,13] and Network Bandwidth [14,18].From another perspective, "time", that the whole service life cycle of Cloud computing platform, and "space", that the real physical place to emplace Cloud computing physical devices, are also two pivotal resources of Cloud computing.Each electrical components of Cloud computing's devices, driven by electric energy and working at the time and the space, constitutes the real resources assemble of Cloud computing.Hence in fact, real natural resources provided by Cloud computing are effective electric energy conversion per unit time and per unit space, regarding energy, time and space as the most essential resources.Therefore, the limited resource utilization capacity of Cloud computing will raise the cost of Cloud providers and energy consumption of Cloud computing system.Moreover, considering that user's time and resource's usage time are both crucial components of society, long response time, long queuing time and high delay rate will direct energy consumption of another social form.Consequently, how to schedule components of Cloud computing in an efficient, energy-saving, balanced method, is proved to be a critical factor influencing the orientation of Cloud computing in society.Whereas, the huge scale of Cloud computing, the complexity of scenarios, the unpredictability of user requests, the randomness of electronic components, and uncertain temperature of various components presented in the running process, pose challenges to efficient and effective scheduling of Cloud computing in the emerging trend [6,12,15,[18][19][20][21][22].To lead more precise comprehension of Cloud computing, we will briefly review the development history of Cloud and its scheduling approaches.

Figure 1: Resource management and task allocation process based on central scheduler
There are abundant discussions about the application of machine learning in cloud computing resource scheduling such as work from Microsoft [40], CLOUDS Laboratory of The University of Melbourne [22,41], and other reseach [6,[42][43][44][45].Deep reinforcement learning (DRL), a novel approach belonging to ML and combined with advantages of deep neural network and reinforcement learning, is utilized to solve the scheduling problems of Cloud computing in recent years, which has been proved to occupy strong superiorities in many scenarios especially in complex scenes of Cloud computing [26,[46][47][48][49][50].As the application of DRL in resource scheduling of Cloud computing began in recent years, these surveys [3, 5-7, 24, 27, 42-44, 51-59], provided detail, comprehensive and valuable reviews of Cloud computing, yet did not specifically discuss the application of DRL in resource scheduling of Cloud computing, although some surveys have discussed the application of machine learning in cloud computing [6,[42][43][44].Researchers are still exploring the application pattern of RL especially DRL in scheduling of Cloud computing [13, 17, 26, 46-48, 50, 60-63].Similarly, RL is also applied to solve scheduling problems in other field [64,65].Aiming at providing a specific vision for scheduling methods of Cloud computing and provisioning integrated information reference of DRL in resource scheduling of Cloud computing, we focus on the reviews of research using RL especially the DRL to solve scheduling problems of Cloud computing and finally target challenges and future directions using DRL to adapt more realistic scenarios of resource scheduling in Cloud computing.
The main contributions in this paper can be summarized as follows: (1) We briefly review the development of Cloud computing and categorize scheduling methods and optimization objectives in Cloud computing.(2) We summarize models in the scheduling of Cloud computing for various objectives.

OVERVIEW OF CLOUD COMPUTING AND ITS SCHEDULING 2.1 Concept development of Cloud computing
The concept of Clouds distributed system can be found in some papers of 1980s to 1990s [66][67][68][69][70].In 1983, Allchin presented a nested action management algorithm considering that constructing reliable programs for distributed processing systems is a task with difficulty [71].This marked an architecture was described, which proposed the construction of reliable and distributed operating systems.Theoretically based on Allchin's reliable algorithm, Dasgupta et al. provided a functional description of Clouds distributed system in 1988 [66].In [66], Dasgupta et al. mentioned that "Clouds is designed to run on a set of general purpose computers (uniprocessors or multiprocessors) that are connected via a medium-tohigh speed local area network".In 1989, Bernabeu-Auban et al. described the architecture of Ra which is a kernel for Clouds and can be used to support the development of an object-based distributed operating system [67].In [67], Ra virtual space was proposed to structure Clouds objects and distributed operating systems which used the object-thread model of computation.In 1991, Dasgupta et al. expounded Clouds distributed operating system based on object-thread model, regarding Clouds as a paradigm, also a general-purpose operating system for structuring distributed operating systems as well as described potential and implications of this paradigm [68].Same in 1991, Raymond C. Chen et al. presented a kernel-level consistency control mechanism called Invocation-Based Consistency Control (IBBC) as part of Clouds distributed operating system to support general-purpose persistent object-based distributed computing [69].As a set of flexible mechanisms that controls recovery and synchronization, IBBC had three types of consistencies as defaults: global, local, and standard.The same year, Pearson et al. presented Clouds LISP distributed environments and proposed CLIDE environment structure [72].In 1992, Gunaseelan et al. designed a language, that Distributed Eiffel, for programming multi-granular distributed objects on Clouds operating system [70].The design in [70] addressed a number of issues such as parameter passing, asynchronous invocations and result claiming, and concurrency control, which makes it possible to combine and control both data migration and computation migration effectively at the language level.
With the exponential development of the Internet, its users have an increasing demand for computing.More concepts of distributed computing have been proposed or been developed such as grid computing [73,74], peer to peer computing [75,76], and utility computing [77,78].With the advent of network information age, Web 2.0 [79,80], network information flow presents the characteristics of big data, high speed and multi-function.The concept of Cloud computing was in formally presented in the search engine conference in August 2006.Amazon, Google, IBM, Microsoft and Yahoo, who are the forerunners that provide Cloud computing services, contributed to demonstrate Cloud computing [1,81,82].In 2008, Microsoft released Windows Azure Platform [1,83].Until the International conference Cloudcom 2009 [84] started, three categories of services, IAAS, PAAS, and SAAS provided by Cloud computing, had been recognized by researchers [85,86].
In reference [79] which gave a Berkeley view of Cloud computing in 2009, Armando Fox et al. deemed the conditions, that are widespread broadband Internet access, the capacity of providers, the suitable billing model and development of efficient virtualization techniques, would make Cloud computing available.Meanwhile, experts began to discuss the relationships and differences between Cloud computing and grid computing [73,87,88].In [73], Foster et al. compared Cloud computing and grid computing in various aspects including business model, architecture, resource management, programming model, application model and security model.Foster et al. added a definition for Cloud computing: a largescale distributed computing paradigm driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand to external customer over the Internet [73].Before this definition, there had been many definitions of Cloud computing but there still was little consensus on how to define the Cloud [2].In Foster's definition, Cloud computing differs from traditional distributed computing paradigms in that it has massive scalability, can be encapsulated as an abstract entity that delivers different levels of services to customers, is driven by economies of scale and has dynamical services [73].In [87], Klems et al. held that Cloud computing is an emerging trend of provisioning scalable and reliable services over Internet as computing utilities as they deemed that the methods and the business models introduced for grid computing from the previous works of distributed computing do not consider all economic drivers which are identified relevant for Cloud computing, such as pushing for short time to market in the context of organization inertia or low entry barriers for start-up companies.
A definition of Cloud computing that "Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the data centers that provide those services" was gave in [1], where Armbrust et al. demonstrated the advantages of public Cloud comparing with private data centers.From Armbrust et al., public Cloud has the characteristics including appearance of infinite computing resources on demand, elimination of an up-front commitment by Cloud users, ability to pay for use of computing resources on a short-term basis as needed, economies of scale due to very large data centers, simplifying operation and increasing utilization via resource virtualization that the Conventional Data Center does not or usually not possess [1].In [88], Buyya et.al. proposed one of the popular definitions of Cloud computing and demonstrated how it enabled emergence of "computing as 5th utility".A High-level Market-oriented Cloud architecture, consisting of users/brokers, SLA resource allocator, VMs and physical machines, was presented in [88].
From 2010 to 2020, resource scheduling strategies for Cloud computing has attracted scholars to research.Some of the mainstreams of the resource scheduling strategies for Cloud computing focus on objectives of minimizing energy consumption, minimizing makespan, minimizing delay time, reducing response time, load balancing, increasing reliability, increasing the utilization of resources, maximizing the profit of providers, maximizing task completion ratio, minimizing Service Level Agreement (SLA) Violation, maximizing throughput, and multi-objectives.Table 1 provides a summary of objectives addressed in surveyed papers.The above objectives can be summarized as three aspects, reducing cost and increasing profit of Cloud providers, improving quality of services (Qos) provisioned to users, and greening Cloud computing.

TERMINOLOGIES AND DIFFERENT OPTIMIZATION OBJECTIVES OF SCHEDULING PROBLEMS IN CLOUD COMPUTING 3.1 Terminologies of scheduling problems in Cloud computing
References [4,14,18,20,23,31,32,39,47,50,100,102,[140][141][142]145] have demonstrated the description of the symbols and terms related to Cloud computing and Edge computing to a certain extent.Surveying these references and combining various scenarios, this paper gives an integration description of notations and terminologies of scheduling in Cloud computing as shown in Table 2.When the scene becomes complex, many parameters will turn to time-varying or random.Therefore, this paper uses time-related parameters to represent various elements.Then considering future modeling, we add the dimension of parameters.

Modeling of resource scheduling problems in Cloud computing with different optimization objectives
In surveyed literature, energy consumption minimization, makespan minimization, delay time (or delayed services) minimization, response time minimization, load balancing maximization, utilization maximization, cost minimization and multi-objectives are the main optimization objectives of most of the scheduling algorithms as shown in Table 1.These objectives have close affect to the performance of Cloud computing system including quality of services, total profit of Cloud computing providers and carbon emission.In general, energy consumption, time-cost and load balancing are the major foundations to evaluate the performance of a scheduling algorithm in Cloud computing.

3.2.1
Energy consumption optimization problems.Energy consumption optimization problems can be summarized into two types, total energy consumption and utilization of energy (energy efficiency).The total energy consumption contains the energies consumed by physical resources during stand-by time, consumed in processing tasks, consumed in hanging tasks, and dissipated in the form of heat exchange.In realistic, it is not easy to measure the energy consumption of each partition especially when there are multiple tasks on a physical resource at the same time.A more intuitive measurement index is electricity consumption.Another approach is to treat tasks that are simultaneously on the same physical resource as mutually coupled objects and to decoupled by statistic the power consumption under various scenes.
In addition, the energy consumed by the task can also be approximately calculated according to the task's lifetime [29] or can also be regard as a given value [3].It will be one of the far-reaching prospects of Cloud computing to study the measurement of energy consumption and the coupling relationship between multiple tasks.In view of the fact that the power consumption of a specific task is variable at different times and also variable at different physical resources, we regard the power consumption of a task as a time-varying function   (, ) in paper, which can adapt to various forms of energy calculation in future.For the same consideration,  (, ) and  (, ) of physical resource are assumed as time-varying functions.
Set of tasks on all resources at time  :   ( ) =  ∈    (,  )   ( ) Hence from the perspective of resources, tasks and workflows, the problem of minimizing total energy consumption from time  1 to  2 time can be modeled as The problem of maximizing energy utilization from time  1 to  2 time can be modeled as The public constraint is The constraint of energy consumption optimization problems is the combination of Public Constrains, Constraint1 and Constraint2 seen in Equation (3), Equation (4) and Equation (5).
2 : 3.2.2Time related optimization problem.Time index of scheduling in Cloud computing can be summarized as four types including makespan (execution time), delay time, response time and waiting time.In some cases, the capacity of physical resources and the size of tasks can be used to calculate the execution time of tasks [3].While, in realistic scenes, the capacity of physical resources will vary with the change of load, temperature and working time.The processing speed of the same resource for tasks with different properties will be different.For example, a resource with better GPU performance can process a task with multiple matrix operations faster than a resource without better GPU performance, and a resource with better CPU performance can process a task with numerous commands transfers faster.The most practical approach is to classify tasks according to their specific properties to estimate their executed time on different types of resources.Also considering the complex practical situation, we can treat them as time-varying parameters.
The objectives of makespan, also known as total execution time [3], can be divided into four aspects that minimizing the maximum execution time of resources, minimizing average makespan of workflows, minimizing average execution time of tasks and maximizing total utilization of occupied time.The problems of makespan can be formulated as follows.
Minimizing the maximum work time of resources: where Minimizing average execution time of tasks: Maximizing total utilization of occupied time: The indexes to evaluate the delay tasks can be divided into three aspects that minimizing the average delay time of delay tasks or workflows, minimizing the sum of delay time of all tasks or all workflows and minimizing the number of delay tasks or delay workflows.
Minimizing the average delay time of delay tasks: The constraints of Equation ( 6) to Equation ( 9) can be given as the combination of Public Constraint, Constraint2 and Constraint3(or Constraint4) seen as Equation (3), Equation (5) and Equation (10).
Minimizing the sum of delay time of all tasks: Minimizing the number of delay tasks: Additionally, the objectives of delay workflows are similar to those of delay tasks.For the delay problems, the constraints of delay can be removed and be simplified to the combination of Public Constraint and Constraint2 as Equation (3) and Equation (5).
Response time equals to the sum of transmission time and queuing time.Minimizing the average response time of tasks: Another index to evaluate time performance is the utilization of time for all tasks.Maximizing percentage of processing time of all tasks (utilization of time for all tasks): The constraints of Equation (13) to Equation ( 14) can be expressed with the combination of Public Constraint and Constraint2 where Constraint3 or Constraint4 is optional.

Load balancing optimization problem.
Load balancing index can be divided into two types including load balancing of tasks volume and load balancing of tasks number according to the loaded objects.The degree of load balancing can also be divided into two types including cumulative load balancing and real-time load balancing according to the time period.There are various functions to calculate the degree of load balancing in surveyed references such as variance or standard deviation [17,117,121], average success rate [30], coefficient of variance (CV) [45], load value [60] and degree of imbalance [99].We may as well assume the degree of load balancing as a function  ( ) where  is a vector related to the load object and time.Hence, the load balancing optimization problem can be written as max  ( ).
Other indexes such as Simpson's index [151] and Shannon-Wiener index [152] can also be used as a measure of load balancing in addition to variance, standard derivation, coefficient of variance, degree of imbalance etc..In realistic application, load balancing may not be final goal and is usually an avenue to achieve other objectives such as maximizing utilization of resources, reducing energy consumption or prolong service life of devices.The index to evaluate the degree of balancing should be determined by the final target.3.2.4Other optimization problems.Except to energy consumption optimization problems, time optimization problems and load balancing problems, the other optimization problems contain maximizing utilization of resources, minimizing  2 emission, minimizing temperature of components [22] and minimizing tasks loss rate.The objective of minimizing  2 emission can be modeled as min  ( 1 ,  2 ).
The maximizing utilization of resources can be divided into energy utilization seen in Equation ( 2), total utilization of occupied time seen in Equation (8), and utilization of time for all tasks seen in Equation (14).
When the queuing line or response time is too long, users may choose to give up executing a task on a Cloud platform with high probability, which will cause loss of tasks and users.In some extreme cases when Cloud platforms are unable to provide new services such as when severe overload, physical device failure, device attacked to failure by network, excessive energy consumption and so on, Cloud computing providers have no choice but only to suspend service.Thereupon, tasks loss rate is also an urgent factor to be studied in building a stable Cloud computing platform.Back and forth, stable groups of Cloud environment and users will also debase the difficulty to allocate tasks or schedule resources in Cloud computing.Modeling the minimization of tasks loss rate is complex considering the users' psychology, users' characteristics and randomness of task executions.

Summary
So far, this section provides models for various common scheduling optimization problems in Cloud computing.The constraints of optimization problem in Cloud computing can vary with actual scenarios.In some specific scenarios, we can give weight to constrains to meet complex scenes.For multi-objectives problems, the objectives can be integrated based on their weights in realistic scenarios or can be evaluated by Pareto efficiency [23,95,120].Approaches for scheduling in Cloud computing can be classified into six categories including Dynamic Programming(DP), Probability algorithm (Random), Heuristic method, Meta-Heuristic algorithm, Hybrid algorithms and Machine Learning.In this section, we introduce these approaches applied in surveyed literature from two aspects, classic approaches and machine learning method.In this paper, we regard Dynamic programming, Randomization, Heuristic method, Meta-Heuristic algorithm, and Hybrid algorithms as classic approaches.

Classic approaches for scheduling in Cloud computing
In classic approaches, the most commonly utilized methods in surveyed literature are Meta-Heuristic algorithm and Heuristic method.In skeleton, Meta-Heuristic, the combination of Heuristic and Randomization [5], contains Ant Colony Algorithm, Particle Swarm Optimization, Genetic algorithm, Firefly algorithm and other Meta-Heuristic algorithms.Some common Heuristic methods are Johnson's model, FF (first fit), BF (best fit), RR (Round-robin), FFD (first fit decreasing), BFD (best fit decreasing) and their variants.Table 3, Table 4, and Table 5 provide summaries of classic methods from surveyed literature, aiming for an intuitive reference and summary for the research of scheduling problems in Cloud computing.

Machine learning for scheduling in Cloud computing
In practice, Cloud system has several characteristics: large scale and complexity of systems that make it impossible to model accurately; timeliness of scheduling decisions that demands the high-speed scheduling algorithm; and randomness of tasks (or requests) including randomness of task numbers, arrival time and sizes.These characteristics are challenging for the research of resource scheduling in Cloud computing.It is tough for one specific Meta-Heuristic or Heuristic algorithm to fully adapt the real dynamic Cloud computing systems or Edge-Cloud computing systems.A novel type of machine learning policy for resource scheduling in Cloud computing is the combination of deep neural network (DNN) and reinforcement learning (RL), called deep reinforcement learning (DRL).Integrating the advantages of RL and DNN, DRL has following benefits.They are Modeling ability: it can model complex systems and decision-making policies with DNN; Adaptability for optimization objectives: Training progress based on gradient descent algorithm makes it possible to search the optimization solution for various objectives; Adaptability for environment: DRL can adjust parameters to adapt to various environments; Possibilities for further growth: DRL can grow over time to process large-scale tasks; Adaptability for state space: DRL can process continuous states or multi-dimensional states; Memory of experience: DRL possess capacity to memory experience with experience replay.
Before providing detailed introduction of RL-based algorithms in resource scheduling of Cloud computing, we give a summary of machine learning methods used in scheduling of Cloud computing by reviewed literature in Table 6.

ANALYSIS OF RL AND DRL FRAMEWORKS IN SCHEDULING OF CLOUD COMPUTING
Machine learning is the discipline of teaching computer to predict outcomes or classify objects without explicit programming [42].Machine learning is also an artificial intelligence discipline of studying how computers simulate or implement human learning behavior so that computers can gain new knowledge and skills.Based on learning methods, machine learning can be divided into supervised learning, unsupervised learning and semi-supervised learning [42].RL is one of the unsupervised learning [63].Based on learning strategy, machine learning contains Symbol learning, Artificial Neural Networks learning, Statistical machine learning, Bionic machine learning etc.. Deep learning on the basis of deep artificial neural networks and reinforcement learning (RL) are two subsets with intersection of machine learning, where the intersection between deep learning and RL is deep reinforcement learning (DRL).DRL combined with the perception of deep learning and with decisionmaking ability of RL, has been applied in robot control, computer vision, natural language processing and some Go sports [167,168].DRL and Cloud computing, two emerging technologies, do not have sufficient intersection currently.The utilization of DRL to resolve the scheduling problems in Cloud computing emerged in recent years.Afterwards, DRL has shown superior performance in the current research on the application in resource scheduling of Cloud computing.We discuss the evolution of RL and DRL frameworks in this section, in order to provide a comprehensive sight for the discuss of RL and DRL in resource scheduling of Cloud computing.

Analysis of RL framework
RL, based on the interaction model between agent and environment, instructs agent to learn the optimal action strategy by the feedback from environment corresponding to the agent's action.
The RL model can update the action strategy according to the timely feedback and long-term feedback.Agent will choose the action on the basis of action strategy.State space, action space, environment, feedback, and strategy are five basic elements of RL.Fig. 2 shows a fundamental structure of RL.In some of reviewed literature related to RL [29,32,47,49,50,60], the feedback is described as reward.In this paper, we regard it as feedback considering that both positive Agent Environment Action Feedback

State
Figure 2: A fundamental framework of RL feedback and negative feedback will affect the learning progress of agent's action strategy [17,46,60].
In another standpoint, agent learns the strategy by trailing error, which requires agent to maintain balance between exploration and exploitation.Greedy method, Random method and Meta-Heuristics method will be used to simulate the decision progress between exploration and exploitation.Markov decision process (MDP) is a common model to express the action choose process and Bellman Equation, a dynamic programming equation, is a common function to update the action strategy.Hence, a framework of RL containing action selection and strategy update is shown as Fig. 3.In complex scenarios, state and agent are varying with time.In addition, decision should be made according to the state and agent at real-time and the feedback from environment will affect strategy directly.Hence, an agent-state based structure of RL can be shown as Fig. 4.
In most of realistic scenes, a system is often not completely independent and will alter with extrinsic stimulus.The environment in Fig. 4 is actually the internal environment of system which cannot express the overall interferences from other systems to this internal system.Hence, the system of RL in Fig. 4 should be regarded as an autonomous system because agent and environment evolute on the autonomous rules.A computer game, a Go sport and a language processing problem that covers a large enough amount of data may be regard as an autonomous system, because  the regulation of them is quite stable without external modification of regulation.The movement of vehicles and ships, antagonistic sports like basketball and football are usually affected by external incentives.Then, a Cloud computing system, with time-varying constructive demands, optimization objectives and users' requests, is not an autonomous system.On the other hand, the update function of internal environment for agent-state and the decision-making function are also time-varying such as the revenue ratio of Cloud computing is variable in different periods of the same day.Regarding the decision-making process as an ensemble, a framework of RL with time-varying extrinsic stimulus can be shown as Fig. 5.

Analysis of DRL framework
In Fig. 5, decision maker, that is a complex mapping from agent-state to action, is integrated as an ensemble.Some patterns of RL focus on the expression of this mapping relationship such as Q-table, Advantage Function, Policy Gradient, and Hidden Markov Chain.However, as the sizes and dimensions of state space and action space increase, the computational complexity and storage space of these patterns will grow exponentially.Furthermore, when the state space is non discrete which appears in the resource scheduling problem usually, it is difficult to express the mapping relationship in the general methods of RL.Nonhomogeneous Markov processbased RL, that one of method to express the process of time-varying continuous time space and continuous state space, requires to solve differential equations with variable coefficients however, which astricts the application of nonhomogeneous Markov process in RL to solve the problem with time-varying continuous time space and continuous state space.Hence for various reasons, deep artificial neural network with excellent performance in the establishment of mapping relationship is a fairly good choice to be a mapper of strategy between state-agent and action.Fig. 6 shows a framework using deep neural network to express the decision process.Fig. 6 regards decision process as a mapping process, which can enlighten us to reconstruct the structure of Fig. 5 according to the mapping relation.Thus, we can construct the framework in Fig. 5 as five mappers including mapper of time varying, mapper of stimulus evolution, mapper of decision, environment and mapper of feedback as Fig. 7.The details of each mapper are as follows: Mapper  of time varying refers to the relationship between agent-state and time with stimulus from extrinsic or internal space where time and update are input, stimulus force is the output.Mapper of stimulus evolution refers to the stimulus evolution in agent and state as agent and state are usually variable with stimulus where stimulus force is the input and the set of agent-state at real-time is output.Mapper of decision is responsible for calculating the next action according to the current state of agent where set of agent-state at this time is input and action at next time is output.Environment receives actions that the result of mapper of decision provides, evolves according to the action of agent and then outputs environment's state at next time.The output of environment enters mapper of feedback as its input on one hand, and enters mapper of time varying as internal stimulus for agent on the other hand.Mapper of feedback receives the environment's state, then stores it as replay storage in preparation for long-term feedback in future and takes it as timely feedback simultaneously.Long-term feedback and real-time feedback will update the parameters of decision maker.The framework in Fig. 7 is a generalized RL based on the integration of mappers.In some of experiments, mapper of time varying, mapper of stimulus evolution and environment are usually simulated by program or observed in real scenes.Mapper of decision and mapper of feedback can be constructed with neural networks.As mapper of feedback is aimed at updating the parameters of decision maker, thus the mapper of feedback can be designed as a neural network to calculate the loss function of the neural network in mapper of decision, hence Nature DQN or Double DQN [48,49,115,166] where the neural network of feedback is called as target network with a same structure of decision's network.While inherently, the five mappers in Fig. 7 can be represented by neural networks respectively.In some scenes, it is difficult to simulate or observe the realistic process of a complex system, and the neural network can be used as an end-to-end alternative.An extreme example is that five mappers are all expressed with deep neural networks.However, existing research, using DRL to resolve the scheduling problems in Cloud computing surveyed in this paper, are carried out by replacing one or several of the five mappers with neural networks and have performed well in experiments according to their results.

Summary
Cloud environment is a complex and random system with largescale user requests and complex physical environment, and these user requests and extrinsic physical environment can be regarded as time-varying stimulus.In addition, the actual running processes of electronic components and software programs are hard to express using simulation.Crucially, the high dimensionality and the continuity of state space make mapper of decision and mapper of feedback difficult to be modeled with conventional methods.In summary, the five mappers may have demand to be modeled with implicit expression functions, while deep neural network is a practical method currently to implicit relation on the basis of sufficient data and sufficient training time.
Moreover, the literature adjusts the structure of neural network (CNN, LSTM, full connection, etc.), increase the strategy of initialization of neural network parameters, the training strategy of neural network, prediction or simulation of internal and external environment, and assist with other Meta-Heuristic algorithms as appropriate.

A REVIEW OF DRL-BASED RESOURCE SCHEDULING IN CLOUD COMPUTING
Based on the location of neural network, we review frameworks in surveyed papers.In RL, the central component is the mapper of decision which can conduct scheduler of Cloud computing.In scheduling of Cloud computing using RL, the mapper of decision is usually represented by deep neural network or Q-table.In order to deeply analyze and macroscopically summarize the application of RL in resource scheduling of Cloud computing, we organize the information of literature structurally.In addition to the information of the literature themselves, we also reorganize the possible future work of some literature to provide another probability.QEEC [29] is a Q-learning based task scheduling framework for energy-efficient Cloud computing using Q-value table to express the decision maker of action.The DeepRM_Plus [46] uses a neural network that has convolution neural network (CNN) of six layers to describe the mapper of decision based on the great success of DNN in image processing.Data center cluster, waiting queues, and backlog queue compose the state of environment which is represented by image.

Algorithm
Category Objectives QEEC [29] QL Response time, CPU utilization MDP_DT [164] QL Cost AGH+QL [114] QL Energy consumption, QoS MRLCO [26] MRL Network traffic, service latency DeepRM_Plus [46] DRL Turnaround time, cycling time DERP [160] DRL Automatic elasticity DPM [62] DRL Task latency, energy consumption DQST [17] DQL Makespan, load balancing MDRL [48] DDQN Energy consumption, response time RLTS [49] DDQN Makespan DRL-Cloud [115] DDQN Energy consumption, cost ADRL [13] DDQN Resource Utilization, response time IDRQN [60] DDQN Energy consumption, service latency MADRL [50] DDQN Computation delay, channel utilization DDQN [166] DDQN Service latency, system rewards DRL+FL [116] DDQN Energy consumption, load balancing AGH+QL [114], a novel revised Q-learning-based model, takes hash codes as input states with a reduced size of state space.DQST [17], deep Q-learning task scheduling, uses fully connected network to calculate the Q-values which can express the mapper of action decision.DERP [160] uses three different approaches of a DRL agent to handle the multi-dimensional state and to provide elastic VM resource.Modified DRL [48], RLTS [49], DRL-Cloud [115], and ADRL [13] also use the structure of action-value Q network (or called evaluate Q-network [49]) and target-Q network.Then, their similarities and differences are as follows.IDRQN [60] is a finegrained task offloading scheme based on DRL with Q-network and Target net where LSTM network layer is used in Q-network and candidate network is used to update Target Net.DPM framework [62] based on RL, adopts the long short-term memory (LSTM) network to capture the prediction results and uses DRL to train the strategy of resource allocation aimed at reducing energy consumption in Cloud environment.The LSTM network used to predict the state of environment can be regarded as a mapper of time varying in Fig. 7. DDQN [166], Dueling deep Q-network, contains a set of convolutional neural networks and full connected layer to achieve higher efficiency of data processing, lower network cost, and better security of data interaction.MRLCO [26], a Meta Reinforcement Learning-based method, contains seq2seq neural network to represent the policy.MADRL [50], a novel multiagent DRL, contains actor network and critic network to generate Q value.Actor network with two layer fully connected network is a mapper from state to action, and critic network with two fully connected network hidden layers and an output layer with one node is a mapper from state and action to Q-value.DRL+FL [116], based on DDQN, uses Federal Learning to accelerate the training of DRL agents.MDP_DT [164], a novel full-model based RL for elastic resource management, employs adaptive state space partitioning.Table 7 provides a summary of multi-aspects including category, system type and objectives of RL-based algorithms, Table 8 provides the summary of scenario and task/server nature, as well as Table 9 provides the summary of experimental data and compared baselines.
Table 8: Scenario and Task/Server Nature of RL-based algorithms Algorithm Scenario Task/Server Nature QEEC [29] Online scheduling Heterogeneous servers MDP_DT [164] Dynamic scheduling Dynamic tasks AGH+QL [114] Scheduling in C-RANs Traffic demand MRLCO [26] Adaptive offloading Multiple tasks DeepRM_Plus [46] Online scheduling Independent tasks DERP [160] Dynamic scheduling Dynamic tasks DPM [62] Online scheduling Dynamic tasks DQST [17] Dynamic scheduling Non-preemptive task MDRL [48] Dynamic scheduling Heterogeneous servers RLTS [49] Dynamical scheduling Heterogeneous servers DRL-Cloud [115] Resource provisioning Depended Tasks ADRL [13] On-time scheduling Dynamic tasks IDRQN [60] Task offloading Heterogeneous servers MADRL [50] Multichannel access Joint multichannel DDQN [166] Online scheduling Delay-tolerant tasks DRL+FL [116] Dynamic scheduling Dynamic tasks In reviewed literature, strategies to queue, to accelerate training, to partite state space of agent, capture resource state, to keep stability of rewards etc. are proposed to optimize the performance of algorithms, and their details are listed in Table 10.
Combined with the results of literature research, future work of RL-based algorithms in reviewed literature are discussed in Table 11.
Based on the review of RL-based resource scheduling in Cloud computing, the current situation of RL in resource scheduling of Cloud computing is summarized as follows.
DRL has strong adaptability to continuous or high dimensional state space; adaptability for scheduling scenarios and various optimization objectives of Cloud computing.The main scenario closest to realistic scene used RL to solve in reviewed literature is dynamic online multi resources scheduling problem in Cloud computing environment or Edge-Cloud computing environment which can contain dependent or independent tasks, workflows, and homogeneous or heterogeneous servers.In reviewed literature, experiment results showed DRL can achieve better performance than various common compared algorithms such as Randomization, FCFS, Round-robin, Greedy, Q learning, MDP, QDT, FIFO, HEFT, FA, SDR.And these algorithms together with Conventional DQN can be regarded as baselines to evaluate other algorithms in future.

Algorithm
Strategies and Advantages QEEC [29] M/M/S to reduce the average waiting time of task; dynamic task ordering strategy to promote the quality of Cloud services MDP_DT [164] Adaptively partitions the state space utilizing novel statistical criteria and strategies to perform accurate splits AGH+QL [114] Anchor graph hashing can accelerate training; hash codes can reduce size of state space MRLCO [26] Seq2seq neural network to represent the offloading policy; new training method combining the first-order approximation DeepRM_Plus [46] Imitating learning to accelerate convergence and CNN to capture the state of resource DERP [160] DERP does not demand space Partitioning; DERP with three aspects manages to collect rewards DPM [62] Using LSTM Network to predict workload which can eliminate the vanishing gradient problem DQST [17] Entropy weight method to produce a high-quality solution of bi-objective optimization MDRL [48] DRL can adapt to scalable state space; fair resource allocation helps reduce the underlying practical problems RLTS [49] Utilization of DQN to describe the relationship between state-agent and action DRL-Cloud [115] experience replay, target networks as well as exploration and exploitation can accelerate converge speed ADRL [13] Using an anomaly detection model to identify performance problems and to increase awareness of the environment IDRQN [60] LSTM to estimated value; candidate networks to decouple the action selection and action value evaluation MADRL [50] Combination of actor-critic and DQN can improve performance of algorithm DDQN [166] DDQN can keep stability of reward DRL+FL [116] Combination of DRL and FL can improve the performance in training  [29] To investigate Meta-Heuristic to increase the performance; to establish various queueing model to satisfy realistic scenes MDP_DT [164] To combine the strategies of MDP_DT with DQN to solve complex scenarios AGH+QL [114] To combine anchor graph hashing with DRL MRLCO [26] To apply an adaptive client selection algorithm to automatically filter out stragglers DeepRM_Plus [46] To apply other policies such as Actor-Critic network and DDPG; to analyze the state recognition analytically DERP [160] To combine DERP with federate learning; to design intercommunicate framework of simple DRL, full DRL and DDRL DPM [62] To combine LSTM predictor and CNN to reduce energy DQST [17] To establish model of energy consumption or multi-objective MDRL [48] To consider dependent tasks and workflow DRL-Cloud [115] To utilize it in dynamic scheduling and static scheduling ADRL [13] To applied parameter initialization strategy and combination of DRL and semi-supervised learning to accelerate training IDRQN [60] To apply transfer learning to heterogeneous MEC ; to utilize federate learning to solve multi-objectives problems MADRL [50] To use gated recurrent units of the network to predict channel conditions DDQN [166] To apply it in other issues such as energy efficiency DRL+FL [116] To utilize the combination of FL and DRL in other scenarios several aspects: adjusting the structure of decision mapper to DNN or Q-table; strategies to accelerate training of (deep) RL such as periodical update; partition strategies for state space; federal learning to improve convergence and stability; strategies to perceive current states or to predict subsequent states of agent in RL; policies to provide loss function to train main-net in DRL.

CHALLENGES AND FUTURE DIRECTIONS FOR USING DRL TO RESOLVE SCHEDULING PROBLEMS IN CLOUD COMPUTING
Although DRL-based scheduling algorithms have performed advantages in reviewed literature, they are only tested at the laboratory simulation experiments where tasks' order of magnitude is far less than that in real Cloud environment.DRL, as a complex, non-analytic and time-costing algorithm [167], has inevitable challenges to address the scheduling problems in real large-scale Cloud computing systems.The challenges using DRL to solve scheduling problems in Cloud computing of realistic scenarios are mainly as: (1) DRL consumes large computing power and occupies prodigious complexity in the progress of training and computation especially for multi-clusters or large-scale system.Computational complexity require to be optimized.(2) The scheduling results based on DRL are still unpredictable so that the performance of the worst case is hard to evaluate.(3) Real scheduling also depends on the prediction of dynamic tasks without preemptive and priori knowledge.puting is the application of algorithms in real complex and random scenarios.(8) The exploration of more novel roles of DRL is also a practical research direction such as DRL can perform as system strategist to booster the process of selecting methods, as specific approaches are able to adapt specific scenarios.To illustrate its possible novel role, a Deep Q-learning-based framework of scheduler is presented in Fig. 9 used to schedule various scheduling algorithms for different scenarios, which can be called a scheduler of scheduling algorithm aiming to give full play to the superiorities of all scheduling algorithms and considering that all the algorithms are part of anthroposophy.In this framework, the scheduling algorithms are regarded as resources that can be automatically selected and DRL-based algorithms are not only components of resource scheduling algorithm but also strategy to guide the selection of specific scheduling algorithm.  of DRL to resolve scheduling problems in large-scale Cloud computing?(d) How to construct the deducible optimization theory?(e) How to predict the running state of Cloud systems?(f) How to build a flexible scheduling system to cope with timevarying objectives?(g) How to capture agent-state of DRL correctly?(h) How to construct the communication between multi-DRL agents of federal learning?(i) How to improve other category of method to address pervasive scheduling problem not only in Cloud computing but also in other dynamic distributed systems?

SUMMARY AND CONCLUSIONS
Resource scheduling in Cloud computing is a crucial and emergent research area.In this paper, we reviewed the development of Cloud computing and resource scheduling in Cloud computing.We describe the terminologies in resource scheduling problem of Cloud computing to restore this problem in real-world.According to the optimization objectives, the scheduling problem is divided into energy consumption optimization problem, time optimization problem, load balancing optimization problem and other problems.On the basis of reviewed literature, we classify existing methods used to solve the resource scheduling problems in Cloud computing, which provides a macro perspective for the selection of scheduling methods.
Additionally, DRL-based resource scheduling of Cloud computing successfully integrates the areas of Cloud computing and DRL.From the reviewed literature, DRL is one of effective methods to solve the dynamic resource scheduling of large-scale Cloud computing.Therefore, we collate and research the architecture of RL and DRL, and based on the mapping process of mathematics, we gradually analyze and construct various architectures of RL and DRL, providing a unified architecture view for the subsequent RL or DRL in resource scheduling of Cloud computing.Finally, based on the framework of DRL model and the perspective of mapper, we focus on reviewing the information of given literature on resource scheduling of Cloud computing using RL or DRL.We also discuss the current status and future directions of RL especially DRL in resource scheduling of Cloud computing.

( 3 )
We discuss the application of DRL in scheduling of Cloud computing compared with non-DRL methods through reviewing applications of DRL in Cloud computing.(4) We identify and propose challenges and future directions of DRL in scheduling problems of Cloud computing and target a specific idea to use DRL to solve resource scheduling.The rest of the paper is organized as follows: Section 2 briefly reviews the development history of Cloud computing and its scheduling.Section 3 discusses the terminologies and establishes the mathematical models for various objectives of scheduling problems in Cloud computing.According to the classification of classic methods and machine learning methods, Section 4 sorts out the existing approaches utilized in scheduling of Cloud computing.Structure analysis of RL and DRL applied in the scheduling of Cloud computing is shown in Section 5. Literature review utilizing RL especially DRL to address resource scheduling of Cloud computing is presented in Section 6. Future directions of DRL applied in scheduling of Cloud computing are discussed in Section 7. Finally, Section 8 summarizes and concludes the paper.

Figure 3 :
Figure 3: A framework of RL with action selection and strategy update

Figure 4 :
Figure 4: A complex Framework of RL based on varying agent-state

Figure 5 :
Figure 5: A framework of RL with extrinsic stimulus

Figure 6 :
Figure 6: A framework of RL with neural network-based decision maker

Figure 7 :
Figure 7: A framework of RL with segment of system

Figure 8 :
Figure 8: A framework of duel deep Q network of DRL

( 4 )
Gradient descent algorithm used in DRL or Bellman Equation used in QL have inherent restrictions which will lead local optimization rather than global optimization.(5) Unexplainability of training process challenges for the theoretical derivation based on mathematics techniques.Modeling and theoretical derivation of high dimensional continuous state space demand development of mathematics.A benchmark should possess following properties.(a) It should be easy to reproduce and calculate; (b) It should contain data of multiple scenarios and be as close as possible to real scenarios; (c) It can be applied to the experimental verification of a variety of optimization objectives; (d) It should have a dynamic scale rather than a single scale which can avoid the performance optimization caused by adjusting parameters; (e) If there is a random experiment, a benchmark should have enough sampling times; (f) It should have certain control variables to verify the advantage of local strategy; (g) It should contain enough extreme scenarios especially on some parameter boundaries; (h) It can test the algorithm running under a variety of devices and components.(7) Another emergency trend of resource scheduling in Cloud com-

Figure 9 :( 9 )
Figure 9: Two phases Q learning-based Scheduler used to schedule various scheduling algorithms

Table 1 :
A summary of objectives discussed in reviewed papers

Table 3 :
A summary of Meta-Heuristic approaches

Table 4 :
A summary of Heuristic algorithms

Table 5 :
A summary of other classic approaches

Table 6 :
Machine learning methods used in scheduling of Cloud

Table 10 :
Strategies and Advantages of RL-based algorithms

Table 11 :
Future work of RL-based algorithms