Abstract
Efficient job scheduling on data centers under heterogeneous complexity is crucial but challenging since it involves the allocation of multidimensional resources over time and space. To adapt the complex computing environment in data centers, we proposed an innovative Advantage ActorCritic (A2C) deep reinforcement learning based approach called A2cScheduler for job scheduling. A2cScheduler consists of two agents, one of which, dubbed the actor, is responsible for learning the scheduling policy automatically and the other one, the critic, reduces the estimation error. Unlike previous policy gradient approaches, A2cScheduler is designed to reduce the gradient estimation variance and to update parameters efficiently. We show that the A2cScheduler can achieve competitive scheduling performance using both simulated workloads and real data collected from an academic data center.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Job scheduling is a critical and challenging task for computer systems since it involves a complex allocation of limited resources such as CPU/GPU, memory and IO among numerous jobs. It is one of the major tasks of the scheduler in a computer systemâ€™s Resource Management System (RMS), especially in highperformance computing (HPC) and cloud computing systems, where inefficient job scheduling may result in a significant waste of valuable computing resources. Data centers, including HPC systems and cloud computing systems, have become progressively more complex in their architectureÂ [15], configuration (e.g., special visualization nodes in a cluster)Â [6] and the size of work and workloads receivedÂ [3], all of which increase the job scheduling complexities sharply.
The undoubted importance of job scheduling has fueled interest in the scheduling algorithms on data centers. At present, the fundamental scheduling methodologiesÂ [18], such as FCFS (firstcomefirstserve), backfilling, and priority queues that are commonly deployed in data centers are extremely hard and timeconsuming to configure, severely compromising system performance, flexibility and usability. To address this problem, several researchers have proposed datadriven machine learning methods that are capable of automatically learning the scheduling policies, thus reducing human interference to a minimum. Specifically, a series of policy based deep reinforcement learning approaches have been proposed to manage CPU and memory for incoming jobsÂ [10], schedule timecritical workloadsÂ [8], handle jobs with dependencyÂ [9], and schedule data centers with hundreds of nodesÂ [2].
Despite the extensive research into job scheduling, however, the increasing heterogeneity of the data being handled remains a challenge. These difficulties arise from multiple issues. First, policy gradient DRL method based scheduling method suffers from a high variance problem, which can lead to low accuracy when computing the gradient. Second, previous work has relied on used Monte Carlo (MC) method to update the parameters, which involved massive calculations, especially when there are large numbers of jobs in the trajectory.
To solve the abovementioned challenges, we propose a policyvalue based deep reinforcement learning scheduling method called A2cScheduler, which can satisfy the heterogeneous requirements from diverse users, improve the space exploration efficiency, and reduce the variance of the policy. A2cScheduler consists of two agents named actor and critic respectively, the actor is responsible for learning the scheduling policy and the critic reduces the estimation error. The approximate value function of the critic is incorporated as a baseline to reduce the variance of the actor, thus reducing the estimation variance considerablyÂ [14]. A2cScheduler updates parameters via the multistep Temporaldifference (TD) method, which speeds up the training process markedly compared to conventional MC method due to the way TD method updates parameters. The main contributions are summarized as below:

1.
This represents the first time that A2C deep reinforcement has been successfully applied to a data center resource management, to the best of the authorsâ€™ knowledge.

2.
A2cScheduler updates parameters via multistep Temporaldifference (TD) method which speeds up the training process comparing to MC method due to the way TD method updates parameters. This is critical for the real world data center scheduling application since jobs arrive in real time and low latency is undeniably important.

3.
We tested the proposed approach on both realworld and simulated datasets, and results demonstrate that our proposed model outperformed many existing widely used methods.
2 Related Work
Job Scheduling with Deep Reinforcement Learning. Recently, researchers have tried to apply deep reinforcement learning on cluster resources management. A resource manager DeepRM was proposed inÂ [10] to manage CPU and memory for incoming jobs. The results show that policy based deep reinforcement learning outperforms the conventional job scheduling algorithms such as Short Job First and TetrisÂ [4]. [8] improves the exploration efficiency by adding baseline guided actions for timecritical workload job scheduling. [17] discussed heuristic based method to coordinate disaster response. Mao proposed Decima inÂ [9] which could handle jobs with dependency when graph embedding technique is utilized. [2] proved that policy gradient based deep reinforcement learning can be implemented to schedule data centers with hundreds of nodes.
ActorCritic Reinforcement Learning. Actorcritic algorithm is the most popular algorithm applied in the reinforcement learning frameworkÂ [5] which falls into three categories: actoronly, criticonly and actorcritic methodsÂ [7]. Actorcritic methods combine the advantages of actoronly and criticonly methods. Actorcritic methods usually have good convergence properties, in contrast to criticonlyÂ [5]. At the core of several recent stateoftheart Deep RL algorithms is the advantage actorcritic (A2C) algorithmÂ [11]. In addition to learning a policy (actor) \(\pi (a  s ; \theta )\), A2C learns a parameterized critic: an estimate of value function \(v_{\pi }(s)\), which then uses both to estimate the remaining return after k steps, and as a control variate (i.e. baseline) that reduces the variance of the return estimatesÂ [13].
3 Method and Problem Formulation
In this section, we first review the framework of A2C deep reinforcement learning, and then explain how the proposed A2C based A2cScheduler works in the job scheduling on data centers. The rest part of this section covers the essential details about model training.
3.1 A2C in Job Scheduling
The Advantage Actorcritic (A2C), which combines policy based method and value based method, can overcome the high variance problem from pure policy gradient approach. The A2C algorithm is composed of a policy \(\pi \left( a_{t}  s_{t} ; \theta \right) \) and a value function \(V\left( s_{t} ; w\right) \), where policy is generated by policy network and value is estimated by critic network. The proposed the A2cScheduler framework is shown in Fig.Â 1, which consists of an actor network, a critic network and the cluster environment. The cluster environment includes a global queue, a backlog and the simulated machines. The queue is the place holding the waiting jobs. The backlog is an extension of the queue when there is not enough space for waiting jobs. Only jobs in the queue will be allocated in each state.
The Setting of A2C

Actor: The policy \(\pi \) is an actor which generates probability for each possible action. \(\pi \) is a mapping from state \(s_t\) to action \(a_t\). Actor can choose a job from the queue based on the action probability generated by the policy \(\pi \). For instance, given the action probability \(P=\{p_1,\dots , p_{N}\}\) for N actions, \(p_i\) denotes the probability that action \({a}_i\) will be selected. If the action is chosen according to the maximum probability (\(action=\) \({\text {*}}{arg\,max}_{i \in [0,N], i \in N^+}{p_i}\)), the actor acts greedily which limits the exploration of the agent. Exploration is allowed in this research. The policy is estimated by a neural network \(\pi (as,\theta )\), where a is an action, s is the state of the system and \(\theta \) is the weights of the policy network.

Critic: A statevalue function v(s) used to evaluate the performance of the actor. It is estimated by a neural network \(\hat{v}(s, \mathbf {w})\) in this research where s is the state and \(\mathbf {w}\) is the weights of the value neural network.

State \(s_t \in S\): A state \(s_t\) is defined as the resources allocation status of the data center including the status of the cluster and the status of the queue at time t. The states S is a finite set. FigureÂ 2 shows an example of the state in one time step. The state includes three parts: status of the resources allocated and the available resources in the cluster, resources requested by jobs in the queue, and status of the jobs waiting in the backlog. The scheduler will only schedules jobs in the queue.

Action \(a_t \in A\): An action \(a_t=\{a_t\}_1^{N}\) denotes the allocation strategy of jobs waiting in the queue at time t, where N is the number of slots for waiting jobs in the queue. The action space A of an actor specifies all the possible allocations of jobs in the queue for the next iteration, which gives a set of \(N+1\) discrete actions represented by \(\{\emptyset , 1, 2,\dots ,N \}\) where \(a_t =i\)( \(\forall i \in \{1,\dots , N\}\)) means the allocation of the \(i^{th}\) job in the queue and \(a_t=\emptyset \) denotes a void action where no job is allocated.

Environment: The simulated data center contains resources such as CPUs, RAM and I/O. It also includes resource management queue system in which jobs are waiting to be allocated.

Discount Factor \(\gamma \): A discount factor \(\gamma \) is between 0 and 1, and is used to quantify the difference in importance between immediate rewards and future rewards. The smaller of \(\gamma \), the less importance of future rewards.

Transition function \(P: S \times A \rightarrow [0,1]\): Transition function describes the probabilities of moving between current state to the next state. The state transition probability \(p(s_{t+1}s_t, a_t)\) represents the probability of transiting to \(s_{t +1} \in S\) given a joint action \(a_t \in A\) is taken in the current state \(s_t \in S\).

Reward function \(r \in R = S \times A \rightarrow (\infty , +\infty )\): A reward in the data center scheduling problem is defined as the feedback from the environment when the actor takes an action at a state. The actor attempts to maximize its expected discounted reward: \({R_t}=E({r_t^i}+{\gamma }{r_{t+1}^i}+...)=E({\sum \limits _{k=0}^{\infty }{\gamma }^k}{r_{t+k}^i})=E({r_t^i}+{\gamma }{R_{t+1}})\). The agent reward at time t is defined as \(r_t=\frac{1}{T_j}\) , where \(T_j\) is the runtime for job j.
The goal of data center job scheduling is to find the optimal policy \(\pi ^*\) (a sequence of actions for agents) that maximizes the total reward. The state value function \({Q^\pi }(s,a)\) is introduced to evaluate the performance of different policies. \({Q^\pi }(s,a)\) stands for the expected total reward with discount from current state s onwards with the policy \(\pi \), which is equal to:
where \(s'\) is the next state, and \(a'\) is the action for the next time step.
Function approximation is a way for generalization when the state and/or action spaces are large or continuous. Several reinforcement learning algorithms have been proposed to estimate the value of an action in various contexts such as the QlearningÂ [16] and SARSAÂ [12]. Among them, the modelfree Qlearning algorithm stands out for its simplicityÂ [1]. In Qlearning, the algorithm uses a Qfunction to calculate the total reward, defined as \(Q: S\times A \rightarrow R\). Qlearning iteratively evaluates the optimal Qvalue function using backups:
where \(\alpha \in [0,1)\) is the learning rate and the term in the brackets is the temporaldifference (TD) error. Convergence to \(Q^{\pi ^*}\) is guaranteed in the tabular case provided there is sufficient state/action space exploration.
The Loss Function for Critic. Loss function of the critic is utilized to update the critic network parameters.
where \(s'\) is the state encountered after state s. Critic update the parameters of the value network by minimizing critic loss in Eq.Â 3.
Advantage Actorcritic. The critic updates stateaction value function parameters, and the actor updates policy parameters, in the direction suggested by the critic. A2C updates both the policy and valuefunction networks with the multistep returns as described in [11]. Critic is updated by minimizing the loss function of Eq.Â 3. Actor network is updated by minimizing the actor loss function in equation
where \(\theta _i\) is the parameters of the actor neural network and \(w_i\) is the parameters of the critic neural network. Note that the parameters \(\theta _i\) of policy and \(w_i\) of value are distinct for generality. AlgorithmÂ 1 presents the calculation and update of parameters per episode.
3.2 Training Algorithm
The A2C consists of an actor and a critic, and we implement both of them using deep convolutional neural network. For the Actor neural network, it takes the aforementioned tensor representation of resource requests and machine status as the input, and outputs the probability distribution over all possible actions, representing the jobs to be scheduled. For the Critic neural network, it takes as input the combination of action and the state of the system, and outputs the a single value, indicating the evaluation for actorâ€™s action.
4 Experiments
4.1 Experiment Setup
The experiments are executed on a desktop computer with two RTX2080 GPUs and one i79700k 8core CPU. A2cScheduler is implemented using Tensorflow framework. Simulated jobs arrive online in Bernouli process. A piece of job trace from a real data center is also tested. CPU and Memory are the two kinds of resources considered in this research.
The training process begins with an initial state of the data center. At each time step, a state is passed into the policy network \(\pi \). An action is generated under policy \(\pi \). A void action is made or a job is chosen from the global queue and put into the cluster for execution. Then a new state is generated and some reward is collected. The states, actions, policy and rewards are collected as trajectories. Meanwhile, the state is also passed into the value network to estimate the value, which used to evaluate the performance of the action. Actor in A2cScheduler learns to produce resource allocation strategies from experiences after epochs.
4.2 Evaluation Metrics
Reinforcement learning algorithms, including A2C, have been mostly evaluated by converging speed. However, these metrics are not very informative in domainspecific applications such as scheduling. Therefore, we present several evaluation metrics that are helpful for access the performance of the proposed model.
Given a set of jobs \(J=\{j_1, \dots , j_N\}\), where \(i^{th}\) job is associated with arrival time \(t_i^{a}\), finish time \(t_i^{f}\), and execution time \(t_i^{e}\). Average Job Slowdown. The slowdown for \(i^{th}\) job is defined as \(s_i = \frac{t_i^{f}t_i^{a}}{t_i^{e}}=\frac{c_i}{t_i}\), where \(c_i=t_i^{f}t_i^{a}\) is the completion time of the job and \(t_i\) is the duration of the job. The average job slowdown is defined as \(s_{avg} =\frac{1}{N}\sum \limits _{i=1}^n \frac{t_i^{f}t_i^{a}}{t_i^{e}}= \frac{1}{n}\sum \limits _{i=1}^N \frac{c_i}{t_i}\). The slowdown metric is important because it helps to evaluate normalized waiting time of a system.
Average Job Waiting Time. For the \(i^{th}\) job, the waiting time \(t_{wi}\) is the time between arrival and start of execution, which is formally defined as \(t_{wi}=t_i^{s}t_i^{a}\).
4.3 A2cScheduler with CNN
We simulated the data center cluster containing N nodes with two resources: CPU and Memory. We trained the A2cScheduler with different neural networks including a fully connected layer and Convolutional Neural Networks (CNN). In order to design the best performance neural networks, we explore different CNN architectures and compare whether it converges and how is the converge speed with different settings. As shown in TableÂ 3, fully connected layer (FC layer) with a flatten layer in front did not converge. This is because the state of the environment is a matrix with location information while some location information lost in the flatten layer when the state is processed. To keep the location information, we utilize CNN layers (16 3Â *Â 3filters CNN layer and 32 3Â *Â 3filters CNN layer) and they show better results. Then, we explored CNN with maxpooling and CNN with flattening layer behind. Results show both of them could converge but CNN with maxpooling gets poorer results. This is due to some of the state information also get lost when it passes maxpooling layer. According to the experiment results, we decide to choose the CNN with a flattening layer behind architecture as it converges fast and gives the best performance.
4.4 Baselines
The performance of the proposed method is compared with some of the mainstream baselines such as Shortest Job First (SJF), TetrisÂ [4], and random policy. SJF sorts jobs according to their execution time and schedules jobs with the shortest execution time first; Tetris schedules job by a combined score of preferences for the short jobs and resource packing; random policy schedules jobs randomly. All of these baselines work in a greedy way that allocates as many jobs as allowed by the resources, and share the same resource constraints and take the same input as the proposed model.
4.5 Performance Comparison
Performance on Synthetic Dataset. In our experiment, the A2cScheduler utilized an A2C reinforcement learning method. It is worth to mention that the model includes the option to have multiple episodes in order to allow us to measure the average performance achieved and the capacity to learn for each scheduling policy. AlgorithmÂ 1 presents the calculation and update of parameters per episode. FigureÂ 3 shows experimental results with synthetic job distribution as input.
FigureÂ 3(a) and FigureÂ 3(b) present the rewards and averaged slowdown when the new job rate is 0.7. Cumulative rewards and averaged slowdown converge around 500 episodes. A2cScheduler has lower averaged slowdown than random, Tetris and SJF after 500 episodes. FigureÂ 3(c) and FigureÂ 3(d) show the average completion time and average waiting time of the A2cScheduler algorithm versus baselines. As we can see, the performance of A2cScheduler is the best comparing to all the baselines.
TableÂ 1,Â 2 present the steady state simulation results at different job rates. We can see the A2cScheduler algorithm gets the best or close to the best performance regrading slowdown, average completion time and average waiting time at different job rates ranging from 0.6 to 0.9.
Performance on Realworld Dataset. We ran experiments with a piece of job trace from an academic data center. The results were shown in Fig.Â 4. The job traces were preprocessed before they are trained with the A2cScheduler. There was some fluctuation at the first 500 episodes in Fig.Â 4(a), then it started to converge. FigureÂ 4(b) shows the average slowdown was better than all the baselines and close to optimal value 1, which means the average waiting time was almost 0 as shown in Fig.Â 4(d). This happens because there were only 60 jobs in this case study and jobs runtime are small. This was an case where almost no job was waiting for the allocation when it was optimally scheduled. A2cScheduler also gains the shortest completion time among different methods from Fig.Â 4(c). TableÂ 4 shows the steady state results from a realworld job distribution running on an academic cluster. A2cScheduler gets optimal scheduling results since there is near 0 average waiting time for this jobs distribution. Again, this experimental results proves A2cScheduler effectively finds the proper scheduling policies by itself given adequate training, both on simulation dataset and realworld dataset. There were no rules predefined for the scheduler in advance, instead, there was only a reward defined with the system optimization target included. This proven our defined reward function was effective in helping the scheduler to learn the optimal strategy automatically after adequate training.
5 Conclusion
Job scheduling with resource constraints is a longstanding but critically important problem for computer systems. In this paper, we proposed an A2C deep reinforcement learning algorithm to address the customized job scheduling problem in data centers We defined a reward function related to averaged job waiting time which leads A2cScheduler to find scheduling policy by itself. Without the need for any predefined rules, this scheduler is able to automatically learn strategies directly from experience and thus improve scheduling policies. Our experiments on both simulated data and real job traces for a data center show that our proposed method performs better than widely used SJF and Tetris for multiresource cluster scheduling algorithms, offering a real alternative to current conventional approaches. The experimental results reported in this paper are based on tworesource (CPU/Memory) restrictions, but this approach can also be easily adapted for more complex multiresource restriction scheduling scenarios.
References
AlTamimi, A., et al.: Modelfree Qlearning designs for linear discretetime zerosum games with application to Hinfinity control. Automatica 43(3), 473â€“481 (2007)
Domeniconi, G., Lee, E.K., Morari, A.: CuSH: cognitive scheduler for heterogeneous high performance computing system (2019)
Garg, S.K., Gopalaiyengar, S.K., Buyya, R.: SLAbased resource provisioning for heterogeneous workloads in a virtualized cloud datacenter. In: Proceedings ICA3PP, pp. 371â€“384 (2011)
Grandl, R., et al.: Multiresource packing for cluster schedulers. Comput. Commun. Rev. 44(4), 455â€“466 (2015)
Grondman, I., et al.: A survey of actorcritic reinforcement learning: standard and natural policy gradients. IEEE Trans. Syst. Man Cybern. 42(6), 1291â€“1307 (2012)
Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1â€“20. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_1
Konda, V.R., et al.: Actorcritic algorithms. In: Proceedings of NIPS, pp. 1008â€“1014 (2000)
Liu, Z., Zhang, H., Rao, B., Wang, L.: A reinforcement learning based resource management approach for timecritical workloads in distributed computing environment. In: Proceedings of Big Data, pp. 252â€“261. IEEE (2018)
Mao, H., et al.: Learning scheduling algorithms for data processing clusters. arXiv preprint arXiv:1810.01963 (2018)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: HotNets 2016, pp. 50â€“56. ACM, New York (2016). https://doi.org/10.1145/3005745.3005750
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proceedings of ICML, pp. 1928â€“1937 (2016)
Sprague, N., Ballard, D.: Multiplegoal reinforcement learning with modular sarsa (0) (2003)
Srinivasan, S., et al.: Actorcritic policy optimization in partially observable multiagent environments. In: Proceedings of NIPS, pp. 3422â€“3435 (2018)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, vol. 135. MIT press, Cambridge (1998)
Van Craeynest, K., et al.: Scheduling heterogeneous multicores through performance impact estimation (PIE). In: Computer Architecture News, vol. 40, pp. 213â€“224 (2012)
Watkins, C.J., Dayan, P.: Qlearning. Mach. Learn. 8(3â€“4), 279â€“292 (1992)
Yang, Z., Nguyen, L., Jin, F.: Coordinating disaster emergency response with heuristic reinforcement learning
Zhou, X., Chen, H., Wang, K., Lang, M., Raicu, I.: Exploring distributed resource allocation techniques in the SLURM job management system. Technical report (2013)
Acknowledgement
We are thankful to the anonymous reviewers for their valuable feedback. This research is supported in part by the National Science Foundation under grant CCF1718336 and CNS1817094.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Â© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Liang, S., Yang, Z., Jin, F., Chen, Y. (2020). Data Centers Job Scheduling with Deep Reinforcement Learning. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/9783030474362_68
Download citation
DOI: https://doi.org/10.1007/9783030474362_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030474355
Online ISBN: 9783030474362
eBook Packages: Computer ScienceComputer Science (R0)