Multi-stage resource-aware scheduling for data centers with heterogeneous servers

  • Tony T. Tran
  • Meghana Padmanabhan
  • Peter Yun Zhang
  • Heyse Li
  • Douglas G. Down
  • J. Christopher Beck


This paper presents a three-stage algorithm for resource-aware scheduling of computational jobs in a large-scale heterogeneous data center. The algorithm aims to allocate job classes to machine configurations to attain an efficient mapping between job resource request profiles and machine resource capacity profiles. The first stage uses a queueing model that treats the system in an aggregated manner with pooled machines and jobs represented as a fluid flow. The latter two stages use combinatorial optimization techniques to solve a shorter-term, more accurate representation of the problem using the first-stage, long-term solution for heuristic guidance. In the second stage, jobs and machines are discretized. A linear programming model is used to obtain a solution to the discrete problem that maximizes the system capacity given a restriction on the job class and machine configuration pairings based on the solution of the first stage. The final stage is a scheduling policy that uses the solution from the second stage to guide the dispatching of arriving jobs to machines. We present experimental results of our algorithm on both Google workload trace data and generated data and show that it outperforms existing schedulers. These results illustrate the importance of considering heterogeneity of both job and machine configuration profiles in making effective scheduling decisions.


Resource-aware scheduling Dynamic scheduling Heterogeneous servers 


  1. Al-Azzoni, I., & Down, D. G. (2008). Linear programming-based affinity scheduling of independent tasks on heterogeneous computing systems. IEEE Transactions on Parallel and Distributed Systems, 19(12), 1671–1682.CrossRefGoogle Scholar
  2. Andradóttir, S., Ayhan, H., & Down, D. G. (2003). Dynamic server allocation for queueing networks with flexible servers. Operations Research, 51(6), 952–968.CrossRefGoogle Scholar
  3. Berral, J. L., Goiri, Í., Nou, R., Julià, F., Guitart, J., Gavaldà, R., & Torres, J. (2010). Towards energy-aware scheduling in data centers using machine learning. In Proceedings of the 1st international conference on energy-efficient computing and networking (pp. 215–224). ACM.Google Scholar
  4. Dai, J. G., & Meyn, S. P. (1995). Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Transactions on Automatic Control, 40(11), 1889–1904.CrossRefGoogle Scholar
  5. Gandhi, A., Harchol-Balter, M., & Kozuch, M. A. (2012). Are sleep states effective in data centers? In International green computing conference (IGCC) (pp. 1–10). IEEE.Google Scholar
  6. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., & Stoica, I. (2011). Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX conference on networked systems design and implementation (Vol. 11, pp. 323–336).Google Scholar
  7. Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., & Akella, A. (2014). Multi-resource packing for cluster schedulers. In Proceedings of the 2014 ACM conference on SIGCOMM (pp. 455–466). ACM.Google Scholar
  8. Guazzone, M., Anglano, C., & Canonico, M. (2012). Exploiting vm migration for the automated power and performance management of green cloud computing systems. In Energy efficient data centers (Vol. 7396, pp. 81–92). Springer.Google Scholar
  9. Guenter, B., Jain, N., & Williams, C. (2011). Managing cost, performance, and reliability tradeoffs for energy-aware server provisioning. In INFOCOM, 2011 proceedings IEEE (pp. 1332–1340). IEEE.Google Scholar
  10. He, Y.-T., & Down, D. G. (2008). Limited choice and locality considerations for load balancing. Performance Evaluation, 65(9), 670–687.CrossRefGoogle Scholar
  11. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (pp. 261–276). ACM.Google Scholar
  12. Jain, R., Chiu, D.-M., & Hawe, W. (1984). A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. In Digital equipment corporation research technical report TR-301 (pp. 1–37).Google Scholar
  13. Kim, J.-K., Shivle, S., Siegel, H. J., Maciejewski, A. A., Braun, T. D., Schneider, M., et al. (2007). Dynamically mapping tasks with priorities and multiple deadlines in a heterogeneous environment. Journal of Parallel and Distributed Computing, 67(2), 154–169.CrossRefGoogle Scholar
  14. Le, K., Bianchini, R., Zhang, J., Jaluria, Y., Meng, J., & Nguyen, T. D. (2011). Reducing electricity cost through virtual machine placement in high performance computing clouds. In Proceedings of the international conference for high performance computing, networking, storage and analysis (p. 22). ACM.Google Scholar
  15. Liu, Z., Lin, M., Wierman, A., Low, S. H., & Andrew, L. L. H. (2011). Greening geographical load balancing. In Proceedings of the ACM SIGMETRICS joint international conference on measurement and modeling of computer systems (pp. 233–244). ACM.Google Scholar
  16. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.CrossRefGoogle Scholar
  17. Maguluri, S. T., Srikant, R., & Ying, L. (2012a). Heavy traffic optimal resource allocation algorithms for cloud computing clusters. In Proceedings of the 24th international teletraffic congress (pp. 25). International Teletraffic Congress.Google Scholar
  18. Maguluri, S. T., Srikant, R., & Ying, L. (2012b). Stochastic models of load balancing and scheduling in cloud computing clusters. In Proceedings IEEE INFOCOM (pp. 702–710). IEEE.Google Scholar
  19. Mann, Z. Á. (2015). Allocation of virtual machines in cloud data centers–A survey of problem models and optimization algorithms. ACM Computing Surveys, 48(1), 1–31.CrossRefGoogle Scholar
  20. Mishra, A. K., Hellerstein, J. L., Cirne, W., & Das, C. R. (2010). Towards characterizing cloud backend workloads: Insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37(4), 34–41.CrossRefGoogle Scholar
  21. Ousterhout, K., Wendell, P., Zaharia, M., & Stoica, I. (2013). Sparrow: Distributed, low latency scheduling. In Proceedings of the twenty-fourth ACM symposium on operating systems principles (pp. 69–84). ACM.Google Scholar
  22. Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.CrossRefGoogle Scholar
  23. Reiss, C., Tumanov, A., Ganger, G. R., Katz, R. H., & Kozuch, M. A. (2012). Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the third ACM symposium on cloud computing (pp. 1–13). ACM.Google Scholar
  24. Salehi, M. A., Krishna, P. R., Deepak, K. S., & Buyya, R. (2012). Preemption-aware energy management in virtualized data centers. In 2012 IEEE 5th international conference on cloud computing (CLOUD) (pp. 844–851). IEEE.Google Scholar
  25. Tang, Q., Gupta, S. K. S., & Varsamopoulos, G. (2007). Thermal-aware task scheduling for data centers through minimizing heat recirculation. In IEEE international conference on cluster computing (pp. 129–138). IEEE.Google Scholar
  26. Tarplee, K. M., Friese, R., Maciejewski, A. A., Siegel, H. J., & Chong, E. K. P. (2016). Energy and makespan tradeoffs in heterogeneous computing systems using efficient linear programming techniques. IEEE Transactions on Parallel and Distributed Systems, 27(6), 1633–1646.CrossRefGoogle Scholar
  27. Terekhov, D., Tran, T. T., Down, D. G., & Beck, J. C. (2014). Integrating queueing theory and scheduling for dynamic scheduling problems. Journal of Artificial Intelligence Research, 50, 535–572.Google Scholar
  28. Wang, L., Von Laszewski, G., Dayal, J., He, X., Younge, A. J., & Furlani, T. R. (2009). Towards thermal aware workload scheduling in a data center. In 2009 10th international symposium on pervasive systems, algorithms, and networks (ISPAN) (pp. 116–122). IEEE.Google Scholar
  29. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., & Stoica, I. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on computer systems (pp. 265–278). ACM.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Tony T. Tran
    • 1
  • Meghana Padmanabhan
    • 1
  • Peter Yun Zhang
    • 2
  • Heyse Li
    • 1
  • Douglas G. Down
    • 3
  • J. Christopher Beck
    • 1
  1. 1.Department of Mechanical and Industrial EngineeringUniversity of TorontoTorontoCanada
  2. 2.Engineering Systems DivisionMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.Department of Computing and SoftwareMcMaster UniversityHamiltonCanada

Personalised recommendations