The Journal of Supercomputing

, Volume 74, Issue 6, pp 2353–2384 | Cite as

Non-clairvoyant online scheduling of synchronized jobs on virtual clusters

  • Sina Mahmoodi Khorandi
  • Mohsen SharifiEmail author


Although virtualization technology is recently applied to next-generation distributed high-performance computing systems, theoretical aspects of scheduling jobs on these virtualized environments are not sufficiently studied, especially in online and non-clairvoyant cases. Virtualization of computing resources results in interference and virtualization overheads that negatively impact the load balancing objectives on commonly used cluster of multi-core physical machines. We present a technique for non-clairvoyant online scheduling of globally synchronized jobs, each of which spawns tasks to execute compute-intensive works. Our technique considers both load balancing of physical cores and per job synchronization cost minimization. We show that in the presence of arbitrary virtualization overheads, interference effects and synchronization cost, the problem can be reduced to an online unrelated parallel machine scheduling, which is solved using routing of virtual circuits. We present a new opportunity cost model to reduce the problem to the routing of virtual circuits and prove the effectiveness of our scheduling technique using mathematical analysis and simulative experiments.


Job scheduling Virtual clusters Synchronization Load balancing Non-clairvoyant 


  1. 1.
    Khorandi SM, Sharifi M (2017) Scheduling of online compute-intensive synchronized jobs on high performance virtual clusters. J Comput Syst Sci 85(3):1–17. MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Mondragon OH, Bridges PG, Jones T (2015) Quantifying scheduling challenges for exascale system software. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), PortlandGoogle Scholar
  3. 3.
    Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X, Choudhary A, Dosanjh S, Dunning T, Fiore S, Geist A, Gropp B, Harrison R, Hereld M, Heroux M, Hoisie A, Hotta K, Jin Z, Ishikawa Y, Johnson F, Kale S, Kenway R, Keyes D, Kramer B, Labarta J, Lichnewsky A, Lippert T, Lucas B, Maccabe B, Matsuoka S, Messina P, Michielse P, Mohr B, Mueller MS, Nagel WE, Nakashima H, Papka ME, Reed D, Sato M, Seidel E, Shalf J, Skinner D, Snir M, Sterling T, Stevens R, Streitz F, Sugar B, Sumimoto S, Tang W, Taylor J, Thakur R, Trefethen A, Valero M, Steen Avd, Vetter J, Williams P, Wisniewski R, Yelick K (2011) The international exascale software roadmap. Int J High Perform Comput Appl 25(1):3–60CrossRefGoogle Scholar
  4. 4.
    Sterling T (2009) Models of computation—enabling exascale. Int J High Perform Comput Appl 23(4):332–334CrossRefGoogle Scholar
  5. 5.
    Kocoloski B, Lange J, Abbasi H, Bernholdt DE, Jones TR, Dayal J, Evans N, Lang M, Lofstead J, Pedretti K, Bridges PG (2015) System-level support for composition of applications. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS ’15), Portland, vol 7. ACM, pp 1–8Google Scholar
  6. 6.
    Kocoloski B, Lange J (2013) Improving compute node performance using virtualization. Int J High Perform Comput Appl 27(2):124–135CrossRefGoogle Scholar
  7. 7.
    Brightwell R, Oldfield R, Maccabe AB, Bernholdt DE (2013) Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R. In: The 3rd International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’13), Eugene, vol 2. ACM, pp 1–8Google Scholar
  8. 8.
    Gupta A, Faraboschi P, Giaochin F, Kale LV, Kaufmann R, Lee B-S, March V, Milojicc D, Suen CH (2014) Evaluating and improving the performance and scheduling of HPC applications in cloud. IEEE Trans Cloud Comput 99:1–14Google Scholar
  9. 9.
    Gupta A, Sarood O, Kale L, Milojicic D (2013) Improving HPC application performance in cloud through dynamic load balancing. In: The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Delft, pp 402–409Google Scholar
  10. 10.
    Gupta A, Kale LV, Milojicic D, Faraboschi P, Balle SM (2013) HPC-aware VM placement in infrastructure clouds. In: The IEEE International Conference on Cloud Engineering (IC2E ’13), Redwood, pp 11–20Google Scholar
  11. 11.
    Bridges PG, Arnold D, Pedretti KT, Suresh M, Lu F, Dinda P, Joseph R, Lange J (2012) Virtual machine-based emulation of future generation high-performance computing systems. Int J High Perform Comput Appl 26(2):125–135CrossRefGoogle Scholar
  12. 12.
    Mondragon O, Bridges PG, Ferreira KB, Levy S, Widener PM (2016) Understanding performance interference in next-generation HPC systems. In: The 2016 ACM/IEEE Conference on Supercomputing (SC’16), Salt Lake City. ACM, pp 384–395Google Scholar
  13. 13.
    Keren A, Barak A (2003) Opportunity cost algorithms for reduction of I/O and interprocess communication overhead in a computing cluster. IEEE Trans Parallel Distrib Syst 14(1):39–50CrossRefGoogle Scholar
  14. 14.
    Beckman P, Brightwell R, Supinski BRd, Gokhale M, Hofmeyr S, Krishnamoorthy S, Lang M, Maccabe B, Shalf J, Snir M (2012) Exascale operating system and runtime software report. U.S. Department of Energy.
  15. 15.
    Lange J, Pedretti K, Dinda P, Bridges PG, Bae C, Soltero P, Merritt A (2011) Minimal-overhead virtualization of a large scale supercomputer. In: ACM SIGPLAN notices—VEE ’11, vol 46(7), pp 169–180Google Scholar
  16. 16.
    Lange J, Pedretti K, Hudson T, Dinda P, Cui Z, Xia L, Bridges P, Gocke A, Jaconette S, Levenhagen M, Brightwell R (2010) Palacios and Kitten: new high performance operating systems for scalable virtualized and native supercomputing. In: The 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta. IEEE, pp 1–12Google Scholar
  17. 17.
    Ebenlendr T, Sgall J (2009) Optimal and online preemptive scheduling on uniformly related machines. J Sched 12(5):517–527MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Strunk A (2012) Costs of virtual machine live migration: a survey. Paper presented at the IEEE 8th World Congress on Services, HonoluluGoogle Scholar
  19. 19.
    Jin H, Gao W, Wu S, Shi X, Wu X, Zhou F (2011) Optimizing the live migration of virtual machine by CPU scheduling. J Netw Comput Appl 34(4):1088–1096CrossRefGoogle Scholar
  20. 20.
    Breitgand D, Kutiel G, Raz D (2011) Cost-aware live migration of services in the cloud. In: The Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Network and Services, Boston, vol 11. USENIX, pp 1–6Google Scholar
  21. 21.
    Ramezani F, Lu J, Taheri J, Zomaya AY (2017) A multi-objective load balancing system for cloud environments. Comput J. Google Scholar
  22. 22.
    Awerbuch B, Azar Y, Plotkin S, Waarts O (2001) Competitive routing of virtual circuits with unknown duration. J Comput Syst Sci 62(3):385–397MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Amir Y, Awerbuch B, Barak A, Borgstrom S, Keren A (2000) An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans Parallel Distrib Syst 11(7):760–768CrossRefGoogle Scholar
  24. 24.
    Epstein L, Favrholdt LM, Kohrt JS (2012) Comparing online algorithms for bin packing problems. J Sched 15(1):13–21MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Sleator DD, Tarjan RE (1985) Amortized efficiency of list update and paging rules. Commun ACM 28(2):202–208MathSciNetCrossRefGoogle Scholar
  26. 26.
    Agrawal K, Li J, Lu K, Moseley B (2016) Scheduling parallelizable jobs online to minimize the maximum flow time. In: 28th ACM-SIAM Symposium on Parallelism in Algorithms and Architectures, Pacific Grove. ACM, pp 195–205Google Scholar
  27. 27.
    Li J, Chen JJ, Agrawal K, Lu C, Gill C, Saifullah A (2014) Analysis of federated and global scheduling for parallel real-time tasks. In: 26th Euromicro Conference on Real-Time Systems (ECRTS), Madrid. IEEE, pp 85–96Google Scholar
  28. 28.
    Duboc L, Leiter E, Rosenblum DS (2013) Systematic elaboration of scalability. IEEE Trans Softw Eng 39(1):119–140CrossRefGoogle Scholar
  29. 29.
    Duboc L, Rosenblum D, Wicks T (2007) A framework for characterization and analysis of software system scalability. In: The 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE ’07), Dubrovnik, pp 375–384Google Scholar
  30. 30.
    Caragiannis I (2008) Better bounds for online load balancing on unrelated machines. In: The Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2008), San Francisco. SIAM, pp 972–981Google Scholar
  31. 31.
    Lübbecke E, Maurer O, Megow N, Wiese A (2016) A new approach to online scheduling: approximating the optimal competitive ratio. ACM Trans Algorithms (TALG) 13(1):15MathSciNetGoogle Scholar
  32. 32.
    Borodin A, El-Yaniv R (1998) Online computation and competitive analysis. Cambridge University Press, New YorkzbMATHGoogle Scholar
  33. 33.
    Aspens J, Azar Y, Fiat A, Plotkin S, Waarts O (1997) On-line routing of virtual circuits with applications to load balancing and machine scheduling. J ACM 44(3):486–504MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Chen L, Ye D, Zhang G (2015) Approximating the optimal algorithm for online scheduling problems via dynamic programming. Asia-Pac J Oper Res 32(1):1540011MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Chang E-C, Yap C (2003) Competitive on-line scheduling with level of service. J Sched 6(3):251–267MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Mondragon OH, Bridges PG, Jones T (2015) Quantifying scheduling challenges for exascale system software. In: The 5th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS ’15), PortlandGoogle Scholar
  37. 37.
    Singh S, Chana I (2016) A survey on resource scheduling in cloud computing: issues and challenges. J Grid Comput 14(2):217–264CrossRefGoogle Scholar
  38. 38.
    Pietri I, Sakellariou R (2016) Mapping virtual machines onto physical machines in cloud computing: a survey. ACM Comput Surv (CSUR) 49(3):1–29CrossRefGoogle Scholar
  39. 39.
    Quintin J-N, Wagner F (2012) WSCOM: online task scheduling with data transfers. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa. IEEE, pp 344–351Google Scholar
  40. 40.
    Maoz T, Barak A, Amar L (2008) Combining virtual machine migration ith process migration for HPC on multi-clusters and grids. In: The IEEE International Conference on Cluster Computing, Tsukuba, pp 89–98Google Scholar
  41. 41.
    Gupta A, Kalé LV, Gioachin F, March V, Suen CH, Lee B-S, Faraboschi P, Kaufmann R, Milojicic D (2012) Exploring the performance and mapping of HPC applications to platforms in the cloud. In: The 21st International Symposium on High-Performance Parallel and Distributed Computing, Delft, pp 121–122Google Scholar
  42. 42.
    Machovec D, Tunc C, Kumbhare N, Khemka B, Akoglu A, Hariri S, Siegel HJ (2016) Value-based resource management in high-performance computing systems. In: ACM 7th Workshop on Scientific Cloud Computing, Kyoto, pp 19–26Google Scholar
  43. 43.
    Ritson CG, Sampson AT, Barnes FRM (2012) Multicore scheduling for lightweight communicating processes. Sci Comput Program 77(6):727–740CrossRefGoogle Scholar
  44. 44.
    Heath MT (2015) A tale of two laws. Int J High Perform Comput Appl 29(3):1–11MathSciNetCrossRefGoogle Scholar
  45. 45.
    Sterling T (2009) The biggest need: a new model of computation. Int J High Perform Appl 23(4):335–336CrossRefGoogle Scholar
  46. 46.
    Pedretti KT, Bridges PG (2010) Opportunities for leveraging OS virtualization in high-end supercomputing. In: The Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds (MASVDC’10), AtlantaGoogle Scholar
  47. 47.
    Kale LV, Kumar S, DeSouza J (2002) A malleable-job system for timeshared parallel machines. In: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Berlin. IEEE, pp 230–237Google Scholar
  48. 48.
    Corbalan J, Martorell X, Labarta J (2001) Improving gang scheduling through job performance analysis and malleability. In: The 15th International Conference on Supercomputing, Sorrento. ACM, pp 303–311Google Scholar
  49. 49.
    Clauss C, Moschny T, Eicker N (2016) Dynamic process management with allocation-internal co-scheduling towards interactive supercomputing. In: The 1st Workshop on Co-scheduling of HPC Applications (COSH 2016), PragueGoogle Scholar
  50. 50.
    Herbein S, Ahn DH, Lipari D, Scogland TRW, Stearman M, Grondona M, Garlick J, Springmeyer B, Taufer M (2016) Scalable I/O-aware job scheduling for burst buffer enabled HPC clusters. In: The 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16), Kyoto. ACM, pp 69–80Google Scholar
  51. 51.
    Kocoloski B, Zhou Y, Childers B, Lange J (2015) Implications of memory interference for composed HPC applications. In: The 2015 International Symposium on Memory Systems (MEMSYS’15), Washington. ACM, pp 95–97Google Scholar
  52. 52.
    Zhao J, Cui H, Xue J, Feng X, Yan Y, Yang W (2013) An empirical model for predicting cross-core performance interference on multicore processors. In: The 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh. ACM, pp 201–212Google Scholar
  53. 53.
    Mosix Cluster Management System (2017)
  54. 54.
    Park JJK, Park Y, Mahlke S (2015) Chimera: collaborative preemption for multitasking on a shared GPU. In: the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Istanbul. ACM, pp 593–606Google Scholar
  55. 55.
    Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760CrossRefGoogle Scholar
  56. 56.
    Xiao S, Feng W-c (2010) Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta. IEEE, pp 1–12Google Scholar
  57. 57.
    Ebenlendr T, Sgall J (2010) Semi-online preemptive scheduling: one algorithm for all variants. Theory Comput Syst 48(3):577–613MathSciNetCrossRefzbMATHGoogle Scholar
  58. 58.
    Shmoys DB, Wein J, Williamson DP (1995) Scheduling parallel machines on-line. SIAM J Comput 24(6):1313–1331MathSciNetCrossRefzbMATHGoogle Scholar
  59. 59.
    Graham RL (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429MathSciNetCrossRefzbMATHGoogle Scholar
  60. 60.
    Patel DK, Tripathy D, Tripathy CR (2016) Survey of load balancing techniques for grid. J Netw Comput Appl 65:103–119CrossRefGoogle Scholar
  61. 61.
    Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917CrossRefGoogle Scholar
  62. 62.
    Hirofuchi T, Lebre A, Pouilloux L (2016) SimGrid VM: virtual machine support for a simulation framework of distributed systems. IEEE Trans Cloud Comput. Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Distributed Systems Research Lab, School of Computer EngineeringIran University of Science and Technology (IUST)TehranIran
  2. 2.School of Computer EngineeringIran University of Science and Technology (IUST)TehranIran

Personalised recommendations