Using Inaccurate Estimates Accurately
Job schedulers improve the system utilization by requiring users to estimate how long their jobs will run and by using this information to better pack (or “backfill”) the jobs. But, surprisingly, many studies find that deliberately making estimates less accurate boosts (or does not affect) the performance, which helps explain why production systems still exclusively rely on notoriously inaccurate estimates.
We prove these studies wrong by showing that their methodology is erroneous. The studies model an estimate e as being correlated with r·F (where r is the runtime of the associated job, F is some ”badness” factor, and larger F values imply increased inaccuracy). We show this model is invalid, because: (1) it conveys too much information to the scheduler; (2) it induces favoritism of short jobs; and (3) it is inherently different than real user inaccuracy, which associates 90% of the jobs with merely 20 estimate values, hindering the scheduler’s ability to backfill.
We conclude that researchers must stop using multiples of runtimes as estimates, or else their results would likely be invalid. We develop (and propose to use) a realistic model that preserves the estimates’ modality and allows to soundly simulate increased inaccuracy by, e.g., associating more jobs with the maximal runtime allowed (an always-popular estimate, which prevents backfilling).
KeywordsSupercomputing scheduling backfilling user runtime estimates
Unable to display preview. Download preview PDF.
- 2.Chiang, S.-H., Vernon, M.K.: Production job scheduling for parallel shared memory systems. In: 15th IEEE Int’l Parallel & Distributed Processing Symp (IPDPS) (April 2001)Google Scholar
- 3.Dimitriadou, S., Karatza, H.: Job scheduling in a distributed system using backfilling with inaccurate runtime computations. In: IEEE Int’l Conf. Complex, Intelligent & Software Intensive Systems (CISIS), pp. 329–336 (February 2010)Google Scholar
- 4.Dongarra, J.J., Meuer, H.W., Simon, H.D., Strohmaier, E.: Top500 supercomputer sites, http://www.top500.org/ (updated every 6 months)
- 5.England, D., Weissman, J., Sadago-pan, J.: A new metric for robustness with application to job scheduling. In: 14th IEEE Int’l Symp. on High Performance Distributed Comput. (HPDC), pp. 135–143 (July 2005)Google Scholar
- 7.Etsion, Y., Tsafrir, D.: A Short Survey of Commercial Cluster Batch Schedulers. Technical Report 2005-13, The Hebrew University of Jerusalem (May 2005)Google Scholar
- 8.Feitelson, D.G., Mu’alem Weil, A.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th IEEE Int’l Parallel Processing Symp (IPPS), pp. 542–546 (April 1998)Google Scholar
- 11.Guim, F., Corbalán, J., Labarta, J.: Prediction f based models for evaluating backfilling scheduling policies. In: 8th IEEE Int’l Conf. on Parallel & Distributed Computing, Applications & Technologies (PDCAT), pp. 9–17 (December 2007)Google Scholar
- 15.Netto, M.A.S., Buyya, R.: Coordinated Rescheduling of Bag-of-Tasks for Executions on Multiple Resource Providers. Technical Report CLOUDS-TR-2010-1, U. of Melbourne, Australia, Submitted (TPDS) (February 2010)Google Scholar
- 16.Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/workload
- 17.Sabin, G., Sadayappan, P.: On enhancing the reliability of job schedulers. In: High Availability & Performace Computing Workshop (HAPCW) (October 2005)Google Scholar
- 18.Srinivasan, S., Kettimuthu, R., Subrarnani, V., Sadayappan, P.: Characterization of backfilling strategies for parallel job scheduling. In: Int’l Conf. on Parallel Processing (ICPP), pp. 514–522 (August 2002)Google Scholar
- 19.Suzuoka, T., Subhlok, J., Gross, T.: Evaluating Job Scheduling Techniques for Highly Parallel Computers. Technical Report CMU-CS-95-149, School of Computer Science, Carnegie Mellon University (August 1995)Google Scholar
- 20.Tang, W., Desai, N., Buettner, D., Lan, Z.: Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P. In: IEEE Int’l Parallel & Distributed Processing Symp (IPDPS) (April 2010)Google Scholar
- 21.Tsafrir, D.: Modeling, Evaluating, and Improving the Performance of Supercomputer Scheduling. PhD thesis, The Hebrew University of Jerusalem (September 2006)Google Scholar
- 22.Tsafrir, D., Etsion, Y., Feitelson, D.G.: A model/utility for generating user runtime estimates and appending them to a standard workload format (SWF) file (February 2006), http://www.cs.huji.ac.il/labs/parallel/workload/m_tsafrir05
- 24.Tsafrir, D., Feitelson, D.G.: The dynamics of backfilling: solving the mystery of why increased inaccuracy may help. In: 2nd IEEE Int’l Symp. on Workload Characterization (IISWC) (October 2006)Google Scholar
- 25.Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: 14th IEEE Int’l Parallel & Distributed Processing Symp. (IPDPS), pp. 133–142 (May 2000)Google Scholar
- 27.Zotkin, D., Keleher, P.J.: Job-length estimation and performance in backfilling schedulers. In: 8th IEEE Int’l Symp. on High Performance Distributed Comput. (HPDC), p. 39 (August 1999)Google Scholar