Skip to main content

Performance Implications of Failures in Large-Scale Cluster Scheduling

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 3277)

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.

Keywords

  • Temporal Correlation
  • Performance Implication
  • Work Loss
  • Parallel Computing Environment
  • Checkpoint Interval

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albers, S., Schmidt, G.: Scheduling with unexpected machine breakdowns. Discrete Applied Mathematics 110(2-3), 85–99 (2001)

    CrossRef  MATH  MathSciNet  Google Scholar 

  2. S.M., Andrews, D.: On the reliability of the ibm mvs/xa operating system. IEEE Trans. Software Engineering (October 1987)

    Google Scholar 

  3. Arlitt, M., Jin, T.: Workload Characterization of the 1998 World Cup E-Commerce Site. Technical Report Technical Report HPL-1999-62, HP (May 1999)

    Google Scholar 

  4. Bruno, J.L., Coffman, E.G.: Optimal Fault-Tolerant Computing onMultiprocess Systems. Acta Informatica 34, 881–904 (1997)

    CrossRef  MATH  MathSciNet  Google Scholar 

  5. Buckley, M.F., Siewiorek, D.P.: Vax/vms event monitoring and analysis. In: FTCS-25, Computing Digest of Papers, June 1995, pp. 414–423 (1995)

    Google Scholar 

  6. Buckley, M.F., Siewiorek, D.P.: Comparative analysis of event tupling schemes. In: FTCS-26, Computing Digest of Papers, June 1996, pp. 294–303 (1996)

    Google Scholar 

  7. Castillo, X., Siewiorek, D.P.: A workload dependent software reliability prediction model. In: Proc. 12th. Intl. Symp. Fault-Tolerant Computing, June 1982, pp. 279–286 (1982)

    Google Scholar 

  8. Feitelson, D.: A survey of scheduling in multiprogrammed parallel systems. IBM Research Technical Report, RC 19790 (1994)

    Google Scholar 

  9. Flautner, K., Kim, N., Martin, S., Blaauw, D., Mudge, T.: Drowsy Caches: Simple Techniques for Reducing Leakage Power. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 148–157 (2002)

    Google Scholar 

  10. Franke, H., Jann, J., Moreira, J.E., Pattnaik, P.: An evaluation of parallel job scheduling for asci blue-pacific. In: Proc. of SC 1999. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center (November 1999)

    Google Scholar 

  11. Franke, H., Jann, J., Moreira, J.E., Pattnaik, P., Jette, M.A.: Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In: Proceedings of Supercomputing (November 1999)

    Google Scholar 

  12. Gorda, B., Wolski, R.: Time sharing massively parallel machines. In: Proc. of ICPP 1995. Portland OR, pp. 214–217 (August 1995)

    Google Scholar 

  13. Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of the ACM SIGMETRICS 2002 Conference on Measurement and Modeling of Computer Systems, pp. 217–227 (2002)

    Google Scholar 

  14. Hsueh, M.C., Iyer, R.K., Trivedi, K.S.: A measurement-based performability model for a multiprocessor system. In: Computer Performance and Reliability, pp. 337–352 (1987)

    Google Scholar 

  15. Iyer, R.K., Rossetti, D.J.: Effect of system workload on operating system reliability: A study on ibm 3081. IEEE Trans. Software Engineering SE-11, 1438–1448 (1985)

    CrossRef  Google Scholar 

  16. B. Kalyanasundaram and K. R. Pruhs. Fault-tolerant scheduling. In 26th Annual ACM Symposium on Theory of Computing, pages 115–124, 1994.

    Google Scholar 

  17. Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Transactions on Computer Systems 46, 719–724 (1997)

    CrossRef  Google Scholar 

  18. Krevat, E., Castanos, J.G., Moreira, J.E.: Job scheduling for the bluegene/l system. In: JSPP (2003)

    Google Scholar 

  19. Lee, I., Iyer, R.K.: Analysis of software halts in tandem system. In: Proceedings 3rd Intl. Software Reliability Engineering, October 1992, pp. 227–236 (1992)

    Google Scholar 

  20. Lin, T.Y., Siewiorek, D.P.: Error log analysis: Statistical modelling and heuristic trend analysis. IEEE Trans. on Reliability 39(4), 419–432 (1990)

    CrossRef  Google Scholar 

  21. Ling, Y., Mi, J., Lin, X.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Transactions on Computer Systems 50(7), 699–708 (2001)

    CrossRef  Google Scholar 

  22. Lohman, G.M., Muckstadt, J.A.: Optimal Policy for Batch Operations: Backup, Checkpointing, Reorganization, and Updating. ACM Transactions on Database Systems 2(3), 209–222 (1977)

    CrossRef  Google Scholar 

  23. Lyu, M., Mendiratta, V.: Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling. In: Proceedings 1999 IEEE Aerospace Conference, pp. 141–150 (1999)

    Google Scholar 

  24. Meyer, J., Wei, L.: Analysis of workload influence on dependability. In: Proceedings of the International Symposium on Fault-Tolerant Computing, pp. 84–89 (1988)

    Google Scholar 

  25. Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A Systematic Methodology to Compute the Architectural Vulnerabilityi Factors for a High-Performance Microprocessor. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 29–40 (2003)

    Google Scholar 

  26. Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61(11), 1570–1590 (2001)

    CrossRef  MATH  Google Scholar 

  27. Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems, citeseer.nj.nec.com/qin02efficient.html

  28. Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: Proceedings of the 2004 International Conference on Dependable Systems and Networks, pp. 389–398 (2004) (to appear)

    Google Scholar 

  29. Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD, August 2003, pp. 426–435 (2003)

    Google Scholar 

  30. Shaltz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Transactions on Computer Systems 41, 1156–1168 (1992)

    CrossRef  Google Scholar 

  31. Shivakumar, P., Kistler, M., Keckler, S., Burger, D., Alvisi, L.: Modeling the effect of technology trends on soft error rate of combinational logic. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pp. 389–398 (2002)

    Google Scholar 

  32. Squillante, M.S.: Matrix-Analytic Methods in Stochastic Parallel-Server Scheduling Models. Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications (1998)

    Google Scholar 

  33. Squillante, M.S., Wang, F., Papaefthymiou, M.: Stochastic Analysis of Gang Scheduling in Parallel and Distributed Systems. Technical Report, IBM Research Division (1996)

    Google Scholar 

  34. Squillante, M.S., Zhang, Y., Sivasubramanian, A., Gautam, N., Moreira, J.E., Franke, H.: Modeling and analysis of dynamic coscheduling in parallel and distributed environments. Performance Evaluation Review 30(1), 43–54 (2002)

    CrossRef  Google Scholar 

  35. Sullivan, M., Chillarege, R.: Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In: Proceedings of The 21st International Symposium on Fault Tolerant Computer Systems (FTCS), pp. 2–9 (1991)

    Google Scholar 

  36. Tang, D., Iyer, R.K.: Impact of correlated failures on dependability in a vaxcluster system. In: IFIP Working Conference on Dependable Computing for Critical Applications (1991)

    Google Scholar 

  37. Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a vaxcluster system. In: Proceedings 20th. Intl. Symposium on Fault-tolerant Computing, pp. 244–251 (1990)

    Google Scholar 

  38. Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and Implementation of Software Rejuvenation in Cluster Systems. In: Proceedings of the ACM SIGMETRICS 2001 Conference on Measurement and Modeling of Computer Systems, June 2001, pp. 62–71 (2001)

    Google Scholar 

  39. Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: SIGMETRICS 2001, pp. 62–71 (2001)

    Google Scholar 

  40. Xu, J., Kallbarczyk, Z., Iyer, R.K.: Networked windows nt system field failure data analysis. Technical Report CRHC 9808 University of Illinois at Urbana-Champaign (1999)

    Google Scholar 

  41. Zeigler, J.: Terrestrial Cosmic Rays. IBM Journal of Research and Development 40(1), 19–39 (1996)

    CrossRef  Google Scholar 

  42. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: The Impact of Migration on Parallel Job Scheduling for Distributed Systems. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 245–251. Springer, Heidelberg (2000)

    CrossRef  Google Scholar 

  43. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings of the International Parallel and Distributed Processing Symposium, May 2000, pp. 133–142 (2000)

    Google Scholar 

  44. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: An integrated approach to parallel scheduling using gang-scheduling backfilling and migration. IEEE Transactions on Parallel and Distributed System 14(3), 236–247 (2003)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K. (2005). Performance Implications of Failures in Large-Scale Cluster Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2004. Lecture Notes in Computer Science, vol 3277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11407522_13

Download citation

  • DOI: https://doi.org/10.1007/11407522_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25330-3

  • Online ISBN: 978-3-540-31795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics