Skip to main content
Log in

Predictable quality of service atop degradable distributed systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alonso, G., Hagen, C., Agrawal, D., Abbadi, A.E., Mohan, C.: Enhancing the fault tolerance of workflow management systems. In: IEEE Concurrency, 2000

  2. Availability prediction service. http://nws.cs.ucsb.edu/ewiki/nws.php?id=Availability+Prediction+Service%

  3. Blythe, J., Jain, S., Deelman, E., Gil, Y., Vahi, K., Mandal, A., Kennedy, K.: Task scheduling strategies for workflow-based applications in grids. In: CCGRID, pp. 759–767 (2005)

  4. Braun, T.D., Siegel, H.J., Beck, N.: A comparision of eleven static heuristics for maping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. (2001)

  5. da Lu, C., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proc. of Supercomputing, 2004

  6. Droegemeier, K.K., et al.: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng. (2005)

  7. Haverkort, B.R., Marie, R., Rubino, G., Trivedi, K.: Performability Modelling. Wiley, New York (2001)

    Google Scholar 

  8. Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the grid. J. Grid Comput. (2003)

  9. Inca real time monitoring suite. http://inca.sdsc.edu/

  10. Kennedy, K., et al.: Toward a framework for preparing and executing adaptive grid programs. In: Proceedings of NSF Next Generation Systems Program Workshop (International Parallel and Distributed Processing Symposium), 2002

  11. Khalili, O., He, J., Olschanowsky, C., Snavely, A., Casanova, H.: Measuring the performance and reliability of production computational grids. In: The 7th IEEE/ACM International Conference on Grid Computing, 2006

  12. Kramer, W., Ryan, C.: Performance variability of highly parallel architectures. In: International Conference on Computational Science, 2003

  13. Los almos reliability data. http://institutes.lanl.gov/data/fdata/

  14. Malewicz, G.: Parallel scheduling of complex dags under uncertainty. In: Proceedings of the 17th Annual ACM Symposium on Parallel Algorithms (SPAA), pp. 66–75 (2005)

  15. Meyer, J.F.: On evaluating the performability of degradable computing systems. IEEE Trans. Comput. (1980)

  16. Nurmi, D., Brevik, J., Wolski, R.: Minimizing the network overhead of checkpointing in cycle harvesting cluster environments. Future Gener. Comput. Syst. (2006)

  17. Ramakrishnan, L., Reed, D.A.: Performability modeling for scheduling and fault tolerance strategies for grid workflows. In: ACM/IEEE International Symposium on High Performance Distributed Computing, 2008

  18. Reed, D.A., da Lu, C., Mendes, C.L.: Reliability challenges in large systems. Future Gener. Comput. Syst. (2006)

  19. Sahner, R.A., Trivedi, K.S., Puliafito, A.: Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic, Dordrecht (1996)

    Book  MATH  Google Scholar 

  20. Sakellariou, R., Zhao, H., Tsiakkouri, E., Dikaiakos, M.: Scheduling workflows with budget constraints. In: Gorlatch, S., Danelutto, M. (eds.) Integrated Research in GRID Computing, CoreGRID, pp. 189–202. Springer, New York (2007)

    Chapter  Google Scholar 

  21. Schopf, J., Berman, F.: Performance prediction in production environments. In: Proceedings of IPPS/SPDP, 1998

  22. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proc. of the International Conference on Dependable Systems, 2006

  23. Weissman, J.B.: Fault tolerant computing on the grid: what are my options? In: HPDC, 1999

  24. Yu, J., Buyya, R.: Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program. 14(3–4), 217–230 (2006)

    Google Scholar 

  25. Zhang, Y., Mandal, A., Casanova, H., Chien, A., Kee, Y., Kennedy, K., Koelbel, C.: Scalable grid application scheduling via decoupled resource selection and scheduling. In: CCGrid, 2006

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lavanya Ramakrishnan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramakrishnan, L., Reed, D.A. Predictable quality of service atop degradable distributed systems. Cluster Comput 16, 321–334 (2013). https://doi.org/10.1007/s10586-009-0078-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-009-0078-y

Keywords

Navigation