Predictable quality of service atop degradable distributed systems

Ramakrishnan, Lavanya; Reed, Daniel A.

doi:10.1007/s10586-009-0078-y

Predictable quality of service atop degradable distributed systems

Published: 17 January 2009

Volume 16, pages 321–334, (2013)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Lavanya Ramakrishnan¹ &
Daniel A. Reed²

232 Accesses
7 Citations
Explore all metrics

Abstract

High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alonso, G., Hagen, C., Agrawal, D., Abbadi, A.E., Mohan, C.: Enhancing the fault tolerance of workflow management systems. In: IEEE Concurrency, 2000
Availability prediction service. http://nws.cs.ucsb.edu/ewiki/nws.php?id=Availability+Prediction+Service%
Blythe, J., Jain, S., Deelman, E., Gil, Y., Vahi, K., Mandal, A., Kennedy, K.: Task scheduling strategies for workflow-based applications in grids. In: CCGRID, pp. 759–767 (2005)
Braun, T.D., Siegel, H.J., Beck, N.: A comparision of eleven static heuristics for maping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. (2001)
da Lu, C., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proc. of Supercomputing, 2004
Droegemeier, K.K., et al.: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng. (2005)
Haverkort, B.R., Marie, R., Rubino, G., Trivedi, K.: Performability Modelling. Wiley, New York (2001)
Google Scholar
Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the grid. J. Grid Comput. (2003)
Inca real time monitoring suite. http://inca.sdsc.edu/
Kennedy, K., et al.: Toward a framework for preparing and executing adaptive grid programs. In: Proceedings of NSF Next Generation Systems Program Workshop (International Parallel and Distributed Processing Symposium), 2002
Khalili, O., He, J., Olschanowsky, C., Snavely, A., Casanova, H.: Measuring the performance and reliability of production computational grids. In: The 7th IEEE/ACM International Conference on Grid Computing, 2006
Kramer, W., Ryan, C.: Performance variability of highly parallel architectures. In: International Conference on Computational Science, 2003
Los almos reliability data. http://institutes.lanl.gov/data/fdata/
Malewicz, G.: Parallel scheduling of complex dags under uncertainty. In: Proceedings of the 17th Annual ACM Symposium on Parallel Algorithms (SPAA), pp. 66–75 (2005)
Meyer, J.F.: On evaluating the performability of degradable computing systems. IEEE Trans. Comput. (1980)
Nurmi, D., Brevik, J., Wolski, R.: Minimizing the network overhead of checkpointing in cycle harvesting cluster environments. Future Gener. Comput. Syst. (2006)
Ramakrishnan, L., Reed, D.A.: Performability modeling for scheduling and fault tolerance strategies for grid workflows. In: ACM/IEEE International Symposium on High Performance Distributed Computing, 2008
Reed, D.A., da Lu, C., Mendes, C.L.: Reliability challenges in large systems. Future Gener. Comput. Syst. (2006)
Sahner, R.A., Trivedi, K.S., Puliafito, A.: Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic, Dordrecht (1996)
Book MATH Google Scholar
Sakellariou, R., Zhao, H., Tsiakkouri, E., Dikaiakos, M.: Scheduling workflows with budget constraints. In: Gorlatch, S., Danelutto, M. (eds.) Integrated Research in GRID Computing, CoreGRID, pp. 189–202. Springer, New York (2007)
Chapter Google Scholar
Schopf, J., Berman, F.: Performance prediction in production environments. In: Proceedings of IPPS/SPDP, 1998
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proc. of the International Conference on Dependable Systems, 2006
Weissman, J.B.: Fault tolerant computing on the grid: what are my options? In: HPDC, 1999
Yu, J., Buyya, R.: Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program. 14(3–4), 217–230 (2006)
Google Scholar
Zhang, Y., Mandal, A., Casanova, H., Chien, A., Kee, Y., Kennedy, K., Koelbel, C.: Scalable grid application scheduling via decoupled resource selection and scheduling. In: CCGrid, 2006

Download references

Author information

Authors and Affiliations

Indiana University, Bloomington, IN, USA
Lavanya Ramakrishnan
Microsoft Research, Redmond, WA, USA
Daniel A. Reed

Authors

Lavanya Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Reed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lavanya Ramakrishnan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramakrishnan, L., Reed, D.A. Predictable quality of service atop degradable distributed systems. Cluster Comput 16, 321–334 (2013). https://doi.org/10.1007/s10586-009-0078-y

Download citation

Received: 31 December 2008
Accepted: 06 January 2009
Published: 17 January 2009
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10586-009-0078-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predictable quality of service atop degradable distributed systems

Abstract

Access this article

Similar content being viewed by others

Reliability Based Workflow Scheduling on Cloud Computing with Deadline Constraint

Stochastic Programming Approach for Resource Selection Under Demand Uncertainty

Cost-driven workflow scheduling on the cloud with deadline and reliability constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predictable quality of service atop degradable distributed systems

Abstract

Access this article

Similar content being viewed by others

Reliability Based Workflow Scheduling on Cloud Computing with Deadline Constraint

Stochastic Programming Approach for Resource Selection Under Demand Uncertainty

Cost-driven workflow scheduling on the cloud with deadline and reliability constraints

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation