Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Gu, Yi; Wu, Chase Qishi; Liu, Xin; Yu, Dantong

doi:10.1007/s10723-013-9266-3

Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Published: 08 June 2013

Volume 11, pages 361–379, (2013)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Yi Gu¹,
Chase Qishi Wu²,
Xin Liu³ &
…
Dantong Yu³

293 Accesses
14 Citations
Explore all metrics

Abstract

With the advent of next-generation scientific applications, the workflow approach that integrates various computing and networking technologies has provided a viable solution to managing and optimizing large-scale distributed data transfer, processing, and analysis. This paper investigates a problem of mapping distributed scientific workflows for maximum throughput in faulty networks where nodes and links are subject to probabilistic failures. We formulate this problem as a bi-objective optimization problem to maximize both throughput and reliability. By adapting and modifying a centralized fault-free workflow mapping scheme, we propose a new mapping algorithm to achieve high throughput for smooth data flow in a distributed manner while satisfying a pre-specified bound of the overall failure rate for a guaranteed level of reliability. The performance superiority of the proposed solution is illustrated by both extensive simulation-based comparisons with existing algorithms and experimental results from a real-life scientific workflow deployed in wide-area networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

Article 25 August 2020

On Workflow Scheduling for End-to-End Performance Optimization in Distributed Network Environments

Fault-Detection Managers: More May Not Be the Merrier

Article 20 February 2021

References

Agarwalla, B., Ahmed, N., Hilley, D., Ramachandran, U.: Streamline: a scheduling heuristic for streaming application on the Grid. In: Proc. of the 13th Multimedia Comp. and Net. Conf., San Jose, CA (2006)
Angskun, T., Fagg, G., Bosilca, G., Dongarra, J.: Scalable fault tolerant protocol for parallel runtime environments. In: Proc. of Euro PVM/MPI, Bonn, Germany (2006)
Annie, S.W., Yu, H., Jin, S., Lin, K.-C.: An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 15, 824–834 (2004)
Article Google Scholar
Benoit, A., Robert, Y.: Mapping pipeline skeletons onto heterogeneous platforms. JPDC 68(6), 790–808 (2008)
Google Scholar
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proc. of IEEE Int. Symp. on Para. and Dist. Proc., Miami, FL, pp. 1–8 (2008)
Benoit, A., Hakem, M., Robert, Y.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proc. of the 2009 Int. Conf. on Para. Proc., pp. 325–332 (2009)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of 6th Symp. on Operating System Design and Implementation, San Francisco, CA (2004)
Do, T., Nguyen, D., Nguyen, H., Nguyen, T., Shi, W.: Failure-aware scheduling in Grid computing environments. In: Proc. of the 2008 Int. Conf. on Grid Comp. and App., Las Vegas, Nevada (2009)
Dogan, A., Özgüner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proc. of 2000 Int. Workshops on Para. Proc., Toronto, Ontario, Canada, pp. 429–436 (2000)
Dogan, A., Özgüner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Article Google Scholar
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proc. of the 19th Annual ACM Symp. on Para. Alg. and Arch., San Diego, CA, pp. 280–288 (2007)
Flaugher, B.: The dark energy survey camera (decam). Bull. Am. Astron. Soc. 42, 406 (2010)
Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)
Gerasoulis, A., Yang, T.: A comparison of clustering heuristics for scheduling DAGs on multiprocessors. JPDC 16(4), 276–291 (1992)
MathSciNet Google Scholar
Girault, A., Saule, É., Trystram, D.: Reliability versus performance for critical applications. JPDC 69(3), 326–336 (2009)
Google Scholar
Gu, Y., Wu, Q.: Maximizing workflow throughput for streaming applications in distributed environments. In: Proc. of the 19th Int. Conf. on Comp. Comm. and Net., Zurich, Switzerland (2010)
Gu, Y., Wu, Q., Benoit, A., Robert, Y.: Optimizing end-to-end performance of distributed applications with linear computing pipelines. In: Proc. of the 15th Int. Conf. on Para. and Dist. Sys., Shenzhen, China, 8–11 December 2009
Guirado, F., Ripoll, A., Roig, C., Luque, E.: Optimizing latency under throughput requirements for streaming applications on cluster execution. In: Proc. of IEEE Int. Conf. on Cluster Computing, pp. 1–10. IEEE Computer Society Press (2005)
Hashimito, K., Tsuchiya, T., Kikuno, T.: Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans. Inf. Syst. 85(3), 525–534 (2002)
Google Scholar
Ilavarasan, E., Thambidurai, P.: Low complexity performance effective task scheduling algorithm for heterogeneous computing environments. J. Comput. Sci. 3(2), 94–103 (2007)
Article Google Scholar
Kartik, S., Murthy, C.S.R.: Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems. IEEE Trans. Reliab. 44, 575–586 (1995)
Article Google Scholar
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)
Article Google Scholar
Kwok, Y., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graph to multiprocessors. IEEE Trans. Parallel Distrib. Syst. 7(5), 506–521 (1996)
Article Google Scholar
Kwok, Y., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999)
Article Google Scholar
Mezzacappa, A.: SciDAC 2005: scientific discovery through advanced computing. J. Phys.: Conf. Series 16 (2005)
NSF Grand Challenges in eScience Workshop (2001). http://www2.evl.uic.edu/NSF/index.html. Accessed 1 June 2013
Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., Seltzer, M.: Network-aware operator placement for stream-processing systems. In: Proc. of the 22nd Int. Conf. on Data Eng., Atlanta, GA (2006)
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proc. of the 28th Int. Symp. on Fault-Tolerant Computing, pp. 48–57 (1997)
Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global Grids. J. Future Gener. Comput. Syst. 26, 753–768 (2010)
Article Google Scholar
Ranaweera, A., Agrawal, D.P.: A task duplication based algorithm for heterogeneous systems. In: Proc. of IPDPS, pp. 445–450 (2000)
Ranjan, R., Rahman, M., Buyya, R.: A decentralized and cooperative workflow scheduling algorithm. In: Proc. of the 8th IEEE Int. Symp. on Cluster Computing and the Grid, pp. 1–8 (2008)
Sekhar, A., Manoj, B.S., Murthy, C.S.R.: A state-space search approach for optimizing reliability and cost of execution in distributed sensor networks. In: Proc. of Int. Workshop on Dist. Comp., pp. 63–74 (2005)
Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41, 1156–1168 (1992)
Article Google Scholar
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)
Article Google Scholar
Spallation Neutron Source. http://neutrons.ornl.gov, http://www.sns.gov. Accessed 1 June 2013
Terascale Supernova Initiative (TSI). http://www.phy.ornl.gov/tsi. Accessed 20 Apr 2011
The Office of Science Data-Management Challenge, March–May 2004. Report of the DOE Office of Science Data-Management Workshop. Technical Report SLAC-R-782, Stanford Linear Accelerator Center
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Wolf, J.L., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.: Job admission and resource allocation in distributed streaming systems. In: Proc. of the 14th Int. Workshop on Job Scheduling Strategies for Parallel Processing, Rome, Italy, pp. 169–189 (2009)
Wu, Q., Zhu, M., Lu, X., Brown, P., Lin, Y., Gu, Y., Cao, F., Reuter, M.A.: Automation and management of scientific workflows in distributed network environments. In: Proc. of the 6th Int. Workshop on Sys. Man. Tech., Proc., and Serv., Atlanta, GA, 19 April 2010
Ying, L., Liu, Z., Towsley, D.F., Xia, C.H.: Distributed operator placement and data caching in large-scale sensor networks. In: Proc. of the 27th IEEE Conf. on Computer Communications, Phoenix, AZ, pp. 977–985 (2008)
Yu, J., Buyya, R.: A taxonomy of workflow management systems for Grid computing. J. Grid Computing 3(3–4), 171–200 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management, Marketing, Computer Science & Info System, The University of Tennessee at Martin, 554 University St., Martin, TN, 38237, USA
Yi Gu
Department of Computer Science, The University of Memphis, 317 Dunn Hall, Memphis, TN, 38152, USA
Chase Qishi Wu
Computational Science Center, Brookhaven National Laboratory, 2 Center St., Upton, NY, 11973, USA
Xin Liu & Dantong Yu

Authors

Yi Gu
View author publications
You can also search for this author in PubMed Google Scholar
Chase Qishi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dantong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Gu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Y., Wu, C.Q., Liu, X. et al. Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint. J Grid Computing 11, 361–379 (2013). https://doi.org/10.1007/s10723-013-9266-3

Download citation

Received: 20 May 2012
Accepted: 24 May 2013
Published: 08 June 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s10723-013-9266-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Abstract

Access this article

Similar content being viewed by others

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

On Workflow Scheduling for End-to-End Performance Optimization in Distributed Network Environments

Fault-Detection Managers: More May Not Be the Merrier

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Abstract

Access this article

Similar content being viewed by others

A Fault-Tolerant Workflow Scheduling Algorithm for Grid with Near-Optimal Redundancy

On Workflow Scheduling for End-to-End Performance Optimization in Distributed Network Environments

Fault-Detection Managers: More May Not Be the Merrier

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation