Skip to main content
Log in

Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

With the advent of next-generation scientific applications, the workflow approach that integrates various computing and networking technologies has provided a viable solution to managing and optimizing large-scale distributed data transfer, processing, and analysis. This paper investigates a problem of mapping distributed scientific workflows for maximum throughput in faulty networks where nodes and links are subject to probabilistic failures. We formulate this problem as a bi-objective optimization problem to maximize both throughput and reliability. By adapting and modifying a centralized fault-free workflow mapping scheme, we propose a new mapping algorithm to achieve high throughput for smooth data flow in a distributed manner while satisfying a pre-specified bound of the overall failure rate for a guaranteed level of reliability. The performance superiority of the proposed solution is illustrated by both extensive simulation-based comparisons with existing algorithms and experimental results from a real-life scientific workflow deployed in wide-area networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwalla, B., Ahmed, N., Hilley, D., Ramachandran, U.: Streamline: a scheduling heuristic for streaming application on the Grid. In: Proc. of the 13th Multimedia Comp. and Net. Conf., San Jose, CA (2006)

  2. Angskun, T., Fagg, G., Bosilca, G., Dongarra, J.: Scalable fault tolerant protocol for parallel runtime environments. In: Proc. of Euro PVM/MPI, Bonn, Germany (2006)

  3. Annie, S.W., Yu, H., Jin, S., Lin, K.-C.: An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 15, 824–834 (2004)

    Article  Google Scholar 

  4. Benoit, A., Robert, Y.: Mapping pipeline skeletons onto heterogeneous platforms. JPDC 68(6), 790–808 (2008)

    Google Scholar 

  5. Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proc. of IEEE Int. Symp. on Para. and Dist. Proc., Miami, FL, pp. 1–8 (2008)

  6. Benoit, A., Hakem, M., Robert, Y.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proc. of the 2009 Int. Conf. on Para. Proc., pp. 325–332 (2009)

  7. Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)

    Article  Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of 6th Symp. on Operating System Design and Implementation, San Francisco, CA (2004)

  9. Do, T., Nguyen, D., Nguyen, H., Nguyen, T., Shi, W.: Failure-aware scheduling in Grid computing environments. In: Proc. of the 2008 Int. Conf. on Grid Comp. and App., Las Vegas, Nevada (2009)

  10. Dogan, A., Özgüner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proc. of 2000 Int. Workshops on Para. Proc., Toronto, Ontario, Canada, pp. 429–436 (2000)

  11. Dogan, A., Özgüner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)

    Article  Google Scholar 

  12. Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proc. of the 19th Annual ACM Symp. on Para. Alg. and Arch., San Diego, CA, pp. 280–288 (2007)

  13. Flaugher, B.: The dark energy survey camera (decam). Bull. Am. Astron. Soc. 42, 406 (2010)

    Google Scholar 

  14. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)

  15. Gerasoulis, A., Yang, T.: A comparison of clustering heuristics for scheduling DAGs on multiprocessors. JPDC 16(4), 276–291 (1992)

    MathSciNet  Google Scholar 

  16. Girault, A., Saule, É., Trystram, D.: Reliability versus performance for critical applications. JPDC 69(3), 326–336 (2009)

    Google Scholar 

  17. Gu, Y., Wu, Q.: Maximizing workflow throughput for streaming applications in distributed environments. In: Proc. of the 19th Int. Conf. on Comp. Comm. and Net., Zurich, Switzerland (2010)

  18. Gu, Y., Wu, Q., Benoit, A., Robert, Y.: Optimizing end-to-end performance of distributed applications with linear computing pipelines. In: Proc. of the 15th Int. Conf. on Para. and Dist. Sys., Shenzhen, China, 8–11 December 2009

  19. Guirado, F., Ripoll, A., Roig, C., Luque, E.: Optimizing latency under throughput requirements for streaming applications on cluster execution. In: Proc. of IEEE Int. Conf. on Cluster Computing, pp. 1–10. IEEE Computer Society Press (2005)

  20. Hashimito, K., Tsuchiya, T., Kikuno, T.: Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans. Inf. Syst. 85(3), 525–534 (2002)

    Google Scholar 

  21. Ilavarasan, E., Thambidurai, P.: Low complexity performance effective task scheduling algorithm for heterogeneous computing environments. J. Comput. Sci. 3(2), 94–103 (2007)

    Article  Google Scholar 

  22. Kartik, S., Murthy, C.S.R.: Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems. IEEE Trans. Reliab. 44, 575–586 (1995)

    Article  Google Scholar 

  23. Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)

    Article  Google Scholar 

  24. Kwok, Y., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graph to multiprocessors. IEEE Trans. Parallel Distrib. Syst. 7(5), 506–521 (1996)

    Article  Google Scholar 

  25. Kwok, Y., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999)

    Article  Google Scholar 

  26. Mezzacappa, A.: SciDAC 2005: scientific discovery through advanced computing. J. Phys.: Conf. Series 16 (2005)

  27. NSF Grand Challenges in eScience Workshop (2001). http://www2.evl.uic.edu/NSF/index.html. Accessed 1 June 2013

  28. Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., Seltzer, M.: Network-aware operator placement for stream-processing systems. In: Proc. of the 22nd Int. Conf. on Data Eng., Atlanta, GA (2006)

  29. Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proc. of the 28th Int. Symp. on Fault-Tolerant Computing, pp. 48–57 (1997)

  30. Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global Grids. J. Future Gener. Comput. Syst. 26, 753–768 (2010)

    Article  Google Scholar 

  31. Ranaweera, A., Agrawal, D.P.: A task duplication based algorithm for heterogeneous systems. In: Proc. of IPDPS, pp. 445–450 (2000)

  32. Ranjan, R., Rahman, M., Buyya, R.: A decentralized and cooperative workflow scheduling algorithm. In: Proc. of the 8th IEEE Int. Symp. on Cluster Computing and the Grid, pp. 1–8 (2008)

  33. Sekhar, A., Manoj, B.S., Murthy, C.S.R.: A state-space search approach for optimizing reliability and cost of execution in distributed sensor networks. In: Proc. of Int. Workshop on Dist. Comp., pp. 63–74 (2005)

  34. Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41, 1156–1168 (1992)

    Article  Google Scholar 

  35. Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)

    Article  Google Scholar 

  36. Spallation Neutron Source. http://neutrons.ornl.gov, http://www.sns.gov. Accessed 1 June 2013

  37. Terascale Supernova Initiative (TSI). http://www.phy.ornl.gov/tsi. Accessed 20 Apr 2011

  38. The Office of Science Data-Management Challenge, March–May 2004. Report of the DOE Office of Science Data-Management Workshop. Technical Report SLAC-R-782, Stanford Linear Accelerator Center

  39. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  40. Wolf, J.L., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.: Job admission and resource allocation in distributed streaming systems. In: Proc. of the 14th Int. Workshop on Job Scheduling Strategies for Parallel Processing, Rome, Italy, pp. 169–189 (2009)

  41. Wu, Q., Zhu, M., Lu, X., Brown, P., Lin, Y., Gu, Y., Cao, F., Reuter, M.A.: Automation and management of scientific workflows in distributed network environments. In: Proc. of the 6th Int. Workshop on Sys. Man. Tech., Proc., and Serv., Atlanta, GA, 19 April 2010

  42. Ying, L., Liu, Z., Towsley, D.F., Xia, C.H.: Distributed operator placement and data caching in large-scale sensor networks. In: Proc. of the 27th IEEE Conf. on Computer Communications, Phoenix, AZ, pp. 977–985 (2008)

  43. Yu, J., Buyya, R.: A taxonomy of workflow management systems for Grid computing. J. Grid Computing 3(3–4), 171–200 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Gu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Y., Wu, C.Q., Liu, X. et al. Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint. J Grid Computing 11, 361–379 (2013). https://doi.org/10.1007/s10723-013-9266-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-013-9266-3

Keywords

Navigation