Abstract
With the advent of next-generation scientific applications, the workflow approach that integrates various computing and networking technologies has provided a viable solution to managing and optimizing large-scale distributed data transfer, processing, and analysis. This paper investigates a problem of mapping distributed scientific workflows for maximum throughput in faulty networks where nodes and links are subject to probabilistic failures. We formulate this problem as a bi-objective optimization problem to maximize both throughput and reliability. By adapting and modifying a centralized fault-free workflow mapping scheme, we propose a new mapping algorithm to achieve high throughput for smooth data flow in a distributed manner while satisfying a pre-specified bound of the overall failure rate for a guaranteed level of reliability. The performance superiority of the proposed solution is illustrated by both extensive simulation-based comparisons with existing algorithms and experimental results from a real-life scientific workflow deployed in wide-area networks.
Similar content being viewed by others
References
Agarwalla, B., Ahmed, N., Hilley, D., Ramachandran, U.: Streamline: a scheduling heuristic for streaming application on the Grid. In: Proc. of the 13th Multimedia Comp. and Net. Conf., San Jose, CA (2006)
Angskun, T., Fagg, G., Bosilca, G., Dongarra, J.: Scalable fault tolerant protocol for parallel runtime environments. In: Proc. of Euro PVM/MPI, Bonn, Germany (2006)
Annie, S.W., Yu, H., Jin, S., Lin, K.-C.: An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. 15, 824–834 (2004)
Benoit, A., Robert, Y.: Mapping pipeline skeletons onto heterogeneous platforms. JPDC 68(6), 790–808 (2008)
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proc. of IEEE Int. Symp. on Para. and Dist. Proc., Miami, FL, pp. 1–8 (2008)
Benoit, A., Hakem, M., Robert, Y.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proc. of the 2009 Int. Conf. on Para. Proc., pp. 325–332 (2009)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proc. of 6th Symp. on Operating System Design and Implementation, San Francisco, CA (2004)
Do, T., Nguyen, D., Nguyen, H., Nguyen, T., Shi, W.: Failure-aware scheduling in Grid computing environments. In: Proc. of the 2008 Int. Conf. on Grid Comp. and App., Las Vegas, Nevada (2009)
Dogan, A., Özgüner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proc. of 2000 Int. Workshops on Para. Proc., Toronto, Ontario, Canada, pp. 429–436 (2000)
Dogan, A., Özgüner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proc. of the 19th Annual ACM Symp. on Para. Alg. and Arch., San Diego, CA, pp. 280–288 (2007)
Flaugher, B.: The dark energy survey camera (decam). Bull. Am. Astron. Soc. 42, 406 (2010)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)
Gerasoulis, A., Yang, T.: A comparison of clustering heuristics for scheduling DAGs on multiprocessors. JPDC 16(4), 276–291 (1992)
Girault, A., Saule, É., Trystram, D.: Reliability versus performance for critical applications. JPDC 69(3), 326–336 (2009)
Gu, Y., Wu, Q.: Maximizing workflow throughput for streaming applications in distributed environments. In: Proc. of the 19th Int. Conf. on Comp. Comm. and Net., Zurich, Switzerland (2010)
Gu, Y., Wu, Q., Benoit, A., Robert, Y.: Optimizing end-to-end performance of distributed applications with linear computing pipelines. In: Proc. of the 15th Int. Conf. on Para. and Dist. Sys., Shenzhen, China, 8–11 December 2009
Guirado, F., Ripoll, A., Roig, C., Luque, E.: Optimizing latency under throughput requirements for streaming applications on cluster execution. In: Proc. of IEEE Int. Conf. on Cluster Computing, pp. 1–10. IEEE Computer Society Press (2005)
Hashimito, K., Tsuchiya, T., Kikuno, T.: Effective scheduling of duplicated tasks for fault-tolerance in multiprocessor systems. IEICE Trans. Inf. Syst. 85(3), 525–534 (2002)
Ilavarasan, E., Thambidurai, P.: Low complexity performance effective task scheduling algorithm for heterogeneous computing environments. J. Comput. Sci. 3(2), 94–103 (2007)
Kartik, S., Murthy, C.S.R.: Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems. IEEE Trans. Reliab. 44, 575–586 (1995)
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)
Kwok, Y., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graph to multiprocessors. IEEE Trans. Parallel Distrib. Syst. 7(5), 506–521 (1996)
Kwok, Y., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999)
Mezzacappa, A.: SciDAC 2005: scientific discovery through advanced computing. J. Phys.: Conf. Series 16 (2005)
NSF Grand Challenges in eScience Workshop (2001). http://www2.evl.uic.edu/NSF/index.html. Accessed 1 June 2013
Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., Seltzer, M.: Network-aware operator placement for stream-processing systems. In: Proc. of the 22nd Int. Conf. on Data Eng., Atlanta, GA (2006)
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proc. of the 28th Int. Symp. on Fault-Tolerant Computing, pp. 48–57 (1997)
Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global Grids. J. Future Gener. Comput. Syst. 26, 753–768 (2010)
Ranaweera, A., Agrawal, D.P.: A task duplication based algorithm for heterogeneous systems. In: Proc. of IPDPS, pp. 445–450 (2000)
Ranjan, R., Rahman, M., Buyya, R.: A decentralized and cooperative workflow scheduling algorithm. In: Proc. of the 8th IEEE Int. Symp. on Cluster Computing and the Grid, pp. 1–8 (2008)
Sekhar, A., Manoj, B.S., Murthy, C.S.R.: A state-space search approach for optimizing reliability and cost of execution in distributed sensor networks. In: Proc. of Int. Workshop on Dist. Comp., pp. 63–74 (2005)
Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41, 1156–1168 (1992)
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)
Spallation Neutron Source. http://neutrons.ornl.gov, http://www.sns.gov. Accessed 1 June 2013
Terascale Supernova Initiative (TSI). http://www.phy.ornl.gov/tsi. Accessed 20 Apr 2011
The Office of Science Data-Management Challenge, March–May 2004. Report of the DOE Office of Science Data-Management Workshop. Technical Report SLAC-R-782, Stanford Linear Accelerator Center
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Wolf, J.L., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.: Job admission and resource allocation in distributed streaming systems. In: Proc. of the 14th Int. Workshop on Job Scheduling Strategies for Parallel Processing, Rome, Italy, pp. 169–189 (2009)
Wu, Q., Zhu, M., Lu, X., Brown, P., Lin, Y., Gu, Y., Cao, F., Reuter, M.A.: Automation and management of scientific workflows in distributed network environments. In: Proc. of the 6th Int. Workshop on Sys. Man. Tech., Proc., and Serv., Atlanta, GA, 19 April 2010
Ying, L., Liu, Z., Towsley, D.F., Xia, C.H.: Distributed operator placement and data caching in large-scale sensor networks. In: Proc. of the 27th IEEE Conf. on Computer Communications, Phoenix, AZ, pp. 977–985 (2008)
Yu, J., Buyya, R.: A taxonomy of workflow management systems for Grid computing. J. Grid Computing 3(3–4), 171–200 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gu, Y., Wu, C.Q., Liu, X. et al. Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint. J Grid Computing 11, 361–379 (2013). https://doi.org/10.1007/s10723-013-9266-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-013-9266-3