Abstract
Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.
Similar content being viewed by others
Notes
crd.lbl.gov.
References
Afgan, E., Baker, D., Chilton, J., Coraor, N., Team, T.G., Taylor, J.: Galaxy cluster to cloud—genomics at scale. In: Proceedings of the 9th Gateway Computing Environments Workshop, GCE ’14, pp. 47–50. IEEE Press, Piscataway, NJ, USA (2014)
Bala, A., Chana, I.: Autonomic fault tolerant scheduling approach for scientific workflows in cloud computing. Concurr. Eng. 23(1), 27–39 (2015)
Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discret. Optim. 1(2), 121–127 (2004)
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 2451–2454. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3132847.3133171
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
Chen, W., Deelman, E.: Workflowsim: a toolkit for simulating scientific workflows in distributed environments. In: 8th IEEE International Conference on E-Science, e-Science 2012, Chicago, IL, USA, October 8–12, 2012, pp. 1–8 (2012). https://doi.org/10.1109/eScience.2012.6404430
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/A:1022641700528
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
Compeau, P., Pevzner, P., Tesler, G.: How to apply de bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). https://doi.org/10.1038/nbt.2023
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Costa, F., de Oliveira, D., Ocaña, K.A., Ogasawara, E., Mattoso, M.: Enabling re-executions of parallel scientific workflows using runtime provenance data. In: IPAW, pp. 229–232. Springer, New York (2012)
da Silva, R.F., Juve, G., Rynge, M., Deelman, E., Livny, M.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), (2015). https://doi.org/10.1142/S0129626415410030
de Jesus, L.A., Drummond, L.M.A., de Oliveira, D.: Eeny meeny miny moe: Choosing the fault tolerance technique for my cloud workflow. In: Mocskos, E., Nesmachnow, S. (eds.) High Performance Computing, pp. 321–336. Springer, Cham (2018)
de Oliveira, D., Cunha, L., Tomaz, L., Pereira, V., Mattoso, M.: Using ontologies to support deep water oil exploration scientific workflows. In: 2009 IEEE Congress on Services, Part I, SERVICES I 2009, Los Angeles, CA, USA, July 6–10, 2009, pp. 364–367 (2009). https://doi.org/10.1109/SERVICES-I.2009.17
De Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 378–385. IEEE (2010)
de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)
de Oliveira, D., Ocaña, K.A.C.S., Baião, F.A., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput. 10(3), 521–552 (2012). https://doi.org/10.1007/s10723-012-9227-2
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.H., Vahi, K., Livny, M.: Pegasus: mapping scientific workflows onto the grid. In: undefined, pp. 11–20. Springer, New York (2004)
Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: Proceedings of the SC’08, pp. 50:1–50:12 (2008)
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. FGCS 46, 17–35 (2015)
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2013)
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997). https://doi.org/10.1023/A:1007413511361
Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017)
Ekblom, R., Wolf, J.B.W.: A field guide to whole-genome sequencing, assembly and annotation. Evolut. Appl. 7(9), 1026–1042 (2014)
Elmroth, E., Hernández, F., Tordsson, J.: A light-weight grid workflow execution engine enabling client and middleware independence. In: International Conference on Parallel Processing and Applied Mathematics, pp. 754–761. Springer, New York (2007)
Engelmann, C., Vallee, G.R., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257. IEEE (2009)
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, pp. 1022–1029 (1993). URL http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93
Frank, E., Hall, M.A., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I.H., Trigg, L.: Weka—a machine learning workbench for data mining. In: Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 1269–1277 (2010). https://doi.org/10.1007/978-0-387-09823-4_66
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library, vol. 72. Springer (2015). https://doi.org/10.1007/978-3-319-10247-4
Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM CSUR 31(1), 1–26 (1999)
Gondra, I.: Applying machine learning to software fault-proneness prediction. J. Syst. Softw. 81(2), 186–195 (2008)
Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. Grid Comput. 11(3), 361–379 (2013)
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)
Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, ICDAR ’95, vol. 1, p. 278. IEEE Computer Society, Washington, DC, USA (1995)
Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: eScience, pp. 640–645 (2008)
Hoheisel, A.: Grid workflow execution service-dynamic and interactive execution and visualization of distributed workflows. In: Proceedings of the Cracow Grid Workshop, vol. 2, pp. 13–24. Citeseer (2006)
Hu, M., Luo, J., Wang, Y., Veeravalli, B.: Adaptive scheduling of task graphs with dynamic resilience. IEEE Trans. Comput. 66(1), 17–23 (2017)
Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.M., Hautier, G.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27(17), 5037–5059 (2015)
Juve, G., Chervenak, A.L., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Future Gener. Comp. Syst. 29(3), 682–692 (2013). https://doi.org/10.1016/j.future.2012.08.015
Kerber, R.: Chimerge: Discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992). URL http://dl.acm.org/citation.cfm?id=1867135.1867154
LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539
Lee, K.H., Lai, I.C., Lee, C.R.: Optimizing back-and-forth live migration. In: Proceedings of the 9th UCC, UCC ’16, pp. 49–54. ACM, New York, NY, USA (2016)
Leinweber, D.J.: Stupid data miner tricks. The Journal of Investing 16(1), 15–22 (2007). https://doi.org/10.3905/joi.2007.681820. URL http://joi.iijournals.com/content/16/1/15
Li, R.P., Wang, Z.O.: An entropy-based discretization method for classification rules with inconsistency checking. In: Proceedings of the International Conference on Machine Learning and Cybernetics, vol. 1, pp. 243–246 (2002). https://doi.org/10.1109/ICMLC.2002.1176748
Litvinova, A., Engelmann, C., Scott, S.L.: A proactive fault tolerance framework for high-performance computing. In: Proceedings of the 9th IASTED International Conference, vol. 676, p. 105 (2009)
Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comp. Syst. 63, 76–95 (2016). https://doi.org/10.1016/j.future.2016.04.014
Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. 27, 504–518 (2015)
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., Oliveira, D.: Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5(1), 79 (2010)
Meyer, N., Talia, D., Yahyapour, R.: Grid and Services Evolution, vol. 11. Springer, New York (2009)
Mitchell, T.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982). https://doi.org/10.1016/0004-3702(82)90040-6
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
Ocaña, K., de Oliveira, D., Ogasawara, E.S., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: BSB, pp. 66–70. Springer (2011)
Ocaña, K.A., de Oliveira, D., Ogasawara, E., Dávila, A.M., Lima, A.A., Mattoso, M.: Sciphy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: 2011 BSB, pp. 66–70. Springer (2011)
Ogasawara, E., Dias, J., Silva, V., Chirigati, F., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. 25(16), 2327–2341 (2013)
Olimpio, V., Nascimento, A., Paes, A., de Oliveira, D.: Workflowsim4rl: Aprendizado por reforço aplicado a escalonamento de workflows científicos em nuvens. In: Workshop em Desempenho de Sistemas Computacionais e de Comunicação (WPerformance), Natal, Brazil, 2018, pp. 364–367 (2018)
Pradal, C., Fournier, C., Valduriez, P., Cohen-Boulakia, S.: Openalea: Scientific workflows combining data analysis and simulation. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM ’15, pp. 11:1–11:6. ACM, New York, NY, USA (2015)
Prinzie, A., Van den Poel, D.: Random multiclass classification: Generalizing random forests to random MNL and random NB. In: Wagner, R., Revell, N., Pernul, G. (eds.) Database and Expert Systems Applications, pp. 349–358. Springer, Berlin Heidelberg, Berlin, Heidelberg (2007)
Pruitt, K.D., Tatusova, T.A., Brown, G.R., Maglott, D.R.: NCBI reference sequences (refseq): current status, new features and genome annotation policy. Nucleic Acids Res. 40(Database-Issue), 130–135 (2012). https://doi.org/10.1093/nar/gkr1079
Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (1987)
Rokach, L., et al., Maimon, O.: Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing Co., River Edge (2014)
Rynge, M., Juve, G., Kinney, J., Good, J., Berriman, G., Merrihew, A., Deelman, E.: Producing an infrared multiwavelength galactic plane atlas using montage, pegasus and amazon web services. In: 23rd Annual Astronomical Data Analysis Software and Systems, ADASS, Conference (2013)
Saavedra-Barrera, R., Culler, D., Von Eicken, T.: Analysis of multithreaded architectures for parallel computing. In: SPAAACM 1990, pp. 169–178. ACM (1990)
Sakellariou, R., Zhao, H., Deelman, E.: Mapping workflows on grid resources: experiments with the montage workflow. In: Grids, P2P and Services Computing, pp. 119–132. Springer (2010)
Sharma, D., Chandra, P.: Software fault prediction using machine-learning techniques. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics, pp. 541–549. Springer, Singapore (2018)
Souza, A., Papadopoulos, A.V., Tomás, L., Gilbert, D., Tordsson, J.: Hybrid adaptive checkpointing for virtual machine fault tolerance. In: 2018 IEEE International Conference on Cloud Engineering, IC2E 2018, Orlando, FL, USA, April 17-20, 2018, pp. 12–22 (2018). https://doi.org/10.1109/IC2E.2018.00023
Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206
von Laszewski, G., Hategan, M.: Java cog kit karajan/gridant workflow guide. Tech. rep, Technical Report, Argonne National Laboratory, Argonne, IL, USA (2005)
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: Towards a cloud definition. SIGCOMM Rev. 39(1), 50–55 (2008)
Watson, P., Hiden, H., Woodman, S.: e-science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010). https://doi.org/10.1002/cpe.1611
Weiss, S., Kulikowski, C.: Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1991)
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec. 34(3), 56–62 (2005)
Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/t: Large-scale application composition via distributed-memory dataflow processing. In: 13th IEEE/ACM CCGrid, pp. 95–102. IEEE (2013)
Yang, Y., Webb, G., Wu, X.: Discretization methods. In: Maimon, O., Rokach, L., (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 113–130. Springer (2005). URL http://dblp.uni-trier.de/db/books/collections/datamining2005.html#YangWW05
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: CCGrid 2009, pp. 244–251. IEEE Computer Society (2009)
Acknowledgements
This research made use of Montage. It is funded by the National Science Foundation under Grant Number ACI-1440620, and was previously funded by the National Aeronautics and Space Administration’s Earth Science Technology Office, Computation Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.
Author information
Authors and Affiliations
Corresponding author
Additional information
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Authors would also like to thank CNPq and FAPERJ for partially sponsoring this research.
Rights and permissions
About this article
Cite this article
Guedes, T., Jesus, L.A., Ocaña, K.A.C.S. et al. Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach. Cluster Comput 23, 123–148 (2020). https://doi.org/10.1007/s10586-019-02920-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02920-6