Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Guedes, Thaylon; Jesus, Leonardo A.; Ocaña, Kary A. C. S.; Drummond, Lucia M. A.; de Oliveira, Daniel

doi:10.1007/s10586-019-02920-6

Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Published: 09 March 2019

Volume 23, pages 123–148, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Thaylon Guedes¹,
Leonardo A. Jesus¹,
Kary A. C. S. Ocaña²,
Lucia M. A. Drummond¹ &
…
Daniel de Oliveira ORCID: orcid.org/0000-0001-9346-7651¹

620 Accesses
10 Citations
Explore all metrics

Abstract

Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow

Usability of Scientific Workflow in Dynamically Changing Environment

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Notes

References

Afgan, E., Baker, D., Chilton, J., Coraor, N., Team, T.G., Taylor, J.: Galaxy cluster to cloud—genomics at scale. In: Proceedings of the 9th Gateway Computing Environments Workshop, GCE ’14, pp. 47–50. IEEE Press, Piscataway, NJ, USA (2014)
Bala, A., Chana, I.: Autonomic fault tolerant scheduling approach for scientific workflows in cloud computing. Concurr. Eng. 23(1), 27–39 (2015)
Article Google Scholar
Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discret. Optim. 1(2), 121–127 (2004)
Article MathSciNet Google Scholar
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 2451–2454. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3132847.3133171
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
MATH Google Scholar
Chen, W., Deelman, E.: Workflowsim: a toolkit for simulating scientific workflows in distributed environments. In: 8th IEEE International Conference on E-Science, e-Science 2012, Chicago, IL, USA, October 8–12, 2012, pp. 1–8 (2012). https://doi.org/10.1109/eScience.2012.6404430
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/A:1022641700528
Article Google Scholar
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
Google Scholar
Compeau, P., Pevzner, P., Tesler, G.: How to apply de bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). https://doi.org/10.1038/nbt.2023
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
Costa, F., de Oliveira, D., Ocaña, K.A., Ogasawara, E., Mattoso, M.: Enabling re-executions of parallel scientific workflows using runtime provenance data. In: IPAW, pp. 229–232. Springer, New York (2012)
Google Scholar
da Silva, R.F., Juve, G., Rynge, M., Deelman, E., Livny, M.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), (2015). https://doi.org/10.1142/S0129626415410030
Article MathSciNet Google Scholar
de Jesus, L.A., Drummond, L.M.A., de Oliveira, D.: Eeny meeny miny moe: Choosing the fault tolerance technique for my cloud workflow. In: Mocskos, E., Nesmachnow, S. (eds.) High Performance Computing, pp. 321–336. Springer, Cham (2018)
de Oliveira, D., Cunha, L., Tomaz, L., Pereira, V., Mattoso, M.: Using ontologies to support deep water oil exploration scientific workflows. In: 2009 IEEE Congress on Services, Part I, SERVICES I 2009, Los Angeles, CA, USA, July 6–10, 2009, pp. 364–367 (2009). https://doi.org/10.1109/SERVICES-I.2009.17
De Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 378–385. IEEE (2010)
de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)
de Oliveira, D., Ocaña, K.A.C.S., Baião, F.A., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput. 10(3), 521–552 (2012). https://doi.org/10.1007/s10723-012-9227-2
Article Google Scholar
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.H., Vahi, K., Livny, M.: Pegasus: mapping scientific workflows onto the grid. In: undefined, pp. 11–20. Springer, New York (2004)
Chapter Google Scholar
Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: Proceedings of the SC’08, pp. 50:1–50:12 (2008)
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. FGCS 46, 17–35 (2015)
Article Google Scholar
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2013)
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997). https://doi.org/10.1023/A:1007413511361
Article MATH Google Scholar
Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017)
Ekblom, R., Wolf, J.B.W.: A field guide to whole-genome sequencing, assembly and annotation. Evolut. Appl. 7(9), 1026–1042 (2014)
Article Google Scholar
Elmroth, E., Hernández, F., Tordsson, J.: A light-weight grid workflow execution engine enabling client and middleware independence. In: International Conference on Parallel Processing and Applied Mathematics, pp. 754–761. Springer, New York (2007)
Engelmann, C., Vallee, G.R., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257. IEEE (2009)
Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, pp. 1022–1029 (1993). URL http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93
Frank, E., Hall, M.A., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I.H., Trigg, L.: Weka—a machine learning workbench for data mining. In: Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 1269–1277 (2010). https://doi.org/10.1007/978-0-387-09823-4_66
Chapter Google Scholar
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library, vol. 72. Springer (2015). https://doi.org/10.1007/978-3-319-10247-4
Book Google Scholar
Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM CSUR 31(1), 1–26 (1999)
Article Google Scholar
Gondra, I.: Applying machine learning to software fault-proneness prediction. J. Syst. Softw. 81(2), 186–195 (2008)
Article Google Scholar
Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. Grid Comput. 11(3), 361–379 (2013)
Article Google Scholar
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)
Google Scholar
Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, ICDAR ’95, vol. 1, p. 278. IEEE Computer Society, Washington, DC, USA (1995)
Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: eScience, pp. 640–645 (2008)
Hoheisel, A.: Grid workflow execution service-dynamic and interactive execution and visualization of distributed workflows. In: Proceedings of the Cracow Grid Workshop, vol. 2, pp. 13–24. Citeseer (2006)
Hu, M., Luo, J., Wang, Y., Veeravalli, B.: Adaptive scheduling of task graphs with dynamic resilience. IEEE Trans. Comput. 66(1), 17–23 (2017)
Article MathSciNet Google Scholar
Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.M., Hautier, G.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27(17), 5037–5059 (2015)
Article Google Scholar
Juve, G., Chervenak, A.L., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Future Gener. Comp. Syst. 29(3), 682–692 (2013). https://doi.org/10.1016/j.future.2012.08.015
Article Google Scholar
Kerber, R.: Chimerge: Discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992). URL http://dl.acm.org/citation.cfm?id=1867135.1867154
LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539
Article Google Scholar
Lee, K.H., Lai, I.C., Lee, C.R.: Optimizing back-and-forth live migration. In: Proceedings of the 9th UCC, UCC ’16, pp. 49–54. ACM, New York, NY, USA (2016)
Leinweber, D.J.: Stupid data miner tricks. The Journal of Investing 16(1), 15–22 (2007). https://doi.org/10.3905/joi.2007.681820. URL http://joi.iijournals.com/content/16/1/15
Article Google Scholar
Li, R.P., Wang, Z.O.: An entropy-based discretization method for classification rules with inconsistency checking. In: Proceedings of the International Conference on Machine Learning and Cybernetics, vol. 1, pp. 243–246 (2002). https://doi.org/10.1109/ICMLC.2002.1176748
Litvinova, A., Engelmann, C., Scott, S.L.: A proactive fault tolerance framework for high-performance computing. In: Proceedings of the 9th IASTED International Conference, vol. 676, p. 105 (2009)
Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comp. Syst. 63, 76–95 (2016). https://doi.org/10.1016/j.future.2016.04.014
Article Google Scholar
Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. 27, 504–518 (2015)
Article Google Scholar
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., Oliveira, D.: Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5(1), 79 (2010)
Article Google Scholar
Meyer, N., Talia, D., Yahyapour, R.: Grid and Services Evolution, vol. 11. Springer, New York (2009)
MATH Google Scholar
Mitchell, T.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982). https://doi.org/10.1016/0004-3702(82)90040-6
Article MathSciNet Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
MATH Google Scholar
Ocaña, K., de Oliveira, D., Ogasawara, E.S., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: BSB, pp. 66–70. Springer (2011)
Ocaña, K.A., de Oliveira, D., Ogasawara, E., Dávila, A.M., Lima, A.A., Mattoso, M.: Sciphy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: 2011 BSB, pp. 66–70. Springer (2011)
Ogasawara, E., Dias, J., Silva, V., Chirigati, F., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. 25(16), 2327–2341 (2013)
Article Google Scholar
Olimpio, V., Nascimento, A., Paes, A., de Oliveira, D.: Workflowsim4rl: Aprendizado por reforço aplicado a escalonamento de workflows científicos em nuvens. In: Workshop em Desempenho de Sistemas Computacionais e de Comunicação (WPerformance), Natal, Brazil, 2018, pp. 364–367 (2018)
Pradal, C., Fournier, C., Valduriez, P., Cohen-Boulakia, S.: Openalea: Scientific workflows combining data analysis and simulation. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM ’15, pp. 11:1–11:6. ACM, New York, NY, USA (2015)
Prinzie, A., Van den Poel, D.: Random multiclass classification: Generalizing random forests to random MNL and random NB. In: Wagner, R., Revell, N., Pernul, G. (eds.) Database and Expert Systems Applications, pp. 349–358. Springer, Berlin Heidelberg, Berlin, Heidelberg (2007)
Google Scholar
Pruitt, K.D., Tatusova, T.A., Brown, G.R., Maglott, D.R.: NCBI reference sequences (refseq): current status, new features and genome annotation policy. Nucleic Acids Res. 40(Database-Issue), 130–135 (2012). https://doi.org/10.1093/nar/gkr1079
Article Google Scholar
Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (1987)
Article Google Scholar
Rokach, L., et al., Maimon, O.: Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing Co., River Edge (2014)
Book Google Scholar
Rynge, M., Juve, G., Kinney, J., Good, J., Berriman, G., Merrihew, A., Deelman, E.: Producing an infrared multiwavelength galactic plane atlas using montage, pegasus and amazon web services. In: 23rd Annual Astronomical Data Analysis Software and Systems, ADASS, Conference (2013)
Saavedra-Barrera, R., Culler, D., Von Eicken, T.: Analysis of multithreaded architectures for parallel computing. In: SPAAACM 1990, pp. 169–178. ACM (1990)
Sakellariou, R., Zhao, H., Deelman, E.: Mapping workflows on grid resources: experiments with the montage workflow. In: Grids, P2P and Services Computing, pp. 119–132. Springer (2010)
Sharma, D., Chandra, P.: Software fault prediction using machine-learning techniques. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics, pp. 541–549. Springer, Singapore (2018)
Chapter Google Scholar
Souza, A., Papadopoulos, A.V., Tomás, L., Gilbert, D., Tordsson, J.: Hybrid adaptive checkpointing for virtual machine fault tolerance. In: 2018 IEEE International Conference on Cloud Engineering, IC2E 2018, Orlando, FL, USA, April 17-20, 2018, pp. 12–22 (2018). https://doi.org/10.1109/IC2E.2018.00023
Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206
Article Google Scholar
von Laszewski, G., Hategan, M.: Java cog kit karajan/gridant workflow guide. Tech. rep, Technical Report, Argonne National Laboratory, Argonne, IL, USA (2005)
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: Towards a cloud definition. SIGCOMM Rev. 39(1), 50–55 (2008)
Article Google Scholar
Watson, P., Hiden, H., Woodman, S.: e-science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010). https://doi.org/10.1002/cpe.1611
Article Google Scholar
Weiss, S., Kulikowski, C.: Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1991)
Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec. 34(3), 56–62 (2005)
Article Google Scholar
Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/t: Large-scale application composition via distributed-memory dataflow processing. In: 13th IEEE/ACM CCGrid, pp. 95–102. IEEE (2013)
Yang, Y., Webb, G., Wu, X.: Discretization methods. In: Maimon, O., Rokach, L., (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 113–130. Springer (2005). URL http://dblp.uni-trier.de/db/books/collections/datamining2005.html#YangWW05
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Article Google Scholar
Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: CCGrid 2009, pp. 244–251. IEEE Computer Society (2009)

Download references

Acknowledgements

This research made use of Montage. It is funded by the National Science Foundation under Grant Number ACI-1440620, and was previously funded by the National Aeronautics and Space Administration’s Earth Science Technology Office, Computation Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.

Author information

Authors and Affiliations

Instituto de Computação - Universidade Federal Fluminense, Niterói, Brazil
Thaylon Guedes, Leonardo A. Jesus, Lucia M. A. Drummond & Daniel de Oliveira
Laboratório Nacional de Computação Científica, Petrópolis, Brazil
Kary A. C. S. Ocaña

Authors

Thaylon Guedes
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo A. Jesus
View author publications
You can also search for this author in PubMed Google Scholar
Kary A. C. S. Ocaña
View author publications
You can also search for this author in PubMed Google Scholar
Lucia M. A. Drummond
View author publications
You can also search for this author in PubMed Google Scholar
Daniel de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel de Oliveira.

Additional information

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Authors would also like to thank CNPq and FAPERJ for partially sponsoring this research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guedes, T., Jesus, L.A., Ocaña, K.A.C.S. et al. Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach. Cluster Comput 23, 123–148 (2020). https://doi.org/10.1007/s10586-019-02920-6

Download citation

Received: 31 March 2018
Revised: 20 December 2018
Accepted: 23 February 2019
Published: 09 March 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10586-019-02920-6

Keyword

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Abstract

Access this article

Similar content being viewed by others

Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow

Usability of Scientific Workflow in Dynamically Changing Environment

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keyword

Navigation

Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Abstract

Access this article

Similar content being viewed by others

Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow

Usability of Scientific Workflow in Dynamically Changing Environment

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keyword

Search

Navigation