Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Abstract

Scientific workflows are abstractions composed of activities, data and dependencies that model a computer simulation and are managed by complex engines named scientific workflow management system (SWfMS). Many workflows demand many computational resources once their executions may involve a number of different programs processing a massive volume of data. Thus, the use of high-performance computing (HPC) and data-intensive scalable computing environments allied to parallelization techniques provides the necessary support for the execution of such workflows. Clouds are environments that already offer HPC capabilities and workflows can explore them. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility in this environment. Thus, existing SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as Checkpoint/Restart, Re-Execution and Over-provisioning, but it is far from trivial to choose the suitable fault tolerance technique for a workflow execution that is not going to jeopardize the parallel execution. The major problem is that the suitable fault tolerance technique may be different for each workflow, activity or activation since programs associated with activities may present different behaviors. This article aims at analyzing several fault-tolerance techniques in a cloud-based SWfMS named SciCumulus, and recommend the suitable one for user’s workflow activities and activations using machine learning techniques and provenance data, thus aiming at improving resiliency.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    https://criu.org/.

  2. 2.

    https://www.postgresql.org/.

  3. 3.

    https://aws.amazon.com/pt/s3/details/.

  4. 4.

    https://github.com/s3fs-fuse/s3fs-fuse.

  5. 5.

    https://pegasus.isi.edu/documentation/cli-pegasus-s3.php.

  6. 6.

    https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

  7. 7.

    http://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instance-status.html.

  8. 8.

    http://scicumulusc2.wordpress.com/.

  9. 9.

    crd.lbl.gov.

  10. 10.

    https://github.com/najoshi/sickle.

  11. 11.

    https://www.ebi.ac.uk/~zerbino/velvet/.

  12. 12.

    http://metavelvet.dna.bio.keio.ac.jp/.

  13. 13.

    http://denovoassembler.sourceforge.net/.

  14. 14.

    https://github.com/LeeBergstrand/Bioinformatics_scripts

  15. 15.

    http://en.wikipedia.org/wiki/Anterior_nares.

  16. 16.

    https://aws.amazon.com/pt/ec2/instance-types/.

  17. 17.

    http://montage.ipac.caltech.edu/.

  18. 18.

    https://criu.org/Comparison_to_other_CR_projects.

  19. 19.

    https://orange.biolab.si/.

References

  1. 1.

    Afgan, E., Baker, D., Chilton, J., Coraor, N., Team, T.G., Taylor, J.: Galaxy cluster to cloud—genomics at scale. In: Proceedings of the 9th Gateway Computing Environments Workshop, GCE ’14, pp. 47–50. IEEE Press, Piscataway, NJ, USA (2014)

  2. 2.

    Bala, A., Chana, I.: Autonomic fault tolerant scheduling approach for scientific workflows in cloud computing. Concurr. Eng. 23(1), 27–39 (2015)

    Article  Google Scholar 

  3. 3.

    Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discret. Optim. 1(2), 121–127 (2004)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 2451–2454. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3132847.3133171

  5. 5.

    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)

    Google Scholar 

  6. 6.

    Chen, W., Deelman, E.: Workflowsim: a toolkit for simulating scientific workflows in distributed environments. In: 8th IEEE International Conference on E-Science, e-Science 2012, Chicago, IL, USA, October 8–12, 2012, pp. 1–8 (2012). https://doi.org/10.1109/eScience.2012.6404430

  7. 7.

    Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/A:1022641700528

    Article  Google Scholar 

  8. 8.

    Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)

    Google Scholar 

  9. 9.

    Compeau, P., Pevzner, P., Tesler, G.: How to apply de bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). https://doi.org/10.1038/nbt.2023

    Article  Google Scholar 

  10. 10.

    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. 11.

    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  12. 12.

    Costa, F., de Oliveira, D., Ocaña, K.A., Ogasawara, E., Mattoso, M.: Enabling re-executions of parallel scientific workflows using runtime provenance data. In: IPAW, pp. 229–232. Springer, New York (2012)

    Google Scholar 

  13. 13.

    da Silva, R.F., Juve, G., Rynge, M., Deelman, E., Livny, M.: Online task resource consumption prediction for scientific workflows. Parallel Process. Lett. 25(3), (2015). https://doi.org/10.1142/S0129626415410030

    MathSciNet  Article  Google Scholar 

  14. 14.

    de Jesus, L.A., Drummond, L.M.A., de Oliveira, D.: Eeny meeny miny moe: Choosing the fault tolerance technique for my cloud workflow. In: Mocskos, E., Nesmachnow, S. (eds.) High Performance Computing, pp. 321–336. Springer, Cham (2018)

  15. 15.

    de Oliveira, D., Cunha, L., Tomaz, L., Pereira, V., Mattoso, M.: Using ontologies to support deep water oil exploration scientific workflows. In: 2009 IEEE Congress on Services, Part I, SERVICES I 2009, Los Angeles, CA, USA, July 6–10, 2009, pp. 364–367 (2009). https://doi.org/10.1109/SERVICES-I.2009.17

  16. 16.

    De Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 378–385. IEEE (2010)

  17. 17.

    de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)

  18. 18.

    de Oliveira, D., Ocaña, K.A.C.S., Baião, F.A., Mattoso, M.: A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput. 10(3), 521–552 (2012). https://doi.org/10.1007/s10723-012-9227-2

    Article  Google Scholar 

  19. 19.

    Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.H., Vahi, K., Livny, M.: Pegasus: mapping scientific workflows onto the grid. In: undefined, pp. 11–20. Springer, New York (2004)

    Google Scholar 

  20. 20.

    Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: Proceedings of the SC’08, pp. 50:1–50:12 (2008)

  21. 21.

    Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. FGCS 46, 17–35 (2015)

    Article  Google Scholar 

  22. 22.

    Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2013)

  23. 23.

    Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997). https://doi.org/10.1023/A:1007413511361

    Article  MATH  Google Scholar 

  24. 24.

    Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017)

  25. 25.

    Ekblom, R., Wolf, J.B.W.: A field guide to whole-genome sequencing, assembly and annotation. Evolut. Appl. 7(9), 1026–1042 (2014)

    Article  Google Scholar 

  26. 26.

    Elmroth, E., Hernández, F., Tordsson, J.: A light-weight grid workflow execution engine enabling client and middleware independence. In: International Conference on Parallel Processing and Applied Mathematics, pp. 754–761. Springer, New York (2007)

  27. 27.

    Engelmann, C., Vallee, G.R., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257. IEEE (2009)

  28. 28.

    Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, pp. 1022–1029 (1993). URL http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93

  29. 29.

    Frank, E., Hall, M.A., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I.H., Trigg, L.: Weka—a machine learning workbench for data mining. In: Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 1269–1277 (2010). https://doi.org/10.1007/978-0-387-09823-4_66

    Google Scholar 

  30. 30.

    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)

    Article  Google Scholar 

  31. 31.

    García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining, Intelligent Systems Reference Library, vol. 72. Springer (2015). https://doi.org/10.1007/978-3-319-10247-4

    Google Scholar 

  32. 32.

    Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM CSUR 31(1), 1–26 (1999)

    Article  Google Scholar 

  33. 33.

    Gondra, I.: Applying machine learning to software fault-proneness prediction. J. Syst. Softw. 81(2), 186–195 (2008)

    Article  Google Scholar 

  34. 34.

    Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. Grid Comput. 11(3), 361–379 (2013)

    Article  Google Scholar 

  35. 35.

    Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009)

    Google Scholar 

  36. 36.

    Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, ICDAR ’95, vol. 1, p. 278. IEEE Computer Society, Washington, DC, USA (1995)

  37. 37.

    Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: eScience, pp. 640–645 (2008)

  38. 38.

    Hoheisel, A.: Grid workflow execution service-dynamic and interactive execution and visualization of distributed workflows. In: Proceedings of the Cracow Grid Workshop, vol. 2, pp. 13–24. Citeseer (2006)

  39. 39.

    Hu, M., Luo, J., Wang, Y., Veeravalli, B.: Adaptive scheduling of task graphs with dynamic resilience. IEEE Trans. Comput. 66(1), 17–23 (2017)

    MathSciNet  Article  Google Scholar 

  40. 40.

    Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.M., Hautier, G.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27(17), 5037–5059 (2015)

    Article  Google Scholar 

  41. 41.

    Juve, G., Chervenak, A.L., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Future Gener. Comp. Syst. 29(3), 682–692 (2013). https://doi.org/10.1016/j.future.2012.08.015

    Article  Google Scholar 

  42. 42.

    Kerber, R.: Chimerge: Discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992). URL http://dl.acm.org/citation.cfm?id=1867135.1867154

  43. 43.

    LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  44. 44.

    Lee, K.H., Lai, I.C., Lee, C.R.: Optimizing back-and-forth live migration. In: Proceedings of the 9th UCC, UCC ’16, pp. 49–54. ACM, New York, NY, USA (2016)

  45. 45.

    Leinweber, D.J.: Stupid data miner tricks. The Journal of Investing 16(1), 15–22 (2007). https://doi.org/10.3905/joi.2007.681820. URL http://joi.iijournals.com/content/16/1/15

    Article  Google Scholar 

  46. 46.

    Li, R.P., Wang, Z.O.: An entropy-based discretization method for classification rules with inconsistency checking. In: Proceedings of the International Conference on Machine Learning and Cybernetics, vol. 1, pp. 243–246 (2002). https://doi.org/10.1109/ICMLC.2002.1176748

  47. 47.

    Litvinova, A., Engelmann, C., Scott, S.L.: A proactive fault tolerance framework for high-performance computing. In: Proceedings of the 9th IASTED International Conference, vol. 676, p. 105 (2009)

  48. 48.

    Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., Mattoso, M.: Multi-objective scheduling of scientific workflows in multisite clouds. Future Gener. Comp. Syst. 63, 76–95 (2016). https://doi.org/10.1016/j.future.2016.04.014

    Article  Google Scholar 

  49. 49.

    Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. 27, 504–518 (2015)

    Article  Google Scholar 

  50. 50.

    Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., Oliveira, D.: Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5(1), 79 (2010)

    Article  Google Scholar 

  51. 51.

    Meyer, N., Talia, D., Yahyapour, R.: Grid and Services Evolution, vol. 11. Springer, New York (2009)

    Google Scholar 

  52. 52.

    Mitchell, T.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982). https://doi.org/10.1016/0004-3702(82)90040-6

    MathSciNet  Article  Google Scholar 

  53. 53.

    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)

    Google Scholar 

  54. 54.

    Ocaña, K., de Oliveira, D., Ogasawara, E.S., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: BSB, pp. 66–70. Springer (2011)

  55. 55.

    Ocaña, K.A., de Oliveira, D., Ogasawara, E., Dávila, A.M., Lima, A.A., Mattoso, M.: Sciphy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: 2011 BSB, pp. 66–70. Springer (2011)

  56. 56.

    Ogasawara, E., Dias, J., Silva, V., Chirigati, F., Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. 25(16), 2327–2341 (2013)

    Article  Google Scholar 

  57. 57.

    Olimpio, V., Nascimento, A., Paes, A., de Oliveira, D.: Workflowsim4rl: Aprendizado por reforço aplicado a escalonamento de workflows científicos em nuvens. In: Workshop em Desempenho de Sistemas Computacionais e de Comunicação (WPerformance), Natal, Brazil, 2018, pp. 364–367 (2018)

  58. 58.

    Pradal, C., Fournier, C., Valduriez, P., Cohen-Boulakia, S.: Openalea: Scientific workflows combining data analysis and simulation. In: Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM ’15, pp. 11:1–11:6. ACM, New York, NY, USA (2015)

  59. 59.

    Prinzie, A., Van den Poel, D.: Random multiclass classification: Generalizing random forests to random MNL and random NB. In: Wagner, R., Revell, N., Pernul, G. (eds.) Database and Expert Systems Applications, pp. 349–358. Springer, Berlin Heidelberg, Berlin, Heidelberg (2007)

    Google Scholar 

  60. 60.

    Pruitt, K.D., Tatusova, T.A., Brown, G.R., Maglott, D.R.: NCBI reference sequences (refseq): current status, new features and genome annotation policy. Nucleic Acids Res. 40(Database-Issue), 130–135 (2012). https://doi.org/10.1093/nar/gkr1079

    Article  Google Scholar 

  61. 61.

    Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (1987)

    Article  Google Scholar 

  62. 62.

    Rokach, L., et al., Maimon, O.: Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing Co., River Edge (2014)

    Google Scholar 

  63. 63.

    Rynge, M., Juve, G., Kinney, J., Good, J., Berriman, G., Merrihew, A., Deelman, E.: Producing an infrared multiwavelength galactic plane atlas using montage, pegasus and amazon web services. In: 23rd Annual Astronomical Data Analysis Software and Systems, ADASS, Conference (2013)

  64. 64.

    Saavedra-Barrera, R., Culler, D., Von Eicken, T.: Analysis of multithreaded architectures for parallel computing. In: SPAAACM 1990, pp. 169–178. ACM (1990)

  65. 65.

    Sakellariou, R., Zhao, H., Deelman, E.: Mapping workflows on grid resources: experiments with the montage workflow. In: Grids, P2P and Services Computing, pp. 119–132. Springer (2010)

  66. 66.

    Sharma, D., Chandra, P.: Software fault prediction using machine-learning techniques. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics, pp. 541–549. Springer, Singapore (2018)

    Google Scholar 

  67. 67.

    Souza, A., Papadopoulos, A.V., Tomás, L., Gilbert, D., Tordsson, J.: Hybrid adaptive checkpointing for virtual machine fault tolerance. In: 2018 IEEE International Conference on Cloud Engineering, IC2E 2018, Orlando, FL, USA, April 17-20, 2018, pp. 12–22 (2018). https://doi.org/10.1109/IC2E.2018.00023

  68. 68.

    Topcuoglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002). https://doi.org/10.1109/71.993206

    Article  Google Scholar 

  69. 69.

    von Laszewski, G., Hategan, M.: Java cog kit karajan/gridant workflow guide. Tech. rep, Technical Report, Argonne National Laboratory, Argonne, IL, USA (2005)

  70. 70.

    Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: Towards a cloud definition. SIGCOMM Rev. 39(1), 50–55 (2008)

    Article  Google Scholar 

  71. 71.

    Watson, P., Hiden, H., Woodman, S.: e-science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010). https://doi.org/10.1002/cpe.1611

    Article  Google Scholar 

  72. 72.

    Weiss, S., Kulikowski, C.: Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1991)

  73. 73.

    Wieczorek, M., Prodan, R., Fahringer, T.: Scheduling of scientific workflows in the askalon grid environment. SIGMOD Rec. 34(3), 56–62 (2005)

    Article  Google Scholar 

  74. 74.

    Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/t: Large-scale application composition via distributed-memory dataflow processing. In: 13th IEEE/ACM CCGrid, pp. 95–102. IEEE (2013)

  75. 75.

    Yang, Y., Webb, G., Wu, X.: Discretization methods. In: Maimon, O., Rokach, L., (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 113–130. Springer (2005). URL http://dblp.uni-trier.de/db/books/collections/datamining2005.html#YangWW05

  76. 76.

    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)

    Article  Google Scholar 

  77. 77.

    Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: CCGrid 2009, pp. 244–251. IEEE Computer Society (2009)

Download references

Acknowledgements

This research made use of Montage. It is funded by the National Science Foundation under Grant Number ACI-1440620, and was previously funded by the National Aeronautics and Space Administration’s Earth Science Technology Office, Computation Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Daniel de Oliveira.

Additional information

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Authors would also like to thank CNPq and FAPERJ for partially sponsoring this research.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guedes, T., Jesus, L.A., Ocaña, K.A.C.S. et al. Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach. Cluster Comput 23, 123–148 (2020). https://doi.org/10.1007/s10586-019-02920-6

Download citation

Keyword

  • Cloud computing
  • Scientific workflow
  • Fault-tolerance
  • Recommendation