Skip to main content
Log in

Towards fault tolerance optimization based on checkpoints of in-memory framework spark

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Performance optimization especially for fault tolerance optimization has been a significant aspect for in-memory computing computing. When node failure occurs, in-memory computing may lose the data, which increases the execution time without checkpoint. In the traditional Spark strategy, the programmer chooses the checkpoint with the uncertainty and risk. Therefore, we aims at the checkpoint strategy of in memory computing framework Spark in this paper. After the theoretical analysis, the checkpoint selection algorithm which taking into account the length of the RDD lineage, the computational cost, the operation complexity and the size in setting the checkpoint is presented. The greater the weight of RDD, the higher priority it has. The RDD with higher cost will be set as the checkpoint first, which can reduce the recomputation cost of the task. When failure occurs, the recovery algorithm is executed, and the efficiency of the task recovery can be effectively improved. And the experimental results show that the strategy optimizes the fault tolerance mechanism for Spark and improves the efficiency of the job recovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Chen CLP, Zhang CY (2013) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275(11), 314–347

    Google Scholar 

  • Cho C, Lee S (2016) Effective five directional partial derivatives-based image smoothing and a parallel structure design. IEEE Trans Image Process 25(4):1617–1625

    Article  MathSciNet  Google Scholar 

  • Dimitriou I (2015) A retrial queue for modeling fault-tolerant systems with checkpointing and rollback recovery. Comput Ind Eng 79:156–167

    Article  Google Scholar 

  • Ferreira KB, Riesen R, Bridges P et al (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gen Comput Syst 30:66–77

    Article  Google Scholar 

  • Ifeanyi P, Egwutuoha D, Levy B, Selic S, Chen (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326

    Article  Google Scholar 

  • Jo I, Bae DH, Yoon AS et al (2016) YourSQL: a high-performance database system leveraging in-storage computing. Proc Vldb Endow 9(12):924–935

    Article  Google Scholar 

  • John Walker S (2014) Big data: a revolution that will transform how we live, work, and think. Int J Advert 33(1):181–183

    Article  Google Scholar 

  • Kambatla K, Kollias G, Kumar V et al (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573

    Article  Google Scholar 

  • Kozhirbayev Z, Sinnott RO (2017) A performance comparison of container-based technologies for the cloud. Future Gen Comput Syst 68(3):175–182

    Article  Google Scholar 

  • Matsukawa G, Nakata Y, Sugure Y et al (2015) A low-latency DMR architecture with fast checkpoint recovery scheme. IEICE Trans Electron 98(4):333–339

    Article  Google Scholar 

  • Napoli C, Pappalardo G, Tramontana E (2016) A mathematical model for file fragment diffusion and a neural predictor to manage priority queues over BitTorrent. Int J Appl Math Comput Sci 26(1):147–160

    Article  MathSciNet  MATH  Google Scholar 

  • Provost F, Fawcett T (2013) Data science and its relationship to big data and data-driven decision making. Big Data 1(1):51–59

    Article  Google Scholar 

  • Rodríguez G, González P et al (2013) Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Gen Comput 31(3):163–185

    Article  Google Scholar 

  • Sengupta B, Das A (2017) Use of SIMD-based data parallelism to speed up sieving in integer-factoring algorithms. Appl Math Comput 293(1):204–217

    MathSciNet  Google Scholar 

  • Tang Y, Gedik B (2013) Autopipelining for data stream processing. IEEE Trans Parallel Distrib Syst 24(12):2344–2354

    Article  Google Scholar 

  • Wu R, Huang L, Yu P et al (2017) SunwayMR: a distributed parallel computing framework with convenient data-intensive applications programming. Future Gen Comput Syst 71:43–56

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the anonymous referees for their insightful suggestions and comments. This research was supported by the National Natural Science Foundation of China under Grant Nos. 61262088, 61462079, 61363083, 61562086.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changtian Ying.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ying, C., Yu, J. & He, J. Towards fault tolerance optimization based on checkpoints of in-memory framework spark. J Ambient Intell Human Comput (2018). https://doi.org/10.1007/s12652-018-1018-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-018-1018-6

Keywords

Navigation