Soft error resilience of Big Data kernels through algorithmic approaches

LeCompte, Travis; Legrand, Walker; Chen, Sui; Peng, Lu

doi:10.1007/s11227-017-2042-6

Soft error resilience of Big Data kernels through algorithmic approaches

Published: 18 April 2017

Volume 73, pages 4739–4772, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Travis LeCompte¹,
Walker Legrand¹,
Sui Chen¹ &
…
Lu Peng¹

240 Accesses
3 Citations
Explore all metrics

Abstract

As the volume of data generated each day continues to increase, more and more interest is put into Big Data algorithms and the insight they provide.? Since these analyses require a substantial amount of resources, including physical machines, power, and time, reliable execution of the algorithms becomes critical. This paper analyzes the error resilience of a select group of popular Big Data algorithms and shows how they can effectively be made more fault-tolerant. Using KULFI (http://github.com/quadpixels/kulfi, 2013) and the LLVM (Proceedings of the 2004 international symposium on code generation and optimization (CGO 2004), San Jose, CA, USA, 2004) compiler for compilation allows injection of artificial soft faults throughout these algorithms, giving a thorough analysis of how faults in different locations can affect the outcome of the program. This information is then used to help guide incorporating fault tolerance mechanisms into the program, making them as impervious as possible to soft faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft error resilience in Big Data kernels through modular analysis

Article 03 March 2016

Software approaches for resilience of high performance computing systems: a survey

Article 12 December 2022

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Christoforides A (2011) Metropolis–Hastings implementation. https://github.com/alexischr/mh
IBM’s Big Data Platform and Decision Management (2012) What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
Gordon R (2013) http://math.stackexchange.com/questions/346894/prove-of-the-parsevals-theorem-for-discrete-fourier-transform-dft
Sharma V, Haran A, Chen S (2013) Kulfi fault injector. http://github.com/quadpixels/kulfi
AMPLab at University of California, Berkeley (2014) AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark/
Harris M, NVidia (2015) https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/
Armstrong TG, Ponnekanti V, Borthakur D, Callaghan M (2013) Linkbench: a database benchmark based on the Facebook social graph. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’13, pp 1185–1196. doi:10.1145/2463676.2465296
Austin T (1999) Diva: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO 1999)
Bender C, Sanda PN, Kudva P, Mata R, Pokala V, Haraden R, Schallhorn M (2008) Soft-error resilience of the ibm power6 processor input/output subsystem. IBM J Res Dev 52(3):285–292. doi:10.1147/rd.523.0285
Article Google Scholar
Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. J Supercomput Front Innov 1(1). doi:10.14529/jsfi140101
Chen S, Bronevetsky G, Li B, Guix MC, Peng L (2015) A framework for evaluating comprehensive fault resilience mechanisms in numerical programs. J Supercomput 71(8):2963–2984. doi:10.1007/s11227-015-1422-z
Article Google Scholar
Chen S, Bronevetsky G, Peng L, Li B, Fu X (2016) Soft error resilience in big data kernels through modular analysis. J Supercomput 72(4):1570–1596. doi:10.1007/s11227-016-1682-2
Article Google Scholar
Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC12)
Collange S, Defour D, Graillat S, Iakymchuk R (2015) Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput 49:83–97. doi:10.1016/j.parco.2015.09.001, http://www.sciencedirect.com/science/article/pii/S0167819115001155
Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with ycsb. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’10, pp 143–154, doi:10.1145/1807128.1807152
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge, MA
MATH Google Scholar
Ferdman M, Adileh A, Koçberber YO, Volos S, Alisafaee M, Jevdjic D, Kaynak C, Popescu AD, Ailamaki A, Falsafi B (2012) Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3–7, 2012, pp 37–48. doi:10.1145/2150976.2150982
Free Software Foundation (2016) GSL—GNU scientific library. https://www.gnu.org/software/gsl/
Gao W, Luo C, Zhan J, Ye H, He X, Wang L, Zhu Y, Tian X (2015) Identifying dwarfs workloads in big data analytics. http://arxiv.org/abs/1505.06872
Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen HA (2013) Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’13, pp 1197–1208. doi:10.1145/2463676.2463712
Guan Q, Debardeleben N, Blanchard S, Wu P, Monrow L, Chen Z (2016) P-FSEFI: a parallel soft error fault injection framework for parallel applications. In: Proceedings of the 12th Workshop on Silicon Error in Logic-System Effect (SELSE)
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp 41–51. doi:10.1109/ICDEW.2010.5452747
Iakymchuk R, Collagne S, Defour D, Graillat S (2015) Exblas: reproducible and accurate BLAS library. In the Proceedings of the Numerical Reproducibility at Exascale (NRE2015) workshop held as part of the Supercomputing Conference (SC15). Austin, TX, USA, November 15-20, 2015. HAL ID: hal-01202396
ITRS (2013) International technology roadmap for semiconductors. Technical report
Kumar S, Hari S, Adve SV, Naeimi H, Ramachandran P (2012) Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Proceedings of the 17th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)
Lattner C, Adve V (2004) LLVM: A compilation framework for lifelong program analysis and transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO 2004), San Jose, CA, USA
Liu W, Zhang W, Wang X, Xu J (2016) Distributed sensor network-on-chip for performance optimization of soft-error-tolerant multiprocessor system-on-chip. IEEE Trans Very Large Scale Integr (VLSI) Syst 24(4):1546–1559. doi:10.1109/TVLSI.2015.2452910
Article Google Scholar
NVIDIA (2013) Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v07.pdf
Serrano F, Clemente JA, Mecha H (2015) A methodology to emulate single event upsets in flip-flops using FPGAs through partial reconfiguration and instrumentation. IEEE Trans Nucl Sci 62(4):1617–1624. doi:10.1109/TNS.2015.2447391
Article Google Scholar
Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D (2015) Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In: SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12. doi:10.1145/2807591.2807666
Wang L, Bertran R, Buyuktosunoglu A, Bose P, Skadron K (2014) Characterization of transient error tolerance for a class of mobile embedded applications. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp 74–75. doi:10.1109/IISWC.2014.6983042
Yeh TY, Reinman G, Patel SJ, Faloutsos P (2009) Fool me twice: exploring and exploiting error tolerance in physics-based animation. ACM Trans Graph 29(1):5:1–5:11. doi:10.1145/1640443.1640448

Download references

Acknowledgements

We are grateful to Prof. Nian-Feng Tzeng at the Center for Advanced Computer Studies, University of Louisiana at Lafayette, for providing invaluable feedbacks to our research. We are grateful to Vishal Sharma and Arvind Haran, the authors of the original KULFI and for granting us permission to modify it for our experiment purposes. We are also appreciative of the opportunity to be involved in and contribute to KULFI. Support of this research was provided by National Science Foundation under Award Numbers: 1527318, 1422408 (Directorate for Computer and Information Science and Engineering), and 1017961 (Division of Computing and Communication Foundations).

Author information

Authors and Affiliations

Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, Louisiana, USA
Travis LeCompte, Walker Legrand, Sui Chen & Lu Peng

Authors

Travis LeCompte
View author publications
You can also search for this author in PubMed Google Scholar
Walker Legrand
View author publications
You can also search for this author in PubMed Google Scholar
Sui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lu Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

LeCompte, T., Legrand, W., Chen, S. et al. Soft error resilience of Big Data kernels through algorithmic approaches. J Supercomput 73, 4739–4772 (2017). https://doi.org/10.1007/s11227-017-2042-6

Download citation

Published: 18 April 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11227-017-2042-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft error resilience of Big Data kernels through algorithmic approaches

Abstract

Access this article

Similar content being viewed by others

Soft error resilience in Big Data kernels through modular analysis

Software approaches for resilience of high performance computing systems: a survey

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Soft error resilience of Big Data kernels through algorithmic approaches

Abstract

Access this article

Similar content being viewed by others

Soft error resilience in Big Data kernels through modular analysis

Software approaches for resilience of high performance computing systems: a survey

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation