Soft-Error Effects on Graphics Processing Units

Rech, Paolo; Oliveira, Daniel; Navaux, Philippe; Carro, Luigi

doi:10.1007/978-3-319-14352-1_20

Paolo Rech³,
Daniel Oliveira³,
Philippe Navaux³ &
…
Luigi Carro³

1808 Accesses

Abstract

Graphics Processing Units (GPUs) evolved from graphics-specific devices to general-purpose computing accelerators that scientists use to run large-scale simulations. Additionally, GPUs are very attractive for safety-critical applications that extensively use signal or image processing.

Unfortunately, while the performance and efficiency of GPUs are well established, their resilience characteristics in a large-scale computing system and safety critical-application have not been fully evaluated. The presence of complex scheduling circuitry, for instance, may significantly increase the parallel code error rate. Moreover, the parallel architecture of GPUs introduces novel radiation experiment challenges that need to be solved.

In this Chapter we present a detailed radiation test setup for GPUs, including some recommendations for parallel devices experiments. We also present some experimental results on the radiation sensitivity of modern GPUs, considering both low-level static analysis and typical parallel application behaviors under radiation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc IEEE 96(5):879–899
Article Google Scholar
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE MICRO 28(2):39–55
Article Google Scholar
Kruger J, Westermann R (2003) Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans Graph 22(3):908–916
Article Google Scholar
Liepe J, Barnes C, Cule E, Erguler K, Kirk P, Toni T, Stumpf MPH (2012) ABC-SysBio—approximate Bayesian computation in Python with GPU support. Bioinformatics 26(14):1797–1799
Article Google Scholar
Euro NCAP rating review, Report from the Ratings Group, June 2012. Available: http://www.euroncap.com
Bender O (2014) ARAMIS—concepts to validate the safe application of multicore architectures in the avionics domain, HiPEAC 2014. Available [online] http://www.across-project.eu/workshop2013/121108_ARAMIS_Introduction_HiPEAC_WS_V3.pdf
Seifert N, Zhu X, Massengill LW (2002) Impact of scaling on soft-error rates in commercial microprocessors. IEEE Trans Nucl Sci 46(6):3100–3106
Article Google Scholar
Nguyen HT, Yagil Y, Seifert N, Reitsma M (2005) Chip-level soft error estimation method. IEEE Trans Device Mater Reliab 5(3):365–381
Article Google Scholar
Lerner MD (1988) Algorithm based fault tolerance in massively parallel systems. Department of Computer Science, Columbia University, Tech. Rep., 1988
Google Scholar
Mitra S (2012) System-level single-event effects. IEEE nuclear and space radiation effects conference, NSREC 2012 short course
Google Scholar
Bautista-Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Reorda MS (2014) GPGPUs: how to combine high computational power with high reliability. In: Proceedings of the IEEE design, automation and test in Europe (DATE), 2014, Dresden
Google Scholar
Shi G, Enos J, Showerman M, Kindratenko V (2009) On testing GPU memory for hard and soft errors. In: Proceedings of the symposium on application accelerators in high-performance computing (SAAHPC), 2009
Google Scholar
Wang NJ, Quek J, Rafacz TM, Patel SJ (2004) Characterizing the effects of transient faults on a high-performance processor pipeline. In: Proceedings of the IEEE international conference on dependable systems and networks (DSN), 2004, pp 61–70
Google Scholar
Haque IS, Pande VS (2010) Hard data on soft errors: a large-scale assessment of real-world error rates in GPGPU. In: Proceedings of the IEEE/ACM international conference on cluster, cloud and grid computing, 2010, pp 691–696
Google Scholar
Sheaffer JW, Luebke DP, Skadron K (2007) A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In: Proceedings of the ACM SIGGRAPH symposium on graphics hardware (GH), 2007, pp 55–64
Google Scholar
Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2014) GPU-Qin: a methodology for evaluating the error resilience of GPGPU applications. In: Proceedings of the IEEE international symposium on performance analysis of systems and software (ISPASS), 2014
Google Scholar
Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804
Article Google Scholar
Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Sonza Reorda M, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880
Article Google Scholar
Rech P, Pilla L, Navaux POA, Carro L (2014) Impact of GPUs parallelism management on safety-critical and HPC applications reliability. In: Proceeding IEEE international conference on dependable systems and networks (DSN), June 2014, pp 455–466
Google Scholar
Violante M, Sterpone L, Manuzzato A, Gerardin S, Rech P, Bagatin M, Paccagnella A, Andreani C, Gorini G, Pietropaolo A, Cargarilli G, Pontarelli S, Frost C (2007) A new hardware/software platform and a new 1/e neutron source for soft error studies: testing FPGAs at the ISIS facility. IEEE Trans Nucl Sci 54(4):1184–1189
Article Google Scholar
Oliveira DAG, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux POA, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3123
Article Google Scholar
Rech P, Carro L, Wang N, Tsai T, Hari SKS, Keckler SW (2014) Measuring the radiation reliability of SRAM structures in GPUs designed for HPC. In: Proceedings of the IEEE SELSE 2014
Google Scholar
Jou J-Y, Abraham JA (1988) Fault-tolerant FFT networks. IEEE Trans Comput 37(5):548–561
Article Google Scholar
Bailey D et al (1994) The NAS parallel benchmarks. RNR technical report RNR-94-007, March 1994
Google Scholar
Stockham TG (1966) High-speed convolution and correlation. Proceedings of the Spring Joint Computer Conference, 1966, pp 229–233
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Paolo Rech, Daniel Oliveira, Philippe Navaux & Luigi Carro

Authors

Paolo Rech
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Navaux
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Carro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Rech .

Editor information

Editors and Affiliations

Instituto de Informatica, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Fernanda Kastensmidt
Instituto de Informática, Federal University of Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil
Paolo Rech

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rech, P., Oliveira, D., Navaux, P., Carro, L. (2016). Soft-Error Effects on Graphics Processing Units. In: Kastensmidt, F., Rech, P. (eds) FPGAs and Parallel Architectures for Aerospace Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-14352-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-14352-1_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14351-4
Online ISBN: 978-3-319-14352-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics