Skip to main content
  • 1808 Accesses

Abstract

Graphics Processing Units (GPUs) evolved from graphics-specific devices to general-purpose computing accelerators that scientists use to run large-scale simulations. Additionally, GPUs are very attractive for safety-critical applications that extensively use signal or image processing.

Unfortunately, while the performance and efficiency of GPUs are well established, their resilience characteristics in a large-scale computing system and safety critical-application have not been fully evaluated. The presence of complex scheduling circuitry, for instance, may significantly increase the parallel code error rate. Moreover, the parallel architecture of GPUs introduces novel radiation experiment challenges that need to be solved.

In this Chapter we present a detailed radiation test setup for GPUs, including some recommendations for parallel devices experiments. We also present some experimental results on the radiation sensitivity of modern GPUs, considering both low-level static analysis and typical parallel application behaviors under radiation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc IEEE 96(5):879–899

    Article  Google Scholar 

  2. Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE MICRO 28(2):39–55

    Article  Google Scholar 

  3. Kruger J, Westermann R (2003) Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans Graph 22(3):908–916

    Article  Google Scholar 

  4. Liepe J, Barnes C, Cule E, Erguler K, Kirk P, Toni T, Stumpf MPH (2012) ABC-SysBio—approximate Bayesian computation in Python with GPU support. Bioinformatics 26(14):1797–1799

    Article  Google Scholar 

  5. Euro NCAP rating review, Report from the Ratings Group, June 2012. Available: http://www.euroncap.com

  6. Bender O (2014) ARAMIS—concepts to validate the safe application of multicore architectures in the avionics domain, HiPEAC 2014. Available [online] http://www.across-project.eu/workshop2013/121108_ARAMIS_Introduction_HiPEAC_WS_V3.pdf

  7. Seifert N, Zhu X, Massengill LW (2002) Impact of scaling on soft-error rates in commercial microprocessors. IEEE Trans Nucl Sci 46(6):3100–3106

    Article  Google Scholar 

  8. Nguyen HT, Yagil Y, Seifert N, Reitsma M (2005) Chip-level soft error estimation method. IEEE Trans Device Mater Reliab 5(3):365–381

    Article  Google Scholar 

  9. Lerner MD (1988) Algorithm based fault tolerance in massively parallel systems. Department of Computer Science, Columbia University, Tech. Rep., 1988

    Google Scholar 

  10. Mitra S (2012) System-level single-event effects. IEEE nuclear and space radiation effects conference, NSREC 2012 short course

    Google Scholar 

  11. Bautista-Gomez L, Cappello F, Carro L, DeBardeleben N, Fang B, Gurumurthi S, Pattabiraman K, Rech P, Reorda MS (2014) GPGPUs: how to combine high computational power with high reliability. In: Proceedings of the IEEE design, automation and test in Europe (DATE), 2014, Dresden

    Google Scholar 

  12. Shi G, Enos J, Showerman M, Kindratenko V (2009) On testing GPU memory for hard and soft errors. In: Proceedings of the symposium on application accelerators in high-performance computing (SAAHPC), 2009

    Google Scholar 

  13. Wang NJ, Quek J, Rafacz TM, Patel SJ (2004) Characterizing the effects of transient faults on a high-performance processor pipeline. In: Proceedings of the IEEE international conference on dependable systems and networks (DSN), 2004, pp 61–70

    Google Scholar 

  14. Haque IS, Pande VS (2010) Hard data on soft errors: a large-scale assessment of real-world error rates in GPGPU. In: Proceedings of the IEEE/ACM international conference on cluster, cloud and grid computing, 2010, pp 691–696

    Google Scholar 

  15. Sheaffer JW, Luebke DP, Skadron K (2007) A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In: Proceedings of the ACM SIGGRAPH symposium on graphics hardware (GH), 2007, pp 55–64

    Google Scholar 

  16. Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2014) GPU-Qin: a methodology for evaluating the error resilience of GPGPU applications. In: Proceedings of the IEEE international symposium on performance analysis of systems and software (ISPASS), 2014

    Google Scholar 

  17. Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804

    Article  Google Scholar 

  18. Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Sonza Reorda M, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880

    Article  Google Scholar 

  19. Rech P, Pilla L, Navaux POA, Carro L (2014) Impact of GPUs parallelism management on safety-critical and HPC applications reliability. In: Proceeding IEEE international conference on dependable systems and networks (DSN), June 2014, pp 455–466

    Google Scholar 

  20. Violante M, Sterpone L, Manuzzato A, Gerardin S, Rech P, Bagatin M, Paccagnella A, Andreani C, Gorini G, Pietropaolo A, Cargarilli G, Pontarelli S, Frost C (2007) A new hardware/software platform and a new 1/e neutron source for soft error studies: testing FPGAs at the ISIS facility. IEEE Trans Nucl Sci 54(4):1184–1189

    Article  Google Scholar 

  21. Oliveira DAG, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux POA, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3123

    Article  Google Scholar 

  22. Rech P, Carro L, Wang N, Tsai T, Hari SKS, Keckler SW (2014) Measuring the radiation reliability of SRAM structures in GPUs designed for HPC. In: Proceedings of the IEEE SELSE 2014

    Google Scholar 

  23. Jou J-Y, Abraham JA (1988) Fault-tolerant FFT networks. IEEE Trans Comput 37(5):548–561

    Article  Google Scholar 

  24. Bailey D et al (1994) The NAS parallel benchmarks. RNR technical report RNR-94-007, March 1994

    Google Scholar 

  25. Stockham TG (1966) High-speed convolution and correlation. Proceedings of the Spring Joint Computer Conference, 1966, pp 229–233

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Rech .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Rech, P., Oliveira, D., Navaux, P., Carro, L. (2016). Soft-Error Effects on Graphics Processing Units. In: Kastensmidt, F., Rech, P. (eds) FPGAs and Parallel Architectures for Aerospace Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-14352-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14352-1_20

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14351-4

  • Online ISBN: 978-3-319-14352-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics