Skip to main content
Log in

Soft error vulnerability prediction of GPGPU applications

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As graphics processing units (GPUs) evolve to offer high performance for general-purpose computations in addition to inherently fault-tolerant graphics applications, soft error reliability becomes a significant concern. Fault injection provides a method of evaluating the soft error vulnerability of target programs. Since performing fault injection experiments for complex GPU hardware structures takes impractical times, the prediction-based techniques to evaluate the soft error vulnerability of general-purpose GPU (GPGPU) programs based on metrics from different domains get crucial for both HPC developers and GPU vendors. In this work, we propose machine learning (ML)-based prediction frameworks for the soft error vulnerability evaluation of GPGPU programs. We consider program characteristics, hardware usage and performance metrics collected from the simulation and the profiling tools. While we utilize regression models to predict the masked fault rates, we build classification models to specify the vulnerability level of the GPGPU programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 95.9, 88.46, and 85.7% for masked fault rates, SDCs, and crashes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availibility

We share all source codes related to our prediction environment, our configuration files, and collected metrics from both the profiler and the simulator in our GitHub repository https://github.com/topcuburak/FaultPredictionOnGPGPUs.

Notes

  1. https://github.com/topcuburak/FaultPredictionOnGPGPUs.

References

  1. Aamodt TM, Fung WWL, Rogers TG, Martonosi M (2018) General-purpose graphics processor architecture. Morgan & Claypool Publishers

    Book  Google Scholar 

  2. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459

    Article  Google Scholar 

  3. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54 . https://doi.org/10.1109/IISWC.2009.5306797

  4. Clark JA, Pradhan DK (1995) Fault injection: a method for validating computer-system dependability. Computer 28(6):47–56

    Article  Google Scholar 

  5. Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for gpgpu reliability. In: Workshop on General Purpose Processing on Graphics Processing Units

  6. Du B, Condia JER, Reorda MS (2019) An extended model to support detailed gpgpu reliability analysis. In: 2019 14th International Conference on Design Technology of Integrated Systems In Nanoscale Era (DTIS), pp 1–6 . https://doi.org/10.1109/DTIS.2019.8735047

  7. Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2016) A systematic methodology for evaluating the error resilience of gpgpu applications. IEEE Trans Parallel Distrib Syst 27(12):3397–3411

    Article  Google Scholar 

  8. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: 2012 Innovative Parallel Computing (InPar)

  9. Guo L, Li D, Laguna I (2021) Paris: predicting application resilience using machine learning. J Parallel Distrib Comput 152:111–124. https://doi.org/10.1016/j.jpdc.2021.02.015

    Article  Google Scholar 

  10. Hari S, Tsai T, Stephenson M, Keckler S, Emer J (2017) Sassifi: an architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258 . https://doi.org/10.1109/ISPASS.2017.7975296

  11. Jauk D, Yang D, Schulz M (2019) Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19 . https://doi.org/10.1145/3295500.3356185

  12. Kalra C, Previlon F, Li X, Rubin N, Kaeli D (2018) Prism: predicting resilience of gpu applications using statistical methods. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18

  13. Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-sim: an extensible simulation framework for validated gpu modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) . https://doi.org/10.1109/ISCA45697.2020.00047

  14. Kirk DB, mei W, Hwu, W (2017) Programming massively parallel processors (Third Edition). Morgan Kaufmann

  15. Laguna I, Schulz M, Richards DF, Calhoun J, Olson L (2016) Ipas: intelligent protection against silent output corruption in scientific applications. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

  16. Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: 2009 Design, Automation & Test in Europe Conference & Exhibition, Proceedings of the Conference on Design, Automation and Test in Europe (DATE)

  17. Lu Q, Pattabiraman K, Gupta MS, Rivers JA (2014) Sdctune: a model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

  18. Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for gpu error detection. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18

  19. Mei-Chen H, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer 30(4):75–82

    Article  Google Scholar 

  20. Mittal S, Vetter JS (2016) A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans Parallel Distrib Syst 27(4):1226–1238

    Article  Google Scholar 

  21. Mukherjee S (2008) Architecture design for soft errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

    Google Scholar 

  22. Nie B, Xue J, Gupta S, Patel T, Engelmann C, Smirni E, Tiwari D (2018) Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 95–106 . https://doi.org/10.1109/DSN.2018.00022

  23. Nvidia, cuda-gdb (2022) https://developer.nvidia.com/cuda-gdb

  24. NVIDIA: Nvidia, cuda llvm compiler. https://developer.nvidia.com/cuda-llvm-compiler

  25. NVIDIA: Nvidia nsight compute. https://developer.nvidia.com/nsight-compute

  26. NVIDIA: Nvidia, pascal architecture whitepaper. https://www.nvidia.com/en-us/data-center/resources/pascal-architecture-whitepaper

  27. NVIDIA: Data sheet: Nvidia quadro p4000 (2018). https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-pascal-p4000-data-sheet-a4-nvidia-704358-r2-web.pdf

  28. NVIDIA: Nvidia parallel thread execution is a version 7.4 (2021). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

  29. Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

  30. Oz I, Karadas OF (2021) Regional soft error vulnerability and error propagation analysis for gpgpu applications. J Supercomput 78(3):4095–4130. https://doi.org/10.1007/s11227-021-04026-6

    Article  Google Scholar 

  31. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  32. Sabena D, Sterpone L, Carro L, Rech P (2014) Reliability evaluation of embedded gpgpus for safety critical applications. IEEE Trans Nucl Sci 61(6):3123–3129. https://doi.org/10.1109/TNS.2014.2363358

    Article  Google Scholar 

  33. Unknown: Nvidia quadro p4000 (2022). https://www.techpowerup.com/gpu-specs/quadro-p4000.c2930

  34. Wei X, Zhang R, Liu Y, Yue H, Tan J (2019) Evaluating the soft error resilience of instructions for gpu applications. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp 459–464 . https://doi.org/10.1109/CSE/EUC.2019.00091

  35. Öz I, Arslan S (2021) Predicting the soft error vulnerability of parallel applications using machine learning. Int J Parallel Program 49(3):410–439

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.

Funding

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.

Author information

Authors and Affiliations

Authors

Contributions

BT: run the experiments and wrote the main manuscript. IO: defined the methodology, analyzed the results, and contributed in writing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Işıl Öz.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Consent to participate

Not applicable

Consent for publication

Not applicable

Ethical approval

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Topçu, B., Öz, I. Soft error vulnerability prediction of GPGPU applications. J Supercomput 79, 6965–6990 (2023). https://doi.org/10.1007/s11227-022-04933-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04933-2

Navigation