1 Introduction

In modern computations, massive parallelism is ubiquitous when the application allows it. It saves time and allows computations that would have been unfeasible otherwise. In spite of all the advantages that this presents, the higher the number of processes/threads involved, the larger the probability of one of them failing and affecting the computation. Cosmic ray radiation is one of the possible sources of such errors. When a high energy particle interacts with the atmosphere, it creates a cascade of secondary particles that reaches the Earth’s surface. Among those particles, neutrons are susceptible to interact with the hardware, flipping one or more bits in the registers and perturbing the computation. Thus, it induces a probability of occurrence of an error depending on the energy and the flux of the incoming neutrons as well as on the density of processors per volume. There are several types of such failures depending on their after effects. The most undesirable case is known as Silent Error (SE), which corrupts the computation output in an undetectable manner. As a consequence, software and hardware resilience to radiation induced errors is a key issue in computing science towards the exascale. More restrictive or relaxed contingency plans to overcome those errors can be implemented depending on the Mean Time Between Failures (MTBF) of the hardware components.

In the present work, we perform a series of experiments devoted to studying the response of commercial hardware devices to a neutron flux originated by radioactive materials. The Neutron Standards Laboratory (LPN) is the Spanish national reference in neutron metrology and it is one of the installations that constitute the Ionizing Radiation Metrology Laboratory (LMRI) at CIEMAT. LPN counts with two neutron standards based on \(^{252}\)Cf and \(^{241}\)Am-Be neutron sources, currently used for calibration purposes. These sources provide well-known neutron flows that can be used to irradiate material or devices. In our experiments, we will irradiate computing nodes with various CPUs and GPUs.

Checkpointing and rollback recovery are the de-facto general-purpose error recovery techniques. They employ checkpoints to save periodically the state of a parallel application so that when an error strikes some process, the application can be restored to one of its former states. Nevertheless, the problem can be approximated from different perspectives as well.

Thus, the work carried out so far on characterizing the cosmic radiation on computing processors has basically focused on demonstrating their effect to be translated to silent errors and proposing solutions (L1 bypass, metal shielding, etc.). They have also produced noticeable results in characterizing the radiation effects on different computing units (memories, processors, or modern accelerators), but with a huge neutron flux if compared with the actual cosmic radiation received at the sea level (around six orders of magnitude higher in the referenced GPUs and Xeon Phis’ studies, see for example [1, 2]). Then, in order to better determine the actual effect of cosmic rays is necessary to correlate the dose received to the flux that happens at any latitude, longitude, and altitude around the world with the actual experimental rate of SE per neutron flux.

This paper is organised as follows: in Sect. 2 we briefly review the scientific work regarding silent errors and hardware irradiation; in Sect. 3 we describe the irradiation facility and numerical tests employed; in Sect. 4 we discuss the results and finally in Sect. 5 we present our conclusions.

2 Related work

2.1 Related work on overcoming silent errors

Silent data corruption has been typically studied by comparing the experimental output with the expected result [3]. In order to find a suitable solution, it is needed to consider error criticality, this is, the impact of this corruption on the application or system (see initial results in [1]).

Considerable efforts have been directed to reveal silent errors. In [4], a comprehensive list of techniques and references can be found. Most of the current techniques have combined redundancy at various levels, together with a variety of verification mechanisms. The classic approach is at the hardware level, where all computations were executed twice or even in triplicate, and majority voting was enforced in case of different results [5]. Another hardware-based error detection approach is proposed by Moody [6] and the use of ECC (Error Correcting Code) memory, which can detect and even correct a fraction of errors, but in practice it is complemented with software techniques because not all parts of the system are ECC-protected (in particular, logic units and registers inside the processing units). Then, it clearly rises up that a more advanced integration between these two methods must be accomplished.

Regarding verification mechanisms via application specific information, they include memory scrubbing [7], fault-tolerant algorithms [8], Algorithm Based Fault Tolerance (ABFT) techniques [9], coding sparse matrix–vector multiplication kernels [10], coupling a higher-order with a lower-order scheme for ordinary differential equations [11] and critical MPI message validation [12].

A set of novel silent data corruption detectors by leveraging support vector machine regression have been explored in [13]. Attempts have also been made to use genetic algorithms to detect such errors. Such detectors have demonstrated high precision and recall, but only if they run immediately after an error has been injected. A neural network detector that can identify silent data corruptions even multiple iterations after they were injected is proposed in [14].

In the field of sparse systems and for the sake of completeness, it is of interest to consider the preconditioned conjugate gradient method proposed by Chen [15] to work with sparse iterative solvers. Another iterative algorithm which deserves consideration is GMRES, which has been incipiently analysed with soft fault error models [16]. Other articles have focused on evaluating algorithms for both stability and accuracy in the presence of faults [17], detecting software errors by using properties of the algorithm [18], or providing more information and control for the library or application in handling likely errors, such as different types of DRAM failures [8, 19].

Between generic and application oriented verification mechanisms, approximate replication should be referenced. This methodology can be applied either at numerical method [11] or hardware [20] levels by comparing the exact floating-point results with the ones of an approximate operator. It has shown promising results, but again it is still not general enough, since each application needs to be manually complemented with the required computing kernels. Then, transparent and agnostic developments should be designed and [21] has worked in such a direction by combining an analytical model replication and checkpointing.

Regarding studies on optimal checkpointing and/or verification periods, two models have been mostly used: (i) errors are detected after a certain delay following a probability distribution (typically, an exponential distribution); (ii) errors are detected through some verification mechanism [22]. In both cases, the optimal period in order to minimise the dead time can be computed, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, the solution has been to determine the period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, due to the verification mechanism. The right replication level to detect and correct silent errors at scale is studied in [23]. A more realistic model assuming that errors are detected through some verification / validation mechanism (re-computation, checksums, coherence, etc.) would be an improvement of the overhead induced in this kind of solution. See for example [24], in which 7–70 % less overhead than a full duplication technique with similar detection recall is achieved.

Traditionally, a general assumption for building up a resilient recovery system has been to consider that each checkpoint forms a consistent recovery line and silent errors strike according to a Poisson process. Additionally application workflows refer to a number of parallel tasks that exchange data at the end of their execution, i.e., the task graph is a linear chain, and each task (except maybe the first and the last one) reads data from its predecessor and produces data for its successor. Bearing this in mind, there exist results describing the efficiency of copying with silent errors by combining checkpointing with some verification mechanism [25]. Thus, it is possible to design a general-purpose technique based upon computational patterns that periodically repeat over time, i.e., optimise the aforementioned trade-off between error-free overhead and execution time. These patterns interleave verifications and checkpoints, so a pattern minimizing the expected execution time can be determined. From this point on, it is possible to move to application-specific techniques via dynamic programming algorithms for linear chains of tasks as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra. We can see promising results in [26] with a sparse grid combination technique applied to several scientific fields. In the case of an application composed of a chain of tasks, the optimal and dynamic checkpointing strategy has been solved in [27], but there is still room for designing new algorithms exploiting a proper verification technique as well as for integrating multi-level checkpointing in order to cope with both fail-stop and silent error. The User Level Failure Mitigation (ULFM) interface enables the implementation of resilient MPI applications, system runtimes, and programming language constructs by detecting and reacting to failures without aborting their execution [28].

Regarding ABFT algorithms, the (preconditioned) Conjugate Gradient and GMRES methods can be used for both detection and correction [29]. But it can be extrapolated to any iterative solver that uses sparse matrix vector multiplications and vector operations (non-stationary iterative solvers such as CGNE, BiCG, BiCGstab, etc.). Also, ABFT allows detecting soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm [30], a method widely used but with serious limits in scalability.

Recent compilation studies on fault tolerance in exascale systems can be found in [31] and [32].

2.2 Related work on radiating hardware

To date, several articles have been published about the effect of radiation on computing hardware. Initially, it has been studied by Ziegler et al. in the late ’90 s [33] by irradiating different 16-MB DRAM memory. They concluded that silent errors induced by cosmic radiation depend on the memories’ cell technology used. In order to demonstrate to the manufacturers that the errors appearing in ASCI Q at Los Alamos Nat. Lab. were produced by cosmic rays, one of the servers was placed in front of a beam of neutrons causing errors to spike. This evidence was complemented by studies devoted to radiation-induced soft errors in advanced semiconductor technologies [34] and neutron induced single event upset dependence on bias voltage for CMOS SRAM [35].

Other experimental results related to the occurrence of silent errors are the single-bit ECC errors rate of 350 \(\hbox {min}^{-1}\) and the double-bit errors once per day both observed in the Jaguar supercomputer in 2006. The latter error class was detected, but not corrected by ECC techniques.

Radioactive lead in the solder to cause bad data in the L1 cache was observed in BlueGene/L at Lawrence Livermore Nat. Lab., a fact that derived in the necessity of bypassing L1 and, consequently, slower computations. Also, the U.S. Department of Energy’s Titan has a radiation-induced MTBF in the order of dozens of hours in their Kepler GPUs [36]. Hence, works elaborating on the direct dependence between silent errors and radiation have been published focused on determining the reliability in GPUs [1] and Xeon Phis, also applying high-level fault injection [2]. These works also quantify and qualify radiation effects on applications’ output by correlating the number of corrupted elements with their spatial locality and providing the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Comparison between high-energy and thermal neutrons effects on processor and memory error rates has been studied as well [19].

Fig. 1
figure 1

Scheme of the experimental facility. Distances are expressed in millimetres

Fig. 2
figure 2

Pictures of the experimental set-up

3 Experimental set-up

3.1 Irradiation facility

In this work, the Neutron Standards Laboratory (LPN) has been employed as irradiation facility to study the response of hardware devices to neutron fluence [37].

LPN counts with an irradiation room, that is a bunker with dimensions of 7 m \(\times\) 9 m \(\times\) 8 m, following ISO 8529-1 recommendations [38] and with 1.25 m thick walls, as it is shown in Fig. 1 [37]. The calibration neutron sources are stored in water that works as a very efficient neutron shield. They are remotely manipulated making use of a Cartesian manipulator, and a launcher, to select the neutron source and move from its storage position in the bottom of a pool to the irradiation position, 4 m over the ground. In front of the neutron source, the equipment to be calibrated is placed, over an automated table that allows precise positioning at the desired height and distance. These systems are controlled from a Control Room outside the bunker. Some pictures of the facility are shown in Fig. 2.

The neutron sources used in the experiments are \(^{252}\)Cf and \(^{241}\)Am-Be, which are currently employed in routine calibrations. Both sources provide neutron spectra in the fast energy range, see Fig. 3, according to ISO 8529-1 recommendations [38]. First, the device to be irradiated is placed on the automated table. Then, using the remote systems the neutron source is placed under this device, as seen in Fig. 2. \(^{252}\)Cf is placed inside a cylindrical capsule with external dimensions of 9.8 mm (H) \(\times\) 7.9 mm (). This capsule is itself inside another capsule holder, handled by the automated manipulation system (Fig. 2).

The Cf neutron source consists of 236 mg of \(^{252}\)Cf and according to the calibration certificate, the emission rate was \(B_{{\textrm{Cf0}}} = 5.471\times 10^8\, {{\textrm{s}}}^{-1} \pm 2.6\% \,\, (2\sigma )\), measured by the National Institute of Standards and Technology (NIST, USA), on 12/05/2012. From this value, and knowing that the half-life for \(^{252}\)Cf is \(T_{1/2} = 2.65\) years, and assuming an exponential decay, the current emission rate can be calculated for the experiments dates of Sects. 4.14.24.3 and 4.4.

On the other hand, the \(^{241}\)Am-Be neutron source consists of a compacted mixture of \(^{241}\hbox {AmO}_2\) and \(^9\)Be powder doubly encapsulated. This source has a nominal activity of 185 GBq and emits \(1.11\times 10^7\,{{\textrm{s}}}^{-1} \pm 1.4\%\,\, (2\sigma )\) neutrons, traceable to the Czech Metrology Institute (CMI) on 1/1/2012. The capsule has dimensions of 48.6 mm (H) \(\times\) 19.1 mm () and it is inside its own capsule holder. Considering the value of \(T_{1/2} = 432.6\) years for \(^{241}\)Am-Be, the current emission rate can be determined for the irradiation dates, being this value practically constant for the different irradiation campaigns, and very close to the original value, \(B_{{{\textrm{AmBe}}}} = 1.09\times 10^7\,{{\textrm{s}}}^{-1}\).

The irradiation distance is defined as the distance between the end of the capsule holder and the external surface of the server chassis. Because of safety issues, the smallest irradiation distance is 0.5 cm. This is not the real distance between the source and the irradiated device, because it is necessary to take into account the capsule holder thickness (4 mm), the distance to the centre of the source and the distance from the external chassis of the server to the target device inside.

In order to have a more realistic estimation of this distance (d) between the radioactive source and the target hardware, we introduce a correction offset to the distance set by the remote system. We add the capsule holder correction (\(\sim 0.5\,\hbox {cm}\)) and a half of the capsule size. This leads to an offset of +1.0 cm and +3.0 cm to the Cf and Am-Be distances, respectively. In the experiments, the source was placed perpendicularly and centred on the surface of the target chip to irradiate. The distances inside the server were previously measured with calliper. However, we also assume that the corrected distance has an uncertainty of \(\pm 0.5\,\hbox {cm}\), used in later error calculations in Sect. 4.

Fig. 3
figure 3

Neutron energy spectra for both sources. Note that the \(^{252}\)Cf peaks at \(2-3\,{{\textrm{MeV}}}\) and the \(^{241}\)Am-Be at slightly higher energies

In the experiments of Sect. 4 we may be able to provide a rough estimation of the effective cross section of the hardware piece. In this sense, we are able to measure the error rate E dividing the number of observed errors by the time lapse, and then:

$$\begin{aligned} \sigma _{{{\textrm{eff}}}} \approx \frac{E \times 4 \pi d^2 }{B_i}, \end{aligned}$$
(1)

being \(B_i\) is the neutron emission rate (subscript i labels the source) and \(B_i/(4\pi d^2)\) is the flux (neutrons per surface per time unit).

3.2 Hardware description

The hardware irradiated in this work is standard in many HPC data centres. We are provided with two modern computing servers, with two different CPU models and with three GPUs relying on different lithography technologies, whose transistor sizes range from 7 nm to 14 nm. We have an Intel Xeon CLX Gold 6230 Cascade Lake, 20c (2.1 GHz) and an AMD EPYC Rome 7282, 16c (2.8 GHz). In terms of GPUs, we have a T4, a Tesla V100 16GB and an A100 40GB, all manufactured by the NVIDIA Corp., but based on different microarchitectures, namely Turing, Volta and Ampere, respectively.

As preparation for every experiment, we removed the unnecessary hardware from the servers. Thus, the RAM memory modules were reduced to the minimum and a single hard disk is used, keeping only the components strictly needed for running the OS and the programs. In this way, we minimise any possible effect induced in those components by the neutron radiation. We also shield, when possible, the parts of the server that should not receive neutrons with 1 cm width polyethylene tiles, specially the hard disk.

3.3 Numerical experiments

Since we are irradiating both CPUs and GPUs, we prepare different numerical experiments for each case. The computation itself is irrelevant, but we design it so the CPU/GPU is running at full capacity. The output of those experiments is monitored during the irradiation to detect any errors.

A basic CENTOS 7.9 Linux operative system is installed in each computing node. On top of that, Python is installed via Anaconda. CUDA 11.6 and cuDNN 8 libraries are installed too in the nodes to use the GPUs for scientific computing.

For the CPU experiments, we choose to use the so-called power method for finding the largest eigenvalue of a symmetric matrix [39]. This is a simple and iterative procedure, and if we fix the random seed then we can exactly reproduce the computation and track numerical errors during the irradiation. We can modify the size of the matrix and the number of iterations to control memory usage and execution time. Figure 4 shows a reference execution, run previously to the experiments, and another execution where we artificially introduce some random noise. This procedure allows the identification of any anomaly in the computation, either looking at the values of the eigenvalue as a function of the iterations, or computing the root mean square value [RMS] of both time series. If this number is greater than the machine error (in this case \(\sim 10^{-16}\)), then we have the indication of an unexpected modification of a numerical variable in the hardware.

Fig. 4
figure 4

Executions of the power method: left, a reference execution; right, an execution artificially perturbed in a single value of the eigenvector

In the case of GPUs, we take advantage of their huge parallelisation capabilities and train a deep neural network image classifier. We use the well-known Fashion MNIST dataset [40]: a set of 60,000 greyscale images of fashion articles (handbags, shoes, shirts, etc.), with 28 \(\times\) 28 pixel resolution. We code a convolutional neural network (CNN) classifier and train this model with the whole dataset. The model is composed of several CNN and Maxpool layers, with a final Softmax layer that performs the classification. The number of layers and neurons in each layer is adjusted depending on the GPU memory to the maximum size available. Additionally, we use a batch size of 60,000 (the whole dataset) so the GPU works at full capacity. The GPU usage level is checked using the nvidia-smi tool. Considering that the neural network training produces a time series of the loss function values, we can detect SE in the same way as in the CPU tests.

Both numerical experiments are tuned to last approximately five minutes, and the computations are cyclically running during the irradiation experiments. When the experiment is over then the neutron source is returned back to the safety pool. We can check the outputs and OS kernel messages and quantify and classify any error. Knowing the irradiation time we can estimate the MTBF of the target piece of hardware. This experimental procedure requires constant supervision, since some errors can freeze the OS or even they can reboot the computer and it could require human intervention.

We will differentiate three categories of errors according to their effects on the irradiated hardware from the application point of view:

  • Auto-corrected Errors (AE): those errors appear and they are immediately corrected by the computing node itself using ECC techniques. They also are harmless and they do not modify the computation output in any aspect. AE can be detected either by a message in the application terminal or in the system logs.

  • Catastrophic Errors (CE): in this case, the neutron hits a critical component of the hardware in an irrecoverable way. Then the computation is interrupted, and maybe the OS deactivates the hardware component or even the system is blocked and must be rebooted. In any case, CE are easy to identify and require manual intervention by the user. Notice that in a parallel job, a single CE in one of the nodes can stop the whole computation, with the consequent resource waste.

  • Silent Errors (SE): as we mentioned in the Introduction, this error category is by far the most dangerous. Silent errors are not detected by the OS nor have any detectable impact on the hardware, but they change the computation output. Thus they modify the results in an undetectable way and they compromise the goodness of the outcome. In some cases, this silent data corruption can be easily identified a posteriori by the user because they lead to unrealistic results. But in some other cases, the output is plausible, simply giving misleading values. Consequently, SE must be avoided by all means. An example of a SE would be, in a weather forecast code, an increment of 20 km/h in the wind speed prediction. Decisions based on this data can needlessly trigger hurricane protocols and alerts. Or even worse, if the variation is −20 km/h, a hurricane forecast can be unnoticed with all its consequences. Many other cases, such autonomous cars or aerospace engineering, are also very sensitive to SE.

Because the most common natural cause of the SE is the cosmic neutron radiation, this work is focused on experimentally characterizing the MTBF corresponding to different common processors used in supercomputers (both CPUs and GPUs) as a dependence of the incoming radiation. Thus, the experimental results obtained will complement the previous estimations of the natural radiation on diverse data centres that currently are hosting exascale computers [41].

4 Results and discussion

In this section we present the experimental results obtained with the previously discussed set-up. We divide the results into four groups, each one corresponding to a \(\sim 2\)  week campaign of the radiation facility, from November 2020 to April 2022. Within each campaign, we consider the emission rate of the Cf source constant. Recall that the Am-Be emission rate is considered as constant over all campaigns due to its long lifetime.

The numerical data can be found in Table 1 for all the hardware considered. The uncertainties on the counts are estimated assuming that the events follow a Poisson distribution and applying quadratic propagation of errors. In the following subsections we will discuss every case separately.

Table 1 Results of the irradiation experiments on five pieces of hardware using two radioactive sources. d is the distance to the source and t is the irradiation time

4.1 Intel Xeon Gold 6230 CPU

In the experiments with this processor, we have observed many AE logged in the system, being the Cf emission rate of \(B_{{{\textrm{Cf}}}}=5.48\times 10^7\, {{\textrm{s}}}^{-1}\). We can see in Table 1 that there is a clear positive correlation between the MTBF and the distance to the source. In Fig. 5 we plot the \(\mathrm{{MTBF}}_{{{\textrm{AE}}}}\) as a function of the distance from the irradiated computer to the radioactive source (Cf or Am-Be). Note that, because the number of events (AE, CE, SE) in Table 1 is not very large, the errors on the magnitudes are important, being 10–30 %.

We fit the \(\textrm{MTBF}_{{{\textrm{AE}}}}\) to a function \(\mathrm{{MTBF}}_{{{\textrm{AE}}}}(d) = A\times (d+B)^2 + C\), taking into account the uncertainties in \(\mathrm{{MTBF}}_{{{\textrm{AE}}}}\). This function is chosen because the radiation intensity scales with \(d^{-2}\) for a perfect isotropic and punctual source, and consequently \(\mathrm{{MTBF}}_{{{\textrm{AE}}}}\sim d^2\). Here A plays the role of a constant that depends on the source, B represents a constant offset to the distance and C represents contributions from external sources, such as background radiation. Given the values of \(\chi ^2\) divided by the degrees of freedom, see Fig. 5, we can conclude that the data can be coarsely described by the parabola we defined, given the low statistics in the points at higher distances. The shaded area in Fig. 5 represents the error intervals of the fits, calculated using quadratic error propagation with the standard deviations of the fit parameters. It is clear then that many other functional dependencies of \(\mathrm{{MTBF}}_{{{\textrm{AE}}}}\) and d are also compatible with the experimental data.

Fig. 5
figure 5

MTBF for the AE error class in the Xeon Gold processor and its fit to a parabola

The effective cross section, see Eq. 1 is depicted in Fig. 6. Given the large uncertainties observed, a dependence of the cross section with the distance to the source still remains, but it can be considered approximately constant within the obtained errors for \(d\gtrsim 8\) cm when the Cf were used. In any case, we may estimate this cross section to be of the order of \(3-5\times 10^{-8} \, {{\textrm{cm}}}^2\) for both sources. Although the Am-Be source produces a smaller number of errors, its lower emission rate compensates and leads to a cross section similar to the Cf.

Fig. 6
figure 6

Effective cross section for the AE error class in the Xeon Gold processor

Regarding the CE, we have observed a very small amount of them. Their rate is very small, typically one every few hours, thus there are not enough statistics to find any tendency as we did with the AE.

4.2 AMD 7282 EPYC Rome CPU

This CPU presents the best inherent resistance to radiation of all the tested devices. According to Table 1, the AMD CPU presents an AE error rate of \(\approx 0.011(7)\) errors/min and a CE error rate of 0.0009(7) errors/min. In terms of MTBF we have \(\mathrm{{MTBF}}_{{\textrm{AE}}}=8.9(6)\) min and \(\mathrm{{MTBF}}_{{\textrm{CE}}}=11(8)\times 10^2\) min. The corresponding effective cross sections are \(\sigma _{{{\textrm{AE}}}}= \,2.5(1.2)\times 10^{-9}{{\textrm{cm}}}^2\) and \(\sigma _{{{\textrm{CE}}}}= \,2.1(1.8)\times 10^{-11}{{\textrm{cm}}}^2\). Due to the fact that the CE error rates are very small and the time limitations of the LPN (it is important to highlight that this is the neutron standard facility in Spain and must provide continuous service to the domestic community), we do not make experiments irradiating at larger distances or with the Am-Be source. The Cf activity is lower than in the previous section, being \(B_{{\textrm{Cf}}}=4.13\times 10^7 {{\textrm{s}}}^{-1}\). Still, \(\sigma _{{{\textrm{CE}}}}\) for this processor is orders of magnitude smaller than in other hardware, and almost compatible with zero within its error bars due to the small number of events measured. Note that although this CPU was launched on similar date than the aforementioned Xeon processor, the lithography technology is completely different. AMD is encapsulating the computing cores in separated chiplets while reducing the transistor size to a half (7 nm) with exception of the E/S manager (14 nm).

4.3 NVIDIA A100 GPU

These experiments are performed with a Cf emission rate of \(B_{{\textrm{Cf}}}=4.22\times 10^7\, {{\textrm{s}}}^{-1}\). Similar to the other GPUs (see Sect. 4.4), no AE are observed in the system logs or on the terminal while executing the neural network training. Only CE and indications of SE are encountered.

In Fig. 7 we plot the \(MTBF_{{\textrm{CE}}}\) as a function of the distance from the irradiated computer to the radioactive source (Cf or Am-Be). For short distances, we find an \(MTBF_{{\textrm{CE}}}\) of roughly 20–30 min using Cf and larger than one hour using Am-Be. In Fig. 8 we observe that the effective cross section in this case is \(\approx 10^{-8}\,{{\textrm{cm}}}^2\).

Silent errors are detected by comparing the RMS of the learning curves (loss function value vs training time) of the CNN training and a reference run. If nothing disturbs the training, the RMS value is of the order of machine precision (\(O(10^{-16})\)). When a SE occurs, this value is larger. We have detected several dozens of SE that give an RMS of \(O(10^{-2})\). In Fig. 9 we plot the fractions of silent errors over the total number of runs in each case of Table 1. We also run some extra cases executed without radiation. We can see that all the values are essentially compatible with each other within their uncertainties. This indicates that the observed SE were not triggered by the radiation. However, since the experiments without radiation were the last ones, we cannot discard that the radiation caused permanent damage to the A100 at the beginning of the campaign.

Fig. 7
figure 7

MTBF for the CE error class in the A100 GPU

Fig. 8
figure 8

Effective cross section for the CE error class in the A100 GPU

Fig. 9
figure 9

Fraction of SE detected in the A100 GPU. The vertical black lines over each bar represents the statistical error of the measurements

4.4 NVIDIA V100 and T4 GPUs

The V100 GPU presents very strong resistance to incoming neutrons. As we can see in Table 1, having the neutron source as close as possible to the target produces very few events. No AE nor SE appeared, and only one CE occurred in the GPU when using Am-Be. The differences between the two sources can be attributed to their distinct activities and energy spectra (Fig. 3) The Cf emission rate is the largest among all our experiments: \(B_{{{\textrm{Cf}}}}=5.97\times 10^7\, {{\textrm{s}}}^{-1}\).

On the other hand, the T4 GPU suffered CE more often, specially with Cf. We detected a CE every 13 or 25 min, depending on the distance to the Cf source. The Am-Be source induces an \(\mathrm{{MTBF}}_{{\textrm{CE}}}\) one order of magnitude higher. Table 2 shows the estimation of the effective cross section (Eq. 1). Together with the absence of SE, it indicates that checkpointing techniques are appropriate in this radiation scenario.

Table 2 Effective cross section estimation for the T4 GPU

4.5 Scaling down the measurements to an estimation of errors in operation status

Considering the experiments documented in the previous subsections, we will focus on the most sensible devices: the NVIDIA A100 and T4 GPUs. A similar procedure would follow for the other processors.

We can approximate the \(\mathrm{{MTBF}}^{{{\textrm{CR}}}}\) caused by natural radiation (CR, cosmic ray induced). We use the \(\sigma\) obtained in experiments but also the natural neutron flux (\(F_{\text {CR}}\)) on the geographical location of the hardware. Then, the mean time between failures due to natural radiation reads:

$$\begin{aligned} \mathrm{{MTBF}}^{{{\textrm{CR}}}}= \frac{1}{\sigma \times F_{{\textrm{CR}}} }. \end{aligned}$$
(2)

Therefore, we also require calculating \(F_{\mathrm{{CR}}}\). It is important to mention again that the neutron interaction with the matter depends strongly on its energy, so the following estimation applies to the energy range provided by the neutron sources here used.

Bearing that in mind, first we obtained the natural neutron flux using the curve labelled as M of Fig 3.18 in [42]. We then fit the data to a decaying power law and integrated between 1 and 10 MeV to have an overlapping energy range with our sources (check Fig. 3). This led to a cosmic ray neutron flux \(F_{CR} = 3.5\,{{\textrm{cm}}}^{-2}\,{{\textrm{h}}}^{-1}\) at the sea level and \(50^{\circ }\) of latitude. Then, we calculated neutron flux for places where several supercomputers of interest are installed (see Table 3). For doing this, we calculated the expected flux at every location by correcting the corresponding latitude  [43] and altitude  [44] effects.

Table 3 Location and calculated natural neutron flux (\(F_{{\mathrm {{CR}}}}\))for supercomputers relying on NVIDIA A100 GPUs and ranked in the Top500 list in June, 2022

Figure 8 shows the values of \(\sigma _{{\textrm{CE}}}\) for the NVIDIA A100. Taking into account that the \(\sigma\) is undervalued if Eq. 1 is used, we can round the averaged value to \(\sigma _{{\textrm{CE}}}= 10^{-8}\,{{\textrm{cm}}}^2\). As an application, knowing that the PerlmutterFootnote 1 supercomputer counts 6,188 A100 devices, we can expect at least a CE event every approximately every 4.6 months of operation. Other facilities with a lesser number of cards are more influenced by their placement. This is the case of the JUWELS Booster ModuleFootnote 2 and the PolarisFootnote 3 supercomputer, which will suffer at least a CE crash every year. As you see in Table 3, the altitude is the most important factor (such as for BioHive-1Footnote 4), but also the geomagnetic conditions in every location (ChervonenkisFootnote 5).

The mean time between silent errors, \(\mathrm{{MTBF}}^{{{\textrm{CR}}}}_{{\textrm{SE}}}\), can also be estimated. From Table 1, we calculate the averaged \(\sigma _{{\textrm{SE}}}\), resulting in \(5.6\times 10^{-9}\,{{\textrm{cm}}}^2\) with a larger uncertainty estimation (\(\sim 40\%\)). Results for mean times between events are almost twice \(\mathrm{{MTBF}}^{{{\textrm{CR}}}}_{{{\textrm{CE}}}}\), as it can be seen in Table 4.

Similarly, we can also estimate the impact of CE on the NVIDIA T4 GPU. According to the values of Table 2 we also take \(\sigma _{{\textrm{CE}}}= 10^{-8}\,{{\textrm{cm}}}^2\) for the T4, and consequently, it will individually experiment, at least, a \(\mathrm{{MTBF}}^{{{\textrm{CR}}}}_{{\textrm{CE}}} \approx 3,200 \,{{\textrm{years}}}\) at the sea level and \(50^{\circ }\) of latitude. However, the value lowers to \(\approx 834 \,{{\textrm{years}}}\) in places such as Salt Lake City, US, where large companies devoted to AI business are established. Moreover, it is usual to extend the lifetime of a cluster replacing or installing improved hardware parts, such as SSD disks or more memory slots. T4 cards are cheaper and thinner than V100/A100 devices, but a good choice to amortise past investments. TetralithFootnote 6 supercomputer at NSC in Sweden is an example of this. It was purchased in several phases in 2018 and it reached the Top500 74th position in June 2019, posteriorly it was upgraded with 170 T4 cards.

Table 4 Estimated mean times between failures and silent errors of several Top500 supercomputers relying on NVIDIA A100 GPUs

These numbers indicate that, unless a very big number of GPUs are working in parallel on the same problem, this kind of CE is very unlikely to seriously impact, with exception of large supercomputers. The estimation performed is larger but still comparable to the experiments documented in [36], although the hardware used in that work is older and more susceptible to CE. Regarding the catastrophic errors in CPUs, Table 1 indicates that the expected \(\mathrm{{MTBF}}^{{\textrm{CR}}}_{{\textrm{CE}}}\) is one order of magnitude larger than for GPUs. Additional work is required in characterizing the response of the device to neutrons energy < 1 MeV and > 10 MeV, which will reduce our estimation of \(\mathrm{{MTBF}}^{{\textrm{CR}}}_{{\textrm{CE}}}\) [2].

5 Conclusions

We have performed a series of irradiation experiments on a set of state-of-the-art processors commonly used for HPC (three GPU and two CPU models). Two neutron source, \(^{252}\)Cf and \(^{241}\)Am-Be, are used for a total time of 190 h of irradiation spanning 18 months. The hardware response to the radiation is characterised in terms of anomalies in their functioning. We have classified the errors as auto-corrected (AE), catastrophic (CE) and silent (SE), being the last class the most dangerous and undesirable. The experiment results are shown in Table 1.

Auto-corrected errors are harmless except for a possible small increase in the computation time, not detected in the current experiments. CE appeared mostly on GPUs, indicating that those devices are more sensitive to radiation than CPUs. On the other hand, the former ones are more prone to AE, as we can check in Figs. 6 and 8. The most sensitive GPU is the NVIDIA A100, followed by the T4 and finally the V100, which shows a higher resistance. Those errors are easily detected by the user or administrators since they simply stop the computation or the computer crashes. Checkpointing techniques are appropriate for dealing with such errors. The most positive result of this work is the lack of silent errors in all the computations completed, except maybe those in the NVIDIA A100 whose impact is still very low even in the only in presence of natural radiation (see Sects. 4.3 and 4.5 for discussion). We can also conclude that a more precise estimation of any MTBF or \(\sigma\) would require a larger number of AE/CE/SE events. This can be accomplished by increasing the irradiation time and/or using a more intense radioactive source.

Taking into account these results, the administrators of supercomputing facilities can improve the plans of contingencies needed to be implemented via software (or even from an architectonic point of view) for overcoming potential errors due to the natural neutron irradiation deposed in their infrastructure. Moreover, the researchers that are making use of these facilities can estimate the risk of silent errors for implementing resilient algorithms or checking their results with additional calculations, if were needed. A very approximate estimation of the mean time between failure for CE in part of the cosmic ray spectrum for the most sensible devices leads to thousands of years per unit. Thus, modern hardware combined with ECC techniques turns out to be very resilient to naturally occurring ionizing radiation at the sea level. However, the increased error rates at altitude and the weight excess due to shielding could make these hardware unfeasible for the aerospace field. Furthermore, as the MTBF is inversely proportional to the number of hardware elements, accidents would multiply in a World where the autonomous driving were predominant. For these reasons, shielding mechanisms and materials should be evaluated for those uses in future works.

According to the obtained experimental results, the ultimate hardware has incorporated automatic capabilities for overcoming silent errors, which were foreseen as critical a few years ago with the advent of exascale computing and the higher density of processors per square meter. In this sense, the lithography improvements are reducing the size of transistors, but our results suggest that size does not influence on the resistance of the device. Nowadays, vendors have greatly improved their product performance and silent errors would be scarcely produced, but they still occurs. Therefore, as new hardware continuously appears, it must be tested for radiation induced errors prior to its installation in large supercomputers.