Response of HPC hardware to neutron radiation at the dawn of exascale

Every computation presents a small chance that an unexpected phenomenon ruins or modifies its output. Computers are prone to errors that, although may be very unlikely, are hard, expensive or simply impossible to avoid. In the exascale, with thousands of processors involved in a single computation, those errors are especially harmful because they can corrupt or distort the results, wasting human and material resources. In the present work, we study the effect of ionizing radiation on several pieces of commercial hardware, very common in modern supercomputers. Aiming to reproduce the natural radiation that could arise, CPUs (Xeon, EPYC) and GPUs (A100, V100, T4) are subject to a known flux of neutrons coming from two radioactive sources, namely 252\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{252}$$\end{document}Cf and 241\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{241}$$\end{document}Am-Be, in a special irradiation facility. The working hardware is irradiated under supervision to quantify any appearing error. Once the hardware response is characterised, we are able to scale down the radiation intensity and to estimate the effects on standard data centres. This can help administrators and researchers to develop their contingency plans and protocols.


Introduction
In modern computations, massive parallelism is ubiquitous when the application allows it. It saves time and allows computations that would have been unfeasible otherwise. In spite of all the advantages that this presents, the higher the number of processes/threads involved, the larger the probability of one of them failing and affecting the computation. Cosmic ray radiation is one of the possible sources of such errors. When a high energy particle interacts with the atmosphere, it creates a cascade of secondary particles that reaches the Earth's surface. Among those particles, neutrons are susceptible to interact with the hardware, flipping one or more bits in the registers and perturbing the computation. Thus, it induces a probability of occurrence of an error depending on the energy and the flux of the incoming neutrons as well as on the density of processors per volume. There are several types of such failures depending on their after effects. The most undesirable case is known as Silent Error (SE), which corrupts the computation output in an undetectable manner. As a consequence, software and hardware resilience to radiation induced errors is a key issue in computing science towards the exascale. More restrictive or relaxed contingency plans to overcome those errors can be implemented depending on the Mean Time Between Failures (MTBF) of the hardware components.
In the present work, we perform a series of experiments devoted to studying the response of commercial hardware devices to a neutron flux originated by radioactive materials. The Neutron Standards Laboratory (LPN) is the Spanish national reference in neutron metrology and it is one of the installations that constitute the Ionizing Radiation Metrology Laboratory (LMRI) at CIEMAT. LPN counts with two neutron standards based on 252 Cf and 241 Am-Be neutron sources, currently used for calibration purposes. These sources provide well-known neutron flows that can be used to irradiate material or devices. In our experiments, we will irradiate computing nodes with various CPUs and GPUs.
Checkpointing and rollback recovery are the de-facto general-purpose error recovery techniques. They employ checkpoints to save periodically the state of a parallel application so that when an error strikes some process, the application can be restored to one of its former states. Nevertheless, the problem can be approximated from different perspectives as well.
Thus, the work carried out so far on characterizing the cosmic radiation on computing processors has basically focused on demonstrating their effect to be translated to silent errors and proposing solutions (L1 bypass, metal shielding, etc.). They have also produced noticeable results in characterizing the radiation effects on different computing units (memories, processors, or modern accelerators), but with a huge neutron flux if compared with the actual cosmic radiation received at the sea level (around six orders of magnitude higher in the referenced GPUs and Xeon Phis' studies, see for example [1,2]). Then, in order to better determine the actual effect of cosmic rays is necessary to correlate the dose received to the flux that happens at any latitude, longitude, and altitude around the world with the actual experimental rate of SE per neutron flux. This paper is organised as follows: in Sect. 2 we briefly review the scientific work regarding silent errors and hardware irradiation; in Sect. 3 we describe the irradiation facility and numerical tests employed; in Sect. 4 we discuss the results and finally in Sect. 5 we present our conclusions.

Related work on overcoming silent errors
Silent data corruption has been typically studied by comparing the experimental output with the expected result [3]. In order to find a suitable solution, it is needed to consider error criticality, this is, the impact of this corruption on the application or system (see initial results in [1]).
Considerable efforts have been directed to reveal silent errors. In [4], a comprehensive list of techniques and references can be found. Most of the current techniques have combined redundancy at various levels, together with a variety of verification mechanisms. The classic approach is at the hardware level, where all computations were executed twice or even in triplicate, and majority voting was enforced in case of different results [5]. Another hardware-based error detection approach is proposed by Moody [6] and the use of ECC (Error Correcting Code) memory, which can detect and even correct a fraction of errors, but in practice it is complemented with software techniques because not all parts of the system are ECC-protected (in particular, logic units and registers inside the processing units). Then, it clearly rises up that a more advanced integration between these two methods must be accomplished.
A set of novel silent data corruption detectors by leveraging support vector machine regression have been explored in [13]. Attempts have also been made to use genetic algorithms to detect such errors. Such detectors have demonstrated high precision and recall, but only if they run immediately after an error has been injected. A neural network detector that can identify silent data corruptions even multiple iterations after they were injected is proposed in [14].
In the field of sparse systems and for the sake of completeness, it is of interest to consider the preconditioned conjugate gradient method proposed by Chen [15] to work with sparse iterative solvers. Another iterative algorithm which deserves consideration is GMRES, which has been incipiently analysed with soft fault error models [16]. Other articles have focused on evaluating algorithms for both stability and accuracy in the presence of faults [17], detecting software errors by using properties of the algorithm [18], or providing more information and control for the library or application in handling likely errors, such as different types of DRAM failures [8,19].

3
Between generic and application oriented verification mechanisms, approximate replication should be referenced. This methodology can be applied either at numerical method [11] or hardware [20] levels by comparing the exact floatingpoint results with the ones of an approximate operator. It has shown promising results, but again it is still not general enough, since each application needs to be manually complemented with the required computing kernels. Then, transparent and agnostic developments should be designed and [21] has worked in such a direction by combining an analytical model replication and checkpointing.
Regarding studies on optimal checkpointing and/or verification periods, two models have been mostly used: (i) errors are detected after a certain delay following a probability distribution (typically, an exponential distribution); (ii) errors are detected through some verification mechanism [22]. In both cases, the optimal period in order to minimise the dead time can be computed, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, the solution has been to determine the period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, due to the verification mechanism. The right replication level to detect and correct silent errors at scale is studied in [23]. A more realistic model assuming that errors are detected through some verification / validation mechanism (re-computation, checksums, coherence, etc.) would be an improvement of the overhead induced in this kind of solution. See for example [24], in which 7-70 % less overhead than a full duplication technique with similar detection recall is achieved.
Traditionally, a general assumption for building up a resilient recovery system has been to consider that each checkpoint forms a consistent recovery line and silent errors strike according to a Poisson process. Additionally application workflows refer to a number of parallel tasks that exchange data at the end of their execution, i.e., the task graph is a linear chain, and each task (except maybe the first and the last one) reads data from its predecessor and produces data for its successor. Bearing this in mind, there exist results describing the efficiency of copying with silent errors by combining checkpointing with some verification mechanism [25]. Thus, it is possible to design a general-purpose technique based upon computational patterns that periodically repeat over time, i.e., optimise the aforementioned trade-off between error-free overhead and execution time. These patterns interleave verifications and checkpoints, so a pattern minimizing the expected execution time can be determined. From this point on, it is possible to move to application-specific techniques via dynamic programming algorithms for linear chains of tasks as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra. We can see promising results in [26] with a sparse grid combination technique applied to several scientific fields. In the case of an application composed of a chain of tasks, the optimal and dynamic checkpointing strategy has been solved in [27], but there is still room for designing new algorithms exploiting a proper verification technique as well as for integrating multi-level checkpointing in order to cope with both fail-stop and silent error. The User Level Failure Mitigation (ULFM) interface enables the implementation of resilient MPI applications, system runtimes, and programming Response of HPC hardware to neutron radiation at the dawn of… language constructs by detecting and reacting to failures without aborting their execution [28].
Regarding ABFT algorithms, the (preconditioned) Conjugate Gradient and GMRES methods can be used for both detection and correction [29]. But it can be extrapolated to any iterative solver that uses sparse matrix vector multiplications and vector operations (non-stationary iterative solvers such as CGNE, BiCG, BiCGstab, etc.). Also, ABFT allows detecting soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm [30], a method widely used but with serious limits in scalability.
Recent compilation studies on fault tolerance in exascale systems can be found in [31] and [32].

Related work on radiating hardware
To date, several articles have been published about the effect of radiation on computing hardware. Initially, it has been studied by Ziegler et al. in the late '90 s [33] by irradiating different 16-MB DRAM memory. They concluded that silent errors induced by cosmic radiation depend on the memories' cell technology used. In order to demonstrate to the manufacturers that the errors appearing in ASCI Q at Los Alamos Nat. Lab. were produced by cosmic rays, one of the servers was placed in front of a beam of neutrons causing errors to spike. This evidence was complemented by studies devoted to radiation-induced soft errors in advanced semiconductor technologies [34] and neutron induced single event upset dependence on bias voltage for CMOS SRAM [35].
Other experimental results related to the occurrence of silent errors are the single-bit ECC errors rate of 350 min −1 and the double-bit errors once per day both observed in the Jaguar supercomputer in 2006. The latter error class was detected, but not corrected by ECC techniques.
Radioactive lead in the solder to cause bad data in the L1 cache was observed in BlueGene/L at Lawrence Livermore Nat. Lab., a fact that derived in the necessity of bypassing L1 and, consequently, slower computations. Also, the U.S. Department of Energy's Titan has a radiation-induced MTBF in the order of dozens of hours in their Kepler GPUs [36]. Hence, works elaborating on the direct dependence between silent errors and radiation have been published focused on determining the reliability in GPUs [1] and Xeon Phis, also applying high-level fault injection [2]. These works also quantify and qualify radiation effects on applications' output by correlating the number of corrupted elements with their spatial locality and providing the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Comparison between high-energy and thermal neutrons effects on processor and memory error rates has been studied as well [19].

Irradiation facility
In this work, the Neutron Standards Laboratory (LPN) has been employed as irradiation facility to study the response of hardware devices to neutron fluence [37].
LPN counts with an irradiation room, that is a bunker with dimensions of 7 m × 9 m × 8 m, following ISO 8529-1 recommendations [38] and with 1.25 m thick walls, as it is shown in Fig. 1 [37]. The calibration neutron sources are stored in water that works as a very efficient neutron shield. They are remotely manipulated making use of a Cartesian manipulator, and a launcher, to select the neutron source and move from its storage position in the bottom of a pool to the irradiation position, 4 m over the ground. In front of the neutron source, the equipment to be calibrated is placed, over an automated table that allows precise positioning The neutron sources used in the experiments are 252 Cf and 241 Am-Be, which are currently employed in routine calibrations. Both sources provide neutron spectra in the fast energy range, see Fig. 3, according to ISO 8529-1 recommendations [38]. First, the device to be irradiated is placed on the automated table. Then, using the remote systems the neutron source is placed under this device, as seen in Fig. 2. 252 Cf is placed inside a cylindrical capsule with external dimensions of 9.8 mm (H) × 7.9 mm ( ). This capsule is itself inside another capsule holder, handled by the automated manipulation system (Fig. 2).
The Cf neutron source consists of 236 mg of 252 Cf and according to the calibration certificate, the emission rate was B Cf0 = 5.471 × 10 8 s −1 ± 2.6% (2 ) , measured by the National Institute of Standards and Technology (NIST, USA), on 12/05/2012. From this value, and knowing that the half-life for 252 Cf is On the other hand, the 241 Am-Be neutron source consists of a compacted mixture of 241 AmO 2 and 9 Be powder doubly encapsulated. This source has a nominal activity of 185 GBq and emits 1.11 × 10 7 s −1 ± 1.4% (2 ) neutrons, traceable to the Czech Metrology Institute (CMI) on 1/1/2012. The capsule has dimensions of 48.6 mm (H) × 19.1 mm ( ) and it is inside its own capsule holder. Considering the value of T 1∕2 = 432.6 years for 241 Am-Be, the current emission rate can be determined for the irradiation dates, being this value practically constant for the different irradiation campaigns, and very close to the original value, B AmBe = 1.09 × 10 7 s −1 .
The irradiation distance is defined as the distance between the end of the capsule holder and the external surface of the server chassis. Because of safety issues, the smallest irradiation distance is 0.5 cm. This is not the real distance between the source and the irradiated device, because it is necessary to take into account the capsule holder thickness (4 mm), the distance to the centre of the source and the distance from the external chassis of the server to the target device inside.
In order to have a more realistic estimation of this distance (d) between the radioactive source and the target hardware, we introduce a correction offset to the distance set by the remote system. We add the capsule holder correction ( ∼ 0.5 cm ) and a half of the capsule size. This leads to an offset of +1.0 cm and +3.0 cm to the Cf and Am-Be distances, respectively. In the experiments, the source was placed perpendicularly and centred on the surface of the target chip to irradiate. The distances inside the server were previously measured with calliper. However, we also assume that the corrected distance has an uncertainty of ±0.5 cm , used in later error calculations in Sect. 4.
In the experiments of Sect. 4 we may be able to provide a rough estimation of the effective cross section of the hardware piece. In this sense, we are able to measure the error rate E dividing the number of observed errors by the time lapse, and then: being B i is the neutron emission rate (subscript i labels the source) and B i ∕(4 d 2 ) is the flux (neutrons per surface per time unit). Response of HPC hardware to neutron radiation at the dawn of…

Hardware description
The hardware irradiated in this work is standard in many HPC data centres. We are provided with two modern computing servers, with two different CPU models and with three GPUs relying on different lithography technologies, whose transistor sizes range from 7 nm to 14 nm. We have an Intel Xeon CLX Gold 6230 Cascade Lake, 20c (2.1 GHz) and an AMD EPYC Rome 7282, 16c (2.8 GHz). In terms of GPUs, we have a T4, a Tesla V100 16GB and an A100 40GB, all manufactured by the NVIDIA Corp., but based on different microarchitectures, namely Turing, Volta and Ampere, respectively. As preparation for every experiment, we removed the unnecessary hardware from the servers. Thus, the RAM memory modules were reduced to the minimum and a single hard disk is used, keeping only the components strictly needed for running the OS and the programs. In this way, we minimise any possible effect induced in those components by the neutron radiation. We also shield, when possible, the parts of the server that should not receive neutrons with 1 cm width polyethylene tiles, specially the hard disk.

Numerical experiments
Since we are irradiating both CPUs and GPUs, we prepare different numerical experiments for each case. The computation itself is irrelevant, but we design it so the CPU/GPU is running at full capacity. The output of those experiments is monitored during the irradiation to detect any errors.
A basic CENTOS 7.9 Linux operative system is installed in each computing node. On top of that, Python is installed via Anaconda. CUDA 11.6 and cuDNN 8 libraries are installed too in the nodes to use the GPUs for scientific computing.
For the CPU experiments, we choose to use the so-called power method for finding the largest eigenvalue of a symmetric matrix [39]. This is a simple and iterative procedure, and if we fix the random seed then we can exactly reproduce the computation and track numerical errors during the irradiation. We can modify the size of the matrix and the number of iterations to control memory usage and execution time. Figure 4 shows a reference execution, run previously to the experiments, and another execution where we artificially introduce some random noise. This procedure allows the identification of any anomaly in the computation, either looking at the values of the eigenvalue as a function of the iterations, or computing the root mean square value [RMS] of both time series. If this number is greater than the machine error (in this case ∼ 10 −16 ), then we have the indication of an unexpected modification of a numerical variable in the hardware.
In the case of GPUs, we take advantage of their huge parallelisation capabilities and train a deep neural network image classifier. We use the well-known Fashion MNIST dataset [40]: a set of 60,000 greyscale images of fashion articles (handbags, shoes, shirts, etc.), with 28 × 28 pixel resolution. We code a convolutional neural network (CNN) classifier and train this model with the whole dataset. The model is composed of several CNN and Maxpool layers, with a final Softmax layer 1 3 that performs the classification. The number of layers and neurons in each layer is adjusted depending on the GPU memory to the maximum size available. Additionally, we use a batch size of 60,000 (the whole dataset) so the GPU works at full capacity. The GPU usage level is checked using the nvidia-smi tool. Considering that the neural network training produces a time series of the loss function values, we can detect SE in the same way as in the CPU tests.
Both numerical experiments are tuned to last approximately five minutes, and the computations are cyclically running during the irradiation experiments. When the experiment is over then the neutron source is returned back to the safety pool. We can check the outputs and OS kernel messages and quantify and classify any error. Knowing the irradiation time we can estimate the MTBF of the target piece of hardware. This experimental procedure requires constant supervision, since some errors can freeze the OS or even they can reboot the computer and it could require human intervention.
We will differentiate three categories of errors according to their effects on the irradiated hardware from the application point of view: • Auto-corrected Errors (AE): those errors appear and they are immediately corrected by the computing node itself using ECC techniques. They also are harmless and they do not modify the computation output in any aspect. AE can be detected either by a message in the application terminal or in the system logs. • Catastrophic Errors (CE): in this case, the neutron hits a critical component of the hardware in an irrecoverable way. Then the computation is interrupted, and maybe the OS deactivates the hardware component or even the system is blocked and must be rebooted. In any case, CE are easy to identify and require manual intervention by the user. Notice that in a parallel job, a single CE in one of the nodes can stop the whole computation, with the consequent resource waste. • Silent Errors (SE): as we mentioned in the Introduction, this error category is by far the most dangerous. Silent errors are not detected by the OS nor have any detectable impact on the hardware, but they change the computation output. Thus they modify the results in an undetectable way and they compromise the goodness of the outcome. In some cases, this silent data corruption can be easily identified a posteriori by the user because they lead to unrealistic results. But in some other cases, the output is plausible, simply giving misleading values. Consequently, SE must be avoided by all means. An example of a SE would be, in a weather forecast code, an increment of 20 km/h in the wind speed prediction. Decisions based on this data can needlessly trigger hurricane protocols and alerts. Or even worse, if the variation is −20 km/h, a hurricane forecast can be unnoticed with all its consequences. Many other cases, such autonomous cars or aerospace engineering, are also very sensitive to SE.
Because the most common natural cause of the SE is the cosmic neutron radiation, this work is focused on experimentally characterizing the MTBF corresponding to different common processors used in supercomputers (both CPUs and GPUs) as a dependence of the incoming radiation. Thus, the experimental results obtained will complement the previous estimations of the natural radiation on diverse data centres that currently are hosting exascale computers [41].

Results and discussion
In this section we present the experimental results obtained with the previously discussed set-up. We divide the results into four groups, each one corresponding to a ∼ 2 week campaign of the radiation facility, from November 2020 to April 2022. Within each campaign, we consider the emission rate of the Cf source constant.
Recall that the Am-Be emission rate is considered as constant over all campaigns due to its long lifetime. The numerical data can be found in Table 1 for all the hardware considered. The uncertainties on the counts are estimated assuming that the events follow a Poisson distribution and applying quadratic propagation of errors. In the following subsections we will discuss every case separately.

Intel Xeon Gold 6230 CPU
In the experiments with this processor, we have observed many AE logged in the system, being the Cf emission rate of B Cf = 5.48 × 10 7 s −1 . We can see in Table 1 that there is a clear positive correlation between the MTBF and the distance to the source. In Fig. 5 we plot the MTBF AE as a function of the distance from the irradiated computer to the radioactive source (Cf or Am-Be). Note that, because the number of events (AE, CE, SE) in Table 1 is not very large, the errors on the magnitudes are important, being 10-30 %.
We fit the MTBF AE to a function MTBF AE (d) = A × (d + B) 2 + C , taking into account the uncertainties in MTBF AE . This function is chosen because the radiation intensity scales with d −2 for a perfect isotropic and punctual source, and consequently MTBF AE ∼ d 2 . Here A plays the role of a constant that depends on the source, B represents a constant offset to the distance and C represents contributions from external sources, such as background radiation. Given the values of 2 divided by the degrees of freedom, see Fig. 5, we can conclude that the data  can be coarsely described by the parabola we defined, given the low statistics in the points at higher distances. The shaded area in Fig. 5 represents the error intervals of the fits, calculated using quadratic error propagation with the standard deviations of the fit parameters. It is clear then that many other functional dependencies of MTBF AE and d are also compatible with the experimental data. The effective cross section, see Eq. 1 is depicted in Fig. 6. Given the large uncertainties observed, a dependence of the cross section with the distance to the source still remains, but it can be considered approximately constant within the obtained errors for d ≳ 8 cm when the Cf were used. In any case, we may estimate this cross section to be of the order of 3 − 5 × 10 −8 cm 2 for both sources. Although the Am-Be source produces a smaller number of errors, its lower emission rate compensates and leads to a cross section similar to the Cf.
Regarding the CE, we have observed a very small amount of them. Their rate is very small, typically one every few hours, thus there are not enough statistics to find any tendency as we did with the AE.

AMD 7282 EPYC Rome CPU
This CPU presents the best inherent resistance to radiation of all the tested devices. According to Table 1, the AMD CPU presents an AE error rate of ≈ 0.011 (7) errors/min and a CE error rate of 0.0009(7) errors/min. In terms of MTBF we have MTBF AE = 8.9(6) min and MTBF CE = 11(8) × 10 2 min. The corresponding effective cross sections are AE = 2.5(1.2) × 10 −9 cm 2 and CE = 2.1(1.8) × 10 −11 cm 2 . Due to the fact that the CE error rates are very small and the time limitations of the LPN (it is important to highlight that this is the neutron standard facility in Spain and must provide continuous service to the domestic community), we do not make experiments irradiating at larger distances or with the Am-Be source. The Cf activity is lower than in the previous section, being B Cf = 4.13 × 10 7 s −1 . Still, CE for this processor is orders of magnitude smaller than in other hardware, and almost compatible with zero within its error bars due to the small number of events measured. Note that although this CPU was launched on similar date than the aforementioned Xeon processor, the lithography technology is completely different. AMD is encapsulating the computing cores in separated chiplets while reducing the transistor size to a half (7 nm) with exception of the E/S manager (14 nm).

NVIDIA A100 GPU
These experiments are performed with a Cf emission rate of B Cf = 4.22 × 10 7 s −1 . Similar to the other GPUs (see Sect. 4.4), no AE are observed in the system logs or on the terminal while executing the neural network training. Only CE and indications of SE are encountered.
In Fig. 7 we plot the MTBF CE as a function of the distance from the irradiated computer to the radioactive source (Cf or Am-Be). For short distances, we find an MTBF CE of roughly 20-30 min using Cf and larger than one hour using Am-Be. In Fig. 8 we observe that the effective cross section in this case is ≈ 10 −8 cm 2 .
Silent errors are detected by comparing the RMS of the learning curves (loss function value vs training time) of the CNN training and a reference run. If nothing disturbs the training, the RMS value is of the order of machine precision ( O(10 −16 ) ). When a SE occurs, this value is larger. We have detected several dozens of SE that give an RMS of O(10 −2 ) . In Fig. 9 we plot the fractions of silent errors over the total number of runs in each case of Table 1. We also run some extra cases executed without radiation. We can see that all the values are essentially compatible with each other within their uncertainties. This indicates that the observed SE were not triggered by the radiation. However, since the experiments without radiation were the last ones, we cannot discard that the radiation caused permanent damage to the A100 at the beginning of the campaign.

NVIDIA V100 and T4 GPUs
The V100 GPU presents very strong resistance to incoming neutrons. As we can see in Table 1, having the neutron source as close as possible to the target produces very few events. No AE nor SE appeared, and only one CE occurred in the GPU when using Am-Be. The differences between the two sources can be attributed to their distinct activities and energy spectra (Fig. 3) The Cf emission rate is the largest among all our experiments: B Cf = 5.97 × 10 7 s −1 .
On the other hand, the T4 GPU suffered CE more often, specially with Cf. We detected a CE every 13 or 25 min, depending on the distance to the Cf source. The Am-Be source induces an MTBF CE one order of magnitude higher. Table 2 shows the estimation of the effective cross section (Eq. 1). Together with the absence of SE, it indicates that checkpointing techniques are appropriate in this radiation scenario.

Scaling down the measurements to an estimation of errors in operation status
Considering the experiments documented in the previous subsections, we will focus on the most sensible devices: the NVIDIA A100 and T4 GPUs. A similar procedure would follow for the other processors. We can approximate the MTBF CR caused by natural radiation (CR, cosmic ray induced). We use the obtained in experiments but also the natural neutron flux ( F CR ) on the geographical location of the hardware. Then, the mean time between failures due to natural radiation reads: Therefore, we also require calculating F CR . It is important to mention again that the neutron interaction with the matter depends strongly on its energy, so the following estimation applies to the energy range provided by the neutron sources here used.
Bearing that in mind, first we obtained the natural neutron flux using the curve labelled as M of Fig 3.18 in [42]. We then fit the data to a decaying power law and integrated between 1 and 10 MeV to have an overlapping energy range with our sources (check Fig. 3). This led to a cosmic ray neutron flux F CR = 3.5 cm −2 h −1 at the sea level and 50 • of latitude. Then, we calculated neutron flux for places where several supercomputers of interest are installed (see Table 3). For doing this, we Am-Be 7.2 9(5) × 10 −9 8.2 9(5) × 10 −9

3
Response of HPC hardware to neutron radiation at the dawn of… calculated the expected flux at every location by correcting the corresponding latitude [43] and altitude [44] effects. Figure 8 shows the values of CE for the NVIDIA A100. Taking into account that the is undervalued if Eq. 1 is used, we can round the averaged value to CE = 10 −8 cm 2 . As an application, knowing that the Perlmutter 1 supercomputer counts 6,188 A100 devices, we can expect at least a CE event every approximately every 4.6 months of operation. Other facilities with a lesser number of cards are more influenced by their placement. This is the case of the JUWELS Booster Module 2 and the Polaris 3 supercomputer, which will suffer at least a CE crash every year. As you see in Table 3, the altitude is the most important factor (such as for Bio-Hive-1 4 ), but also the geomagnetic conditions in every location (Chervonenkis 5 ).
The mean time between silent errors, MTBF CR SE , can also be estimated. From Table 1, we calculate the averaged SE , resulting in 5.6 × 10 −9 cm 2 with a larger uncertainty estimation ( ∼ 40% ). Results for mean times between events are almost twice MTBF CR CE , as it can be seen in Table 4. Similarly, we can also estimate the impact of CE on the NVIDIA T4 GPU. According to the values of Table 2 we also take CE = 10 −8 cm 2 for the T4, and consequently, it will individually experiment, at least, a MTBF CR CE ≈ 3, 200 years at the sea level and 50 • of latitude. However, the value lowers to ≈ 834 years in places such as Salt Lake City, US, where large companies devoted to AI business are established. Moreover, it is usual to extend the lifetime of a cluster replacing or installing improved hardware parts, such as SSD disks or more memory slots. T4 cards are cheaper and thinner than V100/A100 devices, but a good choice to amortise past investments. Tetralith 6 supercomputer at NSC in Sweden is an example of this. It was purchased in several phases in 2018 and it reached the Top500 74th position in June 2019, posteriorly it was upgraded with 170 T4 cards.
These numbers indicate that, unless a very big number of GPUs are working in parallel on the same problem, this kind of CE is very unlikely to seriously impact, with exception of large supercomputers. The estimation performed is larger but still comparable to the experiments documented in [36], although the hardware used in that work is older and more susceptible to CE. Regarding the catastrophic errors in CPUs, Table 1 indicates that the expected MTBF CR CE is one order of magnitude larger than for GPUs. Additional work is required in characterizing the response of the device to neutrons energy < 1 MeV and > 10 MeV, which will reduce our estimation of MTBF CR CE [2].

Conclusions
We have performed a series of irradiation experiments on a set of state-of-the-art processors commonly used for HPC (three GPU and two CPU models). Two neutron source, 252 Cf and 241 Am-Be, are used for a total time of 190 h of irradiation spanning 18 months. The hardware response to the radiation is characterised in terms of anomalies in their functioning. We have classified the errors as auto-corrected (AE), catastrophic (CE) and silent (SE), being the last class the most dangerous and undesirable. The experiment results are shown in Table 1.
Auto-corrected errors are harmless except for a possible small increase in the computation time, not detected in the current experiments. CE appeared mostly on GPUs, indicating that those devices are more sensitive to radiation than CPUs. On the other hand, the former ones are more prone to AE, as we can check in Figs. 6 and 8. The most sensitive GPU is the NVIDIA A100, followed by the T4 and finally the V100, which shows a higher resistance. Those errors are easily detected by the user or administrators since they simply stop the computation or the computer crashes. Checkpointing techniques are appropriate for dealing with such errors. The most positive result of this work is the lack of silent errors in all the computations completed, except maybe those in the NVIDIA A100 whose impact is still very low even in the only in presence of natural radiation (see Sects. 4.3 and 4.5 for discussion). We can also conclude that a more precise estimation of any MTBF or would require a larger number of AE/CE/SE events. This can be accomplished by increasing the irradiation time and/or using a more intense radioactive source.
Taking into account these results, the administrators of supercomputing facilities can improve the plans of contingencies needed to be implemented via software (or even from an architectonic point of view) for overcoming potential errors due to the natural neutron irradiation deposed in their infrastructure. Moreover, the researchers that are making use of these facilities can estimate the risk of silent errors for implementing resilient algorithms or checking their results with additional calculations, if were needed. A very approximate estimation of the mean time between failure for CE in part of the cosmic ray spectrum for the most sensible devices leads to thousands of years per unit. Thus, modern hardware combined with ECC techniques turns out to be very resilient to naturally occurring ionizing radiation at the sea level. However, the increased error rates at altitude and the weight excess due to shielding could make these hardware unfeasible for the aerospace field. Furthermore, as the MTBF is inversely proportional to the number of hardware elements, accidents would multiply in a World where the autonomous driving were predominant. For these reasons, shielding mechanisms and materials should be evaluated for those uses in future works.
According to the obtained experimental results, the ultimate hardware has incorporated automatic capabilities for overcoming silent errors, which were foreseen as critical a few years ago with the advent of exascale computing and the higher density of processors per square meter. In this sense, the lithography improvements are reducing the size of transistors, but our results suggest that size does not influence on the resistance of the device. Nowadays, vendors have greatly improved their product performance and silent errors would be scarcely produced, but they still occurs. Therefore, as new hardware continuously appears, it must be tested for radiation induced errors prior to its installation in large supercomputers.
Author Contributions AB is the main contributor to the manuscript and experiments. RM-G, AJR-M, RM, and HA have reviewed the text, datasets, and physics, and they have performed additional calculations. Experimental measurements in the Neutron Standards Laboratory were performed by its staff, RM, SR, FG, and XC.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was partially funded by the Spanish Ministry of Science and Innovation under contract number FIS2017-88892-P and RTI2018-096006-B-I00 (CODEC-OSE) with European Regional Development Fund (ERDF) funds, and by the Comunidad de Madrid CABAHLA-CM project (S2018/TCS-4423).
Data Availability Meaningful data generated or analysed during this study are included in this published article. Training datasets and the software used during this study are publicly available following the links included in the References section. Other raw datasets generated or analysed during the current study are available from the corresponding author on reasonable request.