Introduction

The experiments at the Large Hadron Collider (LHC) at CERN handle large, high-dimensional datasets to find complex patterns or to identify rare signals in background-dominated regions—tasks where machine learning and especially deep learning [1, 2] provide considerable performance gains over traditional methods. It is expected that the relevance of new deep learning technologies will increase, with the era of the High-Luminosity LHC (HL-LHC) approaching [3]. However, studies with the aim of understanding a neural network’s decisions demonstrate the relevance of explainability [4] and raise questions on the safety of systems that use artificial intelligence (AI), which is often perceived as a black-box [4, 5]. Moreover, other studies show that small modifications of the inputs (adversarial examples) can severely affect the performance of neural networks [6, 7] (adversarial attack), a worrying prospect for a field that is reliant on simulation, which might be at times inaccurate. Careful exploration of the susceptibility to mismodelings is necessary to examine how severe these “intriguing properties of neural networks” [6] are in practice. Such effects could be driven by the fact that various popular classes of deep neural networks react linearly when exposed to linear perturbations, together with the large number of input variables [6, 7]. As such, this property is not in conflict with a neural network’s ability to approximate any function via a combination of non-linear activation functions [8], but the presence of (piecewise-)linear activation functions is sufficient to cause severe impact on performance when evaluated on first-order adversarial examples [6, 7]. Applied to computer vision / image recognition, it has been demonstrated that modifications that involve only one pixel are enough to “fool” a neural network [9].

We apply methods from AI safety [5, 10, 11] to the classification of jets based on the flavor of their initiating particle (a quark or gluon), so called jet heavy-flavor identification (tagging) [12, 13]. Identifying the jet flavor plays an important role in various analysis branches exploited by experiments like CMS [12, 14] and ATLAS [15, 16], for example, for the observation of the decay of the Higgs boson to bottom (b) quark-antiquark pairs (\(\mathrm {H}\rightarrow \mathrm {b}\bar{\mathrm {b}}\)) [17,18,19]. Moreover, for analyses that also apply charm (c) tagging [12, 20,21,22], such as searches for the Higgs boson decaying to c quarks [23,24,25], multiclassifiers become increasingly important. Therefore, investigating the susceptibility to mismodeling could be even more relevant for c tagging. We probe the trade-off between performance and robustness to systematic distortions by benchmarking an established algorithm for jet flavor tagging with a realistic dataset. Early taggers included only the displacement of tracks as a way to discriminate heavy- from light-flavored jets, possible due to the different lifetimes of the initiating hadrons. It is also possible to leverage information related to the secondary vertices, giving rise to algorithms such as the (deep) combined secondary vertex algorithm [12].

Mismodelings can arise at various steps during the Monte Carlo (MC) simulation chain, starting with the hard process (matrix element calculation), followed by the subsequent steps that model the parton shower, fragmentation and hadronization, where the perturbation order is limited, and ending with the detector simulation [21] which introduces imperfections such as detector misalignment and calorimeter miscalibration.These imperfections in the modeling, particularly for variables with high discriminating power, demand the calibration of the discriminator shapes [21] and call for investigations of the tagger response to slightly distorted input data [10]. We use adversarial attacks to model systematic uncertainties induced by these subtle mismodelings that could be invisible to typical validation methods, as proposed in Ref. [10]. The approach followed in this study does not eliminate these mismodelings, nor does it provide a definitive a posteriori correction, but it helps in estimating to what extent tagging efficiency and misidentification rates could be affected [10, 19]. We assume that more adversarially robust models also generalize better when applied to a non-training domain [2, 26] (e.g. model evaluated on data [10, 21]). To that end, we seek to modify the training to minimize the impact of adversarial attacks, without sacrificing performance.

Using Adversarial Training [26,27,28] to decrease the effect of simulation-specific artefacts, we show that the injection of systematically distorted samples during the training yields a successful defense strategy. In related works, Adversarial Training is employed through joint training of a classifier and an adversary [29], making use of gradient reversal layers to connect two networks or utilizing domain adaptation [30,31,32,33]. Other approaches towards regularization and generalization in the realm of high-energy physics include data augmentation or uncertainty-aware learning [34].

Dataset and Input Features

We use the Jet Flavor dataset [13]. These samples are generated with Madgraph5 [35] and Pythia 6 [36]. The detector response is simulated with Delphes 3 [37], using the ATLAS [15] detector configuration.

Jets are clustered with the anti-\(k_\text {T}\) algorithm [38] using the FastJet [39] package, with \(R=0.4\). Secondary vertices are reconstructed using the adaptive vertex reconstruction algorithm, as implemented in RAVE [40]. Parton matching within a cone of \(\varDelta R < 0.5\) is used to define the simulated truth labeling of jets. The targets fall in one of the three classes, depending on the jet flavor: light (up, down, strange quarks or gluons), charm, or bottom [13], where the heavier flavor takes precedence in case multiple partons are found. Using this hierarchy for light, charm and bottom, the flavor content is distributed among the classes as \(48.7\%:12.0\%:39.3\%\).

Input Features

A description of all input variables is given in Tables 1 and 2, and is based on Ref. [13]; here we only summarize the main categorization.

Input features are organized hierarchically. Low-level features consist of tracks and their helix parameters, along with the track covariance matrix. Additional information is taken from the relationship between each track and the associated vertex. Up to 33 tracks, sorted by impact parameter significance, are available per jet, however, we only consider the first 6.

At jet level, expert (high-level) features are constructed as a function of the low-level inputs, for example by summing over all tracks or summing over secondary vertices, such as the weighted sum of displacement significances. Additionally, kinematic features of the jet are taken into account.

Missing or otherwise unavailable variables are filled with a convenient default value for later processing.

Preprocessing

The entire dataset consists of 11,491,971 jets, which are split randomly into training (\(72\%\)), validation (\(8\%\)) and test (\(20\%\)) sets. Input features are normalized such that they have a mean of 0 and standard deviation of 1. The scaling is calculated only using the training dataset distributions, excluding the defaulted values. Defaulted input values are set just below the minima of the primary input distributions ensuring no interference between regular and irregular (or missing) values. Minimizing the gap between the default value to the rest of the distributions improves training convergence. This technique of missing data imputation allows us to create fixed length input shapes that are transferred to the first layer of a deep feed-forward neural network, and at the same time prevents vanishing or exploding gradients due to extreme values for the defaults [2, 41, 42].

Sample weights are calculated to exclude a potential flavor dependence of the classifier on the particular kinematic properties of the chosen dataset and to correct for the inherent class imbalance. The reweighting aims at identical, kinematic distributions for all three flavors and is done with respect to the jet transverse momentum (\(p_\text {T}\)) and pseudorapidity (\(\eta\)) distributions [12]. The target shape is the average of the three initial distributions, thus balancing the relative fractions for the three classes at the same time. These distributions are binned into a 2D grid of \(50\times 50\) bins, spanning ranges between \((20{,}900)~{\hbox {GeV}}\) and \((-2.5,2.5)\), respectively. When calculating the loss per batch, these weights are multiplied to the individual losses per sample.

Methods

Reference Classifier

The studies are carried out on a jet flavor tagging algorithm similar in implementation to the ones used at the LHC experiments, such as ATLAS and CMS. We use a fully-connected sequential model with five hidden layers of 100 nodes each. We use dropout layers [43] with a \(10\%\) probability of zeroing out each neuron at each hidden layer to prevent overfitting. The Rectified Linear Unit (ReLU) activation function [2, 12, 41] is used for the hidden layers, the activation of the output layer is computed with the Softmax [2, 12] function. In total, there are 184 input nodes, where the low-level per track features are flattened. We define three output classes, analogous to the dataset.

As loss function, we use the categorical cross entropy loss [41, 44], multiplied with an additional term that downweights easy-to-classify samples during training. The resulting formula for the so called focal loss [45,46,47] evaluated for one batch of length N is given as:

$$\begin{aligned} \frac{1}{\sum _{i=1}^N w_i} \sum _{i=1}^N w_i \sum _{j=1}^3 - (1-y_{ij})^\gamma \cdot \hat{y}_{ij} \log (y_{ij}), \end{aligned}$$
(1)

where \(y_{ij}\) is a placeholder for the output probability assigned to one of the three possible flavors j of the jet i, \(\hat{y}_{ij}\) can be understood as the one-hot-encoded truth label which is either 0 or 1, \(w_i\) is the sample weight obtained from preprocessing and \(\gamma\) is called focusing parameter. Though we already treat the class imbalance by reweighting the nominal loss function, without the focusing term, the neural network is prone to assign the most frequent class. In a setting with highly imbalanced data the chosen technique ensures smooth classifier output distributions, which we achieve by choosing a focusing parameter of \(\gamma =25\).

Model parameters are updated with the Adaptive Moments Estimation (Adam) optimizer [48] using PyTorch’s [49] default settings, which is further controlled with a learning rate schedule [50] that starts at 0.0001 and decays proportionally to \(\left( 1+\frac{\text {epoch}}{30}\right) ^{-1}\). The batch size has been fixed to \(2^{16}=65{,}536\). To ensure that there is no overfitting, training is stopped when the validation loss no longer improves [2]. For each training, the model’s parameters are saved after each iteration through the full training dataset (i.e. after each epoch) to store a checkpoint for later evaluation.

Evaluation Metrics

While multi-class taggers are convenient for implementation, for physics analysis purposes, one is often interested in constructing classifiers distinguishing two classes at a time. We take appropriate likelihood ratios of the bottom, charm and light output classes as needed for discrimination. The likelihood ratio XvsY for discriminating class X from Y is given as:

$$\begin{aligned} \frac{P(\text {X})}{P(\text {X}) + P(\text {Y})}. \end{aligned}$$
(2)

For example, for the BvsL discriminator, \(P(\text {X})\) and \(P(\text {Y})\) refer to the classifier’s score for the bottom and light flavor jets, respectively. The performance of the binary classifiers is visualized and evaluated using Receiver Operating Characteristic (ROC) curves [51,52,53]. With some loss of information, a ROC curve is characterized by its area under the curve (AUC), which can be used as a reasonable single scalar proxy for the classifier performance [54]. It should be noted that due to a large class imbalance in the available dataset, accuracy could be an inaccurate measure of the performance [54].

Adversarial Attacks

One way to generate adversarial inputs is the Fast Gradient Sign Method (FGSM) [2, 7], which modifies the inputs in a systematic way, such that the loss function increases. First, the direction of the steepest increase of the loss function around the raw inputs is computed. Mathematically, the operator that allows to retrieve the “steepest increase” is the gradient of the loss function with respect to the inputs. Once the direction is known, of which only the sign is kept, this vector is multiplied with a (small) limiting parameter \(\epsilon\) to specify the desired severity of the impact. Then, the nominal inputs are shifted by this quantity. It can, therefore, be seen as a technique to maximally disturb the inputs or maximally confuse the network without necessarily manifesting in the input variable distributions.

Expressed in a single equation, the FGSM attack generates adversarial inputs \(x_\text {FGSM}\) from raw inputs \(x_\text {raw}\) by computing

$$\begin{aligned} x_\text {FGSM} = x_\text {raw} + \epsilon \cdot \mathrm {sgn}\left( \nabla _{x_\text {raw}}J(x_\text {raw},y)\right) , \end{aligned}$$
(3)

where \(\mathrm {sgn}(\alpha )\) stands for the sign of \(\alpha\). In Eq. (3), the loss function is denoted as \(J(x_\text {raw},y)\), a function of the inputs (\(x_\text {raw}\)) and targets (y). Moreover, the FGSM attack can be interpreted as a method that locally inverts the approach of gradient descent by performing a gradient ascent with the loss function, but in the input space [7, 26, 28]. Using the terminology of Ref. [26], this is a white box attack with full knowledge of the network (architecture and parameters).

The corresponding visualization is shown in Fig. 1, however, for didactic reasons with one input variable \(x_i\) only. In practice, this method is applied multidimensionally, assigning the same limiting parameter \(\epsilon\) in each input dimension.

Fig. 1
figure 1

Visualization of the generation of adversarial inputs by applying the FGSM attack

Whereas the gradient of an arbitrary function could yield any value, the distortion should stay in reasonable bounds to mimic the behaviour of possible mismodelings or differences between data and simulation [7, 10]. Therefore, we go only a small step in the direction of the gradient, which is expected to introduce practically unnoticeable changes of the input distributions [6, 7].

Increasing the number of inputs to the model also increases the susceptibility towards adversarial attacks, because each shift by \(\epsilon\) for additional features is propagated to the change in activation [7]. Thus it is conceivable that individual feature distributions remain almost unaffected, but the performance of the neural network is substantially deteriorated.

The FGSM attack does not necessarily replicate a global worst-case scenario [28]. Depending on the actual properties of the loss surface, the adversarial attack could shift the inputs also into local minima (or at least harmless regions), if the limiting parameter is chosen unluckily. On average, with small distortions only, it is still expected that in a given region, the attack will maximally confuse the model up to first order.

In this implementation, the FGSM attack is not applied to integer variables, such as the number of tracks, and defaulted values, which would not be shifted by \(\epsilon\) in a physically meaningful way.

As large distortions of input variables would be easy to detect, a limit of \(25\%\) with respect to the original value is applied on the perturbation. The modified value \(x_\text {FGSM}\) is then given by Eq. (4), where x denotes the original input value, \(x'\) the transformed (preprocessed) value and \(\epsilon\) the FGSM scaling factor. Inverting the normalization is denoted by \(()^{-1}\).

$$\begin{aligned} x_\text {FGSM} = \Big ( \Big . x~+&~\mathrm {sgn}\left( \nabla _{x}J(x,y)\right) \cdot \min \big \{\big .\nonumber \\&|(x' + \mathrm {sgn}\left( \nabla _{x}J(x,y)\right) \cdot \epsilon )^{-1} - x|,\nonumber \\&|0.25 \cdot x|\big .\big \} \Big . \Big ) \,' \end{aligned}$$
(4)

Distortions of low-level features are not propagated to high-level features, instead each feature is taken into account via the multidimensional gradient only. Therefore, correlations are not fully taken into account.

Adversarial Training

The approach that will be followed in this study is a simple type of Adversarial Training that injects perturbed inputs already during the training phase [26]. The algorithmic description is shown in Fig. 2. The difference to the nominal and Adversarial Training is highlighted in red.

Fig. 2
figure 2

Adversarial Training algorithm. The inputs are distorted prior to the forward and backward passes, with the FGSM attack. The standard training algorithm denoted in black is based on Ref. [41], the modified implementation for adversarial training is demonstrated in Ref. [11]

Fig. 3
figure 3

Comparison of the nominal and Adversarial Training against the FGSM attack

In fact, in this approach the neural network never sees the raw inputs during the whole training step [26,27,28]. In Fig. 3, this is shown with the insertion of a red block prior to backpropagation. The idea is that by applying the FGSM attack continuously to the training data (for every minibatch, i.e. with every intermediate state of the model after updating the model parameters), the network is less likely to learn the simulation-specific properties of the used sample. Instead, the introduction of a saddle point into the loss surface is expected to improve the generalization capability of the network [2, 26, 28]. This can be understood as a “competition” between gradient descent to solve the outer minimization problem and gradient ascent to handle the inner maximization [28].

Madry et al. [28] have shown that this is an effective method to reduce susceptibility to first-order adversaries, obtained from an FGSM attack. In that sense, Adversarial Training could also be described as a regularization technique, but a more systematic one than only randomly smearing inputs (another example of data augmentation), randomly deleting connections (dropout), or assigning a probability to the different targets to be wrong (label smoothing) [2].

Fig. 4
figure 4

Schematic overview of the inference process when performing a comparison of robustness of both training strategies. Evaluation of the nominal training (green and blue paths) is described in “Vulnerability of the Nominal Training”, while the comparison for the Adversarial Training, including all four combinations is described in “Improving Robustness Through Adversarial Training

The principle behind this technique involves the linearity of neural networks to which the high susceptibility to mismodelings is attributed. Adversarial Training can be interpreted as a method that adjusts the loss surface to be locally constant around the inputs and that downsizes the impact of perturbations evaluated with a high-dimensional linear function [2]. Slightly distorted inputs then cannot significantly increase the value of the loss function, because it is almost flat in the vicinity of the raw inputs [55]. This can be seen as a geometrical problem where the loss manifold is flattened [55,56,57,58]. When evaluating this adversarially-trained model with distorted test inputs, the model should be more robust to those modifications and the performance should not be affected as much as with the generic training. The price for the increased robustness is that the maximally achievable performance on raw inputs can be somewhat reduced with respect to the nominal training [2]. During Adversarial Training, the FGSM attack uses \(\epsilon =0.01\) when injecting adversarial samples, and no further restrictions are applied, i.e. there is no limitation of the attack with respect to the relative scale of the impact on different values and Eq. (3) holds.

Inference

The inference step is split into two separate parts, which can be seen in Fig. 4. First, the relevant samples need to be acquired. These can be either original (raw) samples or systematically distorted samples. Both trainings under consideration have their own respective loss surfaces, which continuously change during the training process. Therefore, samples that maximally deteriorate the performance of one model do not necessarily confuse another model. To cause a severe impact, the FGSM attack will be applied individually per training. A similar argument can be made for different checkpoints of the training, where we also craft adversarial samples per epoch to reflect the model’s exact status and loss surface. After a fixed number of epochs or after convergence of both training strategies, this yields three different sets of samples: nominal samples (green, equal for both contenders), FGSM samples corresponding to the nominal training (blue), and FGSM samples that have been created for the Adversarial Training (orange). These can then be injected into the different models for evaluation.

Fig. 5
figure 5

Distributions of raw and systematically distorted inputs, for a set of features containing high- and low-level information. The displayed range for the signed impact parameter (\(d_0\)) of the first track has been clipped to the most relevant central region, where distortions naturally appear enhanced

Robustness to Mismodeling

Adversarial Attack

As we are interested in producing disturbances that would simulate the behaviour of systematic uncertainties, we verify that the distorted distributions remain within an envelope expected by the typical data-to-simulation agreement. The effect of the FGSM attack at two values of \(\epsilon\) compared to the nominal distribution is shown for four input variables (both high and low-level inputs) in Fig. 5. Even with the largest value of \(\epsilon = 0.05\) chosen for the following performance studies, the modifications of input shapes remain marginal, within typical data-to-simulation agreements of the level of 10–20% [12].

Vulnerability of the Nominal Training

First, we establish how susceptible the nominal model is to the FGSM attack (mismodeling) of various magnitudes. Figure 6 shows the ROC curves for the BvsL (left) and CvsL (right) discriminators, on FGSM datasets generated with varying parameter \(\epsilon\) and on the nominal inputs. As expected, the model performs best on undisturbed test samples with AUC of 0.946, but the performance decays quite quickly with increasing \(\epsilon\). At \(\epsilon =0.05\), which still only causes barely visible differences in the input distributions, the model reaches AUC of 0.883. At 1% mistag working point, this would correspond to a decrease in signal efficiency from 73 to 60%, requiring a scale factor of 0.82.

In the context of the ongoing hunt for better performing classifiers, it is of interest to investigate the susceptibility in relation to the performance. Some insight can be gleaned by evaluating the performance of the classifier at various steps during the training on both the nominal and the perturbed datasets with a fixed \(\epsilon =0.05\), where an AUC value is calculated for each checkpoint. This dependence is shown in Fig. 7, again for the two discriminators. Not surprisingly, before the training performance becomes saturated, longer training leads to an increase in nominal performance. However, at the same time it shows higher vulnerability towards adversarial attacks. In fact the performance on the perturbed datasets follows exactly the opposite trend. Another way to phrase this finding is that the least performant configuration (after only few epochs or iterations through the full training dataset) shows the highest robustness, i.e. the gap between dashed and solid lines is minimal.

Improving Robustness Through Adversarial Training

Fig. 6
figure 6

ROC curves for the BvsL (left) and CvsL discriminator (right), using the nominal training and applying FGSM attacks of different magnitudes. The model is evaluated when the training has reached peak performance

Fig. 7
figure 7

ROC curves for the BvsL (left) and CvsL discriminator (right), using the nominal training and applying FGSM attacks with \(\epsilon =0.05\) at various checkpoints of the training that each come with different nominal performance. Solid lines in different colors represent nominal performance gain with an increased number of epochs, dashed lines show corresponding performance on individually crafted FGSM samples for the particular checkpoints

In this subsection, the studies described above are repeated with the adversarial model, using the same setup for the attacks when performing the inference.

As a check of robustness, we perform a direct comparison of the nominal and Adversarial Training, crafting the FGSM samples individually per model, with the resulting ROC curves for the BvsL and CvsL discriminators shown in Fig. 8.

Fig. 8
figure 8

ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with Adversarial Training when applying the FGSM attack to both trainings individually. Nominal training is visualized in blue, Adversarial Training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs (for nominal training), dashed-dotted lines represent the systematically distorted samples for Adversarial Training

The corresponding AUC values for BvsL are identical (0.946) and are practically identical for CvsL (nominal: 0.759, adversarial training 0.757). At the same time, the adversarial model maintains a high performance also when given systematically distorted samples, which can be seen from the dashed-dotted lines corresponding to the colors mentioned above. The ROC curve corresponding to FGSM samples crafted for and injected to the Adversarial Training (orange dashed-dotted line) appears much closer to that showing nominal performance (solid line) than what can be observed for the ROC curves corresponding to the FGSM attack for the nominal training (blue lines). In numbers, this effect is best observed for the CvsL discriminator where the decrease in performance is roughly \(21\%\) for the nominal training, but only \(8.2\%\) for the adversarial training, while the nominal performance of both models is nearly same. Hence, we have shown that it is possible to build a more robust tagger that is simultaneously highly performant. A label leaking effect (see Ref. [59]), which refers to a better performance on adversarial examples than on undisturbed data for an adversarial model, is not observed.

Figure 9 compares the susceptibility to mismodeling of the two classifiers as a function of performance. FGSM samples have been generated individually for each model and checkpoint (denoting each epoch with a single point) to scan over different discrete stages of the training. Higher density of points in the high performance region is representative of the small improvements at later stages of the training, while the performance gain during the first few epochs is quick. Ideally, there would be a constant relation that shows no signs of decreasing robustness for increasing performance. However, we observe a considerable deterioration (and thus higher susceptibility to mismodeling) of the nominal classifier. The effect for the adversarial model, while still noticeable, is to a large degree mitigated.

Fig. 9
figure 9

Relation between susceptibility and nominal performance for the nominal and adversarial training, tested on systematically distorted inputs with varying \(\epsilon\) in different colors. The x axis shows nominal performance, measured with BvsL AUC, while the y axis shows the difference between disturbed and raw AUC. When there is a drop on the y axis while moving to higher nominal performance (x axis), this indicates higher susceptibility. The empty markers represent the nominal training, which becomes highly vulnerable with increasing nominal performance (with the drop always getting steeper), while the filled markers for adversarial training show a much flatter relation

In fact, the adversarial training seems to recover some of its robustness (e.g. peaking at an AUC of around 0.938) before the impact at higher performance starts to worsen the resistance. Again, this shows the intriguing trade-off between performance and robustness for the nominal training, where training to highest performance is not necessarily advisable due to high susceptibility. On the other hand, the adversarial training performs equally well on nominal samples and only shows a weak functional dependence between performance on first-order adversaries and the respective undisturbed performance.

Probing Flavor Dependence of the Attack as a Proxy for Generalization Capability

In an attempt to understand why the adversarial model is more robust than the nominal classifier, we investigate nominal and perturbed input distributions of a selected feature, split by flavor. We intentionally choose a large distortion. This test aims at visualizing geometric properties of the distorted samples, purposefully choosing a large \(\epsilon\) of 0.1. This is equal to the regular FGSM attack described by Eq. (3) without the limitation described in Eq. (4). The signed impact parameter (\(d_0\)) as shown in Fig. 10 originally offers discriminating power via the fact that heavy-flavor jets contain displaced tracks associated to a secondary vertex, which should naturally lead to more positive values for the \(d_0\) variable. For light-flavored jets, this behaviour is not expected, instead the tracks in light jets have a roughly symmetric \(d_0\) distribution, peaking at 0, apart from some skewness due to relatively long-lived, but light hadrons (\(K^0_s\) or \(\varLambda\)) or contamination with tracks from heavy-flavor hadrons [12].

Fig. 10
figure 10

Signed transverse impact parameter distribution for the first track, split by flavor, before (filled histograms) and after (lines) applying the FGSM attack for the nominal (top) and adversarial (bottom) models, respectively. Clearly asymmetric shapes are produced when using the FGSM attack for the loss function assigned to the nominal training. Applying the FGSM attack based on an adversarial model shows suppressed flavor-dependency and relatively symmetric shapes. The attack uses the parameter \(\epsilon =0.1\), which is higher than the moderately chosen parameter of \(\epsilon =0.01\) during the modified training loop

For the nominal training, light-flavor jets are shifted mostly into the positive region, which should be dominated by b jets; b jets are shifted to the negative region where these jets were not abundant previously. From a geometric point of view, the FGSM attack on the nominal training produces asymmetric shapes. On the other hand, the resulting perturbed input distributions for the adversarial training are symmetric. We observe that the adversarial model is almost agnostic to the direction into which the FGSM attack shifts the inputs, while the nominal training shows a clear preferred direction that could be described as an inversion of the expected physics. For the adversarial training, the attack seems to have difficulties deciding which direction is the worse direction, resulting in a perceived “coin-flipping” of the shift. Thus, the adversarial training remains less susceptible than the nominal training, even when the distortions are noticeably large.

It is conceivable that the different geometric properties of the distributions are related to the geometry of the loss surface [55,56,57,58]. This is expected to be responsible for differences in robustness as well. Figure 11 illustrates how the flatness of the loss surface in the vicinity of raw inputs could influence symmetric or asymmetric shifts.

A nominal training converges into a minimum associated with the default distributions. In that case, for a given flavor, there will be a specific vector pointing away from a local minimum and the direction is fixed according to the steepest increase in loss. The adversarial training always “sees” (new) adversarial inputs, so the adjustment of the model’s parameters might average out eventually over further training epochs. Always following the newly distorted inputs yields a locally constant loss manifold around the original inputs due to the more complex saddle point problem. This would mean that not the exact memorization of training data, but rather higher-order correlations contribute to the improvement of the performance of the adversarial training [26, 28, 55, 56]. With the assumption of a flat loss surface close to the raw inputs there would be no preferred direction for first-order adversarial attacks crafted for the adversarial model. Many vectors would fulfill the criterion of pointing in the direction of increasing loss, much like choosing the direction randomly.

Thus, by examining the geometric properties of adversarial samples, a flat loss landscape for the adversarial model is highly probable, leading to higher robustness [55, 56]. For mismodelings of order \(\epsilon\) that are still on-manifold, the adversarial training would generalize better to data than nominal training. Robustness and generalization are not equivalent [26, 28, 60], which is why the above statement can not be general, but is only valid under the assumption that adversarial methods like the FGSM attack replicate mismodelings between simulation and detector data.

Conclusion

In this paper, we investigated the performance of a jet flavor tagging algorithm when being exposed to systematically distorted inputs that have been generated with an adversarial attack, the Fast Gradient Sign Method. Moreover, we showed how model performance and robustness are related. We explored the trade-off between performance on unperturbed and on distorted test samples, investigating ROC curves and AUC scores for the BvsL and CvsL discriminators. All tests conducted with the nominal training confirm earlier findings that relate higher performance with higher susceptibility, now for a deep neural network that replicates a typical jet tagging algorithm. We applied a defense strategy to counter first-order adversarial attacks by injecting adversarial samples already during the training stage of the classifier, but without altering the network architecture.

When comparing this new classifier with the nominal model, no difference in performance was observed, but the robustness towards adversarial attacks is enhanced by a large margin. Exemplary for the direct comparison of the two trainings, both reached an AUC score of approximately \(76\%\) when discriminating c from light jets, but an FGSM attack that is still moderate in its impact on the input distributions decreases the performance of the nominal training by \(21\%\), and only by \(8.2\%\) for the adversarial training. A study of raw and distorted input distributions allowed us to relate geometric properties of the attack with geometric properties of the underlying loss surfaces for a nominal and an adversarially trained model, yielding a possible explanation for the higher robustness of the latter attributed to flatness of the loss manifold.

To some extent, the higher robustness as shown in this paper points at better generalization capability, but a study that will also utilize detector data has yet to be conducted to confirm this conjecture. The approach followed for this work is comparatively general, in that it only needs access to the model and the criterion. This is the first application of adversarial training to build a robust jet flavor tagger suitable for usage at the LHC.

It would be interesting to apply this type of attack and defense also to more complex neural network structures to see if, for example, convolutional layers are able to leverage adversarial attacks differently, and if adversarial training is as effective for taggers with a larger (or smaller) dimension in the feature space. Another focus could be targeted at using adversarial methods of higher complexity, both for the attack, as well as for the defense against them. Summarizing the efforts so far, adversarial training was applied successfully to resist first-order adversarial attacks on jet flavor tagging algorithms, corresponding studies with higher-order adversaries are left for future investigations.