Abstract
Deep learning is a standard tool in the field of high-energy physics, facilitating considerable sensitivity enhancements for numerous analysis strategies. In particular, in identification of physics objects, such as jet flavor tagging, complex neural network architectures play a major role. However, these methods are reliant on accurate simulations. Mismodeling can lead to non-negligible differences in performance in data that need to be measured and calibrated against. We investigate the classifier response to input data with injected mismodelings and probe the vulnerability of flavor tagging algorithms via application of adversarial attacks. Subsequently, we present an adversarial training strategy that mitigates the impact of such simulated attacks and improves the classifier robustness. We examine the relationship between performance and vulnerability and show that this method constitutes a promising approach to reduce the vulnerability to poor modeling.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The experiments at the Large Hadron Collider (LHC) at CERN handle large, high-dimensional datasets to find complex patterns or to identify rare signals in background-dominated regions—tasks where machine learning and especially deep learning [1, 2] provide considerable performance gains over traditional methods. It is expected that the relevance of new deep learning technologies will increase, with the era of the High-Luminosity LHC (HL-LHC) approaching [3]. However, studies with the aim of understanding a neural network’s decisions demonstrate the relevance of explainability [4] and raise questions on the safety of systems that use artificial intelligence (AI), which is often perceived as a black-box [4, 5]. Moreover, other studies show that small modifications of the inputs (adversarial examples) can severely affect the performance of neural networks [6, 7] (adversarial attack), a worrying prospect for a field that is reliant on simulation, which might be at times inaccurate. Careful exploration of the susceptibility to mismodelings is necessary to examine how severe these “intriguing properties of neural networks” [6] are in practice. Such effects could be driven by the fact that various popular classes of deep neural networks react linearly when exposed to linear perturbations, together with the large number of input variables [6, 7]. As such, this property is not in conflict with a neural network’s ability to approximate any function via a combination of non-linear activation functions [8], but the presence of (piecewise-)linear activation functions is sufficient to cause severe impact on performance when evaluated on first-order adversarial examples [6, 7]. Applied to computer vision / image recognition, it has been demonstrated that modifications that involve only one pixel are enough to “fool” a neural network [9].
We apply methods from AI safety [5, 10, 11] to the classification of jets based on the flavor of their initiating particle (a quark or gluon), so called jet heavy-flavor identification (tagging) [12, 13]. Identifying the jet flavor plays an important role in various analysis branches exploited by experiments like CMS [12, 14] and ATLAS [15, 16], for example, for the observation of the decay of the Higgs boson to bottom (b) quark-antiquark pairs (\(\mathrm {H}\rightarrow \mathrm {b}\bar{\mathrm {b}}\)) [17,18,19]. Moreover, for analyses that also apply charm (c) tagging [12, 20,21,22], such as searches for the Higgs boson decaying to c quarks [23,24,25], multiclassifiers become increasingly important. Therefore, investigating the susceptibility to mismodeling could be even more relevant for c tagging. We probe the trade-off between performance and robustness to systematic distortions by benchmarking an established algorithm for jet flavor tagging with a realistic dataset. Early taggers included only the displacement of tracks as a way to discriminate heavy- from light-flavored jets, possible due to the different lifetimes of the initiating hadrons. It is also possible to leverage information related to the secondary vertices, giving rise to algorithms such as the (deep) combined secondary vertex algorithm [12].
Mismodelings can arise at various steps during the Monte Carlo (MC) simulation chain, starting with the hard process (matrix element calculation), followed by the subsequent steps that model the parton shower, fragmentation and hadronization, where the perturbation order is limited, and ending with the detector simulation [21] which introduces imperfections such as detector misalignment and calorimeter miscalibration.These imperfections in the modeling, particularly for variables with high discriminating power, demand the calibration of the discriminator shapes [21] and call for investigations of the tagger response to slightly distorted input data [10]. We use adversarial attacks to model systematic uncertainties induced by these subtle mismodelings that could be invisible to typical validation methods, as proposed in Ref. [10]. The approach followed in this study does not eliminate these mismodelings, nor does it provide a definitive a posteriori correction, but it helps in estimating to what extent tagging efficiency and misidentification rates could be affected [10, 19]. We assume that more adversarially robust models also generalize better when applied to a non-training domain [2, 26] (e.g. model evaluated on data [10, 21]). To that end, we seek to modify the training to minimize the impact of adversarial attacks, without sacrificing performance.
Using Adversarial Training [26,27,28] to decrease the effect of simulation-specific artefacts, we show that the injection of systematically distorted samples during the training yields a successful defense strategy. In related works, Adversarial Training is employed through joint training of a classifier and an adversary [29], making use of gradient reversal layers to connect two networks or utilizing domain adaptation [30,31,32,33]. Other approaches towards regularization and generalization in the realm of high-energy physics include data augmentation or uncertainty-aware learning [34].
Dataset and Input Features
We use the Jet Flavor dataset [13]. These samples are generated with Madgraph5 [35] and Pythia 6 [36]. The detector response is simulated with Delphes 3 [37], using the ATLAS [15] detector configuration.
Jets are clustered with the anti-\(k_\text {T}\) algorithm [38] using the FastJet [39] package, with \(R=0.4\). Secondary vertices are reconstructed using the adaptive vertex reconstruction algorithm, as implemented in RAVE [40]. Parton matching within a cone of \(\varDelta R < 0.5\) is used to define the simulated truth labeling of jets. The targets fall in one of the three classes, depending on the jet flavor: light (up, down, strange quarks or gluons), charm, or bottom [13], where the heavier flavor takes precedence in case multiple partons are found. Using this hierarchy for light, charm and bottom, the flavor content is distributed among the classes as \(48.7\%:12.0\%:39.3\%\).
Input Features
A description of all input variables is given in Tables 1 and 2, and is based on Ref. [13]; here we only summarize the main categorization.
Input features are organized hierarchically. Low-level features consist of tracks and their helix parameters, along with the track covariance matrix. Additional information is taken from the relationship between each track and the associated vertex. Up to 33 tracks, sorted by impact parameter significance, are available per jet, however, we only consider the first 6.
At jet level, expert (high-level) features are constructed as a function of the low-level inputs, for example by summing over all tracks or summing over secondary vertices, such as the weighted sum of displacement significances. Additionally, kinematic features of the jet are taken into account.
Missing or otherwise unavailable variables are filled with a convenient default value for later processing.
Preprocessing
The entire dataset consists of 11,491,971 jets, which are split randomly into training (\(72\%\)), validation (\(8\%\)) and test (\(20\%\)) sets. Input features are normalized such that they have a mean of 0 and standard deviation of 1. The scaling is calculated only using the training dataset distributions, excluding the defaulted values. Defaulted input values are set just below the minima of the primary input distributions ensuring no interference between regular and irregular (or missing) values. Minimizing the gap between the default value to the rest of the distributions improves training convergence. This technique of missing data imputation allows us to create fixed length input shapes that are transferred to the first layer of a deep feed-forward neural network, and at the same time prevents vanishing or exploding gradients due to extreme values for the defaults [2, 41, 42].
Sample weights are calculated to exclude a potential flavor dependence of the classifier on the particular kinematic properties of the chosen dataset and to correct for the inherent class imbalance. The reweighting aims at identical, kinematic distributions for all three flavors and is done with respect to the jet transverse momentum (\(p_\text {T}\)) and pseudorapidity (\(\eta\)) distributions [12]. The target shape is the average of the three initial distributions, thus balancing the relative fractions for the three classes at the same time. These distributions are binned into a 2D grid of \(50\times 50\) bins, spanning ranges between \((20{,}900)~{\hbox {GeV}}\) and \((-2.5,2.5)\), respectively. When calculating the loss per batch, these weights are multiplied to the individual losses per sample.
Methods
Reference Classifier
The studies are carried out on a jet flavor tagging algorithm similar in implementation to the ones used at the LHC experiments, such as ATLAS and CMS. We use a fully-connected sequential model with five hidden layers of 100 nodes each. We use dropout layers [43] with a \(10\%\) probability of zeroing out each neuron at each hidden layer to prevent overfitting. The Rectified Linear Unit (ReLU) activation function [2, 12, 41] is used for the hidden layers, the activation of the output layer is computed with the Softmax [2, 12] function. In total, there are 184 input nodes, where the low-level per track features are flattened. We define three output classes, analogous to the dataset.
As loss function, we use the categorical cross entropy loss [41, 44], multiplied with an additional term that downweights easy-to-classify samples during training. The resulting formula for the so called focal loss [45,46,47] evaluated for one batch of length N is given as:
where \(y_{ij}\) is a placeholder for the output probability assigned to one of the three possible flavors j of the jet i, \(\hat{y}_{ij}\) can be understood as the one-hot-encoded truth label which is either 0 or 1, \(w_i\) is the sample weight obtained from preprocessing and \(\gamma\) is called focusing parameter. Though we already treat the class imbalance by reweighting the nominal loss function, without the focusing term, the neural network is prone to assign the most frequent class. In a setting with highly imbalanced data the chosen technique ensures smooth classifier output distributions, which we achieve by choosing a focusing parameter of \(\gamma =25\).
Model parameters are updated with the Adaptive Moments Estimation (Adam) optimizer [48] using PyTorch’s [49] default settings, which is further controlled with a learning rate schedule [50] that starts at 0.0001 and decays proportionally to \(\left( 1+\frac{\text {epoch}}{30}\right) ^{-1}\). The batch size has been fixed to \(2^{16}=65{,}536\). To ensure that there is no overfitting, training is stopped when the validation loss no longer improves [2]. For each training, the model’s parameters are saved after each iteration through the full training dataset (i.e. after each epoch) to store a checkpoint for later evaluation.
Evaluation Metrics
While multi-class taggers are convenient for implementation, for physics analysis purposes, one is often interested in constructing classifiers distinguishing two classes at a time. We take appropriate likelihood ratios of the bottom, charm and light output classes as needed for discrimination. The likelihood ratio XvsY for discriminating class X from Y is given as:
For example, for the BvsL discriminator, \(P(\text {X})\) and \(P(\text {Y})\) refer to the classifier’s score for the bottom and light flavor jets, respectively. The performance of the binary classifiers is visualized and evaluated using Receiver Operating Characteristic (ROC) curves [51,52,53]. With some loss of information, a ROC curve is characterized by its area under the curve (AUC), which can be used as a reasonable single scalar proxy for the classifier performance [54]. It should be noted that due to a large class imbalance in the available dataset, accuracy could be an inaccurate measure of the performance [54].
Adversarial Attacks
One way to generate adversarial inputs is the Fast Gradient Sign Method (FGSM) [2, 7], which modifies the inputs in a systematic way, such that the loss function increases. First, the direction of the steepest increase of the loss function around the raw inputs is computed. Mathematically, the operator that allows to retrieve the “steepest increase” is the gradient of the loss function with respect to the inputs. Once the direction is known, of which only the sign is kept, this vector is multiplied with a (small) limiting parameter \(\epsilon\) to specify the desired severity of the impact. Then, the nominal inputs are shifted by this quantity. It can, therefore, be seen as a technique to maximally disturb the inputs or maximally confuse the network without necessarily manifesting in the input variable distributions.
Expressed in a single equation, the FGSM attack generates adversarial inputs \(x_\text {FGSM}\) from raw inputs \(x_\text {raw}\) by computing
where \(\mathrm {sgn}(\alpha )\) stands for the sign of \(\alpha\). In Eq. (3), the loss function is denoted as \(J(x_\text {raw},y)\), a function of the inputs (\(x_\text {raw}\)) and targets (y). Moreover, the FGSM attack can be interpreted as a method that locally inverts the approach of gradient descent by performing a gradient ascent with the loss function, but in the input space [7, 26, 28]. Using the terminology of Ref. [26], this is a white box attack with full knowledge of the network (architecture and parameters).
The corresponding visualization is shown in Fig. 1, however, for didactic reasons with one input variable \(x_i\) only. In practice, this method is applied multidimensionally, assigning the same limiting parameter \(\epsilon\) in each input dimension.
Whereas the gradient of an arbitrary function could yield any value, the distortion should stay in reasonable bounds to mimic the behaviour of possible mismodelings or differences between data and simulation [7, 10]. Therefore, we go only a small step in the direction of the gradient, which is expected to introduce practically unnoticeable changes of the input distributions [6, 7].
Increasing the number of inputs to the model also increases the susceptibility towards adversarial attacks, because each shift by \(\epsilon\) for additional features is propagated to the change in activation [7]. Thus it is conceivable that individual feature distributions remain almost unaffected, but the performance of the neural network is substantially deteriorated.
The FGSM attack does not necessarily replicate a global worst-case scenario [28]. Depending on the actual properties of the loss surface, the adversarial attack could shift the inputs also into local minima (or at least harmless regions), if the limiting parameter is chosen unluckily. On average, with small distortions only, it is still expected that in a given region, the attack will maximally confuse the model up to first order.
In this implementation, the FGSM attack is not applied to integer variables, such as the number of tracks, and defaulted values, which would not be shifted by \(\epsilon\) in a physically meaningful way.
As large distortions of input variables would be easy to detect, a limit of \(25\%\) with respect to the original value is applied on the perturbation. The modified value \(x_\text {FGSM}\) is then given by Eq. (4), where x denotes the original input value, \(x'\) the transformed (preprocessed) value and \(\epsilon\) the FGSM scaling factor. Inverting the normalization is denoted by \(()^{-1}\).
Distortions of low-level features are not propagated to high-level features, instead each feature is taken into account via the multidimensional gradient only. Therefore, correlations are not fully taken into account.
Adversarial Training
The approach that will be followed in this study is a simple type of Adversarial Training that injects perturbed inputs already during the training phase [26]. The algorithmic description is shown in Fig. 2. The difference to the nominal and Adversarial Training is highlighted in red.
In fact, in this approach the neural network never sees the raw inputs during the whole training step [26,27,28]. In Fig. 3, this is shown with the insertion of a red block prior to backpropagation. The idea is that by applying the FGSM attack continuously to the training data (for every minibatch, i.e. with every intermediate state of the model after updating the model parameters), the network is less likely to learn the simulation-specific properties of the used sample. Instead, the introduction of a saddle point into the loss surface is expected to improve the generalization capability of the network [2, 26, 28]. This can be understood as a “competition” between gradient descent to solve the outer minimization problem and gradient ascent to handle the inner maximization [28].
Madry et al. [28] have shown that this is an effective method to reduce susceptibility to first-order adversaries, obtained from an FGSM attack. In that sense, Adversarial Training could also be described as a regularization technique, but a more systematic one than only randomly smearing inputs (another example of data augmentation), randomly deleting connections (dropout), or assigning a probability to the different targets to be wrong (label smoothing) [2].
Schematic overview of the inference process when performing a comparison of robustness of both training strategies. Evaluation of the nominal training (green and blue paths) is described in “Vulnerability of the Nominal Training”, while the comparison for the Adversarial Training, including all four combinations is described in “Improving Robustness Through Adversarial Training”
The principle behind this technique involves the linearity of neural networks to which the high susceptibility to mismodelings is attributed. Adversarial Training can be interpreted as a method that adjusts the loss surface to be locally constant around the inputs and that downsizes the impact of perturbations evaluated with a high-dimensional linear function [2]. Slightly distorted inputs then cannot significantly increase the value of the loss function, because it is almost flat in the vicinity of the raw inputs [55]. This can be seen as a geometrical problem where the loss manifold is flattened [55,56,57,58]. When evaluating this adversarially-trained model with distorted test inputs, the model should be more robust to those modifications and the performance should not be affected as much as with the generic training. The price for the increased robustness is that the maximally achievable performance on raw inputs can be somewhat reduced with respect to the nominal training [2]. During Adversarial Training, the FGSM attack uses \(\epsilon =0.01\) when injecting adversarial samples, and no further restrictions are applied, i.e. there is no limitation of the attack with respect to the relative scale of the impact on different values and Eq. (3) holds.
Inference
The inference step is split into two separate parts, which can be seen in Fig. 4. First, the relevant samples need to be acquired. These can be either original (raw) samples or systematically distorted samples. Both trainings under consideration have their own respective loss surfaces, which continuously change during the training process. Therefore, samples that maximally deteriorate the performance of one model do not necessarily confuse another model. To cause a severe impact, the FGSM attack will be applied individually per training. A similar argument can be made for different checkpoints of the training, where we also craft adversarial samples per epoch to reflect the model’s exact status and loss surface. After a fixed number of epochs or after convergence of both training strategies, this yields three different sets of samples: nominal samples (green, equal for both contenders), FGSM samples corresponding to the nominal training (blue), and FGSM samples that have been created for the Adversarial Training (orange). These can then be injected into the different models for evaluation.
Distributions of raw and systematically distorted inputs, for a set of features containing high- and low-level information. The displayed range for the signed impact parameter (\(d_0\)) of the first track has been clipped to the most relevant central region, where distortions naturally appear enhanced
Robustness to Mismodeling
Adversarial Attack
As we are interested in producing disturbances that would simulate the behaviour of systematic uncertainties, we verify that the distorted distributions remain within an envelope expected by the typical data-to-simulation agreement. The effect of the FGSM attack at two values of \(\epsilon\) compared to the nominal distribution is shown for four input variables (both high and low-level inputs) in Fig. 5. Even with the largest value of \(\epsilon = 0.05\) chosen for the following performance studies, the modifications of input shapes remain marginal, within typical data-to-simulation agreements of the level of 10–20% [12].
Vulnerability of the Nominal Training
First, we establish how susceptible the nominal model is to the FGSM attack (mismodeling) of various magnitudes. Figure 6 shows the ROC curves for the BvsL (left) and CvsL (right) discriminators, on FGSM datasets generated with varying parameter \(\epsilon\) and on the nominal inputs. As expected, the model performs best on undisturbed test samples with AUC of 0.946, but the performance decays quite quickly with increasing \(\epsilon\). At \(\epsilon =0.05\), which still only causes barely visible differences in the input distributions, the model reaches AUC of 0.883. At 1% mistag working point, this would correspond to a decrease in signal efficiency from 73 to 60%, requiring a scale factor of 0.82.
In the context of the ongoing hunt for better performing classifiers, it is of interest to investigate the susceptibility in relation to the performance. Some insight can be gleaned by evaluating the performance of the classifier at various steps during the training on both the nominal and the perturbed datasets with a fixed \(\epsilon =0.05\), where an AUC value is calculated for each checkpoint. This dependence is shown in Fig. 7, again for the two discriminators. Not surprisingly, before the training performance becomes saturated, longer training leads to an increase in nominal performance. However, at the same time it shows higher vulnerability towards adversarial attacks. In fact the performance on the perturbed datasets follows exactly the opposite trend. Another way to phrase this finding is that the least performant configuration (after only few epochs or iterations through the full training dataset) shows the highest robustness, i.e. the gap between dashed and solid lines is minimal.
Improving Robustness Through Adversarial Training
ROC curves for the BvsL (left) and CvsL discriminator (right), using the nominal training and applying FGSM attacks with \(\epsilon =0.05\) at various checkpoints of the training that each come with different nominal performance. Solid lines in different colors represent nominal performance gain with an increased number of epochs, dashed lines show corresponding performance on individually crafted FGSM samples for the particular checkpoints
In this subsection, the studies described above are repeated with the adversarial model, using the same setup for the attacks when performing the inference.
As a check of robustness, we perform a direct comparison of the nominal and Adversarial Training, crafting the FGSM samples individually per model, with the resulting ROC curves for the BvsL and CvsL discriminators shown in Fig. 8.
ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with Adversarial Training when applying the FGSM attack to both trainings individually. Nominal training is visualized in blue, Adversarial Training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs (for nominal training), dashed-dotted lines represent the systematically distorted samples for Adversarial Training
The corresponding AUC values for BvsL are identical (0.946) and are practically identical for CvsL (nominal: 0.759, adversarial training 0.757). At the same time, the adversarial model maintains a high performance also when given systematically distorted samples, which can be seen from the dashed-dotted lines corresponding to the colors mentioned above. The ROC curve corresponding to FGSM samples crafted for and injected to the Adversarial Training (orange dashed-dotted line) appears much closer to that showing nominal performance (solid line) than what can be observed for the ROC curves corresponding to the FGSM attack for the nominal training (blue lines). In numbers, this effect is best observed for the CvsL discriminator where the decrease in performance is roughly \(21\%\) for the nominal training, but only \(8.2\%\) for the adversarial training, while the nominal performance of both models is nearly same. Hence, we have shown that it is possible to build a more robust tagger that is simultaneously highly performant. A label leaking effect (see Ref. [59]), which refers to a better performance on adversarial examples than on undisturbed data for an adversarial model, is not observed.
Figure 9 compares the susceptibility to mismodeling of the two classifiers as a function of performance. FGSM samples have been generated individually for each model and checkpoint (denoting each epoch with a single point) to scan over different discrete stages of the training. Higher density of points in the high performance region is representative of the small improvements at later stages of the training, while the performance gain during the first few epochs is quick. Ideally, there would be a constant relation that shows no signs of decreasing robustness for increasing performance. However, we observe a considerable deterioration (and thus higher susceptibility to mismodeling) of the nominal classifier. The effect for the adversarial model, while still noticeable, is to a large degree mitigated.
Relation between susceptibility and nominal performance for the nominal and adversarial training, tested on systematically distorted inputs with varying \(\epsilon\) in different colors. The x axis shows nominal performance, measured with BvsL AUC, while the y axis shows the difference between disturbed and raw AUC. When there is a drop on the y axis while moving to higher nominal performance (x axis), this indicates higher susceptibility. The empty markers represent the nominal training, which becomes highly vulnerable with increasing nominal performance (with the drop always getting steeper), while the filled markers for adversarial training show a much flatter relation
In fact, the adversarial training seems to recover some of its robustness (e.g. peaking at an AUC of around 0.938) before the impact at higher performance starts to worsen the resistance. Again, this shows the intriguing trade-off between performance and robustness for the nominal training, where training to highest performance is not necessarily advisable due to high susceptibility. On the other hand, the adversarial training performs equally well on nominal samples and only shows a weak functional dependence between performance on first-order adversaries and the respective undisturbed performance.
Probing Flavor Dependence of the Attack as a Proxy for Generalization Capability
In an attempt to understand why the adversarial model is more robust than the nominal classifier, we investigate nominal and perturbed input distributions of a selected feature, split by flavor. We intentionally choose a large distortion. This test aims at visualizing geometric properties of the distorted samples, purposefully choosing a large \(\epsilon\) of 0.1. This is equal to the regular FGSM attack described by Eq. (3) without the limitation described in Eq. (4). The signed impact parameter (\(d_0\)) as shown in Fig. 10 originally offers discriminating power via the fact that heavy-flavor jets contain displaced tracks associated to a secondary vertex, which should naturally lead to more positive values for the \(d_0\) variable. For light-flavored jets, this behaviour is not expected, instead the tracks in light jets have a roughly symmetric \(d_0\) distribution, peaking at 0, apart from some skewness due to relatively long-lived, but light hadrons (\(K^0_s\) or \(\varLambda\)) or contamination with tracks from heavy-flavor hadrons [12].
Signed transverse impact parameter distribution for the first track, split by flavor, before (filled histograms) and after (lines) applying the FGSM attack for the nominal (top) and adversarial (bottom) models, respectively. Clearly asymmetric shapes are produced when using the FGSM attack for the loss function assigned to the nominal training. Applying the FGSM attack based on an adversarial model shows suppressed flavor-dependency and relatively symmetric shapes. The attack uses the parameter \(\epsilon =0.1\), which is higher than the moderately chosen parameter of \(\epsilon =0.01\) during the modified training loop
For the nominal training, light-flavor jets are shifted mostly into the positive region, which should be dominated by b jets; b jets are shifted to the negative region where these jets were not abundant previously. From a geometric point of view, the FGSM attack on the nominal training produces asymmetric shapes. On the other hand, the resulting perturbed input distributions for the adversarial training are symmetric. We observe that the adversarial model is almost agnostic to the direction into which the FGSM attack shifts the inputs, while the nominal training shows a clear preferred direction that could be described as an inversion of the expected physics. For the adversarial training, the attack seems to have difficulties deciding which direction is the worse direction, resulting in a perceived “coin-flipping” of the shift. Thus, the adversarial training remains less susceptible than the nominal training, even when the distortions are noticeably large.
It is conceivable that the different geometric properties of the distributions are related to the geometry of the loss surface [55,56,57,58]. This is expected to be responsible for differences in robustness as well. Figure 11 illustrates how the flatness of the loss surface in the vicinity of raw inputs could influence symmetric or asymmetric shifts.
A nominal training converges into a minimum associated with the default distributions. In that case, for a given flavor, there will be a specific vector pointing away from a local minimum and the direction is fixed according to the steepest increase in loss. The adversarial training always “sees” (new) adversarial inputs, so the adjustment of the model’s parameters might average out eventually over further training epochs. Always following the newly distorted inputs yields a locally constant loss manifold around the original inputs due to the more complex saddle point problem. This would mean that not the exact memorization of training data, but rather higher-order correlations contribute to the improvement of the performance of the adversarial training [26, 28, 55, 56]. With the assumption of a flat loss surface close to the raw inputs there would be no preferred direction for first-order adversarial attacks crafted for the adversarial model. Many vectors would fulfill the criterion of pointing in the direction of increasing loss, much like choosing the direction randomly.
Thus, by examining the geometric properties of adversarial samples, a flat loss landscape for the adversarial model is highly probable, leading to higher robustness [55, 56]. For mismodelings of order \(\epsilon\) that are still on-manifold, the adversarial training would generalize better to data than nominal training. Robustness and generalization are not equivalent [26, 28, 60], which is why the above statement can not be general, but is only valid under the assumption that adversarial methods like the FGSM attack replicate mismodelings between simulation and detector data.
Conclusion
In this paper, we investigated the performance of a jet flavor tagging algorithm when being exposed to systematically distorted inputs that have been generated with an adversarial attack, the Fast Gradient Sign Method. Moreover, we showed how model performance and robustness are related. We explored the trade-off between performance on unperturbed and on distorted test samples, investigating ROC curves and AUC scores for the BvsL and CvsL discriminators. All tests conducted with the nominal training confirm earlier findings that relate higher performance with higher susceptibility, now for a deep neural network that replicates a typical jet tagging algorithm. We applied a defense strategy to counter first-order adversarial attacks by injecting adversarial samples already during the training stage of the classifier, but without altering the network architecture.
When comparing this new classifier with the nominal model, no difference in performance was observed, but the robustness towards adversarial attacks is enhanced by a large margin. Exemplary for the direct comparison of the two trainings, both reached an AUC score of approximately \(76\%\) when discriminating c from light jets, but an FGSM attack that is still moderate in its impact on the input distributions decreases the performance of the nominal training by \(21\%\), and only by \(8.2\%\) for the adversarial training. A study of raw and distorted input distributions allowed us to relate geometric properties of the attack with geometric properties of the underlying loss surfaces for a nominal and an adversarially trained model, yielding a possible explanation for the higher robustness of the latter attributed to flatness of the loss manifold.
To some extent, the higher robustness as shown in this paper points at better generalization capability, but a study that will also utilize detector data has yet to be conducted to confirm this conjecture. The approach followed for this work is comparatively general, in that it only needs access to the model and the criterion. This is the first application of adversarial training to build a robust jet flavor tagger suitable for usage at the LHC.
It would be interesting to apply this type of attack and defense also to more complex neural network structures to see if, for example, convolutional layers are able to leverage adversarial attacks differently, and if adversarial training is as effective for taggers with a larger (or smaller) dimension in the feature space. Another focus could be targeted at using adversarial methods of higher complexity, both for the attack, as well as for the defense against them. Summarizing the efforts so far, adversarial training was applied successfully to resist first-order adversarial attacks on jet flavor tagging algorithms, corresponding studies with higher-order adversaries are left for future investigations.
Data Availability Statement
This manuscript has associated data in a data repository [Authors’ comment: The dataset has been generated with code accessible under Ref. [61] and can be accessed at the UCI Machine Learning in Physics Web portal under the link http://mlphysics.ics.uci.edu/].
Code Availability
Results shown in this report have been prepared with the help of code accessible under Ref. [62].
References
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nat Cell Biol 521(7553):436. https://doi.org/10.1038/nature14539
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://www.deeplearningbook.org. Accessed 2 Aug 2022
Albertsson K et al (2018) Machine learning in high energy physics community white paper. J Phys Conf Ser 1085(2):022008. https://doi.org/10.1088/1742-6596/1085/2/022008, arXiv:1807.02876
Adadi A, Berrada M (2018) Peeking inside the black-box: a survey on Explainable Artificial Intelligence (XAI). IEEE Access 6:52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052
Amodei D et al (2016) Concrete problems in AI safety. arXiv e-prints. arXiv:1606.06565
Szegedy C et al (2014) Intriguing properties of neural networks. arXiv e-prints. arXiv:1312.6199
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. arXiv e-prints. arXiv:1412.6572
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Su J, Vargas DV, Sakurai K (2019) One pixel attack for fooling Deep Neural Networks. IEEE Trans Evol Comput 23:828–841. https://doi.org/10.1109/tevc.2019.2890858, arXiv:1710.08864
Nachman B, Shimmin C (2019) AI safety for high energy physics. arXiv e-prints. arXiv:1910.08606
Shimmin C (2020) advjets-mlhep2020. GitHub repository. https://github.com/cshimmin/advjets-mlhep2020. Accessed 2 Aug 2022
CMS Collaboration (2018) Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV. JINST. 13(05): P05011. https://doi.org/10.1088/1748-0221/13/05/P05011, arXiv:1712.07158
Guest D et al (2016) Jet flavor classification in high-energy physics with Deep Neural Networks. Phys Rev D 94(11):112002. https://doi.org/10.1103/PhysRevD.94.112002, arXiv:1607.08633
CMS Collaboration (2008) The CMS experiment at the CERN LHC. JINST 3:S08004. https://doi.org/10.1088/1748-0221/3/08/S08004
ATLAS Collaboration (2008) The ATLAS experiment at the CERN Large Hadron Collider. JINST 3:S08003. https://doi.org/10.1088/1748-0221/3/08/S08003
ATLAS Collaboration (2019) ATLAS b-jet identification performance and efficiency measurement with \(t{\bar{t}}\) events in pp collisions at \(\sqrt{s}=13\) TeV. Eur Phys J C 79(11):970. https://doi.org/10.1140/epjc/s10052-019-7450-8, arXiv:1907.05120
CMS Collaboration (2018) Observation of Higgs boson decay to bottom quarks. Phys Rev Lett 121(12):121801. https://doi.org/10.1103/PhysRevLett.121.121801, arXiv:1808.08242
ATLAS Collaboration (2018) Observation of \(H \rightarrow b\bar{b}\) decays and \(VH\) production with the ATLAS detector. Phys Lett B 786:59–86. https://doi.org/10.1016/j.physletb.2018.09.013, arXiv:1808.08238
Kogler R et al (2019) Jet substructure at the large hadron collider: experimental review. Rev Mod Phys 91(4):045003. https://doi.org/10.1103/RevModPhys.91.045003, arXiv:1803.06991
CMS Collaboration (2016) Identification of c-quark jets at the CMS experiment. CERN Document Server. https://cds.cern.ch/record/2205149. Accessed 2 Aug 2022
CMS Collaboration (2022) A new calibration method for charm jet identification validated with proton-proton collision events at \(\sqrt{s}\) =13 TeV. JINST 17(03):P03014. https://doi.org/10.1088/1748-0221/17/03/P03014, arXiv:2111.03027
ATLAS Collaboration (2022) Measurement of the c-jet mistagging efficiency in \(t\bar{t}\) events using pp collision data at \(\sqrt{s}=13\) \(\text{TeV}\) collected with the ATLAS detector. Eur Phys J C 82(1):95. https://doi.org/10.1140/epjc/s10052-021-09843-w, arXiv:2109.10627
CMS Collaboration (2020) A search for the standard model Higgs boson decaying to charm quarks. JHEP 03:131. https://doi.org/10.1007/JHEP03(2020)131, arXiv:1912.01662
ATLAS Collaboration (2022) Direct constraint on the Higgs-charm coupling from a search for Higgs boson decays into charm quarks with the ATLAS detector. arXiv e-prints. arXiv:2201.11428
CMS Collaboration (2022) Search for Higgs boson decay to a charm quark-antiquark pair in proton-proton collisions at \(\sqrt{s}\) = 13 TeV. arXiv e-prints. arXiv:2205.05550
Chakraborty A et al (2018) Adversarial attacks and defences: a survey. arXiv e-prints. arXiv:1810.00069
Shaham U, Yamada Y, Negahban S (2018) Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing 307:195–204. https://doi.org/10.1016/j.neucom.2018.04.027, arXiv:1511.05432
Madry A et al (2019) Towards deep learning models resistant to adversarial attacks. arXiv e-prints. arXiv:1706.06083
Louppe G, Kagan M, Cranmer K (2016) Learning to pivot with adversarial networks. arXiv e-prints. arXiv:1611.01046
Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. arXiv e-prints. arXiv:1409.7495
CMS Collaboration (2020) A deep neural network to search for new long-lived particles decaying to jets. Mach Learn Sci Tech 1:035012. https://doi.org/10.1088/2632-2153/ab9023, arXiv:1912.12238
Ćiprijanović A et al (2021) DeepAdversaries: examining the robustness of deep learning models for galaxy morphology classification. arXiv e-prints. arXiv:2112.14299
Babicz M, Alonso-Monsalve S, Dolan S, Terao K (2022) Adversarial methods to reduce simulation bias in neutrino interaction event filtering at liquid argon time projection chambers. Phys Rev D 105(11):112009. https://doi.org/10.1103/PhysRevD.105.112009, arXiv:2201.11009
Ghosh A, Nachman B, Whiteson D (2021) Uncertainty-aware machine learning for high energy physics. Phys Rev D 104(5):056026. https://doi.org/10.1103/PhysRevD.104.056026, arXiv:2105.08742
Alwall J et al (2011) MadGraph 5: going beyond. JHEP 06:128. https://doi.org/10.1007/JHEP06(2011)128, arXiv:1106.0522
Sjostrand T, Mrenna S, Skands PZ (2006) PYTHIA 6.4 physics and manual. JHEP 05:026. https://doi.org/10.1088/1126-6708/2006/05/026, arXiv:hep-ph/0603175
DELPHES 3 Collaboration (2014) DELPHES 3, a modular framework for fast simulation of a generic collider experiment. JHEP 02:057. https://doi.org/10.1007/JHEP02(2014)057, arXiv:1307.6346
Cacciari M, Salam GP, Soyez G (2008) The anti-\(k_t\) jet clustering algorithm. JHEP 2008:063–063. https://doi.org/10.1088/1126-6708/2008/04/063, arXiv:0802.1189
Cacciari M, Salam GP, Soyez G (2012) FastJet user manual. Eur Phys J C 72:1896. https://doi.org/10.1140/epjc/s10052-012-1896-2, arXiv:1111.6097
Waltenberger W (2011) RAVE: a detector-independent toolkit to reconstruct vertices. IEEE Trans Nucl Sci 58:434–444. https://doi.org/10.1109/TNS.2011.2119492
Stevens E, Antiga L, Viehmann T (2020) Deep learning with PyTorch. Manning Publications Company
Kuhn M, Johnson K (2019) Feature engineering and selection: a practical approach for predictive models. CRC Press
Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958. https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf. Accessed 2 Aug 2022
PyTorch (2022) CrossEntropyLoss. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html. Accessed 2 Aug 2022
Lin T-Y et al (2017) Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2999–3007. arXiv:1708.02002, https://doi.org/10.1109/ICCV.2017.324
Hassan A (2021) Adeelh/pytorch-multi-class-focal-loss: 1.1. Zenodo. https://doi.org/10.5281/zenodo.5547584
CMS Collaboration (2022) Search for new physics using top quark pairs produced in associated with a boosted Z or Higgs boson in effective field theory. CERN Document Server. https://cds.cern.ch/record/2802060. Accessed 2 Aug 2022
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference for learning representations (ICLR). arXiv:1412.6980
Paszke A et al (2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32., pp 8024–8035. Curran Associates, Inc. arXiv:1912.01703. NeurIPS 2019
Darken C, Chang J, Moody J (1992) Learning rate schedules for faster stochastic gradient search. In: Neural Networks for signal processing II proceedings of the 1992 IEEE Workshop, pp 3–12. https://doi.org/10.1109/NNSP.1992.253713
Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06, pp 233–240. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1143844.1143874
Powers D (2020) Evaluation: from precision, recall and F-factor to ROC, Informedness, Markedness & Correlation. arXiv e-prints. arXiv:2010.16061
Galar M et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern. Part C (Applications and Reviews) 42(4):463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv. https://doi.org/10.1145/2907070, arXiv:1505.01658
Fawzi A, Moosavi-Dezfooli S-M, Frossard P (2016) Robustness of classifiers: from adversarial to random noise. arXiv e-prints. arXiv:1608.08967. Accepted to NIPS 2016
Fawzi A, Fawzi O, Frossard P (2018) Analysis of classifiers’ robustness to adversarial perturbations. Mach Learn 107(3):481–508. https://doi.org/10.1007/s10994-017-5663-3, arXiv:1502.02590
Li H et al (2018) Visualizing the loss landscape of neural nets. arXiv e-prints. arXiv:1712.09913. NIPS 2018
Fort S, Hu H, Lakshminarayanan B (2020) Deep ensembles: a loss landscape perspective. arXiv e-prints. arXiv:1912.02757
Kurakin A, Goodfellow I, Bengio S (2017) Adversarial machine learning at scale. arXiv e-prints. arXiv:1611.01236
Stutz D, Hein M, Schiele B (2019) Disentangling adversarial robustness and generalization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6969–6980. https://doi.org/10.1109/CVPR.2019.00714, arXiv:1812.00740
Guest D (2016) delphes-rave. GitHub repository. https://github.com/dguest/delphes-rave. Accessed 2 Aug 2022
Stein A (2022) Adversarial-training-for-jet-tagging. GitHub repository. https://github.com/AnnikaStein/Adversarial-Training-for-Jet-Tagging. Accessed 2 Aug 2022
Fleshgrinder (2009) Gaussian distribution. Wikimedia Commons. Released into the public domain. https://commons.wikimedia.org/wiki/File:Gaussian_distribution.svg. Accessed 2 Aug 2022
Pivarski J et al (2020) scikit-hep/awkward-1.0: 0.4.5. Zenodo. https://doi.org/10.5281/zenodo.4341376
Gray L et al (2020) CoffeaTeam/coffea: release v0.6.46. Zenodo. https://doi.org/10.5281/zenodo.3266454
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
ATLAS Collaboration (2010) Impact parameter-based b-tagging algorithms in the 7 TeV collision data with the ATLAS detector: the TrackCounting and JetProb algorithms. CERN Document Server. https://cds.cern.ch/record/1277681. Accessed 2 Aug 2022
Acknowledgements
Simulations were performed with computing resources granted by RWTH Aachen University under project nova0021 and rwth0619. This work has received support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, projects SCHM 2796/5 and GRK 2497), and the Bundesministerium für Bildung und Forschung (BMBF, Project 05H2021). We thank Nicolas Frediani for his contributions to the project in context of his bachelor thesis.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Supplementary Material
See Fig. 11.
B Robustness in the Context of Other Mismodeling Scenarios
B.1 Smearing Inputs with a Gaussian Noise Term
While the FGSM attack aims at worst-case scenarios in the direction of increasing gradients, a physical effect induced by mismodeling of the parton shower or caused by detector misalignment or -calibration does not know the model parameters or its loss surface. It can therefore not act as a “demon” [10] that always points in a preferred direction. Investigating a smearing technique independent of the model under consideration is of interest when studying the robustness to more typical mismodeling scenarios or fluctuations that are of statistical nature. A non-systematic strategy to create a new, slightly distorted set of inputs randomly shifts the variables by adding a noise term \(\xi\) to the original inputs, drawn from a Gaussian distribution [2, 6, 7, 59]:
As described in Sect. 2.2, the inputs are scaled to a standard deviation of one and are centered at zero, thus allowing this smearing without further processing.
The effect of this distortion is shown in Fig. 12. Only one arbitrary input \(x_i\) has been chosen for visualization, and the displayed loss function is just an illustration.
Visualization of the random shift of inputs by adding a Gaussian noise term. With the slight distortion based on the blue probability distribution, the formerly green raw datapoint is shifted and the corresponding loss modified. The change of the loss function with respect to the distorted inputs can go in either direction. Gaussian distribution adapted from Ref. [63]
Compared to the settings introduced for the FGSM attack, the difference is that the magnitude of the distortion is now given by \(\sigma =1\) (not \(\epsilon\)). Other parameters remain untouched, the limitation by \(25\%\) of the value applies as well and we choose \(\mu =0\) for this test against random fluctuations of features.
ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with adversarial training when smearing the inputs with a Gaussian noise term. Nominal training is visualized in blue, adversarial training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs
From Fig. 13 it is evident that the adversarial model also performs better than the nominal model when tested on randomly smeared inputs, although the advantage over nominal training is not as large as for the FGSM attack. Measured with difference in AUC, adversarial training brings a factor of 2 smaller susceptibility to Gaussian noise, compared to nominal training. Therefore we conclude that also in this scenario, which is somewhat closer to typical mismodelings found in the HEP context, the adversarial training is more robust.
B.2 Transferability of Adversarial Samples as a Black-Box Attack
Adversarial samples created for one model can also deteriorate the performance of another, independent model, which is known as transferability of adversarial samples [7, 26, 28]. For this study, the two models under consideration share the same architecture, but the weights and bias terms differ as a result of the different training strategies, thus yielding suiting candidates to investigate the aforementioned transferability. In fact, when injecting the same FGSM inputs generated for the nominal model into both models, we obtain another set of predictions. This can be understood as a black-box attack on the adversarial model [28], as the adversarial inputs are crafted without knowledge of the exact parameters of the adversarial model. In Fig. 4, this corresponds to using the blue branch for both models as an identical set of samples. The parameter used for this scenario is \(\epsilon =0.05\), with the limitation introduced in Eq. (4). Figure 14 shows that the adversarial model is also more robust to this perturbation.
ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with adversarial training when applying the FGSM attack to the nominal training and injecting the obtained inputs to both models. Nominal training is visualized in blue, adversarial training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs that were obtained with the help of the loss surface of the nominal model
B.3 Shifting Inputs Systematically with Up/Down Variations
In this simplified scenario, inputs are modified without prior knowledge of the model parameters. For this variation, features are simultaneously shifted upwards (or downwards) by adding (subtracting) small distortions to (from) the nominal values. Whereas the present dataset does not contain the systematic uncertainties directly, we estimate the magnitude of the distortion that is applied in a feature-wise manner with the help of existing commissioning results [12, 16, 20, 22] by the CMS and ATLAS collaborations. A baseline magnitude of 0.05 has been chosen, which is weighted by a factor \(s_i\) ranging from 1 to 5, depending on the maximally observed data-to-simulation disagreement for input i:
ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with adversarial training when shifting inputs systematically downwards. Nominal training is visualized in blue, adversarial training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs
ROC curves for the BvsL (left) and CvsL (right) discriminators, comparing the nominal with adversarial training when shifting inputs systematically upwards. Nominal training is visualized in blue, adversarial training in orange, solid lines depict nominal performance, dashed lines show performance on distorted inputs
The largest deviation in the data-to-simulation ratio is accounted for by incrementing the initial factor of 1 \(s_i\)-times in steps of 1, where \(s_i\) counts how many intervals of 0.1 fit between the observed ratio and perfect agreement (i.e. a ratio of 1). This already introduces a restriction on the allowed perturbation by itself, which is why the additional limitation of \(25\%\) is not necessary here. For features in the dataset where no direct counterpart is used in the official taggers of said collaborations, or data-to-simulation comparisons are unavailable, reasonable intermediate factors are assumed in Eq. (7).
Figures 15 and 16 prove that in the case of simultaneous up- or downwards variations, the adversarial model maintains a higher performance than the nominal model. The impact of this distortion is not as large as the one observed for the FGSM attack and further, this perturbation does not take correlations into account, which is why the advantage of adversarial training over nominal training is not as enhanced and we might not have seen the worst possible case yet. However, in this simplified scenario, adversarial training can be considered as more robust towards systematical shifts of input features.
C Computing
Processing of the data is carried out with the \(\texttt{awkward}\) [64] package, later evaluation is facilitated utilizing \(\texttt{coffea}\) [65], the graphics are prepared with \(\texttt{matplotlib}\) [66]. The neural network training is performed with the \(\texttt{PyTorch}\) [49] library, where a NVIDIA Tesla V100 GPU is utilized.
D Input Variables
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Stein, A., Coubez, X., Mondal, S. et al. Improving Robustness of Jet Tagging Algorithms with Adversarial Training. Comput Softw Big Sci 6, 15 (2022). https://doi.org/10.1007/s41781-022-00087-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41781-022-00087-1