Improving Robustness of Jet Tagging Algorithms with Adversarial Training

Deep learning is a standard tool in the field of high-energy physics, facilitating considerable sensitivity enhancements for numerous analysis strategies. In particular, in identification of physics objects, such as jet flavor tagging, complex neural network architectures play a major role. However, these methods are reliant on accurate simulations. Mismodeling can lead to non-negligible differences in performance in data that need to be measured and calibrated against. We investigate the classifier response to input data with injected mismodelings and probe the vulnerability of flavor tagging algorithms via application of adversarial attacks. Subsequently, we present an adversarial training strategy that mitigates the impact of such simulated attacks and improves the classifier robustness. We examine the relationship between performance and vulnerability and show that this method constitutes a promising approach to reduce the vulnerability to poor modeling.


Introduction
The experiments at the Large Hadron Collider (LHC) at CERN handle large, high-dimensional datasets to find complex patterns or to identify rare signals in background-dominated regions -tasks where machine learning and especially deep learning [1,2] provide considerable performance gains over traditional methods. It is expected that the relevance of new deep learning technologies will increase, with the era of the High-Luminosity LHC (HL-LHC) approaching [3]. However, studies with the aim of understanding a neural network's decisions demonstrate the relevance of explainability [4] and raise questions on the safety of systems that use artificial intelligence (AI), which is often perceived as a black-box [4,5]. Moreover, other studies show that small modifications of the inputs (adversarial examples) can severely affect the performance of neural networks [6,7] (adversarial attack ), a worrying prospect for a field that is reliant on simulation, which might be at times inaccurate. Careful exploration of the susceptibility to mismodelings is necessary to examine how severe these "intriguing properties of neural networks" [6] are in practice. Such effects could be driven by the fact that various popular classes of deep neural networks react linearly when exposed to linear perturbations, together with the large number of input variables [6,7]. As such, this property is not in conflict with a neural network's ability to approximate any function via a combination of non-linear activation functions [8], but the presence of (piecewise-)linear activation functions is sufficient to cause severe impact on performance when evaluated on first-order adversarial examples [6,7]. Applied to computer vision / image recognition, it has been demonstrated that modifications that involve only one pixel are enough to "fool" a neural network [9].
We apply methods from AI safety [5,10,11] to the classification of jets based on the flavor of their initiating particle (a quark or gluon), so called jet heavyflavor identification (tagging) [12,13]. Identifying the jet flavor plays an important role in various analysis branches exploited by experiments like CMS [12,14] and ATLAS [15,16], for example, for the observation of the decay of the Higgs boson to bottom (b) quark-antiquark pairs (H → bb) [17,18,19]. Moreover, for analyses that also apply charm (c) tagging [12,20,21,22], such as searches for the Higgs boson decaying to c quarks [23,24,25], multiclassifiers become increasingly important. Therefore, investigating the susceptibility to mismodeling could be even more relevant for c tagging. We probe the trade-off between performance and robustness to systematic distortions by benchmarking an established algorithm for jet flavor tagging with a realistic dataset. Early taggers included only the displacement of tracks as a way to discriminate heavyfrom light-flavored jets, possible due to the different lifetimes of the initiating hadrons. It is also possible to leverage information related to the secondary vertices, giving rise to algorithms such as the (deep) combined secondary vertex algorithm [12].
Mismodelings can arise at various steps during the Monte Carlo (MC) simulation chain, starting with the hard process (matrix element calculation), followed by the subsequent steps that model the parton shower, fragmentation and hadronization, where the perturbation order is limited, and ending with the detector simulation [21] which introduces imperfections such as detector misalignment and calorimeter miscalibration. These imperfections in the modeling, particularly for variables with high discriminating power, demand the calibration of the discriminator shapes [21] and call for investigations of the tagger response to slightly distorted input data [10]. We use adversarial attacks to model systematic uncertainties induced by these subtle mismodelings that could be invisible to typical validation methods, as proposed in Ref. [10]. The approach followed in this study does not eliminate these mismodelings, nor does it provide a definitive a posteriori correction, but it helps in estimating to what extent tagging efficiency and misidentification rates could be affected [10,19]. We assume that more adversarially robust models also generalize better when applied to a non-training domain [2,26] (e.g. model evaluated on data [10,21]). To that end, we seek to modify the training to minimize the impact of adversarial attacks, without sacrificing performance. Using

Input Features
A description of all input variables is given in Tables 1  and 2, and is based on Ref.
[13]; here we only summarize the main categorization.
Input features are organized hierarchically. Low-level features consist of tracks and their helix parameters, along with the track covariance matrix. Additional information is taken from the relationship between each track and the associated vertex. Up to 33 tracks, sorted by impact parameter significance, are available per jet, however, we only consider the first six.
At jet level, expert (high-level) features are constructed as a function of the low-level inputs, for example by summing over all tracks or summing over secondary vertices, such as the weighted sum of displacement significances. Additionally, kinematic features of the jet are taken into account.
Missing or otherwise unavailable variables are filled with a convenient default value for later processing.

Preprocessing
The entire dataset consists of 11, 491, 971 jets, which are split randomly into training (72%), validation (8%) and test (20%) sets. Input features are normalized such that they have a mean of 0 and standard deviation of 1. The scaling is calculated only using the training dataset distributions, excluding the defaulted values. Defaulted input values are set just below the minima of the primary input distributions ensuring no interference between regular and irregular (or missing) values. Minimizing the gap between the default value to the rest of the distributions improves training convergence. This technique of missing data imputation allows us to create fixed length input shapes that are transferred to the first layer of a deep feed-forward neural network, and at the same time prevents vanishing or exploding gradients due to extreme values for the defaults [2,41,42].
Sample weights are calculated to exclude a potential flavor dependence of the classifier on the particular kinematic properties of the chosen dataset and to correct for the inherent class imbalance. The reweighting aims at identical, kinematic distributions for all three flavors and is done with respect to the jet transverse momentum (p T ) and pseudorapidity (η) distributions [12]. The target shape is the average of the three initial distributions, thus balancing the relative fractions for the three classes at the same time. These distributions are binned into a 2D grid of 50 × 50 bins, spanning ranges between (20, 900) GeV and (−2.5, 2.5), respectively. When calculating the loss per batch, these weights are multiplied to the individual losses per sample.

Reference Classifier
The studies are carried out on a jet flavor tagging algorithm similar in implementation to the ones used at the LHC experiments, such as ATLAS and CMS. We use a fully-connected sequential model with five hidden layers of 100 nodes each. We use dropout layers [43] with a 10% probability of zeroing out each neuron at each hidden layer to prevent overfitting. The Rectified Linear Unit (ReLU) activation function [2,12,41] is used for the hidden layers, the activation of the output layer is computed with the Softmax [2,12] function. In total, there are 184 input nodes, where the low-level per track features are flattened. We define three output classes, analogous to the dataset.
As loss function, we use the categorical cross entropy loss [41,44], multiplied with an additional term that downweights easy-to-classify samples during training. The resulting formula for the so called focal loss [45,46,47] evaluated for one batch of length N is given as: where y ij is a placeholder for the output probability assigned to one of the three possible flavors j of the jet i,ŷ ij can be understood as the one-hot-encoded truth label which is either 0 or 1, w i is the sample weight obtained from preprocessing and γ is called focusing parameter. Though we already treat the class imbalance by reweighting the nominal loss function, without the focusing term, the neural network is prone to assign the most frequent class. In a setting with highly-imbalanced data the chosen technique ensures smooth classifier output distributions, which we achieve by choosing a focusing parameter of γ = 25. Model parameters are updated with the Adaptive Moments Estimation (Adam) optimizer [48] using Py-Torch's [49] default settings, which is further controlled with a learning rate schedule [50] that starts at 0.0001 and decays proportionally to 1 + epoch 30 −1 . The batch size has been fixed to 2 16 = 65,536. To ensure that there is no overfitting, training is stopped when the validation loss no longer improves [2]. For each training, the model's parameters are saved after each iteration through the full training dataset (i.e. after each epoch) to store a checkpoint for later evaluation.

Evaluation Metrics
While multi-class taggers are convenient for implementation, for physics analysis purposes, one is often interested in constructing classifiers distinguishing two classes at a time. We take appropriate likelihood ratios of the bottom, charm and light output classes as needed for discrimination. The likelihood ratio XvsY for discriminating class X from Y is given as: For example, for the BvsL discriminator, P (X) and P (Y) refer to the classifier's score for the bottom and light flavor jets, respectively. The performance of the binary classifiers is visualized and evaluated using Receiver Operating Characteristic (ROC) curves [51,52,53]. With some loss of information, a ROC curve is characterized by its area under the curve (AUC), which can be used as a reasonable single scalar proxy for the classifier performance [54]. It should be noted that due to a large class imbalance in the available dataset, accuracy could be an inaccurate measure of the performance [54].

Adversarial Attacks
One way to generate adversarial inputs is the Fast Gradient Sign Method (FGSM) [2,7], which modifies the inputs in a systematic way, such that the loss function increases. First, the direction of the steepest increase of the loss function around the raw inputs is computed. Mathematically, the operator that allows to retrieve the "steepest increase" is the gradient of the loss function with respect to the inputs. Once the direction is known, of which only the sign is kept, this vector is multiplied with a (small) limiting parameter to specify the desired severity of the impact. Then, the nominal inputs are shifted by this quantity. It can, therefore, be seen as a technique to maximally disturb the inputs or maximally confuse the network without necessarily manifesting in the input variable distributions. Expressed in a single equation, the FGSM attack generates adversarial inputs x FGSM from raw inputs x raw by computing where sgn(α) stands for the sign of α. In Eq. (3), the loss function is denoted as J(x raw , y), a function of the inputs (x raw ) and targets (y). Moreover, the FGSM attack can be interpreted as a method that locally inverts the approach of gradient descent by performing a gradient ascent with the loss function, but in the input space [7,26,28]. Using the terminology of Ref.
[26], this is a white box attack with full knowledge of the network (architecture and parameters). The corresponding visualization is shown in Fig. 1, however, for didactic reasons with one input variable x i only. In practice, this method is applied multidimensionally, assigning the same limiting parameter in each input dimension. Whereas the gradient of an arbitrary function could yield any value, the distortion should stay in reasonable bounds to mimic the behaviour of possible mismodelings or differences between data and simulation [7,10]. Therefore, we go only a small step in the direction of the gradient, which is expected to introduce practically unnoticeable changes of the input distributions [6,7].
Increasing the number of inputs to the model also increases the susceptibility towards adversarial attacks, because each shift by for additional features is propagated to the change in activation [7]. Thus it is conceivable that individual feature distributions remain almost unaffected, but the performance of the neural network is substantially deteriorated.
The FGSM attack does not necessarily replicate a global worst-case scenario [28]. Depending on the actual properties of the loss surface, the adversarial attack could shift the inputs also into local minima (or at least harmless regions), if the limiting parameter is chosen unluckily. On average, with small distortions only, it is still expected that in a given region, the attack will maximally confuse the model up to first order.
In this implementation, the FGSM attack is not applied to integer variables, such as the number of tracks, and defaulted values, which would not be shifted by in a physically meaningful way.
As large distortions of input variables would be easy to detect, a limit of 25% with respect to the original value is applied on the perturbation. The modified value x FGSM is then given by Eq. (4), where x denotes the original input value, x the transformed (preprocessed) value and the FGSM scaling factor. Inverting the normalization is denoted by () −1 .
Distortions of low-level features are not propagated to high-level features, instead each feature is taken into account via the multidimensional gradient only. Therefore, correlations are not fully taken into account.

Adversarial Training
The approach that will be followed in this study is a simple type of adversarial training that injects perturbed inputs already during the training phase [26]. The algorithmic description is shown in Fig. 2. The difference to the nominal and adversarial training is highlighted in red. In fact, in this approach the neural network never sees the raw inputs during the whole training step [26, 27, 28]. In Fig. 3, this is shown with the insertion of a red block prior to backpropagation. The idea is that by applying the FGSM attack continuously to the training data (for every minibatch, i.e. with every intermediate state of the model after updating the model parameters), the network is less likely to learn the simulation-specific properties of the used sample. Instead, the introduction of a saddle point into the loss surface is expected to improve the generalization capability of the network [2,26,28]. This can be understood as a "competition" between gradient descent to solve the outer minimization problem and gradient ascent to handle the inner maximization [28]. Madry et al. [28] have shown that this is an effective method to reduce susceptibility to first-order adversaries, obtained from an FGSM attack. In that sense, adversarial training could also be described as a regularization technique, but a more systematic one than only randomly smearing inputs (another example of data augmentation), randomly deleting connections (dropout), or assigning a probability to the different targets to be wrong (label smoothing) [2].
The principle behind this technique involves the linearity of neural networks to which the high susceptibil-ity to mismodelings is attributed. Adversarial training can be interpreted as a method that adjusts the loss surface to be locally constant around the inputs and that downsizes the impact of perturbations evaluated with a high-dimensional linear function [2]. Slightly distorted inputs then cannot significantly increase the value of the loss function, because it is almost flat in the vicinity of the raw inputs [55]. This can be seen as a geometrical problem where the loss manifold is flattened [55,56,57,58]. When evaluating this adversarially-trained model with distorted test inputs, the model should be more robust to those modifications and the performance should not be affected as much as with the generic training. The price for the increased robustness is that the maximally achievable performance on raw inputs can be somewhat reduced with respect to the nominal training [2]. During adversarial training, the FGSM attack et al.

Inference
Raw samples

Inference
The inference step is split into two separate parts, which can be seen in Fig. 4   ical data-to-simulation agreements of the level of 10-20% [12].

Vulnerability of the Nominal Training
First, we establish how susceptible the nominal model is to the FGSM attack (mismodeling) of various magnitudes. Figure 6 shows the ROC curves for the BvsL (left) and CvsL (right) discriminators, on FGSM datasets generated with varying parameter and on the nominal inputs. As expected, the model performs best on undisturbed test samples with AUC of 0.946, but the performance decays quite quickly with increasing . At = 0.05, which still only causes barely visible differences in the input distributions, the model reaches AUC of 0.883. At 1% mistag working point, this would correspond to a decrease in signal efficiency from 73 to 60%, requiring a scale factor of 0.82.
In the context of the ongoing hunt for better performing classifiers, it is of interest to investigate the susceptibility in relation to the performance. Some insight can be gleaned by evaluating the performance of the classifier at various steps during the training on both the nominal and the perturbed datasets with a fixed = 0.05, where an AUC value is calculated for each checkpoint. This dependence is shown in Fig. 7, again for the two discriminators. Not surprisingly, before the training performance becomes saturated, longer training leads to an increase in nominal performance. However, at the same time it shows higher vulnerability towards adversarial attacks. In fact the performance on the perturbed datasets follows exactly the opposite trend. Another way to phrase this finding is that the least performant configuration (after only few epochs or iterations through the full training dataset) shows the highest robustness, i.e. the gap between dashed and solid lines is minimal.

Improving Robustness Through Adversarial Training
In this subsection, the studies described above are repeated with the adversarial model, using the same setup for the attacks when performing the inference.
As a check of robustness, we perform a direct comparison of the nominal and adversarial training, crafting the FGSM samples individually per model, with the resulting ROC curves for the BvsL and CvsL discrim- inators shown in Fig. 8 ing effect (see Ref. [59]), which refers to a better performance on adversarial examples than on undisturbed data for an adversarial model, is not observed. Figure 9 compares the susceptibility to mismodeling of the two classifiers as a function of performance. FGSM samples have been generated individually for each model and checkpoint (denoting each epoch with a single point) to scan over different discrete stages of the training. Higher density of points in the high performance region is representative of the small improvements at later stages of the training, while the performance gain during the first few epochs is quick. Ideally, there would be a constant relation that shows no signs of decreasing robustness for increasing performance. However, we observe a considerable deterioration (and thus higher susceptibility to mismodeling) of the nominal classifier. The effect for the adversarial model, while still noticeable, is to a large degree mitigated. In fact, the adversarial training seems to recover some of its robustness (e.g. peaking at an AUC of around 0.938) before the impact at higher performance starts to worsen the resistance. Again, this shows the intriguing trade-off between performance and robustness for the nominal training, where training to highest performance is not necessarily advisable due to high susceptibility. On the other hand, the adversarial training performs equally well on nominal samples and only shows a weak functional dependence between performance on first-order adversaries and the respective undisturbed performance.  Relation between susceptibility and nominal performance for the nominal and adversarial training, tested on systematically distorted inputs with varying in different colors. The x axis shows nominal performance, measured with BvsL AUC, while the y axis shows the difference between disturbed and raw AUC. When there is a drop on the y axis while moving to higher nominal performance (x axis), this indicates higher susceptibility. The empty markers represent the nominal training, which becomes highly vulnerable with increasing nominal performance (with the drop always getting steeper), while the filled markers for adversarial training show a much flatter relation.

Probing Flavor Dependence of the Attack as a Proxy for Generalization Capability
In an attempt to understand why the adversarial model is more robust than the nominal classifier, we investigate nominal and perturbed input distributions of a selected feature, split by flavor. We intentionally choose a large distortion. This test aims at visualizing geometric properties of the distorted samples, purposefully choosing a large of 0.1. This is equal to the regular FGSM attack described by Eq. (3) without the limitation described in Eq. (4). The signed impact parameter (d 0 ) as shown in Fig. 10 originally offers discriminating power via the fact that heavy-flavor jets contain displaced tracks associated to a secondary vertex, which should naturally lead to more positive values for the d 0 variable. For light-flavored jets, this behaviour is not expected, instead the tracks in light jets have a roughly symmetric d 0 distribution, peaking at 0, apart from some skewness due to relatively long-lived, but light hadrons (K 0 s or Λ) or contamination with tracks from heavy-flavor hadrons [12].
For the nominal training, light-flavor jets are shifted mostly into the positive region, which should be dominated by b jets; b jets are shifted to the negative region where these jets were not abundant previously. From a geometric point of view, the FGSM attack on the nominal training produces asymmetric shapes. On the other hand, the resulting perturbed input distributions for the adversarial training are symmetric. We observe that the adversarial model is almost agnostic to the direction into which the FGSM attack shifts the inputs, while the nominal training shows a clear preferred direction that could be described as an inversion of the expected physics. For the adversarial training, the attack seems to have difficulties deciding which direction is the worse direction, resulting in a perceived "coinflipping" of the shift. Thus, the adversarial training remains less susceptible than the nominal training, even when the distortions are noticeably large.
It is conceivable that the different geometric properties of the distributions are related to the geometry of the loss surface [55,56,57,58]. This is expected to be responsible for differences in robustness as well. Figure 11 illustrates how the flatness of the loss surface in the vicinity of raw inputs could influence symmetric or asymmetric shifts.
A nominal training converges into a minimum associated with the default distributions. In that case, for a given flavor, there will be a specific vector pointing away from a local minimum and the direction is fixed according to the steepest increase in loss. The adversarial training always "sees" (new) adversarial inputs, so  the adjustment of the model's parameters might average out eventually over further training epochs. Always following the newly distorted inputs yields a locally constant loss manifold around the original inputs due to the more complex saddle point problem. This would mean that not the exact memorization of training data, but rather higher-order correlations contribute to the improvement of the performance of the adversarial training [26,28,55,56]. With the assumption of a flat loss surface close to the raw inputs there would be no preferred direction for first-order adversarial attacks crafted for the adversarial model. Many vectors would fulfill the criterion of pointing in the direction of increasing loss, much like choosing the direction randomly. Thus by examining the geometric properties of adversarial samples, a flat loss landscape for the adversarial model is highly probable, leading to higher ro-bustness [55,56]. For mismodelings of order that are still on-manifold, the adversarial training would generalize better to data than nominal training. Robustness and generalization are not equivalent [26,28,60], which is why the above statement can not be general, but is only valid under the assumption that adversarial methods like the FGSM attack replicate mismodelings between simulation and detector data.

Conclusion
In this paper, we investigated the performance of a jet flavor tagging algorithm when being exposed to systematically distorted inputs that have been generated with an adversarial attack, the Fast Gradient Sign Method. Moreover, we showed how model performance and robustness are related. We explored the trade-off between performance on unperturbed and on distorted test samples, investigating ROC curves and AUC scores for the BvsL and CvsL discriminators. All tests conducted with the nominal training confirm earlier findings that relate higher performance with higher susceptibility, now for a deep neural network that replicates a typical jet tagging algorithm. We applied a defense strategy to counter first-order adversarial attacks by injecting adversarial samples already during the training stage of the classifier, but without altering the network architecture.
When comparing this new classifier with the nominal model, no difference in performance was observed, but the robustness towards adversarial attacks is enhanced by a large margin. Exemplary for the direct comparison of the two trainings, both reached an AUC score of approximately 76% when discriminating c from light jets, but an FGSM attack that is still moderate in its impact on the input distributions decreases the performance of the nominal training by 21%, and only by 8.2% for the adversarial training. A study of raw and distorted input distributions allowed us to relate geometric properties of the attack with geometric properties of the underlying loss surfaces for a nominal and an adversarially trained model, yielding a possible explanation for the higher robustness of the latter attributed to flatness of the loss manifold.
To some extent, the higher robustness as shown in this paper points at better generalization capability, but a study that will also utilize detector data has yet to be conducted to confirm this conjecture. The approach followed for this work is comparatively general, in that it only needs access to the model and the criterion. This is the first application of adversarial training to build a robust jet flavor tagger suitable for usage at the LHC.
It would be interesting to apply this type of attack and defense also to more complex neural network struc-tures to see if, for example, convolutional layers are able to leverage adversarial attacks differently, and if adversarial training is as effective for taggers with a larger (or smaller) dimension in the feature space. Another focus could be targeted at using adversarial methods of higher complexity, both for the attack, as well as for the defense against them. Summarizing the efforts so far, adversarial training was applied successfully to resist first-order adversarial attacks on jet flavor tagging algorithms, corresponding studies with higher-order adversaries are left for future investigations.
Acknowledgements Simulations were performed with computing resources granted by RWTH Aachen University under project nova0021 and rwth0619. This work has received support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, projects SCHM 2796/5 and GRK 2497), and the Bundesministerium für Bildung und Forschung (BMBF, Project 05H2021). We thank Nicolas Frediani for his contributions to the project in context of his bachelor thesis.

Declarations
Data Availability Statement This manuscript has associated data in a data repository [Authors' comment: The dataset has been generated with code accessible under Ref. [61] and can be accessed at the UCI Machine Learning in Physics Web portal under the link http://mlphysics.ics.uci.edu/].  Fig. 11 Illustration of the potential geometry of the loss surfaces for the nominal as well as the adversarial training. Inspired by Refs. [55,56,57].

B.1 Smearing Inputs with a Gaussian Noise Term
While the FGSM attack aims at worst-case scenarios in the direction of increasing gradients, a physical effect induced by mismodeling of the parton shower or caused by detector misalignment or -calibration does not know the model parameters or its loss surface. It can therefore not act as a "demon" [10] that always points in a preferred direction. Investigating a smearing technique independent of the model under consideration is of interest when studying the robustness to more typical mismodeling scenarios or fluctuations that are of statistical nature. A non-systematic strategy to create a new, slightly distorted set of inputs randomly shifts the variables by adding a noise term ξ to the original inputs, drawn from a Gaussian distribution [2,6,7,59]: As described in Sect. 2.2, the inputs are scaled to a standard deviation of one and are centered at zero, thus allowing this smearing without further processing. The effect of this distortion is shown in Fig. 12. Only one arbitrary input x i has been chosen for visualization, and the displayed loss function is just an illustration. Compared to the settings introduced for the FGSM attack, the difference is that the magnitude of the distortion is now given by σ = 1 (not ). Other parameters remain untouched, the limitation by 25% of the value applies as well and we choose µ = 0 for this test against random fluctuations of features. From Fig. 13 it is evident that the adversarial model also performs better than the nominal model when tested on randomly smeared inputs, although the advantage over nominal training is not as large as for the FGSM attack. Measured with difference in AUC, adversarial training brings a factor of 2 smaller susceptibility to Gaussian noise, compared to nominal training. Therefore we conclude that also in this scenario, which is somewhat closer to typical mismodelings found in the HEP context, the adversarial training is more robust. Fig. 12 Visualization of the random shift of inputs by adding a Gaussian noise term. With the slight distortion based on the blue probability distribution, the formerly green raw datapoint is shifted and the corresponding loss modified. The change of the loss function with respect to the distorted inputs can go in either direction. Gaussian distribution adapted from Ref. [63].

B.2 Transferability of Adversarial Samples as a Black-Box Attack
Adversarial samples created for one model can also deteriorate the performance of another, independent model, which is known as transferability of adversarial samples [7,26,28]. For this study, the two models under consideration share the same architecture, but the weights and bias terms differ as a result of the different training strategies, thus yielding suiting candidates to investigate the aforementioned transferability. In fact, when injecting the same FGSM inputs generated for the nominal model into both models, we obtain another set of predictions. This can be understood as a black-box attack on the adversarial model [28], as the adversarial inputs are crafted without knowledge of the exact parameters of the adversarial model. In Fig. 4, this corresponds to using the blue branch for both models as an identical set of samples. The parameter used for this scenario is = 0.05, with the limitation introduced in Eq. (4). Figure 14 shows that the adversarial model is also more robust to this perturbation.

B.3 Shifting Inputs Systematically with Up/Down Variations
In this simplified scenario, inputs are modified without prior knowledge of the model parameters. For this variation, features are simultaneously shifted upwards (or downwards) by adding (subtracting) small distortions to (from) the nominal values. Whereas the present dataset does not contain the systematic uncertainties directly, we estimate the magnitude of the distortion that is applied in a feature-wise manner with the help of existing commissioning results [12,16,20,22] by the CMS and ATLAS collaborations. A baseline magnitude of 0.05 has been chosen, which is weighted by a factor s i ranging from 1 to 5, depending on the maximally observed data-to-simulation disagreement for input i: x sys,i = x raw,i ± 0.05 · s i , where s i ∈ {1, . . . , 5}. (7) The largest deviation in the data-to-simulation ratio is accounted for by incrementing the initial factor of 1 s i -times  (7). Figures 15 and 16 prove that in the case of simultaneous up-or downwards variations, the adversarial model maintains a higher performance than the nominal model. The impact of this distortion is not as large as the one observed for the FGSM attack and further, this perturbation does not take correlations into account, which is why the advantage of adversarial training over nominal training is not as enhanced and we might not have seen the worst possible case yet. However, in this simplified scenario, adversarial training can be considered as more robust towards systematical shifts of input features.

C Computing
Processing of the data is carried out with the awkward [64] package, later evaluation is facilitated by utilizing coffea [65], the graphics are prepared with matplotlib [66]. The neural network training is performed with the PyTorch [49] library, where a NVIDIA Tesla V100 GPU is utilized.

Short name Description
Jet p T Transverse momentum of the jet with respect to the beam line Jet η jet pseudorapidity Track 2 (3) d 0 (z 0 ) significance Magnitude of impact parameter significance of the second (third) track, transverse to the (along the) beam line, after ranking them by |d 0 | significance N tracks over d 0 threshold Number of tracks with transverse impact parameter significance over 1.8 Jet Prob Light jet probability (see Ref. [67]); product of likelihoods over all tracks to have come from a light quark jet Jet width η (φ) Width of the jet in η (φ) coordinates, obtained from all tracks in the jet via