Reducing the Dependence of the Neural Network Function to Systematic Uncertainties in the Input Space

Applications of neural networks to data analyses in natural sciences are complicated by the fact that many inputs are subject to systematic uncertainties. To control the dependence of the neural network function to variations of the input space within these systematic uncertainties, several methods have been proposed. In this work, we propose a new approach of training the neural network by introducing penalties on the variation of the neural network output directly in the loss function. This is achieved at the cost of only a small number of additional hyperparameters. It can also be pursued by treating all systematic variations in the form of statistical weights. The proposed method is demonstrated with a simple example, based on pseudo-experiments, and by a more complex example from high-energy particle physics.

Examples of their use in physics object identification, e.g. at the LHC experiments ATLAS and CMS, are the classification of particle jets induced by heavy flavor quarks [1,2] and the identification of τ leptons [3,4]. Examples for data analyses that make use of NNs not only for object identification, but to distinguish between signal-and background-like samples are the latest analyses of Higgs boson events in association with third generation fermions, at the LHC [5,6,7,8,9]. These classification tasks usually aim at the distinction of a signal from one or more background processes. They are characterized by a relatively small number of O(10 − 100) input parameters to the NN, which may reveal non-trivial correlations among each other.
Each physics measurement is subject to systematic uncertainties, which have to be propagated from the input space x = {x i } to the NN output f (x). This usually happens in terms of variations of a given input parameter x i within its uncertainties ∆ i . These may be implemented in the form of variations of the actual values of x i , or such that a sample, with a given value of x i , enters the analysis with a different statistical weight, also referred to as reweighting throughout this text. Unlike varying the values of x i , reweighting does not rely on a reprocessing of the dataset and therefore generally implies significantly smaller computational costs.
The possibility to implement prior information about systematic uncertainties ∆ = {x i + ∆ i } already in the NN training is motivated by two considerations: Firstly, a powerful distinction between classes in principle, can be considerably compromised by systematic uncertainties. Integrating prior knowledge of uncertainties in the NN training helps in guiding the NN to focus on features in the input space that are less prone to such a performance degradation. This may even result in a gain for the analysis performance, as observed in Ref. [10]. Secondly, the dependence of a systematic variation of a given feature x i on other parameters {x j }, j = i in the input space, might only be poorly, or even unknown, and the user might want to generally uncorrelate the NN output from this uncertainty to assure a reliable response of the NN to the given task. Both points raise interest in training the NN with the boundary condition that the dependence of f (x) on ∆ should be minimal.
One way to achieve this decorrelation of f (x) from ∆ that has been proposed in the past makes use of a secondary NN that is trained in addition to the primary NN in an iterative procedure, resulting in what has been introduced as adversarial NN in [11]. This secondary NN has the task of drawing information of the systematic variation from the output of the primary NN. The output of the secondary NN is then included in the loss function of the primary NN as part of a minimax optimization problem. In this way the adversarial NN becomes insensitive to the systematic variation of the inputs. This method requires a relatively complex iterative training procedure; it introduces a large and to some extend arbitrary number of new hyperparameters implied by the choice of the architecture of the secondary NN, and requires the resampling of x i within its uncertainties ∆ i .
In our approach we implement a penalty on the differences between the NN output obtained from the nominal value of x i and its variations ∆ i , directly into the loss function. For this purpose we use histograms of f (x) and f (∆) filled during each training batch. The number n k of histogram bins {k}, and the batch size n b are hyperparameters of the training. To guarantee a differentiable loss function for the optimization of the trainable parameters of the NN, the histogram bins are blurred by a filter function applied to each sample b of the training batch. We use Gaussian functions G k (x), normalized to max (G k (x)) = 1 as filters, where the mean and standard deviation are given by the center and half-width of histogram bin k. The count estimate can then be written as N k (f (x)) = b G k (f (x)), and the loss function consists of the two parts where L corresponds to the loss function of the primary task, like for example the cross-entropy function for a classification task, and Λ(x, ∆) to the term that penalizes differences in the NN function between f (x) and f (∆). The factor λ controls the influence of the penalty and adds another hyperparameter to the training. The count estimate N k (f (∆)) can be derived from N k (f (x)) in terms of reweighting, such that no reprocessing of the dataset during the training procedure is required.
In Section 2 we demonstrate the method on a simple example based on pseudo-experiments. A more complex analysis task typical for high-energy particle physics is studied in Section 3. We summarize our findings in Section 4.

Application to a simple example based on pseudo-experiments
To illustrate our approach, we refer to a simple example based on pseudo-experiments that has also been used in Ref. [11]. It consists of two variables x 1 and x 2 , which are the input to separate two classes, in the following labelled as signal and background. The input space is visualized in Fig. 1. A systematic uncertainty for the background class is introduced by two variations of x 2 by ±1.
The NN used to solve the classification tasks consists of two hidden layers with 200 nodes each, with rectified linear units as activation functions [12] and a sigmoid activation function for the output layer. The trainable parameters are initialized using the Glorot algorithm [13]. The optimization is performed using the Adam algorithm [14] with a batch size of 10 3 . Our choice for L is the cross-entropy function. For Λ, we use 10 equidistant bins in the range [0, 1] of the NN output. Finally we set λ to 20. The training on 5 × 10 4 events is stopped if the loss obtained from the training dataset has not decreased for five epochs in sequence, on an independent validation dataset of the same size. In addition, we use 10 5 events for testing and to produce the figures to illustrate the result. The impact of the systematic variations on the NN output is shown in Fig. 2 for the case of a classifier trained with a loss function given only by L (f L ) and a classifier based on a loss function including the additional penalty term Λ (f L Λ ).
As can be seen from Fig. 2, the approach successfully mitigates the dependence of the NN output on the variation of x 2 and therefore results in a classifier that is more robust in the presence of this systematic uncertainty. Fig. 3 visualizes the NN output as a function of the input space spanned by x 1 and x 2 . The additional penalty term, Λ, leads to the intended alignment of the surface of the NN output with the variation of x 2 , resulting in similar values of the NN output for all realisations of the systematic variation. We find our approach to have an effect similar to using an adversarial NN for decorrelating the NN output from the systematic variation of the inputs as described in [11].
3 Application to a more complex analysis task typical for high-energy particle physics In the following, we apply the proposed method to a more complex task typical for high-energy particle physics. We use a dataset that has been released for the Higgs boson machine learning challenge described in Ref. [15]. This challenge uses a simplified dataset from collisions of high-energy proton beams at the CERN LHC. The task is to separate events containing the decay of a Higgs boson into two tau leptons (signal) from all other events (background). The dataset contains 30 input parameters, whose exact physical meanings are given in Ref. [15].
For our example, we use all parameters as input for the NN training. In addition, we introduce a systematic uncertainty, resembling the fact that the momentum and energy of a particle are the results of external measurements with a finite resolution. For our study we assume an ad hoc uncertainty of ±10% on the transverse momentum of the reconstructed hadronic τ decay p τ t , measured in GeV and labelled as PRI tau pt in Ref. [15]. The distributions of the nominal and varied input parameters are visualized in Fig. 4. Instead of resampling the signal and background datasets with the varied values of p τ t , we introduce the systematic variation in the form of statistical weights. In this way we give a higher (lower) statistical weight to subsamples with low (high) values of p τ t with respect to the nominal sample. The weights are determined from the p τ t distributions shown in Fig. 4.
The NN has the same architecture as described in Section 2. For the implementation of f L Λ we chose 20 equidistant bins in the range of [0, 1] of the NN output, for Λ and λ = 1. The batch size is set to 10 3 . The optimization of the trainable parameters is performed on 75 % of the training dataset and stopped if the loss has not decreased for 10 epochs in sequence, on the remaining part of the training dataset. The results are shown on an independent test dataset.
In Fig. 5 the NN outputs f L and f L Λ are shown. Also in this example the training based on a loss function including Λ leads to a mitigated dependence of the NN output on the systematic variation of p τ t . In Fig. 6 the p τ t distributions for signal and background for the full unbiased sample, and for two signal-enriched subsamples are shown. The latter are obtained by a restriction of f L and f L Λ to a value larger than 0.8. On the full unbiased sample a generally harder p τ t spectrum for the signal is observed with a maximum around 50 GeV, in contrast to a steadily falling and softer spectrum for the background. In the signal-enriched subsample based on f L > 0.8 the p τ t distribution for the background is biased towards the same distribution for signal. In the signal-enriched subsample based on f L Λ > 0.8 this bias is alleviated and the p τ t distributions for signal and background are qualitatively unchanged with respect to the full unbiased sample.
At the LHC experiments the presence of the Higgs boson signal has been inferred from hypothesis tests based on a ratio of a likelihood including the Higgs boson signal over a null hypothesis without Higgs boson signal [16]. Systematic uncertainties have been incorporated in the form of nuisance parameters, which might be correlated, e.g., across processes, into the likelihoods. Best estimates and constraints on these nuisance parameters have been obtained by nuisance parameter optimization. The presence of the signal has been quantified, e.g., by means of its statistical significance in terms of Gaussian standard deviations (s.d.), in the limit of large numbers. To serve our discussion we emulate this discovery scenario, in a simplified way, constructing binned likelihoods for the signal and null hypotheses based on the histograms shown in Fig. 5. In addition to the statistical uncertainties of the pseudodata and the templates used for the model we incorpo-    Fig. 5 as process-and bin-correlated variation in the likelihoods, following the prescriptions of [16]. We assume the nominal value of p τ t to be true and the Higgs boson signal to be present with a signal strength as expected by theory. We emulate five idealized measurement outcomes, one with p τ t at its nominal value and two with p τ t shifted by ±10 and ±20%, each.
In Fig. 7 the significance of these measurement outcomes using f L and f L Λ as inputs to the likelihoods is shown. As can be concluded from the slope of the graphs a realisation with a positive (negative) shift w.r.t. to the nominal value in p τ t leads to a larger (smaller) significance. As implied already by Fig. 7 the analysis based on f L reveals a sizable sensitivity to the variation in p τ t , which is largely reduced for the analysis based on f L Λ . Since p τ t is a signal sensitive input parameter the reduction of this dependence is achieved at the cost of an overall smaller sensitivity. On the other hand it comes with the gain of a smaller probability for a false positive discovery. As indicated by the figure a realisation of a shift in p τ t by a bit more then +10% results  in a statistical significance of more than 5 s.d., while at nominal value the significance should be 4.5 s.d.. For such a measurement outcome the announcement of a discovery, while true and therefore not harmful, would still have been non-conservative and based on a false assumption on the scale of p τ t . The outcome based on f L Λ , though less sensitive, and in this sense more conservative, resembles the more reliable measurement. Its choice over f L thus should be more favored for actual measurements of physical quantities once a signal has been established.

Summary
We have presented a new approach to reduce the dependence of the neural network output to variations of features x i of the input space to the NN due to systematic uncertainties in the measured input parameters. We achieve this reduction by including the variation of the NN output w.r.t. the nominal value of x i in the loss function. Compared to a previously published method of using an adversarial neural network, the complexity of the presented method is reduced to one additional term in the loss function with less hyperparameters and no further trainable parameters. Systematic variations can be inscribed in the form of statistical weights, implying no further needs of reprocessing, further reducing the complexity of the training. In turn the method requires batch sizes large enough to populate the blurred histogram of the NN output used for the evaluation of the variation w.r.t the nominal value of x i in the loss function.
We have demonstrated the new approach with a simple example directly comparable to a solution of the same task with an adversarial neural network, and a more complex analysis task typical for high-energy particle physics experiments. In all cases the dependence of the NN output on the variation of a chosen input parameter is successfully mitigated. In application to a high-energy particle physics measurement this leads to a result less prone to systematic uncertainties, which is of increasing interest in the presence of growing datasets, where statistical uncertainties play a subdominant role in the measurement. f L f LΛ Fig. 7 Statistical significance of the Higgs boson signal in the dataset given in Ref. [15], in standard deviations (s.d.), for emulated discovery scenarios, based on f L and f L Λ with five realisations of the scale for p τ t between −20% and +20%. The nominal value with 0% shift is assumed to be true.