Combine and Conquer: Event Reconstruction with Bayesian Ensemble Neural Networks

Ensemble learning is a technique where multiple component learners are combined through a protocol. We propose an Ensemble Neural Network (ENN) that uses the combined latent-feature space of multiple neural network classifiers to improve the representation of the network hypothesis. We apply this approach to construct an ENN from Convolutional and Recurrent Neural Networks to discriminate top-quark jets from QCD jets. Such ENN provides the flexibility to improve the classification beyond simple prediction combining methods by increasing the classification error correlations. In combination with Bayesian techniques, we show that it can reduce epistemic uncertainties and the entropy of the hypothesis by exploiting various kinematic correlations of the system.


Introduction
Deep Learning (DL) has gained tremendous momentum on the verge of the latest developments in data analysis. Whilst boosted decision trees (BDT) have been used in the context of High-Energy Physics for over 30 years, wide usage of Deep Neural Networks (DNNs) only surged very recently. Since then, especially in applications to LHC physics where a large amount of data with the need for its fast and automated analysis is gathered, there has been a profound improvement in the understanding of Neural Networks (NNs). The analysis of the internal structure of jets, highly complex collimated sprays of radiation [1], is a popular arena where reconstruction techniques evolved from sophisticated multi-variate approaches, e.g. HEPTopTagger [2][3][4], over theory-guided matrix-element methods [5][6][7][8] to data-driven NN techniques [9][10][11][12]. In particular top tagging has been the prime example to benchmark the performance of various NN classifiers [13][14][15][16][17][18][19][20][21]. Similar tagging algorithms have been used for Higgs [22,23] and W-boson [24,25] tagging and quark-gluon discrimination [26][27][28][29][30] 1 . Thus, it became apparent that there is a wide range of use-cases for NNs in collider phenomenology, where particle tagging is just one of many applications.
A standard supervised learning algorithm produces a fitting function that aims to find an optimal contour of the decision boundary between competing hypotheses 2 . The given algorithm takes a labelled feature-tensor and attempts to find the global minimum of a given objective function, the so-called loss function, resulting in the prediction of the algorithm. This is achieved by convoluting the input feature vector with non-linear functions, socalled activation functions, and updating the weights of the initial hypothesis through the backpropagation algorithm. Whilst such an approach offers increased flexibility, in general, it can suffer from three major predicaments [45]. First, the problem of statistics denotes the lack of training examples within a particular domain, which can cause the learning algorithm to get stuck in various minima with comparable accuracies in each training. The second problem is computational. As mentioned before, a learning algorithm often employs a stochastic search algorithm, e.g. gradient descent. Assuming the provision of sufficient data, the feature-space can be highly complex, creating a very non-trivial losshypersurface for which the algorithm is tasked to find the global minimum [46,47]. Finally, the third problem is representational. As the nature of a "fitting" algorithm, it is not always possible to find a linear or non-linear representation of the actual function. Hence, it might be necessary to expand the representation space or employ various possible hypotheses to find a closer approximation of the actual function. Although the representation problem is directly linked to the previously mentioned issues, even with sufficient statistics and advanced algorithms, an optimization algorithm may not proceed after finding a hypothesis that can adequately explain the data [48].
The three most popular architectures for classification tasks in particle physics are currently Deep Neural Networks (DNN), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). Each of these networks is designed to exploit different features and correlations of the input data. For instance, CNNs are special-purpose networks that are widely used for image recognition [14,18]. This method sweeps through the image by dividing it into subvolumes. Each subvolume has been transferred to the next layer by passing through an activation function, allowing the network to filter the image's distinguishable features. RNNs are a different kind of specialised networks that keep track of the ordering of the feature vector and thus maintains a sense of "memory" by connecting each node in a graph via an ephemeral sequence. Long-short term memory (LSTM) networks have been employed to classify QCD events with high accuracy [17,25,49]. While each of these techniques can be powerful by itself, it is not clear whether they exploit the full amount of information contained in the feature vectors to perform an optimal classification between competing hypotheses. Thus, combining multiple networks into an Ensemble Neural Network (ENN) might allow to improve on their individual classification performances.
Ensemble learning is a paradigm which employs multiple neural networks to solve a problem. The main idea behind ensembling is to increase the generalisation of the data by harvesting many hypotheses trying to solve the same problem. Each of the networks mentioned above is specialised to learn a particular feature of the given data to achieve the same or similar generalisation. An ensemble of these networks can access all the information presented in each component network and optimise it according to more generic information through data [50][51][52][53][54][55].
While some techniques to combine classifiers have been used in the context of collider phenomenology before 3 , to our knowledge for the first time, we will present a parallel 3 Ref. [56] shows that combining predictions of BDTs with specific rules can improve the discrimination of BSM models from the SM. Ref. [57] shows that injecting randomness to a hypothesis and combining its results can boost the accuracy of the classification. Refs. [58,59] uses stack combining method for Higgs combining method to go beyond simple prediction combinations. As shown in previous studies [45,46,51,61], combining predictions of various networks can significantly improve the overall performance for classification or regression. However, if networks are only combined at the prediction level, they are each separately trained for a specific property of the data. Parallel combined ensembles allow the network to train on a combined higher dimensional latent-space to optimize the entire network accordingly. Hence, having access to all component networks allows improvement upon the representation of the problem. We will show that such an approach allows flexibility to improve background rejection beyond simple prediction combinations. Furthermore, we will show that it will drastically improve the network's error correlations beyond the component and prediction-based-combined networks.
With continuously improving performance indicators for NNs, e.g. measured through receiver operating characteristic (ROC) curves, it becomes increasingly important to obtain an understanding of how this is achieved and how reliable the performance is evaluated [62][63][64][65][66][67]. Bayesian neural networks allow to estimate intrinsic uncertainties of NN by treating their weights as distributions instead of a single trainable variable [68,69]. Hence the network output is a distribution rather than a fix value. To estimate the uncertainties of a network, multiple measurements of the same test data are combined to calculate the mean prediction alongside its standard deviation. We will employ Bayesian techniques to show that parallel combining methods, i.e. as implemented in ENNs, can reduce the standard deviation of the predictions and epistemic uncertainties without requiring more data.
In Sec. 2 we provide a discussion of Ensemble Neural Networks and review their applications and benefits in improving the classification performance. In Sec. 3.1 we describe the procedure we employed to preprocess the input data before the training and in Sec. 3.2 we present our results. Finally, in Sec. 4 we compare uncertainties between component networks and their ensemble, and we offer a summary and conclusions in Sec. 5.

Ensemble Neural Networks
Ensemble Neural Networks (ENNs) are protocols that aim to increase the generalizability of a hypothesis by combining multiple component networks. It has been shown that ENNs can provide the necessary resolutions or approximations that that all three potential pitfalls for NNs mentioned in Sec. 1 require [50][51][52][53][54]. Depending on the problem at hand, ensembling methods can be pooled under three paradigms [55]: (i) parallel combining, (ii) stacked combining and (iii) combining weak classifiers.
Combining classifiers spanning feature-spaces that contains different physical domains, can provide an expanded representation of the hypothesis space, see Fig. 1. Such methods are studied under so-called "parallel combining" method. Another technique, called "stacked combining", employs different classifiers to be trained on the same feature-tensor. Such techniques can provide simple solutions to the computational problem where multiple non-correlated hypotheses can approximate the underlying function more efficiently. The final and most widely studied method is "combining weak classifiers" where, as the tagging at LHC and ref. [60] combines the predictions of multiple different learners. name suggests, weak but successful classifiers' predictions are assembled to create a NN that reaches accuracies beyond its constituents [45]. Here successful means that the hypothesis has greater accuracy than random selection. Although existing methods under the paradigms (ii) and (iii) can successfully optimize over statistical and computational shortcomings of the NNs [70][71][72][73][74][75][76][77], they can not expand the representation of the hypothesis without acquiring an extended domain of the data. Hence one needs a dedicated approach to address the representation problem to learn over different types of correlations within distinct symmetries of the data.
While ENNs are known to improve on the statistics and computational problems [55], see Sec. 1, its benefits for the representation problem, which is in most collider phenomenological applications often is prevalent, is underrated. We propose the use of ENNs for the event reconstruction at high-energy collider experiments under the paradigm of parallel combining. We will further show that this approach improves on the representation problem.
For this purpose, we will use two high-level classifiers, a CNN and a RNN which are often used for image recognition and text recognition respectively. Both of these models are generalising a specific property of a jet, i.e. the spatial position of the substructure of a jet and the sequential order of a cluster history respectively. Naively, one could take the mean prediction of both classifiers, which will lead to a generalisation of the problem in the higher-dimensional hypothesis space. Although this can improve the performance, both component networks are optimised for their own feature space. In this study, we show that instead of combining the component networks' predictions, optimising the network over the combined latent-feature space can lead to a more substantial and stable performance improvement for the problem at hand.
Thus, we propose to initialise multiple high-level classifiers separately. For the example of Sec. 3, these are chosen to be CNN and RNN classifiers. Each the CNN and RNN provide a vector in the latent-feature space corresponding to the flattened image for the CNN and the higher dimensional representation sequence for the RNN. Concatenating these vectors will lead to a larger latent-space, including information from both image-type and sequencetype data. Training with this higher dimensional feature space with extra handles for the NN architecture, such as more layers or nodes to generalise this latent-feature space, can lead to two significant improvements. Firstly, each component network's weights will be optimised with respect to the combined hypothesis space hence will have access to more features of the base theory. Secondly, the ability to access a larger latent-feature space will make it possible to increase the complexity of the model for a larger hypothesis-space. Fig. 1 shows a schematic representation of this approach where one source of input is divided into multiple branches to be analysed within different architectures. Depending on the nature of the problem, one can employ multiple network architectures such as fully connected networks (blue), CNNs (purple), RNNs (green) or even more complex structures which, for the sake of simplicity, are not shown explicitly. The merging stage represents the concatenation process where instead of the prediction of each model, one can combine the latent-space of each network after its individual i th layer and continue training on this new feature space. Hence, the network's output will be the prediction optimised over each distinct feature of the problem. Whilst the network architectures discussed often unveil a strong performance improvement over conventional cut-based reconstruction strategies; one wonders if combining any NN will increase accuracy. To answer this question one needs to investigate the biasvariance-covariance decomposition. The prediction of an ensemble estimator, constructed by averaging the prediction of each component estimators, assuming that they are independent from each other, can be cast as where N is the number of component estimators, f i (x) is the prediction of the i th estimator and x is the feature-tensor. For such an object, the generalization error is given by [61,78] Err(f ens ) = Err where the three terms correspond to variance, covariance and the squared bias of the feature-tensor respectively. Although such construction assumes a very simplistic case, it shows that the generalization error of the average prediction of multiple hypotheses is also affected by the covariance. This shows that if the component hypotheses are negatively correlated with each other the average prediction will decrease the generalization error further. However, as the average bias will remain the same, the generalization error can only be reduced to the bias term. Thus an ENN can improve the generalization error if and only if the given component estimators' errors are not completely correlated [50,79].

Top Tagging through Ensemble Learning
Using CNNs, the pixelated energy deposits in the calorimeters of multi-purpose high-energy experiments have been repeatedly shown to provide a strong discriminatory power between the radiation profile of top quarks versus QCD jets. In the η − φ plane, each pixel corresponds to one or more particles, and so-called colour or luminosity of a pixel can be measured by a combined intrinsic property of these particles such as energy or transverse momentum. This will allow the CNN to learn translationally invariant features of the top and jet system. RNNs instead maintain a sense of timing and memory in a given sequence used as input features. Due to the nature of the clustering algorithm, each jet has an embedded tree structure, where subjets are recombined with respect to a particular rule. Thus, CNNs and RNNs exploit different features of top and QCD jets to discriminate them from each other. We use the complementarity of both methods to combine them in an ENN that has an improved performance over both approaches individually. An implementation of the code we use for preprocessing and network training is provided at this link 4 .

Dataset & Preprocessing
As a case study, we will investigate the top tagging capabilities of NNs by employing a CNN and a RNN. To achieve this, we used the dataset provided in [60,80], which consists of 14 TeV top signal and mixed quark-gluon background jets generated and showered by Pythia 8 [81]. The detector simulation for showered events is obtained through the Delphes 3 package [82] using the default ATLAS detector card. The fat jets are reconstructed using anti-kT algorithm [83] as defiend in FastJet [84], using radius variable R = 0.8. The fat-jet transverse momentum has been limited to [550, 650] GeV range in order to be able to benchmark the NN architectures precisely depending on the nature of jet substructure within a specific p T -range. The resulting fat jets are further limited to be within |η j | < 2. Finally, the fat jets in the top signal sample have been matched with truth level tops requiring ∆R(j, t truth ) < 0.8. This dataset consists of 1.2 million training, 400,000 validation and test events respectively. This dataset has been divided into two subsets within our framework, one for CNN type training and one for RNN type training. For both of the datasets provided PFlow-objects are clustered into a fat-jet as described above.
The CNN dataset has been prepared with leading anti-kT fat jet constituents which are ordered by their corresponding transverse momentum. Each jet in the event has been centred with respect to total p T weighted centroid where the jet vector has been centred at (η, φ) = (0, 0). Furthermore, the coordinate system has been rotated such that the principal axis is at the direction of positive pseudo-rapidity for all constituents. These modified constituents are fitted into pixels on η − φ plane, divided into 37 × 37 pixels between (η, φ) = ([−1.5, 1.5], [−1.5, 1.5]) where the pixel value has been set as total p T within that pixel. To get the leading constituent into the first quadrant, the vertical half of the image with higher total p T flipped to the right, and similarly, the horizontal half of the image with higher p T flipped to the top. Fig. 2 shows the averaged top signal (left) and dijet background (right) images for 10, 000 events projected on modified η − φ-plane. Colour represents the magnitude of the transverse momentum in the corresponding pixel. Note, this image has been zoomed-in to highlight the relevant portion of the image. Since the network requires the input data within [0, 1] range, each image has been normalized by 1 TeV before training. The RNN dataset has been constructed using leading anti-kT fat-jet where the constituents of the this jet are re-clustered with the same radius parameter using the Cambridge-/Aachen (C/A) clustering algorithm [85]. In order to construct the training sequence, three leading branches have been extracted from the clustering history where their respective transverse momentum defined the branches. Initial two leading branches are constructed by the first two subjets in the clustering history where the subjet with larger p T has been chosen to be the leading branch. The third leading branch has been chosen within the parent subjets of the first leading subjet. The parent with the lowest p T is considered as the third leading branch. Fig. 3 shows a schematic representation of this selection where blue stands for the leading branch following the subjets with relatively higher momentum than the consecutive parent subjet. Green is the second leading branch and purple is the third leading branch following the same pattern as the leading branch. Black lines represent the discarded branches which have less p T compared to the corresponding parent subjet. Finally, red represents the C/A-jet. The sequence has been constructed using k T -distances in the clustering history, defined as Here i, j is the number of the parent subjets, ∆R is the relative angular distance between two subjets and R is the clustering radius given as 0.8. For each parent subjet in a branch, the d i,j value is stored with its chronological order. show the number of subjets in each branch where the left and right panels show for top and dijet samples, respectively 5 . Before passing the input feature vectors to the network for training, the dataset has been standardized using RobustScaler within Scikit-Learn package [87] using 100,000 mixed events from the training sample.

Network Architecture & Training
In order to study the effects of ensembling multiple architectures, here we will first introduce two "comparable" but independent architectures for the CNN and RNN-type of datasets 5 It is important to note that we also test our sequence by constructing it out of jets clustered by kT and anti-kT algorithms; however, the discriminative power has been observed to be less than the sequence clustered by C/A algorithm. The CNN dataset has been trained by a network receiving 37 × 37-pixel input via a 2D convolutional layer with eight features and four stride pixels alongside with zero paddings. This layer's output is normalized within a batch normalization layer and passed on to a max-pooling layer with a pool size of 2 × 2, leaving a reduced 18 × 18 image with eight features. Finally, these images have been flattened and passed to a fully connected dense layer with sixteen nodes with a dropout probability of 25%. A rectified linear unit (ReLu) activation function has been used for each layer. A dense output layer has then followed the network with a single node and sigmoid activation for classification.
Furthermore, the RNN dataset has been trained in a slightly more complex architecture starting with an LSTM layer, including 64 nodes. The activation and recurrent activation function for the LSTM layer have been chosen as hyperbolic tangent and sigmoid functions. It has been followed by three fully connected dense layer with 64, 64 and 32 nodes respectively and each dense layer followed by a dropout layer with 25% probability. As before, the ReLu activation function has been used for each dense layer. The network output has been generated from a final dense layer with a single node and sigmoid activation function.
Both networks are aimed to minimize a binary cross-entropy loss function via Adam optimizer [90] with a learning rate of 10 −4 . Networks are trained for 500 epochs, and the learning rate has been reduced half for every 20 epochs if there is no improvement on the validation dataset's loss value. If the network didn't improve the validation loss for 250 epochs, the training terminated automatically.
Since the goal of this study is to question if a more extensive representation can generalize the given problem much better than its component hypotheses, we employed two types of ensembling methods. As a reference case, we studied the mean of both CNN and RNN predictions. As mentioned in Sec. 1, such ensembles have shown to go beyond the accuracies of their component networks. For the main case, we will study an extended architecture where CNN and RNN architectures are concatenated before their output layer; hence resulting in a latent-space of 48 features. To find an optimal generalization of this latent-feature space, they are further connected to a fully connected dense layer with 96 nodes, employing ReLu activation function and L2 kernel regularization with a penalty strength of 0.05. This dense layer has been padded with 25% dropout layers before and after. Then connected to an output layer as before, activated via a sigmoid function.
In order to estimate the inherent uncertainty on each model, the test data has been divided into randomly selected 50,000 non-overlapping partitions. Fig. 5 shows the ROC curve for each model. RNN and CNN are represented with blue and green curves alongside the inherent uncertainty for one standard deviation. The orange curve shows the mean prediction of these two models, which already indicates a higher generalization power than each component network. Finally, the red curve shows the minimalistic ENN configuration. Although the concatenated latent-feature space's training is minimal, it still reveals improvement beyond the mean prediction. The inner plot of Fig. 5 zooms into the slice of tagging efficiency within [0.4, 0.7] to emphasize this improvement. Fig. 5 also shows the area under the curve (AUC) value for each curve where the improvement in mean prediction and ENN is also visible. As mentioned before, for the ENN to show a significant performance improvement over all pooled networks, it is important for the component networks to show mutually a comparable performance. As seen from Fig. 5, both the ENN and the mean prediction is dominated by the CNN above a tagging efficiency of 0.8 and dominated by the RNN below a tagging efficiency of 0.15. This explicitly shows that no matter how complex the ENN architecture is, if one component network is dominating the other component networks, the ensemble will follow the performance curve of the best component network closely. As seen from the interval [0.4, 0.7] of the ROC curve, the ENN-improvement is maximized when the component accuracies are similar.
As discussed in Sec. 2, combining hypotheses with non-correlated errors may improve an ensemble's prediction. In order to test this, Fig. 6 shows the correlations of the squared error, (y −ŷ) 2 mapped on the test images where y is the truth label andŷ is the prediction of the corresponding network. Fig. 6 shows RNN (upper left panel), CNN (upper right panel), mean prediction (lower left panel) and ENN (lower right panel). Each correlation has been estimated by using randomly selected 50,000 test images. One can immediately see the shrinking area of the blue negative correlation distribution. Although the correlations between the RNN and the CNN mapping look similar, the mean prediction improves the two hypotheses' non-overlapping portions. The ENN goes beyond the mean prediction's achievement by drastically shrinking the blue region and removing the fluctuations in the red (positively correlated) region, hence increasing the correlations between squared error and the image pixels. As expected, similarly correlated regions changed neither for mean prediction nor for ENNs. Thus, combining all available neural networks would not improve the accuracy if their error is highly correlated. Instead, one can benefit from this methodology by employing networks with comparable accuracies and different error correlation to improve the latent-feature space accuracy.

Bayesian Deep Learning
For all phenomenological applications it is important to assess the intrinsic uncertainties of a NN model. Two major uncertainties can be modelled within the context of DL [62,69]. The irreducible noise in the observations called aleatoric uncertainties and the uncertainties intrinsic to the proposed hypothesis called epistemic uncertainties. Given sufficient data, epistemic uncertainties can be explained and reduced. The decomposition of the variance of a binary hypothesis is given as [91,92], (4.1) whereŷ represents the network's predictive distribution, the first term represents the epistemic uncertainties while the second term is the aleatoric uncertainty. In addition to the uncertainties, the entropy of the network's prediction, also, gives strong indications about the underlying uncertainties of the system where higher entropy points to higher uncertainty. The entropy of binary classification is given as [93], where the first term stands for the classification of the class 1 (top signal) and the second term for the classification of class 0 (dijet background). In order to analyse the uncertainties underlying our neural network, we used the Ten-sorFlow Probability package version 0.10.0 [94]. We limited ourselves to prediction uncertainties by only changing each network's output layer to Dense Flipout layer [95] with sigmoid activation 6 . The kernel divergence function has been chosen to be mean Kullbeck-Leiber divergence. We employed the same network architectures presented in Sec. 3.2. As before, all networks are trained for 500 epochs with Adam optimizer. The initial learning rate has been given as 10 −4 and reduced to its half in every 20 epochs if validation loss has not been improved. The final prediction has been reported using randomly chosen 100,000 test samples where each network output has been sampled 100 times. Although the notion of "mean prediction" is ambiguous in the Bayesian context, in order to have a baseline, we defined the mean prediction of CNN and RNN networks as the mean of each 100 samples. This serves as the linear combination ensemble baseline which has not been trained on any latent-feature space beyond its component networks. To reveal our ensembling technique's full effect, we used an ensemble learner with one dense layer including 96 nodes, as before, and another ensemble learner with an additional dense layer with 96 nodes 7 .  It is important to note here that, to get the complete model uncertainties from each layer, one can modify the entire network with Bayesian layers. This will double the number of trainable parameters in each layer. Thus in order to simplify our problem, we are only concentrating on prediction uncertainties. 7 It is important to note that we did not observe a significant improvement over ROC AUC by adding an extra dense layer. Thus further optimization beyond adding an extra layer required to improve the accuracy of an ensemble learner. Since this is beyond our scope, we limit ourselves to simplistic architecture.
The left panel of Fig. 7 shows the mean entropy,μ S , distribution with respect to the standard deviation in entropy,σ S , where RNN, CNN and two-layer ENN has been represented with blue, green and red points. In order to simplify the plot, the mean prediction and the one-layer ENN model is not shown. One can immediately conclude that the ensemble learner has a considerable limitation on the standard deviation of the entropy where CNN reaches beyond 0.025, RNN to 0.015 but ENN limits the standard deviation below 0.0075. The right panel of Fig. 7 shows the percentage of events per mean entropy. As before, the RNN and CNN architectures are represented by blue and green solid curves. The separation between two curves increases between the entropy values 0.2 − 0.8 where RNN has been observed to have more events with mid-range entropy values than CNN, but the last bin reveals that CNN has more events with maximum entropy. The dashed orange curve represents the mean of the two predictions where only slight improvement has been observed beyond the RNN. Furthermore, for the two ensemble learners, represented by dashed red and purple curves, one can immediately see the reduction in the number of events for the mid-range entropy values. One can also see that when sufficient complexity is provided, an ensemble learner further improves the hypothesis's entropy, i.e. reduces its values for bothμ S andσ S . This is also summarized in Table 1, where more than 78% of the events for both ensemble learners reach a mean entropyμ S of less than 0.5, while RNN, CNN, and mean prediction remain below 75.3%. We also analyzed the standard deviation in the hypothesis prediction, which is crucial to maintain small in order to achieve consistent predictions. The left panel of Fig. 8 shows the fraction of events per standard deviation in prediction where the same colour scheme applied as before. Given a sufficiently complexity problem, the ENN is observed to reducê σ bayes significantly, compared to each component network and the mean combination of those networks respectively. While the mean prediction reaching up toσ bayes ∼ 0.01, the ENN limits the standard deviation below 0.004, which is similar to the standard deviation mean entropy. On the right panel of the Fig. 8, we show the epistemic uncertainty as given in the first term of Eq. (4.1) using the same colour labelling. Again, we find a significant reduction of the uncertainties with ensemble learners. These results show that learning over various symmetries leads to a more accurate representation of the given problem without requiring more data.
Thus we observed that employing different domains of data that are specialised for specific properties, and optimising a neural network with combined properties of these component learners drastically reduces the system's uncertainties. Such an ensemble network has been shown to learn the system's correlations much more accurately compared to its individual component networks.

Conclusion
We presented Ensemble Neural Networks for the reconstruction and classification of collider events and applied them to the discrimination of boosted hadronically decaying top quarks from QCD jets. An ENN can improve the accuracy of the network beyond the individual contributions of its component networks by reducing the variance of the prediction given that the errors of component networks are not highly correlated. In this study, we showed that such techniques can be used in the event reconstruction of collider events in order to overcome the representation problem of neural networks and to improve the prediction accuracy and uncertainties.
Special-purpose networks, such as CNNs or RNNs have been repeatedly shown to be highly accurate for the classification of LHC events. These networks are specialised to learn and optimise their models with respect to the correlations of the given data. In the case of the classification of fat jets, these correlations can be represented through calorimeter images where a network learns the spatial distribution of a jet's constituents. On the other hand, clustering algorithms produce a sequential tree structures which can be employed to learn distinct kinematic features of top decays and QCD backgrounds. An ensemble learner is a paradigm that allows the combination of these properties in one algorithm. Instead of optimising the network separately with respect to the distinct symmetries of images or cluster sequences, it allows optimisation through combined latentfeature space. We showed that combining convolutional and recurrent neural networks and training the network further over their latent-feature space leads to higher accuracy for the classification task. Further, we found that such technique explicitly reduces the variations in error correlations of the component networks hence improving the domains where the component networks are not highly correlated.
A detailed understanding of the inner workings of Deep Learning techniques is often missing. To develop confidence in their applicability in measurements and searches for new physics, it is of vital importance to understand and, if possible, reduce the uncertainties of the networks. Bayesian techniques are designed to quantify such uncertainties. We found that ENNs can drastically reduce the uncertainty in the prediction of the network, without increasing the amount of training data. We also showed that such methods reduce the entropy of the system as well as the epistemic uncertainties. ENNs can thus provide much more accurate predictions than their component networks. The methodology employed in this study can be applied to a broad scope of application in HEP phenomenology. Instead of expanding the data domain, learning through combined underlying correlations of the problem has been shown to be very effective.
While ensemble learners can reduce the variance of the hypothesis, we did not observe any improvement in the data's bias or aleatoric uncertainties. Although reducing the network's epistemic uncertainties and variance is a crucial step, aleatoric uncertainties are observed to be larger than the epistemic uncertainties. As it has been shown that Genetic-Algorithm-based Selective Ensembles can reduce the biases as well as the variance of the system [50], it is an obvious next step to employ such techniques to reduce biases as well as the variance of the network.