Autoencoders for unsupervised anomaly detection in high energy physics

Autoencoders are widely used in machine learning applications, in particular for anomaly detection. Hence, they have been introduced in high energy physics as a promising tool for model-independent new physics searches. We scrutinize the usage of autoencoders for unsupervised anomaly detection based on reconstruction loss to show their capabilities, but also their limitations. As a particle physics benchmark scenario, we study the tagging of top jet images in a background of QCD jet images. Although we reproduce the positive results from the literature, we show that the standard autoencoder setup cannot be considered as a model-independent anomaly tagger by inverting the task: due to the sparsity and the specific structure of the jet images, the autoencoder fails to tag QCD jets if it is trained on top jets even in a semi-supervised setup. Since the same autoencoder architecture can be a good tagger for a specific example of an anomaly and a bad tagger for a different example, we suggest improved performance measures for the task of model-independent anomaly detection. We also improve the capability of the autoencoder to learn non-trivial features of the jet images, such that it is able to achieve both top jet tagging and the inverse task of QCD jet tagging with the same setup. However, we want to stress that a truly model-independent and powerful autoencoder-based unsupervised jet tagger still needs to be developed.


Introduction
Having discovered the Higgs boson in 2012, the experimental foundation of the Standard Model of particle physics has been completed at the Large Hadron Collider (LHC) [1,2]. However, working at the energy frontier, new particles and interactions, not covered by the Standard Model, could be measured by accumulating more and more statistics, or may already hide in the huge amount of data taken by the LHC experiments. From a machine learning perspective, the search for new physics is a quest for finding anomalous data, usually called signal in the physics context, in the vast background of collider events described by Standard Model interactions. New physics would contribute the out-of-distribution data with respect to the data expected in the Standard Model. Hence, it is not surprising that the recent advances in machine learning have also had a huge impact in the LHC context [3][4][5][6][7][8]. In addition to improving classical search strategies, machine learning may even open up completely new ways to look for anomalous data, i.e. new physics, in a model-independent way.
Single collider events, i.e. data instances, usually cannot simply be labeled as signal and background due to the intrinsically probabilistic nature of quantum mechanics. However, identifying so-called reducible backgrounds, which share some but not all features of the signal events, amounts to a straightforward classification task, where one can try to either employ supervised or unsupervised machine learning techniques to separate the two classes.
Supervised classifiers are extremely powerful tools but have a limited applicability since labeled data has to be available. In the collider-physics context, they cannot be directly applied to experimental data since collider data is not labeled. However, labeled data is available from the sophisticated simulation of collider events. If the transition from simulation to measured data is understood well enough, supervised machine learning can be successfully applied in particle physics. Still, traditional supervised algorithms are always model-dependent and can only efficiently detect the signals simulated for training. Interesting ideas to overcome those shortcomings have been recently discussed in the literature [9][10][11][12][13][14][15].
In unsupervised machine learning, an algorithm does not learn from labeled examples but should understand the structure of the data in some other way in order to sort it into a fixed or variable number of classes. Using unsupervised methods, which can be directly applied to LHC data, is a much more difficult but also an even more rewarding task. Designing an unsupervised tagger which is able to detect anomalies independently of the new physics model is certainly the ultimate vision.
For anomaly or outlier detection (see e.g. [16,17] for reviews), it is usually assumed that only a few anomalous data instances are to be found in the data distribution. In semisupervised machine learning, which is slightly less ambitious, a data sample consisting only of the background class is provided during training. The task is to tag signal data when testing on a mixed sample. This might be also an option in particle physics applications since signal free data from control regions might be available. Many different unsupervised and semi-supervised algorithms have been applied in the physics context (see e.g. [18,19] for reviews).
One particularly promising unsupervised method for anomaly detection is based on the autoencoder (AE) architecture [20]. An AE consists of two parts: an encoder and a decoder. The encoder is a neural network that compresses the input data into a few latent space variables which are also often called the bottleneck. The decoder is a neural network which reconstructs the initial data from the latent space variables. By choosing a suitable loss function both parts are trained together as one neural network to reconstruct the input data as well as possible.
The main idea is to train an AE on a dataset consisting purely (for semi-supervised) or mostly (for unsupervised anomaly detection) of the background class. By having much fewer variables in the latent space than in the input and output layers, the AE is not able to learn the identity transformation. Instead, to minimize the loss, it is forced to extract in its bottleneck the variables that correspond to the most prominent features of the background class. Ideally it extracts correlations in the input data which allow for an efficient data compression. The AE learns a representation that uses the structure of the training data and is therefore specific for this set. Hence, if an AE encounters data that has features somewhat different from the background class, it should not be able to effectively encode and decode these features. The loss of the reconstruction is expected to be larger. Therefore, a trained AE can be used as an anomaly tagger with its loss function as the anomaly score. This simple idea and its variants have been successfully explored in different machine learning applications (see e.g. [16,21,22] for reviews). Also in particle physics autoencoders have been successfully used for anomaly detection [23][24][25][26][27][28][29], in particular for top tagging [30][31][32] which will be the benchmark application in this work.
An unsupervised or semi-supervised machine learning algorithm for anomaly detection has an advantage compared to supervised methods if it is as model-independent as possible. In particular it should not be tailored for a specific kind of anomaly, i.e. a specific new physics model, but it should work for any, or at least a wide variety of possible signals. Ideally, it should also be able to detect an unexpected new physics signature.
Specifically, the setup should also be a working tagger if the background data and the anomalous signal interchange their role. This offers a simple test for model-independence. If such an inverse tagger performs significantly worse (or better) than the original tagger, there is a bias in the setup favoring a particular kind of anomaly. This bias will limit the model-independence of the tagging capabilities and, hence, the usefulness of a given unsupervised machine learning method. In machine learning applications outside of particle physics this kind of bias has recently been investigated mostly in the context of anomaly tagging using deep generative models [33][34][35][36][37][38]. Once an algorithm is known to work as an anomaly tagger in specific examples, it is a natural next step to study and understand potential biases in order to evaluate the performance and to improve the method.
In this work, we explicitly perform this investigation for an autoencoder which is used as a semi-supervised binary classifier to find an unknown signal within a Standard Model background. As a well-known benchmark example [39][40][41][42][43][44], we distinguish boosted top jets from QCD jets employing a simple convolutional autoencoder working on jet images. We confirm findings from the literature [30,31] that an autoencoder can indeed find top jets as anomalies without having seen them during training. However, the ability to detect this particular anomaly does not imply that the autoencoder works in general. Indeed, in Section 2 we will show that the very same autoencoder fails on the inverse task of finding QCD jets as anomalies when being exposed only to top jets during training; it performs worse than picking anomalies randomly. Our anomaly tagging setup employing the autoencoder is strongly biased to label top jets as anomalous no matter what the AE has seen during training. To understand this behavior we investigate what the autoencoder actually learns and why it fails under certain circumstances. It will become clear why a functional autoencoder can be a bad anomaly tagger and vice versa, in particular in the particle physics application at hand. Using those insights, we propose several improvements for our particular autoencoder architecture, such that it works as an anomaly tagger in both ways to distinguish QCD and top jets.
This work is structured as follows: After defining our setup in Section 2.1, we scrutinize the autoencoder performance in Sections 2.2 and 2.3, and discuss its tagging performance in Section 2.4. In Sections 3.1 and 3.2 we introduce improved data preprocessings and loss functions. We discuss the impact of these improvements on the autoencoder performance in general and on the tagging performance in particular in Sections 3.3 and 3.4, respectively. We conclude in Section 4 and present additional material in the appendices.

Autoencoder limitations
In this section, we introduce our AE architecture and investigate its training and performance. We either train the AE on a pure sample of QCD jets and call it a direct tagger, or we train the AE on a pure sample of top jets and call it an inverse tagger. While the former setup is designed to perform the well-known task of tagging top jets as anomalies, the latter setup is designed to perform the inverse task, i.e. tagging QCD jets as anomalies in a background sample of top jets. Hence, we always use a semi-supervised setup, i.e. we assume that a dataset consisting only of background data is available.
Before we discuss the tagging performance in Section 2.4, we scrutinize how and to what extent the AE actually learns to reconstruct jet images during training and how these AE capabilities affect its performance as a model-independent tagger. In particular, we are able to explain the success of the direct tagger and the failure of the inverse tagger by the interplay between an insufficient AE performance and the different complexity in the images of the two jet classes.

Jet data and autoencoder architecture
Throughout this work, we use top tagging as an example, which is a well established benchmark in collider physics: Simulated QCD jets (initiated by light-flavor quarks or gluons) have to be distinguished from hadronically decaying boosted top-quarks which form single jets. We use the benchmark dataset from Ref. [45], which is publicly available at [46]. A subset of 100k and 40k jets of one of the two classes is used for training and validation, respectively, and two 40k subsets of each type for testing.
In the following, we describe our default setup which will be modified and improved in Section 3. We employ the preprocessing introduced in Ref. [42]. The four momenta of the jet constituents are converted into two-dimensional jet images with 40 × 40 pixels. The details of the jet generation and the preprocessing are discussed in Appendix A. Fig. 1 shows the average images of both classes. For QCD jets, most of the intensity is concentrated Figure 2. Architecture of our autoencoder, see also Ref. [30].
in the central pixels. For top jets, there is a clearly visible three-prong structure (as expected for top-quark decays after preprocessing). Of course, individual jets are harder to distinguish than their average images may indicate.
As our anomaly detection algorithm, we use a convolutional autoencoder with an architecture similar to the one in Ref. [30]. We implement our AE with Tensorflow 2.4.1 [47], relying on the built in version of Keras [48]. Several convolution layers with 4 × 4 kernel and average pooling layers with 2 × 2 kernel are applied before the image is flattened and a fully connected network reduces the input further into the bottleneck latent space with 32 nodes. The Parametric ReLU activation function is used in all layers. The described encoder structure is inverted to form the corresponding decoder which is used to reconstruct the original image from its latent space description. Our architecture is defined in Fig. 2; the hyperparameter settings are described in more detail in Appendix A.
Following Ref. [30], to evaluate the reconstruction of the input picture we use the mean squared error (MSE), i.e. the average of the squared error of each reconstructed pixel with respect to its input value, as a loss function. During testing the value of the loss function is also used as the discriminator between signal and background. An event is tagged as signal/anomaly if the value of the loss function is larger than a given threshold. Changing the threshold value, one obtains the usual receiver operating characteristic (ROC) curve.

Limited reconstruction
We first investigate what the AE is actually learning, as it is trained for the reconstruction of the input and not as an anomaly tagger. Fig. 3 shows the learning history of an exemplary top jet in terms of its reconstruction after training on top jets for a given number of epochs. To guide the eye, we also show the evolution of the squared error per pixel and highlight the reconstruction of the intensity of the brightest pixels in detail. Moreover, the MSE of the reconstruction, i.e. the loss function of the AE, is given as a measure for the overall reconstruction improvement.
During the first epoch, the AE learns to reconstruct the average top jet image, as can be seen by comparing the reconstructed image to the average image in Fig. 1. (Note that the images in Fig. 1 are shown on a logarithmic scale.) In the following epochs, the squared error is dominated by the brightest pixels. The AE improves the loss by improving their reconstruction, so its trainable weights are updated accordingly. After 10 epochs, the AE recovers a smeared reconstruction of the brightest pixels. After roughly 25 epochs, the weights are learned to reconstruct the brightest pixels so well that the corresponding error becomes small compared to the error of the remaining pixels. However, the AE further We also show the squared error per pixel between input and reconstructed image (2nd column) and its difference w.r.t. the previous row (3rd column). The 4th column shows the intensity of the 20 brightest input pixels (blue) together with the reconstructed intensity (orange) and the corresponding squared error (purple crosses). improves the reconstruction of the brightest pixels (and the surrounding zero-intensity pixels) instead of changing its focus to the dimmer pixels. The latter would probably harm the previously learned reconstruction of the brightest pixels and therefore increase the loss. After around 100 epochs the four brightest pixels are very well reconstructed, while most of the dimmer pixels are still ignored. The dimmer pixels dominate the total error, as can be seen in the right column of Fig. 3. The AE is apparently trapped in a local minimum of the loss function, and training longer does not change the picture significantly.
The previous discussion applies to both training on QCD and top images. To show that the example presented in Fig. 3 is rather generic, the reconstruction capabilities of the AE for the whole test dataset are summarized in Fig. 4, where we quantify the quality of the reconstruction for individual pixels of a jet image. Specifically, for each of the 40k jets of the test data sample, we determine the ratio of the reconstructed and input intensities, r, for each pixel, and show the fraction of test jet images where the brightest, next-to-brightest etc. pixel (from left to right on the horizontal axis) is reconstructed with a certain quality r. In Fig. 4 (left) this fraction is shown for the AE trained and tested on QCD jets. For the majority of the jet images, the brightest pixels are reconstructed well (blue histogram, corresponding to a ratio 80% ≤ r ≤ 120%). On the other hand, the dimmer a pixel the more likely it will be reconstructed insufficiently, e.g. with an intensity of less than 10% of the input intensity (red histogram). Note that the overall number of jets in the training data, and thus the fraction of jet images, decreases as one requires an increasing number of non-zero pixels. A qualitatively similar picture emerges for the AE trained and tested on top jets (see Fig. 4, right). In the next section, we will explore why such a limited AE can nevertheless tag top jets as anomalies in a QCD background.

Complexity bias
We can gain additional insight into the AE performance by investigating the reconstruction loss as a function of the number of non-zero pixels, N p , of the jet images. We show the N p distribution for QCD and top images in Fig. 5. When training on QCD images, the reconstruction loss of QCD images on average increases strongly with N p , as shown in Fig. 6, left. QCD jets with small N p are simply easier to reconstruct. When training on top images, top jets are also harder to reconstruct for larger N p , see Fig. 6, right. However, in this case the correlation is less pronounced. Since, on average, top images have more non-zero pixels than QCD images, see Fig. 5, the top-trained AE sees less training data with small N p . Given this correlation between the reconstruction loss and the number of non-zero pixels N p , we conclude that N p at least partially describes the complexity of the images. However, complexity is not only related to N p . For fixed N p , QCD jets are on average also easier to reconstruct than top jets. This can be understood from the underlying physics. A top jet is naturally composed of three sub-jets initiated by the top-quark decay products. Hence, this intrinsic three-prong structure leads to more complex structures in the jet images.
These results show that the AE has a strong complexity bias: images which would intuitively be labeled as simpler, are reconstructed better. It has been discussed in the literature, especially for natural images, that simpler images may be harder to identify as anomalous [33-38, 49, 50]. In the context of natural images, it was noted that the algorithm tends to learn features that are not representative of the specific training set, like local pixel correlations.
The strong complexity bias in our application might be mainly due to the limited reconstruction performance of the AE discussed in Section 2.2. It is not surprising that the AE is not able to reconstruct the complex structure of top images when ignoring all but a few pixels. Limitations to the reconstruction of structures in high energy physics have also been noted recently in [51,52].
The correlation between the complexity of an image and the reconstruction loss is dangerous if the AE is to be used as a model-independent anomaly tagger. One should not assume that a new physics signal looks more complex than the SM background. An anomaly tagger should only reconstruct well those images it has been trained on. On the other hand, our AE has the tendency to better reconstruct simple images. For a functional model-independent anomaly tagger the first effect should be much more pronounced than the second. For the case at hand, these two effects go in the same direction for the direct tagger, since it is trained on simple QCD images, but go in opposite directions for the inverse tagger, which is trained on more complex top images. This can be seen already in Fig. 6, where we also show the loss of top jets in the test set for the AE trained on QCD jets and vice versa. Indeed, when training on top jets, even QCD images with the same number of non-zero pixels are better reconstructed than top jets, due to the complexity issue. Although the inverse tagger has improved the top reconstruction through its training on top jet images, it cannot overcome the complexity bias. Given the limited reconstruction capabilities, after learning to reproduce top images, the inverse tagger has also learned to interpolate QCD images. On the other hand, the direct tagger is never exposed to the complex features characterizing top images during training and thus fails to extrapolate to the corresponding top jet structures.

Tagging performance
Having investigated the limitations in the AE reconstruction power and the complexity bias, we show the performance of the direct and inverse taggers in Fig. 7. The ROC curves are constructed by varying the loss threshold for tagging a jet as anomalous. We also show the ROC curve for a supervised CNN tagger inspired by Ref. [43] (see Appendix A). Our direct AE tagger reproduces the results for the direct unsupervised AE tagger in Ref. [30], which has used a similar setup. As expected, it performs worse than a supervised tagger, but much better than random guessing. A similar performance for the unsupervised AE tagger has also been obtained in Ref. [31] in a slightly different setup. Hence, our direct tagger works as expected.
As can be seen from Fig. 7, the inverse tagger performs worse than randomly tagging jets as anomalous. Our inverse autoencoder fails to tag an anomaly (QCD jets in this case) that is simpler than the background. Through its training on top jets, the inverse tagger learns to reconstruct top jets better than the direct tagger (see right plot in Fig. 7), and its performance for the reconstruction of QCD jets is diminished. However, this is not enough to overcome the complexity bias.
To summarize, even with a limited reconstruction capability, an autoencoder can be a good anomaly tagger if there is a bias to reconstruct the background better. However, if the bias works against anomaly detection, the learning capabilities of the AE may not be sufficient to overcome the bias. Only a powerful AE with a background specific data compression in the latent space could potentially be able to overcome such a bias. Improvements of our setup to partly achieve these goals are discussed in Section 3.
It should be noted that a perfect AE, which is always able to perfectly reconstruct the input via the identity mapping regardless of the input data, would be useless as a tagger as it would always interpolate perfectly from the learned data to the anomalies.

Improving the autoencoder performance
Given the limited performance of the AE setup described in Section 2.1, we investigate possible improvements. One approach would be to change the AE architecture. There are unlimited possibilities which are worth investigating. However, here we want to point out some generic improvements concerning the complexity bias and the limited learning capabilities which might be helpful for any AE architecture. These improvements are introduced in Sections 3.1 and 3.2, and their impact on the AE performance and tagging capabilities is quantified in Sections 3.3 and 3.4, respectively.

Intensity remapping
As shown in Section 2.2, the AE focuses on reconstructing only the brightest pixels of an image. The contrast between bright and dim pixels is apparently so large that the dim pixels are not recognized during the early learning epochs. To tackle this problem, we propose several remappings of the pixel intensities, such that the intensities of the dim pixels are enhanced compared to those of the bright pixels. Such a remapping should encourage the AE to ignore less pixels and learn more of the intrinsic structure and features of the given training images. Hence, in particular the complexity of the simplest pictures (having only one or two very bright pixels) is increased, and we expect to also reduce the complexity bias. The intensity remapping can be considered as part of an alternative preprocessing.
Due to the normalization of the input features in the standard preprocessing described in Appendix A, all pixel intensities lie in the interval between 0 and 1. To highlight dim pixels by remapping this interval onto itself, we use the functions which are displayed in Fig. 17 in Appendix B. For the mapping R 3 we use α = 1000; the Θ-function maps each non-zero pixel to one. The original images correspond to the identity remapping R 0 (x) = x and are labeled accordingly in the following. After the remapping the image is normalized again to have a pixel sum equal to one. While remappings R 1 to R 3 are invertible, information is lost through R 4 , which should be thought of as a limiting case. The effects of the remappings are exemplified and quantified further in Appendix B.

Kernel MSE
As discussed in Section 2, it is difficult for the AE to learn the structure of an image. Instead, it only learns the exact position of the few brightest pixels. This is actually not surprising, since the AE is not rewarded by the loss function to reconstruct the image structure. It rather has to reproduce the exact pixel position, which seems to be an unnecessarily tough task. For example, shifting the input image by one pixel would be considered by a human as a valid reconstruction, but is strongly penalized for the standard MSE loss. In fact, the similarity of two images is only poorly represented by the MSE loss. To compare two distributions, i.e. the intensity distribution in our case, the Wasserstein distance, also called energy or earth mover's distance [53], would be a much better choice. However, its full calculation includes an optimization problem that would need to be solved for each image during training. Approximate estimates of this distance, e.g. the sliced Wasserstein distance [54], are also computationally expensive. We thus propose a simple loss function which nevertheless provides a better notion of similarity between two pictures than the MSE loss.
To introduce a notion of neighborhood for each pixel in the loss function we convolve the whole image with a smearing kernel. Hence, neighboring pixels have a partly overlapping smeared distribution which can be recognized by the loss function. The optimal reconstruction, i.e. vanishing loss, is still achieved by reconstructing the original (unsmeared) input image.
On the discrete set of pixels in each image the convolution is defined as where I kl denotes the intensity of the pixel with coordinates (k, l) of the original image, and S ij refers to the smeared pixels. The kernel is defined as with the additional constraint that it is set to zero for negative values. Hence, it is only non-zero for an Euclidean distance to the center pixel which is smaller than L pixels and has an approximately circular shape. We choose L = 8. The normalization is chosen such that the intensity of the central pixel is unchanged. To avoid boundary effects, the original picture is padded with zeros on each side and becomes 54 × 54 pixels. There is an infinite amount of possible choices for this kernel. However, we do not expect the following results to depend on its details. The reconstruction loss is defined as the MSE of the smeared input and the reconstructed image. We refer to it as kernel MSE or KMSE in the following. Since the smearing is a standard matrix convolution its computational cost is negligible for training and testing.

Autoencoder performance
To evaluate the effect of the intensity remapping on the autoencoder performance, we again investigate the distribution of the ratio r of the reconstructed intensity and the input intensity for the leading pixels. The corresponding histogram, i.e. the equivalent of Fig. 4, is shown for the R 2 -remapping in Fig. 8. When training on QCD jets (left figure), the fraction of well-reconstructed pixels is increased by the remapping. This suggests that the AE indeed learns more of the jet features. When training on top jets (right figure), the fraction of well-reconstructed jets is not increased by much, but it extends to more of the brighter pixels, also signaling that more features are potentially accessible. The decrease of performance beyond the 20th pixel (and in particular the disappearance of the peak) compared to the performance without remapping, Fig. 4 (right), is not discouraging, since these pixels correspond to very soft particles. 1 In Fig. 9, we compare the reconstruction quality for the different remappings. The distribution of the intensity ratio r (displayed in Figs. 4 and 8 by stacked histograms) 1 They are only accidentally well reconstructed for R0 by a tiny non-vanishing intensity which is typically found in the reconstructed image in the vicinity of bright pixels. Since the R2-remapping highlights their intensity, this mechanism does not work any more.  . Median and bands for the 25%/75% quantiles for the distribution of r for the leading pixels, where r is the ratio of the reconstructed and the input intensity of a given pixel in a given jet. Results are shown for QCD jets (left) and top jets (right). The AE is trained on the respective training set with a given remapping. is represented by its median and the 25%/75% quantiles. As long as the distribution is dominated by values of r close to one, the leading pixels are reconstructed well. For both the QCD and the top case, the median curves shift to the right for the remappings, i.e. more of the leading pixels are on average taken into account by the AE. Hence, the AE ignores less pixels and thus learns more features of the jet images. For the AE trained and tested on top jets, right figure, this is only achieved by reducing the reconstruction precision of the brightest pixels.
In Fig. 10, we show the reconstruction performance for training on R 0 images using the KMSE loss introduced in Section 3.2. Note that the KMSE loss function is only used during training. To evaluate the pixel reconstruction in the plots no smearing kernel is applied. As one expects, the exact intensity reconstruction of the leading pixels is traded for paying more attention to dimmer pixels. The strong focus to reduce the squared error of the brightest pixels as much as possible is removed. The same can be seen in Fig. 11 where we again show the distributions of the intensity ratio r in terms of the median and  quantiles for the different remappings. Although the brightest pixels are not very precisely reconstructed, dim pixels often have a significant part of their intensity reconstructed. Also when training on top jets using the KMSE loss functions, the remappings lead to more pixels being taken into account.
The stacked histograms for all remappings with and without KMSE loss function, which contain additional information on the r-distribution, are displayed in Appendix C.
The reconstruction of top jets (when training on top jets) with the R 0 and R 2 remappings using the standard MSE and the KMSE loss functions is illustrated in Fig. 12. Using the KMSE loss, the AE focuses more on the reconstruction of a continuous distribution of the intensity than on the reconstruction of individual pixels. Even relatively dim regions far from the center receive some attention. Thus, as also shown in Fig. 11, the dim pixels are partially reconstructed as part of this continuous distribution. Therefore, we expect the KMSE autoencoder to extract more useful features of the overall jet structure compared to the standard AE that focuses almost exclusively on the brightest pixels.
To understand the implications of the intensity remapping with respect to the com- Figure 12. Six top jet input R 0 images (first row) and the corresponding R 2 images (fourth row) together with the AE reconstructions using the MSE (second and fifth row, respectively) or KMSE loss (third and sixth row, respectively). Each image is individually normalized to have the same maximum pixel intensity. Small negative pixel intensities in the reconstructions are set to zero.
plexity bias, we show the loss of R 2 images as a function of the number of non-zero pixels N p in Fig. 13 for the AE trained on QCD (left figure) or top images (right figure) using the MSE loss function. Comparing to Fig. 6, the bias for the QCD-trained AE is reduced a lot and even reversed in direction. For the AE trained on top jets, the bias is also reversed, but additionally increased. This can be understood since the inverse AE sees only few of the training images with small N p , the complexity of which has been increased by the intensity remapping. Training on QCD images, the AE still learns to reconstruct images with low pixel number rather efficiently. Looking at the reconstruction of top images by the direct tagger in the same plot, it still cannot extrapolate well to this unseen data, so that the direct tagger will perform well (see next section). Training on top images with R 2 remapping, the AE has lost its ability to simply interpolate the QCD images with small N p , because their complexity has increased w.r.t. the R 0 case. Hence, this AE is a working inverse tagger. However, comparing the average loss for top and QCD jets in the right panel of Fig. 13, only in a region below 40 active pixels, top images are on average reconstructed better than QCD images. Moreover, the difference is small compared to the width of the distribution and the slope of the bias. Most of the inverse tagging performance is based on the QCD images with small N p , due to the reversed bias and still not due to a well-performing AE.
Using the KMSE loss, Fig. 14 shows that the loss as a function of N p for the direct tagger remains very similar to what we have seen in the MSE case. However, we observe that QCD jets with low N p are not automatically learned to be reconstructed well when training the inverse tagger on top images. Moreover, for N p < 35 top images are again reconstructed on average better than QCD images. Using both a remapping and the KMSE loss function for training, the behavior gets more pronounced. The corresponding distributions are shown in Fig. 19 in Appendix C for all combinations of remappings and loss functions.

Tagging performance
The performance for the direct and inverse taggers using the different intensity remappings introduced in Section 3.1 are summarized in terms of their ROC curves in Fig. 15. Derived performance measures are shown in Table 1 in Appendix C. For the direct tagger, QCD jets are background ( B = QCD ) and top jets are signal ( S = top ), so that the area under the curve plotted in Fig. 15 directly corresponds to the AUC. For the inverse tagger, signal and background interchange their role ( S = QCD and B = top ) and the area under the plotted curves should be as small as possible, since it corresponds to 1-AUC.
Plotting the ROC curves in this way, the area between the ROC curves of the direct and the inverse tagger, called ABC in the following, can be interpreted as the background specific learning achieved by the AE. Furthermore, the difference between the two AUC values for the direct and the inverse tagger, called ∆AUC in the following, is a measure for how biased the AE is. For an AE with small ABC or large ∆AUC one cannot expect a model-independent tagging capability. On the other hand, a large ABC and a small ∆AUC do not guarantee model-independence. It is only an indication that a step towards model-independence has been made.
Using the standard preprocessing without intensity remapping (R 0 ), we find a small ABC = 0.28 and a large ∆AUC = 0.55. As discussed in Section 2.4, the direct tagger only works due to a large complexity bias, and the inverse tagger fails.
For R 1 -remapping we see a minor decrease in the direct tagging performance, but a significant improvement in the inverse tagging performance with ABC = 0.48. This improvement is strong enough to enable inverse tagging although ∆AUC = 0.35 is still large. In contrast, R 2 -remapping shows little bias (∆AUC = 0.06) and an ABC = 0.43. Both direct and inverse tagging have a modest AUC close to 0.7. R 3 has a little more bias (∆AUC = 0.18) but the best ABC = 0.50. R 4 -remapping shows that too much highlighting of dim pixels is also counterproductive. It leads to ∆AUC = −0.3 and a very poor ABC = 0.11.
Although these results are a promising first step towards a more model-independent anomaly tagging, we want to recall that the inverse tagging is mainly caused by the correlation of the AE loss with the number of non-zero pixels N p . The AE is still not able to reconstruct a top jet on which it has been trained better than a QCD jet if both have the same N p . Moreover, in particular the tagging performance of the inverse taggers is still Figure 15. ROC curves of the direct (solid) and inverse (dashed) taggers using the different intensity mappings R 0 to R 4 . The dotted lines represent the ROC curves of the two taggers based on the number of the non-zero pixels in the image. very poor. To illustrate this, we also show the two taggers which only use the distribution of N p for both classes, as shown in Fig. 5, to tag anomalies with either a large or a small N p . None of our inverse taggers can beat the performance of the trivial tagger looking for few non-zero pixels. This shows once more that the inverse tagging for the intensity remappings mostly relies on the correlation of the loss with the number of pixels. Fig. 16 shows the ROC curves for direct and inverse taggers using the KMSE loss and the different intensity remappings (see also Table 2 in Appendix C for additional performance measures). The KMSE loss is also used as the anomaly score. On the one hand, the direct R 0 -tagger using the KMSE loss performs similarly to the MSE one. On the other hand, also the KMSE autoencoder is not able to provide a working inverse tagger, although the performance of the inverse tagger is significantly better than the AE with MSE loss. With ABC = 0.41 and ∆AUC = 0.44 it is less biased towards reconstructing QCD better.
For the intensity remappings, we find ABC = 0.52 and ∆AUC = 0.27 for R 1 , hence the KMSE loss improves the overall performance. The remapping R 2 results in ABC = 0.27 and ∆AUC = −0.09, i.e. the bias is still small but the performance is reduced. The remapping R 3 shows almost no bias (∆AUC = 0.02) but also reduced ABC = 0.32. The remapping R 4 still fails completely. In contrast to the MSE loss, using KMSE the inverse taggers based on R 1 to R 4 all show a performance very close to the pixel number tagger. While we have investigated the ability of the KMSE-based AE to reconstruct more jet features in Section 3.3, the tagging performance shows that these features are not necessarily correlated with only one of the two classes.

Conclusion
In this work we have investigated a specific autoencoder architecture, based on a convolutional neural network, for tagging top jets in a background of QCD jets -and vice versa -using the reconstruction loss as the anomaly score. We observe a rather limited performance of the autoencoder with respect to its image reconstruction capabilities, see Section 2.2. Nevertheless, we confirm findings from the literature [30,31] that such an autoencoder is a powerful top tagger. These apparently contradicting observations are a consequence of a strong complexity bias of the autoencoder-based tagger as discussed in Section 2.3. The sparsity of the jet images in combination with the underlying physics of QCD and top jets allow for a more successful reconstruction of the QCD jet images with respect to the mean squared error of all pixels, no matter which training data (i.e. QCD jets or top jets) are used.
Having made this observation in a very specific benchmark scenario, it is nevertheless quite obvious that it might generalize to other autoencoder architectures and most probably also to some applications with completely different data. Similar observations have been recently made for anomaly detection in natural images [33][34][35][36][37][38]. A poor autoencoder can be a good anomaly tagger if there is a strong bias which favours the reconstruction of the background data for a given anomaly example. Hence, a good tagging performance for a specific example does not imply a functional model-independent unsupervised tagger. On the other hand, a perfect autoencoder is useless as a tagger if it can also interpolate to reconstruct anomalies which have not (in the semi-supervised case) or rarely (in the unsupervised case) been seen during training.
As the top-tagging example shows, a biased autoencoder is not necessarily a bad thing. If, for example, one is only interested in anomalies with more complex jet structures, the bias helps tagging anomalies, and the setup is certainly also robust in an unsupervised approach working with a training sample including signal contamination [30,31]. However, model-independence will always be lost to a certain extent. For example, dark-matter jets from a strongly interacting sector are impossible to tag as anomalies in this case, as discussed in [55,56].
How model-independent an autoencoder based tagger is, is difficult to answer. After all, the autoencoder tagger should exactly be designed to find the sort of anomaly which is not a priori thought of. One way to approach the problem is to test the tagger on as many physics cases as there are available. However, this might be a tedious task, if many physics cases are at hand or if they would have to be designed for this purpose. We propose to at least evaluate the autoencoder by inverting a given test task, i.e. interchanging the role of background and signal and investigating the corresponding performance. Possible performance measures are the difference of the corresponding AUCs or the area between the ROC curves in a plot like Fig. 15 in Section 3.4.
Having understood at least some of the limitations of the original setup, we have investigated two modifications to improve the autoencoder. An intensity remapping for the jet images, as introduced in Section 3.1, and the kernel MSE loss function, as introduced in Section 3.2, are designed to help the autoencoder learn more relevant features within the jet images. Our modifications are shown to be promising first steps, both with respect to the autoencoder performance (Section 3.3) and with respect to the model-independence of the tagging performance (Section 3.4). However, the true progress is hard to quantify and better performance measures are needed in the future. Moreover, the intensity remapping is also affecting the bias of the autoencoder for reconstructing one of the two classes better than the other, irrespective of the training data. In particular, the inverse tagging, i.e. finding QCD jets in a top jet background, is to some extent due to a reversed bias, which is helpful for this particular test case but cannot be claimed as a model-independent advantage.
Future directions include the investigation of improved autoencoder architectures which we have not touched upon at all in this work. It would also be interesting to explore how anomaly detection based on the latent space of (variational) autoencoders or even completely different architectures is impacted by the complexity bias or other related biases when trained on the sparse jet images.
To summarize, this work provides valuable insights into the interplay of the autoencoder performance and anomaly tagging as well as first steps for improvements. However, we want to stress that a powerful and truly model-independent autoencoder for unsupervised anomaly tagging on jet pictures still needs to be developed.
Note added: The submission of this paper has been coordinated with [57,58] and with [59], which address the challenges of using autoencoders for anomaly detection in complementary ways.

Acknowledgements
We are grateful to Anja Butter, Tilman Plehn and the participants of the Machine Learning mini workshops organized within the Collaborative Research Center TRR 257 "Particle Physics Phenomenology after the Higgs Discovery" for useful discussions. This work has been funded by the Deutsche Forschungsgemeinschaft through the CRC TRR 257 under Grant 396021762 -TRR 257 and the Research Training Group GRK 2497 "The physics of the heaviest particles at the Large Hadron Collider" under grant 400140256 -GRK 2497. Simulations were performed with computing resources granted by RWTH Aachen University.

A Jet simulation and neural network architecture
The benchmark dataset from Ref. [45] is publicly available [46]. The jets are obtained for a center-of-mass energy of 14 TeV using Pythia8 [60] and fast detector simulation with the default ATLAS detector card of Delphes [61]. Multiple interactions and pile-up are ignored. The jets are clustered with FastJet [62] using the anti-k T algorithm [63] with a jet radius of R = 0.8. Moreover, the jets fulfil p T ∈ [550 GeV, 650 GeV] and |η| < 2. For top jets, a parton-level top and its decay partons are required within ∆R = 0.8 of the jet axis. The 200 jet constituents leading in p T are stored. Jets with less constituents are padded with zeros.
According to the preprocessing in Ref. [42], the jets are first centered in the η-φ-plane. Afterwards, they are rotated such that the principle axis is vertical and finally mirrored along both axes to obtain the hardest component in the first quadrant. After these steps, the jets are converted into two-dimensional jet images with 40 × 40 pixels corresponding to a two-dimensional transverse-momentum histogram of the jet constituents in the rotated ∆η-∆φ-plane. Finally, each jet image is divided by its total transverse-momentum, i.e. the pixel intensities sum to one.
Here, we explain in more detail the AE architecture given in Fig. 2. Each convolutional layer has a stride of 1 in both directions and a 4 × 4 kernel. Padding is performed in all convolutional layers such that the dimension of the output is the same as of the input. First, we use two convolutional layers with 10 and 5 filters for feature extraction. We reduce the dimension using average pooling with stride 2 and a 2 × 2 kernel, and add two more convolutions with 5 filters each. The feature map of the last convolution is flattened into a vector of 2000 nodes that are fully connected to 100 nodes in the next layer, which in turn are fully connected to the bottleneck layer with 32 nodes. These layers together form the encoder part of the AE. For the decoder we first add two fully connected layers with 100 and 400 nodes, respectively. The output of the latter is reshaped into a 20 × 20 feature map. Afterwards, we perform two convolutions (with 5 filters each) followed by a 2 × 2 upsampling layer to restore the dimensions of the image. We complete the decoder by adding three more convolutional layers with 5, 10 and 1 filter, respectively, resulting in a 40 × 40 dimensional output, matching the dimensions of the input. We apply the parameterized ReLU activation function in all hidden layers (convolutional and fully connected) with the corresponding α parameter initialized from a uniform distribution in the interval [−1, 1]. For the final layer, i.e. the last convolutional layer resulting in the AE reconstruction, no activation function is applied.
The training is done on the designated training set consisting of 100k images using the Adam optimizer [64] and a batch size of 500. As we observe a saturation of the reconstruction loss for this amount of training data, we conclude that it is sufficient. After each epoch of training, validation is performed on 40k images. Early stopping is employed if no improvement of the validation loss is achieved in 10 consecutive epochs. Early stopping always terminates the training before the maximum of 1200 epochs is reached, and we use the weights of the epoch with the lowest validation loss for testing. To improve the training with the KMSE loss function, we reduce the learning rate to 10 −4 and the batch size to 64.
To establish an upper limit for the unsupervised learning and to test our framework, we first use a supervised approach, i.e. we apply a convolutional neural network to this classification problem. The architecture of the CNN is based on the one used in [43]. The corresponding ROC curve in Fig. 7 is obtained by training for 100 epochs on 100k QCD and 100k top jet images, validating on 40k QCD and 40k top images and using the model with the lowest validation loss. The performance of our CNN is comparable to the results from the literature [43,65].

B Intensity distributions after remapping
In this appendix we exemplify and quantify the effects of the intensity remapping as introduced in Section 3.1. The remapping functions, R 1 (x) = √ x , R 2 (x) = 4 √ x , R 3 (x) = log(αx + 1) log(α + 1) and R 4 (x) = Θ(x) , (B.1) are displayed in Fig. 17. The identity mapping, corresponding to the original images, is denoted as R 0 (x) = x. The strength of the highlighting of the dimmer pixels is different for the different functions. Moreover, the derivative of the logarithmic remapping, R 3 , does not diverge at the origin, in contrast to the root functions.
In the left column of Fig. 18, we show the effect of the remapping for an exemplary top jet. The dim region in the bottom right of the R 0 image (upper panel) may be the third sub-jet in the top jet structure. Without highlighting it is barely visible and will most likely be ignored by the AE. By highlighting these dim pixels, the AE can more easily learn the information and features stored in them. Learning more distinctive features of the training set, the AE is also expected to perform better at distinguishing anomalies.
To quantify the effect of the highlighting on the whole set of images, we show the intensity distribution of the non-zero pixels for the 100k top jet images of the training set in the right column of Fig. 18. In the original images (upper panel) the distribution of the intensities of the non-zero pixels is rather broad, with a median that is more than an order of magnitude below the average value. The intensity remapping compresses the distribution on a logarithmic scale. The distributions after the R 1 and R 3 remapping span across around 1.5 orders of magnitude, and the distribution after the R 2 remapping lies nearly fully within one order of magnitude. Due to the normalization to sum to one, the distribution of pixel intensities for the R 4 -images reflects the distribution of the overall number of non-zero pixels. Note that the remappings do not change the number of nonzero pixels. Because of the final normalization, the average intensity of the non-zero pixels is also unchanged.

C Further results
In this appendix, we complete the display of the results discussed in the main part of this work.
We first collect the performance measures for the direct and inverse taggers (see Section 3, Figs. 15 and 16) in Tables 1 and 2. The measures E10 and E100 are defined as S (1/ B = 10) and S (1/ B = 100), respectively.
In Figs. 19 and 20 we finally show the reconstruction loss for individual jet images as a function of the number of non-zero pixels and the distribution of the ratio of reconstructed and input intensity for the leading pixels, respectively, for the different remappings and for the MSE and KMSE loss functions. Example of a top jet image (left), and the distribution of intensities of non-zero pixels for 100k top jet images (right) for different intensity remappings (from top to bottom R 0 , R 1 , R 2 , R 3 and R 4 ). The orange and red-dashed vertical lines are the median and the average of each distribution, respectively. Figure 19. Reconstruction loss of individual test jets as a function of the number of non-zero pixels N p , cf. Fig. 6. The first and second columns represent the results for an AE trained on QCD and top jet images, respectively, using MSE loss, while the third and fourth columns show the corresponding results using the KMSE loss function. The rows correspond to the different image remappings R 0 , R 1 , R 2 , R 3 and R 4 from top to bottom.   . Stacked histogram for different categories of the ratio r of the reconstructed and the input intensity of the non-zero pixels. Pixels are ordered by intensity from left to right for each of the 40k test jets, cf. Fig. 4. The first and second columns represent the results for an AE trained on QCD and top jet images, respectively, using MSE loss, while the third and fourth columns show the corresponding results using the KMSE loss function. The rows correspond to the different image remappings R 0 , R 1 , R 2 , and R 3 from top to bottom.