Topological Obstructions to Autoencoding

Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock"bump hunt"in which the autoencoder fails to identify an anomalous"signal"for reasons tied to the intrinsic topology of $n$-particle phase space.


Introduction
Data of interest in the physical sciences often consists of features of low intrinsic dimensionality packaged in a high-dimensional space. For example, the variants of a gene might be embedded in the much larger space of base pair sequences, or a single fundamental particle might manifest itself as an N × N -pixel (N 1) jet image [1,2] in a particle detector at a highenergy physics experiment. A common task is to detect outliers, or "anomalies," in a large data set; a common tool to perform this task is a neural network autoencoder [3,4]. 1 The autoencoder architecture is quite simple: input data is processed and passed through a feedforward network to a latent layer of smaller width than the input. The output of the latent layer is then processed and unpacked to an output layer of the same size as the input layer. The intuition is that the data is being compressed in the smaller latent layer, and uncompressed on its way out to the output layer. An autoencoder trained on a large data sample is attempting to learn a compressed representation of the data, and a network successful in this task should have small reconstruction error, measured for example by taking the loss function to be the mean squared error between the output and the input.
On the other hand, some particles, such as leptons, may be well-characterized by their 4-vectors rather than the more complicated jets characteristic of hadrons. A "bump hunt" search for a new particle in events containing leptons will feature anomalous events drawn from the same manifold as the background events, but localized to a submanifold. For example, in the search for the Higgs in the 4-lepton "golden channel" H → ZZ * → 4µ [52,53], 1 Anomaly detection in high-energy physics using machine learning is a rich and growing field: some other model-independent strategies include weakly supervised learning (such as classification without labels), density estimation, and likelihood-free anomaly detection. See [5,6] for a review of these and other strategies, and also [7] for a summary of some of these techniques as applied to simulated data. the background events have 4 muons in the final state with a broad distribution of invariant masses, and the "anomalous" Higgs decay events are distinguished by lying on the submanifold of 4-particle phase space where the invariant mass of all four muons is equal to m 2 H . In this case, an autoencoder trained on a sideband data set of background events excluding invariant masses of m 2 H may attempt to perform an interpolation task when run on a Higgs decay event. Such interpolation tasks are generally "easy" for neural networks, and thus might be expected to lead to low autoencoder loss for the signal Higgs events, which is the opposite of the desired behavior.
In this paper, we will investigate how the topology of data manifolds may pose a number of important obstructions to autoencoder performance on the second type of anomaly-detection task, where anomalous events lie on a distinguished submanifold of the manifold of background events. Consider an autoencoder trained on a set of 4-vectors sampled from n-particle phase space. Since 4n > 3n − 4, the embedding space is clearly redundant, and one might expect that after sufficient training, an autoencoder can achieve essentially zero reconstruction error on the training set for latent dimension d l equal to the intrinsic data dimension, 3n − 4. However, as we will show, this is impossible because phase space does not have the trivial topology of R 3n−4 , but rather that of a sphere S 3n−4 . A generic neural network autoencoder is a composition of continuous maps, so the nontrivial topology makes unavoidable the existence of nearby points on the data manifold which are mapped to distant points in the latent space, exactly as a Mercator projection distorts the poles of the 2-sphere when mapped into R 2 . 2 The easiest context in which to visualize this topological obstruction is the unit circle, which we will study extensively in order to gain intuition for the breakdown of these maps to the latent space. Points on the circle may be labeled by a single number, an angle φ, but since φ and φ + 2π represent the same point, an autoencoder which attempts to compress points on the circle to their angular coordinate φ will rip apart nearby points in the data manifold during the compression. More precisely, in the language of differential topology, the latent space is a single chart on the data manifold, which can accurately capture the local geometry but not the global topology, which requires additional charts with transition functions between them.
The failure of the latent representation will imprint spurious features on the data. This has two important and related consequences: • If the data manifold has nontrivial topology, there will always be points or regions in the training set with poor reconstruction error, even when the latent dimension is equal to the intrinsic dimension of the data. These regions are not the desired anomalies, but instead avatars of the topological obstruction to mapping the data manifold into a topologically-trivial latent space.
• If anomalous events live on a submanifold (as in the Higgs example above), the autoencoder may learn to interpolate smoothly across the submanifold even if the training distribution had no support there, causing the would-be anomalous events to have the same error distribution as background events.
These observations present obstacles to using autoencoders as practical anomaly detectors. A necessary condition for a successful autoencoder is near-perfect (or at least uniform-loss) reconstruction on the training set -otherwise the compression of the data is not faithful -but the topology of the data manifold can render that impossible without additional priors on the network. In addition, the background distribution itself may introduce additional topological or geometrical features; in the physics context, a matrix element governing the background process with poles or zeros at certain values of the kinematic variables may concentrate the events with large loss away from the desired submanifold. This paper is organized as follows. In Sec. 2 we define our basic autoencoder architecture, where in particular we take the latent dimension d l equal to the dimension d of the data. We then introduce a specific example in Sec. 3 of an autoencoder failing to perform a bump hunt in 3-particle phase space. The remainder of the paper is devoted to understanding the features of that failure by studying a series of low-dimensional examples, motivated by the fact that phase space has the topology of a sphere. We start in Sec. 4 with the simplest example, the circle S 1 embedded in R 2 , and show how the periodicity of the angular coordinate on the circle poses an obstruction to training an autoencoder with a latent layer R 1 . Moving to the 2-sphere in Sec. 5, we construct an easily-visualized analogue of the anomalous submanifold S 1 ⊂ S 2 , and examine the interplay between topology, extrinsic geometry, and sampling distributions with a double cone. We confirm that these features persist in higher dimensions in Sec. 6. Armed with this intuition, we return to the example of 3-particle phase space in Sec. 7. We briefly summarize the effects of taking d l > d in Sec. 8, arguing that this does not cure the issues we have identified, and conclude in Sec. 9. Additional details are provided in the Appendices: App. A describes our hyperparameter choices, App. B studies the S 1 example in depth including an analytic investigation of the trained network dynamics, and App. C describes our studies of spaces with topological obstructions even for d l > d.
Our goal in this work is not to claim that autoencoders are doomed to fail in the highenergy physics context, but rather to make the point that the topology of phase space and the inductive bias of autoencoders toward interpolation are important pieces of prior knowledge which should be considered before attempting a black-box solution to generalized anomaly detection. 3 In fact, it is somewhat surprising that autoencoders appear to perform worse on the nominally easier task of a bump hunt in leptons than on the superficially much more complicated task of jet image recognition and classification, since leptons live on a phase space of fixed dimension. The increasing prominence of "physics-inspired neural networks"where networks with important symmetry principles (such as gauge equivariance and Lorentz symmetry) hard-coded into the network architecture perform better than networks which are forced to learn these principles from scratch [54][55][56] -suggests that knowledge of the topology may in fact be necessary to appropriately interpret the autoencoder performance. We illustrate this point with the low-dimensional examples described above, and speculate on how these principles might be applied in the context of phase space. 4 2 Autoencoder architecture In this paper, we will implement an autoencoder as a multilayer neural network. Our baseline network architecture will be as follows: a 5-layer, fully-connected network with layer widths (d in , d w , d l , d w , d in ) and loss function L = ||y − x|| 2 , where x is the input and y is the output. The second and fourth layers have d w d l , d in to ensure that for low-dimensional examples we are not artificially penalizing ourselves by using a network with too few parameters to accurately approximate the embedding of the data manifold to R d l . To verify that the number of network parameters is not the limiting factor in autoencoder performance, we will sometimes add a second layer of width d w to both the encoder and decoder, so that the full network has 7 layers.
We will be primarily concerned with autoencoders with d l < d in but d l = d, so that the latent representation has the same number of degrees of freedom as the manifold from which the data is sampled. We will refer to the map R d in → R d l as the encoder or latent representation and the map R d l → R d in as the decoder, each of which is a 1-hidden-layer neural network. We will refer to the full autoencoder map R d in → R d in as the model. Our default width will be d w = 64; this is small by the standards of networks used for e.g. (jet) image recognition, but it is much larger than the d in ≤ 12 we will be considering in this paper. In each example we train the network with stochastic gradient descent (SGD) for 20,000 epochs using a training set of size N train and a test set of size N test , both sampled from the same distribution. Our batch size and learning rate hyperparameters for each example are given in Tab. 1 in App. A; we have checked that our conclusions are robust to changes in these hyperparameters, because in essentially all examples the networks will be trained to convergence but do not overfit the training data.
To visualize the output of the autoencoder, especially in low-dimensional examples, we will plot both the test set data and the predictions of the model on the test set. Occasionally, it will be convenient to present these on the same plot, to see both the density of data points and the density of their images as predicted by the autoencoder. Since our loss function is Euclidean distance, plotting them together can show how large-loss points will be mapped far away from their true locations. 4 We note that similar considerations have been investigated in [57][58][59][60], though not in the context of physics.
In particular, [58] notes that nontrivial topological structure in the input data can require an autoencoder with latent dimension larger than the intrinsic dimension of the data, [59] considers adding a term to the loss function to force the latent layer to preserve topological structures of the data, and [60] performs an in-depth study of the observations of [57] to understand how topology is transformed at each layer of a feed-forward network.  Figure 1. Left: cartoon of the geometry of the 3-vectors of particles Y , Y , and Z sampled from 3particle phase space. In the center-of-momentum frame, the particles are coplanar. Center: example of a Dalitz plot for uniform sampling from 3-particle phase space, which uniformly populates a right isosceles triangle in the s Y Y − s Y Z plane. Right: if the matrix element for the process contains a resonance, say at s Y Y = m 2 X from the intermediate decay X → Y Y , there will be an oversampled "stripe" (purple) in the Dalitz plot.
An important feature of an autoencoder is that for d l ≥ d in , the global minimum of the loss is always the identity function on R d in . However, this is not a generalizable minimum, since the loss would be zero on any input whatsoever, even one having nothing to do with the data distribution which is a particular submanifold of R d in . Since the goal of an autoencoder is to learn a useful low-dimensional representation of the data -an encoding -a successful autoencoder will find its way to a local minimum, which is zero (or close to it) on the training and test sets, but is nonzero on other data. The function of the latent layer with d l < d in is to prevent the network from finding the trivial global minimum. 5 Regardless, the existence of a global minimum for the family of autoencoder architectures with different d l which is not the desired loss minimum means that initialization and gradient descent algorithms may be an important component of the analysis, and suggests a loss landscape for the autoencoder with a rich structure; we discuss a number of examples in Appendices B and C.

Failure of a bump hunt
The manifold M n=3 of 3-particle phase space is defined by imposing energy/momentum conservation (4 constraints) and putting the three particles on mass shell (3 constraints), which imposes 7 algebraic constraints on the three 4-vectors (12 parameters) and yields a 5dimensional manifold embedded in R 12 . 6 As we will explain in Sec. 7, M n=3 has the topology of the 5-sphere S 5 , which has important implications for autoencoder behavior.
To approximate the situation typically encountered at colliders (and also to simplify the analysis), we will consider the final state Y + Y + Z where all final-state particles are 5 Noise injected into the training data may serve the same purpose, though we do not consider that strategy in this work. 6 See Ref. [61] for a detailed study of the geometry of phase space as a Riemannian manifold.
distinguishable and massless -the example we have in mind is a bump hunt in leptons, where (say) Y is a muon and Z is a photon, and the collision energy is large enough that the muon is approximately massless. In the center-of-momentum (COM) frame, the three particles are coplanar ( Fig. 1, left). The natural measure on phase space is the Lorentz-invariant measure, which for 3-body phase space takes a particularly simple form [62]: where √ s is the collision energy in the COM frame, and s Y Y = (p Y + p Y ) 2 and s Y Z = (p Y + p Z ) 2 are the invariant squared masses of the Y Y and Y Z pairs, respectively. The shape of the region R depends on the masses of the final-state particles and is conveniently visualized in a Dalitz plot. Events which are sampled uniformly from phase space will uniformly populate R in the s Y Y − s Y Z plane, which for the massless case is the right isosceles triangle defined by s Y Y , s Y Z > 0 and s Y Y + s Y Z ≤ s (Fig. 1, center). The remaining three phase space coordinates are Euler angles defining an element of SO(3) which orients the event, and have been integrated over in Eq. (3.1), making the Dalitz plot a particularly convenient 2dimensional projection of the 5-dimensional manifold M n=3 . The boundaries of R correspond to events where two particles are collinear, and the corners of the Dalitz triangle for massless particles correspond to a soft particle whose energy goes to zero; for finite final-state masses, these corners are rounded off. Note that the measure is uniform in any pair of invariant masses, and for massless particles R is the same triangle for all three such pairs. In real particle physics events, the matrix element for the desired process introduces a non-uniform distribution on R: for example, if a resonance X of mass m X can decay to Y Y , an oversampled "stripe" will appear in the Dalitz plot at s Y Y = m 2 X (Fig. 1, right). In this work we will focus on the intrinsic topology of phase space and only sample uniformly according to Eq. (3.1), but we will comment throughout on the role of the sampling distribution, which may itself be incorporated into the geometry of the phase space manifold [61].
We perform a mock "bump hunt" by normalizing our units such that √ s = 1 and choosing a desired invariant mass, say s Y Y = 0.25, corresponding to a heavy unstable particle X of mass m X = 0.5 (m 2 X = 0.25) which decays to Y Y . We then train a 7-layer autoencoder with d in = 12 and d l = 5 on "sideband" data sampled from the distribution (3.1) excluding the region 0.9 m 2 X < s Y Y < 1.1m 2 X ; we use this deeper network rather than the 5-layer autoencoder to ensure that there are no issues with network capacity that would inhibit learning the full geometric structure of the phase space manifold. Our setup mimics the standard procedure of fitting a background model to sidebands before examining the signal region. The choice of latent dimension is determined by the dimension of phase space: since the sideband data is drawn from a 5-dimensional manifold, d l < 5 would fail to capture the full geometry of the background data and would result in large losses across the whole data distribution, while d l > 5 would be a redundant parameterization of the data. To distinguish signal from background, we generate two test sets: a sideband test set sampled from the same The boundary of the sampled region is shown in black, and only 5% of the test set is shown for clarity. Right: The 10 worst-loss points in the sideband test set (blue) and their model predictions (red arrows). The worst loss is located at isolated points near the boundary of the excised interval, and these points are mapped far away in the Dalitz plane. However, the excised region is reproduced fairly well. The corners of the Dalitz triangle are also reproduced poorly, but note that the density of points mapped to the corners is large while the loss of any individual point there is considerably smaller than the worst-loss points. distribution as the training data, and a signal test set with s Y Y = 0.25 but otherwise sampled uniformly in phase space.
The boundary of the sampling region for the training and sideband test sets, along with the autoencoder output on the test set colored by loss, is shown in Fig. 2 (left). Note that the autoencoder does a fairly good job of identifying the boundaries of the sideband region around s Y Y = 0.25, but has trouble at the corners of the Dalitz triangle which correspond to kinematic endpoints where the energy of one particle goes to zero. While it is true that the autoencoder task is only to minimize the Euclidean distance between the model and the data point-by-point in phase space, the spurious features in the model output imply correlations which will be imprinted on the loss distribution, which is the desired diagnostic for anomaly detection. Furthermore, the largest-loss points (blue), with loss L 0.05, are located near the boundary of the excised interval near s Y Y = 0.25, as shown in Fig. 2 (right). While these large-loss points are mapped far away by the trained model, most points near the excised interval are low-loss and are mapped close to their true locations. Indeed, we will show in the remainder of this paper that the existence of a neighborhood of large-loss points (i.e. those whose predictions are far away from the true value, and thus have large loss as measured by Euclidean distance) is a direct consequence of phase space having the topology of a sphere. . Left: Dalitz plot for a "signal" test set with s Y Y = 0.25 but uniformly sampled in phase space otherwise. 5% of the data is plotted for clarity. The network trained on the sideband distribution learns to interpolate through the signal region, such that the signal events are not flagged as anomalous. Right: Normalized loss distributions for the sideband test set (Fig. 2) and signal test set. Remarkably, the loss distributions are almost identical for the two data sets.
Next, we run the trained autoencoder on the signal test set. The Dalitz plot is shown in Fig. 3, left. In the Dalitz plane, the signal data lives on a vertical line (the purple line in Fig. 1, right), and we see that despite the autoencoder never having seen points in this region before, it can smoothly interpolate across it, reconstructing points in the signal region with low loss except for a few isolated points. These points are not any more or less anomalous than the rest of the signal data, but are simply the neighbors of the large-loss points in the sideband test set which get mapped far away by the model. The loss distributions of the two test sets (Fig. 3, right) are essentially identical. In particular, there is no obvious large-loss tail for the signal events which would flag them as anomalies, despite events with s Y Y = 0.25 being entirely absent from the training set. There is no reasonable decision boundary that one could draw to separate these two distributions. Cutting on whatever small large-loss tail does exist, at say L = 10 −2 , would give a signal efficiency of S = 7.4 × 10 −4 and background rejection power 1/ B = 3 × 10 3 2/ S , making this autoencoder an extremely poor anomaly detector for rare events and no better than a random classifier at larger signal efficiency. This simple example should be compared with e.g. Ref. [43] where QCD and non-QCD jets have largely non-overlapping loss distributions for a similar autoencoder architecture and reasonable ROC curves which can achieve the same background rejection with S 0.1.
We have checked that varying d l does not change this conclusion. For d l > 5 the loss distributions for both background and signal have no large-loss tail (indicating near-perfect reconstruction) and whatever tail exists for the background events exceeds the signal events, so the signal events would be classified as "less anomalous." For d l < 5, the loss tails for both the background and signal distributions are large and nearly identical, and moreover the network fails to identify the boundaries of the sideband intervals, so no information is gained by further reducing the latent dimension. Similarly, changing the length of training does not change our conclusions: the large-loss points persist for both shorter and longer training, and the loss tails for the sideband and signal sets do not separate. We have also verified that the results are identical including both 1% Gaussian smearing on all 4-momenta coordinates and sampling the signal test set from a Breit-Wigner distribution with 0.5% width, both of which resemble typical detector effects and matrix element structures for realistic applications.
This result would seem to preclude using a standard neural network autoencoder to perform a bump hunt in leptons, where the lack of soft and collinear radiation makes the particle 4-vector a decent proxy for what is actually measured at a collider detector -in other words, parton-level observables are nearly equivalent to detector-level observablesunless additional features were incorporated into the autoencoder architecture. Given that simple feed-forward autoencoders have already found some success in anomaly detection in jet images [43], it is a priori somewhat surprising that the same network architecture fails at what would naively seem to be an easier problem. 7 As we will see in the remainder of this paper, because the training data for the lepton bump hunt is sampled directly from n-particle phase space, one can better understand these results from the perspective of topology: the network uses its inductive bias toward interpolation to trivialize the topology of S 3n−4 in the most minimal way possible, which involves localizing all large losses near a single point in phase space and interpolating everywhere else.

Dimension 1: intrinsic topology and the unit circle
To elucidate our observations about phase space, we will explore a series of low-dimensional examples which encapsulate particular topological features and are more easily visualized. In dimension 1, all closed manifolds without boundary have the same intrinsic topology as the circle. 8 Our examples in this section illustrate two key points: 1. If the manifold has nontrivial intrinsic topology then there will be some data points on the manifold reconstructed with large loss despite not being anomalies of any kind; 2. The sampling distribution used during training can influence the location of those badly reconstructed points.
To start, we consider the simple example of a training set of points (x, y) equidistantly spaced on the unit circle S 1 . Since S 1 has dimension 1, every point on the data manifold can be represented by a single number, an angle φ ∈ (−π, π] such that (x, y) = (cos φ, sin φ). Thus, a latent layer with d l = 1 should be able to fully capture the local geometric features of this manifold. However, the periodicity of φ is a topological obstruction to learning the global structure of the data manifold. In the language of differential geometry, a choice of coordinates is a chart S 1 → R, but the nontrivial topology of S 1 means that it must be covered by at least two charts; without additional structure in the autoencoder network, the latent representation can only provide a single chart. Since φ is periodic, with φ and φ + 2π corresponding to the same point in the training set, the latent layer's encoding function (x, y) → f enc (φ) will cover its range at least twice, with one arc of the circle mapping onto the interval [min(f enc ), max(f enc )] and its complementary arc mapping onto the same interval. The reconstruction can be accurate on at most one of those arcs, so we expect the autoencoder to make one of those arcs as large as possible and the other as small as possible.
Indeed, this is exactly what happens. Fig 4 (top left) shows the latent representation as a function of the input φ after training the 5-layer network on a training set composed of equidistant points on the unit circle. The representation fails quite obviously at a particular angle φ 0 (marked in red) which we refer to as the break point; this result is consistent with the fact that the circle with a point excised is topologically equivalent to R. 9 As can be seen in the plot of the model output ( Fig. 4, top right, where we define the output φ as tan −1 (y/x)), the autoencoder maps points near φ 0 all over the circle, with output values of φ ranging from −π to π. This is also easily visualized by plotting the model as points in R 2 (Fig. 4, bottom left). This leads to large reconstruction error in the neighborhood of φ 0 : Fig. 4 (bottom right) shows the losses as a function of φ on a uniformly-sampled test set. Losses at φ 0 are on the order of 10 3 times the loss of a generic test set point, despite the fact that the break point is not an anomalous point of any kind but rather just another generic point on the circle. In Appendix B, we solve the network dynamics for a generic activation function and SGD training and demonstrate why a finite-sized break region around φ 0 persists even after long amounts of training. The size of this break region is roughly independent of the network width and depth for a fixed training length, shrinks very slowly (perhaps logarithmically) with training, and depends primarily on the particular form of the activation function and the training algorithm. We plan to return to the rich interplay between topology and network dynamics illustrated by this simple example in future work.
In fact, the region around the break point persists even for the absurdly small training set of 20 equidistant points on the unit circle. In that case, the loss at the worst point in the training set is only ∼10 times the loss for a generic point after 100,000 epochs of training. However, the network has not simply memorized the training data because the output map fails to reconstruct at least one of the training set points. Indeed, a bad point seems to occur as long as the density of the training set is high enough that the break region size exceeds the spacing between data points (for the hyperparameters given in Appendix A, this occurs for training sets containing 15 or more equidistant points on the unit circle). 10 We provide more details in Appendix B, including a number of other checks showing that the behavior we see persists with different training algorithms and activations, and that nearperfect reconstruction error can be achieved if d l = 2 since the autoencoder finds the trivial global minimum. 11 These observations are related to general considerations about the performance of neural networks on interpolation and extrapolation tasks. To see this, consider a training set which is undersampled near a randomly chosen point on the unit circle given by φ u . Fig. 5 shows the result of training an autoencoder on points sampled from a normal distibution with mean φ u + π and standard deviation π/3. The break point now lies in the undersampled region (with φ u shown in purple), but all other aspects of the autoencoder behavior are similar to the equidistant training set. In effect, we are asking the autoencoder to perform an interpolation task -on which neural networks typically have excellent performance -but the nontrivial topology of the circle makes this task impossible. In this 1-dimensional example, points absent from the training set in the neighborhood of φ u are indeed reconstructed with large loss, but this is not because these points are anomalous per se, but rather because the topology forces large loss to occur somewhere and the overall loss is minimized by placing the break point in the region where the fewest training points exist. Indeed, we can choose the location of the break point by changing the sampling distribution. We emphasize again that the reconstruction error for an undersampled topologically-trivial curve is not enhanced in the undersampled region; an autoencoder has no trouble learning a distribution on an interval, except near the endpoints where the reconstruction task changes from interpolation to extrapolation. Thus, topology precludes a simple 1-to-1 mapping between autoencoder loss and typicality of data. This behavior persists in higher dimensions, as we discuss further in Sec. 5.1 below.

Dimension 2
As we begin to investigate higher-dimensional data sets, visualizing both the latent representation of the data and the data manifold itself will become more difficult. Visualization is still manageable in d = 2, but to prepare for higher-dimensional examples, we will introduce a useful tool, the loss-versus-distance plot. This is a scatter plot of the autoencoder loss on points in the test set versus their Euclidean distance from the point of largest loss. The intuition is that manifolds which suffer poor reconstruction error in the neighborhood of a single point will show losses anti-correlated with distance from that break point, as in the case of the circle. Indeed, since the n-sphere S n with a single point excised can be covered with a single chart, all autoencoders trained on spheres should exhibit this behavior, regardless of dimension (we will see this explicitly in Sec. 6). If, on the other hand, the loss appears to be uncorrelated with distance, then the manifold may have more complex topology, requiring tearing along a submanifold (instead of just puncturing) to fit in R n ; we study such examples in App. C.
Our examples here will illustrate that the issue of intrinsic topology we identified in 1dimensional data sets persists in d = 2. However, in dimension 2 we can have the qualitatively different situation of undersampling our data distribution along a 1-dimensional submanifold, as opposed to dimension 1 where submanifolds are just isolated points. We will see that, depending on the topology of the data manifold, most of the submanifold may be reconstructed with small loss, despite being absent from the training set.

The 2-sphere and the paraboloid: interpolation and extrapolation
As we did with the unit circle, we consider training an autoencoder on a uniformly-sampled unit sphere S 2 (in this example we use uniform sampling rather than equidistant points for the training set, but this makes no material difference for the examples to follow), defined by x 2 + y 2 + z 2 = 1 in R 3 . Using the same training scheme as with the circle, but now using an autoencoder with d l = 2, we find the results shown in Fig 6: the loss is localized near a single point on the sphere (as with the circle, this point is randomly chosen by initialization and stochastic dynamics), the autoencoder model punctures the sphere in a region around that point, and the loss is ∼ 10 3 worse in this region than at a generic point in the test set. 12 As with the circle, we can undersample the sphere at a point, and just as with the circle, the break point of the model map falls in this undersampled region. However, because the sphere is two-dimensional, we now have the opportunity to undersample along an entire 1dimensional submanifold, for instance the great circle along the equator. This situation is  a closer analogy to our bump hunt example, where rare events tend to lie on submanifolds, rather than at isolated points, of phase space. Since another way to trivialize the topology of the sphere is to excise an entire great circle, yielding the topology of two disks, D 2 ⊕ D 2 , we might expect that this will also be a local minimum of the autoencoder loss. However, after training an autoencoder on a uniform distribution on the sphere but with the region around the equator with |z| < 0.1 excised entirely, the model map ( Fig. 7 left and center) typically breaks at a random point along the equator; it has no trouble interpolating the rest of the equator (which was absent from the training set entirely) because there is no topological obstruction to doing so. The trained network does occasionally yield the output with the D 2 ⊕ D 2 topology; however, over many network realizations, the local minimum with a single break point is much more common, and moreover has lower overall loss. The situation we have described is thus complementary to the 1-dimensional case of the unit circle. The best local minimum for the autoencoder is the one which distorts the data manifold at the fewest number of points; since this can be done by removing a single point on S 2 (i.e. a submanifold of dimension 0), "anomalous" (i.e. undersampled) submanifolds of dimension 1 will be interpolated with low loss except perhaps at an isolated point. Without any additional way of influencing the latent representation, this behavior would seem to preclude using autoencoders to learn this family of distributions on the 2-sphere.
To demonstrate that this interpolation is a generic feature of autoencoders, we train a network with the same architecture and hyperparameters as for the sphere example on a topologically trivial surface, the paraboloid z = x 2 + y 2 with the region z < 0.2 excised. The test set is sampled uniformly in x and y up to x 2 + y 2 = 4. Fig. 8 shows the losses on a test set sampled from the full paraboloid with 0 ≤ z ≤ 4. The center region is interpolated with much smaller loss than the largest-loss points, which are localized on the boundary. Indeed, the finite extent of the training set implies the topology of a manifold with boundary, and reconstructing the boundary accurately is an extrapolation task, which is generally more difficult for neural networks than the interpolation task of filling in the center. Despite this, the worst loss is more than two orders of magnitude smaller than the worst loss for the excised sphere example, because (neglecting the boundary at z = 4) the paraboloid has the same topology as R 2 .

The double cone: extrinsic geometry and non-uniform sampling
Any 2-manifold without boundary or handles is topologically equivalent to the 2-sphere S 2 . However, the embedding in R 3 can introduce an extrinsic geometry different than that of the round metric on the sphere. For example, a double cone has two distinguished points (the tips of the cones) where the embedding is not differentiable and the extrinsic curvature diverges. As we will see in Sec. 7 below, this is a decent low-dimensional cartoon of the geometry of massless phase space, where the corners of the Dalitz plot represent the non-differentiable embedding of M n=3 in R 12 at points where the energy of a massless particle goes to zero. In anticipation of that analogy, we will also consider sampling the double cone uniformly in height (analogous to sampling uniformly in the Dalitz triangle), which effectively oversamples near the tips. Fig. 9 (left) shows an example of a training set drawn from this distribution, for a right circular cone of height h = 2 and equatorial radius r = 1. As expected, the density of points near the tips is greater than at the equator.
After training an autoencoder with d l = 2 on the double cone sampled uniformly in height, Fig. 9 (center) shows the output of the model on a test set drawn from the same uniform-height distribution. Since the double cone has the topology of S 2 , there must be a break point, and as with the example of S 2 with an excised equator, the break point is located in the "bulk" of the cone since the average loss is minimized by placing the break point in the undersampled region. However, the large extrinsic curvature at the tips is an obstruction to reconstructing them well by a smooth function, as can be seen visually from the plot of the model output. In Fig. 9 (right) we plot the loss on the test set as a function of the true height of the test set point. The global maximum is at the break point, but there are also local maxima at the tips. The same result is obtained when the equator of the double cone is excised entirely from the training set: the break point now lies on the Figure 10. Loss-versus-distance plots for autoencoders trained on n-dimensional spheres. For n = 4, 5, the autoencoder finds the latent map which approximates stereographic projection, with only a single break point. The same is true for SU (2), which is diffeomorphic to the 3-sphere embedded in equator, but the remainder of the equator is interpolated with low loss, and the local maxima at the tips persist. As we will show, the equator in this toy example is analogous to the signal submanifold of fixed 2-particle invariant mass in 3-particle phase space, and our results will be more or less equivalent to Fig. 9 (right).

Higher-dimensional spheres
Just like the case of S 2 , the n-dimensional sphere S n can be mapped into R n everywhere except a single point, in a higher-dimensional analogue of stereographic projection. We can see this explicitly in Fig. 10, where we have trained the deep 7-layer network with d l = n on uniformly-sampled training points from the standard embeddings of round spheres, S 4 ⊂ R 5 and S 5 ⊂ R 6 . We also consider an example of a sphere embedded in higher-dimensional space. The group SU(2), the set of complex 2 × 2 matrices U satisfying U † U = I 2×2 , can be parameterized by a triplet of Euler angles (α, β, γ) and is diffeomorphic to S 3 . An element of SU(2) can be mapped into a vector of 8 real numbers, the real and imaginary parts of the matrix entries, and thus embedded in R 8 . As shown in Fig. 10, the SU(2) autoencoder with d in = 8 and d l = 3 shows almost identical behavior to the spheres in other dimensions. These examples confirm that the behavior we have been finding -in particular the utility of the loss-versus-distance plot to visualize the effect of data topology on the autoencoder reconstruction -persists to higher dimensions. Note that the magnitude of the loss at the break point compared to a generic point on the training manifold, about 5 orders of magnitude, is also robust with respect to dimension with the other network hyperparameters fixed.

3-body phase space
Armed with the intuition from the previous lower-dimensional examples, we return to 3particle phase space. As discussed in Sec. 3, this 5-dimensional manifold M n=3 has a natural embedding in R 12 ; here, we will show that it has the topology of the 5-sphere S 5 . Intuitively, the mass-shell conditions and the conservation of spatial momenta are topologically trivial, as they can be formulated by saying one variable is a single-valued function of the others. Only the conservation of energy creates topology, and the level sets of the energy function turn out to be spheres. More precisely, suppose the particles have masses m i , energies E i , and spatial momenta p i (i = 1, 2, 3). Then the mass-shell conditions are E i = | p i | 2 + m 2 i for i = 1, 2, 3. Since each E i is determined algebraically by the p i , dropping the E i coordinates preserves the topology. Let the total initial-state 4-momentum P = p 1 + p 2 + p 3 have 4-vector components P = (E 0 , P 0 ). Since each E i is a convex function of the p i , the inequality E 0 ≤ E 1 + E 2 + E 3 defines a convex origin-symmetric ball in R 9 . Conservation of energy says that phase space lies on the boundary of that ball with E = E 0 . Conservation of momentum slices that ball by the hyperplane p 1 + p 2 + p 3 = P 0 , forming a 6-dimensional ball whose boundary, a sphere, is precisely 3-particle phase space. 13 Note also that this argument generalizes straightforwardly to n-particle phase space, which has the topology of S 3n−4 .
Visualizing the geometry and topology of high-dimensional manifolds can be difficult, but the Dalitz plot introduced in Sec. 3 provides a convenient starting point. A point within the Dalitz triangle fixes the energies of the final-state particles. Momentum conservation implies the final-state particles are coplanar in the COM frame, and thus their orientations are determined by three Euler angles (i.e. an element of SO(3)) which fix the unit normal vector to the event plane and the orientation within the event plane. Locally, then, the geometry of 3-body phase space is R 2 × SO(3). At the boundaries of the triangle, a pair of particles becomes collinear and define an event vector rather than an event plane, which introduces a redundancy because many elements of SO(3) contain the same S 2 which orients the event vector. Furthermore, at the vertices of the triangle, a particle becomes soft (i.e. its energy goes to zero). The properties of the boundaries and the corners are particularly important for relating the underlying topology to the extrinsic geometry. At the boundaries, uniform sampling in the Dalitz plane leads to effective oversampling with respect to the round metric on S 5 because of the redundancy of SO(3) rotations when two vectors are collinear, much as in Sec. 7 where the double cone sampled uniformly in height oversampled the tips compared to the uniform sampling of the 2-sphere. Furthermore, the embedding of phase space in R 12 is non-differentiable at the corners where E i → 0, leading to a singularity in the extrinsic curvature, a higher-dimensional analogue of the tips of the double cone.
Based on the results of our low-dimensional examples, these topological features should 13 Recent work [61] proposes a convenient spinor interpretation of phase space as a double quotient of the unitary group (U(n)/U(n − 2))/U(1) n . Ref. [61] further decomposes U(n)/U(n − 2) as a twisted product of S 2n−1 and S 2n−3 , and gives a measure which is defined simply in terms of each factor. Note that global topology is still spherical; as in the Hopf fibration, the twisted product of spheres is another sphere.
be apparent when uniformly-sampled phase space is used to train an autoencoder with latent dimension 5. In any realistic physics application, the distribution of events will also be weighted by the matrix element for the relevant process, which could have almost arbitrary dependence on the Dalitz plane variables in a model-independent search of the kind autoencoders are useful for. As we have seen in Secs. 4 and 5.1 with the undersampled S 1 and S 2 , the sampling distribution can interplay with the data topology in interesting ways. For the example which follows, we will take a constant matrix element, leaving an exploration of the effects of some common forms of matrix elements for future work.
Here, we sample events uniformly from massless 3-particle phase space (3.1) -as opposed to our sideband distribution in Sec. 3 -and train with a 7-layer autoencoder with d l = 5. The loss-versus-distance plot is shown in Fig. 11 (left); as expected, the largest loss is localized near a point, reflecting the topology of S 5 (compare with Fig. 10, center). The embedding in R 12 does not change the topology, so just as SU(2) ⊂ R 8 had the same loss-versus-distance plot as the standard embedding of the n-sphere in R n+1 (Fig. 10, right), M n=3 ⊂ R 12 exhibits the same topological features as S 5 ⊂ R 6 .
To visualize the autoencoder reconstruction, we plot the output of the model on the Dalitz plane in Fig. 11 (right), exactly analogous to our bump hunt example in Fig. 2 of Sec. 3. The corners of the triangle, where the extrinsic curvature is singular, are not reproduced well, and there is a local maximum of the loss at each corner. The behavior is a straightforward higherdimensional analogue of the double cone of Sec. 5.2. However, the 10 worst points (and hence the global maximum of the loss) are located near a generic point inside the Dalitz triangle, as shown in Fig. 11 (bottom). 14 In contrast to the sideband test set of Fig. 2, or the double cone with the equator excised, there is no undersampled region where the break point is preferred. The points near the break point are mapped far away in the Dalitz plane, which is consistent with their large loss under the Euclidean metric. We emphasize once again that none of these features have anything to do with anomalies, because we have uniformly sampled phase space according to the Lorentz-invariant measure, so any point is as "typical" as any other. While it is true that the autoencoder task is only to minimize the Euclidean distance between the model and the data point-by-point in phase space, the spurious features in the predicted distribution point to correlations which will be imprinted on the loss distribution, which is the desired diagnostic for anomaly detection.
From a topological perspective, the failure of the bump hunt described in Sec. 3 is now straightforward to understand. With 3-body phase space having the local geometry of R 2 × SO(3), the submanifold with s Y Y equal to a certain value in the interior of the Dalitz triangle -i.e. the signal -is much like the equator of the sphere in Sec. 5.1 or the equator of the double cone in Sec. 5.2, and interpolating through this region is topologically trivial. The S 5 topology means that a break point must exist, where the latent representation rips the data manifold and test set points near the break point are mapped far away. If the autoencoder 14 Note that since the Dalitz plot is a 2-dimensional projection of 5-dimensional phase space, points that are somewhat distant in the Dalitz plot can still be "close" in the SO(3) coordinates over each point; the loss-versus-distance plot of Fig. 11 (left) makes clear that the 10 worst points are indeed close in Mn=3. Figure 11. Left: Loss-versus-distance plot for massless 3-particle phase space with d l = 5. The test set is uniformly sampled using the Lorentz-invariant measure (3.1) from an initial state with unit energy. The resulting loss shows the single break point characteristic of S 5 (Fig. 10, center). Right: Dalitz plot showing the distribution of the model prediction (red) on a uniformly-sampled test set (sampled from the interior of the outlined black triangle). Bottom: The 10 worst points from the uniform test set and their model predictions. The largest-loss points are localized near a generic point in the interior of the Dalitz triangle, while at the corners the loss is lower even as the reconstruction is poor, in close analogy to the double-cone example of Sec. 5.2.
is trained on a distribution with an undersampled region, the break point will typically be placed nearby (Fig. 2), but the rest of the submanifold of fixed s Y Y will be reconstructed with low loss like any other generic point in phase space, as we saw in Fig. 3. Said another way, the autoencoder will detect a point in phase space as anomalous (representing a particular orientation of the final-state 3-vectors), but this is only a set of measure zero on the submanifold of desired anomalous events. The situation is even worse if the training set is uniform in phase space, because there is no guarantee that the break point will even lie on the signal Figure 12. Normalized loss distributions for the uniform test set (Fig. 11) and signal test set with s Y Y = 0.25 from Fig. 3. The loss tail for the signal events is smaller than for the background events, the opposite of the desired behavior. submanifold; as seen in Fig. 12, the loss tail from pure signal events is smaller than the loss tail from the background events because the network can achieve near-perfect reconstruction on the 4-dimensional signal submanifold with d l = 5. It is clear that if trained on a smooth background distribution (no sidebands), the autoencoder cannot detect anomalies in a test set with both signal and background, since even the 100% signal sample is indistinguishable from the background. Finally, the effective oversampling of the boundaries of the triangle with respect to the standard round metric on S 5 is analogous to the double-cone example of Sec. 5.2: even though the average loss is minimized by placing the break point in the interior of the Dalitz triangle, the model will struggle to reconstruct the corners where the extrinsic curvature is large, introducing additional distortions in the loss distribution. This strongly suggests that caution is warranted when using an autoencoder as an anomaly detector for real physics events, where the nontrivial matrix element will induce a non-uniform distribution in the Dalitz plane. Of course, if d l ≥ d in , the network will learn the identity map, which will not detect any anomalies, so we focus on the case d < d l < d in .

Changing the latent dimension
• A circle may be embedded in R 3 as a knot, with nontrivial extrinsic topology; even for d l = 2, where perfect reconstruction is theoretically possible (as the circle does embed in the plane), the training process gets stuck at a local minimum with self-intersections in the latent space. However, the performance can be substantially improved by modifying the loss function to force the network to learn a latent representation without selfintersections.
• The torus T 2 may be embedded in R 3 as the standard "donut" embedding, or in R 4 as a direct product of two circles, the Clifford torus. For d in = 4, the global loss minimum for d l = 3 is the donut embedding. After training an ensemble of networks, the latent representations fall into two qualitative categories: infrequently, the network finds the global loss minimum, but more often, the latent map pinches one of the circles in two locations along the torus, yielding poor reconstruction. This demonstrates that even though an embedding may be topologically possible, a randomly-initialized autoencoder is not guaranteed to find it, raising concerns about the robustness of autoencoder performance on data manifolds with nontrivial topology.
• The group manifold SO (3) is locally isomorphic to SU(2), but has the topology of the real projective space RP 3 , consisting of identifying antipodal points on SU(2) ∼ = S 3 . With d in = 9 (i.e. flattening the 3 × 3 SO(3) matrix into a 9-component real vector), this global topological obstruction prevents good reconstruction even up to d l = 5, which is the dimension in which SO(3) may be embedded in R n .
While some of these examples may represent more complicated topology than a generic data set in the wild (or even a practically-relevant data set in high-energy physics), they are important illustrations of the fact that simply increasing the size of the latent space does not guarantee that an autoencoder trained on data sampled from a topologically-nontrivial manifold can achieve low uniform reconstruction error.

Conclusions
The importance of understanding the topology of input data has been recognized since the 1960s and was a pressing issue for Rosenblatt, the inventor of the perceptron (see [64]). At the time, the question was not about a latent representation, but about the extrinsic topology of the input, e.g. whether a circle was inside or outside a square in an input image or whether two components of an image are connected or not. Early neural network models struggled to identify these very global features of input images, as was famously elucidated by Minsky and Papert in their famous critique of the perceptron model [65].
In this paper we have attempted to further understand this intertwining of data topology and neural network performance via an extensive study of a rich variety of low-dimensional input data sets exhibiting both nontrivial intrinsic and extrinsic topology. In particular, we have identified several situations where the global topological features of the data set pose an obstruction to faithfully compressing the data even when the latent space dimension is equal to the intrinsic dimension of the data, which is a local feature. As an application, we have shown that in the canonical example of anomaly-finding in high-energy physics, a "bump hunt," an neural network autoencoder trained on data drawn from n-particle phase space (with n fixed) inevitably results in order-1 reconstruction error for generic points in the training set, and moreover fails to flag as anomalies events with invariant mass values that are entirely absent from the training set.
Since the issues of large reconstruction error are entirely due to the topologically-impossible task of trying to cover a whole manifold with a single chart, they could in principle be ameliorated by training multiple networks -the latent representations of which would represent independent charts -and using the regions of faithful reconstruction in pairwise overlaps of charts to construct transition functions. Indeed, an ensemble of networks has already been used in the context of weakly-supervised learning for collider physics to mitigate the trials factor or "look-elsewhere" effect [66]. If the large-loss points are uniformly distributed across the data manifold, a simple (but computationally-expensive) way to do this would be to independently train a large number of randomly-initialized networks and take the median of the outputs; specifically, for each test set point, sort the list of autoencoder losses from each network and define the output of the ensemble to be the autoencoder output corresponding to the median loss in this list. A more sophisticated solution would correlate the network parameters to construct transition functions directly, along the lines of [58]. Such a strategy of multiple network realizations would also be helpful in cases like the 2-sphere with the excised equator, where the trained ensemble doesn't concentrate on a single minimum. Alternatively, one could let a single network find the transition functions itself using an architecture consisting of parallel sub-networks for each chart, perhaps with suitable encouragement by modifying the loss function. 15 However, for submanifold-type anomalies -such as the bump hunt in phase space -each of the networks or sub-networks will be able to smoothly interpolate through the anomalous submanifold, and no anomalies would be detected.
As our examples with the undersampled circle and phase space have shown, some knowledge of the data topology can be very useful in interpreting the output of an autoencoder. Specifically, knowing that there must be points with large loss could motivate a network architecture which correlates that loss with the data distribution. It would be particularly interesting to investigate how the topology of phase space is imprinted on jet substructure observables where the number of final-state particles is not fixed, and additional soft and collinear radiation "dresses" the parton-level phase space with events near the boundaries of the higher-dimensional simplex defining the analogue of the Dalitz plot for hadron-level phase space. Conversely, the autoencoder itself can be a useful diagnostic of the data topology, by examining whether the points with large loss are correlated in distance.
Far from being some esoteric feature, we might expect that some nontrivial topology is generic for data sets containing features with any degree of rotational symmetry, which applies to a number of examples outside of physics such as 3-dimensional objects viewed from different perspectives. Indeed, Refs. [67,68] have used the nontrivial topology of the sphere (in particular, distortions resulting from planar projections) to motivate SO(3)-equivariant networks to perform machine learning tasks on datasets which live on spheres, which may have applications for learning observables which are functions on phase space. There has been some very interesting recent work on estimating the intrinsic dimensionality of generic data sets [69,70], but these techniques rely on various proxies for the data dimension after the data has already passed through layers of the neural network, which as we have argued is a good probe of local dimension but necessarily misses the global topological features. Depending on how that local dimension is being used in the downstream machine learning task, it might be necessary to adapt such methods further to account for cases of nontrivial topology. Given the considerable recent work which has focused on incorporating nontrivial priors about the data set, including symmetry properties, into the network architecture, we hope this work has motivated including data topology into that set of priors.

A Hyperparameters
Our autoencoder neural networks were fully-connected nets constructed with Pytorch, using default initializations and trained with stochastic gradient descent (SGD) for 20,000 epochs.
In the examples described in the main text, we used tanh activations for all layers except the output of the encoder and the output of the decoder; other activation functions and training algorithms for the S 1 autoencoder are discussed in App. B below. The hyperparameters for each of the examples are shown in Tab. 1.
We increased the number of sample points as the dimension increased, and for the highestdimensional examples we also increased the batch size by about an order of magnitude and as such increased the learning rate accordingly [71].

B Further investigation of the S 1 autoencoder
In Sec. 4, we argued that the appearance of a break point in an autoencoder with latent dimension 1 is an unavoidable feature of a data set with the topology of S 1 . In principle, one could imagine that with sufficient training, the break point would be placed in between finitely-spaced training points, such that the network could achieve perfect reconstruction on the training set. In practice, we find that this is not true, and the appearance of a finite-sized break region encompassing multiple training points seems generic and robust with respect to changes in the network hyperparamters and architecture. In this Appendix we will justify these statemements and perform a simple analytical analysis of the network dynamics to relate the topological requirement of a break region to the network parameters which determine it. While nothing in this Appendix has anything to do with physics per se, we find the richness of this simple example worthy of serious investigation in future study as it touches on notions of spontaneous symmetry breaking, topology of finite data sets, and the neural network loss landscape.

B.1 Changing hyperparameters
We first note that the persistence of the break region is insensitive to changes in the width or depth of the network. Fig 13 shows the S 1 autoencoder with three different architectures: a 5-layer network with d w = 32 (left) and d w = 128 (center), and a 7-layer network with d w = 64 (right). In addition, we experimented with changing the activation function: Fig. 14 shows results for our default architecture with a ReLU activation function (left), as well as modified tanh activation functions 1 β tanh(βx) with varying "temperature" β (center and right). The ReLU and β > 1 activation functions seem to result in a somewhat smaller gap, but one which is still easily visible and encompasses multiple data points from the training set of 1000 equally-spaced points. We also tried a different training algorithm: Fig. 15 shows the results for the Adam [72] optimizer compared to SGD. As expected, Adam converges  to the gradient descent minimum faster than SGD. This results in a smaller gap for the same amount of training, though once again the finite gap remains even after 20,000 epochs. Finally, we verified that the topological obstruction was indeed arising from a latent dimension d l = 1 rather than any issues with inadequate network capacity. After training the default S 1 autoencoder with d l = 2 on 1000 equally-spaced points on the unit circle, Fig. 16 shows the output of the autoencoder on a test set of 3000 points uniform on the whole square −1 ≤ x, y ≤ 1. The loss is smallest on the training set (green), as expected, but the fact that the loss is of the same order everywhere else in the square except at the corners strongly suggests that the network is learning the trivial map on R 2 for x 2 + y 2 ≤ 1.

B.2 Sparse circle
The S 1 autoencoder exhibits interesting behavior when the size of the training set is very small, as shown in Fig. 17. For the same hyperparameters as given in Tab. 1 but with 100,000 epochs of training, a training size of N train = 12 allows the network to memorize the training Figure 15. Output of the S 1 autoencoder with identical hyperparameters but different training algorithms: SGD (left) (same as Fig. 4), and Adam [72] (right). Figure 16. The S 1 autoencoder with d = 2 learns the identity map on all of R 2 . Losses are concentrated outside the region x 2 + y 2 = 1 which defined the training set, because extrapolation is required in that region. set but at the cost of a rather poor reconstruction on a larger test set of size N test = 1000 uniformly sampled from the circle. This is obviously an avatar of overfitting. Next, increasing to N train = 20 gives the familiar behavior with a break region as described above, with the network unable to memorize the full training set because the spacing between training points is smaller than the break region. As anticipated above, by changing the activation to 1 β tanh(βx) with β = 5, the size of the break region can be reduced, allowing the network to perfectly reconstruct the training set while still maintaining accurate reconstruction of the larger test set by placing the break region between training points. However, for N train 100, even the β = 5 activation function cannot memorize the training set after 100,000 epochs.

B.3 Dynamics of training in the S 1 encoder
To gain some analytic understanding of the behavior of the circle autoencoder with d l = 1, we examine the structure of the autoencoder network explicitly. The encoder map f enc (x) with x = (x, y) is a map R 2 → R 1 , while the decoder map f dec (q) is a map R 1 → R 2 . Restricting the input data to the unit circle, the encoder can be thought of as a map f enc (φ) from S 1 to R 1 , with x = cos φ and y = sin φ. The model map is f (x) = f dec (f enc (x)), and the loss function is the mean squared error, where x j and f (x j ) are points in R 2 and the norm is the usual Euclidean norm. Explicitly, the encoder map with a single hidden layer of width d w is where σ is the activation function on the hidden layer output, and θ α = {W ix , W iy , W i , b i , b 2 } are the encoder parameters: W ix and W iy are the weights going to the hidden layer, W (2) i are the weights going to the output of the encoder, and b i and b 2 are the biases. It will be convenient to define as the pre-activation at neuron i of the first hidden layer. A necessary condition for the network to be at a loss minimum after training is that the gradient of the loss with respect to the encoder parameters θ α vanishes: The point of this expression is to note that at the loss minimum, one of three things must be true (absent accidental orthogonalities), data point by data point: either the reconstruction is perfect (f (x j ) = x j ), or the derivative of the decoder vanishes, or the gradient of the encoder vanishes.
Because of the topological issues previously noted in Sec. 4, it is impossible for the network to satisfy the first condition near the break point. The second condition, the vanishing of the decoder derivative, would imply that the decoder is independent of the latent representation to first order, which would mean the network is not actually learning anything from the latent representation and nearby points in the latent space get mapped to the same point in the model. 16 We therefore expect that near the break point φ 0 , the third condition holds, ∇ α f enc (φ 0 ) = 0. To the extent that this is true, the appearance and position of the break point is entirely driven by the encoder, which greatly simplifies the analysis since there is only a single hidden layer. We will use the explicit expression (B.2) to relate ∂f enc /∂φ to ∇ α f enc (φ).
Treating the encoder as a function of the input variable φ, we have where σ is the first derivative of the activation function. Similarly, treating the encoder as a function of θ α , the derivatives of f enc with respect to the network parameters are Note that because ∂f enc /∂b 2 never vanishes, there are no true critical points for f enc . However, since there are 4d w additional parameters, if d w 1 then the gradient will be dominated 16 Of course, a good decoder will map nearby latent points to nearby points in the model, but this implies a nonzero (if small) derivative.
by the other parameters, so we can attempt to minimize those derivatives to find a quasiminimum.
To make further progress, let's suppose the activation function σ(z) vanishes only at z = 0 and furthermore that σ (0) = 0, which is true for example for the tanh, and popular smooth approximation of the ReLU (though not ReLU itself) such as the GELU [73] and SWISH [74] activations. The derivatives with respect to the second-layer weights, ∂f enc /∂W (2) i , can only vanish if z i = 0, but since σ (0) = 0 by assumption, the remaining derivatives with respect to the first-layer weights and biases can only vanish if W (2) i = 0. Therefore, the global quasiminimum is for all of the second-layer weights to vanish, which a trivial model with poor reconstruction error, since from Eq. (B.5), f enc (φ) is then independent of φ. Empirically, what the network tries to do instead is minimize all but a few of the W (2) i ; the ones that remain nonzero stop evolving when their corresponding pre-activations vanish. Indeed, let i * = argmax|W (2) i |, and φ 0 be a solution to z i * (φ) = 0: Then by Eq. (B.5), |∂f enc /∂φ| is large at φ = φ 0 since it is dominated by |W (2) i * | and σ (z i * ) = 0. At most input values φ, the derivative of the encoder is small and nearly constant, allowing it to approximate a linear map where the encoder learns the φ parametrization. Since f enc is continuous, however, there is a short interval in φ where f enc must retrace the path traversed by the rest of the input domain, incurring a large derivative; the "break point" φ 0 is near the center of that interval (see Fig. 4). Note further that this quasi-minimum has a flat direction at the break point φ 0 : which follows from Eq. (B.11) and thus implies that the first-layer weights and biases can continue to evolve even when the behavior of f enc near φ 0 doesn't change.
The analysis is somewhat different for a ReLU activation; in that case, a quasi-minimum can be found when all z i < 0 since σ(z i ) = σ (z i ) = 0 for z i < 0. However, ∂f enc /∂φ is proportional to σ (z i ), and thus would vanish everywhere which would map all of the input data to a single point in the latent space. Thus, the competing requirements of simultaneously needing σ (z i ) = 0 and 0 < σ(z i ) 1 push z i * towards zero as in the case of a tanh activation, leading to qualitatively similar behavior. The discontinuous derivative makes this case more difficult to analyze analytically, though, so we focus our discussion on smooth activation functions from here on but show an example below of the network dynamics with ReLU activation.
To summarize, at the end of training, the break point φ 0 typically corresponds to a solution to z i * = 0 where i * = argmax|W (2) i |. We have also qualified this statement with "typically" because it may happen that two of the weights have similar magnitudes, and it might be the case that φ 0 is determined by the second-largest one, as we discuss below. Figure 18. Left: encoder derivative ∂f enc /∂φ as a function of input data coordinate φ. The break point φ 0 is shown in red. Center and right: Pre-activations z i (φ) of the hidden layer node i * with the largest outgoing weight (center) and of the paired node i * (right) as a function of input φ for the trained circle network. Note that the point of largest derivative of the encoder, which corresponds to the break point φ 0 , is one of the zeros of the pre-activation corresponding to the largest magnitude weight |W (2) i * |. The second zero of the largest weight, φ 0 , is also a zero of the paired pre-activation z i * ; the second zero of the paired weight, φ 0 , is also a point of large encoder derivative. Fig. 18 shows z i * (φ) and ∂f enc /∂φ for the trained network shown in Fig. 4 with σ = tanh, which was initialized with random weights and biases drawn from a uniform distribution between −1/ √ d w and 1/ √ d w (the default in Pytorch). As anticipated by the analysis above, the magnitude of the derivative of the encoder is largest at the break point φ 0 (red), which satisfies z i * (φ 0 ) = 0, where i * is determined by the largest-magnitude weight. However, since z i * (φ) is a linear combination of sines and cosines plus an offset, it can be written as A cos(φ + δ) + b. For sufficiently small b/A, this function always has two zeros. The second zero, labeled by φ 0 (green), does not correspond to a large ∂f enc /∂φ despite the fact that its pre-activation is near zero. Instead, what happens is that there is another weight with large magnitude, W (2) i * ≈ −W (2) i * , whose pre-activation also contains a zero at φ 0 . We can see this empirically in Fig. 19, which shows the evolution of the weights W (2) and the pre-activations z i (φ 0 ) and z i (φ 0 ); as expected from this analysis, weights evolve in tandem to large positive and negative values, with the corresponding pre-activations driven to zero. This paired weight approximately cancels the large derivative at φ 0 (some remnants of the imperfect cancellation can be seen in the "wiggles" of ∂f enc /∂φ at φ 0 ), but absent a fine-tuning of the W ix and W iy , the second zero φ 0 will be different from φ 0 , so the large derivative at φ 0 remains.
On the other hand, there is now a partially uncancelled zero at φ 0 (magenta), resulting in a large ∂f enc /∂φ of opposite sign, which can also be seen in Fig. 18. The difference |φ 0 − φ 0 | is thus responsible for the "gap" around the break point, which looks potentially logarithmic as a function of training epoch, since cancelling the zero at φ 0 requires the network to find its way to a finely-tuned quasi-minimum, or equivalently, for the composition of continuous maps to yield a discontinuous latent representation which has a delta-function derivative. Figure 19. Evolution of the encoder parameters during training, with the weights and pre-activations at nodes i * and i * shown in red and green, respectively. Weights evolve in pairs to large positive and negative values (left). The pre-activation z i * (φ 0 ) is driven to zero while z i * (φ 0 ) = 0 (center), resulting in a break point where f enc has large derivative, while z i * (φ 0 ) and z i * (φ 0 ) are both driven to zero (right). Figure 20. Same as Fig. 19 for a ReLU activation function.
As noted in App. B.1, we have verified that increasing the hidden layer width, or adding another layer to the encoder, does not affect the size of the break region (as measured by the gap in the decoder), which appears to depend mostly on the length of training (for a fixed learning rate) and to some extent on the particular form of the activation function, including the derivative at the origin as measured by β. Fig. 20 shows the evolution of the encoder parameters for a ReLU activation, showing the same qualitative behavior as for tanh. The key differences are that z i * (φ 0 ) is not necessarily driven to zero but can remain positive because the pre-activation derivative σ (z) is identical for any z > 0; in addition, the weights W (2) which do not determine the break point are not driven to zero as fast as for the tanh activation. In this example the break point is also determined by the second-largest |W (2) |. Nonetheless, the main feature which determines the break point, namely the pair of weights of equal magnitudes evolving in parallel, persists independent of the activation function, since Figure 21. Attempts to determine the break point of the S 1 autoencoder by initialization: one weight is initialized to a value 3, larger than the other initialized weights, and the corresponding first-layer parameters are initialized such that z(φ 0 ) = 0 at φ 0 = π/4. Left: Output of the autoencoder. Right: derivative of the encoder ∂f enc /∂φ. The network never cancels the second break point, and the trained minimum is poorer quality than the one achieved with random initialization. it is required by topology.
Finally, we consider trying to initialize the network to give a break point at a prescribed value of φ 0 . Given that φ 0 is determined by the largest second-layer weight W (2) i , we expect that if we initialize one of the weights, say i = i * , to a large value compared to the width of the distribution from which the rest of the weights are drawn (1/ √ d w = 0.125 for our default network parameters), φ 0 will be determined somehow by the corresponding pre-activation z i * . From the update equations, the network will prefer to move along the flat direction for the first-layer weights and biases, so to choose φ 0 we can also initialize W i * x , W i * y , and b i * such that φ 0 is a solution to z i * = 0. Fig. 21 shows the results of initializing W (2) i * = 3 and z i * (φ 0 ) = 0 with φ 0 = π/4. The quasi-minimum the initialized network finds is qualitatively different than the randomly-initialized network. One break point ends up close to the target φ 0 , but the second zero of the pre-activation φ 0 remains uncancelled, and the encoder develops two break points, as shown in Fig. 21. This behavior is qualitatively similar to the latent representation of the Clifford torus in R 4 in App. C.2 below. The interplay between initialization and training is fertile ground for future work, especially in this simple example where analytic approaches may be tractable. Consider a circle embedded with nontrivial extrinsic topology in R 3 : namely, the trefoil knot, defined by x = (R + r + cos 2φ) cos 3φ, y = (R + r + cos 2φ) cos 3φ, z = r + sin 2φ, (C.1) where we take r = 1 and R = 2 for concreteness. In this context, extrinsic topology refers to the fact that the knot cannot be continuously deformed into the circle in R 3 without tearing, despite these two manifolds having the same intrinsic topology of the circle. As with the S 1 autoencoder, we train on an equidistant training set in φ, for both d l = 1 and d l = 2; the results of the output map are shown in the top row of Fig. 22. Just as with the circle, the output map contains break points, which in the case of d l = 2 correspond to self-intersections in the latent representation (Fig. 22, bottom right). The typical size of the error is much larger for d l = 1, but in both cases the largest errors are confined to neighborhoods of isolated points, as was the case for the circle. Here, though, we are seeing nontrivial intrinsic and extrinsic topology, the latter of which makes it difficult to learn the global geometry of the knot even for d l larger than the intrinsic dimension of the data, because a generic initialization of the network will lead to a latent representation with self-intersections. We can cure the topological issues in two ways. First, by taking d l = 3, we can have near-perfect reconstruction of the knot, but at the price of learning the trivial map in the region enclosed by the knot. We can also do something more clever and force the network to learn that the knot is a parametric curve. Consider modifying the loss function to where f (x) is the output of the full network, f enc (x) is the output of the latent layer (i.e. the encoding of x), x φ is the parametric representation of the knot of the same dimension of the latent layer, and λ is a hyperparameter. The new loss L penalizes the network for learning a latent representation different from the parametric representation by φ; for d l = 1, x φ = φ, and for d l = 2, x φ = (cos φ, sin φ). Fig. 23 shows the results of training with L, setting λ = 10 with all other hyperparameters the same. For d l = 1, the latent representation cannot overcome the intrinsic S 1 topology of the knot, and while the output is clearly better at approximating the shape of the knot than the case for the unmodified loss function, the knot still breaks around a point as did the unit circle. For d l = 2, the latent representation can learn the 2-dimensional representation of the circle, and we get much better reconstruction. We have checked that this network is not learning the trivial representation on R 3 , since there is still a compression with d l < d in . We conclude that autoencoders can untie knots (i.e. evade obstructions associated to nontrivial extrinsic topology), as long as we tell the network to do so with a suitable modification to the loss. Indeed, this extra term in the loss is a toy example of the incorporation of priors based on topology which can help improve network performance.

C.2 The torus: quotient spaces
Moving to d = 2, we consider the torus T 2 , which can be embedded in R 3 by x = (R + r cos α) cos β, y = (R + r cos α) sin β, z = r sin α (C.3) We take r = 1 and R = 3, and generate training and test sets uniformly sampled in α and β.
Since the topology of the torus is that of a quotient space, S 1 ⊗ S 1 = R 2 /Z 2 , the torus has a Figure 23. Model output and latent representation for the trefoil for with the loss function modified to include a penalty when the latent representation deviates from the parametric representation of the data. The left plots show the results for d l = 1 and the right plots show d l = 2. Even with the modified loss, the d l = 1 network still has a break point because of the intrinsic topology of the knot, but forcing the latent representation to approximate a circle for d l = 2 leads to much better reconstruction.
nontrivial fundamental group and cannot be covered with a single chart by excising a single point, unlike the sphere. Anticipating that this may make the autoencoder more difficult to train, we use both the 5-layer network and a deeper 7-layer network as defined in Sec. 2.
The results are shown in Fig. 24. While the deeper network reduces the loss overall for a generic point on the test set, there are still numerous points with order-1 loss which are far away from the worst point, and numerous points with low loss which are close to the worst point, indicating that the latent representation is non-local. Since at least an S 1 ∧ S 1 must be excised from a torus to embed the complement in R 2 , this behavior is expected.
To explore the role of the extrinsic topology of the data manifold in training an autoencoder, we consider a nontrivial embedding of the torus into a high-dimensional space d in > 3 and train an autoencoder with latent dimension d l = 3. Since d in > d l , the autoencoder cannot learn the trivial identity map, but it should be able to learn the standard 3-dimensional embedding given in Eq. (C.3) as the latent representation. Indeed, such a high-dimensional embedding exists in R 4 , known as the Clifford torus, (x, y, z, w) = (cos α, sin α, cos β sin β). (C.4) Using a training set of uniformly-sampled points on the Clifford torus, and training multiple instantiations of a 7-layer network, we find two qualitatively different results, shown in Fig. 25. Occasionally, the network will find the global minimum of the loss where the latent representation is homeomorphic to the embedding T 2 ⊂ R 3 . More often, though, the network finds its way to a poor local minimum for the second S 1 factor where it "pinches" at two points, rather than the optimal global minimum of the embedding in R 3 . 17 Indeed, the Clifford torus parametrization makes the product-space structure of the torus T 2 = S 1 × S 1 explicit, and the latent representation suggests that the autoencoder is learning both circles independently, rather than the global structure required for the embedding in R 3 . Thinking about the autoencoder in terms of an ensemble -defined by the different possible realizations of weights and biases and learning dynamics -we see that the ensemble doesn't concentrate on one minimum, but rather a discrete set of them. This lack of typicality is problematic from an application standpoint, as the two minima will have wildly different behaviors when employed as anomaly detectors. This example illustrates once again the richness of the autoencoder loss landscape and the dependence of performance on initialization.

C.3 SU(2) and SO(3): topology versus geometry, or global versus local
The Lie groups SU(2) and SO(3) have the same local structure with isomorphic Lie algebras, but the global structure of the groups differs in a nontrivial way. Both SU(2) and SO(3) are 3-dimensional, but are topologically distinct, with SU(2) the double cover of SO(3). As both groups can be parametrized with a triplet of Euler angles (α, β, γ), which can be mapped into a vector of 8 real numbers (the entries of a complex 2×2 SU(2) matrix U satisfying U † U = I 2×2 ) or 9 real numbers (the entries of a real 3 × 3 SO(3) matrix O satisfying O T O = I 3×3 ), looking at the differing behavior of autoencoders trained on these two manifolds can isolate the topological features from the geometric ones. In particular, since SU(2) is diffeomorphic to S 3 , the SU(2) autoencoder will provide an example of a topologically nontrivial manifold embedded in a much higher-dimensional space R 8 , which will again be analogous to our phase space example.
Since the geometric structure of these manifolds is difficult to visualize, instead of plotting the output directly, we will evaluate the performance of the autoencoder with the loss-versusdistance plot introduced in Sec. 5, as well as examining the loss on the test set as a function of training epoch. We generate training sets by uniformly sampling each group according to the Haar measure, the unique invariant measure on Lie groups. The matrices are then flattened row-by-row into an 8-component or 9-component real vector for SU(2) and SO(3), respectively. Fig. 26 shows the performance of the deep 7-layer autoencoder trained on these two group samples. 18 Based on the results from the circle and the 2-sphere, it is not surprising that 18 Note that the MSE loss is computed on the flattened 8-or 9-dimensional vector, which implies a Euclidean a 3-dimensional latent layer is not able to fully reconstruct the data, while a 4-dimensional latent layer can do so: SU(2) ∼ = S 3 can be embedded in R 4 . On the other hand, for SO(3), the size of the loss after the same amount of training is orders of magnitude larger than for SU (2) and barely improves going from d l = 3 to d l = 4. This is due to topology: SO(3) is diffeomorphic to real projective space RP 3 , as it is the quotient of SU(2), a 3-sphere, by a Z 2 action, and a classical theorem of Mahowald states that RP 3 does not embed in R 4 [75].