The Universe from a Single Particle

We explore the emergence of many-body physics from quantum mechanics via spontaneous symmetry breaking. To this end, we study potentials which are functionals on the space of Hamiltonians enjoying an unstable critical point corresponding to a random quantum mechanical system (the Gaussian unitary ensemble), but also less symmetrical local minima corresponding to interacting systems.


Introduction
Since Wilson [18] much attention both in condensed matter and high energy has been devoted to the question: Can theory B emerge as a low energy limit of theory A? And even possible concatenations A → B → C → . . . ? Effective field theories are undoubtedly important links in such chains. This paper explores, in a toy model, the possibility that field theory itself, even at high energy, may arise through a kind of symmetry breaking applied to a much simpler system, single particle quantum mechanics.
Research was done as part of an internship of the second author at Microsoft. 1 In our toy model, single particle quantum mechanics is represented by theory A with a finite dimensional Hilbert space of states H A ∼ = C N . Initially, and for simplicity, we take N to be a power of 2, N = 2 n , but, as we will see, this choice simplifies the discussion but may not be essential. We study symmetry breaking to a theory B which is a toy model for field theory. Theory B has a Hilbert space H B isometric to C 2 ⊗ · · · ⊗ C 2 n times but H B is regarded as describing a system of n interacting bosons, each with two states (a very similar choice, not explored here, is for H B to describe n interacting spin- 1 2 Fermions). The difference between models A and B is the probability distribution from which the system Hamiltonian is drawn. In case A, we consider that the Hamiltonian H A to be drawn randomly from the Gaussian Unitary Ensemble (GUE) Z on su(N), the Lie algebra of the group of symmetries of H A . We consider the simple Lie group SU(N) to be the symmetry group of H A as overall phase is unobservable. Since the elements of su(N) are skew-Hermitian, we multiply by −i to get a Hermitian Hamiltonian. To complete the definition of theory A we specify the metric g ij on su(N). We choose g ij to be the (ad-invariant) killing form: a, b ∝ tr(ad(a) • ad(b)). Specifying the metric makes the distribution Z well-defined and induces a measure on su(N), giving meaning to the word "randomly" above. In the case g ij is the Killing form, the resulting distribution is the GUE. Concretely, imagine a particle with access to N states |1 , . . . , |N , with the system Hamiltonian H A built by assembling N 2 i.i.d real Gaussian random variables into (1) the diagonal entries of H A , (2) and the real and imaginary parts of (H A ) ij , i > j with (H A ) ij = (H A ) ji . Constructing H A in this way reflects a complete agnosticism on the possible transitions, making H A the Hamiltonian for a random single particle system.
By contrast, in case of B, some qubit structure has been chosen in the form of an isometry: Thus the moduli space of qubits structures is the homogeneous space where J determines the inclusion. There is now a central definition: Definition 1.1. A non-singular inner product (i.e. a metric) g ij on su(2 n ) which is not bi-invariant, knows about qubits (is kaq) if there is a basis {iH 1 , . . . , iH 4 n −1 } for su(2 n ) consisting of principal axes for g ij with the property that there is an isomorphism of Lie algebras j : u(2) ⊗ · · · ⊗ u(2) n copies → u(2 n ) such that iH k = j(O 1,k ⊗ . . . ⊗ O n,k ) for for 1 ≤ k ≤ 4 n − 1, where O i,k lives in the i-th copy of u (2); that is, is there is a basis of principal axes which are pure tensors relative to j.
The previous choice J determines j. The notions of isometry and principal axis above are with respect to a unique (up to scale) bi-invariant metric a, b = 1 dim su(N ) tr(ad(a) • ad(b)). Implicitly, this definition exploits the natural inclusion SU(N) ⊂ U(N), since our g ij is defined only on the former.
For clarity, we remark that if the (inverse) bi-invariant metric is used to raise an index of g ij to create a symmetric operator g j i , then the principal axes of g ij are precisely the eigenvectors of g j i . The reason the definition refers to a basis of principal axes (eigenvectors) instead of the basis is that in the metrics we encounter there is usually degeneracy (to numerical precision) so we need the freedom to look within eigenspaces for a basis of simple (pure tensor form) eigenvectors. In practice, one might wish to soften the definition slightly and allow eigenspaces with nearly identical eigenvalues also to be combined, thus allowing more freedom to find simple linear combinations.
We can generalize the definition above to knows-about-qunits (kaqn). We have used qubits to define kaq but the reader can imagine other primes or combination of primes not restricted to 2. In this initial paper, we have not attempted computations in this direction but expect to study this in future works.
One challenge we face is the large dimension 4 n − 3n − 1 of M. To decide if a given metric g ij on su(2 n ) is or is not kaq, it may be necessary to numerically search M. For n = 2, we did this on several occasions. But a particularly simple family of kaq metrics are those where there is a basis of principal axes of g ij consisting of Pauli words. The structure J determines a Pauli-word basis, called PB n on the Lie algebra su(2 n ) = consisting of all words of length n in the letters {1, X, Y, Z} except the all 1 word which is in u(2 n ) but not su(2 n ). Given J, this basis is canonical up to the action of SU(2) on each factor.
Hence forth, we normalize all metrics (also the killing form) so that det(g ij ) = 1. A very interesting special case of kaq metrics on su(2 n ) are the previously studied penalty metrics of [14,15,8], [7,6], and [5] where where r is some positive constant and w(i) is the weight of the ith Pauli-word meaning the number of letters it contains, X, Y, or Z, which are different from 1. Of course: We should emphasize that kaq metrics are exceedingly rare having roughly square root the number of parameters as general metrics. A kaq metric needs 4 n − 1 parameters to specify j (on which Aut(su(2 n )) acts freely and transitively), and also 4n parameters for each product of O i,k 's of which there are 4 n − 1 (as 1 ≤ k ≤ 4 n − 1). In total this makes (4 n − 1) + (4 n − 1)(4n) = (4n + 1)(4 n − 1) parameters for a kaq metric, whereas the metrics space g ij on su(2 n ) has dimension 4 n (4 n −1) 2 . This paper studies symmetry breaking from theory A to theory B. To do this, we consider (one of several possible) functionals: Let us repeat what we are looking for and why we are looking for it.
First the "what". We are looking for a functional f on metrics which, if minimized, would confer the usual features of interacting physics on what, at a fundamental level, is actually a single particle system. The systems (A and B) we study are each fully quantum mechanical but are each drawn from their separate statistical ensembles: Gaussians defined from quite different inner products. Because the overwhelming preponderance of kaq metrics produce measures with infinitesimally small overlap with the ad-invariant measure (induced by the Killing form), a kaq metric will display observable features of many-body physics, almost certainly not seen in a Killing-random (GUE) Hamiltonian from theory A. The distinction is most clear in the special case of penalty metric in which the expected strength of k-body terms in the Hamiltonian H B , such as (⊗ is suppressed and the non-identity letters are not required to be consecutive), will decay exponentially with k. If one were handed a seemingly random 2 n × 2 n Hermitian matrix and discovered (not an easy computational task) that it was mostly captured, in some tensor coordinates, by low weight terms, one would believe many-body physics was at work. The SYK model is a (fermionic) illustration. It assembles the system Hamiltonian for a black hole from random 4-Majorana interactions (analogous to 2-body interactions in the bosonic case) governed by a complete graph (no "spatial locality"). The SYK model can easily be enhanced, in the spirit of penalty metrics, to allow exponentially weakening higher body interactions: Hamiltonians are part and parcel of many-body physics. It would indeed by surprising to be told that the system was "really" a single particle whose governing Hamiltonian hand rolled off an unstable equilibrium into a kaq well. This is the surprise we are investigating.
Before addressing the "why," a word about the parameter space {metrics} and the functionals f we investigate.
The geometry of compact simple Lie groups with bi-invariant metrics is very well understood [10]. They are symmetric spaces whose geodesics are the cosets of 1-parameter subgroups. In this case much global information is known: for example, all points on the cut locus (say from id) are also conjugate points. On the other hand, if the Riemannian metric is merely left-invariant the geometry is a rich subject of contemporary research. Local computations are straightforward [13,4] but global properties are harder to derive (in [5] it was recently shown that for a fixed constant c > 0 the diameter of the penalty metrics on SU(2 n ) is larger by a factor, exponential in n, than the Killing metric).
Choosing a left-invariant metric on a Lie group G is of course the same thing as choosing an inner product on its Lie algebra g (and moving that inner product about by the differential of left multiplication). So when we study {metrics} on su(2 n ) we are really studying left invariant Riemannian geometries on SU(2 n ). The normalization condition det(g ij ) = 1 ensures a fixed volume for SU(2 n ) as the metric varies. The penalty metrics make high weight directions exponentially expensive so a typical H B for such a metric will consist mostly of low-weight (i.e. few-body) interactions, i.e. look like typical many-body physics. So we are looking for functionals which break symmetry to penalty metrics or at least kaq metrics. It is clearly cheating to mention a qubit structure in the definition of f . f should be defined only from what is intrinsic to noninteracting physics, the structure constants of su(2 n ), c k ij , and the metric g ij . As geometers, a very natural functional might be f = −diameter. For the Killing metric, the diameter is π but can be substantially increased by varying g ij from id. While f = −diameter might eventually reward investigation, we must restrict ourselves to studying functionals f where computations are efficient. So we have tried to assemble for study the most natural functionals built from the tensors g ij and c k ij . For f , we settled on Ricci scalar curvature which is easily expressed in terms of g ij and c k ij , as well as a variety of (cubicly) perturbed finite dimensional Gaussian integrals. The practical reason for studying these is that it is easy to work out the first few terms of their perturbative expansions in terms of tensor contraction diagrams. Once (a few terms of) the perturbative expansion is in hand, there are efficient methods, e.g. Padé approximates [3] to achieve rapid convergence (to be explored in future works) but merely summing the first few terms is useful when the coupling k to the cubic term is small. These integrals are described in the next section.
Regrading the accuracy of our methods, a mathematical admission is in order. We do not have a proof that the integrals we use to define our functional f have an analytic meaning. This is currently under investigation with preliminary remarks in section 4. We proceed instead, as one often does in field theory, trusting that the integral expression, through some regularization, actually defines a function of a (complex) expansion parameter and trust that detailed information on this function can be extracted from its asymptotic expansion (possibly post-processed via Padé approximates). Of course, this issue is not present when f = Ricci scalar curvature.
Finally, why? The Wilsonian dream is that at highest energy the universe has a simple mathematical description and that any complexity in the apparent physical laws arose through spontaneous symmetry breaking (SSB). For example: All the theories in any such story have necessarily been interacting because they are intended to model an obviously interacting universe. The goal of this paper is to propose, in the context of a highly restricted toy model, that a prequel is possible. That single particle quantum mechanics could also be the beginning of the story. Adding quantum mechanics also to the far left in Figure 2, we propose that SSB could allow QM to simulate the interacting systems from which the familiar story is told, as shown in Figure 3.  Although the vision is broad, the goal of this paper is modest: explore symmetry breaking from QM to a kaq-geometry. The answer to most pertinent questions: Can this approach produce the macroscopic dimensions of spacetime? ... Where does the fine structure constant emerge? ... is that we are nowhere near ready to study them. However, to the the large ensemble of untouched questions, we would like to add a speculation.
It may have, and should have, seemed unnatural that for theory A, our random finite dimensional QM system was placed on a Hilbert space of dimension N = 2 n . What would one expect if N was just a randomly selected large integer? Could there still be a symmetry breaking story to a collection of various qunits? If N is chosen at random in a certain size range (say 100 digit numbers), we can ask what is the expected size of its largest prime factor. The answer is that it likely will have about 62 digits. This phenomenon is governed by the Golomb-Dickman constant ≈ 0.624329989 . . . . Much beyond this is known regarding the statistics of prime factorizations of large random integers. If N instead of being a power of 2, is random, perhaps a suitable functional f will break the large system into qunits of dimensions the prime factors of N.
As a test of our symmetry breaking thesis, one should look near the r.h.s. of Figure 3 for some phenomenological residue of a factorization of the initial random integer N. N should indeed be large. Since the entropy of measured black holes can be 10 90 [2,9] we would need n >> 10 90 and N >> 2 10 90 , roughly what was once called Googolplex. To understand how the prime factorization of a large random integer might be imprinted into the low energy physics is a major challenge, and one approach to gathering physical evidence for the symmetry breaking hypothesis that we study here on a mathematical plane. The distribution of primes is tightly reflected in the Riemann zeta function, and similar zeta functions built from the length distribution of closed geodesics on a hyperbolic surface are linked through the Selberg trace formula [16] to the spectrum of the Laplacian on these surfaces. So statistical properties of factorization could have an audible echo in string dynamics if we know what to listen for.
Another mathematical possibility to be explored is that the prime 2 is energetically favored because of the relation to Majorana/Clifford algebras, and that for a general integer N, SSB will simply round down to a power of 2, splitting the Hilbert space into a direct sum of glassy physics, and a left-over degrees of freedom. The size of the power of 2 may depend on energy/entropy balance, as there should be many more shallow, smaller Clifford algebra minima and fewer larger and deeper ones. Leftover degrees of freedom may separately organize themselves into interacting systems. A Hamiltonian which is a superposition of several separate interacting sectors would govern separate world branches which, like the live and dead cat, do not know about each other.
In future works, we hope to investigate SSB to "partial"-kaq(n) minima in which a proper summand of the Hilbert space is kaq(n), or the possibility of "multi"-kaq(n) minima where the Hilbert space decomposes into several subspaces each of which has a kaq(n) decomposition. For example, one might find a 17-d Hilbert space decomposing as ( Let us finish this introductory section with integrals used to define functionals f : {metrics} → R, and a discussion of their symmetries. A discussion of the numerics and ML used to enhance the numerics is in the next sections.
We define the complex-valued functions: F k : {metrics} → C, and consider functions of the form The structure constants c k ij are defined by (8) [ This seems to be the simplest perturbed Gaussian that can be formed from the two tensors g and c which will not vanish due to symmetry consideration. For example, if we merely integrated over one copy, instead of three, of the Lie algebra su(2 n ) ∼ = R 4 n −1 , using g ij instead of G IJ the integral would become But observe that the cubic perturbation c ijk y i y j y k = 0 since c ijk = −c jik , by the skew symmetry of the Lie bracket [, ]. The formula in 7 indeed seems the most natural way to build a nontrivial perturbed Gaussian integral from the tensors g and c.
Another functional that will be considered is the Euclidean version of the above, where −i in the exponent in (7) is replaced by −1.
We should say a little about the symmetries of F k restricted to the kaq subset M n kaq ⊂ M n := {metrics on su(N)}. There is the adjoint action of SU(N) on su(N) inducing an action on M n , and since M n kaq is defined via the existence of some qubit decomposition, this action preserves M n kaq . The isomorphism to C 2 ⊗ · · · ⊗ C 2 n is merely precomposed by the inverse of the element of SU(N). This action, for example has isotopy properly containing For numerical reasons, we are best able to study f k,2 for k positive real in the range of 100 ≤ k ≤ 1000 and f k,1 for k positive imaginary in a similar range.
Once we locate a locally minimizing metric g we diagonalize it and call the 4 n − 1 eigenvalues the large spectrum. Each of the corresponding 4 n − 1 eigenvalues is a 2 n × 2 n skew-Hermitian matrix and we refer to their spectra (which are imaginary) as the little spectra. When there is eigenvalue degeneracy, there is additional choice choosing principle axes (eigenvectors of g j i ) and the concomitant little spectra. There is a very quick and often reliable check that a metric is in M n kaq : the little spectra should all be consistent with the proposed tensor product structure. This is useful primarily when we a non-degenerate eigenvalue in the little spectrum (a 1-d eigenspace). A further more definitive test is discussed in section 4. In practice we can usually understand the kaq status of the local minima we locate.
The rest of the paper is organized as follows: • §2. A detailed look at the numerical methods and the machine learning techniques that enabled them.
• §3. A summary of what computations were carried out and the most significant conclusions for symmetry breaking to kaqs.

The Summary of Results
Numerical study of a variety of functionals on inner product on su(4) and su (8) finds, in addition to an unstable critical point at the bi-invariant Killing form, metric structure often involving multiple local minima. In many cases we have identified these as kaq metrics (see Figures 9 and 10). Although our study was necessarily limited to very low dimensions, it provides evidence for our proposed SSB scenario. It is furthermore intriguing that the choice of functional does not need to be fine tuned, indicating that perhaps many natural functionals formed from g ij and c k ij commonly have kaq minima. One could add another layer of probability and consider not just our system Hamiltonian H selected from a metric dependent Gaussian, but also that metric g ij drawn from local minima of functionals f themselves drawn from a distribution D of functionals.
Next, a few disclaimers: Ricci scalar curvature was numerically challenging and we were only able to locate critical points whose locations had previously been determined analytically [11] and we already knew would be kaq. These local minima are associated in the two cases with the Lie subalgebras so(4) ⊂ su(4) and sp(2) ⊂ su(4) respectively. Surprisingly, for the functionals studied on su(4), some of the solutions appear numerically to be sp(2) Jensen local minima as well. The metric is small on the Lie subalgebra and large on its orthogonal complement. For other local minima for su(4) and su (8), similar Lie subalgebras have been found that correspond to one (or a combination of some) of the eigenspaces. A more detailed description is given in section 3.
This points to the fact that in addition to the class kaq, there is a class on metrics which we shall call sub n having the structure described above. Particularly for small n, sub n and kaq n have considerable intersection among metrics on su(2 n ).
We must consider to what extent the evidence is actually for kaq and not sub. So far, we have not seen local minima in sub n \ kaq n .
In summary, we find considerable numerical evidence to support the plausibility of spontaneous symmetry breaking to kaq metrics, as the prequel to interacting physics.

Gaussian perturbed integral.
Expanding the perturbative series of (7) up to third order, yields a summation of two, four and six vertex trivalent tensor networks, where vertices are labelled by c and edges by g or g −1 . This expansion follows a similar procedure outlined in [1, Equation 1.7 onwards]. We start with the bosonic Gaussian perturbed integral in (7), and make a simple change of variable Focusing on the m-th term of the expansion gives a Gaussian integral which can be computed using standard methods Writing dratic expression, the only nonzero terms after differentiation and setting V = 0 correspond to the power p = 3m 2 , i.e.
Indeed all other terms are zero, when p < 3m 2 , as there are 3m partials and less (< 2 × 3m 2 ) variables, and when p > 3m 2 , as V = 0. By following the power of imaginary number i in the previous equations, we observe that Im(F k ) = f k,2 corresponds to m ≡ 2 (mod 4), while the rest is Re(F k ) = f k,1 . Similarly, the Euclidean version of this integral has alternating signs ±1 depending on m (mod 4).
Our numerical experiments are based on m = 2, 6 and on m = 2, 4 for the Euclidean version.
We can show that the expression above can be viewed as a contraction of a trivalent network without any loops, and m vertices c ijk and edges g ii , g jj , g kk . Note each partial needs to be paired with its corresponding term in 2 . Let us fix a j and consider the example of −i∂ Therefore, there can not be any loops, or any pairing between two different indices, like i and j. As V j ′ 2 is similarly paired with −i∂ , which is associated with some term c −j ′ − , we get an edge connecting the two vertices c −j− and c −j ′ − with g jj ′ labeling that edge. We note a further simplification, by using (8), allowing us to conclude that vertices can be labelled by c k ij instead of c ijk , while edges are labelled by g ii ′ , g jj ′ and g kk ′ instead of g kk ′ . The reason for the change in the latter edge type is that when two vertices c ijk and c i ′ j ′ k ′ are connected along the kk ′ indices, two factors of g kk ′ come from c ijk and c i ′ j ′ k ′ (thus transforming them to c k ij , c k ′ i ′ j ′ ), and one gets cancelled by g kk ′ coming from g kk ′ V k 3 V k ′ 3 . Each diagram also comes with a coefficient. The coefficient can be computed by following the scalars in equations above for each m, and considering all permutations to each diagram labelling.
We list all diagrams below with their coefficients. For m = 2, there is only the diagram Figure 4. Theta diagram. All diagrams are trivalent networks without any loop, and vertices are the structure constants c k ij . Each vertex has indices i, j, k which are paired with their counterpart in another vertex. This pairing is done using g along edge of type k (colored red) and g −1 for type i and j. For m = 6, there is a common coefficient 1 6! . • Three thetas • Tincan and theta • Tetrahedron and theta • Prism Hence, the functions of study are where the power of k for each m is − 3(4 n −1)+m 2 , thus the ratio yields the term k 2 in F 26 and k in F 24 for m = 2. Remark 2.1. As mentioned in the introduction, and illustrated in experiments and in Appendix 5.3, the tensor contractions above are each convex with minimum at Id. This is why it is expected that a signed sum of these diagrams gives local minima around a local maximum at Id (Figure 1).
In order to fix the volume by restricting to the space of metrics with det(g) = 1, we add the term (det(g) − 1) 2 with a high enough coefficient to the loss function. The more obvious normalization is a suitable fractional power of 1 det(g) . However, not only this power is different for networks with different number of vertices, but also, such an expression leads to more severe numerical round-off errors because of the fractional power, as well as a more involved formula for the gradient than (det(g) − 1) 2 (see Remark 3.4).
Furthermore, in many instances, the value of F 26 , F 24 is too high, which makes the use of scaling factors for the functional and det(g) necessary in order to avoid numerical instability, sometimes leading to det(g) being sent to zero.
The general form of the loss function is: where r 1 ≥ 1, r 2 >> 1. The term solution will be exclusively used to refer to the critical points of the loss function found through gradient descent, which experimentally (through many random small perturbations) have been checked to be local minima. This last check is necessary, as we are looking for local minima of F 26 , F 24 inside the det(g) = 1 subspace, and it is not immediately clear that the set of such local minima is equal to that of L 26 , L 24 in the space of all metrics g.

2.1.2.
Ricci scalar curvature. The Ricci scalar curvature for all unimodular Lie algebras can be computed as follows [11]: which in diagrammatic notation Figure 7. For su(2 n ), this can be simplified to To find the critical points, we need to minimize ||∇R|| which can be computed explicitly using the formula above. Minimizing ||∇R|| through gradient descent is too slow, likely because the functional has a shape very close to flat. In this kind of situation, in machine learning applications, one method that generally provides better results is the evolutionary method. The idea is to do an approximate gradient descent, but much faster and with more potential in finding lower minima in almost flat-shaped graphs.
Instead of computing a gradient, we perturb by small amount the given point (parent) and compute the value of M many of its perturbations (children). The child with the smallest evaluation is chosen as the new parent. In our application, we choose M = 20 and perturb each entry of the parameter by using a uniform distribution on (1, 1/30).

2.2.
Initialization of the gradient descent. Different initializations of the metric g lead to the discovery of different local minima of the loss function. In addition, some forms of initializations are preserved under gradient descent. We need to discuss some terminology and lemmas to discuss the initializations.
Definition 2.1. Given a metric g, the degeneracy pattern is the set {d 1 , . . . , d t } of the dimensions of its eigenspaces. An ordered degeneracy pattern is a tuple (d 1 , . . . , d t ) where the corresponding eigenvalues 0 < λ d 1 < . . . < λ dt are in ascending order.
Let a subgroup H < C n act on P n /{±1, ±i}, where we identify words that are proportional to each other. Consider the orbits of H, which also provide a decomposition of the Pauli word basis PB n of su(2 n ), thus determining a degeneracy pattern. We have the following theorem. Remark 2.2. As a direct corollary, gradient descent reaches a critical point (experimentally checked to be local minima) with that degeneracy pattern modulo merging.
We can now discuss different types of initialization to start the gradient descent with. All such initialization are normalized to have det = 1, and determine the set of parameters on which gradient descent is performed.
• DiagPerturbId: A uniform random diagonal perturbation of the local maxima at identity, the set of parameters being the diagonal entries.
• PenMetricOmega: Penalty metric with gradient on the weight parameter w = e r in (1), which means the local minima are found in the space of penalty metrics. Thus, a solution may not necessarily be a local minimum in the space of all metrics (but see Remark 2.3). This is to search for the minimum that the penalty metric can achieve and compare it to the lowest local minima found in other ways.
• PenMetricPattern : Penalty metric with gradient descent on w 1 , ..., w n where w i is the entry for Pauli word of weight i. All w i are initialized using a random w, i.e. w i = w i at the start. Once the local minimum is reached, we can observe how far the local minima found is from a penalty metric by comparing the ratios w i+1 /w i .
• GenPerturbId: A general (nondiagonal) random Gaussian perturbation of identity, the set of parameters being entries of the matrix.
• GenMetric: A random metric, the parameters being entries initialized by standard normal distribution.

Remark 2.3. For su(4), PenMetricOmega is effectively the same as the more general
PenMetricPattern: There is only one ratio (w 2 /w 1 ) to calculate and it gives the weight parameter w. On the other hand, a penalty pattern for su(8) could experience a merging of eigenspaces or lead to a solution where the eigenvalues do not represent a penalty pattern (i.e. w 2 /w 1 = w 3 /w 2 ), and all local minima found with this initialization for su(8) are as such.

Adam optimization algorithm.
The Adam optimization algorithm [12] has become one of the most used algorithms for deep learning and more generally machine learning applications. Similar to the vanilla stochastic gradient descent, it is used to update parameters of a functional iteratively. However, it has many advantages compared to the stochastic gradient descent, and has been experimentally shown to lead to lower minima faster compared to other stochastic optimization algorithms [12]. Most important to our application, it satisfies some desirable characteristics: • Computationally efficient with little memory requirements.
• Invariant to diagonal rescale of the gradients.
• Appropriate for problems with sparse gradients.
• Hyper-parameters have intuitive interpretation and typically require little tuning. In contrast to stochastic gradient descent, Adam has a learning rate for each parameter which is separately adapted during learning. The name Adam comes from adaptive moment estimation, as it computes individual adaptive learning rates for each parameter from estimates of first and second moments of the gradients. This helps with sparse gradients situations and cases where the gradient is highly noisy (like in nonstationary settings). For our purpose, we would like an algorithm that avoids saddle points and has a better chance of finding the more interesting local minima by not getting stuck in small basins. Figure 8 shows the implementation of the algorithm. The hyperparameters of Adam with their default values are described below.
• α (default 0.001): The learning rate (LR) or stepsize. The proportion that weights are updated. We prefer small values to have a more thorough search and find more local minima.
• β 1 (default 0.9): The exponential decay rate for the first moment estimates. This hyperparameter like the next ones required no tuning.
• β 2 (default 0.999): The exponential decay rate for the second-moment estimates. In regions with a small or sparse gradients, β 2 should be close to 1.
• ǫ (default 10 −8 ): This is to avoid any division by zero in the implementation.

Scheduling and hyperparameters.
LR decay can be used with Adam. In all cases, LR is 10 −3 but is halved during the gradient descent every 20000 steps, as the algorithm struggles to converge in its final stages, as it jumps over and back on lower minima basins due to larger than appropriate LR. All other Figure 8. Adam algorithm as in [12] hyperparameters are taken with their default values. Scaling factors mostly depend only on the type of the integral and value of k. The scaling factors chosen for each initialization are given in Tables 1 and 2. Generally, we prefer to choose higher r 1 to scale F 26 or F 24 if they attain very high values, which happens as k grows larger. If they attain high values but we can still be confident about the accuracy of the computations, r 1 is set to 1 while r 2 is chosen high enough (10 6 ) to avoid det(g) vanishing.

Optimization Results and Conclusions
We shall make a few remarks before discussing the results.
Remark 3.1. In tables above, GenMetric is absent. It was observed through experiments that they • did not give any new local minima (in terms of degeneracy pattern), or • they did not converge, as either the values of the functionals were generally too high for us to have confidence in the numerics, or the complicated scheduling necessary in the hyperparameter tuning made it too hard to have a stable gradient descent.
Remark 3.2. As mentioned in Remark 2.3, for su (4), "PenMetricPattern = PenMetri-cOmega", but for su (8), they are gradient descents in different spaces. In the case of L 2,4 ( Table 2), we could not collect any nontrivial result for su(8) by using PenMetricOmega, hence why its column is absent.
Remark 3.3. Our experiments involved two values of k although the values in-between were tried to some extent, and did not deliver any new results. If a degeneracy pattern reappears for higher values of k, it is always the case that eigenvalues are further apart.
We assess our solutions using the following type of evidences, listed in the order used in the paper.
• Finding Lie subalgebra structure by computing the brackets of (a collection of) eigenspaces.
• We propose two different ways for providing evidence/proof regarding the existence of a map j as defined in 1.1. In this paper, we explore the first one to some extent: (1) An evidence for the existence of j can be found by looking for what is called pattern breaking. First one finds a partition of the given degeneracy pattern that has the same dimensions as the penalty pattern; e.g. for su(4), the pattern (3, 1, 1, 8, 2) (see further below) has one partition such as ({3, 1, 2}, {1, 8}) where the dimensions in each set add up to the penalty pattern (6,9). Next, we check if the Lie algebra generated by the set of eigenspaces ({3, 1, 2} in the example) corresponding to the weight one Pauli words is isomorphic to the Lie subalgebra su(2) ⊕ . . . ⊕ su(2) (n = 2 times in the example). This ensures that there exists a map j, at least for the restriction to weight one Pauli words. The final step, not carried out in this work, would be to see if the rest of the eigenvectors can also be mapped under such a j to a local tensor product form. (2) A more direct approach is a gradient descent in the space of all Lie algebra isomorphisms j −1 . We can check whether j −1 (H i ) is in tensor product form by computing the entanglement entropy and aim to minimize the overall entanglement entropy of j −1 (H i )s. Consider the case of su(4) where we would like to measure the distance of j −1 (H i ) to a decomposition into a tensor product O 1 ⊗ O 2 . Fixing j (or J), determines an isomorphism of the underlying C 4 (on which su(4) are the skew-Hermitian transformations) into C 4 ∼ = C 2 1 ⊗ C 2 2 . Think of j −1 (H i ) as a vector in C 2 1 ⊗ C 2 2 ⊗ (C 2 1 ) ⋆ ⊗ (C 2 2 ) ⋆ so that the partial trace ρ i = Tr C 2 1 (j −1 (H i )) ∈ C 2 2 ⊗ (C 2 2 ) ⋆ under the natural map. By computing the entanglement entropy of j −1 (H i ) which is S(j −1 (H i )) := S i = Tr(ρ i log ρ i ), we can define the loss function as dim su(4)=15 i=1 S i and aim to minimize it over {j}. If S i can be made "close" to zero, then we conclude that , is kaq up to some error (depending on how "close" to zero); i.e. the eigenvectors for g have attained something close to a tensor product form. Note that for higher number of qubits, like in su (8), this test needs to be performed successively to prove decomposition into • If we find a nondegenerate eigenvalue, we make a little spectra test on the corresponding skew-Hermitian matrix, to find if the spectrum can rule out that the matrix is not in the form of a tensor product.
• In the case of PenMetricPattern, we make a direct inspection and look at the ratios of eigenvalues of different weight spaces (Remark 2.3).
• In the case of PenMetricOmega, we check if the solution is a local minima in the space of all metrics, and if the weight parameter is smaller than one. We used the TensorNetwork package to compute the diagrams and PyTorch to do gradient descent using Adam. For the hardware, we used one NVIDIA Tesla V100 GPU. Most simulations took at most 2 hours to converge, with the diagonal initializations converging faster.
Remark 3.4. PyTorch uses Automatic differentiation (Autograd) to compute gradient of a function f . It does so by first building a computational tree of f using simpler functions, such as linear, trigonometric, etc., for which the gradient formula is built into the program and can be calculated exactly. Then, it computes the gradient using the chain rule.
As such, to make gradient descent computations faster on the computer, we replace g with g −1 , making sure that one type of edge (k) gets labelled by g −1 , and two labelled by g. Thus, when we examine the results (eigenvalues of g), we will need to take the inverse again. Note that as the parameters of our program are the entries of g, the computation for gradients of terms involving g is far simpler than those involving g −1 , as the formula for the latter involves many long expressions in terms of g entries (determinants of minors and division by det(g)). Therefore, it makes sense to make such a replacement as it allows us to cut the number of edges labelled by g −1 in half (and double the number of those labelled by g).

DiagPerturbId and GenPerturbId results.
There are different degeneracy patterns among the solutions of DiagPerturbId and Gen-PerturbId. We will list the (lowest) local minima ordered degeneracy patterns along with eigenvalues for the highest k (if it appears for both k), and provide explanation regarding the kaq/sub-ness of the metrics. In these lists, LM = local minima and LLM = lowest local minima. A summary of the results is presented in Figures 9 and 10.
1-L 2,6 and su(4) with k = 100, 200: We found nondiagonal metrics for all patterns above, and diagonal metrics with the (10, 5) pattern only (for both k). The second pattern is interesting as we find it is a pattern breaking of (6,9), where the 1-d eigenspace with corresponding eigenvalue 0.686 along with the 2 and 3-d eigenspaces, form a Lie algebra isomorphic to su(2) ⊕ su(2) (similar to Pauli words with weight one), while the 4 × 4 skew-Hermitian matrix in the other 1-d eigenspace has the eigenvalue structure of a tensor product. For k = 200, some of solutions with pattern (1, 3, 1, 4, 4, 2) have that property as well. Finally, the 6-d degeneracy for the last pattern • LLM: (10, 15, 1, 32, 5) for both k. In diagonal solutions, only (30, 32, 1) appears, while the other two patterns occur in nondiagonal solutions. We know by taking all brackets that the eigenvectors in 1 + 30 (and 30 by itself) give a Lie algebra, but we have not classified it. The same property holds for (10, 15, 1, 32, 5) where 1 + 5 + 10 + 15 (and 5 + 10 + 15) is a 31-d (30-d) Lie algebra.
Given that (9, 27, 27) is the penalty pattern, it can only be related to (1,9,1,12,16,9,8,4, 3) if we were to observe some pattern breaking similar to the instance in su(4). It is the case (up to 1e − 3 errors) that one of the two nines has structure constants similar to that of Pauli words with weight one, i.e. su(2) ⊕ su(2) ⊕ su (2). However, the matrix in each of the 1-d eigenspaces can not be a tensor product of three 2 × 2 matrices, although they can be a tensor product of a 4 × 4 and 2 × 2 matrix. Given that we are only approximating the functional, this suggests that maybe locality does not fully form and one should consider the possibility of partial kaq-ness where there are tensor products of qunits instead of (C 2 ) qubits.

PenMetricOmega and PenMetricPattern results.
Using penalty metric intializations, local minima with such patterns are obtained. As noted in Remark 2.3, we need to check if the eigenvalues are generated by a weight parameter which is smaller than one.
For su(4) and both of L 2,4 , L 2,6 , the answer is always affirmative for the lowest local minima in PenMetricPattern. The loss value attained is ≥ 95% of the lowest loss value found (in LLMs in the previous part), and the weight parameter is pushed more towards zero as k increases.
For su (8), there are local minima in PenMetricPattern, with ≥ 95% of the loss value of LLM previously listed, which have the same ordered degeneracy pattern as the penalty pattern. However the two ratios w 2 /w 1 , w 3 /w 2 do not match; the relative difference is about 30%, more than what could be attributed to numerical errors (although it could be due to missing terms in our expansion of the integral). The lowest local minima are given by degeneracy patterns with order (27 + 9, 27) for L 2,6 , where weight one and three have merged, and (27, 27, 9) with weight two being the smallest for L 2,4 . Finally, The solutions with Pen-MetricOmega initialization achieved ∼ 90% of the LLM loss value, but none were local minima in the space of all metrics.

Summary and Outlook
There are many directions in which this work can be refined. One could start by investigating the nature of those 31-d Lie subalgebras in su (8), while collecting more data for PenMetricOmega and more importantly a larger set of (especially higher) values of k. Results in this work and preliminary unreported results for k > 200 for su(4), suggest that new patterns emerge for higher values of k, which are consolidation of previous ones known for lower ks. Perhaps there is a tendency for energy-close eigenspaces to cluster as k increases while the energies of truly distinct eigenspaces grow further apart. We can then ask about the low-energy limit, and what eigenspaces remain.
With regards to kaq-ness, we still do not have a mechanism/theory to prove it given a certain degeneracy pattern, other than checking the Lie subalgebra structure. By doing so, we are ignoring the possibility that a solution is kaq but does not have any Lie subalgebra among (any proper collection of) its eigenspaces. One way to address this issue is a gradient descent in the space of all Lie algebra isomorphisms j −1 , as mentioned in section 3.
Also, we did not use any post-processing technique like Padé approximates, which could provide a more accurate estimation of the functional.
Another direction is the study of complex fermionic perturbed Gaussian integrals which could reveal similar patterns.
Finally, we would like to have similar numerical simulations for other su(N)s like su (6), to see if kaq-ness would still manifest itself in the form of C 3 ⊗ C 2 , showing that there is nothing special about N = 2 n regarding decomposition to local tensor factors.

Acknowledgments
We would like to thank Carl Bender, Scott Morrison, and Xiaoliang Qi for stimulating conversations.
5. Appendix 5.1. The structure constants can only distinguish weights. We are interested in exploring the symmetries of the anti-commutativity graph of Pauli words G PB n defined below.
Definition 5.1. The graph of Pauli words G PBn has vertices PB n , and two vertices are connected when they anti-commute.
As a simple initial observation, note that all Pauli words have identical spectra, half the eigenvalues are 1 and half -1, thus they are all in the same adjoint orbit, so it turns out to be a rather subtle matter to find asymmetric structures in G PBn which have any hope of distinguishing, for example, high weight from low weight Pauli words. In this appendix, we initiate the study of such subtle asymmetries in the hope that it will allow us in the future to reverse-engineer functionals which break to penalty metrics.
If there are asymmetries in this graph, especially among the vertices with different weights, one can hope to derive a metric sensitive to such asymmetries. Recall a graph is k-ultrahomogeneous if any given isomorphism between two of its induced subgraph with at most k vertices can be extended to an automorphism of the graph. As the following fact demonstrates, G PBn is symmetric.
Remark 5.1. Following this fact, one has graph automorphisms of G PBn where vertices of different weights are sent to each other. Moreover, it is a standard Lie algebra fact that the Lie algebra automorphisms of u(N) all come from conjugations by U(N) and a complex conjugation (Z 2 of the Dynkin diagram), hence by definition of C n , all Lie algebra automorphisms that preserve P n come from the Clifford group and complex conjugation, and the latter acts trivially on PB n .
Remark 5.2. The fact above also means that a simple preference for more commutativity (and less anti-commutativity or less || [a, b] ||) can not be the sole reason behind the possible emergence of a penalty metric. Hence minimizing simple functionals like ||[a, b]|| will not deliver the desired result.
However, we will show that the additional structure imposed by c k ij on this graph is exactly what is needed to distinguish the vertices with different weight. We recall a fact that will be useful for our next theorem.
Fact 5.2. For every a, b ∈ P n , either [a, b] = 0 or ab = −ba. Hence, c k ij = 0, ∀k or c k ij = 0 for a unique k, implying ij = c k ij 2 k, and c k ij satisfies skew-symmetry, i.e. c j ik = −c k ij , etc. We ask if it is possible to have a graph automorphism which preserves the structure constants c k ij while not preserving weights (see (1) for the definition of weight)? The negative answer below gives meaning to the title of this section and provides the theoretical motivation for the possibility of penalty metric to emerge from optimization of Gaussian perturbed integrals: ij , ∀i, j, k if and only if φ is a permutation of tensor factors composed with some local automorphism, i.e. φ ∈ Aut(G PB 1 ) ⊗n ×S n , consequently preserving the weight of each word.
Consider the triangle (i = X1 . . . 1, j = Y 1 . . . 1, k = Z1 . . . 1) which has c k ij = +2. Sitting at the opposite side of each of i, j, k with respect to the triangle edges, and not connected to them, we have the subgraphs S i = Xsu(2 n−1 ), S j = Y su(2 n−1 ), S k = Zsu(2 n−1 ), respectively. Furthermore, every vertex s j ∈ S j is connected to i (and k), and has a counterpart i.s j = s k ∈ S k such that c s k is j = +2. There exists a similar structure for j, S k , S i and k, S i , S j . Finally, there exists a fourth subgraph S ijk = 1su(2 n−1 ) not connected to the triangle ijk. This gives the structure of the whole G PB n when viewed from this triangle (see Figure 11). Figure 11. For each triangle with ij = −ji = k, one can partition the vertices into four sets S i , S j , S k , S ijk , where i is connected to all vertices in S j , S k (and similarly for j, S i , S k and k, S i , S j ) and S ijk is not connected to any of i, j, k. Further, for each s j ∈ S j , there is a corresponding vertex s k ∈ S k such that c s k is j = 0 and is j = c s k is j 2 s k .
Using Fact 5.1, Given any triangle ijk with c k ij = 0, the previous description of G PBn holds for any such i, j, k, with the exception of the fact about c s k is j = +2 for all s j , s k corresponding pairs. This means all triangles with c k ij = 0 have three subgraphs corresponding to each vertex and not connected to them, and each vertex is connected to the other two subgraphs completely. Also, there is a fourth subgraph not connected to the triangle.
Proof. To compute a tensor contraction, one selects indices at the two ends of each edge, e.g. for an edge of type i the selection i, i ′ yields a term like g ii ′ c − i− c − i ′ − . Thus, differentiating such an expression is done by selecting any edge and choosing labels a, b on the two ends, while the rest of the edges are labelled using the diagonal metric D (see Figure 12 for more details). We wish to show that if a = b, any such network contracts to zero. To do so, we prove the structure constants multiplier i,j,k labels on edges c k ij obtained from any label selection over the edges, is zero if exactly one edge is picked nondiagonally, like in a, b. Figure 12. Differentiating the Theta diagram at a diagonal metric D. Differentiating the i or j edges is easy, as it just requires replacing the matrix by the elementary matrix E ab . Note the last term requires computing ∂g −1 ∂g ab at D. Considering the entries, we need to compute (−1) m+n ∂ det(gmn) det(g) ∂g ab , where g mn are the minors. It is straightforward to show that evaluating this expression at a diagonal metric D leaves a nonzero term only for m = b, n = a. The matrix D a,b is the diagonal matrix obtained by removing the two a, b-th rows and the two a, b-th columns.
Without loss of generality, assume that a, b is chosen on an edge of type i. Edges of type j and i form a cycle decomposition of the graph. Consider the product of the indices over those edges; as every edge has two equal indices on both ends (except for a, b), the product is some scalar times a.b.
On the other hand, if the structure constants multiplier is nonzero, it means at every vertex, i, j, k were chosen such that c k ij = 0. Furthermore, in that product, every i is followed by a j giving c k ij 2 k. The ks are each repeated twice as they appear twice on each edge. As a result, the product should be a scalar, but a = b, so a.b is not a scalar and we reach contradiction.
Corollary 5.3. Starting at a diagonal metric, gradient descent on loss functions defined in (18,19) stays in the diagonal metrics space.
Proof. Both (18,19) are linear combinations of tensor networks, and the (det(g) − 1) 2 term can be easily checked to have the same property under gradient descent.
Recall how a subgroup H < C n defines a degeneracy pattern (2.1).
Theorem 5.4. Let D be diagonal with a degeneracy pattern determined according to the orbits of a subgroup H < C n . The gradient flow of loss functions (18,19) preserves not only the diagonal form of the metric, but also the degeneracy pattern modulo the merging of the eigenspaces, e.g {d 1 , d 2 , d 3 } could become {d 1 , d 2 + d 3 }.
Proof. We previously showed that D stays diagonal under gradient flow. Next, we need to show that entries corresponding to the same eigenspace stay equal, i.e. have equal gradients. It suffices to prove so for every tensor network T ; as the determinant term (det(g) − 1) 2 has the same property, the statement would follow.
Assume g aa = g bb where there is an h ∈ H such that h(a) = η a b where η a ∈ {±1}. To show ∂T ∂gaa = ∂T ∂g bb , we use the same picture as in the previous theorem. Every tensor contraction in the summation giving ∂T ∂gaa is given by fixing both ends of some edge e with label a. This tensor is sent to an equal contraction in ∂T ∂g bb by simply applying h on the tensor. Indeed, since g xx = g h(x)h(x) for any x, the coefficients involving g entries are equal. As for the structure constants multiplier, since h(i) = η i h i for some h i in the Pauli word basis, we need to show that i,j,k selected labels This is true for any h ∈ C n . Indeed, as computed in Remark 5.3, and since in the product i,j,k selected labels c k ij every label appears twice and η x ∈ {±1}, ∀x, we are left with the equality of the two products.
Remark 5.4. While on one hand, this theorem implies that one can find local minima with a penalty degeneracy pattern by simply initializing at such a pattern, it also means that the penalty pattern is not special in that regard, as we can find local minima with many other degeneracy patterns.

Individual diagrams are convex.
Experiments show that the individual tensor networks used in the asymptotic expansion of (7) are all positive convex functions with unique minimum at g = Id. We can provide some theoretical evidence in this regard for some of the diagrams.
We start with the Theta diagram Θ(c, g). Assume g = UDU † where D is diagonal and U unitary. We wish to show min D Θ(c, UDU † ) = Θ(c, Id).
We can rewrite the expression above as min D Θ(c U , D) = Θ(c U , Id) = Θ(c, Id), where the edges of tensor c with the tensors U (or U † ) are contracted. We have Writing d i = e w i , Θ(c U , D) becomes a positive sum of exponentials of affine functions in w i , hence convex. By using the arithmetic-geometric mean inequality, the minimum can be seen to occur at g = Id. As previously mentioned, experiments have shown that this minimum is unique.
We can prove similar results for other diagrams such as the Tincan, by bringing the contraction in the form written above for Θ(c U , D), through successive minimization over some edges, while others are assumed fixed and once contracted give two equal tensors being contracted over the same indices, same as in Θ.
For the Tincan T (c, g), the starting idea is similar, fixing U in the diagonalization of g and letting D be the variable. However, at first, the vertical edges are also assumed to be fixed, meaning the variable is D on the 4 vertical edges. The contraction of tensors along the vertical lines gives two 4-valent tensors A, B connected with D (or D −1 ) along each edge. The two tensors are equal due to the symmetry of the diagram and the previous contraction being done along the same index. Hence, we obtain an expression similar to the one for Θ(c U , D), implying the minimum is attained at D = Id on the horizontal edges, given the vertical edges. Next step is to consider D on the vertical edges as the variable and the proof follows similarly.