Optimization Algorithms for Multi-species Spherical Spin Glasses

This paper develops approximate message passing algorithms to optimize multi-species spherical spin glasses. We first show how to efficiently achieve the algorithmic threshold energy identified in our companion work (Huang and Sellke in arXiv preprint, 2023. arXiv:2303.12172), thus confirming that the Lipschitz hardness result proved therein is tight. Next we give two generalized algorithms which produce multiple outputs and show all of them are approximate critical points. Namely, in an r-species model we construct \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^r$$\end{document}2r approximate critical points when the external field is stronger than a “topological trivialization" phase boundary, and exponentially many such points in the complementary regime. We also compute the local behavior of the Hamiltonian around each. These extensions are relevant for another companion work (Huang and Sellke in arXiv preprint, 2023. arXiv:2308.09677) on topological trivialization of the landscape.


Introduction
This paper studies the efficient optimization of a family of random non-convex functions H N defined on high-dimensional spaces, namely the Hamiltonians of multi-species spherical spin glasses.Mean-field spin glasses have been studied since [25] as models for disordered magnetic systems and are also closely linked to random combinatorial optimization problems [12,19,22].In short, their Hamiltonians are certain polynomials in many variables with independent centered Gaussian coefficients.
The purpose of this work is to develop efficient algorithms to optimize H N .Our companion work [16] derives an algorithmic threshold ALG and proves no optimization algorithm with suitably Lipschitz dependence on H N can achieve energy better than ALG with more than exponentially small probability.The value ALG is expressed as the maximum of a variational principle over several increasing functions, which was shown to be achieved by joining the solutions to a pair of well-posed differential equations.The first main contribution of this paper is to show that given a solution to this variational problem, so-called approximate message passing (AMP) algorithms efficiently achieve the value ALG.We note that several previous works [4,21,24,26] have given similar algorithms for mean-field spin glasses with 1 species, and our algorithm is in line with the latter three.
Furthermore, we use these AMP algorithms to aid a detailed study of the landscape of H N by probing neighborhoods of special critical points.This is related to a second companion work [17] which identifies the phase boundary for topological trivialization of H N , where the number of critical points is a constant independent of N .Therein, Kac-Rice estimates are used to show that for r -species models (defined on a product of r spheres) in the "supersolvable" regime with strong external field, H N has exactly 2 r critical points with high probability.In this paper, we give a signed AMP algorithm which explicitly approximates each of these critical points.Moreover in the complementary "sub-solvable" regime, we use AMP to construct exp(cN ) separated approximate critical points with high probability.This implies the failure of strong topological trivialization as defined in [17], which is proved therein to hold for super-solvable models.Finally, the machinery of AMP allows us to compute the local behavior of H N around these algorithmic outputs, giving even more precise information about the landscape.

. (λ s k x s k ).
The random function H N can also be described as the Gaussian process on B N with covariance We will also often refer to the product of spheres (1.6) It will be useful to define, for s ∈ S,

The Value ALG
Given ( λ, ξ ), the ground state energy of the associated multi-species spherical spin glass is 1   OPT = OPT(ξ ) = p-lim In the bipartite SK model mentioned above, OPT is the limiting operator norm of an IID Gaussian rectangular matrix with aspect ratio λ 1 /λ 2 .For large k, the asymptotic operator norm of an IID random k-tensor is similarly encoded as OPT(ξ ) for some ξ (with e.g.r = k).Perhaps surprisingly, it is generally believed that polynomial-time algorithms are not in general capable of finding σ ∈ B N such that H N (σ ) ≥ OPT(ξ ) − ε with high probability as N → ∞.Our work [15] showed that in the single species case (and with all terms of even degree), one can identify an exact threshold ALG for the performance of a class of Lipschitz algorithms which includes gradient-based methods and Langevin dynamics.More recently in [16], we extended the algorithmic hardness direction of this result to multi-species spherical spin glasses, using a new proof technique that applies even when OPT is not known.The purpose of this paper is to give explicit algorithms attaining the value ALG, and we present here the formula for this value.
The algorithmic threshold ALG is given by the following variational principle.This is a simplification of the more general variational formula [16,Equation (1.7)], obtained by a partial characterization of its maximizers [16,Theorem 3].The following generic assumption is needed therein to ensure well-posedness of the ODE (2.3) used in this description, and we will freely assume it throughout the paper.
Assumption 1 All quadratic and cubic interactions participate in H , i.e. (2) , (3) > 0 coordinate-wise.We will call such models non-degenerate.Since this condition depends only on ξ , we similarly call ξ non-degenerate.
To optimize H N for degenerate ξ , it suffices to apply our algorithms to a slight perturbation ξ which is non-degenerate and satisfies ξ − ξ C 3 ([0,1] r ) ≤ ε to obtain the guarantees in this and the next section.Here, C 3 ([0, 1] r ) denotes the norm Since both the ground state and the more general ALG formula in [16] (allowing degenerate ξ ) vary continuously in ξ , there is essentially no loss of generality in assuming non-degeneracy.
The formula for ALG is described by two cases depending on whether 1 = 1 S is supersolvable as defined below.
We also adopt the convention that 0 is always super-solvable, and solvable if h = 0.
The following will be useful.
and the supremum is uniquely attained at v.
It is easy to see that any x ∈ (0, 1] S is sub-solvable when h = 0, and that super-solvability is a coordinate-wise increasing property of h.For our purposes, an external field is large if 1 is super-solvable and small if 1 is strictly sub-solvable.(Unfortunately we do not have more refined intuition for the precise form of M * above, nor the resulting phase boundary between super and sub-solvability.)As shown in our companion work [17], in super-solvable models the external fields h are strong enough to trivialize the "glassy" nature of the landscape for H N .Namely the number of critical points is exactly 2 r with high probability, the minimum number of any generic smooth ("Morse") function on a product of r spheres.By contrast in the sub-solvable case, the expected number of critical points is exponentially large in the dimension N .As explained below, the optimization algorithms are also simpler in the super-solvable case.
When 1 is strictly sub-solvable, the formula for ALG becomes more complicated and depends on the optimal choice of a increasing C 2 function : [q 1 , 1] → [0, 1] S satisfying certain conditions.We term such pseudo-maximizers and defer the formal definition to Definition 2.1.Note that q 1 ∈ [0, 1] is not fixed, but is determined by the choice of .Definition 1.5 (Algorithmic Threshold, Strictly Sub-solvable Case) If 1 is strictly subsolvable, then with the maximum taken over all pseudo-maximizers of A, See [16,Remark 1.3] for an approach to maximizing A using the well-posedness of the ODEs (2.2), (2.3) in the definition of pseudo-maximizer.The computational complexity of this task is in particular independent of N .
The following theorem is our main result.We equip the space H N of Hamiltonians H N with the following distance.We identify H N with its disorder coefficients (G (k) ) k≥2 , which we arrange in an arbitrary but fixed order into an infinite vector g(H N ), and define (In other words, H N − H N 2 2 is the sum of squared differences (g i 1 ,...,i k −g i 1 ,...,i k ) 2 between all corresponding pairs of coefficients in (G (k) ) k≥2 and (G (k) ) k≥2 .)We say an algorithm Note that H N − H N 2 may be infinite, and if so this condition holds vacuously for such pairs (H N , H N ).Here and throughout, all implicit constants may depend also on (ξ, h, λ).
Theorem 1 For any ε > 0, there exists an O ε (1)-Lipschitz A N : The main result in our companion work [16,Theorem 1] states that any τ -Lipschitz A N : H N → B N satisfies, for the same threshold ALG and N sufficiently large, Together these results thus characterize the best possible Lipschitz optimization algorithms for multi-species spherical spin glasses.
We prove Theorem 1 with an explicit algorithm based on AMP, following a recent line of work [4,5,21,24,26].Such algorithms are shown to be Lipschitz (up to modification on a set with exp(−cN ) probability) in [15,Sect. 8].AMP algorithms also have computational complexity which is linear in the input size when H N is a polynomial of finite degree (modulo solving for , a task that does not depend on N ).See [4,Remark 2.1] for related discussion on this last point.
Similarly to [5,24], our algorithm has two phases, a "root-finding" phase and a "treedescending" phase.Roughly speaking, the set of points reachable by our algorithm has the geometry of a densely branching ultrametric tree, which is rooted at the origin when h = 0 and more generally at a random point correlated with h.The first phase identifies this root, and the second traces a root-to-leaf path of this tree.The structure of the first phase is similar to the original AMP algorithm of [9] for the SK model at high-temperature, while the latter incremental AMP technique was introduced in [21].
For the purposes of this paper, the significance of (super, sub)-solvability is as follows.When the external field is sufficiently large, the root moves all the way to the boundary of B N (in all r species) and the algorithmic tree becomes degenerate.In [16], it is shown that the external field is large enough for this to occur if and only if 1 is super-solvable.Moreover, [17] shows this condition coincides with strong topological trivialization (defined therein) of the optimization landscape.
In Sect. 3 we extend our main algorithm in several ways.In Sect.3.1 we define 2 r signed generalizations of the root-finding algorithm with similar behavior.In Sect.3.2 we compute the gradients of H N at the points output by our algorithm, in both cases when 1 is supersolvable and sub-solvable.In particular, we show that they are approximate critical points on the product of spheres S N (defined in (1.6)).As explained in Remark 3.1, in the strictly super-solvable case these 2 r outputs approximate the 2 r genuine critical points of H N on S N .The sub-solvable case of this computation is used in our companion paper [17, Theorem 1.5(c) and Sect.5.3] to show failure of annealed topological trivialization in the sub-solvable case.Finally in Sect.3.3 we give a modification of the tree-descending phase for the supersolvable case.It constructs exp(cN ) well-separated approximate critical points arranged in a densely branching ultrametric tree; this implies the failure of strong topological trivialization in [17, Definition 6 and Theorem 1.6].

Notations
Throughout, we will use boldface lowercase letters (u, v, . ..) to denote vectors in R N , and lowercase letters with vector sign ( u, v, . ..) to denote vectors in R S R r .Similarly, boldface uppercase letters denote matrices or tensors in (R N ) ⊗k , and non-boldface uppercase letters denote matrices or tensors in (R r ) ⊗k .We let for u, v ∈ R N .The corresponding norm is Next a N b N means that a N − b N converges in probability to 0. Analogously, for two vectors u N , v N , we write u N v N when u N − v N N converges in probability to 0. We denote limits in probability by p-lim N →∞ .Analogously we write ≈ δ to denote asymptotic equality as δ → 0.
For any tensor A ∈ (R N ) ⊗k , we define the operator norm The following proposition shows that with exponentially good probability, the operator norms of all constant-order gradients of H N are bounded on the appropriate scale.
Proposition 1.6 ([16, Proposition 1.13])For any fixed model (ξ, h) there exists a constant c > 0, sequence (K N ) N ≥1 of convex sets K N ⊆ H N , and sequence of constants (C k ) k≥1 independent of N , such that the following properties hold. (1.9)

Achieving Energy ALG
In this section we prove Theorem 1 by exhibiting an AMP algorithm.Throughout this section, Assumption 1 on non-degeneracy of ξ will be enforced without loss of generality.

Definition of Pseudo-Maximizer
As mentioned before Definition 1.5, the threshold ALG in the sub-solvable case depends on a notion of pseudo-maximizer.We now provide this definition, which was derived in [16, Theorem 3] as a necessary condition for to maximize A defined in (1.8) (and it is proved therein that a maximizer always exists).
We now give an efficient AMP algorithm achieving energy A( ) for any pseudomaximizer .In particular for the optimal pseudo-maximizer this achieves energy ALG.

Review of Approximate Message Passing
Here we recall the class of AMP algorithms, specialized to our setting of interest.We initialize AMP with a deterministic vector w 0 with coordinates depending only on the species.Let f t,s : R t+1 → R be a Lipschitz function for each (t, s) ∈ Z ≥0 × S. For (w 0 , w 1 , . . ., w t ) ∈ R N ×(t+1) , let f t (w 0 , w 1 , . . ., w t ) ∈ R N be given by . We generate subsequent iterates through recursions of the following form, where ons t is known as the Onsager correction term: (2.5) Here W t s , M t s are defined as follows.W 0 s = w s and the variables ( W t s ) (t,s)∈Z ≥1 ×S form a centered Gaussian process with covariance defined recursively by and e. different species are independent).The following state evolution characterizes the behavior of the above iterates.It states that for each s ∈ S, when i ∈ I s is uniformly random the sequence of coordinates (w 1 i , w 2 i , . . ., w t i ) has the same law as

Proposition 2.2 For any pseudo-Lipschitz function ψ and ∈
This proposition allows us to read off normalized inner products of the AMP iterates, since e.g.
Proposition 2.2 is proved in Appendix 1.In fact we show a slight generalization allowing f t = f t (w 0 , . . ., w t , g 0 , . . ., g t ) to depend also on independently generated vectors (g 0 , . . ., g t ) ∈ R N (t+1) .When using this extension, we will always take each g t ∼ N (0, I N ) to be standard Gaussian.The more general result essentially says that g t still acts as an independent Gaussian for the purposes of state evolution.Since this is relatively intuitive, we refer to Theorem 2 in the appendix for a precise statement.For random matrices (i.e. the case of quadratic H ) there is a considerable literature establishing state evolution in many settings beginning with [7,9] and later [6,8,10,11,13] (see also [14] for a survey of many statistical applications).The generalization to tensors was introduced in [23] and proved in [4], whose approach we follow.

Stage I: Finding the Root of the Ultrametric Tree
Our goal in this subsection will be to compute a vector m satisfying p-lim and with the correct energy value (as stated in Lemma 2.5).We take as given a maximizer to A with domain [q 1 , 1].Recall (q 1 ) is super-solvable: either 1 is strictly sub-solvable, in which case (q 1 ) is solvable, or 1 is super-solvable, in which case (q 1 ) = (1) = 1.
We use the initialization Define the vector a ∈ R S by Subsequent iterates are defined via the following recursion. (2.10) (2.12) The last term in (2.10) comes from specializing the formula (2.6) for the Onsager term.
Next recalling (2.8), let (W j s , M j s ) j≥0,s∈S be the state evolution limit of the coordinates of (w 0 , m 0 , . . ., w k , m k ) as N → ∞.Concretely, each W j s is Gaussian with mean h s and We next compute the covariance of the Gaussians (2.13) Define the (deterministic) R S ≥0 -valued sequence ( R 0 , R 1 , . . . ) of asymptotic overlaps recursively by R 0 = 0 and R k+1 = α( R k ).

Lemma 2.3
For integers 0 ≤ j < k, the following equalities hold (the first in distribution): ) so that (2.14) implies (2.16) for each j ≥ 0. On the other hand, state evolution directly implies that if (2.16) holds for j then (2.14) holds for j + 1.This establishes (2.14) and (2.16) for all j ≥ 0.
We similarly show (2.15) and (2.17) together by induction, beginning with (2.15).When j = 0 it is clear because W k s is mean zero and independent of W 0 s .Just as above, it follows from state evolution that (2.15) for ( j, k) implies (2.17) for ( j, k) which in turn implies (2.15) for ( j + 1, k + 1).Hence induction on j proves (2.15) and (2.17) for all ( j, k).
The next lemma is crucial and uses super-solvability of (q 1 ).Lemma 2. 4 The limit R ∞ ≡ lim j→∞ R j exists and equals (q 1 ).
Proof First we observe that α (recall (2.13)) is coordinate-wise strictly increasing in the sense that if 0 x ≺ y then α(x) ≺ α(y).Moreover α( 0) 0 (assuming h = 0, else the result is trivial) and α( ( It remains to show that the above forces R ∞ = (q 1 ) to hold.Let M ∈ R S×S be the matrix with entries M s,s = d dt α s ( (q 1 ) + te s )| t=0 for e s a standard basis vector.Then M is the derivative matrix for α at (q 1 ) in the sense that for any u ∈ R S , We easily calculate that We claim that for any entry-wise non-negative vector for some s ∈ S. Indeed, suppose to the contrary that (M w) s > w s for all s ∈ S.This rearranges to i.e.M * ( (q 1 )) w ≺ 0 (recall (1.7)).Proposition 1.3 then implies that λ min (M * ( (q 1 ))) < 0, so (q 1 ) is strictly sub-solvable, which is a contradiction.Thus (2.18) holds for some s ∈ S. Now suppose for sake of contradiction that R ∞ ≺ (q 1 ), let w = (q 1 ) − R ∞ , and choose s ∈ S such that (2.18) holds.Write f (t) = α s ( (q 1 ) + t w).Since α s is a polynomial with non-negative coefficients and ξ is non-degenerate, f is strictly convex and strictly increasing on The first inequality above is strict, so we deduce that α( ).This contradicts the definition of R ∞ .Therefore R ∞ = (q 1 ), completing the proof.
Remark 2.1 Super-solvability of (q 1 ) is a tight condition for the above argument to hold, as the matrix M above needs to have Perron-Frobenius eigenvalue at most 1.Indeed suppose that (q 1 ) was chosen so that λ 1 (M) > 1.Then there exists w ∈ R S >0 with M w w.Letting x = (q 1 ) − ε w for small ε > 0, we find α( x) ≺ x.Monotonicity implies that α maps the compact, convex set By the Brouwer fixed point theorem, a fixed point of α strictly smaller than (q 1 ) exists whenever (q 1 ) is strictly subsolvable.
We finish our analysis of the first AMP phase by computing the asymptotic energy it achieves.As expected, the resulting value agrees with the first term in the formula (1.8) for ALG.

Lemma 2.5
Proof We use the identity and interchange the limit in probability with the integral.To compute p-lim N →∞ m k , ∇ H N (t m k ) we introduce an auxiliary AMP step For the first term, recalling (2.11) yields Note also that Integrating with respect to t, and switching the roles of s, s in applying (2.20), we thus find Finally the external field h gives energy contribution Since R ∞ = (q 1 ) by Lemma 2.4, we conclude

Stage II: Descending the Ultrametric Tree
We now turn to the second phase which uses incremental approximate message passing.Choose a large integer , and with δ = −1 let We then define with the square-root taken entrywise, and g ∼ N (0, The point n will be the "root" of our IAMP algorithm. 2oreover we set = max{ ∈ Z + : q δ ≤ 1 − 2δ}.We also define for s ∈ S and ≤ ≤ the constants . (2.23) Set z = w − h.We will define (z ) ≥ +1 via The Onsager coefficients d , j are given by (2.7) and will not appear explicitly in any calculations until Sect.3.2.Note that formally, they may depend on the first iteratates, since (2.24) is a continuation of the same AMP iteration.To complete the definition of the iteration (2.24), for s(i) = s and ≥ we set where

26)
The algorithm A outputs where the power −1/2 is taken entry-wise.We show in (2.32) below that lim →∞ p-lim Hence we will often not distinguish between the two and just consider n to be the output.This makes essentially no difference by virtue of Proposition 1.6.
The state evolution limits of z and n are described by time-changed Brownian motions with total variance s (q δ ) in species s after iteration .This is made precise below.

.31)
Proof The fact that these sequences are Gaussian processes is a general fact about state evolution (the external Gaussian g is permitted in Theorem 2).We proceed by induction on ≥ .The proof is similar to [24, Sect.8] so we give only the main points (in fact (2.21) simplifies the corresponding construction therein, which avoided the use of external Gaussian noise).We will make liberal use of (2.8) to connect asymptotic overlaps before and after applying ∇ H N (•).
For base cases, the case of (2.30) is immediate from (2.16).The base case of (2.31) follows from (2.22), and thus the + 1 case of (2.30).The main computation for the base case is Here we used the general AMP statement of Theorem 2 to say that For inductive steps, we always have by state evolution It follows by the inductive hypothesis of (2.28) that for j ≤ , Plugging into the above yields that for j ≤ , This depends only on min( j, ), so (2.28) follows.The others are proved by similar computations.
Equation (2.31) implies that R(n δ , n δ j ) (q δ ( ∧ j)+1 ), which exactly corresponds to the previous sections of the paper.In particular it implies that the final iterate n δ satisfies

33)
Proof Observe that h, n − n N 0 because the values (N δ ,s ) ≥ form a martingale sequence for each s ∈ S. Therefore it suffices to find the in-probability limit of We write and use a Taylor series approximation for each term.In particular for F ∈ C 3 (R; R), applying Taylor's approximation theorem twice yields Assuming sup n N ≤ 1, which holds with probability 1 − o N (1) by state evolution and the definition of , we apply this estimate with The result is: Proposition 1.6 implies that for deterministic constants c, C, On the other hand for each ≤ ≤ − 1 we have p-lim Summing and noting that − ≤ δ −1 yields the high-probability estimate So, this term vanishes as δ → 0. It remains to prove 123 To establish this it suffices to show for each species s ∈ S the equality lim δ→0 p-lim (2.34) Observe by (2.24) that (2.35) Passing to the limiting Gaussian process (Z δ k ) k∈Z + via state evolution, p-lim As (N δ k ) k≥Z + is a martingale process, it follows that all right-most expectations vanish.Similarly it holds that We conclude that p-lim In the second-to-last step we used independence of Z δ ,s increments, which follows from Lemma 2.6, while the last step used (2.23) and (2.29).Combining with [16,Lemma 3.7] on discrete approximation of the integral in A implies (2.34).

Proof of Theorem 1
We take A as in (2.27) for a large constant depending on (ε, ξ, h, λ).

First, P[H N (A(H
follows from combining Lemmas 2.5, 2.7 and the fact that (recall (2.32)) Next, let K N ⊆ H N be as in Proposition 1.6.We recall that P[H N ∈ K N ] ≥ 1 − e −cN .Exactly as in [15,Theorem 10] it follows that there is a C(ε)-Lipschitz function A : H N → R such that A and A agree on K N .Moreover (1.6) and concentration of measure on Gaussian space imply that H N ( A(H N )) is O(N 1/2 )-sub-Gaussian.In light of (2.36) and since This concludes the proof.

Signed AMP
In our companion paper [17], we show that strictly super-solvable models have w.h.p. exactly 2 r critical points, indexed by sign patterns ∈ {±1} r with the following physical meaning.Consider first the extreme case of a linear Hamiltonian, with external field h = h 1 where all entries of h are nonzero and no other interactions.This model clearly has 2 r critical points, which are the products of the maxima and minima in the spheres { x s 2 2 = λ s N } corresponding to each species s ∈ S, and the signs record whether the critical point is a maximum or minimum in each species.As explained in [17,Sect. 6.6], if a strictly super-solvable H N is gradually deformed to a linear function (staying inside the strictly super-solvable phase), the critical points move stably, and over this process their Hessian eigenvalues do not cross zero.Thus, each critical point of H N can also be associated with a sign pattern .
We now show that the root-finding algorithm defined in Sect.2.3 can be generalized to find all 2 r critical points in a strictly super-solvable model.More precisely, it finds 2 r approximate critical points, one in a neighborhood of each exact critical point of the model, from which the exact critical points can be computed by Newton's method (see Remark 3.2).For general models, it finds 2 r approximate critical points on the product of spheres with self-overlap (q 1 ).The restriction of H N to this set, considered as a spin glass in its own right (see [16,Remark 1.2]) is a solvable model.
Fixing ∈ {±1} r , the analogous iteration to (2.10) is: The change of sign does not affect the proofs or statements of Lemmas 2.3, 2.4.Indeed a 2 s only changes to 2 s a 2 s in the former proof which is no change at all.The generalization of Lemma 2.5 is as follows.
Proof The proof is similar to Lemma 2.5.The main calculation now becomes:

Moreover the external field h now contributes energy
Combining gives the desired statement.
Remark 3.1 One can sign the IAMP phase as well by redefining (2.26) to The resulting output n ( ) then achieves asymptotic energy (recall (1.8)) However it is unclear whether n ( ) can be made to obey any notable properties.We will show that the signed outputs m k ( ) of the first phase above are approximate critical points for H N (and in [17] that all near-critical points are close to one of them).By contrast, for the output of signed IAMP to be a critical point, must satisfy a signed version of the tree-descending ODE (2.3) in which the function Since this quantity appears inside a square root in (2.3), it is unclear when to expect solutions to exist.Furthermore the proof in [16] of well-posedness relies on positivity of coefficients (via Perron-Frobenius theory) and does not seem to generalize.Additionally, a solution would not seem to correspond to a maximizer of any variational problem as in (1.8).As a result we do not know how to prove a solution exists in the signed case.However if one takes as given a smooth function satisfying the signed tree-descending ODE, the iteration (3.2) starting from signed initialization n ( ) = m ( ) + √ (q 1 + δ) − (q 1 ) g would produce an approximate critical point n ( ) which still satisfies (3.3).

Gradient Computation and Connection to E ∞
We now compute the gradient of the outputs, showing that m ( ) and n ( ≤ ≤ ) are approximate critical points for the restriction of H N to the products of r spheres with suitable radii passing through them.For σ to be an approximate critical point means precisely that there exist coefficients A ∈ R r such that In our case, these coefficients will be given as follows.If 1 is strictly sub-solvable (so q 1 < 1), define A(q) for q ∈ [q 1 , 1] by Note that, by (2.2), this is consistent with the definition of A(q 1 ) above, in the sense that A(q 1 ; 1) = A(q 1 ).We take this to be the definition of A(q 1 ) if 1 is super-solvable (and q 1 = 1).(3.6), the result follows.Remark 3.2 In [17, Theorems 1.5 and 1.6], we show that when ξ is strictly super-solvable, H N has exactly 2 r critical points {x( )} ∈{−1,1} r .Moreover all ε-approximate critical points with Riemannian gradient

Proposition 3.2 If is a pseudo-maximizer for A (recall Definition 2.1) then for any
It follows from Proposition 3.2 that each m ( ) is an ε-approximate critical point for large enough = (ξ, ε).In fact the preceding gradient computation shows that the values agree, implying that m ( ) − x( ) N ≤ o →∞ (1) (compare with [17, Definition 5, Eq. (1.15)]).Moreover by [17,Theorem 1.6] each Riemannian Hessian ∇ 2 sp H N (x( )) has condition number at least 1/C(ξ ).It follows that each critical point x( ) can be efficiently computed to arbitrary accuracy by applying Newton's method from m ( ) for a large enough = (ξ ).(By contrast, the convergence of m ( ) itself to x( ) is only in the careful double-limit sense lim →∞ lim N →∞ .)Proposition 3.3 If is a pseudo-maximizer for A, then for any -indexed sequence (q * , ) = (q * , ) ≥1 such that q * ∈ [q 1 , 1], ≤ ≤ and lim →∞ |q * − q δ | = 0, we have Proof For notational convenience we assume (q * , ) = (1, ); the proof is identical in general.Recall the rearrangement (2.35): (3.9)So far we did not have to compute d , j .We do this now, focusing on the IAMP phase.Recalling (2.25), the IAMP iteration used non-linearity Using the formula (2.7) we find Note that since ∈ C 2 ([q 1 , 1]) we have the uniform-in-q δ j approximations (recall (3.5)): Substituting into (3.9),we obtain Since the increments (n j+1 − n j ) are orthogonal in the state evolution sense, it easily follows that the approximation of C j,s by C s (q δ j ) commutes with summation, i.e.
Note that we manifestly have C(1) = A(1).We claim the function C is constant on [q 1 , 1].This is equivalent to showing that for each s the function Write f s (q) = (q) s (q), where is independent of s since solves the tree-descending ODE (2.3).Then using the chain rule, the left-hand side of (3.12) equals Meanwhile the right-hand side of (3.12) is Therefore C(q) = A( 1) is constant as claimed.Finally it is clear that the n coefficient in (3.11) approximately equals C(q 1 ) and hence also A (1).Then (3.11) implies which completes the proof.
From the point of view of [16], the fact that ∇ sp H N (n ) N ≈ 0 is to be expected.At least for ( ; q 1 ) maximizing A, if this were not true than an extra step of gradient descent would essentially suffice to reach energy strictly better than ALG, contradicting the optimality in [16, Theorem 1].However the radial derivative computation is interesting in its own right and lets us study the spherical Hessian around an output σ .We believe that Corollary 3.4 can be strengthened to hold with λ 1 rather than λ εN .This seems to require a more precise Gaussian conditioning argument around A(H N ) which we chose not to pursue.Notably Corollary 3.4 explains the equality ALG = E ∞ for pure models, which we derived manually in [16].Indeed for a pure model with ξ = r i=1 x a i i , the energy and radial derivative are deterministically proportional: It follows (using again the N 2 large deviation rate for the spectral bulk) that there is a unique energy level E ∞ at which critical points can have spherical Hessian obeying the conclusion of Corollary 3.4.This is the definition of E ∞ given in [1,20].

Branching IAMP and Exponential Concentration
Here we modify the second stage of our IAMP algorithm (which requires = 1) to use external Gaussian randomness in a small number of increment steps.This allows the construction of an ultrametric tree of outputs with large constant depth and exp(cN ) breadth, with pairwise overlaps given by .More precisely, for any finite ultrametric space X = (x 1 , . . ., x M ), M = exp(cN ), of diameter at most 1 − q 1 , branching IAMP outputs (σ 1 , . . ., σ M ) with p-lim We use an approach suggested in [3] by injecting external Gaussian noise g (i) into the IAMP phase of the algorithm at depth q i ∈ (q 1 , 1).Importantly, this gives an explicit construction of exp(cN ) approximate critical points of H N (with exponentially good probability) whenever there is an IAMP phase.A similar construction was used by one of us in [24,Sect. 4].There the Gaussian noise was constructed artificially by preliminary iterates of AMP rather than from exogenous noise (due to the lack of a state evolution result incorporating independent Gaussian vectors).This only enabled the construction of a large constant number of outputs rather than exponentially many.
Our branching IAMP proceeds as follows.We first apply Stage I with = 1 as before.We fix q 1 < q 2 < • • • < q m = 1 and let We define n with the same recursive formula as before, unless = δ q i for some i ∈ [m].For these cases, we define g (1) , . . ., g (m) ∼ N (0, 1 N ) to be independent standard Gaussian Then we set: ) g (i) , = δ q i for some i ∈ [m] n + u δ z +1 − z , else.
Proposition 3.5 Let the iterations n ,1 , n ,2 be q j coupled as above, and let be a pseudomaximizer of A (recall Definition 2.1).Then Proof The analysis uses the slightly generalized state evolution given in Theorem 2, which states that (2.8) continues to hold even in the presence of external randomness g (i) .Modulo this point, the calculations are essentially identical.Indeed [24] uses exactly the same calculations to analyze a slightly different formulation of branching IAMP (therein, the vectors g are defined via negatively time-indexed AMP iterates to sidestep the lack of a generalized state evolution result).We therefore give only an outline below.The SDE description in (2.6) is unchanged if one uses the slightly added generality of Theorem 2 to incorporate the external Gaussian noise.(This Gaussian noise is scaled in (3.15) to achieve exactly the same effect as a usual iteration step.)The energy analysis of H N (n ) only changes on the m modified steps which has negligible effect since δ → 0 as → ∞; similarly for ∇ H N (n ).Thus (3.16) follows by the same proof as before.The proof of (3.18) is identical to [24,Sect. 8].
In Proposition 3.6 we observe that concentration of measure implies Proposition 3.5 holds with exponentially high probability.Thus we can couple together exp(cN ) branching IAMPs to construct a full ultrametric tree of large constant depth m and breadth exp(cN ).To do this, we fix m, take sufficiently large and then η > 0 sufficiently small.Then with K = exp(ηN ), we consider a complete depth m rooted tree T , with root defined to have depth 1, such that each vertex at depths 1, . . ., m −1 has K children.Thus the leaf-set L(T ) is naturally indexed by [K ] m .For v, v ∈ L(T ) we let v ∧ v ∈ {1, 2, . . ., m} denote the height of their least common ancestor.For each non-leaf x ∈ V (T ), label the edge from x to its parent by an i.i.d.Gaussian vector g (x) ∼ N (0, I N ).Then for each leaf v ∈ L(T ), using the m Gaussian vectors along the path from the root of T to v yields branching IAMP output σ (v) for any H N .Proposition 3.6 Proposition 3.5 holds with exponentially good probability in the following sense.Fix m and q 1 < q 2 < • • • < q m = 1.For any ε > 0, for large enough there exists η = η(ε, ) > 0 such that for N large enough, the following hold simultaneously across all v, v ∈ L(T ) with probability at least 1 − exp(−ηN ): In particular, the last conclusion in (3.19) shows that all exp(ηN ) constructed points have pairwise distance at least δ √ N for 0 < δ < 1 − q m−1 .Thus for any sub-solvable model, with high probability there are exponentially many √ N /C(ξ )-separated approximate critical points.This is a converse to the main result of [17], where we show that strictly super-solvable models enjoy a strong topological trivialization property which rules out such behavior.

Remark 3.3
An alternative to branching IAMP, which is very natural from the point of view of our companion work [16], is to slightly perturb Concentration of measure implies that the overlap concentrates exponentially around a limiting value R δ, ,η ∈ R r .We expect that taking η → 0 with δ, in a suitable way enables R δ, ,η ≈ (q) for any desired q ∈ [q 1 , 1].This corresponds to the fact that p(q) for q ∈ [q 1 , 1] for any ( p, ; q 0 ) maximizing A. However this approach seems more cumbersome to analyze explicitly.

Remark 3.4
The construction in this section shows the quenched existence of exp(ηN ) wellseparated approximate critical points for strictly sub-solvable models.In [17,Theorem 5.15] we use this fact to prove the number of exact critical points is exponentially large in expectation.However we are unable to prove the quenched (i.e.high-probability) existence of exp(ηN ) exact critical points in strictly sub-solvable models.Showing that this is the case, or more generally identifying the quenched exponential order of the number of critical points, is an interesting direction for future work.
To deduce the state evolution result for mixed tensors, we analyze a slightly more general iteration where each homogenous k-tensor is tracked separately, while restricting ourselves to the case where the mixture ξ has finitely many components: γ s 1 ,...,s k = 0 for all (s 1 , . . ., s k ) ∈ S k for all k ≥ D + 1 for some fixed D ≥ 2. We then proceed by an approximation argument to extend the convergence to the general case D = ∞.
We begin by introducing the Gaussian process that captures the asymptotic behavior of AMP.Define ξ k to be the degree k part of ξ , and An AMP iteration is specified by Lipschitz functions f t,s : R 2(t+1) → R for each (t, s) ∈ N × S. 3 For each iteration t, the state of the algorithm is given by vectors w t ∈ R N , and z k,t ∈ R N , with k ∈ {2, . . ., D}. Moreover for each t, there is also an external randomness vector e t ∈ R N with independent coordinates e t i ∼ μ t,s(i) from deterministic probability distributions μ t,s t≥0,s∈S with finite second moment.We now start to define the AMP iteration steps (the definition finishes at (A.11)).A single step is given by AMP t w 0 , . . ., w t ; e 0 , . . ., e t k ≡ A (k) { f t } − (A.9) A general multi-species tensor AMP algorithm then takes the form: , z k,t+1 = AMP t w 0 , . . ., w t ; e 0 , . . ., e t k . (A.10) For the right-hand side of (A.9) to make sense, we must define for each t ≥ 0 and s ∈ S a distribution over sequences (W 0 s , . . ., W t s ; E 0 s , . . ., E t s ).The latter variables E t s ∼ μ t ,s are simply taken independent of each other and all other variables.The construction of the W variables is recursive across t as follows.For each 2 ≤ k ≤ D and s ∈ S, we let U k,0 s ∼ ν k,s and construct a centered Gaussian process  Theorem 2 (State Evolution for AMP) Let {G (k) } k≥2 be independent standard Gaussian tensors with G (k) ∈ (R N ) ⊗k , and define A (k) as in (A.2).Fix a sequence of Lipschitz functions f t,s : R k+1 → R. Let z 2,0 , . . .z D,0 ∈ R N be deterministic vectors and w 0 = 2≤k≤D z k,0 .Assume that for each s ∈ S, the empirical distribution of the vectors (z 2,0 i , . . .z D,0 i ), i ∈ I s converges in W 2 (R D−1 ) distance to the law of the vector (U k,0 s ) 2≤k≤D .Let w t , z k,t , t ≥ 1 be given by the tensor AMP iteration.Then, for all s ∈ S and T ≥ 1 and for any pseudo-Lipschitz functions ψ : R D×T → R and ψ : R T → R, we have p-lim Note that (A.13) (which concerns the actual AMP iterates w t ) is a special case of (A.12) (which is more convenient to prove).Indeed one can take ψ (z k,t ) k≤D,t≤T = ψ k≤D z k,t t≤T .In the special case that c k = 0 for all k ≥ D + 1, Proposition 2.2 follows immediately from Theorem 2 by baking the contribution of h explicitly into f t (since we require k ≥ 2 above).Proposition 2.2 for non-polynomial ξ follows by a standard approximation argument outlined at the end of Sect. 1.For the remainder of this appendix we thus focus on establishing (A.12).

A.1: Further Definitions
We define the notations

Corollary 3 . 4
With λ k the k-th largest eigenvalue of a symmetric real matrix,lim →∞,ε→0 p-lim N →∞ λ εN ∇ 2 sp H N (n ) = 0. (3.13)Proof Fixing A = A(1), the bulk spectral measure ofW (x) = ∇ 2 H N (x) − A x (3.14) for deterministic x ∈ S N concentrates with rate function N 2 around a limiting spectral measure independent of x.By union-bounding over an δ √ N -net as in [26, Proof of Lemma 3], it thus suffices to show (3.13) at a point x ∈ S N independent of H N , with W (x) in place of ∇ 2 sp H N (σ ).This is purely a statement of random matrix theory and is shown in [17, Proposition 5.18].
Definition 1.2 A symmetric diagonally signed matrix M is super-solvable if it is positive semidefinite, and solvable if it is furthermore singular; otherwise M is strictly sub-solvable.A point x ∈ (0, 1] S is super-solvable, solvable, or strictly sub-solvable if M * ( x) is, where Proof We proceed by induction on j, first showing (2.14) and (2.16) together.As a base case, (2.14) holds for j = 0 by initialization.For the inductive step, assume first that (2.14) holds for j.Then by the definition (2.11), .17) [15,Proof As explained in[15, Sect.8], the map H N → n agrees with a C( )-Lipschitz function of the coefficients G(k)of H N except with probability 1−exp(−cN ).The same proof applies for H N → n ,v as well since the external noise variables are also Gaussian.Concentration of measure on Gaussian space now ensures that the statements above hold with exponentially high probability for each fixed (v, v ).Union bounding over all such pairs for small enough η implies the result.
1s , U k,1 s , ..., U k,t s )which is independent of U k,0 s .The variables k,t s and U k ,t s are independent unless (k, s) = (k , s ).It remains to specify the covariance of (U k,t s ) 1≤t≤T which is given recursively by:The main result, an extension of Proposition 2.2, follows.Below we use W 2 to denote the Wasserstein-2 distance between probability measures on Euclidean space in any dimension.We say a function ψ : R d → R is pseudo-Lipschitz if w t , E t = e 0 | e 1 | ...| e t , Z k,t = z k,0 | z k,1 | ...| z k,t .This implies for any0 ≤ t 1 ≤ t − 1 and s ∈ S, G −1 ξ k,s ,t−1 ) t 1 ,t 2 q k,t 2 +1 , f t−1 (V t−1 )Gaussian integration by parts yields the latter expression (it can be done conditionally on the variables E since they are independent).Combining (A.70) with the definition (A.8) will now allow us to conclude T k,t Q t ONS k,t as desired.Indeed for each s ∈ S we haveT k,t Q t t,s = t 1 ,t 2 R s (q k,t 2 +1 , f t−1 ) f t 1 ,s t,t 1 ,k f t 1 −1 = ons k,t+1 ,We conclude from (A.64) that AMP t+1 ( Q t ) k − LAMP t+1 ( Q t ) k N 0. This concludes the proof.