Abstract
This paper develops approximate message passing algorithms to optimize multi-species spherical spin glasses. We first show how to efficiently achieve the algorithmic threshold energy identified in our companion work (Huang and Sellke in arXiv preprint, 2023. arXiv:2303.12172), thus confirming that the Lipschitz hardness result proved therein is tight. Next we give two generalized algorithms which produce multiple outputs and show all of them are approximate critical points. Namely, in an r-species model we construct \(2^r\) approximate critical points when the external field is stronger than a “topological trivialization" phase boundary, and exponentially many such points in the complementary regime. We also compute the local behavior of the Hamiltonian around each. These extensions are relevant for another companion work (Huang and Sellke in arXiv preprint, 2023. arXiv:2308.09677) on topological trivialization of the landscape.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper studies the efficient optimization of a family of random non-convex functions \(H_N\) defined on high-dimensional spaces, namely the Hamiltonians of multi-species spherical spin glasses. Mean-field spin glasses have been studied since [25] as models for disordered magnetic systems and are also closely linked to random combinatorial optimization problems [12, 19, 22]. In short, their Hamiltonians are certain polynomials in many variables with independent centered Gaussian coefficients.
The purpose of this work is to develop efficient algorithms to optimize \(H_N\). Our companion work [16] derives an algorithmic threshold \({\textsf {ALG}}\) and proves no optimization algorithm with suitably Lipschitz dependence on \(H_N\) can achieve energy better than \({\textsf {ALG}}\) with more than exponentially small probability. The value \({\textsf {ALG}}\) is expressed as the maximum of a variational principle over several increasing functions, which was shown to be achieved by joining the solutions to a pair of well-posed differential equations. The first main contribution of this paper is to show that given a solution to this variational problem, so-called approximate message passing (AMP) algorithms efficiently achieve the value \({\textsf {ALG}}\). We note that several previous works [4, 21, 24, 26] have given similar algorithms for mean-field spin glasses with 1 species, and our algorithm is in line with the latter three.
Furthermore, we use these AMP algorithms to aid a detailed study of the landscape of \(H_N\) by probing neighborhoods of special critical points. This is related to a second companion work [17] which identifies the phase boundary for topological trivialization of \(H_N\), where the number of critical points is a constant independent of N. Therein, Kac-Rice estimates are used to show that for r-species models (defined on a product of r spheres) in the “super-solvable” regime with strong external field, \(H_N\) has exactly \(2^r\) critical points with high probability. In this paper, we give a signed AMP algorithm which explicitly approximates each of these critical points. Moreover in the complementary “sub-solvable” regime, we use AMP to construct \(\exp (cN)\) separated approximate critical points with high probability. This implies the failure of strong topological trivialization as defined in [17], which is proved therein to hold for super-solvable models. Finally, the machinery of AMP allows us to compute the local behavior of \(H_N\) around these algorithmic outputs, giving even more precise information about the landscape.
1.1 Problem Description
Fix a finite set \({\mathscr {S}}= \{1,\ldots ,r\}\). For each positive integer N, fix a deterministic partition \(\{1,\ldots ,N\} = \sqcup _{s\in {\mathscr {S}}}\, {\mathcal {I}}_s\) with \(\lim _{N\rightarrow \infty } |{\mathcal {I}}_s| / N =\lambda _s\) where \({\vec \lambda }= (\lambda _1,\ldots ,\lambda _r) \in {\mathbb {R}}_{>0}^{\mathscr {S}}\). For \(s\in {\mathscr {S}}\) and \({\varvec{x}}\in {\mathbb {R}}^N\), let \({\varvec{x}}_s \in {\mathbb {R}}^{{\mathcal {I}}_s}\) denote the restriction of \({\varvec{x}}\) to coordinates \({\mathcal {I}}_s\). We consider the state space
Fix \({\vec h}= (h_1,\ldots ,h_r) \in {\mathbb {R}}_{\ge 0}^{\mathscr {S}}\) and let \({\textbf {1}}= (1,\ldots ,1) \in {\mathbb {R}}^N\). For each \(k\ge 2\) fix a symmetric tensor \(\Gamma ^{(k)} = (\gamma _{s_1,\ldots ,s_k})_{s_1,\ldots ,s_k\in {\mathscr {S}}} \in ({\mathbb {R}}_{\ge 0}^{{\mathscr {S}}})^{\otimes k}\) with \(\sum _{k\ge 2} 2^k {\left\Vert\Gamma ^{(k)}\right\Vert}_\infty < \infty \), and let \({\textbf {G}}^{(k)} \in ({\mathbb {R}}^N)^{\otimes k}\) be a tensor with i.i.d. standard Gaussian entries.
For \(A\in ({\mathbb {R}}^{\mathscr {S}})^{\otimes k}\), \(B\in ({\mathbb {R}}^N)^{\otimes k}\), define \(A\diamond B \in ({\mathbb {R}}^N)^{\otimes k}\) to be the tensor with entries
where s(i) denotes the \(s\in {\mathscr {S}}\) such that \(i\in {\mathcal {I}}_s\). Let \(\varvec{h}= {\vec h}\diamond {\textbf {1}}\). We consider the mean-field multi-species spin glass Hamiltonian
with inputs \({\varvec{\sigma }}= (\sigma _1,\ldots ,\sigma _N) \in {\mathcal {B}}_N\). For example, the choice of parameters \(\Gamma ^{(2)} = ({\begin{matrix} 0 &{} 1 \\ 1 &{} 0 \end{matrix}})\) and \(\Gamma ^{(k)}=0\) for \(k\ge 3\) is the well-known bipartite spherical SK model [2]. For \({\varvec{\sigma }},{\varvec{\rho }}\in {\mathcal {B}}_N\), define the species s overlap and overlap vector
Let \(\odot \) denote coordinate-wise product. For \(\vec {x}= (x_1,\ldots ,x_r) \in {\mathbb {R}}^{\mathscr {S}}\), let
The random function \(\widetilde{H}_N\) can also be described as the Gaussian process on \({\mathcal {B}}_N\) with covariance
We will also often refer to the product of spheres
It will be useful to define, for \(s\in {\mathscr {S}}\),
1.2 The Value \({\textsf {ALG}}\)
Given \(({\vec \lambda },\xi )\), the ground state energy of the associated multi-species spherical spin glass isFootnote 1
In the bipartite SK model mentioned above, \(\textsf{OPT}\) is the limiting operator norm of an IID Gaussian rectangular matrix with aspect ratio \(\lambda _1/\lambda _2\). For large k, the asymptotic operator norm of an IID random k-tensor is similarly encoded as \(\textsf{OPT}(\xi )\) for some \(\xi \) (with e.g. \(r=k\)). Perhaps surprisingly, it is generally believed that polynomial-time algorithms are not in general capable of finding \({\varvec{\sigma }}\in {\mathcal {B}}_N\) such that \(H_N({\varvec{\sigma }})\ge \textsf{OPT}(\xi )-{\varepsilon }\) with high probability as \(N\rightarrow \infty \). Our work [15] showed that in the single species case (and with all terms of even degree), one can identify an exact threshold \({\textsf {ALG}}\) for the performance of a class of Lipschitz algorithms which includes gradient-based methods and Langevin dynamics. More recently in [16], we extended the algorithmic hardness direction of this result to multi-species spherical spin glasses, using a new proof technique that applies even when \(\textsf{OPT}\) is not known. The purpose of this paper is to give explicit algorithms attaining the value \({\textsf {ALG}}\), and we present here the formula for this value.
The algorithmic threshold \({\textsf {ALG}}\) is given by the following variational principle. This is a simplification of the more general variational formula [16, Equation (1.7)], obtained by a partial characterization of its maximizers [16, Theorem 3]. The following generic assumption is needed therein to ensure well-posedness of the ODE (2.3) used in this description, and we will freely assume it throughout the paper.
Assumption 1
All quadratic and cubic interactions participate in H, i.e. \(\Gamma ^{(2)}, \Gamma ^{(3)} > 0\) coordinate-wise. We will call such models non-degenerate. Since this condition depends only on \(\xi \), we similarly call \(\xi \) non-degenerate.
To optimize \(H_N\) for degenerate \(\xi \), it suffices to apply our algorithms to a slight perturbation \(\widetilde{\xi }\) which is non-degenerate and satisfies \(\Vert \xi -\widetilde{\xi }\Vert _{C^3([0,1]^r)}\le {\varepsilon }\) to obtain the guarantees in this and the next section. Here, \(C^3([0,1]^r)\) denotes the norm
Since both the ground state and the more general \({\textsf {ALG}}\) formula in [16] (allowing degenerate \(\xi \)) vary continuously in \(\xi \), there is essentially no loss of generality in assuming non-degeneracy.
The formula for \({\textsf {ALG}}\) is described by two cases depending on whether \(\vec {1}=1^{{\mathscr {S}}}\) is super-solvable as defined below.
Definition 1.1
A matrix \(M\in {\mathbb {R}}^{{\mathscr {S}}\times {\mathscr {S}}}\) is diagonally signed if \(M_{i,i}\ge 0\) and \(M_{i,j}<0\) for all \(i\ne j\).
Definition 1.2
A symmetric diagonally signed matrix M is super-solvable if it is positive semidefinite, and solvable if it is furthermore singular; otherwise M is strictly sub-solvable. A point \(\vec {x}\in (0,1]^{\mathscr {S}}\) is super-solvable, solvable, or strictly sub-solvable if \(M^*(\vec {x})\) is, where
We also adopt the convention that \(\vec {0}\) is always super-solvable, and solvable if \({\vec h}=\vec {0}\).
The following will be useful.
Proposition 1.3
([16, Proposition 4.3], see also [17, Lemma 2.5]) If the square matrix M is diagonally signed, then the minimal eigenvalue \({\varvec{\lambda }}_{\min }(M)\) has multiplicity 1, and the corresponding eigenvector \(\vec {v}\) has strictly positive entries. Moreover
and the supremum is uniquely attained at \(\vec {v}\).
It is easy to see that any \(x\in (0,1]^{\mathscr {S}}\) is sub-solvable when \({\vec h}=\vec {0}\), and that super-solvability is a coordinate-wise increasing property of \({\vec h}\). For our purposes, an external field is large if \(\vec {1}\) is super-solvable and small if \(\vec {1}\) is strictly sub-solvable. (Unfortunately we do not have more refined intuition for the precise form of \(M^*\) above, nor the resulting phase boundary between super and sub-solvability.) As shown in our companion work [17], in super-solvable models the external fields \(\varvec{h}\) are strong enough to trivialize the “glassy” nature of the landscape for \(H_N\). Namely the number of critical points is exactly \(2^r\) with high probability, the minimum number of any generic smooth (“Morse”) function on a product of r spheres. By contrast in the sub-solvable case, the expected number of critical points is exponentially large in the dimension N. As explained below, the optimization algorithms are also simpler in the super-solvable case.
Definition 1.4
(Algorithmic Threshold, Super-Solvable Case) If \(\vec {1}\) is super-solvable, then
When \(\vec {1}\) is strictly sub-solvable, the formula for \({\textsf {ALG}}\) becomes more complicated and depends on the optimal choice of a increasing \(C^2\) function \(\Phi :[q_1,1]\rightarrow [0,1]^{{\mathscr {S}}}\) satisfying certain conditions. We term such \(\Phi \) pseudo-maximizers and defer the formal definition to Definition 2.1. Note that \(q_1 \in [0,1]\) is not fixed, but is determined by the choice of \(\Phi \).
Definition 1.5
(Algorithmic Threshold, Strictly Sub-solvable Case) If \(\vec {1}\) is strictly sub-solvable, then with the maximum taken over all pseudo-maximizers \(\Phi \) of \({\mathbb {A}}\),
See [16, Remark 1.3] for an approach to maximizing \({\mathbb {A}}\) using the well-posedness of the ODEs (2.2), (2.3) in the definition of pseudo-maximizer. The computational complexity of this task is in particular independent of N.
The following theorem is our main result. We equip the space \({\mathscr {H}}_N\) of Hamiltonians \(H_N\) with the following distance. We identify \(H_N\) with its disorder coefficients \(({\varvec{G}}^{(k)})_{k\ge 2}\), which we arrange in an arbitrary but fixed order into an infinite vector \({\varvec{g}}(H_N)\), and define
(In other words, \(\Vert {H_N-H'_N}\Vert _2^2\) is the sum of squared differences \((g_{i_1,\ldots ,i_k}-g'_{i_1,\ldots ,i_k})^2\) between all corresponding pairs of coefficients in \(({\varvec{G}}^{(k)})_{k\ge 2}\) and \(({\varvec{G}}'^{(k)})_{k\ge 2}\).) We say an algorithm \({\mathcal {A}}_N: {\mathscr {H}}_N \rightarrow {\mathcal {B}}_N\) is \(\tau \)-Lipschitz if
Note that \(\Vert {H_N-H'_N}\Vert _2\) may be infinite, and if so this condition holds vacuously for such pairs \((H_N,H'_N)\). Here and throughout, all implicit constants may depend also on \((\xi ,{\vec h},{\vec \lambda })\).
Theorem 1
For any \({\varepsilon }>0\), there exists an \(O_{{\varepsilon }}(1)\)-Lipschitz \({\mathcal {A}}_N:{\mathscr {H}}_N\rightarrow {\mathcal {B}}_N\) such that
The main result in our companion work [16, Theorem 1] states that any \(\tau \)-Lipschitz \({\mathcal {A}}_N: {\mathscr {H}}_N \rightarrow {\mathcal {B}}_N\) satisfies, for the same threshold \({\textsf {ALG}}\) and N sufficiently large,
Together these results thus characterize the best possible Lipschitz optimization algorithms for multi-species spherical spin glasses.
We prove Theorem 1 with an explicit algorithm based on AMP, following a recent line of work [4, 5, 21, 24, 26]. Such algorithms are shown to be Lipschitz (up to modification on a set with \(\exp (-cN)\) probability) in [15, Sect. 8]. AMP algorithms also have computational complexity which is linear in the input size when \(H_N\) is a polynomial of finite degree (modulo solving for \(\Phi \), a task that does not depend on N). See [4, Remark 2.1] for related discussion on this last point.
Similarly to [5, 24], our algorithm has two phases, a “root-finding" phase and a “tree-descending" phase. Roughly speaking, the set of points reachable by our algorithm has the geometry of a densely branching ultrametric tree, which is rooted at the origin when \(\varvec{h}= {\textbf {0}}\) and more generally at a random point correlated with \(\varvec{h}\). The first phase identifies this root, and the second traces a root-to-leaf path of this tree. The structure of the first phase is similar to the original AMP algorithm of [9] for the SK model at high-temperature, while the latter incremental AMP technique was introduced in [21].
For the purposes of this paper, the significance of (super, sub)-solvability is as follows. When the external field is sufficiently large, the root moves all the way to the boundary of \({\mathcal {B}}_N\) (in all r species) and the algorithmic tree becomes degenerate. In [16], it is shown that the external field is large enough for this to occur if and only if \(\vec {1}\) is super-solvable. Moreover, [17] shows this condition coincides with strong topological trivialization (defined therein) of the optimization landscape.
In Sect. 3 we extend our main algorithm in several ways. In Sect. 3.1 we define \(2^r\) signed generalizations of the root-finding algorithm with similar behavior. In Sect. 3.2 we compute the gradients of \(H_N\) at the points output by our algorithm, in both cases when \(\vec {1}\) is super-solvable and sub-solvable. In particular, we show that they are approximate critical points on the product of spheres \({\mathcal {S}}_N\) (defined in (1.6)). As explained in Remark 3.1, in the strictly super-solvable case these \(2^r\) outputs approximate the \(2^r\) genuine critical points of \(H_N\) on \({\mathcal {S}}_N\). The sub-solvable case of this computation is used in our companion paper [17, Theorem 1.5(c) and Sect. 5.3] to show failure of annealed topological trivialization in the sub-solvable case. Finally in Sect. 3.3 we give a modification of the tree-descending phase for the super-solvable case. It constructs \(\exp (cN)\) well-separated approximate critical points arranged in a densely branching ultrametric tree; this implies the failure of strong topological trivialization in [17, Definition 6 and Theorem 1.6].
1.3 Notations
Throughout, we will use boldface lowercase letters (\({\varvec{u}},{\varvec{v}},\ldots \)) to denote vectors in \({\mathbb {R}}^N\), and lowercase letters with vector sign (\(\vec {u},\vec {v},\ldots \)) to denote vectors in \({\mathbb {R}}^{\mathscr {S}}\simeq {\mathbb {R}}^r\). Similarly, boldface uppercase letters denote matrices or tensors in \(({\mathbb {R}}^N)^{\otimes k}\), and non-boldface uppercase letters denote matrices or tensors in \(({\mathbb {R}}^r)^{\otimes k}\). We let
for \({\varvec{u}},{\varvec{v}}\in {\mathbb {R}}^N\). The corresponding norm is
Next \(a_N\simeq b_N\) means that \(a_N-b_N\) converges in probability to 0. Analogously, for two vectors \({\varvec{u}}_N, {\varvec{v}}_N\), we write \({\varvec{u}}_N\simeq {\varvec{v}}_N\) when \(\Vert {\varvec{u}}_N-{\varvec{v}}_N\Vert _N\) converges in probability to 0. We denote limits in probability by \(\mathop {\mathrm {p-lim}}\limits _{N\rightarrow \infty }\). Analogously we write \(\approx _{\delta }\) to denote asymptotic equality as \(\delta \rightarrow 0\).
For any tensor \(\varvec{A}\in ({\mathbb {R}}^N)^{\otimes k}\), we define the operator norm
The following proposition shows that with exponentially good probability, the operator norms of all constant-order gradients of \(H_N\) are bounded on the appropriate scale.
Proposition 1.6
([16, Proposition 1.13]) For any fixed model \((\xi , {\vec h})\) there exists a constant \(c>0\), sequence \((K_N)_{N\ge 1}\) of convex sets \(K_N\subseteq {\mathscr {H}}_N\), and sequence of constants \((C_{k})_{k\ge 1}\) independent of N, such that the following properties hold.
-
(a)
\(\mathbb {P}[H_N\in K_N]\ge 1-e^{-cN}\);
-
(b)
For all \(H_N\in K_N\) and \({\varvec{x}}\in {\mathcal {B}}_N\),
$$\begin{aligned} {\left\Vert\nabla ^k H_N({\varvec{x}})\right\Vert}_{\text {op}}&\le C_{k}N^{1-\frac{k}{2}}. \end{aligned}$$(1.9)
2 Achieving Energy \({\textsf {ALG}}\)
In this section we prove Theorem 1 by exhibiting an AMP algorithm. Throughout this section, Assumption 1 on non-degeneracy of \(\xi \) will be enforced without loss of generality.
2.1 Definition of Pseudo-Maximizer
As mentioned before Definition 1.5, the threshold \({\textsf {ALG}}\) in the sub-solvable case depends on a notion of pseudo-maximizer. We now provide this definition, which was derived in [16, Theorem 3] as a necessary condition for \(\Phi \) to maximize \({\mathbb {A}}\) defined in (1.8) (and it is proved therein that a maximizer always exists).
Definition 2.1
A coordinate-wise strictly increasing \(C^2\) function \(\Phi :[q_1,1]\rightarrow [0,1]^{{\mathscr {S}}}\), for some \(q_1\in [0,1]\), is a pseudo-maximizer if:
-
(1)
\(\Phi \) is admissible, meaning it satisfies the normalization
$$\begin{aligned} \langle {\vec \lambda }, \Phi (q)\rangle = q,\quad \forall q\in [q_1,1]. \end{aligned}$$(2.1)In particular \(\Phi (1) = \vec {1}\).
-
(2)
\(\Phi (q_1)\) is solvable.
-
(3)
The derivative at \(q_1\) satisfies \(M^*(\Phi (q_1))\Phi '(q_1)=\vec {0}\). This amounts to no restriction when \({\vec h}=\vec {0}\) and thus \((q_1,\Phi (q_1))=(0,\vec {0})\); when \({\vec h}\ne \vec {0}\) it means that
$$\begin{aligned} \Phi _s'(q_1) = \frac{\Phi _s(q_1) (\xi ^s\circ \Phi )'(q_1)}{\xi ^s(\Phi (q_1))+h_s^2}, \quad s\in {\mathscr {S}}. \end{aligned}$$(2.2) -
(4)
For all \(q\in [q_1,1]\), \(\Phi \) solves the (second-order) tree-descending differential equation:
$$\begin{aligned} \Psi (q) \equiv \frac{1}{\Phi '_s(q)} {\frac{{\text {d}}}{{\text {d}q}}} \sqrt{\frac{\Phi '_s(q)}{(\xi ^s \circ \Phi )'(q)}} \end{aligned}$$(2.3)is independent of the species s. (See [16, Lemma 4.37] for well-posedness of this ODE.)
Note that there may exist multiple such \(\Phi \), see [16, Figure 2]. If \(\vec {1}\) is super-solvable, we adopt the convention that \(q_1=1\) and \(\Phi \) has domain \(\{1\}\).
We now give an efficient AMP algorithm achieving energy \({\mathbb {A}}(\Phi )\) for any pseudo-maximizer \(\Phi \). In particular for the optimal pseudo-maximizer this achieves energy \({\textsf {ALG}}\).
2.2 Review of Approximate Message Passing
Here we recall the class of AMP algorithms, specialized to our setting of interest. We initialize AMP with a deterministic vector \({\varvec{w}}^0\) with coordinates
depending only on the species. Let \(f_{t,s}:{\mathbb {R}}^{t+1}\rightarrow {\mathbb {R}}\) be a Lipschitz function for each \((t,s)\in {\mathbb {Z}}_{\ge 0}\times {\mathscr {S}}\). For \(({\varvec{w}}^0,{\varvec{w}}^1,\ldots ,{\varvec{w}}^t)\in {\mathbb {R}}^{N\times (t+1)}\), let \(f_{t}({\varvec{w}}^0,{\varvec{w}}^1,\ldots ,{\varvec{w}}^t)\in {\mathbb {R}}^N\) be given by
We generate subsequent iterates through recursions of the following form, where \({\textbf {ons}}_t\) is known as the Onsager correction term:
Here \(W^t_s,M^t_s\) are defined as follows. \(W^0_s=w_s\) and the variables \((\widetilde{W}^t_s)_{(t,s)\in {\mathbb {Z}}_{\ge 1}\times {\mathscr {S}}}\) form a centered Gaussian process with covariance defined recursively by
and \({\mathbb {E}}[\widetilde{W}^{t+1}_{s} \widetilde{W}^{t'+1}_{s'}]=0\) if \(s\ne s'\) (i.e. different species are independent).
The following state evolution characterizes the behavior of the above iterates. It states that for each \(s\in {\mathscr {S}}\), when \(i\in {\mathcal {I}}_s\) is uniformly random the sequence of coordinates \((w^1_i,w^2_i,\ldots ,w^t_i)\) has the same law as \((W^1_s,\ldots ,W^t_s)\). Say a function \(\psi :{\mathbb {R}}^{\ell } \rightarrow {\mathbb {R}}\) is pseudo-Lipschitz if \(|\psi (x) - \psi (y)| \le C(1+|x|+|y|)|x-y|\) for a constant C.
Proposition 2.2
For any pseudo-Lipschitz function \(\psi \) and \(\ell \in {\mathbb {Z}}_{\ge 0}\), \(s\in {\mathscr {S}}\),
This proposition allows us to read off normalized inner products of the AMP iterates, since e.g.
Proposition 2.2 is proved in Appendix 1. In fact we show a slight generalization allowing \(f_t=f_t({\varvec{w}}^0,\ldots ,{\varvec{w}}^t,{\varvec{g}}^0,\ldots ,{\varvec{g}}^t)\) to depend also on independently generated vectors \(({\varvec{g}}^0,\ldots ,{\varvec{g}}^t)\in {\mathbb {R}}^{N(t+1)}\). When using this extension, we will always take each \({\varvec{g}}^t\sim {\mathcal {N}}(0,I_N)\) to be standard Gaussian. The more general result essentially says that \({\varvec{g}}_t\) still acts as an independent Gaussian for the purposes of state evolution. Since this is relatively intuitive, we refer to Theorem 2 in the appendix for a precise statement.
For random matrices (i.e. the case of quadratic H) there is a considerable literature establishing state evolution in many settings beginning with [7, 9] and later [6, 8, 10, 11, 13] (see also [14] for a survey of many statistical applications). The generalization to tensors was introduced in [23] and proved in [4], whose approach we follow.
2.3 Stage \(\text {I}\): Finding the Root of the Ultrametric Tree
Our goal in this subsection will be to compute a vector \(\varvec{m}^{{\underline{\ell }}}\) satisfying
and with the correct energy value (as stated in Lemma 2.5). We take as given a maximizer \(\Phi \) to \({\mathbb {A}}\) with domain \([q_1,1]\). Recall \(\Phi (q_1)\) is super-solvable: either \(\vec {1}\) is strictly sub-solvable, in which case \(\Phi (q_1)\) is solvable, or \(\vec {1}\) is super-solvable, in which case \(\Phi (q_1) = \Phi (1) = \vec {1}\).
We use the initialization
Define the vector \({\vec a}\in \mathbb R^{{\mathscr {S}}}\) by
Subsequent iterates are defined via the following recursion.
The last term in (2.10) comes from specializing the formula (2.6) for the Onsager term.
Next recalling (2.8), let \((W^j_s,M^j_s)_{j\ge 0,s\in {\mathscr {S}}}\) be the state evolution limit of the coordinates of
as \(N\rightarrow \infty \). Concretely, each \(W^j_s\) is Gaussian with mean \(h_s\) and
We next compute the covariance of the Gaussians \(\widetilde{W}^j_s = W^j_s - h_s\). Define \({\vec \alpha }: {\mathbb {R}}_{\ge 0}^{\mathscr {S}}\rightarrow {\mathbb {R}}_{\ge 0}^{\mathscr {S}}\) by
Define the (deterministic) \({\mathbb {R}}_{\ge 0}^{{\mathscr {S}}}\)-valued sequence \((\vec R^0,\vec R^1,\dots )\) of asymptotic overlaps recursively by \(\vec R^0=\vec {0}\) and \(\vec R^{k+1} = {\vec \alpha }(\vec R^k)\).
Lemma 2.3
For integers \(0\le j<k\), the following equalities hold (the first in distribution):
Proof
We proceed by induction on j, first showing (2.14) and (2.16) together. As a base case, (2.14) holds for \(j=0\) by initialization. For the inductive step, assume first that (2.14) holds for j. Then by the definition (2.11),
so that (2.14) implies (2.16) for each \(j\ge 0\). On the other hand, state evolution directly implies that if (2.16) holds for j then (2.14) holds for \(j+1\). This establishes (2.14) and (2.16) for all \(j\ge 0\).
We similarly show (2.15) and (2.17) together by induction, beginning with (2.15). When \(j=0\) it is clear because \(\widetilde{W}^k_s\) is mean zero and independent of \(\widetilde{W}^0_s\). Just as above, it follows from state evolution that (2.15) for (j, k) implies (2.17) for (j, k) which in turn implies (2.15) for \((j+1,k+1)\). Hence induction on j proves (2.15) and (2.17) for all (j, k). \(\square \)
The next lemma is crucial and uses super-solvability of \(\Phi (q_1)\).
Lemma 2.4
The limit \(\vec R^\infty \equiv \lim _{j\rightarrow \infty } \vec R^j\) exists and equals \(\Phi (q_1)\).
Proof
First we observe that \({\vec \alpha }\) (recall (2.13)) is coordinate-wise strictly increasing in the sense that if \(0\preceq x\prec y\) then \({\vec \alpha }(x)\prec {\vec \alpha }(y)\). Moreover \({\vec \alpha }(\vec {0})\succ 0\) (assuming \({\vec h}\ne 0\), else the result is trivial) and \({\vec \alpha }(\Phi (q_1))=\Phi (q_1)\). Therefore \(\vec R^\infty \) exists, \({\vec \alpha }(\vec R^\infty )=\vec R^\infty \), and
It remains to show that the above forces \(\vec R^\infty =\Phi (q_1)\) to hold.
Let \(M\in {\mathbb {R}}^{{\mathscr {S}}\times {\mathscr {S}}}\) be the matrix with entries \(M_{s,s'}={\frac{{\text {d}}}{{\text {d}t}}}{\vec \alpha }_s(\Phi (q_1)+te_{s'})|_{t=0}\) for \(e_{s'}\) a standard basis vector. Then M is the derivative matrix for \({\vec \alpha }\) at \(\Phi (q_1)\) in the sense that for any \(\vec {u}\in {\mathbb {R}}^{{\mathscr {S}}}\),
We easily calculate that
We claim that for any entry-wise non-negative vector \(\vec w\in \mathbb R_{\ge 0}^{{\mathscr {S}}}\),
for some \(s\in {\mathscr {S}}\). Indeed, suppose to the contrary that \((M\vec w)_s > w_s\) for all \(s\in {\mathscr {S}}\). This rearranges to
i.e. \(M^*(\Phi (q_1)) \vec w\prec \vec {0}\) (recall (1.7)). Proposition 1.3 then implies that \({\varvec{\lambda }}_{\min }(M^*(\Phi (q_1))) < 0\), so \(\Phi (q_1)\) is strictly sub-solvable, which is a contradiction. Thus (2.18) holds for some \(s\in {\mathscr {S}}\).
Now suppose for sake of contradiction that \(\vec R^\infty \prec \Phi (q_1)\), let \(\vec w=\Phi (q_1)-\vec R^\infty \), and choose \(s\in {\mathscr {S}}\) such that (2.18) holds. Write \(f(t)=\alpha _s(\Phi (q_1)+t\vec w)\). Since \(\alpha _s\) is a polynomial with non-negative coefficients and \(\xi \) is non-degenerate, f is strictly convex and strictly increasing on \([-1,0]\). Hence
The first inequality above is strict, so we deduce that \({\vec \alpha }(\vec R^\infty )\ne \vec R^\infty \) if \(\vec R^\infty \prec \Phi (q_1)\). This contradicts the definition of \(\vec R^\infty \). Therefore \(\vec R^\infty =\Phi (q_1)\), completing the proof. \(\square \)
Remark 2.1
Super-solvability of \(\Phi (q_1)\) is a tight condition for the above argument to hold, as the matrix M above needs to have Perron-Frobenius eigenvalue at most 1. Indeed suppose that \(\Phi (q_1)\) was chosen so that \(\lambda _1(M)>1\). Then there exists \(\vec w\in {\mathbb {R}}_{>0}^{{\mathscr {S}}}\) with \(M\vec w\succ \vec w\). Letting \(\vec {x}=\Phi (q_1)-{\varepsilon }\vec w\) for small \({\varepsilon }>0\), we find \({\vec \alpha }(\vec {x})\prec \vec {x}\). Monotonicity implies that \({\vec \alpha }\) maps the compact, convex set
into itself. By the Brouwer fixed point theorem, a fixed point of \({\vec \alpha }\) strictly smaller than \(\Phi (q_1)\) exists whenever \(\Phi (q_1)\) is strictly subsolvable.
We finish our analysis of the first AMP phase by computing the asymptotic energy it achieves. As expected, the resulting value agrees with the first term in the formula (1.8) for \({\textsf {ALG}}\).
Lemma 2.5
Proof
We use the identity
and interchange the limit in probability with the integral. To compute \(\mathop {\mathrm {p-lim}}\limits _{N\rightarrow \infty }\langle \varvec{m}^k,\nabla \widetilde{H}_N(t\varvec{m}^k)\rangle \) we introduce an auxiliary AMP step
which depends implicitly on \(t\in [0,1]\). Rearranging yields
For the first term, recalling (2.11) yields
Note also that
Integrating with respect to t, and switching the roles of \(s,s'\) in applying (2.20), we thus find
Finally the external field \(\varvec{h}\) gives energy contribution
Since \(\vec R^\infty =\Phi (q_1)\) by Lemma 2.4, we conclude
\(\square \)
2.4 Stage \(\text {II}\): Descending the Ultrametric Tree
We now turn to the second phase which uses incremental approximate message passing. Choose a large integer \({\underline{\ell }}\), and with \(\delta ={\underline{\ell }}^{-1}\) let
We then define
with the square-root taken entrywise, and \({\varvec{g}}\sim {\mathcal {N}}(0,I_N)\). Then
The point \({\varvec{n}}^{{\underline{\ell }}}\) will be the “root” of our IAMP algorithm.Footnote 2
Moreover we set \(\overline{\ell }=\max \{\ell \in {\mathbb {Z}}_+~:~q_{\ell }^{\delta }\le 1-2\delta \}.\) We also define for \(s\in {\mathscr {S}}\) and \({\underline{\ell }}\le \ell \le \overline{\ell }\) the constants
Set \({\varvec{z}}^{{\underline{\ell }}}={\varvec{w}}^{{\underline{\ell }}}-\varvec{h}\). We will define \(({\varvec{z}}^{\ell })_{\ell \ge {\underline{\ell }}+1}\) via
The Onsager coefficients \(d_{\ell ,j}\) are given by (2.7) and will not appear explicitly in any calculations until Sect. 3.2. Note that formally, they may depend on the first \({\underline{\ell }}\) iteratates, since (2.24) is a continuation of the same AMP iteration. To complete the definition of the iteration (2.24), for \(s(i)=s\) and \(\ell \ge {\underline{\ell }}\) we set
where
The algorithm \({\mathcal {A}}\) outputs
where the power \(-1/2\) is taken entry-wise. We show in (2.32) below that
Hence we will often not distinguish between the two and just consider \({\varvec{n}}^{\overline{\ell }}\) to be the output. This makes essentially no difference by virtue of Proposition 1.6.
The state evolution limits of \({\varvec{z}}^\ell \) and \({\varvec{n}}^\ell \) are described by time-changed Brownian motions with total variance \(\Phi _s(q^{\delta }_{\ell })\) in species s after iteration \(\ell \). This is made precise below.
Lemma 2.6
Fix \(s\in {\mathscr {S}}\). The sequences \((Z^{\delta }_{{\underline{\ell }},s},Z^{\delta }_{{\underline{\ell }}+1,s},\dots )\) and \((N^{\delta }_{{\underline{\ell }},s},N^{\delta }_{{\underline{\ell }}+1,s},\dots )\) are Gaussian processes satisfying
Proof
The fact that these sequences are Gaussian processes is a general fact about state evolution (the external Gaussian \({\varvec{g}}\) is permitted in Theorem 2). We proceed by induction on \(\ell \ge {\underline{\ell }}\). The proof is similar to [24, Sect. 8] so we give only the main points (in fact (2.21) simplifies the corresponding construction therein, which avoided the use of external Gaussian noise). We will make liberal use of (2.8) to connect asymptotic overlaps before and after applying \(\nabla H_N(\cdot )\).
For base cases, the \({\underline{\ell }}\) case of (2.30) is immediate from (2.16). The base case of (2.31) follows from (2.22), and thus the \({\underline{\ell }}+1\) case of (2.30). The main computation for the base case is
Here we used the general AMP statement of Theorem 2 to say that
For inductive steps, we always have by state evolution
It follows by the inductive hypothesis of (2.28) that for \(j\le \ell \),
Plugging into the above yields that for \(j\le \ell \),
This depends only on \(\min (j,\ell )\), so (2.28) follows. The others are proved by similar computations. \(\square \)
Equation (2.31) implies that \(\vec R({\varvec{n}}^{\delta }_{\ell },{\varvec{n}}^{\delta }_{j})\simeq \Phi (q^{\delta }_{(\ell \wedge j)+1})\), which exactly corresponds to the previous sections of the paper. In particular it implies that the final iterate \({\varvec{n}}^{\delta }_{\overline{\ell }}\) satisfies
so the rounding step (2.27) causes only an \(O(\delta )\) change in the Hamiltonian value. Finally we compute in Lemma 2.7 the energy gain from the second phase, which matches the second term in (1.8).
Lemma 2.7
Proof
Observe that \(\langle h,{\varvec{n}}^{\overline{\ell }}-{\varvec{n}}^{{\underline{\ell }}}\rangle _N\simeq 0\) because the values \((N_{\ell ,s}^{\delta })_{\ell \ge {\underline{\ell }}}\) form a martingale sequence for each \(s\in {\mathscr {S}}\). Therefore it suffices to find the in-probability limit of \(\frac{\widetilde{H}_{N}({\varvec{n}}^{\overline{\ell }})-\widetilde{H}_{N}({\varvec{n}}^{{\underline{\ell }}})}{N}\). We write
and use a Taylor series approximation for each term. In particular for \(F\in C^3(\mathbb R;{\mathbb {R}})\), applying Taylor’s approximation theorem twice yields
Assuming \(\sup _{\ell } {\left\Vert{\varvec{n}}^\ell \right\Vert}_N \le 1\), which holds with probability \(1-o_N(1)\) by state evolution and the definition of \(\overline{\ell }\), we apply this estimate with
The result is:
Proposition 1.6 implies that for deterministic constants c, C,
On the other hand for each \({\underline{\ell }}\le \ell \le \overline{\ell }-1\) we have
Summing and noting that \(\overline{\ell }-{\underline{\ell }}\le \delta ^{-1}\) yields the high-probability estimate
So, this term vanishes as \(\delta \rightarrow 0\). It remains to prove
To establish this it suffices to show for each species \(s\in {\mathscr {S}}\) the equality
Observe by (2.24) that
Passing to the limiting Gaussian process \((Z^{\delta }_k)_{k\in \mathbb Z^+}\) via state evolution,
As \((N^{\delta }_k)_{k\ge \mathbb Z^+}\) is a martingale process, it follows that all right-most expectations vanish. Similarly it holds that
We conclude that
In the second-to-last step we used independence of \(Z^{\delta }_{\ell ,s}\) increments, which follows from Lemma 2.6, while the last step used (2.23) and (2.29). Combining with [16, Lemma 3.7] on discrete approximation of the integral in \({\mathbb {A}}\) implies (2.34). \(\square \)
Proof of Theorem 1
We take \({\mathcal {A}}\) as in (2.27) for \({\underline{\ell }}\) a large constant depending on \(({\varepsilon },\xi ,h,\lambda )\). First,
follows from combining Lemmas 2.5, 2.7 and the fact that (recall (2.32))
Next, let \(K_N\subseteq {\mathscr {H}}_N\) be as in Proposition 1.6. We recall that \({\mathbb {P}}[H_N\in K_N]\ge 1-e^{-cN}\). Exactly as in [15, Theorem 10] it follows that there is a \(C({\varepsilon })\)-Lipschitz function \(\widetilde{\mathcal {A}}:{\mathscr {H}}_N\rightarrow {\mathbb {R}}\) such that \(\widetilde{\mathcal {A}}\) and \({\mathcal {A}}\) agree on \(K_N\). Moreover (1.6) and concentration of measure on Gaussian space imply that \(H_N(\widetilde{\mathcal {A}}(H_N))\) is \(O(N^{1/2})\)-sub-Gaussian. In light of (2.36) and since \({\mathbb {P}}[\widetilde{\mathcal {A}}(H_N)={\mathcal {A}}(H_N)]\ge {\mathbb {P}}[H_N\in K_N]\ge 1-e^{-cN}\), we deduce that
This concludes the proof. \(\square \)
3 Extensions
3.1 Signed AMP
In our companion paper [17], we show that strictly super-solvable models have w.h.p. exactly \(2^r\) critical points, indexed by sign patterns \({\vec \Delta }\in \{\pm 1\}^r\) with the following physical meaning. Consider first the extreme case of a linear Hamiltonian, with external field \(\varvec{h}= {\vec h}\diamond {\textbf {1}}\) where all entries of \({\vec h}\) are nonzero and no other interactions. This model clearly has \(2^r\) critical points, which are the products of the maxima and minima in the spheres \(\{{\left\Vert{\varvec{x}}_s\right\Vert}_2^2 = \lambda _s N\}\) corresponding to each species \(s\in {\mathscr {S}}\), and the signs \({\vec \Delta }\) record whether the critical point is a maximum or minimum in each species. As explained in [17, Sect. 6.6], if a strictly super-solvable \(H_N\) is gradually deformed to a linear function (staying inside the strictly super-solvable phase), the critical points move stably, and over this process their Hessian eigenvalues do not cross zero. Thus, each critical point of \(H_N\) can also be associated with a sign pattern \({\vec \Delta }\).
We now show that the root-finding algorithm defined in Sect. 2.3 can be generalized to find all \(2^r\) critical points in a strictly super-solvable model. More precisely, it finds \(2^r\) approximate critical points, one in a neighborhood of each exact critical point of the model, from which the exact critical points can be computed by Newton’s method (see Remark 3.2). For general models, it finds \(2^r\) approximate critical points on the product of spheres with self-overlap \(\Phi (q_1)\). The restriction of \(H_N\) to this set, considered as a spin glass in its own right (see [16, Remark 1.2]) is a solvable model.
Fixing \({\vec \Delta }\in \{\pm 1\}^r\), the analogous iteration to (2.10) is:
The change of sign does not affect the proofs or statements of Lemmas 2.3, 2.4. Indeed \(a_s^2\) only changes to \(\Delta _s^2 a_s^2\) in the former proof which is no change at all. The generalization of Lemma 2.5 is as follows.
Lemma 3.1
Proof
The proof is similar to Lemma 2.5. The main calculation now becomes:
Moreover the external field \(\varvec{h}\) now contributes energy
Combining gives the desired statement. \(\square \)
Remark 3.1
One can sign the IAMP phase as well by redefining (2.26) to
The resulting output \({\varvec{n}}^{\overline{\ell }}({\vec \Delta })\) then achieves asymptotic energy (recall (1.8))
However it is unclear whether \({\varvec{n}}^{\overline{\ell }}({\vec \Delta })\) can be made to obey any notable properties. We will show that the signed outputs \(\varvec{m}^k({\vec \Delta })\) of the first phase above are approximate critical points for \(H_N\) (and in [17] that all near-critical points are close to one of them). By contrast, for the output of signed IAMP to be a critical point, \(\Phi \) must satisfy a signed version of the tree-descending ODE (2.3) in which the function \((\xi ^s \circ \Phi )'(q)\) is replaced by
Since this quantity appears inside a square root in (2.3), it is unclear when to expect solutions to exist. Furthermore the proof in [16] of well-posedness relies on positivity of coefficients (via Perron-Frobenius theory) and does not seem to generalize. Additionally, a solution would not seem to correspond to a maximizer of any variational problem as in (1.8). As a result we do not know how to prove a solution exists in the signed case. However if one takes as given a smooth function \(\Phi \) satisfying the signed tree-descending ODE, the iteration (3.2) starting from signed initialization \({\varvec{n}}^{{\underline{\ell }}}({\vec \Delta })=\varvec{m}^{{\underline{\ell }}}({\vec \Delta })+ \sqrt{\Phi (q_1+\delta )-\Phi (q_1)}\diamond {\varvec{g}}\) would produce an approximate critical point \({\varvec{n}}^{\overline{\ell }}({\vec \Delta })\) which still satisfies (3.3).
3.2 Gradient Computation and Connection to \(E_{\infty }\)
We now compute the gradient of the outputs, showing that \(\varvec{m}^{{\underline{\ell }}}({\vec \Delta })\) and \({\varvec{n}}^{\ell }\) (\({\underline{\ell }}\le \ell \le \overline{\ell }\)) are approximate critical points for the restriction of \(H_N\) to the products of r spheres with suitable radii passing through them. For \({\varvec{\sigma }}\) to be an approximate critical point means precisely that there exist coefficients \(\vec {A}\in {\mathbb {R}}^r\) such that
In our case, these coefficients will be given as follows. If \(\vec {1}\) is strictly sub-solvable (so \(q_1<1\)), define \(\vec {A}(q)\) for \(q\in [q_1,1]\) by
Further define for \({\vec \Delta }\in \{-1,1\}^r\)
Note that, by (2.2), this is consistent with the definition of \(\vec {A}(q_1)\) above, in the sense that \(\vec {A}(q_1;\vec {1}) = \vec {A}(q_1)\). We take this to be the definition of \(\vec {A}(q_1)\) if \(\vec {1}\) is super-solvable (and \(q_1=1\)).
Proposition 3.2
If \(\Phi \) is a pseudo-maximizer for \({\mathbb {A}}\) (recall Definition 2.1) then for any \({\vec \Delta }\in \{\pm 1\}^r\),
Proof
Recall from Lemma 2.3 (which holds without modification for general \({\vec \Delta }\)) that
Thus rearranging (3.1) yields
Since \(\lim _{{\underline{\ell }}\rightarrow \infty } \big (\Delta _s a_s^{-1}+b_{{\underline{\ell }},s}({\vec \Delta })\big )=A_s({\vec \Delta })\) by (3.6), the result follows. \(\square \)
Remark 3.2
In [17, Theorems 1.5 and 1.6], we show that when \(\xi \) is strictly super-solvable, \(H_N\) has exactly \(2^r\) critical points \(\{{\varvec{x}}({\vec \Delta })\}_{{\vec \Delta }\in \{-1,1\}^r}\). Moreover all \({\varepsilon }\)-approximate critical points with Riemannian gradient \(\Vert \nabla _{{\text {sp}}}H_N({\varvec{x}})\Vert \le {\varepsilon }\sqrt{N}\) are within \(o_{{\varepsilon }}(\sqrt{N})\) of some \({\varvec{x}}({\vec \Delta })\). It follows from Proposition 3.2 that each \(\varvec{m}^{{\underline{\ell }}}({\vec \Delta })\) is an \({\varepsilon }\)-approximate critical point for large enough \({\underline{\ell }}={\underline{\ell }}(\xi ,{\varepsilon })\). In fact the preceding gradient computation shows that the values \({\vec \Delta }\) agree, implying that \(\Vert \varvec{m}^{{\underline{\ell }}}({\vec \Delta })-{\varvec{x}}({\vec \Delta })\Vert _N \le o_{{\underline{\ell }}\rightarrow \infty }(1)\) (compare with [17, Definition 5, Eq. (1.15)]). Moreover by [17, Theorem 1.6] each Riemannian Hessian \(\nabla ^2_{{\text {sp}}}H_N({\varvec{x}}({\vec \Delta }))\) has condition number at least \(1/C(\xi )\). It follows that each critical point \({\varvec{x}}({\vec \Delta })\) can be efficiently computed to arbitrary accuracy by applying Newton’s method from \(\varvec{m}^{{\underline{\ell }}}({\vec \Delta })\) for a large enough \({\underline{\ell }}={\underline{\ell }}(\xi )\). (By contrast, the convergence of \(\varvec{m}^{{\underline{\ell }}}({\vec \Delta })\) itself to \({\varvec{x}}({\vec \Delta })\) is only in the careful double-limit sense \(\lim _{{\underline{\ell }}\rightarrow \infty }\lim _{N\rightarrow \infty }\).)
Proposition 3.3
If \(\Phi \) is a pseudo-maximizer for \({\mathbb {A}}\), then for any \({\underline{\ell }}\)-indexed sequence \((q_*,\ell )=\big ((q_*,\ell )_{{\underline{\ell }}\ge 1}\big )\) such that \(q_*\in [q_1,1]\), \({\underline{\ell }}\le \ell \le \overline{\ell }\) and \(\lim _{{\underline{\ell }}\rightarrow \infty }|q_*-q_{\ell }^{\delta }|=0\), we have
Proof
For notational convenience we assume \((q_*,\ell )=(1,\overline{\ell })\); the proof is identical in general. Recall the rearrangement (2.35):
So far we did not have to compute \(d_{\overline{\ell }, j}\). We do this now, focusing on the IAMP phase. Recalling (2.25), the IAMP iteration used non-linearity
Using the formula (2.7) we find
Note that since \(\Phi \in C^2([q_1,1])\) we have the uniform-in-\(q_j^{\delta }\) approximations (recall (3.5)):
Substituting into (3.9), we obtain
Since the increments \(({\varvec{n}}^{j+1}-{\varvec{n}}^j)\) are orthogonal in the state evolution sense, it easily follows that the approximation of \(C_{j,s}\) by \(\widehat{C}_{s}(q_j^{\delta })\) commutes with summation, i.e.
Note that we manifestly have \(\widehat{C}(1)=\vec {A}(1)\). We claim the function \(\widehat{C}\) is constant on \([q_1,1]\). This is equivalent to showing that for each s the function
is constant. Differentiating, it suffices to show
Write \(f_{s'}'(q)=\Psi (q)\Phi _{s'}'(q)\), where \(\Psi \) is independent of s since \(\Phi \) solves the tree-descending ODE (2.3). Then using the chain rule, the left-hand side of (3.12) equals
Meanwhile the right-hand side of (3.12) is
Therefore \(\widehat{C}(q)=\vec {A}(1)\) is constant as claimed. Finally it is clear that the \({\varvec{n}}^{{\underline{\ell }}}\) coefficient in (3.11) approximately equals \(\widehat{C}(q_1)\) and hence also \(\vec {A}(1)\). Then (3.11) implies
which completes the proof. \(\square \)
From the point of view of [16], the fact that \(\Vert \nabla _{{\text {sp}}} H_N({\varvec{n}}^{\overline{\ell }})\Vert _N\approx 0\) is to be expected. At least for \((\Phi ;q_1)\) maximizing \({\mathbb {A}}\), if this were not true than an extra step of gradient descent would essentially suffice to reach energy strictly better than \({\textsf {ALG}}\), contradicting the optimality in [16, Theorem 1]. However the radial derivative computation is interesting in its own right and lets us study the spherical Hessian around an output \({\varvec{\sigma }}\). We believe that Corollary 3.4 can be strengthened to hold with \({\varvec{\lambda }}_1\) rather than \({\varvec{\lambda }}_{{\varepsilon }N}\). This seems to require a more precise Gaussian conditioning argument around \({\mathcal {A}}(H_N)\) which we chose not to pursue.
Corollary 3.4
With \({\varvec{\lambda }}_k\) the k-th largest eigenvalue of a symmetric real matrix,
Proof
Fixing \(\vec {A}=\vec {A}(1)\), the bulk spectral measure of
for deterministic \({\varvec{x}}\in {\mathcal {S}}_N\) concentrates with rate function \(N^2\) around a limiting spectral measure independent of \({\varvec{x}}\). By union-bounding over an \(\delta \sqrt{N}\)-net as in [26, Proof of Lemma 3], it thus suffices to show (3.13) at a point \({\varvec{x}}\in {\mathcal {S}}_N\) independent of \(H_N\), with \({\varvec{W}}({\varvec{x}})\) in place of \(\nabla ^2_{{\text {sp}}} H_N({\varvec{\sigma }})\). This is purely a statement of random matrix theory and is shown in [17, Proposition 5.18]. \(\square \)
Notably Corollary 3.4 explains the equality \({\textsf {ALG}}=E_{\infty }\) for pure models, which we derived manually in [16]. Indeed for a pure model with \(\xi =\prod _{i=1}^r x_i^{a_i}\), the energy and radial derivative are deterministically proportional:
It follows (using again the \(N^2\) large deviation rate for the spectral bulk) that there is a unique energy level \(E_{\infty }\) at which critical points can have spherical Hessian obeying the conclusion of Corollary 3.4. This is the definition of \(E_\infty \) given in [1, 20].
3.3 Branching IAMP and Exponential Concentration
Here we modify the second stage of our IAMP algorithm (which requires \({\vec \Delta }=\vec {1}\)) to use external Gaussian randomness in a small number of increment steps. This allows the construction of an ultrametric tree of outputs with large constant depth and \(\exp (cN)\) breadth, with pairwise overlaps given by \(\Phi \). More precisely, for any finite ultrametric space \(X=(x_1,\ldots ,x_M)\), \(M=\exp (cN)\), of diameter at most \(1-q_1\), branching IAMP outputs \(({\varvec{\sigma }}_1,\ldots ,{\varvec{\sigma }}_M)\) with
We use an approach suggested in [3] by injecting external Gaussian noise \({\varvec{g}}^{(i)}\) into the IAMP phase of the algorithm at depth \(q_i\in (q_1,1)\). Importantly, this gives an explicit construction of \(\exp (cN)\) approximate critical points of \(H_N\) (with exponentially good probability) whenever there is an IAMP phase. A similar construction was used by one of us in [24, Sect. 4]. There the Gaussian noise was constructed artificially by preliminary iterates of AMP rather than from exogenous noise (due to the lack of a state evolution result incorporating independent Gaussian vectors). This only enabled the construction of a large constant number of outputs rather than exponentially many.
Our branching IAMP proceeds as follows. We first apply Stage \(\text {I}\) with \({\vec \Delta }=\vec {1}\) as before. We fix \(q_1<q_2<\dots <q_m=1\) and let
We define \({\varvec{n}}^{\ell }\) with the same recursive formula as before, unless \(\ell =\ell ^{\delta }_{q_i}\) for some \(i\in [m]\). For these cases, we define \({\varvec{g}}^{(1)},\ldots ,{\varvec{g}}^{(m)}\sim {\mathcal {N}}(0,{\textbf {1}}_N)\) to be independent standard Gaussian vectors. Then we set:
The definition (3.15) naturally enables couplings for pairs of iterations. We say the iterations \(\big ({\varvec{n}}^{\ell ,1},{\varvec{n}}^{\ell ,2}\big )_{\ell \ge 1}\) are \(q_j\)-coupled if their associated Gaussian vectors
are coupled so that \({\varvec{g}}^{(i,1)}={\varvec{g}}^{(i,2)}\) almost surely for \(i<j\), and the variables are otherwise independent.
Proposition 3.5
Let the iterations \({\varvec{n}}^{\ell ,1},{\varvec{n}}^{\ell ,2}\) be \(q_j\) coupled as above, and let \(\Phi \) be a pseudo-maximizer of \({\mathbb {A}}\) (recall Definition 2.1). Then
Proof
The analysis uses the slightly generalized state evolution given in Theorem 2, which states that (2.8) continues to hold even in the presence of external randomness \({\varvec{g}}^{(i)}\). Modulo this point, the calculations are essentially identical. Indeed [24] uses exactly the same calculations to analyze a slightly different formulation of branching IAMP (therein, the vectors \({\varvec{g}}\) are defined via negatively time-indexed AMP iterates to sidestep the lack of a generalized state evolution result). We therefore give only an outline below.
The SDE description in (2.6) is unchanged if one uses the slightly added generality of Theorem 2 to incorporate the external Gaussian noise. (This Gaussian noise is scaled in (3.15) to achieve exactly the same effect as a usual iteration step.) The energy analysis of \(H_N({\varvec{n}}^{\overline{\ell }})\) only changes on the m modified steps which has negligible effect since \(\delta \rightarrow 0\) as \({\underline{\ell }}\rightarrow \infty \); similarly for \(\nabla H_N({\varvec{n}}^{\overline{\ell }})\). Thus (3.16) follows by the same proof as before. The proof of (3.18) is identical to [24, Sect. 8]. \(\square \)
In Proposition 3.6 we observe that concentration of measure implies Proposition 3.5 holds with exponentially high probability. Thus we can couple together \(\exp (cN)\) branching IAMPs to construct a full ultrametric tree of large constant depth m and breadth \(\exp (cN)\). To do this, we fix m, take \({\underline{\ell }}\) sufficiently large and then \(\eta >0\) sufficiently small. Then with \(K=\exp (\eta N)\), we consider a complete depth m rooted tree \({\mathcal {T}}\), with root defined to have depth 1, such that each vertex at depths \(1,\ldots ,m-1\) has K children. Thus the leaf-set \(L({\mathcal {T}})\) is naturally indexed by \([K]^m\). For \(v,v'\in L({\mathcal {T}})\) we let \(v\wedge v'\in \{1,2,\ldots ,m\}\) denote the height of their least common ancestor. For each non-leaf \(x\in V({\mathcal {T}})\), label the edge from x to its parent by an i.i.d. Gaussian vector \({\varvec{g}}^{(x)}\sim {\mathcal {N}}(0,I_N)\). Then for each leaf \(v\in L({\mathcal {T}})\), using the m Gaussian vectors along the path from the root of \({\mathcal {T}}\) to v yields branching IAMP output \({\varvec{\sigma }}^{(v)}\) for any \(H_N\).
Proposition 3.6
Proposition 3.5 holds with exponentially good probability in the following sense. Fix m and \(q_1<q_2<\dots <q_m=1\). For any \({\varepsilon }>0\), for large enough \({\underline{\ell }}\) there exists \(\eta =\eta ({\varepsilon },{\underline{\ell }})>0\) such that for N large enough, the following hold simultaneously across all \(v,v'\in L({\mathcal {T}})\) with probability at least \(1-\exp (-\eta N)\):
Proof
As explained in [15, Sect. 8], the map \(H_N\mapsto {\varvec{n}}^{\overline{\ell }}\) agrees with a \(C({\underline{\ell }})\)-Lipschitz function of the coefficients \({\varvec{G}}^{(k)}\) of \(H_N\) except with probability \(1-\exp (-cN)\). The same proof applies for \(H_N\mapsto {\varvec{n}}^{\overline{\ell },v}\) as well since the external noise variables are also Gaussian. Concentration of measure on Gaussian space now ensures that the statements above hold with exponentially high probability for each fixed \((v,v')\). Union bounding over all such pairs for small enough \(\eta \) implies the result. \(\square \)
In particular, the last conclusion in (3.19) shows that all \(\exp (\eta N)\) constructed points have pairwise distance at least \(\delta \sqrt{N}\) for \(0<\delta <1-q_{m-1}\). Thus for any sub-solvable model, with high probability there are exponentially many \(\sqrt{N}/C(\xi )\)-separated approximate critical points. This is a converse to the main result of [17], where we show that strictly super-solvable models enjoy a strong topological trivialization property which rules out such behavior.
Remark 3.3
An alternative to branching IAMP, which is very natural from the point of view of our companion work [16], is to slightly perturb \(H_N\) to a \((1-\eta )\)-correlated function \(H_N^{(\eta )}\). Concentration of measure implies that the overlap
concentrates exponentially around a limiting value \(R_{\delta ,{\underline{\ell }},\eta }\in {\mathbb {R}}^r\). We expect that taking \(\eta \rightarrow 0\) with \(\delta ,{\underline{\ell }}\) in a suitable way enables \(R_{\delta ,{\underline{\ell }},\eta }\approx \Phi (q)\) for any desired \(q\in [q_1,1]\). This corresponds to the fact that p(q) for \(q\in [q_1,1]\) for any \((p,\Phi ;q_0)\) maximizing \({\mathbb {A}}\). However this approach seems more cumbersome to analyze explicitly.
Remark 3.4
The construction in this section shows the quenched existence of \(\exp (\eta N)\) well-separated approximate critical points for strictly sub-solvable models. In [17, Theorem 5.15] we use this fact to prove the number of exact critical points is exponentially large in expectation. However we are unable to prove the quenched (i.e. high-probability) existence of \(\exp (\eta N)\) exact critical points in strictly sub-solvable models. Showing that this is the case, or more generally identifying the quenched exponential order of the number of critical points, is an interesting direction for future work.
Data Availability
We do not analyze or generate any data. Instead, our work proceeds via a mathematical approach.
Notes
Technically the \(N\rightarrow \infty \) limit is not known to exist for general \(\xi \). Since \(\textsf{OPT}\) appears in the present paper only in this informal discussion, we will not belabor this point.
If \({\vec h}=0\), one takes \({\underline{\ell }}=q_1=0\), \(n^{1}_i=\sqrt{\Phi _{s(i)}(\delta )}{\varvec{g}}_i\), and proceeds identically.
The unusual factor 2 in the exponent comes from the external randomness vectors \({\varvec{e}}^1,\ldots ,{\varvec{e}}^t\).
References
Auffinger, A., Ben Arous, G., Černý, J.: Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)
Auffinger, A., Chen, W.-K.: Free energy and complexity of spherical bipartite models. J. Stat. Phys. 157(1), 40–59 (2014)
El Alaoui, A., Montanari, A.: Algorithmic thresholds in mean field spin glasses. arXiv preprint (2020). arXiv:2009.11481
El Alaoui, A., Montanari, A., Sellke, M.: Optimization of mean-field spin glasses. Ann. Probab. 49(6), 2922–2960 (2021)
El Alaoui, A., Sellke, M.: Algorithmic pure states for the negative spherical perceptron. J. Stat. Phys. 189(2), 27 (2022)
Bayati, M., Lelarge, M., Montanari, A.: Universality in polytope phase transitions and message passing algorithms. Ann. Appl. Probab. 25(2), 753–822 (2015)
Bayati, M., Montanari, A.: The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory 57, 764–785 (2011)
Berthier, R., Montanari, A., Nguyen, P.-M.: State evolution for approximate message passing with non-separable functions. Inf. Inference 9, 33–79 (2019)
Bolthausen, E.: An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Commun. Math. Phys. 325(1), 333–366 (2014)
Chen, W.-K., Lam, W.-K.: Universality of approximate message passing algorithms. Electron. J. Probab. 26, 1–44 (2021)
Dudeja, R., Lu, Y.M., Sen, S.: Universality of approximate message passing with semi-random matrices. Ann. Probab. 51(5), 1616–1683 (2023). https://doi.org/10.1214/23-AOP1628
Dembo, A., Montanari, A., Sen, S.: Extremal cuts of sparse random graphs. Ann. Probab. 45(2), 1190–1217 (2017)
Fan, Z.: Approximate message passing algorithms for rotationally invariant matrices. Ann. Stat. 50(1), 197–224 (2022)
Feng, O.Y., Venkataramanan, R., Rush, C., Samworth, R.J., et al.: A unifying tutorial on approximate message passing. Found. Trends Mach. Learn. 15(4), 335–536 (2022)
Huang, B., Sellke, M.: Tight Lipschitz hardness for optimizing mean field spin glasses. arXiv preprint (2021). arXiv:2110.07847
Huang, B., Sellke, M.: Algorithmic threshold for multi-species spherical spin glasses. arXiv preprint (2023). arXiv:2303.12172
Huang, B., Sellke, M.: Strong topological trivialization of multi-species spherical spin glasses. arXiv preprint (2023). arXiv:2308.09677
Javanmard, A., Montanari, A.: State evolution for general approximate message passing algorithms, with applications to spatial coupling. Inf. Inference 2(2), 115–144 (2013)
Krzakala, F., Montanari, A., Ricci-Tersenghi, F., Semerjian, G., Zdeborová, L.: Gibbs states and the set of solutions of random constraint satisfaction problems. Proc. Natl. Acad. Sci. 104(25), 10318–10323 (2007)
McKenna, B.: Complexity of bipartite spherical spin glasses. arXiv preprint (2021). arXiv:2105.05043
Montanari, A.: Optimization of the Sherrington–Kirkpatrick Hamiltonian. SIAM J. Comput. (2021). https://doi.org/10.1137/20M132016X
Panchenko, D.: On the K-sat model with large number of clauses. Random Structures & Algorithms 52(3), 536–542 (2018)
Richard, E., Montanari, A.: A statistical model for tensor PCA. In: Advances in Neural Information Processing Systems, pp. 2897–2905 (2014)
Sellke, M.: Optimizing mean field spin glasses with external field. Electron. J. Probab. 29, 1–47 (2024)
Sherrington, D., Kirkpatrick, S.: Solvable model of a spin-glass. Phys. Rev. Lett. 35(26), 1792 (1975)
Subag, E.: Following the ground states of full-RSB spherical spin glasses. Commun. Pure Appl. Math. 74(5), 1021–1044 (2021)
Acknowledgements
B.H. was supported by an NSF Graduate Research Fellowship, a Siebel scholarship, NSF awards CCF-1940205 and DMS-1940092, and NSF-Simons collaboration grant DMS-2031883. M.S. was supported by an NSF graduate research fellowship, a Stanford graduate fellowship, and NSF award CCF-2006489 and was a member at the IAS while parts of this work were completed.
Funding
Open Access funding provided by the MIT Libraries
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests (beyond the aforementioned funding).
Additional information
Communicated by Francesco Zamponi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: State Evolution: Proof of Proposition 2.2
Appendix A: State Evolution: Proof of Proposition 2.2
In this section we prove Proposition 2.2, following the appendix of [4]. Throughout, we denote by \({\varvec{G}}^{(k)}\in ({\mathbb {R}}^N)^{\otimes k}\), \(k\ge 2\) a sequence of standard Gaussian tensors. For \(S_k\) the symmetric group on k elements we also write
for the rescaled tensors with entries
For a symmetric tensor \(\varvec{A}^{(k)}\in ({\mathbb {R}}^N)^{\otimes k}\) and \(\varvec{T}\in ({\mathbb {R}}^N)^{\otimes (k-1)}\), we denote by \(\varvec{A}^{(k)}\{\varvec{T}\}\in {\mathbb {R}}^N\) the vector with components
For \({\varvec{u}}\in {\mathbb {R}}^N\) we denote by \(\varvec{A}^{(k)}\{{\varvec{u}}\}=\varvec{A}^{(k)}\{{\varvec{u}}^{\otimes (k-1)}\}\) the vector with entries
Note that for \(\varvec{A}^{(k)}\) as in (A.1), one has
where \(H_{N,k}\) denotes the part of \(H_N\) of total degree k.
For \({\varvec{u}},{\varvec{v}}\in {\mathbb {R}}^N\) we recall from Sect. 1.3 the notations
Given functions \(f_{t,s}:{\mathbb {R}}^{t+1}\rightarrow {\mathbb {R}}\) of \(t+1\) variables for each \(s\in {\mathscr {S}}\), and \({\varvec{v}}^0,{\varvec{v}}^1,\ldots ,{\varvec{v}}^t\in {\mathbb {R}}^{N}\), we define \(f_t({\varvec{v}}^0,{\varvec{v}}^1,\ldots ,{\varvec{v}}^t)\in \mathbb R^N\) component-wise via
Finally, for a sequence of vectors \({\varvec{w}}^0,{\varvec{w}}^1,\dots \), we write \({\varvec{w}}^{\le t} = ({\varvec{w}}^0,{\varvec{w}}^1,\ldots ,{\varvec{w}}^t)\).
To deduce the state evolution result for mixed tensors, we analyze a slightly more general iteration where each homogenous k-tensor is tracked separately, while restricting ourselves to the case where the mixture \(\xi \) has finitely many components: \(\gamma _{s_1,\ldots ,s_k} = 0\) for all \((s_1,\ldots ,s_k)\in {\mathscr {S}}^k\) for all \(k \ge D +1\) for some fixed \(D \ge 2\). We then proceed by an approximation argument to extend the convergence to the general case \(D = \infty \).
We begin by introducing the Gaussian process that captures the asymptotic behavior of AMP. Define \(\xi ^k\) to be the degree k part of \(\xi \), and
the degree \(k-1\) part of \(\xi ^s\).
An AMP iteration is specified by Lipschitz functions \(f_{t,s}:{\mathbb {R}}^{2(t+1)}\rightarrow {\mathbb {R}}\) for each \((t,s)\in {\mathbb {N}}\times {\mathscr {S}}\).Footnote 3 For each iteration t, the state of the algorithm is given by vectors \({\varvec{w}}^t\in {\mathbb {R}}^N\), and \({\varvec{z}}^{k,t}\in {\mathbb {R}}^N\), with \(k\in \{2,\ldots ,D\}\). Moreover for each t, there is also an external randomness vector \({\varvec{e}}^t\in {\mathbb {R}}^N\) with independent coordinates \(e^t_i\sim \mu _{t,s(i)}\) from deterministic probability distributions \(\big (\mu _{t,s}\big )_{t\ge 0,s\in {\mathscr {S}}}\) with finite second moment. We now start to define the AMP iteration steps (the definition finishes at (A.11)). A single step is given by
A general multi-species tensor AMP algorithm then takes the form:
For the right-hand side of (A.9) to make sense, we must define for each \(t\ge 0\) and \(s\in {\mathscr {S}}\) a distribution over sequences \((W^0_s,\ldots ,W^t_s;E^0_s,\ldots ,E^t_s)\). The latter variables \(E^{t'}_s\sim \mu _{t',s}\) are simply taken independent of each other and all other variables. The construction of the W variables is recursive across t as follows. For each \(2\le k\le D\) and \(s\in {\mathscr {S}}\), we let \(U^{k,0}_s\sim \nu _{k,s}\) and construct a centered Gaussian process
which is independent of \(U^{k,0}_s\). The variables \(U^{k,t}_s\) and \(U^{k',t'}_{s'}\) are independent unless \((k,s)=(k',s')\). It remains to specify the covariance of \((U^{k,t}_s)_{1\le t\le T}\) which is given recursively by:
The main result, an extension of Proposition 2.2, follows. Below we use \({\mathbb {W}}_2\) to denote the Wasserstein-2 distance between probability measures on Euclidean space in any dimension. We say a function \(\psi :{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) is pseudo-Lipschitz if
Theorem 2
(State Evolution for AMP) Let \(\{{\varvec{G}}^{(k)}\}_{k\ge 2}\) be independent standard Gaussian tensors with \({\varvec{G}}^{(k)}\in ({\mathbb {R}}^N)^{\otimes k}\), and define \(\varvec{A}^{(k)}\) as in (A.2). Fix a sequence of Lipschitz functions \(f_{t,s}:{\mathbb {R}}^{k+1}\rightarrow {\mathbb {R}}\). Let \({\varvec{z}}^{2,0},\ldots {\varvec{z}}^{D,0}\in {\mathbb {R}}^N\) be deterministic vectors and \({\varvec{w}}^0 =\sum _{2\le k\le D} {\varvec{z}}^{k,0}\). Assume that for each \(s\in {\mathscr {S}}\), the empirical distribution of the vectors
converges in \({\mathbb {W}}_2({\mathbb {R}}^{D-1})\) distance to the law of the vector \((U^{k,0}_s)_{2\le k\le D}\).
Let \({\varvec{w}}^{t}, {\varvec{z}}^{k,t}\), \(t\ge 1\) be given by the tensor AMP iteration. Then, for all \(s\in {\mathscr {S}}\) and \(T\ge 1\) and for any pseudo-Lipschitz functions \(\psi :{\mathbb {R}}^{D \times T}\rightarrow {\mathbb {R}}\) and \(\widetilde{\psi }:{\mathbb {R}}^T\rightarrow {\mathbb {R}}\), we have
Note that (A.13) (which concerns the actual AMP iterates \({\varvec{w}}^t\)) is a special case of (A.12) (which is more convenient to prove). Indeed one can take \(\psi \left((z^{k,t})_{k\le D,t\le T}\right)=\widetilde{\psi }\left(\big (\sum _{k\le D}z^{k,t}\big )_{t\le T}\right)\). In the special case that \(c_k =0\) for all \(k\ge D+1\), Proposition 2.2 follows immediately from Theorem 2 by baking the contribution of \(\varvec{h}\) explicitly into \(f_t\) (since we require \(k\ge 2\) above). Proposition 2.2 for non-polynomial \(\xi \) follows by a standard approximation argument outlined at the end of Sect. 1. For the remainder of this appendix we thus focus on establishing (A.12).
1.1 A.1: Further Definitions
We define the notations
Given a \(N\times (t+1)\) matrix such as \({\varvec{W}}_t\), and a tensor \(\varvec{A}^{(k)}\in ({\mathbb {R}}^{N})^{\otimes k}\), we write \(\varvec{A}^{(k)}\{{\varvec{W}}_t\}\) for the \(N\times (t+1)\) matrix with columns \(\varvec{A}^{(k)}\{{\varvec{w}}^0\}\), ..., \(\varvec{A}^{(k)}\{{\varvec{w}}^t\}\):
We will write \(\varvec{f}_t=f_t({\varvec{W}}_t,{\varvec{E}}_t)=f_t({\varvec{w}}^0,\ldots ,{\varvec{w}}^t,{\varvec{e}}^0,\ldots ,{\varvec{e}}^t)\) and also set
We also define an associated \((t+1)\times (t+1)\) Gram matrix \({\varvec{G}}_{\xi ^{k,s}}={\varvec{G}}_{\xi ^{k,s},t}\) via
The dependence of \({\varvec{G}}_{\xi ^{k,s},t}\) on t will often be suppressed (this dependence is relevant when inverting the matrix \({\varvec{G}}_{\xi ^{k,s},t}\) but not for defining individual entries). Finally, we let \({\mathcal {F}}_t\) denote the \(\sigma \)-algebra generated by all iterates up to time t:
Throughout the proof of state evolution we make the following simplifying assumptions:
Assumption 2
\(\xi \) is a degree D polynomial with all coefficients \(\gamma _{s_1,\ldots ,s_k}\) for \(2\le k\le D\) strictly positive.
Assumption 3
Each matrix \({\varvec{G}}_{\xi ^{k,s},t}\) is well-conditioned, i.e.
for all \(t\le T\). Here \({\varvec{G}}_{\xi ^{k,s},t}\) is defined based on iterates that will appear in Theorems 3 and A.6. The same holds for \({\mathcal {L}}_{k,t}\) as defined in (A.34).
It is a standard argument that to establish Proposition 2.2, it suffices to do so under the above assumptions. The reason is that one can always slightly perturb both \(\xi \) and the non-linearities \(f_{t,s}\) to ensure the assumptions hold. Then suitable continuity properties suffice to transfer all asymptotic guarantees. We refer the reader to [4, Appendices A.8 and A.9] for the arguments in the single-species case, still in the generality of mixed tensors. (In the more common setting \(D=2\) of just a random matrix this step is also common for state evolution proofs, see e.g. [18, Sect. 4.2.1].) The corresponding extension in our setting is completely analogous and omitted.
1.2 A.2: Preliminary Lemmas
The next lemma has several parts. All are elementary Gaussian calculations so their proofs are omitted.
Lemma A.1
For any deterministic \({\varvec{u}},{\varvec{v}}\in {\mathbb {R}}^N\) and \(\varvec{A}^{(k)}\) defined by (A.2) we have:
-
1.
Letting \(g_0\sim \textsf{N}(0,1)\) be independent of \({\varvec{g}}\sim \textsf{N}(0,{\varvec{I}}_N)\), we have
$$\begin{aligned} \varvec{A}^{(k)}\{{\varvec{u}}\}{\mathop {=}\limits ^{\textrm{d}}}\sum _{s\in {\mathscr {S}}} {\varvec{g}}_s \sqrt{\xi ^{k,s}(\vec R({\varvec{u}},{\varvec{u}}))} + \frac{g_0}{\sqrt{N}} \sum _{s\in {\mathscr {S}}} {\varvec{u}}_s \sqrt{ \sum _{s'\in {\mathscr {S}}} \partial _{x_{s'}} \xi ^{k,s} \Big ( \vec R({\varvec{u}},{\varvec{u}})\Big )}. \end{aligned}$$(A.18) -
2.
Let \(g_0,g_1,\ldots ,g_r\sim \textsf{N}(0,1)\) be independent. We have (jointly across \(s\in {\mathscr {S}}\))
$$\begin{aligned} \sqrt{\lambda _s N} R_s({\varvec{v}},\varvec{A}^{(k)}\{{\varvec{u}}\})&{\mathop {=}\limits ^{\textrm{d}}}&\sqrt{\xi ^{k,s}(\vec R({\varvec{u}},{\varvec{u}}))\cdot \vec R({\varvec{v}},{\varvec{v}})}g_s\nonumber \\ {}{} & {} + \sqrt{\sum _{s'\in {\mathscr {S}}} \partial _{x_{s'}} \xi ^{k,s} \Big (\vec R({\varvec{u}},{\varvec{u}})\Big )} R_s({\varvec{u}},{\varvec{v}}) \quad g_0. \end{aligned}$$(A.19) -
3.
For \(s\in {\mathscr {S}}\):
$$\begin{aligned} R\left(\varvec{A}^{(k)}\{{\varvec{u}}\},\varvec{A}^{(k)}\{{\varvec{v}}\}\right)_s\simeq \xi ^{k,s}\left(\vec R( {\varvec{u}},{\varvec{v}})\right). \end{aligned}$$ -
4.
For a deterministic symmetric tensor \(\varvec{T}\in ({\mathbb {R}}^N)^{\otimes k-1}\), the vector \(\varvec{A}^{(k)}\{\varvec{T}\}\) is centered Gaussian. Its covariance is given by
$$\begin{aligned} \mathop {\mathrm {{\mathbb {E}}}}\limits \big [\varvec{A}^{(k)}\{\varvec{T}\}_i\varvec{A}^{(k)}\{\varvec{T}\}_j\big ]&= \langle \xi ^{k,s(i)}\diamond \varvec{T},\, \varvec{T}\rangle _N \cdot 1\{i=j\} \\&\quad +\frac{k(k-1)}{N^{k-1}}\;\sum _{i_1,\ldots ,i_{k-2}=1}^N \gamma _{i,i_1,\ldots ,i_{k-2}} \gamma _{j,i_1,\ldots ,i_{k-2}} T_{i,i_1,\ldots ,i_{k-2}} T_{j,i_1,\ldots ,i_{k-2}}. \end{aligned}$$ -
5.
Let \({\varvec{P}}\in {\mathbb {R}}^{N\times N}\) be the orthogonal projection onto a (deterministic) subspace \(S\subseteq {\mathbb {R}}^N\) with \(d=\dim (S)=O(1)\). Then
$$\begin{aligned} \Vert {\varvec{P}}{\varvec{G}}^{(k)}\{{\varvec{u}}\} - {\varvec{G}}^{(k)}\{{\varvec{u}}\}\Vert _2 /\Vert {\varvec{G}}^{(k)}\{{\varvec{u}}\}\Vert _2\simeq 0. \end{aligned}$$
We next develop a formula for the conditional expectation of a Gaussian tensor \(\varvec{A}^{(k)}\) given a collection of linear observations. We set \({\varvec{D}}\) to be the \(t\times t\times t\) tensor with entries \(D_{ijk}=1\) if \(i=j=k\) and \(D_{ijk}=0\) otherwise.
Lemma A.2
Recalling (A.17), let \(\mathop {\mathrm {{\mathbb {E}}}}\limits \{\varvec{A}^{(k)}|{\mathcal {F}}_t\}\). Equivalently \(\mathop {\mathrm {{\mathbb {E}}}}\limits \{\varvec{A}^{(k)}|{\mathcal {F}}_t\}\) is the conditional expectation of \(\varvec{A}^{(k)}\) given the linear-in-\(\varvec{A}^{(k)}\) observations
Then we have for \(i_1,i_2,\ldots ,i_k\le n\),
Here, the matrix \(\widehat{\varvec{Z}}_{k,t}\in {\mathbb {R}}^{N\times t}\) is defined as the solution of a system of linear equations as follows. Define the linear operator \({\mathcal {T}}_{k,t}:{\mathbb {R}}^{N\times t}\rightarrow {\mathbb {R}}^{N\times t}\) by letting, for \(i\le N\), \(0\le t_3\le t-1\):
Then \(\widehat{\varvec{Z}}_{k,t}\) is the unique solution of the following linear equation (with \(\varvec{Y}_{k,t}\) defined as per (A.14))
(Here, \(\widehat{\varvec{Z}}_{k,t} = [\hat{{\varvec{z}}}_{k,0},\ldots ,\hat{{\varvec{z}}}_{k,t-1}]\) and \(\varvec{Y}_{k,t} = [\hat{{\varvec{y}}}_{k,1},\ldots ,\hat{{\varvec{y}}}_{k,t}]\) have dimensions \(N \times t\).)
The above formulas for \(\mathop {\mathrm {{\mathbb {E}}}}\limits [\varvec{A}^{(k)}|{\mathcal {F}}_t]\) and \({\mathcal {T}}_{k,t}\) are rather complicated. In [4, Appendix A] the reader may find helpful tensor network diagrams for the single-species case. Unfortunately it is less clear how to draw a corresponding tensor network with multiple species.
Proof of Lemma A.2
Let \({\mathcal {V}}_{k,t}\) be the affine space of symmetric tensors satisfying the constraint (A.20). The conditional expectation \(\mathop {\mathrm {{\mathbb {E}}}}\limits [\varvec{A}^{(k)}|{\mathcal {F}}_t]\) is the tensor with minimum weighted Frobenius norm \(\Vert \cdot \Vert _{F,\xi ^k}\) in the affine space \({\mathcal {V}}_{k,t}\), given by
Here \((\Gamma ^{(k)})^{-1}\) is the entry-wise inverse of \(\Gamma ^{(k)}\), which exists by Assumption 2.
By Lagrange multipliers, there exist vectors \(\varvec{m}^1,\ldots ,\varvec{m}^t\in {\mathbb {R}}^N\) such that \(\mathop {\mathrm {{\mathbb {E}}}}\limits [\varvec{A}^{(k)}|{\mathcal {F}}_t]=\widehat{\varvec{A}}^{(k)}\) equals
Also by Lagrange multipliers, if a tensor \(\widehat{\varvec{A}}^{(k)}\) is of this form (for some choice of vectors \(\varvec{m}^1,\ldots ,\varvec{m}^t\)) and satisfies the constraints \(\widehat{\varvec{A}}^{(k)}\{\varvec{f}_{t'}\}={\varvec{y}}_{k,t'+1}\) for \(s< t\), then this tensor is unique and equals \(\mathop {\mathrm {{\mathbb {E}}}}\limits [\varvec{A}^{(k)}|{\mathcal {F}}_t]\). Without loss of generality, we write
By direct calculation we obtain that for each \(i\in [N]\),
We next stack these vectors as columns of an \(N\times t\) matrix. The first term yields \(\widehat{\varvec{Z}}_{k,t}\). Moreover the second term coincides with \({\mathcal {T}}_{k,t}(\widehat{\varvec{Z}}_{k,t})\) by rearranging the order of sums in (A.27). Hence
This in turn implies that the equation determining \(\widehat{\varvec{Z}}_{k,t}\) takes the form (A.23). \(\square \)
1.3 A.3: Long AMP
As an intermediate step towards proving Theorem 2, we introduce a new iteration that we call Long AMP (LAMP), following [8]. This iteration is less compact but simpler to analyze. For each \(k\le D\), let \({\mathcal {S}}_{k,t}\subseteq ({\mathbb {R}}^N)^{\otimes k}\) be the linear subspace of tensors \(\varvec{T}\) that are symmetric and such that \(\varvec{T}\{\varvec{f}_{t_1}\}=0\) for all \(t_1<t\). We denote by \({\mathcal {P}}_t^{\perp }(\varvec{A}^{(k)})\) the projection of \(\varvec{A}^{(k)}\) onto \({\mathcal {S}}_{k,t}\), in the inner product space (A.24) corresponding to \(\Gamma ^{(k)}\). We then define the LAMP mapping
Here we use similar notations \(\varvec{f}_t = f_t({\varvec{V}}_t;{\varvec{E}}_t)\) and \({\varvec{G}}_{\xi ^{k,s},t}\) as before (recall (A.16)), and take the vectors \({\varvec{e}}^t\) as before. However the quantities \(\varvec{f}_t,{\varvec{G}}_{\xi ^{k,s},t}\) are now different: they are computed using the vectors \({\varvec{v}}^0,\ldots ,{\varvec{v}}^t\) using the recursion:
Following [4, 8], we first establish state evolution for LAMP (under the non-degeneracy Assumption 2), and then deduce the result for the original AMP. In analyzing LAMP we use notations analogous to the ones introduced for AMP. In particular:
1.4 A.4: State Evolution for Long AMP
Theorem 3
Under the assumptions of Theorem 2, let \({\varvec{q}}^{2,0},\ldots {\varvec{q}}^{D,0}\in {\mathbb {R}}^N\) be deterministic vectors and \({\varvec{v}}^0 =\sum _{2\le k\le D} {{\textbf {q}}}^{k,0}\). Assume that the uniform empirical distribution of the N vectors \(\{(q_i^{2,0},\ldots , q_i^{D,0})\}_{i\le N}\) converges in \({\mathbb {W}}_2\) distance to the law of the vector \((U^{k,0})_{2\le k\le D}\).
Further we assume there is a constant \(C<\infty \) such that for all \(t\le T\):
-
(i)
The matrices \({\varvec{G}}_{\xi ^{k,s},t}= {\varvec{G}}_{\xi ^{k,s},t}({\varvec{V}})\) are uniformly well-conditioned as guaranteed by Assumption 3.
-
(ii)
Let the linear operator \({\mathcal {T}}_{k,t}:{\mathbb {R}}^{N\times t}\rightarrow {\mathbb {R}}^{N\times t}\) be defined as per (A.22), with \({\varvec{G}}_{\xi ^{k,s},t} = {\varvec{G}}_{\xi ^{k,s},t}({\varvec{V}},{\varvec{E}})\), and \(\varvec{f}_t=f_t({\varvec{V}},{\varvec{E}})\), and define
$$\begin{aligned} {\mathcal {L}}_{k,t} = {\varvec{1}}+{\mathcal {T}}_{k,t}. \end{aligned}$$(A.34)Then \(C^{-1}\le \sigma _{\min }({\mathcal {L}}_{k,t})\le \sigma _{\max }({\mathcal {L}}_{k,t})\le C\).
Then the following statements hold for any \(t\le T\) and sufficiently large N:
-
(a)
Correct conditional law:
$$\begin{aligned} {\varvec{q}}^{k,t+1}|_\mathcal {F_t}{\mathop {=}\limits ^{\textrm{d}}}\mathop {\mathrm {{\mathbb {E}}}}\limits [{\varvec{q}}^{k,t+1}|\mathcal F_t] + {\mathcal {P}}_t^{\perp }(\widetilde{\varvec{A}}^{(k)}) \{\varvec{f}_t\}. \end{aligned}$$(A.35)where \(\widetilde{\varvec{A}}^{(k)}\) is a symmetric tensor distributed identically to \(\varvec{A}^{(k)}\) and independent of everything else, and \({\mathcal {P}}_{t}^{\perp }\) is the projection onto the subspace \({\mathcal {S}}_{k,t}\) defined in Sect. 1. Further
$$\begin{aligned} \mathop {\mathrm {{\mathbb {E}}}}\limits [{\varvec{q}}^{k,t+1}|\mathcal F_t] = \sum _{s\in {\mathscr {S}}} \sum _{0\le t_1\le t} h_{t,t_1-1,k,s} {\varvec{q}}^{k,t_1}_s. \end{aligned}$$(A.36)Moreover, the vectors \(({\varvec{q}}^{k,t+1})_{2\le k\le D}\) are conditionally independent given \(\mathcal F_t\).
-
(b)
Approximate isometry: we have
$$\begin{aligned} R_s({\varvec{q}}^{k,t_1+1},{\varvec{q}}^{k,t_2+1})&\simeq \xi ^{k,s}\left(\vec R( \varvec{f}_{t_1},\varvec{f}_{t_2})\right), \end{aligned}$$(A.37)$$\begin{aligned} R_s({\varvec{v}}^{t_1+1},{\varvec{v}}^{t_2+1})&\simeq \xi ^{s}\big (\vec R( \varvec{f}_{t_1},\varvec{f}_{t_2})\big ). \end{aligned}$$(A.38)Moreover, both sides converge in probability to constants as \(N\rightarrow \infty \), and for \(k_1\ne k_2\) and any \((t_1,t_2)\) and \(s\in {\mathscr {S}}\),
$$\begin{aligned} R_s({\varvec{q}}^{k_1,t_1},{\varvec{q}}^{k_2,t_2})\simeq 0. \end{aligned}$$(A.39) -
(c)
State evolution: for each \(s\in {\mathscr {S}}\) and any pseudo-Lipschitz function \(\psi :{\mathbb {R}}^{D \times 2(t+1)}\rightarrow {\mathbb {R}}\), we have
$$\begin{aligned} \mathop {\mathrm {p-lim}}\limits _{N\rightarrow \infty } \frac{1}{N_s} \sum _{i\in {\mathcal {I}}_s} \psi \big ((q_i^{k,t'})_{ k\le D,t'\le t}; (e^t_i)_{t'\le t}\big ) = \mathop {\mathrm {{\mathbb {E}}}}\limits \big \{\psi \big ((U^{k,t'}_s)_{2\le k\le D,t'\le t}; (E^{t'}_s)_{t'\le t} \big )\big \}. \end{aligned}$$(A.40)where \((U^{k,t}_s)_{k\le D,1\le t\le T}\) is the centered Gaussian process defined in the statement of Theorem 2.
In the next subsection, we will prove these statements by induction on t. The crucial point we exploit is the representation (a). We emphasize that the iteration number t is bounded as \(N\rightarrow \infty \); therefore all numerical quantities not depending on N (but possibly on t) will be treated as constants.
1.5 A.5: Proof of Theorem 3
The proof will be by induction over t. The base case is clear, (e.g. see Proposition A.4) and we focus on the inductive step. We assume the statements above for \(t-1\) and prove them for t.
1.5.1 A.5.1: Proof of (a)
Note that \({\mathcal {P}}_t^{\perp }(\varvec{A}^{(k)})\) is by construction independent of \({\mathcal {F}}_t\), and therefore we can replace \(\varvec{A}^{(k)}\) by a fresh independent matrix in (A.29), whence (A.35) follows. The equality (A.36) holds by definition of the iteration.
1.5.2 A.5.2: Proof of (b): Approximate isometry
We will repeatedly apply Lemma A.1. We start with (A.37). As we are inducting on t, we may limit ourselves to considering overlaps \(\vec R( {\varvec{q}}^{k,t+1},{\varvec{q}}^{k,t_1+1})\), for \(t_1\le t\).
Define the tensor \(\Gamma ^{(k),\nabla }\in ({\mathbb {R}}^{{\mathscr {S}}}_{\ge 0})^{\otimes (k-1)}\) by
We choose
such that
is the orthogonal projection of \(\Gamma ^{(k),\nabla }\diamond \varvec{f}_{t}^{\otimes k-1}\) onto
and also set
We will use (and soon after, prove) the following lemma.
Lemma A.3
For all \(t_1\le t_1\), we have
For \(t_1\le t-1\), using Lemma A.1, point 2 implies
We next use the formula in (a) for \(\mathop {\mathrm {{\mathbb {E}}}}\limits [{\varvec{q}}^{k,t+1}|{\mathcal {F}}_t]\) together with the expression in (A.29). For each \(s\in {\mathscr {S}}\):
Here (A.43) comes from the induction hypothesis (A.37) (and the symmetry of the matrix \({\varvec{G}}_{\xi ^{k,s},t-1}\) is used to obtain the next line). We next prove that (A.37) holds for \(t_1=t\). We have by definition of the projections that
where the right-hand side is defined according to (A.3). Using (A.42) from Lemma A.3 as well as point 4 of Lemma A.1, we have
Next, using (A.42) and Lemma A.1 (point 2), we obtain that for all \(s\in {\mathscr {S}}\)
Moreover we recall that by the expression for \(\mathop {\mathrm {{\mathbb {E}}}}\limits [{\varvec{q}}^{k,t+1}|{\mathcal {F}}_t]\) from part (a),
The formula for linear regression implies
By part (b) of the inductive step, for \(1\le t_1,t_2\le t-1\) we have
In particular the formulas (A.30) and (A.47) have asymptotically the same coefficients, and the overlap structure between the summands is identical. It follows that
Using together Eqs. (A.44), (A.45), and (A.50), we get
This establishes (A.37).
Next consider (A.39), i.e., approximate orthogonality of \({\varvec{q}}^{k,r}\) and \({\varvec{q}}^{p',r}\) for \(k\ne p'.\) This follows easily from the representation in point (a) which, together with Lemma A.1, inductively implies that the iterates \({\varvec{q}}^{s,k}\) for different k are approximately orthogonal. Finally, (A.38) follows directly from (A.37) and (A.39). We now prove Lemma A.3.
Proof of Lemma A.3
For convenience we write \(\widetilde{\varvec{A}}=\widetilde{\varvec{A}}^{(k)}\). By Lagrange multipliers, there exist vectors \(({\varvec{\theta }}_{t_1})_{t_1 \le t-1}\) in \({\mathbb {R}}^N\) such that \({\mathcal {P}}_t^{\perp } (\widetilde{\varvec{A}}) = \widetilde{\varvec{A}}- {\varvec{Q}}\), where
The vectors \(({\varvec{\theta }}_{t_1})_{t_1 \le t-1}\) are determined by the equations \({\varvec{Q}}\{\varvec{f}_{t_1}\}=\widetilde{\varvec{A}}\{\varvec{f}_{t_1}\}\) for all \(t_1\le t-1\). This expands (for each \(t_1\le t-1\)) to
Recall that we assume each \({\varvec{G}}_{\xi ^{k,s},t-1}\) is well-conditioned with high probability. Thus we can multiply the system of t equations above by \({\varvec{G}}_{\xi ^{k,s},t-1}^{-1}\) in the coordinates \({\mathcal {I}}_s\) for each \(s\in {\mathscr {S}}\). For each \(t_3\le t-1\), we obtain:
Switching \(t_3\) to \(t_1\), we find
We claim that \(\Vert {\varvec{\theta }}^{\parallel }_{t_1}\Vert _N\simeq 0\), i.e., \({\varvec{\theta }}_{t_1}\simeq {\varvec{\theta }}^0_{t_1}\). Indeed, let \({\varvec{\Theta }}\in {\mathbb {R}}^{N\times t}\) be the matrix with columns \(({\varvec{\theta }}_{t_2})_{t_2<t}\), and \({\varvec{\Theta }}^0\) the matrix with columns \(({\varvec{\theta }}^0_{t_2})_{t_2<t}\). Then (A.51) can be written as
Here we recall \({\mathcal {L}}_{k,t}={\varvec{1}}+{\mathcal {T}}_{k,t}\) and \({\mathcal {T}}_{k,t}\in {\mathbb {R}}^{Nt\times Nt}\) is defined in (A.22). Substituting the decomposition \({\varvec{\Theta }}= {\varvec{\Theta }}^0+{\varvec{\Theta }}^{\parallel }\) in the above, we obtain
Recall that \({\mathcal {L}}_{k,t}\) is well-conditioned by Assumption 3. Therefore it remains to prove
Let \({\varvec{c}}_0,\ldots ,{\varvec{c}}_{t-1} \in {\mathbb {R}}^N\) be the columns of \({\mathcal {T}}^{\textsf{T}}_{k,t}({\varvec{\Theta }}^0)\). We first note that for all \(t_1\le t-1\) and \(s\in {\mathscr {S}}\),
Moreover the Gram matrix
is well-conditioned for each \(s\in {\mathscr {S}}\). Therefore it is sufficient to check that \(R_s(\varvec{f}_{t_1},{\varvec{c}}_{t_4})\simeq 0\) for each \(t_1,t_4<t\) and \(s\in {\mathscr {S}}\). Plugging in the definition (A.22), it remains to check that for \(0\le t_1,t_4\le t-1\),
Finally, this last claim follows by substituting the definition (A.52) of \({\varvec{\theta }}^0_{t_2}\), and using the fact that
which follows from Lemma A.1. Thus (A.53) is established.
We are now ready to prove Lemma A.3. First note that
decomposes into two types of terms based on the definition of \({\varvec{Q}}\) above. Recalling (A.41), the first involves
for \(t_1\le t-1\), which vanishes by the definition of \((\varvec{f}_t^{\otimes k-1})_{\perp }\). The other terms take the form
In particular, this means that to prove (A.54) vanishes, suffices to show
for all \(t_2\le t\).
Note that by construction,
By the well-conditioning assumption, the \(b_{t_1}\) are bounded. Therefore it suffices to show that
Finally note that each term in the left-hand side includes an overlap \(R_s({\varvec{\theta }}_{t_1},\varvec{f}_{t_2})\). However these all vanish:
This is because we can substitute \({\varvec{\theta }}_{t_1}\) with \({\varvec{\theta }}_{t_1}^0\) as defined in (A.52) and use the fact that \(\vec R(\widetilde{\varvec{A}}\{\varvec{f}_{t_3}\},\varvec{f}_t)\simeq 0\) which follows from Lemma A.1. This completes the proof. \(\square \)
1.5.3 A.5.3: Proof of (c)
The base case of initialization is handled by the following basic fact.
Proposition A.4
Let \(\mu \in {\mathcal {P}}({\mathbb {R}}^k)\) be a probability distribution with finite second moment. Then if \(E_1,\ldots ,E_N{\mathop {\sim }\limits ^{i.i.d.}}\mu \) and \(\hat{\mu }_N=\frac{1}{N}\sum _{i=1}^N \delta _{E_i}\), one has
Proof
It suffices to show that \(\hat{\mu }_N\rightarrow \mu \) weakly in probability and show convergence in probability of the \(L^2\) norm. The first is clear and the second holds by the law of large numbers. \(\square \)
Continuing to the inductive step, recall that the process \((U^{k,t}_s)_{t\ge 1}\) is Gaussian by construction, and independent of \(U^{k,0}_s\). Define
We then have
Here in writing \(({\varvec{C}}^{-1}_{\le t,s})_{t_1,t_2}\), we view \({\varvec{C}}_{\le t,s}\) as a \((t+1)\times (t+1)\) matrix for each \(s\in {\mathscr {S}}\).
On the other hand, from point (a), we know that
Moreover the induction hypothesis of (A.40) implies that for \(t_1,t_2 \le t\),
(Recall that by definition \(W^t_s \equiv \sum _{k\le D} U^{k,t}_s\), while \(\varvec{f}_t=f_t({\varvec{V}}_t;{\varvec{E}}_t)\) here.)
Therefore, from the definition of the process \((U^{k,t}_s)_{t\ge 0}\),
Recalling that \({\varvec{G}}_{\xi ^{k,s},t}\) is well-conditioned, we find (recall (A.55), (A.56)):
Therefore we also have
Moreover, Lemma A.1 (point 4) shows that \({\mathcal {P}}^{\perp }_t(\widetilde{\varvec{A}}^{(k)})\{\varvec{f}_t\}\simeq \widetilde{\varvec{A}}^{(k)}\{(\varvec{f}_{t}^{\otimes k-1})_{\perp }\}\) has entries which are approximately independent Gaussian with variance
on coordinates \(i\in {\mathcal {I}}_s\), even conditionally on \({\mathcal {F}}_t\). Therefore
where \(\Vert {\varvec{err}}\Vert _N\simeq 0\) and \({\varvec{g}}\sim \textsf{N}(\varvec{0},{\varvec{I}}_N)\) is independent of everything else. It now remains to verify that this agrees with the desired covariance. As proved in the previous point, for any \(t_1\le t\),
In particular this establishes convergence of the second moment, so in order to prove (A.40) it is sufficient to establish weak convergence. Hence we may assume \(\psi :{\mathbb {R}}^{D \times (t+1)}\rightarrow {\mathbb {R}}\) is Lipschitz (rather than just pseudo-Lipschitz).
Using the representation (A.58), and focusing for simplicity on a single k, we get
The second equality above follows by Gaussian concentration since \(\psi \) is assumed Lipschitz. Applying the induction hypothesis now implies (A.40), except that \({\varvec{e}}^{t+1}\) is not present. However since \({\varvec{e}}^{t+1}_i\) and \(E^{t+1}_s\) have the same law and are both independent of the past, \({\mathbb {W}}_2\) convergence immediately transfers by Proposition A.5 below. This completes the proof of part (c).
Proposition A.5
Let \(\nu _n=\sum _{i=1}^n \delta _{\widehat{X}_i}\) for \(n\ge 1\) be a sequence of probability measures on \({\mathbb {R}}^k\) converging to \(\nu \in {\mathcal {P}}({\mathbb {R}}^k)\) in \({\mathbb {W}}_2\). Let \(\mu \in {\mathcal {P}}({\mathbb {R}}^k)\) be a probability distribution with finite second moment. Let
and set
Then
Proof
Using Proposition A.4 applied to \(\nu \), we can find for any \({\varepsilon }>0\) a coupling \(\Pi =\big ((\widehat{X}_i,X_i)\big )_{i\in [N]}\) of \(\nu _n\) with i.i.d. samples \(\hat{\nu }_n\) with transport cost at most \({\varepsilon }\). Generate independent variables \(E_1,\ldots ,E_N{\mathop {\sum }\limits ^{i.i.d.}}\mu \). Then note that
Here in the latter step we used the assumption on the coupling \(\Pi \) for the first term and Proposition A.4 applied to \(\nu \otimes \mu \) on the second term. This completes the proof. \(\square \)
1.6 A.6: Asymptotic Equivalence of AMP and Long AMP
Here we show that AMP and LAMP produce approximately the same iterates.
Lemma A.6
Let \(\{{\varvec{G}}^{(k)}\}_{k\le D}\) be standard Gaussian tensors, and \(\varvec{A}^{(k)} = \Gamma ^{(k)}\diamond {\varvec{G}}^{(k)}\) for \(k\ge 2\). Consider the corresponding AMP iterates \({\varvec{Z}}_{t}\equiv ({\varvec{z}}^{k,t_1})_{k\le D,t_1\le t}\) and LAMP iterates \({\varvec{Q}}_{t}\equiv ({\varvec{q}}^{k,t_1})_{k\le D,t_1\le t}\), from the same initialization \({\varvec{Z}}_0={\varvec{Q}}_0\) satisfying the assumptions of Theorems 2 and 3.
Let \(\varvec{f}_t = f_t({\varvec{V}}_t;{\varvec{E}}_t)\), \(t\ge 0\) be the nonlinearities applied to LAMP iterates. Further assume that there exists a constant \(C<\infty \) such that, for all \(t\le T\),
-
(i)
The LAMP Gram matrices \({\varvec{G}}_{k,t} = {\varvec{G}}_{k,t}\) are well-conditioned as guaranteed by Assumption 3, i.e.,
$$\begin{aligned} C^{-1}\le \sigma _{\min }({\varvec{G}}_{k,t})\le \sigma _{\max }({\varvec{G}}_{k,t})\le C, \quad \quad \forall k\le D,~t\le T. \end{aligned}$$ -
(ii)
Let the linear operator \({\mathcal {T}}_{k,t}:{\mathbb {R}}^{N\times t}\rightarrow {\mathbb {R}}^{N\times t}\) be defined as per (A.22), with \({\varvec{G}}_{k,t} = {\varvec{G}}_{k,t}({\varvec{V}})\), and \(\varvec{f}_t=f_t({\varvec{V}},{\varvec{E}}_t)\), and define \({\mathcal {L}}_{k,t} = {\textbf {1}}+{\mathcal {T}}_{k,t}\). Then
$$\begin{aligned} C^{-1}\le \sigma _{\min }({\mathcal {L}}_{k,t})\le \sigma _{\max }({\mathcal {L}}_{k,t})\le C. \end{aligned}$$
Then, for any \(t\le T\), we have
Proof
Throughout the proof we will suppress \({\varvec{E}}_t\) and simply write \(f_t({\varvec{W}}_t)\) or \(f_t({\varvec{V}}_t)\) to distinguish AMP and LAMP iterates, and analogously for \({\varvec{G}}_{k,t}({\varvec{W}}_t)\) or \({\varvec{G}}_{k,t}({\varvec{V}}_t)\). The proof is by induction over the iteration number, so we will assume it to hold at iteration t, and prove it for iteration \(t+1\). We prove the induction step by establishing the following two facts for each \(2\le k\le D\):
Let us first consider the claim (A.60), and note that
where we wrote \(d_{t,t_1,k,s}\) for the coefficients of (A.8), with AMP iterates replaced by LAMP iterates. We then have
Notice that, by the induction assumption (and recalling that each \(f_{t,s}\) is Lipschitz continuous and acts component-wise):
Further, for any tensor \(\varvec{T}\in ({\mathbb {R}}^{N})^{\otimes k}\), and any vectors \({\varvec{v}}_1,bv_2\in {\mathbb {R}}^N\),
Using Lemma A.1, this implies that the following bound holds with high probability for a constant C:
The last step follows from (A.62) and Theorem 3, which implies (recall each \(f_{t,s}\) is Lipschitz) that \(\Vert f_t({\varvec{V}}_t)\Vert _N\le C\) with probability \(1-o(1)\). Notice that the same argument implies \(\Vert f_t({\varvec{W}}_t)\Vert _N \le C\) with high probability.
Similarly, \(D_{2,t}\simeq 0\) follows since \(\Vert f_{t_1-1}({\varvec{W}}_{t_1-1})- f_{t_1-1}({\varvec{V}}_{t_1-1})\Vert _N\simeq 0\) and \(|d_{t,t_1,k,s}|\le C_T\) by construction, thus yielding (A.60).
We now prove (A.61). Comparing (A.8) and (A.29), with \({\mathcal {P}}_t^{\parallel } = {\varvec{1}}-{\mathcal {P}}_t^{\perp }\) we find
Note that \({\mathcal {P}}_t^{\parallel } (\varvec{A}^{(k)})=\mathop {\mathrm {{\mathbb {E}}}}\limits \left[\varvec{A}^{(k)}|{\mathcal {F}}_t\right]\), where \({\mathcal {F}}_t\) here is the analogous \(\sigma \)-algebra generated by \(\{{{\textbf {q}}}^{k,t_1},{\varvec{e}}^{t_1}\}_{t_1\le t, k\le D}\). Equivalently, this is the conditional expectation of \(\varvec{A}^{(k)}\) given the linear constraints
Also notice that, by the induction hypothesis, and the definition of \({\varvec{y}}_{k,t_1}\), (A.14), we have for all \(t_1\le t\),
Lemma A.2 implies that \({\mathcal {P}}_t^{\parallel } (\varvec{A}^{(k)})\) takes the form of (A.21) for a suitable matrix \(\widehat{\varvec{Z}}_{k,t}\in {\mathbb {R}}^{N\times t}\). The key claim is that
In order to establish this claim, we show that, under the inductive hypothesis,
Since \({\mathcal {L}}_{k,t}={\varvec{1}}+{\mathcal {T}}_{k,t}\) is well-conditioned by assumption, the combination of (A.23) and (A.68) implies \(\widehat{\varvec{Z}}_{k,t}\simeq {\varvec{Q}}_t\). By (A.66), in order to prove (A.68), it is sufficient to show that
In order to prove (A.69), we use Theorem 3. Recall that
(The value \(2\le k\le D\) is implicitly fixed in the definition of \({\varvec{C}}_{\le t}\).) By Theorem 3,
This implies for any \(0 \le t_1\le t-1\) and \(s\in {\mathscr {S}}\),
Indeed, Gaussian integration by parts yields the latter expression (it can be done conditionally on the variables E since they are independent). Combining (A.70) with the definition (A.8) will now allow us to conclude \({\mathcal {T}}_{k,t}{\varvec{Q}}_t\simeq {\textbf {ONS}}_{k,t}\) as desired. Indeed for each \(s\in {\mathscr {S}}\) we have
Having established (A.67), we now use the formula (A.21) for \({\mathcal {P}}^{\parallel }_t(\varvec{A}^{(k)})=\mathop {\mathrm {{\mathbb {E}}}}\limits \big [\varvec{A}^{(k)}|{\mathcal {F}}_t\big ]\). The result is:
On the other hand, using again (A.70) gives
We conclude from (A.64) that \(\Vert \textsf{AMP}_{t+1}({\varvec{Q}}_{t})_k -\textsf{LAMP}_{t+1}({\varvec{Q}}_{t})_k \Vert _N\simeq 0\). This concludes the proof. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, B., Sellke, M. Optimization Algorithms for Multi-species Spherical Spin Glasses. J Stat Phys 191, 29 (2024). https://doi.org/10.1007/s10955-024-03242-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10955-024-03242-7