Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models

We study the problem of recovering an unknown signal $\boldsymbol x$ given measurements obtained from a generalized linear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator $\hat{\boldsymbol x}^{\rm L}$ and a spectral estimator $\hat{\boldsymbol x}^{\rm s}$. The former is a data-dependent linear combination of the columns of the measurement matrix, and its analysis is quite simple. The latter is the principal eigenvector of a data-dependent matrix, and a recent line of work has studied its performance. In this paper, we show how to optimally combine $\hat{\boldsymbol x}^{\rm L}$ and $\hat{\boldsymbol x}^{\rm s}$. At the heart of our analysis is the exact characterization of the joint empirical distribution of $(\boldsymbol x, \hat{\boldsymbol x}^{\rm L}, \hat{\boldsymbol x}^{\rm s})$ in the high-dimensional limit. This allows us to compute the Bayes-optimal combination of $\hat{\boldsymbol x}^{\rm L}$ and $\hat{\boldsymbol x}^{\rm s}$, given the limiting distribution of the signal $\boldsymbol x$. When the distribution of the signal is Gaussian, then the Bayes-optimal combination has the form $\theta\hat{\boldsymbol x}^{\rm L}+\hat{\boldsymbol x}^{\rm s}$ and we derive the optimal combination coefficient. In order to establish the limiting distribution of $(\boldsymbol x, \hat{\boldsymbol x}^{\rm L}, \hat{\boldsymbol x}^{\rm s})$, we design and analyze an Approximate Message Passing (AMP) algorithm whose iterates give $\hat{\boldsymbol x}^{\rm L}$ and approach $\hat{\boldsymbol x}^{\rm s}$. Numerical simulations demonstrate the improvement of the proposed combination with respect to the two methods considered separately.

Throughout this paper, the performance of an estimatorx will be measured by its normalized correlation (or "overlap") with x: x,x where · 2 denotes the Euclidean norm of a vector. Most of the existing techniques require an initial estimate of the signal, which can then be refined via a local algorithm. Here, we focus on two popular methods: a linear estimator and a spectral estimator. The linear estimatorx L has the form: where T L denotes a given preprocessing function. The performance analysis of this estimator is quite simple, see e.g., Proposition 1 in [43] or Sect. 2.3 of this paper. The spectral estimator consists in the principal eigenvectorx s of a matrix of the form: where T s is another preprocessing function. The idea of a spectral method first appeared in [32] and, for the special case of phase retrieval, a series of works has provided more and more refined performance bounds [11,12,40]. Recently, an exact high-dimensional analysis of the spectral method for generalized linear models with Gaussian sensing vectors has been carried out in [34,37]. These works consider a regime where both n and d grow large at a fixed proportional rate δ = n/d > 0. The choice of T s which minimizes the value of δ (and, consequently, the amount of data) necessary to achieve a strictly positive scalar product (2) was obtained in [37]. Furthermore, the choice of T s which maximizes the correlation between x andx s for any given value of the sampling ratio δ was obtained in [35]. The case in which the sensing vectors are obtained by picking columns from a Haar matrix is tackled in [16].
In short, the performance of the linear estimatex L and the spectral estimatex s is well understood, and there is no clear winner between the two. In fact, the superiority of one method over the other depends on the output function p(· | x, a i ) and on the sampling ratio δ. For example, for phase retrieval (y i = | x, a i |), the spectral estimate provides positive correlation with the ground-truth signal as long as δ > 1/2 [37], while linear estimators of the form (3) are not effective for any δ > 0. On the contrary, for 1bit compressed sensing (y i = sign( x, a i )) the situation is the opposite: the spectral estimator is uncorrelated with the signal for any δ > 0, while the linear estimate works well. For many cases of practical interest, e.g., neural networks with ReLU activation function (y i = max( x, a i , 0)), both the linear and the spectral method give estimator with non-zero correlation. Thus, a natural question is the following: What is the optimal way to combine the linear estimatorx L and the spectral estimatorx s ?
This paper closes the gap and answers the question above for Gaussian sensing vectors {a i } 1≤i≤n . Our main technical contribution is to provide an exact high-dimensional characterization of the joint empirical distribution of (x,x L ,x s ) in the limit n, d → ∞ with a fixed sampling ratio δ = n/d (see Theorem 1). In particular, we prove that the conditional distribution of (x L ,x s ) given x converges to the law of a bivariate Gaussian whose mean vector and covariance matrix are specified in terms of the preprocessing functions T L and T s . As a consequence, we are able to compute the Bayes-optimal combination ofx L andx s for any given prior distribution on x (see Theorem 2). In the special case in which the signal prior is Gaussian, the Bayes-optimal combination has the form θx L +x s , with θ ∈ R, and we compute the optimal combination coefficient θ * that maximizes the normalized correlation in (2) (see Corollary 2).
The characterization of the joint empirical distribution of (x,x L ,x s ) is achieved by designing and analyzing a suitable approximate message passing (AMP) algorithm. AMP is a family of iterative algorithms that has been applied to several high-dimensional statistical estimation problems including estimation in linear models [4,5,15,29], generalized linear models [44,48,50], and low-rank matrix estimation [13,31,38,45]. An appealing feature of AMP algorithms is that under suitable conditions on the model, the empirical joint distribution of the iterates can be exactly characterized in the high-dimensional limit, in terms of a simple scalar recursion called state evolution.
(a) (b) (c) Fig. 1 a Performance comparison among the linear estimator, the spectral estimator and the proposed optimal combination for a specific output function and ranging values of the sampling ratio δ. The performance is measured in terms of the normalized correlation (2). b Optimal combination coefficient θ * as a function of δ for the same output function as in a. c Percentage performance gain of the combined estimator for different output functions and sampling ratios In this paper, we design an AMP algorithm that is equivalent to a power method computing the principal eigenvector of the matrix (4). Then, the state evolution analysis leads to the desired joint empirical distribution of (x,x L ,x s ). Using the limiting distribution, we reduce the vector problem of estimating x ∈ R d given two (correlated) observationsx L ,x s ∈ R d to the scalar problem of estimating the random variable X ∈ R given two (correlated) observations X L , X s ∈ R. We emphasize that the focus of this work is not on using the AMP algorithm as an estimator for the generalized linear model. Rather, we use AMP as a proof technique to characterize the joint empirical distribution of (x,x L ,x s ), and thereby understand how to optimally combine the two simple estimators.
Our proposed combination of the linear and spectral estimators can significantly boost the correlation with the ground-truth signal (2). As an illustration, in Fig. 1a we compare against each other the correlations ρ L , ρ s and ρ * of the linear, spectral and optimal combined estimators, respectively, for a range of values of the sampling ratio δ = n/d and measurements of the form y i = 0.3 x, a i + x, a i 2 + z i . Here, x is uniformly distributed on the d-dimensional sphere of radius √ d, a i ∼ i.i.d. N (0 d , I d /d), z i ∼ i.i.d. N (0, 0.2), and the preprocessing functions are chosen as follows: T L (y) = y and T s (y) = min{y, 3.5}. The solid lines correspond to analytically derived asymptotic formulae, and they are compared against numerical simulations (cf. markers of the corresponding color) computed for d = 2000. Specifically, the red line corresponds to the optimal combined estimator θ * x L +x s (in this example, the empirical distribution of x is Gaussian). The optimal combination coefficient θ * is plotted in Fig. 1b as a function of δ. Note that for values of δ for which the spectral estimator achieves strictly positive correlation with x, the combined estimator provides a significant performance improvement. The performance gain depends on: (i) the sampling ratio δ (it can be as large as ∼ 30% for δ ≈ 8), and (ii) the output function that defines the measurement. To better visualize this dependence, we plot in Fig. 1c the percentage gain (ρ * − ρ max )/ρ max × 100 for various values of δ and for different output-function parameterizations. Specifically, the x-axis in Fig. 1c represents the value of the coefficient H 1 of an output function of the form y i = 0.5 + H 1 x, a i + 0.5 x, a i 2 + z i , . Above, ρ * denotes the correlation achieved by our proposed estimator, and ρ max is the maximum correlation among the linear and spectral estimators. The rest of the paper is organized as follows. In Sect. 2, we describe the setting and review existing results on the linear and the spectral estimator. In Sect. 3, we present our contributions. The main technical result, Theorem 1, gives an exact characterization of the joint empirical distribution of (x,x L ,x s ). Using this, we derive the Bayes-optimal combination of the estimatorsx L andx s . In the special case in which the signal prior is Gaussian, the Bayes-optimal combination is linear inx L andx s , and we derive the optimal coefficient. In Sect. 4, we demonstrate the effectiveness of our method via numerical simulation. In Sect. 5, we describe the generalized AMP algorithm and use it to prove Theorem 1.

Notation and Definitions
Given n ∈ N, we use the shorthand [n] = {1, . . . , n}. Given a vector x, we denote by x 2 its Euclidean norm. Given a matrix A, we denote by A op its operator norm.
Given two probability measures μ (on a space X ) and ν (on a space Y), a coupling ρ of μ and ν is a probability distribution on X × Y whose marginals coincide with μ and ν, respectively. For k ≥ 1, the Wasserstein-k (W k ) distance between two probability measures μ, ν on R n is defined by where the infimum is over all the couplings of μ and ν. A sequence of probability distributions ν n on R m converges in W k to ν, written ν n

Generalized Linear Model
Let x ∈ R d be the signal of interest. We assume that x 2 2 = d. The signal is observed via inner products with n sensing vectors (a i ) i∈ [n] , with each a i ∈ R d having independent Gaussian entries with mean zero and variance 1/d. That is, Given g i = x, a i , the measurement vector y ∈ R n is obtained by drawing each component independently according to a conditional distribution p Y |G We stack the measurement vectors as rows to define the n × d sensing matrix A. That is, A = [a 1 , . . . , a n ] T .
We write δ n = n d for the sampling ratio and assume that δ n → δ ∈ (0, ∞). Since the entries of the sensing matrix are ∼ i.i.d. N (0, 1/d), each row of A has norm close to 1.

Linear Estimator
Given the measurements (y i ) i∈[n] and a preprocessing function Consider the following linear estimator that averages the data as follows: The following lemma characterizes the asymptotic performance of this simple estimator. The proof is rather straightforward, and we include it in Appendix A for completeness.
and y be distributed according to (7). Let n/d → δ, G ∼ N (0, 1) and define Z L = T L (Y ) for Y ∼ p Y |G ( · | G) such that E{|G Z L |} < ∞. Letx L be the linear estimator defined as in (10). Then, as n → ∞, , and

Spectral Estimator
Given the measurements (y i ) i∈ [n] , consider the n × n diagonal matrix where T s : R → R is a preprocessing function. Consider the d × d matrix Let G ∼ N (0, 1), Y ∼ p(· | G), and Z s = T s (Y ). We will make the following assumptions on Z s .
(A2) Z s has bounded support and τ is the supremum of this support, i.e., (A3) As λ approaches τ from the right, we have Let us comment on these assumptions. First, the condition (A1) simply avoids the degenerate case in which the measurement vector, after passing through the preprocessing function, is 0 with high probability. Second, the condition (A2) requires that the support of Z s is bounded both from above and below. This assumption appears in the papers that have recently analyzed the performance of spectral estimators [34,35,37], and it is also required for Lemma 2. Requiring that the support of Z s is bounded from above is rather natural, since the argument relies on the matrix D n having a spectral gap. It is not clear whether having the support of Z s bounded from both sides (rather than only from above) is necessary, and investigating this aspect is an interesting avenue for future research. Let us also point out that the condition (A2) is purely technical and rather mild. In fact, if the desired preprocessing function is not bounded, 1 then one can construct a sequence of bounded approximations that approach its performance, as done, e.g., in [35]. Finally, the condition (A3) essentially requires that Z s has sufficient probability mass near the supremum of the support τ . One sufficient condition is that the law of Z s has a point mass at τ . If this is not the case, the argument in (115)-(118) of [37] shows how to modify the preprocessing function T s so that (i) condition (A3) holds, and (ii) the spectral estimator suffers no performance loss.
The spectrum of D n exhibits a phase transition as δ increases. The most basic phenomenon of this kind was unveiled for low-rank perturbations of a Wigner matrix: the well-known BBAP phase transition, first discovered in the physics literature [26], and named after the authors of [2]. Here, the model for the random matrix D n is quite different from that considered in [2,26], and the phase transition is formalized by the following result.
and y be distributed according to (7).
. Assume that Z s satisfies the assumptions (A1)-(A2)-(A3). Letx s be the principal eigenvector of the matrix D n , defined as in (13). Then, the following results hold: admits a unique solution, call it λ * δ , for λ > τ.
where ψ δ and φ denote the derivatives of these two functions. 3. Let λ D n 1 ≥ λ D n 2 denote the two largest eigenvalues of D n . Then, as n → ∞, The proof immediately follows from Lemma 2 of [37]. In that statement, it is assumed that x is uniformly distributed on the d-dimensional sphere, but this assumption is actually never used. In fact, since the sensing vectors {a i } 1≤i≤n are i.i.d. standard Gaussian, to prove the result above, without loss of generality we can assume that x = √ d e 1 , where e 1 is the first element of the canonical basis of R d . We also note that the signal x and the measurement matrix A differ from Lemma 2 in [37] for a scaling factor. This accounts for an extra term δ in the expression of the eigenvalues of D n .

Main Results
Throughout this section, we will make the following additional assumptions on the signal x, the output function of the GLM, and the preprocessing functions T L and T s used for the linear and spectral estimators, respectively.
(B1) LetP X ,d denote the empirical distribution of x ∈ R d , with x 2 2 = d. As d → ∞,P X ,d converges weakly to a distribution P X such that, for some k ≥ 2, the following hold: (B2) The function T L : R → R is Lipschitz and |E{T L (Y )G}| > 0; the function T s : R → R is bounded and Lipschitz.
The assumption (B2) is mainly technical and rather mild, since one can construct a sequence of approximations of the desired T L , T s that satisfy (B2).
Lemmas 1 and 2 in the previous sections derive formulae for the asymptotic correlation of x with the linear estimatorx L and the spectral estimatorx s . For convenience, let us denote these as follows: We also denote by n L the high-dimensional limit of x L 2 , and we define

Joint Distribution of Linear and Spectral Estimators
The key technical challenge is to compute the limiting joint empirical distribution of the signal x, the linear estimatorx L , and the spectral estimatorx s . This result is stated in terms of pseudo-Lipschitz test functions.

Definition 1 (Pseudo-Lipschitz test function)
We say that a function ψ : for all x, y ∈ R m .

. Assume that (A1)-(A2)-(A3)
and (B1)-(B2) hold. Assume further that ψ δ (λ * δ ) > 0, and letx s be the principal eigenvector of D n , defined as in (13) with preprocessing function T s , with the sign of x s chosen so that x s , x ≥ 0. Letx L be the linear estimator defined as in (10) with preprocessing function T L .
Consider the rescalings x s = √ dx s and x L = √ dx L /n L . Then, the following holds almost surely for any PL(k) function ψ : R 3 → R: Here, X ∼ P X , and (W L , W s ) are independent of X and jointly Gaussian with zero mean and covariance given by Note that the order k of the pseudo-Lipschitz test function appearing in (27) is the same as the integer k appearing in assumption (B1). In particular, the order of pseudo-Lipschitz functions for which (27) holds is only constrained by the fact that the random variables X and Y should have finite moments of order 2k − 2. The proof of the theorem is given in Sect. 5.
Remark 1 (What happens if either linear or spectral are ineffective?) From Lemma 1, the assumption |E{T L (Y )G}| > 0 contained in (B2) implies that |ρ L | > 0. Similarly, from Lemma 2, the assumption ψ δ (λ * δ ) > 0 implies that ρ s > 0. Thus, Theorem 1 assumes that both the linear and the spectral estimators are effective. We note that a similar result also holds when only one of the two estimators achieves strictly positive correlation. In that case, ψ : R 2 → R takes as input the components of the signal x and of the estimator that is effective (as well as the corresponding random variables), and a formula analogous to (27) holds. The proof of this claim is easily deduced from the argument of Theorem 1. A simpler proof using a rotational invariance argument along the lines of (144)-(151) also leads to the same result.
Remark 2 (Convergence in W k ) The result in Eq. (27) is equivalent to the statement that the empirical joint distribution of (x, x L , x s ) converges almost surely in W k distance to the joint law of (X , ρ L X + W L , ρ s X + W s ).
This follows from the fact that a sequence of distributions P n with finite k-th moment converges in W k to P if and only if P n converges weakly to P and a k d P n (a) → a k d P(a), see [55, Definition 6.7, Theorem 6.8].

Optimal Combination
Equipped with the result of Theorem 1, we now reduce the vector problem of estimating x given (x L ,x s ) to an estimation problem over scalar random variables, i.e., how to optimally estimate X from observations X L := ρ L X + W L and X s := ρ s X + W s , where W L and W s are jointly Gaussian. The Bayes-optimal estimator for this scalar problem is given by This is formalized in the following result.

Lemma 3
Let (X , X L , X s ) be jointly distributed random variables such that where (W L , W s ) are jointly Gaussian independent of X with zero mean and covariance given by and consider the following optimal estimation problem of X given X L and X s over all measurable estimators f : Then, for any c = 0,X = cF * (X L , X s ) attains the maximum in (32), where F * is defined in (29).
Proof By the tower property of conditional expectation, for any f ∈ V we have where we have used the Cauchy-Schwarz inequality. Moreover, the inequality in (33) becomes an equality if and only if f (X L , X s ) = c E{X | X L , X s }, for some c = 0, which proves the result.
At this point, we are ready to show how to optimally combine the linear estimator x L and the spectral estimatorx s . Theorem 2 (Optimal combination) Consider the setting of Theorem 1. Let F * be defined in (29) and assume that F * ∈ P L( k/2 ). Then, as n → ∞, where F * acts component-wise on x L and x s , i.e., F * (x L , . Furthermore, for any f ∈ P L( k/2 ) acting component-wise on x L and x s , almost surely, By taking f = F * , the result (34) immediately follows. By applying Lemma 3, (35) also follows and the proof is complete.
The integer k appearing in assumption (B1) is the same one defining the order k/2 of the pseudo-Lipschitz functions F * and f in (34)- (35).
Remark 3 (What happens if either linear or spectral are ineffective?) Theorem 2 considers the same setting of Theorem 1, and therefore it assumes that ρ L = 0 and ρ s = 0. The results in (34)-(35) still hold if either ρ L = 0 or ρ s = 0 (and even in the case ρ L = ρ s = 0). For the sake of simplicity, suppose that ρ L = 0 (the argument for ρ s = 0 is analogous). Then, X L = W L is independent of X and therefore the conditional expectation in (29) does not depend on x L . Recall from Remark 1 that if ρ L = 0, then a formula analogous to (27) holds where ψ : R 2 → R takes as input the components of x and x s on the LHS, and the corresponding random variables on the RHS. Hence, (34)- (35) are obtained by following the same argument in the proof of Theorem 2.
We highlight that, even if one of the two estimators is ineffective, the proposed optimal combination can still improve on the performance of the other one. This is due to the fact that the function F * takes advantage of the knowledge of the signal prior. We showcase an example of this behavior for a binary prior in Fig Remark 4 (Sign ofx s ) The spectral estimatorx s is defined up to a change of sign, since it is the principal eigenvector of a suitable matrix. In Theorem 1 and 2, we pick the sign ofx s such that x s , x ≥ 0. In practice, there is a simple way to resolve the sign ambiguity: one can match the sign of q as defined in (25) with the sign of the scalar product x L ,x s (see also (41)).

Remark 5 (Sufficient condition for pseudo-Lipschitz F * )
The assumption that the Bayes-optimal estimator F * in (29) is pseudo-Lipschitz is fairly mild. In fact, F * is Lipschitz if either of the following two conditions on X hold [20, Lemma 3.8]: (i) X has a log-concave distribution, or (ii) there exist independent random variables U , V such that U is Gaussian, V is compactly supported and X Remark 6 (Non-separable combinations) The optimality of F * in Theorem 2 can be extended to a class of combined estimators of the form The result above is in terms of convergence in probability, while the limiting statement in (35) holds almost surely. This is because the state evolution result for AMP with nonseparable functions [6, Theorem 1] is obtained in terms of convergence in probability.

A Special Case: Optimal Linear Combination
Theorem 2 shows that the optimal way to combinex L andx s is via the Bayes-optimal estimator F * for the corresponding scalar problem. If the signal prior X is standard Gaussian, then F * (x L , x s ) is a linear combination of x L and x s . In this section, we provide closed-form expressions for the performance of such optimal linear combination. For convenience, let us denote the normalized linear estimator asx L , i.e.,x L = x L / x L 2 . (Recall that the spectral estimatorx s is already normalized, i.e., x s 2 = 1.) We consider an estimatorx c (θ ) of x, parameterized by θ ∈ R ∪ {±∞}, defined as follows:x where we use the convention,x c (θ ) = ±x L for θ = ±∞.
Let us now compute the asymptotic performance of the proposed estimatorx c (θ ) in (38). Specifically, using Lemmas 1 and 2, it follows immediately that In order to conclude with the limit of the normalized correlation , it still remains to compute the magnitude of the new estimator: This is possible thanks to the following result, which gives the correlation between the linear and the spectral estimator as well as the asymptotic performance of the linear combinationx c (θ ).

Corollary 1 (Performance of linear combination) Consider the setting of Theorem 1.
Then, as n → ∞, where q is given by (25). Furthermore, letx c (θ ) be the estimator defined in (38) parameterized by θ ∈ R. Then, as n → ∞, Using (23), the parameter q can be alternatively expressed in terms of ρ L and ρ s in the following compact form: Observe also that Using the characterization of Corollary 1, we can compute the combination coefficient θ * that leads to the asymptotically optimal linear combination of the form (38).

Corollary 2 (Optimal linear combination) Recall the notation in Corollary 1 and define
The proof of Corollary 1 is deferred to Appendix B. Let us now comment on the assumption |q| < 1. Ifx L andx s are perfectly correlated (i.e., |q| = 1), then it is clear that the combined estimatorx c (θ ) cannot improve the performance for any value of θ . On the contrary, when |q| < 1, Corollary 2 characterizes when the linear combination x c strictly improves upon the performance of the individual estimatorsx L andx s . Specifically, by denoting such that the right-hand side of (45) becomes it can be readily checked that F(θ * ) > ρ max provided that |q| < 1 and q = p.
Remark 7 (Optimization of preprocessing functions) The linear estimatorx L and the spectral estimatorx s use the preprocessing functions T L (cf. (3)) and T s (cf. (4)), respectively. The choice of these functions naturally affects the performance of the two estimators, as well as, that of the combined estimator F * (x L , x s ). Lemmas 1-2, Theorem 2 and Corollary 2 derive sharp asymptotics on the estimation performance that hold for any choice of the preprocessing functions T L and T s satisfying our technical assumptions (A1), (A2), (A3), (B2). In Appendix C, we briefly discuss how these results can be used to yield optimal such choices. Specifically, the optimal preprocessing function that maximizes the normalized correlation of the spectral estimator is derived in [35], see Appendix C.2. The optimal choice for the linear estimator is much easier to obtain and we outline the process in Appendix C.1. In Appendix C.3, we combine these two results to derive a precise characterization of the sampling regimes in which the linear estimator surpasses the spectral estimator, and vice-versa. Finally, in Appendix C.4, we use Corollary 2 to cast the problem of optimally choosing T L and T s to maximize the correlation of the combined estimatorx c (θ * ) as a function optimization problem. Solving the latter is beyond the scope of this paper, and it represents an intriguing future research direction.

Numerical Simulations
This section validates our theory via numerical experiments and provides further insights on the benefits of the proposed combined estimator. First, we consider a setting in which the signal x is uniformly distributed on the d-dimensional sphere. In this case, the limiting empirical distribution P X is Gaussian. Thus, the Bayes-optimal estimator F * (x L , x s ) in (29) is linear and is given byx c (θ * ), where θ * is determined in Corollary 2. For this scenario, we study in Figs. 2, 3 and 4 the performance gain ofx c (θ * ) for three different measurement models and for various noise levels.
Second, in Fig. 5 we consider a setting in which the the entries of x are binary, drawn i.i.d. from the set {1, −1}, such that the empirical distribution is of the form P X (1) = 1 − P X (−1) = p, for some p ∈ (0, 1). For this case, we compute the Bayesoptimal estimator F * (x L , x s ) and compare it with the optimal linear combination x c (θ * ) for various choices of output functions.

Optimal Linear Combination
In Fig. 2, we fix the input-output relation as Fig. 2a), and we investigate the performance gain of the proposed combined estimator for different values of the noise variance σ 2 . Here, x is generated via a standard Gaussian vector which is normalized such that ) and the pre-processing functions are chosen: T L (y) = y and T s (y) = min{y, 3.5}. Note that the empirical distribution of x tends to a standard Gaussian distribution in the high-dimensional limit. Thus, following Sect. 3.3, the optimal combined estimator is (asymptotically) linear and is given by (38) for θ = θ * chosen as in (44). In Fig. 2b, we plot the percentage improvement ρ * −ρ max ρ max ×100 as a function of the sampling ratio δ, for three values of the noise variance σ 2 = 0, 0.4 and 0.8. Here, ρ * = F(θ * ) defined in (45) and ρ max = max{|ρ L |, ρ s }. We observe that, as σ 2 increases, larger values of the sampling ratio δ are needed for the combined estimator to improve upon the linear and spectral estimators. However, for sufficiently large δ, the benefit of the combined estimator is more pronounced for larger values of the noise variance. For instance, for σ 2 = 0.8 and large values of δ, the percentage gain is larger than 10%. In Fig. 2c, we fix σ 2 = 0.4 and plot the correlations ρ L , ρ s and ρ * . The solid lines correspond to the theoretical predictions obtained by Lemma 1, Lemma 2 and Corollary 2, respectively. The theoretical predictions are compared against the results of Monte Carlo simulations. For the simulations, we used d = 1000 and averaged over 15 independent problem realizations. In Fig. 2c (Middle), we also plot the limit q (cf. (25)) of the cross-correlation x L ,x s x L x s and the ratio p in (46). The corresponding values of the optimal combination coefficient θ * are plotted in Fig. 2c (Right). For values of δ smaller than the threshold for weak-recovery of the spectral method (where ρ s = 0), we observe that ρ * = ρ L and θ * = ∞. However, for larger values of δ, the value of the optimal coefficient θ * is non-trivial. Finally, Fig. 2d shows the same plots as in Fig. 2c, but for σ 2 = 0.8.
The setting of Fig. 3 is the same as in Fig. 2, only now the input-output function is Fig. 3b to Fig. 2b, note that the benefit of the combined estimator is more significant for the link function studied here. Moreover, in contrast to Fig. 2b, here, the percentage gain of the combined estimator takes its maximum value in the noiseless case: σ 2 = 0.
In Fig. 4, we repeat the experiments of Figs. 2 and 3, but for f (x) = 1 + 0.3x + (x 2 − 1). Compared to the two functions studied in Figs. 2 and 3, in Fig. 4 we observe that the performance gain is significantly larger and reaches values up to 30%. This can be argued by considering the expansion of the input-output functions on the basis of the Hermite polynomials, i.e., Specifically, recall that the first three Hermite polynomials are as follows: , only the first three coefficients {H i }, i = 0, 1, 2, are non-zero. To see the relevance of these coefficients to the linear and spectral estimators, note that for identity pre-processing functions it holds that E{G Thus, it follows directly from Lemma 1 that ρ L = 0 if the first Hermite coefficient H 1 is zero. Similarly, it can be shown using Lemma 2 that ρ s = 0 if the second Hermite coefficient H 2 is zero; in fact, the threshold of weak recovery of the spectral method is infinity in this case (see (167) in Appendix C). Intuitively, the linear and spectral estimators exploit the energy of the output function corresponding to the Hermite polynomials of first-and second-order, respectively; see also [17,52]. In this example, we have chosen f (x) such that all of its energy is concentrated on the Hermite polynomials of order up to two.
As a final note, from the numerical results in Figs. 2, 3 and 4, we observe that the proposed optimal combination leads to a performance improvement only if both the linear and the spectral estimators are asymptotically correlated with the signal. This is because the signal prior is Gaussian (see Remark 3). In contrast, as we will see in the next section, when the signal prior is binary, the combined estimator provides an improvement even when only the linear estimator is effective.

Bayes-Optimal Combination
In Figs. 5a-d, we consider the same setting as in Figs. 3 and 4, respectively. However, here each entry of x takes value either +1 or −1. Each entry is chosen independently according to the distribution P X (1) = 1 − P X (−1) = p, for p ∈ (0, 1). Thus, the Bayes optimal combinationx mmse = F * (x L , x s ) is not necessarily linear as in the Gaussian case. In Appendix D, we compute the Bayes-optimal estimator F * (x L , x s ) (cf. (29)) for the setting considered here. Then, we use the prediction of Theorem 2 to plot in solid black lines the normalized correlation ofx mmse with x (i.e., ρ mmse * = ρ * in (34)). The theoretical predictions (solid lines) are compared against the results of Monte Carlo simulations (markers). Moreover, we compare the optimal performance against those of the linear estimator (cf. ρ L ), the spectral estimator (cf. ρ s ) and the optimal linear combinationx c (θ * ) (cf. ρ linear * ). We have chosen p = 0.3 in Fig. 5a, c and p = 0.5 in Fig. 5b, d. Note that the optimal linear combination provides a performance improvement only for the values of δ s.t. ρ L > 0 and ρ s > 0. On the contrary, ρ mmse * is strictly larger than ρ L even when ρ s = 0. Bayes-optimal combination vs optimal linear combination for a binary prior P X (1) = 1− P X (−1) = p. In a, b (resp. c, d) the setting is otherwise the same as in Fig. 3 (resp. Fig. 4)

The Combined Estimator as Initialization for Local Algorithms
As mentioned in the introduction, the initial estimates of the signal obtained by either the linear/spectral methods or our proposed combined estimator can then be refined via local algorithms such as gradient descent (GD). The theory and numerical simulations in the previous sections showed the superiority of our proposed combined estimator x mmse = F * (x L , x s ) over x L and x s in terms of correlation with the true signal. In Fig. 6, by plotting the correlation of GD iterates x t , t ≥ 1 for different initializations, we show numerically how this improvement translates to improved performance of gradient descent refinements. Specifically, we ran GD on squared error loss with step size 1/2, that is f ( a i , x t )). Here, f is the output function described in the caption of the figure.
In Fig. 6a, b, the true signal has i.i.d. Gaussian entries, thus the linear combined estimatorx c is the optimal combination in terms of correlation performance. We observe that for two different choices of output function and sampling ratio (see caption), GD with linear or spectral initialization requires hundreds of iterations to reach the performance of the combined estimator. In Fig. 6c, d, the true signal has entries in {+1, −1}, chosen independently with P X (1) = p = 0.3 and P X (−1) = p = 0.1, respectively. Here, the linear combined estimator is sub-optimal (but still improves upon linear/spectral), thus we also compute and study the Bayes-optimal estimator. Interestingly, while for both priors, GD converges to the same correlation as t increases, for p = 0.1, GD achieves higher correlation if stopped early. It is a fascinating, albeit challenging, question to better understand the evolution of the GD trajectory as a function of the initialization, signal prior and output function.

Proof Sketch
The proof of Theorem 1 is based on the design and analysis of a generalized approximate message passing (GAMP) algorithm. GAMP is a class of iterative algorithms proposed by Rangan [44] for estimation in generalized linear models. A GAMP algorithm is defined in terms of a sequence of Lipschitz functions f t : R → R and g t : R × R → R, for t ≥ 0. For t ≥ 0, the GAMP iteration computes: Here, the functions f t and g t are understood to be applied component-wise, i.e., ) and g t (u t ; y) = (g t (u t 1 ; y 1 ), . . . , g t (u t n ; y n )). The scalars b t , c t are defined as where g t (·, ·) denotes the derivative with respect to the first argument. The iteration (48) is initialized with for some constant c > 0. Here, 1 n ∈ R n denotes the all-ones vector.
A key feature of the GAMP algorithm (48) is that the asymptotic empirical distribution of its iterates can be succinctly characterized via a deterministic recursion, called state evolution. Hence, the performance of the high-dimensional problem involving the iterates u t , v t is captured by a scalar recursion. Specifically, this result gives that for t ≥ 1, the empirical distributions of u t and v t converge in W k distance to the laws of the random variables U t and V t , respectively, with where N (0, 1). Similarly, X ∼ P X and W V ,t ∼ N (0, 1) are independent. The deterministic parameters (μ U ,t , σ U ,t , μ V ,t , σ V ,t ) are computed via the recursion (56) detailed in Sect. 5.2, and the formal statement of the state evolution result is contained in Proposition 1 (again in Sect. 5.2). Next, in Sect. 5.3, we show that a suitable choice of the functions f t , g t leads to a GAMP algorithm such that In order to approach the spectral estimator as t → ∞, we pick f t , g t to be linear functions; see (68). The idea is that, with this choice of f t and g t , the GAMP iteration is effectively a power method. Let us now briefly discuss why this is the case.
With the choice of f t , g t in (68), the GAMP iteration can be expressed as Furthermore, from the analysis of the fixed points of state evolution of Lemma 4, we obtain that β ∞ = √ δE{Z (G 2 − 1)}. Thus, after some manipulations, (54) can be rewritten as Hence, v ∞ is an eigenvector of the matrix A T Z(Z + δE{Z (G 2 − 1)}I n ) −1 A, and the GAMP algorithm is effectively performing a power method. Finally, we choose T (and consequently Z) so that Z(Z + δE{Z (G 2 − 1)}I n ) −1 = Z s , with Z s given by (12). In this way, the matrix A T Z(Z + δE{Z (G 2 − 1)}I n ) −1 A coincides with the spectral matrix D n , as defined in (13), and therefore v t approaches the spectral estimatorx s .
In conclusion, as v 1 = √ dx L and v t tends to be aligned with √ dx s for large t, we can use the state evolution result of Proposition 1 to analyze the joint empirical distribution of (x, √ dx L , √ dx s ). At this point, the proof of Theorem 1 becomes a straightforward application of Lemma 6 and is presented at the end of Sect. 5.3. The proof of Lemma 6 is quite long, so it is provided separately in Sect. 5.4.

State Evolution for Generalized Approximate Message Passing (GAMP)
For t ≥ 1, let us define the following deterministic recursion for the parameters (μ U ,t , σ U ,t , μ V ,t , σ V ,t ) appearing in (51)-(52): Recalling from (50) that u 0 = c1 n , the state evolution recursion is initialized with Furthermore, for t ≥ 0, let the sequences of random variables (W U ,t ) t≥0 and (W V ,t ) t≥0 be each jointly Gaussian with zero mean and covariance defined as follows [6,47]. First, we have: Then, for r , t ≥ 1, we iteratively compute: Note that for r = t, by (56) we have E{W 2 U ,t } = E{W 2 V ,t } = 1.
At this point, we are ready to present the state evolution result [27,44] for the GAMP algorithm (48)- (49). (49), with initialization u 0 = c1 n ∈ R n , for any constant c > 0. Assume that for t ≥ 0, the functions g t : R × R → R and f t : R → R are Lipschitz, and that Assumption (B1) on p.6 holds for some k ≥ 2.

Remark 8 Suppose that we have a sequence of P L(k) functions ψ
for some constants c U , c V ∈ R. Then, since the statements (61) and (62) hold with probability 1 for each fixed t ≥ 1, we have that, almost surely,

GAMP as a Power Method to Compute the Spectral Estimator
Consider the GAMP iteration in (48) initialized with u 0 = 1 δ 1 n , and the function g 0 : R × R → R chosen as (The function f 0 is assumed to be zero.) From (10), we note that v 1 = √ dx L . For t ≥ 1, let the functions g t : R × R → R and f t : R → R be chosen as where the function T : R → R is bounded and Lipschitz, and β t is a constant, defined iteratively for t ≥ 1 via the state evolution equations below (Eqs. (72)-(74)). To prove Theorem 1, we will choose T as a suitable function of T s (see (82)). Note that the functions g t and f t are required to be Lipschitz for t ≥ 0, and this will be ensured choosing T to be bounded and Lipschitz (see Lemma 6).
With this choice of f t , g t , for t ≥ 1, the scalars in (49) are given by In the GAMP iteration below, we will replace the parameter c t by its almost sure limit The state evolution result in Proposition 1 still holds when c t is replaced withc t in the GAMP iteration [27,44]. This can be shown using the pseudo-Lipschitz property of the test functions ψ in Eqs. (61)-(62) and the fact that lim n→∞ 1 n n i=1 T (y i ) = E{Z } almost surely, due to the strong law of large numbers.
With these choices, the GAMP iteration in (48) is as follows. Initializing with we have for t ≥ 1: where Z L = diag(T L (y 1 ), . . . , T L (y n )) and Z = diag(T (y 1 ), . . . , T (y n )). The state evolution equations in (56) become: Here, we recall that G ∼ N (0, 1) and Z = T (Y ), for Y ∼ p Y |G (· | G). From (57), the state evolution iteration is initialized with the following: We will show in Lemma 6 that in the high-dimensional limit, the vector v t in (71) tends to align with the principal eigenvector of the matrix M n = A T Z(Z+δE{Z (G 2 − 1)}I n ) −1 A as t → ∞. In other words, the GAMP iteration is equivalent to a power iteration for M n . This equivalence, together with Proposition 1, allows us to precisely characterize the limiting empirical distribution of the eigenvector of M n in Lemma 6.
We begin with a result characterizing the fixed points of the state evolution recursion in (72)-(73).

Then, the state evolution recursion (72)-(73) has three fixed points: one is F P
Furthermore, if the initialization (75) is such that μ V ,1 > 0, then the recursion converges to F P 1 . If μ V ,1 < 0, the recursion converges to F P 2 .
The lemma is proved in Appendix E. We note that Lemma 4 (and the subsequent Lemmas 5-6 as well) assumes that E{Z (G 2 −1)} > 0 and δ > E{Z 2 } (E{Z G 2 } − E{Z }) 2 . These conditions concern the auxiliary random variable Z (or, equivalently, the auxiliary function T ). In the proof of Theorem 1, which appears at the end of this section, we will provide a choice of Z (depending on Z s ) that fulfills these requirements; see (82). Let us also point out that the condition δ > E{Z 2 } (E{Z G 2 } − E{Z }) 2 follows from ψ δ (λ * δ ) > 0, which in turn ensures that ρ s > 0. For a discussion of the case ρ s = 0, see Remark 1.
The next lemma shows that the mean-squared difference between successive AMP iterates vanishes as t → ∞ in the high-dimensional limit.

Lemma 5 Assume that E{Z
Consider the GAMP iteration in (71) initialized with u 0 = 1 δ 1 n . Then, the following limits hold almost surely: The proof of the lemma is given in Appendix F. The next result is the main technical lemma needed to prove Theorem 1. It shows that, as t grows large, v t tends to be aligned with the principal eigenvector of the matrix M n defined in (79). Theorem 1 is then obtained from Lemma 6 by using a suitable choice for T (·) in the GAMP iteration (71), which ensures that M n is a scaled version of D n in (13). N (0 d , I d /d), and y be distributed according to (7). Let n/d → δ, G ∼ N (0, 1) that the conditions (B1)-(B2) on p.6 hold (with T s (·) replaced by T (·) in (B2)). Assume also that Z is bounded,

)} and assume that Z satisfies the assumptions (A1), (A2), (A3) on p.5. Define the matrix
where Z = diag(T (y 1 ), . . . , T (y n )). Letφ 1 be the principal eigenvector of M n , let its sign be chosen so that φ 1 , x ≥ 0, and consider the rescalingφ (1) = √ dφ 1 . Also, letx L = √ dx L , wherex L is the linear estimator defined in (10). Then, the following holds almost surely for any PL(k) function ψ : R×R×R → R: Here X ∼ P X ,μ V ,σ V ,β are given by (76)-(77), and μ V ,1 , σ 2 V ,1 are given by (75). The random variables (W V ,1 , W V ,∞ ) are independent of X , and jointly Gaussian with zero mean and covariance: We first show how Theorem 1 is obtained from Lemma 6, and then prove the lemma in the following subsection.

Proof of Theorem 1 Recall that λ *
δ is the unique solution of (20) for λ > τ, where τ is the supremum of the support of Z s . Define We now verify that Z satisfies the assumptions of Lemma 6. As λ * δ > τ and Z s has bounded support with supremum τ , we have that Z is bounded. Furthermore, Z is a Lipschitz function of Y , since Z s is a Lipschitz function of Y and Z s is bounded away from λ * δ . Thus, Z satisfies the condition (B2) (with T s (·) replaced by T (·)). Next, note that τ > 0 (since Z s satisfies assumption (A3)). As P(λ * δ − Z s > 0) = 1, we have that P(Z > −1) = 1.
Using (87) in (79), we see that M n = 1 λ * δ A T Z s A. Therefore, the principal eigenvector of M n is equal tox s . Furthermore, from (76)-(77), we can compute the coefficients μ V ,σ V ,β as By using also (75), one can easily verify that which yields the desired result.

Proof of Lemma 6
Fix c > 0, and letZ = cZ andZ = cZ. Then, Thus, by inspecting the definition (79), one immediately obtains thatφ (1) does not change if we rescale Z and Z by the multiplicative factor c. Furthermore, by using the definitions (76)-(77), we have that and that (91) Consequently, both the LHS and the RHS of (80) are unchanged when we rescale Z and Z by the multiplicative factor c.
The argument above proves that, without loss of generality, we can rescale Z and Z by any multiplicative factor c > 0. To simplify the rest of the argument, it is convenient to assume the normalization condition Consider the iteration (71) with the initialization in (70). Since by hypothesis, Lemma 4 guarantees that the state evolution recursion (73) converges to either fixed point F P 1 or F P 2 as t → ∞. That is, The last equality above follows by combining (77) and (92). By substituting the expression for u t in (71) in the v t+1 update, the iteration can be rewritten as follows, for t ≥ 2: In the remainder of the proof, we will assume that t ≥ 2. Define By combining (96) with (94), we have that By substituting the expression for u t−1 obtained in (98) into (95), we have From (97) and (99), we obtain Let us decompose v t into a component in the direction ofφ 1 plus an orthogonal component r t : where ξ t = v t ,φ 1 . At this point, the idea is to show that, when t is large, r t is small, thus v t tends to be aligned withφ 1 . To do so, we prove that, as n → ∞, the largest eigenvalue of the matrix M n defined in (79) converges almost surely to δE{Z } + 1. Furthermore, we prove that the matrix M n exhibits a spectral gap, in the sense that the second largest eigenvalue of M n converges almost surely to a value strictly smaller than δE{Z } + 1. Since r t is orthogonal toφ 1 and M n has a spectral gap, the norm of e t 3 in (100) can be lower bounded by a strictly positive constant times the norm of r t . Next, using the expression in (101), we show that the norm of e t 3 can be made arbitrarily small by taking n and t sufficiently large. From this, we conclude that r t must be small.
We begin by proving that M n has a spectral gap.

Lemma 7 (Spectral gap for M n )
The following holds almost surely: for a numerical constant c 1 > 0 that does not depend on n.
Proof of Lemma 7 By hypothesis, Z = Z /(Z + 1) satisfies the assumptions (A1)-(A2)-(A3). Thus, we can use Lemma 2 to compute the almost sure limit of the two largest eigenvalues of M n , call them λ M n 1 ≥ λ M n 2 . Let τ denote the supremum of the support of Z . As P(Z > −1) = 1 and Z has bounded support, we have that τ < 1. For λ ∈ (τ, ∞), define and Thus, by using (92), we have that Furthermore, (109) Thus, where in the last step we combine the hypothesis δ > 2 with the normalization condition (92).
Let us now go back to (100) and combine it with (102). Then, We now prove that, almost surely, for all sufficiently large n, the following lower bound on the norm of the LHS of (114) holds: where c 2 > 0 is a numerical constant independent of n, t.
As the matrix A T Z(Z + I n ) −1 A − (δE{Z } + 1) I d is symmetric, it can be written in the form Q Q T , with Q orthogonal and diagonal. Furthermore, the columns of Q are the eigenvectors of A T Z(Z + I n ) −1 A − (δE{Z } + 1) I d and the diagonal entries of are the corresponding eigenvalues. As r t is orthogonal toφ 1 , we can write where is obtained from by changing the entry corresponding to λ M n 1 −(δE{Z }+1) to any other value. For our purposes, it suffices to substitute λ M n 1 − (δE{Z } + 1) with λ M n 2 − (δE{Z } + 1). Note that where λ min ( Q 2 Q T ) denotes the smallest eigenvalue of Q 2 Q T and the last equality follows from the variational characterization of the smallest eigenvalue of a symmetric matrix. Note that By using (104), we obtain that, almost surely, for all sufficiently large n, min i∈{2,...,d} By combining (116), (117), (118) and (119), we conclude that (115) holds.
Recalling that r t satisfies (114), we will next show that, almost surely, Combined with (114) and (115), this implies that lim t→∞ lim d→∞ By using the triangle inequality, we have where the second inequality uses φ 1 2 = 1 and that |ξ t | = v t ,φ 1 ≤ v t 2 . We can bound the second term on the RHS of (121) using the result in Proposition 1, applied with the PL(2) test function ψ(v) = v 2 . Then, almost surely, Here we used the definitions of V t and β 2 t from (52) and (74). Recalling from (93) that lim t→∞ β 2 t = 1 δ , the limit in (122) combined with Remark 8 and the continuous mapping theorem implies that, almost surely, Thus, by using (103), we conclude that, almost surely, We now bound the first term on the RHS of (121). Recalling the definition of e t 3 in (101), an application of the triangle inequality gives where the second inequality follows from the fact that, given a matrix M and a vector v, Let us bound the operator norm of the two matrices appearing in the RHS of (125). As the operator norm is sub-multiplicative, we have As Z is bounded, the operator norm of Z is upper bounded by a numerical constant (independent of n, t). The operator norm of (Z + I n ) −1 and (Z + √ δβ t I n ) −1 is also upper bounded by a numerical constant (independent of n, t). Indeed, from (93) β t → 1/ √ δ as t → ∞, and the support of Z does not contain points arbitrarily close to −1. We also have that, almost surely, for all sufficiently large n, the operator norm of A is upper bounded by a constant (independent of n, t). As a result, we deduce that, almost surely, for all sufficiently large n, t, where C is a numerical constant (independent of n, t). Furthermore, by Lemma 5, the following limits hold almost surely: By combining (93), (123), (127) and (128), we obtain that, almost surely, each of the terms in the RHS of (125) vanishes when scaled by the factor 1/ √ n, as t, n → ∞. As a result, almost surely, By combining (121), (124) and (129), we conclude that, almost surely, (120) holds.
Recall that r t satisfies (114). Thus, by combining the lower bound in (115) with the almost sure limit in (120), we obtain that, almost surely, Recalling from (102) that r t is the component of v t orthogonal toφ 1 , the result in (130) implies that v t tends to be aligned withφ 1 in the high-dimensional limit. In formulas, by combining (102) with (130), we have that, almost surely, Thus, by using (123), we obtain that, almost surely, To obtain the sign of ξ t , we first observe that, by Proposition 1, almost surely, As μ V ,0 = α > 0 and E{Z (G 2 − 1)} = 1/δ, the state evolution iteration (73) implies that μ V ,t > 0 for all t ≥ 0. Using (102), we can write Recall that by hypothesis, φ 1 , x ≥ 0. Moreover, using (130) and Cauchy-Schwarz, we have that, almost surely, Thus we deduce that, almost surely, Therefore, withφ (1) = √ dφ 1 . At this point, we are ready to prove (80). For any PL(k) function ψ : where C, C are universal constants (which may depend on k but not on d, n). The inequality in the second line above uses ψ ∈ PL(k), and the third and fourth lines are obtained via the Hölder and Cauchy-Schwarz inequalities. We now claim that, almost surely, where, from now on, we will use C to denote a generic positive constant that does not depend on d, n. If (138) holds, then by using (136) and (137), we deduce that, almost surely, Let us now prove (138). First, by assumption (B1), we have that Next, the main technical lemma [27,Lemma 2] leading to the state evolution result in Proposition 1 implies that, almost surely, for t ≥ 1, In particular, this follows from [27, Lemma 2(e)] (see also [4, Lemma 1(e)]). Sincẽ x L = v 1 , we also have that, almost surely, It remains to show that, almost surely, To do so, we use a rotational invariance argument. Let R ∈ R d×d be an orthogonal matrix such that Rx = x. Then, Consequently, we have that which immediately implies that (1) .
Then, we can decomposeφ (1) as where ϕ ⊥ is uniformly distributed over the set of vectors orthogonal to x with norm √ d and Relating the uniform distribution on the sphere to the normal distribution [54,Sec. 3.3.3], we can express ϕ ⊥ as follows: where u ∼ N (0 d , I d ) and independent of x. By the law of large numbers, we have the almost sure limits Thus, by combining (147) and (149), we conclude that where the coefficients c 1 and c 2 can be bounded by universal constants (independent of n, d) using (150). As a result, Note that, almost surely, where U ∼ N (0, 1). By combining (140), (152) and (153), (143) immediately follows. Finally, by combining (140), (141), (142) and (143), we deduce that (138) holds. We now use Proposition 1 which guarantees that, almost surely, To conclude the proof of (80), we take the limit t → ∞ and use Remark 8. For this, we will show that where (W V ,1 , W V ,∞ ) are zero mean jointly Gaussian random variables with covariance given by (81). Using (58) and the formulas for g 0 and g t from (66) and (68), we have In the second equality above, we used (72). Using the expression for σ V ,1 from (75) and letting t → ∞, we have Therefore, the sequence of zero mean jointly Gaussian pairs (W V ,1 , W V ,t ) t≥1 converges in distribution to the jointly Gaussian pair (W V ,1 , W V ,∞ ), whose covariance is given by (81).
To show (155), we use Lemma 9 in Appendix G. We apply this result taking Q t to be the distribution of Since μ V ,t →μ V , σ V ,t →σ V , the sequence (Q t ) t≥2 converges weakly to Q, which is the distribution of In our case, ψ : R 3 → R is PL(k), and therefore ψ(a, b, c) ≤ C (1 + |a| k + |b| k + |c| k ), for all (a, b, c) ∈ R 3 for some constant C . Choosing h(a, b, c . . , σ k V ,t }, with coefficients that do not depend on t. The integral h dQ has the same form, except that μ V ,t , σ V ,t are replaced bỹ μ V ,σ V , respectively. Since μ V ,t →μ V and σ V ,t →σ V , we have that lim t→∞ h dQ t = h dQ. Therefore, by applying Lemma 9 in Appendix G, we have that lim t→∞ ψ dQ t = ψ dQ, which is equivalent to (155). This completes the proof of the lemma.

A Proof of Lemma 1
By rotational invariance of the Gaussian measure, we can assume without loss of generality that Let us also denote the first column of the matrix A by u ∈ R n and the remaining n × (d − 1) sub-matrix by A, i.e., A = u A . In this notation, each measurement y i , i ∈ [n] depends only on the corresponding element u i of the vector u. In particular, the random variables z L i = T (y i ), i ∈ [n] are independent of the sub-matrix A. Furthermore, we may expressx L as follows: We are now ready to prove (11). First, we compute the correlation x L , x where we have that √ du i iid ∼ N (0, 1) and the almost sure convergence follows from the law of large numbers.
Second, we compute the norm of the estimator x L 2 : where we have used the following: −→ E Z 2 L and h 2 2 /n a.s.
−→ 1/δ, by the law of large numbers. Combining the above displays completes the proof of the lemma.

B Proof of Corollary 2
. It can be checked that We consider three cases.
This completes the proof of the result.

C Optimization of Preprocessing Functions
In order to state the results in this section, let us define the following functions for y ∈ R and G ∼ N (0, 1), Note that the functions μ 0 , μ 1 and μ 2 only depend on the conditional distribution p Y |G ( · | G). Furthermore, let S denote the support of the probability measure Y (i.e., the support of μ 0 (y)).

C.1 Linear Estimator
In terms of the notation in (161), for a preprocessing function T L (y), we can write Thus, it follows from (11) in Lemma 1 that provided S T L (y)μ 1 (y)dy = 0 and E{|G Z L |} < ∞. Assume, henceforth, that Then, we will show in this section that the optimal preprocessing function for the linear estimator is and the achieved (optimal) normalized correlation is To see this, note from (162) that ρ 2 L is maximized for the choice of T L that minimizes the ratio S T 2 L (y)μ 0 (y)dy S T L (y)μ 1 (y)dy 2 , while ensuring S T L (y)μ 1 (y)dy = 0. Furthermore, by using the Cauchy-Schwarz inequality, we obtain: Rearranging the above and substituting in the expression for |ρ L | from (162) yields ρ 2 L ≤ ρ * L 2 , with equality achieved if and only if T L (y) = c · μ 1 (y) μ 0 (y) , ∀y ∈ R and some constant c > 0. Clearly, the correlation performance ofx L is insensitive to scaling T L by a constant. Thus, we can choose c = 1 to arrive at (164). To complete the proof of the claim, note that for the choice in (164): where the last inequalities in the above lines follow from (163). As a final note, observe that the optimal T * L does not depends on the sampling ratio δ.

C.2 Spectral Estimator
The optimal preprocessing function for the spectral estimator is derived in [35]. For ease of reference, we present here their result in the special case where inf y μ 2 (y) μ 0 (y) > 0. If this condition does not hold, the idea is to construct a sequence of approximations of the optimal preprocessing function (we refer the reader to [35] for the details).
Assume δ ≥ δ * , where is the threshold for weak recovery of the spectral estimator [37]. For a preprocessing function T s (y), we have from Lemma 2 that where λ * δ is the unique solution to the following equation for λ ≥ τ : and also (cf. ψ δ (λ * δ ) ≥ 0), Using this characterization, [35] shows that the optimal preprocessing function for the spectral estimator is and the achieved (optimal) normalized correlation is As for the linear estimator, the optimal function T * s does not depend on the sampling ratio δ.

C.3 Spectral Versus Linear
As mentioned in the introduction, there is no clear winner between the linear and the spectral estimator: the superiority of one method over the other depends on the measurement model and on the sampling ratio. Here, we fix the measurement model (i.e., the stochastic output function p Y |G ( · | x, a i )) and we present an analytic condition that determines which method is superior for any given δ > 0 after optimizing both in terms of the preprocessing function.
and let γ δ := δ S . Then, the following holds: where ρ * L and ρ * s are defined in (165) and (171), and denote the optimal normalized correlation for the linear and the spectral estimator, respectively.

C.4 Combined Estimator
In the previous sections of Appendix C, we have discussed how to optimally choose T L and T s to maximize the correlation of the linear and spectral estimators. This was possible thanks to the asymptotic characterizations in Lemmas 1 and 2. Theorem 2 opens up the possibility to optimally choose T L and T s to maximize the correlation achieved by the Bayes-optimal combination F * (x L , x s ). Here, we focus on the special case in which the signal prior is Gaussian and, hence, F * (x L , x s ) is a linear combination of x L and x s , see Corollary 2. In the rest of this section, we formalize the problem of (optimally) choosing the functions T L and T s .
To begin, note from (43) that q = ρ L · ρ s · s, where we define Furthermore, by using (174) in (45), we can express the achieved correlation F(θ * ) ofx c (θ * ) as follows where we have denoted T (y) := 1 λ * δ T (y) and all integrals are over y (not explicitly written for brevity). Thus, the problem of choosing T L and T s can be reformulated as follows: The second and the third constraints above follow from (168) and (169), respectively. These further guarantee that ρ s > 0 (so division in (175) is allowed). Similarly, the last constraint on T L ensures that ρ 2 L > 0. Though concrete, the formulation above is a difficult function optimization problem. Solving this goes beyond the scope of this paper, but it may be an interesting future direction. Another related question is whether the solution to this problem coincides (or not) with the "individually" optimal choices in (164) and (170), respectively.

D Example: Bayes-Optimal Combination for Binary Prior
In this section, we evaluate explicitly the Bayes-optimal estimator F * (x L , x s ) = E{X | X L = x L , X s = x s } in (29) for the case where X ∈ {1, −1} with P X (1) = p and P X (−1) = 1 − p.
Using this prior, we obtain where the last line follows by Bayes rule and simple algebra. Here, p W L ,W s denotes the joint density of (W L , W s ) as predicted by Theorem 1, i.e., , and C is a constant that is irrelevant for our purpose as it cancels in (177). Using (178) in (177) gives an explicit expression for F * (x L , x s ).
In Sect. 4.2, we numerically implement the optimal combined estimator for various measurement models. Specifically, we use the linear and spectral estimators x L and x s to form the combined estimator where F * acts element-wise on the entries of its arguments as specified in (177). The asymptotic correlation of the estimatorx mmse is given by Theorem 2 as follows: ρ * = |E{X ·F * (X L ,X s )}| √ E{F 2 * (X L ,X s )} (see (34)). Equipped with the explicit expression for F * (X L , X s ) in (177), we can compute ρ * using Monte Carlo averaging over realizations of the triplet (X , X L , X s ).

E Proof of Lemma 4
Take t → ∞ in (73), and let μ V = lim t→∞ μ V ,t , σ 2 V = lim t→∞ σ 2 V ,t , β 2 = σ 2 V +μ 2 V . Then, by solving these equations, we obtain two solutions for the pair (μ 2 V , σ 2 V ): one solution gives the fixed point F P 0 ; and the other solution gives F P 1 and F P 2 . Note that the fixed points F P 1 and F P 2 exist only whenβ 2 > E{Z 2 }. From (77), this is equivalent to δ > E{Z 2 } (E{Z G 2 } − E{Z }) 2 , which is the condition in the statement of the lemma.
Let us define Using this definition, the two equations in (73) can be combined to obtain the following recursion in γ 2 t : Note that E{Z 2 } > 0. In fact, if E{Z 2 } = 0, then P(Z = 0) = 1 and the condition E{Z (G 2 − 1)} > 0 cannot hold. Thus, the two fixed points of this recursion are γ 2 F P 0 = 0, and As discussed above, the fixed point γ 2 F P 12 exists when δ > The recursion can be written as The derivative of f is given by We now argue that, whenever γ 2 1 = is strictly positive, the recursion γ 2 t+1 = f (γ 2 t ) converges to γ 2 F P 12 . We will separately consider the two cases γ 2 1 < γ 2 F P 12 and γ 2 1 > γ 2 F P 12 . We consider first the case γ 2 1 < γ 2 F P 12 . Since f (x) > 0, the function f (x) is strictly increasing for x ≥ 0. Therefore, if for any t ≥ 0 we have γ 2 t < γ 2 F P 12 , then γ 2 t+1 = f (γ 2 t ) < f (γ 2 F P 12 ) = γ 2 F P 12 . Next, we argue that f (x) > x for 0 < x < γ 2 F P 12 . To show this claim, we first note that f (0) > 1 since δ > E{Z 2 } (E{Z G 2 } − E{Z }) 2 . Thus, f (x) > x for x sufficiently close to 0. If f (γ ) ≤ γ for some 0 < γ < γ 2 F P 12 , then there exists a fixed point between 0 and γ , as f (x) is continuous. However, this is not possible since γ < γ 2 F P 12 and γ 2 F P 12 is the unique fixed point > 0. As a result, f (x) > x for 0 < x < γ 2 F P 12 . Hence, if γ 2 1 < γ 2 F P 12 , then the sequence (γ 2 t ) t≥1 is strictly increasing and bounded above by γ 2 F P 12 . Furthermore, by using the uniqueness of the fixed point, one obtains that its supremum is γ 2 F P 12 . Therefore, the sequence (γ 2 t ) t≥1 converges to γ 2 F P 12 . Next, consider the case γ 2 1 > γ 2 F P 12 . We observe that (181), we see that f (x) is strictly decreasing for x > 0, hence f (x) < 1 for x ≥ γ 2 F P 12 . Therefore, by the Banach fixed point theorem, the iteration (179) converges to γ 2 F P 12 whenever γ 2 1 ≥ γ 2 F P 12 . Finally, we observe from (73) that: Thus, for any initialization such that γ 2 1 > 0, where we have used the expression for γ 2 F P 12 from (180). Note that both fixed points F P 1 and F P 2 correspond to the same (μ 2 V ,σ 2 V ). The assumption that E{Z (G 2 −1)} > 0 ensures that the sign of μ V ,t+1 in (73) remains unchanged and hence the iteration converges to either F P 1 or F P 2 depending on the sign of μ V ,1 .

F Proof of Lemma 5
We first consider a fixed t ≥ 2 and write Applying Proposition 1 with the PL(2) functions ψ(a) = a 2 (for the first two terms) and ψ(a, b) = ab (for the last term), we obtain Similarly, we have for any t ≥ 1 Since |E{T L (Y )G}| > 0, the initialization μ V ,1 of the state evolution recursion in (75) is strictly non-zero. Therefore, Lemma 4 guarantees that the state evolution recursion (72)-(73) converges to either the fixed point F P 1 or to F P 2 depending on the sign of μ V ,1 . Without loss of generality assume that μ V ,1 > 0, so that the recursion converges to F P 1 . (The argument for μ V ,1 < 0 is identical.) (188) Hence, the desired result immediately follows from (186), (187) and Remark 8, if we show that E{W V ,t+1 W V ,t } → 1 and E{W U ,t−1 W U ,t } → 1 as t → ∞.
Next, taking r = (t − 1) in (59) and using the formula for f t (·) from (68), we get (190) Here, the last equality is obtained by noting from (72) that √ δμ U ,t = μ V ,t β t , hence the coefficients on X cancel. Combining (189) and (190), we obtain (191) For brevity, we write this iteration as w t+1 = a t + b t w t , where (192) The iteration is initialized with w 1 = E{W V ,1 W V ,0 } = 0. Note that, as t → ∞, the sequences a t and b t converge to well-defined limits determined by (188). By using the sub-additivity of lim sup, we have that lim sup Rearranging and using the limits from (188), we obtain lim sup where the last equality follows from (76). By using the super-additivity of lim inf, we also have that lim inf which leads to lim inf By combining (194) and (196), we conclude that lim t→∞ E{W V ,t+1 W V ,t } = 1.