Abstract
Recovery of an N-dimensional, K-sparse solution \({\mathbf {x}}\) from an M-dimensional vector of measurements \({\mathbf {y}}\) for multivariate linear regression can be accomplished by minimizing a suitably penalized least-mean-square cost \(||{\mathbf {y}}-{\mathbf {H}} {\mathbf {x}}||_2^2+\lambda V({\mathbf {x}})\). Here \({\mathbf {H}}\) is a known matrix and \(V({\mathbf {x}})\) is an algorithm-dependent sparsity-inducing penalty. For ‘random’ \({\mathbf {H}}\), in the limit \(\lambda \rightarrow 0\) and \(M,N,K\rightarrow \infty \), keeping \(\rho =K/N\) and \(\alpha =M/N\) fixed, exact recovery is possible for \(\alpha \) past a critical value \(\alpha _c = \alpha (\rho )\). Assuming \({\mathbf {x}}\) has iid entries, the critical curve exhibits some universality, in that its shape does not depend on the distribution of \({\mathbf {x}}\). However, the algorithmic phase transition occurring at \(\alpha =\alpha _c\) and associated universality classes remain ill-understood from a statistical physics perspective, i.e. in terms of scaling exponents near the critical curve. In this article, we analyze the mean-field equations for two algorithms, Basis Pursuit (\(V({\mathbf {x}})=||{\mathbf {x}}||_{1} \)) and Elastic Net (\(V({\mathbf {x}})= ||{\mathbf {x}}||_{1} + \tfrac{g}{2} ||{\mathbf {x}}||_{2}^2\)) and show that they belong to different universality classes in the sense of scaling exponents, with mean squared error (MSE) of the recovered vector scaling as \(\lambda ^\frac{4}{3}\) and \(\lambda \) respectively, for small \(\lambda \) on the critical line. In the presence of additive noise, we find that, when \(\alpha >\alpha _c\), MSE is minimized at a non-zero value for \(\lambda \), whereas at \(\alpha =\alpha _c\), MSE always increases with \(\lambda \).
Similar content being viewed by others
Notes
Under this circumstance, if \(\xi \) remains of order one, then the error is dominated by \(\xi \), i.e. \(q (\mathrm {MSE})=\sigma _\xi ^2\). However, this is not consistent with \(\sigma _\xi ^2=q/\alpha \), unless \(\sigma _\xi ^2=0\). Hence in this regime, we need to consider a \(\sigma _\xi ^2\) that is comparable to \(\theta \). Therefore, as \(\vartheta \rightarrow 0\), we will have \(\sigma _\xi ^2\rightarrow 0\) and \(q\rightarrow 0\), making the reconstruction perfect, i.e. the limit when \(\vartheta ,\theta ,\sigma _\xi ^2\rightarrow 0\) with \(\tfrac{\theta }{\sigma _\xi }\) of order one.
Note that \({{\hat{\rho }}}>\rho \), even in the perfect reconstruction phase. That is because a fraction of \(x_a\)’s remain non-zero as long as \(\vartheta >0\), and vanish only in the \(\vartheta \rightarrow 0\) limit.
Note that, when \(|\tau _0|=\tfrac{|x_0|}{\sigma _\xi }\rightarrow \infty \), \(\varPhi (\tau +\tau _0)+\varPhi (\tau -\tau _0)\rightarrow 1\) and \((\tau -\tau _0) \phi (\tau +\tau _0),( \tau +\tau _0) \phi (\tau -\tau _0)\rightarrow 0\). The \(\tau _0\) dependent expression inside \([\ldots ]^{\mathrm {av}}_{x_0}\) in Eq. (34) goes from \(2\big \{(1+\tau ^2)\varPhi (\tau ) - \tau \phi (\tau )\big \}\) to \(1+\tau ^2\) as \(\tau _0\) goes from zero to infinity. We wrote this expression as \(1+\tau ^2-\psi _\xi (\tau _0,\tau )\).
References
Amelunxen, D., Lotz, M., McCoy, M.B., Tropp, J.A.: Living on the edge: phase transitions in convex programs with random data. Inf. Inference 3(3), 224–294 (2014)
Andersen, M., Dahl, J., Vandenberghe, L.: CVXOPT: a python package for convex optimization (2010)
Bayati, M., Montanari, A.: The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory 57(2), 764–785 (2011)
Bayati, M., Montanari, A.: The lasso risk for gaussian matrices. IEEE Trans. Inf. Theory 58(4), 1997–2017 (2012)
Candès, E., Romberg, J.: Sparsity and incoherence in compressive sampling. Inverse Probl. 23(3), 969 (2007)
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)
Donoho, D.L.: For most large underdetermined systems of linear equations the minimal. Commun. Pure Appl. Math. 59(6), 797–829 (2006)
Donoho, D.L., Tanner, J.: Sparse nonnegative solution of underdetermined linear equations by linear programming. Proc. Natl. Acad. Sci. USA 102(27), 9446–9451 (2005)
Donoho, D., Tanner, J.: Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philos. Trans. R. Soc. A 367(1906), 4273–4293 (2009). https://doi.org/10.1098/rsta.2009.0152
Donoho, D.L., Maleki, A., Montanari, A.: Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 106(45), 18914–18919 (2009)
El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 170(1–2), 95–175 (2018)
Ganguli, S., Sompolinsky, H.: Statistical mechanics of compressed sensing. Phys. Rev. Lett. 104(18), 188701 (2010)
Kabashima, Y., Wadayama, T., Tanaka, T.: A typical reconstruction limit for compressed sensing based on \(\ell _p\)-norm minimization. J. Stat. Mech. 2009(09), L09003 (2009)
Karoui, N.E.: Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445 (2013)
Kubo, R.: The fluctuation-dissipation theorem. Rep. Prog. Phys. 29(1), 255 (1966)
Ma, S.K.: Modern Theory of Critical Phenomena. Routledge, London (2018)
Mézard, M., Parisi, G., Virasoro, M.: Sk model: the replica solution without replicas. Europhys. Lett. 1(2), 77–82 (1986)
Mézard, M., Parisi, G., Virasoro, M.A.: Spin Glass Theory and Beyond, vol. 9. World scientific, Singapore (1987)
Ramezanali, M., Mitra, P.P., Sengupta, A.M.: The cavity method for analysis of large-scale penalized regression. arXiv preprint arXiv:1501.03194 (2015)
Rudelson, M., Vershynin, R.: On sparse reconstruction from fourier and gaussian measurements. Commun. Pure Appl. Math. 61(8), 1025–1045 (2008)
Stojnic, M.: Various thresholds for \(\ell _1 \)-optimization in compressed sensing. arXiv preprint arXiv:0907.3666 (2009)
Stojnic, M.: \(\ell _2/\ell _1\)-optimization in block-sparse compressed sensing and its strong thresholds. IEEE J. Sel. Top. Signal Process. 4(2), 350–357 (2010)
Stojnic, M.: A rigorous geometry-probability equivalence in characterization of \(\ell _1 \)-optimization. arXiv preprint arXiv:1303.7287 (2013)
Tibshirani, R., Wainwright, M., Hastie, T.: Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, New York (2015)
Tikhonov, A.N.: On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943)
Vershik, A.M., Sporyshev, P.: Asymptotic behavior of the number of faces of random polyhedra and the neighborliness problem. Selecta Math. Soviet 11(2), 181–201 (1992)
Xu, Y., Kabashima, Y.: Statistical mechanics approach to 1-bit compressed sensing. J. Stat. Mech. 2013(2), P02041 (2013)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. 67(2), 301–320 (2005)
Acknowledgements
This work was supported by the National Science Foundation INSPIRE (track 1) Award No. 1344069. Part of this paper was written while two of the authors (AMS and MR) were visiting Center for Computational Biology at Flatiron Institute. We are grateful for their hospitality.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Equivalent Single Variable Optimization Problem
Here, we provide a sketch of our cavity argument. A detailed derivation will be left for the published version of our preprint [20]. From an algorithmic point of view, the cavity method is related to message passing algorithms, but we will assume that the algorithm convergences on a state. We wish to find an approximate statistical description of that state. Once we are done discussing cavity method, we also briefly mention how to connect these results to the replica calculations found in [13, 14].
We will consider the case where the function V is twice differentiable. To construct potentials like the \(\ell _1\) norm, we use second differentiable functions like \(r\ln (2\cosh (x/r))\), which tends to |x| when r goes to zero. We can study the solution for \(r>0\) and then take the appropriate limit.
Minimization of the original penalized regression objective function is mathematically equivalent to the minimization of \({\mathcal {E}}({\mathbf {u}})\) over \({\mathbf {u}}\) (even if, in practice, we do not know the explicit form of \({\mathcal {E}}({\mathbf {u}})\)).
Note that the \(z_i\) variables and the \(u_a\) variables only interact via the random measurement matrix \({\mathbf {H}}\). From this point on, our arguments are similar to that of Xu and Kabashima [28]. We try to find single variable functions \({\mathcal {E}}_a(u_a)\) and \({\mathcal {E}}_i(z_i)\) whose optimization mimics the full optimization problem.
We consider the problem with an a-cavity, meaning, a problem where the variable \(x_a=u_a+x_{0a}\) has been set to zero. We also consider a problem with an i-cavity, namely a problem, where \(z_i\) has been set to zero. The single variable functions are constructed by introducing the inactive/missing variable into the corresponding cavity. The discussion becomes simpler if we assume the optimization in the systems with cavities are already well-approximated by optimizing over sum of single variable functions of the following forms (as is done in Xu and Kabashima [28]):
and
For simplicity, we will call these single variable functions potentials.
We will set up slightly more involved notation for these parameters for a cavity system. With a missing, the potentials for \(z_i\) are represented by
Similarly, with i missing, we have
We argue that the corresponding parameters with or without cavity are nearly the same.
1.1 Step 1: Introducing \(x_a\) into the a-cavity
From here, using the parametrization of \({\mathcal {E}}_{i\rightarrow a}(z_i)\) according to Eq. 67, we know that optimal \(z_j=\tfrac{H_{ia}u_a-K_{j\rightarrow a}}{K_{j\rightarrow a}}\), leading to
and
Since \(H_{ia}\sim \tfrac{1}{\sqrt{N}}\), \(A_a\approx A_{a\rightarrow i}\) and \(F_a\approx F_{a\rightarrow i}\).
1.2 Step 2: Introducing \(z_i\) into the i-cavity
Now, we use the parametrization of \({\mathcal {E}}_{a\rightarrow i}(u_a)\) according to Eq. 68, and get the optimal \(u_b\) satisfies
Since \(z_iH_{ib}\sim \tfrac{1}{\sqrt{N}}\), we can expand this equation around \(z_i=0,{{\bar{u}}}_b\) and find that
We could derive equations for \(B_{i\rightarrow a}\) and \(K_{i\rightarrow a}\), like before, but we know that we can ignore the difference between these parameters and \(B_i\) and \(K_i\) respectively. Also, we have optimal \(u_a\approx {{\bar{u}}}_a\).
1.3 Step 3: Putting it all together
We now only deal with \(A_a,B_i\) etc. From Eqs. 71 and 77
For large M, N the expressions are self-averaging. One can essentially replace \(H_{ia}^2\) by \(\tfrac{1}{M}=\tfrac{1}{\alpha N}\) and see that \(A_a=A,B_i=B\). In other words, these quantities are essentially index independent.
If we identify \(B=\sigma _\mathrm {eff}^2=\tfrac{1}{A}\), we see that we got
according to the definition in Proposition 1.
To get our final result, we need \(F_a\). Using Eqs. 72 and 78 and the various approximations
We, thus, identify the term inside the square bracket as \(\xi _a\). Its distribution over different choices of \({\mathbf {H}}\) and \(\varvec{\zeta }\),is approximately normal with mean zero and variance \(=\sigma _\zeta ^2+\tfrac{\alpha }{N}\sum _{b\ne a}u_b^2\approx \sigma _\zeta ^2+\alpha q.\) Also, \(\xi _a\) and \(\xi _b\) are nearly uncorrelated for \(a\ne b\).
At the end we get
which is the same expression as in Proposition 1.
The cavity mean field equations arose in the context of spin systems in solid state physics [18, 19]. These equations take into account the feedback dependencies by estimating the reaction of all the other ‘spins’/variables when a single spin is removed from the system, thereby leaving a ‘cavity’. This leads to a considerable simplification by utilizing the fact that the system of variables are fully connected. The local susceptibility matrix \(\varvec{\chi }\), a common quantity in physics, measures how stable the solution is to perturbations. This quantity plays a key role in such systems [20]. In particular, in the asymptotic limit of large M and N, certain quantities (e.g. MSE and average local susceptibility, \({\overline{\chi }}({\mathbf {x}})\)) converge, i.e. become independent of the detailed realization of the matrix \({\mathbf {H}}\). In this limit, a sudden increase in susceptibility signals the error prone phase.
These results are equivalent to those obtained by [13, 14] with the replica approach. These approaches begin with a finite temperature statistical mechanics model. In order to make a connection with these studies, one should replace \({\overline{\chi }}\) by the quantity \(\beta \varDelta Q\) where \(\beta \) is a quantity playing the role of inverse temperature. The quantity \(\varDelta Q\) could be defined as
where \(\langle \cdots \rangle \) is the average over ‘thermal’ fluctuations of \({\mathbf {u}}\) with in the ensemble \(P_\beta \):
The quantity \(\varDelta Q\) is nothing but ‘thermal’ fluctuations in \({\mathbf {u}}\) and \(\beta \varDelta Q\) can in fact be identified as a local susceptibility due to the fluctuation-dissipation theorem [16]. Our results are obtained in the limit \(\beta \rightarrow \infty \), where the probability distribution becomes peaked near the minimum, making it into an optimization problem. The local susceptibility, however remains well-defined in this limit.
Appendix B: Ridge Regression via Singular Value Decomposition
For the sake of completeness, in this appendix, we derive Eqs. (17) and (18) in Sect. 4.1 using a singular value decomposition. Elementary derivation leads us to an explicit expression:
where we use the singular vector basis of the matrix \({\mathbf {H}}\), with \({\mathbf {s}}_i\) being the non-zero singular values, and \({\mathcal {V}}_i\) the corresponding right singular vectors. When we take the limit of vanishing \(\sigma ^2\), we just have a projection of the N dimensional vector \({\mathbf {x}}_0\) to an M-dimensional projection spanned by \({\mathcal {V}}_i\)’s. In other words
\({\mathbf {P}}\) being the projection matrix. For random \({\mathbf {H}}\), \({\mathcal {V}}_i\)’s are just a random choice of M orthonormal vectors. Thus, the properties of the estimate depends on the statistics of the projection matrix to a random M-dimensional subspace.
For variance, we need to think of second order moments of the matrix elements of \({\mathbf {P}}\), particularly, \([P_{ab}P_{ac}]^{\mathrm {av}}_{{\mathbf {H}}}\). We could parametrize \([P_{ab}P_{ac}]^{\mathrm {av}}_{{\mathbf {H}}}=A\delta _{bc}+B\delta _{ab}\delta _{bc}\). Since \({\mathbf {P}}\) is a projection operator, \({\mathbf {P}}^2={\mathbf {P}}\) and it is a symmetric matrix. Hence,
In the limit of \(M,N\rightarrow 0\) with \(\alpha \) fixed, the distribution of \(P_{aa}\) gets highly concentrated around the mean \(\alpha \). As a result,
Using the two constraints, represented by Eqs. (91) and (92), we can determine A and B, in the large M, N limit, leading to,
The variance is now given by,
recovering our earlier result.
Rights and permissions
About this article
Cite this article
Ramezanali, M., Mitra, P.P. & Sengupta, A.M. Critical Behavior and Universality Classes for an Algorithmic Phase Transition in Sparse Reconstruction. J Stat Phys 175, 764–788 (2019). https://doi.org/10.1007/s10955-019-02292-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10955-019-02292-6