Skip to main content

Matrix completion with nonconvex regularization: spectral operators and scalable algorithms

Abstract

In this paper, we study the popularly dubbed matrix completion problem, where the task is to “fill in” the unobserved entries of a matrix from a small subset of observed entries, under the assumption that the underlying matrix is of low rank. Our contributions herein enhance our prior work on nuclear norm regularized problems for matrix completion (Mazumder et al. in J Mach Learn Res 1532(11):2287–2322, 2010) by incorporating a continuum of nonconvex penalty functions between the convex nuclear norm and nonconvex rank functions. Inspired by Soft-Impute (Mazumder et al. 2010; Hastie et al. in J Mach Learn Res, 2016), we propose NC-Impute—an EM-flavored algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm starts. We present a systematic study of the associated spectral thresholding operators, which play an important role in the overall algorithm. We study convergence properties of the algorithm. Using structured low-rank SVD computations, we demonstrate the computational scalability of our proposal for problems up to the Netflix size (approximately, a 500,000 \(\times \) 20,000 matrix with \(10^8\) observed entries). We demonstrate that on a wide range of synthetic and real data instances, our proposed nonconvex regularization framework leads to low-rank solutions with better predictive performance when compared to those obtained from nuclear norm problems. Implementations of algorithms proposed herein, written in the R language, are made available on github.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. We say that a function is a spectral function of a matrix X, if it depends only upon the singular values of X. The state-of-the-art algorithmics in mixed integer Semidefinite optimization problems is in its nascent stage; and not even comparable to the technology for mixed integer quadratic optimization.

  2. Since the problems under consideration are nonconvex, our methods are not guaranteed to reach the global minimum—we thus refer to the solutions obtained as upper bounds. In many synthetic examples, however, the solutions are indeed seen to be globally optimal. We do show rigorously, however, that these solutions are first-order stationary points for the optimization problems under consideration.

  3. Note that we consider \(\tau \ge 0\) in the definition so that it includes the case of (nonstrong) convexity.

  4. This follows from the simple observation that \(s_{a\lambda , \gamma }(ax)=a s_{\lambda , \gamma }(x)\) and \(s'_{a\lambda , \gamma }(ax)=s'_{\lambda , \gamma }(x)\).

  5. Due to the boundedness of the penalty function, the boundedness of the objective function does not necessarily imply that the sequence \(\varvec{\sigma }(X_k)\) will remain bounded.

  6. We note that it is not guaranteed that the \({X}_k\)’s will be of low rank across the iterations of the algorithm for \(k \ge 1\), even if they are eventually, for k sufficiently large. However, in the presence of warm starts across \((\lambda ,\gamma )\) they are indeed, empirically, found to have low rank as long as the regularization parameters are large enough to result in a small rank solution. Typically, as we have observed in our experiments, in the presence of warm starts, the rank of \(X_{k}\) is found to remain low across all iterations.

  7. Available at http://grouplens.org/datasets/movielens/.

  8. Note that we do not assume that the sequence \(\varvec{\sigma }_{k}\) has a limit point.

References

  • Alquier, P.: A bayesian approach for noisy matrix completion: optimal rate under general sampling distribution. Electron. J. Stat. 9(1), 823–841 (2015)

    MathSciNet  MATH  Google Scholar 

  • Bai, Z., Silverstein, J.W.: Spectral Analysis of Large Dimensional Random Matrices. Springer, Berlin (2010)

    MATH  Google Scholar 

  • Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    MathSciNet  MATH  Google Scholar 

  • Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3873–3881 (2016)

  • Borwein, J., Lewis, A.: Convex Analysis and Nonlinear Optimization. Springer, New York (2006)

    MATH  Google Scholar 

  • Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20, 1956–1982 (2010)

    MathSciNet  MATH  Google Scholar 

  • Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2009)

    MathSciNet  MATH  Google Scholar 

  • Candès, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98, 925–936 (2010a)

    Google Scholar 

  • Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory 56, 2053–2080 (2010b)

    MathSciNet  MATH  Google Scholar 

  • Candes, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(\ell _1\) minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)

    MathSciNet  MATH  Google Scholar 

  • Candès, E., Sing-Long, C., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Trans.Signal Process. 61(19), 4643–4657 (2013)

    MathSciNet  MATH  Google Scholar 

  • Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.: Coherent matrix completion. In: Proceedings of the 31st International Conference on Machine Learning, JMLR, pp. 674–682 (2014)

  • Chen, Y.: Incoherence-optimal matrix completion. IEEE Trans. Inf. Theory 61(5), 2909–2923 (2015)

    MathSciNet  MATH  Google Scholar 

  • Chen, Y., Wainwright, M.J.: Fast low-rank estimation by projected gradient descent: general statistical and algorithmic guarantees (2015). arXiv preprint arXiv:1509.03025

  • Chen, J., Liu, D., Li, X.: Nonconvex rectangular matrix completion via gradient descent without \(\ell _{2,\infty }\) regularization (2019a). arXiv preprint arXiv:1901.06116

  • Chen, Y., Chi, Y., Fan, J., Ma, C., Yan, Y.: Noisy matrix completion: understanding statistical guarantees for convex relaxation via nonconvex optimization (2019b). arXiv preprint arXiv:1902.07698

  • Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67(20), 5239–5269 (2019)

    MathSciNet  MATH  Google Scholar 

  • Chistov, A.L., Grigor’ev, D.Y.: Complexity of quantifier elimination in the theory of algebraically closed fields. In: Proceedings of the 11th International Symposium on Mathematical Foundations of Computer Science, pp. 17–31. Springer (1984)

  • Daubechies, I., DeVore, R., Fornasier, M., Güntürk, C.S.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 63(1), 1–38 (2010)

    MathSciNet  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32(2), 407–499 (2004)

    MATH  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    MathSciNet  MATH  Google Scholar 

  • Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)

  • Feng, L., Zhang, C.H.: Sorted concave penalized regression (2017). arXiv preprint arXiv:1712.09941

  • Fornasier, M., Rauhut, H., Ward, R.: Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM J. Optim. 21(4), 1614–1640 (2011)

    MathSciNet  MATH  Google Scholar 

  • Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–135 (1993)

    MATH  Google Scholar 

  • Freund, R.M., Grigas, P., Mazumder, R.: An extended Frank-Wolfe method with “In-Face” directions, and its application to low-rank matrix completion (2015). arXiv e-prints arXiv:1511.02204

  • Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2973–2981 (2016)

  • Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, JMLR. org, pp. 1233–1242 (2017)

  • Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore (1983)

    MATH  Google Scholar 

  • Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57(3), 1548–1566 (2011)

    MathSciNet  MATH  Google Scholar 

  • Gu, S., Xie, Q., Meng, D., Zuo, W., Feng, X., Zhang, L.: Weighted nuclear norm minimization and its applications to low level vision. Int. J. Comput. Vis. 121(2), 183–208 (2017)

    Google Scholar 

  • Hardt, M.: Understanding alternating minimization for matrix completion. In: IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 651–660. IEEE (2014)

  • Hardt, M., Wootters, M.: Fast matrix completion without the condition number. In: Conference on Learning Theory, pp. 638–678 (2014)

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Prediction, Inference and Data Mining, 2nd edn. Springer, New York (2009)

    MATH  Google Scholar 

  • Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1), 3367–3402 (2016)

    MathSciNet  MATH  Google Scholar 

  • Hazimeh, H., Mazumder, R.: Fast best subset selection: coordinate descent and local combinatorial optimization algorithms. Oper. Res. (2019) (accepted)

  • Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  • Jaggi, M., Sulovský, M.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 471–478 (2010)

  • Jain, P., Meka, R., Dhillon, I.S.: Guaranteed rank minimization via singular value projection. In: Advances in Neural Information Processing Systems, pp. 937–945 (2010)

  • Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pp. 665–674. ACM (2013)

  • Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from noisy entries. J. Mach. Learn. Res. 11, 2057–2078 (2010)

    MathSciNet  MATH  Google Scholar 

  • Klopp, O.: Noisy low-rank matrix completion with general sampling distribution. Bernoulli 20(1), 282–303 (2014)

    MathSciNet  MATH  Google Scholar 

  • Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011)

    MathSciNet  MATH  Google Scholar 

  • Larsen, R.: Propack-software for large and sparse SVD calculations (2004). http://sun.stanford.edu/~rmunk/PROPACK

  • Lecué, G., Mendelson, S.: Regularization and the small-ball method I: sparse recovery. Ann. Stat. 46(2), 611–641 (2018)

    MathSciNet  MATH  Google Scholar 

  • Lewis, A.S.: The convex analysis of unitarily invariant matrix functions. J.f Convex Anal. 2, 173–183 (1995)

    MathSciNet  MATH  Google Scholar 

  • Loh, P.L., Wainwright, M.J.: Regularized m-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)

    MathSciNet  MATH  Google Scholar 

  • Lv, J., Fan, Y.: A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat. 37, 3498–3528 (2009)

    MathSciNet  MATH  Google Scholar 

  • Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution (2017). arXiv preprint arXiv:1711.10467

  • Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)

    MathSciNet  MATH  Google Scholar 

  • Mazumder, R., Friedman, J.H., Hastie, T.: Sparsenet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106, 1125–1138 (2011)

    MathSciNet  MATH  Google Scholar 

  • Mazumder, R., Radchenko, P.: The discrete dantzig selector: estimating sparse linear models via mixed integer linear optimization (2015). arXiv preprint arXiv:1508.01922

  • Mazumder, R., Radchenko, P., Dedieu, A.: Subset selection with shrinkage: sparse linear modeling when the SNR is low (2017). arXiv preprint arXiv:1708.03288

  • Mohan, K., Fazel, M.: Reweighted nuclear norm minimization with application to system identification. In: Proceedings of the 2010 American Control Conference, pp. 2953–2959. IEEE (2010)

  • Mohan, K., Fazel, M.: Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn. Res. 13(Nov), 3441–3473 (2012)

    MathSciNet  MATH  Google Scholar 

  • Negahban, S.N., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39, 1069–1097 (2011)

    MathSciNet  MATH  Google Scholar 

  • Negahban, S.N., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res. 13, 1665–1697 (2012)

    MathSciNet  MATH  Google Scholar 

  • Nikolova, M.: Local strong homogeneity of a regularized estimator. SIAM J. Appl. Math. 61, 633–658 (2000)

    MathSciNet  MATH  Google Scholar 

  • Recht, B.: A simpler approach to matrix completion. J. Mach. Learn. Res. 12, 3413–3430 (2011)

    MathSciNet  MATH  Google Scholar 

  • Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52, 471–501 (2010)

    MathSciNet  MATH  Google Scholar 

  • Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    MATH  Google Scholar 

  • Rohde, A., Tsybakov, A.B.: Estimation of high-dimensional low-rank matrices. Ann. Stat. 39, 887–930 (2011)

    MathSciNet  MATH  Google Scholar 

  • Shapiro, A., Xie, Y., Zhang, R.: Matrix completion with deterministic pattern: a geometric perspective. IEEE Trans. Signal Process. 67(4), 1088–1103 (2018)

    MathSciNet  MATH  Google Scholar 

  • SIGKDD, A., Netflix: Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach. In: Proceedings of KDD Cup and Workshop (2007)

  • Stein, C.M.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9(6), 1135–1151 (1981)

    MathSciNet  MATH  Google Scholar 

  • Stewart, G.W., Sun, J.G.: Matrix Perturbation Theory. Computer Science and Scientific Computing. Academic Press, Cambridge (1990)

    Google Scholar 

  • Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)

    Google Scholar 

  • Wang, S., Weng, H., Maleki, A.: Which bridge estimator is the best for variable selection? Ann. Stat. (2019) (accepted)

  • Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)

    MathSciNet  MATH  Google Scholar 

  • Zhang, C.H., Zhang, T.: A general theory of concave regularization for high-dimensional sparse estimation problems. Stat. Sci. 27(4), 576–593 (2012)

    MathSciNet  MATH  Google Scholar 

  • Zheng, Q., Lafferty, J.: Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent (2016). arXiv preprint arXiv:1605.07051

  • Zheng, L., Maleki, A., Weng, H., Wang, X., Long, T.: Does \(\ell _p\)-minimization outperform \(\ell _1\)-minimization? IEEE Trans. Inf. Theory 63(11), 6896–6935 (2017)

    MATH  Google Scholar 

  • Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    MathSciNet  MATH  Google Scholar 

  • Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509–1533 (2008)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haolei Weng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Additional technical material

Lemma 1

(Marchenko–Pastur law (Bai and Silverstein 2010)). Let \(X\in {\mathbb {R}}^{m \times n}\), where \(X_{ij}\) are iid with \({\mathbb {E}}(X_{ij})=0, {\mathbb {E}}(X_{ij}^2)=1\), and \(m>n\). Let \(\lambda _1\le \lambda _2 \le \dots \le \lambda _n\) be the eigenvalues of \(Q_m=\frac{1}{m}X'X\). Define the random spectral measure

$$\begin{aligned} \mu _n=\frac{1}{n}\sum _{i=1}^n\delta _{\lambda _i}\,. \end{aligned}$$

Then, assuming \(n/m \rightarrow \alpha \in (0,1]\), we have

$$\begin{aligned} \mu _n(\cdot , \omega ) \rightarrow \mu ~~a.s., \end{aligned}$$

where \(\mu \) is a deterministic measure with density

$$\begin{aligned} \frac{\mathrm{d}\mu }{\mathrm{d}x}=\frac{\sqrt{(\alpha _+-x)(x-\alpha _-)}}{2\pi \alpha x}I(\alpha _-\le x \le \alpha _+). \end{aligned}$$

Here, \(\alpha _+=(1+\sqrt{\alpha })^2\,\) and \(\, \alpha _-=(1-\sqrt{\alpha })^2\).

1.1.1 Proof of Proposition 5.

Proof

In the following proof, we make use of the notation: \(\varTheta _1(\cdot )\) and \(\varTheta _2(\cdot )\), defined as follows. For two positive sequences \(a_{k}\) and \(b_k\), we say \(a_k= \varTheta _2(b_k)\) if there exists a constant \(c>0\) such that \(a_k \ge c b_k\) and we say \(a_k = \varTheta _1(b_k)\), whenever, \(a_k = \varTheta _2(b_k)\) and \(b_k=\varTheta _2(a_k)\).

We first consider the case \(\lambda _n=\varTheta _1(\sqrt{m})\) For simplicity, we assume \(\lambda _n=\zeta \sqrt{m}\) for some constant \(\zeta >0\). Denote \(df(S_{\lambda _n,\gamma }(Z))= D_{\lambda _n,\gamma }\), and use \({\mathcal {T}}_{t_1,t_2}\) to represent

$$\begin{aligned} \frac{\sqrt{mt_1} s_{\lambda _n,\gamma }(\sqrt{mt_1}) - \sqrt{mt_2} s_{\lambda _n,\gamma }(\sqrt{mt_2})}{mt_1-mt_2} \mathbb {1}(t_1\ne t_2). \end{aligned}$$

Adopting the notation from Lemma 1, it is not hard to verify that

$$\begin{aligned} D_{\lambda _n,\gamma }&= n {\mathbb {E}}_{\mu _n} \bigg \{ s'_{\lambda _n,\gamma }(\sqrt{mt_1}) + |m-n|\frac{s_{\lambda _n,\gamma }(\sqrt{mt_1})}{\sqrt{mt_1}} \bigg \} \\&\quad + n^2 {\mathbb {E}}_{\mu _n} ({\mathcal {T}}_{t_1,t_2})\,, \end{aligned}$$

where \(t_1, t_2 \overset{\text {iid}}{\sim } \mu _n\). A quick check of the relation between \(s_{\lambda _n,\gamma }\) and \(g_{\zeta ,\gamma }\) yields

$$\begin{aligned} \frac{D_{\lambda _n,\gamma }}{mn}= & {} \frac{1}{m}{\mathbb {E}}_{\mu _n}s'_{\lambda _n, \gamma }(\sqrt{mt_1})+\left( 1-\frac{n}{m}\right) {\mathbb {E}}_{\mu _n}g_{\zeta , \gamma }(t_1)\\&+\frac{n}{m} {\mathbb {E}}_{\mu _n} \left\{ \frac{t_1g_{\zeta ,\gamma }(t_1)-t_2g_{\zeta ,\gamma }(t_2)}{t_1-t_2} \mathbb {1}(t_1\ne t_2) \right\} \,. \end{aligned}$$

Due to the Lipschitz continuity of the functions \(s_{\lambda _n,\gamma }(x)\) and \(xg_{\zeta , \gamma }(x)\), we obtain

$$\begin{aligned} \Big | \frac{D_{\lambda _n,\gamma }}{mn} \Big | \le \frac{\gamma }{m(\gamma -1)}+\left( 1-\frac{n}{m}\right) + \frac{n}{m}\left( \frac{2\gamma -1}{2\gamma -2}\right) \,. \end{aligned}$$

Hence, there exists a positive constant \(C_{\alpha }\), such that for sufficiently large n,

$$\begin{aligned} \Big | \frac{D_{\lambda _n,\gamma }}{mn} \Big | \le C_{\alpha }, \quad \, a.s. \end{aligned}$$

Let \(T_1,T_2\) be two independent random variables generated from the Marchenko–Pastur distribution \(\mu \). If we can show

$$\begin{aligned}&\frac{D_{\lambda _n,\gamma }}{mn} \overset{a.s.}{\rightarrow } \\&\quad (1-\alpha ){\mathbb {E}}g_{\zeta ,\gamma }(T_1) + \alpha {\mathbb {E}}\left( \frac{T_1g_{\zeta ,\gamma }(T_1)-T_2g_{\zeta ,\gamma } (T_2)}{T_1-T_2}\right) , \end{aligned}$$

then by the dominated convergence theorem (DCT), we conclude the proof in the \(\lambda _n=\varTheta _1(\sqrt{m})\) regime. Note immediately that

$$\begin{aligned} \frac{1}{m} {\mathbb {E}}_{\mu _n}s'_{\lambda _n,\gamma }(\sqrt{mt_1}) \rightarrow 0 \quad a.s. \end{aligned}$$
(34)

Moreover, given that \(g_{\zeta ,\gamma }(\cdot )\) is bounded and continuous, the Marchenko–Pastur theorem in Lemma 1 implies

$$\begin{aligned} \left( 1-\frac{n}{m}\right) {\mathbb {E}}_{\mu _n} g_{\zeta ,\gamma }(t_1) \rightarrow (1-\alpha ) {\mathbb {E}}_{\mu } g_{\zeta ,\gamma }(T_1) \quad a.s. \end{aligned}$$
(35)

Since \((t_1, t_2) \overset{d}{\rightarrow } (T_1, T_2)\), and the discontinuity set of the function \(\frac{t_1g_{\zeta ,\gamma }(t_1)-t_2g_{\zeta ,\gamma }(t_2)}{t_1-t_2}\mathbb {1}(t_1\ne t_2)\) has zero probability under the measure induced by \((T_1,T_2)\), by the continuous mapping theorem,

$$\begin{aligned}&\frac{t_1g_{\zeta ,\gamma }(t_1)-t_2g_{\zeta ,\gamma }(t_2)}{t_1-t_2}\mathbb {1}(t_1\ne t_2) \overset{d}{\rightarrow } \nonumber \\&\quad \frac{T_1g_{\zeta ,\gamma }(T_1)-T_2g_{\zeta ,\gamma }(T_2)}{T_1-T_2}\mathbb {1}(T_1 \ne T_2) \quad \text {as } \, n \rightarrow \infty \,. \end{aligned}$$

Also, due to the boundedness of \(\frac{t_1g_{\zeta ,\gamma }(t_1)-t_2g_{\zeta ,\gamma }(t_2)}{t_1-t_2}\mathbb {1}(t_1\ne t_2)\), it holds that

$$\begin{aligned}&{\mathbb {E}}_{\mu _n} \left\{ \frac{t_1g_{\zeta ,\gamma }(t_1)-t_2g_{\zeta ,\gamma }(t_2)}{t_1-t_2}\mathbb {1}(t_1\ne t_2) \right\} \overset{a.s.}{\rightarrow } \nonumber \\&\quad {\mathbb {E}}_{\mu } \left\{ \frac{T_1g_{\zeta ,\gamma }(T_1)-T_2g_{\zeta ,\gamma }(T_2)}{T_1-T_2}\mathbb {1}(T_1 \ne T_2)\right\} . \end{aligned}$$
(36)

Combining (34)–(36) completes the proof for the \(\lambda _n=\varTheta _1(\sqrt{m})\) case.

When \(\lambda _n=o(\sqrt{m})\), we can readily see that

$$\begin{aligned} {\mathbb {E}}_{\mu _n}\mathbb {1}(\sqrt{mt_1} \ge \lambda _n \gamma ) \rightarrow 1, a.s. \end{aligned}$$

Using that both \(\frac{s_{\lambda _n,\gamma }(\sqrt{mt_1})}{\sqrt{mt_1}}\,\) and \({\mathcal {T}}_{t_1,t_2}\) are bounded, we have, almost surely

$$\begin{aligned}&{\mathbb {E}}_{\mu _n}\frac{s_{\lambda _n,\gamma }(\sqrt{mt_1})}{\sqrt{mt_1}}\\&\quad = {\mathbb {E}}_{\mu _n}\mathbb {1}(\sqrt{mt_1}\ge \lambda _n \gamma ) \\&\quad \quad +{\mathbb {E}}_{\mu _n} \left\{ \frac{s_{\lambda _n, \gamma }(\sqrt{mt_1})}{\sqrt{mt_1}} \mathbb {1}(\sqrt{mt_1}< \lambda _n \gamma ) \right\} \rightarrow 1 \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}_{\mu _n}({\mathcal {T}}_{t_1,t_2})&={\mathbb {E}}_{\mu _n} \mathbb {1}(\sqrt{mt_1} \ge \lambda _n \gamma ) \mathbb {1}(\sqrt{mt_2} \ge \lambda _n \gamma ) + o(1)\\&\rightarrow 1. \end{aligned}$$

Invoking DCT completes the proof. Similar arguments hold for the case \(\lambda _n=\varTheta _2(\sqrt{m})\). \(\square \)

Fig. 9
figure 9

Random orthogonal model (ROM) simulations with \(\text {SNR}=1\). The optimal nonconvex penalties are obtained at \(\gamma =30\) and \(\gamma =20\) under the two scenarios, respectively. The integers from 1 to 100 on the x-axis index the grid of 100 values of \(\lambda \) (from largest to smallest) as described in Sect. 4.1

Fig. 10
figure 10

Random orthogonal model (ROM) simulations with \(\text {SNR}=5\). The optimal nonconvex penalties are obtained at \(\gamma =30\) and \(\gamma =5\) under the two scenarios, respectively. The integers from 1 to 100 on the x-axis index the grid of 100 values of \(\lambda \) (from largest to smallest) as described in Sect. 4.1

Fig. 11
figure 11

Coherent and nonuniform sampling (NUS) simulations with \(\text {SNR}=10\). The optimal nonconvex penalties are both obtained at \(\gamma =5\) under the two scenarios, respectively. The integers from 1 to 100 on the x-axis index the grid of 100 values of \(\lambda \) (from largest to smallest) as described in Sect. 4.1

1.1.2 Proof of Proposition 10

Proof

Observe that R as defined in Proposition 9 can be written as:

$$\begin{aligned} \begin{aligned} R= & {} {\widetilde{A}}{\widetilde{V}}_{1} - {\widetilde{U}}_{1}{\widetilde{\varSigma }}_{1} + (A - {\widetilde{A}}){\widetilde{V}}_{1}= & {} (A - {\widetilde{A}}){\widetilde{V}}_{1}, \end{aligned} \end{aligned}$$
(37)

where above we have used the fact that \({\widetilde{A}}{\widetilde{V}}_{1} = {\widetilde{U}}_{1}{\widetilde{\varSigma }}_{1}\), which follows from the definition of the SVD of \({\widetilde{A}}\). By a simple inequality, it follows that

$$\begin{aligned} \Vert R \Vert _2 \le \Vert (A - {\widetilde{A}})\Vert _2 \Vert {\widetilde{V}}_{1}\Vert _2 = \Vert (A - {\widetilde{A}})\Vert _2, \end{aligned}$$
(38)

where we have used the fact that \(\Vert {\widetilde{V}}_{1}\Vert _2 = 1\). Similarly, we have an analogous result for Q:

$$\begin{aligned} \Vert Q \Vert _2 \le \Vert (A - {\widetilde{A}})\Vert _2 \Vert {\widetilde{U}}_{1}\Vert _2 = \Vert (A - {\widetilde{A}})\Vert _2. \end{aligned}$$
(39)

Note that (38) and (39) together imply that if \(\Vert {\widetilde{A}} - A \Vert _2\) is small, then so are \(\Vert R\Vert _2, \Vert Q\Vert _2\).

We now apply (31) (Proposition 9) with \(A = X_{k}\) and \({\widetilde{A}} = X_{k+1}\) and \(r_{1} = p\), to arrive at the proof of Proposition 10. \(\square \)

1.1.3 Proof of Proposition 11

Proof

Proof of Part (a):

Let us write the stationary conditions for every update:

$$\begin{aligned} X_{k+1} = \mathop {\hbox {arg min}}\limits _{X} \; F_{\ell }(X;X_{k}). \end{aligned}$$

We set the subdifferential of the map \(X \mapsto F_{\ell }(X;X_{k})\) to zero at \(X = X_{k+1}\):

$$\begin{aligned}&\left( X_{k+1} - \left( {\mathcal {P}}_{\varOmega }(Y)+ {\mathcal {P}}_{\varOmega }^\perp (X_{k}) \right) \right) \nonumber \\&\quad + \ell (X_{k+1} - X_{k}) + U_{k+1} \nabla _{k+1} V_{k+1}' = 0, \end{aligned}$$
(40)

where \(X_{k+1} = U_{k+1} \mathrm {diag}(\varvec{\sigma }_{k+1})V'_{k+1}\) is the SVD of \(X_{k+1}\). Note that the term, \(U_{k+1} \nabla _{k+1} V_{k+1}'\) in (40), is a subdifferential (Lewis 1995) of the spectral function:

$$\begin{aligned} X \mapsto \sum _{i} P(\sigma _{i}(X); \lambda , \gamma ), \end{aligned}$$

where \(\nabla _{k+1}\) is a diagonal matrix with the ith diagonal entry being a derivative of the map \(\sigma _{i} \mapsto P(\sigma _{i}; \lambda ,\gamma )\) (on \(\sigma _{i} \ge 0\)), denoted by \(\partial P(\sigma _{k+1, i}; \lambda ,\gamma )/\partial \sigma _{i}\) for all i. Note that (40) can be rewritten as:

$$\begin{aligned}&{\mathcal {P}}_{\varOmega }(X_{k+1}) - {\mathcal {P}}_{\varOmega }(Y) +U_{k+1} \nabla _{k+1} V_{k+1}' \\&\quad + \underbrace{\left( {\mathcal {P}}_{\varOmega }^\perp (X_{k+1} - X_{k}) + \ell (X_{k+1} - X_{k}) \right) }_{(a)} = 0. \end{aligned}$$

As \(k \rightarrow \infty \), term (a) converges to zero (See Proposition 7), and thus, we have:

$$\begin{aligned} {\mathcal {P}}_{\varOmega }(X_{k+1}) - {\mathcal {P}}_{\varOmega }(Y) + U_{k+1} \nabla _{k+1} V_{k+1}' \rightarrow 0. \end{aligned}$$

Let us denote the ith column of \(U_{k}\) by \({u}_{k,i}\), and use a similar notation for \(V_{k}\) and \(v_{k,i}\). Let \(r_{k+1}\) denote the rank of \(X_{k+1}\). Hence, we have:

$$\begin{aligned}&\sum _{i=1}^{r_{k+1}}\sigma _{k+1,i} {\mathcal {P}}_{\varOmega }(u_{k+1,i} v_{k+1, i}') \\&\quad - {\mathcal {P}}_{\varOmega }(Y) + U_{k+1} \nabla _{k+1} V_{k+1}' \rightarrow 0. \end{aligned}$$

Multiplying the left- and right-hand sides of the above by \(u'_{k+1,j}\) and \(v_{k+1,j}\), we have the following:

$$\begin{aligned}&\sum _{i=1}^{r_{k+1}} \sigma _{k+1,i} u'_{k+1,j}{\mathcal {P}}_{\varOmega }(u_{k+1,i}v'_{k+1,i})v_{k+1,j}\\&\quad - u'_{k+1,j}{\mathcal {P}}_{\varOmega }(Y)v_{k+1,j} + \nabla _{k+1,j} \rightarrow 0, \end{aligned}$$

for \(j = 1, \ldots , r_{k+1}.\) Let \(\left\{ {\bar{U}}, {\bar{V}} \right\} \) denote a limit point of the sequence \(\left\{ U_{k},V_{k}\right\} \) (which exists since the sequence is bounded), and let r be the rank of \({\bar{U}}\) and \({\bar{V}}\). Let us now study the following equations:Footnote 8

$$\begin{aligned}&\sum _{i=1}^{r} {\bar{\sigma }}_{j} {\bar{u}}'_{j}{\mathcal {P}}_{\varOmega }({\bar{u}}_{i}{\bar{v}}'_{i}){\bar{v}}_{j}\nonumber \\&\quad - {\bar{u}}'_{j}{\mathcal {P}}_{\varOmega }(Y){\bar{v}}_{j} + {\bar{\nabla }}_{j} = 0, \;\;\; j = 1, \ldots , r. \end{aligned}$$
(41)

Using the notation \({\bar{\theta }}_{j} = \text {vec} \left( {\mathcal {P}}_{\varOmega }({\bar{u}}_{j}{\bar{v}}'_{j}) \right) \) and \({\bar{y}} = \text {vec}({\mathcal {P}}_{\varOmega }(Y))\), we note that (41) are the first-order stationary conditions for a point \(\bar{\varvec{\sigma }}\) for the following penalized regression problem:

$$\begin{aligned} \mathop {\hbox {min}}\limits _{\varvec{\sigma }} \;\; \frac{1}{2} \Vert \sum _{j=1}^{r} \sigma _{j} {\bar{\theta }}_{j} - {\bar{y}} \Vert _{2}^2 + \sum _{j=1}^{r} P(\sigma _{j}; \lambda ,\gamma ), \end{aligned}$$
(42)

with \(\varvec{\sigma } \ge \mathbf {0}\).

If the matrix \({\bar{\varTheta }} = [{{\bar{\theta }}}_{1}, \ldots , {{\bar{\theta }}}_{r}]\) (note that \({\bar{\varTheta }} \in {\mathbb {R}}^{mn \times r}\)) has rank r, then any \(\varvec{\sigma }\) that satisfies (41) is finite—in particular, the sequence \(\varvec{\sigma }_{k}\) is bounded and has a limit point: \(\bar{\varvec{\sigma }}\) which satisfies the first-order stationary condition (41).

Proof of Part (b):

Furthermore, if we assume that

$$\begin{aligned} \lambda _{\min }( {\bar{\varTheta }}'{\bar{\varTheta }}) + \phi _P > 0, \end{aligned}$$

then (42) admits a unique solution \(\bar{\varvec{\sigma }}\), which implies that \(\varvec{\sigma }_{k}\) has a unique limit point, and hence, the sequence \(\varvec{\sigma }_{k}\) necessarily converges. \(\square \)

1.2 Additional simulation results

Fig. 12
figure 12

The y-axis denotes the number of iterations NC-Impute takes to stabilize the rank. The integers on the x-axis index some values on a grid of \(\lambda \) (from largest to smallest) as described in Sect. 4.1. The six plots represent the six scenarios considered in Sect. 4.1: (a)–(d) correspond to the four scenarios of Example A; (e) covers Example B; (f) is for Example C. Each procedure is repeated 10 times

This section contains additional numerical results from the simulation study in Sect. 4.1.

  • To demonstrate the variation of the procedures in the experiments, we plot the averaged value and standard error of both test error and rank for some representative nonconvex penalty functions. Specifically, under each scenario considered in Sect. 4.1, we pick the nonconvex penalty that yields the best prediction and rank estimation performance. For each picked penalty, we plot the averaged value of test error and rank along with the associated standard error, against the tuning parameter \(\lambda \). The results are shown in Figs. 10, 11, and 12. As is clear form the figures, the standard error is typically (at least) one order of magnitude smaller than the average. Moreover, the general patterns of test error and rank on the solution path are expected, except for a few points corresponding to very small values of \(\lambda \). The irregularity of these few points occurs probably because the solutions are getting unstable as the nonconvex regularization becomes weak when \(\lambda \) is significantly small.

  • To examine the rank dynamics of the updates in NC-Impute, we compute the number of iterations that the algorithm takes for the convergence of the rank. We choose the same six nonconvex penalties as above and evaluate the rank stabilization for several values of \(\lambda \). The results are summarized in Fig. 9. One clearly observes that except for few instances, it takes less than 10 iterations for the rank to stabilize. Moreover, when the penalty is more “nonconvex” (i.e., \(\gamma \) is smaller), the rank stabilization occurs earlier. These empirical results provide complementary information on rank stabilization that has been theoretically investigated in Sect. 3.1.1.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mazumder, R., Saldana, D. & Weng, H. Matrix completion with nonconvex regularization: spectral operators and scalable algorithms. Stat Comput 30, 1113–1138 (2020). https://doi.org/10.1007/s11222-020-09939-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09939-5

Keywords

  • Matrix completion
  • Low rank
  • Spectral nonconvex penalties
  • MC+ penalty
  • Optimization
  • Degrees of freedom