Background

The fields of machine learning and mathematical programming are increasingly intertwined. Optimization algorithms lie at the heart of most machine learning approaches [1] and have been widely applied in learning to rank [2, 3], clustering analysis [4], matrix factorization [5], etc.

Matrix completion is the task of filling in the missing entries of a partially observed matrix. Given a partially observed matrix O, matrix completion [6] attempts to recover a low-rank matrix X that approximates O on the observed entries. This approach leads to the following low-rank regularized optimization problem:

$$ \min_{X}\frac{1}{2}\left\|P_{\Omega}(O - X)\right\|_{F}^{2} + \lambda \cdot rank(X), $$
(1)

where PΩ is the projection of Xm×n onto the observed indices Ω and otherwise zeros, and ∥·∥F represents the Frobenius norm of a matrix.

Direct minimization of the matrix rank is NP-hard, and a nature convexification of rank(X) is the nuclear norm ∥X of X. The corresponding surrogate problem can be solved by various optimization algorithms. Candes and Recht [6] realized that convex relaxation is the equivalent to classic semidefinite programming (SDP), where SDP complexity is O(max(m,n)4). Cai et al. [7] gave a first-order method that approximately solves convex relaxation by the singular value thresholding algorithm (SVT), which takes O(mn2) time to solve SVD in each SVT iteration. The soft-impute algorithm utilized a special “sparse + low-rank” structure associated with SVT to efficiently compute the SVD with a reasonable time. The augmented Lagrangian method (ALM) provides a powerful framework to solve convex problems with equality constrains. Lin et al. [8] used ALM to solve matrix completion problems. The fixed point continuation with approximate SVD (FPCA) algorithm [9] is based on the proximal gradient with a continuation technique to accelerate convergence. Toh and Yun [10] proposed an accelerated proximal gradient algorithm on the balanced approach. Tomioka et al. [11] used a dual augmented Lagrangian algorithm for matrix completion, which achieves super-linear convergence and can be generalized to the problem of learning multiple matrices and general spectral regularization.

The traditional nuclear norm regularization may fall short of delivering reliable low-rank estimators with good prediction performance. Many researchers move their strides beyond the convex l1−penalty to the nonconvex form. Mazumder et al. [12] used MC+penalty to demonstrate the performance of the coordinate-descent algorithm. Dimitris et al. [13] developed a discrete extension of modern first-order continuous optimization methods to find a high-quality feasible solution. Zhang [14] proposed the MC+ method for penalized variable selection in high-dimensional linear regression, which is brought into a unified view of concave high-dimensional sparse estimation procedures [15].

Previous work [16] used the alternating least squares (ALS) procedure to optimize the nonconvex matrix completion. However, this method has high time complexities and requires more iterations to reach convergence, which cannot meet the requirements of big data analytics. Halko et al. [17] presented a modular framework for constructing randomized algorithms to compute the partial matrix decompositions. Nesterov’s momentum [18] makes a step in the velocity direction and then makes a correction to a velocity vector based on the new location. Nesterov’s momentum has been widely used in deep learning [19] and signal processing [20].

In this work, we adopt the randomized SVD decomposition and Nesterov’s momentum to accelerate optimization of nonconvex matrix completion. Randomized SVD decomposition requires very few iterations to converge quickly. In addition, the proposed algorithm has a low running time and fast convergence rate by Nesterov’s acceleration. The experimental results on four real world recommendation system datasets show that the proposed algorithm achieves tremendous speed increases whilst maintaining comparable performance with ALS for nonconvex (convex) matrix completions.

Our contribution is as follows: (1) we introduce Nesterov’s momentum to solve the non-convex matrix completion problem; (2) we further adopt the random SVD decomposition to accelerate the converge of the algorithm; (3) the experimental results on benchmark datasets shows the effectiveness of our algorithm.

The remainder of the paper is organized as follows: the “Background” section introduces nonconvex matrix completion and the alternating least squares algorithm; the “Nonconvex Matrix Completion by Nesterov’s Acceleration (NMCNA)" section shows the experimental results; the “Discussion” section analyses the time complexities of the algorithms; the “Conclusionsss” section summarizes the conclusions; and the “Methods” gives our novel algorithm called Nonconvex Matrix Completion by Nesterov’s Acceleration (NMCNA).

Nonconvex Matrix Completion

Nonconvex matrix completion uses nonconvex regularizers to approximate the l0−penalty, which leads to the following nonconvex estimation problems:

$$ \min_{X} \frac{1}{2}\|P_{\Omega}(X - Y)\|_{F}^{2} + \sum\limits_{i=1}^{\min\{m,n\}}P(\sigma_{i}(X);\lambda,\gamma), $$
(2)

where σi(X)(i≥1) are the singular values of X and P(σi(X);λ,γ) is a concave penalty function of σ on [0,]. The parameter (λ,γ) controls the amount of nonconvexity and shrinkage. For the nuclear norm regularized problem with P(σi(X);λ,γ)=λσi(X), the soft-thresholding operator gives a solution of problem (2):

$$ S_{\lambda}(Z) = U \text{diag} (s_{\lambda}(\boldsymbol{\sigma})) V^{T}, \quad (s_{\lambda}(\boldsymbol{\sigma}))_{i} = (\sigma_{i} - \lambda)_{+}, $$
(3)

where U and V is the left singular matrix and the right singular matrix of Z, σ is a singular value vector of Z and λ is a threshold.

For the more general spectral penalty function P(σ;λ,γ), we obtain the following similar result [16] as Lemma 1.

Lemma 1

Let Z=Udiag(σ)VT denote the SVD of Z and Dλ,γ denote the following thresholding operator on the singular values of Z:

$$ s_{\lambda,\gamma}(\boldsymbol{\sigma}) \in \text{argmin}_{\boldsymbol{\alpha} \ge 0} \sum\limits_{i = 1}^{\min\{m,n\}} \frac{1}{2}(\alpha_{i} - \sigma_i)^{2} + P(\alpha_{i}; \lambda,\gamma). $$
(4)

Then, Sλ,γ(Z)=Udiag(sλ,γ(σ))VT.

By the separability of the optimization, we can see that it can be converted into the optimization of the single variable

$$ s_{\lambda,\gamma}(\sigma_i) \in \text{argmin}_{\alpha \ge 0}\left\{ \frac{1}{2}(\alpha - \sigma_i)^{2} + P(\alpha;\lambda,\gamma)\right\}. $$
(5)

The popular nonconvex function used in the high-dimensional regression framework includes the lγ penalty, the SCAD penalty, the MC+ penalty and the log-penalty. In this work, we use the MC+ penalty (λ>0 and γ>1) as a representative:

$$ P(\sigma;\lambda,\gamma) =\left\{ \begin{array}{rr} \lambda\left(\sigma - \frac{\sigma^{2}}{2\lambda\gamma}\right),& \quad 0 \le \sigma \le \lambda \gamma,\\ \frac{\lambda^{2} \gamma}{2},& \quad \sigma \ge \lambda \gamma. \end{array}\right. $$
(6)

where the parameter λ and γ) controls the amount of nonconvexity and shrinkage.

We can easily analyze the solution of the thresholding function induced by the MC+ penalty:

$$ s_{\lambda,\gamma}(\sigma) = \left\{ \begin{array}{rr} 0,& \quad \sigma \le \lambda, \\ \frac{\sigma - \lambda}{1 - 1/\gamma},&\quad \lambda < \sigma < \lambda \gamma, \\ \sigma,&\quad \sigma > \lambda \gamma. \end{array}\right. $$
(7)

As observed in Fig. 1, when γ→+, the above threshold operator is consistent with the soft-thresholding operator λσ. On the other hand, the threshold operator approaches the discontinuous hard-thresholding operator σ1(σλ) when γ→1+.

Fig. 1
figure 1

MC+ threshold operator. Curves of nonconvex penalties P(σ;λ,γ) with λ=2 for different γ

It is critical to solve the SVD efficiently. Mazumder [16] et al. used the alternating least square (ALS) procedure to compute a low-rank SVD.

Matrix Completion with Fast Alternating Least Squares

The ALS method [21, 22] solves the following nonlinear optimization problem:

$$ \min_{A_{m \times r},B_{n \times r}}\left\|P_{\Omega}\left(Y - A B^{T}\right)\right\|_{F}^{2} + \lambda\left(\left\|A\right\|_{F}^{2} + \left\|B\right\|_{F}^{2}\right), $$
(8)

where λ is a regularization parameter and PΩ(X) is the projection of the matrix X on the matrix Ω. We call it as the maximum-margin matrix factorization (MMMF) criterion with nonconvex but biconvex. It can be solved with the alternating minimization algorithm. One variable is fixed (A or B), and we wish to solve (8) for another variable (B or A). We can easily see that this problem decouples into n separate ridge regressions. Further, we can prove that [21] the problem (8) is equivalent to the following problem:

$$ \min_{X:rank(X) \le r}\frac{1}{2}\left\|P_{\Omega}(Y - X)\right\|_{F}^{2} + \lambda \|X\|_{*}, $$
(9)

where XAB.

It is noticed that (??) or (??) are efficient sparse + low-rank representations for high-dimensional problems, which can efficiently store and multiply the matrices. We can then obtain the analytical solution for the linear regression problems. Comparing (8) and (9), we can observe that X=UD2VT=ABT with A=UD and B=VD.

Finally, we use the relative change in Frobenius norm to check for convergence

$$\begin{array}{*{20}l} ration & = \frac{\left\|U D^{2} V^{T} - \bar{U}\bar{D}^{2} \bar{V}^{T}\right\|_{F}^{2}}{\left\|U D^{2} V^{T}\right\|_{F}^{2}} \\ & = \frac{\text{tr}\left(D^{4}\right) + \text{tr}\left(\bar{D}^{4}\right) - 2\text{tr}\left(D^{2} U^{T} \bar{U} \bar{D}^{2} \bar{V}^{T} V\right)}{\text{tr}\left(D^{4}\right)} \end{array} $$
(14)

Results

In this section, we describe the experimental protocols and then compare the effectiveness and efficiency of our proposed algorithm (NMCNA) and the ALS-convex and ALS-nonconvex algorithms. All algorithm are implemented in Python language with numpy library and running on Ubuntu system 14.04.

We used four real world recommendation system datasets (as shown in Table 1) from ml100k and ml1m provided by MovieLens 1 to compare nonconvex MC+ regularization. Every matrix item is taken from the ratings {1,2,3,4,5}.

Table 1 Descriptions of the datasets

For performance evaluation, we use root mean squared error as follows:

$$ RMSE = \sqrt{\left\|P_{\Omega}(Y - O)\right\|_{F}^{2}/\left\|\Omega\right\|_{1}}, $$
(15)

where Y and O are the true values and predicted ones, respectively.

For all three settings above, we fix λ1=∥PΩ(Y)∥ and λ50=0.001·λ1. We choose 10 γ-values in a logarithmic grid from 106 to 1.1. We let the true ranks be taken from {3,5,10,20,30,40,50}.

From Fig. 2, we observed that a moderate γ value will obtain the best generalization performance.

Fig. 2
figure 2

The test RMSE and the parameter γ for different ranks (left: k=3 and right: k=50) on the ml-100k/ua dataset. The x-axises are the iteration times and the y-axises are the test RMSE

Figure 3 shows the variations of the test RMSE with the training RMSE results. The x-axis is the training RMSE, and the y-axis the test RMSE in each subfigure. The rows and columns correspond to the ranks and the datasets.

Fig. 3
figure 3

Training and Test RMSE curve on four datasets for three algorithms. The x-axis is the training RMSE and y-axis is the test RMSE

Figure 4 shows that the running time varies with the iteration times. It is noted that our proposed algorithm runs faster than the ALS-convex and ALS-nonconvex algorithms.

Fig. 4
figure 4

Running time of three algorithms on four datasets. The x-axises are the iteration times and the y-axises are the running time

Discussion

In Fig. 2, it is more obvious for the large rank k=50 that the choice γ=80 achieves the best test RMSE with 35 iterations for the ml-100k/ua dataset. Similar results also hold for the rank k=3.

We can observe from Fig. 3 that the proposed algorithm can converge rapidly, although it has a high training and test RMSE from the beginning. In the end, comparable performance is achieved with ALS-convex and ALS-convex algorithms. The minimum RMSE value does not decrease with increasing ranks.

Figure 4 shows that our algorithm is 18 times faster than other two algorithms on the ml-1m/ra.train and ml-1m/rb.train datasets.

According to the discussion on the ALS algorithm in [21], the total cost of an iteration for computing SVD is O(2r|Ω|+mk2+3nk2+k3).

The randomized SVD procedure [17] in the proposed algorithm requires the computation of YYTR (line 5 in Alg. 3), where YTR can be computed as

$$\begin{array}{*{20}l} Y^{*T}R & = P_{\Omega}(Y - X^{*})R + (1 + \theta)XR - \theta X^{\prime}R \\ & = P_{\Omega}(Y - X^{*})R + (1 + \theta)U \text{diag}(s_{\lambda,\gamma}(\sigma)) V^{T} R \\ &- \theta U^{\prime} \text{diag}(s^{\prime}_{\lambda,\gamma}(\sigma)) V^{\prime{T}} R \end{array} $$
(16)

Assume that X and X have ranks r and r; then, it will take O((r+r)|Ω|) time for constructing PΩ(YX) and O(|Ω|k) for computing PΩ(YX)R time. Further, the cost of computing U(diag(sλ,γ(σ))(VTR)) is O(mrk+nrk); then, the total cost is

$$ O\left((r + r^{\prime} + k)|\Omega| + (r^{\prime} + r)(m + n)k\right). $$
(17)

Although we do not judge the performance advantages of random SVD over ALS for one iteration, the random SVD is shown to require fewer iterations (<5) than ALS in these experiments.

Conclusions

This paper proposes a novel algorithm for nonconvex matrix completion. The proposed algorithm uses randomized SVD decomposition with few iterations. Nesterov’s acceleration further decreases its running time. The experimental results on four real-world recommendation system datasets show that the approach can accelerate the optimization of nonconvex matrix problems but with comparable performance to ALS algorithms. In the future, we will focus on the validation of other nonconvex functions for matrix completion problems.

Methods

Nonconvex Matrix Completion by Nesterov’s Acceleration (NMCNA)

Although soft-impute is a proximal gradient method that can be accelerated by Nesterov’s acceleration, the special sparse+low-rank structure will be lost in the ALS algorithm [23]. In this work, we resorted to the randomized partial matrix decomposition algorithm [17] to compute SVD.

Now, we present Nesterov’s acceleration for the proposed nonconvex matrix completion problem. First, define the following sequences:

$$\begin{array}{*{20}l} c_{i} & = \left\{ \begin{array}{rl} \frac{1 + \sqrt{1 + 4 c_{i - 1}^{2}}}{2}, & \quad i > 1, \\ 1, &\quad i = 0,1. \end{array} \right. \end{array} $$
(18)
$$\begin{array}{*{20}l} \theta_{i} & = \frac{c_{i} - 1}{c_{i + 1}} \end{array} $$
(19)

Given the singular value decomposition of X and \(\bar {X}\): \(\bar {X} = U\Sigma V^{T}\) (Σ is diagonal) and \(\bar {X}^{\prime } = U^{\prime }\Sigma ^{\prime } V^{\prime {T}}\), the proximal gradient algorithm uses the proximal mapping

$$\begin{array}{*{20}l} X & = U \text{diag}({s_{\lambda,\gamma}}(\sigma)){V^{T}}, \end{array} $$
(20)
$$\begin{array}{*{20}l} X^{\prime} & = U^{\prime} \text{diag}({s^{\prime}_{\lambda,\gamma}}(\sigma))V^{\prime{T}}, \end{array} $$
(21)

where sλ,γ(σ) is computed by (7) and X (X) is the current (previous) predict matrix. By Nesterov’s acceleration, we have

$$\begin{array}{*{20}l} X^{*} & = & (1 + \theta)X - \theta X^{\prime} \end{array} $$
(22)
$$\begin{array}{*{20}l} Y^{*} & = & P_{\Omega}(Y - X^{*}) + X^{*}. \end{array} $$
(23)

Random SVD Factorization

The problem of computing a low-rank approximation to a given matrix can be divided into two stages: construction of a low-dimensional subspace and computation of the factorization with the restriction to the subspace.

We require a matrix Q here as an approximate basis (see Alg. 3) for the range of the input matrix Z

$$ Z \approx Q Q^{T} Z, $$
(24)

where Q has orthonormal columns.

In the simplest form, we directly compute the QR factorization of Y=ZΩ=QR, where Ω is a random matrix. Thus, Q is an orthonormal base of the range of Z. However, the singular spectrum of the input matrix may decay slowly. Therefore we alternatively compute the QR factorization of (ZZT)qZΩ=QR, where q≤5 usually suffices in practice.

Given the orthonormal matrix Q we compute the SVD of B=QTZ as follows:

$$ B = Q^{T} Z = \tilde{U}\Sigma V^{T}, $$
(25)

then with (24), we immediately obtain the approximate SVD of Z:

$$ Z \approx Q Q^{T} Z = Q B = \left(Q\tilde{U}\right) \Sigma V^{T}. $$
(26)