Low-rank matrix recovery with Ky Fan 2-k-norm

We propose Ky Fan 2-k-norm-based models for the nonconvex low-rank matrix recovery problem. A general difference of convex algorithm (DCA) is developed to solve these models. Numerical results show that the proposed models achieve high recoverability rates.


Introduction
Matrix recovery problem concerns the construction of a matrix from incomplete information of its entries. This problem has a wide range of applications such as recommendation systems with incomplete information of users' ratings or sensor localization problem with partially observed distance matrices (see, e.g., [4]). In these applications, the matrix is usually known to be (approximately) low-rank. Finding these low-rank matrices are theoretically difficult due to their non-convex properties. Computationally, it is important to study the tractability of these problems given the large scale of datasets considered in practical applications. Recht et al. [21] studied the low-rank matrix recovery problem using a convex relaxation approach which is tractable. More precisely, in order to recover a low-rank matrix X ∈ R m×n which This work is partially supported by the Alan Turing Fellowship of the first author. satisfies A(X) = b, where the linear map A : R m×n → R p and b ∈ R p , b = 0, are given, the following convex optimization problem is proposed: where X * = i σ i (X) is the nuclear norm, the sum of all singular values of X. Recht et al. [21] showed the recoverability of this convex approach using some restricted isometry conditions of the linear map A. In general, these restricted isometry conditions are not satisfied and the proposed convex relaxation can fail to recover the matrix X.
Low-rank matrices appear to be appropriate representations of data in other applications such as biclustering of gene expression data. Doan and Vavasis [7] proposed a convex approach to recover low-rank clusters using dual Ky Fan 2-k-norm instead of the nuclear norm. Ky Fan 2-k-norm is defined as 1 2 , where σ 1 ≥ · · · ≥ σ k ≥ 0 are the first k largest singular values of A, k ≤ k 0 = rank( A). The dual norm of the Ky Fan 2-k-norm is denoted by | · | k,2 , These unitarily invariant norms (see, e.g., Bhatia [3]) and their gauge functions have been used in sparse prediction problems [1], low-rank regression analysis [8] and multi-task learning regularization [12]. When k = 1, the Ky Fan 2-k-norm is the spectral norm, A = σ 1 ( A), the largest singular value of A, whose dual norm is the nuclear norm. Similar to the nuclear norm, the dual Ky Fan 2-k-norm with k > 1 can be used to compute the k-approximation of a matrix A (Proposition 2.9, [7]), which demonstrates its low-rank property. Motivated by this low-rank property of the (dual) Ky Fan 2-k-norm, which is more general than that of the nuclear norm, and its usage in other applications, we aim to utilize this norm to solve the matrix recovery problem. In addition to the convex relaxation approach using nuclear norm, there have been many non-convex approaches for the matrix recovery problem. The recent survey paper by Nguyen et al. [18] surveys and extensively tests many of these methods. According to [18], a natural dichotomy distinguishes methods for which k is a priori unknown versus methods that prescribe k. Our proposed approach falls into the latter category. Among those methods that prescribe k, the next dichotomy is between methods based on nuclear-norm minimization versus those that attempt to directly minimize the residual A(X) − b by carrying out some form of constraint-preserving descent on X. Methods based on nuclear-norm minimization generally have better recovery rates according to the experiments of [18].
Although our method is in neither of these categories, it is more similar to nuclear-norm based methods of [18] because it minimizes convex norms and does not maintain feasible iterates. Of the non-convex methods, the method with the best recovery rate according to [18] is called truncated nuclear norm minimization (TNN) by Hu et al. [10]. TNN minimizes the non-convex function X * − |X| k,1 , where |X| k,1 denotes the sum of the largest k singular values of X, the Ky Fan 1-k-norm. It should be observed that X * − |X| k,1 is nonnegative and is the sum of singular values k + 1, k + 2, . . . , min(m, n). Therefore, X * − |X| k,1 = 0 if and only if rank(X) ≤ k, hence TNN is an exact formulation of the original problem. We are going to compare our proposed approach with TNN among others.

Contributions and paper outline
In this paper, we focus on non-convex approaches for the matrix recovery problem using the (dual) Ky Fan 2-k-norm. Specifically, our contributions and the structure of the paper are as follows: 1. We propose a Ky Fan 2-k-norm-based non-convex approach to solve the matrix recovery problem and discuss the proposed models in detail in Sect. 2. 2. We develop numerical algorithms to solve those models in Sect. 3, which includes the framework for difference of convex functions algorithm (DCA) and a proximal point algorithm (PPA) framework to solve the (dual) Ky Fan 2-k-norm optimization problems. 3. Numerical results are presented in Sect. 4 to compare the proposed approach with other approaches, including the convex relaxation approach and TNN, in terms of recoverability.

Ky Fan 2-k-norm-based models
The Ky Fan 2-k-norm is the 2 -norm of the vector of k largest singular values with k ≤ min{m, n}. Thus we have: where · F is the Frobenius norm. Now consider the dual Ky Fan 2-k-norm and use the definition of the dual norm, we obtain the following inequality: It is clear that these inequalities become equalities if and only if rank( A) ≤ k. It shows that to find a low-rank matrix X that satisfies A(X) = b with rank(X) ≤ k, we can solve either the following optimization problem It is straightforward to see that these non-convex optimization problems can be used to recover low-rank matrices as stated in the following theorem given the norm inequalities in (4).

Theorem 1
If there exists a matrix X ∈ R m×n such that rank(X) ≤ k and A(X) = b, then X is an optimal solution of (5) and (6).
Given the result in Theorem 1, the exact recovery of a low-rank matrix using (5) or (6) relies on the uniqueness of the low-rank solution of A(X) = b. Recht et al. [21] generalized the restricted isometry property of vectors introduced by Candès and Tao [5] to matrices and used it to provide sufficient conditions on the uniqueness of these solutions.
Definition 1 (Recht et al. [21]) For every integer k with 1 ≤ k ≤ min{m, n}, the k-restricted isometry constant is defined as the smallest number δ k (A) such that holds for all matrices X of rank at most k.
Using Theorem 3.2 in Recht et al. [21], we can obtain the following exact recovery result for (5) and (6).
Theorem 2 Suppose that δ 2k (A) < 1 and there exists a matrix X ∈ R m×n which satisfies A(X) = b and rank(X) ≤ k, then X is the unique solution to (5) and (6), which implies exact recoverability.
The condition in Theorem 2 is indeed stronger than those obtained for the nuclear norm approach such as the condition of δ 5k (A) < 1/10 for exact recovery in Recht et al. [21,Theorem 3.3]. The non-convex optimization problems (5) and (6) use norm ratio and difference. When k = 1, the norm ratio and difference are computed between the nuclear and Frobenius norm. The idea of using these norm ratios and differences with k = 1 has been used to generate non-convex sparse generalizer in the vector case, i.e., m = 1. Yin et al. [24] investigated the ratio 1 / 2 while Yin et al. [25] analyzed the difference 1 − 2 in compressed sensing. Note that even though optimization formulations based on these norm ratios and differences are non-convex, they are still relaxations of 0 -norm minimization problem unless the sparsity level of the optimal solution is s = 1. Our proposed approach is similar to the idea of the truncated difference of the nuclear norm and Frobenius norm discussed in Ma et al [16]. Given a parameter t ≥ 0, the truncated difference is defined as For t ≥ k − 1, the problem of truncated difference minimization can be used to recover matrices with rank at most k given that X * ,t−F = 0 if rank(X) ≤ t + 1. Similar results for exact recovery as in Theorem 2 are provided in Theorem 3.7(a) in Ma et al [16]. Despite the similarity with respect to the recovery results, the problems (5) and (6) are motivated from a different perspective. The proposed approach uses the inequality between the dual Ky Fan 2-k-norm and the Frobenius norm. Given the norm inequalities in (4), we can also generate different nonconvex optimization problems using the Ky Fan 2-k-norm instead of its dual together with the Frobenius norm. These problems put the Ky Fan 2-k-norm in the non-convex objective term of the objective, which will be represented only via a (linear) approximation during the inner iterations of the DCA framework, which will be discussed in the next section. This is contrast to our method, where the dual Ky Fan 2-k norm appears in the convex term and therefore is exactly represented on outer iterations, which we believe is preferable. Note that if we use 1-norms, i.e., nuclear norm instead of Frobenius norm and Ky Fan 1-k-norm, | · | k,1 , instead of 2-k-norm, we get the TNN method directly from the difference model.
We believe that, instead of the (dual) Ky Fan 1-k-norm, it is preferable in general to use the (dual) Ky Fan 2-k-norm given its property regarding rank-k approximation (see further remarks on this matter in [7]). This explains why the models (5) and (6) are proposed. We are now going to discuss how to solve these problems next.

Difference of convex functions algorithms
We start with the problem (5). It can be reformulated as with the change of variables z = 1/ |X| k,2 and Z = X/ |X| k,2 . The compact formulation is where Z is the feasible set of the problem (8) and δ Z (·) is the indicator function of Z. The problem (9) is a difference of convex functions (d.c.) optimization problem (see, e.g. [19]). The difference of convex functions algorithm DCA proposed in [19] can be applied to the problem (9) as shown in Algorithm 1.
Step 2. Update (Z s+1 , z s+1 ) as an optimal solution of the following convex optimization problem Step 3. Set s ← s + 1 and repeat Step 2.
Let X s = Z s /z s and use the general convergence analysis of DCA (see, e.g., Theorem 3.7 in [20]), we can obtain the following convergence results.
The convergence results show that the DCA improves the objective function value of the ratio minimization problem (5).
is the set of optimal solution of (10) and (Z s , z s ) which satisfied this condition is called a critical point. Note that (local) optimal solutions of (9) can be shown to be critical points. The following proposition shows that an equivalent condition for critical points.
Proof Consider Y ∈ Null(A), i.e., A(Y ) = 0, we then have: We compute the objective function value in (10) for this solution and compare that of When Y = 0, we achieve the equality. We have: Clearly, it is equivalent to the fact that Y = 0 is an optimal solution of (11).
The result of Proposition 2 shows the similarity between the norm ratio minimization problem (5) and the norm different minimization problem (6) with respect to the implementation of the DCA. It is indeed that the problem (6) is a DC optimization problem with the objective function as the difference between the dual Ky Fan 2-k-norm and the Frobenius norm together with a convex feasible set. Given that the linear approximation of the Frobenius (6) can be described as Algorithm 2 as follows.

Algorithm 2 (DCA-D)
Step 1. Start with some X 0 such that A(X 0 ) = b and set s = 0.
Step 2. Update X s+1 = X s + Y , where Y is an optimal solution of the following convex optimization problem Step 3. Set s ← s + 1 and repeat Step 2. Note that in (12), we change the variable from X to Y = X − X s , which results in the constraints A(Y ) = 0 for Y given that A(X s ) = b. Now, it is clear that X s is a critical point for the problem (6) if and only if Y is an optimal solution of (12). Both problems (11) and (12) can be written in the general form as where α(X) = |X| k,2 (12), respectively. Given that A(X s ) = b, these problems can be written as The following proposition shows that X s is a critical point of the problem (14) for many functions α(·) if rank(X s ) ≤ k.

Proposition 3 If rank(X s ) ≤ k, X s is a critical point of the problem (14) for any function α(·) which satisfies
Thus for all Y , the following inequality holds: It implies Y = 0 is an optimal solution of the problem (13) since the optimality condition is Thus X s is a critical point of the problem (14).
Proposition 3 shows that one can potentially use different functions α(·) such as α(X) = 1 |X| k, 2 for the sub-problem in the general DCA framework to solve the original problem.
We are going to experiment with different pre-determined functions α(·) in the numerical section. Now, the generalized sub-problem (14) is a convex optimization problem, which can be formulated as a semidefinite optimization problem given the following calculation of the dual Ky Fan 2-k-norm provided in [7]: In order to implement the DCA, one also needs to consider how to find the initial solution X 0 . We can use the nuclear norm minimization problem (1), the convex relaxation of the rank minimization problem, to find X 0 . A similar approach is to use the following dual Ky Fan 2-k-norm minimization problem to find X 0 given its low-rank properties: This initial problem can be considered as an instance of (14) with X 0 = 0. Given that X 0 , X = 0, one can set any value for α(0), e.g., α(0) = 1. It is equivalent to starting the iterative algorithm with X 0 = 0 one step ahead. All of these instances of (14) can be solved with a semidefinite optimization solver, which will not be efficient for large problem instances. In the next section, we will develop a primal-dual proximal point algorithm to solve (14).

Proximal point algorithm
Liu et al. [15] proposed a proximal point algorithm (PPA) framework to solve the nuclear norm minimization problem (1). We will develop a similar algorithm to solve the following general problem where C ∈ R m×n is a parameter, of which (14) is an instance. This section is divided into four subsections as follows.
1. We first write down the dual problem, its Moreau-Yoshida regularization, and finally a spherical quadratic approximation of the smooth term. Deriving the formulation of normplus-spherical is the main task needed in order to use the PPA framework. 2. Next, we argue by symmetry that the norm-plus-spherical subproblem can be rewritten as a vector-norm optimization problem involving only the largest singular values of a certain matrix. 3. This vector-norm optimization problem can be solved in closed form. This requires several technical results whose proofs are deferred to Appendix A. 4. Finally, we summarize all the computations of the PPA method.

Deriving the norm-plus-spherical subproblem
The steps involved in setting up the PPA framework are formulating the dual, its Moreau-Yoshida regularization, and the quadratic approximation thereof. The dual problem can be written simply as max z g(z), where g is the concave function defined by The Moreau-Yoshida regularization of g can be computed by applying the strong duality (or minimax theory) result in Rockafellar [22]: The inner optimization problem can be solved easily with y * = z + (b − A(X)). Thus we can write down the Moreau-Yoshida regularization of g as follows: where The function λ (·; z) is convex and differentiable with the gradient In order to find the proximal point mapping p λ associated with g, we need to find an optimal solution X * for the inner minimization problem in (21) For a fixed z, the inner minimization problem we need to consider is where P(X) = λ (X; z). As discussed in Liu et al. [15], we will use an accelerated proximal gradient (APG) algorithm for this problem. The APG algorithm in each iteration needs to solve the following approximation of the sum f (X) + P(X) at the current solution Z: This problem has a unique solution S t (Z) given the strong convexity of the function. We have: which means S t (Z) is the minimizer of the problem: which is an instance of the following optimization problem where λ = 1/t > 0 and Y = G t (Z)− 1 t C are parameters. For the nuclear norm minimization problem, the dual Ky Fan 2-k-norm is replaced by the nuclear norm and the resulting problem can be solved easily with the SVD decomposition and the shrinkage operator (see, e.g., Liu et al. [15] and references therein).

Reduction of the norm-plus-spherical problem to a convex vector-norm problem
Using the orthogonal invariance of | · | * k,2 , we first prove that the optimal solution X * of (24) can be constructed from the optimal solution x * of the following optimization problem: where x * k,2 is the dual of the Ky Fan 2-k-vector norm of x, 1 2 , with |x| (i) is the (n − i + 1)-th order statistic of |x|. · * k,2 is the corresponding symmetric gauge function of | · | * k,2 and the above result which we shall prove can actually be applied to any orthogonally invariant matrix norm | · | and its corresponding symmetric gauge function · . We start with the following lemma.

Lemma 1
Consider the optimal solution x * of the following optimization problem: where · is a symmetric gauge function. The following statements are true.
Proof (i) If x * 0, without loss of generality, let us assume that x * 1 < 0. Construct the solution x from x * by inverting the sign of the first element, x 1 = −x * 1 > 0. We have: x = x * and x − y 2 ≤ x * − y 2 . Thus x = x * is also an optimal solution of (26), which contradicts the uniqueness of the optimal solution x * . Thus we have x * ≥ 0.
(ii) Assume that x * i < x * j , we construct another solution x as follows.
We have: x = x * since . is a symmetric gauge function. We have: Thus x = x * is also an optimal solution of (26), which contradicts the uniqueness of the optimal solution x * . Thus we have x * i ≥ x * j .
We can now state the main result for the optimization problem with the general orthogonally invariant norm |.| , Proposition 4 If Y = UDiag(σ (Y ))V T is a singular decomposition of Y , then the optimal solution X * of (27) can be calculated as where x * is the optimal solution of (26) with y = σ (Y ).
Proof Problem (27) with the general |.| has a unique optimal solution X * that satisfies the following optimality condition: Similarly, the optimal solution x * satisfies the condition We have: y = σ (Y ) ≥ 0. Thus according to Lemma 1, x * ≥ 0. Similarly, y 1 ≥ . . . ≥ y n implies that x * 1 ≥ . . . ≥ x * n . Now let us consider X = UDiag(x * )V T , we have: σ (X) = x * . According to Ziȩtak [26], the subgradient of |.| is where X =ŪDiag(σ (X))V T is any singular value decomposition of X. We have: X satisfies the optimality condition, thus UDiag(x * )V T is the (unique) optimal solution of (27).

Solution of the vector-norm problem
Proposition 4 shows that we can focus on (25). The following proposition is proved in Appendix A.

Proposition 5
The dual Ky Fan 2-k-vector norm can be computed using the following formulation: where We are now ready to solve the problem (25). The optimality condition for (25) can be written as follows: We shall focus on the case y = σ (Y ) ≥ 0 and assume that y 1 ≥ · · · ≥ y n ≥ 0. Using the formulation of · * k,2 stated in Proposition 5, we can compute the subgradient ∂ · * k,2 as follows. If y = 0, we have: If y = 0, compute i * ∈ [0, k − 1] using (29). For 1 ≤ j ≤ k − i * − 1, we have: Let assume that there are n 0 zeros among the last n − k + i * + 1 elements of y starting from (k − i * )-th element. For k − i * ≤ j ≤ n − n 0 , we have: Finally we have: where e n 0 ∈ R n 0 is the vector of all ones.
Using the formulation of the subgradient at x = 0 and the uniqueness of the optimal solution, we can show that if y k,2 ≤ λ, then x = 0 is the optimal solution of (25). Now consider the case y k,2 > λ. According to Lemma 1, the optimal solution of (25) is nonnegative, i.e., x ≥ 0. Assuming that the values of i * and n 0 have been given, the optimality conditions can then be written as follows: where α = x * k,2 /λ, m 0 = n − n 0 − k + i * + 1 and I m 0 ∈ R m 0 ×m 0 is the identity matrix. Using the Sherman-Morrison formula for the inverse of I m 0 + 1 (i * + 1)α e m 0 e T m 0 , we obtain the analytic formulation for x: x [ x [n−n 0 +1:n] = 0.
The unknown α > 0 satisfies the following equation: Given the relationship between x and y, we have: The function f (α) is a (strictly) decreasing function on [0, +∞); therefore, the necessary conditions are f (0) > 1 and f (α max ) < 1. If f (0) > 1 and f (α max ) < 1, the value of α can be found using the efficient binary search. Given the values of i * , n 0 , and α, the solution x can be computed completely. Note that the optimal solution x is unique, which guarantees that if all optimality conditions are satisfied and i * is correct according to (29), we indeed obtain the unique optimal solution x.

PPA summary
The analyses in the preceding paragraphs result in Algorithm 3 that can be used to solve the problem (25).
Algorithm 3 together with Proposition 4 allows us to solve the problem (24) using the SVD decomposition, which means that we can compute S t (Z) efficiently. The APG algorithm for solving (23) can then be described as follows. Given τ 0 = τ −1 = 1 and X 0 = X −1 , each iteration includes the following steps Step 2. Update X s+1 = S t s (Y s ) using Algorithm 3 and Proposition 4.
The additional parameter τ s and how it is updated in each step are necessary for the convergence analysis of the APG algorithm when t s is set to be L, the Lipschitz modulus of the gradient ∇ P(Y ) (see, e.g., Beck and Teboulle [2]). We now can write down the proximal point algorithm (PPA) to solve the problem (18) as follows:
The proximal parameter λ s affects the convergence speeds of both inner optimization problem (37) and the whole PPA algorithm. Similar to the discussion in Doan et al. [6], one can initially fix λ 0 and increase λ s only when the outer algorithm converges too slowly as compared to the inner algorithm. One can use the relative convergence measure rel s for the outer algorithm and rel s i = for the inner algorithm in deciding when to increase λ s . This PPA algorithm can then be used in Step 2 of the proposed DCA to solve the problem (10), which is an instance of (18). We are now ready to provide numerical results to demonstrate the effectiveness of the proposed approach.

Numerical results
In this section, we are going to discuss the proposed general DCA framework both in terms of computation and recoverability. Similar to Candès and Recht [4], we construct the following the experiment. We generate M, an m × n matrix of rank r , by sampling two m × r and n × r factors M L and M R with i.i.d. Gaussian entries and setting M = M L M R . The linear map A is constructed with s independent Gaussian matrices A i whose entries follows N (0, 1/s), i.e., We now start with computational settings for the proposed DCA framework.

Computational settings
The proposed general DCA framework allows us to choose the function α(·) for the subproblem as well as the initial solution. In order to test different functions α(·), we generate K = 50 matrix M with m = 50, n = 40, and r = 2. The dimension of these matrices is d r = r (m + n − r ) = 176. For each M, we generate s matrices for the random linear map with s = 200. We set the maximum number of iterations of the algorithm to be N max = 100 together with the tolerance of = 10 −6 for X s+1 − X s F . For this experiment, the instances are solved using SDPT3 solver [23] for semi-definite optimization problems in Matlab with CVX [9]. The computer used for these numerical experiments is a 64-bit Windows 10 machine with 3.70GHz quad-core CPU, and 32GB RAM. We run the proposed DCA with k = r = 2 and three different pre-determined functions, α 1 (X) = 1 X F , Fig. 1 Computational performance of the DCA under different settings . We use the solution obtained from the nuclear optimization formulation (1) as the initial solution for the proposed DCA in this experiment. Figure 1a shows that all three functions α(·) perform similarly for most of the instances. However, for difficult instances which require a large number of iterations to converge, α 3 (·) performs the worst as compared to the other two. There is one instance with which the DCA with α 3 (·) does not converge after the maximum number of iterations. Given that α 1 (·) performs slightly better than α 2 (·) in all tested instances, we are going to use α 1 (·) in subsequent experiments.
The next experiment is used to test the options of initial solutions. In this experiment, we vary s from 180 to 500 and generate K = 50 instances for each value of s. We run the two variants of the proposed PCA, k2-nuclear with initial solution obtained from (1) and k2-zero with initial solution X 0 = 0 as discussed in Sect. 3.1. Figure 1b shows that two variants perform similarly in terms of average number of iterations for most sizes of the random linear map. k2-nuclear is better when the size of the linear map is large, which can be explained by the recoverability of the convex relaxation approach nuclear using the nuclear optimization formulation (1). We shall therefore use k2-nuclear in comparison with other approaches including nuclear in the next section.

Recoverability
In order to compare the proposed approach with others, we again generate K = 50 random instances with m = 50, n = 40, and r = 2 for different sizes of the linear map ranging from 180 to 500. In order to check whether the matrix M is recovered, we use the relative error X − M F M F as the performance measure. The matrix M is considered to be recovered suc- where the threshold t is set to be 10 −6 for these experiments.
We compare our proposed approach k2-nuclear with nuclear and t-nuclear, an iterative algorithm proposed by Hu et al. [10] for the TNN method. Based on survey paper by Nguyen et al. [18], we also consider the alternating optimization approach proposed by Jain et al. [13] which is used to solve the following equivalent bilinear non-convex optimization problem: Another approach that we consider is the greedy heuristic of atomic decomposition for minimum rank approximation (admira) proposed by Lee and Bresler [14], which aims to find k rank-one matrices representing the original matrix. Finally, in addition to nuclear, we consider schatten approach proposed by Mohan and Fazel [17] to solve the Schatten p-norm minimization problem with p = 1/2, one of the non-convex relaxation methods for which k is a priori unknown (see, e.g., Hu et al. [11]). The tolerance is again set at = 10 −6 for X s+1 − X s F and the maximum number of iterations is set higher atN max = 10000 for these methods given that their (least-squares) subproblems require less computational efforts. Figure 2a shows recovery probabilities given different sizes of the linear map. The results show that the proposed algorithm can recover exactly the matrix M with 100% rate when s ≥ 200 while the nuclear norm approach cannot recover any matrix at all, i.e., 0% rate, if s ≤ 300. Both admira and schatten approaches cannot recover any matrix even though both of them converges with not very large number of iterations as shown in Fig. 2b. This result indicates that the greedy heuristic does not work well for exact recovery. Note that the schatten approach does not solve the non-convex Schatten p-norm optimization problem exactly, which might explain why it also does not perform well. The bilinear approach is better than the nuclear norm approach with 100% recovery rate when s ≥ 300. However, for small s, its recovery probability drops significantly to 2% for s = 190 and 0% for s = 180. The t-nuclear performs comparably with our proposed approach with 100% rate for s ≥ 200. For s = 190 and s = 180, our approach achieves better recovery rate of 98% and 52% respectively, as compared those of 92% and 44% for the t-nuclear approach. For all these three approaches, the average number of iterations increases significantly, especially for the bilinear approach, when the size of the linear map decreases. Both our approach and t-nuclear have similar average numbers of iterations. It is interesting to see that we only need 2 extra iterations when s = 250 or 1 extra iteration on average when s = 300 to obtain 100% recover rate when the nuclear norm optimization approach still cannot recover any of the matrices (0% rate). The average numbers of iterations for bilinear are much higher, which implies that it has a much slower convergence rate. On the other hand, the computation time per iteration for bilinear is much lower than those for nuclear, t-nuclear, and our proposed approach. It can be explained by the fact that one needs to solve matrix norm minimization problems using SDPT3 in each iteration of these approaches why the solutions of least-squares subproblems of bilinear can be computed directly in Matlab. The computation times per iteration are shown in Fig. 3a while the average computation times are shown in Fig. 3b. It shows that bilinear is much more efficient when the size of the linear map is large. On the other hand, its average computation time increases significantly when the size of the linear map decreases. As compared with t-nuclear, the average computation time of our approach is higher even though the numbers of iterations are similar. One explanation might be that the dual Ky Fan 2-k-norm optimization problem is more difficult to solve than the nuclear norm optimization problem with the SDPT3 solver. We are going to test the proposed proximal point algorithm that can be used to solve the dual Ky Fan 2-k-norm optimization problem (14) for our proposed approach in the next section.
To conclude this section, we provide some additional recoverability tests with different rank settings including r = 1, r = 5 and r = 10. Note that when k = r = 1, the dual Ky Fan 2-k-norm is the nuclear norm and the objective function in (6) is X * − X F while that of the TNN method is X * − X , whose sub-problems are both nuclear norm optimization problems. We run the experiments with the following settings of (r , s): (1,95), (2,180), (5,440), and (10, 820), in which the size of the linear map is quite close to the dimension d r . For all ranks, bilinear approach does not recover any matrix with this size of the linear map and the average number of iterations is very close to the maximum number ofN max = 10000. Figure 4a shows the recovery probabilities of our proposed approach  and t-nuclear. The results indicate that our proposed approach is slightly better than t-nuclear in general in terms of recovery with these small sizes of the linear map. It is interesting to see that there are instances with which one approach can recover the original matrix exactly while the other cannot. Figure 4b shows those instances for r = 2 (maximum number of iterations in general indicates no recovery and vice versa under these settings). Figure 5a shows that the average numbers of iterations of our proposed approach are slightly higher than those of t-nuclear. The average computation times of our proposed approach are also higher as shown in Fig. 5b. When r = 1, the average computation time per iteration is still higher even though both sub-problems are nuclear norm minimization problems (10.54 seconds vs. 4.49 seconds). It might be explained by the fact that the sub- Fig. 6 Computational performance of k2-zero(sdp) and k2-zero(ppa) problem in our approach uses the general dual Ky Fan 2-k-norm for k = r = 1 whereas that in t-nuclear uses the function norm_nuc in CVX. In the next section, we are going to test whether the proposed proximal point algorithm can be used to solve subproblems in our proposed approach more efficiently.

Proximal point algorithm
In this section, we are going to test the proposed proximal point algorithm used to solve the subproblem (14). We start with the same initial setting of m = 50, n = 40, and r = 2 for the original matrix M and s = 200 for the random linear map. Instead of k2-nuclear, we run the proposed algorithm k2-zero with initial solution X 0 = 0, which is easier to set without the need of solving the nuclear norm optimization problem. We use the proximal point algorithm (ppa) for the sub-problems and compare it with the original algorithm which uses the interior-point method (sdp) for the semidefinite optimization formulation of the subproblem (14) for K = 50 instances. The tolerance is again set to be = 10 −6 and the maximum number of iterations is N max = 100. The tolerance for the proximal point algorithm is set to be p = 10 −8 with the maximum number of iterations of N p max = 100. We set λ 0 = 10 and will increase λ s+1 = 1.25 · λ s if rel s o > 5 ·rel s i . Figure 6a shows that in general the algorithm k2-zero(ppa) requires more iterations to converge than k2-zero(sdp). In terms of computation time, the algorithm k2-zero(ppa) is definitely better than k2-zero(sdp) in all instances. Figure 6b shows the average computation times for both variants of the proposed algorithm. In terms of accuracy, we plot the performance measure of relative error in Fig. 6c. The algorithm k2-zero(sdp) indeed has much higher accuracy for all instances. On the other hand, the algorithm k2-zero(ppa) also performs consistently and achieves relative errors below the threshold of = 10 −6 for all instances, which is significant given that the proximal point algorithm only uses first-order information.
Next, we will run the algorithm k2-zero(ppa) for larger instances. We set m = n and vary it from 50 to 500 with r = 5 and s = 1.10 · d r , where d r = r (m + n − r ). Note that for m = 500, the decision matrix X has the size of 500 × 500 and there are approximately 5500 dense linear constraints in X, which requires a substantial amount of memory for the representation of instance data and the execution of the algorithm. For these instances, the tolerance is set to be = 10 −4 for , which is more consistent with respect to different instance sizes. The maximum number of iterations is kept at N max = 100 as before. For the proximal point algorithm, we set the tolerance p = 10 −6 and the maximum number of iterations N p max = 100. Given these tolerances, the algorithm converges after approximately 14 iterations for all instances with 12 and 16 as the minimum and maximum, respectively. The relative errors are approximately 10 −4 , which are shown in Fig. 7a for different matrix sizes. The computation time per iteration increases significantly from 2.5 to 10 4 seconds when the matrix size increases from m = 50 to m = 500 as shown in Fig. 7b.

Conclusion
We have proposed non-convex models based on the dual Ky Fan 2-k-norm for low-rank matrix recovery and developed a general DCA framework to solve the models. The computational results show that the proposed approach achieves high recoverability as compared to the truncated nuclear norm approach by Hu et al. [10] as well as the bilinear approach proposed by by Jain et al. [13], especially when the size of the linear map is small. We also demonstrate that the proposed proximal point algorithm is promising for larger instances.
which implies that x * with x * 1 = x * 2 is also the optimal solution of (42) with the original parameters a 1 and a 2 .
The structural property of the optimal solution of (41) can now be stated as follows. an optimal solution x * of (41) that satisfies the condition x * j = x * k for all j ≥ k − i.
Proof We prove the statement by induction. Let i = 1, consider an optimal solution x * of (41) and the following optimization problem: Applying Lemma 3, we obtain the optimal solutionx k−1 =x k . We also have: is also an optimal solution. Now assume the statement is true for i < k − 1, we will prove that the statement is true for i + 1. We have: y k−i−1 < 1 i + 1 k j=1 x 2 j ≤ 1, which can be used to solve the original problem. It is equivalent to the following problem: x 2 j + (i + 1)x 2 k ≤ 1, Similar to the case i = 1, we can construct the problem with two decision variable x k−i−1 and x k from an optimal solution of the problem above and prove that the solution isx k−i−1 =x k . Thus the statement is true for i + 1, which means the statement is true for all i by induction.
Proof of Prop. 5. Without loss of generality, let consider y that satisfies the condition y 1 ≥ . . . ≥ y n ≥ 0. According to Lemma 4, with i * defined in (29), there exists an optimal solution x * of (40) that satisfies x * i = x * k−i * for all i ≥ k − i * + 1 and all values can be found by solving the following optimization problem with the Cauchy-Schwartz inequality: max k−i * −1 j=1 y j x j + (i * + 1) Note that the index i = 0 always belongs to the considered set of indices. Thus we have: which concludes the proof given that the above formulation can be easily extended for an arbitrary vector y.