Robust principal component analysis using facial reduction

We introduce a novel approach for robust principal component analysis (RPCA) for a partially observed data matrix. The aim is to recover the data matrix as a sum of a low-rank matrix and a sparse matrix so as to eliminate erratic noise (outliers). This problem is known to be NP-hard in general. A classical approach to solving RPCA is to consider convex relaxations. One such heuristic involves the minimization of the (weighted) sum of a nuclear norm part, that promotes a low-rank component, with an ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document} norm part, to promote a sparse component. This results in a well-structured convex problem that can be efficiently solved by modern first-order methods. However, first-order methods often yield low accuracy solutions. Moreover, the heuristic of using a norm consisting of a weighted sum of norms may lose some of the advantages that each norm had when used separately. In this paper, we propose a novel nonconvex and nonsmooth reformulation of the original NP-hard RPCA model. The new model adds a redundant semidefinite cone constraint and solves small subproblems using a PALM algorithm. Each subproblem results in an exposing vector for a facial reduction technique that is able to reduce the size significantly. This makes the problem amenable to efficient algorithms in order to obtain high-level accuracy. We include numerical results that confirm the efficacy of our approach.


Introduction
Principal component analysis (PCA) seeks a low-rank approximation to an input data matrix. It is arguably the most popular statistical tool in data analysis and is very effective for handling the effect of small random Gaussian noise. PCA can be accomplished easily via the singular value decomposition (SVD). However, it is also well-known that PCA lacks robustness with respect to large erratic noise, i.e., a single corrupted entry can result in an approximation that is far away from the true solution. Robust PCA has been proposed to remove the effect of sparse gross errors. In e.g., Candes et al. (2010), Chandrasekaran et al. (2009), Cheng et al. (2015), the intractable RPCA problem and its convex relaxation are introduced. It is shown that the two problems are equivalent, with high probability, under certain conditions of the input data. Specifically, RPCA aims to express a given data matrix Z ∈ ℝ m×n as the sum Z = L * + S * , where L * is the low-rank approximation to Z, and S * is a sparse matrix that captures the additive erratic noise. Throughout this paper, we assume that r = rank (L * ) ≪ min(m, n) , and > 0 is a given positive parameter. RPCA is in a sense a multicriteria (two criteria) problem in that it attempts to find a low rank approximation as well a sparse approximation. Finding Pareto optimal points can be formulated as the following weighted optimization problem: where the cardinality function ‖S‖ 0 counts the number of nonzeros of S. This problem is NP-hard due to the combinatorial nature of the rank and the cardinality functions. It is shown in e.g., Candes et al. (2010), Chandrasekaran et al. (2009), that under certain conditions (1) is equivalent, with high probability, to the following convex program: Here ‖L‖ * is the nuclear norm of L, the sum of the singular values of L, and the (vector) 1 -norm is ‖S‖ 1 ∶= ∑ ij �S ij � . The convex program (2) is known as robust principal component pursuit, RPCP.
In practice, it is possible that the matrix Z is only partially observed. That is, there exists a subset Ê of the indices such that only entries Z ij for (i, j) ∈Ê are given. In this case, RPCA (1) and RPCP (2) need to be changed, respectively, to and (1) min rank (L) + ‖S‖ 0 s.t. L + S = Z, (2) min‖L‖ * + ‖S‖ 1 s.t. L + S = Z. where PÊ ∶ ℝ m×n → ℝÊ denotes the projection onto the components with indices in Ê , and the data z = PÊ(Z). The convex relaxations (2) and (4) can be reformulated as semidefinite programming problems (SDP), e.g., Recht et al. (2010), Candes et al. (2010). Thus, they can be efficiently solved by e.g., interior point methods. However, RPCA arising from real applications is usually of huge scale and interior point methods generally do not scale well. As a result, current research on algorithms for RPCA and RPCP is focused on first-order methods, see e.g., the recent survey papers (Bouwmans and Zahzah 2014;Aybat 2016;Ma and Aybat 2018). But, it is also known that the firstorder methods cannot generally provide high accuracy solutions.

Main contributions
In this paper we propose a new approach to solve RPCA with partially observed data, (3). We use semidefinite programming and a novel cone facial reduction (FR) technique applied to a reformulation of (3). As a result, the size of the feasible region is significantly reduced and an efficient algorithm is used to obtain highly accurate solutions. This extends the approach in Huang and Wolkowicz (2018) for low-rank matrix completions where no sparse gross errors are considered. In addition, our facial reduction technique is applied on the non-convex model (3) directly, rather than within a nuclear norm mimimization model as done in Huang and Wolkowicz (2018). A key ingredient here is the use of the exposing vector approach for characterizing faces of the semidefinite cone, S n + . see , , Huang and Wolkowicz (2018). Here we develope a new technique to grow bicliques by submatrix completion. Our tests show that this new technique greatly increases the speed of the completion process and allows us to find more missing entries. In particular, we see that in many cases, and even under relatively low density for the sampled data, we get exact recovery without using any SDP solver. This provides a distinct improvement over simply using the convex relaxation approach, RPCP.

Connections and differences with previous work
Our work is an extension of the previous work in Huang and Wolkowicz (2018) that only considers facial reduction on the convex approximation of the rank minimization problem. In this paper, we show in Lemma 1 that the nonconex RPCA problem (3) is equivalent to the nonconvex SDP problem (5). And in Proposition 1 we get the exposing vectors for the optimal face of the nonconvex SDP problem, rather than the convex approximation in Huang and Wolkowicz (2018). Note that our goal is to solve the nonconvex problem in order to recover the true matrices. It is clear after facial reduction that the optimal solution lies in a smaller space. Hence the convex relaxation of the now smaller-size problem generally has a better chance to recover the true matrices than the convex relaxation of the original large-size problem. This is validated by the numerical tests in Sect. 6. Secondly, we extend the method in Huang and Wolkowicz (2018) to include the sparse component S and we show we can reduce the size of S using exposing vectors obtained from the low rank component L. Thirdly, we proposed a novel approach of growing bicliques in Sect. 5. The technique to grow bicliques is much more efficient when compared to the previous work in the literature.

Organization
Further background on RPCA is given in Sect. 2; this includes SDP reformulations and graph representations. In Sect. 3 we discuss how to obtain accurate solutions for small fully sampled submatrices. This is the key in obtaining high accurate exposing vectors for faces used in reducing the size of the complete problem, while still maintaining low error growth that could arise due to noise. Section 4 presents the main results used in the paper. This includes facial reduction, bicliques and exposing vectors, as well as the algorithm for solving the RPCA problem accurately. In Sect. 5 we discuss a heuristic on growing the size of bicliques using a submatrix approach. We present the numerical results in Sect. 6 for both noiseless and noisy data. We conclude in Sect. 7.

Background
In this paper we focus on the nonconvex and nonsmooth model RPCA in (3). Our approach starts with a reformulation using an additional semidefinite cone constraint.

Reformulating RPCA with SDP
In this section we show that (3) is equivalent to the following nonconvex and nonsmooth SDP problem: Lemma 1 Problems (3) and (5) are equivalent in the sense that they have the same optimal solution pair (L * , S * ) and the same optimal objective value. In particular, the optimal L * is a submatrix of the optimal Y * .
Proof Suppose (L * , S * ) is the optimal solution of (3) with rank (L * ) = r < min(m, n) , and the compact SVD of L * is given by L * = U V ⊤ , where both U ∈ ℝ m×r and (5)

3
Robust principal component analysis using facial reduction V ∈ ℝ n×r have orthonormal columns, and ∈ ℝ r×r ++ is a diagonal matrix with the positive singular values of L * on its diagonal. We now see that this is equivalent to (Y * , S * ) is an optimal solution pair of (5) with from which we get that rank (Y * ) = rank (L * ) = r , and L * is the upper right corner block of Y * . This shows that for an optimal Y * we have rank (Y * ) = rank (L * ) = r ; and it emphasizes that the only change in the two problems is the addition of the semidefinite constraint.
The equivalence follows from: if there exists a better solution (Ŷ,Ŝ) , with the spectral decomposition of Ŷ given by then we have where the first inequality is due to the fact rank (Û̂V ⊤ ) ≤ rank (̂) = rank (Ŷ) . This is a contradiction. ◻ Remark 1 Lemma 1 indicates that, in order to solve (3), we can solve (5) instead to obtain (Y * , S * ) . Then (L * , S * ) solves (3), where L * is obtained from the upper right corner block of Y * . It appears that (5) is harder than (3) as it is much larger. However, the fact that the optimal L * and Y * are both of low rank enables us to reduce the size of (5) greatly to a new problem whose size is much smaller than (3). This is done by pursuing the facial structure of the semidefinite cone in (5). This is the main motivation for our approach.

Outline of our algorithm
We now present a brief outline of our algorithm based on the graph representation of the partially observed matrix Z and the facial reduction of the reformulation (5). For a given partially observed data matrix Z ∈ ℝ m×n , we aim to find the low-rankplus-sparse decomposition Z = L * + S * with r = rank (L * ) ≪ min(m, n) . The main steps of our algorithm are as follows: (i) Associate a bipartite graph to Z using the observed entries, and find a biclique associated to a completely determined submatrix of Z. Denote this submatrix as Z ∈ ℝ p×q , where p, q are generally much smaller than m, n, respectively. (ii) Find the low-rank-plus-sparse decomposition for Z , i.e., Z =L +S . Note that since p ≪ m, q ≪ n , this problem is much easier than the original problem.
(iii) From L we are able to find an exposing vector for a face of the SDP cone that contains an optimal Y * that solves (5), i.e., an exposing vector of the optimal face. Note that L is a submatrix of L * and thus a submatrix of Y * . (iv) We use the fact that the sum of exposing vectors of faces is an exposing vector of the intersection of the faces, see (9) below. We can then reformulate (5) into a new problem whose size is much smaller than (3). This new problem can then be solved efficiently and accurately.

Graph representation of a partially observed matrix Z
We associate a bipartite graph G Z ((U m , V n ),Ê) to Z, whose node set corresponds to the union of the two sets of rows and columns of Z and there is an edge (i, j) ∈Ê , with i ∈ U m and j ∈ V n if Z ij is observed. Note that a biclique of G Z corresponds to a submatrix of Z whose entries are all observed. To find a biclique of G Z ((U m , V n ),Ê) , we can relate it to finding cliques in the graph Suppose we find a non-trivial clique of G denoted by C = {i 1 , … , i k } whose cardinality k = p + q satisfies By removing the edges in C that have both nodes in {1, … , m} or {m + 1, … , m + n} , we obtain a biclique of G Z . We use C to denote this biclique. This biclique then corresponds to a submatrix of Z, with entries corresponding to edges in C being kept and other entries of Z are removed. We use Z to denote this matrix whose size is p × q . Note that Z is a submatrix of Z, and all entries of Z are observed. Moreover, we generally maintain the size of Z much smaller than the size of Z, p ≪ m and q ≪ n.

Heuristics for finding cliques
Finding all the cliques from a graph is a NP-hard problem, (Blair and Peyton 1993); the clique decision problem is one of Karp's 21 NP-complete problems, (Karp 1972). Moreover, the cost of finding the SVD needed for each submatrix corresponding to each clique found becomes expensive for large cliques and large submatrices. Therefore, we use a heuristic algorithm that is proposed in Krislock and Wolkowicz (2010) and Drusvyatskiy et al. (2017, Algorithm 2) to efficiently find many small cliques. This heuristic algorithm is briefly outlined in Algorithm 1.
Algorithm 1 is very efficient in practice because it only goes through each node in the graph G once. In total, we need to add at most K max |V| nodes which corresponds to K max |V| row operations on the adjacency matrix of the graph. Since it is a heuristic, it is not guaranteed that all the cliques are found. However, if we need more cliques, we could apply a random permutation to V and use Algorithm 1 again.

Decomposing the submatrix Z using PALM
From Sects. 2.3 and 2.4 we know that by finding a biclique of G Z , we get a submatrix of Z, denoted by Z ∈ ℝ p×q with K min ≤ p + q ≤ K max , whose entries are all known (sampled). Now we want to find a low-rank-plus-sparse decomposition of Z . This problem can be formulated as where r and s are given parameters to control the rank of L and sparsity of S , respectively. There are two reasons that (7) is much easier to solve than (3). The first reason is that all entries of Z are known, while those in Z are only partially known. The second reason is that the size of Z is much smaller than the size of Z. We adopt the proximal alternating linearization method (PALM) (Bolte et al. 2014) to solve (7). This is summarized in Algorithm 2.
By introducing indicator functions, (7) can be equivalently written as where the indicator function (X | X) equals 0 if X ∈ X , and equals +∞ otherwise. It is easy to verify that the objective function (L,S) satisfies the so-called Kurdyka-̌ojasiewicz property (Kurdyka 1998;Łojasiewicz 1963;Bolte et al. 2014).
As a result, we have the following convergence result for Algorithm 2 that follows directly from Bolte et al. (2014, Theorem 1).

Theorem 1
The sequence {L k ,S k } k∈ℕ generated by Algorithm 2 converges to a critical point of problem (7).
Suppose we solve (7) and obtain the optimal value zero that guarantees a global optimum. To ensure the correct L is recovered, we need this global optimum to be unique. This happens with high probability when the rank of L is much smaller than the size of the matrix, and the incoherence conditions hold, see e.g., Candes et al. (2010), Chandrasekaran et al. (2009). Therefore, we make the following assumption throughout the remainder of the paper.
Assumption We assume that r ≪ min(p, q) and, in addition, that (7) has a unique global optimal solution L that is a submatrix of L * and has the same rank as L * .
The uniqueness also happens with higher probability when the sparsity parameter s is smaller. Let D S denote the density of the sparse matrix S * . We vary s from 1 to max{1, pqD S } to control the sparsity in problem (7). We also set the target rank r = r . Thus we have the following algorithm:

3
Robust principal component analysis using facial reduction

Preliminaries on faces
We now present some of the geometric facts we need. More details can be found in e.g., Drusvyatskiy and Wolkowicz (2017), Pataki (2000), Krislock and Wolkowicz (2010), . Recall that the set K is a proper convex cone if The dual cone, K * , is defined by The conjugate face, F * , is defined by F * = F ⟂ ∩ K * , where F ⟂ denotes the orthogonal complement of F. A face F ⊴ K is an exposed face if there exists ∈ K * such that F = ⟂ ∩ K ; and is an exposing vector. Let T be a subset of the convex cone K, then face (T) is the smallest face of K containing T. It is known that: a face of a face is a face; an intersection of faces is a face; and essential for our algorithm is the following for finding an intersection of exposed faces F i ⊴ K, i = 1, … , k , see Drusvyatskiy et al. (2017), i.e., the positive semidefinite cone, then the facial structure is well understood. Faces are characterized by the ranges or nullspaces of the matrices in the face. Let X ∈ S n + be rank r and be the (orthogonal) spectral decomposition with D ∈ S r ++ being a diagonal matrix. Then the smallest face of S n + containing X is The matrix QQ ⊤ is an exposing vector for face (X) . Moreover, the relative interior satisfies i.e., the face and the exposing vectors are characterized by the eigenspace of any X in the relative interior of the face. For our application we use the following view of facial reduction and exposed faces. The result in (10) details the facial reduction process for the matrix completion problem using exposing vectors. More precisely, if B ⪰ 0 is a principal submatrix of the data and trace (VB) = 0, V ⪰ 0, V ≠ 0 , then V provides an exposing vector for the image of the coordinate projection onto the submatrix. We can then complete V with zeros (adjoint of the coordinate projection) to get Y ∈ S n + , an exposing vector for F . Define the triangular number, t(n) = n(n + 1)∕2 , and the isometry s2vec ∶ S n → ℝ t(n) that vectorizes the upper-triangular part of a symmetric matrix columnwise.

Exposing vector
The following lemma shows a generic rank property of a matrix with its submatrix.
Lemma 2 (Generic rank property, Lemma 3.6 in Huang and Wolkowicz (2018)) Let r be a positive integer and Z 1 ∈ ℝ m×r and Z 2 ∈ ℝ n×r be continuous random variables with i.i.d. entries. Set Z = Z 1 Z ⊤ 2 and let X ∈ ℝ p×q be any submatrix of Z with min(p, q) ≥ r . Then rank (X) = r with probability 1.
Based on Lemma 2, we can assume that L returned by Algorithm 3 has the same rank as the targeting low-rank matrix L * if min(p, q) > r , i.e., That is, we solved (7) to global optimality with objective value being 0.
Proposition 1 Under Assumption 3, suppose that PALM (Algorithm 2) returns L and S such that (11) holds. Without loss of generality, we further assume the targeting low-rank matrix L * can be partitioned as L * = L 1 L 2 L L 3 (after a permutation if needed), where L ∈ ℝ p×q and r ≤ min(p, q) , and the SVD of L is given by By adding appropriate blocks of zeros to ŪŪ ⊤ and VV ⊤ (after a permutation if needed), we get the following matrix W, which is an exposing vector for face (Y * ) , i.e., trace (Y * W) = 0 , where Y * is the optimal solution of (5).
Proof Lemma 1 shows the property Without loss of generality, after a permutation if needed, we assume that . Therefore, since P , P,Q, Q all have only r columns and rank (L) = r . It follows that Range (P) = Range (P) = Range (L) and Range (Q) = Range (Q) = Range (L ⊤ ) . We conclude that P ⊤ŪŪ⊤ =P ⊤ŪŪ⊤ = 0 and Q ⊤VV ⊤ =Q ⊤VV ⊤ = 0 , i.e., that Y * ⋅ W = 0 . Therefore W is an exposing vector of the optimal face, the face that contains Y * , the optimal solution of (5). ◻

Remark 2
The reason we apply facial reduction to (5) (or equivalently (3)) is because applying facial reduction to (5) is better than applying facial reduction to the convex relaxation formulation. Since the goal is to solve the nonconvex problem (5) to recover the true matrices, when we apply facial reduction to (5), we can show that the exposing vector is actually the optimal face of the nonconvex problem (5) in Proposition 1, therefore the optimal solution lies in a smaller space and we have a better chance to obtain the optimal solution of the nonconvex problem (5). If we apply facial reduction to the convex approximation, we may not obtain the exposing vector of the "optimal face".

Reducing problem size using FR
From Proposition 1, we get Algorithm 4 to find an exposing vector Y expo for face (Y * ).

Remark 3
We emphasize that the size of the bicliques in are small. A large biclique means an expensive SVD when finding the corresponding exposing vector. Adding many exposing vectors results in one exposing vector corresponding to the union of the bicliques after completion.
From Y expo we can also find the basis for Null (Y expo ) which is given by the columns of where V P ∈ ℝ m×r p and V Q ∈ ℝ n×r q . We denote r v = r p + r q < m + n . Therefore Y * (optimal solution of (5)) can be expressed as From Lemma 1 and (13) we have rank (Y * ) = rank (V P R pq V ⊤ Q ) = rank (R pq ) . Therefore, (5) reduces to the following problem which has a much smaller size for the low-rank component:

Further reducing the size of the problem
Note that the size of S in (15) can be further reduced. In the decomposition (11), we know that L and S are exactly recovered by PALM when successful. Therefore, there is min R pq ∈ℝ r p ×r q ,S∈ℝ m×n a subset of entries of S * that has been successfully recovered. We can remove this subset of entries from the linear constraints in (15). We let Ê S denote the set of indices of entries of S * that are exactly recovered by PALM, and let Ê S c ∶=Ê�Ê S to get: Moreover, the number of linear constraints in (16) can, and should, be further reduced to remove any redundant linear constraints that arose. We use A and B to denote the matrix representations of the coefficient matrices in PÊ S (V P R pq V ⊤ Q ) and PÊ S c (V P R pq V ⊤ Q ) , respectively. If the ith row of B (denoted by B i ) is in the row space of A, then B i can be removed so the size of B gets smaller. The algorithm that we use for this purpose is outlined in Algorithm 5.
Since FR typically results in very small values of r p and r q in (13), the first set of linear constraints in (16) , i.e., is usually an overdetermined linear system. As a result, we can obtain L * by only solving this linear system. (16) has a unique solution.

Lemma 3 The target matrix
Lemma 4 If L * is the unique optimal solution of (3), then (16) has a unique optimal solution R * pq that recovers L * : When we have enough bicliques, our numerical tests generally successfully recover L * by solving the linear system (17). If we do not have enough bicliques, then we have to solve the nonconvex and nonsmooth problem (16). In this case, we can solve the following convex relaxation of (16): Standard first-order methods such as alternating direction method of multipliers can be used to solve (18). Moreover, since the size of (18) is very small, we can also use interior point methods to solve it. The convex relaxation (18) after facial reduction is usually tighter than (4) and is more likely to successfully recover L * . We do not go into more details in this paper.

Detecting the target rank
So far we have assumed that we know the target rank in advance, while in practice we may not know the target rank. We now assume that we know a range of the target rank and the density of the sparse part.
To obtain the correct target, we can just sample many submatrices from the observed data and run Algorithm 6. We then choose the rank which appears most correct. We tested Algorithm 6 on random instances with different size, ranks and density of (18) min R pq ∈ℝ r p ×r q ,s∈ℝÊS c   Table 1. For example, the third row of Table 1 presents average results for 100 instances of size 20 × 20 , density D S = 0.01 and orignal rank r = 2 . Algorithm 6 found estimated ranks r = [1, 2, 3, 4, 5] in [0, 85, 12, 3, 0] percent, respectively. We conclude that Algorithm 6 estimates the correct rank in most cases.

Growing bicliques by submatrix completion
We observe that when there are not enough bicliques, we may have many zero rows and zero columns in the exposing vector Y expo . Let the set of indices for nonzero rows and columns in the upper-left block of Y expo be J 1 and the set of indices for nonzero rows and columns in the bottom-right block of Y expo be J 2 . Now, if we consider the submatrix Z J 1 ,J 2 which is obtained by taking the rows of Z whose indices are in J 1 and columns of Z whose indices are in J 2 , we see that the size of the problem (16) is much smaller, i.e., since we remove all the zero rows and columns, we observe that the exposing vectors now cover all the remaining rows and columns, i.e.,those of Z J 1 ,J 2 . Hence we have a better chance of recovering Z J 1 ,J 2 . In fact, we often get a unique solution. After we recover the submatrix Z J 1 ,J 2 , we can add the indices for all the entries in Z J 1 ,J 2 to the sampled indices set Ê . The sampled indices set Ê contains a larger biclique J 1 × J 2 and has more elements. We then use Algorithm 1 again to find more bicliques. This process can be repeated until all the rows and columns in Z are recovered.
We illustrate the growth of bicliques and the exposing vector Y expo with an example. We choose the size of Z to be 200 × 200 , the density for the sampled data is 0.18 and the minimum size for the bicliques is 4 × 4 with the target rank r = 2 . The original low rank matrix L is recovered after 3 iterations by our algorithm. The example is shown in Figs. 1 and 2. Our main algorithm for recovering the low rank matrix is summarized in Algorithm 7.

Remark 4
The motivation to grow bicliques arises from the desire to improve efficiency. We note that Algorithm 1 is only a heuristic and finding all the bicliques in a graph is an NP-hard problem. If there are not enough bicliques being identified, the equation in (17) may be underdetermined. Therefore we have to solve the convex approximation which can lead to a failure to recover the true low-rank matrix. The growing bicliques approach uses the existing bicliques to grow a big biclique. And when we add this big biclique to the sampled elements and run Algorithm 1 again, we are able to find many more bicliques efficiently. We note that the heuristic Algorithm 1 performs well when a big biclique exists. Our experiments show that this process is much more efficient than simply running Algorithm 1 many times. In fact this process is critical for the success of the method. Without the biclique growing procedure the proposed method takes much longer and will often fail to recover the target matrices if the density of the samplings is low.
Robust principal component analysis using facial reduction 6 Numerical experiments

Noiseless case
We applied our Algorithm 7 to solving random noiseless instances of (3). The input data were generated in the following manner. The low-rank matrix L * was integer valued and obtained using L * = round(L r L ⊤ c ) , where L r ∈ ℝ m×r and L c ∈ ℝ r×n are from the standard normal distribution, N(0, 10). The density of the sparse matrix S * ∈ ℝ m×n is denoted by D S , and the nonzero elements were integer valued and obtained by rounding the uniformly distributed random elements. Matrix Z is set as Z = L * + S * . We then randomly sample elements in Z using the density , i.e., the ratio of the observed components of Z is . We use L to denote the recovery returned by our Algorithm 7.
For each set of data parameters (m, n, ) , we randomly generated 20 instances and reported the averaged performance in Tables 2 and 3. In these tables, D S = 0.01 , = c∕( √ m) where c is a chosen constant in [1,10], the cpu times are in seconds, K min denotes the smallest size of the cliques we found, K max denotes the largest size of the cliques we found, r = rank (L) and "Succ" denotes the number of  successfully recovered instances out of 20 randomly generated ones. Here we claim that matrix L * is successfully recovered if there is no difference between L and L * , i.e., ‖L − L * ‖ F = 0 . Note that this is possible because all entries of L * are integer valued. From these results we see that we can get exact recovery with very low density. Moreover, our algorithm is very efficient, and for very large problems with size 5000 × 5000 , we can solve it in about two minutes. We also compare our method with the fast alternating linearization method (FALM) proposed in Goldfarb et al. (2013). The results are reported in Table 4, where D S = 0.01 . From these results we see that after preprocessing by facial reduction, our reduced model (16) has a higher probability for recovering the low-rank matrix than solving the convex relaxation (4) by FALM. Moreover, our algorithm, although using a second-order interior point solver, is significantly faster than the first order method FALM. Our code is available at http://www. math.uwate rloo.ca/~hwolk owi//henry /repor ts/Codef orthe table s.
Recently there has been a trend to use matrix factorization to solve the lowrank problem, such as Sun and Luo (2016), Yi et al. (2016). Therefore we also compared our method with the methods based on matrix factorization. Since in Sun and Luo (2016) no sparse gross errors are considered, we only compare the fast gradient descent method in Yi et al. (2016)denoted as FGD with our method. We note that in Yi et al. (2016) only the noiseless cases are considered therefore we only consider the noiseless case in the comparisons. We note in Yi et al. (2016), the authors make two assumptions about the data: (i) The low rank matrix M needs to be -incoherent. (ii) The entries of the sparse component S are "spread out", i.e., for ∈ [0, 1) , S contains at most -fraction nonzero entries per row and column. In this paper, we don't make these assumptions. Therefore we first compare our method with their method when these assumptions hold, secondly, we compare our method with their method without these assumptions being hold. The results are in Table 5 and 6. From the results in the tables we can see that the fast gradient descent method usually works better for large scale problems when these two assumptions hold. However when these two assumptions do not hold, our method can recover the true low rank matrix with a high success rate while the fast gradient descent method fails to recover any of them.

Noisy case
In the noisy case, we consider Z = L + S + E where L, S are no longer integer and all the elements of the matrix E are from random noise, generated from a Gaussian distribution with standard deviation . We still assume S is sparse, with density D S . The results are in Tables 7 and 8. Each row in the tables presents the average of 20 instances.
We see that when the noise level is small ( = 10 −4 or 10 −5 ), the CPU time of our Algorithm 7 is significantly smaller than that of FALM; and the residual is smaller as well which is pa great advantage of our algorithm.
In the case when the noise level is relatively large, our algorithm does not perform as well as FALM. This appears to be due to the failure of recovering the correct submatrices L ,S from Z . However in general, if Algorithm 7 succeeds in finding a solution with the correct target rank, then our algorithm generally has a smaller residual than the first order method FALM. If Algorithm 7 fails to find the solution with the correct target rank, FALM is generally better. We conclude that we should use Algorithm 7 first and check if the correct target rank is achieved with a satisfactory tolerance. If not, then we should fall back on the first order method FALM.
We note there are popular algorithms such as the stochastic gradient descent method (SGD) (Gemulla et al. 2011;Recht and Ré 2013;Zhuang et al. 2013) and block coordinate descent type methods (Pilászy et al. 2010;Yu et al. 2012) to solve the large scale low-rank matrix completion problems. Our method proposed in the paper can be modified to solve the low-rank matrix completion problems as well and in the future it will be very interesting to compare our methods to those methods for quality improvement when solving the general low-rank matrix completion problems. Also in Sun and Luo (2016), a regularization term is added to control incoherence in the iterates, it will be very interesting to study whether a regularization term will help to improve the recovery quality for RPCA problems.

Conclusion
In this paper we have shown that we can apply a facial reduction approach in combination with the nuclear norm heuristic to efficiently solve the robust PCA problem with partially observed data. This exploits the implicit degeneracy at the optimal solution resulting from the low-rank and sparse structure. Specifically, whenever enough complete bipartite subgraphs in the data are available, we are able to find a proper face of the semidefinite cone that contains the optimal solution and results in a significant reduction in dimension. If we cannot find enough bicliques, the matrix can still be partially completed. Having an insufficient number of bicliques is indicative of not having enough initial data to recover the unknown elements for our algorithm. This is particularly true for large rank r, where larger bicliques are needed. Throughout we see that the facial reduction both regularizes the problem and reduces the size and often allows for a solution without any refinement or need for an SDP solver.
Our preliminary numerical results are promising as they efficiently and accurately recover large scale problems. The numerical tests are ongoing with improvements in using biclique algorithms rather than clique algorithms. At the current stage, we are only testing randomly generated examples. In practice most of the problems from the real world, for example from surveillance video with moving objects, the gross errors are not sparse enough, or in other words, the random noise is not significantly smaller than the sparse gross errors. To solve practical problems from the real world, we need to further refine our algorithm to better separate the random noise from the sparse gross errors, which is part of our current ongoing work.
Theoretical results on exact recovery are discussed in many papers, e.g., Candes et al. (2010), Chandrasekaran et al. (2009). They use the so-called incoherence conditions, which are difficult to verify. It appears from our work above that exact recovery guarantees can be found from rigidity results in the graph of Z, i.e., in the number and density of the bicliques. Moreover, there are interesting questions on how to extend these results from the simple matrix completion to general solutions of linear equations, A(Z) = b , where A is some linear transformation.