Robust PCA via Regularized Reaper with a Matrix-Free Proximal Algorithm

Principal component analysis (PCA) is known to be sensitive to outliers, so that various robust PCA variants were proposed in the literature. A recent model, called reaper, aims to find the principal components by solving a convex optimization problem. Usually the number of principal components must be determined in advance and the minimization is performed over symmetric positive semi-definite matrices having the size of the data, although the number of principal components is substantially smaller. This prohibits its use if the dimension of the data is large which is often the case in image processing. In this paper, we propose a regularized version of reaper which enforces the sparsity of the number of principal components by penalizing the nuclear norm of the corresponding orthogonal projector. If only an upper bound on the number of principal components is available, our approach can be combined with the L-curve method to reconstruct the appropriate subspace. Our second contribution is a matrix-free algorithm to find a minimizer of the regularized reaper which is also suited for high-dimensional data. The algorithm couples a primal-dual minimization approach with a thick-restarted Lanczos process. This appears to be the first efficient convex variational method for robust PCA that can handle high-dimensional data. As a side result, we discuss the topic of the bias in robust PCA. Numerical examples demonstrate the performance of our algorithm.


Introduction
Principal component analysis (PCA) [37] realizes the dimensionality reduction in data by projecting them onto those affine subspace which minimizes the sum of the squared Euclidean distances between the data points and their projections. Unfortunately, PCA is very sensitive to outliers, so that various robust approaches were developed in robust statistics [17,28,46] and nonlinear optimization. In this paper, we focus on the second one.  One possibility to make PCA robust consists in removing outliers before computing the principal components which has the serious drawback that outliers are difficult to identify and other data points are often falsely labeled as outliers. Another approach assigns different weights to data points based on their estimated relevance, to get a weighted PCA [20] or repeatedly estimate the model parameters from a random subset of data points until a satisfactory result indicated by the number of data points within a certain error threshold is obtained [11]. In a similar vein, least trimmed squares PCA models [38,41] aim to exclude outliers from the squared error function, but in a deterministic way. In [44], a dual principal component pursuit is used for this purpose. The variational model in [4] decomposes the data matrix into a low rank and a sparse part. Related approaches such as [7,32,51] separate the low-rank component from the column sparse one using different norms in the variational model. Another group of robust PCA replaces the squared L 2 norm in the PCA model by the L 1 norm [18]. Unfortunately, this norm is not rotationally invariant, i.e., when rotating the centered data points, the minimizing subspace is not rotated in the same way. Replacing the squared Euclidean norm in the PCA model by just the Euclidean one leads to a non-convex robust PCA model with minimization over the Stiefel or Grassmannian manifold, see, e.g., [9,26,31,34]. Instead of the previous model which minimizes over the sparse number of directions spanning the low-dimensional subspace, it is also possible to minimize over the orthogonal projectors onto the desired subspace. This has the advantage that the minimization can be performed over symmetric positive semi-definite matrices, e.g., using methods from semi-definite programming, and the disadvantage that the dimension of the projectors is as large as the data now. This prohibits this approach for many applications in particular in image processing. The projector PCA model is still non-convex, and a convex relaxation, called reaper, was recently proposed by Lerman et el. [27]. An extensive comparison of the benefits and drawbacks of the different approaches in the rich literature of robust PCA as well as of the related numerical algorithms can, for instance, be found in [26].
In this paper, we build up on the advantages of the convex reaper model, but modify it in two important directions: (i) by penalizing the nuclear norm of the approximated projectors, our model does only require an upper bound on the dimension of the desired subspace. Having the same effect as the sparsity promotion of the 1-norm, the nuclear norm-the 1-norm of the eigenvalues-promotes low-rank matrices or, equivalently, sparse eigenvalue decompositions; (ii) by combining primal-dual minimization techniques with a thick-restarted Lanczos process, we are able to handle highdimensional data. We call our new convex model rreaper. We provide all computation steps leading to an efficient and a provable convergent algorithm to the minimum of the objective function. A performance analysis following the lines of [27] is given. The choice of the offset in robust PCA is an interesting problem which is not fully discussed in the literature so far. Usually, the geometric median is used. We do not provide a full solution of this issue, but show that under some assumptions the affine hyperplane in R d having the smallest Euclidean distance to n > d given data points goes through d + 1 of these points. We underline our theoretical findings by numerical examples.
The outline of this paper is as follows: preliminaries from linear algebra and convex analysis are given in Sect. 2. In Sect. 3, we introduce our regularized reaper model. The basic primal-dual algorithm for its minimization is discussed in Sect. 4. The algorithm is formulated with respect to the full projection matrix. The matrix-free version of the algorithm is given in Sect. 5. It is based on the thick-restarted Lanczos algorithm and is suited for high-dimensional data. In Sect. 6, we examine the performance analysis of rreaper along the lines of [27]. Some results on the offset in robust PCA are proved in Sect. 7. The very good performance of rreaper in particular for high-dimensional data is demonstrated in Sect. 8. Further, the relation between the dimension of the wanted subspace and the regularization parameter is addressed via the L-curve method. Section 9 finishes the paper with conclusions and directions of future research.

Notation and Preliminaries
Throughout this paper, we will use the following notation and basic facts from linear algebra and convex analysis which can be found in detail in various monographs and overview papers as [1,3,6,13,40].
Linear algebra By · 2 we denote the Euclidean vector norm and by · 1 the norm which sums up the absolute vector components. Recall that for any x ∈ R n , projector if Π ∈ S (n) and Π 2 = Π. This is equivalent to the statement that Π ∈ S (n) and has only eigenvalues in {0, 1}. The nuclear norm is the unique norm such that rank(Π) = Π tr for every orthogonal projector Π.
For a given norm · on R n , the dual norm is defined by x, y .
In particular, for a matrix X = (x 1 | . . . |x N ) ∈ R n,N we will be interested in the norm which can be considered as norm on R nN by arranging the columns of the matrix into a vector. Its dual norm is given by Convex analysis Let Γ 0 (R n ) denote the space of proper, lower semi-continuous, convex functions mapping from R n into the extended real numbers (−∞, ∞]. The indicator function ι C of C ⊆ R n is defined by We have ι C ∈ Γ 0 (R n ) if and only if C is non-empty, convex and closed.
For f ∈ Γ 0 (R n ), the proximal mapping is defined by Indeed, the minimizer exists and is unique [40,Thm. 31.5]. If C ⊂ R n is a non-empty, closed, convex set, then the proximal mapping of a multiple of ι C is just the orthogonal projection onto C , i.e., In particular, the orthogonal projection onto the halfspace H (a, β) := {x ∈ R n : a, x ≤ β} with a ∈ R n and β ∈ R can be computed by where (y) + := max{0, y}. Further, the orthogonal projection onto the hypercube Q := [0, 1] n is given by The The dual function of a norm is just the indicator function of the unit ball with respect to its dual norm. In particular, we have for · 2,1 : R n,N → R that where B 2,∞ := {X ∈ R n,N : over b ∈ R n and A ∈ R n,d . It is not hard to check that the affine subspace goes through the offset (bias) Therefore, we can reduce our attention to data points x k −b, k = 1, . . . , N , which we denote by x k again, and minimize over the linear d-dimensional subspaces through the origin, i.e., Unfortunately, the solution of this minimization problem is sensitive to outliers. Therefore, several robust PCA variants were proposed in the literature. A straightforward approach consists in just skipping the square in the Euclidean norm leading to This is a non-convex model which requires the minimization over matrices A in the so-called Stiefel manifold, see [9,26,34,35].
Another approach is based on the observation that Π := AA T is the orthogonal projector onto the linear subspace spanned by the columns of A. Since the linear subspace is d-dimensional, exactly d eigenvalues of Π have to be one. Thus, problem (6) can be reformulated as Having computed Π, we can determine A by spectral decomposition. Unfortunately, (7) is still a non-convex model which is moreover NP hard to solve. Therefore, Lerman et al. [27] suggested to replace it by a convex relaxation, called reaper, min P∈S (n) P X−X 2,1 subject to 0 n,n P I n , tr( P) = d.
In order to deal with the non-differentiability of the objective function, Lerman et al. [27] iteratively solve a series of positive semi-definite programs. In contrast to models minimizing directly over A ∈ R n,d , algorithms for minimizing reaper or rreaper seem to require the handling of a large matrix P ∈ S (n) or, more precisely, the handling of its spectral decomposition which makes the method not practicable for high-dimensional data.
The above model requires the exact knowledge of the dimension d of the linear subspace the data will be reduced to. In this paper, we suggest to replace the strict trace constraint by a relaxed variant tr(Π) ≤ d and to add the nuclear norm of Π as a regularizer which enforces the sparsity of the rank of Π: Here α > 0 is an appropriately fixed regularization parameter.
Since (8) is again hard so solve, we use a relaxation for the eigenvalues and call the new model regularized reaper (rreaper): min P∈S (n) P X − X 2,1 + α P tr subject to 0 n,n P I n , tr( P) ≤ d.
Finally, we project the solution of rreaper to the set of orthoprojectors with rank not larger than d: In the following, we will present a primal-dual approach to solve (9) which uses only the sparse spectral decomposition of P, but not the matrix itself within the computation steps.

Primal-Dual Algorithm
Rreaper is a convex optimization problem; so we may choose from various convex solvers. Before choosing a specific method, we study the structure of rreaper in more details. For this purpose, we define the forward operator X : S (n) → R n,N : P → P X and rearrange (9) as min P∈S (n) || X ( P) − X || 2,1 + αR( P), (10) where the regularizer R : S (n) → [0, +∞] is defined by R( P) := || P || tr + ι C ( P), Since C is compact and convex, and since the norms || · || 2,1 and || · || tr are continuous, rreaper has a global minimizer. This minimizer is in general not unique. Concerning the adjoint operator X * : R n,N → S (n), we observe for all P ∈ S (n) and Y ∈ R n,N , where we exploit the symmetry of P by P = 1 /2( P + P T ). Thus, the adjoint is just The operator norm of X is given by the spectral norm of X ∈ R n,N , i.e., In more detail, for P = ( p 1 | . . . | p n ) ∈ S (n), we obtain Here the inequality becomes sharp for where U arises from the singular value decomposition X = U diag(σ X ) V T with descending ordered singular values σ 1 ≥ · · · ≥ σ min{n,N } .
More generally, rreaper is an optimization problem of the form minimize F(A ( P)) + G( P), (12) whose objective consists of two convex, lower semi-continuous functions F = || · −X || 2,1 and G = αR and a linear mapping A = X . Since both-data fidelity and regularizer of rreaper-are non-differentiable, gradient decent methods as well as the forward-backward splitting or the fast iterative shrinkage-thresholding algorithm (FISTA) cannot be applied. In order to exploit the structure, various saddlepoint methods that do not require differentiability can be used. Examples are the alternating directions method of multipliers (ADMM) which is related to the Douglas-Rachford splitting or primal-dual algorithms, see for instance [6] and references therein. Since the iteration variables with respect to P in rreaper computed via ADMM usually have no sparse spectral decomposition and cannot be handled for high-dimensional instances, we prefer to choose the primaldual method of Chambolle and Pock [6] with extrapolation of the primal variable to compute the minimizer of rreaper (10). For the general minimization problem (12), the primaldual method consists in the iteration with fixed parameters τ, σ > 0 and θ ∈ [0, 1]. The extrapolation step consisting in the calculation ofP is here required to ensure the convergence whenever τ σ < 1/|| A || 2 and θ = 1, where || A || denotes the operator norm of A . If the functions F and G possesses more regularity like strong convexity, the parameter θ may be varied to prove an acceleration of the convergence and to guarantee certain convergence rates [5,6]. For rreaper, the above iteration leads us to the following numerical method.

Algorithm 1 (Primal-Dual Algorithm)
Input: X ∈ R n,N d ∈ N, and σ, τ > 0 with σ τ < 1/ X 2 2 , and θ ∈ (0, 1]. Iteration: More generally, Chambolle and Pock [6] have proven that the sequence { P (r ) } r ∈N converges to a minimizerP of (10) and the sequence {Y (r ) } r ∈N to a minimizer of the dual problem if the Lagrangian L( P, Y ) := − · −X * 2,1 (Y ) + αR( P) + X ( P), Y (13) has a saddle-point which is, however, clear for rreaper. The algorithm requires the computation of the proximal mapping of the dual data fidelity and of the regularizer which we consider next.

Proposition 1 (Proximal mapping of the dual data fidelity)
For x ∈ R n,N and σ > 0, we have Proof Using (3) and, since ( f (· − x 0 )) * = f * + ·, x 0 , we obtain For the maximal dimension d of the target subspace, we henceforth use the half-space in order to bound the trace of the primal iteration variable P (r ) . Then the proximal mapping of the regularizer is given in the following proposition.
Proposition 2 (Proximal mapping of the regularizer) For P ∈ S (n) with spectral decomposition P = U diag(λ P ) U T and R in (11) it holds Proof A symmetric matrix P is in C if and only if λ P ∈ Q ∩ H . Hence, the regularizer can be written as By the theorem of Hoffmann and Wielandt [16, Thm. 6.3.5], we know that with equality if and only if S possesses the same eigenspaces as P. Therefore, the minimizer in (14) has to be of the form S = U diag(λ S ) U T , where the columns of U are the eigenvectors of P. Incorporating this observation in (14), we determine the eigenvalues λ S by solving the minimization problem Alternatively to the proof, we could argue with the socalled spectral function related to R which is invariant under permutations, see, e.g., [1].
By Proposition 2, the proximal mapping of the regularizer requires the projection onto the truncated hypercube. The following proposition can be found in [1,Ex. 6.32].

Proposition 3 (Projection onto the truncated hypercube)
For any λ ∈ R n and any d ∈ (0, n], the projection to the truncated hypercube is given by wheret is the smallest root of the function Due to the projection to the hypercube, see (2), only the positive components of λ influence its projection onto Q ∩ H . More precisely, we have where the function (·) + is employed componentwise.
To formulate a projection algorithm, in particular, to compute the zero of ϕ, we study the properties of ϕ and derive an explicit representation. Within a different setting, the reader may also have a look at [30]. (15) has the following properties:
(ii) ϕ is monotone decreasing and piecewise linear. More precisely, we can construct a sequence 0 = s 0 < s 1 < s 2 < . . . < s M with M ≤ 2n such that where m is the unique index such that ϕ(s m ) > 0 and ϕ(s m+1 ) ≤ 0.
Proof (i) Using the definition of ϕ, the Cauchy-Schwarz inequality, and the non-expansiveness of the projection, we get (ii) Starting with s 0 = 0, we construct s l with l = 1, . . . , M iteratively as follows: given s l , we set μ := λ − s l 1 n and choose Here we use the convention min ∅ = ∞. If both sets in the definition of s leave and s enter are empty, we stop the construction since all components of μ are non-positive implying proj Q (μ) = 0 n and thus ϕ( Considering the projection to the hypercube Q in (2), and that a change appears exactly in s l+1 , where at least one component enters or leaves the interval (0, 1]. Hence, we have Let s M be the first value in this procedure, where all components of μ become non-positive. Since each component in λ − t1 n can at most one times enter or leave the interval (0, 1], we know that M < 2n. Further, k l ≥ 0 shows that ϕ is monotone decreasing. (iii) By definition of ϕ and by the assumption proj Q (λ), 1 n > d, we have ϕ(0) > 0. On the other side, ϕ(s m ) = −d was shown above. Consequently, the smallest zerot of ϕ has to lie in the interval [s m , s m+1 ] with ϕ(s m ) > 0 and ϕ(s m+1 ) ≤ 0 and can be computed by solving This results int = s m + 1 k m ϕ(s m ) and finishes the proof.
Following Proposition 3 and the previous proof, we obtain the following algorithm for the projection onto Q ∩ H .
Based on the derived proximal mappings, the primal-dual Algorithm 2 to solve rreaper (10) can be specified in matrix form as follows.

Matrix-Free Realization
Solving rreaper with the primal-dual Algorithm 3 is possible if the dimension of the surrounding space R n is moderate which is often not the case in image processing tasks. While the dual variable Y ∈ R n,N matches the dimension of the data, the primal variable P is in S(n) instead of R n,d , d n. How can the primal-dual iteration be realized in the case n d though the primal variable cannot be hold in memory and the required eigenvalue decomposition cannot be computed in a reasonable amount of time?
Here the nuclear norm in rreaper that promotes low-rank matrices comes to our aid. Our main idea to derive a practical implementation of the primal-dual iteration is thus based on the assumption that the iterates of the primal variable P (r ) possess the form with small rank d r . In our simulations, we observed that the rank is usually around the dimension d of the wanted lowdimensional subspace.
In order to integrate the matrix-free representation (16) into the primal-dual iteration efficiently, we further require a fast method to compute the eigenvalue thresholding. For this, we compute a partial eigenvalue decomposition using the well-known Lanczos process [22]. Deriving matrix-free versions of the forward operator X and its adjoint X * , we finally introduce a complete matrix-free primal-dual implementation with respect to P (r ) .

The Thick-Restarted Lanczos Process
One of the most commonly used methods to extract a small set of eigenvalues and their corresponding eigenvectors of a large symmetric matrix is the Lanczos method [22], which is based on theory of Krylov subspaces. The method builds a partial orthogonal basis first and then uses a Rayleigh-Ritz projection to extract the wanted eigenpairs approximately. If the set of employed basis vectors is increased, the extracted eigenpairs converge to the eigenpairs of the given matrix [13]. Since the symmetric matrix whose partial eigenvalue decomposition is required in the primal-dual method usually is high-dimensional, we would like to chose the number k max of basis vectors within the Lanczos method as small as possible such that the convergence of the Krylov method does not emerge. To calculate the dominant fix eigenpairs with highaccuracy nevertheless, the Lanczos method can be restarted with the dominant fix Ritz pairs. For our purpose, we use the thick-restart scheme of Wu and Simon [50] in Algorithm 4, whose details are discussed below.
Remark 1 Although the Lanczos process computes an orthogonal basis e 1 , . . . , e k max , the orthogonality is usually lost because of the floating-point arithmetic. In order to re-establish the orthogonality, we therefore have to orthogonalize the newly computed e k with the previous basis vectors, which can be achieved by the Gram-Schmidt procedure. More sophisticated re-orthogonalization strategies are discussed in [50].

Remark 2
During the Lanczos process, the norm of the residual γ k could become zero. In this case, we can stop the process, reduce k max to the current k, and proceed with step (iii) and (iv). Then the computed basis e 1 , . . . , e k spans an invariant subspace of P such that the eigenpairs in U and Λ become exact, see [13].
The heart of the Lanczos method in Algorithm 4 is the construction of an orthogonal matrix E := [e 1 | . . . |e k max ] ∈ R n×k max such that T := E T P E becomes tridiagonal, see (17) with = 0 below. Using the eigenvalue decomposition T = Y ΛY T , we then compute the Ritz pairs (λ k , u k ), where u k are the columns of U := [u 1 | . . . |u k max ] and λ k the eigenvalues in Λ. In the next iteration, we chose fix Ritz pairs corresponding to the absolute leading Ritz values denoted by (λ 1 ,ȗ 1 ), . . . , (λ fix ,ȗ fix ) and restart the Lanczos process. Thereby, the chosen Ritz vectors are extended to an orthogonal basis E : where ρ k :=γ k maxy k max ,k withγ k max andy k max ,k originating from the last iteration, see [50].
The stopping criteria of the thick-restarted Lanczos process is here deduced from the fact that the chosen Ritz pairs fulfill the equation whereȓ k max is the last residuum vector of the previous iteration [50]. Consequently, the absolute error of the chosen Ritz pairs is given by Usually, the absolute value of the leading Ritz value is a good approximation of the required spectral norm || P || to estimate the current relative error.

Remark 3 (Power method)
Besides the Krylov subspace methods like the discussed Lanczos process with thickrestarting, the required eigenpairs may be calculated using the power method. Here the main idea is to start with an orthogonal matrix fix ] ∈ R n× fix representing an fix -dimensional subspace U (0) and to successively compute U (r ) := P(U (r −1) ) again represented by an orthogonal matrix U (r ) . The convergence of the so-called orthogonal iteration may be improved by constructing U (r ) using the eigenvalue decomposition of P restricted to U (r ) . This approach goes back to [43] and consists in the iteration where the first step consists in the QR decomposition of PU (r −1) into an orthogonal matrix E (r ) and an upper triangular matrix R (r ) and the second in the eigenvalue decomposition of E (r )T P E (r ) . If none of the columns of U (0) is orthogonal to the fix leading eigenvectors of P, and if the fix leading eigenvalues are well-defined-both in absolute value, then the columns of U (r ) and the eigenvalues in Λ (r ) converge to the leading eigenpairs of P, see [13,43].
Although there is no convergence guarantee for the restarted Lanczos process, it usually enjoys much faster convergence than the orthogonal iteration (18)-(20) for high-dimensional problems. Moreover, the chosen restarting technique can be easily adapted to compute the proximal mapping of the regularizer as discussed in the next section.

Matrix-Free Primal Update
The thick-restarted Lanczos method allows us to compute the leading absolute eigenvalues and their corresponding eigenvectors in a matrix-free manner using only the action of the considered matrix. In our primal-dual method for rreaper, we need the action of P (r ) −τ X * (Y (r +1) ). Incorporating the low-rank representation (16), we see that this can be rewritten as For the evaluation of the primal proximal mapping, we first compute the eigenvalue decomposition of P (r ) − τ X * (Y (r +1) ), next shift the eigenvalues, and finally project them to the truncated hypercube Q ∩ H , see Algorithm 3.
Since the projection onto Q ∩ H is independent of negative eigenvalues, see note after Proposition 3, it is thus sufficient to compute only the eigenpairs with eigenvalue larger than ατ .
For the numerical implementation, we compute the relevant eigenpairs with the thick-restarted Lanczos method. In the course of this, we are confronted with the issue that we actually do not know how many eigenpairs has to be computed. To reduce the overhead of Algorithm 4 as much as possible, the parameters fix and k max can be easily adapted between the restarts. Further, the computation of strongly negative eigenvalues can be avoided by an eigenvalue shift, i.e., actually compute the eigenpairs of P (r ) − τ X * (Y (r +1) ) + ν I with μ ≥ 0, where the required action has the form Essentially, we may thus implement the primal proximation in the following manner.

Remark 4
If the matrix P (r ) −τ X * (Y (r +1) ) does not possess any eigenvalues greater than ατ , then the Lanczos process stops in step (i.a) with m = 0. Since the projection to the truncated hypercube is then the zero vector again, the new iteration P (r +1) can be represented by an empty low-rank representation, i.e., d r +1 = 0.

Matrix-Free Dual Update
Compared with the primal update, the derivation of the matrix-free dual update is more straightforward. First, the matrix The low-rank representations of P (r ) and P (r −1) similar to (16)  N ] column by column too, we obtain the following numerical method.

Matrix-Free Projection onto the Orthoprojectors
With the matrix-free implementations of the primal and dual proximal mappings, we are already able to solve rreaper (9) numerically. Before summarizing the compound algorithm, we briefly discuss the last needed component to tackle the robust PCA problem (8). The final step is to project the solu-tionP of rreaper onto the set of orthoprojectors with rank not larger than d: We may calculate the projection explicitly in the following manner.

Proposition 4 (Projection onto the orthoprojectors)
For P ∈ S(n) with eigenvalue decomposition P = U diag(λ P ) U T , and for every 1 ≤ p ≤ ∞, the projection onto O d with respect to the Schatten p-norm is given by

Proof
The key ingredient to prove this statement is the theorem of Lidskii-Mirsky-Wielandt, see for instance [29]. Using this theorem to estimate the Schatten p-Norm, we obtain where we have equality if Π has the same eigenvectors as P. Recall that the eigenvalues in λ P appear in descending order. The right-hand side of (22) thus becomes minimal if we choose the eigenvalues of Π for k = 1, . . . , d aŝ and setλ Π,k := 0 for k = d + 1, . . . , n. This is exactly the projection onto E d .
Because of the low-rank representation of the primal variable, the construction of the orthoprojector Π ∈ O d is here especially simple.

Matrix-Free Robust PCA by RREAPER
Combining the matrix-free implementations of the primal and dual proximal mappings, we finally obtain a primal-dual method to solve rreaper (9) without evaluating the primal variable P (r ) representing the relaxed orthoprojector explicitly.
However, using the thick-restarted Lanczos process to evaluate the proximal mapping of the regularizer in Proposition 2, we introduce a systematic error since the required eigenvalue decompositions are only computed approximately. Fortunately, we may control this error by choosing the accuracy δ in Algorithm 4 appropriately small. If the accuracies becomes higher during the primal-dual iteration, we will see that the underlying primal-dual iteration converges nevertheless.

Theorem 1 (Convergence of Algorithm 8) Let
Q (r ) := P (r −1) − τ X * (Y (r ) ) = U diag(λ) U T be the argument in the r th primal-dual iteration of Algorithm 8, andQ (r ) the computed low-rank approximation of the leading positive eigenvalues greater than τ α by the thickrestarted Lanczos process. Define the approximation error by where (·) 0 denotes the projection to the positive-definite cone. Then, for σ τ < 1/|| X || 2 2 and θ = 1, the primal-dual iteration in Algorithm 8 converges to a global minimizer of rreaper whenever

Proof
The key idea of the proof is to show that the approx-imationsP (r ) = prox τ αR (Q (r ) ) are so-called type-one approximations of the proximal points P (r ) in the second step of Algorithm 1. For this, we have to compare the objective of prox τ αR at the pointsP (r ) and P (r ) . By the triangle inequality, we obtain Using the non-expansiveness of the projection and assuming without loss of generality that E r ≤ 1, we conclude Thus,P (r ) fulfills the definition of the type-one approximation with precision C E r (|| Q (r ) || F + √ d), see [39, p. 385]. Since the square roots of the precisions are summable, [39,Thm. 2] ensures the convergence of the inexactly performed primal-dual iteration to a saddle-point of the Lagrangian in (13) and thus to a solutionP of rreaper.

Performance Analysis
Inspired by ideas of Lerman et al. [27], we examine the performance analysis of rreaper. To this end, we assume that the 'ideal' subspace L of the given data x k ∈ R n , k = 1, . . . , N has dimension d L ≤ d. As in [27], we determine the best fit of the data by two measures: the first one is the distance of the data from the subspace where Π L denotes the orthogonal projector onto L. For the second measure, we assume that the projected data {Π x k : for all u ∈ L with u 2 = 1. In order to recover the entire subspace L, the data have obviously to cover each direction in L with sufficiently many data points. This well-localization of the data is measured by the permeance statistic which can be seen as 1 counterpart of the lower frame bound. Clearly, P L becomes large if all direction in L are uniformly covered by the data. The lower frame bound and the permeance statistic come into the play in the following lemma, compared with [27, Sect. A2.3].

Lemma 2
Let Π L be the orthogonal projector onto a subspace L of R n of dimension d L and x k ∈ R n , k = 1, . . . , N , N ≥ d L which form the columns of the matrix X. Then, for any A ∈ R n,n , the following relations hold true: and Proof We restrict our attention to (25). The relation (24) follows similar lines. Let A Π L have the singular value decomposition A Π L = UΣ V T , where the singular values σ k , k = 1, . . . , n are in descending order and σ d L +1 = . . . = σ n = 0 and we can arrange V such that the transpose of the first d L rows of V belong to L. Then it holds Using orthogonality of U and concavity of the square root function, we obtain Proof SinceP is a minimizer of (9), we obtain = R L − (I n −P)X 2,1 + α(d L − P tr ) It remains to estimate from below. To this end, we decompose Δ :=P − Π L as Since 0 n,n P I n , we obtain be conjugation with Π L , resp. I n −Π L that Δ 1 0 n,n and 0 n,n Δ 3 , so that Δ 1 tr = − tr Δ 1 and Δ 3 tr = tr Δ 3 . Then we conclude which implies || Δ|| tr ≤ ||Δ 1 || tr + ||Δ 2 || tr + ||Δ T 2 || tr + ||Δ 3 || tr = 2|| Δ 1 || tr + 2|| Δ 2 || tr + tr(P) − d L .

Now we can estimate the last summand by (1) and Lemma 2 as
By (28) and (29), we obtain Using the triangular inequality P − Π L tr ≥ |d L − P tr |, we get We next employ the estimate which results in Alternatively, we apply (32) in (30) together with |d L − P tr | ≤ d, since d L ≤ d, and obtain The final assertion follows by The error depends basically on the quotient of the distance of the normalized data from the subspace and the permeance statistics, where the normalization is indirectly contained in the quotient. For α = γ L , the terms of the minimum in (26) are the same and both terms become small; the second term is preferred for α near zero or 2γ L . Compared to our experimental results, the estimate is in general too pessimistic.

Remark 5
In [27], the following estimate was given for the original reaper where the data points X = [X in |X out ] were divided into inliers X in ∈ R n,N in and outliers X out ∈ R n,N out and (·) ∼ normalizes the columns of a matrix one. Following the lines of the proof in those paper, we would obtain the estimate for rreaper. Therefore, rreaper inherits the robustness from reaper as well as the theoretical guarantees for certain data models. However, in our numerical examples, we realized that the above estimate is often not defined due to division by zero, especially in the presents of already small inlier noise, i.e., if the columns of X in are only near the subspace L. Moreover, it is not clear how the division into in-and outliers can be achieved for real data sets. In our estimate (26), we do not distinguish between in-and outliers. Using (31), we would obtain the estimate

Incorporating the Offset
So far, we have assumed that the offset b in the robust PCA problem is given, so that we can just search for a lowdimensional linear subspace which represents the data well. While in the classical PCA (4), the affine subspace always passes through the mean value (5) of the data, it is not clear which value b must be chosen in order to minimize Clearly, if (Â,b) is a minimizer of E, then, for every b ∈ ran(Â), also (Â,b + b) is a minimizer. A common choice for the offset is the geometric median of the data points defined bŷ which can be computed, e.g., by the Weiszfeld algorithm and its generalizations, see, e.g., [2,36,42,49]. Other choices arising, e.g., from Tyler's M-estimator or other robust statistical approaches [19,24,26,33,47], were proposed in the literature. However, they do in general not minimize (35) as the following example from [35] shows: given three points in R n which form a triangle with angles smaller than 120 • , the geometric median is the point in the inner of the triangle from which the points can be seen under an angle of 120 • . In contrast, the line (d = 1) having smallest distance from the three points is the one which passes through those two points with the largest distance from each other.
In the following, we show that there always exists an optimal hyperplane of dimension d = n − 1 in R n determined by a minimizer of E in (35) that contains at least n data points. Further, if the number N of data points is odd, then every optimal hyperplane contains at least n data points. Recall that the hyperplane spanned by the columns of A = (a 1 | . . . |a n−1 ) ∈ St(n, n − 1) with offset b is given by where a ⊥ ⊥ a i , i = 1, . . . , n − 1 is a unit normal vector of the hyperplane, which is uniquely determined up to its sign and a ⊥ , b = −β.
The following lemma describe the placement of the data points with respect to the halfspaces determined by a minimizing hyperplane. is positive, and we consider the shifted hyperplane {x ∈ R n : a ⊥ , x = β +ε}, which contains at least one data point. The sum of the distances of the data points from this hyperplane is Since (Â,b) is a minimizer of E, this implies that M + ≤ M + M − . If M = 0, then M − = M + and the shifted hyperplane is also minimizing. However, this case cannot appear for odd N so that M ≥ 1 for odd N . This finishes the proof. R n , = 1, . . . , N . Then there exists a  minimizer (Â,b) of E such that the corresponding minimizing hyperplane contains at least n data points. If N is odd, every minimizing hyperplane contains at least n data points. Proof By Lemma 3, there exists a data point x such that (Â,b) withb = x is a minimizer of E and for odd N every minimizing hyperplane passes through a data point. Letâ ⊥ be a unit vector orthogonal to the columns ofÂ. Set y k := x k − x , k = 1, . . . , N . Next, we show: if the subspace normal tô a ⊥ contains M linearly independent vectors y 1 , . . . , y M with 0 ≤ M ≤ n − 2, then exactly one of the following situations apply. (i) The remaining vectors y k with k = M + 1, . . . , N are linearly dependent from the first M vectors and thus in the same linear subspace span{ y k : k = 1, . . . , M}. (ii) We find a further independent vector, say y M+1 , contained in the subspace normal toâ ⊥ such that we can increase M to M + 1. Repeating this argumentation until M = n − 2, we are done since y k + x , k = 1, . . . , n − 1 and x itself are in the subspace corresponding to (Â,b).

Theorem 3 Let x ∈
Because the vectors y k with k = 1, . . . , M are independent and are contained in the subspace normal toâ ⊥ by assumption, there exists a matrix Due to (36), the function has moreover a minimum in α = 0. For the summands of F, we obtain since Q B and R(α) are orthogonal by construction. Hence, we get Here the first M summands vanish because of the mentioned orthogonality y k ⊥ u n−1 and y k ⊥â ⊥ for k = 1, . . . , M.
If all remaining given points y k with k = M + 1, . . . , N are in span{ y 1 , . . . , y M }, then the corresponding remaining summands of F(α) become zero too, and the first situation (i) applies; so we are done.
If this is not the case, consider only those y k with k = M + 1, . . . , N that are linearly independent of the y k , k = 1, . . . , M. Let us denote the corresponding non-empty index set by I . Assume that there exists a k ∈ I such that f k is not differentiable in α = 0. This is only possible if the argument of the absolute value vanishes implying Thus, the vector y k is in the subspace spanned by the columns ofÂ, and we are done. Otherwise, if f k (0) = 0 for all k ∈ I , then it is differentiable in α = 0 and, by straightforward differentiation, we obtain But then α = 0 cannot be a minimizer of F which is a contradiction. Hence, this case cannot occur and the proof is complete.
If the target dimension d of the minimizing subspace is strictly less than n − 1, then it does not have to contain any data point as the following example shows.
Counterexample 1 (Lower-dimensional subspace approximation) Initially, we consider the approximation of some given points in R 3 by an one-dimensional subspace-a line. More precisely, for a fixed T 1, we consider the six given points We thus have two well-separated clusters around (0, 0, T ) T and (0, 0, −T ) T .
Obviously, the optimal line has somehow to go through each cluster. One possible candidate for the approximation line is simply the axis {(0, 0, t) : t ∈ R}, whose distance to the given points is by construction 6-for each cluster 3. Now, assume that the line goes through one given point, say (1, 0, T ) T . If T is very large, then we can neglect the slope of the line. Only considering the distances within the cluster around (0, 0, T ) T , we notice that the distance increases from 3 to approximately 2 √ 3. Although the axis is maybe not the optimal line, the distance to the given points is smaller than for a line going through a data point. Therefore, we can conclude that the optimal line has not to contain any given point.
The same construction can be done for arbitrary subspaces of dimension d < n−1. For example, consider just the points where e k is the kth unit vector. Using the same argumentation as above, the distance to the subspace span{±e k : k = 1, . . . , d} is smaller than to any d-dimensional subspace containing at least one data point.

Numerical Examples
In this section, we demonstrate the performance of rreaper by numerical examples implemented in MATLAB ® (R2019b, 64-bit) and calculated using an Intel © Core TM i7-8700 CPU (6 × 3.20 GHz) and 32 GiB main memory. For synthetic data, we employ the so-called haystack model introduced by Lerman et al. in [27]. The idea is to generate a set of inliers X in ∈ R n,N in lying near an d L -dimensional subspace-here the subspace spanned by the first d L unit vectors-and a set of outliers X out ∈ R n,N out located somewhere in the surrounding space. More precisely, the elements of the in-and outliers are drawn form the Gaussians and (X out ) ,k ∼ N (0, σ 2 out /n).
The complete data set then consists of the matrix Further, we compare the results of the matrix-free rreaper with several other robust PCA methods which can be applied to high-dimensional images. More precisely, we consider the following methods: -PCA-L1 method [21] maximizes the L1 dispersion || P X || 1,1 . The method is based on a greedy search algorithm that determines the maximumP by finding the rank-one projections to the principal components. The method finds a local maximum, but there is no guarantee to compute the global one. Indeed, the results heavily depends on the start value, which is the reason why we restart the algorithm several times (actually 10 times). Further, in contrast to the other three methods, this one is not rotationally invariant due to the objective function which often results in lower quality. -Nested Weiszfeld algorithm [35] aims to approximate the minimizer of || (I n − P)X || 2,1 . The projection is determined by a greedy-like method by computing the principle components sequentially. In general, this does not yield a (local) minimizer of || (I n − P)X || 2,1 . -R1-PCA [9,34], which minimizes again the data fidelity || (I n − P)X || 2,1 . The minimization is performed by a gradient descent method on the Grassmannian manifold which is equivalent to a conditional gradient algorithmalso known as Frank-Wolfe algorithm. Convergence to a (local) minimizer is only guaranteed under certain assumptions on the anchor set. -Fast Median Subspace method (FMS) [25] is a heuristic method to minimize the energy || (I n − P)X || 2, p := N k=1 || (I n − P)x k || p 2 with 0 < p < 2 iteratively. For this, a sequence of subspaces is constructed on the basis of weighted PCAs. FMS always converge to a stationary point (or to a continuum of stationary points), which are usually (local) minimizers of the energy functional.
-Geodesic Gradient Descent (GGD) [31] intends to minimizes || (I n − P)X || 2,1 by a geodesic descent performed on the Grassmannian. Under certain conditions on the problem and starting value, this method linear converges to the underlying unknown subspace. Fig. 1 Performance of rreaper with (2,1)-norm and Frobenius norm in the data fidelity term, respectively. The first one appears to be more robust against outliers

(2,1)-Norm versus Frobenius Norm
This example with simple synthetic data will show that the (2,1)-norm in the data term is more robust against outliers than the Frobenius norm For the Frobenius norm here abbreviated as F-norm, we have only to replace the projection onto B 2,∞ with the projection to the Frobenius norm ball We want to recover a line in the plane. Since this recovery problem is invariant under rotations, we restrict ourselves to span{(1, 0) T }. The data are generated randomly using the haystack model and consist of 50 points near the considered axis-we added a small amount of noise in the second coordinate-and of 10 outliers located somewhere in the plane, see Fig. 1. Besides the data points, the recovered lines using rreaper with the (2,1)-norm (solid line) as data fidelity and the Frobenius norm (dashed line) with parameters d = 1 and α = 5 are presented. In this toy example, rreaper yields nearly a perfect result regardless of the outliers and is in particular more robust than the same model with the Frobenius norm.

Nuclear Norm and Truncated Hypercube Constraints
In this example, we are interested how the rank reduction is influenced by the nuclear norm and the projection to the truncated hypercube. In this synthetic experiment, we approximate the given data x k ∈ R 100 by a 10-dimensional subspace. The data are again generated randomly with respect to the haystack model with σ in = σ out = 1 and σ noise = 0.1, where 100 points lie near the subspace L spanned by the first ten unit vectors and additional 25 outliers. In Fig. 2a, the data set is represented by the distance to the subspace L and to the orthogonal complement L ⊥ . We apply rreaper in Algorithm 8 with different parameter combination. For the target dimension, we choose in our first experiment d = 10, which is the wanted dimension, and second one d = 100, which does not truncate the unit hypercube at all. The influence of the regularization parameter α on the rank of P (r ) is shown in Fig. 2b, where the lines from top to down correspond to the regularization parameters α = 1 /4, 1 /2, 3 /4, 1, 2. Since we start the iteration with the zero-rank matrix P (0) := 0, the first iterations for d = 10  Table 1.
Considering only the results for d = 100 (dashed lines), we see that the nuclear norm reduces the rank of the iteration variable P (r ) with an increasing regularization parameter. Further, the rank during the primal-dual algorithm is very sensitive to the regularization parameter. For d = 10 (solid lines), the situation changes dramatically. After the initial stages, the rank of P (r ) decreases nearly to the target dimension. Since the matrices P (r ) are no orthogonal projections, rank and trace do not coincide. Due to this fact, the rank is not strictly bounded by the maximal trace of the truncated hypercube. Nevertheless, the projection to the truncated hypercube significantly reduces the rank.
For an optimal rank evolution during the matrix-free primal-dual method, the projection to the truncated hypercube by Algorithm 2 appears to be important. Moreover, the projection makes the rank evolution less sensitive to the regularization parameter α so that a wider range of regularization parameters can be applied without losing the computational benefits of the low rank. Thus, the truncated hypercube projection is an elementary key component of the algorithm.
Finally, we compare the performance of rreaper in this experiment with the other robust PCA methods mentioned above. For this, we repeat the experiment 100 times with random data. For rreaper, we choose the parameters d = 10 and α = 0.75. The mean reconstruction errors and the mean run times have been recorded in Table 2. In this experiment, the PCA-L1 fail to approximate the true underlying subspace. Interestingly, rreaper here always yields a slightly better approximation than R1-PCA, FMS, and GGD. The reason for this behavior is that the nuclear norm regularization suppress the Gaussian noise within the computed principle components.

Choosing the Parameters
In the last example, we have seen that the target dimension is achieved by choosing either d or α appropriately. To understand the relation between the parameters in more detail and finally to get a clue how to choose α if d is just taken large enough, we provide some additional experiments.  Hence, the first conclusion is that the reconstruction error is much less sensitive to choice of α than to those of d.
Next, considering again the variational nature of rreaper in (9), we recall that || P X − X || 2,1 describes the error between the 'projected data' P X and the given data X, and that the || P || tr controls the rank of P. In order to find a good subspace, both terms have to be small. Notice that the data fidelity becomes small if the rank/trace becomes large and vice versa. Consequently, α has to be chosen such that both complementary aspects are balanced. One approach coming from inverse problems is the so-called L-curve, see for instance [14,15]. The L-curve is a usual or a log-log plot of the data fidelity and the regularizer. For the haystack data set, the corresponding L-curve-once computed for the outcomeP of rreaper, and once for its projectionΠ = proj O 25 (P)-is shown in Fig. 4, where d = 25. Notice that both L-curves have a singularity, especially in the log-log plot. For the sake of clarity, we also plotted the data fidelity and the trace norm over α, see Fig. 5. The corner of the L-curve forΠ corresponds to α = 9.2. At the interval, data fidelity and regularizer are balanced. More importantly, for every α ∈ [2.1, 9.2], the determined subspace has dimension 10 and thus coincides with the true dimension.
We repeated this experiment around 500 times for random chosen d, d L , N in , N out , and σ noise and observe the same behavior of the L-curve meaning that the edge of the curve has been clearly visible and corresponds to the true subspace dimension. Therefore, it seems that the unknown subspace dimension can be detected and that an appropriate regularization parameter α may be chosen by studying the L-curve of the given data set. Notice that the L-curve, also well studied in the literature, is a heuristic parameter choice rule. Finally let us mention that if the inlier noise is too large or if the outliers hide the linear structure of the inliers, then the L-curve loses its singularity.

Face Approximation
The idea to use the principle components of face imagesthe so-called eigenfaces-for recognition, classification and reconstruction was considered in various paper and goes probably back to [45]. In this experiment, we adopt this idea to show that our matrix-free reaper can handle highdimensional data. Since the computation of an optimal offset is non-trivial as discussed in Sect. 7, we choose just the geometric median. For the experiment, we use the 'Extended Yale Face Data Set B' [12,23] consisting of several series of faces images under different lighting conditions but with the same facial expression. The basis of our noisy data set are 64 images of size 640 × 480 pixels with integer values between 0 and 255. We place some artifacts in the first four images covering the right eye, the nose, the right ear and the mouth, respectively, and serving as outliers. The complete employed data set is shown in Fig. 6a.
It is well-known that such images can be well approximated by a subspace covering around five directions [10]. To remove the artifacts by unsupervised learning, we therefore approximate the full data set including the artificial Fig. 6 The used data set based on the Extended Yale Face Data Set B and its projections onto the determined subspace using matrix-free rreaper Fig. 7 Restoration of the first sample by projecting to the principle components determined by several robust PCA's Fig. 8 Restoration of the artifact in the first sample by projecting to the principle components determined by several robust PCA's face images by a five-dimensional subspace (d = 5) using rreaper and project the corrupted images to the computed subspace. An typical effect of this procedure is that dark regions are lightened, shadows are removed, and reflections at skin and eyes are cleared away. Besides this effects, the major part of the unwanted artefacts is removed, see Fig. 6b. Note that in this example the projectionΠ corresponds to a 307,200 × 307,200 matrix, which would require 703.125 GiB for double precision whereas the matrix-free representation only requires around 14.063 MiB since the rank of the primal variable has been bounded by six due to the choice of α := 4.2 · 10 4 . The primal-dual algorithm for rreaper has converged after 30 iterations.
Next, we compare rreaper with the robust PCA methods mentioned above. Figure 7 shows examples of the projection of the first corrupted image to the computed subspace of the four considered methods. The restoration of the artifact covering the right eye is shown in Fig. 8 in more details. Taking a closer look in Fig. 8, we notice that PCA-L1, although very fast, yields the worst result, which has also been reported in [35]. The nested Weiszfeld algorithm and R1-PCA nearly remove the artifact. The result of rreaper is comparable but appears to be slightly smoother, which is an effect of the additional regularization.
Comparing the required run times, we notice that PCA-L1 with 10 restarts is the fastest method with 10.189 (in seconds) due to its linear structure, followed by FMS with 12.313, and the nested Weiszfeld algorithm with 18.875. R1-PCA, rreaper and GGD behave similarly with 43.441, 43.439 and 55.968, respectively. Note that the run times are only comparable to a limited extent due to the different nature of the algorithms and hence different stopping criteria. For instance, GGD here requires a very small step size to stay numerically at the Grassmannian. We can clearly see that rreaper is slower than the sequential or heuristic approaches. It seems that the numerical effort of rreaper is however comparable with the non-convex methods like R1-PCA. Rreaper converges however to a global minimizer, where the non-convex methods may be stuck in local minima, which here happens for GGD.

Conclusion
Convex models are usually preferable over non-convex ones due to their unique local minimum. While robust PCA models that can handle high-dimensional data are usually non-convex, a convex relaxation was proposed by the reaper model. Relying on the projector approach, it is however not applicable for high-dimensional data in its original form. To manage such data, we have combined primal-dual optimization techniques from convex analysis with sparse factorization techniques from the Lanczos algorithm. Moreover, we extended the model by penalizing the nuclear norm of the operator and called it rreaper. This results in the first convex variational method for robust PCA that can handle high-dimensional data since the required memory is reduced from O(n N + n 2 ) to O(n N + nr), where r is the maximal rank during the primal-dual iteration and has usually the same magnitude as d. The rreaper minimization algorithm always converges to a global minimizer of the regularized objective. Moreover, using the L-curve method an advantage of the new method seems to be that the dimension of the low-dimensional subspace must not be known in advance, but may be overestimated. We intend to further investigate this direction from the point of view of multi-objective optimization [8,48]. We addressed the problem of the bias in robust PCA, but more research in this directions appears to be necessary. Further other sparsity promoting norms then the nuclear norm could be involved. Our method can be enlarged to 3D images as videos, 3D stacks of medical or material images, where tensor-free methods will come into the play. Finally, it may be interesting to couple PCA ideas with approaches from deep learning to better understand the structure of both.