Prox-regularity of rank constraint sets and implications for algorithms

We present an analysis of sets of matrices with rank less than or equal to a specified number $s$. We provide a simple formula for the normal cone to such sets, and use this to show that these sets are prox-regular at all points with rank exactly equal to $s$. The normal cone formula appears to be new. This allows for easy application of prior results guaranteeing local linear convergence of the fundamental alternating projection algorithm between sets, one of which is a rank constraint set. We apply this to show local linear convergence of another fundamental algorithm, approximate steepest descent. Our results apply not only to linear systems with rank constraints, as has been treated extensively in the literature, but also nonconvex systems with rank constraints.


Introduction
Rank optimization is a well-developed topic that has found a tremendous number of applications in recent years (see [23] and references therein). Most of the problems one encounters involve a linear data model that is underdetermined and the very poorly behaved "sparsity function", either the function determining the rank of a matrix or the function counting the number of nonzero entries in an array. A common approach to solving sparsity optimization problems is via a convex surrogate, most often the 1 or (in the case of matrices) the nuclear norm. The rational for working 1 arXiv:1112.0526v3 [math.OC] 3 Dec 2012 with such surrogates is that the original problem is NP-complete, and thus should be avoided. Inspired by earlier work proving local convergence of cyclic projections onto nonconvex sets with an application to sparse signal recovery [6], and a more recent projection-reflection algorithm for x-ray imaging [20] that appears to be very successful at working with a proximal operator of the 0 function and a nonlinear imaging model, we set out in the present note to determine whether sets with sparsity constraints have some sort of regularity that might justify working directly with sparsity rather than through convex surrogates.
Based on the work of Lewis and Sendov [15,16], Le has obtained explicit formulas for the generalized rank function [11]. This formula shows that every point of the rank function is a critical point [8], and so reasonable algorithmic strategies should not directly make use of the rank function. Instead, we consider the lower level sets of the rank function. While sets of matrices of rank less than a specified level are not manifolds, we show here that they are quite regular, in fact prox-regular. While prox-regularity of these sets is not new [14], our proof of this fact established in Section 3 uses elementary tools, at the center of which is a particularly simple and apparently new characterization of the normal cone to these sets established in Proposition 3.6.
Prox-regularity of the lower level sets of the rank function immediately yields local linear convergence of fundamental algorithms for either finding the intersection of the rank constraint set with another set determined by some (nonlinear) data model, or for minimizing the distance to a rank constrained set and a data set. The result, detailed in Section 4, is quite general and extends to nonconvex data imaging models with rank constraints. Our results are an extension of results established recently in [3] for the vector case, however at the cost of additional assumptions on the regularity of the solution set. In particular, [3] establishes local linear convergence, with radius of convergence, of alternating projections between an affine constraint and the set of vectors with no more than s nonzero elements without any assumptions on the regularity of the intersection of these sets, beyond the assumption that it is nonempty. Our results, in contrast, are modeled after results of [14] and [13] where a stronger regularity of the intersection is assumed. We discuss the difficulties in extending the tools developed in [4] to the matrix case in the conclusion. In any case, avoiding convex surrogates is at the cost of global convergence guarantees: these results are local and offer no panacea for solving rank optimization problems. Rather, this analysis shows that certain macro-regularity assumptions such as restricted isometry or mutual coherence (see [23] and references therein) play no role asymptotically in the convergence of algorithms, but rather have bearing only on the radius of convergence. We begin this note with a review of notation and basic results and definitions upon which we build.

Notation
Throughout this paper X and Y are Euclidean spaces. In particular we are interested in Euclidean spaces defined on R m×n where we derive the norm from the trace inner product This naturally specializes to the case of R n when m = n above and x ∈ R n×n is restricted to the subspace of diagonal matrices. For x ∈ R m×n we denote the span of the rows of x by range(x T ) and recall that this is orthogonal to the nullspace of the linear mapping x : For x ∈ {z ∈ R n×n | z ij = 0 if i = j } (that is, when x is square diagonal) this corresponds exactly to the usual support of vectors on R n : where Diag (x) maps the diagonal of the matrix x ∈ R m×n to a vector in R r with r = min{m, n}.
In order to emphasize this connection to the support of vectors, and reduce notational clutter we will denote the span of the rows of x by We denote the rank of x by rank(x) and recall that rank(x) is the dimension of the span of the columns -or equivalently the rows -of x which is equivalent to the number of nonzero singular values. The singular values of x ∈ R m×n are the (positive) square root of the eigenvalues of xx T ; these are denoted by σ j (x) and are assumed to be ordered so that σ i (x) ≥ σ j (x) for i < j. We denote by σ(x) := (σ 1 (x), σ 2 (x), . . . , σ r (x)) T (r = min{m, n}) the ordered vector of singular values of x. The corresponding diagonal matrix is denoted Σ(x) := diag (σ(x)) ∈ R m×n where diag (·) maps vectors in R r to matrices in R m×n . Following [12,15,16] we denote the (Lie) group of n × n orthogonal matrices by O(n) and the product O(m) × O(n) by O(m, n). A singular value decomposition of x ∈ R m×n restricted to the above ordering is then any pair of orthogonal matrices (U, V ) ∈ O(m, n) together with Σ(x) such that x = U Σ(x)V T . We will denote the set of pairs of orthogonal matrices that comprise singular systems for x by U( The closed ball centered at x with radius ρ is denoted by B(x, ρ); the unit ball centered at the origin is simply denoted by B. Given a set Ω ⊂ X , we denote the distance of a point x ∈ X to Ω by If Ω is empty then we use the convention that the distance to this set is +∞. The corresponding (multivalued) projection operator of x onto Ω, denoted P Ω (x), is defined by If Ω is nonempty and closed, then the projection of any point in X onto Ω is nonempty.
We define the normal cone to a closed set Ω ⊂ X following [24, Def. 6.3]: The vectors v k are regular normals to Ω at x k and the cone of regular normals at x k is denoted N Ω (x k ).
What we are calling regular normals are called Fréchet normals in [19,Def. 1.1].
Here and elsewhere we use the notation x → Ω x to mean that x → x with x ∈ Ω. An important example of a regular normal is a proximal normal, defined as any vector v ∈ X that can be written as v = λ(x − x) for λ ≥ 0 and x ∈ P Ω (x) for some x ∈ X . We denote the set of proximal normals to Ω at x ∈ Ω by N P Ω (x). For Ω closed and nonempty, any normal v ∈ N Ω (x) can be approximated arbitrarily closely by a proximal normal [24,Exercise 6.18]. Thus we have the next result which is key to our analysis. Proposition 2.2 (Theorem 1.6 of [19]) Let Ω ⊂ X be closed and x ∈ Ω. Then Central to our results is the regularity of the intersection of sets, which we define in terms of a type constraint qualification formulated with the normal cones to the sets at points in the intersection.

Definition 2.3 (basic set intersection qualification) A family of closed sets
is y i = 0 for i = 1, 2, . . . , m. We say that the intersection is strongly regular at x if the basic set constraint qualification is satisfied there.
In the case m = 2, this condition can be written The two set case is called the basic constraint qualification for sets in [19, Definition 3.2] and has its origins in the the generalized property of nonseparability [18] which is the n-set case. It was later recovered as a dual characterization of what is called strong regularity of the intersection in [10,Theorem 3]. It is called linear regularity in [13].
The case of two sets also yields the following simple quantitative characterization of strong regularity.
Definition 2.5 (angle of regular intersections) Suppose that Ω 1 and Ω 2 are closed subsets of X . We say that the intersection Ω 1 ∩ Ω 2 is strongly regular at x ∈ Ω 1 ∩ Ω 2 with angle θ := cos −1 (c) > 0 when the constant c given by (2.2) is less than 1.
We will also require certain regularity of the sets themselves, not just the intersection. The following definition of prox-regularity of sets is a modern manifestation that can be traced back to [7] and sets of positive reach. What we use here as a definition actually follows from the equivalence of prox-regularity of sets as defined in [

Properties of lower level sets of the rank function
We collect here some facts that will be used repeatedly in what follows.
Proof. This follows immediately from continuity of the singular values as a function of x. (See, for instance, [9, Appendix D].) For the remainder of this note we will consider real m × n matrices and denote by r the minimum of {m, n}. The rank level set will be denoted by S := {y ∈ R m×n | rank(y) ≤ s } for s ∈ {0, 1, . . . , r}. As can be found in textbooks on matrix analysis, the projection onto this set is just the truncation of the r − s smallest singular vectors to zero; in the case of a tie for the s-th largest singular value, the projection is the set of all s-selections from the s-largest singular values.
The projection P S (x) is given by The next results establish that the set S is prox-regular at all points where rank(x) = s. We make use of the following tools. For r = min{m, n} define the mappings J : where |J(x, α)| denotes the cardinality of this discrete set. We define J(x, +∞) to be the empty set. Before proceeding with our results, we collect some observations about these objects.
Proof. (i) Since the cardinality of the empty set is zero, the supremum in the definition of α 0 is unbounded. In any case, the cardinality of J(x, α) is monotonically decreasing with respect to α for x fixed from a value of r at α = 0 to 0 for all α > σ 1 (x). Thus for x fixed α s is bounded for all s ∈ {1, 2, . . . , r}. The value of α for which the cardinality s ≥ 1 is achieved is attained precisely when α = σ j (x) for some j.  (i) P S (x) is multi-valued; Proof. To show that (i) implies (ii), let y and z ∈ P S (x) with y = z. By Lemma 3.
. Then by [9,Theorem 7.4.51] hence rank(x) > s. Since y = z and they have the same singular values, the multiplicity of singular values σ j (x) with value α s must be greater than one, hence |J(x, α s )| > s.
Conversely, to show that (ii) implies (i) first note that by Lemma 3.3(ii) rank(x) > s > 0. Now fix y ∈ P S (x) with y = U y Σ s (x)V T y for (U y , V y ) ∈ U(x). The corresponding decomposition for x is U y Σ(x)V T y . Now construct the orthogonal matrix V by switching the s + 1 th column of V y with the s th column of V y . Since |J(x, α s )| > s we have that x = U y Σ(x) V T . Define z := U y Σ s (x) V T . By Lemma 3.2 z ∈ P S (x) with rank(z) = s, but the s th column of V is in the nullspace of y so z = y and the projection is thus multi-valued. This completes the proof.
An immediate consequence of the above is the obvious observation that the projection onto the trivial sparsity sets S 0 with s = 0 and S r with s = r is single-valued.
The normal cone of this set has the following simple characterization. We first show that W is nonempty and hence Z(w) for w ∈ W is nonempty. For all x ∈ R m×n and s ∈ {0, 1, 2, . . . , r} the zero matrix 0 ∈ W , hence W is nonempty. Next note that for w ∈ W , Z(w) ⊂ ker(w) with dim(ker(w)) ≥ s ≥ 0, and it is always possible to find an element z of ker(w) with rank(x + z) = s. Now, choose any w ∈ W and z 0 ∈ Z(w) and construct the sequences (x k ) k∈N and (w k ) k∈N by There is a K ∈ N such that for all k > K Thus for all k > K Note that by Proposition 3.4 and Lemma 3.3(ii) the representation of the projection above holds with equality since rank x + 1 √ k w 0 = s. Since x k → x, by definition, w ∈ N S (x). As w was arbitrary, we have W ⊂ N S (x).
We show next that, conversely, N S (x) ⊂ W for x ∈ S. The matrix w = 0 trivially belongs to W , so we assume that w = 0. By Proposition 2.2 we can write w as a limit of proximal normals, that is, the limit of sequences (x k ) and (w k ) with x k / ∈ S and w k → w for w k = t k x k − y k for y k ∈ P S (x k ). We consider the corresponding singular value decompositions by y k = U k Σ s (x k )V T k for (U k , V k ) ∈ U(x k ) and Σ s (x) := diag ((σ 1 (x), σ 2 (x), . . . , σ s (x), 0, . . . , 0) T ) ∈ R m×n (see Lemma 3.2). Note that x k and y k have the same left and right singular vectors with the usual ordering. The matrices U k and V k are also collections of left and right singular vectors for w k , although they do not yield the usual ordering of singular values of w k : Let (U , V ) ∈ U(x) be the limit of left and right singular vectors of x k , that is, It follows immediately that rank(w) ≤ r − s and Supp (w) ⊥ Supp (x) which completes the proof of the inclusion.
To see that each normal to the set S at x with rank(x) = s is actually a proximal normal, note that if rank(x) = s then by (3.1) every point v ∈ N S (x) can be written as v = 1 τ ((τ v + x) − P S (τ v + x)) for τ > 0 small enough. Suppose, on the other hand, that rank(x) < s.
It is known that the set of matrices with rank s is a smooth manifold [1] (although the set of matrices with rank less than or equal to s is not), from which it follows that S is prox-regular [14, Lemma 2.1 and Example 2.3]. We present here a simple proof of this fact based on the characterization of the normal cone. Proof. Let (x k ) k∈N in R m×n be any sequence converging to x with the corresponding sin- . . , σ s (x k ), 0 . . . , 0 T and Σ s (x k ) := 0, . . . , 0, σ s+1 (x k ), σ s+2 (x k ), . . . , σ r (x k ) T for r = min{m, n}. Note that y k → x with rank(y k ) = rank(x) = s for all k large enough, while by Proposition 3.6 z k → 0 with z k ∈ N S (y k ) for all k. Then for all k large enough max j {σ j (z k )} = σ s+1 (x k ) < σ s (x k ) = min j {σ j (y k )} and |J(x k , α s )| = s. By Proposition 3.4 the projection P S (x k ) is single-valued. Since the sequence was arbitrarily chosen, it follows that the projection is single-valued on a neighborhood of x, hence S is prox-regular.

Algorithms for optimization with a rank constraint
The prox-regularity of the set S has a number of important implications regarding numerical algorithms. Principal among these is local linear convergence of the elementary alternating projection and steepest descent algorithms. There has been a tremendous number of articles published in recent years about convex (and nonconvex) relaxations of the rank function, and when the solution of optimization problems with respect to these relaxations corresponds to the optimization problem with the rank function (see the review article [23] and references therein). The motivation for such relaxations is that there are polynomial-time algorithms for the solution of the relaxed problems, while the rank minimization problem is NP-complete. As we will show in this section, the above theory implies that in the neighborhood of a solution there are polynomial-time algorithms for the solution of optimization problems with rank constraints. This observation was anticipated in [2] and notably [5] where a (globally) linearly convergent projected gradient algorithm with a rank constraint was presented. Without further assumptions, however, such assurances of convergence of algorithms for problems with rank constraints is at the cost of global guarantees of convergence.

Inexact, extrapolated alternating projections
To the extent that the singular value decomposition can be computed exactly, the projection of a point x onto the rank lower level set S can be calculated exactly simply by ordering the singular values of x and truncating. The above analysis immediately yields local linear convergence of exact and inexact alternating projections for finding the intersection S ∩ M for M closed on neighborhoods of points where the intersection is strongly regular. The following algorithm allows for inexact evaluation of the fixed point operator, and hence implementable algorithms.
For γ = 0 and x 2k+1 = x 2k+1 * the inexact algorithm reduces to the usual alternating projections algorithm. Note that the odd iterates x 2k+1 can lie on the interior of M . This is the major difference between Algorithm 4.1 and the one specified in [13] where all of the iterates are assumed to lie on the boundary of M . We include this feature to allow for extrapolated iterates in the case where M has interior.
If, in addition, M is prox-regular at x, then the iterates converge with rate Proof. Since by Proposition 3.8 S is prox regular at x the results follow immediately from [17,Theorem 4.4].

Remark 4.3
The above result requires only closedness of the set M . For example, this yields convergence for affine sets M = {x | Ax = b } which are not only closed, but convex. But the above result is not restricted to such nice sets. Another important example is inverse scattering with sparsity constraints [20]. Here the set M is M = x ∈ C n |(F x) j | 2 = b j , j = 1, 2, . . . , n where F is a linear mapping (the discrete Fourier or Fresnel transform) and b is some measurement (a far field intensity measurement). This set is not convex, but it is certainly closed (in fact prox-regular), so again, we can apply the above results to provide local guarantees of convergence for nonconvex alternating projections with a sparsity set.
Paradoxically, in the vector case it is the projection onto the affine constraint that in general cannot be evaluated exactly, while the projection onto the sparsity set S can be implemented exactly by simply (hard) thresholding the vectors. In the matrix case, this is no longer possible since in general the singular values cannot be evaluated exactly. In order to accommodate both projections being approximately evaluated, we explore one possible solution using a common reformulation of the problem on a product space. This is explained next.

Approximate steepest descent
Another fundamental approach to solving such problems is simply to minimize the sum of the (squared) distances to the sets M and S: Steepest descent without line search is: given If S and M were convex and the distance function the Euclidean distance, it is well-known that this would be equivalent to averaged projections: If we assume that M is prox-regular, then, since we have already established the prox-regularity of S, the correspondence between the derivative of the sum of squared distances to these sets and the projection operators in (4.2) holds on (common) open neighborhoods of M and S [22, Theorem 1.3]. Using a common product space formulation due to [21] we can show that (4.2) is equivalent to alternating projections between the sets where x k+1 is given by (4.2). The set Ω is prox-regular if M and S are, and the set D is convex, so Theorem 4.2 guarantees local linear convergence of the sequence of iterates (x k ) k∈N with rate depending on the angle of strong intersection of the sets D and Ω. We cannot expect to be able to compute the projection onto the set Ω exactly, but we can reasonably assume to be able to compute the projection onto the diagonal D exactly, even if the magnitudes of the elements of P S (x k ) and P M (x k ) differ in orders of magnitude beyond our numerical precision. Indeed, since the projection operators P S and P M are Lipschitz continuous for S and M prox-regular [22, Theorem 1.3], we can attribute any error we in fact make in the evaluation of P D to the evaluation of P Ω where we compute an approximation according to Algorithm 4.1. Again, Theorem 4.2 guarantees local linear convergence with rate governed by the angle of strong regularity between D and Ω and the accuracy of the approximate projection onto Ω.

Conclusion
We have developed a novel characterization of the normal cone to the lower level sets of the rank function. This enables us to obtain a simple proof of the prox-regularity of such sets. This property then allows for a straight-forward application of previous results on the local linear convergence of approximate alternating projections for finding the intersection of rank constrained sets and another closed set, as long as the intersection is strongly regular at a reference point x. Our characterization of the normal cone to rank constraint sets allows for easy characterization and verification of the strong regularity of intersections of these sets with other sets. The results are also extended to the elementary steepest descent algorithm for minimizing the sum of squared distances to sets, one of which is a rank constraint set. This implies that, in the neighborhood of a solution with sufficient regularity, there are polynomial time algorithms for directly solving rank constraint problems without resorting to convex relaxations or heuristics.
What remains to be determined is the radius of convergence of these algorithms. Using the restricted normal cone developed in [4] Bauschke an coauthors [3] obtained linear rates with estimates of the radius of convergence of alternating projections applied to affine sparsity constrained problems -that is, the vector affine case of the setting considered here -assuming only existence of solutions. The restricted normal cone is not immediately applicable here since the restrictions in [3] are over countable collections of subspaces representing all possible s−sparse vectors. For the rank function this is problematic since the collection of all possible s-rank matrices is not countable. Extending the tools of [4] to the matrix case is the focus of future research.
The results one might obtain using the tools of [4] or similar, however, are based on the regularity near the solution, what we call micro-regularity. We cannot expect the estimates for the radius of convergence to extend very far using these tools, unless certain local-to-global properties like convexity are assumed. In [5] a scalable restricted isometry property is used to prove global convergence of a projected gradient algorithm to the unique solution to the problem of minimizing the distance to an affine subspace subject to a rank constraint. The (scalable) restricted isometry property and other properties like it (mutual coherence, etc) directly concern uniqueness of solutions and indirectly provide sufficient conditions for global convergence of algorithms for solving relaxations of the original sparsity/rank optimization problem. A natural question is whether there is a more general macro-regularity property than the scalable restricted isometry property, one independent of considerations of uniqueness of solutions, that guarantees global convergence.