Efficient Proximal Mapping Computation for Low-Rank Inducing Norms

Low-rank inducing unitarily invariant norms have been introduced to convexify problems with a low-rank/sparsity constraint. The most well-known member of this family is the so-called nuclear norm. To solve optimization problems involving such norms with proximal splitting methods, efficient ways of evaluating the proximal mapping of the low-rank inducing norms are needed. This is known for the nuclear norm, but not for most other members of the low-rank inducing family. This work supplies a framework that reduces the proximal mapping evaluation into a nested binary search, in which each iteration requires the solution of a much simpler problem. The simpler problem can often be solved analytically as demonstrated for the so-called low-rank inducing Frobenius and spectral norms. The framework also allows to compute the proximal mapping of increasing convex functions composed with these norms as well as projections onto their epigraphs.


Introduction Background
Non-convex optimization problems with rank or cardinality constraint appear in many data driven areas such as machine learning, image analysis and multivariate linear regression [3, 7-9, 14, 24, 31, 46, 47, 52] as well as areas within control such as system identification, model reduction, low-order controller design and low-complexity modelling [2,4,16,23,28,30,39,40,43,[56][57][58][59].Besides the low-rank constraint, these problems are often convex and can be posed as minimize X L(X) where the loss-function L is proper closed and convex.Therefore, one of the most common techniques for solving such problems is to convexify them using regularizers or by taking convex envelopes [9,16,19,22].A promising class of such regularizers and convex envelopes are the so-called unitarily invariant low-rank inducing norms [19], which are defined for arbitrary unitarily invariant norms • g as where χ rank(•)≤r is the indicator function for matrices with at most rank r and (•) * * the biconjugate, which coincides with the convex envelope.If L = f 0 + f 1 ( • g ) where f 0 and f 1 are convex and f 1 is increasing, these envelopes have the advantage that a rank-r solution to the convex problem minimize X f 0 (X) + f 1 ( X g,r * ) is guaranteed to be a solution to the non-convex problem (1).For instance, this may allow us to determine Frobenius norm optimal low-rank approximation with convex constraints by setting , where • 2 is the Frobenius norm (see Section 5 and [21] for details).

Problem
Although low-rank inducing norms often admit a representation as semi-definite programs (SDP) (see [19]), proximal splitting algorithms (see [10]) are often used for large-scale problems, where standard interior-point method SDP solvers have too costly iterations (see [45,50]).To apply such methods to (3), the proximal mapping to f 1 ( • g,r * ) is needed, which is the main objective of this work.
For some f 1 the proximal mapping of f 1 ( • g,r * ) can be evaluated directly, while for other f 1 it may be very involving or even intractable.This can be circumvented by lifting problem (3) to the epigraph form minimize X,t f 0 (X) + f 1 (t) + χ epi( • g,r * ) (X, t), (4) where χ epi( • g,r * ) is the indicator function of the epigraph to • g,r * .Since the proximal mapping of the one dimensional function f 1 is fast to evaluate, tractability of the approach relies on projection onto the epigraph being efficient.Thus, we are also interested in the projections onto epi( • g,r * ), i.e., the proximal mapping of χ epi( • g,r * ) .

Contribution
In this work, we first introduce a generic search framework for computing a solution to min where γ > 0 and Z ∈ R n×m a fixed, f is proper, closed and convex, and is the so-called dual norm (see Section 2) to • g,r * .Through the Moreau-decomposition, this will allow us to simultaneously treat the proximal mappings of f 1 ( • g,r * ), such as γ • g,r * , γ 2 • 2 g,r * , as well as χ epi( • g,r * ) .The core step of our framework is the reduction of (5) to a nested binary search algorithm, where each iteration a much simpler problem is needed to be solved.In many cases, this simpler problem can be solved explicitly.This is demonstrated for • g being the Frobenius norm and the spectral norm.Finally, after computing a singular value decomposition (SVD), our computations only involve singular values and therefore coincide with the computations for vector-valued problems with cardinality constraint.
Note that for r = 1, all low-rank inducing norms coincide with the wellknown nuclear norm (modulus a constant scalar) and efficient algorithms for computing its proximal mapping exist [44].However, for the other members of the low-rank inducing norms such efficient methods are to our best knowledge still unknown.An analytic approach for the low-rank inducing spectral norm has been studied in [55].Despite the similarity of using the Moreau-decomposition, this approach is of higher computational cost than what is presented here.This is because of our derived binary search rules.Further, [53] proposes a nonanalytic approach for an extended class of not necessarily unitarily invariant low-rank inducing norms (see [32]).This approach, however, depends on the complexity and convergence rates of other optimization algorithms.Finally, [1,15,34,35] consider the special case of the squared low-rank inducing Frobenius norm, but the proximal mapping of the non-squared low-rank inducing Frobenius norm as well as the proximal mapping to general f 1 ( • g,r * ) are not considered.Interestingly, our framework shows that the computational complexity in case of γ • g,r * , γ 2 • 2 g,r * , and χ epi( • g,r * ) coincide.In particular, since the algorithms in [15,34] are special cases of our framework, their computational complexity for the squared low-rank inducing Frobenius norm carries directly over to the non-squared case and the epigraph projection.

Outline
The paper is organized as follows.We start by introducing some preliminaries on norms and convex optimization.Subsequently, a formal definition of the class of low-rank inducing norms as well as their application to rank constrained optimization problems is outlined.Then we derive our main results, the binary search framework and outline an algorithm for evaluating their epigraph projections.For the low-rank inducing Frobenius and spectral norms, we make these computations explicit and arrive at implementable algorithms for which the computational cost is analyzed.Subsequently, a case study is performed in order to illustrate the performance of our algorithm when solving a problem of form (3) through proximal splitting.Finally, we draw a conclusion and point the reader to our freely available implementations of these algorithms in MATLAB and Python.

Preliminaries
The set of reals is denoted by R, the set of real vectors by R n , the set of vectors with nonnegative entries by R n ≥0 and the set of real matrices by R n×m .In the remainder of the paper, we assume with out loss of generality that n ≤ m.The singular valued decomposition of X ∈ R n×m is denoted by multiplicity).The corresponding vector of all singular values is given by σ(X) := (σ 1 (X), . . ., σ n (X)).
For all x = (x 1 , . . ., x n ) ∈ R n , we define the p norms by where | • | denotes the absolute value.
A matrix norm • : R n×m → R ≥0 is called unitarily invariant if for all unitary matrices U ∈ R n×n and V ∈ R m×m and all X ∈ R n×m it holds that U XV = X .Equivalently, unitary invariance can be characterized by symmetric gauge functions (see e.g.[29, Theorem 7.4.7.2]): , where |x| denotes the element-wise absolute value.
iii. g(P x) = g(x) for all permutation matrices P ∈ R n×n and all x ∈ R n .
Proposition 1.The norm • : R n×m → R ≥0 is unitarily invariant if and only if where g is a symmetric gauge function.
Throughout this work, we use the notation X g := g(σ(X)).For X, Y ∈ R n×m the Frobenius inner product is defined as with Frobenius norm Moreover, the nuclear norm and the spectral norm are given by The dual norm to • g is defined as In particular, this means that dual norms inherit the unitary invariance as well as the duality relationship for p norms, i.e.
Furthermore, in this work the following truncated dual gauge functions will play a key role.To this end, let us define the truncation operator T : R n → R r−t+1 for all 1 ≤ r ≤ n and (t, s) ∈ {1, . . ., r} × {0, . . ., n − r} as where sort : R n → R n denotes the sorting in descending order, and the corresponding truncated gauge function of of g D as , 0, . . ., 0 for all x ∈ R n .For the special case (t, s) = (1, 0), we simply write g D r .Note that g D r,s,t is indeed a gauge function with dual gauge function [26, Lemma 2.2.2]) , 0, . . ., 0).
For the convince of the reader, we review next some elementary definitions and results from convex optimization.For f : R n×m → R ∪ {∞} we define the following set notions: In particular, by [27,Exampel VI Further, f is said to be: The conjugate (dual) function f * of f is defined as for all Y ∈ R n×m .The function f * * := (f * ) * is called the biconjugate function or convex envelope of f .For f : R → R ∪ {∞}, we say that f increasing if and if there exist x, y ∈ R such that x < y and f (x) < f (y).Moreover, its monotone monotone conjugate is defined as [48] f The indicator function of a set S ⊂ R n×m is defined as We also use this notation for the indicator function of the set of matrices with at most rank r, i.e. χ rank(•)≤r .For any Z ∈ R n×m , the proximal mapping of a closed, proper and convex function f : R n×m → R ∪ {∞} is defined as In particular, prox γχ C (Z) coincides with the unique Euclidean projection onto C for any closed, non-empty set C ⊂ R n×m .Moreover, by the extended Moreau decomposition it holds for all f : R n×m → R ∪ {∞}, Z ∈ R n×m and γ > 0 that (see [6,Theorem 6.29]) 3 Low-Rank Inducing Norms This section introduces the family of unitarily invariant low-rank inducing norms, which has been discussed in [19].Besides recapping some elementary properties, this section briefly motivates the usefulness of these norms as convex envelopes or additive regularizers in optimization problems to promote low-rank solutions.Low-rank inducing norms are defined as the dual norm of a rank constrained dual norm Y g D ,r := max rank(X)≤r This means that the low-rank inducing norms corresponding to • g are given by X g,r * := max For r = n, the rank constraint in ( 12) is redundant and • g ≡ • g,r * .Some important properties of these norms are summarized next [19].
Lemma 1.Let X, Y ∈ R n×m , r ∈ N be such that 1 ≤ r ≤ n, and g : R n → R ≥0 be a symmetric gauge function.Then • g D ,r is a unitarily invariant norm with Its dual norm • g,r * satisfies In this work, we especially consider the so-called low-rank inducing Frobenius norm and the low-rank inducing spectral norm Y, X .
The following motivates the main interest in low-rank inducing norms (see [19,20,22] for details).
Proposition 2. Assume that f 0 : R n×m → R ∪ {∞} is a proper closed convex function, and that r ∈ N is such that 1 ≤ r ≤ min{m, n}.Let f 1 : R ≥0 → R ∪ {∞} be an increasing, proper closed convex function, and let θ > 0. Then ) If X solves (18) such that rank(X ) ≤ r, then equality holds, and X is also a solution to the problem on the left of (17).
In other words, Proposition 2 shows that low-rank inducing norms can be used both as additive regularizers and direct convex envelopes to find (approximate) solutions to minimize For regularization, [16,49], we set f 0 = f and choose a suitable f 1 and θ to find an approximate solution.In the second case, when f can be split into may return an (exact) solution to (19).

Proximal Mappings
For problems of small size, it is often convenient to solve (19) through semidefinite programming (SDP).However, conventional SDP solvers are typically based on interior-point methods (see [45,50]) with iteration cost that grows unfavorably with the problem dimension.For large-scale problems, proximal splitting methods can be used (see [6,10]).
To efficiently solve (19), proximal splitting methods require efficient computation of the proximal mapping of f 1 ( • g,r * ).In this section, we present our main results on developing a nested binary search framework (see Theorem 1 and Algorithm 1) for computing this proximal mapping for simple f 1 efficiently.Explicit and implementable steps for these computations will be shown for the simple, but most frequently appearing cases [5,19,22,42] • f 1 = (•) and In Section 4.2, the computational complexity of our generic algorithm as well as these particular cases is derived.
In cases, where f 1 is not simple, ( 19) can be rewritten as where χ epi( • g,r * ) is the indicator function of the epigraph to • g,r * .Then a consensus formulation for proximal splitting methods (see [10]) requires an evaluation of the proximal mappings for f 1 and χ epi( • g,r * ) .Since f 1 is onedimensional, convex, proper and increasing, its proximal mapping is fast to evaluate.We will see as part of our complexity analysis in Section 4.2 that computing prox χ epi( • g,r * ) has the same cost as the cases f 1 = (•) and f 1 (•) 2 .Finally note that in contrast to • g,r * , its dual norm • g D ,r is explicitly known by its definition (14).Therefore, we derive our search framework for with which by (11) and ( 16) yields

Search framework
Next, we present our main result, which shows that (22), and hence (23a) and (23b), can be computed by a nested parameter search.Since the computations of ( 22) can be unified as minimize where f is closed, proper and convex, our results are stated for all such problems.In particular, for the cases we choose by ( 11) and ( 16) such that the corresponding solution (Y , w ) to ( 24) yield where χ • g D ≤γ is the indicator function of the set {X : X g D ≤ γ}.
Note that Theorem 1 reduces the problem of solving (24) to the tractability of (C1)-(C3).In the following, (C1)-(C3) are made explicit for the cases (25) and the low-rank inducing Frobenius and spectral norms, i.e., g = 2 and g = ∞ .More generally, we will determine (C1)-(C3) for g = γ 2 and g = τ ∞ for all τ > 0, because this will allow us to handle the first two cases simultaneously through the identity where Π Y (Y, w) := Y and τ > 0. Further, we will see that it is easy to adjust these computations for the third case, because prox τ • g,r * (Z) = prox • τ g,r * (Z).
A proof to Proposition 3 can be found in Appendix A.3.

Low-rank inducing spectral norm
where j is chosen such that and (C3) derives as where µ = μk with μk = zv+ k i=1 ẑi 1+ k i=1 αi and k can be identified by a search over k with the following rules for increasing/decreasing k:

Computational Complexity
In the following, we evaluate the computational complexity, i.e. counting all flops (see [51]) of the discussed approaches for computing Since the same analysis also applies to the other cases discussed in (25), this will allow us to compare our approach to existing methods.Our evaluation starts with a discussion of Algorithm 1 for a general gauge function, followed by an explicit discussion for the cases of g = 2 and g = ∞ in Sections 4.2.1 and 4.2.2.Assume that the cost for determining (y is bounded by C(n, r).Then the complexity of Algorithm 1 is the sum of: 1. SVD for Z providing all σ i (Z) and (see [51]): O(n 3 ) 2. Binary search rules in Lemmas A.1 and A.2 for t and s (see [33]): By the coordinate transformation (26a), it holds that (y and therefore the cost for C(n, r) is determined by the cost C(n, r) for finding (ỹ r−t , ỹr−t+1 ) in (C1)-(C3).In addition, we are required to compute the full solution y (t ,s ) once an optimal pair (t , s ) is found.The cost for these pre-and post-computing steps is at most O(n) and therefore the overall computational complexity of Algorithm 1 is bounded by bineary search for (t ,s ) Remark 1.The cost for computing zr−t+1 is given by the cost for knowing r+s i=r+1 σ i (Z) (for s > 0) and r i=r−t+1 σ i (Z).Both sums could be computed a priori for all t and s through incremental summation with cost O(n).However, in practice it may be cheaper to store and re-use the intermediate sums, when deriving r i=r−t+1 z i and r+s i=r+1 z i .This means we only need to compute additional intermediate sums whenever t and s get increased within in the binary search.

Low-rank inducing Frobenius norms
In order to determine the computational cost C(n, r) for g = 2 , we need to distinguish between (29a) and (29b) and the case when µ has to be determined.Both cases require r−t i=1 as either part of the inequalities or as coefficients in the polynomial (29f).These sums can be computed once for all t ∈ {1, . . ., r} with cost O(r).Then testing (29a) and (29b) as well as solving the fourth order polynomial (29f) are of cost O(1).Thus, the complexity of C(n, r) can be summarized as O (1).

Low-rank inducing spectral norms
As in the previous case, in order to compute C(n, r) for g = ∞ , we distinguish between (31a) and (31b) and the case when µ has to be determined.(31a) and (31b) require . This can be done once for all t ∈ {1, . . ., r} with cost O(r).Verifying the corresponding inequalities is then of complexity O (1).
For determining µ we need to ẑi may need to be computed.Thus, C(n, r) is given by the complexity of determining µ, which by the preceding analysis is of at most O(r) and therefore the complexity for computing prox Compared to [55], our approaches reduces the cost for finding (t , s ), significantly, from O(r(n − r + 1)) to O(r log(r) log(n − r + 1) + n).This is especially important for the corresponding vector-valued problem, when rank is replaced by cardinality.

Case Study: Matrix Completion
In the following, we will see how the binary search parameters (t, s, k) from Algorithm 1 and Proposition 4 evolve when solving an optimization problem with proximal splitting.We consider the convexified low-rank matrix completion problem (see, e.g., [8,9,19] for motivation and examples) minimize with r = 50, I := {n ij : n ij > 0} and N = r i=1 u i u T i being defined through the SVD of Note that a smaller version of this example has been solved successfully in [19] by using SDP-solvers, but this larger example is far out of the scope of typical SDP-solvers [45,50].Therefore, we are going to apply the following Douglas-Rachford splitting scheme (see [10,12,37]): with L := {X ∈ R 500×500 : x ij = n ij , (i, j) ∈ I}, Z 0 = 0 and lim i→∞ X i = lim i→∞ Y i being a solution to (35).By the construction of N , it can be shown that lim i→∞ X i = N (see [19]).The parameter path of (t, s, k) for computing X i is shown in Figure 1.We observe that as X i approaches N , the values of t, s and k start plateauing.Thus by using the values from one iterate in the subsequent iterate, the practical computational cost may reduce significantly.Finally, after the initial transient, the variance of each parameter is small compared to the overall 500 singular values.As a result, it may be worth considering sparse SVD algorithms, which only computes a small predefined number of largest singular values (see e.g.[38]).

Conclusion
This work presents a binary search framework for computing the proximal mappings of all unitarily invariant low-rank inducing norms and their epigraph projections.In particular, complete algorithms for the low-rank inducing Frobenius and spectral norms are presented.Our framework unifies and extends the known proximal mapping computations in the following sense: (i) So far, only proximal mappings for the squared low-rank inducing Frobenius norm [15] and the (non-squared) low-rank inducing spectral norm [55] have been derived.0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 Iterate (t, s, k) Figure 1: Parameter path of ( t, s, k) from Algorithm 1 and Proposition 4 when computing prox • ∞ ,r * within the Douglas-Rachford iterations (37).There are no values for the first iterate, because prox • ∞,r * (Z 0 ) = 0 and the iterations are stopped when X i − Y i F ≤ 10 −8 .The local plateauing after relatively few iterations suggests to use (t, s, k) from the previous iterations as an initial guess for the current iteration to save computational time.This framework is independent of the particular unitary invariant norm and its composition with a convex increasing function.(ii) Excluding the cost for an SVD, we recover the same complexity O(n log n) for the squared low-rank inducing Frobenius norm as in [15,34], but decrease the complexity for the (non-squared) low-rank inducing spectral norm from O(r(n − r + 1)) in [55] to O(r log(r) log(n − r + 1) + n).
Finally, in our case study we have seen that within a proximal splitting method, this cost may be reduced to O(n) after a small number of iterations and is therefore roughly the same as in case of the nuclear norm.Implementations for the low-rank inducing Frobenius and spectral norms are available for MATLAB and Python at [17,18].ii.k = max{k : ẑk − α k μk ≥ 0}.
Proof.We first show some results needed to prove Items ii. and iii.. Let Since all g i are strictly decreasing in µ and we conclude that a.
Moreover, the break point sorting in ẑ implies that if l and µ are such that ẑl − α l µ ≥ 0, then also ẑi − α i µ ≥ 0 for all i ≤ l.Thus, In conjunction with the uniqueness of µ k , this implies that C1) and (C2) reduce to (31a) with z v = −1, as well as (C3) is determined by (31c) and (31d), where µ = μk can be found with the search rules from above and μk = k i=1 ẑi k i=1 αi .Proposition 4 is prove in Appendix A.5.
In particular, (t , s ) can be found by a nested search over t and s with the following rules for increasing/decreasing t and s: (t,s t ) , w (t,s t ) = y ((t,s t ) , w (t,s t ) .
t , s ) in Theorem 1 through binary search over (t, s) 3: Set t min = 1, t max = r, and t = tmin+tmax 2 //Binary search over t to find t 4: while t min = t max do Set s min = 0, s max = n − r, and s = smin+smax Set t = t min and p.r.n.binary search for s = s t 23: Output: (Y , w ) = (