Abstract
Low-rank inducing unitarily invariant norms have been introduced to convexify problems with a low-rank/sparsity constraint. The most well-known member of this family is the so-called nuclear norm. To solve optimization problems involving such norms with proximal splitting methods, efficient ways of evaluating the proximal mapping of the low-rank inducing norms are needed. This is known for the nuclear norm, but not for most other members of the low-rank inducing family. This work supplies a framework that reduces the proximal mapping evaluation into a nested binary search, in which each iteration requires the solution of a much simpler problem. The simpler problem can often be solved analytically as demonstrated for the so-called low-rank inducing Frobenius and spectral norms. The framework also allows to compute the proximal mapping of increasing convex functions composed with these norms as well as projections onto their epigraphs.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Non-convex optimization problems with a rank or cardinality constraint appear in many data driven areas such as machine learning, image analysis and multivariate linear regression [6, 8, 13, 26, 40] as well as areas within control such as system identification, model reduction, low-order controller design, and low-complexity modeling [2, 15, 18, 35, 47]. Besides the low-rank constraint, these problems are often convex and therefore one of the most common techniques for solving such problems is to convexify them by using regularizers or taking convex envelopes [8, 15, 17, 19]. A promising class of such regularizers and convex envelopes is the class of so-called unitarily invariant low-rank inducing norms [17], i.e., convex envelopes of unitarily invariant norms whose domain is restricted to matrices with prescribed bounded rank. As many common loss functions, e.g., squared distance in the Frobenius norm, contain terms of unitarily invariant norms, these norms have the attractive feature to exactly convexify, i.e., the convexified problem in terms of the low-rank inducing norm coincides on with the original at all low-rank matrices of prescribed bounded rank. Therefore, if the convexified problem has a low-rank solution, it is guaranteed to be a solution to the non-convex one.
Although low-rank inducing norms often admit a representation as semi-definite programs (SDP) [17], proximal splitting algorithms [9] are often used for large-scale problems, where standard interior-point method solvers have too costly iterations [39]. The main objective of this work is to efficiently compute the needed proximal mappings of low-rank inducing norms that are composed with increasing convex functions. To this end, we develop a generic nested binary search algorithm, which in each iteration solves a simple problem. While for well-known low-rank inducing norms such as the nuclear norm [38] and the low-rank inducing Frobenius norm [1, 14, 29, 30], our algorithm will recover the same efficiency, for other norms such as the low-rank inducing spectral norm [46], our approach improves the computational complexity significantly, especially in the vector-valued case. Finally, [45] proposes a non-analytic approach for an extended class of not necessarily unitarily invariant low-rank inducing norms (see [27]). This approach, however, depends on the complexity and convergence rates of other optimization algorithms.
The paper is organized as follows. We start by introducing some preliminaries on norms and convex optimization. Subsequently, a formal definition of the class of low-rank inducing norms, including their application to rank constrained optimization problems is outlined. Then, we discuss and derive our main results, the binary search framework and outline an algorithm for evaluating their epigraph projections. For the low-rank inducing Frobenius and spectral norms, we make these computations explicit and arrive at implementable algorithms for which the computational cost is analyzed. Subsequently, a case study is performed in order to illustrate the performance of our algorithm through proximal splitting. Finally, we draw a conclusion and point the reader to our freely available implementations of these algorithms in MATLAB and Python.
2 Preliminaries
The set of reals is denoted by \(\mathbb {R}\), the set of real vectors by \(\mathbb {R}^n\), the set of vectors with nonnegative entries by \(\mathbb {R}^n_{\ge 0}\) and the set of real matrices by \(\mathbb {R}^{n \times m}\). In the remainder of the paper, we assume without loss of generality that \(n \le m\). The singular value decomposition of \(X \in \mathbb {R}^{n \times m}\) is denoted by \(X = \sum _{i=1}^{n} \sigma _i(X) u_i v_i^\mathsf {T}\) with non-increasingly ordered singular values \(\sigma _1(X) \ge \dots \ge \sigma _{n}(X)\) (counted with multiplicity). The corresponding vector of all singular values is given by \(\sigma (X):=(\sigma _1(X),\ldots ,\sigma _{n}(X)).\) For all \(x=(x_1,\ldots ,x_n)\in \mathbb {R}^n\), we define the \(\ell _p\) norms with \(1\le p<\infty \) by \(\ell _p(x):=\left( \sum _{i=1}^{q} |x_i|^p\right) ^{\frac{1}{p}}\) and \(\ell _{\infty }(x):=\max _{i}|x_i|\), where \(|\cdot |\) denotes the absolute value.
A matrix norm \(\Vert \cdot \Vert : \mathbb {R}^{n \times m}\rightarrow \mathbb {R}_{\ge 0}\) is called unitarily invariant if for all unitary matrices \(U \in \mathbb {R}^{n \times n}\) and \(V \in \mathbb {R}^{m \times m}\) and all \(X \in \mathbb {R}^{n \times m}\) it holds that \(\Vert UXV\Vert = \Vert X\Vert \). Equivalently, unitary invariance can be characterized by symmetric gauge functions (see e.g., [25, Theorem 7.4.7.2]):
Definition 1
A function \(g: \mathbb {R}^n \rightarrow \mathbb {R}_{\ge 0}\) is a symmetric gauge function if
-
i.
g is a norm.
-
ii.
\(\forall x \in \mathbb {R}^{n}: g(|x|) = g(x)\), where |x| denotes the element-wise absolute value.
-
iii.
\(g(Px) = g(x)\) for all permutation matrices \(P\in \mathbb {R}^{n\times n}\) and all \(x\in \mathbb {R}^n\).
Proposition 1
The norm \(\Vert \cdot \Vert :\mathbb {R}^{n \times m}\rightarrow \mathbb {R}_{\ge 0}\) is unitarily invariant if and only if \(\Vert \cdot \Vert = g(\sigma _1(\cdot ),\dots ,\sigma _{n }(\cdot ))\), where g is a symmetric gauge function.
Throughout this work, we use the notation \(\Vert X\Vert _g := g(\sigma (X))\). For \(X,Y \in \mathbb {R}^{n \times m}\) the Frobenius inner product is defined as \(\langle X , Y \rangle := \sum _{i=1}^{m}\sum _{j=n}^{n} x_{ij} y_{ij} = \text {trace}(X^\mathsf {T}Y)\) with Frobenius norm \(\Vert X\Vert _{\ell _2} := \ell _2(\sigma (X)) = \sqrt{\langle X , X \rangle }.\) The nuclear norm and the spectral norm are given by \(\Vert \cdot \Vert _{\ell _1}:=\ell _{1}(\sigma (\cdot ))\) and \(\Vert \cdot \Vert _{\ell _\infty }:=\ell _{\infty }(\sigma (\cdot ))=\sigma _1(\cdot )\). The dual norm to \(\Vert \cdot \Vert _g\) is defined as
Dual norms inherit the unitary invariance as well as the duality relationship for \(\ell _p\) norms, i.e., \(g = \ell _p\) implies \(g^D = \ell _q\) with \(p,q \in [1,\infty ]\) satisfying \(\frac{1}{p} + \frac{1}{q} = 1\). We will also make use of truncated dual gauge functions. Let \(y\in \mathbb {R}^n\), \(r\in \{1,\ldots ,n\}\), and \(g^D:\mathbb {R}^n\rightarrow \mathbb {R}_{\ge 0}\). The truncated dual gauge function is then defined as
where \(\text {sort}: \mathbb {R}^n \rightarrow \mathbb {R}^n\) denotes sorting in descending order.
Next, we introduce some standard notation and results from convex optimization [5, 41]. For \(f: \mathbb {R}^{n \times m}\rightarrow \mathbb {R} \cup \lbrace \infty \rbrace \), we denote by \(\text {dom}(f)\) and \(\text {epi}(f)\) the effective domain and epigraph of f, respectively. Its subdifferential at \(X \in \mathbb {R}^{n \times m}\) is written as \(\partial f(X)\). In particular, by [24, Example VI.3.1]
Further, f is said to be proper if \(\text {dom}f \ne \emptyset \) and closed if \(\text {epi}(f)\) is a closed set. The conjugate (dual) function of f is denoted by \(f^*\) and \(f^{**} := (f^*)^*\) is called the biconjugate function or convex envelope of f. For \(f:\mathbb {R} \rightarrow \mathbb {R}\cup \lbrace \infty \rbrace \), we say that f increasing if \(x \le y \ \Rightarrow \ f(x) \le f(y) \ \text { for all } \ x,y \in \text {dom}(f)\) and if there exist \(x,y\in \mathbb {R}\) such that \(x<y\) and \(f(x)<f(y)\). Moreover, its monotone conjugate is defined as [41] \(f^+(y) := \sup _{x \ge 0} \left[ xy - f(x) \right] \ \text { for all } y\in \mathbb {R}.\) The 0-infinity indicator (or characteristic) function of a set \(\mathcal {S} \subset \mathbb {R}^{n \times m}\) is denoted by \(\chi _{\mathcal {S}}\), which we also use for the indicator function of the set of matrices with at most rank r, i.e., \(\chi _{\text {rank}(\cdot )\le r}\). For any \(Z \in \mathbb {R}^{n \times m}\), the proximal mapping of a closed, proper and convex function \(f: \mathbb {R}^{n \times m}\rightarrow \mathbb {R} \cup \lbrace \infty \rbrace \) is defined as
In particular, \(\text {prox}_{\gamma \chi _{\mathcal {C}}}(Z)\) coincides with the unique Euclidean projection
onto \(\mathcal {C}\) for any closed, non-empty, convex set \(\mathcal {C} \subset \mathbb {R}^{n \times m}\). Moreover, by the extended Moreau decomposition it holds for all \(f: \mathbb {R}^{n \times m}\rightarrow \mathbb {R} \cup \lbrace \infty \rbrace \), \(Z \in \mathbb {R}^{n \times m}\) and \(\gamma > 0\) that (see [4, Theorem 6.29])
Finally, we denote compositions of two functions f and g by \( (f \circ g)(\cdot ) := f(g(\cdot ))\).
3 Low-Rank Inducing Norms
This section introduces the family of unitarily invariant low-rank inducing norms, which has been discussed in [17]. Besides recapping some elementary properties, we briefly motivate the usefulness of these norms as convex envelopes or additive regularizers in optimization problems to promote low-rank solutions.
Low-rank inducing norms are defined as the dual norm of a rank constrained dual norm
This means that the low-rank inducing norms corresponding to \(\Vert \cdot \Vert _g\) are
For \(r=n\), the rank constraint in (6) is redundant and \(\Vert \cdot \Vert _g \equiv \Vert \cdot \Vert _{g,r*}\). Some important properties of these norms are summarized next [17].
Lemma 1
Let \(X,Y \in \mathbb {R}^{n \times m}\), \(r \in \mathbb {N}\) be such that \(1\le r \le n\), and \(g: \mathbb {R}^n \rightarrow \mathbb {R}_{\ge 0}\) be a symmetric gauge function. Then \(\Vert \cdot \Vert _{g^D,r}\) is a unitarily invariant norm with
where \(g^D_r\) is defined in (2). Its dual norm \(\Vert \cdot \Vert _{g,r*}\) satisfies
In this work, we especially consider the so-called low-rank inducing Frobenius and spectral norms, i.e., the cases when \(g = \ell _2\) and \(g = \ell _{\infty }\). Since \(\ell _2^D = \ell _2\) and \(\ell _{\infty }^D = \ell _1\), it follows from (8) that \(\Vert X\Vert _{\ell _2,r*} :=\max _{\Vert Y\Vert _{\ell _2,r} \le 1}\) with \(\Vert Y\Vert _{\ell _2^D,r}:=\sqrt{\sum _{i=1}^r \sigma _i^2(Y)}\) and \(\left\| {} X {}\right\| _{\ell _{\infty },r*} :=\max _{\left\| {} Y{}\right\| _{\ell _1,r}\le 1}\langle Y,X\rangle \) with \(\left\| {} Y{}\right\| _{\ell _1,r} = \sum _{i=1}^r \sigma _i(Y)\).
The following motivates the main interest in low-rank inducing norms (see [16, 17, 19] for details).
Proposition 2
Assume that \(f_0: \mathbb {R}^{n \times m}\rightarrow \mathbb {R} \cup \lbrace \infty \rbrace \) is a proper closed convex function, and that \(r \in \mathbb {N}\) is such that \(1 \le r \le \min \{ m,n \}\). Let \(f_1: \mathbb {R}_{\ge 0} \rightarrow \mathbb {R} \cup \{ \infty \}\) be an increasing, proper closed convex function, and let \(\theta >0\). Then
and
If \(X^\star \) solves (12) such that \(\text {rank}(X^\star ) \le r\), then equality holds and \(X^\star \) is also a solution to the problem on the left of (11).
In other words, Proposition 2 shows that low-rank inducing norms can be used both as additive regularizers and direct convex envelopes to find (approximate) solutions to
For regularization as in [15, 42], we set \(f_0 = L\) and choose a suitable \(f_1\) and \(\theta \) to find an approximate solution. In the second case, when L can be split into \(L = f_0 + f_1(\Vert \cdot \Vert _g)\) as in Proposition 2, then
may return an (exact) solution to (13).
4 Proximal Mappings
For problems of small dimensions, it is often convenient to solve (14) through semi-definite programming (SDP). However, conventional SDP solvers are typically based on interior-point methods (see [39]) with an iteration cost that grows unfavorably with the problem dimension. For large-scale problems, proximal splitting methods can be used [4, 9]. To efficiently solve (14), proximal splitting methods require efficient computation of the proximal mapping of \(f_1(\Vert \cdot \Vert _{g,r*})\).
In this section, we present our main results on developing a nested binary search framework for computing this proximal mapping for simple choices of \(f_1\), efficiently. Explicit and implementable steps for these computations will be shown for the common cases \(f_1 = (\cdot )\) and \(f_1 = (\cdot )^2\) with \(g = \ell _2\) and \(g = \ell _{\infty }\) [3, 17, 19, 37]. In Sect. 4.3, the computational complexity of our generic algorithm as well as these particular cases is derived. In cases where \(f_1\) is not simple, we can write (14) as
where \(\chi _{\text {epi}(\Vert \cdot \Vert _{g,r*})}\) is the indicator function of the epigraph to \(\Vert \cdot \Vert _{g,r*}\). Then a consensus formulation for proximal splitting methods (see [9]) requires an evaluation of the proximal mappings for \(f_1\) and \(\chi _{\text {epi}(\Vert \cdot \Vert _{g,r*})}\). Since \(f_1\) is one-dimensional, convex, proper and increasing, its proximal mapping is fast to evaluate. We will see as part of our complexity analysis in Sect. 4.3 that computing \(\text {prox}_{\chi _{\text {epi}(\Vert \cdot \Vert _{g,r*})}}\), \(\text {prox}_{\Vert \cdot \Vert _{g,r*}}\) and \(\text {prox}_{\Vert \cdot \Vert _{g,r*}^2}\), i.e., \(f_1 = (\cdot )\) and \(f_1 = (\cdot )^2\), is equally costly.
Note that in contrast to \(\left\| {} \cdot {}\right\| _{ g,r*}\), its dual norm \(\left\| {} \cdot {}\right\| _{ g^D,r}\) is explicitly known by its definition (8), which is why we derive our search framework for
with
which by (5) and (10) yields the sought proximal mappings
4.1 Search Framework
Next, we present our main result, which shows that (16), and hence Eqs. (17a) and (17b), can be computed by a nested parameter search. Since the computations of (16) can be compactly unified as
where f is closed, proper and convex, our results are stated for all such problems. Table 1 summaries common choices for f and its relationship to Eqs. (17a) and (17b) via (16).
Before we state the main theorem on how to solve (18) with a nested binary search method, we outline the steps that give rise to this algorithm. It is well-known that the solution \(Y^\star \) to (18) and Z have a simultaneous SVD [31, 36] and, therefore, only the singular values of \(Y^\star \) need to be computed. Let \(y_i=\sigma _i(Y)\) and \(z_i=\sigma _i(Z)\), then it follows that (18) reduces to the vector-valued problem
Since \(z \in \mathbb {R}^n_{\ge 0}\) is monotonically decreasing, it can be shown that the minimizer of (19) is nonnegative. The problem is, therefore, equivalent to solving
Since only the r first elements in y are included in the norm constraint, the solution may have a chain of equalities around \(y_r\), i.e., there exist integers \(t\ge 1\) and \(s\ge 0\) such that
The base case \(t=1\) and \(s=0\) implies that \(y_{r-1}>y_r>y_{r+1}\), i.e., the chain has length one. Thus, if we can solve (21) for an arbitrary, but fixed pair (t, s), an optimal \((t^*,s^*)\) could be determined by comparison with all pairs. As this would be very inefficient, the proposed search rules are devised to find \((t^*,s^*)\) by only considering a few pairs.
To state these rules, we need to introduce the truncated gauge function of \(g^D\) as
where \(x \in \mathbb {R}^n\) and the truncation operator \(T: \mathbb {R}^{n} \rightarrow \mathbb {R}^{r-t+1}\) is defined for all \(1 \le r \le n\) and \((t,s) \in \{1,\dots ,r\} \times \{0,\dots ,n-r\}\) as
Note that \(g^D_{r,s,t}\) is indeed a gauge function with dual gauge function [23, Lemma 2.2.2])
For the special case \((t,s) = (1,0)\), it reduces to \(g^D_r\) in (2). We are now ready to state our main theorem.
Theorem 1
Let \(Z = \sum _{i = 1}^n \sigma _i(Z) u_i v_i^\mathsf {T}\in \mathbb {R}^{n \times m}\), \(\gamma > 0\), \(1\le r\le n\), \(g: \mathbb {R}^n \rightarrow \mathbb {R}\) be a gauge function, and \(f:\mathbb {R} \rightarrow \mathbb {R} \) be proper, closed and convex. For each \((t,s) \in \{1, \dots , r\} \times \{0,\dots , n-r\}\) let \((y^{(t,s)},w^{(t,s)}) \in \mathbb {R}^{n+1}\) be defined as
where \((\tilde{y},\tilde{w}) \in \mathbb {R}^{r-t+2}\) solves
and \(\tilde{z} := T \sigma (Z)\) is given by (22). Then \((Y^\star ,w^\star ) = (\sum _{i = 1}^n y_i^{(t^\star ,s^\star )}u_i v_i^\mathsf {T},w^{(t^\star ,s^\star )})\) is the solution to (18), where
In particular, \((t^\star ,s^\star )\) can be found by a nested binary search over t and s with the following rules for increasing/decreasing t and s:
-
I.
\(y^{(t,s_t^\star )}_{r-t} \ge y_{r-t+1}^{(t,s_t^\star )}\) for all \(t \ge t^\star \).
-
II.
\(y^{(t,s_t^\star )}_{r-t} \le y_{r-t+1}^{(t,s_t^\star )}\) for all \(t < t^\star \).
-
III.
If \(t < t^\star \) and \(y^{(t,s_t^\star )}_{r-t} = y_{r-t+1}^{(t,s_t^\star )}\) then \(\left( y^{(t,s_t^\star )},w^{(t,s_t^\star )} \right) = \left( y^{(t^\star ,s^\star )},w^{(t^\star ,s^\star )} \right) \).
-
IV.
\(y^{(t,s)}_{r+s} \ge y_{r+s+1}^{(t,s)}\) for all \(s \ge s_t^\star \).
-
V.
\(y^{(t,s)}_{r+s} \le y_{r+s+1}^{(t,s)}\) for all \(s < s_t^\star \).
-
VI.
If \(s < s_t^\star \) and \(y^{(t,s)}_{r+s} = y_{r+s+1}^{(t,s)}\) then \(\left( y^{(t,s)},w^{(t,s)} \right) = \left( y^{(t,s_t^\star )},w^{(t,s_t^\star )} \right) \).
A few words on Theorem 1 may be helpful. The first part simply makes explicit that \((y^{(t,s)},w^{(t,s)})\) in Eqs. (23a) and (23b) is the solution of (21) with fixed t and s, i.e., it solves
via the solution of the lower-dimensional problem (24). For fixed t in (21), the search rules for s (Items IV. to VI.) can be used to find an optimal \(s = s_t^\star \) that minimizes the cost in (21) among all choices of s that fulfil the constraint \(y^{(t,s)}_{r+s} \ge y_{r+s+1}^{(t,s)} \ge \dots \ge y_{n}^{(t,s)}.\) Since \(y_i^{(t,s)} = z_i\) for \(i \ge r+s+1\) by (23a), it suffices to check that \(y^{(t,s)}_{r+s} \ge y_{r+s+1}^{(t,s)}\), where by (25b) \(s_t^\star \) is the smallest of such s. Similarly, the search rules for finding an optimal \(t =t^\star \) minimize the cost in (21) among all choices \((t,s) = (t,s_t^\star )\) that do not violate the constraint \(y^{(t,s_t^\star )}_{r-t} \ge y_{r-t+1}^{(t,s_t^\star )}\). Using nested binary search (see [28]) over s (inner loop) and t (outer loop), an optimal \((t^*,s^*)\) can be found efficiently under the assumption that (24) has an efficiently computable solution for all choices (t, s). For more details, see the derivation of the proof to Theorem 1 in Appendix 2 and our explicit implementation for determining \(\Pi _{-\text {epi}(\Vert \cdot \Vert _{g^D,r})}(Z,z_v)\) in Algorithm 1.
4.2 Low-Rank Inducing Frobenius and Spectral Norms
Next, we exemplify solutions to (24) for the instances in Table 1 with \(g = \ell _2\) and \(g = \ell _\infty \). A general result on the solvability of (24) is given in Appendix 3.
In particular, we will discuss solutions to (24) for all \(g = \tau \ell _2\) and \(g = \tau \ell _\infty \), \(\tau > 0\), because this enables us to handle the first two cases in Table 1 simultaneously through the identity
where \(\Pi _{Y} (Y,w) := Y\) and \(\tau > 0\). It is easy to adjust these computations for the third case, because \(\text {prox}_{\tau {\left\| {}\cdot {}\right\| _{ g,r*}}}(Z) = \text {prox}_{{\left\| {}\cdot {}\right\| _{ \tau g,r*}}}(Z)\).
Proposition 3
Let \(Z = \sum _{i = 1}^n \sigma _i(Z) u_i v_i^\mathsf {T}\in \mathbb {R}^{n \times m}\), \(g = \tau \ell _2\) with \(\tau > 0\), \(1\le r\le n\), \(\gamma = 1\), \(z_v \in \mathbb {R}\) and \(\tilde{z} := T \sigma (z)\). Then, \(\Pi _{-\text {epi}(\Vert \cdot \Vert _{g^D,r})}(Z,z_v)\) can be computed via Theorem 1 with \(f(w) = \frac{1}{2}(w+z_v)^2\), where the solution to (24) is characterized by one of the following three distinct cases:
and otherwise
where the unique \(\mu \ge 0\) is a solution to the fourth-order polynomial
\(c_1 := \sum _{i=1}^{r-t} \tilde{z}^2_i\) and \(c_2 := \sqrt{t+s}\tilde{z}_{r-t+1}\).
Similarly, \(\text {prox}_{\chi _{\Vert \cdot \Vert _{\tau g^D,r} \le \gamma }}(Z)\) can be determined by setting \(f(w) = \chi _{[0,\gamma ]}(w)\), where it suffices to consider the two cases: (27a) with \(z_v = -1\), and Eqs. (27c), (27d) and (27f) with \(\tilde{w} = 1\).
Proposition 4
Let \(Z = \sum _{i = 1}^n \sigma _i(Z) u_i v_i^\mathsf {T}\in \mathbb {R}^{n \times m}\), \(g = \tau \ell _\infty \) with \(\tau > 0\), \(1\le r\le n\), \(\gamma = 1\), \(z_v \in \mathbb {R}\) and \(\tilde{z} := T \sigma (z)\). Further, let
where j is chosen such that
Then, \(\Pi _{-\text {epi}(\Vert \cdot \Vert _{g^D,r})}(Z,z_v)\) can be computed via Theorem 1 with \(f(w) = \frac{1}{2}(w+z_v)^2\), where the solution to (24) is characterized by one of the following three distinct cases:
and otherwise
where \(\mu = \hat{\mu }_{k^\star }\) with \(\hat{\mu }_k= \frac{z_v+\sum _{i=1}^{k}\hat{z}_{i}}{1+\sum _{i=1}^k\alpha _i}\) and \(k^\star \) can be identified by a search over k with the following rules for increasing/decreasing k:
-
I.
\(k^\star =\max \{k : \hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\}\)
-
II.
\(\hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\) for all \(k \le k^\star \)
-
III.
\(\hat{z}_k-\alpha _k\hat{\mu }_k< 0\) for all \(k> k^\star \)
Similarly, \(\text {prox}_{\chi _{\Vert \cdot \Vert _{\tau g^D,r} \le \gamma }}(Z)\) can be determined by setting \(f(w) = \chi _{[0,\gamma ]}(w)\), where it suffices to consider the two cases: (29a) with \(z_v = -1\) and Eqs. (29c) and (29d), where \(\mu = \hat{\mu }_{k^\star }\) can be found with the search rules from above and \(\hat{\mu }_k = \frac{\sum _{i=1}^{k}\hat{z}_{i}}{\sum _{i=1}^k\alpha _i}\).
Propositions 3 and 4 are proved in Appendixces 4 and 6, respectively, and implementations for the user are available for MATLAB and Python at [20, 21].
4.3 Computational Complexity
In the following, we evaluate the computational complexity, i.e., counting all flops (see [44]) of the discussed approaches for computing
Since the same analysis also applies to the other cases discussed in Table 1, this will allow us to compare our approach to existing methods. Our evaluation starts with a discussion of Algorithm 1 for a general gauge function, followed by an explicit discussion for the cases of \(g=\ell _2\) and \(g = \ell _{\infty }\) in Sects. 4.3.1 and 4.3.2 of which a summary is given in Table 2.
In order to apply the binary search rules in Theorem 1, we only need to determine \((y^{(t,s)}_{r-t},y^{(t,s)}_{r-t+1},y^{(t,s)}_{r+s},y^{(t,s)}_{r+s+1}),\) whose computational cost we assume to be bounded by C(n, r). Then, the complexity of Algorithm 1 is the sum of:
-
1.
SVD for Z providing all \(\sigma _i(Z)\) and \(u_iv_i^\mathsf {T}\) such that \(Z = \sum _{i = 1}^n \sigma _i(Z)u_iv_i^\mathsf {T}\) (see [44]): \(\mathcal {O}(mn^2)\).
-
2.
Binary search rules (see [28]) in Theorem 1 for t and s:
$$\begin{aligned} \mathcal {O}(C(n,r) \log (r)\log (n-r)) \end{aligned}$$ -
3.
Determine the final full solution: \(\mathcal {O}(n)\).
-
4.
Compute \(\text {prox}_{\chi _{\text {epi}(\Vert \cdot \Vert _{g,r*})}}(Z,z_v)\) from \(\Pi _{-\text {epi}(\Vert \cdot \Vert _{g^D,r})}(Z,z_v)\): \(\mathcal {O}(n)\)
In practise, the first cost may be significantly reduced by employing sparse SVD solvers (see e.g., [32, 34]). In particular, for the vector-valued case, this corresponds to a simple sorting of the entries. The second cost is determined by the coordinate transformation (23a), i.e.,
and therefore the cost for C(n, r) equals the cost \(\tilde{C}(n,r)\) for solving (24) to find \((\tilde{y}_{r-t},\tilde{y}_{r-t+1})\). To compute the full solution \(y^{(t^\star ,s^\star )}\), once an optimal pair \((t^\star ,s^\star )\) is found, the cost for these pre- and post-computing steps is at most \(\mathcal {O}(n)\). Finally, computing \(\text {prox}_{\chi _{\text {epi}(\Vert \cdot \Vert _{g,r*})}}(Z,z_v)\) from \(\Pi _{-\text {epi}(\Vert \cdot \Vert _{g^D,r})}(Z,z_v)\) only contributes an additional \(n+1\) subtractions.
Remark 1
The cost for computing \(\tilde{z}_{r-t+1}\) is given by the cost for knowing \(\sum _{i=r+1}^{r+s} \sigma _i(Z)\) (for \(s > 0\)) and \(\sum _{i=r-t+1}^{r} \sigma _i(Z)\). Both sums could be computed a priori for all t and s through incremental summation with cost \(\mathcal {O}(n)\). However, in practice it may be cheaper to store and re-use the intermediate sums, when deriving \(\sum _{i=r-t+1}^{r} z_i\) and \(\sum _{i=r+1}^{r+s} z_i\). This means we only need to compute additional intermediate sums whenever t and s get increased within the binary search.
4.3.1 Low-Rank Inducing Frobenius Norms
In order to determine the computational cost \(\tilde{C}(n,r)\) for \(g = \ell _2\), we need to analyze the complexity of the three cases in Proposition 3. All cases require the evaluation of \(\sum _{i=1}^{r-t} \tilde{z}_i^2 = \sum _{i=1}^{r-t} \sigma _i^2(Z)\) as either part of the inequalities Eqs. (27a) and (27b) or as coefficients in polynomial (27f). These sums can be computed once for all \(t \in \{1,\dots ,r\}\) with cost \(\mathcal {O}(r)\). Then testing Eqs. (27a) and (27b) as well as solving the fourth-order polynomial (27f) are of cost \(\mathcal {O}(1)\). Our generic approach, therefore, recovers in this special case the same complexity as the algorithms in [14, 29] (see Table 2).
4.3.2 Low-Rank Inducing Spectral Norms
As in the previous case, determining \(\tilde{C}(n,r)\) for \(g = \ell _\infty \) requires us to compute the complexity of the three cases in Proposition 4. The cases Eqs. (29a) and (29b) require the evaluation of \(\sum _{i=1}^{r-t}\tilde{z}_i = \sum _{i=1}^{r-t}\sigma _i(Z)\). This can be done once for all \(t \in \{1,\dots ,r\}\) with cost \(\mathcal {O}(r)\), and verifying the corresponding inequalities is then of complexity \(\mathcal {O}(1)\).
Determining \(\mu \) in the third case of Proposition 4 requires:
-
a)
Find j in (28): \(\mathcal {O}(\log (r-t+1))\), because \(\tilde{z}_1 \ge \dots \ge \tilde{z}_{r-t}\).
-
b)
Determine \(\mu _{k^{\star }} = \mu \) through binary search: \(\mathcal {O}(r-t+1)\), because \(\sum _{i=1}^{r-t+1} \hat{z}_i\) may need to be computed.
Thus, \(\tilde{C}(n,r)\) is dominated by the complexity of determining \(\mu \), which by the preceding analysis is at most \(\mathcal {O}(r)\). Compared to [46], our approach reduces the overall cost significantly (see Table 2), which is especially important for the corresponding vector-valued problem.
5 Case Study: Matrix Completion
In the following, we will see how the binary search parameters (t, s, k) from Algorithm 1 and Proposition 4 evolve when solving an optimization problem with a proximal splitting method. We consider the convexified low-rank matrix completion problem (see, e.g., [7, 8, 17] for motivation and examples)
with \(r =50\), \(\mathcal {I} := \{n_{ij}: n_{ij} > 0 \}\) and \(N = \sum _{i=1}^r u_i u_i^\mathsf {T}\) being defined through the SVD of
Note that a smaller version of this example has been solved successfully in [17] by using an SDP-solver, but this larger example is far out of the scope of typical SDP-solvers [39, 43]. Therefore, we apply the following Douglas–Rachford splitting scheme (see [9, 11, 33]):
with \(\mathcal {L} := \{X \in \mathbb {R}^{500\times 500}: x_{ij} = n_{ij}, \ (i,j) \in \mathcal {I} \}\), \(Z_0 = 0\) and \(\lim _{i \rightarrow \infty } X_i = \lim _{i \rightarrow \infty } Y_i\) being a solution to (30). By the construction of N, it can be shown that \(\lim _{i \rightarrow \infty } X_i = N\) (see [17]).
The parameter path of (t, s, k) for computing \(X_i\) is shown in Fig. 1. We observe that as \(X_i\) approaches N, the values of t, s and k start plateauing. Thus, by using the values from one iterate in the subsequent iterate, the practical computational cost may reduce significantly. Finally, after the initial transient, the variance of each parameter is small compared to the overall 500 singular values. As a result, sparse SVD algorithms, which only compute a small predefined number of largest singular values (see, e.g., [32, 34]), can be effectively applied. This emphasizing that our complexity analysis is important to both, vector- and matrix-valued problems.
6 Conclusion
This work presents a binary search framework for computing the proximal mappings of all unitarily invariant low-rank inducing norms and their epigraph projections. In particular, complete algorithms for the low-rank inducing Frobenius and spectral norms are presented. Our framework unifies and extends the known proximal mapping computations in the following sense: (i) So far, only proximal mappings for the squared low-rank inducing Frobenius norm [14] and the (non-squared) low-rank inducing spectral norm [46] have been derived. This framework is independent of the particular unitary invariant norm and its composition with a convex increasing function. (ii) Excluding the cost for an SVD, i.e., the cost for the analogous vector-valued problem, we recover the same complexity for the squared low-rank inducing Frobenius norm as in [14, 29], but significantly decrease the complexity for the (non-squared) low-rank inducing spectral norm. Further, we show that these costs also transfer to compositions with simple functions.
Finally, in our case study we have seen that within a proximal splitting method, the computational cost of our proximal mappings may be reduced to approximately linear cost, besides the singular value decomposition, after a small number of iterations and is therefore roughly the same as in case of the nuclear norm/spectral norm. Further, our example also demonstrates that sparse singular value decomposition (see e.g., [32, 34]) can be effectively applied, underlining the importance of our analysis even for the matrix case. Implementations for the low-rank inducing Frobenius and spectral norms are available for MATLAB and Python at [20, 21].
References
Andersson, F., Carlsson, M., Olsson, C.: Convex envelopes for fixed rank approximation. Optim. Lett. 11(8), 1783–1795 (2017). https://doi.org/10.1007/s11590-017-1146-5
Antoulas, A.C.: On the approximation of Hankel matrices. In: Helmke, U. (ed.) Operators. Systems and Linear Algebra: Three Decades of Algebraic Systems Theory, pp. 17–22. Vieweg+Teubner Verlag, Wiesbaden (2013)
Argyriou, A., Foygel, R., Srebro, N.: Sparse prediction with the k-support norm. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1457–1465. Curran Associates Inc, London (2012)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, New York (2011). https://doi.org/10.1007/978-3-319-48311-5
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004). https://doi.org/10.1017/CBO9780511804441
Candès, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98(6), 925–936 (2010). https://doi.org/10.1109/JPROC.2009.2035722
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009). https://doi.org/10.1007/s10208-009-9045-5
Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Found. Comput. Math. 12(6), 805–849 (2012). https://doi.org/10.1007/s10208-012-9135-7
Combettes, P.L., Pesquet, J.-C.: Proximal Splitting Methods in Signal Processing, pp. 185–212. Springer, New York (2011). https://doi.org/10.1007/978-1-4419-9569-8_10
Condat, L.: Fast projection onto the simplex and the \(\ell _1\) ball. Math. Program. 158(1), 575–585 (2016). https://doi.org/10.1007/s10107-015-0946-6
Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the L1-ball for learning in high dimensions. In: ICML ’08: Proceedings of the 25th International Conference on Machine Learning, 25th International Conference on Machine Learning (ICML), pp. 272–279, New York (2008). https://doi.org/10.1145/1390156.1390191
Eldén, L.: Matrix methods in data mining and pattern recognition. SIAM (2007). https://doi.org/10.1137/1.9780898718867
Eriksson, A., Thanh Pham, T., Chin, T.-J., Reid, I.: The k-support norm and convex envelopes of cardinality and rank. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3349–3357(2015)
Fazel, M., Hindi, H.., Boyd, S.P.: A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American Control Conference, vol 6, pp. 4734–4739 (2001). https://doi.org/10.1109/ACC.2001.945730
Grussler, C., Giselsson, P.: Local convergence of proximal splitting methods for rank constrained problems. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 702–708. Melbourne (2017). https://doi.org/10.1109/CDC.2017.8263743
Grussler, C., Giselsson, P.: Low-rank inducing norms with optimality interpretations. SIAM J. Optim. 28(4), 3057–3078 (2018). https://doi.org/10.1137/17M1115770
Grussler, C., Zare, A., Jovanović, M.R., Rantzer, A.: The use of the \(r\ast \) heuristic in covariance completion problems. In|: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1978–1983. Las Vegas (2016). https://doi.org/10.1109/CDC.2016.7798554
Grussler, C., Rantzer, A., Giselsson, P.: Low-rank optimization with convex constraints. IEEE Trans. Autom. Control 63(11), 4000–4007 (2018). https://doi.org/10.1109/TAC.2018.2813009
Grussler, C.: LRINorm—A MATLAB package for rank constrained optimization by low-rank inducing norms and non-convex proximal splitting methods. https://github.com/LowRankOpt/LRINorm (2018a)
Grussler, C.: LRIPy—a python package for rank constrained optimization by low-rank inducing norms and non-convex proximal splitting methods. https://github.com/LowRankOpt/LRIPy (2018b)
Held, M., Wolfe, P., Crowder, H.P.: Validation of subgradient optimization. Math. Program. 6(1), 62–88 (1974). https://doi.org/10.1007/BF01580223
Hiriart-Urruty, J.-B., Lemaréchal, C.: , onvex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Grundlehren der mathematischen Wissenschaften. Springer, Berlin Heidelberg (1993). https://doi.org/10.1007/978-3-662-06409-2
Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg (1996). https://doi.org/10.1007/978-3-662-02796-7
Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge (2012). https://doi.org/10.1017/9781139020411
Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5(2), 248–264 (1975). https://doi.org/10.1016/0047-259X(75)90042-1
Jacob, L., Obozinski, G., Vert, J.-P.: Group lasso with overlaps and graph lasso. In: Bottou, L., Littman, M. (eds) Proceedings of the 26th International Conference on Machine Learning, pp. 433–440. Montreal. Omnipress (2009)
Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3. Pearson Education, New York (1998)
Lai, H., Pan, Y., Canyi, L., Tang, Y., Yan, S.: Efficient k-Support Matrix Pursuit, pp. 617–631. Springer, Berlin (2014)
Larsson, V., Olsson, C.: Convex low rank approximation. Int. J. Comput. Vis. 120(2), 194–214 (2016)
Lewis, A.S.: The convex analysis of unitarily invariant matrix functions. J. Convex Anal. 2(1), 173–183 (1995)
Li, Y., Liu, H., Wen, Z., Yuan, Y.: Low-rank matrix iteration using polynomial-filtered subspace extraction. SIAM J. Sci. Comput. 42(3), A1686–A1713 (2020). https://doi.org/10.1137/19M1259444
Lions, P.-L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979). https://doi.org/10.1137/0716071
Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing dominant singular value decompositions. SIAM J. Sci. Comput. 35(3), A1641–A1668 (2013a). https://doi.org/10.1137/120871328
Liu, Z., Hansson, A., Vandenberghe, L.: Nuclear norm system identification with missing inputs and outputs. Syst. Control Lett. 62(8), 605–612 (2013b). https://doi.org/10.1016/j.sysconle.2013.04.005
Lu, Z., Yong, Z., Li, X.: Penalty decomposition methods for rank minimization. Optim. Methods Softw. 30(3), 531–558 (2015). https://doi.org/10.1080/10556788.2014.936438
McDonald, A.M., Pontil, M., Stamos, D.: New perspectives on k-support and cluster norms. J. Mach. Learn. Res. 17(155), 1–38 (2016)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014). https://doi.org/10.1561/2400000003
Peaucelle, D., Henrion, D., Labit, Y., Taitz, K.: User’s guide for SEDUMI INTERFACE 1.04. 2002. LAAS-CNRS, Toulouse
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010). https://doi.org/10.1137/070697835
Tyrrell, R.R.: Convex Analysis. Princeton University Press, Princeton (1970)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
Toh, K.C., Tutuncu, R.H., Todd, M.J.: On the implementation of SDPT3 (version 3.1)—a MATLAB software package for semidefinite-quadratic-linear programming. In: IEEE International Conference on Robotics and Automation, pp. 290–296 (2004)
Trefethen, L.N., David Bau, I.I.I.: Numerical Linear Algebra. SIAM, New York (1997)
Villa, S., Rosasco, L., Mosci, S., Verri, A.: Proximal methods for the latent group lasso penalty. Comput. Optim. Appl. 58(2), 381–407 (2014). https://doi.org/10.1007/s10589-013-9628-6
Wu, B., Ding, C., Sun, D., Toh, K.-C.: On the Moreau–Yosida regularization of the vector \(k\)-norm related functions. SIAM J. Optim. 24(2), 766–794 (2014). https://doi.org/10.1137/110827144
Zoltowski, D.M., Dhingra, N., Lin, F., Jovanović, M.R.: Sparsity-promoting optimal control of spatially-invariant systems. In: 2014 American Control Conference, pp. 1255–1260 (2014). https://doi.org/10.1109/ACC.2014.6859491
Acknowledgements
This work was completed while both authors were members of the LCCC Linnaeus Center and the eLLIIT Excellence Center at Lund University. It was financially supported by the Swedish Foundation for Strategic Research and the Swedish Research Council through the Project 621-2012-5357.
Funding
Open access funding provided by Lund University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Shoham Sabach.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Lemmas, Proofs and Additional Discussion
Lemmas, Proofs and Additional Discussion
1.1 Search Rules
Lemma 2
Let f be proper, closed and convex, \(z_1 \ge \dots \ge z_n\ge 0\) and \(\left( y^{(t)},w^{(t)}\right) \) denote the t-dependent solution to
where \(1\le t \le r\). Then there exists \(t^\star \) such that \(\left( y^{(t^\star )},w^{(t^\star )}\right) \) is the solution to
with \(y^{(t^\star )}_{r-t^\star } > y_{r-t^\star +1}^{(t^\star )}\) and \(y^{(t^\star )}_{r-t^\star } = y_{r-t^\star +1}^{(t^\star )}\) if \(t^\star = r\). Further,
-
i.
\(t^\star = \min \left\{ \lbrace t: y^{(t)}_{r-t} > y_{r-t+1}^{(t)} \rbrace \cup \lbrace r \rbrace \right\} \).
-
ii.
If \(y^{(t')}_{r-t'} \ge y_{r-t'+1}^{(t')}\) then \(y^{(t)}_{r-t} \ge y_{r-t+1}^{(t)}\) for all \(t \ge t'\).
-
iii.
If \(y^{(t')}_{r-t'} < y_{r-t'+1}^{(t')}\) then \(y^{(t)}_{r-t} < y_{r-t+1}^{(t)}\) for all \(t \le t'\).
In particular, \(t^\star \) can be found by a search over t, where t is increased/decreased according to the following rules:
-
I
\(y^{(t)}_{r-t} \ge y_{r-t+1}^{(t)}\) for all \(t \ge t^\star \).
-
II
\(y^{(t)}_{r-t} \le y_{r-t+1}^{(t)}\) for all \(t < t^\star \).
-
III
If \(t < t^\star \) and \(y^{(t)}_{r-t} = y_{r-t+1}^{(t)}\) then \(\left( y^{(t)},w^{(t)} \right) = \left( y^{(t^\star )},w^{(t^\star )} \right) \).
Proof
First we show the equivalence between Eqs. (34) and (33). To this end note that it is not necessary to explicitly restrict y to be nonnegative. The unique solution \((y^\star ,w^\star )\) to (34) fulfills \(0 \le y_i^\star \le z_i\) for \(1 \le i \le n\). The upper bound holds, because otherwise by [25, Theorem 7.4.8.4] \(g^D_r(\bar{y}) \le g^D_r(y^\star )\) with \(\bar{y}_i^\star := \min \{z_i,y_i^\star \}\) and thus \(\bar{y}^\star \) is a feasible solution to (34) with smaller cost. Similarly, the lower bound holds, because otherwise \(\bar{y}^\star \) with \(\bar{y}_i^\star = \max \{0,y_i^\star \}\) is a feasible solution to (34) with smaller cost by Definition 1 (ii). Then there exists \(t^\star \) such that \( y^\star _{r-t^\star } > y^\star _{r-t^\star +1} = \dots = y^\star _r\) where \(t^\star = r\) if \(y^\star _1 = y^\star _r\), which implies that \(y_{r-t^\star } \ge y_{r-t^\star +1}\) is assumed to be inactive and therefore can be removed from (34). Thus, also the constraints \(y_1 \ge \dots \ge y_{r-t^\star }\) can be removed, because the cost function and the sorting of z ensures that the solution will always fulfill them. This yields solving (34) reduces to finding \(t^\star \) such that (33) solves (34).
Next, we characterize \(t^\star \) in terms of solution to (33). In the following, we let p(t) denote the optimal cost of (33) as a function of t. Since adding constraints cannot reduce the optimal cost, p is a nondecreasing function.
Item i.: By the same reasoning that led to the equivalence between Eqs. (34) and (33), it holds that \(y_1^{(t)} \ge \dots \ge y_{r-t}^{(t)}, \ 1\le t \le r\), which is why the set \(\lbrace t: y^{(t)}_{r-t} > y_{r-t+1}^{(t)} \rbrace \cup \lbrace r \rbrace \) contains all t for which the solution of (33) is feasible for (34). Since p is nondecreasing and \(\left( y^{(t^\star )},w^{(t^\star )}\right) \) is unique, the first claim follows.
Item ii.: The second claim is proven by contradiction. Let \((y^{(t')},w^{(t')})\) be such that \(y^{(t')}_{r-t'} \ge y_{r-t'+1}^{(t')}\). Further assume that \(y^{(t'+1)}_{r-t'-1} < y_{r-t'}^{(t'+1)}\). In the following, we construct another solution \((\tilde{y},\tilde{w}) \in \mathbb {R}^{q+1}\) to (33) with \(t = t' +1\), which has a cost that is no larger than \(p(t' +1)\). However, (33) has a unique solution due to strong convexity of the cost function. This yields the desired contradiction. The contradicting solution is constructed as a convex combination \(\tilde{w}=(1-\alpha )w^{(t'+1)}+\alpha w^{(t')}\) with \(\alpha \in (0,1]\) and a partially sorted convex combination of \(y^{(t')}\) and \(y^{(t'+1)}\) with the same \(\alpha \). Let \(\hat{y}:=(1-\alpha )y^{(t'+1)}+\alpha y^{(t')}\) and let
be the partially sorted convex combination. To select \(\alpha \), we note that by assumption, \(y^{(t')}_{r-t'-1}\ge y^{(t')}_{r-t'} \ge y_{r-t'+1}^{(t')}\) and \(y^{(t'+1)}_{r-t'-1} < y_{r-t'}^{(t'+1)}=y_{r-t'+1}^{(t'+1)}\). Therefore, there exists an \(\alpha \in (0,1]\) such that
Since \(y_{r-t'+1}^{(t')}=\cdots =y_{r}^{(t')}\) and \( y_{r-t'-1}^{(t'+1)}=\cdots =y_{r}^{(t'+1)},\) it follows that \(\tilde{y}_{r-t'}=\cdots =\tilde{y}_{r}.\) Furthermore, the construction of \(\tilde{y}\) as well as the sorting yield that \(\tilde{y}_{r}\ge \cdots \ge \tilde{y}_q\) and \(\tilde{y}_{1}\ge \cdots \ge \tilde{y}_{r-t'-1}\), which is why \(\tilde{y}\) satisfies the chain of inequalities in (33) for \(t=t'+1\).
It remains to show that \(\tilde{y}\) satisfies the epigraph constraint and that the cost is not higher than \(p(t'+1)\). These properties are already fulfilled for \(\hat{y}\) being a convex combination of two feasible points with costs \(p(t')\) and \(p(t'+1)\), respectively, where \(p(t') \le p(t'+1)\). Therefore, it is left to show that the sorting involved in \(\tilde{y}\) maintains these properties. First, we show that sorting of any sub-vector in \(\hat{y}\) does not increase the cost. Suppose that \(z_i\ge z_j\), \(\hat{y}_i\le \hat{y}_j\), i.e., \(\hat{y}\) is not sorted the same way as z. Then
and thus the cost is not increased by sorting \(\hat{y}\) or any sub-vector of it. Further, a permutation of the first r elements of \(\hat{y}\) does not influence the epigraph constraint, because \(g^D_r(\hat{y})\) is permutation invariant by definition.
Next notice that \(\tilde{y}\) is obtained from \(\hat{y}\) by first swapping \(\hat{y}_{r-t'-1}\) and \(\hat{y}_{r-t'}\). From the choice of \(\alpha \), we conclude that
Thus, this swap is a sorting which does neither increase the cost, nor does it violate the epigraph constraint. Analogously, sorting the first \(r-t'\) elements of the resulting vector to obtain \(\tilde{y}\) has the same effect and therefore we receive the desired contradiction.
Item iii.: Suppose that there exist t and \(t'\) with \(t'>t\) such that \(y_{r-t'}^{(t')}<y_{r-t'+1}^{(t')}\) and \(y_{r-t}^{(t)}\ge y_{r-t+1}^{(t)}\). Then Item ii. shows that \(y_{r-t'}^{(t')}\ge y_{r-t'+1}^{(t')}\), which is a contradiction.
Items I. to III.: The statements follow immediately from Items i. to iii.
Lemma 3
Let f and z be as in Lemma 2 and \(\left( y^{(t,s)},w^{(t,s)}\right) \) denote the (t, s)-dependent solution to
where \(0 \le s \le r -n\) and t is fixed within \(1 \le t \le r\). Then, there exists \(s^\star \) such that \(\left( y^{(t,s^\star )},w^{(t,s^\star )}\right) \) is the solution to (33) with \(y^{(t,s^\star )}_{r+s^\star } > y_{r+s^\star +1}^{(t,s^\star )}\) and \(y^{(t,s^\star )}_{r+s^\star } = y_{r+s^\star +1}^{(t)}\) if \(s^\star = n-r\). Further,
-
i
\(s^\star = \min \left\{ \lbrace s: y^{(t,s^\star )}_{r+s^\star } > y_{r+s^\star +1}^{(t,s^\star )} \rbrace \cup \lbrace n-r \rbrace \right\} \).
-
ii
If \(y^{(t,s')}_{r+s'} \ge y_{r+s'+1}^{(t,s')}\) then \(y^{(t,s)}_{r+s} \ge y_{r+s+1}^{(t,s)}\) for all \(s \ge s'\).
-
iii
If \(y^{(t,s')}_{r+s'} < y_{r+s'+1}^{(t,s')}\) then \(y^{(t,s)}_{r+s} < y_{r+s+1}^{(t,s)}\) for all \(s \le s'\).
In particular, \(s^\star \) can be found by a search over s, where s is increased/decreased according to the following rules:
-
I
\(y^{(t,s)}_{r+s} \ge y_{r+s+1}^{(t,s)}\) for all \(s \ge s^\star \).
-
II
\(y^{(t,s)}_{r+s} \le y_{r+s+1}^{(t,s)}\) for all \(s < s^\star \).
-
III
If \(s < s^\star \) and \(y^{(t,s)}_{r+s} = y_{r+s+1}^{(t,s)}\) then \(\left( y^{(t,s)},w^{(t,s)} \right) = \left( y^{(t,s^\star )},w^{(t,s^\star )} \right) \).
The proof of Lemma 3 goes analogously to the proof of Lemma 2 and is therefore omitted.
Lemma 4
Let f and z be as in Lemma 2, \(1 \le t \le r\) and \( 0 \le s \le n-r\). Moreover, let \(\tilde{z} := Tz \in \mathbb {R}^{r-t+1}\) be defined by (22) and be \((\tilde{y}^{(t,s)},w^{(t,s)})\) the (t, s)-dependent solution to
Then \((y^{(t,s)},w^{(t,s)})\) is a solution to (35), where
Proof
Letting \(\tilde{y} \in \mathbb {R}^{r-t+1}\) be defined as
and notice that
yields the reduced dimensional problem (36).
1.2 Proof to Theorem 1
By Lemma 2, (34) can be solved by performing a search over the t-dependent solutions to (33), where by Lemma 3 these solutions can be determined for each t by a search over the s-dependent solutions to (35). In order to solve (35), we apply Lemma 4 to reduce (35) to solving (24) in Theorem 1. Hence, the remainder of the theorem is a direct application of Lemmas 3 and 2 and thus a nested search with the stated rules succeeds in finding \((t^\star ,s^\star )\).
1.3 General Solution to (24)
In every step of the binary search (24) must be solved. Provided a very mild constraint qualification holds (which it does for our functions of interest), the solution will fall into one of three cases, depending on f and the singular values of Z. The different cases are described in the following.
Proposition 5
Suppose that there exits \((\bar{y},\bar{w})\) such that \(\bar{w}\in \mathrm{{relint}} (\text {dom}f)\) and \(\bar{w}>g^D_{r,s,t}(\bar{y})\). Then \((\tilde{y},\tilde{w})\) is a solution to (24) if and only if one of the following cases applies:
where \(\tilde{z} := T \sigma (Z)\) is given by (22).
Proof
A solution \((\tilde{y},\tilde{w})\) to (24) fulfills \(0\in \partial (f(\tilde{w})+\tfrac{\gamma }{2}\Vert \tilde{y}-\tilde{z}\Vert ^2+\chi _{\text {epi}(g^D_{r,s,t})}(\tilde{y},\tilde{w}))\) by [24, Theorem VI.2.2.1], which under the assumed constraint qualification is equivalent to
where \(\mathcal {N}\) denotes the normal cone to \(\text {epi}(g^D_{r,s,t})\) and the summation is understood set-wise. Then by [24, Proposition VI.1.3.1]
which is why we need to distinguish the cases \(\tilde{y} = \tilde{z}\) and \(\tilde{w} = g^D_{r,s,t}(\tilde{y})\). Thus, the proof follows by invoking (3).
Remark 2
In the epigraph case with \(f(x)=\tfrac{1}{2}(w+z_v)^2\) and \(\gamma = 1\), (C1) corresponds to that \((z,-z_v)\) is in the cone given by the epigraph of \(g^D_{r,s,t}\), (C2) corresponds to that \((z,z_v)\) is in the cone given by the epigraph of the dual gauge function \(g_{r,s,t}\), and (C3) covers the remaining cases.
The problem of solving (18) therefore reduces to checking Eqs. (C1), (C2) and (C3) within the nested binary search, which has been made explicit for \(g^D=\ell _2\) in Appendix 4 and \(g^D=\ell _{1}\) in Appendix 6.
1.4 Proof to Proposition 3
For \(\tau > 0\) and a gauge function \(\tilde{g}\) it holds that \(g = \tau \tilde{g}\) is gauge function with \(g^D = \frac{\tilde{g}}{\tau }\). Setting \(\gamma =1\) and \(f(w) = \frac{1}{2}(w +z_v)\) in Theorem 1, Eqs. (C1), (C2), and (C3) in Proposition 5 then become
For our particular case \(\tilde{g} = \ell _2\), it follows immediately that Eqs. (42a) and (42b) correspond to Eqs. (27a) and (27b). Furthermore, by taking the gradient of \(g^D_{r,s,t}\), (42c) becomes Eqs. (27c), (27e) and (27d) with the constraints \(\mu \ge 0\) and \(\tau \tilde{w} = {g^D_{r,s,t}(\tilde{y})}\). Thus, it is left to compute \(\mu \ge 0\). Plugging Eqs. (27c), (27e) and (27d) into \(\tau ^2 \tilde{w}^2 = {g^D_{r,s,t}(\tilde{y})}^2\) and making some rearrangements yields
Then defining \(c_1 := \sum _{i=1}^{r-t} \tilde{z}^2_i\) and \(c_2 := \sqrt{t+s}\tilde{z}_{r-t+1}\), this can be rewritten as the fourth-order polynomial equation (27f) which can be solved explicitly for unique \(\mu \ge 0\) after the substitution (27e) is performed. This proves the first part of Proposition 3. For \(f(w) = \chi _{[0,\gamma ]}(w)\), Eqs. (C1), (C2) and (C3) are
Note that (C2) is redundant here, because it coincides with (43a). Hence, for \(g = \ell _2\) (43a) becomes (27a) with \(z_v = -1\) and (43b) is equivalent to Eqs. (27f), (27c) and (27d) with \(\tilde{w} = 1\).
1.5 Break Point Search
Lemma 5
Let \((\tilde{z},z_v)\) fulfill neither of Eqs. (29a) and (29b), and \(\hat{z}\) and \(\alpha \) be as in Proposition 4. Further, let \(\mu ^\star \) be the solution to \( \sum _{i=1}^{r-t+1}\max (\hat{z}_{i}-\alpha _i\mu ,0) +z_v - \mu =0\) and \(\hat{\mu }_k\) be the solution to \(\left( \sum _{i=1}^{k} \hat{z}_{i}-\alpha _i\mu \right) +z_v - \mu =0\), i.e., \(\hat{\mu }_k = \frac{z_v+\sum _{i=1}^{k}\hat{z}_{i}}{1+\sum _{i=1}^k\alpha _i}.\) Then there exists \(k^\star \in \{1,\dots r-t+1 \}\) such that \(\hat{z}_{k^\star }- \alpha _{k^\star }{\mu ^\star } \ge 0\), \(\hat{z}_{i}-\alpha _{i}{\mu ^\star } < 0\) for all \(i > k^\star \) and
-
i.
\(\hat{\mu }_{k^\star } = \mu ^\star .\)
-
ii.
\(k^\star =\max \{k : \hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\}\).
-
iii.
If \(\hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\), then \(\hat{z}_i-\alpha _i\hat{\mu }_i\ge 0\) for all \(i \le k\).
-
vi.
If \(\hat{z}_k-\alpha _k\hat{\mu }_k< 0\), then \(\hat{z}_i-\alpha _i\hat{\mu }_i< 0\) for all \(i\ge k\).
In particular,
-
I.
\(\hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\) for all \(k \le k^\star \).
-
II.
\(\hat{z}_k-\alpha _k\hat{\mu }_k< 0\) for all \(k> k^\star \).
Proof
We first show some results needed to prove Items ii. and iii. Let \( g_k(\mu ):=\sum _{i=1}^{k}\max (\hat{z}_{i}-\alpha _i\mu ,0)+z_v-\mu ,\) and let \(\mu _k\) be the unique solution to the equation \(g_k(\mu )=0\). Since all \(g_i\) are strictly decreasing in \(\mu \) and \(g_k(\mu ) = g_{k-1}(\mu )+ \max (\hat{z}_{k}-\alpha _k\mu ,0) \ge g_{k-1}(\mu ),\) we have
-
a.
\(\mu _{k-1}\le \mu _{k}.\)
-
b.
\(\hat{z}_k-\alpha _k\mu _k\le 0 \ \Leftrightarrow \ g_{k-1}(\mu _k)=g_k(\mu _k)=0 \ \Leftrightarrow \ \mu _{k-1}=\mu _k\).
Moreover, the break point sorting in \(\hat{z}\) implies that if l and \(\mu \) are such that \(\hat{z}_l-\alpha _l\mu \ge 0\), then also \(\hat{z}_i-\alpha _i\mu \ge 0\) for all \(i\le l\). Thus,
In conjunction with the uniqueness of \({\mu }_k\), this implies that
-
c.
\(\hat{z}_k-\alpha _k\mu _k\ge 0\) or \(\hat{z}_k-\alpha _{k}\hat{\mu }_k\ge 0 \ \Leftrightarrow \ \hat{\mu }_k=\mu _k\).
Item i.: This has already been proven in the discussion before Lemma 5.
Item ii.: By the definition of \(k^\star \) and Item i. it holds that
Thus, by Item c. \(\hat{\mu }_{k^{\star }} = \mu ^\star = \mu _{k^{\star }}\) and \(\hat{z}_i-\alpha _i \mu _{i} < 0\) for all \(i > k^\star .\) Then Item b. implies that \( \hat{\mu }_{k^{\star }} = \mu ^\star = \mu _{r-t+1} = \mu _{r-t} = \dots = \mu _{k^{\star }}.\) Therefore, if there exists \(k > k^\star \) with \(\hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\), it will hold by Item c. that \(\hat{\mu }_k = \mu _k = \hat{\mu }_{k^{\star }}\), which contradicts (44), because \(0 \le \hat{z}_k-\alpha _k\hat{\mu }_k = \hat{z}_k-\alpha _k\hat{\mu }_{k^\star } < 0.\) This proves that \(k^\star = \max \{k: \hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\}\).
Item iii.: Assume that \(\hat{z}_{k}-\alpha _{k}\hat{\mu }_k\ge 0\). Then, by the break point sorting it holds that \( \hat{z}_{k-1}-\alpha _{k-1}\hat{\mu }_k\ge 0\) and by Items a. and c. that \(\hat{\mu }_k = \mu _k \ge \mu _{k-1}\). Thus, we conclude that
where the last equality follows again by Item c. The other indices follow inductively.
Item iv.: Let on the contrary k be such that \(\hat{z}_k-\alpha _k\hat{\mu }_k<0\), but with \(i\in \{k,\ldots ,r-t+1\}\) such that \(\hat{z}_i-\alpha _i\hat{\mu }_i\ge 0\). Then, by Item iii., \(\hat{z}_k-\alpha _k\hat{\mu }_k\ge 0\), which is a contradiction.
Items I. and II.: Follow immediately from Items ii. to iv.
1.6 Proof to Proposition 4
Analogous to showing Proposition 3, Eqs. (42a) and (42b) correspond to Eqs. (C1) and (C2) in Proposition 5, which translate for \(\tilde{g} = \ell _{\infty }\) to
Since \(\tilde{z}\) is nonnegative and decreasingly sorted, the second case simplifies to (29b). For (42c), we need to note that \(\tilde{y} \in \mathbb {R}^{r-t+1}_{\ge 0}\) and therefore the conditions for \(\tilde{y}_i=0\) and \(\tilde{y}_i>0\) become
for all \(i \in \{1,\ldots ,r-t \}\). These equivalences also hold for \(\tilde{y}_{r-t+1}\) with \(\mu \) multiplied by \(t/\sqrt{s+t}\). Therefore, Eqs. (29c), (29d) and (29e) follow together with the constraints \(\tau \tilde{w} = \tilde{g}^D_{r,s,t}(\tilde{y})\) and \(\mu \ge 0\). Then, plugging Eqs. (29c) and (29d) into \(\tau w^\star = \tilde{g}^D_{r,s,t}(\tilde{y})\) yields
which determines the unique solution to \(\mu \ge 0\). We solve the equation by using a so-called break point searching algorithm, as it has been done for similar problems in [10, 12, 22].
In our case, the break points are given by the smallest values of \(\mu \) for which each max expressions as function of \(\mu \) becomes zero, i.e., \(\left( \gamma \tilde{z}_1, \dots , \gamma \tilde{z}_{r-t}, \frac{\gamma \sqrt{s+t}}{t} \tilde{z}_{r-t+1} \right) \). Then we define \(\hat{z} :=\frac{1}{\gamma }\left( \tilde{z}_1,\ldots ,\tilde{z}_j,\dfrac{t}{\sqrt{(t+s)}}\tilde{z}_{r-t+1},\tilde{z}_{j+1},\ldots ,\tilde{z}_{r-t}\right) \), to be the vector that sorts \(\frac{1}{\gamma }\left( \tilde{z}_1,\dots ,\tilde{z}_{r-t},\frac{t}{\sqrt{t+s}} \tilde{z}_{r-t+1}\right) \) by decreasing break points, i.e., j fulfills
Therefore, (45a) can be equivalently written as
with \(\alpha =\frac{1}{\gamma ^2}\left( 1,\ldots ,1,\dfrac{t^2}{(t+s)},1,\dots ,1\right) .\) Hence, there exists an index \(k^\star \in \{1,\dots ,r-t+1\}\) such that the unique solution \(\mu \ge 0\) to (46b) fulfills
which is why \(\mu \) can be determined as
Consequently, computing \(\mu \) equals a search for \(k^\star \in \{1,\dots ,r-t+1\}\) for which (46d) satisfies (46c). This can be done with the search rules in Lemma 5.
Finally, if \(f(w) = \chi _{[0,\gamma ]}(w)\), then Eqs. (C1), (C2) and (C3) are given by Eqs. (43a) and (43b). For \(\tilde{g} = \ell _{\infty }\), this corresponds to (29a) with \(z_v = -1\), and Eqs. (29c) and (29d) with the constraint that \( \sum _{i=1}^{r-t+1}\max (\hat{z}_{i}-\alpha _i\mu ,0)=\tau ,\) respectively. Therefore, \(\hat{\mu }_k = \frac{\sum _{i=1}^{k}\hat{z}_{i}}{\sum _{i=1}^k\alpha _i}\), \(\mu =\dfrac{\sum _{i=1}^{k^\star }\hat{z}_{i}}{\sum _{i=1}^{k^\star }\alpha _i}\) and it is readily seen that \(k^\star \) obeys the same rules as in Lemma 5.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Grussler, C., Giselsson, P. Efficient Proximal Mapping Computation for Low-Rank Inducing Norms. J Optim Theory Appl 192, 168–194 (2022). https://doi.org/10.1007/s10957-021-01956-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-021-01956-2