Bias Versus Non-Convexity in Compressed Sensing

Cardinality and rank functions are ideal ways of regularizing under-determined linear systems, but optimization of the resulting formulations is made difficult since both these penalties are non-convex and discontinuous. The most common remedy is to instead use the ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^1$$\end{document}- and nuclear norms. While these are convex and can therefore be reliably optimized, they suffer from a shrinking bias that degrades the solution quality in the presence of noise. This well-known drawback has given rise to a fauna of non-convex alternatives, which usually features better global minima at the price of maybe getting stuck in undesired local minima. We focus in particular penalties based on the quadratic envelope, which have been shown to have global minima which even coincide with the “oracle solution,” i.e., there is no bias at all. So, which one do we choose, convex with a definite bias, or non-convex with no bias but less predictability? In this article, we develop a framework which allows us to interpolate between these alternatives; that is, we construct sparsity inducing penalties where the degree of non-convexity/bias can be chosen according to the specifics of the particular problem.


Introduction and Background
Sparsity and rank penalties are common tools for regularizing ill-posed linear problems. The sparsity regularized problem is often formulated as where x 0 is the number of nonzero elements of x.
Optimization of (1) is difficult since the term x 0 is nonconvex and discontinuous at any point containing entries that are zero, which in particular applies to the sought sparse solution.
It can be shown (Theorem 3.1 in [12]) that Q 2 ( f ) + · 2 is the convex envelope of f + · 2 ; this is useful for concrete calculations. For f (x) = μ x 0 , we obtain the objective where Q 2 (μ · 0 )(x) coincides, in fact, with the so-called minimax concave penalty (MCP) [34]; calculation details can be found in [11], Example 2.4. In [8], it was argued that the so-called oracle solution is the best one could possibly wish for, which is what we get if we minimize Ax − b 2 over the "true" support of x 0 . It was shown in [13] that the oracle solution often is a global minimizer of (3), and moreover, that it is unique as a sparse minimizer, i.e., any local minimizer will necessarily have higher cardinality. This is true under the LRIP assumption (lower restricted isometry property, see [2]) on A which states that there should be a positive constant δ − K sufficiently close to 0 such that for all vectors x with x 0 ≤ K , and hence, this is a weaker assumption than the standard RIP estimates (see, e.g., [8]). It is noteworthy that the LRIP condition is not only less stringent than RIP, and the estimates for the corresponding constant are also easier to satisfy for the results in [13] to be valid. The same holds true for the present paper, where we will show similar results for a class of penalties intermediate between λ · 1 and Q 2 (μ · 0 ). Before outlining the details, let us mention that there is a parallel theory for low-rank matrices. In this setting, we are seeking to minimize A : R m×n → R p is a linear operator. Here the standard approach relies on replacing μ rank(X ) with the nuclear norm X * , which is the 1 -norm applied to the singular values σ (X ) of a given matrix X [7,26], whereas [24] proposed solving instead and showed a number of desirable features, which was further strengthened in [14]. As for the vector case, this paper provides penalties in between the two extremes. While the non-convex relaxations (3) and (6) provide unbiased alternatives to the 1 /nuclear norms which can be shown to only have one sparse/low-rank stationary point ( [13,14,24]), it is clear that there always will be poor local minimizers. To see this, let x h be a dense vector from the nullspace of A and x p a minimizer of Ax − b . Then by rescaling x h so that all the elements of x p + x h have magnitude strictly larger than √ μ we obtain a vector that minimizes the data fit while the regularization Q 2 (μ · 0 ) is (locally) constant around it. We recall that (3) and (6) are usually solved with iterative solvers such as forward-backward splitting (FBS) or alternating direction method of multipliers (ADMM), which often are initialized by 0 or some rough approximation of the desired solution. We introduce the somewhat non-stringent concept convergence basin, by which we mean the set of initial points which lead to the global minimizer, without further specifying which algorithm or parameter choice is used. For example, the point x p + x h above (and any point near it) lies outside the "convergence basin." In contrast, (2) (and its matrix counterpart) has the whole space as convergence basin. To summarize, the non-convex penalties enjoy better properties of the global minimizer but could have a small convergence basin, leading to suboptimal performance in practice.
To find a good trade-off between the benefits of both methods, we introduce here a sort of crossover. We will study relaxations of and for sparsity and rank regularization, respectively. We propose to minimize these by replacing the penalties with their quadratic envelopes Q 2 (μ · 0 + λ · 1 ) and Q 2 (μ rank + λ · * ), respectively. A reason for this choice is that this regularization does not move the global minimizer, and hence, in many cases we actually find the minimizer of (7) and (8).
Our formulation can be seen as a trade-off between small bias and improved optimization properties. While the terms λ x 1 and λ X * introduce a small bias to solutions, they also increase the convergence basin. Simple optimization is often related to good modeling. Adding a weak shrinking factor may also make sense from a modeling perspective for certain applications. In this paper, we exemplify with non-rigid structure from motion (NRSfM). Here each nonzero singular value corresponds to a mode of deformation. When choosing a smaller μ (larger rank) in order to capture all fine deformations the resulting problem is often ill-posed due to unobserved depths. As noted in [24], this may result in a large difference to the true reconstruction despite good data fit. The addition of the λ X * allows us to separately incorporate a variable bias restricting the size of the deformations, which regularizes the problem further, see Sect. 7.5.
The main contributions of this paper are • We present a class of new regularizers that leverage the benefits of previous convex as well as unbiased nonconvex formulations. • We show that local minimizers of the relaxed functionals is a subset of local minimizers of those to (7). • We introduce a concrete point called the λ-regularized oracle solution, which is a local minimizer of the relaxation of (7)

Relaxations and Shrinking Bias
In this section, we will study properties of our proposed relaxations of (7) and (8). We will present our results in the context of the vector case (7). The corresponding matrix versions follow by applying the regularization term to the singular values, with similar proofs. Our first theorem shows that adding the term λ · 1 before or after taking the quadratic envelope makes no difference. We say that a function is sign-invariant if the sign on any coordinate can be changed without affecting the function value. Then for every x ∈ R d .
Proof In [12] (Proposition 3.1 and Theorem 3.1), it is shown that for a lower semicontinuous functional g we have By the symmetry property of h, it suffices to consider y ∈ R d + . First notice that in only the term x, y depends on the signs of the elements of x; thus, it is clear that any maximizing x will have sign(x i ) = sign(y i ). Therefore, we may assume without loss of generality that x ∈ R d + as well. We now have x 1 = x, 1 which reduces (11) to Note that if y j − λ < 0 for some j, then for every where e j is the jth vector of the canonical basis, which implies that the above supremum is the same if we only restrict attention to x with supp (x) ⊂ S, where S = {i : y i > λ}. Define χ S x = k∈S e k x k and note that where (x) + denotes thresholding at 0, that is, (x) + = (max(x 1 , 0), ..., max(x d , 0)), which gives a more concrete expression for (11).
The function f need not be μ · 0 , if the sought cardinality is known one could for example incorporate this information by letting f be the indicator functional of {x : x 0 ≤ K } (c.f. [13]) which leads to highly non-trivial non-separable new penalties. However, for the remainder of this paper we focus exclusively on f (x) = μ x 0 . In view of the above and (3), it follows that Q 2 (μ · 0 + λ · 1 ) = r μ,λ , where We therefore propose to minimize the objective This is motivated by the following simple observation.

Lemma 2.2 If A has columns of Euclidean norm (strictly)
less than one, the local minimizers of (13) form a subset of those of (7); moreover, the global minimizers coincide.
Proof Let x be a local minimizer of (13). If 0 < |x i | < √ μ holds for some index i, then it follows by (12) that ∂ 2 i r μ,λ = −2. If a i denotes the i:th column of A, we get on the other hand that ∂ 2 i Ax − b 2 = 2 a i < 2, and hence, this point cannot be a local minimizer of (13), a contradiction. Hence, we either have x i = 0 or |x i | ≥ √ μ for all indices i. By (12), we get that r μ,λ (x) = μ x 0 + λ x 1 , and hence, x must also be a local minimizer for (7) (since (13) is less than or equal to (7), but equal at the point x; this follows from a general feature of the quadratic envelope Q 2 , cfr. Theorem 3.1 in [12] for additional details).
We remark that the assumption on A always can be achieved by a rescaling of the problem 1 . The property of not moving minimizers is inherent to quadratic envelope regularizations, see [12]. Note that r μ,λ (x) + Ax − b 2 reduces to (2) if μ = 0 and (3) if λ = 0. Figure 1 shows an illustration of r μ,λ for a couple of different values of λ. When λ = 0 the function is constant for values larger than √ μ. Therefore, large elements give zero gradients which can result in algorithms getting stuck in poor local minimizers. Increasing λ makes the regularizer closer to being convex, which as we show numerically in Sect. 7, increases its convergence basin. We conclude this section with a simple 2D illustration of the general principle. Consider r μ,λ (x) + Ax − b 2 for a two-dimensional problem with 1 Alternatively we can work with the original A and the more general transform Q where γ > 0 is a parameter chosen with respect to the size of A (see [12]). We have chosen the above assumption on A and set γ = 2 for simplicity of the presentation. Since the matrix is diagonal, the function is the sum of two functions of 1 variable, which are depicted in Fig. 2. The blue curves show the case μ = 1 and λ = 0. It is easy to verify that the problem has two local minimizers x = (2, 3) and x = (0, 3) (which is also global). These points (and in addition (0, 0) and (2, 0)) are also local minima to (1) with μ = 1.
The yellow curve shows the effect of using the convex 1 formulation (2), with λ = 0.7. Here we have used the smallest possible λ so that the optimum of the left residual is 0 while the right one is nonzero. The resulting solution x = (0, 2) has the correct support; however, the magnitude of the nonzero element is reduced from 3 to 2 due to the shrinking bias. With our approach, it is possible to chose an objective which has less bias but still a single local minimizer. Setting μ = 0.7 and λ = 0.4 gives the red dashed curves with optimal point x ≈ (0, 2.5).

Oracle Solutions
For sparsity problems, the so-called oracle solution [8] is what we would obtain if we somehow knew the "true" support of the sought solution and we were to solve the least squares problem over the nonzero entries of x. Candés et al. [8] showed that under certain RIP conditions the solution (2) (i.e., LASSO) approximates the oracle solution. In [13], it was then shown that (3) often gives exactly the oracle solution. In this section, we will show that our relaxation solves a similar 1 -regularized least squares problem. In particular, for μ = 0 this gives a concrete characterization of the LASSO minimizer.
Let x 0 be the so-called ground truth, i.e., a sparse vector that we wish to recover using the measurement b = Ax 0 + where denotes noise. Furthermore, let S be the set of nonzero indices of x 0 , let K be the cardinality of S and suppose that δ − K ∈ [0, 1), which simply means that any K columns of A are linearly independent. We will use the notation A S to denote the matrix which has the same entries as A in the columns indexed by S and zeros otherwise. We refer to the λ-regularized oracle solution as the minimizer of Note that x λ indeed gets support in S (since we use A S instead of A above), and hence, that the minimization problem (15) has a unique solution (due to the assumption that δ − K ∈ [0, 1) which implies Condition 1 in [33]). It is easy to see that in the limit λ → 0 + , this becomes the classical oracle solution, i.e., the least squares solution over the correct support. For a nonzero λ, the 1 norm modifies the solution by adding a shrinking bias.
We now show that the solutions to (15) also are stationary points of (13). For non-convex functions, we will say that a point is stationary if its Fréchet subdifferential∂ includes 0, we refer to Section 3 of [13] for more details.

that A has columns of Euclidean norm less than one, and that the residual errors
Proof It is easy to see that r μ,λ (x)+ x 2 can be written as the is the difference of this convex function and the smooth function x 2 − Ax − b 2 , and for such functions, it is easy to see that a point x is stationary if and only if the gradient of the smooth part is a member of the subdifferential of the convex part. Since the latter can be written as a cartesian product Ś i A i with A i ⊆ R, this condition can be verified coordinate-wise. For x = x λ and j such that x j = 0, we have where a j denotes the j :th column of A, whereas the corresponding interval for the subgradient of the convex part is the former point is a member of this subset. It remains to check the nonzero x j 's, i.e., for j ∈ S. (This follows by the definition of x λ and the assumption |x λ,i | ≥ √ μ for all j ∈ S.) Let #S denote the cardinality of S and note that by assumption on x λ we have Whether x λ is the global optimum or not depends on the problem instance. However, for the particular case of μ = 0, the problem is convex and hence a stationary point is a global minimizer. In other words, the theorem says that the λ−regularized solution is often the solution of the classical 1 -problem (2) (LASSO). For μ > 0, we will in the following sections show that under a sufficiently strong RLIP it is the sparsest possible stationary point.
In Fig. 3, we illustrate with an experiment, the setup is very similar to the one described in Sect. 7.3: a random matrix A, together with a ground truth x 0 and a set of noisy measurements b are fixed; the parameter μ is also kept fixed and chosen such that We study the the impact of an increasingly bigger value of λ on the reconstruction performances, and we draw quantitative conclusions. Blue graphs relate to solving LASSO (2) for various values of λ (solution denoted x 1 ), and the red graphs show the same but for (13) (denoted x r μ,λ ). The noise is fixed at noise level 30%. The yellow line shows distance from x λ to ground truth x 0 . Clearly, this deviates from x 0 at a linear rate, as expected, demonstrating the need to keep λ small. The left graph shows log 10 (1 + #{misfit}) where #{misfit} is counting the amount of wrong positions in the support of the estimated sparse solution (so value 0 indicates perfect support recovery). As it is plain to see, 1 finds the correct support only for very high values of λ, and in this regime, we also have x 1 = x λ as predicted by Theorem 3.1 (when μ = 0), but here x λ is very far from the ground truth. This regime is also small and therefore hard to find in practice. On the contrary, x r μ,λ fails only for λ = 0 (the algorithm is initialized at the least squares solution, which is a local minimum) and very high values of λ. Another interesting observation is that x r μ,λ stops having the right support as soon as the condition |x λ,i | ≥ √ μ from Theorem 3.1 is violated for some i ∈ S.

Separation of Stationary Points
A feature of the left red graph in Fig. 3 which is not explained by Theorem 3.1 is the fact that when x r μ,λ fails to find x λ for λ = 0, it has a very large support. In this section, we aim at providing theoretical support also for this fact. More precisely, we study the stationary points of the objective function (13) under the assumption that A fulfills the RLIP condition (4) with decent values of δ − K . We will extend the results of [13,24] to our class of functionals. Specifically, we show that under some technical conditions two stationary points x and x have to be separated by x − x 0 > 2K . From a practical point of view, this means that if we find a stationary point with x 0 ≤ K we can be certain that this is the sparsest one possible.

Stationary Points and Local Approximation
We will first characterize a stationary point as being a thresholded version of a noisy vector z which depends on the data. As in [13] we introduce the auxiliary function G μ,λ (x) = 1 2 (r μ,λ (x) + x 2 ), i.e., 2G μ,λ (x) equals the l.s.c. convex envelope of μ · 0 + λ · 1 + · 2 , by Theorem 2.1 and the design of Q 2 . Notice that the function G μ,λ is convex and proper, and thus, the object ∂G μ,λ is the classical subdifferential from convex analysis.
Given a point x (stationary or not), we introduce the auxiliary point in the following proofs, we will use the shorter notations z = z(x), z = z(x ) and z = z(x ).

Proposition 4.1
The point x is stationary for (13) if and only if z ∈ ∂G μ,λ (x ) and if and only if Proof First note the identity By differentiating, we see that x is stationary in (13) if and Similarly, differentiating (17) we see that x is stationary in (17) if and only if z ∈ ∂G μ,λ (x ). Now recall that by construction the functional in (17) is convex and therefore x being stationary is equivalent to solving (17).
We will use properties of the vector z to establish conditions that ensure that x is the sparsest possible stationary point of (13). The overall idea which follows [11,24] is to show that the subdifferential ∂G μ,λ grows faster than z, as a function of x, and therefore, we can only have z ∈ ∂G μ,λ (x) in one (sparse) point. The result requires that the elements of the vector z are not too close to the threshold √ μ + λ 2 . Theorem 4.2 Let δ − 2K be the RLIP constant (4) for the matrix A for cardinality 2K . Assume that x is a stationary point of (13) and that each element of z (defined as in (16)) fulfills The proof of Theorem 4.2 requires an estimate of the growth of the subgradients of G μ,λ which we now present. The function G μ,λ is separable and can be evaluated separately for each element of x. To separate the notation, we write g μ,λ for G μ,λ restricted to one real parameter x. The subdifferential of g μ,λ then becomes Figure 4 shows the function g μ,λ and ∂ g μ,λ . The parameter λ adds a constant offset to the positive values of ∂ g μ,λ (x) and subtracts the same value for all negative values.
It is clear from Fig. 4 that in (−∞, − √ μ] and [ √ μ, ∞) the subdifferential contains a single element. In addition, for any two elements x , x in one of these intervals, we have For the other parts, the subdifferential grows less. To ensure a certain growth, we need to add some assumptions on the subdifferential which is done in the following result.

Lemma 4.3
Assume that x is such that z ∈ ∂G μ,λ (x ) and β > 0, where again z is defined by (16). If the elements z i for every i, then for any x with z ∈ ∂G μ,λ (x ) we have as long as x = x .
Proof We first consider the scalar case: z ∈ ∂ g μ,λ (x ); by the symmetry of (18), we may assume that z ≥ 0. First assume that z > √ μ β 2 + λ 2 . In view of (18), we then have Since l(x ) = z and l(0) = β 2 x + λ 2 > √ μ + λ 2 , Fig. 4 shows that l(x ) > z for all x < x . Therefore, for all x < x . Additionally, for x > x we clearly have that in both scenarios (x > x and x > x ), we obtain Now assume that 0 ≤ z ≤ β 2 √ μ + λ 2 ; this implies x = 0, which follows from the structure of the subgradient of g μ,λ (18). If we define another linear function p(x) = (1−β 2 )x + z , we have that and it is clear that, if x > 0, then p(x ) < z (there are no hypothesis on z here). Therefore, Similarly, it is easy to see that p(x ) > z if x < 0 and therefore which again yields (21). To obtain (20), we now sum over the nonzero entries of x − x .
Proof of Theorem 4.2 By Proposition 4.1, we have z ∈ ∂G μ,λ (x ) so Lemma 4.3 applies to x , z . Let z be related to x via (16). Then Since A satisfies the RLIP condition this is less than or equal to δ − 2K x − x 2 whenever x − x 0 ≤ 2K , which is impossible by Lemma 4.3.
Let us summarize our conclusions so far: We have introduced a relaxed functional (13) which is intermediate between the classical LASSO and MCP penalties. We have shown that the local minimizers of (13) are a subset of those of (7), and we have concretely characterized one such minimizer x λ . This is the sought solution and, although it may not be unique, it is unique as a sparse solution. In other words, if Theorem 4.2 applies with K sufficiently big and the algorithm gets stuck in an undesired local minimum, this will be visible by its high cardinality. It is clear that the bias of x λ scales linearly with λ, but a small λ in LASSO gives a too big support, and this is where the μ−parameter comes handy. Ideally, one should pick λ = 0 for then the oracle solution is among the local minimizers (in fact, it is often the unique global minimizer, see [13]), but in practice a trade-off may be more reliable due to the risk of getting stuck in undesired local minima of MCP.
Let us also underline that although we have studied one concrete and relatively simple separable penalty r μ,λ , the idea to extend the convergence basin of non-convex penalties applies to a whole array of sparsity inducing penalties such as those studied in [20].

Optimization
The optimization of functions of the type (13) is straightforward and can be done either via ADMM or FBS, once the proximal operator is known, which we now compute. Both have been proven to converge in the present setting, in the former case see [29] and in the latter one needs to combine the results of [12] with [1]. We have also run both algorithms in parallel and found that they almost always converge to the same point, despite the non-convex landscape. To generalize these algorithms to the matrix case is also straightforward, one basically needs to apply the vector proximal operator to the singular values, see [14].

The Proximal Operator
The proximal operator of r μ,λ /ρ, where ρ is a step length parameter, is defined by where r μ,λ (x) = Q 2 (μ · 0 + λ · 1 )(x). The following result shows that in general the proximal operator of Q 2 ( f + λ · 1 ) is easy to compute if the proximal operator of Q 2 ( f ) is known. Note that Q 2 ( f ) is a non-convex functional with maximum negative curvature −2 (see [12]), and hence we must require that ρ > 2 in order for the proximal operator to be single valued (Fig. 5). We recall that a function f : for all x ∈ R d and all diagonal d × d matrices S with only 1 and −1 on the main diagonal.

Proposition 5.1 Let f : R d → R be a lower semicontinuous sign-invariant function such that f
Then for every y ∈ R d and ρ ≥ 2. Proof It is enough to compute the proximal operator of the function Q 2 ( f )(·) + λ · 1 as per Theorem 2.1. Without loss of generality we assume that y ∈ R d + . With the same notation as in the proof of Theorem 2.1 we have because the quantity λ x 1 /ρ − x, y will be minimized with an x with the same signs of y. i.e., x ∈ R d + . Moreover λ x 1 /ρ − x, y = − x, y − λ1/ρ and again we want the latter to be as small as possible and thus we pick x such that and the terms in y are constant (since the minimization is over x), we see that x also solves arg min x∈R d Note that (y − λ ρ 1) + = prox λ · 1 /ρ (y) since y ∈ R d + . Also, since the elements of (y − λ ρ 1) + are nonnegative it is clear that minimizing over x ∈ R d instead of R d + does not change the optimizer and therefore For our particular case, f (x) = μ x 0 , the proximal operator is separable and each element of the vector x can be treated independently. As usual the soft thresholding operator is given by sign(y i ) max(|y i |−λ/ρ, 0). The computations of x = prox Q 2 (μ · 0 ) ρ (y) are fairly straightforward and can be found, e.g., in [11,20]. For ρ > 2 we get

Matrix Framework
In this section, we briefly show how the theory can be lifted to the matrix framework. We let σ (X ) denote the singular values of a given matrix X . Note that σ (X ) 0 = rank(X ) and that · 1 applied to the singular values gives the nuclear norm X * , which is a rank-reducing penalty, see the discussion around (5) and (6). Analogously we can consider r μ,λ (σ (X )) which is a rank-reducing penalty with less of a bias than X * . For the case λ = 0, it has been shown in [14] how to lift basically any statement about vectors to a corresponding statement for matrices, and along these lines, we could develop a theory for matrices parallel to the results in Sects. 2-5. We refrain from this and focus here on providing the necessary details to apply this framework in practice. We recall that a function f : R d → R is said to be absolutely symmetric if f (|x|) = f (x) and f ( x) = f (x) for all permutations and all x ∈ R d .

Proposition 6.1 Suppose that f is an absolutely symmetric functional on
Proof See Proposition 4.1 of [14].

Proposition 6.2 Let f be an absolutely symmetric function on R d and set as in the previous proposition F(Y ) =
Proof See Proposition 2.1 of [15].

Experiments
In this section, we test the proposed formulation on a number of real and synthetic experiments. Our focus is to evaluate the proposed method's robustness to local minima and the effects of its shrinking bias.

Convergence Basins
One of the drawbacks of using non-convex penalties is that overall performances might be poor when the problem to be solved is particularly ill-posed. The ideas that we presented in this note and that we want to highlight in the present section is that some issues related to non-convexity can be mitigated by means of adding a small convex "perturbation." In this subsection, we empirically demonstrate how the convergence basin can be greatly enlarged when r μ,λ is employed instead of r μ,0 , with λ small; i.e., we show that the reconstruction algorithm seems less prone to get stuck in undesired stationary points.
For this purpose, we constructed a "ground truth" x 0 that is not a sparse signal, but most of its magnitude is concentrated in the largest coefficients (more precisely, roughly 90% of the signal is distributed on 5% of the entries). The sensing matrix A was here a 500×4096 random (with Gaussian distribution) matrix with normalized columns. The measurements b were considered perturbed by additive Gaussian white noise such that = 0.05 Ax 0 . Fig. 7 Regrouping of the quantities x k (x S P ) − x 0 / x 0 for the different starting points x S P generated. The cut between the blue and the red groups determined our definition of "success" (Color figure online) We generated 500 different random (with uniform distribution) points belonging to the ball B 1.5 (x 0 ) (with center x 0 and radius 1.5, where x 0 = 1) and used each of them as starting point for the FISTA algorithm, first to minimize the functional Q 2 (μ · 0 )(x) + Ax − b 2 and then the relaxation (13), with λ = 0.01 and μ = 0.1. The algorithm terminates when x k − x k+1 < 10 −14 and the convergence point is simply called x k in the below figures, or x k (x S P ) if we want to specify the particular Starting Point x S P . The outcome is illustrated in Figs. 6 and 7. We say that a starting point x S P "is successful" if x k (x S P ) is such that likely is the best one could expect. The successful starting points are depicted in blue, the others in red. There and what to be consider as "fail," as the next histogram shows; Figure 6 illustrates the cloud of the starting points -angles are random for graphical representation purposes while distances to x 0 are exact. Notice that the 0, often used as starting point, would lead to a fail when λ = 0 in the above example.

Well-Posedness vs Ill-Posedness
As already mentioned in the previous sections, the relaxation (13) shows its effectiveness with highly ill-posed problems. In this subsection we experimentally investigate more on this aspect. We consider x 0 as in Sect. 7.1 and real random matrices with 1000, 1500 and 2000 rows, respectively (and 4096 normalized columns), since fewer rows leads to more ill-posedness.
The following pictures show the reconstruction precision in these three different scenarios as well as the cardinality of the retrieved approximate solutions along the segment √ μ + λ/2 = √ 0.1 for μ ∈ [0, 0.1]. The rationale behind this parameter choice stems from the observation that the cardinality of the solution to arg min x Q 2 (μ · 0 )(x) + λ x 1 + I x − y 2 is essentially determined by the number √ μ + λ/2, as seen by setting ρ = 2 in (23). When the identity I is replaced with a matrix A this might not be true any longer, but we still expect the cardinality to be roughly determined by the quantity √ μ + λ/2 (when A has normalized columns). The blue axis shows cardinality of the reconstruction and the red axis shows reconstruction misfit to ground truth for values of λ in the range 0 to 2 √ 0.1 ≈ 0.63 (where λ = 0.63 represents traditional LASSO (2)) ( Fig. 8).
When the problem is ill-posed (1000 rows) we see the proposed crossover method at work: for λ in the range 0.2 to 0.4 the reconstruction precision is good while keeping the cardinality roughly constant. For bigger values of λ the reconstruction precision is still good, but at the cost of a higher cardinality. For 1500 both reconstruction quality and cardinality are optimal at λ = 0.25, while LASSO gives a significantly worse output with respect to both parameters. With 2000 rows the problem is well posed enough that optimal performance is found for very small λ, i.e., one may just as well skip the 1 -penalty and only work with (3), as reported previously in [13]. Summing up, the r μ,λ -penalty does a better job than 1 in the entire range.

Random Matrices
In this section, we compare the robustness to local minima of the relaxations (2), (3) and (13). Note that (2) and (3) are special cases of (13), obtained by letting λ or μ equal to 0 (by Theorem 2.1).
We generated A-matrices of size 100 × 200 by drawing the columns from a uniform distribution over the unit sphere in R 100 , and the vector x 0 was selected to have 10 random nonzero elements with random magnitudes between 2 and ≈ 0.5 ; see [13] for the rationale behind these choices. For (13) we again chose μ = 1 but used λ = λ 1 /6. Figure 9 plots x − x S for the estimated x with the three methods, as a function of ; x S is here the oracle solution to the linear system of equations Ax = b [13]. Both (3) and (13) do better than traditional 1 in the entire range, (3) finds x S with 100% accuracy until around ≈ 3, where (13) starts to perform better. This is likely due to the fact that the small 1 term helps the (non-convex) minimization of (13) to not get stuck in local minima. To test this conjecture, we ran the same experiment for 50 iterations for the fixed noise level = 3.5 and chose as initial point the least squares solution, which is known to be close to many local minima (we usually use 0 as initial point). The histograms to the right in Fig. 9 show the cardinality of the found solution. Adding the λ x 1 enabled the algorithm to avoid almost all of these high cardinality solutions, in perfect harmony with Theorem 4.2 and Fig. 3.

Point-Set Registration with Outliers
Next we consider registration of 2D point clouds. We assume that we have a set of model points Here s R is a scaled rotation of the form a −b b a and t ∈ R 2 is a translation vector. Since the residuals are linear in the parameters a, b, t, we can by column-stacking them write the problem as My − v 2 , where the vector y contains the unknowns a, b, t. We further assume that the point matches contain outliers that needs to be removed. Therefore we add a sparse vector x whose nonzero entries allows the solution to have large errors. We thus want to solve min x,y The minimization over y can be carried out in closed form by noting that y = (M t M) − The matrix A is a projection onto the complement of the column space of M, and therefore has a four-dimensional null space. Figure 10 shows the results of a synthetic experiment with 500 problem instances. The data were generated by first selecting 100 random Gaussian 2D points. We then divided these into two groups of 60 and 40, respectively, and transformed these using two different random similarity transformations. This way the data supports two strong hypotheses which yields a problem which is much more difficult than what adding random uniformly distributed outliers does. The transformations were generated by taking a and b to be Gaussian with mean 0 and variance 1, and selecting t to be 2D Gaussian with mean (0, 0) and covariance 5I . We compare the three relaxations (2) with λ = 2, (3) with μ = 1 and (13) with μ = 1 and λ = 0.5. (The reason for using λ = 2 in (2) and μ = 1 in (3) is that this gives the same threshold in the corresponding proximal operators.) All methods where initialized with the least squares solution. In the left histogram of Fig. 10, we plot the data fit with respect to the inlier residuals (corresponding to the first 60 points, that supports the larger hypothesis). In other words we reorder the data points so that ( s Rp i + t − q i ) 100 i=1 is Matches between two of the images used in Fig. 12 increasing, and then compute 60 i=1 s Rp i + t − q i 2 . The histograms were produced with 500 trials, and a low value on the x-axis thus indicates a good fit. In the right histogram, the x-axis displays the number of residuals determined to be outliers (via a thresholding rule), and thus, a value near 40 indicates success. When starting from the least squares initialization the formulation (3) frequently gets stuck in solutions with poor data fit that are dense and close to the least squares solution. However, when it converges to the correct solution it gives a much better data fit then the 1 norm formulation (2) due to its lack of bias. The added 1 term helps the sequence generated by the minimization of (13) to converge to the correct solution with a good data fit.
Note that the number of outliers are in many cases is smaller than 40 due to the randomness of the data.
We also include a few problem instances with real data. Here we matched SIFT descriptors between two images, as shown in Fig. 11, to generate the two point sets We then registered the points sets using the formulations (3) with μ = 20 2 and (2) with λ = 10 (which in both cases corresponds to a 20 pixel outlier threshold in a 3072 × 2048 image). For (13) we used μ = 20 2 and λ = 5.
The results are shown in Fig. 12. In the first problem instance (first row) we used an image which generates one strong hypothesis. Here both (13) and (2) produce good results. In contrast, (3) immediately gets stuck in the least squares solution for which all residuals are above the threshold. In the second instance, there are two strong hypotheses. The incorrect one introduces a systematic bias that effects (2) more than (13). As a result, the registration obtained by (13) is better than that of (3) and the number of determined inliers is larger.

Non-rigid Structure from Motion
In our final experiment, we consider non-rigid structure form motion with a rank prior. We follow the aproach of Dai. et al. [16] and let where X i ,Y i ,Z i are 1 × m matrices containing the x-,y-and z-coordinates of tracked image points in frame i. With an orthographic camera the projection of the 3D points can be written M = R X, where R is a 2F × 3F block diagonal matrix with 2 × 3 blocks R i , consisting of two orthogonal rows that encode the camera orientation in image i. The resulting 2F × m measurement matrix M consists of the x-and y-image coordinates of the tracked points. Under the assumption of a linear shape basis model [5] with r deformation modes, the matrix X # can be written X # = C B, where B is r × 3m, and therefore rank(X # ) = r . We search for the matrix X # of rank r that minimizes the residual error P X − M 2 F . In Fig. 14 we compare the three relaxations on the four MOCAP sequences displayed in Fig. 13, obtained from [16]. These consist of real motion capture data and therefore the ground truth solution is only approximatively of low rank. Figure 14 shows results for the three methods. We solved the problem for 50 values of √ μ between 10 and 100 (orange curve) and computed the resulting rank and datafit. (For (27) we kept λ = 5 fixed.) All three formulations were given the same (random) starting solution.  The same tendencies are visible for all four sequences. While (26) generally gives a better data fit than (28), due to the nuclear norms shrinking bias, the distance to the ground truth is larger for low values of μ or equivalently large ranks where the problem gets ill-posed. The relaxed functional (27) consistently outperforms (28) in terms of both data fit and distance to ground truth. In addition, its performance is similar to (26) for high values of μ while it does not exhibit the same unstable behavior for high ranks.
Funding Open access funding provided by Lund University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.