On the monotone and primal-dual active set schemes for $\ell^p$-type problems, $p \in (0,1]$

Nonsmooth nonconvex optimization problems involving the $\ell^p$ quasi-norm, $p \in (0, 1]$, of a linear map are considered. A monotonically convergent scheme for a regularized version of the original problem is developed and necessary optimality conditions for the original problem in the form of a complementary system amenable for computation are given. Then an algorithm for solving the above mentioned necessary optimality conditions is proposed. It is based on a combination of the monotone scheme and a primal-dual active set strategy. The performance of the two algorithms is studied by means of a series of numerical tests in different cases, including optimal control problems, fracture mechanics and microscopy image reconstruction.


Introduction
We consider the following nonconvex nonsmooth optimization problem where A ∈ M m×n , Λ ∈ M r×n , b ∈ R m , p ∈ (0, 1] and β ∈ R + . Here which is a norm for p = 1 and a quasi-norm for 0 < p < 1. Optimization of problems as 1.1 arises frenquently in many applications as an efficient way to extract the essential features of generalized solutions. In particular, many problems in sparse learning and compressed sensing can be written as 1.1 with Λ = I, I being the identity (see e.g. [12,42] and the references therein). In image analysis, p -regularisers as in 1.1 have recently been proposed as nonconvex extensions of the total generalized variation (TGV) regularizer used to reconstruct piecewise smooth functions (e.g. in [43,24]). Also, the use of p -functionals with p ∈ (0, 1) is of particular importance in fracture mechanics (see [44]). Recently, sparsity techniques have been investigated also by the optimal control community, see e.g. [10,22,49,34,27]. The literature on sparsity optimization problems as 1.1 is rapidly increasing, here we mention also [7,45,18,1]. The nonsmoothness and nonconvexity make the study of problems as 1.1 both an analytical This work was supported by the ERC advanced grant 668998 (OCLOC) under the EU's H2020 research programme. and a numerical challenge. Many numerical techniques have been developped when Λ = I (e.g. in [27,20,31,32]) and attention has recently been given to the case of more general operators, here we mention e.g. [43,24,36] and we refer to the end of the introduction for further details. However, the presence of the matrix inside the p -term combined with the nonconvexity and nonsmoothness remains one main issue in the developments of numerical schemes for 1.1. In the present work, we first propose a monotone algorithm to solve a regularized version of 1.1. The scheme is based on an iterative procedure solving a modified problem where the singularity at the origin is regularized. The convergence of this algorithm and the monotone decay of the cost during the iterations are proved. Then its performance is successfully tested in four different situations, a time-dependent control problem, a fracture mechanic example for cohesive fracture models, an M-matrix example, and an elliptic control problem. We also focus on the investigation of suitable necessary optimality conditions for solving the original problem. Relying on an augmented Lagrangian formulation, optimality conditions of complementary type are derived. For this purpose we consider the case where Λ is a regular matrix, since in the general case the optimality conditions of complementary type are not readily obtainable. An active set primal-dual strategy which exploits the particular form of these optimality conditions is developped. A new particular feature of our method is that at each iteration level the monotone scheme is used in order to solve the nonlinear equation satisfied by the non zero components. The convergence of the active set primal-dual strategy is proved in the case Λ = I under a diagonal dominance condition. Finally the algorithm was tested on the same time-dependent control problem as the one analysed for the monotone scheme as well as for a miscroscopy image recontruction example. In all the above mentioned examples the matrix inside the p -term appears as a discretized gradient with very different purposes, e.g. as a regularization term in imaging and with modelling purposes in fracture mechanics. Similar type of algorithms were proposed in [27] and [20] for problems as 1.1 in case of no matrix inside the p -term and in the infinite dimensional sequence spaces p , with p ∈ [0, 1]. Our monotone and primal-dual active set monotone algorithm are inspired by the schemes studied respectively in [27] and [20], but with the main novelties that now we treat the case of a regular matrix in the p -term and we provide diverse numerical tests for both the schemes. Moreover, we prove the convergence of the primal-dual active set strategy. Note also that the monotone scheme has not been tested in the earlier papers. Let us recall some further literature concerning p , p ∈ (0, 1] sparse regularizers. Iteratively reweighted least-squares algorithms with suitable smoothing of the singularity at the origin were analysed in [14,29,30]. In [37] a unified convergence analysis was given and new variants were also proposed. An iteratively reweighted 1 algorithm ( [9]) was developped in [15] for a class of nonconvex 2 -p problems, with p ∈ (0, 1). A generalized gradient projection method for a general class of nonsmooth non-convex functionals and a generalized iterated shrinkage algorithm are analysed respectively in [7] and in [55]. Also, in [45] a surrogate functional approach combined with a gradient technique is proposed. However, all the previous works do not investigate the case of a linear operator inside the p -term. Then in [43] an iteratively reweighted convex majorization algorithm is proposed for a class of nonconvex problems including the p , p ∈ (0, 1] regularizer acting on a linear map. However, an additional assumption of Lipschitz continuity of the objective functional is required to establish convergence of the whole sequence generated by the algorithm. Nonconvex T V p -models with p ∈ (0, 1) for image restoration are studied in [24] by a Newton-type solution algorithm for a regularized version of the original problem. We mention also [32], where a primal-dual active set method is studied for problems as in 1.1 with Λ = I for a large class of penalties including also the p , with p ∈ [0, 1). A continuation strategy with the respect to the regularization parameter β is proposed and the convergence of the primal-dual active set strategy coupled with the continuation strategy is proved. However, in [32], differently from the present work, the nonlinear problem arising at each iteration level of the active set scheme is not investigated. Moreover, in [32] the matrix A has normalized column vectors, whereas in the present work A is a general matrix. Finally, in [36] an alternating direction method of multipliers (ADMM) is studied in the case of a regular matrix inside the p -term, optimality conditions were derived and convergence was proved. Although the ADMM in [36] is also deduced from an augmented Lagrangian formulation, we remark that the optimality conditions of that paper are of a different nature than ours and hence the two approaches cannot readily be compared. We refer to 4.2 for a more detailed explanation. Concerning the general importance of p -functionals with p ∈ (0, 1), numerical experience has shown that their use can promote sparsity better than the 1 -norm (see [11,19,50]), e.g. allowing possibly a smaller number of measurements in feature selection and compressed sensing (see also [41,12,13]). Moreover, many works demonstrated empirically that nonconvex regularization terms in total variation-based image restoration provide better edge preservation than the 1regularization (see [40,41,6,46]). Also, the use of nonconvex optimization can be considered from natural image statistics [26] and it appears to be more robust with respect to heavy-tailed distributed noise (see e.g. [54]). The paper is organized as follows. In 2 we present our proposed monotone algorithm and we prove its convergence. In 3 we report our numerical results for the four test cases mentioned above. In 4 we derive the necessary optimality conditions for 1.1, we describe our primal-dual active set strategy and prove convergence in the case Λ = I. Finally in 5 we report the numerical results obtained by testing the active set monotone algorithm in the two situations mentioned above.

Existence and monotone algorithm for a regularized problem
For convenience of exposition, we recall the problem under consideration where A ∈ M m×n , Λ ∈ M r×n , b ∈ R m , p ∈ (0, 1] and β ∈ R + . Throughout this section we assume The first result is existence for 2.1.
Proof. Since J is bounded from below, existence will follow from the continuity and coercivity of J. Thus we prove that J is coercive, that is, |J(x k )| 2 → +∞ whenever |x k | 2 → +∞ for some sequence {x k } ⊂ R n . By contradiction, suppose that |x k | 2 → +∞ and J(x k ) is bounded. For each k, let x k = t k z k be such that t k ≥ 0, x k ∈ R n and |z k | 2 = 1. Since t k → +∞, p < 2, we have for k sufficiently large By compactness, the sequence {z k } has an accumulation pointz such that |z| = 1 andz ∈ Ker(A) ∩ Ker(Λ), which contradicts 2.2.
Following [27], in order to overcome the singularity of (|s| p ) = ps |s| 2−p near s = 0, we consider for ε > 0 the following regularized version of 2.1 where for t ≥ 0 Remark 2.1. Notice that by the coercivity of the functional J in 2.1, the coercivity of J ε and hence existence for 2.3 follow as well.
The necessary optimality condition for 2.3 is given by where the max-operation is interpreted coordinate-wise. We set y = Λx. Then In order to solve eq. (2.5), the following iterative procedure is considered: where we denote y k = Λx k , and the second addends are short for the vectors with components We have the following convergence result.
Theorem 2.2. For ε > 0, let {x k } be generated by 2.6. Then, J ε (x k ) is strictly monotonically decreasing, unless there exists some k such that x k = x k+1 and x k satisfies the necessary optimality condition 2.5. Moreover every cluster point of x k , of which there exists at least one, is a solution of 2.5.
Proof. The proof follows similar arguments to that of Theorem 4.1, [27]. Multiplying 2.6 by x k+1 − x k , we get Note that Since {x k } ∞ k=1 is bounded, there exists a subsequence andx ∈ R n such that x k l →x. By 2.12 and 2.2 we have that x k l +1 →x. Then, passing to the limit with respect to k in 2.6, we get thatx is a solution to 2.6.
In the following proposition we establish the convergence of 2.3 to 2.1 as ε goes to zero. Proof. From the coercivity of J ε , we have that {x ε } ε is bounded for ε small and then there exist a subsequence andx ∈ R n such that x ε l →x. Since {x ε } ε solves 2.3, by letting ε → 0 and using the definition of Ψ ε , we easily get thatx is a solution of 2.1.

Monotone algorithm: numerical results
The focus of this section is to investigate the performance of the monotone algorithm in practice. For this purpose we choose four problems with matrices A of very different structure: a time-dependent optimal control problem, a fracture mechanics example, the M matrix and a stationary optimal control problem. The latter two problems are studied for the two matrix case.
3.1. The numerical scheme. For further references it is convenient to recall the algorithm in the following form (see Algorithm 1). Note that a continuation strategy with respect to the parameter ε is performed. The initialization and range of ε-values is described for each class of problems below. The algorithm stops when the ∞ -norm of the residue of 2.5 is O(10 −3 ) in all the examples, except the fracture problem, where it is O(10 −15 ). At this instance, the 2 -residue is typically much smaller. Thus, we find an approximate solution of the ε-reguralized optimality condition 2.5. The initialization x 0 is chosen in the following way that is, x 0 is chosen as the solution of the problem 2.1 where the p -term is replaced by the 2 -norm. Our numerical experience shows that for some values of β the previous initialization is not suitable, that is, the residue obtained is too big. In order to get a lower residue, we successfully tested a continuation strategy with respect to increasing β-values.
In the presentation of our numerical results, the total number of iterations shown in the tables takes into account the continuation strategy with respect to ε. However, it does not take into account the continuation with respect to β. We remark that in all the experiments presented in the following sections, the value of the functional for each iterations was checked to be monotonically decreasing accordingly to Theorem 2.2.
The following notation will hold for the rest of the paper. For x ∈ R n we will denote |x| 0 = #{i : |x i | > 10 −10 }, |x| c 0 = #{i : |x i | ≤ 10 −10 }, and by |x| 2 the euclidean norm of x.

3.2.
Time-dependent control problem. We consider the linear control system d dt y(t) = Ay(t) + Bu(t), y(0) = 0, that is, where the linear closed operator A generates a C 0 -semigroup e At , t ≥ 0 on the state space X. More specifically, we consider the one dimensional controlled heat equation for y = y(t, x): with homogeneous boundary conditions y(t, 0) = y(t, 1) = 0 and thus X = L 2 (0, 1). The differential operator Ay = y xx is discretized in space by the second order finite difference approximation with n = 49 interior spatial nodes (∆x = 1 50 ). We use two time dependent controls − → u = (u 1 , u 2 ) with corresponding spatial control distributions b i chosen as step functions: The control problem consists in finding the control function − → u that steers the state y(0) = 0 to a neighborhood of the desired state y d at the terminal time T = 1. We discretize the problem in time by the mid-point rule, i.e.
where − → u = (u 1 1 , · · · , u m 1 , u 1 2 , · · · u m 2 ) is a discretized control vector whose coordinates represent the values at the mid-point of the intervals (t k , t k+1 ). Note that in 3.4 we denote by B a suitable rearrangement of the matrix B in 3.2 with some abuse of notation. A uniform step-size ∆t = 1 50 (m = 50) is utilized. The solution of the control problem is based on the sparsity formulation 2.1, where Λ is the backward difference operator acting independently on each component of the control, that is, Λ = m(I 2 ⊗ D) where I 2 is the 2 × 2 identity matrix and D : R m → R m is as follows Also, b in 2.1 is the discretized target function chosen as the Gaussian distribution y d (x) = 0.4 exp(−70(x − .7) 2 )) centered at x = .7. That is, we apply our algorithm for the discretized optimal control problem in time and space where x from 2.1 is the discretized control vector u ∈ R 2m which is mapped by A to the discretized output y at time 1 by means of 3.4. Moreover b from 2.1 is the discretized state y d with respect to the spatial grid ∆x. The parameter ε was initialized with 10 −3 and decreased down to 10 −8 . Note that, since the second control distribution is well within the support of the desired state y d we expect the authority of this control to be stronger than that of the first one, which is away from the target. In Table 1 we report the results of our tests for p = .5 for β incrementally increasing by factor of 10 from 10 −3 to 1. We report only the values for the second control u 2 since the first control u 1 is always zero. In the third row we see that (|Du 2 | 0 ) c increases with β, consistent with our expectation. Note also that the quantity |Du 2 | p p decreases for β increasing. For any i = 1, · · · , m, we say that i is a singular component of the vector In particular, note that the singular components are the ones where the εregularization is most influential. In the sixth row of Table 1 we show their number at the end of the ε-path following scheme (denoted by Sp) and we observe that it concides with the quantity |Du 2 | c 0 , which is reassuring the validity of our ε-strategy. The algorithm was also tested for values of p near to 1, e.g. for p = .9. The results obtained shows a less piecewise constant behaviour of the solution with respect to the ones for p = .5. Finally, we remark that if we change the initialization 3.1, the method converges to the same solution with no remarkable modifications in the number of iterations. Table 1. Sparsity in a time-dependent control problem, p = .5, mesh size h = 1 50 . Results obtained by Algorithm 1. 3.3. Quasi-static evolution of cohesive fracture models. In this section we focus on a modelling problem for quasi-static evolutions of cohesive fractures. This kind of problems require the minimization of an energy functional, which has two components: the elastic energy and the cohesive fracture energy. The underlying idea is that the fracture energy is released gradually with the growth of the crack opening. The cohesive energy, denoted by θ, is assumed to be a monotonic non-decreasing function of the jump amplitude of the displacement, denoted by u . Cohesive energies were introduced independently by Dugdale [16] and Barenblatt [3], we refer to [44] for more details on the models. Let us just remark that the two models differ mainly in the evolution of the derivative θ ( u ), that is, the bridging force, across a crack amplitude u . In Dugdale's model this force keeps a constant value up to a critical value of the crack opening and then drops to zero. In Barenblatt's model, the dependence of the force on u is continuous and decreasing.
In this section we test the p -term 0 < p < 1 as a model for the cohesive energy. In particular, the cohesive energy is not differentiable in zero and the bridging force goes to infinity when the jump amplitude goes to zero. Note also that the bridging force goes to zero when the jump amplitude goes to infinity. Let us introduce all the elements that we need for the rest of the section. We consider the one-dimensional domain Ω = [0, 2l] with l > 0 and we denote by u : Ω → R the displacement function. The deformation of the domain is given by an external force which we express in terms of an external displacement function g : Ω × [0, T ] → R. We require that the displacement u coincides with the external deformation, that is We denote by Γ the point of the (potential) crack, which we chose as the midpoint Γ = l and by θ( u ) Γ the value of the cohesive energy θ on the crack amplitude of the displacement u on Γ.
Since we are in a quasi-static setting, we introduce the time discretization 0 = t 0 < t 1 < · · · < t T = T and look for the equilibrium configurations which are minimizers of the energy of the system. This means that for each i ∈ {0, · · · , T } we need to minimize the energy of the system with respect to a given boundary datum g: where β > 0 in J(u) is a material parameter. In particular, we consider the following type of cohesive energy for p ∈ (0, 1). We divide Ω into 2N intervals and approximate the displacement function with a function u h that is piecewise linear on Ω\Γ and has two degrees of freedom on Γ to represent correctly the two lips of the fracture, denoting with u − N the degree on [0, l] and u + N the one on [l, 1]. We discretize the problem in the following way We remark that the jump of the displacement is not taken into account in the sum, and the gradient of u is approximated with finite difference of first order. The Dirichlet condition is applied on ∂Ω = {0, 2l} and the external displacememt is chosen as To enforce the boundary condition in the minimization process, we add it to the energy functional as a penalization term. Hence, we solve the following unconstrained minimization problem where the operator A ∈ R (2N +1)×(2N +1) is given by Moreover g ∈ R 2N +1 in 3.7 is given by g = (0, · · · , γ2lt i ) and γ is the penalization parameter. To compute the jump between the two lips of the fracture, we introduce the operator D f : where −1 and 1 are respectively in the N and N + 1 positions. Then we write the functional 3.7 as follows Note that KerA = 0, hence assumption 2.2 is satisfied and existence of a minimizer for 3.8 is guaranteed.
Our numerical experiments were conducted with a discretization in 2N intervals with N = 100 and a prescribed potential crack Γ = 0.5. The time step in the time discretization of [0, T ] with T = 3 is set to dt = 0.01. The parameters of the energy functional J h (u h ) are set to β = 1, γ = 50. The parameter ε is decreased from 10 −1 to 10 −12 .
In Figures 1 we report three time frames to represent the evolutions of the crack obtained with Algorithm 1 for two different values of p, that is, p = .01, .1 respectively. Each time frame consists of three different time steps (t 1 , t 2 , t 3 ), where t 2 , t 3 are chosen as the first instant where the prefacture and the fracture appear. The evolution presents the three phases that we expect from a cohesive fracture model: • Pure elastic deformation: in this case the jump amplitude is zero and the gradient of the displacement is constant in Ω\Γ; • Prefracture: the two lips of the fracture do not touch each other, but they are not free to move. The elastic energy is still present. • Fracture: the two parts are free to move. In this final phase the gradient of the displacement (and then the elastic energy) is zero.
Moreover we remark that the formation of the crack is anticipated for smaller values of p. As we see in Figure 1, for p = .01 prefracture and fracture are reached at t = .3 and t = 1.5 respectively. As p is increased to p = .1, prefracture and fracture occur at t = 1 and t = 3 respectively. Finally we remark that in our experiments the residue is O(10 −16 ) and the number of iterations is small, e.g. 12, 15 for p = .01, .1 respectively.
where A is the backward finite difference gradient with G 1 ∈ R n(n+1)×n 2 , G 2 ∈ R n(n+1)×n 2 given by Here I is the n × n identity matrix, ⊗ denotes the tensor product, and D ∈ R (n+1)×n is given by Then A * A is an M matrix coinciding with the 5-point star discretization on a uniform mesh on a square of the Laplacian with Dirichlet boundary conditions. Note that 3.9 can be equivalently expressed as (3.12) min If β = 0 this is the discretized variational form of the elliptic equation For β > 0 the variational problem 3.12 gives a solution piecewise constant enhancing behaviour.
Our tests were conducted with f chosen as discretization of f = 10x 1 sin(5x 2 )cos(7x 1 ) and where D 1 ∈ R n 2 ×n 2 , D 2 ∈ R n 2 ×n 2 are defined as follows (3.14) and D ∈ R n×n is the backward difference operator defined in 3.11 without the n + 1-row. The parameter ε was initialized with 10 −1 and decreased to 10 −6 .
In Tables 2 we show the performance of Algorithm 1 for p = .1, h = 1/64 as mesh size and β incrementally increasing by factor of 10 from 10 −4 to 10. In Figure 2 we report the graphics of the solutions for different values of β between .01 and .3 where most changes occur in the graphics.
We observe significant differences in the results with respect to different values of β. Consistently with our expectations, |Λx| c 0 increases with β (see the third row of Table 2). For example, for β = 1, 10, we have |Λx| c 0 = 7938, or equivalently, |Λx| 0 = 0, that is, the solution to 3.12 is constant. Moreover the fourth row shows that |Λx| p p decreases when β increases. The fifth row exhibits the ∞ norm of the residue, which is O(10 −4 ) for all the considered β. We remark that the number of iterations is sensitive with respect to β, in particular it increases when β is increasing from 10 −4 to 10 −1 and then it decreases significantly for β = 1, 10. The algorithm was also tested for different values of p. The results obtained show dependence on p, in particular |Λx| c 0 decreases as p is increasing. For example, for p = .5 and β = .1 we have |Λx| c 0 = 188, |Λx| p p = 528. In the sixth row of Table 2 we show the number of singular components of the vector Λx at the end of the ε-path following scheme, that is, Sp := #{i | |(Λx) i | < ε}. For most values of β, we note that Sp is comparable to |Λx| c 0 . This again confirms that the ε-strategy is effective. Finally, we remark that if we modify the initialization (3.1), the method converges to the same solution with no remarkable modifications in the number of iterations, which is a sign for the global nature of the algorithm.   Remark 3.1. The algorithm was also tested in the following two particular cases: Λ = I, where I is the identity matrix of size n 2 , and Λ = (n + 1)D 1 , where D 1 is as in eq. (3.14).
In the case Λ = I the variational problem eq. (3.12) for β > 0 gives a sparsity enhancing solution for the elliptic equation eq. (3.13), that is, the displacement y will be 0 when the forcing f is small. Indeed, in this case we have sparsity of the solution increasing with β. Also, the residue is O(10 −8 ) and the number of iterations is considerably smaller than in the two matrix case. For the case Λ = (n + 1)D 1 we show the graphics in Figure 3. Comparing the graphs for β = .3 in Figure 2 and Figure 3 we can find subdomains where the solution is only unidirectionally piecewise constant in Figure 3 and piecewise constant in Figure 2. The number of iterations, |Λx| c 0 , |Λx| p p and the residue are comparable to the ones of Table 2. 3.5. Elliptic control problem. We consider the following two dimensional control problem where we minimize over u ∈ L p (Ω) such that ∇u ∈ L p (Ω), Ω is the unit square, y d ∈ L 2 (Ω) is a given target function, and y ∈ L 2 (Ω) satisfies We discretize eq. (3.15) by the following 1 n -mesh size discretized minimization problem For numerical reasons, in order to avoid the inversion of the matrix A * A we multiply the necessary optimality condition eq. (2.5) by (E −1 ) * and we get where y = Λu. We introduce z = Eu, p = (Λ * N Λ)u, where we denote by N the diagonal matrix with i-entry (N ) ii = βp max(ε 2−p ,|yi| 2−p ) , i = 1, · · · n 2 . Since E −1 = A * A, we can express eq. (3.18) in the form To solve eq. (3.19) the following iteration procedure is used where we denote by N k the diagonal matrix with i-entry (N k ) ii = βp max(ε 2−p ,|y k i | 2−p ) for i = 1, · · · , n 2 and y k = Λu k . Note that the system matrix eq. . The parameter ε was initialized with 10 −1 and decreased to 10 −6 . In Table 3 we report the results of our test for h = 1 64 , p = .1 and β incrementally increasing by factor of 10 from 10 −3 to 1. As expected, when β increases, |Λu| c 0 increases and |Λu| p p decreases. In Figure 5 we show the graphics of the solution for different values of β, thus showing the enhancing piecewise constant behaviour of the solution.   From our tests we conclude that the monotone algorithm is reliable to find a solution of the ε-regularized optimality condition eq. (2.5) for a diverse spectrum of problems. It is also stable with respect to the choice of initial conditions. According to the last rows of Tables 1, 2, 3 we have that #{i | |(Λx) i | ≤ 10 −10 } is typically very close to the number of singular components at the end of the ε-path following scheme. Depending on the choice of β the algorithm requires on the order of O(10 2 ) to O(10 3 ) iterations to reach convergence. In the following sections we aim at analysing an alternative algorithm for which the iteration number is smaller, despite the fact that the convergence can be proved only in special cases.

The active set monotone algorithm for the optimality conditions
In the following we discuss an algorithm which aims at finding a solution of the original unregularized problem where A ∈ M m×n , b ∈ R m , p ∈ (0, 1] and β ∈ R + are as in section 2 and Λ ∈ M n×n is a regular matrix. Existence for the problem eq. (4.1) follows from theorem 2.1. First the necessary optimality conditions for problem eq. (4.1) in the form of a complementary systems are derived. Then an active-set strategy is proposed relying on the form of the optimality condition. Convergence of the primal-dual active set strategy is proven in the case Λ = I. Finally, the results of our numerical tests in two different situations are reported in section 5.
4.1. Necessary optimality conditions. For any matrix A ∈ M m×n , we denote by A i the i-th column of A. We have the following necessary optimality conditions.
Proof. Note that ifx is a global minimizer of eq. (4.1), thenȳ = Λx is a global minimizer of . We introduce the multiplier λ and we write eq. (4.4) in the following way (4.5) Then the optimality conditions eq. (4.2) follows from eq. (4.5) withȳ = Λx. The equality conditions follow similarly byȳ = Λx and the first equation in eq. (4.5).
Remark 4.1. We remark that theorem 4.1 still hold when considering eq. (4.1) in the infinite dimensional sequence spaces p in the case Λ = I.
Moreover, we have the following corollary, which can be proved as in [27], Corollary 2.1.

The augmented Lagrangian formulation and the primal-dual active set strategy.
The active set strategy can be motivated by the following augmented Lagrangian formulation for problem eq. (4.1). Let P be a nonnegative self-adjoint matrix P , satisfying for some η, ξ > 0, independent of x ∈ R n . We set 1 2 , and let B denote the diagonal invertible operator with entries B i . Thus, if A is nearly singular, we use η > 0 and the functional η 2 (x, P x) to regularize eq. (4.1). Consider the associated augmented Lagrangian functional Given x, λ, we first minimize the Lagrangian L coordinate-wise with respect to y. For this purpose we consider Then, by theorem 4.1, the Lagrangian L can be minimized coordinate-wise with respect to y by considering the expressions β|y i | p + 1 2 B . Given y, λ, we minimize L at x to obtain where B is the diagonal operator with entries B i . Thus, the augmented Lagrangian method [28] uses the updates: If it converges, i.e. x n → x, y n → Λx n and λ n → λ, then which is the optimality condition for J P (x) = min x∈R n 1 2 |Ax − b| 2 2 + β|Λx| p p + η 2 (x, P x).
Motivated by the form of the optimality conditions eq. (4.11) obtained by the augmented Lagrangian formulation, we formulate a primal-dual active set strategy for the following system where ε > 0 in the third equation is a fixed parameter enough small. Note that eq. (4.12) coincides with eq. (4.11) for ε = 0. The scope of the parameter ε is to avoid the computation of Solve for (x n+1 , λ n+1 ) Set y n+1 = Λx n+1 , n = n + 1. 5: until the stopping criterion is fulfilled.

Remark 4.2.
Note that B i has to be chosen exactly as in (4.7) in order to have the convergence of the method to the optimality condition eq. (4.2). In [36] an alternate direction method of multipliers is proposed for problems as in eq. (4.1) and the augmented Lagrangian formulation is considered with a penalization term chosen "large enough". The convergence of the proposed alternate direction method of multiplier is proved to a stationary point as defined in [36], equation 4. We deduce that, due to the different choice in the penalization term, the stationary points considered in [36] (to which the ADMM proposed in [36] is proved to converge) do not coincide with the stationary points identified by our optimality condition eq. (4.2). To make it evident, we propose to look at the following 1-dimensional example. Suppose we want to minimize where we denote µ := d β,p and d β,p = β  Given y fixed, we minimize with respect to x to obtain Then, given x fixed, we minimize with respect to y the expression β|y| p + c 2 |x − y| 2 . By theorem 4.1, we obtain Then we obtain the following optimality conditions Note that if c > 1, then µ < µ c . Then we consider µ < b < µ c in the augmented Lagrangian formulation eq. (4.18) and we get that y = 0, x = b 1+c is a solution to eq. (4.20) and we note that (x, y) → (0, 0) as c → +∞. On the contrary, since b > µ, we have that x = 0 is not a solution of eq. (4.17). We remark that considering a Lagrange multiplier in eq. (4.18) leads to the same conclusion.

4.3.
Convergence of the primal-dual active set strategy: case Λ = I. While the numerical performance of the primal-dual active set strategy proved to be very successful, its convergence analysis is still a substantial challenge. Then we focus on the case Λ = I for which we can give a sufficient condition for uniqueness of the solution to eq. (4.12) and for convergence. Moreover, the case Λ = I will be successfully tested in an image recontruction problem in section 5.3.
We will use the following diagonal dominance condition: Remark 4.5. In the case that Q is a diagonal matrix Q − B = 0 and eq. (4.23) is trivially satisfied.
Remark 4.6. We observe that for p → 0, we have α = γ = − 1 2 . In particular eq. (4.23) coincides with the diagonal dominance condition considered in [27] to prove the convergence of the primal dual active set strategy in the case p = 0. We set the following notation which will be used for the rest of this section: Under the diagonal dominance condition eq. (4.23), we prove that, if x, λ is a solution to eq. (4.12) satisfying one of the following conditions (4.25) min for some δ > 0 large enough, then it is necessarely unique. Above min I(x,λ) |B α (λ + Bx)| stands for min i∈I(x,λ) |B α i (λ i + B i x i )|. Henceforth we refer to eq. (4.25)-eq. (4.26) as strictly complementary condition. Note that µ i = β −γ CB −α i . The precise statement of the uniqueness result is given in the following theorem. The proof is inspired by [27], Theorem 5.1. Remark 4.7. More precisely, it will be seen from the proof that δ in eq. (4.25) has to satisfy δ > 2ρ where we recall that −α ≥ 0 and γ < 0.

4.3.2.
Convergence: Diagonal dominant case. Here we give a sufficient condition for the convergence of the primal-dual active set method. Following the ideas of [27] (in particular Proposition 5.1), we utilize the diagonal dominance condition eq. (4.23) and consider a solution x, λ to eq. (4.12) which satisfies the strict complementary condition. As such it is unique according to theorem 4.3. We use the same notation as in section 4.3.1.
Remark 4.8. More specifically, it will be seen from the proof that δ has to satisfy δ > ρ(2ρβ Proof. We divide the proof into three steps. In Step (i) we prove an estimate which will be used throughout the rest of the proof, in Step (ii) we prove the monotonicity of S n and T n and finally in Step (iii) we conclude the proof of convergence.
Assume i ∈ A(x,λ)\T n . Then (4.45) x n+1 . By eq. (4.37), the last equation in eq. (4.46), the complementary condition |B α iλ i | ≤ (1−δ)β −γ C, eq. (4.23) and eq. (4.42), we get The first equation in eq. (4.45) and the last in (4.46), eq. (4.22) and the update rule of the algorithm imply Thus by eq. (4.48), eq. (4.47) and the third equation in eq. (4.45) we have and we get a contradiction by taking δ > ρÃβ γ C −1 By the first equation in eq. (4.49), the third in eq. (4.50) and the update rule of the algorithm, we have and by the strict complementary condition B α i (λ i + B ixi ) > β −γ C, eq. (4.22) and the first equation in eq. (4.50), we get By proceeding as in eq. (4.47) and using eq. (4.51), we have and by eq. (4.52) we get and we have a contradiction by taking δ > ρÃβ γ C −1 Then S n = I(x,λ). Once the active set structure is determined the unique solution is determined by eq. (4.13).

Active set monotone algorithm: numerical results
Here we describe the active set monotone scheme (see Algorithm 3) and discuss the numerical results for two different test cases. The first one is the time-dependent control problem from section 3.2, the second one is an example in microscopy image reconstruction. Typically the active set monotone scheme requires fewer iterations and achieves a lower residue than the monotone scheme of section 2.
5.1. The numerical scheme. The proposed active set monotone algorithm consists of an outer loop based on the primal-dual active set strategy and an inner loop which uses the monotone algorithm to solve the nonlinear part of the optimality condition. In order to achieve a better numerical performance, we write the optimality condition as explained in the following. At each iteration of the active-set strategy (Algorithm 2) we solve the following system in x n+1 , λ n+1 where A n = {i : |B i y n i + λ n i | ≤ µ i } are the active indexes and I n = A c n are the inactive ones. We write eq. (5.1) in the following form where Λ An , Λ In are the rows of Λ corresponding to the active and inactive indexes and N n+1 In is the diagonal operator such that (N n+1 In ) ii,i∈In = βp max(ε 2−p ,|(Λx n+1 ) i∈In | 2−p ) . In order to solve eq. (5.2) we apply the following iterative procedure which is solved for x k+1,n+1 , λ k+1,n+1 In Λ In + ηP )x k+1,n+1 + Λ * An λ k+1,n+1 In is diagonal with i-entries βp max(ε 2−p ,|(Λx k,n+1 ) i∈In | 2−p ) .
Remark 5.1. Note that the system matrix associated to eq. (5.3) is symmetric.
The algorithm stops when the residue of eq. (5.2) and eq. (4.2) (for the inner and the outer cycle respectively) is O(10 −12 ) in the control problem and O(10 −8 ) in the microscopy image example. We remark that in our numerical tests we always took η = 0. The initialization x 0 , λ 0 in the outer cycle is chosen in the following way In particular λ 0 is the solution of the first equation in eq. (4.2) for x = x 0 . As in section 3, for some values of β the previous initialization is not suitable. Following the idea already used for the monotone scheme, we successfully tested an analogous continuation strategy with respect to increasing β-values.
In Algorithm 3 we jump out of at the inner loop in case of presence of singular components. We recall that the singular components are those i such that |(Λx) i | < ε}, that is, the components where the ε-regularization is most influential.
= βpy k+1,n+1 In max(ε 2−p ,|y k+1,n+1 In In is a singular point, go to 9. 7: Set k = k + 1. 8: until the stopping criteria for the inner loop are fulfilled. 9: Set n = n + 1; 10: until the stopping criteria for the outer loop are fulfilled. 11: Reduce ε and go to 3. In the case Λ coincides with the identity the system eq. (5.1) can be written as Note that in eq. (5.6) we coupled the first and the third equation in eq. (5.1) and we eliminated the dual variable. The advantage is that now we solve the second equation in eq. (5.6) only for the inactive components x In , solving a system of |I n | equations, whereas in eq. (5.5) we solve n + |A n | equations. Finally we remark that in the case Λ coincides with the identity ε > 0 is fixed. In particular ε = min i 5.2. Sparsity in a time-dependent control problem. We test the active set monotone algorithm on the time-dependent control problem described in section 3.2, with the same discretization in space and time (∆x = ∆t = 1 50 ) and target function b. Also the initialization of x and the ε-range are the same. In Tables 4 we report the results of our tests for p = .1 and β incrementally increasing by factor of 10 from 10 −3 to 1. We report only the values for the second control u 2 since the first control u 1 is always zero. As expected, |Du 2 | c 0 increases and |Du 2 | p p decreases when β is increasing. Note that the number of iterations of the inner and outer cycle are both small. The algorithm was also tested for the same p as in section 3.2, that is p = .5, for the same range of β as in Table 4. Comparing to the results achieved by Algorithm 1, we obtained the same values for the 0 -term for corresponding values of β and a considerably smaller residue within a significantly fewer number of inner iterations. Finally we note that if Λ = I the number of inner iterations is even smaller, that is, 6 on the average.

5.3.
Compressive sensing approach for microscopy image reconstruction. In this subsection we present an application of the active set monotone scheme to compressive sensing for microscopy image reconstruction. We focus on the STORM (stochastic optical reconstruction microscopy) method, which is based on stochastically switching and high-precision detection of single molecules to achieve an image resolution beyond the diffraction limit. The literature on the STORM has been intensively increasing, see e.g. [47], [5] [23], [25]. The STORM reconstruction process consists in a series of imaging cycles. In each cycle only a fraction of the fluorophores in the field of view are switched on (stochastically), such that each of the active fluorophores is optically resolvable from the rest, allowing the position of these fluorophores to be determined with high accuracy. Despite the advantage of obtaining sub-diffraction-limit spatial resolution, in these single molecule detection-based techniques such as STORM, the time to acquire a super-resolution image is limited by the maximum density of fluorescent emitters that can be accurately localized per imaging frame, see e.g. [48], [33], [39]. In order to get at the same time better resolution and higher emitter density per imaging frame, compressive sensing methods based on l 1 techniques have been recently applied, see e.g. [53], [2], [21] and the references therein. In the following, we propose a similar approach based on our l p with p < 1 methods. We mention that l p with 0 < p ≤ 1 techniques based on a concave-convex regularizing procedure, and hence different from ours, are used in [35].
To be more specific, each single frame reconstruction can be achieved by solving the following constrained-minimization problem: where p ∈ (0, 1], x is the up-sampled, reconstructed image, b is the experimentally observed image, and A is the impulse reponse (of size m × n, where m and n are the numbers of pixels in b and x, respectively). A is usually called the point spread function (PSF) and describes the response of an imaging system to a point source or point object. The inequality constraint on the 2 -norm allows some inaccuracy in the image reconstruction to accommodate the statistical corruption of the image by noise [53]. Solving problems as eq. (5.7) is referred to as compressed sensing in the literature of miscroscopy imaging. Indeed, in the basic compressed sensing problem, an under-determined, sparse signal vector is reconstructed from a noisy measurement in a basis in which the signal is not sparse. In the compressed sensing approach to microscopy image reconstruction, the sparse basis is a high resolution grid, in which fluorophore locations are presented, while the noisy measurement basis is the lower resolution camera pixels, on which fluorescence signal are detected experimentally. In this framework, the optimally reconstructed image is the one that contains the fewest number of fluorophores but reproduces the measured image on the camera to a given accuracy (when convolved with the optical impulse reponse).
We reformulate problem eq. (5.7) as: (5.8) min x∈R n 1 2 |Ax − b| 2 2 + β|x| p p and we solve eq. (5.8) by applying Algorithm 3. Note that we may consider eq. (5.8) arising from (5.7) with β related to the reciprocal of the Lagrange multiplier associated to the inequality constraint |Ax − b| 2 ≤ ε. First we tested the procedure for same resolution images, in particular the conventional and the true images are both 128 × 128 pixel images. Then the algorithm was tested in the case of a 16 × 16 pixel conventional image and a 128 × 128 true image. The values for the impulse reponse A and the measured data b were chosen according to the literature, in particular A was taken as the Gaussian PSF matrix with variance σ = 8 and size 3 × σ = 24, and b was simulated by convolving the impulse reponse A with a random 0-1 mask over the image adding a white random noise so that the signal to noise ratio is .01. We carried out several tests with the same data for different values of p, β. We report only our results for p = .1 and β = 10 −6 , β = 10 −9 for the same and the different resolution case respectively, since for these values the best reconstructions were achieved. The number of single frame reconstructions carried out to get the full reconstruction was 5, 10 for the same, different resolution case, respectively. In order to measure the performance of our algorithm, we plot a graphic of the average over six recoveries of the location recovery and the exact recovery (up to a certain tolerance) against the noise. Note that in compressed sensing these quantities are typically used as a measure of the efficacy of the reconstruction method, see for example [17] (where, under certain conditions, a linear decay with respect to the noise is proven) and [8].
The first test is carried out for a sparse 0-1 cross-like image. The STORM reconstructions are presented in Figures 5, 6 for the same and different resolution case, respectively. In Figures 7 the plots of the location and exact recovery are shown in the case of different resolution. Similar plots are obtained in the same resolution case. Note that our algorithm can recover quite well the location of the emitters. Also, the location and intensity of the emitters decay linearly with respect to the noise level, in line with the result of [17]. In particular, for small noise both the recoveries are very near to n 2 = 16384, that is, the exact recovery is 16240, 16243 and the location is 16384, 16360 for the same and the different resolution case, respectively. We observe also that the values of the location recovery are higher than the exact recovery for small values of the noise, as expected. A second test on a non sparse standard phantom image is carried out. In Figure 8 we show the reconstruction in the case of same resolution images. Note that a high percentage of emitters is correctly localized and the boundaries of the image are well-recovered. Also in this case the location and exact recoveries show a linear decay with respect to the noise.
In Tables 5, 6 we report the number of iterations needed for each single frame reconstruction. For the cross image in the different resolution case (Table 5), the number of iterations is averagely 100, 164 for the outer cycle and inner cycle, respectively. Note that for the phantom in the same resolution case ( Table 6) the number of iterations is lower, that is averagely 7.2, 9.8 for the outer cycle and inner cycle, respectively. The numbers of iterations for the cross image in case of same resolution are comparable to the ones of Table 5. As shown in the third line of each tables, the residue is always less than or equal to 10 −8 . We compared our results with the ones obtained by the FISTA in the same situations and same values of the parameters as described above. Figure 9 shows a comparison between the number of surplus and missed emitters recovered (Error+, Error-respectively) by Algorithm 3 and the FISTA in the case of the cross image and different resolution. We remark that the levels of the location and exact recoveries achieved by the FISTA are lower than the ones obtained by Algorithm 3, at least for values of the noise near .01. In particular, by the FISTA the Error+ is always above 410, whereas by Algorithm 3 is zero for small value of the noise. On the other hand, FISTA is faster than our algorithm (as expected, since our algorithm solves a nonlinear equation for each minimization problem.)  Figure 5. A STORM reconstruction procedure, same resolution, p = .1, β = 10 −6 . Results obtained by Algorithm 3. Table 5. Number of outer and inner iterations (ItOut, ItIn) and residue (Res) for eache single frame (Fr). Cross image with different resolution, p = .1, β = 10 −9 . Results obtained by Algorithm 3.