A proximal gradient method for control problems with nonsmooth and nonconvex control cost

We investigate the convergence of an application of a proximal gradient method to control problems with nonsmooth and nonconvex control cost. Here, we focus on control cost functionals that promote sparsity, which includes functionals of $L^p$-type for $p\in [0,1)$. We prove stationarity properties of weak limit points of the method. These properties are weaker than those provided by Pontryagin's maximum principle and weaker than $L$-stationarity.


Introduction
Let Ω ⊂ R n be Lebesgue measurable with finite measure. We consider a possibly non-smooth optimal control problem of type min u∈L 2 (Ω) f (u) + The function f : L 2 (Ω) → R is assumed to be smooth. Here, we have in mind to choose f (u) := f (y(u)) as the smooth part of an optimal control problem incorporating the state equation and possibly smooth cost functional. We will make the assumptions on the ingredients of the control problem precise below in Section 2. Due to the properties of g, the optimization problem (P) is challenging in several ways. First of all, the resulting integral functional u → Ω g(u(x)) dx is not weakly lower semicontinuous in L 2 (Ω), so it is impossible to prove existence of solutions of (P) by the direct method. Second, it is challenging to solve numerically, i.e., to compute local minima or stationary points. *  In this paper, we address this second issue. Here, we propose to use the proximal gradient method (also called forward-backward algorithm [3]). The main idea of this method is as follows: Suppose the objective is to minimize a sum f + j of two functions f and j on the Hilbert space H where f is smooth. Given an iterate u k , the next iterate u k+1 is computed as where L > 0 is a proximal parameter, and L −1 can be interpreted as a step-size. In our setting, the functional to be minimized in each step is an integral function, whose minima can be computed by minimizing the integrand pointwise. Using the so-called prox map, that is defined by prox γj (z) = arg min where γ > 0, the next iterate of the algorithm can be written as If j ≡ 0, the method reduces to the steepest descent method. If j is the indicator function of a convex set, then the method is a gradient projection method. If f and j are convex, then the convergence properties of the method are well-known: under mild assumptions the iterates (u k ) converge weakly to a global minimum of f + j, see, e.g., [3,Corollary 27.9]. If f is non-convex, then weak sequential limit points of (u k ) are stationary, that is, they satisfy −∇f (u * ) ∈ ∂j(u * ). If in addition j is nonconvex, then much less can be proven. In finite-dimensional problems, limit points are fixed points of the iteration, and satisfy the so-called L-stationary type conditions, see [5] and [4,Chapter 10] for optimization problems with l 0 -constraints. A feasible point u * is called L-stationary if In a recent contribution [16], the method was analyzed when applied to control problems with L 0 -control cost. There it was proven that weak sequential limit points of the iterates in L 2 (Ω) satisfy the L-stationary type condition. An essential ingredient of the analysis in [16] was that the functional g is sparsity promoting: solutions of the proximal step are either zero or have a positive distance to zero. We will show how this property can be obtained under weak assumptions on the functional g in (P) near u = 0, see Section 3. Still this is not enough to conclude L-stationarity of limit points. We will show that weak limit points satisfy a weaker condition in general, see Theorem 4.18. Under stronger assumptions, L-stationarity can be obtained (Theorems 4.19,4.20). Let us emphasize that, under weak assumptions, the sequence of iterates (u k ) contains weakly converging subsequences but is not weakly convergent in general. Pointwise a.e. and strong convergence is obtained in Theorem 4.25. We apply these results to g(u) = |u| p , p ∈ (0, 1) in Section 5.1. Interestingly, the proximal gradient method sketched above is related to algorithms based on proximal minimization of the Hamiltonian in control problems. These algorithms are motivated by Pontryagin's maximum principle. First results for smooth problems can be found in [15]. There, stationarity of pointwise limits of (u k ) was proven. Under weaker conditions it was proved in [6] that the residual in the optimality conditions tends to zero. These results were transferred to control problems with parabolic partial differential equations in [7].

Preliminary considerations
Throughout the paper, we will use the following assumption on the function f . Assumption A. The functional f : L 2 (Ω) → R is bounded from below and weakly lower semicontinuous. Moreover, f is Fréchet differentiable and ∇f : holds for all u 1 , u 2 ∈ L 2 (Ω).
For the moment, let g : R →R be lower semicontinuous and bounded from below. In Section 3 below, we will give the precise assumptions on g that allow sparse controls. Let u ∈ L 2 (Ω) be given. Then x → g(u(x)) is a measurable function, and we define Then j : L 2 (Ω) →R is well-defined and lower semicontinuous, but not weakly lower semicontinuous in general. Hence standard existence proofs cannot be applied. For a discussion, we refer to [11,16] Remark 2.1. The results are also valid for the general case that g depends on x ∈ Ω, which results in the integral functional j(u) = Ω g(x, u(x)) dx, provided g : Ω × R →R is a normal integrand, for the definition we refer to [10, Definition VIII.1.1].

Necessary optimality conditions
The mapping u → Ω g(u(x)) dx is not directionally differentiable in general, and thus there is no first order optimality condition. In the following we are going to derive a necessary optimality condition for (P), known as Pontryagin maximum principle, where no derivatives of the functional are involved. We formulate the Pontryagin maximum principle (PMP) as in [16]. A controlū ∈ L 2 (Ω) satisfies (PMP) if and only if for almost all holds true for all v ∈ R. The following result is shown in [16,Thm. 2.5] for the special choice g(u) := |u| 0 .
Proof. Letū be a local solution to (P). We will use needle perturbations of the optimal control.
For arbitrary x ∈ Ω define u r,i ∈ L 2 (Ω) by for some r > 0 and i ∈ N. Let χ r := χ Br(x) , then we have u r,i = (1 − χ r )ū + χ r v i and After dividing above inequality by |B r (x)| and passing r 0, we obtain by Lebesgue's differentiation theorem This holds for every Lebesgue point x ∈ Ω of the integrands, i.e., for all x ∈ Ω \ N i , where N i is a set of zero Lebesgue measure, on which the above inequality is not satisfied. Since the union for all (v, t) ∈ epi(g). Choosing t = g(v) yields the claim.

Sparsity promoting proximal operators
In this section, we will investigate the minimization problems that have to be solved in order to compute the proximal gradient step in (1.1). Let g : R →R be proper and lower-semicontinuous. For s > 0 and q ∈ R, we define the function h q,s (u) := −qu + 1 2 u 2 + sg(u).
Here, we have in mind to set q := −∇f (u k )(x). Let us investigate scalar-valued optimization problems of form min The solution set is given by the proximal map prox sg : R ⇒ R of g, If g is convex then (3.1) is a convex problem, and the proximal map is single-valued. If g is bounded from below and lower semicontinuous, prox sg (q) is nonempty for all q but may be multi-valued for some q.
The focus of this section is to investigate under which assumptions prox sg is sparsity promoting: Here, we want to prove that there is σ > 0 such that u ∈ prox sg ⇒ u = 0 or |u| ≥ σ.
In [13], this was also investigated for some special cases of non-convex functions. We will show that the following assumption is enough to guarantee the sparsity promoting property, it contains the result from [13] as a special case. Assumption B. (B1) g : R →R is lower semicontinuous, symmetric with g(0) = 0.
(B3) g satisfies one of the following properties: (B3.a) g is twice differentiable on an interval (0, ) for some > 0 and lim sup u 0 g (u) ∈ (−∞, 0), (B3.b) g is twice differentiable on an interval (0, ) for some > 0 and lim u 0 By assumption B, the function g is non-convex in a neighborhood of 0 and nonsmooth at 0. Some examples are given below.  (iv) The indicator function of the integers g(u) We are interested in the characterization of global solutions to (3.1) in terms of q. It is wellknown that for given s > 0 the proximal map q ⇒ prox sg (q) is monotone, i.e., the inequality is satisfied for all q 1 , q 2 ∈ R. In addition, the graph of prox sg is a closed set. Moreover, the following results hold true. Proof. Due to (B1), we have u ∈ prox sg (q) if and only if −u ∈ prox sg (−q). The claim now follows from the monotonicity of the prox-mapping. Proof. Let u ∈ prox sg (q). By optimality, the following inequality is true. Since g(u) ≥ 0, the claim follows. Proof. If f is of the claimed form, then clearly prox f (q) = {0} for all q. Now, let 0 ∈ prox f (q) for all q ∈ H. Then it holds This is equivalent to Setting q := tu and letting t → +∞ shows f (u) = +∞ for all u = 0.
then |q| ≤ q 0 is also necessary for u = 0 being a global solution to (3.1).
Proof. Let |q| ≤ q 0 . Take u = 0, then we have Note that the second inequality is strict if |q| < q 0 . For the second claim assume u = 0 is a global solution to (3.1). Assume q > 0. Then it holds By the definition of q 0 , the inequality q ≤ q 0 follows. Similarly, one can prove |q| ≤ q 0 for negative q.
Together with Assumption B, these results allows us to show the following key observation concerning the characterization of solutions to (3.1). A similar statement to the following can be found in [13, Theorem 1.1]. Theorem 3.6. Let g : R →R satisfy Assumption B. Then there exists s 0 ≥ 0 such that for every s > s 0 there is u 0 (s) > 0 such that for all q ∈ R a global minimizer u of (3.1) satisfies u = 0 or |u| ≥ u 0 (s).
In case g satisfies (B3.b) or (B3.c), s 0 can be chosen to be zero. Moreover, for all s > 0 there is q 0 := q 0 (s) > 0 such that u = 0 is a global solution to (3.1) if and only if |q| ≤ q 0 . If |q| < q 0 then u = 0 is the unique global solution to (3.1).
Proof. Assume that the first claim does not hold. Then there are sequences (u n ) and (q n ) and s > 0 with u n ∈ prox sg (q n ) and u n → 0. W.l.o.g., (u n ) is a monotonically decreasing sequence of positive numbers, and hence (q n ) is monotonically decreasing and non-negative by Lemma 3.2. Let u and q denote the limits of both sequences. Since u n = 0 is a global minimum of h qn,s , it follows h qn,s (u n ) ≤ h qn,s (0) = 0. Passing to the limit in this inequality, we obtain lim inf n→∞ h qn,s (u n ) ≤ 0, which implies With g(0) = 0 by (B1), this contradicts (B3.c). Let now (B3.a) or (B3.b) be satisfied. Then for n sufficiently large the necessary second-order optimality condition h qn,s (u n ) ≥ 0 holds, and we obtain lim sup This inequality is a contradiction to (B3.a) if s > −1/lim sup u 0 g (u) > 0 and to (B3.b) for all s. By (B1), it holds prox sg (q) = ∅ for all q. Due to (B2) and Lemma 3.4, there is q ≥ 0 such that 0 ∈ prox sg . The claim concerning q 0 follows from Assumptions (B4), (B3) and Lemma 3.5. First, consider that case (B3.a) or (B3.b) is satisfied, i.e., there is 1 > 0 such that g is strictly concave on (0, 1 ]. By reducing 1 if necessary, we get g( 1 ) > 0. Since g(u) = 0, it holds g(u) ≥ g( 1 ) 1 |u| for all u ∈ (0, 1 ) by concavity. Due to symmetry, this holds for all u with |u| ≤ 1 . Since g(u) ≥ 0 for all u by (B4), it holds 1 2 u 2 +sg(u) ≥ 1 2 |u| for all |u| ≥ 1 . This proves )|u| for all u. Hence, the claim follows with q 0 := min( 1 2 , sg( 1 ) 1 ) by Lemma 3.5. Second, if (B3.c) is satisfied, then there are 2 , τ > 0 such that g(u) ≥ τ for all u with |u| ∈ (0, 2 ) as g is lower semicontinuous. Therefore, it holds g(u) ≥ τ ≥ τ 2 |u| if |u| ∈ (0, 2 ). The claim follows as above by Lemma 3.5.

Remark 3.7.
1. In general, the constant u 0 in Theorem 3.6 depends on s and the structure of g.

2.
We note the second claim concerning q 0 in Theorem 3.6 holds for all s > 0 and does not depend on the first claim due to Assumption (B4). One can replace g(u) ≥ 0 by the pre-requisite of Lemma 3.5.
3. Assumption B also allows functions of form g(u) =q(u) + δ D (u) with someg : R →R and the indicator function δ D of the set D ⊆ R. This means the analysis includes constrained optimization problems, e.g., standard box constraints of form Example 3.8. The proximal map of (3.1) with g(u) = |u| 0 is given by the hard-thresholding operator, defined by With the above considerations in mind, let us discuss the minimization problem This minimization corresponds to the pointwise minimization of the integrand in (1.1).
If 1 L > s 0 , see Theorem 3.6, then all global solutions u satisfy and therefore of form (3.1). The claim follows by definition and from Theorem 3.6.

Analysis of the proximal gradient algorithm
In this section, we will analyze the proximal gradient algorithm.
The functional to be minimized in (4.1) can be written as an integral functional. In this representation the minimization can be carried out pointwise by using the previous results. The following statements are generalizations of [16, Lemmas 3.10, 3.11, Theorem 3.12], and the corresponding proofs can be carried over easily.

Lemma 4.2.
Let u k ∈ U ad be given. Then is solvable, and u k+1 ∈ L 2 (Ω) is a global solution if and only if Proof. Let us show, that we can choose a measurable function satisfying the inclusion (4.3). The set-valued mapping prox L −1 g has closed graph and is thus outer semicontinuous. Then by [14,Corollary 14.14], the set-valued mapping [14,Corollary 14.6] implies the existence of a measurable function u such that u(x) ∈ prox L −1 g 1 L (Lu k (x) − ∇f (u k )(x)) for almost all x ∈ Ω. Due to the growth condition of Lemma 3.3, we have u ∈ L 2 (Ω), and hence u solves (4.2). If u k+1 solves (4.2) then (4.3) follows by a standard argument. We introduce the following notation. For a sequence (u k ) ⊂ L 2 (Ω) define Let us now investigate convergence properties of Algorithm 4.1. The following Lemma will be helpful for what follows.

Theorem 4.4.
For L > L f let (u k ) be a sequence of iterates generated by Algorithm 4.1. Then the following statements hold: is monotonically decreasing and converging.
(iv) Let s 0 be as in Theorem 3.6. Assume 1 L > s 0 . Then the sequence of characteristic functions (χ k ) is converging in L 1 (Ω) and pointwise a.e. to some characteristic function χ.
Using the optimality of u k+1 , we find that the inequality holds.. Hence, (f (u k ) + j(u k )) is decreasing. Convergence follows because f and j are bounded from below.
(ii) Weak coercivity of the functional implies that (u k ) is bounded. Furthermore, because of and hence Hence, (χ k ) is a Cauchy sequence in L 1 (Ω), and therefore also converging in L 1 (Ω), i.e., χ k → χ for some characteristic function χ. Pointwise a.e. convergence of (χ k ) can be proven by Fatou's Lemma.
As a consequence, we get the following result.  Proof. By the Lemma of Fatou, we have This implies n k=0 |u k+1 (x) − u k (x)| 2 < ∞ for almost all x ∈ Ω, and the claim follows.

Stationarity conditions for weak limit points from inclusions
Under a weak coercivity assumption Theorem 4.4 implies that Algorithm 4.1 generates a sequence (u k ) with weak limit point u * ∈ L 2 (Ω). Due to the lack of weak lower semicontinuity in the term u → Ω g(u) dx, however, we cannot conclude anything about the value of the objective functional in a weak limit point. Unfortunately, we are not able to show as it was done in [16,Thm. 3.14] for the special choice g(u) := |u| 0 . Nevertheless, by using results of set-valued analysis we will show that a weak limit point of a sequence (u k ) of iterates satisfies a certain inclusion in almost every point x ∈ Ω, which can be interpreted as a pointwise stationary condition for weak limit points. By definition, the iterates satisfy the inclusion for almost all x ∈ Ω, see e.g., (4.3). However, this inclusion seems to be useless for a convergence analysis as the function u k+1 to the left of the inclusion as well as the arguments Lu k − ∇f (u k ) only have weakly converging subsequences at best. The idea is to construct a set-valued mapping G : R ⇒ R, such that a solution u k+1 of (4.2) satisfies the inclusion in almost every point x ∈ Ω for some z k ∈ L 2 (Ω), where (z k ) converges strongly or pointwise almost everywhere. Here, we will use By Theorem 4.4, we have u k+1 − u k → 0 in L 2 (Ω) and pointwise almost everywhere. With the additional assumption that subsequences of (∇f (u k )) are converging pointwise almost everywhere, the argument of the set-valued mapping is converging pointwise almost everywhere. In the context of optimal control problems, such an assumption is not a severe restriction. So there is a chance to pass to the limit in the inclusion (4.5).
Lemma 4.7. Let u k+1 be a solution of (4.2). Then where the set-valued mapping G : R ⇒ R is given by Unfortunately, the set-valued map G is not monotone in general. If g would be convex, then the optimality condition of (4.2) is z k (x) ∈ ∂g(u k+1 (x)) for almost all x ∈ Ω, hence one could choose G = gph(∂(g * )), where g * denotes the convex conjugate of g. For the rest of this section, we will always suppose that g satisfies Assumption B. As a first direct consequence from the definition of G we get

A convergence result for inclusions
Let us recall a few helpful notions and results from set-valued analysis that can be found in the literature, see e.g., [2,14].

S is called locally bounded
A set-valued mapping S is outer semicontinuous if and only if it has a closed graph.
The following convergence analysis relies on [2, Thm. 7.2.1]. We want to extend this result to set-valued maps into R n that are not locally bounded. Let us define the following set-valued map that serves as a generalization of x → conv(F (x)) for the locally unbounded situation. Define the set-valued map conv ∞ F : By definition, it holds gph F ⊂ gph conv ∞ F . In addition, we have conv(F (x)) ⊂ (conv ∞ F )(x). If F is locally bounded in x, then (conv ∞ )F (x) = conv(F (x)), which can be proven using Carathéodory's theorem. In general, dom conv ∞ F is strictly larger than dom F .   Let (Ω, A, µ) be a measure space and F : R m ⇒ R n be a set-valued map. Let sequences of measurable functions (x n ), (y n ) be given such that 1. x n converges almost everywhere to some function x : Ω → R m , 2. y n converges weakly to a function y in L 1 (µ, R n ), 3. y n (t) ∈ F (x n (t)) for almost all t ∈ Ω.

Stationarity conditions for weak limit points
Recall, for iterates (u k ) of Algorithm 4.1 and the corresponding sequence z k we have by construction Then by Theorem 4.14, we could expect the inclusion u * (t) ∈ (conv ∞ G)(−∇f (u * )(x)) pointwise almost everywhere to hold in the subsequential limit. However, the convexification of G results in a set-valued map that is very large. In order to obtain a smaller inclusion in the limit, we will employ the result of Corollary 4.9: the graph of G can be split into three clearly separated components. In the sequel, we will show that we can pass to the limit with each component separately, which leads to a smaller set-valued map in the limit. This observation motivates the following splitting of the map G. 1. G + : R ⇒ R with u ∈ G + (z) :⇐⇒ u ∈ G(z) and u > 0, The mappings G + , G − and G 0 are depicted in Figure 2 for the special choice g(u) : Obviously we have by construction Proof. G being outer semicontinuous is equivalent to the closedness of its graph. Let (u n ), (q n ) be sequences such that u n → u, q n → q and u n ∈ G(q n ). By definition it holds for all v ∈ R. Passing to the limit in above inequality yields due to the lower semicontinuity of g. Hence, i.e., u ∈ G(q), which is the claim for G. For G + , G − , G 0 the claim follows as their graphs are intersections of closed sets with gph G, which follows from Corollary 4.9 (for suitable chosen L in case of G + , G − ).
In the sequel we want to apply Theorem 4.14 to each of the set-valued maps in (4.6) separately. Let us first show the next helpful result.
holds for almost all x ∈ Ω.
Let us remark that the assumption of pointwise convergence of (∇f (u k )) is not a severe restriction. If ∇f : L 2 (Ω) → L 2 (Ω) is completely continuous, then this assumption is fulfilled. For many control problems, this property of ∇f is guaranteed to hold.
Interestingly, we can get rid of the convexification operator conv ∞ if we assume that the whole sequence (∇f (u k )) converges pointwise almost everywhere. Theorem 4.19. Let (u k ) be a sequence of iterates generated by Algorithm 4.1 with weak limit point u * ∈ L 2 (Ω). Assume ∇f (u k ) → ∇f (u * ) pointwise almost everywhere. Then holds for almost all x ∈ Ω.
Let ∈ (0, ). Set I := {x : |z − z(x)| < }, and The sequence (I K ) is monotonically increasing. Since z k (x) → z(x) for almost all x ∈ Ω, we have (Ω) and pointwise almost everywhere. Let x ∈ I. Then there is K such that x ∈ I K . This implies u k (x) ∈ B (ũ) for all k > K. Here, the pointwise convergence of the whole sequence (z k ) is needed. The sum ∞ k=K+1 (χ + k+1 χ − k + χ − k+1 χ + k )(x) counts the number of switches between values larger thanũ + and smaller thañ u − from u k (x) to u k+1 (x). Since this sum is finite for almost all x ∈ Ω, there is only a finite number of such switches. Then there is K > K such that either u k (x) ≥ũ + for all k > K or u k (x) ≤ũ − for all k > K . Set The sequences (S + K ) and (S − K ) are increasing, and for almost all x ∈ I, which implies for almost all x ∈ Ω. Since we can cover the complement of gph G by countably many such sets, the claim follows.

Pointwise convergence of iterates
So far we were able to show that weak limit points of iterates (u k ) satisfy a certain inclusion in a pointwise sense. However, the resulting set in the limit might still be large or even unbounded in general. Assuming that G is (locally) single-valued on its components G + , G − , G 0 , we can show local pointwise convergence of a subsequence of iterates (u kn ) to a weak limit point u * ∈ L 2 (Ω).
In the next result this is illustrated for the map G + , however it can be shown for the components G − , G 0 similarly. To this end, we set in the following χ + k := χ {x∈Ω: u k (x)>0} with χ + k → χ + in L 1 (Ω) and pointwise almost everywhere by Lemma 4.17.
Theorem 4.20. Letz ∈ dom(G + ). Assume that G + : R → R is single-valued and locally bounded on B (z) ∩ dom(G + ) for some > 0. Let u kn u * in L 2 (Ω) and assume ∇f (u kn )(x) → ∇f (u * )(x) pointwise almost everywhere. For ∈ (0, ] define the set holds for almost all x ∈ I. Furthermore, we have Proof. Let u kn+1 u * in L 2 (Ω). By the assumption and Corollary 4.9 it holds z kn (x) → z(x) := −∇f (u * )(x) pointwise almost everywhere. In addition, u kn+1 u * in L 2 (Ω) holds. Let ∈ (0, ) be given. Take x ∈ I such that z kn (x) → z(x). Then there is K > 0 such that |z kn (x)−z| < for all k n > K. Since x ∈ supp(χ + ) and χ + k → χ + in L 1 (Ω) and pointwise almost everywhere there is K > 0 such that x ∈ supp(χ + k ) for all k > K . Hence, for k n sufficiently large we have Since G + is single-valued, locally bounded and outer semicontinuous in B (z) ∩ dom(G + ), it is continuous, see also [14,Cor. 5.20]. This implies The continuity property mentioned above implies conv ∞ G + (z(x)) = G + (z(x)). Then by Theorem 4.18, G + (z(x)) = {u * (x)}, and the convergence u kn (x) → u * (x) follows. The fixed-point property is a consequence of the closedness of the graph of the proximal operator. As x ∈ I was chosen arbitrary, and I = ∪ ∈(0, ) I , the claim is proven.
The above result requires local boundedness of the set-valued map G, which is not satisfied in general. For some interesting choices of g, e.g. g(u) := |u| p , it can be proven, see Section 5. Let us give an example of a locally unbounded map G below.

Strong convergence of iterates
Many optimal control problems of type (P) include a smooth cost functional of form u → α 2 u 2 L 2 (Ω) , α > 0. For the rest of the sequel, we will treat this term explicitly in the convergence analysis to obtain an almost everywhere and strong convergence of a subsequence. Therefore let g : R → R satisfy Assumption B and consider a sequence of iterates computed by u k+1 := arg min The solution to (4.7) is now given by for almost every x ∈ Ω. It follows that all the analysis that was done in this section still applies in this case and all results can be transferred except for a possible change of notation. Furthermore, we adapt the set-valued map G : R → R from Lemma 4.7 which is then defined by For simplicity we assume dom(g) = [−b, b] with b ∈ (0, ∞], i.e., the subproblem (4.7) is equivalent to a box constrained optimization problem of form u k+1 := arg min subject to |u(x)| ≤ b for almost every x ∈ Ω. To obtain strong convergence of iterates in L 1 (Ω) and an L-stationary condition almost everywhere, we need to put stronger and more restricting assumptions ong, as the next theorem shows. To this end, let us introduce the following extension of Assumption B.
First, we have the following necessary optimality condition for (4.7) due to Assumption (B5).

Corollary 4.22. Let u k+1 be a solution to (4.7) andg satisfy in addition (B5). Then the pointwise inequality in
pointwise is equivalent to solve the constrained problem min u:|u|≤b in every Lebesgue point x. For x ∈ I k+1 it holds u k+1 (x) = 0, and therefore above problem is differentiable. The claimed inequality is the corresponding necessary optimality condition.
Let us for the rest of the sequel assume thatg satisfies (B5) and (B6) in addition to Assumption B. This enables us to give more information about the set-valued map G as the next result shows. That is, elements in G are (possibly unique) solutions of an associated variational inequality.  ). Then u ∈ G(z) satisfies the variational inequality L+α and u 0 ≥ u I with u I := u I ( 1 L+α ) as in (B6), then we have u ∈ G(z) if and only u satisfies (4.9).
Proof. Let us discuss the case u ≥ u 0 only. If u ∈ G(z) for some z ∈ R, then by definition Hence, by first order necessary optimality condition it holds , which is the claim. Assume u I ≤ u 0 holds, and let u > 0 satisfy (4.9), then u satisfies in particular and also to min By convexity u is the unique solution of the latter and since by assumption z+Lu L+α ≥ q 0 1 L+α , it follows from Theorem 3.6 that there is a global solution larger than u 0 to the unconstrained problem which together implies u ∈ G(z).
Proof. We set s := 1 α and u I := u I (s) as in (B6). Note that by assumptions the following holds for α > 0 and |u| ≥ u 0 ≥ u I : where we define, corresponding to (4.10), Due to assumption (B6) and Lemma 4.23, u k+1 (x) is the only element in G(z k (x)) \ {0} for almost all x ∈ I k+1 and it holds u k+1 (x) = prox u I sg z k (x) α . Set It is easy to see that prox u I sg is single-valued for |z| > 0. Since it is in addition outer semicontinuous and locally bounded for |z| ≥ z I , it is also continuous on {z : |z| ≥ z I }, see also [14,Corollary 5.20]. Let u ∈ prox u I sg (z). By optimality of u we have −zu + 1 2 u 2 + sg(u) ≤ −z · sign(u)u I + 1 2 u 2 I + sg(u I ).
Dividing by |u| > 0, we get Having in mind that u I |u| ≤ 1, the growth estimate | prox u I sg (z)| ≤ 2|z| + c for all |z| ≥ z I with some c > 0 independent of z follows.
Let l : R → R denote a continuous function defined by for z : Ω → R. Then by a well-known result, see e.g. [1, Theorem 3.1], the superposition operator G is continuous from L 2 (Ω) → L 2 (Ω) and the claim follows. Now, we are able to prove strong convergence of a subsequence of (u k ) similar to [16,Thm. 3.17].
Theorem 4.25. Suppose complete continuity of ∇f and let (u k ) ⊂ L 2 (Ω) be a sequence generated by Algorithm 4.7 with weak limit point u * . Under the same assumptions as in Lemma 4.24 u * is a strong sequential limit point of (u k ) in L 1 (Ω).
Proof. By Lemma 4.24 there exists a continuous mapping G : L 2 (Ω) → L 2 (Ω) such that u k+1 = χ k+1 G( z k α ) . Let u kn u * in L 2 (Ω). Again, by Theorem 4.4 and complete continuity of ∇f , we obtain strong convergence of the sequence z kn := − (∇f (u kn ) + L(u kn+1 − u k )) → −∇f (u * ) =: z * in L 2 (Ω) as well as χ k → χ in L p (Ω) for all p < ∞ and u kn+1 u * . Then the convergence in L 1 (Ω) follows by Hölder's inequality. Since strong and weak limit points coincide, it follows u kn → u * in L 1 (Ω) and With the assumptions in Theorem 4.25 we can find an almost everywhere converging subsequence of iterates, i.e., u kn (x) → u * (x) for almost every x ∈ Ω. By the closedness of the mapping prox sg , we get i.e., u * is L-stationary to the problem in almost every point. If L = 0 in (4.11), then we obtain by Lemma 4.2 Hence, in this case u * satisfies the Pontryagin maximum principle.

The proximal gradient method with variable stepsize
The convergence results of this section require the knowledge of the Lipschitz modulus L f of ∇f . This can be overcome by line-search with respect to the parameter L subject to a suitable decrease condition, which is a widely applied technique.
is satisfied.
The convergence results as in Theorem 4.4 can be carried over. Then theorem 4.4 holds without the assumption L > L f . The assumptions 1/L > s 0 has to be replaced by (lim sup L k ) −1 > s 0 . This is satisfied if s 0 = 0, which is true by Theorem 3.6 if one of (B3.b), (B3.c) is valid. 5 Applications of the proximal gradient method 5.1 Optimal control with L p control cost, p ∈ (0, 1) In [16], the discussed proximal method was analyzed and applied to optimal control problems with L 0 control cost, i.e., g(u) := α 2 u 2 + |u| 0 . In this section, we discuss the problem with g(u) := α 2 u 2 + β|u| p + δ [−b,b] , where p ∈ (0, 1) and b ∈ (0, ∞] and consider min u∈L 2 (Ω) To find a solution to (5.1)with Algorithm 4.1, the subproblem, interpreted in terms of (4.7) withg := |u| p + δ [−b,b] , has to be solved in every iteration. According to Theorem 4.2, u k+1 is a solution to (5.1) if and only if Due to Theorem 3.6 it holds u k+1 (x) = 0 or |u k+1 (x)| ≥ u 0 for all k. The particular choice of g allows to compute the constant u 0 explicitly by solving min u =0 u 2 + s g(u) 2 and is given by as a consequence of Lemma 3.5. We recall the definition of the set-valued map G : R → R, which reads in this case u ∈ G(z) := G L,α,s :⇐⇒ u = arg min Note that g satisfies assumptions (B5) and (B6) due to its structure. This allows to give an equivalent but more precise characterization of G as Lemma 4.23 applies to u k+1 (x) on I k+1 .
A visualization of G is given in Figure 2 below. With a suitable choice of parameters, we can apply Theorem 4.25 to the L p problem to obtain a strong convergent subsequence.

Corollary 5.2.
Let α > 0 and (u k ) a sequence of iterates. Furthermore, assume L ≤ ( 2 p − 1)α. Then the assumptions of Theorem 4.25 are satisfied. If in addition ∇f is completely continuous from L 2 (Ω) to L 2 (Ω), then every weak sequential limit point u * ∈ L 2 (Ω) is a strong sequential limit point in L 1 (Ω).
calculation yields that the assumptions on the parameters imply Here, u I is the positive point of inflection of (5.1) and it holds that is convex for all q ∈ R on [u I , ∞) and (−∞, u I ), respectively, which corresponds to Assumption (B6). The claim now follows by Lemma 4.24 and Theorem 4.25.

Optimal control with discrete-valued controls
Let us investigate the optimization problem with optimal control taking discrete values. That is, we choose g(u) as the indicator function of integers, i.e., The problem now reads min u∈L 2 (Ω) Note, this choice satisfies Assumption (B3.c). Applying Algorithm 4.1, the subproblem to solve is given by min u∈L 2 (Ω) and can be solved pointwise and explicitly. The analysis carried out in Chapter 4 is applicable, however, the special choice of g comes along with the following desirable result.
Thus, (u k ) is a Cauchy sequence in L 1 (Ω) and therefore convergent in L 1 (Ω) and it holds u k → u * .

Numerical experiments
In this section we finally apply the proximal gradient method to optimal control problems of type (P) and carry out numerical experiments for cost functionals with different g. Let in the following denote f l the reduced tracking-type functional where S l is the weak solution operator of the linear Poisson equation ∂d(x, y) ∂y + ∂ 2 d(x, y) ∂y 2 ≤ C M for almost all x ∈ Ω and |y| ≤ M .
Then the equation is uniquely solvable, we refer to e.g., [8,9] In addition, we define Furthermore, we choose Ω := (0, 1) 2 to be the underlying domain in all following examples.
To solve the partial differential equation, the domain is divided into a regular triangular mesh and the PDE (6.1),(6.2) is discretized with piecewise linear finite elements. The controls are discretized with piecewise constant functions on the triangles. The finite-element matrices were created with FEnicCS [12]. If not mentioned otherwise, the meshsize is approximately h = √ 2/160 ≈ 0.00884. In each iteration a suitable constant L k > 0 needs to be determined, that satisfies the decrease condition see (4.12). Note, L −1 k can be seen as a stepsize. In [16] several stepsize selection strategies are proposed. In our tests, we use a simple Armijo-like backtracking line search method (BT). That is, having an initial L 0 > 0 and a widening factor θ ∈ (0, 1), determine L k as the smallest accepted number of form L 0 θ −i , i = 0, 1, .... This method ensures a decrease in the objective values along the iterates, but it turns out to be very slow for large L 0 , as the corresponding stepsize L −1 k gets smaller. For all our tests we choose The stopping criterion is as follows: If |f (u k+1 ) + g(u k+1 ) − (f (u k ) + g(u k )| ≤ 10 −12 : STOP.

Example 1
Let g(u) := |u| p + δ [−b,b] for p ∈ (0, 1) and find min u∈L 2 (Ω) Setting U ad := {L 2 (Ω) : |u(x)| ≤ b a.e. on Ω} the problem is equivalent to min u∈U ad The first example is taken from [16], where the proximal gradient algorithm was investigated for (sparse) optimal control problems with L 0 (Ω) control cost. Since Ω |u| p dx → Ω |u| 0 dx as p 0, we expect similar solutions. We choose the same problem data as in [11,16]. That is, if not mentioned otherwise, y d (x, y) = 10x sin(5x) cos(7y) and α = 0.01, β = 0.01, b = 4. A computed solution for p = 0.8 is shown in Figure 3. Convergence for decreasing p−values. In the following we consider solutions for different values of p. We use the same data and discretization as above. We set L 0 = 0.0001. In Table   p Table 1 shows the result of applying the iterative hard-thresholding algorithm IHT-LS from [16] to the problem with p = 0, which is in agreement with our expectation. In the implementation we used a meshsize of h = √ 2/500 ≈ 0.0028.
Discretization. Next, we solved the problem on different levels of discretization to investigate the influence. As can be seen in Table 2   Convergence in the case L > (2/p − 1)α. So far, in every experiment the assumption on the parameters was naturally satisfied, such that strong convergence of iterates can be proven according to Theorem 5.2. The numerical results confirmed the theory. We will now investigate the case where the assumption is not satisfied, i.e., we choose parameters such that L > (2/p − 1)α. In the following we present the result for the problem parameters α = 0.001, p = 0.9, L 0 = 0.005.
Furthermore, we set b = 6. In our computations the algorithm needed very long to reach the stopping criteria |J(u k+1 ) − J(u k )| ≤ 10 −12 as can be seen in Table 3. This might be due to the parameter choice and the step-size strategy. For smaller mesh-sizes more iterations are needed.  Table 3: performance for bad choice of parameters across different mesh-sizes Recall, the problem in the analysis that comes with this choice of parameters is that the map G in Lemma 4.7 is not necessarily single-valued anymore on the set of points where an iterate is not vanishing, see also Figure 2. Let u I := u I (β/α) > 0 denote the constant from Assumption (B6) and define the set Ω m,k := {x ∈ Ω : 0 < |u k (x)| < u I }.
Then Ω m,k is the set of points for which the crucial assumption in Lemma 4.24 that implies single-valuedness of G \ {0} is not satisfied. In our numerical experiments, however, we made the observation that the measure of the set Ω m,k is decreasing as k → ∞, see Figure 4. Across different mesh-sizes h, the measure decreases and tends to zero along the iterations. with g(u) = |u| p , p ∈ (0, 1). This example can be found in [9] for semilinear control problems with L 1 -cost. Here, f sl is given by the standard tracking type functional u → y u − y d 2 L 2 (Ω) , where y u is the solution of the semilinear elliptic state equation −∆y + y 3 = u in Ω, y = 0 on ∂Ω.
The data is given by α = 0.002, β = 0.03, b = 12 and y d = 4 sin(2πx 1 ) sin(πx 2 )e x 1 . We use the parameter L 0 = 0.001. We made similar observations as in the linear case concerning the influence of discretization and different values of p. Also the behavior of the algorithm in case of a bad choice of parameters is as before (see Example 1).

Example 3
In this last test, we consider an optimal control problem with discrete-valued controls. That is, we choose g(u) := δ Z (u), where δ M denotes the indicator function of a set M , i.e., δ M (u) := 0 if u ∈ M, ∞ else . Here, the subproblem in Algorithm 4.1 can be solved pointwise and explicitly. We adapt again the setting from Example 1. In Figure 6, a solution plot of the optimal control is displayed. We used exactly the same problem data as before in Example 1, but set b = 2 and L 0 = 0.001. Again, we find the algorithm is robust with respect to the discretization.