Gradient Methods on Strongly Convex Feasible Sets and Optimal Control of Affine Systems

The paper presents new results about convergence of the gradient projection and the conditional gradient methods for abstract minimization problems on strongly convex sets. In particular, linear convergence is proved, although the objective functional does not need to be convex. Such problems arise, in particular, when a recently developed discretization technique is applied to optimal control problems which are affine with respect to the control. This discretization technique has the advantage to provide higher accuracy of discretization (compared with the known discretization schemes) and involves strongly convex constraints and possibly non-convex objective functional. The applicability of the abstract results is proved in the case of linear-quadratic affine optimal control problems. A numerical example is given, confirming the theoretical findings.


Introduction
Solving numerically optimal control problems in which the control function appears linearly, and performing error analysis, are still challenging issues due to the typi- cal discontinuity of the optimal control. Considerable progress was made in the past decade in the analysis of discretization schemes in combination with various methods of solving the resulting discrete-time optimization problems. The papers [1,2,25,27] apply to problems with linear dynamics, while [3,11] address nonlinear affine (in the control) dynamics. Usually the discretization is performed by Runge-Kutta schemes (mainly the Euler scheme) and the accuracy is at most of first order due to the discontinuity of the optimal control. Discretization schemes of higher accuracy were recently proposed in [21,24] for systems with linear dynamics and Mayer or Bolza problems. In both cases the error analysis is based on the assumption that the optimal control is of purely bang-bang type.
On the other hand, the papers [12,23] present convergence results for a version of the (abstract) Newton method for nonlinear problems, affine with respect to the control. Every step of the Newton method requires solving a linear-quadratic (affine in the control) optimal control problem for a linear system, namely a problem of the following type: subject toẋ Here, [0, T ] is a fixed time horizon, Q ∈ R n×n , q ∈ R n , A(t), W (t) ∈ R n×n , B(t), S(t) ∈ R n×m , d(t) ∈ R n for every t ∈ [0, T ], the superscript means transposition. Admissible controls are all measurable functions u : [0, T ] → U . The state of the system at time t is x(t) ∈ R n , where x(·) is the (absolutely continuous) solution of (2), given an admissible control u(·). Linear terms are not included in the integrand in (1), since they can be shifted in a standard way into the differential equation (2). For solving the above problem one can apply the high-order discretization scheme developed in [21,24]. It results in a discrete-time optimal control problem (a mathematical programming problem), where the gradient of the objective function can be calculated following a standard procedure involving the solution of the associated adjoint system, so that gradient-type methods are conveniently applicable. And here we encounter a remarkable fact: although neither the objective functional (1) of the continuous-time problem (1)-(3) nor the control constraints (3) are strongly convex, it turns out that the feasible set of the discretized problem is strongly convex. This brings into consideration the issue of convergence of gradient methods for problems with strongly convex feasible sets and possibly non-convex objective functions (even if the functional J in (1) is convex on the set of admissible control-trajectory pairs, the discretized problem may fail to be convex!).
Versions of the gradient projection method (GPM) and the conditional gradient method (CGM) are widely studied (see e.g. [18,19] and the references therein), but results about linear convergence of the generated sequence of iterates seem to be available only for problems with strongly convex objective functions. Exceptions are the papers [6,15], where strong convexity is assumed for the feasible set instead of the objective function. However, as clarified in the end of Sect. 2.1 below, the additional assumptions in these two papers are rather strong and are not fulfilled for the problem arising in the optimal control context as described above.
In this paper we present convergence results for the gradient projection and the conditional gradient methods for minimization problems in a Hilbert space, where the feasible set is strongly convex but the objective functional is not necessarily convex. These results are new even for convex or strongly convex objective functional, but we relax the convexity assumption due to the needs of our main goal-to cover the problems arising in optimal control of affine systems, as described above. For that we consider objective functionals that we called, for shortness, (ε, δ)-approximately convex. These functions constitute a larger class than that of the weakly convex functions (see e.g. [4]). In Sect. 2.1 we prove linear convergence of the sequence of approximate solutions generated by the GPM, provided that the step sizes are appropriately chosen. Apart from the applicability for non-convex objective functionals, this result does not require the additional conditions in [6,15]. As usual, the "appropriate" choice of the step sizes is expressed by some constants related to the data of the problem, which are often not available (or very roughly estimated). Therefore, we present an additional convergence result involving a rather general and constructive condition for the step sizes (well-known in the literature).
The conditional gradient method may have some advantages (compared with the GPM) in our optimal control application. For this reason we also prove a linear convergence result for the CGM. This is done in Sect. 2.2.
In Sect. 3 we turn back to the optimal control problem (1)-(3). The first two subsections are preliminary, where we introduce notations, formulate assumptions and present the discrete approximation introduced in [21,24] and the error estimate proved in [24]. All this is needed for understanding of the implementation of the GPM and the CGM and of the proofs of the error estimations. Then, in Sects. 3.3 and 3.4 we prove the applicability of the abstract convergence results, obtained in Sect. 2, to our discretized optimal control problem and present details about the implementation of the GPM and the CGM. A numerical example that confirms the theoretical findings is given in Sect. 3.5.
The paper concludes with indication of some open problems for further research (Sect. 4).

Gradient Methods for Problems with Strongly Convex Feasible Set
In this section we investigate the convergence of certain gradient methods for an abstract minimization problem of the form where K is a convex subset of a real Hilbert space H and f : H → R is a function for which certain conditions weaker than convexity will be posed. We remind that if w * ∈ K is a (local) solution of (4) and f is Fréchet-differentiable at w * then Convergence results for gradient projection methods for this problem in finite dimensional spaces and convex f are known (see e.g. [19]). It has been proved that the iterative sequence generated by versions of the gradient projection method converges linearly to a solution, provided that the objective function f is strongly convex and its gradient is Lipschitz continuous. Extensions to infinite dimensional Hilbert spaces are straightforward. In contrast, in our results below the function f does not even need to be convex, while the set K is assumed strongly convex. Some convergence results for smooth convex functions f and strongly convex sets K are obtained in [6,15], but under suppositions that (apart from the convexity of f ) are not satisfied in our main motivation as described in the introduction (see Remark 2.3 below). The convergence results presented in this section are substantially stronger. As usual, ·, · denotes the inner product in H and · -the induced norm. Let K be a nonempty closed convex subset of H . For each u ∈ H , there exists a unique point in K (see [16, p. 8]), denoted by P K (u), such that It is well-known that the metric projection P K is a nonexpansive mapping, i.e., for all u, v ∈ H Moreover for any u ∈ H and v ∈ K , it holds that Below we remind the following notions.
An alternative definition is often used in the literature: a set is strongly convex (with respect to the number R > 0) if it coincides with the intersection of all balls of radius Assume also that f is (ε, δ)-approximately convex on K atŵ and that the number Moreover, any solution of problem (4) is at distance at most δ fromŵ.
, we have z = 1. By the strong convexity of K we obtain that for any w ∈ K y := Due to (6), for all w ∈ K with w −ŵ ≥ δ we have Hence, The optimality ofŵ implies that Then from (8) we obtain that that is, (7). Now assume thatw is another solution of (4). The optimality ofw implies, in particular, that Assuming that w −ŵ > δ we may substitute w =w ∈ K in (7), which gives Adding the last two inequalities we obtain that which contradicts the assumption w −ŵ > δ. The proof is completed. Property (7) will play an important role in the further analysis. In fact, the (ε, δ)approximate convexity of f and the strong convexity of K were needed just to ensure existence of ν > 0 and δ ≥ 0 for which condition (7) is fulfilled. We mention that (7) is always fulfilled if the set K is convex and the function f is strongly convex, which is not the case here. (7) be fulfilled with some ν > 0. If for some w ∈ K and λ > 0 it holds that P K (w − λ∇ f (w)) = w, then w −ŵ ≤ δ.

Lemma 2.1 Let f be differentiable on K and let condition
Proof Contrary to the claim of the lemma, assume that w −ŵ > δ. Then from Proposition 2.1 we have that the first inequality in (7) is fulfilled by w. From the condition Applying this inequality for u =ŵ and adding it to the first inequality in (7) we obtain that which is a contradiction. (7) be fulfilled with some ν > 0. If for some w ∈ K it holds that ∇ f (w) = 0, then w −ŵ ≤ δ.

Lemma 2.2 Let f be differentiable on K and let condition
Proof If we assume w −ŵ > δ, then from the first inequality in (7) we have which is a contradiction.

The Gradient Projection Method
For solving the minimization problem (4), we consider first the most classical algorithm, the gradient projection method (GPM) stated below. In the formulation of the algorithm we only assume that f is L-smooth. Algorithm GPM.
Step 1: If w k = P K (w k − ∇ f (w k )) then Stop. Otherwise, go to Step 2.
Step 2: Choose λ k > 0 and calculate Replace k by k + 1; go to Step 1.
It is well-known that for convex f and K the GPM has the error estimate O( 1 k ) in term of the objective function when λ k = λ ∈ (0, 1 L ], see e.g. [7]. More precisely, if problem (4) has a solution andf is the minimal value of f on K , then where m 0 is the distance from w 0 to the solution set of (4). If in addition, f is strongly convex, then the sequence {w k } converges linearly to the unique solution of (4). If f is only convex (but not necessarily strongly convex), the sequence {w k } converges weakly [20]. When K is strongly convex, the linear convergence of {w k } is obtained under additional conditions (too strong for our main application) in [5,6,15].
In this subsection, we prove that if condition (7) is fulfilled with ν > 0 then the sequence {w k } generated by the GPM linearly approachesŵ at least until entering a δ-neighborhood ofŵ. Proposition 2.1 gives conditions for existing of such ν in terms of strong convexity of the set K and (ε, δ)-approximate convexity of the function f . We mention that if the above algorithm of the GPM stops at Step 1 for some k then, according to Lemma 2.1, w k −ŵ ≤ δ, that is, a δ-approximate solution is attained (obviously this is meaningful only if δ is sufficiently small).

Proposition 2.2
Let f be L-smooth on K , let condition (7) be fulfilled with some ν > 0, and let w 0 −ŵ ≥ δ. Then the sequence {w k } generated by the GPM satisfies the inequality at least as long as w k+1 −ŵ ≥ δ.
Substitution of w =ŵ ∈ K in this inequality yields Since w k+1 ∈ K and λ k > 0, if w k+1 −ŵ ≥ δ then due to (7) − By the Cauchy-Schwarz inequality and the Lipschitz continuity of ∇ f , we obtain that Inequalities (11), (12) and (13) imply that On the other hand, Combining (14) and (15) we obtain that where a, b are some positive constants. Define Let {w k } be the sequence generated by the GPM.
Moreover, for every k, if w i+1 −ŵ ≥ δ, i = 0, . . . , k, then the following a priori and a posteriori error estimates hold: and Before proving the theorem we mention that in the case of an ε-convex function f (that is, if δ = 0) the first claim of the theorem means that the sequence generated by the GPM converges linearly to the (unique) solutionŵ. In the case δ > 0 we also have linear convergence at least until the generated sequence enters the δ-neighborhood of w. Thus in this case the theorem is meaningful only if δ is reasonably small.

Remark 2.1
If the constants L and ν can be reasonably estimated, then inequalities (19) and (20) can be used to estimate the number of iterations of the GPM needed to achieve a given accuracy.

Remark 2.2
The value μ in (17) can be regarded as a function μ = μ(a, b) of the variable (a, b) belonging to the domain It is a routine task to obtain that the minimum of μ(a, b) under the above constraints is achieved at (a * , b * ) := ( ν L 2 , ν L 2 ) and the minimal value is μ * := L √ L 2 +ν 2 . Hence, λ k = ν L 2 would be an optimal choice of λ k .
Since the parameters L and ν are usually not known in advance, we can consider the step size sequence {λ k } as any non-summable converging to zero sequence of positive real numbers as it follows in the next theorem.
Clearly, in the case δ = 0 the first claim of the theorem implies strong convergence of the sequence {w k }.
Let us now prove the first claim of the theorem. For each k set Since which is a contradiction. Thus w k remains in the δ -neighborhood ofŵ for all k ≥ k 0 . The proof is completed.

Remark 2.3
Using the contractivity of the projection onto strongly convex sets, Balashov and Golubev [6] and Golubev [15] obtained the linear convergence of the GPM for smooth, convex optimization problem with the following additional conditions: (i) For any k, there exists a unit vector n(w k ) ∈ N K (w k ) such that (ii) The problem (4) has a unique solution and it belongs to the boundary of K .
In our convergence analysis in Theorem 2.1, the assumptions (i), (ii) are eliminated, which is important for our main motivation (see the next section). Also important is that our result applies under the (ε, δ)-approximate convexity instead of convexity.

The Conditional Gradient Method
In this subsection, we consider the conditional gradient method (CGM) for solving problem (4) with a γ -strongly convex set K and an (ε, δ)-approximate convex and L-smooth function f . This method dates back to the original work of Frank and Wolfe [13] which presented an algorithm for minimizing a quadratic function over a polytope using only linear optimization steps over the feasible set. The CGM for solving (strongly) convex problem was investigated in [8,9,14]. Algorithm CGM.
Step 1: Step 2: If x k = w k , then Stop. Otherwise, go to Step 3.
Step 3: replace k by k + 1, and go to Step 1. Else the iteration process terminates.
Notice that if the above algorithm stops at Step 1 or Step 3 for some k then, under the assumptions of Lemma 2.2 w k −ŵ ≤ δ, that is, an approximate solution is attained.
In general, problem (25) may fail to have a solution, in which case the CGM is not executable.

Remark 2.4
The objective function in the subproblem (25) in the CGM is linear, thus if K is a polytope, we encounter a linear programming problem which should be easier to solve than the quadratic programming subproblem (9) in the GPM. In the case considered in this paper the set K is not a polytope, thus (25) is not a linear programming problem. However, in our main application (see the next section) the set K is a product of (possibly large number of) simple two-dimensional strongly convex sets, so that (25) decomposes into two-dimensional subproblems that are easy to solve.
We will use the following global version of (ε, δ)-approximate convexity.
We begin the convergence analysis of the CGM with an inequality which will play a key role for obtaining convergence results. For convenience we assume that if the CGM terminates at some finite iteration k = i, (due to ∇ f (w i ) = 0) then the sequence {w k } is extended as w k = w i for k > i. (4) such that ∇ f (ŵ) ≥ ρ for some number ρ > 0. Assume also that f is (ε, δ)-approximately convex on K and that the number ν := γρ 4 − ε is positive. Further, assume that at any iteration k a solution of the subproblem (25) does exist, and let {w k } be the sequence generated by the CGM. Denotef := f (ŵ)

Proposition 2.3 Assume that K is γ -strongly convex, f is L-smooth on K andŵ is a solution of problem
at least as long as w k −ŵ ≥ δ.
Proof If ∇ f (w i ) = 0 for some i, we have x k = w k and k = 0 for all k ≥ i, hence (28). Thus we may assume that ∇ f (w k ) = 0 for the arbitrarily fixed k in the consideration below.
Since f is L-smooth on K we have (see, for example, [20, Lemma 1.30]) Subtractingf from both sizes of (29), we obtain By the optimality of x k in (25), we have Assume from now on that w k −ŵ ≥ δ. From (31) and the (ε, δ)-approximate convexity of f it follows that Setting z = −∇ f (ŵ) ∇ f (ŵ) , we have z = 1. By the strong convexity of K we obtain that Therefore, from the (ε, δ)-approximate convexity of f and the optimality ofŵ, we obtain Combining (33) with (32) we have Setting , we have z k = 1. By the strong convexity of K we have that The optimality of x k in (25) yields that where the last inequality follows from (34). Combining (30) with (35), we obtain that We are now in a position to establish the convergence results for the CGM.
Clearly, in the case δ = 0, the first and the second claims of the theorem mean that the sequences { f (w k )} and {w k } converge linearly tof andŵ, respectively. In the case δ > 0 we also have linear convergence at least until the generated sequence enters the δ-neighborhood ofŵ.
Therefore, it follows from (28) that, for all k, it holds In addition, if w i −ŵ ≥ δ, i = 0, . . . , k, then we have This and (33) imply

The Affine Optimal Control Problem
In this section we turn back to the control-affine linear-quadratic problem (1)- (3) and prove that the gradient projection methods considered in the previous section are applicable to the (high order) discretization of the problem recently developed in [21,24]. (This also applies to the conditional gradient method, where the analysis is similar). We also provide error estimates regarding both the errors due to discretization and those due to truncation of the gradient projection iterations. The first two subsections reproduce assumptions and results from [24] that are necessary for understanding the implementation of the GPM to the discretized version of problem (1)-(3). The next subsections prove the applicability of the abstract results obtained above, present details about the implementation of the gradient methods, and provide results of computational experiments. The scalar product is normalized by division by N since below N will be a "large" number and the sum will be a proxy for integration on a fixed interval [0, T ] by using values on a mesh with size h = T /N . We also denote |w i | :

Notations and Assumptions
. The l 1 , l 2 , and l ∞ norms in H will be respectively Clearly, the inequality w 1 ≤ w 2 ≤ w ∞ holds for every w ∈ H . As usual, L 2 ([0, T ]; R m ) denotes the Hilbert space of all measurable squareintegrable functions [0, T ] → R m with scalar product u 1 , u 2 = T 0 u 1 (t), u 2 (t) dt and the corresponding norm is denoted again by · 2 .

Assumption A1 The matrix functions A(t), B(t), W (t) and S(t), t ∈ [0, T ], have Lipschitz continuous first derivatives, Q and W (t) are symmetric. Moreover, the matrix B(t) S(t) is symmetric for all t ∈ [0, T ].
Denote by F the set of all admissible control-trajectory pairs (u, x), that is, all pairs of an admissible control u and the corresponding (absolutely continuous) solution x of (2). By a standard argument, problem (1)-(3) has a solution, (x,û) ∈ F, which from now on will be considered as fixed.

Assumption A2
The first part of Assumption (A1) is standard, while the last requirement is demanding but known from the literature, usually expressed in terms of the Lie brackets of the involved controlled vector fields see e.g. [26]. It is certainly fulfilled in the case of single-input systems, m = 1. Assumption (A2) is a directional convexity assumption at (x,û), which is somewhat weaker than the usual convexity assumption for the functional J in (1) regarded as a functional on the set of admissible controls (viewing x as a function of u).
The Pontryagin principle implies that there exists an absolutely continuous function p : [0, T ] → R n such that the triple (x,û,p) satisfies the following system of generalized equations: for a.e. t ∈ [0, T ],

=ṗ(t) + A(t) p(t) + W (t)x(t) + S(t)u(t),
where N U (u) is the normal cone to U at u. Following [10], we assume that the optimal controlû is strictly bang-bang, with a finite number of switching times on [0, T ], and that the so-called switching function, Assumptions (A1)-(A3) will be standing in this section.

High-Order Time-Discretization
In this subsection we recall the discretization scheme for problem (1)-(3) presented in [24], which has a higher accuracy than the Euler scheme without a substantial increase of the numerical complexity of the discretized problem. The approach uses second order truncated Volterra-Fliess series. The discretization scheme is described as follows.
For any natural number N denote h = T /N and define the mesh

Introducing the notations
we replace the differential equation (2) with the discrete-time controlled dynamics where Z m is the Cartesian product m 1 Z and Z is the Aumann integral As pointed out in [21], the set Z can be easily represented in the more convenient way as where ϕ 1 (α) := 1 4 (−1 + 2α + α 2 ) and ϕ 2 (α) := 1 4 (1 + 2α − α 2 ). For the subsequent analysis it will be important that the set Z ⊂ R 2 is strongly convex. This is evident from Fig. 1, but the calculation of a modulus γ is cumbersome and we skip the details. In this calculation we use Theorem 1 in [28] (expressing γ by the Lipschitz constant of the mapping that maps a unit vector to that point on the boundary of Z at which this vector is normal to Z ) and the explicit formula for the normal cone to Z given in [21,Sect. 4]. The number γ = 1/ √ 32 turns out to be a modulus of strong convexity of Z .
We introduce the discrete-time counterpart of the objective functional J in (1): for x = (x 0 , . . . , x N ), w = (w 0 , . . . , w N −1 ) = ((u 0 , v 0 ), . . . , (u N −1 , v N −1 )), Fig. 1 The set Z as the area between the two parabolas ϕ 1 (lower) and ϕ 2 (upper) Then we consider the problem of minimization of the functional J h defined in (46) subject to the constraints (43)-(44). The set of admissible discrete controls in this problem is denoted by K ⊂ H , that is, We also introduce the discrete adjoint equation (see formula (3.11) in [24]) i = N − 1, . . . , 0, with the end condition Section 3.3 in [24] presents a construction which for every sequence w = (w 0 , . . . w N −1 ) ∈ K defines an admissible control u = h (w) in problem (1)-(3), with values ±1 and with at most two switches in every interval [t i , t i+1 ] of each of its components. We do not reproduce this construction here, only mentioning that it requires only a few calculations (to define the switching points), and the restriction of any component ] depends only on w j i . Moreover, the following equalities hold (see (3.14) in [24]): for every w = ((u 0 , v 0 ) (49) In addition, the function h has the important property that there exists a constantc independent of N such that for every i, j and w Clearly, this implies Below we will use the metric in the set of admissible controls in problem (1)-(3).
The following theorem is extracted from Theorem 3.1 in [24].
where c is independent of N .
We mention that the above discretization scheme is meaningful even without assuming (A2) and (A3). These assumptions are only needed for the error estimate in Theorem 3.1.

Applicability of the Results About Gradient-Type Methods
First of all, we reformulate the problem of minimization of (46) under the constraints (43)-(44) as a minimization problem on the set namely, minimize where x h [w] is the solution of the discrete-time equation (43) ∈ Z m , with the given initial condition x 0 . In this subsection we prove that the assumptions needed for applicability of the results in Sect. 2 to the above problem are fulfilled.
Let us denote by f the objective functional in problem (1)-(3), regarded as a function of the control, namely, f (u) is the solution of (2) corresponding to u ∈ L 2 ([0, T ]; R m ). It is well known that the functional f : L 2 ([0, T ]; R m ) → R is Fréchet differentiable at any u and its derivative has the functional representation where x and p are the solutions of (39), (40), (42) corresponding to u. Similarly, the function f h : H → R is Fréchet differentiable, and its derivative has the representation (see the second term in the right hand side of (3.12) in [24]) We mention that Assumption (A2) implies that f is convex atû, hence In contrast, f h does not need to be convex. Next, we present five technical lemmas which are needed in the proof of the main result in this section-Proposition 3.1. In the proofs, c 1 , c 2 , . . . denote non-negative constants that may depend on the data of the problem (1)-(3) (and their derivatives) but are independent of N . These constants may have different values in different proofs.

Lemma 3.1 There exist constants c and c independent of h, such that for every
where u i , u i , u i are the first coordinates of the components w i , w i , w i of the elements w , w and w, respectively.
Proof Considering the discrete equation (43), it is a standard procedure to obtain the following estimate for the solutions x and x corresponding to w and w : Similarly, also using the last estimation, we obtain from (47), (48) that Then using the explicit representation (55) we obtain that which together with (57) and (58) implies the firts inequality in the lemma. The second one follows by application of the Cauchy-Schwarz inequality and the definition of the norms.

Lemma 3.2
There exists a number c * such that for every natural number N , for everȳ w ∈ K and for every ∈ is defined as Proof Denote byx andp the solutions of (39) and (40) where O(t; h 2 ) is measurable in t and |O(t; h 2 )| ≤ c 1 h 2 for a.e. t. Using this expression and (54) we obtain the following equality: Using the expressions (55) we obtain, after a simple rearrangement of terms, that Then the estimation |O(t; h 2 )| ≤ c 1 h 2 completes the proof.

Lemma 3.3 The function f h defined in (53) is L-smooth on K with the Lipschitz constant of its derivative being independent of N :
Proof The Fréchet differentiability of f h was established in [24]), together with the representation (55) of its derivative. The Lipschitz continuity on K follows from this representation, together with (57) and (58) (the notations are as in the proof of Lemma 3.1).
We remind thatŵ N ∈ K denoted in Theorem 3.1 an optimal control sequence in the discrete problem (53). Further it will be convenient to skip the superscript N in this notation.

Lemma 3.4
There exist numbers N 0 and δ 1 such that for every N ≥ N 0 (α is the number from Assumption (A3) and γ ≥ 1/ √ 32 is a modulus of strong convexity of Z ).

Proof
The following expression is obtained in [24] (see formula (4.5) there, applied for t = t i+1 ): where u = h (ŵ). Comparing this with the expression (55) we see that Then using Theorem 3.1 we obtain that wherec is an appropriate constant. Written for the jth components of the vectors in the left-hand side, the inequality becomes Assumption (A3) implies that there exist a natural number r and a real number τ 0 ∈ (0, τ ) such that every componentσ j ofσ has at most r zeros in [0, T ], and |σ j (t)| ≥ ατ every τ ∈ (0, τ 0 ] and for every t which does not belong to a τneighborhood of a zero ofσ j . Now, let us define δ 1 := M √ 2mr/T , where M is the diameter of the set Z (which is √ 5). Moreover, define the natural number N 0 as bigger than 4cT /α, so thatch ≤ α/4.
be arbitrarily chosen. Due to the γ -strong convexity of Z we have that | (whenever the denominator is non-zero) we obtain exactly in the same way as in the proof of Proposition 2.1 that Denote by j the set of all indexes i such that |∇ w j i f h (ŵ)| < αh/8. Then Consider an arbitrary i ∈ j . Since |∇ w j i f h (ŵ)| < αh/8, according to (59) we have where we also use thatch ≤ α/4. Then t i+1 belongs to the h/2-neighborhood of some zero ofσ j (see the paragraph after (59)). Then no other point t k = t i+1 belongs to this neighborhood. Sinceσ j has at most r zeros, the set j consists of at most r points.
Then continuing (60) we obtain The proof is complete.

Lemma 3.5
There exists a constant ν 1 > 0 such that Proof As before, letû be the optimal control in the continuous-time problem (1) Due to (49), we have Moreover, Now we denote the left-hand side of (61) by D and represent D = D 1 + D 2 + D 3 , where We shall estimate each of these terms separately. From Lemma 3.1 we obtain ν approaches zero. In the same time, Proposition 3.1 estimates ν as proportional to h. Thus, although the convergence is linear, its rate, μ, may be close to one. Even more, this rate of convergence is valid only until an accuracy δ is achieved (see Theorem 2.1). The number δ in Proposition 3.1 is estimated as proportional to √ h. Thus the convergence of the GPM does not seem to be consistent with the O(h 2 )-approximation that the discretization method provides. On the other hand, the fact that the GPM is proved to converge (even linearly, in the sense of Theorem 2.1) is remarkable. Indeed, if the Euler discretization scheme is applied to the original problem (1)-(3) (as in most of the literature), the resulting discrete-time problem may fail to be convex, and no results about the rate of convergence of the GPM are available in the literature, to the authors' knowledge.
We do not present the convergence analysis of the CGM for problem (52) and (53), which is rather similar.

Implementation of the Gradient Methods
Now, we shall describe the implementation of the GPM and the CGM to the specific mathematical programming problem defined by (53) and (52).
The two key points in the implementation of the gradient methods are: (i) calculation of the gradient ∇ f h (w); calculation of projections on K (for the GPM) or solving a linear optimization problem on K (for the CGM). We do not discuss here the issue of the choice of the step sizes λ k , for which numerous possibilities are known from the literature.

Calculation of ∇ f h (w)
Since f h represents the objective function of a discretetime optimal control problem as a function of the control variables (the state being implicitly regarded as a function of the control), we employ the well known in control theory way for calculating its gradient: ∇ f h (w) is the derivative of the Hamiltonian with respect to the control, evaluated at the current control-trajectory pair, together with the corresponding solution of the adjoint equation. The explicit formula is given in (55), reproducing [24, Sect. 3.2].

Calculation of the projection on K
The set K is a product of m × N copies of the strongly convex set Z , thus the projection of a vector w ∈ H onto K is represented by projections onto Z of the twodimensional components of w. Thus we have to only calculate projections, The following representation of the normal cone to the set Z is obtained in [21,Sect. 4]: where ζ = sgn(α − 2β). Now, take arbitrarily a vector ξ = (u, v) ∈ R 2 and observe that P Z (ξ ) is the unique solution of the inclusion Therefore, using the formula (64), one can explicitly calculate P Z (ξ ) as where the functions ϕ 1 and ϕ 2 are defined after (45), α 1 is a solution in [−1, 1] of the third order equation and α 2 is a solution in [−1, 1] of the third order equation Indeed, the first three cases in the representation (66) are clear. In the fourth case u > −1 and u + v < 3 2 and v < ϕ 1 (u), thus P Z (u, v) has the form (α, ϕ 1 (α)) (see Fig. 1). From (64), we have Combining this with (65), one has which leads to (67). The last case is treated similarly.

Solving the auxiliary sub-problem in the CGM
Now, we consider the subproblem min y∈K ∇ f h (w), y which appears in the implementation of the CGM (see (25)).
Observe that, the necessary (and sufficient) optimality condition for this problem reads as 0 ∈ ∇ f h (w) + N K (y).
(69) Therefore, the subproblem (25) can be solved explicitly without solving any third order algebraic equation as in the GPM.

Numerical Examples
In this subsection, we present some numerical experiments for the example of an affine linear-quadratic optimal control problem given in [24].
As in [24], we choose a = 1, b = 0.1, then τ = 0.492487520 is a simple zero of the switching function, thus Assumption (A3) is fulfilled. The exact optimal control iŝ For each N , the iterates {w k } generated by GPM or CGM converge linearly to the unique (in this example) solutionŵ h with rates μ N and θ N , respectively. The starting control is chosen as u 0 (t) = 1, t ∈ [0, T ], for both algorithms. In the following tables, we report these rates for some values of N . The stopping condition is w k+1 − w k ≤ 10 −6 for the GPM and x k − w k ≤ 10 −6 for the CGM. Table 1 indicates that the (numerically obtained) rate of linear convergence, μ N , of the GPM depends on the mesh size N : it is monotone increasing and likely approaching  1 when N increases. This is to be expected, since according to Theorem 2.1, the rate μ N of linear convergence approaches 1 when ν goes to zero, and according to Proposition 3.1 ν estimated as proportional to h. Actually, the convergence of μ N to 1 is also consistent with the fact, that the GPM applied (theoretically) to the continuous-time problem (1)-(3) converges sub-linearly, as recently established in [22,Theorem 3.2]. We emphasize that due to the second order accuracy of discretization, the mesh size N does not need to be taken large, therefore the rate of linear convergence may be reasonably good (see Table 1 for N = 10-30). Table 2 presents the rate of linear convergence of the CGM applied to the same example. Although, as mentioned at the end of Sect. 3.4, the amount of computations at each step of the CGM is slightly lower than that for the GPM, the rate of linear convergence is worse.

Concluding Remarks
In this paper we obtain a number of new results about the convergence of gradient methods for general optimization problems on strongly convex feasible sets. The main motivation is the application of a recently developed discretization scheme [21,24] for linear-quadratic affine optimal control problems, which results in discretetime problems of the same type, however, with strongly convex point-wise control constraints having rather simple representations by means of quadratic inequalities. This opens several directions of further research.
First, to develop more efficient (than gradient projection) methods using the specific linear-quadratic structure of the objective function and of the constraints.
Second, to investigate the applicability of gradient projection methods to discretized nonlinear optimal control problems with the control appearing linearly. As indicated in [17], our discretization approach is also applicable to such problems, and results in mathematical programming problems with strongly convex feasible sets. The general convergence results obtained in the present paper are also applicable, in principle. The main open problem here, is that the error analysis of the discretization is not developed for nonlinear problems, which also creates problems to justify the applicability and the convergence of gradient methods.