Gradient-type penalty method with inertial effects for solving constrained convex optimization problems with smooth data

We consider the problem of minimizing a smooth convex objective function subject to the set of minima of another differentiable convex function. In order to solve this problem, we propose an algorithm which combines the gradient method with a penalization technique. Moreover, we insert in our algorithm an inertial term, which is able to take advantage of the history of the iterates. We show weak convergence of the generated sequence of iterates to an optimal solution of the optimization problem, provided a condition expressed via the Fenchel conjugate of the constraint function is fulfilled. We also prove convergence for the objective function values to the optimal objective value. The convergence analysis carried out in this paper relies on the celebrated Opial Lemma and generalized Fejér monotonicity techniques. We illustrate the functionality of the method via a numerical experiment addressing image classification via support vector machines.


Introduction and preliminaries
Let H be a real Hilbert space with the norm and inner product given by · and ·, · , respectively, and f and g be convex functions acting on H , which we assume for simplicity to be everywhere defined and (Fréchet) differentiable. The object of our investigation is the optimization problem We assume that and that the gradients ∇ f and ∇g are Lipschitz continuous operators with constants L f and L g , respectively.
The work [5] of Attouch and Czarnecki has attracted since its appearance a huge interest from the research community, since it undertakes a qualitative analysis of the optimal solutions of (1) from the perspective of a penalty-term based dynamical system. This represented the starting point for the design and development of numerical algorithms for solving the minimization problem (1), several variants of it involving also nonsmooth data up to monotone inclusions that are related to optimality systems of constrained optimization problems. We refer the reader to [4][5][6][7][8]10,11,[13][14][15][20][21][22][23]33,35] and the references therein for more insights into this research topic.
A key assumption used in this context in order to guarantee the convergence properties of the numerical algorithms is the condition and N argmin g is the normal cone to the set argmin g, defined by N argmin g (x) = {p ∈ H : p, y − x ≤ 0 ∀y ∈ argmin g} for x ∈ argmin g and N argmin g (x) = ∅ for x / ∈ argmin g. Finally, ran (N argmin g ) denotes the range of the normal cone N argmin g , that is, p ∈ ran (N argmin g ) if and only if there exists x ∈ argmin g such that p ∈ N argmin g (x). Let us notice that for x ∈ argmin g one has p ∈ N argmin g (x) if and only if σ argmin g ( p) = p, x . We also assume without loss of generality that min g = 0.
In this paper we propose a numerical algorithm for solving (1) that combines the gradient method with penalization strategies also by employing inertial and memory effects. Algorithms of inertial type result from the time discretization of differential inclusions of second order type (see [1,3]) and were first investigated in the context of the minimization of a differentiable function by Polyak [36] and Bertsekas [12]. The resulting iterative schemes share the feature that the next iterate is defined by means of the last two iterates, a fact which induces the inertial effect in the algorithm. Since the works [1,3], one can notice an increasing number of research efforts dedicated to algorithms of inertial type (see [1][2][3]9,[16][17][18][19][24][25][26][27][28][30][31][32]34]).
Iterative step: For given current iterates x n−1 , x n ∈ H (n ≥ 1), define x n+1 ∈ H by We notice that in the above iterative scheme {λ n } ∞ n=1 represents the sequence of step sizes, {β n } ∞ n=1 the sequence of penalty parameters, while α controls the influence of the inertial term.
For every n ≥ 1 we denote by n := f + β n g, which is also a (Fréchet) differentiable function, and notice that ∇ n is L n := L f + β n L g -Lipschitz continuous.
In case α = 0, Algorithm 1 collapses in the algorithm considered in [35] for solving (1). We prove weak convergence for the generated iterates to an optimal solution of (1), by making use of generalized Fejér monotonicity techniques and the Opial Lemma and by imposing the key assumption mentioned above as well as some mild conditions on the involved parameters. Moreover, the performed analysis allows us also to show the convergence of the objective function values to the optimal objective value of (1). As an illustration of the theoretical results, we present in the last section an application addressing image classification via support vector machines.

Convergence analysis
This section is devoted to the asymptotic analysis of Algorithm 1.
We would like to mention that in [21] we proposed a forward-backward-forward algorithm of penalty-type, endowed with inertial and memory effects, for solving monotone inclusion problems, which gave rise to a primal-dual iterative scheme for solving convex optimization problems with complex structures. However, we succeeded in proving only weak ergodic convergence for the generated iterates, while with the specific choice of the sequences {λ n } ∞ n=1 and {β n } ∞ n=1 in Assumption 2 we will be able to prove weak convergence of the iterates generated in Algorithm 1 to an optimal solution of (1).

Remark 3
The conditions in Assumption 2 slightly extend the ones considered in [35] in the noninertial case. The only differences are given by the first inequality in (II), which here involves the constant α which controls the inertial terms (for the corresponding condition in [35] one only has to take α = 0), and by the inequality 1 λ n+1 − 1 λ n ≤ 2 α for all n ≥ 1. We refer to Remark 12 for situations where the fulfillment of the conditions in Assumption 2 is guaranteed.
We start the convergence analysis with three technical results.

Lemma 4 Let x ∈ S and set p
where ϕ n := x n − x 2 .
Proof Since x ∈ S, we have according to the first-order optimality conditions that where y n := x n + α(x n − x n−1 ). This, together with the monotonicity of ∇ f , imply that On the other hand, since g is convex and differentiable, we have for all n ≥ 1 which means that As for all n ≥ 1 Combining (4), (5) and (6), we obtain that for each n ≥ 1 Finally, since x ∈ argmin g, we have that for all n ≥ 1 which completes the proof.

Lemma 5
We have for all n ≥ 1 Proof From the descent Lemma and the fact that ∇ n is L n -Lipschitz continuous, we get that and then which is nothing else than By the Cauchy-Schwarz inequalty it holds that For x ∈ S and all n ≥ 1, we set and, for simplicity, we denote Lemma 6 Let x ∈ S and set p : Proof According to Lemma 5 and Assumption 2(II), (8) becomes for all n ≥ 1 On the other hand, after multiplying (2) by K , we obtain for all n ≥ 1 After summing up the relations (11) and (12) and adding on both sides of the resulting inequality the expressions α ( n−1 (x n−1 ) − n (x n )) and α(K λ n β n g(x n ) − K λ n−1 β n−1 g(x n−1 )) for all n ≥ 2, we obtain the required statement.
(v) lim n→+∞ g(x n ) = 0 and every sequential weak cluster point of the sequence {x n } ∞ n=1 lies in argmin g. Proof We set p := −∇ f (x) and recall that g(x) = min g = 0.
(i) Since f is convex and differentiable, it holds for all n ≥ 1 which means that { n } ∞ n=1 is bounded from below. Notice that the first inequality in the above relation is a consequence of Assumption 2(II), since 1 (ii) For all n ≥ 2, we may set We fix a natural number N 0 ≥ 2. Then Since f is bounded from below and g(x N 0 ) ≥ g(x) = 0, it follows that ∞ n=2 u n < +∞.
We notice that −δ n +α Thus, according Lemma 6, we get for all n ≥ 2 We fix another natural number N 1 ≥ 2 and sum up the last inequality for n = 2, . . . , N 1 . We obtain which, by taking into account Assumption 2(III), means that {μ n } ∞ n=2 is bounded from above by a positive number that we denote by M. Consequently, for all n ≥ 2 we have We have for all n ≥ 2 Consequently, for the arbitrarily chosen natural number N 1 ≥ 2, we have [see (14)] which together with (15) and the fact that c > 1 implies that On the other hand, due to (13) we have δ n+1 ≤ δ n + 1 for all n ≥ 1. Consequently, using also that c > 1, (10) implies that According to Proposition 7 and by taking into account that { n } ∞ n=1 is bounded from below, we obtain that lim n→+∞ n exists.
(iv) Since n (x n ) = n − K ϕ n + K λ n β n g(x n ) for all n ≥ 1, by using (ii) and (iii), we get that lim n→+∞ n (x n ) exists.
(v) Since lim inf n→+∞ λ n β n > 0, we also obtain that lim n→+∞ g(x n ) = 0. Let w be a sequential weak cluster point of {x n } ∞ n=1 and assume that the subsequence {x n j } ∞ j=1 converges weakly to w. Since g is weak lower semicontinuous, we have which implies that w ∈ argmin g. This completes the proof.
In order to show also the convergence of the sequence ( f (x n )) ∞ n=1 , we prove first the following result.

Lemma 9 Let x ∈ S be given. We have
Proof Since f is convex and differentiable, we have for all n ≥ 1 Since g is convex and differentiable, we have for all n ≥ 1 which together imply that From here we obtain for all n ≥ 1 [see (6)] Hence, by using the previous lemma, the required result holds.
The Opial Lemma that we recall below will play an important role in the proof of the main result of this paper. Proof (i) According to Lemma 8, lim n→+∞ x n − x exists for all x ∈ S. Let w be a sequential weak cluster point of {x n } ∞ n=1 . Then there exists a subsequence {x n j } ∞ j=1 of {x n } ∞ n=1 such that x n j converges weakly to w as j → +∞. By Lemma 8, we have that w ∈ argmin g. This means that in order to come to the conclusion it suffices to show that f (w) ≤ f (x) for all x ∈ argmin g. From Lemma 9, Lemma 8 and the fact that which shows that w ∈ S. Hence, thanks to Opial Lemma, {x n } ∞ n=1 converges weakly to a point in S.
(ii) The statement follows easily from the above considerations.
In the end of this section we present some situations where Assumption 2 is verified.

Numerical example: image classification via support vector machines
In this section we employ the algorithm proposed in this paper in the context of image classification via support vector machines. Having a set of training data a i ∈ R n , i = 1, . . . , k, belonging to one of two given classes denoted by "−1" and "+1", the aim is to construct by using this information a decision function given in the form of a separating hyperplane, which assigns every new data to one of the two classes with a misclassification rate as low as possible. In order to be able to handle the situation when a full separation is not possible, we make use of non-negative slack variables ξ i ≥ 0, i = 1, . . . , k; thus the goal will be to find (s, r, ξ) ∈ R n × R × R k + as optimal solution of the following optimization problem where for i = 1, . . . , k, d i is equal to −1 if a i belongs to the class "−1" and it is equal to +1, otherwise. Each new data a ∈ R n will by assigned to one of the two classes by means of the resulting decision function z(a) = a s + r , namely, a will be assigned to the class "−1", if z(a) < 0, and to the class "+1", otherwise. For more theoretical insights in support vector machines we refer the reader to [29]. By making use of the matrix the problem under investigation can be written as or, equivalently, By considering f : ⎠ and notice that ∇g is A 2 -Lipschitz continuous, where proj R 2k + denotes the projection operator on the set R 2k + . For the numerical experiments we used a data set consisting of 6.000 training images and 2.060 test images of size 28 × 28 taken from the website http://www.cs.nyu.edu/ roweis/data.html representing the handwritten digits 2 and 7, labeled by −1 and +1, respectively (see Fig. 1). We evaluated the quality of the resulting decision function on test data set by computing the percentage of misclassified images.
We denote by D = {(X i , Y i ), i = 1, . . . , 6.000} ⊂ R 784 × {−1, +1} the set of available training data consisting of 3.000 images in the class −1 and 3.000 images in the class +1. Due to numerical reasons each image has been vectorized and normalized. We tested in MATLAB different combinations of parameters chosen as in Remark 12 by running the algorithm for 3.000 iterations. A sample of misclassified images is shown in Fig. 2.
In Table 1 we present the misclassification rate in percentage for different choices for the parameters α ∈ (0, 1) (we recall that in this case we take K := 2/α) and C > 0, while for α = 0 which corresponds to the noninertial version of the algorithm we consider different choices of the parameter K > 0 and C > 0. We observe that when combining α = 0.1 with each regularization parameters C = 5, 10, 100 leads to the lowest misclassification rate with 2.1845%.
In Table 2 we present the misclassification rate in percentage for different choices of the parameters C > 0 and c > 1. The lowest classification rate of 2.1845% is obtained for each regularization parameter C = 5, 10, 100. Finally, Table 3 shows the misclassification rate in percentage for different choices for the parameters C > 0 and q ∈ (1/2, 1). The lowest classification rate of 2.1845% is obtained when combining the value q = 0.9 with each regularization parameter C = 5, 10, 100.