1 Introduction

In this paper we consider a composite optimization problem with the following structure

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n}f(x)\equiv f_0(x) + f_1(x), \end{aligned}$$
(1)

where \(f_0:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is a continuously differentiable function and \(f_1:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{\infty \}\) is convex. This kind of problem is especially relevant in the framework of the variational approach to inverse problems, where a solution is obtained by minimizing a function which combines data fidelity and regularization terms.

Forward–backward (FB) methods are a class of optimization algorithms which fully exploit the structure of problem (1), namely the smoothness of \(f_0\) and the convexity of \(f_1\). They have reasonable theoretical convergence guarantees and, if implemented with some care, they exhibit good practical performances.

The general iteration of a FB method can be expressed in the following way

$$\begin{aligned} {\tilde{y}}^{(k)}\approx & {} {\textrm{prox}}_{\alpha _k f_1}^{D_k}\left( { x}^{(k)}-\alpha _k{D_k}^{-1}\nabla f_0( { x}^{(k)})\right) \end{aligned}$$
(2a)
$$\begin{aligned} { x}^{(k+1)}= & {} { x}^{(k)}+ \lambda _k( {\tilde{y}}^{(k)}- { x}^{(k)}), \end{aligned}$$
(2b)

where \(\alpha _k\) is a positive steplength parameter, \(D_k\) is a symmetric positive definite matrix and \({\textrm{prox}}_{f_1}^D(\cdot )\) denotes the proximal operator associated to the function \(f_1\), in the metric induced by D.

Many state-of-the-art algorithms can be stated in the form (2) [1, 16, 21, 24,25,26,27,28, 49], each depending on some specific choices of the parameters \(\alpha _k,\lambda _k,D_k\) and the strategy for computing, possibly in an approximate way, the proximity operator. These methods have been extensively analyzed in convex and nonconvex frameworks and a wide numerical experience shows that they are a valid tool for solving large scale problems. In the last two decades, more variants of iteration (2) have been proposed, with the aim to overcome some known pitfalls of FB methods, as the so-called “hem stitching” effect. These FB variants introduce a kind of extrapolation, or inertial, step exploiting the two previous iterations and can be traced back to two main typologies: FISTA-like methods and Heavy-Ball ones.

FISTA-like methods are reminiscent of the seminal work [37] and can be stated as

$$\begin{aligned} {\bar{x}}^{(k)}= & {} { x}^{(k)}+ \beta _k( { x}^{(k)}- { x}^{(k-1)}) \end{aligned}$$
(3a)
$$\begin{aligned} { x}^{(k+1)}\approx & {} {\textrm{prox}}_{\alpha _k f_1}^{D_k}\left( {\bar{x}}^{(k)}-\alpha _k{D_k}^{-1}\nabla f_0({\bar{x}}^{(k)}) \right) , \end{aligned}$$
(3b)

where \(\beta _k\) is an extrapolation parameter. Method (3) has been analyzed for example in [4, 7, 10, 20, 23, 45] in the convex case. Other recent contributions on this class of methods and related developments can be found in [34, 54,55,56], where the analysis covers also the nonconvex case. The most remarkable property of method (3) is the convergence rate estimation \(f( { x}^{(k)})-f^* = o(1/k^2)\), which can be proved under suitable assumptions on the parameters \(\beta _k,\alpha _k, D_k\), when \(f_0\) is convex and has Lipschitz continuous gradient. In the same settings, the standard FB rate is 1/k and, in this sense, FISTA is optimal among all first order methods. On the other side, the implementation of (2) allows more flexibility in the choice of the parameters, which may lead to practical performances which are often comparable or even better than those of FISTA.

The second class of inertial methods, named Heavy-Ball, corresponds to the following scheme

$$\begin{aligned} { x}^{(k+1)}\approx & {} {\textrm{prox}}_{\alpha _k f_1}^{D_k}\left( { x}^{(k)}-\alpha _k{D_k}^{-1}\nabla f_0( { x}^{(k)}) +\beta _k( { x}^{(k)}- { x}^{(k-1)})\right) . \end{aligned}$$
(4)

The main idea behind iteration (4) dates back to Polyak [42] and, in the simple case when \(f_1=0\), it can be derived as a discretization of the ordinary differential equation

$$\begin{aligned} \ddot{x}(t) + a \dot{x}(t) + b \nabla f_0(x(t)) = 0, \end{aligned}$$
(5)

while the standard gradient method is related to

$$\begin{aligned} \dot{x}(t) + \alpha \nabla f_0(x(t)) = 0. \end{aligned}$$

The connection between the above dynamical systems and the algorithms which can be obtained by discretization are investigated for example in [6, 36, 42]. Iteration (4) has also connections with the Conjugate Gradient method for quadratic unconstrained optimization and its extension to nonlinear smooth problems [32].

Heavy-Ball methods received considerable attention in recent years. In particular, the properties of iteration (4) under convexity assumptions are investigated in [5, 6, 31, 39], where the theoretical convergence rate is also provided. In the convex framework, the parameters \(\alpha _k\) and \(\beta _k\) are usually selected in order to guarantee that the optimal convergence rate estimate is achieved. From this point of view, the implementation of iteration (4) requires explicit knowledge of the minimum and maximum eigenvalues of \(\nabla ^2 f_0(x)\). This clearly implies that the method itself can be applied only to some specific class of problems and, in addition, it may result in a lack of effectiveness, e.g. when the minimum eigenvalue of the Hessian is small [6].

Other recent results can be found in [18, 35, 40, 41], where the analysis applies also to the nonconvex case, assuming that the Kurdyka–Łojasiewicz property holds. We also mention the Inertial Bregman proximal gradient methods proposed in [54, 56], which are closely related to the iteration (4). In these settings, the guidelines for selecting \(\alpha _k\), \(\beta _k\) could be different than in the convex case and adaptive rules based on a backtracking procedure have been also proposed [38, 40, 54].

Contribution. In this paper we propose the following generalization of the Heavy-Ball iteration (4)

$$\begin{aligned} {\tilde{y}}^{(k)}\approx & {} {\textrm{prox}}_{\alpha _k f_1}^{D_k}\left( { x}^{(k)}-\alpha _k{D_k}^{-1}\nabla f_0( { x}^{(k)}) +\beta _k( { x}^{(k)}- { x}^{(k-1)})\right) \end{aligned}$$
(6a)
$$\begin{aligned} { x}^{(k+1)}= & {} { x}^{(k)}+ \lambda _k( {\tilde{y}}^{(k)}- { x}^{(k)}), \end{aligned}$$
(6b)

and we analyze it in a nonconvex setting, providing also some insights on the related implementation issues.

The key feature of our proposed method is a line-search procedure to determine the steplength parameter \(\lambda _k\), in order to guarantee the sufficient decrease of a suitably defined merit function. The combination of a line-search procedure with an inertial step is, as far as we know, new. Thanks to the descent properties enforced by the line-search, we can prove the convergence of the iterates to a stationary point of problem (1), in the Kurdyka–Łojasiewicz (KL) framework.

Unlike most Heavy-Ball methods, the theoretical conditions employed to prove this result do not impose significant restrictions on the choice of the parameters \(\beta _k,\alpha _k, D_k\), which can be chosen freely within a compact set. This possibility allows to adopt acceleration strategies for improving the performances.

Indeed, as a second contribution, we also devise practical rules for selecting the algorithm parameters, by adapting some ideas proposed in the literature as acceleration strategies for the standard FB methods. In particular, for the stepsize parameter \(\alpha _k\) we propose an adaptation of the well known Barzilai–Borwein formulae [8], while for selecting the inertial parameter \(\beta _k\) we borrow the ideas behind the nonlinear Conjugate Gradient (CG) method.

Our approach encompasses also the possibility to employ an approximation of the proximal operator in (6a) and adopt a metric associated to the matrix \(D_k\) which may vary throughout the iterations.

The numerical experience described in the last section shows that these implementation choices make the algorithm competitive if compared to the most effective FB and FISTA algorithms, as well as with more standard implementations of the Heavy-Ball itself.

Related work.

Our method shares with the Heavy-Ball algorithms proposed in [38, 40] and in [54] the idea to combine an inertial step with a backtracking loop to compute some of its parameters. The key difference is that our approach consists in an actual line-search, i.e., the next iterate is found along the direction joining the current iterate and the (approximate) inertial proximal-gradient point \( {\tilde{y}}^{(k)}\) (6b). Moreover, unlike the above mentioned references, the line-search concerns only the steplength parameter \(\lambda _k\), allowing a totally different approach to the selection of the parameters \(\alpha _k,\beta _k,D_k\). On the other side, the iterates \( { x}^{(k)}\) produced by the rules (6), unlike the plain Heavy-Ball iteration (4), are not necessarily obtained as the evaluation of the proximity operator at some point. As a consequence of this, the convergence analysis of our proposed method can not be obtained as a special case of [38, 40, 54].

Alternatives to the Euclidean metric have been considered in [54, 56], where a Bregman distance is adopted. Here we adopt a scaled Euclidean metric, induced by the matrix \(D_k^{-1}\), which may vary at each iteration.

The convergence analysis of iteration (6) is performed by showing that it is a special instance of the abstract scheme presented in [18]. From this point of view, our approach shares similar convergence properties with the Heavy-Ball methods in [18, 38, 40]. More insights about the relations between our new algorithm and the existing ones are given in Sect. 3.

Moreover, our approach to the selection of the steplength \(\alpha _k\) is in the same line of [29, 30], where the authors shows that the effectiveness of line-search based FB methods of the form (2) can be significantly improved by devising clever steplength selection rules, in spite of lower theoretical rate estimates.

Paper organization. In Sect. 2 we state some basic definitions and results about subdifferential calculus in nonconvex settings and we define the KL framework. The proposed algorithm, named Phila, is described in Sect. 3, where the relations with existing algorithms are discussed. The theoretical convergence analysis is developed in Sect. 4, leading to the main result which is stated in Theorem 13. In Sect. 5 we describe and motivate the implementation choices concerning the algorithm parameters. In order to evaluate the effectiveness of these choices, we present the results of a numerical experience carried out on a variety of test problems, ranging from unconstrained quadratic to nonconvex, nonlinear problems where the proximity operator of \(f_1\) is not available in explicit form. The performances assessment is obtained comparing our approach to several state-of-the-art methods. Final remarks and perspectives are offered in Sect. 6.

2 Preliminaries

2.1 Notations

The notation \({\overline{{\mathbb {R}}}}\) denotes the extended real numbers set, i.e., \({\overline{{\mathbb {R}}}}= {\mathbb {R}}\cup \{-\infty ,+\infty \}\). Given two vectors \(x,y\in {\mathbb {R}}^n\), we denote by \(\langle x,y\rangle \) the scalar product on \({\mathbb {R}}^n\) and by \(\Vert x\Vert \) the Euclidean norm of x:

$$\begin{aligned} \langle x,y\rangle = x^Ty, \quad \Vert x\Vert = \sqrt{\langle x,x\rangle }. \end{aligned}$$

If \(x\in {\mathbb {R}}^n\) and D is a symmetric positive definite matrix, we denote by \(\Vert x\Vert _D\) the norm induced by D:

$$\begin{aligned} \Vert x\Vert _D = \sqrt{\langle x,Dx\rangle }. \end{aligned}$$

Given \(\mu >1\), we define the set \({{\mathcal {M}}}_{\mu }\) of all symmetric positive definite matrices with eigenvalues belonging to the interval \([\frac{1}{\mu }, \mu ]\):

$$\begin{aligned} {{\mathcal {M}}}_{\mu }= \left\{ D\in {\mathbb {R}}^{n\times n}: D^T = D, \quad \mu I \succeq D\succeq \frac{1}{\mu }I \right\} . \end{aligned}$$

If \(D\in {{\mathcal {M}}}_{\mu }\), then \(D^{-1}\in {{\mathcal {M}}}_{\mu }\). Moreover, for any \(D\in {{\mathcal {M}}}_{\mu }\), we also have

$$\begin{aligned} \frac{1}{\mu }\Vert x\Vert ^2\le \Vert x\Vert ^2_D\le \mu \Vert x\Vert ^2. \end{aligned}$$

Given a function \(\phi :{\mathbb {R}}^n\rightarrow {\bar{{\mathbb {R}}}}\), the domain of \(\phi \) is denoted as \(\textrm{dom}(\phi ) = \{x\in {\mathbb {R}}^n:\phi (x)<+\infty \}\), and f is called proper if \(\textrm{dom}(\phi )\ne \emptyset \) and \(\phi (x)>-\infty \) for all \(x\in {\mathbb {R}}^n\). Given a proper, convex function \(\phi :{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\), a symmetric positive definite matrix D and a point \(x\in {\mathbb {R}}^n\), the proximal operator of \(\phi \) at point x in the norm induced by D is defined as

$$\begin{aligned} {\textrm{prox}}_{\phi }^{D}(x) = \mathop {{\text {argmin}}}\limits _{y\in {\mathbb {R}}^n} \phi (y) + \frac{1}{2} \Vert y-x\Vert ^2_D. \end{aligned}$$

2.2 Subdifferential calculus and KL functions

In this section, we draw the fundamental concept of subdifferential from the field of variational analysis. We refer the reader to well-known textbooks such as [47, 48] for more details.

We start with the definitions of Fréchet subdifferential and limiting-subdifferential.

Definition 1

[48, Definition 8.3] Let \(f:{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) and \(x\in \textrm{dom}(f)\). The Fréchet subdifferential of f at x is the set

$$\begin{aligned} {\hat{\partial }} f(x) = \left\{ v\in {\mathbb {R}}^n: \ \liminf _{y\rightarrow x,y\ne x} \frac{1}{\Vert x-y\Vert }(f(y)-f(x)-(y-x)^T v)\ge 0\right\} . \end{aligned}$$

The limiting-subdifferential (or simply subdifferential) of f at x is defined as

$$\begin{aligned} \partial f(x)= & {} \{v\in {\mathbb {R}}^n: \ \exists \ \{y^{(k)}\}_{k\in {\mathbb {N}}} \subseteq {\mathbb {R}}^n, \ v^{(k)}\in {\hat{\partial }} f(y^{(k)}) \ \forall k\in {\mathbb {N}}\ {\textrm{such}} \ {\textrm{that}} \\{} & {} y^{(k)}\rightarrow x, \ f(y^{(k)})\rightarrow f(x) \ {\textrm{and}} \ v^{(k)}\rightarrow v \displaystyle \}, \end{aligned}$$

and \(\textrm{dom}(\partial f)=\{x\in \textrm{dom}(f):\partial f(x) \ne \emptyset \}\).

Definition 2

We say that a point \(x\in {\mathbb {R}}^n\) is stationary for \(f:{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) if \(x\in \textrm{dom}(f)\) and \(0\in \partial f(x)\).

Remark 3

Any (local) minimum point \(x\in {\mathbb {R}}^n\) of f is a stationary point, while the converse may not be true in general. If f is also convex, then x is a (local) minimum point if and only if x is a stationary point.

The following subdifferential calculus rule will be used later in the paper.

Lemma 4

[48, Exercise 8.8c] If \(f=f_0+f_1\) with \(f_1\) finite at x and \(f_0\) continuously differentiable on a neighbourhood of x, then

$$\begin{aligned} \partial f(x) = \nabla f_0(x)+\partial f_1(x). \end{aligned}$$

Definition 5

[57, p. 82] Let \(f:{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) be a proper, convex function. Given \(\epsilon \ge 0\) and \(x\in {\mathbb {R}}^n\), the \(\epsilon -\)subdifferential of f at x is the set

$$\begin{aligned} \partial _{\epsilon } f(x) = \{v\in {\mathbb {R}}^n: \ f(y)\ge f(x)+\langle v, y-x\rangle -\epsilon , \ \forall \ y\in {\mathbb {R}}^n\}. \end{aligned}$$

For our convergence analysis, we need to recall the definition of the so-called Kurdyka-Łojasiewicz inequality [13], an analytical property which has been extensively employed to ensure the convergence of iterative methods in the absence of convexity [3, 14, 18, 38].

Definition 6

[14, Definition 3] Let \(f:{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) be proper and lower semicontinuous. The function f satisfies the Kurdyka-Łojasiewicz (KL) inequality at the point \(x^*\in \textrm{dom}(\partial f)\) if there exist \(\nu >0\), a neighbourhood U of \(x^*\), and a continuous concave function \(\xi :[0,\nu )\rightarrow [0,+\infty ) \) such that \(\xi (0) = 0\), \(\xi \) is \(C^1\) on \((0,\nu )\), \(\xi '(s)>0\) for all \(s\in (0,\nu )\), and the following inequality holds

$$\begin{aligned} \xi '(f(x)-f(x^*))\textrm{dist}(0,\partial f(x))\ge 1, \end{aligned}$$
(7)

for all \(x\in U \cap \{y\in {\mathbb {R}}^n: \ f(x^*)< f(y) < f(x^*)+\nu \}\). If f satisfies the KL inequality for all \(x^*\in \textrm{dom}(\partial f)\), then f is called a KL function.

Remark 7

The KL inequality is trivially satisfied around nonstationary points [2, Remark 3.2], whereas it establishes a meaningful connection between the gradient norm and the function values of the objective function around stationary points. Several classes of functions arising in signal and image processing satisfy the KL inequality, including the indicator functions of semi-algebraic sets, real polynomials, \(p-\)norms and, more generally, semi-algebraic functions or real analytic functions (see [14] and references therein for more details).

3 The algorithm

In the following, we present Phila, a Proximal Heavy-ball Inexact Line-search Algorithm for solving problem (1). Our proposal is reported step by step in Algorithm 1.

Phila requires as inputs an initial guess \(x^{(0)}\in \textrm{dom}(f_1)\), the backtracking parameters \(\delta ,\sigma \in (0,1)\) and \(\gamma >0\), the constants \(0<\alpha _{min}\le \alpha _{max}\), \(\beta _{max}>0\), and \(\mu \ge 1\) related to the choice of the steplength, the inertial parameter, and the scaling matrix, respectively, and a tolerance parameter \(\tau \ge 0\) for the (possibly inexact) computation of the proximal-gradient point. Once the inputs are fixed, Phila starts an iterative process, where at each iteration k the iterate \( { x}^{(k+1)}\) is formed according to the following steps.

At Step 1, we select a steplength \(\alpha _k\in [\alpha _{min},\alpha _{max}]\), an inertial parameter \(\beta _{k}\in [0,\beta _{max}]\), and a symmetric positive definite scaling matrix \(D_k\in {\mathcal {M}}_\mu \). These three parameters contribute to the definition of the (possibly inexact) proximal-gradient point defined at the next step. No requirements are imposed on \(\alpha _k,\beta _k,D_k\), other than they must belong to compact subsets in their corresponding spaces.

At Step 2, we compute the point \( {\tilde{y}}^{(k)}\in {\mathbb {R}}^n\), which approximates or possibly coincides with the inertial proximal-gradient point \( {\hat{y}}^{(k)}= {\textrm{prox}}_{\alpha _k f_1}^{D_k}( { x}^{(k)}-\alpha _k D_k^{-1}\nabla f_0( { x}^{(k)})+\beta _k( { x}^{(k)}- { x}^{(k-1)}))\). To this aim, we define the function \(h^{(k)}(\,\cdot \,; { x}^{(k)}, { x}^{(k-1)}):{\mathbb {R}}^n\rightarrow {\bar{{\mathbb {R}}}}\) as follows

$$\begin{aligned} h^{(k)}(y; { x}^{(k)}, { x}^{(k-1)})&= \left\langle \nabla f_0( { x}^{(k)}) -\frac{\beta _k}{\alpha _k}D_k( { x}^{(k)}- { x}^{(k-1)}),y- { x}^{(k)}\right\rangle \\&\quad +\frac{1}{2\alpha _k}\Vert y- { x}^{(k)}\Vert _{D_k}^2+f_1(y) - f_1( { x}^{(k)}). \end{aligned}$$

It is easy to see that the inertial proximal-gradient point \( {\hat{y}}^{(k)}\) is the unique minimum point of the strongly convex function \(h^{(k)}(\,\cdot ; { x}^{(k)}, { x}^{(k-1)})\), i.e.,

$$\begin{aligned} {\hat{y}}^{(k)}&= {\textrm{prox}}_{\alpha _kf_1}^{D_k}\left( { x}^{(k)}-\alpha _k{D_k}^{-1}\nabla f_0( { x}^{(k)}) + \beta _k( { x}^{(k)}- { x}^{(k-1)})\right) \end{aligned}$$
(8)
$$\begin{aligned}&=\mathop {{\text {argmin}}}\limits _{y\in {\mathbb {R}}^n} h^{(k)}(y; { x}^{(k)}, { x}^{(k-1)}) \nonumber \\&\Leftrightarrow 0\in \partial h^{(k)}( {\hat{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}). \end{aligned}$$
(9)

Then, we compute \( {\tilde{y}}^{(k)}\) by means of the following condition

$$\begin{aligned} h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)})-h^{(k)}({\hat{y}}^{(k)};x^{(k)}, { x}^{(k-1)})\le -\frac{\tau }{2} h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)}).\nonumber \\ \end{aligned}$$
(10)

If \(\tau = 0\), the above condition implies \( {\tilde{y}}^{(k)}= {\hat{y}}^{(k)}\), i.e., the inertial proximal-gradient point is computed exactly; otherwise, if \(\tau >0\), an approximation \( {\tilde{y}}^{(k)}\) of the exact point \( {\hat{y}}^{(k)}\) is computed. This condition has been employed for inexactly computing the proximal-gradient point in several first-order methods, see e.g. [12, 17, 18, 33]. For completeness, an explicit procedure for computing a point satisfying (10) is detailed in the Appendix. By observing that \({\hat{y}}^{(k)}\) is a minimizer of \(h^{(k)}(\cdot ;x^{(k)}, { x}^{(k-1)})\) and \(h^{(k)}(x^{(k)};x^{(k)}, { x}^{(k-1)})=0\), the above condition entails that

$$\begin{aligned} h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)})\le \left( \frac{2}{2+\tau }\right) h^{(k)}({\hat{y}}^{(k)};x^{(k)}, { x}^{(k-1)})\le 0, \end{aligned}$$

so that the inexactness condition can be conveniently rewritten as

$$\begin{aligned} 0\in \partial _{\epsilon _k}h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)}), \end{aligned}$$
(11a)

with

$$\begin{aligned} \epsilon _k = -\frac{\tau }{2}h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)})\ge 0. \end{aligned}$$
(11b)

In other words, the computation of \( {\tilde{y}}^{(k)}\) is based on a relaxation of the optimality condition (9), obtained by appropriately enlarging the subdifferential of the function \(h^{(k)}(\,\cdot ; { x}^{(k)}, { x}^{(k-1)})\) at the point \( {\tilde{y}}^{(k)}\). Even if inequality (10) and its equivalent version (11) are presented in an implicit way, there exists a well defined, explicit procedure to compute it [20, 50, 52]. For completeness, this aspect is detailed in Appendix A. We observe that the inexactness criterion based on the inclusion (11a) is employed also in [56], where the accuracy parameter \(\epsilon _k\) is chosen as a prefixed sequence instead of our adaptive rule (11b).

Once the quantity \(\Delta _k =h^{(k)}({\tilde{y}}^{(k)};x^{(k)}, { x}^{(k-1)})\) and the search direction \(d^{(k)}= {\tilde{y}}^{(k)}- { x}^{(k)}\) are assigned at Steps 3–4, an Armijo-like linesearch along the direction \(d^{(k)}\) is performed at Steps 5–6. Particularly, we compute the linesearch parameter as \(\lambda _k=\delta ^{i_k}\), where \(\delta \in (0,1)\) is the backtracking factor and \(i_k\) is the first non-negative integer such that the following Armijo-like condition holds:

$$\begin{aligned}&\min \left\{ f( { x}^{(k)}+\delta ^{i_k}d^{(k)}) + \frac{\gamma }{2}\Vert \delta ^{i_k}d^{(k)}\Vert ^2, \ \ f( {\tilde{y}}^{(k)})+\frac{\gamma }{2} \Vert d^{(k)}\Vert ^2 \right\} \nonumber \\&\quad \le f( { x}^{(k)}) + \frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \sigma \delta ^{i_k}\Delta _k. \end{aligned}$$
(12)

Since \(\Delta _k\le 0\), we stop the backtracking procedure when the minimum between \(f( { x}^{(k)}+\delta ^{i_k}d^{(k)}) + \frac{\gamma }{2}\Vert \delta ^{i_k}d^{(k)}\Vert ^2\) and \(f( {\tilde{y}}^{(k)})+\frac{\gamma }{2} \Vert d^{(k)}\Vert ^2 \) is sufficiently smaller than the reference value \(f( { x}^{(k)}) + \frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2\). We remark that the minimum operation is not necessary from a theoretical viewpoint, as our convergence analysis still holds if we neglect the value \(f( {\tilde{y}}^{(k)})+\frac{\gamma }{2} \Vert d^{(k)}\Vert ^2\) from (12); rather, it should be seen as a way of saving unnecessary backtracking iterations. Note that we are enforcing a descent condition on the merit function \(\Phi (x,s) = f(x)+\frac{\gamma }{2}\Vert x-s\Vert ^2\), rather than on the original objective function f. This is motivated by the fact that \(d^{(k)}\) might not be a descent direction for the function f in general, so it might be unfeasible to impose the decrease of \(f( { x}^{(k)}+\lambda d^{(k)})\) for a sufficiently small \(\lambda >0\). On the other hand, it is still possible to ensure the decrease of the merit function \(\Phi \) (see Sect. 4 for more details).

Finally, Step 7 updates the next iterate as either \( { x}^{(k+1)}= { x}^{(k)}+ \lambda _k d^{(k)}\) or \( { x}^{(k+1)}= {\tilde{y}}^{(k)}\), depending on whether or not \(f( { x}^{(k)}+\lambda _kd^{(k)}) + \frac{\gamma }{2} \Vert \lambda _kd^{(k)}\Vert ^2\) is smaller than \(f( {\tilde{y}}^{(k)}) + \frac{\gamma }{2} \Vert d^{(k)}\Vert ^2\). Note that, as a consequence of Steps 5–7, the sequence generated by Phila satisfies the following inequalities

$$\begin{aligned} f( { x}^{(k+1)}) + \frac{\gamma }{2}\Vert { x}^{(k+1)}- { x}^{(k)}\Vert ^2&\le f( { x}^{(k)}) + \frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \sigma \lambda _k\Delta _k \end{aligned}$$
(13)
$$\begin{aligned} \Vert { x}^{(k+1)}- { x}^{(k)}\Vert&\le \Vert {\tilde{y}}^{(k)}- { x}^{(k)}\Vert . \end{aligned}$$
(14)
Algorithm 1
figure a

Phila: Proximal Heavy-ball inexact Line-search Algorithm

Remark 8

(Related work) The main novelty of Phila lies in the original combination between a proximal-gradient iteration with inertia and a linesearch along a prefixed direction. In particular, our line-search approach involves only the steplength parameter \(\lambda _k\), while the other parameters, \(\alpha _k\), \(D_k\), can be selected in a completely independent manner. This is in contrast with existing inertial proximal-gradient methods, where the parameters \(\alpha _k,\beta _k,D_k\) are usually prefixed, or selected through a backtracking procedure based on a local version of the Descent Lemma, see e.g. [9, 38,39,40, 43]. To the best of our knowledge, the inertial proximal-gradient method with backtracking which are more similar to ours are [18, Algorithm 3] and [55, Algorithm 1] (see also [54, Algorithm 2] for a generalization of it). All these methods employ the same merit function as Phila. However, in contrast to Phila, the algorithm in [18] introduces an auxiliary iterate \(s^{(k)}\), and computes the inexact proximal point as \( {\tilde{y}}^{(k)}\approx {\textrm{prox}}_{\alpha _k f_1}^{D_k}( { x}^{(k)}-\alpha _k D_{k}^{-1}\nabla f_0( { x}^{(k)})+\beta _k( { x}^{(k)}-s^{(k)}))\), where the term \(\beta _k( { x}^{(k)}-s^{(k)})\) may not coincide with the inertial force \(\beta _k( { x}^{(k)}- { x}^{(k-1)})\) that is typically employed in inertial proximal-gradient methods. Moreover, the backtracking procedure in [55, Algorithm 1] and [54, Algorithm 2] involves also the inertial parameter and, in addition, it is not performed along a line but along the path defined by the proximal-gradient operator. We also observe that the complexity of the backtracking procedure of these two algorithms might be more demanding, since each reduction step requires the evaluation of the proximity operator, while our approach needs it only once per outer iteration.

Our linesearch has two main advantages. On the one hand, it allows a relatively free selection of the parameters \(\alpha _k,\beta _k,D_k\), as the theoretical convergence relies on the linesearch parameter \(\lambda _k\) (see Sect. 4); on the other hand, it requires only one evaluation of the proximal operator per iteration k, as opposed to the standard backtracking loops based on the Descent Lemma that require a proximal evaluation at each backtracking step. Note also that the Armijo-like condition (12) is applied to the merit function \(\Phi (x,s)=f(x)+\frac{1}{2}\Vert x-s\Vert ^2\), rather than to the objective function f; again, this is contrast with the prior literature, where condition (12) is typically employed by setting \(\gamma = 0\) and neglecting the term \(f( {\tilde{y}}^{(k)}) + \frac{\gamma }{2}\Vert d^{(k)}\Vert ^2\) on the left-hand side of the inequality, see e.g. [17, 19, 51]. The merit function \(\Phi (x,s)\) was inspired by the works on the algorithm iPiano [38, 40], where the choice of the parameters \(\alpha _k,\beta _k,D_k\) (either prefixed or selected according to a local version of the Descent Lemma) guarantees the sufficient decrease of a similar merit function along the iterates. A backtracking loop based on the same merit function with a different acceptance rule and implementation is proposed also in [54, 55]. In these works, unlike our approach, the backtracking steps involve not only the steplength \(\alpha _k\), but also the other parameters defining the iteration rule.

4 Convergence analysis

In this section, we are interested in providing the convergence analysis for Algorithm 1. The analysis will be carried out under the following assumptions on the functions \(f_0,f_1\) appearing in problem (1).

  1. [A1

    ] \(f_1:{\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) is a proper, lower semicontinuous, convex function.

  2. [A2

    ] \(f_0:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is continuously differentiable on an open set \(\Omega _0\supset \overline{\textrm{dom}(f_1)}\).

  3. [A3

    ] \(f_0\) has \(L-\)Lipschitz continuous gradient on \(\textrm{dom}(f_1)\), i.e., there exists \(L>0\) such that

    $$\begin{aligned} \Vert \nabla f_0(x)-\nabla f_0(y)\Vert \le L\Vert x-y\Vert , \ \ \forall x,y\in \textrm{dom}(f_1). \end{aligned}$$
  4. [A4

    ] f is bounded from below.

First, let us state some lemmas holding for the sequences \(\{ { x}^{(k)}\}_{k\in {\mathbb {N}}}\), \(\{ {\tilde{y}}^{(k)}\}_{k\in {\mathbb {N}}}\), and \(\{ {\hat{y}}^{(k)}\}_{k\in {\mathbb {N}}}\) associated to Algorithm 1, which can be seen as special cases of the results [18, Lemma 8-10].

Lemma 9

Suppose that Assumptions [A1]–[A2] hold true. Then, the following inequalities hold.

$$\begin{aligned} \frac{1}{2\mu \alpha _{max}} \Vert {\hat{y}}^{(k)}- { x}^{(k)}\Vert ^2\le & {} \left( 1+\frac{\tau }{2}\right) (-h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})) \end{aligned}$$
(15)
$$\begin{aligned} \frac{1}{2\mu \alpha _{max}} \Vert {\tilde{y}}^{(k)}- {\hat{y}}^{(k)}\Vert ^2\le & {} \frac{\tau }{2} (-h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})) \end{aligned}$$
(16)
$$\begin{aligned} \frac{\theta }{2\mu \alpha _{max}} \Vert {\tilde{y}}^{(k)}- { x}^{(k)}\Vert ^2\le & {} (-h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})), \end{aligned}$$
(17)

with \(\displaystyle \theta = \left( \sqrt{1+\frac{\tau }{2}}+\sqrt{\frac{\tau }{2}}\right) ^{-2}\le 1\).

Lemma 10

Suppose Assumptions [A1]–[A3] hold true. Then, there exists a subgradient \({\hat{v}}^{(k)}\in \partial f( {\hat{y}}^{(k)})\) such that

$$\begin{aligned} \Vert {\hat{v}}^{(k)}\Vert\le & {} p(\Vert {\hat{y}}^{(k)}- { x}^{(k)}\Vert + \Vert { x}^{(k)}- { x}^{(k-1)}\Vert ) \end{aligned}$$
(18)
$$\begin{aligned}\le & {} q\left( \sqrt{ - h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})} + \Vert { x}^{(k)}- { x}^{(k-1)}\Vert \right) , \end{aligned}$$
(19)

where the two constants pq depend only on \(\alpha _{min},\alpha _{max},\beta _{max}\), \(\mu \), and the Lipschitz constant L.

Lemma 11

Suppose Assumptions [A1]–[A3] hold true. Then, there exist \(c,d,{\bar{c}},{\bar{d}} \in {\mathbb {R}}\) depending only on \(\alpha _{min}\), \(\alpha _{max}\), \(\beta _{max}\), \(\mu \), \(\tau \) such that

$$\begin{aligned} f( {\hat{y}}^{(k)})\ge & {} f( {\tilde{y}}^{(k)}) +ch^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}) - d \Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 \end{aligned}$$
(20)
$$\begin{aligned} f( {\hat{y}}^{(k)})\le & {} f( { x}^{(k)}) - {\bar{c}} h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})+ {\bar{d}}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2. \end{aligned}$$
(21)

Next, we show that the linesearch procedure at Step 5 terminates in a finite number of steps, and that the linesearch parameter \(\lambda _k\) is bounded away from zero.

Lemma 12

Let \(\{x^{(k)}\}_{k\in {\mathbb {N}}}\) be the sequence generated by Algorithm 1 and suppose Assumptions [A1]–[A3] hold true.

  1. (i)

    The linesearch at Step 5 is well-defined.

  2. (ii)

    There exists \(\lambda _{min}>0\) depending only on \(\alpha _{min}\), \(\alpha _{max}\), \(\beta _{max}\), \(\mu \) and \(\tau \), such that the parameter \(\lambda _k\) computed at Step 6 satisfies

    $$\begin{aligned} \lambda _k\ge \lambda _{min}, \ \ \forall \ k\ge 0. \end{aligned}$$
    (22)

Proof

Let \(\lambda \in (0,1)\). Then, we have

$$\begin{aligned}&f( { x}^{(k)}+\lambda d^{(k)}) + \frac{\gamma }{2} \lambda ^2\Vert d^{(k)}\Vert ^2\\&\quad = f_0( { x}^{(k)}+\lambda d^{(k)}) + f_1((1-\lambda ) { x}^{(k)}+ \lambda {\tilde{y}}^{(k)}) + \frac{\gamma }{2} \lambda ^2\Vert d^{(k)}\Vert ^2\\&\quad \le f_0( { x}^{(k)}) + \lambda \langle \nabla f_0( { x}^{(k)}),d^{(k)}\rangle + {(L+\gamma )} \frac{\lambda ^2}{2}\Vert d^{(k)}\Vert ^2 +\\&\qquad + (1-\lambda ) f_1( { x}^{(k)}) + \lambda f_1( {\tilde{y}}^{(k)})\\&\quad = f( { x}^{(k)}) + \lambda \left( f_1( {\tilde{y}}^{(k)})-f_1( { x}^{(k)}) + \langle \nabla f_0( { x}^{(k)}),d^{(k)}\rangle \right) +\\&\qquad + (L+\gamma ) \frac{\lambda ^2}{2} \Vert d^{(k)}\Vert ^2 \pm \frac{\beta _k}{\alpha _k} \lambda \langle D_k( { x}^{(k)}- { x}^{(k-1)}),d^{(k)}\rangle \\&\quad = f( { x}^{(k)}) + \lambda \left( f_1( {\tilde{y}}^{(k)})-f_1( { x}^{(k)}) + \langle \nabla f_0( { x}^{(k)})- \frac{\beta _k}{\alpha _k}D_k( { x}^{(k)}- { x}^{(k-1)}),d^{(k)}\rangle \right) +\\&\qquad +\frac{1}{2}(L+\gamma ) \lambda ^2 \Vert d^{(k)}\Vert ^2 + \frac{\beta _k}{\alpha _k} \lambda \langle { x}^{(k)}- { x}^{(k-1)},D_kd^{(k)}\rangle \\&\quad \le f( { x}^{(k)}) + \lambda \left( f_1( {\tilde{y}}^{(k)})-f_1( { x}^{(k)}) + \langle \nabla f_0( { x}^{(k)})- \frac{\beta _k}{\alpha _k}D_k( { x}^{(k)}- { x}^{(k-1)}),d^{(k)}\rangle \right) +\\&\qquad +\frac{1}{2}(L+\gamma ) \lambda ^2 \Vert d^{(k)}\Vert ^2 + \frac{\beta _k^2}{2\gamma \alpha _k^2} \lambda ^2 \Vert D_kd^{(k)}\Vert ^2 +\frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2\\&\quad \le f( { x}^{(k)}) + \lambda \left( f_1( {\tilde{y}}^{(k)})-f_1( { x}^{(k)}) + \langle \nabla f_0( { x}^{(k)})- \frac{\beta _k}{\alpha _k}D_k( { x}^{(k)}- { x}^{(k-1)}),d^{(k)}\rangle \right) +\\&\qquad +\frac{1}{2}\left( L+\gamma + \frac{\mu ^2\beta _{max}^2}{\gamma \alpha _{min}^2} \right) \lambda ^2 \Vert d^{(k)}\Vert ^2 +\frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2\\&\quad \le f( { x}^{(k)}) +\frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \lambda \Delta _k +\frac{1}{2}\left( L+\gamma + \frac{\mu ^2\beta _{max}^2}{\gamma \alpha _{min}^2} \right) \lambda ^2 \Vert d^{(k)}\Vert ^2 \\&\quad \le f( { x}^{(k)}) +\frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \lambda \Delta _k - \frac{\mu \alpha _{max}}{\theta }\left( L+\gamma + \frac{\mu ^2\beta _{max}^2}{\gamma \alpha _{min}^2} \right) \lambda ^2 \Delta _k \end{aligned}$$

where the first inequality follows from the Descent Lemma and the convexity of \(f_1\), the second one from the inequality

$$\begin{aligned} 0&\le \frac{1}{2} \Vert \sqrt{\gamma }( { x}^{(k)}- { x}^{(k-1)}) - \frac{\beta _k}{\alpha _k\sqrt{\gamma }}\lambda D_kd^{(k)}\Vert ^2\\&= \frac{\gamma }{2} \Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \frac{1}{2} \lambda ^2 \frac{\beta _k^2}{\gamma \alpha _k^2}\Vert D_kd^{(k)}\Vert ^2 - \lambda \frac{\beta _k}{\alpha _k}\langle { x}^{(k)}- { x}^{(k-1)},D_kd^{(k)}\rangle , \end{aligned}$$

and the third one from inequality (17). Setting \(\kappa = \dfrac{\mu \alpha _{max}}{\theta }\left( L+\gamma + \dfrac{\mu ^2\beta _{max}^2}{\gamma \alpha _{min}^2}\right) \), the last inequality writes as

$$\begin{aligned} f( { x}^{(k)}+\lambda d^{(k)}) + \frac{\gamma }{2} \lambda ^2\Vert d^{(k)}\Vert ^2\le f( { x}^{(k)}) +\frac{\gamma }{2}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + \lambda (1-\kappa \lambda ) \Delta _k. \end{aligned}$$

Let us denote by m the smallest integer such that \(\delta ^{m}\ge (1 - \sigma )/\kappa \). Then, the parameter \(\lambda _k\) computed at Step 6 using inequality (12) satisfies \(\lambda _k\ge \delta ^m\). Therefore, (22) is satisfied with \(\lambda _{min} = \delta ^m\). \(\square \)

The main result about Phila is stated below.

Theorem 13

Let \(\{x^{(k)}\}_{k\in {\mathbb {N}}}\) be the sequence generated by Algorithm 1. Suppose that Assumptions [A1]– are satisfied and assume that the function \({\mathcal {F}}:{\mathbb {R}}^n\times {\mathbb {R}}\rightarrow {\overline{{\mathbb {R}}}}\) defined as

$$\begin{aligned} {{\mathcal {F}}}(x,\rho ) = f(x) + \frac{1}{2} \rho ^2 \end{aligned}$$
(23)

is a KL function. Moreover, assume that the sequence \(\{{ x}^{(k)}\}_{k\in {\mathbb {N}}}\) is bounded. Then, \(\{x^{(k)}\}_{k\in {\mathbb {N}}}\) converges to a stationary point of f.

The proof of Theorem 13 is given by showing that Phila can be cast in the framework of the abstract scheme in [18]. More precisely, we will prove Theorem 13 as a special case of [18, Theorem 5(iii)] which, for convenience, is restated below.

Theorem 14

Let \({{\mathcal {F}}}:{\mathbb {R}}^{n}\times {\mathbb {R}}^m\rightarrow {\overline{{\mathbb {R}}}}\) be a proper, lower semicontinuous KL function. Consider any sequence \(\{( { x}^{(k)},\rho ^{(k)})\}_{k\in {\mathbb {N}}}\subset {\mathbb {R}}^n\times {\mathbb {R}}^m\) and assume that there exist a proper, lower semicontinuous, bounded from below function \(\Phi :{\mathbb {R}}^n\times {\mathbb {R}}^q\rightarrow {\overline{{\mathbb {R}}}}\) and four sequences \(\{u^{(k)}\}_{{k\in {\mathbb {N}}}}\subset {\mathbb {R}}^n\), \(\{s^{(k)}\}_{{k\in {\mathbb {N}}}}\subset {\mathbb {R}}^q\), \(\{\rho ^{(k)}\}_{k\in {\mathbb {N}}}\subset {\mathbb {R}}^m\), \(\{d_k\}_{k\in {\mathbb {N}}}\subset {\mathbb {R}}_{\ge 0}\) such that the following relations are satisfied.

  1. [H1

    ]There exists a positive real number a such that

    $$\begin{aligned} \Phi ( { x}^{(k+1)},s^{(k+1)})+a d_k^2 \le \Phi ( { x}^{(k)},s^{(k)}), \quad \forall \ k\ge 0. \end{aligned}$$
  2. [H2

    ] There exists a sequence of non-negative real numbers \(\{r_k\}_{k\in {\mathbb {N}}}\) with \(\lim \limits _{k\rightarrow \infty } r_k = 0\) such that

    $$\begin{aligned} \Phi ( { x}^{(k+1)},s^{(k+1)})\le {{\mathcal {F}}}(u^{(k)},\rho ^{(k)})\le \Phi ( { x}^{(k)},s^{(k)})+r_k, \quad \forall \ k\ge 0. \end{aligned}$$
  3. [H3

    ] There exists a subgradient \(w^{(k)}\in \partial {{\mathcal {F}}}(u^{(k)},\rho ^{(k)})\) such that

    $$\begin{aligned} \Vert w^{(k)}\Vert \le b \sum _{i\in I}\theta _id_{k+1-i}, \quad \forall \ k\ge 0, \end{aligned}$$

    where b is a positive real number, \({{\mathcal {I}}}\subset {\mathbb {Z}}\) is a non-empty, finite index set and \(\theta _i\ge 0\), \(i\in {{\mathcal {I}}}\) with \(\sum _{i\in {{\mathcal {I}}}} \theta _i=1\) (\(d_j=0\) for \(j\le 0\)).

  4. [H4

    ] If \(\{(x^{(k_j)},\rho ^{(k_j)})\}_{j\in {\mathbb {N}}}\) is a subsequence of \(\{( { x}^{(k)},\rho ^{(k)})\}_{k\in {\mathbb {N}}}\) converging to some \( (x^*,\rho ^*)\in {\mathbb {R}}^n\times {\mathbb {R}}^m\), then we have

    $$\begin{aligned} \lim _{j\rightarrow \infty }\Vert u^{(k_j)}-x^{(k_j)}\Vert =0, \quad \lim _{j\rightarrow \infty } {{\mathcal {F}}}(u^{(k_j)},\rho ^{(k_j)}) = {{\mathcal {F}}}(x^*,\rho ^*). \end{aligned}$$
  5. [H5

    ] There exists a positive real number \(p>0\) such that

    $$\begin{aligned} \Vert { x}^{(k+1)}- { x}^{(k)}\Vert \le p d_{k}, \quad \forall \ k\ge 0. \end{aligned}$$

Moreover, assume that \(\{( { x}^{(k)},\rho ^{(k)})\}_{k\in {\mathbb {N}}}\) is bounded and \(\{\rho ^{(k)}\}_{k\in {\mathbb {N}}}\) converges. Then, the sequence \(\{( { x}^{(k)},\rho ^{(k)})\}_{k\in {\mathbb {N}}}\) converges to a stationary point of \({\mathcal {F}}\).

In the remaining of this section, we will show that the sequence generated by Phila, with a proper setting of the surrogate function \(\Phi \) and all the auxiliary sequences, satisfies the assumptions of the previous theorem.

Proposition 15

Let \(\{ { x}^{(k)}\}_{k\in {\mathbb {N}}}\) be the sequence generated by Phila and define the function \(\Phi :{\mathbb {R}}^n\times {\mathbb {R}}^n\rightarrow {\overline{{\mathbb {R}}}}\) as

$$\begin{aligned} \Phi (x,s) = f(x)+ \frac{\gamma }{2} \Vert x-s\Vert ^2. \end{aligned}$$
(24)

If Assumptions [A1]–[A3] hold true, then for any k we have

$$\begin{aligned} \Phi ( { x}^{(k+1)}, { x}^{(k)}) - \sigma \lambda _{min}\Delta _k \le \Phi ( { x}^{(k)}, { x}^{(k-1)}). \end{aligned}$$
(25)

If, in addition, Assumption  is satisfied, we also have

$$\begin{aligned} 0=\lim _{k\rightarrow \infty } \Vert { x}^{(k+1)}- { x}^{(k)}\Vert = \lim _{k\rightarrow \infty } h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}) = \lim _{k\rightarrow \infty }\Vert {\tilde{y}}^{(k)}- { x}^{(k)}\Vert .\nonumber \\ \end{aligned}$$
(26)

Proof

Inequality (25) follows from (13) and from Lemma 12. Since from Assumption  f is bounded from below, \(\Phi \) is bounded from below as well. Therefore, (25) implies \(-\sum _{k=0}^{\infty }\Delta _k<\infty \), which, in turn, yields \(\lim _{k\rightarrow \infty }\Delta _k=0\). Recalling (17), this implies (26). \(\square \)

Proposition 16

Let \(\{ { x}^{(k)}\}_{{k\in {\mathbb {N}}}}\) be the sequence generated by Phila and suppose that Assumptions [A1]– hold true. Let \({{\mathcal {F}}}:{\mathbb {R}}^n\times {\mathbb {R}}\rightarrow {\overline{{\mathbb {R}}}}\) be defined as in (23). Then, there exist \(\{\rho _k\}_{k\in {\mathbb {N}}},\{r_k\}_{k\in {\mathbb {N}}}\subset {\mathbb {R}}\), with \(\lim _{k\rightarrow \infty }r_k=0\), such that

$$\begin{aligned} \Phi ( { x}^{(k+1)}, { x}^{(k)})\le {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k)\le \Phi ( { x}^{(k)}, { x}^{(k-1)})+r_k, \quad \forall \ k\ge 0, \end{aligned}$$
(27)

where \(\Phi \) and \( {\hat{y}}^{(k)}\) are defined in (24) and (8), respectively.

Proof

From Step 8, we have

$$\begin{aligned} \Phi ( { x}^{(k+1)}, { x}^{(k)})\le & {} \Phi ( {\tilde{y}}^{(k)}, { x}^{(k)}) = f( {\tilde{y}}^{(k)}) + \frac{\gamma }{2} \Vert {\tilde{y}}^{(k)}- { x}^{(k)}\Vert ^2\\\le & {} f( {\hat{y}}^{(k)}) - \left( c+\gamma \frac{\mu \alpha _{max}}{\theta }\right) h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}) \\ {}{} & {} + d\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2, \end{aligned}$$

where the last inequality follows from (20) and (17). Setting

$$\begin{aligned} \rho _k = \sqrt{2}\left( -\left( c+\gamma \frac{\mu \alpha _{max}}{\theta }\right) h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})+ d\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2\right) ^{\frac{1}{2}},\nonumber \\ \end{aligned}$$
(28)

we obtain \( \Phi ( { x}^{(k+1)}, { x}^{(k)}) \le {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k)\), which represents the left-most inequality in (27). On the other hand, from inequality (21) we obtain

$$\begin{aligned} {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k)= & {} f( {\hat{y}}^{(k)}) +\frac{1}{2} \rho _k^2\\\le & {} f( { x}^{(k)}) -{\bar{c}} h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}) +{\bar{d}}\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2+\frac{1}{2} \rho _k^2. \end{aligned}$$

Setting \(r_k = \rho _k^2/2 -{\bar{c}} h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)}) +({\bar{d}}-\frac{\gamma }{2})\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 \), we can write

$$\begin{aligned} {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k) \le f( { x}^{(k)}) +\frac{\gamma }{2} \Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2 + r_k = \Phi ( { x}^{(k)}, { x}^{(k-1)}) + r_k, \end{aligned}$$

which is the rightmost inequality in (27). Finally, from (26) we have that \(\lim _{k\rightarrow \infty }r_k=0\). \(\square \)

Proposition 17

Let \(\{ { x}^{(k)}\}_{{k\in {\mathbb {N}}}}\) be the sequence generated by Phila and suppose that Assumptions [A1]– hold true. Let \(\rho _k\) be defined as in (28). Then, for any k, there exists a subgradient \(w^{(k)}\in \partial {{\mathcal {F}}}({black {\hat{y}}^{(k)}},\rho _k)\) such that

$$\begin{aligned} \Vert w^{(k)}\Vert \le \eta (\sqrt{-\Delta _k} + \sqrt{-\Delta _{k-1}}) \end{aligned}$$
(29)

for some positive constant \(\theta \).

Proof

From (19) we know that for any k there exists a subgradient \({\hat{v}}^{(k)}\in \partial f(\hat{y}^{(k)})\) such that

$$\begin{aligned} \Vert {\hat{v}}^{(k)}\Vert \le q\sqrt{-\Delta _k} + q\Vert { x}^{(k)}- { x}^{(k-1)}\Vert , \end{aligned}$$
(30)

where pq are positive constants which do not depend on k. Setting

$$\begin{aligned}w^{(k)}= \begin{pmatrix} {\hat{v}}^{(k)}\\ \rho _k\end{pmatrix}, \end{aligned}$$

since \({{\mathcal {F}}}\) is separable, we have

$$\begin{aligned} w^{(k)}\in \partial {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k). \end{aligned}$$

By the triangular inequality we also obtain

$$\begin{aligned} \Vert w^{(k)}\Vert \le \Vert {\hat{v}}^{(k)}\Vert + |\rho _k|. \end{aligned}$$
(31)

Let us analyze the two terms at the right-hand side of the inequality above. From (30) we have

$$\begin{aligned} \Vert {\hat{v}}^{(k)}\Vert\le & {} q(\sqrt{-\Delta _k} + \Vert {\tilde{y}}^{(k-1)}- { x}^{(k-1)}\Vert )\\\le & {} q\sqrt{-\Delta _k} + q\sqrt{\frac{2\mu \alpha _{max}}{\theta }\left( -h^{(k-1)}({\tilde{y}}^{(k-1)}); { x}^{(k-1)},x^{(k-2)}\right) }\\= & {} q\sqrt{-\Delta _k} + q\sqrt{\frac{2\mu \alpha _{max}}{\theta }(-\Delta _{k-1})}, \end{aligned}$$

where the first inequality follows from the fact that \( { x}^{(k)}- { x}^{(k-1)}= \lambda _{k-1}({\tilde{y}}^{(k-1)} - { x}^{(k-1)})\), where \(\lambda _{k-1}\le 1\), the second one is a consequence of (17), and the third one follows from the definition of \(\Delta _k\). Setting \({\bar{q}} = q\cdot \max \left\{ 1, \sqrt{{2\mu \alpha _{max}}/{\theta }}\right\} \), the last inequality above yields

$$\begin{aligned} \Vert {\hat{v}}^{(k)}\Vert \le {\bar{q}} \left( \sqrt{-\Delta _k}+ \sqrt{-\Delta _{k-1}}\right) . \end{aligned}$$
(32)

Reasoning as above, from (28) we obtain

$$\begin{aligned} \rho _k= & {} \sqrt{2}\left( (c+ \gamma \mu \alpha _{max}/ \theta )(-\Delta _k) + d\Vert { x}^{(k)}- { x}^{(k-1)}\Vert ^2\right) ^{\frac{1}{2}}\\\le & {} \sqrt{2}\left( (c+ \gamma \mu \alpha _{max}/ \theta )(-\Delta _k) + d\Vert {\tilde{y}}^{(k-1)}- { x}^{(k-1)}\Vert ^2\right) ^{\frac{1}{2}}\\\le & {} \sqrt{2}\left( (c+ \gamma \mu \alpha _{max}/ \theta )(-\Delta _k) + d\frac{2\mu \alpha _{max}}{\theta }(-h^{(k-1)}({\tilde{y}}^{(k-1)}; { x}^{(k-1)},x^{(k-2)})\right) ^{\frac{1}{2}}\\\le & {} \sqrt{2}\left( (c+ \gamma \mu \alpha _{max}/ \theta )(-\Delta _k) + d\frac{2\mu \alpha _{max}}{\theta }(-\Delta _{k-1})\right) ^{\frac{1}{2}}\\\le & {} {\bar{p}} \sqrt{-\Delta _k -\Delta _{k-1}}\le \zeta \left( \sqrt{-\Delta _k} +\sqrt{-\Delta _{k-1}}\right) , \end{aligned}$$

where \({\bar{p}} = \sqrt{2}\max \left\{ c+ \gamma \mu \alpha _{max}/{\theta }, 2\mu d\alpha _{max} / \theta \right\} ^{\frac{1}{2}}\) and the last inequality follows from the concavity of \(\sqrt{\cdot }\). Therefore, by combining the last inequality above with (32) and (31), it follows that (29) holds with \(\eta = {\bar{q}} + {\bar{p}}\). \(\square \)

Proposition 18

Let \(\{ { x}^{(k)}\}_{k\in {\mathbb {N}}}\) be the sequence generated by Phila and define \(\rho _k\) as in (28). Assume that \(({ x}^*,\rho ^*)\) is a limit point of \(\{( { x}^{(k)},\rho _k)\}_{k\in N}\) and that \(\{(x^{(k_j)}, \rho _{k_j})\}_{j\in {\mathbb {N}}}\) is a subsequence converging to it. Then, we have

$$\begin{aligned} \lim _{k\rightarrow \infty } \Vert {\hat{y}}^{(k)}- { x}^{(k)}\Vert = 0, \quad \lim _{j\rightarrow \infty } {{\mathcal {F}}}({\hat{y}}^{(k_j)}, \rho _{k_j}) = {{\mathcal {F}}}(x^*,\rho ^*). \end{aligned}$$

Proof

Inequalities (15) and (26) implies that

$$\begin{aligned} \lim _{k\rightarrow \infty } \Vert {\hat{y}}^{(k)}- { x}^{(k)}\Vert = 0. \end{aligned}$$
(33)

From (26) and (29), there exists a sequence \(\{w^{(k)}\}_{k\in {\mathbb {N}}}\) with \(w^{(k)}\in \partial {{\mathcal {F}}}( {\hat{y}}^{(k)},\rho _k)\) such that \(\lim _{k\rightarrow \infty }\Vert w^{(k)}\Vert = 0\). Since \({{\mathcal {F}}}\) has a separable structure, this implies that there exists a sequence \(\{v^{(k)}\}_{k\in {\mathbb {N}}}\) with \({\hat{v}}^{(k)}\in \partial f({black {\hat{y}}^{(k)}})\) such that \(w^{(k)}= ({\hat{v}}^{(k)},\rho _k)\) with \(\lim _{k\rightarrow \infty } \Vert {\hat{v}}^{(k)}\Vert = \lim _{k\rightarrow \infty }\rho _k=0\). Moreover, thanks to Lemma 4, we can write \({\hat{v}}^{(k)}= \nabla f_0( {\hat{y}}^{(k)}) + z^{(k)}\), with \(z^{(k)}\in \partial f_1({black {\hat{y}}^{(k)}})\). Hence, by continuity of \(\nabla f_0\), the assumption on the subsequence \(\{( { x}^{(k_j)},\rho _{k_j})\}_{j\in {\mathbb {N}}}\), and (33), the following implication holds

$$\begin{aligned} \lim _{k\rightarrow \infty }{\hat{v}}^{(k)}=0\Rightarrow \lim _{j\rightarrow \infty } z^{(k_j)} = -\nabla f_0(x^*). \end{aligned}$$

By applying the subgradient inequality to \(z^{(k)}\), and adding the quantity \(\frac{1}{2} \rho _{k_j}^2\) to both sides of the same inequality, we get

$$\begin{aligned} f_1(x^*) + \frac{1}{2} \rho _{k_j}^2\ge & {} f_1( {\hat{y}}^{(k_j)}) + \langle z^{(k_j)},x^*-{\hat{y}}^{(k_j)}\rangle + \frac{1}{2} \rho _{k_j}^2\\= & {} {{\mathcal {F}}}( {\hat{y}}^{(k_j)},\rho _{k_j}) -f_0({\hat{y}}^{(k_j)}) + \langle z^{(k_j)}, x^*-{\hat{y}}^{(k_j)}\rangle , \end{aligned}$$

where the last equality is obtained by adding and subtracting \(f_0({\hat{y}}^{(k_j)})\) to the right-hand-side. Taking limits on both sides we obtain

$$\begin{aligned} f_1(x^*) +\frac{1}{2} (\rho ^*)^2\ge \lim _{j\rightarrow \infty }{{\mathcal {F}}}({\hat{y}}^{(k_j)},\rho _{k_j}) -f_0(x^*), \end{aligned}$$

which, rearranging terms, gives \(\lim _{j\rightarrow \infty }{{\mathcal {F}}}({\hat{y}}^{(k_j)},\rho _{k_j}) \le {{\mathcal {F}}}(x^*,\rho ^*)\). On the other side, by assumption, \({\mathcal {F}}\) is lower semicontinuous, therefore \(\lim _{j\rightarrow \infty }{{\mathcal {F}}}({\hat{y}}^{(k_j)},\rho _{k_j}) \ge {{\mathcal {F}}}(x^*,\rho ^*)\), which completes the proof. \(\square \)

Proof of Theorem 13

By Propositions 1516, and 17, we know that conditions [H1], [H2], and [H3] hold for Phila, with the merit functions defined in (23) and (24) with the following settings:

$$\begin{aligned} s^{(k)}= { x}^{(k-1)}, \ u^{(k)}= {\hat{y}}^{(k)}, \ d_k = \sqrt{-\Delta _k}, \ a = \sigma \lambda _{min}. \end{aligned}$$

Moreover, the inequalities in [H2] are satisfied with \(\rho ^{(k)}\) and \(r_k\) set as in the proof of Proposition 16, while the inequality [H3] holds with \({{\mathcal {I}}} = \{1,2\}\), \(\theta _i = 1/2\), for \(i=1,2\), \(b = 2\eta \).

Moreover, Proposition 18 leads to conclude that [H4] holds.

Finally, condition [H5] holds as a consequence of (17) ans (14), since \(d_k= \sqrt{-\Delta _k}\). Then, Theorem 14 applies and guarantees that the sequence \(\{(x^{(k)},\rho _k)\}_{k\in {\mathbb {N}}}\) converges to a stationary point \((x^*,\rho ^*)\) of \({\mathcal {F}}\). Note that, since \({\mathcal {F}}\) is the sum of separable functions, its subdifferential can be written as \(\partial {\mathcal {F}}(x,\rho )=\partial f(x)\times \{\rho \}\). Then, \((x^*,\rho ^*)\) is stationary for \({\mathcal {F}}\) if and only if \(\rho ^*=0\) and \(0\in \partial f(x^*)\). Hence \(x^*\) is a stationary point for f and \(\{x^{(k)}\}_{k\in {\mathbb {N}}}\) converges to it. \(\square \)

5 Numerical experiments

In this section, we show the flexibility and effectiveness of our proposed algorithm on a series of optimization problems of varying difficulty, ranging from uncostrained quadratic minimization to nonconvex composite minimization (see Table 1 for a summary of the test problems presented in the following and related features). For each problem, we equip Algorithm 1 with different selection rules for the parameters \(\alpha _k\) and \(\beta _k\), and compare its numerical performance to other well-known methods available in the literature. Overall, we will see that Algorithm 1 performs best when its parameters are either partially or completely related to the nonlinear Conjugate Gradient iteration. The numerical experience is carried out on a PC equipped with 11th Gen Intel(R) Core(TM) i7 processor (16GB RAM) in Matlab (R2021b) environment.

Table 1 Summary of the test problems considered in Sect. 5

5.1 Quadratic toy problems

5.1.1 Unconstrained quadratic problem

As a first test, we address the unconstrained minimization of a quadratic function, i.e.,

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} f_0(x) \equiv \frac{1}{2} x^TAx-b^Tx \end{aligned}$$
(34)

where \(n = 100\), and \(A\in {\mathbb {R}}^{n\times n}\) is a symmetric positive definite matrix with eigenvalues \(\mu _{min}=\mu _{1}<\mu _2<\ldots <\mu _n=\mu _{max}\), respectively. The test problem is generated following the same procedure described in [44]: first, we compute an orthogonal matrix \(Q\in {\mathbb {R}}^{n\times n}\) via the QR factorization of a random \(n\times n\) matrix; then, we choose the eigenvalues \(\mu _1<\mu _2<\ldots <\mu _n\) as n random values in the interval \([\mu _{min},\mu _{max}]\), and we define \(\Lambda \in {\mathbb {R}}^{n\times n}\) as the diagonal matrix where \(\Lambda _{ii} = \mu _i\), \(i=1,\ldots ,n\); finally, we define \(A = Q^T\Lambda Q\). The solution of the test problem is then defined as a random vector \(x^*\in {\mathbb {R}}^n\) and the linear term is set as \(b = Ax^*\).

In the numerical experiments we consider different implementations of our method. In order to explain the rationale behind the setting of the algorithm parameters, we first restate the iteration of Algorithm 1 applied to (34). As there is no need to compute an (inexact) proximal point, Phila reduces in this case to the two-step iterative scheme

$$\begin{aligned} { x}^{(k+1)}= { x}^{(k)}+\lambda _k (-\alpha _k D_k^{-1}g^{(k)}+\beta _k s^{(k)}), \quad k =0,1,\ldots , \end{aligned}$$

where

$$\begin{aligned} g^{(k)}=\nabla f_0( { x}^{(k)}), \ \ \ s^{(k)}= { x}^{(k)}- { x}^{(k-1)}. \end{aligned}$$

The above relation shows that the Phila iteration has the same form of the preconditioned CG updating rule, with an iteration dependent preconditioner. Assume now that at the iteration k the steplength parameter \(\alpha _k\) and the scaling matrix \(D_k\) have been selected. Borrowing the ideas behind the Spectral Conjugate Gradient Method (see [11, Section 1]), we consider the following choice for the extrapolation parameter

$$\begin{aligned} \beta _k^{SGM} = \frac{(\alpha _k D_k^{-1}w^{(k)}- s^{(k)})^Tg^{(k)}}{{s^{(k)}}^Tw^{(k)}}, \end{aligned}$$
(35)

where \(w^{(k)}= g^{(k)}-g^{(k-1)}\). The above formula with \(D_k=I\) has been proposed in [11] and is a generalization of well known strategies in the framework of the nonlinear Conjugate Gradient method (see [32] for a survey). For example, if \({s^{(k)}}^TD_k^{-1}g^{(k)}={g^{(k)}}^TD_{k-1}^{-1}g^{(k-1)}=0\) for all k, it gives the following generalization of the Fletcher–Reeves formula

$$\begin{aligned} \beta _k^{PCG} = \beta _k^{FR}\frac{\alpha _k}{\alpha _{k-1}\lambda _{k-1}}, \beta _k^{FR} = \frac{\Vert g^{(k)}\Vert ^2_{D_k^{-1}}}{\Vert g^{(k-1)}\Vert _{D_{k-1}^{-1}}^2}. \ \ \end{aligned}$$
(36)

Assuming that a fixed upper bound \(\beta _{max}\) is given as an input parameter of Phila, in all our experiments we set the inertial parameter as follows

$$\begin{aligned} \beta _k = \left\{ \begin{array}{ll} \min \left\{ \beta _{max},\beta _k^{SGM}\right\} &{} \text{ if } \beta _k^{SGM}\ge 0\\ \min \left\{ \beta _{max},\beta _k^{FR}\right\} &{} \text{ otherwise } \end{array}\right. , \end{aligned}$$
(37)

where the thresholding operation is needed to comply with the theoretical prescriptions.

As for the other parameters, in the experiments we set \(D_k=I\), \(\gamma = 10^{-4}\), while we consider different choices for the steplength \(\alpha _k\), which are detailed below.

  • Phila-L: We set \(\alpha _k\) as a fixed value related to the Lipschitz constant of \(\nabla f_0\):

    $$\begin{aligned} \alpha _k = \frac{1.99}{\mu _{max}}. \end{aligned}$$
    (38)
  • Phila-BB1: Here \(\alpha _k\) is computed by means of the first BB rule, namely,

    $$\begin{aligned} \alpha _k^{BB1} =\frac{{s^{(k)}}^Ts^{(k)}}{{s^{(k)}}^Tw^{(k)}}. \end{aligned}$$
    (39)

    The formula for computing \(\alpha _k\) is then the following one:

    $$\begin{aligned} \alpha _k = \max \left\{ \min \left\{ \alpha _k^{BB1},\alpha _{min}\right\} \right\} . \end{aligned}$$
  • Phila-BB2: We also consider the variant corresponding to the second BB rule:

    $$\begin{aligned} \alpha _k^{BB2} =\frac{{s^{(k)}}^Tw^{(k)}}{{w^{(k)}}^Tw^{(k)}}. \end{aligned}$$
    (40)

    We observed that the above quotient provided a good estimation of the inverse of the Lipschitz constant of the gradient \(1/\mu _{max}\). Then, in analogy with (38), we employ twice its value to define \(\alpha _k\)

    $$\begin{aligned} \alpha _k = \max \left\{ \min \left\{ 2\alpha _k^{BB2},\alpha _{min}\right\} \right\} . \end{aligned}$$
  • Phila-CG: In this implementation we aim to closely reproduce the CG iteration. In particular, for \(k\ge 1\), we compute \(\beta ^{FR}_k\) as in (36) with \(D_k=D_{k-1} = I\) and we define \(p^{(k)}= -g^{(k)}+ \frac{\beta _k^{FR}}{\lambda _{k-1}\alpha _{k-1}}( { x}^{(k)}- { x}^{(k-1)}) \). Then, we define

    $$\begin{aligned} \alpha _k^{CG} = -\frac{{p^{(k)}}^Tg^{(k)}}{{p^{(k)}}^TA p^{(k)}}, \end{aligned}$$
    (41)

    which is the optimal steplength along the direction \(p^{(k)}\), i.e., \(f( { x}^{(k)}+\alpha _k^{CG}p^{(k)}) = \min _{\alpha >0} f( { x}^{(k)}+\alpha p^{(k)})\). Finally, we set

    $$\begin{aligned} \alpha _k = \max \{\min \{\alpha _k^{CG},\alpha _{max}\},\alpha _{min}\} \end{aligned}$$
    (42)

    Thus, the CG iterations are exactly reproduced when, for all k, \(\lambda _k=1\), \(\alpha _k^{CG}\in [\alpha _{min},\alpha _{max}]\) and \(\beta _k^{SGM}\in [0,\beta _{max}]\).

We observe that all the above implementations, except Phila-CG which is specifically tailored for quadratic problems, can be applied also to smooth nonlinear problems.

As a benchmark, we compare our proposed algorithm with the following state-of-the-art methods.

  • SD: this is the classical Steepest Descent, based on the iteration (2) where \(D_k=I\), \(\lambda _k=1\), and \(\alpha _k\) is computed by the exact minimization rule

    $$\begin{aligned} \alpha _k = \mathop {{\text {argmin}}}\limits _{\alpha > 0} f_0( { x}^{(k)}-\alpha \nabla f_0( { x}^{(k)})) = \frac{\Vert \nabla f_0( { x}^{(k)})\Vert ^2}{\nabla f_0( { x}^{(k)})^TA\nabla f_0( { x}^{(k)})}. \end{aligned}$$
  • ISTA: with this acronym, we refer to iteration (2) with \(D_k=I\), \(\lambda _k=1\) and \(\alpha _k\equiv \alpha \) given by the constant value \(\alpha = 1.99/{\mu _{max}}\), where \(\mu _{max}\) represents the Lipschitz constant of \(\nabla f_0\) [10, 26, 28, p. 191].

  • FISTA: this is the well-known accelerated FB method (3), with the parameter setting described in [10, p.193].

  • Heavy-Ball: this consists in iteration (4) with the constant choices for the parameters proposed in [42], which are defined as follows

    $$\begin{aligned} \beta _k \equiv \beta = \left( \frac{\sqrt{\kappa }-1}{\sqrt{\kappa }+1}\right) ^2, \ \ \alpha _k \equiv \alpha = \frac{(1+\sqrt{\beta })^2}{\mu _{max}}, \end{aligned}$$

    where \(\kappa = \mu _{max}/\mu _{min}\).

  • ADR denotes the inertial algorithm corresponding to formula (4.19) in [6].

  • VMILA: the Variable Metric Inexact Linesearch Algorithm is a FB scheme (2) first proposed in [16] where \(\alpha _k\in [\alpha _{min},\alpha _{max}]\) and \(D_k\in {\mathcal {M}}_\mu \) can be freely selected, whereas \(\lambda _k\in (0,1]\) is computed by Armijo-like backtracking. For our tests, the scaling matrix \(D_k\) is equal to the identity matrix, whereas the steplength \(\alpha _k\) is computed by alternating the two standard Barzilai–Borwein (BB) rules [8].

In Fig. 1 (left column), we show the relative decrease of the objective function with respect to the iteration number for three different values of the conditioning number of the matrix A. All implementations of Phila outperform ISTA and FISTA, while being comparable or superior to VMILA, Heavy-Ball and ADR. We note that the acceleration of Phila-L with respect to ISTA is only due to the presence of the inertial term, whereas the improvement with respect to all the inertial methods—FISTA, Heavy-Ball and ADR—can be attributed to the automated CG-like choice of the Phila inertial parameter in place of the prefixed FISTA, Heavy-Ball and ADR rules. Furthermore, the implementation Phila-CG achieves the same convergence speed of the Conjugate Gradient method, which provides the best performance among all the competitors.

Fig. 1
figure 1

Quadratic toy problems. Left column: unconstrained case with different condition numbers. Minimum eigenvalue \(\mu _{min} = 1\) and \(\mu _{max} = 10^2\) (top), \(10^3\) (middle), \(10^4\) (bottom). Right column: non-negatively constrained case with \( \mu _{min} = 1\), \(\mu _{max} = 10^3 \) and different number of active constraints at the solution \(n_a= 1\) (top), 20 (middle), 48 (bottom)

5.1.2 Non-negatively constrained quadratic problem

In this section we consider a quadratic problem constrained onto the non-negative orthant,

$$\begin{aligned} \min _{x{\ge 0}} f_0(x) \equiv \frac{1}{2} x^TAx-b^Tx, \end{aligned}$$
(43)

where \(n = 100\), \(A\in {\mathbb {R}}^{n\times n}\) is a symmetric positive definite matrix with minimum and maximum eigenvalues denoted by \(\mu _{min}\) and \(\mu _{max}\), respectively. This is a convex composite optimization problem of the form (1), where the nonsmooth term corresponds to \(f_1(x) = \iota _{{\ge 0}}(x)\). The proximal operator associated to \(f_1\) is the orthogonal projection onto the constraint set \(P_{\ge 0}(x) = \max \{x,0\}\).

We aim to evaluate the effectiveness of Phila in comparison with some existing approaches, with respect not only to the conditioning of the Hessian A, but also to the number of constraints which are active at the solution.

To this end, the test problems are defined following [44]: first, a symmetric positive definite matrix with uniformly spaced eigenvalues in \([\mu _{min},\mu _{max}]\) is computed as described in the previous section; then, we randomly define a set of indexes \({{\mathcal {I}}}_a \subset \{1,\ldots ,n \}\) of the constraints which will be active at the solution. We compute the vector \(w^*\) of the Lagrange multipliers such that \(w^*_i = 1\) if \(i\in {{\mathcal {I}}}_a\) and \(w_i^* = 0\) if \(i\notin {{\mathcal {I}}}_a\). The solution \(x^*\) is computed in a similar way as a random vector such that \(x^*_i\in (0,1)\) if \(i\notin {{\mathcal {I}}}_a \) and \(x^*_i = 0\) for \(i\in {{\mathcal {I}}}_a\). The linear term is finally computed as \(b = Ax^*-w^*\).

Also in this case we consider different implementations of Phila, which can be considered as a generalization of those described in the previous sections. Indeed, the optimality conditions of problem (43) write as

$$\begin{aligned} \partial f(x) = \nabla f_0(x) + \partial \iota _{\ge 0}(x) \ni 0, \ \ [\partial \iota _{\ge 0}(x)]_i = \left\{ \begin{array}{ll} 0 &{} \text{ if } x_i > 0\\ {\mathbb {R}}_{\le 0} &{} \text{ if } x_i = 0. \end{array}\right. \end{aligned}$$

At each iteration k we compute the following set of indexes

$$\begin{aligned} {{\mathcal {B}}}_k = \{i: { x}^{(k)}_i = 0 \text{ and } \nabla _i f_0( { x}^{(k)}) \ge 0 \} \end{aligned}$$
(44)

and we define the residual vector \(r^{(k)}\in {\mathbb {R}}^n\) as

$$\begin{aligned} r^{(k)}_i = \left\{ \begin{array}{ll} 0 &{} \text{ if } i\in {{\mathcal {B}}}_k\\ \nabla _i f_0( { x}^{(k)}) &{} \text{ otherwise. } \end{array} \right. \end{aligned}$$
(45)

Clearly we have \(r^{(k)}\in \partial f( { x}^{(k)})\) and, in particular, \(\Vert r^{(k)}\Vert = \textrm{dist}(\partial f( { x}^{(k)}),0)\). According to the founding in [29, 30], we then propose to generalize the implementation of Phila described in the previous section with the subgradient (45) in place of the gradient of f.

  • Phila-L, Phila-BB1 and Phila-BB2 are the generalization of the corresponding algorithms described in the previous section, where \(\beta _k\) is still defined as in (37) but with

    $$\begin{aligned} w^{(k)}= r^{(k)}- r^{(k-1)}, \ \ g^{(k)}= r^{(k)}\end{aligned}$$
    (46)

    in (35)–(36). Moreover, the steplength parameter for Phila-L is defined as in (38), while for Phila-BB1 and Phila-BB2 we revisit the BB rules (39), (40) by computing them with \(w^{(k)}\) as in (46).

  • Phila-CG: we compute \(\beta ^{FR}_k\) as in (36) with \(D_k=D_{k-1} = I\) and \(r^{(k)}\) in place of \(g^{(k)}\). Then, we define \(p^{(k)}= -r^{(k)}+ \frac{\beta _k^{FR}}{\lambda _{k-1}\alpha _{k-1}}( { x}^{(k)}- { x}^{(k-1)}) \) and \(\alpha _k^{CG}\) as in (41), again with \(r^{(k)}\) instead of \(g^{(k)}\). Finally, \(\alpha _k\) is defined as in (42) while the inertial parameter is

    $$\begin{aligned} \beta _k= & {} \left\{ \begin{array}{ll}\min \left\{ \beta _{max},\frac{\alpha _k^{CG}\beta _k^{CG}}{\alpha _{k-1}^{CG}\lambda _{k-1}}\right\} &{} \text{ if } {{\mathcal {B}}}_{k-1}= {{\mathcal {B}}}_k\\ 0 &{} \text{ otherwise } \end{array} \right. . \end{aligned}$$

Also in this case, the implementations Phila-L, Phila-BB1 and Phila-BB2 can be directly applied to non-negatively constrained nonlinear smooth problems. The above variants of Phila are then compared with the following methods.

  • ISTA, FISTA, VMILA, Heavy-Ball and ADR with the same parameters described in the previous section. Since in this case \(f_1\ne 0\) is convex but not strongly convex, the Heavy-Ball method actually corresponds to the algorithm named iPiasco proposed in [39].

  • BCCG: we include in the comparison the Bound Constrained Conjugate Gradient method [53], which is an adaptation of the CG iteration to box constrained problems where the search direction is based on the residual vector. The convergence properties of this method are not well established, but we include it in our comparison since we observed good performances.

The results of the comparison are depicted in Fig. 1 (right column). Also in this case, Phila-BB2 achieves the overall best performances.

5.2 Image restoration problems

5.2.1 Image denoising with edge preserving Total Variation regularization

We consider the variational model for image denoising

$$\begin{aligned} \min _{x\in {\mathbb {R}}_{\ge 0}^n} \frac{1}{2} \rho \Vert x-g\Vert ^2 + TV_\epsilon (x), \end{aligned}$$
(47)

where \(g\in {\mathbb {R}}^n\) is the noisy data, \(\rho \) is a positive parameter and the discrete, smoothed version of the Total Variation function is defined as

$$\begin{aligned} TV_\epsilon (x) = \sum _{i=1}^n\sqrt{\Vert A_ix\Vert ^2+\epsilon ^2} \end{aligned}$$

where \(A_i\in {\mathbb {R}}^{2\times n}\) represents the discrete image gradient operator at the i-th pixel, while \(\epsilon >0\) is a smoothing parameter.

The noisy data g has been generated by choosing a good quality image of size \(481\times 321\) (hence \(n = 154401\)), with pixels in the range [0, 255], and adding Gaussian noise with zero mean and standard deviation 25. The model parameters \(\rho \) and \(\epsilon \) have been empirically selected in order to obtain a good restoration: the values employed in the experiments are \(\rho =0.0531\) and \(\epsilon =1\). The ground truth and the noisy image are shown in Fig. 2a, b.

Fig. 2
figure 2

Image denoising problem. a Ground truth; b noisy image (PSNR = 20.27); c restoration (PSNR = 27.67)

Problem (47) is a strongly convex, nonlinear, smooth constrained optimization problem, having a unique solution \(x^*\), where the nonsmooth term consists in \(f_1(x) = \iota _{\ge 0}(x)\).

In the numerical experiments we consider the implementations Phila-L, Phila-BB1 and Phila-BB2 exactly as described in Sect. 5.1.2. In particular, for Phila-L, we set \(\mu _{max} = 1+\rho /\epsilon \) in (38), which is motivated by the approximation of the Lipschitz constant of \(\nabla TV_\epsilon (x)\) with \( 1/\epsilon \) (see [22, Section 3.3]).

The other parameters are set as \(\gamma =10^{-4}\), \(\alpha _{min}= 10^{-5}\), \(\alpha _{max} = 10^{5}\) and \(\beta _{max} = 1.5\).

As a benchmark, we consider again ISTA, FISTA and VMILA. In particular, since the Lipschitz constant is available only as an estimation, we equip both ISTA and FISTA with a backtracking procedure to possibly adjust the steplength. Since the problem is strongly convex, we can still include in the comparison also the Heavy-Ball method and ADR. In addition, we consider also the inertial methods ABPG-LS [54], PGels [55], nmAPG-LS [34] and v-iBPG [56].

Moreover, for Phila and VMILA, we consider also a variable metric implementation, obtained by defining a diagonal scaling matrix \(D_k\) as prescribed in [58]. These implementations are denoted with the suffix ’s’ added to the algorithm name.

As a stopping criterion we adopt the average relative difference of the objective function value between two successive iterates over the last 10 steps

$$\begin{aligned} \frac{1}{10}\sum _{j=0}^{9}\frac{|f(x^{(k-j)})-f(x^{(k+1-j)})|}{|f(x^{(k-j)})|}\le tol. \end{aligned}$$
(48)

Since the exact solution \(x^*\) of problem (47) is not available, it has been approximated by running all the considered methods for 200 iterations or until condition (48) is satisfied with \(tol = 10^{-15}\) and then selecting the point corresponding to the lower function value. We then compare the speed of each method in approaching the optimal value.

The results of the numerical comparison are reported in Fig. 3. In the left panel, we plot the relative distance of the objective function from the optimal value with respect to the iteration number over the first 150 iterations, while the right panel plots the same quantity with respect to the computational time in seconds. All algorithms are stopped when a maximum number of 1000 iterations or 4.5 s of computing time are exceeded, As a further reference, in Table 2 we report for each method the number of iterations performed to satisfy the stopping criterion (48) with \(tol=10^{-8}\) (i.e., for a medium-high accuracy solution), the total number of function evaluations, the relative Euclidean distance between the restored and the ground truth image, the PSNR, and the corresponding computational time. Actually, iPiasco and ADR do not require the computation of the objective function, but only of its gradient. However, since the objective function evaluation can be obtained as a byproduct of the gradient, we believe that the plots in Fig. 3 and Table 2 give a clear indication about the effectiveness of the algorithms. In Fig. 2c we report the output image obtained from Phila-BB2.

Our proposed line-search based methods outperform the benchmark methods with fixed parameters (Heavy-Ball, ADR, v-iBPG) as well as those employing a backtracking procedure (ABPG_LS, PGels, nmAPG_LS). In terms of execution time, the most performing algorithm is Phila-BB2. Indeed, the presence of the scaling matrix seems to slightly reduce the number of iterations needed to achieve a given accuracy as shown in the left panel of Fig. 3, but in some cases this results in an increased number of the backtracking steps. This explains why the scaled methods generally perform better of their nonscaled version in terms of number of iterations (see left panel of Fig. 2), whereas the computational time can be larger. This can also be explained by the fact that the computation of the scaling matrix adds some complexity. However, we believe that a more reliable assessment of the algorithms based on the execution time would require an highly optimized implementation of each code, which is out of the scope of the present paper.

Fig. 3
figure 3

Image denoising problem. Relative decrease of the objective function with respect to the iteration number (left panel) and to the computational time (right panel)

Table 2 Image denoising problems: benchmark of algorithms at (48) satisfied with \(tol=10^{-8}\)

5.2.2 Image deblurring with bimodal nonconvex regularization

In this section we consider a simulated test problem inspired by the Helsinki Deblur Challenge (HDC) 2021. The goal is to recover an image of some text string from a noisy blurred acquisition. The target image has been generated by resizing and binarizing a randomly selected ground truth image provided by the HDC dataset (the ones acquired by Camera 1). The resulting image has size \(183\times 295\) and its pixels are zero (black) or one (white). The binary image has been convolved with a disk psf of radius 7 to simulate out-of-focus blur. Finally, Gaussian noise with zero mean and standard deviation 0.01 has been added. The restored image is obtained by solving the following optimization problem

$$\begin{aligned} \min _{x\in [0,1]^n} \frac{1}{2} \Vert Hx-g\Vert ^2 + \rho TV_\epsilon (x) + \chi \sum _{i=1}^nx_i(1-x_i), \end{aligned}$$
(49)

where g is the noisy blurred data and H represents the blurring matrix corresponding to the disc psf. In addition to the TV functional, here the objective function includes also a bimodal nonconvex regularization term, whose minima are attained when each pixel value is either zero or one. This kind of regularization has been considered in [15] in the HDC framework. The ground truth image and the simulated data are reported in Fig. 4a, b, respectively. Problem (49) is a nonconvex, constrained, smooth optimization problem. Due to nonconvexity, it may have multiple stationary points.

Fig. 4
figure 4

Image deblurring problem - constrained. a Ground truth; b blurred noisy image; c restoration

The implementation of Phila is based on a generalization of the settings described in Sect. 5.1.2, taking into account the presence of box constraints instead of non-negativity only. In particular, for each iteration k we define the following set of indexes

$$\begin{aligned} {{\mathcal {B}}}_k = \{i: (x_i^{(k)}= 0 \wedge \nabla _i f_0(x) \ge 0) \vee (x_i^{(k)}= 1 \wedge \nabla _i f_0(x) \le 0) \} \end{aligned}$$

and we define the residual at the iteration k as in (45) and the vectors \(w^{(k)},g^{(k)}\) as in (46). Also in this case we have \(r^{(k)}\in \partial f( { x}^{(k)})\) and \(\Vert r^{(k)}\Vert = \textrm{dist}(\partial f( { x}^{(k)}),0)\). Based on these definitions, we consider the implementations Phila-L, Phila-BB1 and Phila-BB2 as described in Sect. 5.1.2. We also set \(D_k= I\) and \(\gamma = 10^{-4}\). For the fixed stepsize version Phila-L, we adopt \(\mu _{max}=1+\rho /\epsilon + 2\chi \) as an estimate of the Lipschitz constant of \(\nabla f_0(x)\), and we define \(\alpha _k\) as in (38).

The other parameters are set exactly as in the previous section (\(\gamma =10^{-4}\), \(\alpha _{min}= 10^{-5}\), \(\alpha _{max} = 10^{5}\) and \(\beta _{max} = 1.5\)). The benchmark algorithms compared to Phila are ISTA, FISTA, VMILA, ABPG_LS, PGels, nmAPG_LS and v-iBPG as in the previous experiment. We include also FISTA, even if its convergence properties are proved only for convex problems, since in this case we observed a good behaviour anyway. In particular, for ISTA and FISTA we use the same estimate employed in Phila-L for \(\mu _{max}\) to determine the steplengh parameter; also in this case, we include a backtracking procedure as a safeguard for convergence. As a further benchmark, we include also the Heavy-Ball method iPiano [40], with constant parameters.

The results are reported in Fig. 5, where the plots show the decrease of the objective function with respect to the iteration number (left panel) and the computational time (right panel). Table 3 reports a comparison of the algorithms when the stopping condition (48) with \(tol = 10^{-8}\) is satisfied. From left to right, the columns of the table report the number of iterations, the corresponding total number of function evaluations, the Euclidean relative error, the PSNR between the restored and the clean image and the total computational time.

We can observe that all the Phila version perform well with respect to the reference algorithms: in particular, Phila-BB2 exhibits the overall best performance.

Fig. 5
figure 5

Image deblurring problem: relative decrease with respect to the iteration number (left panel) and to the computational time, in seconds (right panel)

Table 3 Image deblurring problem

5.2.3 Image deblurring with sparsity inducing regularization on the wavelet transform

In this section we consider a different approach to the deblurring problem, leading to a convex composite \(\ell _2\)-\(\ell _1\) minimization problem. Given a blurred noisy image \(g\in {\mathbb {R}}^n\), a restoration is obtained by solving the following optimization problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} \frac{1}{2} \Vert HR^Tx-g\Vert ^2 + \rho \Vert x\Vert _1, \end{aligned}$$
(50)

where \(H\in {\mathbb {R}}^{n\times n}\) is the blurring operator and \(R\in {\mathbb {R}}^{n\times n}\) represents an orthogonal wavelet transform. Here, the unknown \(x\in {\mathbb {R}}^n\) corresponds to the wavelet coefficients and the \(\ell _1\) regularization terms aims to induce sparsity on it. Once computed a solution \(x^*\) of the above problem, the restored image is obtained with formula \(R^T x^*\). The test problem is defined following [10, Example 5.1]: as a ground truth we take the \(256\times 256\) cameraman image and we blur it with a Gaussian psf of size \(9\times 9\) and standard deviation 4. We assume reflective boundary conditions, therefore the matrix-vector products involving the blurring operator H can be easily implemented via the dct2 transform. Finally, we add to the blurred image zero mean Gaussian noise with standard deviation \(10^{-3}\). The ground truth and the noisy blurred image are shown in Fig. 6a, b. As for the synthesis operator R we adopt a three stage Haar wavelet transform. Since R is orthogonal, the eigenvalues of the Hessian matrix can be computed using the dct2 transform. The regularization parameter \(\rho \) is set to \(2\cdot 10^{-5}\).

Problem (50) is convex, therefore the optimal value is unique.

Fig. 6
figure 6

Image deblurring problem - \(\ell _2\)-\(\ell _1\). a Ground truth; b blurred noisy image (PSNR = 11.99); c restoration (PSNR = 31.71)

We compare the same algorithms considered in the previous section. In particular, for the implementation of Phila we first observe that in this case we have \(f_1(x) = \rho \Vert x\Vert _1\) and

Then, using the same rationale behind the strategies proposed for non-negatively and box constrained problems, at each iteration we define the following set of indexes

$$\begin{aligned} {{\mathcal {B}}}_k = \{i: x_i^{(k)}= 0 \wedge \nabla _i f_0(x) \in [-\rho ,\rho ] \}, \end{aligned}$$

and we define \(r^{(k)}\) as

$$\begin{aligned} r_i^{(k)}={\left\{ \begin{array}{ll} 0 \ {} &{}i\in {\mathcal {B}}_k\\ \nabla _i f_0(x^{(k)})+\rho {\text {sign}}(x_i^{(k)}) \ &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

so that \(r^{(k)}\in \partial f( { x}^{(k)})\) and \(\Vert r^{(k)}\Vert = \textrm{dist}(\partial f( { x}^{(k)}),0)\). Using this definition of \(r^{(k)}\) and with \(w^{(k)},g^{(k)}\) as in (46), we compute the steplength and inertial parameters as in (39), (40) and (35), leading to the corresponding implementations of Phila. Since R is orthogonal and thanks to the special structure of H, the maximum eigenvalue \(\mu _{max}\) of the Hessian employed in Phila-L, can be computed via the dct2 transform. The other parameters are set as in the previous experiments (\(\alpha _{min}=10^{-5}\), \(\alpha _{max}=10^{5}\), \(\beta _{max}=1.5\) and \(D_k=I\)).

The comparison still includes ISTA, FISTA, Heavy Ball, ADR, VMILA, ABPG_LS, PGels, nmAPG_LS and v-iBPG. In particular, ISTA and FISTA are implemented using the exact value of the Lipschitz constant \(\mu _{max}\).

The results are reported in Fig. 7, while the restoration obtained from the Phila-BB2 solution is shown in Fig. 6c. To evaluate the performances, we first obtain a numerical approximation of the optimal value \(f^*\) by running 15,000 iterations of Phila-BB2. Then, we run all methods for a maximum of 4000 iterations with a time limit of 200 s and we plot the function values with respect to the iteration number (left panel of Fig. 7) and the computational time (right panel). In Table 4 we report the results of all algorithms stopped when condition (48) with \(tol=10^{-6}\) is satisfied. In the columns of the table we list the same parameters as in Table 2 and 3 with, at the sixth column, the additional information about the number of zero components of the computed solution. Actually, FISTA, ISTA, Heavy-Ball and ADR do not require the function evaluation; however, as we noticed before, the objective function can be obtained as a byproduct of the gradient, which all methods need. Therefore, the performance assessment can be still obtained by comparing the total number of function evaluations of the backtracking methods—Phila and VMILA—with the number of iterations of the methods employing fixed stepsize—ISTA, FISTA, Heavy-Ball and ADR.

We can observe that the fastest decrease of the objective function is achieved by Phila-BB2, which also provides the sparsest solution. Also from the point of view of the effectiveness, Phila-BB2 requires very few backtracking reductions, leading to the overall best performance.

Fig. 7
figure 7

Image deblurring with sparsity inducing regularization: relative decrease with respect to the iteration number (left panel) and to the computational time (right panel)

Table 4 Image deblurring with sparsity inducing regularization: results at (48) satisfies with \(tol=10^{-6}\)

5.2.4 Image deblurring in presence of signal dependent Gaussian noise

As a last image restoration example, we consider a deblurring problem where the data \(g\in {\mathbb {R}}^n\) are affected by signal dependent Gaussian noise. In this case, the likelihood principle leads to the following fit-to-data term

$$\begin{aligned} f_0(x) = \sum _{i=1}^n \frac{([Hx]_i - g_i)^2}{a_i[Hx]_i + c_i} + \log (a_i[Hx]_i + c_i), \end{aligned}$$

where \(H\in {\mathbb {R}}^{n\times n}\) models the blurring phenomenon and \(a_i,c_i > 0\) are prefixed parameters. As a regularization term we consider the nonsmooth TV functional, given by formula (48) with \(\epsilon = 0\). This is a case where the explicit formula for the proximity operator is not available and it has to be approximated with the dual procedure described in the Appendix and detailed in Algorithm 2. Non-negativity constraints are also included in the model, therefore the nonsmooth part of the objective function reads as

$$\begin{aligned} f_1(x) = \rho TV_0(x) + \iota _{\ge 0}(x). \end{aligned}$$

We observe that \(f_0\) is nonconvex and \(\nabla f_0\) is Lipschitz continuous on \(\textrm{dom}(f_1)\), but the constant is not easy to estimate. The test problem, consisting in the ground truth and noisy blurred image shown in Fig. 8a, b, has been downloaded from [46]. Moreover, the data discrepancy parameters are \(a_i=c_i=1\) for all \(i=1,\ldots ,n\), H corresponds to a convolution with a \(7\times 7\) Gaussian kernel and the regularization parameter is \(\rho = 0.03\).

The numerical comparison includes ISTA, VMILA, iPiano (in the implementation described in [18]), the variable metric forward-backward algorithm (VMFB) proposed in [25], whose Matlab code can be downloaded from [46] and the inexact version of v-iBPG [56]. All these methods, as well as Phila, need an approximation of the proximity operator. For ISTA, VMILA, iPiano and Phila we adopt the same approach, i.e., Algorithm 2 with \(a=2.1\) and the adaptive stopping criterion (10) with \(\tau =10^{6}\) (for more details concerning this point, see Sect. A.1). The inexactness criterion in v-iBPG is very similar and it can be implemented using the same approach described above but with a different stopping criterion. In particular, at the iteration k, the inner iterations are stopped when the primal-dual gap is below the prefixed tolerance \(\nu _k = 100/(1+k^{1.1})\), fulfilling the theoretical prescriptions detailed in [56]. As for VMFB, in the code [46] the proximal gradient point is still approximated with inner dual iterations, until the relative difference of two successive iterates and corresponding dual function values are below to a fixed tolerance. Then, evaluation of the effectiveness of the considered algorithms can be measured in terms of the number of function/gradient evaluations and of the number of inner iterations required to achieve an approximate solution with similar accuracy.

Fig. 8
figure 8

Image deblurring from data affected by signal dependent Gaussian noise: a Ground truth, b noisy blurred data (PSNR = 23.38); c restoration (PSNR = 29.09)

Fig. 9
figure 9

Image deblurring from data affected by signal dependent Gaussian noise: relative decrease of the objective function with respect to the iteration number (left panel) and to the computational time (right panel)

In the first phase of our numerical experience, we also considered the scaled versions of VMILA and Phila based on the variable metric employed in VMFB: however, since there was no improvement in the convergence behaviour, we do not report the corresponding results. For Phila we adopt the fixed stepsize \(\alpha _k=50\), which was empirically determined. The other parameters of Phila are set exactly as in the other image restoration experiments (\(\gamma =10^{-4}\), \(\alpha _{min}= 10^{-5}\), \(\alpha _{max} = 10^{5}\) and \(\beta _{max} = 1.5\)).

Table 5 Image deblurring from data affected by signal dependent Gaussian noise: output of the algorithms to satisfy inequality (48) with \(tol = 10^{-8}\)

The results of the comparison are shown in Fig. 9. In this case, even if the problem is nonconvex, all methods seem to converge to the same value. Therefore, for a better visual comparison we first computed an approximation of the optimal value \(f^*\) by running 20,000 iterations of VMFB, then we evaluate the relative distance from \(f^*\) at each iteration of all methods and we plot this quantity with respect to the iteration number (left panel) and the computational time (right panel). Table 5 collects some information about the effectiveness of each algorithm. The first three columns report the number of iterations, function evaluations and inner iterations needed to fulfill condition (48) with \(tol=10^{-8}\). In the last column we report a kind of optimality measure in terms of the quantity \(h_k\), which is defined as as \(h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k)})\) for the non inertial methods ISTA, VMILA and VMFB, while \(h_k = h^{(k)}( {\tilde{y}}^{(k)}; { x}^{(k)}, { x}^{(k-1)})\) for the three implementations of Phila and iPiano. Then, for all methods \(h_k\) is a quantity which must converge to zero, so that the smaller is \(h_k\), the closer is the algorithm to its limit. Moreover, for the inexact methods ISTA, VMILA and Phila, \(h_k\) controls the accuracy level in evaluating the proximity operator, which becomes more demanding, in terms of inner iterations, for small values of \(h_k\).

From Fig. 9 and Table 5 we can observe that condition (48) with the same tolerance stops different algorithms at a different accuracy level. In particular, the solutions provided by Phila are more accurate with respect to both the closeness to the target value \(f^*\) and the optimality measure \(h_k\). This also explains why the number of inner iterations is larger. Again, the overall best performances are obtained by Phila-BB2. The restored image provided by Phila-BB2 is shown in Fig. 8c.

6 Conclusions and future work

In this paper we proposed a new inertial proximal gradient algorithm whose convergence is guaranteed by the descent properties enforced by a suitable line-search procedure. This results in a flexible scheme where, unlike previously proposed Heavy-Ball methods, the parameters characterizing the iteration can be selected almost freely. This freedom can be exploited by adapting acceleration techniques which already demonstrated their effectiveness when combined with non inertial forward–backward methods. In particular, we adopt a variant of the Barzilai–Borwein formulae for the stepsize selection, while the inertial parameter is computed by mimicking the nonlinear Conjugate Gradient rules. The results of the numerical experience show that these practical rules make the proposed method more efficient than existing Heavy-Ball methods and competitive also with accelerated forward–backward methods, especially when seeking for a medium-high accuracy solution. Future research will concern more insights on the parameter selection from a spectral point of view.