1 Introduction

In applied mathematics, many problems can be reduced to solving a nonlinear fixed-point equation \(g(x)=x\), where \(x\in \mathbb {R}^n\) and \(g:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is a given function. If g is a contractive mapping, i.e.,

$$\begin{aligned} \Vert g(x)-g(y)\Vert \le \kappa \Vert x-y\Vert \quad \forall ~x,y\in \mathbb {R}^n, \end{aligned}$$
(1)

where \(\kappa <1\), then the iteration

$$\begin{aligned} x^{k+1}=g(x^k) \end{aligned}$$

is ensured to converge to the fixed-point of g by Banach’s fixed-point theorem. Anderson acceleration (\(\textsf{AA}\)) [2, 3, 49] is a technique for speeding up the convergence of such an iterative process. Instead of using the update \(x^{k+1}=g(x^k)\), it generates \(x^{k+1}\) as an affine combination of the latest \(m+1\) steps:

$$\begin{aligned} x^{k+1} = g(x^{k})+{\sum }_{i=1}^{m}\alpha _i^*(g(x^{k-i})-g(x^{k})) \end{aligned}$$
(2)

with the combination coefficients \(\alpha ^{*}=(\alpha ^{*}_1, \ldots ,\alpha ^{*}_m) \in \mathbb {R}^m\) being computed via an optimization problem

$$\begin{aligned} \min _{\alpha }~\left\| f(x^{k})+{\sum }_{i=1}^{m}\alpha _i(f(x^{k-i})-f(x^{k}))\right\| ^2, \end{aligned}$$
(3)

where \(f(x^{k}) = g(x^{k}) - x^{k}\) denotes the residual function.

\(\textsf{AA}\) was initially proposed to solve integral equations [2] and has gained popularity in recent years for accelerating fixed-point iterations [49]. Examples include tensor decomposition [42], linear system solving [35], and reinforcement learning [18], among many others [1, 7, 24, 25, 27, 29, 30, 33, 52, 54].

On the theoretical side, it has been shown that \(\textsf{AA}\) is a quasi-Newton method for finding a root of the residual function [14, 16, 38]. When applied to linear problems (i.e., if \(g(x) = Ax - b\)), \(\textsf{AA}\) is equivalent to the generalized minimal residual method (GMRES), [34]. For nonlinear problems, \(\textsf{AA}\) is also closely related to the nonlinear generalized minimal residual method [50]. A local convergence analysis of \(\textsf{AA}\) for general nonlinear problems was first given in [45, 46] under the base assumptions that g is Lipschitz continuously differentiable and the \(\textsf{AA}\) mixing coefficients \(\alpha \), determined in (3), stay in a compact set. However, the convergence rate provided in [45, 46] is not faster than the one of the original fixed-point iteration. A more recent analysis in [13] shows that \(\textsf{AA}\) can indeed accelerate the local linear convergence of a fixed-point iteration up to an additional quadratic error term. This is further improved in [32] where q-linear convergence of \(\textsf{AA}\) is established. The convergence result in [32] requires sufficient linear independence of the columns of \([f(x^{k-1})-f(x^k),\dots ,f(x^{k-m})-f(x^{k-m+1})]\) which is typically stronger than the previously mentioned boundedness assumption on the coefficients \(\alpha \). By assuming the mixing coefficient \(\alpha \) to be stationary during the iteration, an exact rate of \(\textsf{AA}\) is derived in [50].

One issue of classical \(\textsf{AA}\) is that it can suffer from instability and stagnation [34, 39]. Different techniques have been proposed to address this issue. For example, safeguarding checks were introduced in [30, 54] to only accept an \(\textsf{AA}\) step if it meets certain criteria, but without a theoretical guarantee for convergence. Another direction is to introduce regularization to the problem (3) for computing the combination coefficients. In [17], a quadratic regularization is used together with a safeguarding step to achieve global convergence of \(\textsf{AA}\) on Douglas–Rachford splitting, but there is no guarantee that the local convergence is faster than the original solver. In [39, 40], a similar quadratic regularization is introduced to achieve local convergence, although no global convergence proof is provided. A more detailed discussion of related literature and specific techniques connected to our algorithmic design and development is deferred to Sect. 2.1.

As far as we are aware, none of the existing approaches and modified versions of \(\textsf{AA}\) guarantee both global convergence and accelerated local convergence. In this paper, we propose a novel \(\textsf{AA}\) globalization scheme that achieves these two goals simultaneously. Specifically, we apply a quadratic regularization with its weight adjusted automatically according to the effectiveness of the \(\textsf{AA}\) step. We adapt the nonmonotone trust-region framework in [47] to update the weight and to determine the acceptance of the \(\textsf{AA}\) step. Our approach can not only achieve global convergence, but also attains the same local convergence rate established in [13] for \(\textsf{AA}\) without regularization. Furthermore, our local results also cover applications where the mapping g is nonsmooth and differentiability is only required at a target fixed-point of g. To the best of our knowledge, this is the first globalization technique for \(\textsf{AA}\) that achieves the same local convergence rate as the original \(\textsf{AA}\) scheme. Numerical experiments on both smooth and nonsmooth problems verify the effectiveness and efficiency of our method.

Notations Throughout this work, we restrict our discussion on the n-dimensional Euclidean space \(\mathbb {R}^n\). For a vector x, \(\Vert x\Vert \) denotes its Euclidean norm, and \(\mathbb {B}_{\epsilon }(x):= \{y: \Vert y -x \Vert \le \epsilon \}\) denotes the Euclidean ball centered at x with radius \(\epsilon \). For a matrix A, \(\Vert A\Vert \) is the operator norm with respect to the Euclidean norm. We use I to denote both the identity mapping (i.e., \(I(x) = x\)) and the identity matrix. For a function \(h: \mathbb {R}^n\rightarrow \mathbb {R}^\ell \), the mapping \(h^\prime : \mathbb {R}^n\rightarrow \mathbb {R}^{\ell \times n}\) represents its derivative. h is called L-smooth if it is differentiable and \(\Vert h^\prime (x)-h^\prime (y)\Vert \le L\Vert x-y\Vert \) for all \(x,y\in \mathbb {R}^n\). An operator \(h:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is called nonexpansive if for all \(x,y\in \mathbb {R}^n\) we have \(\Vert h(x)-h(y)\Vert \le \Vert x-y\Vert \). We say that the operator h is \(\rho \)-averaged for some \(\rho \in (0,1)\) if there exists a nonexpansive operator \(R: \mathbb {R}^n\rightarrow \mathbb {R}^n\) such that \(h = (1-\rho )I+\rho R\). The set of fixed points of the mapping h is defined via \(\textrm{Fix}(h):= \{x: h(x)=x\}\). The interested reader is referred to [4] for further details on operator theory.

The \(\textsf{AA}\) formulation in Eqs. (2) and (3) assumes \(k \ge m\). It can be adapted to account for the case \(k < m\) by using \(\hat{m}\) coefficients instead where \(\hat{m} = \min \{m, k\}\). Without loss of generality, we use m to refer to the actual number of coefficients being used.

2 Algorithm and Convergence Analysis

2.1 Adaptive Regularization for \(\textsf{AA}\)

In the following, we set \(f^k:= f(x^k)\) and \(g^k:= g(x^k)\) to simplify notation. We first note that the accelerated iterate computed via (2) and (3) is invariant under permutations of the indices of \(\{f^j\}\) and \(\{g^j\}\). Concretely, let \(\varPi _k:= (k_0,k_1,\dots ,k_m)\) be any permutation of the index sequence \((k,k-1,\dots ,k-m)\). Then the point \(x^{k+1}\) calculated in Eq. (2) also satisfies

$$\begin{aligned} x^{k+1} = g^{k_0}+{\sum }_{i=1}^{m} \bar{\alpha }_i^k(g^{k_i}-g^{k_0}), \end{aligned}$$
(4)

with coefficients \(\bar{\alpha }^k = (\bar{\alpha }_0^k, \ldots , \bar{\alpha }_m^k)\) computed via

$$\begin{aligned} \bar{\alpha }^k \in \mathop {\textrm{argmin}}\limits _{\alpha } \left\| f^{k_0}+{\sum }_{i=1}^{m}\alpha _i(f^{k_i}-f^{k_0})\right\| ^2, \end{aligned}$$
(5)

which amounts to solving a linear system. In this paper, we use a particular class of permutations where

$$\begin{aligned} k_0 =\max \left\{ j \mid j \in \mathop {\textrm{argmin}}\limits \nolimits _{i\in \{k,k-1, \ldots , k-m\}}\Vert f^i\Vert \right\} , \end{aligned}$$
(6)

i.e., \(k_0\) is the largest index that attains the minimum residual norm among \(\Vert f^{k-m}\Vert , \Vert f^{k-m+1}\Vert , \ldots , \Vert f^k\Vert \). As we will see later, this type of permutation allow us to apply certain nonmonotone globalization techniques and to ultimately establish local and global convergence of our approach. An ablation study on the potential effect of the permutation strategy is presented in “Appendix D”.

One potential cause of instability of \(\textsf{AA}\) is the (near) linear dependence of the vectors \(\{f^{k_i}-f^{k_0}: i = 1, \ldots , m\}\), which can result in (near) rank deficiency of the linear system matrix for the problem (5). To address this issue, we introduce a quadratic regularization to the problem (5) and compute the coefficients \(\alpha ^{k}\) via:

$$\begin{aligned} \alpha ^{k} = \mathop {\textrm{argmin}}\limits _{\alpha }~\Vert {{\hat{f}}}^k(\alpha )\Vert ^2 + \lambda _k \Vert \alpha \Vert ^2, \end{aligned}$$
(7)

where \(\lambda _k > 0\) is a regularization weight, and

$$\begin{aligned} {{\hat{f}}}^k(\alpha ):= f^{k_0}+{\sum }_{i=1}^{m}\alpha _i(f^{k_i}-f^{k_0}). \end{aligned}$$
(8)

The coefficients \(\alpha ^{k}\) are then used to compute a trial step in the same way as in Eq. (4). In the following, we denote this trial step as \({{\hat{g}}}^k(\alpha ^k)\) where

$$\begin{aligned} \hat{g}^k(\alpha ):= g^{k_0}+{\sum }_{i=1}^{m}\alpha _i(g^{k_i}-g^{k_0}). \end{aligned}$$
(9)

The trial step is accepted as the new iterate if it meets certain criteria (which we will develop in the following in detail). Regularization such as the one in Eq. (7) has been suggested in [3] and is applied in [17, 40]. A major difference between our approach and the regularization in [17, 40] is the choice of \(\lambda _{k}\): in [17] it is set in a heuristic manner, whereas in [40] it is either fixed or specified via grid search. We instead update \(\lambda _k\) adaptively based on the effectiveness of the latest \(\textsf{AA}\) step. Specifically, we observe that a larger value of \(\lambda _k\) can improve stability for the resulting linear system; it will also induce a stronger penalty for the magnitude of \(\Vert \alpha ^{k}\Vert \). In this case, the trial step \(\hat{g}^k(\alpha ^k)\) tends to be closer to \(g^{k_0}\), which, according to Eq. (6), is the fixed-point iteration step with the smallest residual among the latest \(m+1\) iterates. On the other hand, a larger regularization weight may also hinder the fast convergence of \(\textsf{AA}\) if it is already effective in reducing the residual without regularization. Thus, \(\lambda _k\) is dynamically adjusted according to the reduction of the residual in the current step.

Our adaptive regularization scheme is inspired by the similarity between the problem (7) and the Levenberg-Marquardt (LM) algorithm [20, 26], a popular approach for solving nonlinear least squares problems of the form \(\min _x \Vert F(x)\Vert ^2\), where F is a vector-valued function. Each iteration of LM computes a variable update \(d^k:= x^{k+1} - x^k\) by solving a quadratic problem

$$\begin{aligned} \mathop {\textrm{argmin}}\limits _d ~\Vert F(x^k) + F'(x^k) d\Vert ^2 + \bar{\lambda }_k \Vert d\Vert ^2. \end{aligned}$$

Here, the first term is a local quadratic approximation of the target function \(\Vert F(x)\Vert ^2\) using the first-order Taylor expansion of F, while the second term is a regularization with a weight \(\bar{\lambda }_k > 0\). LM can be considered as a regularized version of the classical Gauss-Newton (GN) method for nonlinear least squares optimization [23]. In GN, each iteration computes an initial step d by minimizing the local quadratic approximation term only, i.e.,:

$$\begin{aligned} \mathop {\textrm{argmin}}\limits _d ~\Vert F(x^k) + F'(x^k) d\Vert ^2, \end{aligned}$$
(10)

which amounts to solving a linear system for d with the positive semidefinite matrix \((F'(x^k))^T F'(x^k)\). Similar to \(\textsf{AA}\), the (near) linear dependence between the columns of \(F'(x^k)\) can lead to (near) rank deficiency of the system matrix causing potential instability. To address this issue, LM introduces a quadratic regularization term for d, which adds a scaled identity matrix to the linear system matrix and prevents it from being singular. Furthermore, LM measures the effectiveness of the computed update using a ratio of the resulting reduction of the target function and a predicted reduction based on the quadratic approximation. The measure is utilized to determine the acceptance of the update, to enforce monotonic decrease of the target function, and to update the regularization weight for the next iteration. Such an adaptive regularization is an instance of a trust-region method [10].

Taking a similar approach as LM, we define two functions \({\textrm{ared}_{k}}\) and \({\textrm{pred}_{k}}\) that measure the actual and predicted reduction of the residual resulting from the solution \(\alpha ^k\) to (7):

$$\begin{aligned} {\textrm{ared}_{k}}:={r_{k}}-\Vert f(\hat{g}^k(\alpha ^k))\Vert , \quad {\textrm{pred}_{k}}:={r_{k}} -c\Vert \hat{f}^k(\alpha ^k)\Vert , \end{aligned}$$
(11)

where \(c \in (0, 1)\) is a constant. Here \({r_{k}}\) measures the residuals from the latest \(m+1\) iterates via a convex combination:

$$\begin{aligned} {r_{k}}:= (1- m \gamma )\Vert f^{k_0}\Vert + {\sum }_{i=1}^m \gamma \Vert f^{k_i}\Vert , \end{aligned}$$
(12)

with \(\gamma \in (0, \frac{1}{m+1})\) such that a higher weight is assigned to the smallest residual \(f^{k_0}\) among them. Note that \(\hat{g}^k(\alpha ^k)\) is the trial step, and \(f(\cdot )\) is the residual function. Thus \({\textrm{ared}_{k}}\) compares the latest residuals with the residual resulting from the trial step. This specific choice of \(r_k\) is inspired by the local descent properties of \(\textsf{AA}\), see, e.g., [13, Theorem 4.4]. Moreover, note that \(\hat{f}^k(\cdot )\) (see Eq. (8)) is a linear approximation of the residual function based on the latest residual values, and it is used in problem (7) to derive the coefficients \(\alpha ^k\) for computing the trial step. Thus \(\hat{f}^k(\alpha ^k)\) is a predicted residual for the trial step based on the linear approximation, and \({\textrm{pred}_{k}}\) compares it with the latest residuals. The constant c guarantees that \({\textrm{pred}_{k}}\) has a positive value (as long as a solution to the problem has not been found; see “Appendix A” for a proof). Similar to LM, we calculate the ratio

$$\begin{aligned} \rho _k = \frac{{\textrm{ared}_{k}}}{{\textrm{pred}_{k}}} \end{aligned}$$
(13)

as a measure of effectiveness for the trial step \(\hat{g}^k(\alpha ^k)\) computed with Eqs. (7) and (9). In particular, if \(\rho _k \ge p_1\) with a threshold \(p_1 \in (0,1)\), then from Eq. (13) and using the positivity of \({\textrm{pred}_{k}}\) we can bound the residual of \(\hat{g}^k(\alpha ^k)\) via

$$\begin{aligned} \Vert f(\hat{g}^k(\alpha ^k))\Vert \le (1-p_1) r_k + p_1 c \Vert \hat{f}^k(\alpha ^k)\Vert < (1-p_1) r_k + p_1 \Vert f^{k_0}\Vert . \end{aligned}$$
(14)

Like \(r_k\), the last expression \((1-p_1) r_k + p_1 \Vert f^{k_0}\Vert \) in Eq. (14) is also a convex combination of the latest \(m+1\) residuals, but with a higher weight on the smallest residual \(f^{k_0}\) than on \(r_k\). Hence, when \(\rho _k \ge p_1\), we consider the decrease of the residual to be sufficient. In this case, we set \(x^{k+1} = \hat{g}^k(\alpha ^k)\) and say the iteration is successful. Otherwise, we discard the trial step and choose \(x^{k+1} = g^{k_0} = g(x^{k_0})\), which corresponds to the fixed-point iteration step with the smallest residual among the latest \(m+1\) iterates. Thus, by permuting the indices \((k,k-1,\ldots ,k-m)\) according to \(\varPi _k\), we can ensure to achieve the most progress in terms of reducing the residual when an \(\textsf{AA}\) trial step is rejected.

We also adjust the regularization weight \(\lambda _k\) according to the ratio \(\rho _k\). Specifically, we set

$$\begin{aligned} \lambda _k= \mu _k \Vert f^{k_0}\Vert ^2, \end{aligned}$$
(15)

where the factor \(\mu _k > 0\) is automatically updated based on \(\rho _{k}\) as follows:

  • If \(\rho _{k} < p_1\), then we consider the decrease of the residual to be insufficient and we increase the factor in the next iteration via \(\mu _{k+1} = \eta _1\mu _k\) with a constant \(\eta _1 > 1\).

  • If \(\rho _{k} > p_2\) with a threshold \(p_2 \in (p_1, 1)\), then we consider the decrease to be high enough and reduce the factor via \(\mu _{k+1} = \eta _2\mu _k\) with a constant \(\eta _2 \in (0,1)\). This will relax the regularization so that the next trial step will tend to be closer to the original \(\textsf{AA}\) step.

  • Otherwise, in the case \(\rho _{k} \in [p_1, p_2]\), the factor remains the same in the next iteration.

Here the choice of the parameters \(p_1, p_2\) where \(0< p_1< p_2 < 1\) follows the convention of basic trust-region methods [10].

Our setting of \(\lambda _k\) in Eq. (15) is inspired by [15] which relates the LM regularization weight to the residual norm. For our method, this setting ensures that the two target function terms in problem (7) are of comparable scales, so that the adjustment of the factor \(\mu _k\) is meaningful. This choice of \(\lambda _k\) and the update rule of \(\mu _k\) are quite standard in LM methods. However, the classical convergence analysis in [15] is not directly applicable here. In the LM method, the decrease of the residual can be predicted via its linearized model \(\Vert F(x^k)+F'(x^k)d\Vert ^2\). For \(\textsf{AA}\), the linearized residual \({\hat{f}}^k(\alpha ^k)\) is not a model for the update \(x^{k+1}={\hat{g}}^k(\alpha ^k)\) but for \({\hat{x}}^k(\alpha ^k)\) instead. Since a linearized residual of \({\hat{g}}^k\) is not readily available, we use an upper bound for such a linearization of \({\hat{g}}^k(\alpha ^k)\) which is exactly given by \(c\Vert {\hat{f}}^k(\alpha ^k)\Vert \). The whole method is summarized in Algorithm 1.

figure a

Unlike LM which enforces a monotonic decrease of the target function, our acceptance strategy allows the residual for \(x^{k+1}\) to increase compared to the previous iterate \(x^k\). Therefore, our scheme can be considered as a nonmonotone trust-region approach and follows the procedure investigated in [47]. In the next subsections, we will see that this nonmonotone strategy allows us to establish unified global and local convergence results. In particular, besides global convergence guarantees, we can show transition to fast local convergence and an acceleration effect similar to the original \(\textsf{AA}\) scheme can be achieved.

The main computational overhead of our method lies in the optimization problem (7), which amounts to constructing and solving an \(m \times m\) linear system \(( J^T J + {\lambda _k} I ) \alpha ^k = -J^T f^{k_0}\) where \(J = [ f^{k_1} - f^{k_0}, f^{k_2} - f^{k_0}, \ldots , f^{k_m} - f^{k_0} ] \in \mathbb {R}^{n \times m}\). A naïve implementation that computes the matrix J from scratch in each iteration will result in \(O(m^2 n)\) time for setting up the linear system, whereas the system itself can be solved in \(O(m^3)\) time. Since we typically have \(m \ll n\), the linear system setup will become the dominant overhead. To reduce the overhead, we note that each entry of \(J^T J\) is a linear combination of inner products between \(f^{k_0}, \ldots , f^{k_m}\). If we pre-compute and store these inner products, then it only requires additional \(O(m^2)\) time to evaluate all entries. Moreover, the pre-computed inner products can be updated in O(mn) time in each iteration, so we only need O(mn) total time to evaluate \(J^T J\). Similarly, we can evaluate \(J^T f^{k_0}\) in O(m) time. In this way, the linear system setup only requires O(mn) time in each iteration. Moreover, as the parameter m is often a small value independent of n (and significantly smaller than n), the complexity O(mn) is effectively linear with respect to n and only incurs a small computational overhead.

2.2 Global Convergence Analysis

We now present our main assumptions on g and f that allow us to establish global convergence of Algorithm 1. Our conditions are mainly based on a monotonicity property and on pointwise convergence of the iterated functions \(g^{[k]}: \mathbb {R}^n\rightarrow \mathbb {R}^n\) defined as

figure b

for \(k \in \mathbb {N}\).

Assumption 1

The functions g and f satisfy the following conditions:

  1. (A.1)

    \(\Vert f(g(x))\Vert \le \Vert f(x)\Vert \) for all \(x \in \mathbb {R}^n\).

  2. (A.2)

    \(\lim \limits _{k\rightarrow \infty } \Vert f(g^{[k]}(x))\Vert = \nu \) for all \(x \in \mathbb {R}^n\), where \(\nu =\inf _{x\in \mathbb {R}^n}\Vert f(x)\Vert \).

It is easy to see that Assumption 1 holds for any contractive function with \(\nu = 0\). In particular, if g satisfies (1), we obtain

$$\begin{aligned} \Vert f(g^{[k]}(x))\Vert&= \Vert g(g^{[k]}(x)) - g(g^{[k-1]}(x)) \Vert \nonumber \\&\le \kappa \Vert f(g^{[k-1]}(x))\Vert \le \ldots \le \kappa ^{k} \Vert f(x)\Vert \rightarrow 0 \end{aligned}$$
(16)

as \(k \rightarrow \infty \). In the following, we will verify that Assumption 1 also holds for \(\rho \)-averaged operators which define a broader class of mappings than contractions.

Proposition 1

Let \(g:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be a \(\rho \)-averaged operator with \(\rho \in (0,1)\). Then g satisfies Assumption 1.

Proof

By definition the \(\rho \)-averaged operator g is also nonexpansive and thus, (A.1) holds for g. To prove (A.2), let us set \(y^0:= x\) and \(y^{k+1}:= g^{[k+1]}(x) = g(y^{k})\) for all k. By (A.1), the sequence \(\{\Vert f(y^{k})\Vert \}_k\) is monotonically decreasing. Therefore, we can assume that \(\lim _{k\rightarrow \infty }\Vert f(y^{k})\Vert =\vartheta \). If \(\vartheta >\nu \), then we may select \(x^0\in \mathbb {R}^n\) such that \(\Vert f(x^0)\Vert <\nu +\frac{1}{2}(\vartheta -\nu )\). Defining \(x^{k+1}=g^{[k+1]}(x^0)=g(x^k)\) and applying [4, Proposition 4.25(iii)], we have

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^k-y^k\Vert ^2 -\frac{1-\rho }{\rho }\Vert f(x^k)-f(y^k)\Vert ^2. \end{aligned}$$

This yields

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^0-y^0\Vert ^2 -\frac{1-\rho }{\rho }\sum _{i=0}^k\Vert f(x^i)-f(y^i)\Vert ^2. \end{aligned}$$

By the reverse triangle inequality and (A.1), we have

$$\begin{aligned} \Vert f(x^i)-f(y^i)\Vert \ge \Vert f(y^i)\Vert -\Vert f(x^i)\Vert \ge \vartheta -\Vert f(x^0)\Vert \ge \frac{1}{2}(\vartheta -\nu ). \end{aligned}$$

Combining with the previous inequality, we obtain

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^0-y^0\Vert ^2 -\frac{1-\rho }{2\rho }(k+1)(\vartheta -\nu ). \end{aligned}$$

Taking the limit \(k\rightarrow \infty \), we reach a contradiction. So, we must have \(\nu =\vartheta \), as desired. \(\square \)

Remark 1

Setting \(\kappa \) (the Lipschitz constant of g) to 1 in (16), we see that (A.1) is always satisfied if g is a nonexpansive operator. However, nonexpansiveness is not a necessary condition for (A.1). In fact, we can construct an operator g that is not nonexpansive but satisfies (A.1) and (A.2), e.g.,

$$\begin{aligned} g: \mathbb {R}\rightarrow \mathbb {R}, \quad g(x):=\left\{ \begin{array}{ll} 0.5 x &{} \text {if }x\in [0,1], \\ 0 &{} \text {otherwise.} \end{array} \right. \end{aligned}$$

For any \(x\in [0,1]\), we have \(g^{[k]}(x) = 2^{-k}x\), \(f(g^{[k]}(x)) = -2^{-(k+1)}x\) and it is not hard to verify (A.1) and (A.2). For any \(x\notin [0,1]\) it follows \(f(g(x))=f(0)=0\), thus (A.1) and (A.2) also hold in this situation. However, since g is not continuous, it can not be nonexpansive.

Because of Proposition 1, our global convergence theory is applicable to a large class of iterative schemes. As an example, we show in the following that Assumption 1 is satisfied by forward-backward splitting, a popular optimization solver in machine learning.

Example 1

Let us consider the nonsmooth optimization problem:

$$\begin{aligned} \min _{x \in \mathbb {R}^n}~r(x)+\varphi (x), \end{aligned}$$
(17)

where both \(r, \varphi : \mathbb {R}^n\rightarrow (-\infty ,\infty ]\) are proper, closed, and convex functions, and r is L-smooth. It is well known that \(x^*\) is a solution to this problem if and only if it satisfies the nonsmooth equation:

$$\begin{aligned} x^* - G_\mu (x^*) = 0, \quad G_\mu (x):= \textrm{prox}_{\mu \varphi }(x-\mu \nabla r(x)), \end{aligned}$$

where \(\textrm{prox}_{\mu \varphi }(x):= \mathop {\textrm{argmin}}\limits _y \varphi (y) + \frac{1}{2\mu } \Vert x-y\Vert ^2\), \(\mu > 0\), is the proximity operator of \(\varphi \), see also Corollary 26.3 of [4]. We can then compute \(x^*\) via the iterative scheme

$$\begin{aligned} x^{k+1} = G_\mu (x^k). \end{aligned}$$
(18)

\(G_\mu \) is known as the forward-backward splitting operator and it is a \(\rho \)-averaged operator for all \(\mu \in (0,\frac{2}{L})\), see [8]. Hence, Assumption 1 holds and our theory can be used to study the global convergence of Algorithm 1 applied to (18).

Remark 2

For problem (17), it can be shown that Douglas–Rachford splitting, as well as its equivalent form of ADMM, can both be written as a \(\rho \)-averaged operator with \(\rho \in (0,1)\) (see, e.g., [22]). Therefore, the applications considered in [17] are also covered by Assumption 1.

We can now show the global convergence of Algorithm 1:

Theorem 1

Suppose Assumption 1 is satisfied and let \(\{x^k\}\) be generated by Algorithm 1 with \(\epsilon _f = 0\). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert f^k\Vert =\nu , \end{aligned}$$

where \(\nu =\inf _{x\in \mathbb {R}^n}\Vert f(x)\Vert \).

Proof

In the following, we will use \({\mathcal {S}}\) to denote the set of indices for all successful iterations, i.e., \({\mathcal {S}}:= \{k: \rho _k \ge p_1\}\). To simplify the notation, we introduce a function \({\mathcal {P}}:\mathbb {N}\rightarrow \mathbb {N}\) defined as

$$\begin{aligned} {\mathcal {P}}(k):=\max \left\{ j \mid j \in {\mathop {\textrm{argmin}}\limits }_{i\in \{k,k-1,\ldots ,k-{{m}} \}}\Vert f^i\Vert \right\} . \end{aligned}$$

Notice that the number \({\mathcal {P}}(k)\) coincides with \(k_0\) for fixed k.

If Algorithm 1 terminates after a finite number of steps, the conclusion simply follows from the stopping criterion. Therefore, in the following, we assume that a sequence of iterates of infinite length is generated. We consider two different cases:

  1. Case 1:

    \(|\mathcal {S}|<\infty \). Let \(\bar{k}\) denote the index of the last successful iteration in \(\mathcal {S}\) (we set \({\bar{k}} = 0\) if \(\mathcal {S}=\emptyset \)). We first show that \(\mathcal {P}(k)=k\) for all \(k\ge {\bar{k}}+1\). Due to \({\bar{k}}+1\notin \mathcal {S}\), it follows \(x^{{\bar{k}}+1}=g(x^{{\mathcal {P}}({\bar{k}})})\) and by (A.1), this implies \(\Vert f(x^{{\bar{k}}+1})\Vert \le \Vert f(x^{{\mathcal {P}}({\bar{k}})})\Vert \). From the definition of \({\mathcal {P}}\), we have \(\Vert f(x^{{\mathcal {P}}({\bar{k}})})\Vert \le \Vert f^{{\bar{k}}-i}\Vert \) for every \(0\le i\le \min \{m,{\bar{k}}\}\) and hence \({\mathcal {P}}({\bar{k}}+1)={\bar{k}}+1\). An inductive argument then yields \(\mathcal {P}(k)=k\) for all \(k\ge {\bar{k}} +1\). Notice that for any \(k\ge {\bar{k}}+1\), we have \(k\notin {\mathcal {S}}\) and \(x^{k+1}=g(x^{{\mathcal {P}}(k)})=g(x^k)\). Utilizing (A.2), it follows that \(\Vert f^k\Vert = \Vert f(g^{[k-{\bar{k}}]}(x^{{\mathcal {P}}({\bar{k}})}))\Vert \rightarrow \nu \) as \(k \rightarrow \infty \).

  2. Case 2:

    \(|\mathcal {S}|=\infty \). Let us denote

    $$\begin{aligned} W_k:=\max _{k-m\le i\le k}\Vert f^i\Vert . \end{aligned}$$

    We first show that the sequence \(\{W_k\}\) is non-increasing.

    • If \(k\in \mathcal {S}\), then we have:

      $$\begin{aligned}&p_1 \le \rho _{k} = \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}}. \end{aligned}$$

      We already know from “Appendix A” that \({\textrm{pred}_{k}} > 0\). Since \(p_1 > 0\), it also holds that \({\textrm{ared}_{k}} >0\). Hence, if \(c \Vert {\hat{f}}^k(\alpha ^k)\Vert \le \Vert f^{k+1}\Vert \), then using \({r_{k}} \le W_k\) and (41) from “Appendix A”, we can derive:

      $$\begin{aligned} p_1&\le \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}} = 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{{\textrm{pred}_{k}}}= 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{r_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert } \\&\le 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{W_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert }=\frac{W_k-\Vert f^{k+1}\Vert }{W_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert } \le \frac{W_k-\Vert f^{k+1}\Vert }{(1-c)W_k}, \end{aligned}$$

      which implies

      $$\begin{aligned} \Vert f^{k+1}\Vert \le c_p W_k, \end{aligned}$$
      (19)

      where \(c_p:= 1-(1-c)p_1 < 1\). Otherwise, if \(c \Vert {\hat{f}}^k(\alpha ^k)\Vert > \Vert f^{k+1}\Vert \), then we have

      $$\begin{aligned} \Vert f^{k+1}\Vert \le c W_k \le c_p W_k. \end{aligned}$$
      (20)
    • If \(k \notin {\mathcal {S}}\), we have \(x^{k+1}=g^{{\mathcal {P}}(k)}\). By Assumption (A.1), it then follows that

      $$\begin{aligned} \Vert f^{k+1}\Vert \le \Vert f^{{\mathcal {P}}(k)}\Vert \le W_{k}. \end{aligned}$$
      (21)

    Eqs. (19), (20) and (21) show that \(\Vert f^{k+1}\Vert \le W_{k}\). By definition of \(W_{k}\), we then have \(W_{k+1} \le \max \{\Vert f^{k+1}\Vert , W_k\} = W_k\). This shows that the sequence \(\{W_k\}\) is non-increasing. Next, we verify

    $$\begin{aligned} W_{k+m+1}\le c_p W_k \end{aligned}$$

    for all \(k \in {\mathcal {S}}\). It suffices to prove that for any i satisfying \(k+1\le i\le k+m+1\), we have \(\Vert f^i\Vert \le c_p W_{k}\). Since we consider a successful iteration \(k\in \mathcal {S}\), our previous discussion has already shown that \(\Vert f^{k+1}\Vert \le c_p W_k\). We now assume \(\Vert f^i\Vert \le c_pW_k\) for some \(k+1\le i\le k+m\). If \(i\in {\mathcal {S}}\), we obtain \(\Vert f^{i+1}\Vert \le c_p W_i \le c_p W_k\). Otherwise, it follows that \(\Vert f^{i+1}\Vert \le \Vert f^{{\mathcal {P}}(i)}\Vert \le \Vert f^i\Vert \le c_p W_k\). Hence, by induction, we have \(W_{k+m+1}\le c_p W_k\) for all \(k \in {\mathcal {S}}\). Since \(\{W_k\}\) is non-increasing and we assumed \(|{\mathcal {S}}|=\infty \), this establishes \(W_k\rightarrow 0\) and \(\Vert f^k\Vert \rightarrow 0\). In this case, we can infer \(\nu =0\) and the proof is complete.\(\square \)

Remark 3

This global result does not depend on the specific update rule for the regularization weight \(\lambda _k\). Indeed, global convergence mainly results from our acceptance mechanism and hence, as a consequence of our proof, different update strategies for \(\lambda _k\) can also be applied. Our choice of \(\lambda _k\) in (15), however, will be essential for establishing local convergence of the method.

2.3 Local Convergence Analysis

Next, we analyze the local convergence of our proposed approach, starting with several assumptions.

Assumption 2

The function \(g: \mathbb {R}^n\rightarrow \mathbb {R}^n\) satisfies the following conditions:

  1. (B.1)

    g is Lipschitz continuous with a constant \(\kappa <1\).

  2. (B.2)

    g is differentiable at \(x^*\) where \(x^*\) is the fixed point of the mapping g.

Remark 4

(B.1) is a standard assumption widely used in the local convergence analysis of \(\textsf{AA}\) [13, 39, 40, 46]. The existing analyses typically rely on the smoothness of g. In contrast, (B.2) allows g to be nonsmooth and only requires it to be differentiable at the fixed point \(x^*\), allowing our assumptions to cover a wider variety of methodologies such as forward-backward splitting and Douglas–Rachford splitting under appropriate assumptions, see “Appendix B”. We note that in [6] the Lipschitz differentiability of g is replaced by continuous differentiability around \(x^*\), while we only assume differentiability at one point. This technical difference is based on the observation that an expansion of the residual \(f^k\) is only required at the point \(x^*\) and not at the iterates \(x^k\) which allows to work with weaker differentiability requirements. We further note that \(\textsf{AA}\) has been investigated for nonsmooth g in [17, 53] but without local convergence analysis. Recent convergence results of \(\textsf{AA}\) for a scheme related to the proximal gradient method discussed in Example 1 can also be found in [24]. While the local assumptions and derived convergence rates in [24] are somewhat similar to our local results, we want to highlight that the algorithm and analysis in [24] are tailored to convex composite problems of the form (17). Moreover, the global results in [24] are shown for a second, guarded version of \(\textsf{AA}\) and are based on the strong convexity of the problem. In contrast and under conditions that are not stronger than the local assumptions in [24], we will establish unified global–local convergence of our approach for general contractions. In Sect. 3, we verify the conditions (B.1) and (B.2) on the numerical examples, with a more detailed discussion in “Appendix B”.

Remark 5

(B.1) implies that the function g is contractive, which is a sufficient condition for (A.1) and (A.2). Thus, a function g satisfying Assumption 2 will also fulfill Assumption 1 with \(\nu = 0\).

Similar to the local convergence analyses in [13, 46], we also work with the following condition:

Assumption 3

For the solution \(\bar{\alpha }^k\) to the unregularized \(\textsf{AA}\) problem (5), there exists \(M > 0\) such that \(\Vert \bar{\alpha }^k\Vert _{\infty }\le M\) for all k sufficiently large.

Remark 6

The assumptions given in [13, 46] are formulated without permuting the last \(m+1\) indices. We further note that we do not require the solution \({\bar{\alpha }}^k\) to be unique.

The acceleration effect of the original \(\textsf{AA}\) scheme has only been studied very recently in [13] based on slightly stronger assumptions. In particular, their result can be stated as

$$\begin{aligned} \Vert f(\hat{g}^k(\bar{\alpha }^k))\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert +\sum \nolimits _{i=0}^mO(\Vert f^{k-i}\Vert ^2), \end{aligned}$$
(22)

where

$$\begin{aligned} \theta _k:={\Vert \hat{f}^k(\bar{\alpha }^k)\Vert }/{\Vert f^{k_0}\Vert } \end{aligned}$$
(23)

is an acceleration factor. Since \(\bar{\alpha }^k\) is a solution to the problem (5), we have \(\Vert \hat{f}^k(\bar{\alpha }^k)\Vert \le \Vert \hat{f}^k(0)\Vert = \Vert f^{k_0}\Vert \) so that \(\theta _k \in [0,1]\). Then (22) implies that for a fixed-point iteration that converges linearly with a contraction constant \(\kappa \), \(\textsf{AA}\) can improve the convergence rate locally. In the following, we will show that our globalized \(\textsf{AA}\) method possesses similar characteristics under weaker assumptions.

We first verify that after finitely many iterations, every step \(x^{k+1} = {\hat{g}}^k(\alpha ^k)\) is accepted as a new iterate. Thus, our method eventually reduces to a pure regularized \(\textsf{AA}\) scheme.

Theorem 2

Suppose that Assumptions 2 and 3 hold and let the constant c in (11) be chosen such that \(c\ge \kappa \). Then, the sequence \(\{x^k\}\) generated by Algorithm 1 (with \(\epsilon _f = 0\)) either terminates after finitely many steps, or converges to the fixed point \(x^*\) and there exists some \(\ell \in \mathbb {N}\) such that \(\rho _k\ge p_2\) for all \(k\ge \ell \). In particular, every iteration \(k \ge \ell \) is successful with \(x^{k+1}=\hat{g}^k(\alpha ^k)\).

Proof

Our proof consists of three steps. We first show the convergence of the whole sequence \(\{x^k\}\) to the fixed point \(x^*\). Afterwards we derive a bound for the residual \(\Vert f({\hat{g}}^k(\alpha ^k))\Vert \) that can be used to estimate the actual reduction \({\textrm{ared}_{k}}\). In the third step, we combine our observations to prove the transition to the full \(\textsf{AA}\) method, i.e., we show that there exists some \(\ell \) with \(k\in \mathcal {S}\) for all \(k\ge \ell \).

Step 1::

Convergence of \(\{x^k\}\). By (B.1), g is a contraction, i.e., for all \(x \in \mathbb {R}^n\) we have

$$\begin{aligned} \Vert x - x^*\Vert \le \Vert x-g(x)\Vert + \Vert g(x)-g(x^*)\Vert \le \Vert f(x)\Vert + \kappa \Vert x-x^*\Vert \end{aligned}$$
(24)

and it follows \(\Vert f^k\Vert = \Vert f(x^k)\Vert \ge (1-\kappa )\Vert x^k-x^*\Vert \) for all k. Theorem 1 and Remark 5 guarantee \(\lim _{k\rightarrow \infty }\Vert f^k\Vert =0\) and hence, we can infer \(x^k \rightarrow x^*\).

Step 2::

Bounding \(\Vert f({\hat{g}}^k(\alpha ^k))\Vert \). Introducing

$$\begin{aligned} \hat{x}^k(\alpha ^k):= x^{k_0} + {\sum }_{i=1}^m \alpha _i^k (x^{k_i} - x^{k_0}) \end{aligned}$$

and using (B.1), we can bound the residual \(\Vert f({\hat{g}}^k(\alpha ^k))\Vert \) directly as follows:

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert&= \Vert g({\hat{g}}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&\le \Vert g({\hat{g}}^k(\alpha ^k))-g(\hat{x}^k(\alpha ^k))\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&\le \kappa \Vert {\hat{g}}^k(\alpha ^k) - \hat{x}^k(\alpha ^k)\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&= \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert . \end{aligned}$$

We now continue to estimate the second term \(\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \). From the algorithmic construction and the definition of \(\alpha ^k\) and \({\bar{\alpha }}^k\), it follows that

$$\begin{aligned} \Vert {\hat{f}}^k(\alpha ^k)\Vert ^2 + \lambda _k \Vert \alpha ^k\Vert ^2 \le \Vert {\hat{f}}^k({\bar{\alpha }}^k)\Vert ^2 + \lambda _k \Vert {\bar{\alpha }}^k\Vert ^2 \le \Vert {\hat{f}}^k(\alpha ^k)\Vert ^2 + \lambda _k \Vert {\bar{\alpha }}^k\Vert ^2, \end{aligned}$$
(25)

which implies \(\Vert \alpha ^k\Vert _{\infty }\le \Vert \alpha ^k\Vert \le \Vert \bar{\alpha }^k\Vert \le \sqrt{m}\Vert \bar{\alpha }^k\Vert _\infty \le \sqrt{m}M\) for all k. Defining \(\nu ^k=(\nu ^k_0,\dots ,\nu ^k_m)\in \mathbb {R}^{m+1}\) with \(\nu ^k_0=1-\sum _{i=1}^m\alpha ^k_i\) and \(\nu ^k_j = \alpha ^k_j\) for \(1\le j \le m\), we obtain

$$\begin{aligned} {\hat{g}}^k(\alpha )&= {\sum }_{i=0}^m\nu _i^k g(x^{k_i}), \quad {\hat{x}}^k(\alpha ) = {\sum }_{i=0}^m\nu _i^k x^{k_i}, \end{aligned}$$

\(\Vert \nu ^k\Vert _\infty \le 1+m^{\frac{3}{2}}M\), and \(\sum _{i=0}^m \nu ^k_i = 1\). Consequently, applying the estimate (24) derived in step 1, it follows

$$\begin{aligned} \Vert \hat{x}^k(\alpha ^k)-x^*\Vert = \left\| {\sum }_{i=0}^m \nu _i^k (x^{k_i}-x^*) \right\|&\le (1+ m^{\frac{3}{2}}M) {\sum }_{i=0}^m\Vert x^{k_i}-x^*\Vert \\ {}&\le (1+ m^{\frac{3}{2}}M)(1-\kappa )^{-1} {\sum }_{i=0}^m\Vert f^{k-i}\Vert \end{aligned}$$

which shows \({\hat{x}}^k(\alpha ^k) \rightarrow x^*\) as \(k \rightarrow \infty \). This also establishes

$$\begin{aligned} o(\Vert \hat{x}^k(\alpha ^k)-x^*\Vert ) = o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \quad k \rightarrow \infty . \end{aligned}$$
(26)

Note that the differentiability of g at \(x^*\)—as stated in Assumption (B.2)—implies \(\Vert g(y) - g(x^*) - g^\prime (x^*)(y-x^*)\Vert = o(\Vert y-x^*\Vert )\) as \(y \rightarrow x^*\). Applying this condition to different choices of y and the boundedness of \(\nu ^k\), we can obtain

$$\begin{aligned}& \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k) \Vert \nonumber \\ = ~&\Vert g(\hat{x}^k(\alpha ^k))-g(x^*) + g(x^*) - {\hat{g}}^k(\alpha ^k)\Vert \nonumber \\ \le ~&\Vert g'(x^*)({\hat{x}}^k(\alpha ^k)-x^*) + g(x^*) - {\hat{g}}^k(\alpha ^k) \Vert +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \nonumber \\ = ~&\left\| {\sum }_{i=0}^m \nu ^k_i [g'(x^*)(x^{k_i}-x^*) + g(x^*) - g(x^{k_i})] \right\| +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \nonumber \\ \le ~&{\sum }_{i=0}^m o(\Vert x^{k_i}-x^*\Vert ) +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \le o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) . \end{aligned}$$
(27)

Here, we also used (24), (26), and \(\sum _{i=0}^m o(\Vert f^{k-i}\Vert ) = o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \). Combining our results, this yields

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \quad k \rightarrow \infty . \end{aligned}$$
(28)
Step 3::

Transition to fast local convergence. As in the proof of Theorem 1, let us introduce

$$\begin{aligned} W_k:= \max _{k-m\le i\le k}\Vert f^i\Vert . \end{aligned}$$

Due to (28) and \(o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) = o(W_k)\) there then exists \(\ell \in \mathbb {N}\) such that

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + (1-p_2)\min \{\gamma ,1-c\}W_k \end{aligned}$$

for all \(k \ge \ell \). Hence, using \(c\ge \kappa \), we have

$$\begin{aligned} {\textrm{ared}_{k}}&= {r_{k}} - \Vert f(\hat{g}^k(\alpha ^k))\Vert \ge {\textrm{pred}_{k}}- (1-p_2)\min \{\gamma ,1-c\}W_k. \end{aligned}$$

Similarly, for the predicted reduction \({\textrm{pred}_{k}}\) we can show

$$\begin{aligned} {\textrm{pred}_{k}}&= (1-\gamma m) \Vert f^{k_0}\Vert + \gamma {\sum }_{i=1}^{m} \Vert f^{k_i}\Vert -c\Vert \hat{f}^k(\alpha ^k)\Vert \\&\ge \gamma {\sum }_{i=0}^{m} \Vert f^{k_i}\Vert + (1-\gamma (m+1)) \Vert f^{k_0}\Vert - c \Vert f^{k_0}\Vert \\&\ge \gamma W_k+(1-\gamma )\Vert f^{k_0}\Vert -c\Vert f^{k_0}\Vert . \end{aligned}$$

Thus, if \(1-\gamma -c \ge 0\), we obtain \({\textrm{pred}_{k}} \ge \gamma W_k\). Otherwise, it follows \({\textrm{pred}_{k}} \ge (1-c) W_k\) and together this yields \({\textrm{pred}_{k}} \ge \min \{\gamma ,1-c\}W_k\). Combining the last estimates, we can finally deduce

$$\begin{aligned} \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}}\ge \frac{{\textrm{pred}_{k}}- (1-p_2)\min \{\gamma ,1-c\}W_k}{{\textrm{pred}_{k}}} \ge p_2, \end{aligned}$$

which completes the proof.\(\square \)

Remark 7

Our novel nonmonotone acceptance mechanism is the central component of our proof for Theorem 2, as it allows us to balance the additional error terms caused by an \(\textsf{AA}\) step.

Next, we show that our approach can enhance the convergence of the underlying fixed-point iteration and that it has a local convergence rate similar to the original \(\textsf{AA}\) method as given in [13].

Theorem 3

Suppose that Assumptions 2, and 3 hold and let the parameters \(c, \epsilon _{f}\) in Algorithm 1 satisfy \(c\ge \kappa \) and \(\epsilon _{f}=0\). Then, for \(k \rightarrow \infty \) it holds that:

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert +o\left( \sum \nolimits _{i=0}^{m}\Vert f^{k-i}\Vert \right) , \end{aligned}$$

where \(\theta _k:=\Vert \hat{f}^k(\bar{\alpha }^k)\Vert /\Vert f^{k_0}\Vert \) is the corresponding acceleration factor. In addition, the sequence of residuals \(\{\Vert f^k\Vert \}\) converges r-linearly to zero with a rate arbitrarily close to \(\kappa \), i.e., for every \(\eta \in (\kappa ,1)\) there exist \(C > 0\) and \({\hat{\ell }} \in \mathbb {N}\) such that

$$\begin{aligned} \Vert f^k\Vert \le C \eta ^k \quad \forall ~k \ge {\hat{\ell }}. \end{aligned}$$

Proof

Theorem 2 implies \(\rho _k\ge p_2\) for all \(k \ge \ell \) and hence, from the update rule of Algorithm 1, it follows that

$$\mu _k=\eta _2\mu _{k-1} \quad \forall ~k \ge \ell .$$

Then by (15), we can infer \(\lambda _k=o(\Vert f^{k_0}\Vert ^2)\). Using Eq. (25) and Assumption 3, this shows

$$\begin{aligned} \Vert \hat{f}^k(\alpha ^k)\Vert \le \Vert \hat{f}^k(\bar{\alpha }^k)\Vert +\sqrt{\lambda _k }\Vert \bar{\alpha }^k\Vert = \Vert \hat{f}^k(\bar{\alpha }^k)\Vert +o(\Vert f^{k_0}\Vert ). \end{aligned}$$
(29)

Thus, by (28), we obtain

$$\begin{aligned} \Vert f^{k+1}\Vert \le ~&\kappa \Vert \hat{f}^{k}(\alpha ^k)\Vert + o\left( {\sum }_{i=0}^m \Vert f^{k-i}\Vert \right) \nonumber \\ \le ~&\kappa \Vert \hat{f}^{k}(\bar{\alpha }^k)\Vert +o(\Vert f^{k_0}\Vert )+o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \nonumber \\ =~&\kappa \theta _k \Vert f^{k_0}\Vert + o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) , \end{aligned}$$
(30)

as desired. In order to establish r-linear convergence, we follow the strategy presented in [46]. Let \(\eta \in (\kappa ,1)\) be a given rate. Then, due to \(\Vert f^k\Vert \rightarrow 0\) and using (30), there exists \({\hat{\ell }} \in \mathbb {N}\) such that

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \Vert {f}^{k_0}\Vert + {\bar{\nu }} \cdot {\sum }_{i=0}^m \Vert f^{k-i}\Vert \end{aligned}$$
(31)

for all \(k \ge {\hat{\ell }}\) where \({\bar{\nu }}:= \frac{1-\eta }{1-\eta ^{m+1}} \eta ^m (\eta -\kappa )\). Defining \(C:= \eta ^{-{\hat{\ell }}} \max _{{\hat{\ell }}-m\le i \le {\hat{\ell }}} \Vert f^i\Vert = W_{{\hat{\ell }}}\,\eta ^{-{\hat{\ell }}}\), we then have

$$\begin{aligned} \Vert f^j\Vert \le W_{{\hat{\ell }}} = (W_{{\hat{\ell }}}\,\eta ^{-j}) \eta ^j \le (W_{{\hat{\ell }}}\,\eta ^{-{\hat{\ell }}}) \eta ^j = C \eta ^j. \end{aligned}$$

for all \({\hat{\ell }}-m \le j \le {\hat{\ell }}\). We now claim that the statement \(\Vert f^{k}\Vert \le C \eta ^k\) holds for all \(k \ge {\hat{\ell }}\). As just shown, this is obviously satisfied for the base case \(k = {\hat{\ell }}\). As part of the inductive step, let us assume that the estimate \(\Vert f^{j}\Vert \le C \eta ^j\) holds for all \(j = {\hat{\ell }}, {\hat{\ell }}+1,\ldots ,k\). (In fact, this bound also holds for \(j = {\hat{\ell }}-m,\ldots ,{\hat{\ell }}-1\)). By the definition of the index \(k_0\), we have \(\Vert f^{k_0}\Vert \le C\eta ^k\) and, due to (31), it follows

$$\begin{aligned} \Vert f^{k+1}\Vert&\le \kappa \Vert {f}^{k_0}\Vert + {\bar{\nu }} \cdot {\sum }_{i=0}^m \Vert f^{k-i}\Vert \le C\kappa \eta ^k + C{\bar{\nu }} \eta ^k {\sum }_{i=0}^m \left( \frac{1}{\eta }\right) ^i \\&= C\eta ^k \left[ \kappa + {\bar{\nu }} \cdot \frac{1-\eta ^{-(m+1)}}{1-\eta ^{-1}} \right] = C\eta ^k \left[ \kappa + \frac{{\bar{\nu }}}{\eta ^m} \cdot \frac{1-\eta ^{m+1}}{1-\eta } \right] = C\eta ^{k+1}. \end{aligned}$$

Hence, our claim also holds for \(k+1\) which finishes the induction and proof. \(\square \)

Under a stronger differentiability condition and stricter update rule for \(\lambda _k\), we can recover the same local rate as in [13]:

Corollary 1

Let the assumptions stated in Theorem 3 hold and let g satisfy the differentiability condition

$$\begin{aligned} \Vert g(x) - g(x^*) - g^\prime (x^*)(x-x^*)\Vert = O(\Vert x-x^*\Vert ^2) \;\; \text {as} \;\; x \rightarrow x^*. \end{aligned}$$

Suppose that the weight \(\lambda _k\) is updated via \(\lambda _k = \mu _k \Vert f^{k_0}\Vert ^4\). Then, for all k sufficiently large we have

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert + \sum \nolimits _{i=0}^{m} O(\Vert f^{k-i}\Vert ^2). \end{aligned}$$

Proof

As mentioned in the remark after Theorem 1, our global results do still hold if a different update strategy is used for the weight parameter \(\lambda _k\). Moreover, the proof of Theorem 2 does also not depend on the specific choice of \(\lambda _k\). Consequently, we only need to improve the bound (28) for \(\Vert f({\hat{g}}^k(\alpha ^k))\Vert \) derived in step 2 of the proof of Theorem 2. Using the additional differentiability property \(\Vert g(y) - g(x^*) - g^\prime (x^*)(y-x^*)\Vert = O(\Vert y-x^*\Vert ^2)\), \(y \rightarrow x^*\), we can directly improve the estimate for \(\Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert \) in (27) as follows:

$$\begin{aligned} \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert \le {\sum }_{i=0}^m O(\Vert x^{k_i}-x^*\Vert ^2) + O(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ^2). \end{aligned}$$

Using the bound \((\sum _{i=0}^m y_i)^2 \le (m+1) \sum _{i=0}^m y_i^2\) for \(y \in \mathbb {R}^{m+1}\), we obtain \(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ^2 = \sum _{i=0}^m O(\Vert f^{k-i}\Vert ^2)\) and thus, mimicking and combining the derivations in step 2 of the proof of Theorem 2, we have

$$\begin{aligned} \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert \le {\sum }_{i=0}^m O(\Vert f^{k-i}\Vert ^2) \end{aligned}$$

and

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + {\sum }_{i=0}^m O(\Vert f^{k-i}\Vert ^2) \end{aligned}$$
(32)

as \(k \rightarrow \infty \). As in the previous proof, we can now infer \(\mu _k \rightarrow 0\) (this follows from \(\rho _k \ge p_2\) for all k sufficiently large) and \(\lambda _k = o(\Vert f^{k_0}\Vert ^4)\). Furthermore, as in (29), due to Eq. (25) and Assumption 3, it holds that \(\Vert {\hat{f}}^k(\alpha ^k)\Vert \le \Vert {\hat{f}}^k({\bar{\alpha }}^k)\Vert + o(\Vert f^{k_0}\Vert ^2)\). Combining this result with (32), we can then establish the convergence rate stated in Corollary 1. \(\square \)

Remark 8

The stronger differentiability condition, which was also used in [13] and other local analyses, is, e.g., satisfied when the derivative \(g^\prime \) is locally Lipschitz continuous around \(x^*\). More discussions of this property can also be found in “Appendix B”. We note that under this type of stronger differentiability, we can only improve the order of the remainder linearization error terms and not the linear rate of convergence.

3 Numerical Experiments

We verify the effectiveness of our method by applying it to several existing numerical solvers and comparing its convergence speed with the original solvers. We also include the acceleration approaches from [17, 40] for comparison. The regularized nonlinear acceleration (RNA) proposed in [40] computes an accelerated iterate via an affine combination of the previous k iterates, and it also introduces a quadratic regularization when computing the affine combination coefficients. Unlike our approach, it performs an acceleration step every k iterations instead of every iteration, and its regularization weight is determined by a grid search that finds the weight that leads to the lowest target function value at the accelerated iterate. The A2DR scheme proposed in [17] is a globalization of \(\textsf{AA}\) applied on Douglas–Rachford splitting, using a quadratic regularization together with an acceptance mechanism based on sufficient decrease of the residual. All experiments are carried out on a laptop with a Core-i7 9750 H at 2.6GHz and 16GB of RAM. The source code for the examples in this section is available at https://github.com/bldeng/Nonmonotone-AA.

Our method involves several parameters. The parameters \(p_1\), \(p_2\), \(\eta _1\) and \(\eta _2\), used for determining acceptance of the trial step and updating the regularization weight, are standard parameters for trust-region methods. We choose \(p_1=0.01\), \(p_2=0.25\), \(\eta _1=2\), \(\eta _2=0.25\) by default. The parameter \(\gamma \) affects the convex combination weights in computing \({r_{k}}\) in Eq. (12), and we choose \(\gamma = 10^{-4}\). For the parameter c in the definition of \({\textrm{pred}_{k}}\), we choose \(c = \kappa \) where \(\kappa < 1\) is a Lipschitz constant for the function g, to satisfy the conditions for Theorems 2 and 3. We will derive the value of \(\kappa \) in each experiment. The initial regularization factor \(\mu _0\) is set to \(\mu _0 = 1\) unless stated otherwise. Concerning the number m of previous iterates used in an \(\textsf{AA}\) step, we can make the following observations: a larger m tends to reduce the number of iterations required for convergence, but also increases the computational cost per iteration; our experiments suggest that choosing \(5 \le m \le 20\) often achieves a good balance. For each experiment below, we will include multiple choices of m for comparison. “Appendix C” provides some further ablation studies for the parameters \(p_1\), \(p_2\), \(\eta _1\), \(\eta _2\), and c.

3.1 Logistic Regression

First, to compare our method with the RNA scheme proposed by [40], we consider the following logistic regression problem from [40] that optimizes a decision variable \(x \in \mathbb {R}^n\):

$$\begin{aligned} \min _{x}~F(x), \end{aligned}$$
(33)

where

$$\begin{aligned} F(x)=\frac{1}{N}\sum \nolimits _{i=1}^N \log (1+\exp (-b_ia_i^Tx))+\frac{\tau }{2}\Vert x\Vert ^2, \end{aligned}$$
(34)

and \(a_i \in \mathbb {R}^n\), \(b_i \in \{-1, 1\}\) are the attributes and label of the data point i, respectively. Following [40], we consider gradient descent solver \(x^{k+1} = g(x^k)\) with a fixed step size:

$$g(x) = x-\frac{2}{L_F+\tau }\nabla F(x),$$

where

$$\begin{aligned} L_F= \tau + \frac{\Vert A\Vert _2^2}{4N} \end{aligned}$$
(35)

is the Lipschitz constant of \(\nabla F\), and \(A=[a_1,\ldots ,a_N]^T\in \mathbb {R}^{N\times n}\). Then g is Lipschitz continuous with modulus

$$\begin{aligned} \kappa = \frac{L_F-\tau }{L_F+\tau } < 1 \end{aligned}$$

and differentiable, which satisfies Assumption 2. We apply our approach (denoted by “LM-AA”) and RNA to this solver, and compare their performance on two datasets: covtypeFootnote 1 (54 features, 581,012 points), and sido0Footnote 2 (4932 features, 12,678 points). For each dataset, we normalize the attributes and solve the problem with \(\tau = L_F / 10^{6}\) and \(\tau = L_F / 10^{9}\), respectively. For the implementation of RNA, we use the source code released by the authors of [40].Footnote 3 We set \(\mu _0 = 100\), and \(m = 10, 15, 20\), respectively for our method. RNA performs an acceleration step every k iterations, and we test \(k=5, 10, 20\), respectively. All other RNA parameters are set to their default values as provided in the source code (in particular, with grid-search adaptive regularization weight and line search enabled). Figure 1 plots for each method the normalized target function value \((F(x^k) - F^*)/F^*\) with respect to the iteration count and computational time, where \(F^*\) is the ground-truth global minimum computed by running each method until full convergence and taking the minimum function value among all methods. All variants of LM-AA and RNA accelerate the decrease of the target function compared with the original gradient descent solver, with our methods achieving an overall faster decrease.

Fig. 1
figure 1

Comparison between RNA  [40] and our method on a gradient descent solver for the logistic regression problem (34) for the covtype and sido0 datasets, with a different choice of parameter \(\tau \) in each row

3.2 Image Reconstruction

Next, we consider a nonsmooth problem proposed in [51] for total variation based image reconstruction:

$$\begin{aligned} \min _{w,u} ~{\sum }_{i=1}^{N^2}\Vert w_i\Vert _2+\frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i-D_i u\Vert _2^2+\frac{\nu }{2}\Vert Ku-s\Vert ^2_2, \end{aligned}$$
(36)

where \(s\in [0,1]^{N^2}\) is an \(N \times N\) input image, \(u\in \mathbb {R}^{N^2}\) is the output image to be optimized, \(K\in \mathbb {R}^{N^2\times N^2}\) is a linear operator, \(D_i \in \mathbb {R}^{2 \times N^2}\) represents the discrete gradient operator at pixel i, \(w = (w_{1}^T,\ldots ,w_{N^2}^T)^T \in \mathbb {R}^{2N^2}\) are auxiliary variables for the image gradients, \(\nu > 0\) is a fidelity weight, and \(\beta > 0\) is a penalty parameter. The solver in [51] can be written as alternating minimization between u and w as follows:

$$\begin{aligned} u^{k+1}&= \mathop {\textrm{argmin}}\limits _{u} \frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i^k-D_i u\Vert _2^2+\frac{\nu }{2}\Vert Ku-s\Vert ^2_2, \end{aligned}$$
(37)
$$\begin{aligned} w^{k+1}&= \mathop {\textrm{argmin}}\limits _{w} {\sum }_{i=1}^{N^2}\Vert w_i\Vert _2+\frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i-D_i u^{k+1}\Vert _2^2 . \end{aligned}$$
(38)

The solutions to the subproblems (37) and (38) can both be computed in a closed form. When \(\beta \) and \(\nu \) are fixed, this can be treated as a fixed-point iteration \(w^{k+1} = g(w^k)\), and it satisfies Assumption 2 (see “Appendix B.2” for a detailed derivation of g and verification of Assumption 2). In the following, we consider the solver with \(K = I\) and \(\nu = 4\) for image denoising. In this case, condition (B.1) is satisfied with

$$\begin{aligned} \kappa = 1-\left( 1+\frac{4\beta }{\nu }\right) ^{-1} \end{aligned}$$

(see “Appendix B.2” for the derivation). We apply this solver to a \(1024 \times 1024\) image with added Gaussian noise that has a zero mean and a variance of \(\sigma = 0.05\) (see Fig. 2).

Fig. 2
figure 2

Application of our method to an alternating minimization solver for an image denoising problem (36)

We use the source code released by the authors of [51]Footnote 4 for the implementation of this solver, and apply our acceleration method with \(m = 1, 3, 5\) respectively. For comparison, we also apply the RNA scheme with \(k = 2, 5, 10\), respectively. Here we choose smaller values of m and k than the logistic regression example because both RNA and our method have a high relative overhead on this problem, which means that larger values of m or k may induce overhead that offsets the performance gain from acceleration. Similar to the logistic regression example, we use the released source code of RNA for our experiments, and set all RNA parameters to their default values. Figure 2 plots the residual norm \(\Vert f(w)\Vert \) for all methods, with \(\beta = 100\) and \(\beta =1000\), respectively. All instances of acceleration methods converge faster to a fixed point than the original alternating minimization solver, except for RNA with \(k=2\) which is slower in terms of the actual computational time due to its overhead for the grid search of regularization parameters. Overall, the two acceleration approaches achieve a rather similar performance on this problem.

3.3 Nonnegative Least Squares

Finally, to compare our method with [17], we consider a nonnegative least squares (NNLS) problem that is used in [17] for evaluation:

$$\begin{aligned} \min _{x}~\psi (x)+\varphi (x), \end{aligned}$$
(39)

where \(x=(x_1,x_2)\in \mathbb {R}^{2q}\), \(\psi (x)=\Vert Hx_1-t\Vert _2^2+\mathcal {I}_{x_2\ge 0}(x_2)\), and \(\varphi (x)=\mathcal {I}_{x_1=x_2}(x)\), with \(\mathcal {I}_S\) being the indicator function of the set S. The Douglas–Rachford splitting (DRS) solver for this problem can be written as

$$\begin{aligned} v^{k+1} =g(v^k) ={\textstyle {\frac{1}{2}}}((2\textrm{prox}_{\beta \varphi }-I)(2\textrm{prox}_{\beta \psi }-I)+I)v^k \end{aligned}$$
(40)

where \(v^{k}=(v_1^k,v_2^k)\in \mathbb {R}^{2q}\) is an auxiliary variable for DRS and \(\beta \) is the penalty parameter. In [17], the authors use their regularized \(\textsf{AA}\) method (A2DR) to accelerate the DRS solver (40). To apply our method, we verify in “Appendix B.3” that if H is of full column rank, then g satisfies condition (B.1) with

$$\begin{aligned} \kappa = \frac{\sqrt{3+c_1^2}}{2} < 1 \end{aligned}$$

where \(c_1=\max \{\frac{\beta \sigma _1-1}{\beta \sigma _1+1},\frac{1-\beta \sigma _0}{1+\beta \sigma _0}\}\), and \(\sigma _0,\sigma _1\) are the minimal and maximal eigenvalues of \(2H^T H\), respectively. Moreover, g is also differentiable under a mild condition. We compare our method with A2DR on the solver (40), with the same \(\textsf{AA}\) parameters \(m = 10, 15, 20\). The methods are tested using a \(600 \times 300\) sparse random matrix H with \(1\%\) nonzero entries and a random vector t. We use the source code released by the authors of [17]Footnote 5 for the implementation of A2DR, and set all A2DR parameters to their default values. While A2DR and DRS are implemented using parallel evaluation of the proximity operators in the released A2DR code, we implement our method as a single-threaded application for simplicity. Figure 3 plots the residual norm \(\Vert f(v)\Vert \) for DRS and the two acceleration methods. It also plots the norm of the overall residual \(r = (r_{\textrm{prim}}, r_{\textrm{dual}})\) used in [17] for measuring convergence, where \(r_{\textrm{prim}}\) and \(r_{\textrm{dual}}\) denote the primal and dual residuals as defined in Equations (7) and (8) of [17], respectively. For both residual measures, the original DRS solver converges slowly after the initial iterations, whereas the two acceleration methods achieve significant speedup. Moreover, the single-threaded implementation of our method outperforms the parallel A2DR with the same m parameter, in terms of both iteration count and computational time.

Fig. 3
figure 3

Comparison between A2DR [17] and our method on the NNLS solver (40) with an \(600 \times 300\) sparse random matrix H and a random vector t

3.4 Statistics of Successful Steps

Our acceptance mechanism plays a key role in achieving global and local convergence of the proposed method. To demonstrate its behavior, we provide statistics of the successful steps in Figs. 1, 2 and 3. Specifically, for each instance of LM-AA, we count the total steps required to reach a certain level of accuracy and we compare it with the corresponding number of successful \(\textsf{AA}\) steps within these steps. Tables 1, 2, 3, and 4 show the statistics of successful steps for Figs. 1, 2 and 3, respectively. Here, besides the total number of steps, we report the success rate which is defined as the ratio between successful and total steps required to reach different levels of accuracy.

Table 1 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset covtype
Table 2 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset sido0
Table 3 Statistics of successful steps of LM-AA for the image denoising problem (36)
Table 4 Statistics of successful steps of LM-AA for the nonnegative least squares problem (39)

The results in Table 4 demonstrate that essentially all \(\textsf{AA}\) steps are accepted in the nonnegative least squares problem. This observation is also independent of the choice of the parameter m. More specifically, the success rate of \(\textsf{AA}\) steps only decreases and more alternative fixed-point iterations are performed when we seek to solve the problem with the highest accuracy \(\textit{tol} = 10^{-15}\). Table 3 illustrates that a similar behavior can also be observed for the image denoising problem (36) when setting \(m=1\). Notice that this high accuracy is close to machine precision and hence this effect is mainly caused by numerical errors and inaccuracies that affect the computation and quality of an \(\textsf{AA}\) step. The results in Table 3 also demonstrate a second typical effect: the success rate of \(\textsf{AA}\) step is often lower when the chosen accuracy is relatively low. With increasing accuracy, the rate then increases to around 70–80%. This general observation is also supported by our results for logistic regression, see Tables 1 and 2. (Here the maximum success rate is more sensitive to the choice of m, \(\tau \), and of the dataset).

In summary, the statistics provided in Tables 1, 2, 3, and 4 support our theoretical results. The success rate of \(\textsf{AA}\) steps gradually increases as the iteration gets closer to the fixed point, which indicates a transition to a pure regularized \(\textsf{AA}\) scheme. Furthermore, as more \(\textsf{AA}\) steps seem to be rejected at the beginning of the iterative procedure, our globalization mechanism effectively guarantees global progress and convergence of the approach.

4 Conclusions

We propose a novel globalization technique for Anderson acceleration which combines adaptive quadratic regularization and a nonmonotone acceptance strategy. We prove the global convergence of our approach under mild assumptions. Furthermore, we show that the proposed globalized \(\textsf{AA}\) scheme has the same local convergence rate as the original \(\textsf{AA}\) iteration and that the globalization mechanism does not hinder the acceleration effect of \(\textsf{AA}\). This is one of the first \(\textsf{AA}\) globalization methods that achieves global convergence and fast local convergence simultaneously. Several numerical examples illustrate that our method is competitive and it can improve the efficiency of a variety of numerical solvers.