Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Ouyang, Wenqing; Tao, Jiong; Milzarek, Andre; Deng, Bailin

doi:10.1007/s10915-023-02231-4

Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Open access
Published: 18 May 2023

Volume 96, article number 5, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Scientific Computing Aims and scope Submit manuscript

Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Download PDF

Wenqing Ouyang^1,2,3,
Jiong Tao⁴,
Andre Milzarek^1,2 &
…
Bailin Deng ORCID: orcid.org/0000-0002-0158-7670⁵

1355 Accesses
Explore all metrics

Abstract

Anderson acceleration ($\textsf{AA}$) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for $\textsf{AA}$ to improve stability and achieve unified global and local convergence. Unlike existing $\textsf{AA}$ globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-region framework and introduce an adaptive quadratic regularization together with a tailored acceptance mechanism. We prove global convergence and show that our algorithm attains the same local convergence as $\textsf{AA}$ under appropriate assumptions. The effectiveness of our method is demonstrated in several numerical experiments.

A new nonmonotone adaptive trust region algorithm

Article 30 April 2021

An accelerated nonmonotone trust region method with adaptive trust region for unconstrained optimization

Article 01 September 2017

Global and local convergence of a new affine scaling trust region algorithm for linearly constrained optimization

Article 15 September 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In applied mathematics, many problems can be reduced to solving a nonlinear fixed-point equation $g(x)=x$, where $x\in \mathbb {R}^n$ and $g:\mathbb {R}^n\rightarrow \mathbb {R}^n$ is a given function. If g is a contractive mapping, i.e.,

$$\begin{aligned} \Vert g(x)-g(y)\Vert \le \kappa \Vert x-y\Vert \quad \forall ~x,y\in \mathbb {R}^n, \end{aligned}$$

(1)

where $\kappa <1$, then the iteration

$$\begin{aligned} x^{k+1}=g(x^k) \end{aligned}$$

is ensured to converge to the fixed-point of g by Banach’s fixed-point theorem. Anderson acceleration ($\textsf{AA}$) [2, 3, 49] is a technique for speeding up the convergence of such an iterative process. Instead of using the update $x^{k+1}=g(x^k)$, it generates $x^{k+1}$ as an affine combination of the latest $m+1$ steps:

$$\begin{aligned} x^{k+1} = g(x^{k})+{\sum }_{i=1}^{m}\alpha _i^*(g(x^{k-i})-g(x^{k})) \end{aligned}$$

(2)

with the combination coefficients $\alpha ^{*}=(\alpha ^{*}_1, \ldots ,\alpha ^{*}_m) \in \mathbb {R}^m$ being computed via an optimization problem

$$\begin{aligned} \min _{\alpha }~\left\| f(x^{k})+{\sum }_{i=1}^{m}\alpha _i(f(x^{k-i})-f(x^{k}))\right\| ^2, \end{aligned}$$

(3)

where $f(x^{k}) = g(x^{k}) - x^{k}$ denotes the residual function.

$\textsf{AA}$ was initially proposed to solve integral equations [2] and has gained popularity in recent years for accelerating fixed-point iterations [49]. Examples include tensor decomposition [42], linear system solving [35], and reinforcement learning [18], among many others [1, 7, 24, 25, 27, 29, 30, 33, 52, 54].

On the theoretical side, it has been shown that $\textsf{AA}$ is a quasi-Newton method for finding a root of the residual function [14, 16, 38]. When applied to linear problems (i.e., if $g(x) = Ax - b$), $\textsf{AA}$ is equivalent to the generalized minimal residual method (GMRES), [34]. For nonlinear problems, $\textsf{AA}$ is also closely related to the nonlinear generalized minimal residual method [50]. A local convergence analysis of $\textsf{AA}$ for general nonlinear problems was first given in [45, 46] under the base assumptions that g is Lipschitz continuously differentiable and the $\textsf{AA}$ mixing coefficients $\alpha $, determined in (3), stay in a compact set. However, the convergence rate provided in [45, 46] is not faster than the one of the original fixed-point iteration. A more recent analysis in [13] shows that $\textsf{AA}$ can indeed accelerate the local linear convergence of a fixed-point iteration up to an additional quadratic error term. This is further improved in [32] where q-linear convergence of $\textsf{AA}$ is established. The convergence result in [32] requires sufficient linear independence of the columns of $[f(x^{k-1})-f(x^k),\dots ,f(x^{k-m})-f(x^{k-m+1})]$ which is typically stronger than the previously mentioned boundedness assumption on the coefficients $\alpha $. By assuming the mixing coefficient $\alpha $ to be stationary during the iteration, an exact rate of $\textsf{AA}$ is derived in [50].

One issue of classical $\textsf{AA}$ is that it can suffer from instability and stagnation [34, 39]. Different techniques have been proposed to address this issue. For example, safeguarding checks were introduced in [30, 54] to only accept an $\textsf{AA}$ step if it meets certain criteria, but without a theoretical guarantee for convergence. Another direction is to introduce regularization to the problem (3) for computing the combination coefficients. In [17], a quadratic regularization is used together with a safeguarding step to achieve global convergence of $\textsf{AA}$ on Douglas–Rachford splitting, but there is no guarantee that the local convergence is faster than the original solver. In [39, 40], a similar quadratic regularization is introduced to achieve local convergence, although no global convergence proof is provided. A more detailed discussion of related literature and specific techniques connected to our algorithmic design and development is deferred to Sect. 2.1.

As far as we are aware, none of the existing approaches and modified versions of $\textsf{AA}$ guarantee both global convergence and accelerated local convergence. In this paper, we propose a novel $\textsf{AA}$ globalization scheme that achieves these two goals simultaneously. Specifically, we apply a quadratic regularization with its weight adjusted automatically according to the effectiveness of the $\textsf{AA}$ step. We adapt the nonmonotone trust-region framework in [47] to update the weight and to determine the acceptance of the $\textsf{AA}$ step. Our approach can not only achieve global convergence, but also attains the same local convergence rate established in [13] for $\textsf{AA}$ without regularization. Furthermore, our local results also cover applications where the mapping g is nonsmooth and differentiability is only required at a target fixed-point of g. To the best of our knowledge, this is the first globalization technique for $\textsf{AA}$ that achieves the same local convergence rate as the original $\textsf{AA}$ scheme. Numerical experiments on both smooth and nonsmooth problems verify the effectiveness and efficiency of our method.

Notations Throughout this work, we restrict our discussion on the n-dimensional Euclidean space $\mathbb {R}^n$. For a vector x, $\Vert x\Vert $ denotes its Euclidean norm, and $\mathbb {B}_{\epsilon }(x):= \{y: \Vert y -x \Vert \le \epsilon \}$ denotes the Euclidean ball centered at x with radius $\epsilon $. For a matrix A, $\Vert A\Vert $ is the operator norm with respect to the Euclidean norm. We use I to denote both the identity mapping (i.e., $I(x) = x$) and the identity matrix. For a function $h: \mathbb {R}^n\rightarrow \mathbb {R}^\ell $, the mapping $h^\prime : \mathbb {R}^n\rightarrow \mathbb {R}^{\ell \times n}$ represents its derivative. h is called L-smooth if it is differentiable and $\Vert h^\prime (x)-h^\prime (y)\Vert \le L\Vert x-y\Vert $ for all $x,y\in \mathbb {R}^n$. An operator $h:\mathbb {R}^n\rightarrow \mathbb {R}^n$ is called nonexpansive if for all $x,y\in \mathbb {R}^n$ we have $\Vert h(x)-h(y)\Vert \le \Vert x-y\Vert $. We say that the operator h is $\rho $-averaged for some $\rho \in (0,1)$ if there exists a nonexpansive operator $R: \mathbb {R}^n\rightarrow \mathbb {R}^n$ such that $h = (1-\rho )I+\rho R$. The set of fixed points of the mapping h is defined via $\textrm{Fix}(h):= \{x: h(x)=x\}$. The interested reader is referred to [4] for further details on operator theory.

The $\textsf{AA}$ formulation in Eqs. (2) and (3) assumes $k \ge m$. It can be adapted to account for the case $k < m$ by using $\hat{m}$ coefficients instead where $\hat{m} = \min \{m, k\}$. Without loss of generality, we use m to refer to the actual number of coefficients being used.

2 Algorithm and Convergence Analysis

2.1 Adaptive Regularization for $\textsf{AA}$

In the following, we set $f^k:= f(x^k)$ and $g^k:= g(x^k)$ to simplify notation. We first note that the accelerated iterate computed via (2) and (3) is invariant under permutations of the indices of $\{f^j\}$ and $\{g^j\}$. Concretely, let $\varPi _k:= (k_0,k_1,\dots ,k_m)$ be any permutation of the index sequence $(k,k-1,\dots ,k-m)$. Then the point $x^{k+1}$ calculated in Eq. (2) also satisfies

$$\begin{aligned} x^{k+1} = g^{k_0}+{\sum }_{i=1}^{m} \bar{\alpha }_i^k(g^{k_i}-g^{k_0}), \end{aligned}$$

(4)

with coefficients $\bar{\alpha }^k = (\bar{\alpha }_0^k, \ldots , \bar{\alpha }_m^k)$ computed via

$$\begin{aligned} \bar{\alpha }^k \in \mathop {\textrm{argmin}}\limits _{\alpha } \left\| f^{k_0}+{\sum }_{i=1}^{m}\alpha _i(f^{k_i}-f^{k_0})\right\| ^2, \end{aligned}$$

(5)

which amounts to solving a linear system. In this paper, we use a particular class of permutations where

$$\begin{aligned} k_0 =\max \left\{ j \mid j \in \mathop {\textrm{argmin}}\limits \nolimits _{i\in \{k,k-1, \ldots , k-m\}}\Vert f^i\Vert \right\} , \end{aligned}$$

(6)

i.e., $k_0$ is the largest index that attains the minimum residual norm among $\Vert f^{k-m}\Vert , \Vert f^{k-m+1}\Vert , \ldots , \Vert f^k\Vert $. As we will see later, this type of permutation allow us to apply certain nonmonotone globalization techniques and to ultimately establish local and global convergence of our approach. An ablation study on the potential effect of the permutation strategy is presented in “Appendix D”.

One potential cause of instability of $\textsf{AA}$ is the (near) linear dependence of the vectors $\{f^{k_i}-f^{k_0}: i = 1, \ldots , m\}$, which can result in (near) rank deficiency of the linear system matrix for the problem (5). To address this issue, we introduce a quadratic regularization to the problem (5) and compute the coefficients $\alpha ^{k}$ via:

$$\begin{aligned} \alpha ^{k} = \mathop {\textrm{argmin}}\limits _{\alpha }~\Vert {{\hat{f}}}^k(\alpha )\Vert ^2 + \lambda _k \Vert \alpha \Vert ^2, \end{aligned}$$

(7)

where $\lambda _k > 0$ is a regularization weight, and

$$\begin{aligned} {{\hat{f}}}^k(\alpha ):= f^{k_0}+{\sum }_{i=1}^{m}\alpha _i(f^{k_i}-f^{k_0}). \end{aligned}$$

(8)

The coefficients $\alpha ^{k}$ are then used to compute a trial step in the same way as in Eq. (4). In the following, we denote this trial step as ${{\hat{g}}}^k(\alpha ^k)$ where

$$\begin{aligned} \hat{g}^k(\alpha ):= g^{k_0}+{\sum }_{i=1}^{m}\alpha _i(g^{k_i}-g^{k_0}). \end{aligned}$$

(9)

The trial step is accepted as the new iterate if it meets certain criteria (which we will develop in the following in detail). Regularization such as the one in Eq. (7) has been suggested in [3] and is applied in [17, 40]. A major difference between our approach and the regularization in [17, 40] is the choice of $\lambda _{k}$: in [17] it is set in a heuristic manner, whereas in [40] it is either fixed or specified via grid search. We instead update $\lambda _k$ adaptively based on the effectiveness of the latest $\textsf{AA}$ step. Specifically, we observe that a larger value of $\lambda _k$ can improve stability for the resulting linear system; it will also induce a stronger penalty for the magnitude of $\Vert \alpha ^{k}\Vert $. In this case, the trial step $\hat{g}^k(\alpha ^k)$ tends to be closer to $g^{k_0}$, which, according to Eq. (6), is the fixed-point iteration step with the smallest residual among the latest $m+1$ iterates. On the other hand, a larger regularization weight may also hinder the fast convergence of $\textsf{AA}$ if it is already effective in reducing the residual without regularization. Thus, $\lambda _k$ is dynamically adjusted according to the reduction of the residual in the current step.

Our adaptive regularization scheme is inspired by the similarity between the problem (7) and the Levenberg-Marquardt (LM) algorithm [20, 26], a popular approach for solving nonlinear least squares problems of the form $\min _x \Vert F(x)\Vert ^2$, where F is a vector-valued function. Each iteration of LM computes a variable update $d^k:= x^{k+1} - x^k$ by solving a quadratic problem

$$\begin{aligned} \mathop {\textrm{argmin}}\limits _d ~\Vert F(x^k) + F'(x^k) d\Vert ^2 + \bar{\lambda }_k \Vert d\Vert ^2. \end{aligned}$$

Here, the first term is a local quadratic approximation of the target function $\Vert F(x)\Vert ^2$ using the first-order Taylor expansion of F, while the second term is a regularization with a weight $\bar{\lambda }_k > 0$. LM can be considered as a regularized version of the classical Gauss-Newton (GN) method for nonlinear least squares optimization [23]. In GN, each iteration computes an initial step d by minimizing the local quadratic approximation term only, i.e.,:

$$\begin{aligned} \mathop {\textrm{argmin}}\limits _d ~\Vert F(x^k) + F'(x^k) d\Vert ^2, \end{aligned}$$

(10)

which amounts to solving a linear system for d with the positive semidefinite matrix $(F'(x^k))^T F'(x^k)$. Similar to $\textsf{AA}$, the (near) linear dependence between the columns of $F'(x^k)$ can lead to (near) rank deficiency of the system matrix causing potential instability. To address this issue, LM introduces a quadratic regularization term for d, which adds a scaled identity matrix to the linear system matrix and prevents it from being singular. Furthermore, LM measures the effectiveness of the computed update using a ratio of the resulting reduction of the target function and a predicted reduction based on the quadratic approximation. The measure is utilized to determine the acceptance of the update, to enforce monotonic decrease of the target function, and to update the regularization weight for the next iteration. Such an adaptive regularization is an instance of a trust-region method [10].

Taking a similar approach as LM, we define two functions ${\textrm{ared}_{k}}$ and ${\textrm{pred}_{k}}$ that measure the actual and predicted reduction of the residual resulting from the solution $\alpha ^k$ to (7):

$$\begin{aligned} {\textrm{ared}_{k}}:={r_{k}}-\Vert f(\hat{g}^k(\alpha ^k))\Vert , \quad {\textrm{pred}_{k}}:={r_{k}} -c\Vert \hat{f}^k(\alpha ^k)\Vert , \end{aligned}$$

(11)

where $c \in (0, 1)$ is a constant. Here ${r_{k}}$ measures the residuals from the latest $m+1$ iterates via a convex combination:

$$\begin{aligned} {r_{k}}:= (1- m \gamma )\Vert f^{k_0}\Vert + {\sum }_{i=1}^m \gamma \Vert f^{k_i}\Vert , \end{aligned}$$

(12)

with $\gamma \in (0, \frac{1}{m+1})$ such that a higher weight is assigned to the smallest residual $f^{k_0}$ among them. Note that $\hat{g}^k(\alpha ^k)$ is the trial step, and $f(\cdot )$ is the residual function. Thus ${\textrm{ared}_{k}}$ compares the latest residuals with the residual resulting from the trial step. This specific choice of $r_k$ is inspired by the local descent properties of $\textsf{AA}$, see, e.g., [13, Theorem 4.4]. Moreover, note that $\hat{f}^k(\cdot )$ (see Eq. (8)) is a linear approximation of the residual function based on the latest residual values, and it is used in problem (7) to derive the coefficients $\alpha ^k$ for computing the trial step. Thus $\hat{f}^k(\alpha ^k)$ is a predicted residual for the trial step based on the linear approximation, and ${\textrm{pred}_{k}}$ compares it with the latest residuals. The constant c guarantees that ${\textrm{pred}_{k}}$ has a positive value (as long as a solution to the problem has not been found; see “Appendix A” for a proof). Similar to LM, we calculate the ratio

$$\begin{aligned} \rho _k = \frac{{\textrm{ared}_{k}}}{{\textrm{pred}_{k}}} \end{aligned}$$

(13)

as a measure of effectiveness for the trial step $\hat{g}^k(\alpha ^k)$ computed with Eqs. (7) and (9). In particular, if $\rho _k \ge p_1$ with a threshold $p_1 \in (0,1)$, then from Eq. (13) and using the positivity of ${\textrm{pred}_{k}}$ we can bound the residual of $\hat{g}^k(\alpha ^k)$ via

$$\begin{aligned} \Vert f(\hat{g}^k(\alpha ^k))\Vert \le (1-p_1) r_k + p_1 c \Vert \hat{f}^k(\alpha ^k)\Vert < (1-p_1) r_k + p_1 \Vert f^{k_0}\Vert . \end{aligned}$$

(14)

Like $r_k$, the last expression $(1-p_1) r_k + p_1 \Vert f^{k_0}\Vert $ in Eq. (14) is also a convex combination of the latest $m+1$ residuals, but with a higher weight on the smallest residual $f^{k_0}$ than on $r_k$. Hence, when $\rho _k \ge p_1$, we consider the decrease of the residual to be sufficient. In this case, we set $x^{k+1} = \hat{g}^k(\alpha ^k)$ and say the iteration is successful. Otherwise, we discard the trial step and choose $x^{k+1} = g^{k_0} = g(x^{k_0})$, which corresponds to the fixed-point iteration step with the smallest residual among the latest $m+1$ iterates. Thus, by permuting the indices $(k,k-1,\ldots ,k-m)$ according to $\varPi _k$, we can ensure to achieve the most progress in terms of reducing the residual when an $\textsf{AA}$ trial step is rejected.

We also adjust the regularization weight $\lambda _k$ according to the ratio $\rho _k$. Specifically, we set

$$\begin{aligned} \lambda _k= \mu _k \Vert f^{k_0}\Vert ^2, \end{aligned}$$

(15)

where the factor $\mu _k > 0$ is automatically updated based on $\rho _{k}$ as follows:

If $\rho _{k} < p_1$, then we consider the decrease of the residual to be insufficient and we increase the factor in the next iteration via $\mu _{k+1} = \eta _1\mu _k$ with a constant $\eta _1 > 1$.
If $\rho _{k} > p_2$ with a threshold $p_2 \in (p_1, 1)$, then we consider the decrease to be high enough and reduce the factor via $\mu _{k+1} = \eta _2\mu _k$ with a constant $\eta _2 \in (0,1)$. This will relax the regularization so that the next trial step will tend to be closer to the original $\textsf{AA}$ step.
Otherwise, in the case $\rho _{k} \in [p_1, p_2]$, the factor remains the same in the next iteration.

Here the choice of the parameters $p_1, p_2$ where $0< p_1< p_2 < 1$ follows the convention of basic trust-region methods [10].

Our setting of $\lambda _k$ in Eq. (15) is inspired by [15] which relates the LM regularization weight to the residual norm. For our method, this setting ensures that the two target function terms in problem (7) are of comparable scales, so that the adjustment of the factor $\mu _k$ is meaningful. This choice of $\lambda _k$ and the update rule of $\mu _k$ are quite standard in LM methods. However, the classical convergence analysis in [15] is not directly applicable here. In the LM method, the decrease of the residual can be predicted via its linearized model $\Vert F(x^k)+F'(x^k)d\Vert ^2$. For $\textsf{AA}$, the linearized residual ${\hat{f}}^k(\alpha ^k)$ is not a model for the update $x^{k+1}={\hat{g}}^k(\alpha ^k)$ but for ${\hat{x}}^k(\alpha ^k)$ instead. Since a linearized residual of ${\hat{g}}^k$ is not readily available, we use an upper bound for such a linearization of ${\hat{g}}^k(\alpha ^k)$ which is exactly given by $c\Vert {\hat{f}}^k(\alpha ^k)\Vert $. The whole method is summarized in Algorithm 1.

Unlike LM which enforces a monotonic decrease of the target function, our acceptance strategy allows the residual for $x^{k+1}$ to increase compared to the previous iterate $x^k$. Therefore, our scheme can be considered as a nonmonotone trust-region approach and follows the procedure investigated in [47]. In the next subsections, we will see that this nonmonotone strategy allows us to establish unified global and local convergence results. In particular, besides global convergence guarantees, we can show transition to fast local convergence and an acceleration effect similar to the original $\textsf{AA}$ scheme can be achieved.

The main computational overhead of our method lies in the optimization problem (7), which amounts to constructing and solving an $m \times m$ linear system $( J^T J + {\lambda _k} I ) \alpha ^k = -J^T f^{k_0}$ where $J = [ f^{k_1} - f^{k_0}, f^{k_2} - f^{k_0}, \ldots , f^{k_m} - f^{k_0} ] \in \mathbb {R}^{n \times m}$. A naïve implementation that computes the matrix J from scratch in each iteration will result in $O(m^2 n)$ time for setting up the linear system, whereas the system itself can be solved in $O(m^3)$ time. Since we typically have $m \ll n$, the linear system setup will become the dominant overhead. To reduce the overhead, we note that each entry of $J^T J$ is a linear combination of inner products between $f^{k_0}, \ldots , f^{k_m}$. If we pre-compute and store these inner products, then it only requires additional $O(m^2)$ time to evaluate all entries. Moreover, the pre-computed inner products can be updated in O(mn) time in each iteration, so we only need O(mn) total time to evaluate $J^T J$. Similarly, we can evaluate $J^T f^{k_0}$ in O(m) time. In this way, the linear system setup only requires O(mn) time in each iteration. Moreover, as the parameter m is often a small value independent of n (and significantly smaller than n), the complexity O(mn) is effectively linear with respect to n and only incurs a small computational overhead.

2.2 Global Convergence Analysis

We now present our main assumptions on g and f that allow us to establish global convergence of Algorithm 1. Our conditions are mainly based on a monotonicity property and on pointwise convergence of the iterated functions $g^{[k]}: \mathbb {R}^n\rightarrow \mathbb {R}^n$ defined as

for $k \in \mathbb {N}$.

Assumption 1

The functions g and f satisfy the following conditions:

(A.1)
$\Vert f(g(x))\Vert \le \Vert f(x)\Vert $ for all $x \in \mathbb {R}^n$.
(A.2)
$\lim \limits _{k\rightarrow \infty } \Vert f(g^{[k]}(x))\Vert = \nu $ for all $x \in \mathbb {R}^n$, where $\nu =\inf _{x\in \mathbb {R}^n}\Vert f(x)\Vert $.

It is easy to see that Assumption 1 holds for any contractive function with $\nu = 0$. In particular, if g satisfies (1), we obtain

$$\begin{aligned} \Vert f(g^{[k]}(x))\Vert&= \Vert g(g^{[k]}(x)) - g(g^{[k-1]}(x)) \Vert \nonumber \\&\le \kappa \Vert f(g^{[k-1]}(x))\Vert \le \ldots \le \kappa ^{k} \Vert f(x)\Vert \rightarrow 0 \end{aligned}$$

(16)

as $k \rightarrow \infty $. In the following, we will verify that Assumption 1 also holds for $\rho $-averaged operators which define a broader class of mappings than contractions.

Proposition 1

Let $g:\mathbb {R}^n\rightarrow \mathbb {R}^n$ be a $\rho $-averaged operator with $\rho \in (0,1)$. Then g satisfies Assumption 1.

Proof

By definition the $\rho $-averaged operator g is also nonexpansive and thus, (A.1) holds for g. To prove (A.2), let us set $y^0:= x$ and $y^{k+1}:= g^{[k+1]}(x) = g(y^{k})$ for all k. By (A.1), the sequence $\{\Vert f(y^{k})\Vert \}_k$ is monotonically decreasing. Therefore, we can assume that $\lim _{k\rightarrow \infty }\Vert f(y^{k})\Vert =\vartheta $. If $\vartheta >\nu $, then we may select $x^0\in \mathbb {R}^n$ such that $\Vert f(x^0)\Vert <\nu +\frac{1}{2}(\vartheta -\nu )$. Defining $x^{k+1}=g^{[k+1]}(x^0)=g(x^k)$ and applying [4, Proposition 4.25(iii)], we have

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^k-y^k\Vert ^2 -\frac{1-\rho }{\rho }\Vert f(x^k)-f(y^k)\Vert ^2. \end{aligned}$$

This yields

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^0-y^0\Vert ^2 -\frac{1-\rho }{\rho }\sum _{i=0}^k\Vert f(x^i)-f(y^i)\Vert ^2. \end{aligned}$$

By the reverse triangle inequality and (A.1), we have

$$\begin{aligned} \Vert f(x^i)-f(y^i)\Vert \ge \Vert f(y^i)\Vert -\Vert f(x^i)\Vert \ge \vartheta -\Vert f(x^0)\Vert \ge \frac{1}{2}(\vartheta -\nu ). \end{aligned}$$

Combining with the previous inequality, we obtain

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2\le \Vert x^0-y^0\Vert ^2 -\frac{1-\rho }{2\rho }(k+1)(\vartheta -\nu ). \end{aligned}$$

Taking the limit $k\rightarrow \infty $, we reach a contradiction. So, we must have $\nu =\vartheta $, as desired. $\square $

Remark 1

Setting $\kappa $ (the Lipschitz constant of g) to 1 in (16), we see that (A.1) is always satisfied if g is a nonexpansive operator. However, nonexpansiveness is not a necessary condition for (A.1). In fact, we can construct an operator g that is not nonexpansive but satisfies (A.1) and (A.2), e.g.,

$$\begin{aligned} g: \mathbb {R}\rightarrow \mathbb {R}, \quad g(x):=\left\{ \begin{array}{ll} 0.5 x &{} \text {if }x\in [0,1], \\ 0 &{} \text {otherwise.} \end{array} \right. \end{aligned}$$

For any $x\in [0,1]$, we have $g^{[k]}(x) = 2^{-k}x$, $f(g^{[k]}(x)) = -2^{-(k+1)}x$ and it is not hard to verify (A.1) and (A.2). For any $x\notin [0,1]$ it follows $f(g(x))=f(0)=0$, thus (A.1) and (A.2) also hold in this situation. However, since g is not continuous, it can not be nonexpansive.

Because of Proposition 1, our global convergence theory is applicable to a large class of iterative schemes. As an example, we show in the following that Assumption 1 is satisfied by forward-backward splitting, a popular optimization solver in machine learning.

Example 1

Let us consider the nonsmooth optimization problem:

$$\begin{aligned} \min _{x \in \mathbb {R}^n}~r(x)+\varphi (x), \end{aligned}$$

(17)

where both $r, \varphi : \mathbb {R}^n\rightarrow (-\infty ,\infty ]$ are proper, closed, and convex functions, and r is L-smooth. It is well known that $x^*$ is a solution to this problem if and only if it satisfies the nonsmooth equation:

$$\begin{aligned} x^* - G_\mu (x^*) = 0, \quad G_\mu (x):= \textrm{prox}_{\mu \varphi }(x-\mu \nabla r(x)), \end{aligned}$$

where $\textrm{prox}_{\mu \varphi }(x):= \mathop {\textrm{argmin}}\limits _y \varphi (y) + \frac{1}{2\mu } \Vert x-y\Vert ^2$, $\mu > 0$, is the proximity operator of $\varphi $, see also Corollary 26.3 of [4]. We can then compute $x^*$ via the iterative scheme

$$\begin{aligned} x^{k+1} = G_\mu (x^k). \end{aligned}$$

(18)

$G_\mu $ is known as the forward-backward splitting operator and it is a $\rho $-averaged operator for all $\mu \in (0,\frac{2}{L})$, see [8]. Hence, Assumption 1 holds and our theory can be used to study the global convergence of Algorithm 1 applied to (18).

Remark 2

For problem (17), it can be shown that Douglas–Rachford splitting, as well as its equivalent form of ADMM, can both be written as a $\rho $-averaged operator with $\rho \in (0,1)$ (see, e.g., [22]). Therefore, the applications considered in [17] are also covered by Assumption 1.

We can now show the global convergence of Algorithm 1:

Theorem 1

Suppose Assumption 1 is satisfied and let $\{x^k\}$ be generated by Algorithm 1 with $\epsilon _f = 0$. Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert f^k\Vert =\nu , \end{aligned}$$

where $\nu =\inf _{x\in \mathbb {R}^n}\Vert f(x)\Vert $.

Proof

In the following, we will use ${\mathcal {S}}$ to denote the set of indices for all successful iterations, i.e., ${\mathcal {S}}:= \{k: \rho _k \ge p_1\}$. To simplify the notation, we introduce a function ${\mathcal {P}}:\mathbb {N}\rightarrow \mathbb {N}$ defined as

$$\begin{aligned} {\mathcal {P}}(k):=\max \left\{ j \mid j \in {\mathop {\textrm{argmin}}\limits }_{i\in \{k,k-1,\ldots ,k-{{m}} \}}\Vert f^i\Vert \right\} . \end{aligned}$$

Notice that the number ${\mathcal {P}}(k)$ coincides with $k_0$ for fixed k.

If Algorithm 1 terminates after a finite number of steps, the conclusion simply follows from the stopping criterion. Therefore, in the following, we assume that a sequence of iterates of infinite length is generated. We consider two different cases:

Case 1:
$|\mathcal {S}|<\infty $. Let $\bar{k}$ denote the index of the last successful iteration in $\mathcal {S}$ (we set ${\bar{k}} = 0$ if $\mathcal {S}=\emptyset $). We first show that $\mathcal {P}(k)=k$ for all $k\ge {\bar{k}}+1$. Due to ${\bar{k}}+1\notin \mathcal {S}$, it follows $x^{{\bar{k}}+1}=g(x^{{\mathcal {P}}({\bar{k}})})$ and by (A.1), this implies $\Vert f(x^{{\bar{k}}+1})\Vert \le \Vert f(x^{{\mathcal {P}}({\bar{k}})})\Vert $. From the definition of ${\mathcal {P}}$, we have $\Vert f(x^{{\mathcal {P}}({\bar{k}})})\Vert \le \Vert f^{{\bar{k}}-i}\Vert $ for every $0\le i\le \min \{m,{\bar{k}}\}$ and hence ${\mathcal {P}}({\bar{k}}+1)={\bar{k}}+1$. An inductive argument then yields $\mathcal {P}(k)=k$ for all $k\ge {\bar{k}} +1$. Notice that for any $k\ge {\bar{k}}+1$, we have $k\notin {\mathcal {S}}$ and $x^{k+1}=g(x^{{\mathcal {P}}(k)})=g(x^k)$. Utilizing (A.2), it follows that $\Vert f^k\Vert = \Vert f(g^{[k-{\bar{k}}]}(x^{{\mathcal {P}}({\bar{k}})}))\Vert \rightarrow \nu $ as $k \rightarrow \infty $.
Case 2:
$|\mathcal {S}|=\infty $. Let us denote
$$\begin{aligned} W_k:=\max _{k-m\le i\le k}\Vert f^i\Vert . \end{aligned}$$
We first show that the sequence $\{W_k\}$ is non-increasing.
- If $k\in \mathcal {S}$, then we have:
  $$\begin{aligned}&p_1 \le \rho _{k} = \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}}. \end{aligned}$$
  We already know from “Appendix A” that ${\textrm{pred}_{k}} > 0$. Since $p_1 > 0$, it also holds that ${\textrm{ared}_{k}} >0$. Hence, if $c \Vert {\hat{f}}^k(\alpha ^k)\Vert \le \Vert f^{k+1}\Vert $, then using ${r_{k}} \le W_k$ and (41) from “Appendix A”, we can derive:
  $$\begin{aligned} p_1&\le \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}} = 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{{\textrm{pred}_{k}}}= 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{r_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert } \\&\le 1 + \frac{c \Vert {\hat{f}}^k(\alpha ^k)\Vert - \Vert f^{k+1}\Vert }{W_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert }=\frac{W_k-\Vert f^{k+1}\Vert }{W_k-c\Vert {\hat{f}}^k(\alpha ^k)\Vert } \le \frac{W_k-\Vert f^{k+1}\Vert }{(1-c)W_k}, \end{aligned}$$
  which implies
  $$\begin{aligned} \Vert f^{k+1}\Vert \le c_p W_k, \end{aligned}$$
  (19)
  where $c_p:= 1-(1-c)p_1 < 1$. Otherwise, if $c \Vert {\hat{f}}^k(\alpha ^k)\Vert > \Vert f^{k+1}\Vert $, then we have
  $$\begin{aligned} \Vert f^{k+1}\Vert \le c W_k \le c_p W_k. \end{aligned}$$
  (20)
- If $k \notin {\mathcal {S}}$, we have $x^{k+1}=g^{{\mathcal {P}}(k)}$. By Assumption (A.1), it then follows that
  $$\begin{aligned} \Vert f^{k+1}\Vert \le \Vert f^{{\mathcal {P}}(k)}\Vert \le W_{k}. \end{aligned}$$
  (21)
Eqs. (19), (20) and (21) show that $\Vert f^{k+1}\Vert \le W_{k}$. By definition of $W_{k}$, we then have $W_{k+1} \le \max \{\Vert f^{k+1}\Vert , W_k\} = W_k$. This shows that the sequence $\{W_k\}$ is non-increasing. Next, we verify
$$\begin{aligned} W_{k+m+1}\le c_p W_k \end{aligned}$$
for all $k \in {\mathcal {S}}$. It suffices to prove that for any i satisfying $k+1\le i\le k+m+1$, we have $\Vert f^i\Vert \le c_p W_{k}$. Since we consider a successful iteration $k\in \mathcal {S}$, our previous discussion has already shown that $\Vert f^{k+1}\Vert \le c_p W_k$. We now assume $\Vert f^i\Vert \le c_pW_k$ for some $k+1\le i\le k+m$. If $i\in {\mathcal {S}}$, we obtain $\Vert f^{i+1}\Vert \le c_p W_i \le c_p W_k$. Otherwise, it follows that $\Vert f^{i+1}\Vert \le \Vert f^{{\mathcal {P}}(i)}\Vert \le \Vert f^i\Vert \le c_p W_k$. Hence, by induction, we have $W_{k+m+1}\le c_p W_k$ for all $k \in {\mathcal {S}}$. Since $\{W_k\}$ is non-increasing and we assumed $|{\mathcal {S}}|=\infty $, this establishes $W_k\rightarrow 0$ and $\Vert f^k\Vert \rightarrow 0$. In this case, we can infer $\nu =0$ and the proof is complete.$\square $

Remark 3

This global result does not depend on the specific update rule for the regularization weight $\lambda _k$. Indeed, global convergence mainly results from our acceptance mechanism and hence, as a consequence of our proof, different update strategies for $\lambda _k$ can also be applied. Our choice of $\lambda _k$ in (15), however, will be essential for establishing local convergence of the method.

2.3 Local Convergence Analysis

Next, we analyze the local convergence of our proposed approach, starting with several assumptions.

Assumption 2

The function $g: \mathbb {R}^n\rightarrow \mathbb {R}^n$ satisfies the following conditions:

(B.1)
g is Lipschitz continuous with a constant $\kappa <1$.
(B.2)
g is differentiable at $x^*$ where $x^*$ is the fixed point of the mapping g.

Remark 4

(B.1) is a standard assumption widely used in the local convergence analysis of $\textsf{AA}$ [13, 39, 40, 46]. The existing analyses typically rely on the smoothness of g. In contrast, (B.2) allows g to be nonsmooth and only requires it to be differentiable at the fixed point $x^*$, allowing our assumptions to cover a wider variety of methodologies such as forward-backward splitting and Douglas–Rachford splitting under appropriate assumptions, see “Appendix B”. We note that in [6] the Lipschitz differentiability of g is replaced by continuous differentiability around $x^*$, while we only assume differentiability at one point. This technical difference is based on the observation that an expansion of the residual $f^k$ is only required at the point $x^*$ and not at the iterates $x^k$ which allows to work with weaker differentiability requirements. We further note that $\textsf{AA}$ has been investigated for nonsmooth g in [17, 53] but without local convergence analysis. Recent convergence results of $\textsf{AA}$ for a scheme related to the proximal gradient method discussed in Example 1 can also be found in [24]. While the local assumptions and derived convergence rates in [24] are somewhat similar to our local results, we want to highlight that the algorithm and analysis in [24] are tailored to convex composite problems of the form (17). Moreover, the global results in [24] are shown for a second, guarded version of $\textsf{AA}$ and are based on the strong convexity of the problem. In contrast and under conditions that are not stronger than the local assumptions in [24], we will establish unified global–local convergence of our approach for general contractions. In Sect. 3, we verify the conditions (B.1) and (B.2) on the numerical examples, with a more detailed discussion in “Appendix B”.

Remark 5

(B.1) implies that the function g is contractive, which is a sufficient condition for (A.1) and (A.2). Thus, a function g satisfying Assumption 2 will also fulfill Assumption 1 with $\nu = 0$.

Similar to the local convergence analyses in [13, 46], we also work with the following condition:

Assumption 3

For the solution $\bar{\alpha }^k$ to the unregularized $\textsf{AA}$ problem (5), there exists $M > 0$ such that $\Vert \bar{\alpha }^k\Vert _{\infty }\le M$ for all k sufficiently large.

Remark 6

The assumptions given in [13, 46] are formulated without permuting the last $m+1$ indices. We further note that we do not require the solution ${\bar{\alpha }}^k$ to be unique.

The acceleration effect of the original $\textsf{AA}$ scheme has only been studied very recently in [13] based on slightly stronger assumptions. In particular, their result can be stated as

$$\begin{aligned} \Vert f(\hat{g}^k(\bar{\alpha }^k))\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert +\sum \nolimits _{i=0}^mO(\Vert f^{k-i}\Vert ^2), \end{aligned}$$

(22)

where

$$\begin{aligned} \theta _k:={\Vert \hat{f}^k(\bar{\alpha }^k)\Vert }/{\Vert f^{k_0}\Vert } \end{aligned}$$

(23)

is an acceleration factor. Since $\bar{\alpha }^k$ is a solution to the problem (5), we have $\Vert \hat{f}^k(\bar{\alpha }^k)\Vert \le \Vert \hat{f}^k(0)\Vert = \Vert f^{k_0}\Vert $ so that $\theta _k \in [0,1]$. Then (22) implies that for a fixed-point iteration that converges linearly with a contraction constant $\kappa $, $\textsf{AA}$ can improve the convergence rate locally. In the following, we will show that our globalized $\textsf{AA}$ method possesses similar characteristics under weaker assumptions.

We first verify that after finitely many iterations, every step $x^{k+1} = {\hat{g}}^k(\alpha ^k)$ is accepted as a new iterate. Thus, our method eventually reduces to a pure regularized $\textsf{AA}$ scheme.

Theorem 2

Suppose that Assumptions 2 and 3 hold and let the constant c in (11) be chosen such that $c\ge \kappa $. Then, the sequence $\{x^k\}$ generated by Algorithm 1 (with $\epsilon _f = 0$) either terminates after finitely many steps, or converges to the fixed point $x^*$ and there exists some $\ell \in \mathbb {N}$ such that $\rho _k\ge p_2$ for all $k\ge \ell $. In particular, every iteration $k \ge \ell $ is successful with $x^{k+1}=\hat{g}^k(\alpha ^k)$.

Proof

Our proof consists of three steps. We first show the convergence of the whole sequence $\{x^k\}$ to the fixed point $x^*$. Afterwards we derive a bound for the residual $\Vert f({\hat{g}}^k(\alpha ^k))\Vert $ that can be used to estimate the actual reduction ${\textrm{ared}_{k}}$. In the third step, we combine our observations to prove the transition to the full $\textsf{AA}$ method, i.e., we show that there exists some $\ell $ with $k\in \mathcal {S}$ for all $k\ge \ell $.

Step 1::

Convergence of $\{x^k\}$. By (B.1), g is a contraction, i.e., for all $x \in \mathbb {R}^n$ we have

$$\begin{aligned} \Vert x - x^*\Vert \le \Vert x-g(x)\Vert + \Vert g(x)-g(x^*)\Vert \le \Vert f(x)\Vert + \kappa \Vert x-x^*\Vert \end{aligned}$$

(24)

and it follows $\Vert f^k\Vert = \Vert f(x^k)\Vert \ge (1-\kappa )\Vert x^k-x^*\Vert $ for all k. Theorem 1 and Remark 5 guarantee $\lim _{k\rightarrow \infty }\Vert f^k\Vert =0$ and hence, we can infer $x^k \rightarrow x^*$.

Step 2::

Bounding $\Vert f({\hat{g}}^k(\alpha ^k))\Vert $. Introducing

$$\begin{aligned} \hat{x}^k(\alpha ^k):= x^{k_0} + {\sum }_{i=1}^m \alpha _i^k (x^{k_i} - x^{k_0}) \end{aligned}$$

and using (B.1), we can bound the residual $\Vert f({\hat{g}}^k(\alpha ^k))\Vert $ directly as follows:

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert&= \Vert g({\hat{g}}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&\le \Vert g({\hat{g}}^k(\alpha ^k))-g(\hat{x}^k(\alpha ^k))\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&\le \kappa \Vert {\hat{g}}^k(\alpha ^k) - \hat{x}^k(\alpha ^k)\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert \\&= \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert +\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert . \end{aligned}$$

We now continue to estimate the second term $\Vert g(\hat{x}^k(\alpha ^k))-{\hat{g}}^k(\alpha ^k)\Vert $. From the algorithmic construction and the definition of $\alpha ^k$ and ${\bar{\alpha }}^k$, it follows that

$$\begin{aligned} \Vert {\hat{f}}^k(\alpha ^k)\Vert ^2 + \lambda _k \Vert \alpha ^k\Vert ^2 \le \Vert {\hat{f}}^k({\bar{\alpha }}^k)\Vert ^2 + \lambda _k \Vert {\bar{\alpha }}^k\Vert ^2 \le \Vert {\hat{f}}^k(\alpha ^k)\Vert ^2 + \lambda _k \Vert {\bar{\alpha }}^k\Vert ^2, \end{aligned}$$

(25)

which implies $\Vert \alpha ^k\Vert _{\infty }\le \Vert \alpha ^k\Vert \le \Vert \bar{\alpha }^k\Vert \le \sqrt{m}\Vert \bar{\alpha }^k\Vert _\infty \le \sqrt{m}M$ for all k. Defining $\nu ^k=(\nu ^k_0,\dots ,\nu ^k_m)\in \mathbb {R}^{m+1}$ with $\nu ^k_0=1-\sum _{i=1}^m\alpha ^k_i$ and $\nu ^k_j = \alpha ^k_j$ for $1\le j \le m$, we obtain

$$\begin{aligned} {\hat{g}}^k(\alpha )&= {\sum }_{i=0}^m\nu _i^k g(x^{k_i}), \quad {\hat{x}}^k(\alpha ) = {\sum }_{i=0}^m\nu _i^k x^{k_i}, \end{aligned}$$

$\Vert \nu ^k\Vert _\infty \le 1+m^{\frac{3}{2}}M$, and $\sum _{i=0}^m \nu ^k_i = 1$. Consequently, applying the estimate (24) derived in step 1, it follows

$$\begin{aligned} \Vert \hat{x}^k(\alpha ^k)-x^*\Vert = \left\| {\sum }_{i=0}^m \nu _i^k (x^{k_i}-x^*) \right\|&\le (1+ m^{\frac{3}{2}}M) {\sum }_{i=0}^m\Vert x^{k_i}-x^*\Vert \\ {}&\le (1+ m^{\frac{3}{2}}M)(1-\kappa )^{-1} {\sum }_{i=0}^m\Vert f^{k-i}\Vert \end{aligned}$$

which shows ${\hat{x}}^k(\alpha ^k) \rightarrow x^*$ as $k \rightarrow \infty $. This also establishes

$$\begin{aligned} o(\Vert \hat{x}^k(\alpha ^k)-x^*\Vert ) = o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \quad k \rightarrow \infty . \end{aligned}$$

(26)

Note that the differentiability of g at $x^*$—as stated in Assumption (B.2)—implies $\Vert g(y) - g(x^*) - g^\prime (x^*)(y-x^*)\Vert = o(\Vert y-x^*\Vert )$ as $y \rightarrow x^*$. Applying this condition to different choices of y and the boundedness of $\nu ^k$, we can obtain

$$\begin{aligned}& \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k) \Vert \nonumber \\ = ~&\Vert g(\hat{x}^k(\alpha ^k))-g(x^*) + g(x^*) - {\hat{g}}^k(\alpha ^k)\Vert \nonumber \\ \le ~&\Vert g'(x^*)({\hat{x}}^k(\alpha ^k)-x^*) + g(x^*) - {\hat{g}}^k(\alpha ^k) \Vert +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \nonumber \\ = ~&\left\| {\sum }_{i=0}^m \nu ^k_i [g'(x^*)(x^{k_i}-x^*) + g(x^*) - g(x^{k_i})] \right\| +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \nonumber \\ \le ~&{\sum }_{i=0}^m o(\Vert x^{k_i}-x^*\Vert ) +o(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ) \le o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) . \end{aligned}$$

(27)

Here, we also used (24), (26), and $\sum _{i=0}^m o(\Vert f^{k-i}\Vert ) = o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) $. Combining our results, this yields

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \quad k \rightarrow \infty . \end{aligned}$$

(28)

Step 3::

Transition to fast local convergence. As in the proof of Theorem 1, let us introduce

$$\begin{aligned} W_k:= \max _{k-m\le i\le k}\Vert f^i\Vert . \end{aligned}$$

Due to (28) and $o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) = o(W_k)$ there then exists $\ell \in \mathbb {N}$ such that

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + (1-p_2)\min \{\gamma ,1-c\}W_k \end{aligned}$$

for all $k \ge \ell $. Hence, using $c\ge \kappa $, we have

$$\begin{aligned} {\textrm{ared}_{k}}&= {r_{k}} - \Vert f(\hat{g}^k(\alpha ^k))\Vert \ge {\textrm{pred}_{k}}- (1-p_2)\min \{\gamma ,1-c\}W_k. \end{aligned}$$

Similarly, for the predicted reduction ${\textrm{pred}_{k}}$ we can show

$$\begin{aligned} {\textrm{pred}_{k}}&= (1-\gamma m) \Vert f^{k_0}\Vert + \gamma {\sum }_{i=1}^{m} \Vert f^{k_i}\Vert -c\Vert \hat{f}^k(\alpha ^k)\Vert \\&\ge \gamma {\sum }_{i=0}^{m} \Vert f^{k_i}\Vert + (1-\gamma (m+1)) \Vert f^{k_0}\Vert - c \Vert f^{k_0}\Vert \\&\ge \gamma W_k+(1-\gamma )\Vert f^{k_0}\Vert -c\Vert f^{k_0}\Vert . \end{aligned}$$

Thus, if $1-\gamma -c \ge 0$, we obtain ${\textrm{pred}_{k}} \ge \gamma W_k$. Otherwise, it follows ${\textrm{pred}_{k}} \ge (1-c) W_k$ and together this yields ${\textrm{pred}_{k}} \ge \min \{\gamma ,1-c\}W_k$. Combining the last estimates, we can finally deduce

$$\begin{aligned} \frac{{\textrm{ared}_{k}}}{ {\textrm{pred}_{k}}}\ge \frac{{\textrm{pred}_{k}}- (1-p_2)\min \{\gamma ,1-c\}W_k}{{\textrm{pred}_{k}}} \ge p_2, \end{aligned}$$

which completes the proof.$\square $

Remark 7

Our novel nonmonotone acceptance mechanism is the central component of our proof for Theorem 2, as it allows us to balance the additional error terms caused by an $\textsf{AA}$ step.

Next, we show that our approach can enhance the convergence of the underlying fixed-point iteration and that it has a local convergence rate similar to the original $\textsf{AA}$ method as given in [13].

Theorem 3

Suppose that Assumptions 2, and 3 hold and let the parameters $c, \epsilon _{f}$ in Algorithm 1 satisfy $c\ge \kappa $ and $\epsilon _{f}=0$. Then, for $k \rightarrow \infty $ it holds that:

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert +o\left( \sum \nolimits _{i=0}^{m}\Vert f^{k-i}\Vert \right) , \end{aligned}$$

where $\theta _k:=\Vert \hat{f}^k(\bar{\alpha }^k)\Vert /\Vert f^{k_0}\Vert $ is the corresponding acceleration factor. In addition, the sequence of residuals $\{\Vert f^k\Vert \}$ converges r-linearly to zero with a rate arbitrarily close to $\kappa $, i.e., for every $\eta \in (\kappa ,1)$ there exist $C > 0$ and ${\hat{\ell }} \in \mathbb {N}$ such that

$$\begin{aligned} \Vert f^k\Vert \le C \eta ^k \quad \forall ~k \ge {\hat{\ell }}. \end{aligned}$$

Proof

Theorem 2 implies $\rho _k\ge p_2$ for all $k \ge \ell $ and hence, from the update rule of Algorithm 1, it follows that

$$\mu _k=\eta _2\mu _{k-1} \quad \forall ~k \ge \ell .$$

Then by (15), we can infer $\lambda _k=o(\Vert f^{k_0}\Vert ^2)$. Using Eq. (25) and Assumption 3, this shows

$$\begin{aligned} \Vert \hat{f}^k(\alpha ^k)\Vert \le \Vert \hat{f}^k(\bar{\alpha }^k)\Vert +\sqrt{\lambda _k }\Vert \bar{\alpha }^k\Vert = \Vert \hat{f}^k(\bar{\alpha }^k)\Vert +o(\Vert f^{k_0}\Vert ). \end{aligned}$$

(29)

Thus, by (28), we obtain

$$\begin{aligned} \Vert f^{k+1}\Vert \le ~&\kappa \Vert \hat{f}^{k}(\alpha ^k)\Vert + o\left( {\sum }_{i=0}^m \Vert f^{k-i}\Vert \right) \nonumber \\ \le ~&\kappa \Vert \hat{f}^{k}(\bar{\alpha }^k)\Vert +o(\Vert f^{k_0}\Vert )+o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) \nonumber \\ =~&\kappa \theta _k \Vert f^{k_0}\Vert + o\left( {\sum }_{i=0}^m\Vert f^{k-i}\Vert \right) , \end{aligned}$$

(30)

as desired. In order to establish r-linear convergence, we follow the strategy presented in [46]. Let $\eta \in (\kappa ,1)$ be a given rate. Then, due to $\Vert f^k\Vert \rightarrow 0$ and using (30), there exists ${\hat{\ell }} \in \mathbb {N}$ such that

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \Vert {f}^{k_0}\Vert + {\bar{\nu }} \cdot {\sum }_{i=0}^m \Vert f^{k-i}\Vert \end{aligned}$$

(31)

for all $k \ge {\hat{\ell }}$ where ${\bar{\nu }}:= \frac{1-\eta }{1-\eta ^{m+1}} \eta ^m (\eta -\kappa )$. Defining $C:= \eta ^{-{\hat{\ell }}} \max _{{\hat{\ell }}-m\le i \le {\hat{\ell }}} \Vert f^i\Vert = W_{{\hat{\ell }}}\,\eta ^{-{\hat{\ell }}}$, we then have

$$\begin{aligned} \Vert f^j\Vert \le W_{{\hat{\ell }}} = (W_{{\hat{\ell }}}\,\eta ^{-j}) \eta ^j \le (W_{{\hat{\ell }}}\,\eta ^{-{\hat{\ell }}}) \eta ^j = C \eta ^j. \end{aligned}$$

for all ${\hat{\ell }}-m \le j \le {\hat{\ell }}$. We now claim that the statement $\Vert f^{k}\Vert \le C \eta ^k$ holds for all $k \ge {\hat{\ell }}$. As just shown, this is obviously satisfied for the base case $k = {\hat{\ell }}$. As part of the inductive step, let us assume that the estimate $\Vert f^{j}\Vert \le C \eta ^j$ holds for all $j = {\hat{\ell }}, {\hat{\ell }}+1,\ldots ,k$. (In fact, this bound also holds for $j = {\hat{\ell }}-m,\ldots ,{\hat{\ell }}-1$). By the definition of the index $k_0$, we have $\Vert f^{k_0}\Vert \le C\eta ^k$ and, due to (31), it follows

$$\begin{aligned} \Vert f^{k+1}\Vert&\le \kappa \Vert {f}^{k_0}\Vert + {\bar{\nu }} \cdot {\sum }_{i=0}^m \Vert f^{k-i}\Vert \le C\kappa \eta ^k + C{\bar{\nu }} \eta ^k {\sum }_{i=0}^m \left( \frac{1}{\eta }\right) ^i \\&= C\eta ^k \left[ \kappa + {\bar{\nu }} \cdot \frac{1-\eta ^{-(m+1)}}{1-\eta ^{-1}} \right] = C\eta ^k \left[ \kappa + \frac{{\bar{\nu }}}{\eta ^m} \cdot \frac{1-\eta ^{m+1}}{1-\eta } \right] = C\eta ^{k+1}. \end{aligned}$$

Hence, our claim also holds for $k+1$ which finishes the induction and proof. $\square $

Under a stronger differentiability condition and stricter update rule for $\lambda _k$, we can recover the same local rate as in [13]:

Corollary 1

Let the assumptions stated in Theorem 3 hold and let g satisfy the differentiability condition

$$\begin{aligned} \Vert g(x) - g(x^*) - g^\prime (x^*)(x-x^*)\Vert = O(\Vert x-x^*\Vert ^2) \;\; \text {as} \;\; x \rightarrow x^*. \end{aligned}$$

Suppose that the weight $\lambda _k$ is updated via $\lambda _k = \mu _k \Vert f^{k_0}\Vert ^4$. Then, for all k sufficiently large we have

$$\begin{aligned} \Vert f^{k+1}\Vert \le \kappa \theta _k\Vert f^{k_0}\Vert + \sum \nolimits _{i=0}^{m} O(\Vert f^{k-i}\Vert ^2). \end{aligned}$$

Proof

As mentioned in the remark after Theorem 1, our global results do still hold if a different update strategy is used for the weight parameter $\lambda _k$. Moreover, the proof of Theorem 2 does also not depend on the specific choice of $\lambda _k$. Consequently, we only need to improve the bound (28) for $\Vert f({\hat{g}}^k(\alpha ^k))\Vert $ derived in step 2 of the proof of Theorem 2. Using the additional differentiability property $\Vert g(y) - g(x^*) - g^\prime (x^*)(y-x^*)\Vert = O(\Vert y-x^*\Vert ^2)$, $y \rightarrow x^*$, we can directly improve the estimate for $\Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert $ in (27) as follows:

$$\begin{aligned} \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert \le {\sum }_{i=0}^m O(\Vert x^{k_i}-x^*\Vert ^2) + O(\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ^2). \end{aligned}$$

Using the bound $(\sum _{i=0}^m y_i)^2 \le (m+1) \sum _{i=0}^m y_i^2$ for $y \in \mathbb {R}^{m+1}$, we obtain $\Vert {\hat{x}}^k(\alpha ^k)-x^*\Vert ^2 = \sum _{i=0}^m O(\Vert f^{k-i}\Vert ^2)$ and thus, mimicking and combining the derivations in step 2 of the proof of Theorem 2, we have

$$\begin{aligned} \Vert g({\hat{x}}^k(\alpha ^k)) - {\hat{g}}^k(\alpha ^k)\Vert \le {\sum }_{i=0}^m O(\Vert f^{k-i}\Vert ^2) \end{aligned}$$

and

$$\begin{aligned} \Vert f({\hat{g}}^k(\alpha ^k))\Vert \le \kappa \Vert {\hat{f}}^k(\alpha ^k)\Vert + {\sum }_{i=0}^m O(\Vert f^{k-i}\Vert ^2) \end{aligned}$$

(32)

as $k \rightarrow \infty $. As in the previous proof, we can now infer $\mu _k \rightarrow 0$ (this follows from $\rho _k \ge p_2$ for all k sufficiently large) and $\lambda _k = o(\Vert f^{k_0}\Vert ^4)$. Furthermore, as in (29), due to Eq. (25) and Assumption 3, it holds that $\Vert {\hat{f}}^k(\alpha ^k)\Vert \le \Vert {\hat{f}}^k({\bar{\alpha }}^k)\Vert + o(\Vert f^{k_0}\Vert ^2)$. Combining this result with (32), we can then establish the convergence rate stated in Corollary 1. $\square $

Remark 8

The stronger differentiability condition, which was also used in [13] and other local analyses, is, e.g., satisfied when the derivative $g^\prime $ is locally Lipschitz continuous around $x^*$. More discussions of this property can also be found in “Appendix B”. We note that under this type of stronger differentiability, we can only improve the order of the remainder linearization error terms and not the linear rate of convergence.

3 Numerical Experiments

We verify the effectiveness of our method by applying it to several existing numerical solvers and comparing its convergence speed with the original solvers. We also include the acceleration approaches from [17, 40] for comparison. The regularized nonlinear acceleration (RNA) proposed in [40] computes an accelerated iterate via an affine combination of the previous k iterates, and it also introduces a quadratic regularization when computing the affine combination coefficients. Unlike our approach, it performs an acceleration step every k iterations instead of every iteration, and its regularization weight is determined by a grid search that finds the weight that leads to the lowest target function value at the accelerated iterate. The A2DR scheme proposed in [17] is a globalization of $\textsf{AA}$ applied on Douglas–Rachford splitting, using a quadratic regularization together with an acceptance mechanism based on sufficient decrease of the residual. All experiments are carried out on a laptop with a Core-i7 9750 H at 2.6GHz and 16GB of RAM. The source code for the examples in this section is available at https://github.com/bldeng/Nonmonotone-AA.

Our method involves several parameters. The parameters $p_1$, $p_2$, $\eta _1$ and $\eta _2$, used for determining acceptance of the trial step and updating the regularization weight, are standard parameters for trust-region methods. We choose $p_1=0.01$, $p_2=0.25$, $\eta _1=2$, $\eta _2=0.25$ by default. The parameter $\gamma $ affects the convex combination weights in computing ${r_{k}}$ in Eq. (12), and we choose $\gamma = 10^{-4}$. For the parameter c in the definition of ${\textrm{pred}_{k}}$, we choose $c = \kappa $ where $\kappa < 1$ is a Lipschitz constant for the function g, to satisfy the conditions for Theorems 2 and 3. We will derive the value of $\kappa $ in each experiment. The initial regularization factor $\mu _0$ is set to $\mu _0 = 1$ unless stated otherwise. Concerning the number m of previous iterates used in an $\textsf{AA}$ step, we can make the following observations: a larger m tends to reduce the number of iterations required for convergence, but also increases the computational cost per iteration; our experiments suggest that choosing $5 \le m \le 20$ often achieves a good balance. For each experiment below, we will include multiple choices of m for comparison. “Appendix C” provides some further ablation studies for the parameters $p_1$, $p_2$, $\eta _1$, $\eta _2$, and c.

3.1 Logistic Regression

First, to compare our method with the RNA scheme proposed by [40], we consider the following logistic regression problem from [40] that optimizes a decision variable $x \in \mathbb {R}^n$:

$$\begin{aligned} \min _{x}~F(x), \end{aligned}$$

(33)

where

$$\begin{aligned} F(x)=\frac{1}{N}\sum \nolimits _{i=1}^N \log (1+\exp (-b_ia_i^Tx))+\frac{\tau }{2}\Vert x\Vert ^2, \end{aligned}$$

(34)

and $a_i \in \mathbb {R}^n$, $b_i \in \{-1, 1\}$ are the attributes and label of the data point i, respectively. Following [40], we consider gradient descent solver $x^{k+1} = g(x^k)$ with a fixed step size:

$$g(x) = x-\frac{2}{L_F+\tau }\nabla F(x),$$

where

$$\begin{aligned} L_F= \tau + \frac{\Vert A\Vert _2^2}{4N} \end{aligned}$$

(35)

is the Lipschitz constant of $\nabla F$, and $A=[a_1,\ldots ,a_N]^T\in \mathbb {R}^{N\times n}$. Then g is Lipschitz continuous with modulus

$$\begin{aligned} \kappa = \frac{L_F-\tau }{L_F+\tau } < 1 \end{aligned}$$

and differentiable, which satisfies Assumption 2. We apply our approach (denoted by “LM-AA”) and RNA to this solver, and compare their performance on two datasets: covtype^{Footnote 1} (54 features, 581,012 points), and sido0^{Footnote 2} (4932 features, 12,678 points). For each dataset, we normalize the attributes and solve the problem with $\tau = L_F / 10^{6}$ and $\tau = L_F / 10^{9}$, respectively. For the implementation of RNA, we use the source code released by the authors of [40].^{Footnote 3} We set $\mu _0 = 100$, and $m = 10, 15, 20$, respectively for our method. RNA performs an acceleration step every k iterations, and we test $k=5, 10, 20$, respectively. All other RNA parameters are set to their default values as provided in the source code (in particular, with grid-search adaptive regularization weight and line search enabled). Figure 1 plots for each method the normalized target function value $(F(x^k) - F^*)/F^*$ with respect to the iteration count and computational time, where $F^*$ is the ground-truth global minimum computed by running each method until full convergence and taking the minimum function value among all methods. All variants of LM-AA and RNA accelerate the decrease of the target function compared with the original gradient descent solver, with our methods achieving an overall faster decrease.

3.2 Image Reconstruction

Next, we consider a nonsmooth problem proposed in [51] for total variation based image reconstruction:

$$\begin{aligned} \min _{w,u} ~{\sum }_{i=1}^{N^2}\Vert w_i\Vert _2+\frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i-D_i u\Vert _2^2+\frac{\nu }{2}\Vert Ku-s\Vert ^2_2, \end{aligned}$$

(36)

where $s\in [0,1]^{N^2}$ is an $N \times N$ input image, $u\in \mathbb {R}^{N^2}$ is the output image to be optimized, $K\in \mathbb {R}^{N^2\times N^2}$ is a linear operator, $D_i \in \mathbb {R}^{2 \times N^2}$ represents the discrete gradient operator at pixel i, $w = (w_{1}^T,\ldots ,w_{N^2}^T)^T \in \mathbb {R}^{2N^2}$ are auxiliary variables for the image gradients, $\nu > 0$ is a fidelity weight, and $\beta > 0$ is a penalty parameter. The solver in [51] can be written as alternating minimization between u and w as follows:

$$\begin{aligned} u^{k+1}&= \mathop {\textrm{argmin}}\limits _{u} \frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i^k-D_i u\Vert _2^2+\frac{\nu }{2}\Vert Ku-s\Vert ^2_2, \end{aligned}$$

(37)

$$\begin{aligned} w^{k+1}&= \mathop {\textrm{argmin}}\limits _{w} {\sum }_{i=1}^{N^2}\Vert w_i\Vert _2+\frac{\beta }{2}{\sum }_{i=1}^{N^2}\Vert w_i-D_i u^{k+1}\Vert _2^2 . \end{aligned}$$

(38)

The solutions to the subproblems (37) and (38) can both be computed in a closed form. When $\beta $ and $\nu $ are fixed, this can be treated as a fixed-point iteration $w^{k+1} = g(w^k)$, and it satisfies Assumption 2 (see “Appendix B.2” for a detailed derivation of g and verification of Assumption 2). In the following, we consider the solver with $K = I$ and $\nu = 4$ for image denoising. In this case, condition (B.1) is satisfied with

$$\begin{aligned} \kappa = 1-\left( 1+\frac{4\beta }{\nu }\right) ^{-1} \end{aligned}$$

(see “Appendix B.2” for the derivation). We apply this solver to a $1024 \times 1024$ image with added Gaussian noise that has a zero mean and a variance of $\sigma = 0.05$ (see Fig. 2).

We use the source code released by the authors of [51]^{Footnote 4} for the implementation of this solver, and apply our acceleration method with $m = 1, 3, 5$ respectively. For comparison, we also apply the RNA scheme with $k = 2, 5, 10$, respectively. Here we choose smaller values of m and k than the logistic regression example because both RNA and our method have a high relative overhead on this problem, which means that larger values of m or k may induce overhead that offsets the performance gain from acceleration. Similar to the logistic regression example, we use the released source code of RNA for our experiments, and set all RNA parameters to their default values. Figure 2 plots the residual norm $\Vert f(w)\Vert $ for all methods, with $\beta = 100$ and $\beta =1000$, respectively. All instances of acceleration methods converge faster to a fixed point than the original alternating minimization solver, except for RNA with $k=2$ which is slower in terms of the actual computational time due to its overhead for the grid search of regularization parameters. Overall, the two acceleration approaches achieve a rather similar performance on this problem.

3.3 Nonnegative Least Squares

Finally, to compare our method with [17], we consider a nonnegative least squares (NNLS) problem that is used in [17] for evaluation:

$$\begin{aligned} \min _{x}~\psi (x)+\varphi (x), \end{aligned}$$

(39)

where $x=(x_1,x_2)\in \mathbb {R}^{2q}$, $\psi (x)=\Vert Hx_1-t\Vert _2^2+\mathcal {I}_{x_2\ge 0}(x_2)$, and $\varphi (x)=\mathcal {I}_{x_1=x_2}(x)$, with $\mathcal {I}_S$ being the indicator function of the set S. The Douglas–Rachford splitting (DRS) solver for this problem can be written as

$$\begin{aligned} v^{k+1} =g(v^k) ={\textstyle {\frac{1}{2}}}((2\textrm{prox}_{\beta \varphi }-I)(2\textrm{prox}_{\beta \psi }-I)+I)v^k \end{aligned}$$

(40)

where $v^{k}=(v_1^k,v_2^k)\in \mathbb {R}^{2q}$ is an auxiliary variable for DRS and $\beta $ is the penalty parameter. In [17], the authors use their regularized $\textsf{AA}$ method (A2DR) to accelerate the DRS solver (40). To apply our method, we verify in “Appendix B.3” that if H is of full column rank, then g satisfies condition (B.1) with

$$\begin{aligned} \kappa = \frac{\sqrt{3+c_1^2}}{2} < 1 \end{aligned}$$

where $c_1=\max \{\frac{\beta \sigma _1-1}{\beta \sigma _1+1},\frac{1-\beta \sigma _0}{1+\beta \sigma _0}\}$, and $\sigma _0,\sigma _1$ are the minimal and maximal eigenvalues of $2H^T H$, respectively. Moreover, g is also differentiable under a mild condition. We compare our method with A2DR on the solver (40), with the same $\textsf{AA}$ parameters $m = 10, 15, 20$. The methods are tested using a $600 \times 300$ sparse random matrix H with $1\%$ nonzero entries and a random vector t. We use the source code released by the authors of [17]^{Footnote 5} for the implementation of A2DR, and set all A2DR parameters to their default values. While A2DR and DRS are implemented using parallel evaluation of the proximity operators in the released A2DR code, we implement our method as a single-threaded application for simplicity. Figure 3 plots the residual norm $\Vert f(v)\Vert $ for DRS and the two acceleration methods. It also plots the norm of the overall residual $r = (r_{\textrm{prim}}, r_{\textrm{dual}})$ used in [17] for measuring convergence, where $r_{\textrm{prim}}$ and $r_{\textrm{dual}}$ denote the primal and dual residuals as defined in Equations (7) and (8) of [17], respectively. For both residual measures, the original DRS solver converges slowly after the initial iterations, whereas the two acceleration methods achieve significant speedup. Moreover, the single-threaded implementation of our method outperforms the parallel A2DR with the same m parameter, in terms of both iteration count and computational time.

3.4 Statistics of Successful Steps

Our acceptance mechanism plays a key role in achieving global and local convergence of the proposed method. To demonstrate its behavior, we provide statistics of the successful steps in Figs. 1, 2 and 3. Specifically, for each instance of LM-AA, we count the total steps required to reach a certain level of accuracy and we compare it with the corresponding number of successful $\textsf{AA}$ steps within these steps. Tables 1, 2, 3, and 4 show the statistics of successful steps for Figs. 1, 2 and 3, respectively. Here, besides the total number of steps, we report the success rate which is defined as the ratio between successful and total steps required to reach different levels of accuracy.

Table 1 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset covtype

Full size table

Table 2 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset sido0

Full size table

Table 3 Statistics of successful steps of LM-AA for the image denoising problem (36)

Full size table

Table 4 Statistics of successful steps of LM-AA for the nonnegative least squares problem (39)

Full size table

The results in Table 4 demonstrate that essentially all $\textsf{AA}$ steps are accepted in the nonnegative least squares problem. This observation is also independent of the choice of the parameter m. More specifically, the success rate of $\textsf{AA}$ steps only decreases and more alternative fixed-point iterations are performed when we seek to solve the problem with the highest accuracy $\textit{tol} = 10^{-15}$. Table 3 illustrates that a similar behavior can also be observed for the image denoising problem (36) when setting $m=1$. Notice that this high accuracy is close to machine precision and hence this effect is mainly caused by numerical errors and inaccuracies that affect the computation and quality of an $\textsf{AA}$ step. The results in Table 3 also demonstrate a second typical effect: the success rate of $\textsf{AA}$ step is often lower when the chosen accuracy is relatively low. With increasing accuracy, the rate then increases to around 70–80%. This general observation is also supported by our results for logistic regression, see Tables 1 and 2. (Here the maximum success rate is more sensitive to the choice of m, $\tau $, and of the dataset).

In summary, the statistics provided in Tables 1, 2, 3, and 4 support our theoretical results. The success rate of $\textsf{AA}$ steps gradually increases as the iteration gets closer to the fixed point, which indicates a transition to a pure regularized $\textsf{AA}$ scheme. Furthermore, as more $\textsf{AA}$ steps seem to be rejected at the beginning of the iterative procedure, our globalization mechanism effectively guarantees global progress and convergence of the approach.

4 Conclusions

We propose a novel globalization technique for Anderson acceleration which combines adaptive quadratic regularization and a nonmonotone acceptance strategy. We prove the global convergence of our approach under mild assumptions. Furthermore, we show that the proposed globalized $\textsf{AA}$ scheme has the same local convergence rate as the original $\textsf{AA}$ iteration and that the globalization mechanism does not hinder the acceleration effect of $\textsf{AA}$. This is one of the first $\textsf{AA}$ globalization methods that achieves global convergence and fast local convergence simultaneously. Several numerical examples illustrate that our method is competitive and it can improve the efficiency of a variety of numerical solvers.

Data Availability

The datasets generated during and/or analysed during the current study are available in the GitHub repository https://github.com/bldeng/Nonmonotone-AA.

Notes

References

An, H., Jia, X., Walker, H.F.: Anderson acceleration and application to the three-temperature energy equations. J. Comput. Phys. 347, 1–19 (2017)
MathSciNet MATH Google Scholar
Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM 12(4), 547–560 (1965)
MathSciNet MATH Google Scholar
Anderson, D.G.M.: Comments on “Anderson acceleration, mixing and extrapolation’’. Numer. Algorithms 80(1), 135–234 (2019)
MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.L., et al.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York (2011)
Google Scholar
Beck, A.: Introduction to nonlinear optimization. In: MOS-SIAM Series on Optimization, vol. 19. Society for Industrial and Applied Mathematics (SIAM); Mathematical Optimization Society, Philadelphia, PA. Theory, algorithms, and applications with MATLAB (2014)
Bian, W., Chen, X., Kelley, C.: Anderson acceleration for a class of nonsmooth fixed-point problems. SIAM J. Sci. Comput. 43(5), S1–S20 (2021)
MathSciNet MATH Google Scholar
Both, J.W., Kumar, K., Nordbotten, J.M., Radu, F.A.: Anderson accelerated fixed-stress splitting schemes for consolidation of unsaturated porous media. Comput. Math. Appl. 77(6), 1479–1502 (2019)
MathSciNet MATH Google Scholar
Byrne, C.: An elementary proof of convergence of the forward-backward splitting algorithm. J. Nonlinear Convex Anal. 15(4), 681–691 (2014)
MathSciNet MATH Google Scholar
Clarke, F.H.: Optimization and nonsmooth analysis. In: Classics in Applied Mathematics, vol. 5, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990)
Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust Region Methods. Society for Industrial and Applied Mathematics (SIAM), Mathematical Programming Society (MPS), Philadelphia (2000)
Ding, C., Sun, D., Sun, J., Toh, K.C.: Spectral operators of matrices: semismoothness and characterizations of the generalized Jacobian. SIAM J. Optim. 30(1), 630–659 (2020)
MathSciNet MATH Google Scholar
Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2014). A view from variational analysis
Evans, C., Pollock, S., Rebholz, L.G., Xiao, M.: A proof that Anderson acceleration improves the convergence rate in linearly converging fixed-point methods (but not in those converging quadratically). SIAM J. Numer. Anal. 58(1), 788–810 (2020)
MathSciNet MATH Google Scholar
Eyert, V.: A comparative study on methods for convergence acceleration of iterative vector sequences. J. Comput. Phys. 124(2), 271–285 (1996)
MathSciNet MATH Google Scholar
Fan, J.Y.: A modified Levenberg-Marquardt algorithm for singular system of nonlinear equations. J. Comput. Math. 21(5), 625–636 (2003)
MathSciNet MATH Google Scholar
Fang, H.R., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numer. Linear Algebra Appl. 16(3), 197–221 (2009)
MathSciNet MATH Google Scholar
Fu, A., Zhang, J., Boyd, S.: Anderson accelerated Douglas–Rachford splitting. SIAM J. Sci. Comput. 42(6), A3560–A3583 (2020)
MathSciNet MATH Google Scholar
Geist, M., Scherrer, B.: Anderson acceleration for reinforcement learning. arXiv preprint arXiv:1809.09501 (2018)
Giselsson, P., Boyd, S.: Linear convergence and metric selection for Douglas–Rachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2016)
MathSciNet MATH Google Scholar
Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Q. Appl. Math. 2, 164–168 (1944)
MathSciNet MATH Google Scholar
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
MathSciNet MATH Google Scholar
Liang, J., Fadili, J., Peyré, G.: Local convergence properties of Douglas–Rachford and alternating direction method of multipliers. J. Optim. Theory Appl. 172(3), 874–913 (2017)
MathSciNet MATH Google Scholar
Madsen, K., Nielsen, H., Tingleff, O.: Methods for Non-linear Least Squares Problems. Informatics and Mathematical Modelling, 2nd edn. Technical University of Denmark, Kongens Lyngby (2004)
Google Scholar
Mai, V., Johansson, M.: Anderson acceleration of proximal gradient methods. In: Singh, A. H.D. III (ed.) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, pp. 6620–6629. PMLR, Virtual (2020)
Mai, V.V., Johansson, M.: Nonlinear acceleration of constrained optimization algorithms. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4903–4907. IEEE (2019)
Marquardt, D.W.: An algorithm for least-squares estimation of nonlinear parameters. J. SIAM 11, 431–441 (1963)
MathSciNet MATH Google Scholar
Matveev, S., Stadnichuk, V., Tyrtyshnikov, E., Smirnov, A., Ampilogova, N., Brilliantov, N.V.: Anderson acceleration method of finding steady-state particle size distribution for a wide class of aggregation-fragmentation models. Comput. Phys. Commun. 224, 154–163 (2018)
MathSciNet MATH Google Scholar
Milzarek, A.: Numerical methods and second order theory for nonsmooth problems. Ph.D. thesis, Technische Universität München (2016)
Pavlov, A.L., Ovchinnikov, G.W., Derbyshev, D.Y., Tsetserukou, D., Oseledets, I.V.: AA-ICP: Iterative closest point with Anderson acceleration. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–6. IEEE (2018)
Peng, Y., Deng, B., Zhang, J., Geng, F., Qin, W., Liu, L.: Anderson acceleration for geometry optimization and physics simulation. ACM Trans. Graph. 37(4), 42 (2018)
Google Scholar
Poliquin, R.A., Rockafellar, R.T.: Generalized Hessian properties of regularized nonsmooth functions. SIAM J. Optim. 6(4), 1121–1137 (1996)
MathSciNet MATH Google Scholar
Pollock, S., Rebholz, L.G.: Anderson acceleration for contractive and noncontractive operators. IMA J. Numer. Anal. 41(4), 2841–2872 (2021)
MathSciNet MATH Google Scholar
Pollock, S., Rebholz, L.G., Xiao, M.: Anderson-accelerated convergence of Picard iterations for incompressible Navier–Stokes equations. SIAM J. Numer. Anal. 57(2), 615–637 (2019)
MathSciNet MATH Google Scholar
Potra, F.A., Engler, H.: A characterization of the behavior of the Anderson acceleration on linear problems. Linear Algebra Appl. 438(3), 1002–1011 (2013)
MathSciNet MATH Google Scholar
Pratapa, P.P., Suryanarayana, P., Pask, J.E.: Anderson acceleration of the Jacobi iterative method: an efficient alternative to Krylov methods for large, sparse linear systems. J. Comput. Phys. 306, 43–54 (2016)
MathSciNet MATH Google Scholar
Qi, L.Q., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
MathSciNet MATH Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317, 3rd edn. Springer, Berlin (2009)
Google Scholar
Rohwedder, T., Schneider, R.: An analysis for the DIIS acceleration method used in quantum chemistry calculations. J. Math. Chem. 49(9), 1889–1914 (2011)
MathSciNet MATH Google Scholar
Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. In: Advances in Neural Information Processing Systems, pp. 712–720 (2016)
Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. Math. Program. 179(1), 47–83 (2020)
MathSciNet MATH Google Scholar
Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)
MathSciNet MATH Google Scholar
Sterck, H.D.: A nonlinear GMRES optimization algorithm for canonical tensor decomposition. SIAM J. Sci. Comput. 34(3), A1351–A1379 (2012)
MathSciNet MATH Google Scholar
Sun, D., Sun, J.: Strong semismoothness of eigenvalues of symmetric matrices and its application to inverse eigenvalue problems. SIAM J. Numer. Anal. 40(6), 2352–2367 (2002)
MathSciNet MATH Google Scholar
Sun, D., Sun, J.: Strong semismoothness of the Fischer–Burmeister SDC and SOC complementarity functions. Math. Program. 103(3), 575–581 (2005)
MathSciNet MATH Google Scholar
Toth, A., Ellis, J.A., Evans, T., Hamilton, S., Kelley, C., Pawlowski, R., Slattery, S.: Local improvement results for Anderson acceleration with inaccurate function evaluations. SIAM J. Sci. Comput. 39(5), S47–S65 (2017)
MathSciNet MATH Google Scholar
Toth, A., Kelley, C.: Convergence analysis for Anderson acceleration. SIAM J. Numer. Anal. 53(2), 805–819 (2015)
MathSciNet MATH Google Scholar
Ulbrich, M.: Nonmonotone trust-region methods for bound-constrained semismooth equations with applications to nonlinear mixed complementarity problems. SIAM J. Optim. 11(4), 889–917 (2001)
MathSciNet MATH Google Scholar
Ulbrich, M.: Semismooth Newton Methods for Variational Inequalities and Constrained Optimization Problems in Function Spaces. MOS-SIAM Series on Optimization, vol. 11. Society for Industrial and Applied Mathematics (SIAM), Mathematical Optimization Society, Philadelphia (2011)
Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM J. Numer. Anal. 49(4), 1715–1735 (2011)
MathSciNet MATH Google Scholar
Wang, D., He, Y., De Sterck, H.: On the asymptotic linear convergence speed of Anderson acceleration applied to ADMM. J. Sci. Comput. 88(2), 1–35 (2021)
MathSciNet MATH Google Scholar
Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)
MathSciNet MATH Google Scholar
Willert, J., Park, H., Taitano, W.: Using Anderson acceleration to accelerate the convergence of neutron transport calculations with anisotropic scattering. Nucl. Sci. Eng. 181(3), 342–350 (2015)
Google Scholar
Zhang, J., O’Donoghue, B., Boyd, S.: Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM J. Optim. 30(4), 3170–3197 (2020)
MathSciNet MATH Google Scholar
Zhang, J., Peng, Y., Ouyang, W., Deng, B.: Accelerating ADMM for efficient simulation and optimization. ACM Trans. Graph. 38(6), 1–21 (2019)
Google Scholar

Download references

Funding

A. Milzarek was partly supported by the Fundamental Research Fund - Shenzhen Research Institute for Big Data (SRIBD) Startup Fund JCYJ-AM20190601. B. Deng was partly supported by the Guangdong International Science and Technology Cooperation Project (No. 2021A0505030009).

Author information

Authors and Affiliations

School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Wenqing Ouyang & Andre Milzarek
Shenzhen Research Institute of Big Data, Shenzhen, China
Wenqing Ouyang & Andre Milzarek
Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
Wenqing Ouyang
Department of Computer Science, University of Bath, Bath, UK
Jiong Tao
School of Computer Science and Informatics, Cardiff University, Cardiff, UK
Bailin Deng

Authors

Wenqing Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Jiong Tao
View author publications
You can also search for this author in PubMed Google Scholar
Andre Milzarek
View author publications
You can also search for this author in PubMed Google Scholar
Bailin Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bailin Deng.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof that ${\textrm{pred}_{k}}$ in Eq. (12) is Positive

Proof

By the definition of the minimization problem (7), we have

$$\begin{aligned} \Vert \hat{f}^k(\alpha ^k)\Vert ^2&\le \Vert \hat{f}^k(\alpha ^k)\Vert ^2 + \lambda _k \Vert \alpha ^k\Vert ^2 \le \Vert \hat{f}^k(0)\Vert ^2 + \lambda _k \Vert 0\Vert ^2 = \Vert f^{k_0}\Vert ^2. \end{aligned}$$

Then Eqs. (12), (6) and $c \in (0,1)$ imply that

$$\begin{aligned} r_k \ge \Vert f^{k_0}\Vert \ge \Vert \hat{f}^k(\alpha ^k)\Vert \ge c \Vert \hat{f}^k(\alpha ^k)\Vert . \end{aligned}$$

(41)

By the algorithmic construction we know $\Vert f^{k_0}\Vert >0$. So if $\Vert \hat{f}^k(\alpha ^k)\Vert =0$ then the second inequality is strict, otherwise the third inequality is strict. Overall we can deduce that ${\textrm{pred}_{k}}= r_k - c \Vert \hat{f}^k(\alpha ^k)\Vert $ must be positive. $\square $

Verification of Local Convergence Assumptions

In this section, we briefly discuss different situations that allow us to verify and establish the local conditions stated in Assumption 2 and required for Corollary 1.

1.1 The Smooth Case

Clearly, assumption (B.2) is satisfied if g is a smooth mapping.

In addition, as mentioned at the end of Sect. 2.3, if the mapping g is continuously differentiable in a neighborhood of its associated fixed-point $x^*$, then the stronger differentiability condition

(C.1)
$\Vert g(x)-g(x^*) - g^\prime (x^*)(x-x^*)\Vert = O(\Vert x-x^*\Vert ^2)$ for $x \rightarrow x^*$,

used in Corollary 1, holds if the derivative $g^\prime $ is locally Lipschitz continuous around $x^*$, i.e., for any $x,y \in \mathbb {B}_\epsilon (x^*)$ we have

$$\begin{aligned} \Vert g^\prime (x) - g^\prime (y)\Vert \le L \Vert x-y\Vert . \end{aligned}$$

(42)

Assumption (C.1) can then be shown via utilizing a Taylor expansion. Let us notice that for (C.1) it is enough to fix $y = x^*$ in (42). Such a condition is known as outer Lipschitz continuity at $x^*$. Furthermore, assumption (B.1) holds if $\sup _{x \in \mathbb {R}^n} \Vert g^\prime (x)\Vert < 1$, see, e.g., Theorem 4.20 of [5].

1.2 Total Variation Based Image Reconstruction

The alternating minimization solver for image reconstruction problem (36) can be written as a fixed-point iteration

$$\begin{aligned} w^{k+1} = g(w^k):= (\varPhi \circ h)(w^k), \end{aligned}$$

(43)

with

$$\begin{aligned} \varPhi (w)&:= (s_\beta (w_{1})^T,\ldots ,s_\beta (w_{N^2})^T)^T \in \mathbb {R}^{2N^2}, \quad s_\beta (x) :=\max \left\{ \Vert x\Vert -\frac{1}{\beta },0\right\} \frac{x}{\Vert x\Vert }, \end{aligned}$$

and $h(w):= DM^{-1}(D^T w+ ({\nu }/{\beta })K^Tf$, $D:= (D_1^T,\ldots ,D_{N^2}^T)^T$, and $M:=D^TD+({\nu }/{\beta })K^TK$. Let us notice that the mapping h and the fixed point iteration (43) are well-defined when the null spaces of K and D have no intersection, see, e.g., Assumption 1 of [51].

Next, we verify Assumption 2 for the mapping g in (43).

Proposition 2

Suppose that the operator $K^TK$ is invertible. Then, the spectral radius $\rho (T)$ of $T:= DM^{-1}D^T$ fulfills $\rho (T)<1$ and condition (B.1) is satisfied.

Proof

Utilizing the Sherman–Morrison–Woodbury formula and the invertibility of $K^T K$, we obtain

$$\begin{aligned} (I+\xi D(K^TK)^{-1}D^T)^{-1}&= I - \xi D(K^TK)^{-1}(I + \xi D^TD(K^TK)^{-1})^{-1} D^T = I-T, \end{aligned}$$

(44)

where $\xi := {\beta }/{\nu }$. Due to $\lambda _{\min }(I+\xi D(K^TK)^{-1}D^T) \ge 1$ and $\lambda _{\max }(I+\xi D(K^TK)^{-1}D^T) \le 1 + \xi \Vert D(K^TK)^{-1}D^T\Vert ,$ it then follows that

$$\begin{aligned} \sigma ((I+\xi D(K^TK)^{-1}D^T)^{-1}) \subset \left[ \frac{1}{1+\xi \Vert D(K^TK)^{-1}D^T\Vert },1 \right] , \end{aligned}$$

where $\sigma (\cdot )$ denotes the spectral set of a matrix. Combining this observation with (44), we have

$$\begin{aligned} \sigma (T)\subset \left[ 0,1-\frac{1}{1+\xi \Vert D(K^TK)^{-1}D^T\Vert }\right] \quad \implies \quad \rho (T)<1. \end{aligned}$$

Furthermore, following the proof of Theorem 3.6 in [51], it holds that

$$\begin{aligned} \Vert g(w)-g(v)\Vert \le \rho (T)\Vert w-v\Vert \quad \forall ~w,v \in \mathbb {R}^{2N^2} \end{aligned}$$

and hence, assumption (B.1) is satisfied. $\square $

Concerning assumption (B.2), it can be shown that g is twice continuously differentiable on the set

$$\begin{aligned} {\mathcal {W}} =\{w: \Vert [h(w)]_{i}\Vert \ne {1}/{\beta }, \,\forall ~i=1,\ldots ,N^2\}. \end{aligned}$$

(In this case the max-operation in the shrinkage operator $s_\beta $ is not active). Moreover, since h is continuous, the set ${\mathcal {W}}$ is open. Consequently, for every point $w \in {\mathcal {W}}$, we can find a bounded open neighborhood N(w) of w such that $N(w) \subset {\mathcal {W}}$. Hence, if the mapping g has a fixed-point $w^*$ satisfying $w^* \in {\mathcal {W}}$, then we can infer that g is differentiable on $N(w^*)$ and assumption (B.2) has to hold at $w^*$. Furthermore, the stronger assumption (C.1) for Corollary 1 is satisfied as well in this case. Finally, if K is the identity matrix, then notice that the finite difference matrix D satisfies that $\Vert D\Vert \le 2$ and we can infer $\rho (T)\le 1-(1+4\beta /\nu )^{-1}$ which justifies the choice of c in our algorithm.

1.3 Nonnegative Least Squares

We first note that given the specific form of $\varphi $, we can calculate the proximity operator $\varphi $ explicitly as

$$\begin{aligned} \textrm{prox}_{\beta \varphi }(v)=\frac{1}{2}((v_1+v_2)^T,(v_1+v_2)^T)^T, \end{aligned}$$

where $v=(v_1^T,v_2^T)^T$. Consequently, we obtain $[2\textrm{prox}_{\beta \varphi }-I](v)=(v_2^T,v_1^T)^T.$ Similarly, by setting $ \psi _1(v_1):= \Vert Hv_1 - t\Vert _2^2$ and $\psi _2(v_2):= {\mathcal {I}}_{v_2\ge 0}(v_2),$ we have

$$\begin{aligned} \textrm{prox}_{\beta \psi }(v)&= \begin{pmatrix} \textrm{prox}_{\beta \psi _1}(v_1) \\ \textrm{prox}_{\beta \psi _2}(v_2) \end{pmatrix},\\ \textrm{prox}_{\beta \psi _1}(v_1)&= (H^T H + (2\beta )^{-1}I )^{-1}(H^T t + v_1 /({2\beta })),\\ \textrm{prox}_{\beta \psi _2}(v_2)&= {\mathcal {P}}_{[0,\infty )^q}(v_2), \end{aligned}$$

where ${\mathcal {P}}_{[0,\infty )^q}$ denotes the Euclidean projection onto the set of nonnegative numbers $[0,\infty )^q$. In the next proposition, we give a condition to establish (B.1) for g.

Proposition 3

Let $\sigma _0$ and $\sigma _1$ denote the minimum and maximum eigenvalue of $2H^TH$, respectively and suppose $\sigma _0 > 0$. Then, the mapping g is Lipschitz continuous with modulus ${\sqrt{3+c_1^2}}/{2}$, where $c_1=\max \{\frac{\beta \sigma _1-1}{\beta \sigma _1+1},\frac{1-\beta \sigma _0}{1+\beta \sigma _0}\} < 1$.

Proof

We can explicitly calculate g as follows

$$\begin{aligned} g(v) = \frac{1}{2}((2\textrm{prox}_{\beta \varphi }-I)(2\textrm{prox}_{\beta \psi }-I)+I)v = \frac{1}{2} \begin{pmatrix}{\mathcal {R}}^2_\beta (v_2)+v_1 \\ {\mathcal {R}}^1_\beta (v_1)+v_2\end{pmatrix}, \end{aligned}$$

(45)

where ${\mathcal {R}}^i_\beta =2\textrm{prox}_{\beta \psi _i}-I$, $i=1,2$. By Proposition 4.2 of [4], the reflected operators ${\mathcal {R}}_\beta ^1$ and $\mathcal {R}_\beta ^2$ are nonexpansive. Moreover, since $\sigma _0>0$, $\psi _1$ is strongly convex with modulus $\sigma _0$ and $\sigma _1$-smooth. Then by Theorem 1 of [19], $\mathcal {R}_\beta ^1$ is Lipschitz continuous with modulus $c_1$. Next, for any $v,\bar{v}\in \mathbb {R}^{2q}$, we have

$$\begin{aligned}&\Vert g(v)-g({\bar{v}})\Vert \\&= \frac{1}{2} \left[ \Vert {\mathcal {R}}_\beta ^2(v_2)-{\mathcal {R}}_\beta ^2({\bar{v}}_2) +v_1-{\bar{v}}_1\Vert ^2 + \Vert {\mathcal {R}}_\beta ^1(v_1)-{\mathcal {R}}_\beta ^1({\bar{v}}_1) +v_2-{\bar{v}}_2\Vert ^2 \right] ^\frac{1}{2} \\&\le \frac{1}{2} \left[ (c_1^2 + 1)\Vert v_1 - {\bar{v}}_1\Vert ^2 + 2(c_1 + 1)\Vert v_1 - {\bar{v}}_1\Vert \Vert v_2-{\bar{v}}_2\Vert + 2 \Vert v_2 - {\bar{v}}_2\Vert ^2 \right] ^\frac{1}{2} \\&\le \frac{1}{2} \left[ (c_1^2 + 1)\Vert v_1 - {\bar{v}}_1\Vert ^2 +\frac{(c_1+1)^2}{c_1^2+1}\Vert v_1 - {\bar{v}}_1\Vert ^2+(c_1^2+1)\Vert v_2-{\bar{v}}_2\Vert ^2 + 2 \Vert v_2 - {\bar{v}}_2\Vert ^2 \right] ^\frac{1}{2} \\&\le \frac{1}{2} \sqrt{(c_1^2 + 3)\Vert v_1 - {\bar{v}}_1\Vert ^2 + (c_1^2+3) \Vert v_2 - {\bar{v}}_2\Vert ^2} \le \frac{\sqrt{3+c_1^2}}{2}\Vert v-\bar{v}\Vert , \end{aligned}$$

where we used Cauchy’s inequality, the nonexpansiveness of ${\mathcal {R}}^2_\beta $, and the Lipschitz continuity of ${\mathcal {R}}^1_\beta $. The estimate in the second to last line follows from Young’s inequality. $\square $

Hence, assumption (B.1) is satisfied if H has full column rank. Using the special form of the mapping g in (45), we see that g is twice continuously differentiable at $v = (v_1^T,v_2^T)^T$ if and only if $v \in {\mathcal {V}}:= \mathbb {R}^q \times \prod _{i=1}^q \mathbb {R}\backslash \{0\}$, i.e., if none of the components of $v_2$ are zero. As before, we can then infer that assumption (B.2) and the stronger condition (C.1) have to hold at every fixed-point $v^*$ of g satisfying $v^* \in {\mathcal {V}}$.

1.4 Further Extensions

We now formulate a possible extension of the conditions presented in Sect. B.1 to the nonsmooth setting.

If the mapping g has more structure and is connected to an underlying optimization problem like in forward-backward and Douglas–Rachford splitting, nonsmoothness of g typically results from the proximity operator or projection operators. In such a case, further theoretical tools are available and for certain function classes it is possible to fully characterize the differentiability of g at $x^*$ via a so-called strict complementarity condition. In fact, the conditions $w^* \in {\mathcal {W}}$ and $v^* \in {\mathcal {V}}$ from Sects. B.2 and B.3 are equivalent to such a strict complementarity condition. In the case of forward-backward splitting, a related and in-depth discussion of this important observation is provided in [24, 41] and we refer the interested reader to [21, 28, 31, 41] for further background.

Concerning the stronger assumption (C.1), we can establish the following characterization: Suppose that g is locally Lipschitz continuous and let us consider the properties:

The function g is differentiable at $x^*$.
The mapping g is strongly (or 1-order) semismooth at $x^*$ [36], i.e., we have
$$\begin{aligned} \sup _{M \in \partial g(x)}~\Vert g(x) - g(x^*) - M(x-x^*) \Vert = O(\Vert x-x^*\Vert ^2) \end{aligned}$$
when $x \rightarrow x^*$. Here, the multifunction $\partial g: \mathbb {R}^n\rightrightarrows \mathbb {R}^{n \times n}$ denotes Clarke’s subdifferential of the locally Lipschitz continuous (and possibly nonsmooth) function g, see, e.g., [9, 36].
There exists an outer Lipschitz continuous selection $M^*: \mathbb {R}^n \rightarrow \mathbb {R}^{n \times n}$ of $\partial g$ in a neighborhood of $x^*$, i.e., for all x sufficiently close to $x^*$ we have $M^*(x) \in \partial g(x)$ and
$$\begin{aligned} \quad \Vert M^*(x) - M^*(x^*)\Vert \le L_M \Vert x - x^*\Vert \end{aligned}$$
for some constant $L_M > 0$.

Then, the mapping g satisfies the condition (C.1) at $x^*$.

Proof

The combination of differentiability and semismoothness implies that g is strictly differentiable at $x^*$ and as a consequence, Clarke’s subdifferential at $x^*$ reduces to the singleton $\partial g(x^*) = \{g^\prime (x^*)\}$. We refer to [28, 36, 37] for further details. Thus, we can infer $M^*(x^*) = g^\prime (x^*)$ and we obtain

$$\begin{aligned} \Vert g(x) - g(x^*) - g^\prime (x^*)(x-x^*)\Vert&\le \Vert g(x)-g(x^*)-M^*(x)(x-x^*)\Vert \\& + \Vert [M^*(x)-M^*(x^*)](x-x^*)\Vert \\&\le O(\Vert x-x^*\Vert ^2) + L_M \Vert x-x^*\Vert ^2, \end{aligned}$$

for $x \rightarrow x^*$, where we used the strong semismoothness and outer Lipschitz continuity in the last step. This establishes (C.1). $\square $

The class of strongly semismooth functions is rather rich and includes, e.g., piecewise twice continuously differentiable (PC${}^2$) functions [48], eigenvalue and singular value functions [43, 44], and certain spectral operators of matrices [11]. Let us further note that the stated selection property is always satisfied when g is a piecewise linear mapping. In this case, the sets $\partial g(x)$ are polyhedral and outer Lipschitz continuity follows from Theorem 3D.1 of [12].

Ablation Study on Parameter Choices

This subsection provides more numerical experiments on the parameters of our method, using the examples given in Figs. 1, 2 and 3 of the paper. In each experiment, we run our method by varying a subset of the parameters while keeping all other parameters the same as in the original figures, to evaluate how the varied parameters influence the performance of our method. For the evaluation, we plot the same convergence graphs as in the original figures to compare the performance resulting from different parameter choices. The evaluation is performed on the parameters c, $(p_1, p_2)$, $(\eta _1, \eta _2)$, and $\mu _0$.

We first consider the logistic regression problem in Fig. 1 with $m=15$ and $\tau = L_F \times 10^{-6}$. The parameters used in Fig. 1 are: $p_1=0.01$, $p_2=0.25$, $\eta _1=2$, $\eta _2=0.25$, $\mu _0 = 100$, $c = \kappa = ({L_F-\tau })/(L_F+\tau )$. Figures 4, 5, 6, and 7 show the results using varied values of c, $(p_1, p_2)$, $(\eta _1, \eta _2)$, and $\mu _0$, respectively.

Next, we consider the image reconstruction problem in Fig. 2 with $m = 5$ and $\beta = 100$. The parameters used in Fig. 2 are: $p_1=0.01$, $p_2=0.25$, $\eta _1=2$, $\eta _2=0.25$, $\mu _0 = 1$, and $c = \kappa $ where $\kappa $ is derived in “Appendix B.2”. Figures 8, 9, 10, and 11 show the results using varied values of c, $(p_1, p_2)$, $(\eta _1, \eta _2)$, and $\mu _0$, respectively.

Finally, we consider the nonnegative least squares problem in Fig. 3 with $m=10$. The parameters used in Fig. 3 are: $p_1=0.01$, $p_2=0.25$, $\eta _1=2$, $\eta _2=0.25$, $\mu _0 = 1$, and $c = \kappa $ where $\kappa $ is derived in “Appendix B.3”. Figures 12, 13, 14, and 15 show the results using varied values of c, $(p_1, p_2)$, $(\eta _1, \eta _2)$, and $\mu _0$, respectively.

As shown in Figs. 4, 8, and 12, our algorithm is not very sensitive w.r.t. the choice of c. Specifically, we still observe convergence if c is chosen smaller than the Lipschitz constant $\kappa $. This robustness is particularly important when good estimates of the constant $\kappa $ are not available. Although the bound $c \ge \kappa $ is required in our theoretical results in Sect. 2.3, fast convergence can still be observed for other choices of c. This indicates that either we have over-estimated the Lipschitz constant $\kappa $ or the acceleration effect of $\textsf{AA}$ steps can be much faster than $\kappa $.

The results in Figs. 5, 9, and 13 demonstrate that the performance of LM-AA is not overly affected by the choice of the trust-region parameters $p_1$ and $p_2$ either. In general, good performance can be achieved if $p_1$ is moderately small and $p_2$ is not too large. Thus, we decide to work with the standard choice $p_1 = 0.01$ and $p_2 = 0.25$.

In comparison, the trust-region parameters $\eta _1$ and $\eta _2$ can have a more significant impact on the performance of our algorithm. While the performance of LM-AA is not sensitive to the choice of $\eta _1$ and $\eta _2$ in the logistic regression problem using the dataset sido0 (Fig. 6) and in the denoising problem (Fig. 10), more variation can be seen in the remaining two examples. In general, the performance seems to deteriorate when $\eta _1$ and $\eta _2$ are chosen to be very close to each other. The standard choice $\eta _1 = 2$ and $\eta _2 = 0.25$ again achieves convincing performance on all numerical examples and has a good balance when increasing and decreasing the weight parameter $\lambda _k$.

The convergence plot for different values of $\mu _0$ are shown in Figs. 7, 11, and 15. Our observations are again somewhat similar: the performance of LM-AA on the logistic regression problem for sido0 (Fig. 7) and on the denoising problem (Fig. 11) is very robust w.r.t. the choice of $\mu _0$. In the nonnegative least squares problem, $\mu _0$ appears to mainly affect the last convergence stage of the algorithm, i.e., different choices of $\mu _0$ can lead to an earlier jump to a level with higher accuracy. Overall the parameters $\mu _0 =1$ (for denoising and NNLS) and $\mu _0 = 100$ (for logistic regression) yield the most robust results.

Ablation Study on Permutation Strategy

We also provide an ablation study for the permutation strategy in line 6 of Algorithm 1. In general, when the parameter $\gamma $ is small, our nonmonotone globalization strategy is close to a monotone criterion on the residual. In this case, the minimal residual iteration $k_0$ mostly coincides with the current iteration number k and the permutation causes little difference. However, when $\gamma $ is (relatively) large, then the usage of permutations can cause essential differences in the numerical performance. In particular, when the current trial step is rejected, then Algorithm 1 performs $x^{k+1} = g^{k}$ as next iteration if no permutation is used. In general, the update $x^{k+1} = g^{k}$ can be worse than $x^{k+1} =g^{k_0}$ since $x^{k_0}$ has the smallest residual among the m latest iterations. We test Algorithm 1 without permutation in Fig. 16 for the logistic regression experiment. We set $\gamma =0.05$ in Fig. 16 for LM-AA and keep other parameters unchanged. As can be seen from the figure, permutation improves the overall convergence and performance.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ouyang, W., Tao, J., Milzarek, A. et al. Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization. J Sci Comput 96, 5 (2023). https://doi.org/10.1007/s10915-023-02231-4

Download citation

Received: 22 February 2022
Revised: 18 December 2022
Accepted: 29 April 2023
Published: 18 May 2023
DOI: https://doi.org/10.1007/s10915-023-02231-4

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Abstract

Similar content being viewed by others

A new nonmonotone adaptive trust region algorithm

An accelerated nonmonotone trust region method with adaptive trust region for unconstrained optimization

Global and local convergence of a new affine scaling trust region algorithm for linearly constrained optimization

1 Introduction

2 Algorithm and Convergence Analysis

2.1 Adaptive Regularization for \(\textsf{AA}\)

2.2 Global Convergence Analysis

Assumption 1

Proposition 1

Proof

Remark 1

Example 1

Remark 2

Theorem 1

Proof

Remark 3

2.3 Local Convergence Analysis

Assumption 2

Remark 4

Remark 5

Assumption 3

Remark 6

Theorem 2

Proof

Remark 7

Theorem 3

Proof

Corollary 1

Proof

Remark 8

3 Numerical Experiments

3.1 Logistic Regression

3.2 Image Reconstruction

3.3 Nonnegative Least Squares

3.4 Statistics of Successful Steps

4 Conclusions

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Proof that \({\textrm{pred}_{k}}\) in Eq. (12) is Positive

Proof

Verification of Local Convergence Assumptions

1.1 The Smooth Case

1.2 Total Variation Based Image Reconstruction

Proposition 2

Proof

1.3 Nonnegative Least Squares

Proposition 3

Proof

1.4 Further Extensions

Proof

Ablation Study on Parameter Choices

Ablation Study on Permutation Strategy

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation