Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization

Anderson acceleration (AA\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{AA}$$\end{document}) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for AA\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{AA}$$\end{document} to improve stability and achieve unified global and local convergence. Unlike existing AA\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{AA}$$\end{document} globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-region framework and introduce an adaptive quadratic regularization together with a tailored acceptance mechanism. We prove global convergence and show that our algorithm attains the same local convergence as AA\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsf{AA}$$\end{document} under appropriate assumptions. The effectiveness of our method is demonstrated in several numerical experiments.


Introduction
In applied mathematics, many problems can be reduced to solving a nonlinear fixed-point equation g(x) = x, where x ∈ R n and g : R n → R n is a given function. If g is a contractive mapping, i.e., where κ < 1, then the iteration is ensured to converge to the fixed-point of g by Banach's fixed-point theorem.
On the theoretical side, it has been shown that AA is a quasi-Newton method for finding a root of the residual function [14,16,38]. When applied to linear problems (i.e., if g(x) = Ax − b), AA is equivalent to the generalized minimal residual method (GMRES), [34]. For nonlinear problems, AA is also closely related to the nonlinear generalized minimal residual method [50]. A local convergence analysis of AA for general nonlinear problems was first given in [46,45] under the base assumptions that g is Lipschitz continuously differentiable and the AA mixing coefficients α, determined in (3), stay in a compact set. However, the convergence rate provided in [46,45] is not faster than the one of the original fixed-point iteration. A more recent analysis in [13] shows that AA can indeed accelerate the local linear convergence of a fixed-point iteration up to an additional quadratic error term. This is further improved in [32] where q-linear convergence of AA is established. The convergence result in [32] requires sufficient linear independence of the columns of [f (x k−1 ) − f (x k ), . . . , f (x k−m )−f (x k−m+1 )] which is typically stronger than the previously mentioned boundedness assumption on the coefficients α. By assuming the mixing coefficient α to be stationary during the iteration, an exact rate of AA is derived in [50].
One issue of classical AA is that it can suffer from instability and stagnation [34,39]. Different techniques have been proposed to address this issue. For example, safeguarding checks were introduced in [30,54] to only accept an AA step if it meets certain criteria, but without a theoretical guarantee for convergence. Another direction is to introduce regularization to the problem (3) for computing the combination coefficients. In [17], a quadratic regularization is used together with a safeguarding step to achieve global convergence of AA on Douglas-Rachford splitting, but there is no guarantee that the local convergence is faster than the original solver. In [39,40], a similar quadratic regularization is introduced to achieve local convergence, although no global convergence proof is provided. A more detailed discussion of related literature and specific techniques connected to our algorithmic design and development is deferred to Subsection 2.1.
As far as we are aware, none of the existing approaches and modified versions of AA guarantee both global convergence and accelerated local convergence. In this paper, we propose a novel AA globalization scheme that achieves these two goals simultaneously. Specifically, we apply a quadratic regularization with its weight adjusted automatically according to the effectiveness of the AA step. We adapt the nonmonotone trust-region framework in [47] to update the weight and to determine the acceptance of the AA step. Our approach can not only achieve global convergence, but also attains the same local convergence rate established in [13] for AA without regularization. Furthermore, our local results also cover applications where the mapping g is nonsmooth and differentiability is only required at a target fixed-point of g. To the best of our knowledge, this is the first globalization technique for AA that achieves the same local convergence rate as the original AA scheme. Numerical experiments on both smooth and nonsmooth problems verify the effectiveness and efficiency of our method.
Notations. Throughout this work, we restrict our discussion on the ndimensional Euclidean space R n . For a vector x, x denotes its Euclidean norm, and B (x) := {y : y − x ≤ } denotes the Euclidean ball centered at x with radius . For a matrix A, A is the operator norm with respect to the Euclidean norm. We use I to denote both the identity mapping (i.e., I(x) = x) and the identity matrix. For a function h : R n → R , the mapping h : R n → R ×n represents its derivative. h is called L-smooth if it is differentiable and h (x) − h (y) ≤ L x − y for all x, y ∈ R n . An operator h : R n → R n is called nonexpansive if for all x, y ∈ R n we have h(x) − h(y) ≤ x − y . We say that the operator h is ρ-averaged for some ρ ∈ (0, 1) if there exists a nonexpansive operator R : R n → R n such that h = (1 − ρ)I + ρR. The set of fixed points of the mapping h is defined via Fix(h) := {x : h(x) = x}. The interested reader is referred to [4] for further details on operator theory.
The AA formulation in Eqs. (2) and (3) assumes k ≥ m. It can be adapted to account for the case k < m by usingm coefficients instead wherem = min{m, k}. Without loss of generality, we use m to refer to the actual number of coefficients being used.

Adaptive Regularization for AA
In the following, we set f k := f (x k ) and g k := g(x k ) to simplify notation. We first note that the accelerated iterate computed via (2) and (3) is invariant under permutations of the indices of {f j } and {g j }. Concretely, let Π k := (k 0 , k 1 , . . . , k m ) be any permutation of the index sequence (k, k − 1, . . . , k − m). Then the point x k+1 calculated in Eq. (2) also satisfies with coefficientsᾱ k = (ᾱ k 0 , . . . ,ᾱ k m ) computed viā which amounts to solving a linear system. In this paper, we use a particular class of permutations where i.e., k 0 is the largest index that attains the minimum residual norm among f k−m , f k−m+1 , . . . , f k . As we will see later, this type of permutation allow us to apply certain nonmonotone globalization techniques and to ultimately establish local and global convergence of our approach. An ablation study on the potential effect of the permutation strategy is presented in Appendix D.
One potential cause of instability of AA is the (near) linear dependence of the vectors {f ki − f k0 : i = 1, . . . , m}, which can result in (near) rank deficiency of the linear system matrix for the problem (5). To address this issue, we introduce a quadratic regularization to the problem (5) and compute the coefficients α k via: where λ k > 0 is a regularization weight, and The coefficients α k are then used to compute a trial step in the same way as in Eq. (4). In the following, we denote this trial step asĝ k (α k ) wherê The trial step is accepted as the new iterate if it meets certain criteria (which we will develop in the following in detail). Regularization such as the one in Eq. (7) has been suggested in [3] and is applied in [17,40]. A major difference between our approach and the regularization in [17,40] is the choice of λ k : in [17] it is set in a heuristic manner, whereas in [40] it is either fixed or specified via grid search. We instead update λ k adaptively based on the effectiveness of the latest AA step. Specifically, we observe that a larger value of λ k can improve stability for the resulting linear system; it will also induce a stronger penalty for the magnitude of α k . In this case, the trial stepĝ k (α k ) tends to be closer to g k0 , which, according to Eq. (6), is the fixed-point iteration step with the smallest residual among the latest m + 1 iterates. On the other hand, a larger regularization weight may also hinder the fast convergence of AA if it is already effective in reducing the residual without regularization. Thus, λ k is dynamically adjusted according to the reduction of the residual in the current step.
Our adaptive regularization scheme is inspired by the similarity between the problem (7) and the Levenberg-Marquardt (LM) algorithm [20,26], a popular approach for solving nonlinear least squares problems of the form min x F (x) 2 , where F is a vector-valued function. Each iteration of LM computes a variable update d k := x k+1 − x k by solving a quadratic problem Here, the first term is a local quadratic approximation of the target function F (x) 2 using the first-order Taylor expansion of F , while the second term is a regularization with a weightλ k > 0. LM can be considered as a regularized version of the classical Gauss-Newton (GN) method for nonlinear least squares optimization [23]. In GN, each iteration computes an initial step d by minimizing the local quadratic approximation term only, i.e.,: which amounts to solving a linear system for d with the positive semidefinite matrix (F (x k )) T F (x k ). Similar to AA, the (near) linear dependence between the columns of F (x k ) can lead to (near) rank deficiency of the system matrix causing potential instability. To address this issue, LM introduces a quadratic regularization term for d, which adds a scaled identity matrix to the linear system matrix and prevents it from being singular. Furthermore, LM measures the effectiveness of the computed update using a ratio of the resulting reduction of the target function and a predicted reduction based on the quadratic approximation. The measure is utilized to determine the acceptance of the update, to enforce monotonic decrease of the target function, and to update the regularization weight for the next iteration. Such an adaptive regularization is an instance of a trust-region method [10]. Taking a similar approach as LM, we define two functions ared k and pred k that measure the actual and predicted reduction of the residual resulting from the solution α k to (7): where c ∈ (0, 1) is a constant. Here r k measures the residuals from the latest m + 1 iterates via a convex combination: with γ ∈ (0, 1 m+1 ) such that a higher weight is assigned to the smallest residual f k0 among them. Note thatĝ k (α k ) is the trial step, and f (·) is the residual function. Thus ared k compares the latest residuals with the residual resulting from the trial step. This specific choice of r k is inspired by the local descent properties of AA, see, e.g., [13,Theorem 4.4]. Moreover, note thatf k (·) (see Eq. (8)) is a linear approximation of the residual function based on the latest residual values, and it is used in problem (7) to derive the coefficients α k for computing the trial step. Thusf k (α k ) is a predicted residual for the trial step based on the linear approximation, and pred k compares it with the latest residuals. The constant c guarantees that pred k has a positive value (as long as a solution to the problem has not been found; see Appendix A for a proof). Similar to LM, we calculate the ratio as a measure of effectiveness for the trial stepĝ k (α k ) computed with Eqs. (7) and (9). In particular, if ρ k ≥ p 1 with a threshold p 1 ∈ (0, 1), then from Eq. (13) and using the positivity of pred k we can bound the residual ofĝ k (α k ) via Like r k , the last expression (1 − p 1 )r k + p 1 f k0 in Eq. (14) is also a convex combination of the latest m + 1 residuals, but with a higher weight on the smallest residual f k0 than on r k . Hence, when ρ k ≥ p 1 , we consider the decrease of the residual to be sufficient. In this case, we set x k+1 =ĝ k (α k ) and say the iteration is successful. Otherwise, we discard the trial step and choose x k+1 = g k0 = g(x k0 ), which corresponds to the fixed-point iteration step with the smallest residual among the latest m + 1 iterates. Thus, by permuting the indices (k, k − 1, . . . , k − m) according to Π k , we can ensure to achieve the most progress in terms of reducing the residual when an AA trial step is rejected. We also adjust the regularization weight λ k according to the ratio ρ k . Specifically, we set where the factor µ k > 0 is automatically updated based on ρ k as follows: -If ρ k < p 1 , then we consider the decrease of the residual to be insufficient and we increase the factor in the next iteration via µ k+1 = η 1 µ k with a constant η 1 > 1. -If ρ k > p 2 with a threshold p 2 ∈ (p 1 , 1), then we consider the decrease to be high enough and reduce the factor via µ k+1 = η 2 µ k with a constant η 2 ∈ (0, 1). This will relax the regularization so that the next trial step will tend to be closer to the original AA step.
Here the choice of the parameters p 1 , p 2 where 0 < p 1 < p 2 < 1 follows the convention of basic trust-region methods [10]. Our setting of λ k in Eq. (15) is inspired by [15] which relates the LM regularization weight to the residual norm. For our method, this setting ensures that the two target function terms in problem (7) are of comparable scales, so that the adjustment of the factor µ k is meaningful. This choice of λ k and the update rule of µ k are quite standard in LM methods. However, the classical convergence analysis in [15] is not directly applicable here. In the LM method, the decrease of the residual can be predicted via its linearized model For AA, the linearized residualf k (α k ) is not a model for the update x k+1 =ĝ k (α k ) but forx k (α k ) instead. Since a linearized residual ofĝ k is not readily available, we use an upper bound for such a linearization of g k (α k ) which is exactly given by c f k (α k ) . The whole method is summarized in Algorithm 1.
Unlike LM which enforces a monotonic decrease of the target function, our acceptance strategy allows the residual for x k+1 to increase compared to the previous iterate x k . Therefore, our scheme can be considered as a nonmonotone trust-region approach and follows the procedure investigated in [47]. In the next subsections, we will see that this nonmonotone strategy allows us to establish unified global and local convergence results. In particular, besides global convergence guarantees, we can show transition to fast local convergence and an acceleration effect similar to the original AA scheme can be achieved.
The main computational overhead of our method lies in the optimization problem (7), which amounts to constructing and solving an m×m linear system . . , f km − f k0 ] ∈ R n×m . A naïve implementation that computes the matrix J from scratch in each iteration will result in O(m 2 n) time for setting up the linear system, whereas the system itself can be solved in O(m 3 ) time. Since we typically have m n, the linear system setup will become the dominant overhead. To reduce the overhead, we note that each entry of J T J is a linear combination of inner products between f k0 , . . . , f km . If we pre-compute and store these inner products, then it only requires additional O(m 2 ) time to evaluate all entries. Moreover, the pre-computed inner products can be updated in O(mn) time in each iteration, so we only need O(mn) total time to evaluate J T J. Similarly, we can evaluate J T f k0 in O(m) time. In this way, the linear system setup only requires O(mn) time in each iteration. Moreover, as the parameter m is often a small value independent of n (and significantly smaller than n), the complexity O(mn) is effectively linear with respect to n and only incurs a small computational overhead.

Global Convergence Analysis
We now present our main assumptions on g and f that allow us to establish global convergence of Algorithm 1. Our conditions are mainly based on a monotonicity property and on pointwise convergence of the iterated functions

Assumption 1
The functions g and f satisfy the following conditions: It is easy to see that Assumption 1 holds for any contractive function with ν = 0. In particular, if g satisfies (1), we obtain as k → ∞. In the following, we will verify that Assumption 1 also holds for ρaveraged operators which define a broader class of mappings than contractions.
Proof By definition the ρ-averaged operator g is also nonexpansive and thus, (A.1) holds for g. To prove (A.2), let us set y 0 := x and y k+1 : Therefore, we can assume that lim k→∞ f (y k ) = ϑ. If ϑ > ν, then we may This yields By the reverse triangle inequality and (A.1), we have Combining with the previous inequality, we obtain Taking the limit k → ∞, we reach a contradiction. So, we must have ν = ϑ, as desired.
Remark 1 Setting κ (the Lipschitz constant of g) to 1 in (16), we see that (A.1) is always satisfied if g is a nonexpansive operator. However, nonexpansiveness is not a necessary condition for (A.1). In fact, we can construct an operator g that is not nonexpansive but satisfies (A.1) and (A.2), e.g., Because of Proposition 1, our global convergence theory is applicable to a large class of iterative schemes. As an example, we show in the following that Assumption 1 is satisfied by forward-backward splitting, a popular optimization solver in machine learning.
Example 1 Let us consider the nonsmooth optimization problem: where both r, ϕ : R n → (−∞, ∞] are proper, closed, and convex functions, and r is L-smooth. It is well known that x * is a solution to this problem if and only if it satisfies the nonsmooth equation: where prox µϕ (x) := argmin y ϕ(y)+ 1 2µ x−y 2 , µ > 0, is the proximity operator of ϕ, see also Corollary 26.3 of [4]. We can then compute x * via the iterative scheme G µ is known as the forward-backward splitting operator and it is a ρ-averaged operator for all µ ∈ (0, 2 L ), see [8]. Hence, Assumption 1 holds and our theory can be used to study the global convergence of Algorithm 1 applied to (18).
Remark 2 For problem (17), it can be shown that Douglas-Rachford splitting, as well as its equivalent form of ADMM, can both be written as a ρ-averaged operator with ρ ∈ (0, 1) (see, e.g., [22]). Therefore, the applications considered in [17] are also covered by Assumption 1.
We can now show the global convergence of Algorithm 1: Theorem 1 Suppose Assumption 1 is satisfied and let {x k } be generated by Proof In the following, we will use S to denote the set of indices for all successful iterations, i.e., S := {k : ρ k ≥ p 1 }. To simplify the notation, we introduce a function P : N → N defined as Notice that the number P(k) coincides with k 0 for fixed k.
If Algorithm 1 terminates after a finite number of steps, the conclusion simply follows from the stopping criterion. Therefore, in the following, we assume that a sequence of iterates of infinite length is generated. We consider two different cases: Case 1: |S| < ∞. Letk denote the index of the last successful iteration in S (we setk = 0 if S = ∅). We first show that P(k) = k for all k ≥k + 1. Due tok + 1 / ∈ S, it follows xk +1 = g(x P(k) ) and by (A.1), this implies f (xk +1 ) ≤ f (x P(k) ) . From the definition of P, we have f (x P(k) ) ≤ fk −i for every 0 ≤ i ≤ min{m,k} and hence P(k + 1) =k + 1. An inductive argument then yields P(k) = k for all k ≥k + 1. Notice that for any k ≥k + 1, we have k / ∈ S and We first show that the sequence {W k } is non-increasing.
-If k ∈ S, then we have: We already know from Appendix A that pred k > 0. Since (41) from Appendix A, we can derive: where - Eqs. (19), (20) and (21) for all k ∈ S. It suffices to prove that for any i satisfying k+1 ≤ i ≤ k+m+1, we have f i ≤ c p W k . Since we consider a successful iteration k ∈ S, our previous discussion has already shown that f k+1 ≤ c p W k . We now assume Hence, by induction, we have W k+m+1 ≤ c p W k for all k ∈ S. Since {W k } is non-increasing and we assumed |S| = ∞, this establishes W k → 0 and f k → 0. In this case, we can infer ν = 0 and the proof is complete.
Remark 3 This global result does not depend on the specific update rule for the regularization weight λ k . Indeed, global convergence mainly results from our acceptance mechanism and hence, as a consequence of our proof, different update strategies for λ k can also be applied. Our choice of λ k in (15), however, will be essential for establishing local convergence of the method.

Local Convergence Analysis
Next, we analyze the local convergence of our proposed approach, starting with several assumptions.

Assumption 2
The function g : R n → R n satisfies the following conditions: Remark 4 (B.1) is a standard assumption widely used in the local convergence analysis of AA [46,13,39,40]. The existing analyses typically rely on the smoothness of g. In contrast, (B.2) allows g to be nonsmooth and only requires it to be differentiable at the fixed point x * , allowing our assumptions to cover a wider variety of methodologies such as forward-backward splitting and Douglas-Rachford splitting under appropriate assumptions, see Appendix B. We note that in [6] the Lipschitz differentiability of g is replaced by continuous differentiability around x * , while we only assume differentiability at one point. This technical difference is based on the observation that an expansion of the residual f k is only required at the point x * and not at the iterates x k which allows to work with weaker differentiability requirements. We further note that AA has been investigated for nonsmooth g in [53,17] but without local convergence analysis. Recent convergence results of AA for a scheme related to the proximal gradient method discussed in Example 1 can also be found in [24]. While the local assumptions and derived convergence rates in [24] are somewhat similar to our local results, we want to highlight that the algorithm and analysis in [24] are tailored to convex composite problems of the form (17). Moreover, the global results in [24] are shown for a second, guarded version of AA and are based on the strong convexity of the problem. In contrast and under conditions that are not stronger than the local assumptions in [24], we will establish unified global-local convergence of our approach for general contractions. In Section 3, we verify the conditions (B.1) and (B.2) on the numerical examples, with a more detailed discussion in Appendix B.
Remark 5 (B.1) implies that the function g is contractive, which is a sufficient condition for (A.1) and (A.2). Thus, a function g satisfying Assumption 2 will also fulfill Assumption 1 with ν = 0.
Similar to the local convergence analyses in [46,13], we also work with the following condition: Remark 6 The assumptions given in [46,13] are formulated without permuting the last m + 1 indices. We further note that we do not require the solutionᾱ k to be unique.
The acceleration effect of the original AA scheme has only been studied very recently in [13] based on slightly stronger assumptions. In particular, their result can be stated as where is an acceleration factor. Sinceᾱ k is a solution to the problem (5), we have Then (22) implies that for a fixed-point iteration that converges linearly with a contraction constant κ, AA can improve the convergence rate locally. In the following, we will show that our globalized AA method possesses similar characteristics under weaker assumptions.
We first verify that after finitely many iterations, every step x k+1 =ĝ k (α k ) is accepted as a new iterate. Thus, our method eventually reduces to a pure regularized AA scheme.
Theorem 2 Suppose that Assumptions 2 and 3 hold and let the constant c in (11) be chosen such that c ≥ κ. Then, the sequence {x k } generated by Algorithm 1 (with f = 0) either terminates after finitely many steps, or converges to the fixed point x * and there exists some ∈ N such that ρ k ≥ p 2 for all k ≥ . In particular, every iteration k ≥ is successful with x k+1 =ĝ k (α k ).
Proof Our proof consists of three steps. We first show the convergence of the whole sequence {x k } to the fixed point x * . Afterwards we derive a bound for the residual f (ĝ k (α k )) that can be used to estimate the actual reduction ared k . In the third step, we combine our observations to prove the transition to the full AA method, i.e., we show that there exists some with k ∈ S for all k ≥ .
Step 1: Convergence of {x k }. By (B.1), g is a contraction, i.e., for all x ∈ R n we have and it follows f k = f (x k ) ≥ (1 − κ) x k − x * for all k. Theorem 1 and Remark 5 guarantee lim k→∞ f k = 0 and hence, we can infer x k → x * .
Step 2: Bounding f (ĝ k (α k )) . Introducinĝ and using (B.1), we can bound the residual f (ĝ k (α k )) directly as follows: We now continue to estimate the second term g(x k (α k )) −ĝ k (α k ) . From the algorithmic construction and the definition of α k andᾱ k , it follows that Consequently, applying the estimate (24) derived in step 1, it follows Note that the differentiability of g at x * -as stated in Assumption (B.2) - Applying this condition to different choices of y and the boundedness of ν k , we can obtain Here, we also used (24), (26), and Combining our results, this yields Step 3: Transition to fast local convergence. As in the proof of Theorem 1, let us introduce Due to (28) and for all k ≥ . Hence, using c ≥ κ, we have Similarly, for the predicted reduction pred k we can show Thus, if 1 − γ − c ≥ 0, we obtain pred k ≥ γW k . Otherwise, it follows pred k ≥ (1 − c)W k and together this yields pred k ≥ min{γ, 1 − c}W k . Combining the last estimates, we can finally deduce which completes the proof.

Remark 7
Our novel nonmonotone acceptance mechanism is the central component of our proof for Theorem 2, as it allows us to balance the additional error terms caused by an AA step.
Next, we show that our approach can enhance the convergence of the underlying fixed-point iteration and that it has a local convergence rate similar to the original AA method as given in [13].
Theorem 3 Suppose that Assumptions 2, and 3 hold and let the parameters c, f in Algorithm 1 satisfy c ≥ κ and f = 0. Then, for k → ∞ it holds that: where θ k := f k (ᾱ k ) / f k0 is the corresponding acceleration factor. In addition, the sequence of residuals { f k } converges r-linearly to zero with a rate arbitrarily close to κ, i.e., for every η ∈ (κ, 1) there exist C > 0 andˆ ∈ N such that f k ≤ Cη k ∀ k ≥ˆ .
Proof Theorem 2 implies ρ k ≥ p 2 for all k ≥ and hence, from the update rule of Algorithm 1, it follows that Then by (15), we can infer λ k = o( f k0 2 ). Using Eq. (25) and Assumption 3, this shows Thus, by (28), we obtain as desired. In order to establish r-linear convergence, we follow the strategy presented in [46]. Let η ∈ (κ, 1) be a given rate. Then, due to f k → 0 and using (30), there existsˆ ∈ N such that for all k ≥ˆ whereν : for allˆ − m ≤ j ≤ˆ . We now claim that the statement f k ≤ Cη k holds for all k ≥ˆ . As just shown, this is obviously satisfied for the base case k =ˆ . As part of the inductive step, let us assume that the estimate f j ≤ Cη j holds for all j =ˆ ,ˆ + 1, . . . , k. (In fact, this bound also holds for j =ˆ − m, . . . ,ˆ − 1). By the definition of the index k 0 , we have f k0 ≤ Cη k and, due to (31), it follows Hence, our claim also holds for k + 1 which finishes the induction and proof.
Under a stronger differentiability condition and stricter update rule for λ k , we can recover the same local rate as in [13]: Corollary 1 Let the assumptions stated in Theorem 3 hold and let g satisfy the differentiability condition Suppose that the weight λ k is updated via λ k = µ k f k0 4 . Then, for all k sufficiently large we have Proof As mentioned in the remark after Theorem 1, our global results do still hold if a different update strategy is used for the weight parameter λ k . Moreover, the proof of Theorem 2 does also not depend on the specific choice of λ k . Consequently, we only need to improve the bound (28) for f (ĝ k (α k )) derived in step 2 of the proof of Theorem 2. Using the additional differentiability property g(y) − g(x * ) − g (x * )(y − x * ) = O( y − x * 2 ), y → x * , we can directly improve the estimate for g(x k (α k )) −ĝ k (α k ) in (27) as follows: and thus, mimicking and combining the derivations in step 2 of the proof of Theorem 2, we have as k → ∞. As in the previous proof, we can now infer µ k → 0 (this follows from ρ k ≥ p 2 for all k sufficiently large) and λ k = o( f k0 4 ). Furthermore, as in (29), due to Eq. (25) and Assumption 3, it holds that f . Combining this result with (32), we can then establish the convergence rate stated in Corollary 1.

Remark 8
The stronger differentiability condition, which was also used in [13] and other local analyses, is, e.g., satisfied when the derivative g is locally Lipschitz continuous around x * . More discussions of this property can also be found in Appendix B. We note that under this type of stronger differentiability, we can only improve the order of the remainder linearization error terms and not the linear rate of convergence.

Numerical Experiments
We verify the effectiveness of our method by applying it to several existing numerical solvers and comparing its convergence speed with the original solvers. We also include the acceleration approaches from [40,17] for comparison. The regularized nonlinear acceleration (RNA) proposed in [40] computes an accelerated iterate via an affine combination of the previous k iterates, and it also introduces a quadratic regularization when computing the affine combination coefficients. Unlike our approach, it performs an acceleration step every k iterations instead of every iteration, and its regularization weight is determined by a grid search that finds the weight that leads to the lowest target function value at the accelerated iterate. The A2DR scheme proposed in [17] is a globalization of AA applied on Douglas-Rachford splitting, using a quadratic regularization together with an acceptance mechanism based on sufficient decrease of the residual. All experiments are carried out on a laptop with a Core-i7 9750H at 2.6GHz and 16GB of RAM. The source code for the examples in this section is available at https://github.com/bldeng/Nonmonotone-AA.
Our method involves several parameters. The parameters p 1 , p 2 , η 1 and η 2 , used for determining acceptance of the trial step and updating the regularization weight, are standard parameters for trust-region methods. We choose p 1 = 0.01, p 2 = 0.25, η 1 = 2, η 2 = 0.25 by default. The parameter γ affects the convex combination weights in computing r k in Eq. (12), and we choose γ = 10 −4 . For the parameter c in the definition of pred k , we choose c = κ where κ < 1 is a Lipschitz constant for the function g, to satisfy the conditions for Theorems 2 and 3. We will derive the value of κ in each experiment. The initial regularization factor µ 0 is set to µ 0 = 1 unless stated otherwise. Concerning the number m of previous iterates used in an AA step, we can make the following observations: a larger m tends to reduce the number of iterations required for convergence, but also increases the computational cost per iteration; our experiments suggest that choosing 5 ≤ m ≤ 20 often achieves a good balance. For each experiment below, we will include multiple choices of m for comparison. Appendix C provides some further ablation studies for the parameters p 1 , p 2 , η 1 , η 2 , and c.

Logistic Regression
First, to compare our method with the RNA scheme proposed by [40], we consider the following logistic regression problem from [40] that optimizes a decision variable x ∈ R n : min where and a i ∈ R n , b i ∈ {−1, 1} are the attributes and label of the data point i, respectively. Following [40], we consider gradient descent solver x k+1 = g(x k ) with a fixed step size: is the Lipschitz constant of ∇F , and A = [a 1 , ..., a N ] T ∈ R N ×n . Then g is Lipschitz continuous with modulus and differentiable, which satisfies Assumption 2. We apply our approach (denoted by "LM-AA") and RNA to this solver, and compare their performance on two datasets: covtype 1 (54 features, 581012 points), and sido0 2 (4932 features, 12678 points). For each dataset, we normalize the attributes and solve the problem with τ = L F /10 6 and τ = L F /10 9 , respectively. For the implementation of RNA, we use the source code released by the authors of [40] 3 . We set µ 0 = 100, and m = 10, 15, 20, respectively for our method. RNA performs an acceleration step every k iterations, and we test k = 5, 10, 20, respectively. All other RNA parameters are set to their default values as provided in the source code (in particular, with grid-search adaptive regularization weight and line search enabled). Fig. 1 plots for each method the normalized target function value (F (x k ) − F * )/F * with respect to the iteration count and computational time, where F * is the ground-truth global minimum computed by running each method until full convergence and taking the minimum function value among all methods. All variants of LM-AA and RNA accelerate the decrease of the target function compared with the original gradient descent solver, with our methods achieving an overall faster decrease.

Image Reconstruction
Next, we consider a nonsmooth problem proposed in [51] for total variation based image reconstruction: where s ∈ [0, 1] N 2 is an N × N input image, u ∈ R N 2 is the output image to be optimized, K ∈ R N 2 ×N 2 is a linear operator, D i ∈ R 2×N 2 represents the discrete gradient operator at pixel i, w = (w T 1 , . . . , w T N 2 ) T ∈ R 2N 2 are auxiliary variables for the image gradients, ν > 0 is a fidelity weight, and β > 0 is a penalty parameter. The solver in [51] can be written as alternating minimization between u and w as follows: The solutions to the subproblems (37) and (38) can both be computed in a closed form. When β and ν are fixed, this can be treated as a fixed-point iteration w k+1 = g(w k ), and it satisfies Assumption 2 (see Appendix B.2 for a detailed derivation of g and verification of Assumption 2). In the following, we consider the solver with K = I and ν = 4 for image denoising. In this case, condition (B.1) is satisfied with (see Appendix B.2 for the derivation). We apply this solver to a 1024 × 1024 image with added Gaussian noise that has a zero mean and a variance of σ = 0.05 (see Fig. 2). We use the source code released by the authors of [51] 4 for the implementation of this solver, and apply our acceleration method with m = 1, 3, 5 respectively. For comparison, we also apply the RNA scheme with k = 2, 5, 10, respectively. Here we choose smaller values of m and k than the logistic regression example because both RNA and our method have a high relative overhead on this problem, which means that larger values of m or k may induce overhead that offsets the performance gain from acceleration. Similar to the logistic regression example, we use the released source code of RNA for our experiments, and set all RNA parameters to their default values. Fig. 2 plots the residual norm f (w) for all methods, with β = 100 and β = 1000, respectively. All instances of acceleration methods converge faster to a fixed point than the original alternating minimization solver, except for RNA with k = 2 which is slower in terms of the actual computational time due to its overhead for the grid search of regularization parameters. Overall, the two acceleration approaches achieve a rather similar performance on this problem.

Nonnegative Least Squares
Finally, to compare our method with [17], we consider a nonnegative least squares (NNLS) problem that is used in [17] for evaluation: where x = (x 1 , x 2 ) ∈ R 2q , ψ(x) = Hx 1 − t 2 2 + I x2≥0 (x 2 ), and ϕ(x) = I x1=x2 (x), with I S being the indicator function of the set S. The Douglas-Rachford splitting (DRS) solver for this problem can be written as where v k = (v k 1 , v k 2 ) ∈ R 2q is an auxiliary variable for DRS and β is the penalty parameter. In [17], the authors use their regularized AA method (A2DR) to accelerate the DRS solver (40). To apply our method, we verify in Appendix B.3 that if H is of full column rank, then g satisfies condition (B.1) with where c 1 = max{ βσ1−1 βσ1+1 , 1−βσ0 1+βσ0 }, and σ 0 , σ 1 are the minimal and maximal eigenvalues of 2H T H, respectively. Moreover, g is also differentiable under a mild condition. We compare our method with A2DR on the solver (40), with the same AA parameters m = 10, 15, 20. The methods are tested using a 600 × 300 sparse random matrix H with 1% nonzero entries and a random vector t. We use the source code released by the authors of [17] 5 for the implementation of A2DR, and set all A2DR parameters to their default values. While A2DR and DRS are implemented using parallel evaluation of the proximity operators in the released A2DR code, we implement our method as a single-threaded application for simplicity. Fig 3 plots the residual norm f (v) for DRS and the two acceleration methods. It also plots the norm of the overall residual r = (r prim , r dual ) used in [17] for measuring convergence, where r prim and r dual denote the primal and dual residuals as defined in Equations (7) and (8) of [17], respectively. For both residual measures, the original DRS solver converges slowly after the initial iterations, whereas the two acceleration methods achieve significant speedup. Moreover, the single-threaded implementation of our method outperforms the parallel A2DR with the same m parameter, in terms of both iteration count and computational time.

Statistics of Successful Steps
Our acceptance mechanism plays a key role in achieving global and local convergence of the proposed method. To demonstrate its behavior, we provide statistics of the successful steps in Figs. 1, 2 and 3. Specifically, for each instance of LM-AA, we count the total steps required to reach a certain level of accuracy and we compare it with the corresponding number of successful AA steps within these steps. Tables 1, 2, 3, and 4 show the statistics of successful steps for Figs. 1, 2 and 3, respectively. Here, besides the total number of steps, we report the success rate which is defined as the ratio between successful and total steps required to reach different levels of accuracy.
The results in Table 4 demonstrate that essentially all AA steps are accepted in the nonnegative least squares problem. This observation is also independent of the choice of the parameter m. More specifically, the success rate of AA steps only decreases and more alternative fixed-point iterations are performed when we seek to solve the problem with the highest accuracy tol = 10 −15 . Table 3 illustrates that a similar behavior can also be observed for the image denoising problem (36) when setting m = 1. Notice that this high accuracy is close to machine precision and hence this effect is mainly caused by numerical errors and inaccuracies that affect the computation and quality of an AA step. The results in Table 3 also demonstrate a second typical effect: the success rate Table 1 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset covtype. In each of the columns iter, we report the number of iterations required to satisfy (F (x k ) − F * )/F * ≤ tol with tol ∈ {10 −3 , 10 −6 , 10 −9 , 10 −12 , 10 −15 }. The columns s-rate show the corresponding success rate of AA-steps, i.e., s-rate is the ratio between successful and total steps required to reach the different accuracies.    Table 2 Statistics of successful steps of LM-AA for the logistic regression problem (34) and the dataset sido0. In each of the columns iter, we report the number of iterations required to satisfy the criterion (F (x k ) − F * )/F * ≤ tol with tol ∈ {10 −3 , 10 −6 , 10 −9 , 10 −12 , 10 −15 }. The columns s-rate show the corresponding success rate of AA-steps, i.e., s-rate is the ratio between successful and total steps required to reach the different accuracies.    Table 3 Statistics of successful steps of LM-AA for the image denoising problem (36). In each of the columns iter, we report the number of iterations required to satisfy the criterion f (w k ) ≤ tol with tol ∈ {10 −3 , 10 −6 , 10 −9 , 10 −12 , 10 −15 }. The columns s-rate show the success rate of AA-steps, i.e., s-rate is the ratio between successful and total steps required to reach the different accuracies.  of AA step is often lower when the chosen accuracy is relatively low. With increasing accuracy, the rate then increases to around 70%-80%. This general observation is also supported by our results for logistic regression, see Tables 1  and 2. (Here the maximum success rate is more sensitive to the choice of m, τ , and of the dataset). Table 4 Statistics of successful steps of LM-AA for the nonnegative least squares problem (39). In each of the columns iter, we report the number of iterations required to satisfy the criterion f (w k ) ≤ tol with tol ∈ {10 −3 , 10 −6 , 10 −9 , 10 −12 , 10 −15 }. The columns s-rate show the success rate of AA-steps, i.e., s-rate is the ratio between successful and total steps required to reach the different accuracies. In summary, the statistics provided in Tables 1, 2, 3, and 4 support our theoretical results. The success rate of AA steps gradually increases as the iteration gets closer to the fixed point, which indicates a transition to a pure regularized AA scheme. Furthermore, as more AA steps seem to be rejected at the beginning of the iterative procedure, our globalization mechanism effectively guarantees global progress and convergence of the approach.

Conclusions
We propose a novel globalization technique for Anderson acceleration which combines adaptive quadratic regularization and a nonmonotone acceptance strategy. We prove the global convergence of our approach under mild assumptions. Furthermore, we show that the proposed globalized AA scheme has the same local convergence rate as the original AA iteration and that the globalization mechanism does not hinder the acceleration effect of AA. This is one of the first AA globalization methods that achieves global convergence and fast local convergence simultaneously. Several numerical examples illustrate that our method is competitive and it can improve the efficiency of a variety of numerical solvers.

B.2 Total Variation Based Image Reconstruction
The alternating minimization solver for image reconstruction problem (36) can be written as a fixed-point iteration with and h(w) := DM −1 (D T w+(ν/β)K T f , D := (D T 1 , . . . , D T N 2 ) T , and M := D T D+(ν/β)K T K. Let us notice that the mapping h and the fixed point iteration (43) are well-defined when the null spaces of K and D have no intersection, see, e.g., Assumption 1 of [51].
Next, we verify Assumption 2 for the mapping g in (43).
Proposition 2 Suppose that the operator K T K is invertible. Then, the spectral radius ρ(T ) of T := DM −1 D T fulfills ρ(T ) < 1 and condition (B.1) is satisfied.
Proof Utilizing the Sherman-Morrison-Woodbury formula and the invertibility of K T K, we obtain where ξ := β/ν. Due to λ min (I + ξD(K T K) −1 D T ) ≥ 1 and λmax(I + ξD( where σ(·) denotes the spectral set of a matrix. Combining this observation with (44), we have Furthermore, following the proof of Theorem 3.6 in [51], it holds that and hence, assumption (B.1) is satisfied.
Concerning assumption (B.2), it can be shown that g is twice continuously differentiable on the set (In this case the max-operation in the shrinkage operator s β is not active). Moreover, since h is continuous, the set W is open. Consequently, for every point w ∈ W, we can find a bounded open neighborhood N (w) of w such that N (w) ⊂ W. Hence, if the mapping g has a fixed-point w * satisfying w * ∈ W, then we can infer that g is differentiable on N (w * ) and assumption (B.2) has to hold at w * . Furthermore, the stronger assumption (C.1) for Corollary 1 is satisfied as well in this case. Finally, if K is the identity matrix, then notice that the finite difference matrix D satisfies that D ≤ 2 and we can infer ρ(T ) ≤ 1 − (1 + 4β/ν) −1 which justifies the choice of c in our algorithm.

B.3 Nonnegative Least Squares
We first note that given the specific form of ϕ, we can calculate the proximity operator ϕ explicitly as where P [0,∞) q denotes the Euclidean projection onto the set of nonnegative numbers [0, ∞) q .
In the next proposition, we give a condition to establish (B.1) for g.

Proposition 3
Let σ 0 and σ 1 denote the minimum and maximum eigenvalue of 2H T H, respectively and suppose σ 0 > 0. Then, the mapping g is Lipschitz continuous with modulus Proof We can explicitly calculate g as follows where R i β = 2prox βψ i − I, i = 1, 2. By Proposition 4.2 of [4], the reflected operators R 1 β and R 2 β are nonexpansive. Moreover, since σ 0 > 0, ψ 1 is strongly convex with modulus σ 0 and σ 1 -smooth. Then by Theorem 1 of [19], R 1 β is Lipschitz continuous with modulus c 1 . Next, for any v,v ∈ R 2q , we have where we used Cauchy's inequality, the nonexpansiveness of R 2 β , and the Lipschitz continuity of R 1 β . The estimate in the second to last line follows from Young's inequality.
Hence, assumption (B.1) is satisfied if H has full column rank. Using the special form of the mapping g in (45), we see that g is twice continuously , if none of the components of v 2 are zero. As before, we can then infer that assumption (B.2) and the stronger condition (C.1) have to hold at every fixed-point v * of g satisfying v * ∈ V.

B.4 Further Extensions
We now formulate a possible extension of the conditions presented in Section B.1 to the nonsmooth setting.
If the mapping g has more structure and is connected to an underlying optimization problem like in forward-backward and Douglas-Rachford splitting, nonsmoothness of g typically results from the proximity operator or projection operators. In such a case, further theoretical tools are available and for certain function classes it is possible to fully characterize the differentiability of g at x * via a so-called strict complementarity condition. In fact, the conditions w * ∈ W and v * ∈ V from section B.2 and B.3 are equivalent to such a strict complementarity condition. In the case of forward-backward splitting, a related and in-depth discussion of this important observation is provided in [41,24] and we refer the interested reader to [31,28,21,41] for further background.
Concerning the stronger assumption (C.1), we can establish the following characterization: Suppose that g is locally Lipschitz continuous and let us consider the properties: -The function g is differentiable at x * .
-There exists an outer Lipschitz continuous selection M * : R n → R n×n of ∂g in a neighborhood of x * , i.e., for all x sufficiently close to x * we have M * (x) ∈ ∂g(x) and Then, the mapping g satisfies the condition (C.1) at x * .
Proof The combination of differentiability and semismoothness implies that g is strictly differentiable at x * and as a consequence, Clarke's subdifferential at x * reduces to the singleton ∂g(x * ) = {g (x * )}. We refer to [36,37,28] for further details. Thus, we can infer M * (x * ) = g (x * ) and we obtain for x → x * , where we used the strong semismoothness and outer Lipschitz continuity in the last step. This establishes (C.1).
The class of strongly semismooth functions is rather rich and includes, e.g., piecewise twice continuously differentiable (PC 2 ) functions [48], eigenvalue and singular value functions [43,44], and certain spectral operators of matrices [11]. Let us further note that the stated selection property is always satisfied when g is a piecewise linear mapping. In this case, the sets ∂g(x) are polyhedral and outer Lipschitz continuity follows from Theorem 3D.1 of [12].

C Ablation Study on Parameter Choices
This subsection provides more numerical experiments on the parameters of our method, using the examples given in Figs. 1, 2 and 3 of the paper. In each experiment, we run our method by varying a subset of the parameters while keeping all other parameters the same as in the original figures, to evaluate how the varied parameters influence the performance of our method. For the evaluation, we plot the same convergence graphs as in the original figures    to compare the performance resulting from different parameter choices. The evaluation is performed on the parameters c, (p 1 , p 2 ), (η 1 , η 2 ), and µ 0 . We first consider the logistic regression problem in Fig. 1 with m = 15 and τ = L F ×10 −6 . The parameters used in Fig. 1         over-estimated the Lipschitz constant κ or the acceleration effect of AA steps can be much faster than κ.
The results in Figs. 5, 9, and 13 demonstrate that the performance of LM-AA is not overly affected by the choice of the trust-region parameters p 1 and p 2 either. In general, good performance can be achieved if p 1 is moderately small and p 2 is not too large. Thus, we decide to work with the standard choice p 1 = 0.01 and p 2 = 0.25. In comparison, the trust-region parameters η 1 and η 2 can have a more significant impact on the performance of our algorithm. While the performance of LM-AA is not sensitive to the choice of η 1 and η 2 in the logistic regression problem using the dataset sido0 (Fig. 6) and in the denoising problem (Fig. 10), more variation can be seen in the remaining two examples. In general, the performance seems to deteriorate when η 1 and η 2 are chosen to be very close to each other. The standard choice η 1 = 2 and η 2 = 0.25 again achieves convincing performance on all numerical examples and has a good balance when increasing and decreasing the weight parameter λ k .
The convergence plot for different values of µ 0 are shown in Figs. 7, 11, and 15. Our observations are again somewhat similar: the performance of LM-AA on the logistic regression problem for sido0 (Fig. 7) and on the denoising problem (Fig. 11) is very robust w.r.t. the choice of µ 0 . In the nonnegative least squares problem, µ 0 appears to mainly affect the last convergence stage of the algorithm, i.e., different choices of µ 0 can lead to an earlier jump to a level with higher accuracy. Overall the parameters µ 0 = 1 (for denoising and NNLS) and µ 0 = 100 (for logistic regression) yield the most robust results.

D Ablation Study on Permutation Strategy
We also provide an ablation study for the permutation strategy in line 6 of Algorithm 1. In general, when the parameter γ is small, our nonmonotone globalization strategy is close to a monotone criterion on the residual. In this case, the minimal residual iteration k 0 mostly coincides with the current iteration number k and the permutation causes little difference. However, when γ is (relatively) large, then the usage of permutations can cause essential differences in the numerical performance. In particular, when the current trial step is rejected, then Algorithm 1 performs x k+1 = g k as next iteration if no permutation is used. In general, the update x k+1 = g k can be worse than x k+1 = g k 0 since x k 0 has the smallest residual among the m latest iterations. We test Algorithm 1 without permutation in Figure 16 for the logistic regression experiment. We set γ = 0.05 in Figure 16 for LM-AA and keep other parameters unchanged. As can be seen from the figure, permutation improves the overall convergence and performance.