1 Introduction

Two general approaches are typical for the solution of the bilevel optimization problem

$$\begin{aligned} {} {\textbf {}} \min _{\alpha \in \mathscr {A}}~J(S_u(\alpha )) + R(\alpha ) \quad \text {with}\quad S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits _{u\in U} F(u; \alpha ) \end{aligned}$$
(1)

in Hilbert spaces \(\mathscr {A}\) and U. The first, familiar from the treatment of general mathematical programming with equilibrium constraints (MPECs), is to write out the Karush–Kuhn–Tucker conditions for the whole problem in a suitable form, and to apply a Newton-type method or other nonlinear equation solver to them [1,2,3,4,5].

The second approach, common in the application of (1) to inverse problems and imaging [6,7,8,9,10,11,12], treats the solution mapping \(S_u\) as an implicit function. Thus it is necessary to (i) on each outer iteration k solve the inner problem \(\min _u F(u; \alpha ^k)\) near-exactly using an optimization method of choice; (ii) solve an adjoint equation to calculate the gradient of the solution mapping; and (iii) use another optimization method of choice on the outer problem \(\min _\alpha J(S_u(\alpha ))\) with the knowledge of \(S_u(\alpha ^k)\) and \(\nabla S_u(\alpha ^k)\). The inner problem is therefore generally assumed to have a unique solution, and the solution map to be differentiable. An algorithm for nonsmooth inner problems has been developed in [13], while [14] rely on proving directional Bouligand differentiability for otherwise nonsmooth problems.

The challenge of the first “whole-problem” approach is to scale it to large problems, typically involving the inversion of large matrices. The difficulty with the second “implicit function” approach is that the inner problem needs to be solved several times, which can be expensive. Solving the adjoint equation also requires matrix inversion. The variant in [15] avoids this through derivative-free methods for the outer problem. It also solves the inner problem to a low but controlled accuracy.

In this paper, by preconditioning the implicit-form first-order optimality conditions, we develop an intermediate approach more efficient than the aforementioned, as we demonstrate in the numerical experiments of Sect. 4. It can be summarized as (i) take only one step of an optimization method on the inner problem, (ii) perform a cheap operation to advance towards the solution of the adjoint equation, and, finally, (iii) using this approximate information, take one step of an optimization method for the outer problem. Repeat.

The preconditioning, which we introduce in detail in Sect. 2, is based on insight from the derivation of the primal-dual proximal splitting of [16] as a preconditioned proximal point method [17,18,19]. We write the optimality conditions for (1) as the inclusion \(0 \in H(x)\) for a set-valued H, where \(x=(u,p,\alpha )\) for an adjoint variable p. The basic proximal point method then iteratively solves \(x^{k+1}\) from

$$\begin{aligned} 0 \in H(x^{k+1}) + (x^{k+1}-x^k). \end{aligned}$$

This can be as expensive as solving the original optimality condition. The idea then is to introduce a preconditioning operator M that decouples the components of x—in our case u, p and \(\alpha \)—such that each component can be solved in succession from

$$\begin{aligned} 0 \in H(x^{k+1}) + M(x^{k+1}-x^k). \end{aligned}$$

Gradient steps can be handled through nonlinear preconditioning [18, 19], as we will see in Sect. 2 when we develop the approach in detail along with two more specific algorithms, the FEFB (Forward-Exact-Forward-Backward) and the FIFB (Forward-Inexact-Forward-Backward). In Sect. 3 we prove their local linear convergence under a second-order growth condition on the composed objective \(J \circ S_u\), and other more technical conditions. The proof is based on the “testing” approach developed in [18] and also employed extensively in [19, 20]. Finally, we evaluate the numerical performance of the proposed schemes on imaging applications in Sect. 4, specifically the learning of a regularization parameter for total variation denoising, and the convolution kernel for deblurring. Since the purpose of these experiments is a simple performance comparison between different algorithms, instead of real applications, we only use a single training sample of various dimensions, as explained in Sect. 4.

Intermediate approaches, some reminiscent of ours, have recently also been developed in the machine learning community. Our approach, however, allows a non-smooth function R in the outer problem (1). Moreover, to our knowledge, our work is the first to show linear convergence for a fully “single-loop” algorithm. To be more precise, the STABLE [21], TTSA [22], FLSA [23], MRBO, VRBO [24], and SABA [25] are “single-loop” algorithms such as ours, taking only a single step towards the solution of the inner problem on each outer iteration. The STABLE requires solving the adjoint equation exactly, as does our first approach, the FEFB. The others use a Neumann series approximation for the adjoint equation. Our second approach, the FIFB, takes a simple step reminiscent of gradient descent for the adjoint equation. The TSSA and STABLE obtain sublinear convergence of the outer iterates \(\{\alpha ^k\}_{k \in \mathbb {N}}\) assuming strong convexity (second-order growth) of both the inner and outer objective. For the SABA similar linear convergence is claimed with the outer strong convexity replaced by a Polyak-Łojasiewicz inequality. Without either of those assumptions, the theoretical results on the aforementioned methods from the literature are much weaker, and generally only show various forms of “stall” of the iterates at a sublinear rate, or the ergodic convergence of the gradient \(\nabla _{\alpha }[J\circ S_u](\alpha ^k)\) of the composed objective to zero. Such modes of convergence say very little about the convergence of function values to optimum or the iterates to a solution.

In the context of not fully single-loop algorithms, the AID, ITD [26], AccBio [27], and ABA [28] take a fixed (small) number of inner iterations for each outer iteration. The AID and ITD only sublinear convergence of the composed gradient is claimed. For the ABA and AccBio linear convergence of outer function values is claimed under strong convexity of both the inner and outer objectives.

1.1 Fundamentals and applications

Fundamentals of MPECs and bilevel optimization are treated in the books [29,30,31,32]. An extensive literature review up to 2018 can be found in [33], and recent developments in [34]. Optimality conditions for bilevel problems, both necessary and sufficient, are developed in, e.g., [35,36,37,38,39]. A more limited type of “bilevel” problems only constrains \(\alpha \) to lie in the set of minimisers of another problem. Algorithms for such problems are treated in [40, 41].

Bilevel optimization has been used for learning regularization parameters and forward operators for inverse imaging problems. With total variation regularization in the inner problem, the parameter learning problem in its most basic form reads [7]

$$\begin{aligned} \min _{\alpha }~\frac{1}{2}\Vert S_u(\alpha )-b\Vert ^2 + R(\alpha ) \quad \text {with}\quad S_u(\alpha ) = \mathop {\mathrm {arg\,min}}\limits _{u\in U} \frac{1}{2}\Vert A_\alpha u-z\Vert ^2 + \alpha _1\Vert \nabla u\Vert _1. \end{aligned}$$

This problem finds the best possible \(\alpha \) for reconstructing the “ground truth” image b from the measurement data z, which may be noisy and possibly transformed and only partially known through the forward operator \(A_\alpha \), mapping images to measurements. To generalize to multiple images, the outer problem would sum over them and corresponding inner problems [12]. Multi-parameter regularization is discussed in [42], and natural conditions for \(\alpha >0\) in [43].Footnote 1 In other works, the forward operator \(A_\alpha \) is learned for blind image deblurring [44] or undersampling in magnetic resonance imaging [11]. In [8] regularization kernels are learned, while [14, 45] study the learning of optimal discretisation schemes. To circumvent the non-differentiability of \(S_u\), [46] replace the inner problem with a fixed number of iterations of an algorithm. Their approach has connections to the learning of deep neural networks.

Bilevel problems can also be seen as leader–follower or Stackelberg games: the outer problem or agent leads by choosing \(\alpha \), and the inner agent reacts with the best possible u for that \(\alpha \). Multiple-agent Nash equilibria may also be modeled as bilevel problems. Both types of games can be applied to financial markets and resource use planning; we refer to the the aforementioned books [29,30,31,32] for specific examples.

1.2 Notation and basic concepts

We write \(\mathbb {L}(X; Y)\) for the space of bounded linear operators between the normed spaces X and Y and \({{\,\textrm{Id}\,}}\) for the identity operator. Generally X will be Hilbert, so we can identify it with the dual \(X^*\).

For \(G \in C^1(X)\), we write \(G'(x) \in X^*\) for the Fréchet derivative at x, and \(\nabla G(x) \in X\) for its Riesz presentation, i.e., the gradient. For \(E \in C^1(X; Y)\), since \(E'(x) \in \mathbb {L}(X; Y)\), we use the Hilbert adjoint to define \(\nabla E(x) :=G'(x)^* \in \mathbb {L}(Y; X)\). Then the Hessian \(\nabla ^2 G(x) :=\nabla [\nabla G](x) \in \mathbb {L}(X; X)\). When necessary we indicate the differentiation variable with a subscript, e.g., \(\nabla _u F(u, \alpha )\). For convex \(R: X \rightarrow \overline{\mathbb {R}}\), we write \({{\,\textrm{dom}\,}}R\) for the effective domain and \(\partial R(x)\) for the subdifferential at x. With slight abuse of notation, we identify \(\partial R(x)\) with the set of Riesz presentations of its elements. We define the proximal operator as \({{\,\textrm{prox}\,}}_R(x) :=\mathop {\mathrm {arg\,min}}\limits _z \frac{1}{2}\Vert z-x\Vert ^2 + R(z)=({{\,\textrm{Id}\,}}+\partial R)^{-1}(x)\).

We write \(\langle x,y\rangle \) for an inner product, and B(xr) for a closed ball in a relevant norm \(\Vert \,\varvec{\cdot }\,\Vert \). For self-adjoint positive semi-definite \(M\in \mathbb {L}(X; X)\) we write \(\Vert x\Vert _{M} :=\sqrt{\langle x,x\rangle _{M}} :=\sqrt{\langle Mx,x\rangle }.\) Pythagoras’ or three-point identity then states

$$\begin{aligned} \langle x-y,x-z\rangle _{M} = \frac{1}{2}\Vert x-y\Vert ^2_M - \frac{1}{2}\Vert y-z\Vert ^2_M + \frac{1}{2}\Vert x-z\Vert ^2_M \end{aligned}$$
(2)

for all \(x,y,z\in X\). We extensively use Young’s inequality

$$\begin{aligned} \langle x,y\rangle \le \frac{a}{2}\Vert x\Vert ^2 + \frac{1}{2a}\Vert y\Vert ^2 \qquad \text {for all } x,y\in X,\, a > 0. \end{aligned}$$

We sometimes apply operations on \(x \in X\) to all elements of a set \(A \subset X\), writing \(\langle x+A,z\rangle :=\{\langle x+a,z\rangle \mid a \in A \}\), and for \(B \subset \mathbb {R}\), writing \(B \ge c\) if \(b \ge c\) for all \(b \in B.\)

2 Proposed methods

We now present our proposed methods for (1). They are based on taking a single gradient descent step for the inner problem, and using forward-backward splitting for the outer problem. The two methods differ on how an “adjoint equation” is handled. We present the algorithms and assumptions required to prove their convergence in Sects. 2.2 and 2.3 after deriving optimality conditions and the adjoint equation in Sect. 2.1. We prove convergence in Sect. 3.

2.1 Optimality conditions

Suppose \(u\mapsto F(u;\alpha )\in C^2(U)\) is proper, coercive, and weakly lower semicontinuous for each outer variable \(\alpha \in {{\,\textrm{dom}\,}}R \subset \mathscr {A}\). Then the direct method of the calculus of variations guarantees the inner problem \(\min _u F(u; \alpha )\) to have a solution. If, further, \(u\mapsto F(u;\alpha )\) is strictly convex, the solution is unique so that the solution mapping \(S_u\) from (1) is uniquely determined.

Suppose further that \(F, \nabla F\) and \(S_u\) are Fréchet differentiable. Writing \(T(\alpha ) :=(S_u(\alpha ), \alpha )\), Fermat’s principle and \(S_u(\tilde{\alpha }) \in \mathop {\mathrm {arg\,min}}\limits _u F(u; \tilde{\alpha })\) then show that

$$\begin{aligned}{}[\nabla _{u} F\circ T] (\alpha )= \nabla _{u} F(S_u(\alpha ); \alpha ) =0 \end{aligned}$$
(3)

for \(\alpha \) near \(\tilde{\alpha }\). Therefore, the chain rule for Fréchet differentiable functions yields

$$\begin{aligned} 0=\nabla _{\alpha }[\nabla _{u} F \circ T](\alpha ) = \nabla _{\alpha }S_u(\alpha ) \nabla _{u}^2 F(T(\alpha )) + \nabla _{\alpha u} F(T(\alpha )). \end{aligned}$$

That is, \(p=\nabla _{\alpha }S_u(\alpha )\) solves for \(u=S_u(\alpha )\) the adjoint equation

$$\begin{aligned} 0=p \nabla _{u}^2 F(u, \alpha ) + \nabla _{\alpha u} F(u, \alpha ). \end{aligned}$$
(4)

We introduce the corresponding solution mapping for the adjoint variable p,

$$\begin{aligned} S_p(u,\alpha ) := - \nabla _{\alpha u} F(u; \alpha ) \left( \nabla _u^2 F(u; \alpha )\right) ^{-1}. \end{aligned}$$
(5)

We will later make assumptions that ensure that \(S_p\) is well-defined.

Since \(S_u: \mathscr {A}\rightarrow U\), the Fréchet derivative \(S_u'(\alpha ) \in \mathbb {L}(\mathscr {A}; U)\) and the Hilbert adjoint \(\nabla _\alpha S_u(\alpha ) \in \mathbb {L}(U; \mathscr {A})\) for all \(\alpha \). Consequently \(p \in \mathbb {L}(U; \mathscr {A})\), but we will need p to lie in an inner product space. Assuming \(\mathscr {A}\) to be a separable Hilbert space, we introduce such structure

$$\begin{aligned} P=(\mathbb {L}(U; \mathscr {A}), \langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle ) \end{aligned}$$
(6a)

by using a countable orthonormal basis \(\{\varphi _i\}_{i\in I}\) of \(\mathscr {A}\) to define the inner product

$$\begin{aligned} \langle \hspace{-2.0pt}\langle p_1, p_2 \rangle \hspace{-2.0pt}\rangle :=\sum _{i\in I} \langle p_1^* \varphi _i,p_2^* \varphi _i\rangle = \sum _{i\in I} \langle \varphi _i,p_1 p_2^* \varphi _i\rangle . \quad (p_1, p_2 \in \mathbb {L}(U; \mathscr {A})). \end{aligned}$$
(6b)

We briefly study this inner product and the induced norm \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) in Sect. 2.

By the sum rule for Clarke subdifferentials (denoted \(\partial _C\)) and their compatibility with convex subdifferentials and Fréchet differentiable functions [47], we obtain

$$\begin{aligned} \partial _C (J \circ S_u+R)(\widehat{\alpha }) = \nabla _{\alpha }(J \circ S_u)(\widehat{\alpha }) + \partial R(\widehat{\alpha }) = \nabla _{\alpha }S_u(\widehat{\alpha })\nabla _{u}J(S_u(\widehat{\alpha })) + \partial R(\widehat{\alpha }). \end{aligned}$$

The Fermat principle for Clarke subdifferentials then furnishes the necessary optimality condition

$$\begin{aligned} 0 \in \nabla _{\alpha }(J \circ S_u)(\widehat{\alpha }) + \partial R(\widehat{\alpha })= \nabla _{\alpha }S_u(\widehat{\alpha })\nabla _{u}J(S_u(\widehat{\alpha })) + \partial R(\widehat{\alpha }). \end{aligned}$$
(7)

We combine (3), (4) and (7) as the inclusion

$$\begin{aligned} 0 \in H(\widehat{u}, \widehat{p}, \widehat{\alpha }) \end{aligned}$$
(8)

with

$$\begin{aligned} H(u,p,\alpha ):= \begin{pmatrix} \nabla _{u} F(u; \alpha ) \\ p \nabla _{u}^2 F(u; \alpha ) + \nabla _{\alpha u} F(u; \alpha ) \\ p\nabla _{u}J(u) + \partial R(\alpha ) \end{pmatrix} \end{aligned}$$
(9)

This is the optimality condition that our proposed methods, presented in Sects. 2.2 and 2.3, attempt to satisfy. We generally abbreviate

$$\begin{aligned} x=(u,p,\alpha ), \quad \widehat{x}=(\widehat{u},\widehat{p},\widehat{\alpha }), \quad \text {etc.} \end{aligned}$$

2.2 Algorithm: forward-exact-forward-backward

Our first strategy for solving (8) takes just a single gradient descent step for the inner problem, solves the adjoint equation exactly, and then takes a forward-backward step for the outer problem. We call this Algorithm 2.1 the FEFB (forward-exact-forward-backward).

Algorithm 1
figure a

Forward-exact-forward-backward (FEFB) method

Using H defined in (9), Algorithm 2.1 can be written implicitly as solving

$$\begin{aligned} 0 \in H_{k+1}(x^{k+1}) + M(x^{k+1} - x^k) \end{aligned}$$
(10)

for \(x^{k+1} = (u^{k+1}, p^{k+1}, \alpha ^{k+1})\), where, with \(x=(u, p, \alpha )\),

(11a)

and the preconditioning operator \(M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})\) is

$$\begin{aligned} M :={{\,\textrm{diag}\,}}( \tau ^{-1}{{\,\textrm{Id}\,}}, 0, \sigma ^{-1}{{\,\textrm{Id}\,}}). \end{aligned}$$
(11b)

The “nonlinear preconditioning” applied to H to construct \(H_{k+1}\) shifts iterate indices such that a forward step is performed instead of a proximal step; compare [18, 19].

We next state essential structural, initialisations, and step length assumptions. We start with a contractivity condition needed for the proximal step with respect to R.

Assumption 2.1

Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) be convex, proper, and lower semicontinuous. We say that R is locally prox-\(\sigma \)- contractive at \(\widehat{\alpha }\in \mathscr {A}\) for \(q \in \mathscr {A}\) (within \(A \subset {{\,\textrm{dom}\,}}R\)) if there exist \( C_R > 0\) and a neighborhood \(A \subset {{\,\textrm{dom}\,}}R\) of \(\widehat{\alpha }\) such that, for all \(\alpha \in A\),

$$\begin{aligned} \Vert D_{\sigma R}(\alpha )-D_{\sigma R}(\widehat{\alpha })\Vert \le \sigma C_R \Vert \alpha -\widehat{\alpha }\Vert \quad \text {for}\quad D_{\sigma R}(\alpha ) :={{\,\textrm{prox}\,}}_{\sigma R}(\alpha - \sigma q)-\alpha . \end{aligned}$$

If \(\rho >0\) can be arbitrary with the same factor \(C_R\), we drop the word “locally”.

We verify Assumption 2.1 for some common cases in Sect. 1. When applying the assumption to to \(\widehat{\alpha }\) satisfying (8), we will take \(q = -\widehat{p}\nabla _u J(\widehat{u}) \in \partial R(\widehat{\alpha })\). Then \(D_{\sigma R}(\widehat{\alpha })=0\) by standard properties of proximal mappings. The results for nonsmooth functions in Sect. 1 in that case forbid strict complementarity. In particular, for \(R=\beta \Vert \,\varvec{\cdot }\,\Vert _1 + \delta _{[0, \infty )^n}\) we need to have \(q \in (\beta , \ldots , \beta )\), and for \(R=\delta _C\) for a convex set C, we need to have \(q=0\). Intuitively, this restriction serves to forbid the finite identification property [48] of proximal-type methods, as \(\{\alpha ^n\}\) cannot converge too fast in our techniques for the stability of the inner problem and adjoint with respect to perturbations of \(\alpha \).

We now come to our main assumption for the FEFB. It collects conditions related to step lengths, initialization, and the problem functions F, J, and R. For a constant \(c>0\) to be determined by the assumption, we introduce the testing operator

$$\begin{aligned} Z :={{\,\textrm{diag}\,}}(\varphi _u {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}). \end{aligned}$$
(12)

The idea, introduced in [18] and further explained in [19], is to test the algorithm-defining inclusion (10) with the linear functional \(\langle Z\,\varvec{\cdot }\,,x^{k+1}-\widehat{x}\rangle \) to obtain a descent estimate with respect to the ZM-norm. The operator Z encodes component-specific scalings and convergence rates, although we do not exploit the latter in this manuscript.

Assumption 2.2

We assume that U is a Hilbert space, \(\mathscr {A}\) a separable Hilbert space, and treat the adjoint variable \(p\in \mathbb {L}(U; \mathscr {A})\) as an element of the inner product space P defined in (6a). Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) and \(J: U \rightarrow \mathbb {R}\) be convex, proper, and lower semicontinuous, and assume the same from \(F(\,\varvec{\cdot }\,, \alpha )\in C^2(U)\) for all \(\alpha \in {{\,\textrm{dom}\,}}R.\) Pick \((\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)\) and let \(\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}\) be generated by Algorithm 2.1 for a given initial iterate \((u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R\). For a given \(r, r_u>0\) we suppose that

  1. (i)

    The relative initialization bound \(\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) holds for some \(C_u>0\).

  2. (ii)

    There exists in \(B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) a continuously Fréchet-differentiable and \(L_{S_u}\)-Lipschitz inner problem solution mapping \(S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )\).

  3. (iii)

    \(F(\widehat{u};\,\varvec{\cdot }\,)\) is Lipschitz continuously differentiable with factor \(L_{\nabla F,\widehat{u}} > 0\), and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) for all \((u, \alpha ) \in B(\widehat{u}, r_u) \times ( B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) for some \(\gamma _F, L_F > 0.\) Moreover, \((u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz in \(B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) with factors \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\), where we equip \(U \times \mathscr {A}\) with the norm \((u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}\).

  4. (iv)

    The inner step length \(\tau \in (0, 2\kappa /L_F]\) for some \(\kappa \in (0, 1)\).

  5. (v)

    The outer fitness function J is Lipschitz continuously differentiable with factor \(L_{\nabla J}\), and \(\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _\alpha ,L_\alpha >0\). Moreover, R is locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(C_R\ge 0\).

  6. (vi)

    The constants \(\varphi _u, C_u > 0\) satisfy

    $$\begin{aligned} \gamma _F (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})C_u + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$

    where

    $$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$
  7. (vii)

    The outer step length \(\sigma \) fulfills

    $$\begin{aligned} 0 < \sigma \le \frac{(C_F-1)C_u}{(L_{S_u} +C_FC_u)C_{\alpha }} \end{aligned}$$

    where

    $$\begin{aligned} {\left\{ \begin{array}{ll} C_F :=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \quad \text {and} \\ C_{\alpha } :=(N_pL_{\nabla J} + N_{\nabla J} L_{S_p}) C_u + L_\alpha + C_R. \end{array}\right. } \end{aligned}$$
  8. (viii)

    The initial iterates \(u^0\) and \(\alpha ^0\) are such that the distance-to-solution

    $$\begin{aligned} r_0 :=\sqrt{\sigma \varphi _u\tau ^{-1}\Vert u^{0}-\widehat{u}\Vert ^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$

    satisfies

    $$\begin{aligned} r_0 \le r \quad \text {and}\quad r_0 \max \{2L_{S_u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$

Remark 2.3

(Interpretation) Part (i) of Assumption 2.2 ensures that the initial inner problem iterate is good relative to the outer problem iterate. If \(u^1\) solves the inner problem for \(\alpha ^0\), (i) holds for any \(C_u>0\). Therefore, (i) can always be satisfied by solving the inner problem for \(\alpha ^0\) to high accuracy. This condition does not require \(\alpha ^0\) to be close to a solution \(\widehat{\alpha }\) of the entire problem.

Part (ii) ensures that the inner problem solution map exists and is well-behaved; we discuss it more in the next Remark 2.4.

Parts (iii) and (v) are second order growth and boundedness conditions, standard in smooth optimization. The nonsmooth R is handled through the prox-\(\sigma \)-contractivity assumption. If \(S_u\) is twice Fréchet differentiable, the product and the chain rules establish

If \(R=0\), first-order optimality conditions establish \(\nabla _u J(S_u(\widehat{\alpha }))=0\). Therefore, if, further, J is strongly convex and \(S_u'(\widehat{\alpha })\) is invertible, \(\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)(\widehat{\alpha })\) for some \(\gamma >0\). Then additional continuity assumptions establish the positivity required in (v) in a neighbourhood of \(\widehat{\alpha }\). It is also possible to further develop the condition to not depend on the solution mapping at all.

Dependent on R, (v) may restrict the outer step length parameter \(\sigma \). Part (iii) ensures that \(u\mapsto \nabla _u^2 F(u; \alpha )\) is invertible and \(S_p\) is well-defined. We will see in Lemma 3.3 that the radius \(r_u\) is sufficiently large that \(\alpha \in B(\widehat{\alpha }, r)\) implies \(S_u(\alpha ) \in B(\widehat{u}, r_u)\). Part (v) implies that \(\alpha \mapsto \nabla _{\alpha }(J\circ S_u)(\alpha )\) is Lipschitz in \(B(\widehat{\alpha }, r)\).

Part (iv) is a standard step length condition for the inner problem while (vii) is a step length condition for the outer problem. It depends on several constants defined in the more technical part (vi). We can always satisfy the inequality in (vi) by good relative initialisation (small \(C_u>0\)), as discussed above, and taking the testing parameter \(\varphi _u\) small. According to the local initialization condition (viii), the latter can be done if the initialial iterates are close to a solution \((\widehat{u}, \widehat{\alpha })\) of the entire problem, or if \(r_u>0\) can be be taken arbitrarily large. If we can take both \(r>0\) and \(r_u>0\) arbitrarily large, we obtain global convergence.

Remark 2.4

(Existence and differentiability of the solution map) Suppose F is twice continuously differentiable in both variables, and that \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) for all \(u\in B(\widehat{u}, r_u)\) and \(\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _F>0\). Then the implicit function theorem shows the existence of a unique continuously differentiable \(S_u\) in a neighborhood of any \(\alpha \in B(\widehat{\alpha },r) \cap {{\,\textrm{dom}\,}}R\). Such an \(S_u\) is also Lipschitz in a neighborhood of \(\alpha \); see, e.g., [19, Lemma 2.11]. If \(\mathscr {A}\) is finite-dimensional, a compactness argument gluing together the neighborhoods then proves Assumption 2.2 (ii).

2.3 Algorithm: forward-inexact-forward-backward

Algorithm 2
figure b

Forward-inexact-forward-backward (FIFB) method

Our second strategy for solving (8) modifies the first approach to solve the adjoint variable inexactly, so that no costly matrix inversions are required. Instead we perform an update reminiscent of a gradient step. This approach, which we call the FIFB (forward-inexact-forward-backward) reads as Algorithm 2.2 and has the implicit form

$$\begin{aligned} {\left\{ \begin{array}{ll} 0 = \tau \nabla _{u} F(u^{k}; \alpha ^{k}) + u^{k+1} - u^{k} \\ 0 = \theta \left( p^{k} \nabla _{u}^2 F(u^{k+1}; \alpha ^{k}) + \nabla _{\alpha u} F(u^{k+1}; \alpha ^{k})\right) + p^{k+1} - p^{k} \\ 0 \in \sigma (\partial R(\alpha ^{k+1}) + p^{k+1}\nabla _{u}J(u^{k+1})) + \alpha ^{k+1} - \alpha ^{k}. \end{array}\right. } \end{aligned}$$
(13)

The implicit form can also be written as (10) with

(14a)

and the preconditioning operator \(M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})\),

$$\begin{aligned} M :={{\,\textrm{diag}\,}}( \tau ^{-1}{{\,\textrm{Id}\,}}, \theta ^{-1}{{\,\textrm{Id}\,}}, \sigma ^{-1}{{\,\textrm{Id}\,}}). \end{aligned}$$
(14b)

For the testing operator Z we use the structure

$$\begin{aligned} Z :={{\,\textrm{diag}\,}}(\varphi _u {{\,\textrm{Id}\,}}, \varphi _p {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}). \end{aligned}$$
(15)

with the constants \(\varphi _u, \varphi _p>0\) determined in the following assumption. It is the FIFB counterpart of Assumption 2.2 for the FEFB, collecting essential structural, step length, and initialization assumptions.

Assumption 2.5

We assume that U is a Hilbert space, \(\mathscr {A}\) a separable Hilbert space, and treat the adjoint variable \(p\in \mathbb {L}(U; \mathscr {A})\) as an element of the inner product space P defined in (6a). Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) and \(J: U \rightarrow \mathbb {R}\) be convex, proper, and lower semicontinuous, and assume the same from \(F(\,\varvec{\cdot }\,, \alpha )\) for all \(\alpha \in {{\,\textrm{dom}\,}}R\). Pick \((\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)\) and let \(\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}\) be generated by Algorithm 2.2 for a given initial iterate \((u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R\). For given \(r, r_u>0\) we suppose that

  1. (i)

    The relative initialization bounds \(\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) and \(\Vert \hspace{-1.0pt}|p^{1}-\nabla _{\alpha }S_u(\alpha ^{0}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) hold with some constants \(C_u>0\) and \(C_p>0.\)

  2. (ii)

    There exists in \(B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) a continuously Fréchet-differentiable and \(L_{S_u}\)-Lipschitz inner problem solution mapping \(S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )\).

  3. (iii)

    \(F(\widehat{u};\,\varvec{\cdot }\,)\) is Lipschitz continuously differentiable with factor \(L_{\nabla F,\widehat{u}} > 0\), and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) for \(u\in B(\widehat{u}, r_u)\) and \(\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R.\) Moreover, \((u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz in \(B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) with factors \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\), where we equip \(U \times \mathscr {A}\) with the norm \((u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}\).

  4. (iv)

    The inner step length \(\tau \in (0, 2\kappa /L_F]\) for some \(\kappa \in (0, 1)\) whereas the adjoint step length \(\theta \in (0, 1/L_F)\).

  5. (v)

    The outer fitness function J is Lipschitz continuously differentiable with factor \(L_{\nabla J},\) and \(\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _\alpha , L_\alpha > 0\). Moreover, R is locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(C_R\ge 0\).

  6. (vi)

    The constants \(\varphi _u, \varphi _p, C_u > 0\) satisfy

    $$\begin{aligned} \varphi _p \le \varphi _u \frac{\gamma _F^2(1-\kappa )}{2 L_F L_{S_p}} \end{aligned}$$

    and

    $$\begin{aligned}{} & {} L_F L_{S_p} \varphi _p + \sqrt{(L_F L_{S_p} \varphi _p)^2 + \gamma _F^2 (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})^2C_u^2} \\{} & {} \quad + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$

    where

    $$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$
  7. (vii)

    The outer step length \(\sigma \) satisfies

    $$\begin{aligned} 0 < \sigma \le \textstyle \frac{1}{C_\alpha }\min \left\{ \frac{(C_F-1)C_u}{L_{S_u} +C_FC_u}, \frac{(C_{F,S}-1)C_p- (1+C_{F,S})L_{S_p} C_u}{(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u} \right\} \end{aligned}$$

    with

    $$\begin{aligned} \begin{aligned} C_F&:=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \qquad C_{F,S} :=\sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)} \quad \text {and} \\ C_{\alpha }&:=N_pL_{\nabla J}C_u + N_{\nabla J}\max \{C_p, L_{S_p} C_u\} + L_\alpha + C_R. \end{aligned} \end{aligned}$$
  8. (viii)

    The initial iterate \((u^0, p^0, \alpha ^0)\) is such that the distance-to-solution

    $$\begin{aligned} r_0 :=\sqrt{\frac{\sigma \varphi _u}{\tau }\Vert u^{0}-\widehat{u}\Vert ^2 + \frac{\sigma \varphi _p}{\theta }\Vert \hspace{-1.0pt}|p^{0}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$

    satisfies

    $$\begin{aligned} r_0 \le r \ \text {and}\ r_0 \max \{2(C_u + L_{S_u}), \sqrt{\tfrac{\tau }{\sigma \varphi _u}}(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$

Remark 2.6

(Interpretation) The interpretation of Assumption 2.2 in Remark 2.4 also applies to Assumption 2.5. We stress that to satisfy the inequality in (vi), it suffices to ensure small \(C_u>0\) by good relative initialization of u and p with respect to \(\alpha \), and choosing the testing parameters \(\varphi _u, \varphi _p>0\) small enough. According to (viii), the latter can be done by initializing close to a solution, or if the radii \(r_u>0\) is large.

3 Convergence analysis

We now prove the convergence of the FEFB (Algorithm 2.1) and the FIFB (Algorithm 2.2) in the respective Sects. 3.2 and 3.3. Before this we start with common results. Our proofs are self-contained, but follow on the “testing” approach of [18] (see also [19]). The main idea is to prove a monotonicity-type estimate for the operator \(H_{k+1}\) occurring in the implicit forms (10) and (14) of the algorithms, and then use the three-point identity (2) with respect to ZM-norms and inner products. This yields an inequality that readily yields an estimate from which convergence rates can be observed. The main results for the FEFB and the FIFB and in the respective Theorems 3.16 and 3.21.

Throughout, we assume that either Assumption 2.2 (FEFB) or 2.5 (FIFB) holds, and tacitly use the constants from the relevant one. We also tacitly take it that \(\alpha ^k \in {{\,\textrm{dom}\,}}R\) for all \(k \in \mathbb {N}\), as this is guaranteed by the assumptions for \(k=0\), and by the proximal step in the algorithms for \(k \ge 1\).

3.1 General results

Our main goal here is to bound the error in the inner and adjoint iterates \(u^k\) and \(p^k\) in terms of the outer iterates \(\alpha ^k\). We also derive bounds on the outer steps, and local monotonicity estimates. We first show that the solution mapping for the adjoint equation (4) is Lipschitz.

Lemma 3.1

Suppose \((u, \alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz continuous with the respective constants \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\) in some bounded closed set \(V_u \times V_{\alpha }.\) Also assume that \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) and \(\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}\) in \(V_u \times V_{\alpha }\) for some \(\gamma _F, N_{\nabla _{\alpha u}}>0\). Then \(S_p\) is Lipschitz continuous in \(V_u \times V_{\alpha },\) i.e.

$$\begin{aligned} \Vert \hspace{-1.0pt}|S_p(u_1, \alpha _1) - S_p(u_2, \alpha _2) \Vert \hspace{-1.0pt}| \le L_{S_p}(\Vert u_1 - u_2\Vert +\Vert \alpha _1-\alpha _2\Vert ) \end{aligned}$$

for \(u_1,u_2\in V_u\) and \(\alpha _1,\alpha _2 \in V_{\alpha }\) with factor \(L_{S_p} :=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}.\)

Proof

Using the definition of \(S_p\) in (5), we rearrange

Thus the triangle inequality and the operator norm inequality Theorem 6.1 (ii) give

(16)

The assumption \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) implies \(\Vert (\nabla _u^2 F(u; \alpha ))^{-1}\Vert \le \gamma _F^{-1}.\) Therefore, also using the Lipschitz continuity of \((u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha )\) in \(V_u\times V_{\alpha },\) we get

$$\begin{aligned} E_1 \le \gamma _F^{-1} L_{\nabla _{\alpha u} F}\left( \Vert u_1-u_2\Vert + \Vert \alpha _1 - \alpha _2\Vert \right) . \end{aligned}$$
(17)

Towards estimating the second term on the right hand side of (16), we observe that

$$\begin{aligned} A^{-1} - B^{-1}= A^{-1}B B^{-1} - A^{-1}A B^{-1} = A^{-1}(A-B)B^{-1} \end{aligned}$$

for any invertible linear operators AB. Then we use \(\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}\) and the Lipschitz continuity of \(\nabla _{u}^2 F (u; \alpha )\) to obtain

Inserting this inequality and (17) into (16) establishes the claim. \(\square \)

We now prove two simple step length bounds.

Lemma 3.2

Let Assumption 2.2 or 2.5 hold. Then \(\sigma < 1/L_\alpha \) and \(1< C_F < \sqrt{1+\gamma _F/L_F}\).

Proof

We have \(C_F>1\) since \(\kappa <1\) forces \(2\tau \gamma _F(1 - \kappa )>0.\) Assumption 2.2 (iv) or 2.5 (iv) implies \(2\tau \gamma _F(1 - \kappa )<4\gamma _F(\kappa - \kappa ^2)/L_F \le \gamma _F/L_F.\) Therefore \( C_F < \sqrt{1+ \gamma _F/L_F}.\) For \(C_F,C_u, L_{S_u}>0\) it holds \(C_FC_u-C_u< L_{S_u} +C_FC_u.\)

Hence Assumption 2.2 (vii) or 2.5 (vii) gives

$$\begin{aligned} \sigma \le \frac{(C_F-1)C_u}{C_{\alpha }(L_{S_u} +C_FC_u)}< \frac{1}{C_{\alpha }} = \frac{1}{C_u (L_{S_p} N_{\nabla J} + N_{p}) + L_\alpha + C_R} < \frac{1}{L_\alpha }. \end{aligned}$$

\(\square \)

The next lemma explains the latter inequality for \(r_0\) in Assumption 2.2 (viii) and 2.5 (viii). For \(u^n\) and \(\alpha ^n\) close enough to the respective solutions, it bounds the next iterate \(u^{n+1}\) and the true inner problem solution \(S_u(\alpha ^n)\) for \(\alpha ^n\) to the \(r_u\)-neighborhood of \(\widehat{u}\).

Lemma 3.3

Suppose Assumption 2.2 or 2.5 hold and \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\), as well as \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). Then \(u^{n+1}\in B(\widehat{u}, r_u)\) and \(S_u(\alpha ^n)\in B(\widehat{u}, r_u).\)

Proof

The inner gradient step of Algorithm 2.1 or with \(u^{n}\in B(\widehat{u}, \sqrt{\frac{\tau }{\sigma \varphi _u}}r_0)\) give

$$\begin{aligned} \Vert u^{n+1} - \widehat{u}\Vert \le \Vert u^{n+1} - u^{n}\Vert + \Vert u^{n} - \widehat{u}\Vert \le \tau \Vert \nabla _u F(u^n; \alpha ^n)\Vert + \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau } r_0. \end{aligned}$$

Using \(\nabla _u F(\widehat{u}; \widehat{\alpha })=0\), \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\), and the Lipschitz continuity of \(F(\widehat{u};\,\varvec{\cdot }\,)\) and \(F(\,\varvec{\cdot }\,;\alpha ^n)\) from Assumption 2.2 (iii) or 2.5 (iii) we continue to estimate, as required

Next, the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha }, 2r)\) from Assumption 2.2 (ii) or 2.5 (ii) with \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(r_0\le r\) from Assumption 2.2 (viii) or 2.5 (viii) imply

$$\begin{aligned} \Vert S_u(\alpha ^{n}) - \widehat{u}\Vert = \Vert S_u(\alpha ^{n}) - S_u(\widehat{\alpha })\Vert \le L_{S_u}\Vert \alpha ^{n} - \widehat{\alpha }\Vert \le L_{S_u}r_0 \le r_u. \square \end{aligned}$$

We now introduce a working condition that we later prove. It guarantees that the Lipschitz and Hessian properties of Assumption 2.2 (ii), (iii) and (v) or Assumption 2.5 (ii), (iii) and (v) hold at iterates.

Assumption 3.4

(Iterate locality) Let \(r_0\le r\) and \(N_p\) be defined in either Assumption 2.2 or 2.5. Then this assumption holds for a given \(n \in \mathbb {N}\) if

$$\begin{aligned} \alpha ^{n}\in B(\widehat{\alpha }, r_0),\quad u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0), \quad \text { and }\quad \Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}| \le N_p. \end{aligned}$$

Indeed, the next two lemmas show that if Assumption 3.4 holds for \(n=k\) along with some further conditions, then it holds for \(n=k+1\).

Lemma 3.5

Suppose either Assumption 2.2 or 2.5 holds. Let \(n \in \mathbb {N}\) and suppose

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \le C\Vert \alpha ^n- \widehat{\alpha }\Vert \end{aligned}$$
(18)

with \(\alpha ^n \in B(\widehat{\alpha }, r)\). Then \(\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}| \le N_p\).

Proof

We estimate using (18) and the definitions of the relevant constants in Assumption 2.2 or 2.5 that

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}|&\le \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| + \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \\&\le N_{\nabla S_u} + C \Vert \alpha ^n- \widehat{\alpha }\Vert \le N_{\nabla S_u} + C r = N_p.\square \end{aligned}$$

Lemma 3.6

Let \(k \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5 holds; Assumption 3.4 holds for \(n=k\); and that (18) holds for \(n=k+1\). If also \(\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}\) for \(n\in \{0,\ldots ,k\}\), then Assumption 3.4 holds for \(n=k+1\).

Proof

Summing \(\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}\) over \(n=0,\ldots ,k\) gives \(\Vert x^{k+1} - \widehat{x}\Vert _{ZM} \le \Vert x^{0} - \widehat{x}\Vert _{ZM} = \sigma ^{-1/2}r_0.\) By the definitions of Z and M in (12) or (15), and (11b) or (14b) respectively, it follows \(\alpha ^{k+1} \in B(\widehat{\alpha }, r_0)\) and \(u^{k+1}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\) as required. We finish by using Lemma 3.5 with \(n=k+1\) to establish \(\Vert \hspace{-1.0pt}|p^{k+2} \Vert \hspace{-1.0pt}| \le N_p\). \(\square \)

We next prove a monotonicity-type estimate for the inner objective. For this we need need the following three-point monotonicity inequality.

Theorem 3.7

Let \(z,\widehat{x}\in X.\) Suppose \(F\in C^2(X),\) and for some \(L>0\) and \(\gamma \ge 0\) that \(\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(\zeta ) \le L \cdot {{\,\textrm{Id}\,}}\) for all \(\zeta \in [\widehat{x}, z] :=\{\widehat{x} + s(z - \widehat{x}) \mid s\in [0,1] \}\). Then, for all \(\beta \in (0, 1]\) and \(x \in X\),

$$\begin{aligned} \langle \nabla F(z)- \nabla F(\widehat{x}),x-\widehat{x}\rangle \ge \gamma (1 - \beta )\Vert x-\widehat{x}\Vert ^2 - \frac{L}{4\beta }\Vert x-z\Vert ^2. \end{aligned}$$
(19)

Proof

The proof follows that of [19, Lemma 15.1] whose statement unnecessarily takes \(\zeta \) in neighborhood of \(\widehat{x}\) instead of just the interval \([\widehat{x}, z]\). \(\square \)

Lemma 3.8

Let \(n \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5, and 3.4 hold. Then for any \(\kappa \in (0,1)\), we have

(20)

Proof

Assumption 2.2 (iii) or 2.5 (iii) with \(\alpha ^n\in B(\widehat{\alpha }, r)\) and \(u^{n}\in B(\widehat{u},r_u)\) from Assumption 3.4 give \(\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha ^n) \le L_F\cdot {{\,\textrm{Id}\,}}\) for all \(u \in [\widehat{u}, u^n]\). We have \(\nabla _{u}F(\widehat{u}; \widehat{\alpha })=0\) since \(0\in H(\widehat{u}, \widehat{p}, \widehat{\alpha })\). Therefore Theorem 3.7 yields

$$\begin{aligned}{} & {} \langle \nabla _{u}F(u^{n}; \alpha ^{n}),u^{n+1}-\widehat{u}\rangle = \langle \nabla _{u}F(u^{n}; \alpha ^{n}) - \nabla _{u}F(\widehat{u}; \alpha ^{n}),u^{n+1}-\widehat{u}\rangle \\{} & {} \quad + \langle \nabla _{u}F(\widehat{u}; \alpha ^{n}) -\nabla _{u}F(\widehat{u}; \widehat{\alpha }) ,u^{n+1}-\widehat{u}\rangle \\{} & {} \quad \ge \gamma _F(1 - \kappa )\Vert u^{n+1}-\widehat{u}\Vert ^2 - \frac{L_F}{4\kappa }\Vert u^{n+1}-u^{n}\Vert ^2\\ {}{} & {} \qquad - |\langle \nabla _{u}F(\widehat{u}; \alpha ^{n}) -\nabla _{u}F(\widehat{u}; \widehat{\alpha }),u^{n+1}-\widehat{u}\rangle |. \end{aligned}$$

Young’s inequality and the definition of \(L_{\nabla F, \widehat{u}}\) in Assumption 2.2 (iii) or 2.5 (iii) now readily establishes the claim. \(\square \)

The next lemma bounds the steps taken for the outer problem variable.

Lemma 3.9

Let \(n \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5 hold, as do Assumption 3.4 (18), and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert . \end{aligned}$$
(21)

Then

$$\begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert \le \sigma [(N_p L_{\nabla J} C_u + N_{\nabla J} C + L_\alpha ) + C_R]\Vert \alpha ^{n} - \widehat{\alpha }\Vert \end{aligned}$$
(22)

and

$$\begin{aligned} \begin{aligned} C_u\Vert \alpha ^{n} - \widehat{\alpha }\Vert + L_{S_u}\Vert \alpha ^{n+1}-\alpha ^n\Vert&\le C_F C_u \bigl (\Vert \alpha ^n - \widehat{\alpha }\Vert - \Vert \alpha ^{n+1} - \alpha ^n\Vert \bigr ) \\&\le C_F C_u \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$
(23)

Proof

Using the \(\alpha \)-update of Algorithm 2.1 or , we estimate

Since proximal maps are 1-Lipschitz, and R is by Assumption 2.2 (v) or 2.5 (v) locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) with factor \(C_R\), it follows

$$\begin{aligned} \begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert&\le \sigma \Vert p^{n+1}\nabla _u J(u^{n+1})- \widehat{p}\nabla _u J(\widehat{u})\Vert + \sigma C_R \Vert \alpha ^n-\widehat{\alpha }\Vert \\&=: \sigma Q + \sigma C_R \Vert \alpha ^n-\widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$
(24)

We have \(\widehat{p}\nabla _u J(\widehat{u})=\nabla _\alpha S_u(\widehat{\alpha }) \nabla _u J(S_u(\widehat{\alpha }))=\nabla _\alpha (J \circ S_u)(\widehat{\alpha })\), where \(\nabla _{\alpha }(J\circ S_u)\) is \(L_\alpha \)-Lipschitz in \(B(\widehat{\alpha }, r)\ni \alpha ^n\) by Assumption 2.2 (v) or 2.5 (v). Hence

$$\begin{aligned} \begin{aligned} Q&\le \Vert p^{n+1}\nabla _u J(u^{n+1}) - \nabla _{\alpha }(J\circ S_u)(\alpha ^n) + \nabla _{\alpha }(J\circ S_u)(\alpha ^n) -\widehat{p}\,\nabla _u J(\widehat{u})\Vert \\&\le \Vert p^{n+1}\nabla _u J(u^{n+1}) - \nabla _{\alpha }S_u(\alpha ^n)\nabla _u J(S_u(\alpha ^n))\Vert + L_\alpha \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$

Using the Lipschitz continuity of \(\nabla _u J\) from Assumption 2.2 (v) or 2.5 (v), we continue

We have \(\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}|\le N_{p}\) and \(\alpha ^n\in B(\widehat{\alpha }, r)\) by Assumption 3.4. Hence \(\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}\) by the definition in Assumption 2.2 (vi) or 2.5 (vi). Using (18) and (21) therefore give

$$\begin{aligned} Q \le N_p L_{\nabla J} C_u \Vert \alpha ^n - \widehat{\alpha }\Vert + N_{\nabla J} C \Vert \alpha ^n - \widehat{\alpha }\Vert + L_\alpha \Vert \alpha ^n - \widehat{\alpha }\Vert = (C_{\alpha }-C_R)\Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$

Inserting this into (24), we obtain (22).

Assumption 2.2 (vii) or Assumption 2.5 (vii) and (22) then yield

$$\begin{aligned} (L_{S_u} +C_FC_u)\Vert \alpha ^{n+1} - \alpha ^n\Vert \le \sigma (L_{S_u} +C_FC_u)C_{\alpha } \Vert \alpha ^n - \widehat{\alpha }\Vert \le (C_F-1)C_u \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$

Rearranging terms and finishing with the triangle inequality we get (23). \(\square \)

Remark 3.10

(Gradient steps with respect to R) We could (in both FEFB and FIFB) also take a gradient step instead of a proximal step with respect to R with \(L_{\nabla R}\)-Lipschitz gradient. That is, we would perform for the outer problem the update

$$\begin{aligned} \alpha ^{n+1} = \alpha ^n - \sigma [p^{n+1}\nabla _u J(u^{n+1}) + \nabla R(\alpha ^n)]. \end{aligned}$$

This can be shown to be convergent by changing (24) to

$$\begin{aligned} \begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert&= \sigma \Vert p^{n+1}\nabla _u J(u^{n+1}) + \nabla R(\alpha ^n)\Vert \\&= \sigma \Vert p^{n+1}\nabla _u J(u^{n+1}) - \widehat{p}\,\nabla _u J(\widehat{u}) + \nabla R(\alpha ^n)-\nabla R(\widehat{\alpha })\Vert \\&\le \sigma \bigl ( \Vert p^{n+1}\nabla _u J(u^{n+1})- \widehat{p}\,\nabla _u J(\widehat{u})\Vert + L_{\nabla R}\Vert \alpha ^n-\widehat{\alpha }\Vert \bigr ). \end{aligned} \end{aligned}$$

We next prove that if an inner problem iterate has small error, and we take a short step in the outer problem, then also the next inner problem iterate has small error.

Lemma 3.11

Let \(k \in \mathbb {N}\). Suppose Assumption 2.2 or 2.5 hold. If Assumption 3.4, (18), and (21) hold for \(n=k,\) then (21) holds for \(n=k+1\) and we have \(\alpha ^{k+1}\in B(\widehat{\alpha }, 2r_0)\).

Proof

We plan to use Theorem 3.7 on \(F(\,\varvec{\cdot }\,; \alpha ^{k+1})\) followed by the three-point identity and simple manipulations. We begin by proving the conditions of the theorem.

First, we show that both \(u^{k+1}\in B(\widehat{u},r_u)\) and \(S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)\). The former is immediate from Assumption 3.4 and Lemma 3.3. For the latter we use (23) of Lemma 3.9. Its first inequality readily implies either \(\Vert \alpha ^{k} - \widehat{\alpha }\Vert > \Vert \alpha ^{k+1} - \alpha ^{k}\Vert \) or \(\alpha ^{k+1} = \widehat{\alpha }\). In the latter case \(S_u(\alpha ^{k+1})=\widehat{u} \in B(\widehat{u},r_u)\). In the former, using \(\alpha ^k\in B(\widehat{\alpha },r_0)\), we get

$$\begin{aligned} \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \le \Vert \alpha ^{k+1} - \alpha ^{k}\Vert + \Vert \alpha ^{k}- \widehat{\alpha }\Vert < 2 \Vert \alpha ^{k}- \widehat{\alpha }\Vert \le 2r_0 \end{aligned}$$

Therefore we can use the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha },2r)\) from Assumption 2.2  (ii) or 2.5 (ii) to estimate

$$\begin{aligned} \Vert S_u(\alpha ^{k+1}) - \widehat{u}\Vert = \Vert S_u(\alpha ^{k+1}) - S_u(\widehat{\alpha })\Vert \le L_{S_u}\Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \le L_{S_u}2r_0. \end{aligned}$$

This implies \(S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)\) by Assumption 2.2 (viii) or 2.5 (viii).

Since both \(u^{k+1}, S_u(\alpha ^{k+1}) \in B(\widehat{u},r_u)\), Assumption 2.2 (iii) or 2.5 (iii)

shows that \(\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(u) \le L_F\cdot {{\,\textrm{Id}\,}}\) for \(u \in [S_u(\alpha ^{k+1}), u^{k+1}]\). Consequently Theorem 3.7 and \(\nabla _{u}F(S_u(\alpha ^{k+1}); \alpha ^{k+1}) = 0\) give

Inserting the u update of Algorithm 2.1 or , i.e., \( -\tau ^{-1}(u^{k+2}-u^{k+1}) = \nabla _{u}F(u^{k+1}; \alpha ^{k+1}) \) and using the three-point identity (2) we get

$$\begin{aligned}{} & {} \frac{1}{2\tau } \left( \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \Vert u^{k+2}-u^{k+1}\Vert ^2 - \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert ^2 \right) \\{} & {} \quad \le - \gamma _F(1 - \kappa ) \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \frac{L_F}{4\kappa }\Vert u^{k+2}-u^{k+1}\Vert ^2. \end{aligned}$$

Equivalently

$$\begin{aligned}{} & {} \left( 1+2\tau \gamma _F(1 - \kappa )\right) \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \Bigl (1-\frac{\tau L_F}{2\kappa } \Bigr ) \Vert u^{k+2}-u^{k+1}\Vert ^2 \\{} & {} \quad \le \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert ^2. \end{aligned}$$

Because Assumption 2.2 (iv) or 2.5 (iv) guarantees \(1-\tau L_F/(2\kappa ) > 0,\) this implies

$$\begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert \le C_F^{-1} \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert . \end{aligned}$$

Therefore the triangle inequality, (21) for \(n=k\) and the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha },2r)\ni \alpha ^{k}, \alpha ^{k+1}\) yield

$$\begin{aligned} \begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert&\le C_F^{-1} \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert \\&\le C_F^{-1} \bigl ( \Vert u^{k+1}-S_u(\alpha ^{k})\Vert + L_{S_u}\Vert \alpha ^{k+1}-\alpha ^{k}\Vert \bigr ) \\&\le C_F^{-1} \bigl ( C_u \Vert \alpha ^{k} - \widehat{\alpha }\Vert + L_{S_u}\Vert \alpha ^{k+1}-\alpha ^{k}\Vert \bigr ). \end{aligned} \end{aligned}$$

Inserting (23) here, we establish the claim. \(\square \)

The next lemma is a crucial monotonicity-type estimate for the outer problem. It depends on an \(\alpha \)-relative exactness condition on the inner and adjoint variables.

Lemma 3.12

Let \(n \in \mathbb {N}\). Suppose Assumption 2.2(v) and (vi), or 2.5 (v) and (vi) hold with Assumption 3.4 and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^n)\Vert \le C_u \Vert \alpha ^n - \widehat{\alpha }\Vert \, \text { and } \, \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \le C \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$
(25)

Then, for any \(d > 0\),

$$\begin{aligned}{} & {} \langle p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}),\alpha ^{n+1}-\widehat{\alpha }\rangle \ge - \frac{L_\alpha }{2} \Vert \alpha ^{n+1}- \alpha ^n\Vert ^2 \nonumber \\{} & {} \quad + \left( \frac{\gamma _\alpha }{2}-\frac{L_{\nabla J}N_p C_u + C N_{\nabla J}}{2d}\right) \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2 \nonumber \\{} & {} \quad + \left( \frac{\gamma _\alpha }{2}-\frac{(L_{\nabla J}N_p C_u + C N_{\nabla J})d}{2}\right) \Vert \alpha ^n-\widehat{\alpha }\Vert ^2. \end{aligned}$$
(26)

Proof

The \(\alpha \)-update of both Algorithms 2.1 and 2.2 in implicit form reads

$$\begin{aligned} 0 = \sigma (q^{n+1} + p^{n+1}\nabla _u J(u^{n+1})) + \alpha ^{n+1} - \alpha ^n \quad \text {for some}\quad q^{n+1} \in \partial R(\alpha ^{n+1}). \end{aligned}$$

Similarly, \(0 \in H(\widehat{u}, \widehat{p}, \widehat{\alpha })\) implies \( \widehat{p}\,\nabla _{u}J(\widehat{u}) + \widehat{q} =0 \) for some \( \widehat{q} \in \partial R(\widehat{\alpha }). \) Writing \(E_0\) for the left hand side of (26), these expressions and the monotonicity of \(\partial R\) yield

(27)

We estimate \(E_1\) and \(E_2\) separately.

The one-dimensional mean value theorem gives

$$\begin{aligned} E_2 = \langle \nabla _\alpha (J\circ S_u)(\alpha ^n)-\nabla _\alpha (J\circ S_u)(\widehat{\alpha }),\alpha ^{n+1}-\widehat{\alpha }\rangle = \langle Q(\alpha ^n-\widehat{\alpha }),\alpha ^{n+1}-\widehat{\alpha }\rangle \end{aligned}$$

for some \(\zeta \in [\widehat{\alpha }, \alpha ^n]\) and \(Q :=\nabla ^2_\alpha (J\circ S_u)(\zeta )\).

Since \(\Vert \alpha ^n - \widehat{\alpha }\Vert \le r\) by Assumption 3.4, also \(\Vert \zeta - \widehat{\alpha }\Vert \le r\).

Therefore, the 3-point identity (2) and Assumption 2.2 (v) or 2.5 (v) yield

$$\begin{aligned} \begin{aligned} E_2&= \frac{1}{2}\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2_{Q} + \frac{1}{2}\Vert \alpha ^n-\widehat{\alpha }\Vert ^2_{Q} - \frac{1}{2}\Vert \alpha ^{n+1}- \alpha ^n\Vert ^2_{Q} \\&\ge \frac{\gamma _\alpha }{2}(\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2 + \Vert \alpha ^n-\widehat{\alpha }\Vert ^2) - \frac{L_\alpha }{2}\Vert \alpha ^{n+1}- \alpha ^n\Vert ^2. \end{aligned} \end{aligned}$$
(28)

To estimate \(E_1\) we rearrange

$$\begin{aligned} \begin{aligned} E_1&= \langle p^{n+1}\nabla _u J(u^{n+1})-\nabla _{\alpha }S_u(\alpha ^n)\nabla _u J(S_u(\alpha ^n)),\alpha ^{n+1}-\widehat{\alpha }\rangle \\&= \langle p^{n+1}(\nabla _u J(u^{n+1})-\nabla _u J(S_u(\alpha ^n)))\\ {}&\quad + (p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n))\nabla _u J(S_u(\alpha ^n)),\alpha ^{n+1}-\widehat{\alpha }\rangle . \end{aligned} \end{aligned}$$

We have \(\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}\) by the definition of the latter in Assumption 2.2 (vi) or 2.5 (vi) with \(\alpha ^n\in B(\widehat{\alpha }, r)\) from Assumption 3.4. The same assumptions establish that \(\nabla _u J\) is Lipschitz. Hence, using the operator norm inequality Theorem 6.1 (iii),

Applying (25) and Young’s inequality now yields for any \(d>0\) the estimate

$$\begin{aligned} \begin{aligned} E_1&\ge -\left( L_{\nabla J}N_p C_u + C N_{\nabla J}\right) \Vert \alpha ^n-\widehat{\alpha }\Vert \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert \\&\ge -\left( L_{\nabla J}N_p C_u + C N_{\nabla J}\right) \left( \frac{d}{2}\Vert \alpha ^n-\widehat{\alpha }\Vert ^2 + \frac{1}{2d}\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2\right) . \end{aligned} \end{aligned}$$
(29)

By inserting (28) and (29) into (27) we obtain the claim (26). \(\square \)

3.2 Convergence: forward-exact-forward-backward

We now prove the convergence of Algorithm 2.1. We start with a lemma that shows an \(\alpha \)-relative exactness estimate on the adjoint iterate when one holds for the inner iterate. This is needed to use Lemma 3.12. The main result of this subsection is in the final Theorem 3.16. It proves under Assumption 2.2 the linear convergence of \(\{(u^n, \alpha ^n)\}_{n \in \mathbb {N}}\) generated by Algorithm 2.1 to \((\widehat{u}, \widehat{\alpha })\) solving the first-order optimality condition (8) for some \(\widehat{p}\).

Lemma 3.13

Let \(n \in \mathbb {N}\). Suppose Assumption 2.2 and the inner exactness estimate (21) hold as well as \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\) and \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). Then (18) and (25) hold for \(C=L_{S_p}C_u.\)

Proof

Since (18) with (21) equals (25), it suffices to prove (18). We assumed \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(u^{n+1},S_u(\alpha ^n)\in B(\widehat{u}, r_u)\) by Lemma 3.3. Therefore the Lipschitz continuity of \(S_p\) in \(B(\widehat{u}, r_u)\times B(\widehat{\alpha }, r)\) from Lemma 3.1 with Assumption 2.2 (ii) and (iii) and (21) give

$$\begin{aligned} \begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1} - \nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}|&= \Vert \hspace{-1.0pt}|S_p(u^{n+1},\alpha ^n)-S_p(S_u(\alpha ^{n}),\alpha ^{n}) \Vert \hspace{-1.0pt}| \\&\le L_{S_p}\Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le L_{S_p} C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert . \end{aligned}\square \end{aligned}$$

We are able to collect the previous lemmas into a descent estimate from which we immediately observe local linear convergence. We recall the definitions of the preconditioning and testing operators M and Z in (11b) and (12).

Lemma 3.14

Let \(n \in \mathbb {N}\) and suppose Assumption 2.2 and 3.4, and the inner exactness estimate (21) hold. Then

$$\begin{aligned} \Vert x^{n+1}-\widehat{x}\Vert _{ZM}^2 +2\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + 2\varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \le \Vert x^n-\widehat{x}\Vert _{ZM}^2 \end{aligned}$$
(30)

for \(\varphi _u>0\) as in Assumption 2.2 (vi),

$$\begin{aligned} \varepsilon _u :=\frac{\varphi _u \gamma _F(1-\kappa )}{2}> 0, \quad \text {and}\quad \varepsilon _{\alpha } :=\frac{\gamma _\alpha - (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})C_u}{2} > 0. \end{aligned}$$

Proof

We start by proving the monotonicity estimate

$$\begin{aligned} \langle ZH_{n+1}(x^{n+1}),x^{n+1}-\widehat{x}\rangle \ge \mathscr {V}_{n+1}(\widehat{x}) - \frac{1}{2}\Vert x^{n+1}-x^{n}\Vert ^2_{ZM} \end{aligned}$$
(31)

for \(\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) :=\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\). We observe that \(\varepsilon _u,\varepsilon _{\alpha }>0\) by Assumption 2.2. The monotonicity estimate (31) expands as

$$\begin{aligned} h_{n+1} \ge \mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) -\frac{\varphi _u}{2\tau }\Vert u^{n+1}-u^{n}\Vert ^2 -\frac{1}{2\sigma }\Vert \alpha ^{n+1}-\alpha ^{n}\Vert ^2 \end{aligned}$$
(32)

for (all elements of the set)

$$\begin{aligned} h_{n+1}:= \left\langle \begin{pmatrix} \varphi _u\nabla _{u}F(u^{n};\alpha ^{n}) \\ p^{n+1} \nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F(u^{n+1};\alpha ^{n}) \\ p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}) \end{pmatrix}, \begin{pmatrix} u^{n+1}-\widehat{u} \\ p^{n+1}-\widehat{p}\\ \alpha ^{n+1}-\widehat{\alpha } \end{pmatrix} \right\rangle . \end{aligned}$$

We estimate each of the three lines of \(h_{n+1}\) separately. For the first line, we use (20) from Lemma 3.8. For the middle line we observe that \(p^{n+1}\nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F( u^{n+1};\alpha ^{n})=0\) by the p-update of Algorithm 2.1.

For the last line, we use (26) from Lemma 3.12 with \(d=2\). We can do this because (25) holds by (21) and 3.13. This gives

Summing with (20) we thus obtain

The factor of the first term is \(\varepsilon _u\) and the factor of last term is zero. Since \(\sigma <1/L_\alpha \) by Lemma 3.2 and \(L_F/(2\kappa ) \le 1/\tau \) by Assumption 2.2 (iv), we obtain (32), i.e., (31).

We now come to the fundamental argument of the testing approach of [18], combining operator-relative monotonicity estimates with the three-point identity. Indeed, (31) combined with the implicit algorithm (10) gives

$$\begin{aligned} \langle ZM(x^{n+1} - x^n),x^{n+1} - \widehat{x}\rangle + \mathscr {V}_{n+1}(\widehat{x}) \le \frac{1}{2}\Vert x^{n +1}-x^{n}\Vert ^2_{ZM}. \end{aligned}$$

Inserting the three-point identity (2) and expanding \(\mathscr {V}_{n+1}\) yields (30). \(\square \)

Before stating our main convergence result for the FEFB, we simplify the assumptions of the previous lemma to just Assumption 2.2.

Lemma 3.15

Suppose Assumption 2.2 holds. Then (30) holds for any \(n\in \mathbb {N}\).

Proof

Then claim readily follows if we prove by induction for all \(n \in \mathbb {N}\) that

$$\begin{aligned} Assumption~3.4, (21), \text { and } (30) \text { hold}. \end{aligned}$$
(*)

We first prove (*) for \(n=0\). Assumption 2.2 (i) directly establishes (21). The definition of \(r_0\) in Assumption 2.2 also establishes that \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\) and \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). We have just proved the conditions of Lemma 3.13, which establishes (18) for \(n=0\).

Now Lemma 3.5 establishes \(\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}| \le N_p\). Therefore Assumption 3.4 holds for \(n=0\). Finally (3.14) proves (30) for \(n=0\). This concludes the proof of the induction base.

We then make the induction assumption that (*) holds for \(n\in \{0,\ldots ,k\}\) and prove it for \(n=k+1\). Indeed, the induction assumption and Lemma 3.11 give (21) for \(n=k+1\). Next (30) for \(n=k\) implies \(\alpha ^{k+1}\in B( \widehat{\alpha },r_0)\) and \(u^{k+1} \in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau } r_0)\), where \(r_0\) and \(r_u\) are as in Assumption 2.2. Therefore Lemma 3.3 gives \(u^{k+2}\in B(\widehat{u}, r_u)\) while Lemma 3.13 establishes (18) for \(n=k+1\). For all \(n\in \{0,\ldots ,k\}\), the inequality (30) implies \(\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^n-\widehat{x}\Vert _{ZM}\). Therefore Lemma 3.6 proves Assumption 3.4 and finally (3.14) proves (30) and consequently (*) for \(n=k+1\). \(\square \)

Theorem 3.16

Suppose Assumption 2.2 holds. Then \(\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0\) linearly.

Proof

Lemma 3.15, expansion of (30), and basic manipulation show that

$$\begin{aligned} \varphi _u\tau ^{-1}&\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&\ge (\varphi _u\tau ^{-1}+2\varepsilon _u)\Vert u^{n+1}-\widehat{u}\Vert ^2 + (\sigma ^{-1}+2\varepsilon _{\alpha })\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \\&= (1+2\varepsilon _u\varphi ^{-1}_u\tau )\varphi _u\tau ^{-1}\Vert u^{n+1}-\widehat{u}\Vert ^2 + (1+2\varepsilon _{\alpha }\sigma )\sigma ^{-1}\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \\&\ge \mu \bigl (\varphi _u\tau ^{-1}\Vert u^{n+1}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\bigr ) \end{aligned}$$

for \(\mu := \min \{1+2\varepsilon _u\varphi ^{-1}_u \tau , 1+2\varepsilon _{\alpha }\sigma \}\). Since \(\mu >1\), linear convergence follows. \(\square \)

3.3 Convergence: forward-inexact-forward-backward

We now prove the convergence of Algorithm 2.2. The overall structure and idea of the proofs follows Sect. 3.2 and uses several lemmas from Sect. 3.1. We first prove monotonicity estimate lemma for the adjoint step and then that a small enough step length in the outer problem gurantees that the inner and adjoint iterates stay in a small local neighbourhood if they are already in one. The main result of this subsection is in the final Theorem 3.21. It proves under Assumption 2.5 the linear convergence of \(\{(u^n, p^n, \alpha ^n)\}_{n \in \mathbb {N}}\) generated by Algorithm 2.2 to \((\widehat{u}, \widehat{p}, \widehat{\alpha })\) solving the first-order optimality condition (8).

Lemma 3.17

Let \(u\in U, \alpha \in \mathscr {A}\) and \(p_1, p_2, \tilde{p} \in P.\) Moreover, \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) and

$$\begin{aligned} \tilde{p} \nabla _u^2 F(u; \alpha ) + \nabla _{\alpha u}F(u; \alpha ) = 0. \end{aligned}$$
(33)

holds. Then

Proof

[4] Using (33), the three-point identity (2) and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) gives for \(A :=\nabla _{u}^2 F(u; \alpha )\) the lower bound

$$\begin{aligned} \langle \hspace{-2.0pt}\langle&p_1 \nabla _{u}^2 F(u; \alpha ) +\nabla _{\alpha u} F(u; \alpha ), p_2-\tilde{p} \rangle \hspace{-2.0pt}\rangle \\&= \langle \hspace{-2.0pt}\langle (p_1-\tilde{p})\nabla _u^2 F(u; \alpha ), p_2-\tilde{p} \rangle \hspace{-2.0pt}\rangle \\&= \sum _{i\in I} \langle \nabla _{u}^2 F(u; \alpha )(p_1-\tilde{p})^*\varphi _i,(p_2-\tilde{p})^*\varphi _i\rangle \\&= \sum _{i\in I} \biggl ( \frac{1}{2} \Vert (p_1-\tilde{p})^*\varphi _i\Vert ^2_{A} - \frac{1}{2}\Vert (p_2-p_1)^*\varphi _i\Vert ^2_{A} + \frac{1}{2}\Vert (p_2-\tilde{p})^*\varphi _i\Vert ^2_{A} \biggr ) \\&\ge \sum _{i\in I}\left( \frac{\gamma _F}{2} \Vert (p^{k+1}-\tilde{p})^*\varphi _i\Vert ^2 - \frac{L_F}{2}\Vert (p^{k+2}-p^{k+1})^*\varphi _i\Vert ^2 + \frac{\gamma _F}{2}\Vert (p^{k+2}-\tilde{p})^*\varphi _i\Vert ^2 \right) \\&= \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p_2 - \tilde{p} \Vert \hspace{-1.0pt}|^2 + \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p_1 - \tilde{p} \Vert \hspace{-1.0pt}|^2 - \frac{L_F}{2}\Vert \hspace{-1.0pt}|p_2- p_1 \Vert \hspace{-1.0pt}|^2. \end{aligned}$$

\(\square \)

Lemma 3.18

Let \(k \in \mathbb {N}\). Suppose Assumption 2.5 holds, and Assumption 3.4 and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert \, \text { and } \, \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^{n}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{n} - \widehat{\alpha }\Vert \end{aligned}$$
(34)

hold for \(n=k\). Then (34) holds for \(n = k+1.\)

Proof

Observe that (34) for \(n=k\) implies (21) as well as (18) for \(n=k\) and \(C=C_p\). Lemma s3.11 therefore proves the first part of (34) for \(n=k+1\), i.e.,

$$\begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert \le C_u \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$
(35)

We still need to show the second part \(\Vert \hspace{-1.0pt}|p^{k+2}-\nabla _{\alpha }S_u(\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \). We follow the fundamental argument of the testing approach (see the end of the proof of (3.14)) and use Assumption 2.5 (ii) and (iii). For the latter we need \(\alpha ^k, \alpha ^{k+1}\in B(\widehat{\alpha }, 2r)\) and \(u^{k+2}, S_u(\alpha ^{k})\in B(\widehat{u}, r_u).\) We have \(\alpha ^{k}\in B(\widehat{\alpha },r_0)\) by Assumption 3.4 and \(\alpha ^{k+1}\in B(\widehat{\alpha },2r_0)\) by Lemma 3.11. Thus we may use the Lipschitz continuity of \(S_u\) with the triangle inequality and (35) to get \(S_u(\alpha ^k)\in B(S_u(\widehat{\alpha }), L_{S_u}r_0)\subset B(\widehat{u}, r_u)\) and

$$\begin{aligned} \begin{aligned} \Vert u^{k+2}-\widehat{u}\Vert&\le \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert + \Vert S_u(\alpha ^{k+1}) - S_u(\widehat{\alpha })\Vert \\&\le (C_u+ L_{S_u})\Vert \alpha ^{k+1} - \widehat{\alpha } \Vert \le (C_u+ L_{S_u})2r_0, \end{aligned} \end{aligned}$$

which yields \(u^{k+2}\in B(\widehat{u}, r_u).\) The definition of \(S_p\) in (5) implies

$$\begin{aligned} S_p(u^{k+2}, \alpha ^{k+1})\nabla _{u}^2 F(u^{k+2}; \alpha ^{k+1}) + \nabla _{\alpha u} F(u^{k+2}; \alpha ^{k+1})=0. \end{aligned}$$

Since also \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)\) from Assumption 2.5 (iii), we get

(36)

from Lemma 3.17. By the p update of the FIFB in the implicit form (13), we have

$$\begin{aligned} p^{k+1} \nabla _{u}^2 F(u^{k+2}; \alpha ^{k+1}) +\nabla _{\alpha u} F(u^{k+2}; \alpha ^{k+1}) = -\theta ^{-1}(p^{k+2}-p^{k+1}). \end{aligned}$$

Combining with (36) gives

An application of the three-point identity (2) with \(\theta L_F \le 1\) from Assumption 2.5 (iv) now yields for \(C_{F,S} = \sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)}\) the estimate

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{k+2}-S_p(u^{k+2},\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_{F,S}^{-1} \Vert \hspace{-1.0pt}|p^{k+1}-S_p(u^{k+2},\alpha ^{k+1}) \Vert \hspace{-1.0pt}|. \end{aligned}$$

This estimate and the triangle inequality give

(37)

The solution map \(S_u\) is Lipschitz in \(B(\widehat{\alpha }, 2r)\) and \(S_p\) is Lipschitz in \(B(\widehat{u}, r_u)\times B(\widehat{\alpha }, 2r)\) due to Assumption 2.5 (ii) and (iii), and Lemma 3.1. Combined with the triangle inequality, (34) for \(n=k\) and (35), we obtain

(38)

for

$$\begin{aligned} E_3 :=C_p \Vert \alpha ^{k} - \widehat{\alpha }\Vert + L_{S_p}(1+L_{S_u})\Vert \alpha ^{k+1} - \alpha ^{k}\Vert . \end{aligned}$$

Using again the Lipschitz continuity of \(S_p\) and (35), we get

$$\begin{aligned} E_2 \le L_{S_p}\Vert u^{k+2} - S_u(\alpha ^{k+1})\Vert \le L_{S_p} C_u \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$
(39)

Inserting (38) and (39) into (37) yields

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{k+2}-\nabla _{\alpha }S_u(\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_{F,S}^{-1} E_3 + (C_{F,S}^{-1} + 1) L_{S_p} C_u \Vert \alpha ^{k+1} - \widehat{\alpha } \Vert . \end{aligned}$$

Therefore the claim follows if we show that

$$\begin{aligned} C_{F,S}^{-1} E_3 \le (C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u) \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$
(40)

Lemma 3.9 proves (22) with \(C=C_p\). Together with Assumption 2.5 (vii) it yields

$$\begin{aligned}{} & {} \Vert \alpha ^{k+1} - \alpha ^k\Vert \le \sigma C_{\alpha } \Vert \alpha ^{k} - \widehat{\alpha }\Vert \\ {}{} & {} \le \frac{(C_{F,S}-1)C_p- (1+C_{F,S})L_{S_p}C_u}{(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u} \Vert \alpha ^{k} - \widehat{\alpha }\Vert . \end{aligned}$$

Multiplying by \((1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u,\) rearranging terms, and continuing with the triangle inequality gives (40). Indeed,

$$\begin{aligned} E_3&\le C_{F,S}(C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u)(\Vert \alpha ^{k} - \widehat{\alpha }\Vert - \Vert \alpha ^{k+1} - \alpha ^{k}\Vert ) \\&\le C_{F,S}(C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u)\Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$

\(\square \)

We now show that the adjoint iterates stay local if the outer iterates do.

Again, by combining the previous lemmas, we prove an estimate from which local convergence is immediate. For this, we recall the definitions of the preconditioning and testing operators M and Z in (14b) and (12).

Lemma 3.19

Suppose Assumption 2.5 and 3.4, and the inner and adjoint exactness estimate (34) hold for \(n\in \mathbb {N}.\) Then

$$\begin{aligned} \Vert x^{n+1}-\widehat{x}\Vert _{ZM}^2 + 2\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + 2\varepsilon _p\Vert p^{n}-\widehat{p}\Vert ^2 + 2\varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \le \Vert x^n-\widehat{x}\Vert _{ZM}^2 \end{aligned}$$
(41)

for

$$\begin{aligned} \begin{aligned} \varepsilon _u&:=\frac{\varphi _u\gamma _F(1-\kappa )}{2}-C_{\alpha ,1}> 0,&\varepsilon _p :=\frac{\varphi _p \gamma _F}{2}&> 0, \quad \text {and}\quad \\ \varepsilon _{\alpha }&:=\frac{\gamma _\alpha - C_{\alpha ,1} - \sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2} > 0, \end{aligned} \end{aligned}$$

where \(\varphi _u, \varphi _p>0\) are as in Assumption 2.5, \(C_{\alpha ,1} :=\varphi _p\tfrac{L_FL_{S_p}}{\gamma _F}\), and \(C_{\alpha ,2} :=\bigl (L_{\nabla J}N_p + L_{S_p} N_{\nabla J}\bigr ) \tfrac{C_u}{2}\).

Proof

We start by proving the monotonicity estimate

$$\begin{aligned} \langle ZH_{n+1}(x^{n+1}),x^{n+1}-\widehat{x}\rangle \ge \mathscr {V}_{n+1}(\widehat{x}) - \frac{1}{2}\Vert x^{n+1}-x^{n}\Vert ^2_{ZM} \end{aligned}$$
(42)

for \(\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) = \varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _p\Vert p^{n}-\widehat{p}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\). We observe that \(\varepsilon _u,\varepsilon _p,\varepsilon _{\alpha }>0\) by Assumption 2.5. The monotonicity estimate (42) expands as

$$\begin{aligned} h_{n+1} \ge \mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) -\frac{\varphi _u}{2\tau }\Vert u^{n+1}-u^{n}\Vert ^2 -\frac{\varphi _p}{2\theta }\Vert \hspace{-1.0pt}|p^{n+1}-p^{n} \Vert \hspace{-1.0pt}|^2 -\frac{1}{2\sigma }\Vert \alpha ^{n+1}-\alpha ^{n}\Vert ^2\nonumber \\ \end{aligned}$$
(43)

for (all elements of the set)

$$\begin{aligned} \nonumber h_{n+1}:= \left\langle \begin{pmatrix} \varphi _u\nabla _{u}F(u^{n};\alpha ^{n}) \\ \varphi _p \left( p^{n} \nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F(u^{n+1};\alpha ^{n})\right) \\ p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}) \end{pmatrix}, \begin{pmatrix} u^{n+1}-\widehat{u} \\ p^{n+1}-\widehat{p}\\ \alpha ^{n+1}-\widehat{\alpha } \end{pmatrix} \right\rangle . \end{aligned}$$

We estimate the three lines of \(h_{n+1}\) separately. We immediately take care of the first line by using (20) from Lemma 3.8.

For the second line, using the optimality condition (4) we have

(44)

We have \(u^{n+1}\), \(S_u(\alpha ^n)\in B(\widehat{u}, r_u)\), and \(\alpha ^n\in B(\widehat{\alpha }, r)\) by Lemma 3.3 and 3.4. Thus \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)\) and \(\Vert \nabla _{u}^2 F(u^{n+1};\alpha ^{n})\Vert \le L_F\) by Assumption 2.5 (iii). We get

$$\begin{aligned} E_1 \ge \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 - \frac{L_F}{2}\Vert \hspace{-1.0pt}|p^{n+1} - p^{n} \Vert \hspace{-1.0pt}|^2 \end{aligned}$$
(45)

from Lemma 3.17. By Theorem 6.1 (i) \(\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle \) is an inner product and \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) a norm on \(\mathbb {L}(U; \mathscr {A}),\) and we can use thus Cauchy-Schwarz inequality for them. Therefore, using also Theorem 6.1 (ii), Lemma 3.1 and Young’s inequality, we can estimate

$$\begin{aligned} \begin{aligned} E_2&\ge - \big | \langle \hspace{-2.0pt}\langle (\widehat{p}-S_p(u^{n+1},\alpha ^{n})) \nabla _{u}^2 F(u^{n+1}; \alpha ^{n}), p^{n+1}-\widehat{p} \rangle \hspace{-2.0pt}\rangle \big | \\&\ge - \Vert \hspace{-1.0pt}|(\widehat{p}-S_p(u^{n+1},\alpha ^{n})) \nabla _{u}^2 F(u^{n+1}; \alpha ^{n}) \Vert \hspace{-1.0pt}| \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - L_F\Vert \hspace{-1.0pt}|S_p(\widehat{u}, \widehat{\alpha })-S_p(u^{n+1},\alpha ^{n})) \Vert \hspace{-1.0pt}| \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - L_FL_{S_p}\left( \Vert u^{n+1}-\widehat{u}\Vert + \Vert \alpha ^{n} - \widehat{\alpha }\Vert \right) \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - \frac{L_FL_{S_p}}{\gamma _F}\left( \Vert u^{n+1}-\widehat{u}\Vert ^2 + \Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \right) - \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}|^2. \end{aligned} \end{aligned}$$
(46)

Inserting (45) and (46) into (44), we obtain

(47)

The factor of the last term equals \(C_{\alpha ,1}\) from Assumption 2.5 (v).

Since Assumption 3.4 and (34) hold, Lemma 3.12 gives (26) with \(C = C_p\) and any \(d>0\) for the third line of \(h_{n+1}\). Summing (20), (47) and (26) we finally deduce

We have

$$\begin{aligned} \frac{C_{\alpha ,2}}{d} = C_{\alpha ,1} + C_{\alpha ,2}d = \frac{C_{\alpha ,1}}{2} + \frac{\sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2} \quad \text {for}\quad d = \frac{-C_{\alpha ,1} + \sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2C_{\alpha ,2}}. \end{aligned}$$

Then also \(\frac{\gamma _\alpha }{2}- \frac{C_{\alpha ,2}}{d}=\varepsilon _{\alpha }\). It follows

Since \(\sigma <1/L_\alpha \) by Lemma 3.2, \(L_F/(2\kappa ) \le 1/\tau \) and \(\theta < 1/L_F\) by Assumption 2.5 (iv), we obtain (43), i.e., (42). We finish by applying the fundamental arguments of the testing approach to (42) and the general implicit update (10) as in (3.14). \(\square \)

We simplify the assumptions of the previous lemma to just Assumption 2.5.

Lemma 3.20

Suppose Assumption 2.5 holds. Then (41) holds for any \(n\in \mathbb {N}\).

Proof

The claim readily follows if we prove by induction for all \(n\in \mathbb {N}\) that

We first prove () for \(n=0.\) Assumption 2.5 (i) directly establishes (34). The definition of \(r_0\) in Assumption 2.5 also establishes that \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(u^n\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0).\) We have just proved the conditions of Lemma 3.5, which gives \(\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}|\le N_p\). Thus we we proved Assumption 3.4 for \(n=0.\) Now (3.19) proves (41) for \(n=0.\) This concludes the proof of induction base.

We then make the induction assumption that () holds for \(n\in \{0,\ldots ,k\}\) and prove it for \(n=k+1.\) The induction assumption and Lemma 3.18 give (34) for \(n=k+1.\) The inequality (41) for \(n\in \{0,\ldots ,k\}\) also ensures \(\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^{n}-\widehat{x}\Vert _{ZM}\) for \(n\in \{0,\ldots ,k\}.\) Therefore Lemma 3.6 proves Assumption 3.4 for \(n=k+1\). Now (3.19) shows (41) and concludes the proof of () for \(n=k+1\). \(\square \)

We finally come to the main convergence result for the FIFB.

Theorem 3.21

Suppose Assumption 2.5 holds. Then \(\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \varphi _p\theta ^{-1}\Vert p^{n}-\widehat{p}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0\) linearly.

Proof

We define \(\mu _1:= \min \{(1+2\varepsilon _u\varphi ^{-1}_u \tau ), (1+2\varepsilon _{\alpha }\sigma )\}\) and \(\mu _2:= 1 - 2\varepsilon _p\varphi ^{-1}_p\theta .\) Lemma 3.20, expansion of (41), and basic manipulation show that

(48)

There are two separate cases (i) \(\mu _1 \mu _2 \le 1\) and (ii) \(\mu _1 \mu _2 > 1.\) In case (i), we have

$$\begin{aligned} \begin{aligned}&\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&= \mu ^{-1}_1 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \mu _1\mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ) \\&\le \mu ^{-1}_1 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ), \end{aligned} \end{aligned}$$

which with (48) implies linear convergence since \(\mu ^{-1}_1\in (0,1).\) In case (ii), we obtain

$$\begin{aligned} \begin{aligned}&\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&\le \mu _2 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ), \end{aligned} \end{aligned}$$

which with (48) implies linear convergence since \(\mu _2\in (0,1).\) \(\square \)

4 Numerical experiments

We evaluate the performance of our proposed algorithms on parameter learning for (anisotropic, smoothed) total variation image denoising and deconvolution. For a “ground truth” image \(b \in \mathbb {R}^{N^2}\) of dimensions \(N \times N\), we take

$$\begin{aligned} J(u) = \frac{1}{2}\Vert u-b\Vert _2^2 \end{aligned}$$

as the outer fitness function. For b we use a cropped portion of image 02 or 08 from the free Kodak dataset [49] converted to gray values in [0, 1]. The purpose of these numerical experiments is a simple performance comparison between our proposed methods and a few representative approaches from the literature. We therefore only consider a single ground-truth image b and a corresponding corrupted data z in the various inner problems, which we next describe. For proper generalizable parameter learning, multiple such training pairs \((b_i, z_i)\) should be used. This can in principle be done by summing over all the data in both the inner and outer problem; resulting in a higher-dimensional bilevel problem; see, e.g., [50]. In practise, a large sample count would require stochastic techniques.

4.1 Denoising

For denoising we take in (1) as the inner objective

$$\begin{aligned} F(u; \alpha ) = \frac{1}{2}\Vert u - z\Vert _2^2 + \alpha \Vert Du\Vert _{1,\gamma } \quad (u \in \mathbb {R}^{N^2}, \alpha \in [0, \infty )), \end{aligned}$$

and as the outer regulariser \(R \equiv 0\). The simulated measurement z is obtained from b by adding Gaussian noise of standard deviation 0.1. The matrix D is a backward difference operator with Dirichlet boundary conditions. Instead of the one-norm, \(\Vert \cdot \Vert _1\), to ensure the twice differentiability of the objective and hence a simple adjoint equation, we use a \(C^2\) Huber- or Moreau–Yosida -type approximation with

$$\begin{aligned} \Vert y\Vert _{1, \gamma }:= \sum _{i=1}^{2N^2} \rho _{\gamma }(y_i), \quad \text {where}\quad \rho _{\gamma }(x) :={\left\{ \begin{array}{ll} -\frac{|x|^3}{3\gamma ^2} + \frac{|x|^2}{\gamma } &{} \text {if } \, |x|\le \gamma , \\ |x| - \frac{\gamma }{3} &{} \text {if } \, |x| > \gamma . \end{array}\right. } \end{aligned}$$

We used \(\gamma =10^{-4}\) in our experiments (Fig. 1).

Fig. 1
figure 1

Denoising data and results for the implicit and FIFB methods (\(N=128\))

4.2 Deconvolution

For deconvolution, we take as the inner objective

$$\begin{aligned} F(u; \alpha ) = \frac{1}{2}\Vert K(\alpha ) * u-z\Vert _2^2 + C\alpha _1 \Vert Du\Vert _{1,\gamma }, \quad (u \in \mathbb {R}^{N^2}, \alpha \in [0, \infty )^4 ), \end{aligned}$$

and as the outer regulariser \(R(\alpha )=\beta (\sum _{i=2}^4\alpha _i-1)^2+\delta _{[0, \infty )}(\alpha _1)\) for a regularisation parameter \(\beta =10^4\). We introduce the constant \(C=\tfrac{1}{10}\) to help convergence by ensuring the same order of magnitude for all components of \(\alpha \). The first element of \(\alpha \) is the total variation regularization parameter while the rest parametrizes the convolution kernel \(K(\alpha )\) as illustrated in Fig. 2a. The sum of the elements of the kernel equals \(\alpha _2 + \alpha _3 + \alpha _4.\) Operator \(r_{\theta }\) rotates image \(\theta \) degrees, clockwise for \(\theta >0\) and counterclocwise for \(\theta <0.\) We form z by computing \(r_{-1}(K(\alpha )*r_1(b))\) for \(\alpha = [0.15, 0.1, 0.75] \) and adding Gaussian noise of standard deviation \(1\cdot 10^{-2}\).

Fig. 2
figure 2

Deconvolution kernel parametrisation, data, and result for FIFB (\(N=128\))

For denoising, and deconvolution assuming \(\ker D \cap \ker K(\alpha ) = \{0\}\), it is not difficult to verify the structural parts of Assumption 2.2 and 2.5, required for the convergence results of Sect. 3. We do not attempt to verify the conditions on the step lengths, choosing them by trial and error.

4.3 An implicit baseline method

We evaluate Algorithms 2.1 and 2.2 against a conventional method that solves both the inner problem and the adjoint equation (near-)exactly, as well as the AID [26]. We also experimented with solving the equivalent constrained optimisation problem \(\min _{\alpha , u} J(u)\) subject to \(\nabla _u F(u; \alpha )=0\) with IPOPT [51] and the NL-PDPS [52, 53]. However, we did not observe convergence without the inclusion of additional \(H^1\) regularisation in the inner problem, as in, e.g., [7]. Since that changes the problem, we have not included “whole problem” approaches in our comparison.

To solve the inner problem in the implicit baseline method, we use gradient descent, starting at \(v^0=0\) and updating \(v^{m+1} :=v^m - \tau _m \nabla F(v^m; \alpha ^k)\) We then set \(u^{k+1}=v^{m+1}\). The adjoint and outer iterate updates are as in Algorithm 2.1, however, we discover \(\sigma =\sigma _k\) by the line search rule [19, (12.41)] for nonsmooth problems, starting at \(\sigma _k=5\cdot 10^{-5}\) and multiplying by 0.1 on each line search step. For deconvolution we use a fixed step length parameter, as it performed better. The specific parameter choices (step lengths, number of inner and adjoint iterations) for all algorithms and experiments are listed in Table 1.

Table 1 Algorithm parametrisation (step length parameters, inner and adjoint iteration counts), time multiplier, and outer steps taken to reach threshold computational resources (CPU time) value
Fig. 3
figure 3

Graph of composed objective the denoising problem (both problem sizes), along with near-optimal \(\tilde{\alpha }\) found by recursive subdivision

4.4 Numerical setup

Our algorithm implementations are available on Zenodo [54]. To solve the adjoint equation in the FEFB and implicit methods, we use Matlab’s bicgstab implementation of the stabilized biconjugate gradient method [55] with tolerance \(10^{-5}\), and maximum iteration count \(10^{3}\). With the AID we use 50 conjugate gradient iterations. These choices, as well as the choice of the number of inner iterations for the implicit method and the AID, have been made by trial and error to be as small as possible while obtaining an apparently stable algorithm.

To evaluate scalability, we consider for denoising both \(N=128\) and \(N=256\). For deconvolution we consider \(N=128\) and \(N=32.\) We take initial \(u^0 = S_u(\alpha ^0)\) and \(p^0 = S_p(u^0, \alpha ^0)\) where for denoising \(\alpha ^0=0\) and for deconvolution \(\alpha ^{0}=[0.4, 0.25, 0.25, 0.5]\) and \(\alpha ^{0}=[0.04, 0.25, 0.25, 0.5]\) with \(N=128\) and \(N=32\) respectively.

To compare algorithm performance, we plot relative errors against the !cputime! value of Matlab on an AMD Ryzen 5 5600 H CPU. We call this value “computational resources”, as it takes into account the use of several CPU cores by Matlab’s internal linear algebra, making it fairer than the actual running time. For each algorithm and problem, we indicate in Table 1 the step length parameters, the number of outer steps to reach the computational resources value of 6000 for denoising 15,000 for deconvolution, and an average multiplier to convert computational resources into running times.

Fig. 4
figure 4

Denoising performance. The graphs correspond to the FIFB (green line), FEFB (orange line), AID (pink line), and implicit method (violet line)

Fig. 5
figure 5

Deconvolution performance. The graphs correspond to the FIFB (green line), FEFB (orange line), AID (pink line), and implicit method (violet line). For the inner problem, we additionally plot the relative error of \(S_u(\alpha ^k)\) for the FIFB (green dotted line)

For performance comparison, we need estimates \(\tilde{\alpha }\) and \(\tilde{u}\) of optimal \(\hat{\alpha }\) and \(\hat{u}=S_u(\hat{\alpha })\). For denoising we find them by searching for the one-dimensional variable \(\alpha \) on a regular grid and recursively subdividing until node spacing goes below \(10^{-5}\). As \(\tilde{u}\), we take an estimate of \(S_u(\tilde{\alpha })\) obtained with 25,000 steps of the implicit base line method. We visualise so obtained \(\tilde{\alpha }\) and \(J \circ S_u\) in Fig. 3. For the higher-dimensional deconvolution problem, such a scan is not feasible. Instead, we obtain the comparison estimates by running the implicit method from a very good initial iterate until computational resources (CPU time) value of 6000 for \(N=32\) and 10,000 for \(N=128\). Specifically, we initialise the kernel parameters \((\alpha _2,\alpha _3,\alpha _4)\) as those used for generating the data z, and the regularization parameter \(\alpha _1 = 0.045\) for \(N=32\) and \(\alpha _1=0.02\) for \(N=128\), the latter found by trial and error. This initialisation is different from that used for the actual numerical experiments; see above. Our experiments indicate that the other methods approach \(\tilde{\alpha }\) so obtained faster than the implicit method itself, providing some justification for the choice.

With these solution estimates we define the inner and outer relative errors

$$\begin{aligned} e_{\alpha ,\text {rel}} :=\tfrac{\Vert \tilde{\alpha }- \alpha ^k\Vert }{\Vert \tilde{\alpha }\Vert } \quad \text {and}\quad e_{u,\text {rel}} :=\tfrac{\Vert \tilde{u} - u^k\Vert }{\Vert \tilde{u}\Vert }. \end{aligned}$$

4.5 Results

We report performance in Figs. 4 and 5 and the image data and reconstructions in Figs. 1 and 2. Figure 5 indicates that for deconvolution the FIFB significantly outperforms the other methods. The outer variable converges much faster than for other evaluated methods despite the fact that the inner variable especially with \(N=32\) stays some distance away from \(\tilde{u}\). However, as the dashed line indicates, the exact solution \(S_u(\alpha ^k)\) of the inner problem for the corresponding outer iterate, shows clear signs of convergence. (The few “spikes” in the graph for \(N=128\) temporarily have the regularisation parameter \(\alpha ^k_0\) much closer to zero than \(\tilde{\alpha }_0\).) This observation justifies the intuition that the inner problem does not need to be solved to high accuracy to obtain convergence for the outer problem; that such accurate solutions can even be detrimental to convergence. The exact solution of the adjoint equation in both the implicit method and the FEFB causes them to be too slow to make any meaningful progress. The denoising experiments of Fig. 4 likewise suggest that the FIFB is initially the best performing algorithm, although the implicit method and the AID catch up later on the denoising problem. On the small denoising problem (\(N=128\)), the implicit method is significantly faster than any oher method. Overall, and for practical purposes, nevertheless, the FIFB appears to perform the best.