Linearly convergent bilevel optimization with single-step inner methods

Suonperä, Ensio; Valkonen, Tuomo

doi:10.1007/s10589-023-00527-7

Linearly convergent bilevel optimization with single-step inner methods

Open access
Published: 28 September 2023

Volume 87, pages 571–610, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

Linearly convergent bilevel optimization with single-step inner methods

Download PDF

697 Accesses
1 Altmetric
Explore all metrics

Abstract

We propose a new approach to solving bilevel optimization problems, intermediate between solving full-system optimality conditions with a Newton-type approach, and treating the inner problem as an implicit function. The overall idea is to solve the full-system optimality conditions, but to precondition them to alternate between taking steps of simple conventional methods for the inner problem, the adjoint equation, and the outer problem. While the inner objective has to be smooth, the outer objective may be nonsmooth subject to a prox-contractivity condition. We prove linear convergence of the approach for combinations of gradient descent and forward-backward splitting with exact and inexact solution of the adjoint equation. We demonstrate good performance on learning the regularization parameter for anisotropic total variation image denoising, and the convolution kernel for image deconvolution.

Methodology and first-order algorithms for solving nonsmooth and non-strongly convex bilevel optimization problems

Article 27 December 2022

On the solution of convex bilevel optimization problems

Article 24 September 2015

Bilevel Optimization with Nonsmooth Lower Level Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Two general approaches are typical for the solution of the bilevel optimization problem

$$\begin{aligned} {} {\textbf {}} \min _{\alpha \in \mathscr {A}}~J(S_u(\alpha )) + R(\alpha ) \quad \text {with}\quad S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits _{u\in U} F(u; \alpha ) \end{aligned}$$

(1)

in Hilbert spaces $\mathscr {A}$ and U. The first, familiar from the treatment of general mathematical programming with equilibrium constraints (MPECs), is to write out the Karush–Kuhn–Tucker conditions for the whole problem in a suitable form, and to apply a Newton-type method or other nonlinear equation solver to them [1,2,3,4,5].

The second approach, common in the application of (1) to inverse problems and imaging [6,7,8,9,10,11,12], treats the solution mapping $S_u$ as an implicit function. Thus it is necessary to (i) on each outer iteration k solve the inner problem $\min _u F(u; \alpha ^k)$ near-exactly using an optimization method of choice; (ii) solve an adjoint equation to calculate the gradient of the solution mapping; and (iii) use another optimization method of choice on the outer problem $\min _\alpha J(S_u(\alpha ))$ with the knowledge of $S_u(\alpha ^k)$ and $\nabla S_u(\alpha ^k)$. The inner problem is therefore generally assumed to have a unique solution, and the solution map to be differentiable. An algorithm for nonsmooth inner problems has been developed in [13], while [14] rely on proving directional Bouligand differentiability for otherwise nonsmooth problems.

The challenge of the first “whole-problem” approach is to scale it to large problems, typically involving the inversion of large matrices. The difficulty with the second “implicit function” approach is that the inner problem needs to be solved several times, which can be expensive. Solving the adjoint equation also requires matrix inversion. The variant in [15] avoids this through derivative-free methods for the outer problem. It also solves the inner problem to a low but controlled accuracy.

In this paper, by preconditioning the implicit-form first-order optimality conditions, we develop an intermediate approach more efficient than the aforementioned, as we demonstrate in the numerical experiments of Sect. 4. It can be summarized as (i) take only one step of an optimization method on the inner problem, (ii) perform a cheap operation to advance towards the solution of the adjoint equation, and, finally, (iii) using this approximate information, take one step of an optimization method for the outer problem. Repeat.

The preconditioning, which we introduce in detail in Sect. 2, is based on insight from the derivation of the primal-dual proximal splitting of [16] as a preconditioned proximal point method [17,18,19]. We write the optimality conditions for (1) as the inclusion $0 \in H(x)$ for a set-valued H, where $x=(u,p,\alpha )$ for an adjoint variable p. The basic proximal point method then iteratively solves $x^{k+1}$ from

$$\begin{aligned} 0 \in H(x^{k+1}) + (x^{k+1}-x^k). \end{aligned}$$

This can be as expensive as solving the original optimality condition. The idea then is to introduce a preconditioning operator M that decouples the components of x—in our case u, p and $\alpha $—such that each component can be solved in succession from

$$\begin{aligned} 0 \in H(x^{k+1}) + M(x^{k+1}-x^k). \end{aligned}$$

Gradient steps can be handled through nonlinear preconditioning [18, 19], as we will see in Sect. 2 when we develop the approach in detail along with two more specific algorithms, the FEFB (Forward-Exact-Forward-Backward) and the FIFB (Forward-Inexact-Forward-Backward). In Sect. 3 we prove their local linear convergence under a second-order growth condition on the composed objective $J \circ S_u$, and other more technical conditions. The proof is based on the “testing” approach developed in [18] and also employed extensively in [19, 20]. Finally, we evaluate the numerical performance of the proposed schemes on imaging applications in Sect. 4, specifically the learning of a regularization parameter for total variation denoising, and the convolution kernel for deblurring. Since the purpose of these experiments is a simple performance comparison between different algorithms, instead of real applications, we only use a single training sample of various dimensions, as explained in Sect. 4.

Intermediate approaches, some reminiscent of ours, have recently also been developed in the machine learning community. Our approach, however, allows a non-smooth function R in the outer problem (1). Moreover, to our knowledge, our work is the first to show linear convergence for a fully “single-loop” algorithm. To be more precise, the STABLE [21], TTSA [22], FLSA [23], MRBO, VRBO [24], and SABA [25] are “single-loop” algorithms such as ours, taking only a single step towards the solution of the inner problem on each outer iteration. The STABLE requires solving the adjoint equation exactly, as does our first approach, the FEFB. The others use a Neumann series approximation for the adjoint equation. Our second approach, the FIFB, takes a simple step reminiscent of gradient descent for the adjoint equation. The TSSA and STABLE obtain sublinear convergence of the outer iterates $\{\alpha ^k\}_{k \in \mathbb {N}}$ assuming strong convexity (second-order growth) of both the inner and outer objective. For the SABA similar linear convergence is claimed with the outer strong convexity replaced by a Polyak-Łojasiewicz inequality. Without either of those assumptions, the theoretical results on the aforementioned methods from the literature are much weaker, and generally only show various forms of “stall” of the iterates at a sublinear rate, or the ergodic convergence of the gradient $\nabla _{\alpha }[J\circ S_u](\alpha ^k)$ of the composed objective to zero. Such modes of convergence say very little about the convergence of function values to optimum or the iterates to a solution.

In the context of not fully single-loop algorithms, the AID, ITD [26], AccBio [27], and ABA [28] take a fixed (small) number of inner iterations for each outer iteration. The AID and ITD only sublinear convergence of the composed gradient is claimed. For the ABA and AccBio linear convergence of outer function values is claimed under strong convexity of both the inner and outer objectives.

1.1 Fundamentals and applications

Fundamentals of MPECs and bilevel optimization are treated in the books [29,30,31,32]. An extensive literature review up to 2018 can be found in [33], and recent developments in [34]. Optimality conditions for bilevel problems, both necessary and sufficient, are developed in, e.g., [35,36,37,38,39]. A more limited type of “bilevel” problems only constrains $\alpha $ to lie in the set of minimisers of another problem. Algorithms for such problems are treated in [40, 41].

Bilevel optimization has been used for learning regularization parameters and forward operators for inverse imaging problems. With total variation regularization in the inner problem, the parameter learning problem in its most basic form reads [7]

$$\begin{aligned} \min _{\alpha }~\frac{1}{2}\Vert S_u(\alpha )-b\Vert ^2 + R(\alpha ) \quad \text {with}\quad S_u(\alpha ) = \mathop {\mathrm {arg\,min}}\limits _{u\in U} \frac{1}{2}\Vert A_\alpha u-z\Vert ^2 + \alpha _1\Vert \nabla u\Vert _1. \end{aligned}$$

This problem finds the best possible $\alpha $ for reconstructing the “ground truth” image b from the measurement data z, which may be noisy and possibly transformed and only partially known through the forward operator $A_\alpha $, mapping images to measurements. To generalize to multiple images, the outer problem would sum over them and corresponding inner problems [12]. Multi-parameter regularization is discussed in [42], and natural conditions for $\alpha >0$ in [43].^{Footnote 1} In other works, the forward operator $A_\alpha $ is learned for blind image deblurring [44] or undersampling in magnetic resonance imaging [11]. In [8] regularization kernels are learned, while [14, 45] study the learning of optimal discretisation schemes. To circumvent the non-differentiability of $S_u$, [46] replace the inner problem with a fixed number of iterations of an algorithm. Their approach has connections to the learning of deep neural networks.

Bilevel problems can also be seen as leader–follower or Stackelberg games: the outer problem or agent leads by choosing $\alpha $, and the inner agent reacts with the best possible u for that $\alpha $. Multiple-agent Nash equilibria may also be modeled as bilevel problems. Both types of games can be applied to financial markets and resource use planning; we refer to the the aforementioned books [29,30,31,32] for specific examples.

1.2 Notation and basic concepts

We write $\mathbb {L}(X; Y)$ for the space of bounded linear operators between the normed spaces X and Y and ${{\,\textrm{Id}\,}}$ for the identity operator. Generally X will be Hilbert, so we can identify it with the dual $X^*$.

For $G \in C^1(X)$, we write $G'(x) \in X^*$ for the Fréchet derivative at x, and $\nabla G(x) \in X$ for its Riesz presentation, i.e., the gradient. For $E \in C^1(X; Y)$, since $E'(x) \in \mathbb {L}(X; Y)$, we use the Hilbert adjoint to define $\nabla E(x) :=G'(x)^* \in \mathbb {L}(Y; X)$. Then the Hessian $\nabla ^2 G(x) :=\nabla [\nabla G](x) \in \mathbb {L}(X; X)$. When necessary we indicate the differentiation variable with a subscript, e.g., $\nabla _u F(u, \alpha )$. For convex $R: X \rightarrow \overline{\mathbb {R}}$, we write ${{\,\textrm{dom}\,}}R$ for the effective domain and $\partial R(x)$ for the subdifferential at x. With slight abuse of notation, we identify $\partial R(x)$ with the set of Riesz presentations of its elements. We define the proximal operator as ${{\,\textrm{prox}\,}}_R(x) :=\mathop {\mathrm {arg\,min}}\limits _z \frac{1}{2}\Vert z-x\Vert ^2 + R(z)=({{\,\textrm{Id}\,}}+\partial R)^{-1}(x)$.

We write $\langle x,y\rangle $ for an inner product, and B(x, r) for a closed ball in a relevant norm $\Vert \,\varvec{\cdot }\,\Vert $. For self-adjoint positive semi-definite $M\in \mathbb {L}(X; X)$ we write $\Vert x\Vert _{M} :=\sqrt{\langle x,x\rangle _{M}} :=\sqrt{\langle Mx,x\rangle }.$ Pythagoras’ or three-point identity then states

$$\begin{aligned} \langle x-y,x-z\rangle _{M} = \frac{1}{2}\Vert x-y\Vert ^2_M - \frac{1}{2}\Vert y-z\Vert ^2_M + \frac{1}{2}\Vert x-z\Vert ^2_M \end{aligned}$$

(2)

for all $x,y,z\in X$. We extensively use Young’s inequality

$$\begin{aligned} \langle x,y\rangle \le \frac{a}{2}\Vert x\Vert ^2 + \frac{1}{2a}\Vert y\Vert ^2 \qquad \text {for all } x,y\in X,\, a > 0. \end{aligned}$$

We sometimes apply operations on $x \in X$ to all elements of a set $A \subset X$, writing $\langle x+A,z\rangle :=\{\langle x+a,z\rangle \mid a \in A \}$, and for $B \subset \mathbb {R}$, writing $B \ge c$ if $b \ge c$ for all $b \in B.$

2 Proposed methods

We now present our proposed methods for (1). They are based on taking a single gradient descent step for the inner problem, and using forward-backward splitting for the outer problem. The two methods differ on how an “adjoint equation” is handled. We present the algorithms and assumptions required to prove their convergence in Sects. 2.2 and 2.3 after deriving optimality conditions and the adjoint equation in Sect. 2.1. We prove convergence in Sect. 3.

2.1 Optimality conditions

Suppose $u\mapsto F(u;\alpha )\in C^2(U)$ is proper, coercive, and weakly lower semicontinuous for each outer variable $\alpha \in {{\,\textrm{dom}\,}}R \subset \mathscr {A}$. Then the direct method of the calculus of variations guarantees the inner problem $\min _u F(u; \alpha )$ to have a solution. If, further, $u\mapsto F(u;\alpha )$ is strictly convex, the solution is unique so that the solution mapping $S_u$ from (1) is uniquely determined.

Suppose further that $F, \nabla F$ and $S_u$ are Fréchet differentiable. Writing $T(\alpha ) :=(S_u(\alpha ), \alpha )$, Fermat’s principle and $S_u(\tilde{\alpha }) \in \mathop {\mathrm {arg\,min}}\limits _u F(u; \tilde{\alpha })$ then show that

$$\begin{aligned}{}[\nabla _{u} F\circ T] (\alpha )= \nabla _{u} F(S_u(\alpha ); \alpha ) =0 \end{aligned}$$

(3)

for $\alpha $ near $\tilde{\alpha }$. Therefore, the chain rule for Fréchet differentiable functions yields

$$\begin{aligned} 0=\nabla _{\alpha }[\nabla _{u} F \circ T](\alpha ) = \nabla _{\alpha }S_u(\alpha ) \nabla _{u}^2 F(T(\alpha )) + \nabla _{\alpha u} F(T(\alpha )). \end{aligned}$$

That is, $p=\nabla _{\alpha }S_u(\alpha )$ solves for $u=S_u(\alpha )$ the adjoint equation

$$\begin{aligned} 0=p \nabla _{u}^2 F(u, \alpha ) + \nabla _{\alpha u} F(u, \alpha ). \end{aligned}$$

(4)

We introduce the corresponding solution mapping for the adjoint variable p,

$$\begin{aligned} S_p(u,\alpha ) := - \nabla _{\alpha u} F(u; \alpha ) \left( \nabla _u^2 F(u; \alpha )\right) ^{-1}. \end{aligned}$$

(5)

We will later make assumptions that ensure that $S_p$ is well-defined.

Since $S_u: \mathscr {A}\rightarrow U$, the Fréchet derivative $S_u'(\alpha ) \in \mathbb {L}(\mathscr {A}; U)$ and the Hilbert adjoint $\nabla _\alpha S_u(\alpha ) \in \mathbb {L}(U; \mathscr {A})$ for all $\alpha $. Consequently $p \in \mathbb {L}(U; \mathscr {A})$, but we will need p to lie in an inner product space. Assuming $\mathscr {A}$ to be a separable Hilbert space, we introduce such structure

$$\begin{aligned} P=(\mathbb {L}(U; \mathscr {A}), \langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle ) \end{aligned}$$

(6a)

by using a countable orthonormal basis $\{\varphi _i\}_{i\in I}$ of $\mathscr {A}$ to define the inner product

$$\begin{aligned} \langle \hspace{-2.0pt}\langle p_1, p_2 \rangle \hspace{-2.0pt}\rangle :=\sum _{i\in I} \langle p_1^* \varphi _i,p_2^* \varphi _i\rangle = \sum _{i\in I} \langle \varphi _i,p_1 p_2^* \varphi _i\rangle . \quad (p_1, p_2 \in \mathbb {L}(U; \mathscr {A})). \end{aligned}$$

(6b)

We briefly study this inner product and the induced norm $\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|$ in Sect. 2.

By the sum rule for Clarke subdifferentials (denoted $\partial _C$) and their compatibility with convex subdifferentials and Fréchet differentiable functions [47], we obtain

$$\begin{aligned} \partial _C (J \circ S_u+R)(\widehat{\alpha }) = \nabla _{\alpha }(J \circ S_u)(\widehat{\alpha }) + \partial R(\widehat{\alpha }) = \nabla _{\alpha }S_u(\widehat{\alpha })\nabla _{u}J(S_u(\widehat{\alpha })) + \partial R(\widehat{\alpha }). \end{aligned}$$

The Fermat principle for Clarke subdifferentials then furnishes the necessary optimality condition

$$\begin{aligned} 0 \in \nabla _{\alpha }(J \circ S_u)(\widehat{\alpha }) + \partial R(\widehat{\alpha })= \nabla _{\alpha }S_u(\widehat{\alpha })\nabla _{u}J(S_u(\widehat{\alpha })) + \partial R(\widehat{\alpha }). \end{aligned}$$

(7)

We combine (3), (4) and (7) as the inclusion

$$\begin{aligned} 0 \in H(\widehat{u}, \widehat{p}, \widehat{\alpha }) \end{aligned}$$

(8)

with

$$\begin{aligned} H(u,p,\alpha ):= \begin{pmatrix} \nabla _{u} F(u; \alpha ) \\ p \nabla _{u}^2 F(u; \alpha ) + \nabla _{\alpha u} F(u; \alpha ) \\ p\nabla _{u}J(u) + \partial R(\alpha ) \end{pmatrix} \end{aligned}$$

(9)

This is the optimality condition that our proposed methods, presented in Sects. 2.2 and 2.3, attempt to satisfy. We generally abbreviate

$$\begin{aligned} x=(u,p,\alpha ), \quad \widehat{x}=(\widehat{u},\widehat{p},\widehat{\alpha }), \quad \text {etc.} \end{aligned}$$

2.2 Algorithm: forward-exact-forward-backward

Our first strategy for solving (8) takes just a single gradient descent step for the inner problem, solves the adjoint equation exactly, and then takes a forward-backward step for the outer problem. We call this Algorithm 2.1 the FEFB (forward-exact-forward-backward).

Using H defined in (9), Algorithm 2.1 can be written implicitly as solving

$$\begin{aligned} 0 \in H_{k+1}(x^{k+1}) + M(x^{k+1} - x^k) \end{aligned}$$

(10)

for $x^{k+1} = (u^{k+1}, p^{k+1}, \alpha ^{k+1})$, where, with $x=(u, p, \alpha )$,

(11a)

and the preconditioning operator $M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})$ is

$$\begin{aligned} M :={{\,\textrm{diag}\,}}( \tau ^{-1}{{\,\textrm{Id}\,}}, 0, \sigma ^{-1}{{\,\textrm{Id}\,}}). \end{aligned}$$

(11b)

The “nonlinear preconditioning” applied to H to construct $H_{k+1}$ shifts iterate indices such that a forward step is performed instead of a proximal step; compare [18, 19].

We next state essential structural, initialisations, and step length assumptions. We start with a contractivity condition needed for the proximal step with respect to R.

Assumption 2.1

Let $R: \mathscr {A}\rightarrow \overline{\mathbb {R}}$ be convex, proper, and lower semicontinuous. We say that R is locally prox-$\sigma $- contractive at $\widehat{\alpha }\in \mathscr {A}$ for $q \in \mathscr {A}$ (within $A \subset {{\,\textrm{dom}\,}}R$) if there exist $ C_R > 0$ and a neighborhood $A \subset {{\,\textrm{dom}\,}}R$ of $\widehat{\alpha }$ such that, for all $\alpha \in A$,

$$\begin{aligned} \Vert D_{\sigma R}(\alpha )-D_{\sigma R}(\widehat{\alpha })\Vert \le \sigma C_R \Vert \alpha -\widehat{\alpha }\Vert \quad \text {for}\quad D_{\sigma R}(\alpha ) :={{\,\textrm{prox}\,}}_{\sigma R}(\alpha - \sigma q)-\alpha . \end{aligned}$$

If $\rho >0$ can be arbitrary with the same factor $C_R$, we drop the word “locally”.

We verify Assumption 2.1 for some common cases in Sect. 1. When applying the assumption to to $\widehat{\alpha }$ satisfying (8), we will take $q = -\widehat{p}\nabla _u J(\widehat{u}) \in \partial R(\widehat{\alpha })$. Then $D_{\sigma R}(\widehat{\alpha })=0$ by standard properties of proximal mappings. The results for nonsmooth functions in Sect. 1 in that case forbid strict complementarity. In particular, for $R=\beta \Vert \,\varvec{\cdot }\,\Vert _1 + \delta _{[0, \infty )^n}$ we need to have $q \in (\beta , \ldots , \beta )$, and for $R=\delta _C$ for a convex set C, we need to have $q=0$. Intuitively, this restriction serves to forbid the finite identification property [48] of proximal-type methods, as $\{\alpha ^n\}$ cannot converge too fast in our techniques for the stability of the inner problem and adjoint with respect to perturbations of $\alpha $.

We now come to our main assumption for the FEFB. It collects conditions related to step lengths, initialization, and the problem functions F, J, and R. For a constant $c>0$ to be determined by the assumption, we introduce the testing operator

$$\begin{aligned} Z :={{\,\textrm{diag}\,}}(\varphi _u {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}). \end{aligned}$$

(12)

The idea, introduced in [18] and further explained in [19], is to test the algorithm-defining inclusion (10) with the linear functional $\langle Z\,\varvec{\cdot }\,,x^{k+1}-\widehat{x}\rangle $ to obtain a descent estimate with respect to the ZM-norm. The operator Z encodes component-specific scalings and convergence rates, although we do not exploit the latter in this manuscript.

Assumption 2.2

We assume that U is a Hilbert space, $\mathscr {A}$ a separable Hilbert space, and treat the adjoint variable $p\in \mathbb {L}(U; \mathscr {A})$ as an element of the inner product space P defined in (6a). Let $R: \mathscr {A}\rightarrow \overline{\mathbb {R}}$ and $J: U \rightarrow \mathbb {R}$ be convex, proper, and lower semicontinuous, and assume the same from $F(\,\varvec{\cdot }\,, \alpha )\in C^2(U)$ for all $\alpha \in {{\,\textrm{dom}\,}}R.$ Pick $(\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)$ and let $\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}$ be generated by Algorithm 2.1 for a given initial iterate $(u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R$. For a given $r, r_u>0$ we suppose that

(i)
The relative initialization bound $\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert $ holds for some $C_u>0$.
(ii)
There exists in $B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R$ a continuously Fréchet-differentiable and $L_{S_u}$-Lipschitz inner problem solution mapping $S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )$.
(iii)
$F(\widehat{u};\,\varvec{\cdot }\,)$ is Lipschitz continuously differentiable with factor $L_{\nabla F,\widehat{u}} > 0$, and $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}$ for all $(u, \alpha ) \in B(\widehat{u}, r_u) \times ( B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)$ for some $\gamma _F, L_F > 0.$ Moreover, $(u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )$ and $(u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P$ are Lipschitz in $B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)$ with factors $L_{\nabla ^2 F}$ and $L_{\nabla _{\alpha u} F}$, where we equip $U \times \mathscr {A}$ with the norm $(u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}$.
(iv)
The inner step length $\tau \in (0, 2\kappa /L_F]$ for some $\kappa \in (0, 1)$.
(v)
The outer fitness function J is Lipschitz continuously differentiable with factor $L_{\nabla J}$, and $\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}$ in $B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R$ for some $\gamma _\alpha ,L_\alpha >0$. Moreover, R is locally prox-$\sigma $-contractive at $\widehat{\alpha }$ for $\widehat{p}\nabla _u J(\widehat{u})$ within $B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R$ for some $C_R\ge 0$.
(vi)
The constants $\varphi _u, C_u > 0$ satisfy
$$\begin{aligned} \gamma _F (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})C_u + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$
where
$$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$
(vii)
The outer step length $\sigma $ fulfills
$$\begin{aligned} 0 < \sigma \le \frac{(C_F-1)C_u}{(L_{S_u} +C_FC_u)C_{\alpha }} \end{aligned}$$
where
$$\begin{aligned} {\left\{ \begin{array}{ll} C_F :=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \quad \text {and} \\ C_{\alpha } :=(N_pL_{\nabla J} + N_{\nabla J} L_{S_p}) C_u + L_\alpha + C_R. \end{array}\right. } \end{aligned}$$
(viii)
The initial iterates $u^0$ and $\alpha ^0$ are such that the distance-to-solution
$$\begin{aligned} r_0 :=\sqrt{\sigma \varphi _u\tau ^{-1}\Vert u^{0}-\widehat{u}\Vert ^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$
satisfies
$$\begin{aligned} r_0 \le r \quad \text {and}\quad r_0 \max \{2L_{S_u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$

Remark 2.3

(Interpretation) Part (i) of Assumption 2.2 ensures that the initial inner problem iterate is good relative to the outer problem iterate. If $u^1$ solves the inner problem for $\alpha ^0$, (i) holds for any $C_u>0$. Therefore, (i) can always be satisfied by solving the inner problem for $\alpha ^0$ to high accuracy. This condition does not require $\alpha ^0$ to be close to a solution $\widehat{\alpha }$ of the entire problem.

Part (ii) ensures that the inner problem solution map exists and is well-behaved; we discuss it more in the next Remark 2.4.

Parts (iii) and (v) are second order growth and boundedness conditions, standard in smooth optimization. The nonsmooth R is handled through the prox-$\sigma $-contractivity assumption. If $S_u$ is twice Fréchet differentiable, the product and the chain rules establish

If $R=0$, first-order optimality conditions establish $\nabla _u J(S_u(\widehat{\alpha }))=0$. Therefore, if, further, J is strongly convex and $S_u'(\widehat{\alpha })$ is invertible, $\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)(\widehat{\alpha })$ for some $\gamma >0$. Then additional continuity assumptions establish the positivity required in (v) in a neighbourhood of $\widehat{\alpha }$. It is also possible to further develop the condition to not depend on the solution mapping at all.

Dependent on R, (v) may restrict the outer step length parameter $\sigma $. Part (iii) ensures that $u\mapsto \nabla _u^2 F(u; \alpha )$ is invertible and $S_p$ is well-defined. We will see in Lemma 3.3 that the radius $r_u$ is sufficiently large that $\alpha \in B(\widehat{\alpha }, r)$ implies $S_u(\alpha ) \in B(\widehat{u}, r_u)$. Part (v) implies that $\alpha \mapsto \nabla _{\alpha }(J\circ S_u)(\alpha )$ is Lipschitz in $B(\widehat{\alpha }, r)$.

Part (iv) is a standard step length condition for the inner problem while (vii) is a step length condition for the outer problem. It depends on several constants defined in the more technical part (vi). We can always satisfy the inequality in (vi) by good relative initialisation (small $C_u>0$), as discussed above, and taking the testing parameter $\varphi _u$ small. According to the local initialization condition (viii), the latter can be done if the initialial iterates are close to a solution $(\widehat{u}, \widehat{\alpha })$ of the entire problem, or if $r_u>0$ can be be taken arbitrarily large. If we can take both $r>0$ and $r_u>0$ arbitrarily large, we obtain global convergence.

Remark 2.4

(Existence and differentiability of the solution map) Suppose F is twice continuously differentiable in both variables, and that $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )$ for all $u\in B(\widehat{u}, r_u)$ and $\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R$ for some $\gamma _F>0$. Then the implicit function theorem shows the existence of a unique continuously differentiable $S_u$ in a neighborhood of any $\alpha \in B(\widehat{\alpha },r) \cap {{\,\textrm{dom}\,}}R$. Such an $S_u$ is also Lipschitz in a neighborhood of $\alpha $; see, e.g., [19, Lemma 2.11]. If $\mathscr {A}$ is finite-dimensional, a compactness argument gluing together the neighborhoods then proves Assumption 2.2 (ii).

2.3 Algorithm: forward-inexact-forward-backward

Our second strategy for solving (8) modifies the first approach to solve the adjoint variable inexactly, so that no costly matrix inversions are required. Instead we perform an update reminiscent of a gradient step. This approach, which we call the FIFB (forward-inexact-forward-backward) reads as Algorithm 2.2 and has the implicit form

$$\begin{aligned} {\left\{ \begin{array}{ll} 0 = \tau \nabla _{u} F(u^{k}; \alpha ^{k}) + u^{k+1} - u^{k} \\ 0 = \theta \left( p^{k} \nabla _{u}^2 F(u^{k+1}; \alpha ^{k}) + \nabla _{\alpha u} F(u^{k+1}; \alpha ^{k})\right) + p^{k+1} - p^{k} \\ 0 \in \sigma (\partial R(\alpha ^{k+1}) + p^{k+1}\nabla _{u}J(u^{k+1})) + \alpha ^{k+1} - \alpha ^{k}. \end{array}\right. } \end{aligned}$$

(13)

The implicit form can also be written as (10) with

(14a)

and the preconditioning operator $M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})$,

$$\begin{aligned} M :={{\,\textrm{diag}\,}}( \tau ^{-1}{{\,\textrm{Id}\,}}, \theta ^{-1}{{\,\textrm{Id}\,}}, \sigma ^{-1}{{\,\textrm{Id}\,}}). \end{aligned}$$

(14b)

For the testing operator Z we use the structure

$$\begin{aligned} Z :={{\,\textrm{diag}\,}}(\varphi _u {{\,\textrm{Id}\,}}, \varphi _p {{\,\textrm{Id}\,}}, {{\,\textrm{Id}\,}}). \end{aligned}$$

(15)

with the constants $\varphi _u, \varphi _p>0$ determined in the following assumption. It is the FIFB counterpart of Assumption 2.2 for the FEFB, collecting essential structural, step length, and initialization assumptions.

Assumption 2.5

We assume that U is a Hilbert space, $\mathscr {A}$ a separable Hilbert space, and treat the adjoint variable $p\in \mathbb {L}(U; \mathscr {A})$ as an element of the inner product space P defined in (6a). Let $R: \mathscr {A}\rightarrow \overline{\mathbb {R}}$ and $J: U \rightarrow \mathbb {R}$ be convex, proper, and lower semicontinuous, and assume the same from $F(\,\varvec{\cdot }\,, \alpha )$ for all $\alpha \in {{\,\textrm{dom}\,}}R$. Pick $(\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)$ and let $\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}$ be generated by Algorithm 2.2 for a given initial iterate $(u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R$. For given $r, r_u>0$ we suppose that

(i)
The relative initialization bounds $\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert $ and $\Vert \hspace{-1.0pt}|p^{1}-\nabla _{\alpha }S_u(\alpha ^{0}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{0} - \widehat{\alpha }\Vert $ hold with some constants $C_u>0$ and $C_p>0.$
(ii)
There exists in $B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R$ a continuously Fréchet-differentiable and $L_{S_u}$-Lipschitz inner problem solution mapping $S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )$.
(iii)
$F(\widehat{u};\,\varvec{\cdot }\,)$ is Lipschitz continuously differentiable with factor $L_{\nabla F,\widehat{u}} > 0$, and $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}$ for $u\in B(\widehat{u}, r_u)$ and $\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R.$ Moreover, $(u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )$ and $(u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P$ are Lipschitz in $B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)$ with factors $L_{\nabla ^2 F}$ and $L_{\nabla _{\alpha u} F}$, where we equip $U \times \mathscr {A}$ with the norm $(u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}$.
(iv)
The inner step length $\tau \in (0, 2\kappa /L_F]$ for some $\kappa \in (0, 1)$ whereas the adjoint step length $\theta \in (0, 1/L_F)$.
(v)
The outer fitness function J is Lipschitz continuously differentiable with factor $L_{\nabla J},$ and $\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}$ in $B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R$ for some $\gamma _\alpha , L_\alpha > 0$. Moreover, R is locally prox-$\sigma $-contractive at $\widehat{\alpha }$ for $\widehat{p}\nabla _u J(\widehat{u})$ within $B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R$ for some $C_R\ge 0$.
(vi)
The constants $\varphi _u, \varphi _p, C_u > 0$ satisfy
$$\begin{aligned} \varphi _p \le \varphi _u \frac{\gamma _F^2(1-\kappa )}{2 L_F L_{S_p}} \end{aligned}$$
and
$$\begin{aligned}{} & {} L_F L_{S_p} \varphi _p + \sqrt{(L_F L_{S_p} \varphi _p)^2 + \gamma _F^2 (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})^2C_u^2} \\{} & {} \quad + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$
where
$$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$
(vii)
The outer step length $\sigma $ satisfies
$$\begin{aligned} 0 < \sigma \le \textstyle \frac{1}{C_\alpha }\min \left\{ \frac{(C_F-1)C_u}{L_{S_u} +C_FC_u}, \frac{(C_{F,S}-1)C_p- (1+C_{F,S})L_{S_p} C_u}{(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u} \right\} \end{aligned}$$
with
$$\begin{aligned} \begin{aligned} C_F&:=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \qquad C_{F,S} :=\sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)} \quad \text {and} \\ C_{\alpha }&:=N_pL_{\nabla J}C_u + N_{\nabla J}\max \{C_p, L_{S_p} C_u\} + L_\alpha + C_R. \end{aligned} \end{aligned}$$
(viii)
The initial iterate $(u^0, p^0, \alpha ^0)$ is such that the distance-to-solution
$$\begin{aligned} r_0 :=\sqrt{\frac{\sigma \varphi _u}{\tau }\Vert u^{0}-\widehat{u}\Vert ^2 + \frac{\sigma \varphi _p}{\theta }\Vert \hspace{-1.0pt}|p^{0}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$
satisfies
$$\begin{aligned} r_0 \le r \ \text {and}\ r_0 \max \{2(C_u + L_{S_u}), \sqrt{\tfrac{\tau }{\sigma \varphi _u}}(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$

Remark 2.6

(Interpretation) The interpretation of Assumption 2.2 in Remark 2.4 also applies to Assumption 2.5. We stress that to satisfy the inequality in (vi), it suffices to ensure small $C_u>0$ by good relative initialization of u and p with respect to $\alpha $, and choosing the testing parameters $\varphi _u, \varphi _p>0$ small enough. According to (viii), the latter can be done by initializing close to a solution, or if the radii $r_u>0$ is large.

3 Convergence analysis

We now prove the convergence of the FEFB (Algorithm 2.1) and the FIFB (Algorithm 2.2) in the respective Sects. 3.2 and 3.3. Before this we start with common results. Our proofs are self-contained, but follow on the “testing” approach of [18] (see also [19]). The main idea is to prove a monotonicity-type estimate for the operator $H_{k+1}$ occurring in the implicit forms (10) and (14) of the algorithms, and then use the three-point identity (2) with respect to ZM-norms and inner products. This yields an inequality that readily yields an estimate from which convergence rates can be observed. The main results for the FEFB and the FIFB and in the respective Theorems 3.16 and 3.21.

Throughout, we assume that either Assumption 2.2 (FEFB) or 2.5 (FIFB) holds, and tacitly use the constants from the relevant one. We also tacitly take it that $\alpha ^k \in {{\,\textrm{dom}\,}}R$ for all $k \in \mathbb {N}$, as this is guaranteed by the assumptions for $k=0$, and by the proximal step in the algorithms for $k \ge 1$.

3.1 General results

Our main goal here is to bound the error in the inner and adjoint iterates $u^k$ and $p^k$ in terms of the outer iterates $\alpha ^k$. We also derive bounds on the outer steps, and local monotonicity estimates. We first show that the solution mapping for the adjoint equation (4) is Lipschitz.

Lemma 3.1

Suppose $(u, \alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )$ and $(u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P$ are Lipschitz continuous with the respective constants $L_{\nabla ^2 F}$ and $L_{\nabla _{\alpha u} F}$ in some bounded closed set $V_u \times V_{\alpha }.$ Also assume that $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )$ and $\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}$ in $V_u \times V_{\alpha }$ for some $\gamma _F, N_{\nabla _{\alpha u}}>0$. Then $S_p$ is Lipschitz continuous in $V_u \times V_{\alpha },$ i.e.

$$\begin{aligned} \Vert \hspace{-1.0pt}|S_p(u_1, \alpha _1) - S_p(u_2, \alpha _2) \Vert \hspace{-1.0pt}| \le L_{S_p}(\Vert u_1 - u_2\Vert +\Vert \alpha _1-\alpha _2\Vert ) \end{aligned}$$

for $u_1,u_2\in V_u$ and $\alpha _1,\alpha _2 \in V_{\alpha }$ with factor $L_{S_p} :=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}.$

Proof

Using the definition of $S_p$ in (5), we rearrange

Thus the triangle inequality and the operator norm inequality Theorem 6.1 (ii) give

(16)

The assumption $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )$ implies $\Vert (\nabla _u^2 F(u; \alpha ))^{-1}\Vert \le \gamma _F^{-1}.$ Therefore, also using the Lipschitz continuity of $(u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha )$ in $V_u\times V_{\alpha },$ we get

$$\begin{aligned} E_1 \le \gamma _F^{-1} L_{\nabla _{\alpha u} F}\left( \Vert u_1-u_2\Vert + \Vert \alpha _1 - \alpha _2\Vert \right) . \end{aligned}$$

(17)

Towards estimating the second term on the right hand side of (16), we observe that

$$\begin{aligned} A^{-1} - B^{-1}= A^{-1}B B^{-1} - A^{-1}A B^{-1} = A^{-1}(A-B)B^{-1} \end{aligned}$$

for any invertible linear operators A, B. Then we use $\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}$ and the Lipschitz continuity of $\nabla _{u}^2 F (u; \alpha )$ to obtain

Inserting this inequality and (17) into (16) establishes the claim. $\square $

We now prove two simple step length bounds.

Lemma 3.2

Let Assumption 2.2 or 2.5 hold. Then $\sigma < 1/L_\alpha $ and $1< C_F < \sqrt{1+\gamma _F/L_F}$.

Proof

We have $C_F>1$ since $\kappa <1$ forces $2\tau \gamma _F(1 - \kappa )>0.$ Assumption 2.2 (iv) or 2.5 (iv) implies $2\tau \gamma _F(1 - \kappa )<4\gamma _F(\kappa - \kappa ^2)/L_F \le \gamma _F/L_F.$ Therefore $ C_F < \sqrt{1+ \gamma _F/L_F}.$ For $C_F,C_u, L_{S_u}>0$ it holds $C_FC_u-C_u< L_{S_u} +C_FC_u.$

Hence Assumption 2.2 (vii) or 2.5 (vii) gives

$$\begin{aligned} \sigma \le \frac{(C_F-1)C_u}{C_{\alpha }(L_{S_u} +C_FC_u)}< \frac{1}{C_{\alpha }} = \frac{1}{C_u (L_{S_p} N_{\nabla J} + N_{p}) + L_\alpha + C_R} < \frac{1}{L_\alpha }. \end{aligned}$$

$\square $

The next lemma explains the latter inequality for $r_0$ in Assumption 2.2 (viii) and 2.5 (viii). For $u^n$ and $\alpha ^n$ close enough to the respective solutions, it bounds the next iterate $u^{n+1}$ and the true inner problem solution $S_u(\alpha ^n)$ for $\alpha ^n$ to the $r_u$-neighborhood of $\widehat{u}$.

Lemma 3.3

Suppose Assumption 2.2 or 2.5 hold and $\alpha ^{n}\in B(\widehat{\alpha }, r_0)$, as well as $u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)$. Then $u^{n+1}\in B(\widehat{u}, r_u)$ and $S_u(\alpha ^n)\in B(\widehat{u}, r_u).$

Proof

The inner gradient step of Algorithm 2.1 or with $u^{n}\in B(\widehat{u}, \sqrt{\frac{\tau }{\sigma \varphi _u}}r_0)$ give

$$\begin{aligned} \Vert u^{n+1} - \widehat{u}\Vert \le \Vert u^{n+1} - u^{n}\Vert + \Vert u^{n} - \widehat{u}\Vert \le \tau \Vert \nabla _u F(u^n; \alpha ^n)\Vert + \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau } r_0. \end{aligned}$$

Using $\nabla _u F(\widehat{u}; \widehat{\alpha })=0$, $\alpha ^{n}\in B(\widehat{\alpha }, r_0)$, and the Lipschitz continuity of $F(\widehat{u};\,\varvec{\cdot }\,)$ and $F(\,\varvec{\cdot }\,;\alpha ^n)$ from Assumption 2.2 (iii) or 2.5 (iii) we continue to estimate, as required

Next, the Lipschitz continuity of $S_u$ in $B(\widehat{\alpha }, 2r)$ from Assumption 2.2 (ii) or 2.5 (ii) with $\alpha ^n\in B(\widehat{\alpha }, r_0)$ and $r_0\le r$ from Assumption 2.2 (viii) or 2.5 (viii) imply

$$\begin{aligned} \Vert S_u(\alpha ^{n}) - \widehat{u}\Vert = \Vert S_u(\alpha ^{n}) - S_u(\widehat{\alpha })\Vert \le L_{S_u}\Vert \alpha ^{n} - \widehat{\alpha }\Vert \le L_{S_u}r_0 \le r_u. \square \end{aligned}$$

We now introduce a working condition that we later prove. It guarantees that the Lipschitz and Hessian properties of Assumption 2.2 (ii), (iii) and (v) or Assumption 2.5 (ii), (iii) and (v) hold at iterates.

Assumption 3.4

(Iterate locality) Let $r_0\le r$ and $N_p$ be defined in either Assumption 2.2 or 2.5. Then this assumption holds for a given $n \in \mathbb {N}$ if

$$\begin{aligned} \alpha ^{n}\in B(\widehat{\alpha }, r_0),\quad u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0), \quad \text { and }\quad \Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}| \le N_p. \end{aligned}$$

Indeed, the next two lemmas show that if Assumption 3.4 holds for $n=k$ along with some further conditions, then it holds for $n=k+1$.

Lemma 3.5

Suppose either Assumption 2.2 or 2.5 holds. Let $n \in \mathbb {N}$ and suppose

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \le C\Vert \alpha ^n- \widehat{\alpha }\Vert \end{aligned}$$

(18)

with $\alpha ^n \in B(\widehat{\alpha }, r)$. Then $\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}| \le N_p$.

Proof

We estimate using (18) and the definitions of the relevant constants in Assumption 2.2 or 2.5 that

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}|&\le \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| + \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \\&\le N_{\nabla S_u} + C \Vert \alpha ^n- \widehat{\alpha }\Vert \le N_{\nabla S_u} + C r = N_p.\square \end{aligned}$$

Lemma 3.6

Let $k \in \mathbb {N}$. Suppose either Assumption 2.2 or 2.5 holds; Assumption 3.4 holds for $n=k$; and that (18) holds for $n=k+1$. If also $\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}$ for $n\in \{0,\ldots ,k\}$, then Assumption 3.4 holds for $n=k+1$.

Proof

Summing $\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}$ over $n=0,\ldots ,k$ gives $\Vert x^{k+1} - \widehat{x}\Vert _{ZM} \le \Vert x^{0} - \widehat{x}\Vert _{ZM} = \sigma ^{-1/2}r_0.$ By the definitions of Z and M in (12) or (15), and (11b) or (14b) respectively, it follows $\alpha ^{k+1} \in B(\widehat{\alpha }, r_0)$ and $u^{k+1}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)$ as required. We finish by using Lemma 3.5 with $n=k+1$ to establish $\Vert \hspace{-1.0pt}|p^{k+2} \Vert \hspace{-1.0pt}| \le N_p$. $\square $

We next prove a monotonicity-type estimate for the inner objective. For this we need need the following three-point monotonicity inequality.

Theorem 3.7

Let $z,\widehat{x}\in X.$ Suppose $F\in C^2(X),$ and for some $L>0$ and $\gamma \ge 0$ that $\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(\zeta ) \le L \cdot {{\,\textrm{Id}\,}}$ for all $\zeta \in [\widehat{x}, z] :=\{\widehat{x} + s(z - \widehat{x}) \mid s\in [0,1] \}$. Then, for all $\beta \in (0, 1]$ and $x \in X$,

$$\begin{aligned} \langle \nabla F(z)- \nabla F(\widehat{x}),x-\widehat{x}\rangle \ge \gamma (1 - \beta )\Vert x-\widehat{x}\Vert ^2 - \frac{L}{4\beta }\Vert x-z\Vert ^2. \end{aligned}$$

(19)

Proof

The proof follows that of [19, Lemma 15.1] whose statement unnecessarily takes $\zeta $ in neighborhood of $\widehat{x}$ instead of just the interval $[\widehat{x}, z]$. $\square $

Lemma 3.8

Let $n \in \mathbb {N}$. Suppose either Assumption 2.2 or 2.5, and 3.4 hold. Then for any $\kappa \in (0,1)$, we have

(20)

Proof

Assumption 2.2 (iii) or 2.5 (iii) with $\alpha ^n\in B(\widehat{\alpha }, r)$ and $u^{n}\in B(\widehat{u},r_u)$ from Assumption 3.4 give $\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha ^n) \le L_F\cdot {{\,\textrm{Id}\,}}$ for all $u \in [\widehat{u}, u^n]$. We have $\nabla _{u}F(\widehat{u}; \widehat{\alpha })=0$ since $0\in H(\widehat{u}, \widehat{p}, \widehat{\alpha })$. Therefore Theorem 3.7 yields

$$\begin{aligned}{} & {} \langle \nabla _{u}F(u^{n}; \alpha ^{n}),u^{n+1}-\widehat{u}\rangle = \langle \nabla _{u}F(u^{n}; \alpha ^{n}) - \nabla _{u}F(\widehat{u}; \alpha ^{n}),u^{n+1}-\widehat{u}\rangle \\{} & {} \quad + \langle \nabla _{u}F(\widehat{u}; \alpha ^{n}) -\nabla _{u}F(\widehat{u}; \widehat{\alpha }) ,u^{n+1}-\widehat{u}\rangle \\{} & {} \quad \ge \gamma _F(1 - \kappa )\Vert u^{n+1}-\widehat{u}\Vert ^2 - \frac{L_F}{4\kappa }\Vert u^{n+1}-u^{n}\Vert ^2\\ {}{} & {} \qquad - |\langle \nabla _{u}F(\widehat{u}; \alpha ^{n}) -\nabla _{u}F(\widehat{u}; \widehat{\alpha }),u^{n+1}-\widehat{u}\rangle |. \end{aligned}$$

Young’s inequality and the definition of $L_{\nabla F, \widehat{u}}$ in Assumption 2.2 (iii) or 2.5 (iii) now readily establishes the claim. $\square $

The next lemma bounds the steps taken for the outer problem variable.

Lemma 3.9

Let $n \in \mathbb {N}$. Suppose either Assumption 2.2 or 2.5 hold, as do Assumption 3.4 (18), and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert . \end{aligned}$$

(21)

Then

$$\begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert \le \sigma [(N_p L_{\nabla J} C_u + N_{\nabla J} C + L_\alpha ) + C_R]\Vert \alpha ^{n} - \widehat{\alpha }\Vert \end{aligned}$$

(22)

and

$$\begin{aligned} \begin{aligned} C_u\Vert \alpha ^{n} - \widehat{\alpha }\Vert + L_{S_u}\Vert \alpha ^{n+1}-\alpha ^n\Vert&\le C_F C_u \bigl (\Vert \alpha ^n - \widehat{\alpha }\Vert - \Vert \alpha ^{n+1} - \alpha ^n\Vert \bigr ) \\&\le C_F C_u \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$

(23)

Proof

Using the $\alpha $-update of Algorithm 2.1 or , we estimate

Since proximal maps are 1-Lipschitz, and R is by Assumption 2.2 (v) or 2.5 (v) locally prox-$\sigma $-contractive at $\widehat{\alpha }$ for $\widehat{p}\nabla _u J(\widehat{u})$ within $B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R$ with factor $C_R$, it follows

$$\begin{aligned} \begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert&\le \sigma \Vert p^{n+1}\nabla _u J(u^{n+1})- \widehat{p}\nabla _u J(\widehat{u})\Vert + \sigma C_R \Vert \alpha ^n-\widehat{\alpha }\Vert \\&=: \sigma Q + \sigma C_R \Vert \alpha ^n-\widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$

(24)

We have $\widehat{p}\nabla _u J(\widehat{u})=\nabla _\alpha S_u(\widehat{\alpha }) \nabla _u J(S_u(\widehat{\alpha }))=\nabla _\alpha (J \circ S_u)(\widehat{\alpha })$, where $\nabla _{\alpha }(J\circ S_u)$ is $L_\alpha $-Lipschitz in $B(\widehat{\alpha }, r)\ni \alpha ^n$ by Assumption 2.2 (v) or 2.5 (v). Hence

$$\begin{aligned} \begin{aligned} Q&\le \Vert p^{n+1}\nabla _u J(u^{n+1}) - \nabla _{\alpha }(J\circ S_u)(\alpha ^n) + \nabla _{\alpha }(J\circ S_u)(\alpha ^n) -\widehat{p}\,\nabla _u J(\widehat{u})\Vert \\&\le \Vert p^{n+1}\nabla _u J(u^{n+1}) - \nabla _{\alpha }S_u(\alpha ^n)\nabla _u J(S_u(\alpha ^n))\Vert + L_\alpha \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned} \end{aligned}$$

Using the Lipschitz continuity of $\nabla _u J$ from Assumption 2.2 (v) or 2.5 (v), we continue

We have $\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}|\le N_{p}$ and $\alpha ^n\in B(\widehat{\alpha }, r)$ by Assumption 3.4. Hence $\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}$ by the definition in Assumption 2.2 (vi) or 2.5 (vi). Using (18) and (21) therefore give

$$\begin{aligned} Q \le N_p L_{\nabla J} C_u \Vert \alpha ^n - \widehat{\alpha }\Vert + N_{\nabla J} C \Vert \alpha ^n - \widehat{\alpha }\Vert + L_\alpha \Vert \alpha ^n - \widehat{\alpha }\Vert = (C_{\alpha }-C_R)\Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$

Inserting this into (24), we obtain (22).

Assumption 2.2 (vii) or Assumption 2.5 (vii) and (22) then yield

$$\begin{aligned} (L_{S_u} +C_FC_u)\Vert \alpha ^{n+1} - \alpha ^n\Vert \le \sigma (L_{S_u} +C_FC_u)C_{\alpha } \Vert \alpha ^n - \widehat{\alpha }\Vert \le (C_F-1)C_u \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$

Rearranging terms and finishing with the triangle inequality we get (23). $\square $

Remark 3.10

(Gradient steps with respect to R) We could (in both FEFB and FIFB) also take a gradient step instead of a proximal step with respect to R with $L_{\nabla R}$-Lipschitz gradient. That is, we would perform for the outer problem the update

$$\begin{aligned} \alpha ^{n+1} = \alpha ^n - \sigma [p^{n+1}\nabla _u J(u^{n+1}) + \nabla R(\alpha ^n)]. \end{aligned}$$

This can be shown to be convergent by changing (24) to

$$\begin{aligned} \begin{aligned} \Vert \alpha ^{n+1}-\alpha ^n\Vert&= \sigma \Vert p^{n+1}\nabla _u J(u^{n+1}) + \nabla R(\alpha ^n)\Vert \\&= \sigma \Vert p^{n+1}\nabla _u J(u^{n+1}) - \widehat{p}\,\nabla _u J(\widehat{u}) + \nabla R(\alpha ^n)-\nabla R(\widehat{\alpha })\Vert \\&\le \sigma \bigl ( \Vert p^{n+1}\nabla _u J(u^{n+1})- \widehat{p}\,\nabla _u J(\widehat{u})\Vert + L_{\nabla R}\Vert \alpha ^n-\widehat{\alpha }\Vert \bigr ). \end{aligned} \end{aligned}$$

We next prove that if an inner problem iterate has small error, and we take a short step in the outer problem, then also the next inner problem iterate has small error.

Lemma 3.11

Let $k \in \mathbb {N}$. Suppose Assumption 2.2 or 2.5 hold. If Assumption 3.4, (18), and (21) hold for $n=k,$ then (21) holds for $n=k+1$ and we have $\alpha ^{k+1}\in B(\widehat{\alpha }, 2r_0)$.

Proof

We plan to use Theorem 3.7 on $F(\,\varvec{\cdot }\,; \alpha ^{k+1})$ followed by the three-point identity and simple manipulations. We begin by proving the conditions of the theorem.

First, we show that both $u^{k+1}\in B(\widehat{u},r_u)$ and $S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)$. The former is immediate from Assumption 3.4 and Lemma 3.3. For the latter we use (23) of Lemma 3.9. Its first inequality readily implies either $\Vert \alpha ^{k} - \widehat{\alpha }\Vert > \Vert \alpha ^{k+1} - \alpha ^{k}\Vert $ or $\alpha ^{k+1} = \widehat{\alpha }$. In the latter case $S_u(\alpha ^{k+1})=\widehat{u} \in B(\widehat{u},r_u)$. In the former, using $\alpha ^k\in B(\widehat{\alpha },r_0)$, we get

$$\begin{aligned} \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \le \Vert \alpha ^{k+1} - \alpha ^{k}\Vert + \Vert \alpha ^{k}- \widehat{\alpha }\Vert < 2 \Vert \alpha ^{k}- \widehat{\alpha }\Vert \le 2r_0 \end{aligned}$$

Therefore we can use the Lipschitz continuity of $S_u$ in $B(\widehat{\alpha },2r)$ from Assumption 2.2 (ii) or 2.5 (ii) to estimate

$$\begin{aligned} \Vert S_u(\alpha ^{k+1}) - \widehat{u}\Vert = \Vert S_u(\alpha ^{k+1}) - S_u(\widehat{\alpha })\Vert \le L_{S_u}\Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \le L_{S_u}2r_0. \end{aligned}$$

This implies $S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)$ by Assumption 2.2 (viii) or 2.5 (viii).

Since both $u^{k+1}, S_u(\alpha ^{k+1}) \in B(\widehat{u},r_u)$, Assumption 2.2 (iii) or 2.5 (iii)

shows that $\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(u) \le L_F\cdot {{\,\textrm{Id}\,}}$ for $u \in [S_u(\alpha ^{k+1}), u^{k+1}]$. Consequently Theorem 3.7 and $\nabla _{u}F(S_u(\alpha ^{k+1}); \alpha ^{k+1}) = 0$ give

Inserting the u update of Algorithm 2.1 or , i.e., $ -\tau ^{-1}(u^{k+2}-u^{k+1}) = \nabla _{u}F(u^{k+1}; \alpha ^{k+1}) $ and using the three-point identity (2) we get

$$\begin{aligned}{} & {} \frac{1}{2\tau } \left( \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \Vert u^{k+2}-u^{k+1}\Vert ^2 - \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert ^2 \right) \\{} & {} \quad \le - \gamma _F(1 - \kappa ) \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \frac{L_F}{4\kappa }\Vert u^{k+2}-u^{k+1}\Vert ^2. \end{aligned}$$

Equivalently

$$\begin{aligned}{} & {} \left( 1+2\tau \gamma _F(1 - \kappa )\right) \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert ^2 + \Bigl (1-\frac{\tau L_F}{2\kappa } \Bigr ) \Vert u^{k+2}-u^{k+1}\Vert ^2 \\{} & {} \quad \le \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert ^2. \end{aligned}$$

Because Assumption 2.2 (iv) or 2.5 (iv) guarantees $1-\tau L_F/(2\kappa ) > 0,$ this implies

$$\begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert \le C_F^{-1} \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert . \end{aligned}$$

Therefore the triangle inequality, (21) for $n=k$ and the Lipschitz continuity of $S_u$ in $B(\widehat{\alpha },2r)\ni \alpha ^{k}, \alpha ^{k+1}$ yield

$$\begin{aligned} \begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert&\le C_F^{-1} \Vert u^{k+1}-S_u(\alpha ^{k+1})\Vert \\&\le C_F^{-1} \bigl ( \Vert u^{k+1}-S_u(\alpha ^{k})\Vert + L_{S_u}\Vert \alpha ^{k+1}-\alpha ^{k}\Vert \bigr ) \\&\le C_F^{-1} \bigl ( C_u \Vert \alpha ^{k} - \widehat{\alpha }\Vert + L_{S_u}\Vert \alpha ^{k+1}-\alpha ^{k}\Vert \bigr ). \end{aligned} \end{aligned}$$

Inserting (23) here, we establish the claim. $\square $

The next lemma is a crucial monotonicity-type estimate for the outer problem. It depends on an $\alpha $-relative exactness condition on the inner and adjoint variables.

Lemma 3.12

Let $n \in \mathbb {N}$. Suppose Assumption 2.2(v) and (vi), or 2.5 (v) and (vi) hold with Assumption 3.4 and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^n)\Vert \le C_u \Vert \alpha ^n - \widehat{\alpha }\Vert \, \text { and } \, \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}| \le C \Vert \alpha ^n - \widehat{\alpha }\Vert . \end{aligned}$$

(25)

Then, for any $d > 0$,

$$\begin{aligned}{} & {} \langle p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}),\alpha ^{n+1}-\widehat{\alpha }\rangle \ge - \frac{L_\alpha }{2} \Vert \alpha ^{n+1}- \alpha ^n\Vert ^2 \nonumber \\{} & {} \quad + \left( \frac{\gamma _\alpha }{2}-\frac{L_{\nabla J}N_p C_u + C N_{\nabla J}}{2d}\right) \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2 \nonumber \\{} & {} \quad + \left( \frac{\gamma _\alpha }{2}-\frac{(L_{\nabla J}N_p C_u + C N_{\nabla J})d}{2}\right) \Vert \alpha ^n-\widehat{\alpha }\Vert ^2. \end{aligned}$$

(26)

Proof

The $\alpha $-update of both Algorithms 2.1 and 2.2 in implicit form reads

$$\begin{aligned} 0 = \sigma (q^{n+1} + p^{n+1}\nabla _u J(u^{n+1})) + \alpha ^{n+1} - \alpha ^n \quad \text {for some}\quad q^{n+1} \in \partial R(\alpha ^{n+1}). \end{aligned}$$

Similarly, $0 \in H(\widehat{u}, \widehat{p}, \widehat{\alpha })$ implies $ \widehat{p}\,\nabla _{u}J(\widehat{u}) + \widehat{q} =0 $ for some $ \widehat{q} \in \partial R(\widehat{\alpha }). $ Writing $E_0$ for the left hand side of (26), these expressions and the monotonicity of $\partial R$ yield

(27)

We estimate $E_1$ and $E_2$ separately.

The one-dimensional mean value theorem gives

$$\begin{aligned} E_2 = \langle \nabla _\alpha (J\circ S_u)(\alpha ^n)-\nabla _\alpha (J\circ S_u)(\widehat{\alpha }),\alpha ^{n+1}-\widehat{\alpha }\rangle = \langle Q(\alpha ^n-\widehat{\alpha }),\alpha ^{n+1}-\widehat{\alpha }\rangle \end{aligned}$$

for some $\zeta \in [\widehat{\alpha }, \alpha ^n]$ and $Q :=\nabla ^2_\alpha (J\circ S_u)(\zeta )$.

Since $\Vert \alpha ^n - \widehat{\alpha }\Vert \le r$ by Assumption 3.4, also $\Vert \zeta - \widehat{\alpha }\Vert \le r$.

Therefore, the 3-point identity (2) and Assumption 2.2 (v) or 2.5 (v) yield

$$\begin{aligned} \begin{aligned} E_2&= \frac{1}{2}\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2_{Q} + \frac{1}{2}\Vert \alpha ^n-\widehat{\alpha }\Vert ^2_{Q} - \frac{1}{2}\Vert \alpha ^{n+1}- \alpha ^n\Vert ^2_{Q} \\&\ge \frac{\gamma _\alpha }{2}(\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2 + \Vert \alpha ^n-\widehat{\alpha }\Vert ^2) - \frac{L_\alpha }{2}\Vert \alpha ^{n+1}- \alpha ^n\Vert ^2. \end{aligned} \end{aligned}$$

(28)

To estimate $E_1$ we rearrange

$$\begin{aligned} \begin{aligned} E_1&= \langle p^{n+1}\nabla _u J(u^{n+1})-\nabla _{\alpha }S_u(\alpha ^n)\nabla _u J(S_u(\alpha ^n)),\alpha ^{n+1}-\widehat{\alpha }\rangle \\&= \langle p^{n+1}(\nabla _u J(u^{n+1})-\nabla _u J(S_u(\alpha ^n)))\\ {}&\quad + (p^{n+1}-\nabla _{\alpha }S_u(\alpha ^n))\nabla _u J(S_u(\alpha ^n)),\alpha ^{n+1}-\widehat{\alpha }\rangle . \end{aligned} \end{aligned}$$

We have $\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}$ by the definition of the latter in Assumption 2.2 (vi) or 2.5 (vi) with $\alpha ^n\in B(\widehat{\alpha }, r)$ from Assumption 3.4. The same assumptions establish that $\nabla _u J$ is Lipschitz. Hence, using the operator norm inequality Theorem 6.1 (iii),

Applying (25) and Young’s inequality now yields for any $d>0$ the estimate

$$\begin{aligned} \begin{aligned} E_1&\ge -\left( L_{\nabla J}N_p C_u + C N_{\nabla J}\right) \Vert \alpha ^n-\widehat{\alpha }\Vert \Vert \alpha ^{n+1}-\widehat{\alpha }\Vert \\&\ge -\left( L_{\nabla J}N_p C_u + C N_{\nabla J}\right) \left( \frac{d}{2}\Vert \alpha ^n-\widehat{\alpha }\Vert ^2 + \frac{1}{2d}\Vert \alpha ^{n+1}-\widehat{\alpha }\Vert ^2\right) . \end{aligned} \end{aligned}$$

(29)

By inserting (28) and (29) into (27) we obtain the claim (26). $\square $

3.2 Convergence: forward-exact-forward-backward

We now prove the convergence of Algorithm 2.1. We start with a lemma that shows an $\alpha $-relative exactness estimate on the adjoint iterate when one holds for the inner iterate. This is needed to use Lemma 3.12. The main result of this subsection is in the final Theorem 3.16. It proves under Assumption 2.2 the linear convergence of $\{(u^n, \alpha ^n)\}_{n \in \mathbb {N}}$ generated by Algorithm 2.1 to $(\widehat{u}, \widehat{\alpha })$ solving the first-order optimality condition (8) for some $\widehat{p}$.

Lemma 3.13

Let $n \in \mathbb {N}$. Suppose Assumption 2.2 and the inner exactness estimate (21) hold as well as $\alpha ^{n}\in B(\widehat{\alpha }, r_0)$ and $u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)$. Then (18) and (25) hold for $C=L_{S_p}C_u.$

Proof

Since (18) with (21) equals (25), it suffices to prove (18). We assumed $\alpha ^n\in B(\widehat{\alpha }, r_0)$ and $u^{n+1},S_u(\alpha ^n)\in B(\widehat{u}, r_u)$ by Lemma 3.3. Therefore the Lipschitz continuity of $S_p$ in $B(\widehat{u}, r_u)\times B(\widehat{\alpha }, r)$ from Lemma 3.1 with Assumption 2.2 (ii) and (iii) and (21) give

$$\begin{aligned} \begin{aligned} \Vert \hspace{-1.0pt}|p^{n+1} - \nabla _{\alpha }S_u(\alpha ^n) \Vert \hspace{-1.0pt}|&= \Vert \hspace{-1.0pt}|S_p(u^{n+1},\alpha ^n)-S_p(S_u(\alpha ^{n}),\alpha ^{n}) \Vert \hspace{-1.0pt}| \\&\le L_{S_p}\Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le L_{S_p} C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert . \end{aligned}\square \end{aligned}$$

We are able to collect the previous lemmas into a descent estimate from which we immediately observe local linear convergence. We recall the definitions of the preconditioning and testing operators M and Z in (11b) and (12).

Lemma 3.14

Let $n \in \mathbb {N}$ and suppose Assumption 2.2 and 3.4, and the inner exactness estimate (21) hold. Then

$$\begin{aligned} \Vert x^{n+1}-\widehat{x}\Vert _{ZM}^2 +2\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + 2\varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \le \Vert x^n-\widehat{x}\Vert _{ZM}^2 \end{aligned}$$

(30)

for $\varphi _u>0$ as in Assumption 2.2 (vi),

$$\begin{aligned} \varepsilon _u :=\frac{\varphi _u \gamma _F(1-\kappa )}{2}> 0, \quad \text {and}\quad \varepsilon _{\alpha } :=\frac{\gamma _\alpha - (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})C_u}{2} > 0. \end{aligned}$$

Proof

We start by proving the monotonicity estimate

$$\begin{aligned} \langle ZH_{n+1}(x^{n+1}),x^{n+1}-\widehat{x}\rangle \ge \mathscr {V}_{n+1}(\widehat{x}) - \frac{1}{2}\Vert x^{n+1}-x^{n}\Vert ^2_{ZM} \end{aligned}$$

(31)

for $\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) :=\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2$. We observe that $\varepsilon _u,\varepsilon _{\alpha }>0$ by Assumption 2.2. The monotonicity estimate (31) expands as

$$\begin{aligned} h_{n+1} \ge \mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) -\frac{\varphi _u}{2\tau }\Vert u^{n+1}-u^{n}\Vert ^2 -\frac{1}{2\sigma }\Vert \alpha ^{n+1}-\alpha ^{n}\Vert ^2 \end{aligned}$$

(32)

for (all elements of the set)

$$\begin{aligned} h_{n+1}:= \left\langle \begin{pmatrix} \varphi _u\nabla _{u}F(u^{n};\alpha ^{n}) \\ p^{n+1} \nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F(u^{n+1};\alpha ^{n}) \\ p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}) \end{pmatrix}, \begin{pmatrix} u^{n+1}-\widehat{u} \\ p^{n+1}-\widehat{p}\\ \alpha ^{n+1}-\widehat{\alpha } \end{pmatrix} \right\rangle . \end{aligned}$$

We estimate each of the three lines of $h_{n+1}$ separately. For the first line, we use (20) from Lemma 3.8. For the middle line we observe that $p^{n+1}\nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F( u^{n+1};\alpha ^{n})=0$ by the p-update of Algorithm 2.1.

For the last line, we use (26) from Lemma 3.12 with $d=2$. We can do this because (25) holds by (21) and 3.13. This gives

Summing with (20) we thus obtain

The factor of the first term is $\varepsilon _u$ and the factor of last term is zero. Since $\sigma <1/L_\alpha $ by Lemma 3.2 and $L_F/(2\kappa ) \le 1/\tau $ by Assumption 2.2 (iv), we obtain (32), i.e., (31).

We now come to the fundamental argument of the testing approach of [18], combining operator-relative monotonicity estimates with the three-point identity. Indeed, (31) combined with the implicit algorithm (10) gives

$$\begin{aligned} \langle ZM(x^{n+1} - x^n),x^{n+1} - \widehat{x}\rangle + \mathscr {V}_{n+1}(\widehat{x}) \le \frac{1}{2}\Vert x^{n +1}-x^{n}\Vert ^2_{ZM}. \end{aligned}$$

Inserting the three-point identity (2) and expanding $\mathscr {V}_{n+1}$ yields (30). $\square $

Before stating our main convergence result for the FEFB, we simplify the assumptions of the previous lemma to just Assumption 2.2.

Lemma 3.15

Suppose Assumption 2.2 holds. Then (30) holds for any $n\in \mathbb {N}$.

Proof

Then claim readily follows if we prove by induction for all $n \in \mathbb {N}$ that

$$\begin{aligned} Assumption~3.4, (21), \text { and } (30) \text { hold}. \end{aligned}$$

(*)

We first prove (*) for $n=0$. Assumption 2.2 (i) directly establishes (21). The definition of $r_0$ in Assumption 2.2 also establishes that $\alpha ^{n}\in B(\widehat{\alpha }, r_0)$ and $u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)$. We have just proved the conditions of Lemma 3.13, which establishes (18) for $n=0$.

Now Lemma 3.5 establishes $\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}| \le N_p$. Therefore Assumption 3.4 holds for $n=0$. Finally (3.14) proves (30) for $n=0$. This concludes the proof of the induction base.

We then make the induction assumption that (*) holds for $n\in \{0,\ldots ,k\}$ and prove it for $n=k+1$. Indeed, the induction assumption and Lemma 3.11 give (21) for $n=k+1$. Next (30) for $n=k$ implies $\alpha ^{k+1}\in B( \widehat{\alpha },r_0)$ and $u^{k+1} \in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau } r_0)$, where $r_0$ and $r_u$ are as in Assumption 2.2. Therefore Lemma 3.3 gives $u^{k+2}\in B(\widehat{u}, r_u)$ while Lemma 3.13 establishes (18) for $n=k+1$. For all $n\in \{0,\ldots ,k\}$, the inequality (30) implies $\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^n-\widehat{x}\Vert _{ZM}$. Therefore Lemma 3.6 proves Assumption 3.4 and finally (3.14) proves (30) and consequently (*) for $n=k+1$. $\square $

Theorem 3.16

Suppose Assumption 2.2 holds. Then $\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0$ linearly.

Proof

Lemma 3.15, expansion of (30), and basic manipulation show that

$$\begin{aligned} \varphi _u\tau ^{-1}&\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&\ge (\varphi _u\tau ^{-1}+2\varepsilon _u)\Vert u^{n+1}-\widehat{u}\Vert ^2 + (\sigma ^{-1}+2\varepsilon _{\alpha })\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \\&= (1+2\varepsilon _u\varphi ^{-1}_u\tau )\varphi _u\tau ^{-1}\Vert u^{n+1}-\widehat{u}\Vert ^2 + (1+2\varepsilon _{\alpha }\sigma )\sigma ^{-1}\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \\&\ge \mu \bigl (\varphi _u\tau ^{-1}\Vert u^{n+1}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\bigr ) \end{aligned}$$

for $\mu := \min \{1+2\varepsilon _u\varphi ^{-1}_u \tau , 1+2\varepsilon _{\alpha }\sigma \}$. Since $\mu >1$, linear convergence follows. $\square $

3.3 Convergence: forward-inexact-forward-backward

We now prove the convergence of Algorithm 2.2. The overall structure and idea of the proofs follows Sect. 3.2 and uses several lemmas from Sect. 3.1. We first prove monotonicity estimate lemma for the adjoint step and then that a small enough step length in the outer problem gurantees that the inner and adjoint iterates stay in a small local neighbourhood if they are already in one. The main result of this subsection is in the final Theorem 3.21. It proves under Assumption 2.5 the linear convergence of $\{(u^n, p^n, \alpha ^n)\}_{n \in \mathbb {N}}$ generated by Algorithm 2.2 to $(\widehat{u}, \widehat{p}, \widehat{\alpha })$ solving the first-order optimality condition (8).

Lemma 3.17

Let $u\in U, \alpha \in \mathscr {A}$ and $p_1, p_2, \tilde{p} \in P.$ Moreover, $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}$ and

$$\begin{aligned} \tilde{p} \nabla _u^2 F(u; \alpha ) + \nabla _{\alpha u}F(u; \alpha ) = 0. \end{aligned}$$

(33)

holds. Then

Proof

[4] Using (33), the three-point identity (2) and $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}$ gives for $A :=\nabla _{u}^2 F(u; \alpha )$ the lower bound

$$\begin{aligned} \langle \hspace{-2.0pt}\langle&p_1 \nabla _{u}^2 F(u; \alpha ) +\nabla _{\alpha u} F(u; \alpha ), p_2-\tilde{p} \rangle \hspace{-2.0pt}\rangle \\&= \langle \hspace{-2.0pt}\langle (p_1-\tilde{p})\nabla _u^2 F(u; \alpha ), p_2-\tilde{p} \rangle \hspace{-2.0pt}\rangle \\&= \sum _{i\in I} \langle \nabla _{u}^2 F(u; \alpha )(p_1-\tilde{p})^*\varphi _i,(p_2-\tilde{p})^*\varphi _i\rangle \\&= \sum _{i\in I} \biggl ( \frac{1}{2} \Vert (p_1-\tilde{p})^*\varphi _i\Vert ^2_{A} - \frac{1}{2}\Vert (p_2-p_1)^*\varphi _i\Vert ^2_{A} + \frac{1}{2}\Vert (p_2-\tilde{p})^*\varphi _i\Vert ^2_{A} \biggr ) \\&\ge \sum _{i\in I}\left( \frac{\gamma _F}{2} \Vert (p^{k+1}-\tilde{p})^*\varphi _i\Vert ^2 - \frac{L_F}{2}\Vert (p^{k+2}-p^{k+1})^*\varphi _i\Vert ^2 + \frac{\gamma _F}{2}\Vert (p^{k+2}-\tilde{p})^*\varphi _i\Vert ^2 \right) \\&= \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p_2 - \tilde{p} \Vert \hspace{-1.0pt}|^2 + \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p_1 - \tilde{p} \Vert \hspace{-1.0pt}|^2 - \frac{L_F}{2}\Vert \hspace{-1.0pt}|p_2- p_1 \Vert \hspace{-1.0pt}|^2. \end{aligned}$$

$\square $

Lemma 3.18

Let $k \in \mathbb {N}$. Suppose Assumption 2.5 holds, and Assumption 3.4 and

$$\begin{aligned} \Vert u^{n+1}-S_u(\alpha ^{n})\Vert \le C_u \Vert \alpha ^{n} - \widehat{\alpha }\Vert \, \text { and } \, \Vert \hspace{-1.0pt}|p^{n+1}-\nabla _{\alpha }S_u(\alpha ^{n}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{n} - \widehat{\alpha }\Vert \end{aligned}$$

(34)

hold for $n=k$. Then (34) holds for $n = k+1.$

Proof

Observe that (34) for $n=k$ implies (21) as well as (18) for $n=k$ and $C=C_p$. Lemma s3.11 therefore proves the first part of (34) for $n=k+1$, i.e.,

$$\begin{aligned} \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert \le C_u \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$

(35)

We still need to show the second part $\Vert \hspace{-1.0pt}|p^{k+2}-\nabla _{\alpha }S_u(\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert $. We follow the fundamental argument of the testing approach (see the end of the proof of (3.14)) and use Assumption 2.5 (ii) and (iii). For the latter we need $\alpha ^k, \alpha ^{k+1}\in B(\widehat{\alpha }, 2r)$ and $u^{k+2}, S_u(\alpha ^{k})\in B(\widehat{u}, r_u).$ We have $\alpha ^{k}\in B(\widehat{\alpha },r_0)$ by Assumption 3.4 and $\alpha ^{k+1}\in B(\widehat{\alpha },2r_0)$ by Lemma 3.11. Thus we may use the Lipschitz continuity of $S_u$ with the triangle inequality and (35) to get $S_u(\alpha ^k)\in B(S_u(\widehat{\alpha }), L_{S_u}r_0)\subset B(\widehat{u}, r_u)$ and

$$\begin{aligned} \begin{aligned} \Vert u^{k+2}-\widehat{u}\Vert&\le \Vert u^{k+2}-S_u(\alpha ^{k+1})\Vert + \Vert S_u(\alpha ^{k+1}) - S_u(\widehat{\alpha })\Vert \\&\le (C_u+ L_{S_u})\Vert \alpha ^{k+1} - \widehat{\alpha } \Vert \le (C_u+ L_{S_u})2r_0, \end{aligned} \end{aligned}$$

which yields $u^{k+2}\in B(\widehat{u}, r_u).$ The definition of $S_p$ in (5) implies

$$\begin{aligned} S_p(u^{k+2}, \alpha ^{k+1})\nabla _{u}^2 F(u^{k+2}; \alpha ^{k+1}) + \nabla _{\alpha u} F(u^{k+2}; \alpha ^{k+1})=0. \end{aligned}$$

Since also $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}$ in $B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)$ from Assumption 2.5 (iii), we get

(36)

from Lemma 3.17. By the p update of the FIFB in the implicit form (13), we have

$$\begin{aligned} p^{k+1} \nabla _{u}^2 F(u^{k+2}; \alpha ^{k+1}) +\nabla _{\alpha u} F(u^{k+2}; \alpha ^{k+1}) = -\theta ^{-1}(p^{k+2}-p^{k+1}). \end{aligned}$$

Combining with (36) gives

An application of the three-point identity (2) with $\theta L_F \le 1$ from Assumption 2.5 (iv) now yields for $C_{F,S} = \sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)}$ the estimate

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{k+2}-S_p(u^{k+2},\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_{F,S}^{-1} \Vert \hspace{-1.0pt}|p^{k+1}-S_p(u^{k+2},\alpha ^{k+1}) \Vert \hspace{-1.0pt}|. \end{aligned}$$

This estimate and the triangle inequality give

(37)

The solution map $S_u$ is Lipschitz in $B(\widehat{\alpha }, 2r)$ and $S_p$ is Lipschitz in $B(\widehat{u}, r_u)\times B(\widehat{\alpha }, 2r)$ due to Assumption 2.5 (ii) and (iii), and Lemma 3.1. Combined with the triangle inequality, (34) for $n=k$ and (35), we obtain

(38)

for

$$\begin{aligned} E_3 :=C_p \Vert \alpha ^{k} - \widehat{\alpha }\Vert + L_{S_p}(1+L_{S_u})\Vert \alpha ^{k+1} - \alpha ^{k}\Vert . \end{aligned}$$

Using again the Lipschitz continuity of $S_p$ and (35), we get

$$\begin{aligned} E_2 \le L_{S_p}\Vert u^{k+2} - S_u(\alpha ^{k+1})\Vert \le L_{S_p} C_u \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$

(39)

Inserting (38) and (39) into (37) yields

$$\begin{aligned} \Vert \hspace{-1.0pt}|p^{k+2}-\nabla _{\alpha }S_u(\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_{F,S}^{-1} E_3 + (C_{F,S}^{-1} + 1) L_{S_p} C_u \Vert \alpha ^{k+1} - \widehat{\alpha } \Vert . \end{aligned}$$

Therefore the claim follows if we show that

$$\begin{aligned} C_{F,S}^{-1} E_3 \le (C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u) \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$

(40)

Lemma 3.9 proves (22) with $C=C_p$. Together with Assumption 2.5 (vii) it yields

$$\begin{aligned}{} & {} \Vert \alpha ^{k+1} - \alpha ^k\Vert \le \sigma C_{\alpha } \Vert \alpha ^{k} - \widehat{\alpha }\Vert \\ {}{} & {} \le \frac{(C_{F,S}-1)C_p- (1+C_{F,S})L_{S_p}C_u}{(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u} \Vert \alpha ^{k} - \widehat{\alpha }\Vert . \end{aligned}$$

Multiplying by $(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u,$ rearranging terms, and continuing with the triangle inequality gives (40). Indeed,

$$\begin{aligned} E_3&\le C_{F,S}(C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u)(\Vert \alpha ^{k} - \widehat{\alpha }\Vert - \Vert \alpha ^{k+1} - \alpha ^{k}\Vert ) \\&\le C_{F,S}(C_p - (C_{F,S}^{-1}+1) L_{S_p} C_u)\Vert \alpha ^{k+1} - \widehat{\alpha }\Vert . \end{aligned}$$

$\square $

We now show that the adjoint iterates stay local if the outer iterates do.

Again, by combining the previous lemmas, we prove an estimate from which local convergence is immediate. For this, we recall the definitions of the preconditioning and testing operators M and Z in (14b) and (12).

Lemma 3.19

Suppose Assumption 2.5 and 3.4, and the inner and adjoint exactness estimate (34) hold for $n\in \mathbb {N}.$ Then

$$\begin{aligned} \Vert x^{n+1}-\widehat{x}\Vert _{ZM}^2 + 2\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + 2\varepsilon _p\Vert p^{n}-\widehat{p}\Vert ^2 + 2\varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2 \le \Vert x^n-\widehat{x}\Vert _{ZM}^2 \end{aligned}$$

(41)

for

$$\begin{aligned} \begin{aligned} \varepsilon _u&:=\frac{\varphi _u\gamma _F(1-\kappa )}{2}-C_{\alpha ,1}> 0,&\varepsilon _p :=\frac{\varphi _p \gamma _F}{2}&> 0, \quad \text {and}\quad \\ \varepsilon _{\alpha }&:=\frac{\gamma _\alpha - C_{\alpha ,1} - \sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2} > 0, \end{aligned} \end{aligned}$$

where $\varphi _u, \varphi _p>0$ are as in Assumption 2.5, $C_{\alpha ,1} :=\varphi _p\tfrac{L_FL_{S_p}}{\gamma _F}$, and $C_{\alpha ,2} :=\bigl (L_{\nabla J}N_p + L_{S_p} N_{\nabla J}\bigr ) \tfrac{C_u}{2}$.

Proof

We start by proving the monotonicity estimate

$$\begin{aligned} \langle ZH_{n+1}(x^{n+1}),x^{n+1}-\widehat{x}\rangle \ge \mathscr {V}_{n+1}(\widehat{x}) - \frac{1}{2}\Vert x^{n+1}-x^{n}\Vert ^2_{ZM} \end{aligned}$$

(42)

for $\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) = \varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _p\Vert p^{n}-\widehat{p}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2$. We observe that $\varepsilon _u,\varepsilon _p,\varepsilon _{\alpha }>0$ by Assumption 2.5. The monotonicity estimate (42) expands as

$$\begin{aligned} h_{n+1} \ge \mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) -\frac{\varphi _u}{2\tau }\Vert u^{n+1}-u^{n}\Vert ^2 -\frac{\varphi _p}{2\theta }\Vert \hspace{-1.0pt}|p^{n+1}-p^{n} \Vert \hspace{-1.0pt}|^2 -\frac{1}{2\sigma }\Vert \alpha ^{n+1}-\alpha ^{n}\Vert ^2\nonumber \\ \end{aligned}$$

(43)

for (all elements of the set)

$$\begin{aligned} \nonumber h_{n+1}:= \left\langle \begin{pmatrix} \varphi _u\nabla _{u}F(u^{n};\alpha ^{n}) \\ \varphi _p \left( p^{n} \nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F(u^{n+1};\alpha ^{n})\right) \\ p^{n+1}\nabla _u J(u^{n+1}) + \partial R(\alpha ^{n+1}) \end{pmatrix}, \begin{pmatrix} u^{n+1}-\widehat{u} \\ p^{n+1}-\widehat{p}\\ \alpha ^{n+1}-\widehat{\alpha } \end{pmatrix} \right\rangle . \end{aligned}$$

We estimate the three lines of $h_{n+1}$ separately. We immediately take care of the first line by using (20) from Lemma 3.8.

For the second line, using the optimality condition (4) we have

(44)

We have $u^{n+1}$, $S_u(\alpha ^n)\in B(\widehat{u}, r_u)$, and $\alpha ^n\in B(\widehat{\alpha }, r)$ by Lemma 3.3 and 3.4. Thus $\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}$ in $B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)$ and $\Vert \nabla _{u}^2 F(u^{n+1};\alpha ^{n})\Vert \le L_F$ by Assumption 2.5 (iii). We get

$$\begin{aligned} E_1 \ge \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 - \frac{L_F}{2}\Vert \hspace{-1.0pt}|p^{n+1} - p^{n} \Vert \hspace{-1.0pt}|^2 \end{aligned}$$

(45)

from Lemma 3.17. By Theorem 6.1 (i) $\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle $ is an inner product and $\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|$ a norm on $\mathbb {L}(U; \mathscr {A}),$ and we can use thus Cauchy-Schwarz inequality for them. Therefore, using also Theorem 6.1 (ii), Lemma 3.1 and Young’s inequality, we can estimate

$$\begin{aligned} \begin{aligned} E_2&\ge - \big | \langle \hspace{-2.0pt}\langle (\widehat{p}-S_p(u^{n+1},\alpha ^{n})) \nabla _{u}^2 F(u^{n+1}; \alpha ^{n}), p^{n+1}-\widehat{p} \rangle \hspace{-2.0pt}\rangle \big | \\&\ge - \Vert \hspace{-1.0pt}|(\widehat{p}-S_p(u^{n+1},\alpha ^{n})) \nabla _{u}^2 F(u^{n+1}; \alpha ^{n}) \Vert \hspace{-1.0pt}| \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - L_F\Vert \hspace{-1.0pt}|S_p(\widehat{u}, \widehat{\alpha })-S_p(u^{n+1},\alpha ^{n})) \Vert \hspace{-1.0pt}| \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - L_FL_{S_p}\left( \Vert u^{n+1}-\widehat{u}\Vert + \Vert \alpha ^{n} - \widehat{\alpha }\Vert \right) \Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}| \\&\ge - \frac{L_FL_{S_p}}{\gamma _F}\left( \Vert u^{n+1}-\widehat{u}\Vert ^2 + \Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \right) - \frac{\gamma _F}{2}\Vert \hspace{-1.0pt}|p^{n+1}-\widehat{p} \Vert \hspace{-1.0pt}|^2. \end{aligned} \end{aligned}$$

(46)

Inserting (45) and (46) into (44), we obtain

(47)

The factor of the last term equals $C_{\alpha ,1}$ from Assumption 2.5 (v).

Since Assumption 3.4 and (34) hold, Lemma 3.12 gives (26) with $C = C_p$ and any $d>0$ for the third line of $h_{n+1}$. Summing (20), (47) and (26) we finally deduce

We have

$$\begin{aligned} \frac{C_{\alpha ,2}}{d} = C_{\alpha ,1} + C_{\alpha ,2}d = \frac{C_{\alpha ,1}}{2} + \frac{\sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2} \quad \text {for}\quad d = \frac{-C_{\alpha ,1} + \sqrt{C_{\alpha ,1}^2+4C_{\alpha ,2}^2}}{2C_{\alpha ,2}}. \end{aligned}$$

Then also $\frac{\gamma _\alpha }{2}- \frac{C_{\alpha ,2}}{d}=\varepsilon _{\alpha }$. It follows

Since $\sigma <1/L_\alpha $ by Lemma 3.2, $L_F/(2\kappa ) \le 1/\tau $ and $\theta < 1/L_F$ by Assumption 2.5 (iv), we obtain (43), i.e., (42). We finish by applying the fundamental arguments of the testing approach to (42) and the general implicit update (10) as in (3.14). $\square $

We simplify the assumptions of the previous lemma to just Assumption 2.5.

Lemma 3.20

Suppose Assumption 2.5 holds. Then (41) holds for any $n\in \mathbb {N}$.

Proof

The claim readily follows if we prove by induction for all $n\in \mathbb {N}$ that

We first prove () for $n=0.$ Assumption 2.5 (i) directly establishes (34). The definition of $r_0$ in Assumption 2.5 also establishes that $\alpha ^n\in B(\widehat{\alpha }, r_0)$ and $u^n\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0).$ We have just proved the conditions of Lemma 3.5, which gives $\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}|\le N_p$. Thus we we proved Assumption 3.4 for $n=0.$ Now (3.19) proves (41) for $n=0.$ This concludes the proof of induction base.

We then make the induction assumption that () holds for $n\in \{0,\ldots ,k\}$ and prove it for $n=k+1.$ The induction assumption and Lemma 3.18 give (34) for $n=k+1.$ The inequality (41) for $n\in \{0,\ldots ,k\}$ also ensures $\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^{n}-\widehat{x}\Vert _{ZM}$ for $n\in \{0,\ldots ,k\}.$ Therefore Lemma 3.6 proves Assumption 3.4 for $n=k+1$. Now (3.19) shows (41) and concludes the proof of () for $n=k+1$. $\square $

We finally come to the main convergence result for the FIFB.

Theorem 3.21

Suppose Assumption 2.5 holds. Then $\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \varphi _p\theta ^{-1}\Vert p^{n}-\widehat{p}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0$ linearly.

Proof

We define $\mu _1:= \min \{(1+2\varepsilon _u\varphi ^{-1}_u \tau ), (1+2\varepsilon _{\alpha }\sigma )\}$ and $\mu _2:= 1 - 2\varepsilon _p\varphi ^{-1}_p\theta .$ Lemma 3.20, expansion of (41), and basic manipulation show that

(48)

There are two separate cases (i) $\mu _1 \mu _2 \le 1$ and (ii) $\mu _1 \mu _2 > 1.$ In case (i), we have

$$\begin{aligned} \begin{aligned}&\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&= \mu ^{-1}_1 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \mu _1\mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ) \\&\le \mu ^{-1}_1 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ), \end{aligned} \end{aligned}$$

which with (48) implies linear convergence since $\mu ^{-1}_1\in (0,1).$ In case (ii), we obtain

$$\begin{aligned} \begin{aligned}&\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \mu _2\varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \\&\le \mu _2 \bigl ( \mu _1 \bigl ( \varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \bigr ) + \varphi _p\theta ^{-1}\Vert \hspace{-1.0pt}|p^{n}-\widehat{p} \Vert \hspace{-1.0pt}|^2 \bigr ), \end{aligned} \end{aligned}$$

which with (48) implies linear convergence since $\mu _2\in (0,1).$ $\square $

4 Numerical experiments

We evaluate the performance of our proposed algorithms on parameter learning for (anisotropic, smoothed) total variation image denoising and deconvolution. For a “ground truth” image $b \in \mathbb {R}^{N^2}$ of dimensions $N \times N$, we take

$$\begin{aligned} J(u) = \frac{1}{2}\Vert u-b\Vert _2^2 \end{aligned}$$

as the outer fitness function. For b we use a cropped portion of image 02 or 08 from the free Kodak dataset [49] converted to gray values in [0, 1]. The purpose of these numerical experiments is a simple performance comparison between our proposed methods and a few representative approaches from the literature. We therefore only consider a single ground-truth image b and a corresponding corrupted data z in the various inner problems, which we next describe. For proper generalizable parameter learning, multiple such training pairs $(b_i, z_i)$ should be used. This can in principle be done by summing over all the data in both the inner and outer problem; resulting in a higher-dimensional bilevel problem; see, e.g., [50]. In practise, a large sample count would require stochastic techniques.

4.1 Denoising

For denoising we take in (1) as the inner objective

$$\begin{aligned} F(u; \alpha ) = \frac{1}{2}\Vert u - z\Vert _2^2 + \alpha \Vert Du\Vert _{1,\gamma } \quad (u \in \mathbb {R}^{N^2}, \alpha \in [0, \infty )), \end{aligned}$$

and as the outer regulariser $R \equiv 0$. The simulated measurement z is obtained from b by adding Gaussian noise of standard deviation 0.1. The matrix D is a backward difference operator with Dirichlet boundary conditions. Instead of the one-norm, $\Vert \cdot \Vert _1$, to ensure the twice differentiability of the objective and hence a simple adjoint equation, we use a $C^2$ Huber- or Moreau–Yosida -type approximation with

$$\begin{aligned} \Vert y\Vert _{1, \gamma }:= \sum _{i=1}^{2N^2} \rho _{\gamma }(y_i), \quad \text {where}\quad \rho _{\gamma }(x) :={\left\{ \begin{array}{ll} -\frac{|x|^3}{3\gamma ^2} + \frac{|x|^2}{\gamma } &{} \text {if } \, |x|\le \gamma , \\ |x| - \frac{\gamma }{3} &{} \text {if } \, |x| > \gamma . \end{array}\right. } \end{aligned}$$

We used $\gamma =10^{-4}$ in our experiments (Fig. 1).

4.2 Deconvolution

For deconvolution, we take as the inner objective

$$\begin{aligned} F(u; \alpha ) = \frac{1}{2}\Vert K(\alpha ) * u-z\Vert _2^2 + C\alpha _1 \Vert Du\Vert _{1,\gamma }, \quad (u \in \mathbb {R}^{N^2}, \alpha \in [0, \infty )^4 ), \end{aligned}$$

and as the outer regulariser $R(\alpha )=\beta (\sum _{i=2}^4\alpha _i-1)^2+\delta _{[0, \infty )}(\alpha _1)$ for a regularisation parameter $\beta =10^4$. We introduce the constant $C=\tfrac{1}{10}$ to help convergence by ensuring the same order of magnitude for all components of $\alpha $. The first element of $\alpha $ is the total variation regularization parameter while the rest parametrizes the convolution kernel $K(\alpha )$ as illustrated in Fig. 2a. The sum of the elements of the kernel equals $\alpha _2 + \alpha _3 + \alpha _4.$ Operator $r_{\theta }$ rotates image $\theta $ degrees, clockwise for $\theta >0$ and counterclocwise for $\theta <0.$ We form z by computing $r_{-1}(K(\alpha )*r_1(b))$ for $\alpha = [0.15, 0.1, 0.75] $ and adding Gaussian noise of standard deviation $1\cdot 10^{-2}$.

For denoising, and deconvolution assuming $\ker D \cap \ker K(\alpha ) = \{0\}$, it is not difficult to verify the structural parts of Assumption 2.2 and 2.5, required for the convergence results of Sect. 3. We do not attempt to verify the conditions on the step lengths, choosing them by trial and error.

4.3 An implicit baseline method

We evaluate Algorithms 2.1 and 2.2 against a conventional method that solves both the inner problem and the adjoint equation (near-)exactly, as well as the AID [26]. We also experimented with solving the equivalent constrained optimisation problem $\min _{\alpha , u} J(u)$ subject to $\nabla _u F(u; \alpha )=0$ with IPOPT [51] and the NL-PDPS [52, 53]. However, we did not observe convergence without the inclusion of additional $H^1$ regularisation in the inner problem, as in, e.g., [7]. Since that changes the problem, we have not included “whole problem” approaches in our comparison.

To solve the inner problem in the implicit baseline method, we use gradient descent, starting at $v^0=0$ and updating $v^{m+1} :=v^m - \tau _m \nabla F(v^m; \alpha ^k)$ We then set $u^{k+1}=v^{m+1}$. The adjoint and outer iterate updates are as in Algorithm 2.1, however, we discover $\sigma =\sigma _k$ by the line search rule [19, (12.41)] for nonsmooth problems, starting at $\sigma _k=5\cdot 10^{-5}$ and multiplying by 0.1 on each line search step. For deconvolution we use a fixed step length parameter, as it performed better. The specific parameter choices (step lengths, number of inner and adjoint iterations) for all algorithms and experiments are listed in Table 1.

Table 1 Algorithm parametrisation (step length parameters, inner and adjoint iteration counts), time multiplier, and outer steps taken to reach threshold computational resources (CPU time) value

Full size table

4.4 Numerical setup

Our algorithm implementations are available on Zenodo [54]. To solve the adjoint equation in the FEFB and implicit methods, we use Matlab’s bicgstab implementation of the stabilized biconjugate gradient method [55] with tolerance $10^{-5}$, and maximum iteration count $10^{3}$. With the AID we use 50 conjugate gradient iterations. These choices, as well as the choice of the number of inner iterations for the implicit method and the AID, have been made by trial and error to be as small as possible while obtaining an apparently stable algorithm.

To evaluate scalability, we consider for denoising both $N=128$ and $N=256$. For deconvolution we consider $N=128$ and $N=32.$ We take initial $u^0 = S_u(\alpha ^0)$ and $p^0 = S_p(u^0, \alpha ^0)$ where for denoising $\alpha ^0=0$ and for deconvolution $\alpha ^{0}=[0.4, 0.25, 0.25, 0.5]$ and $\alpha ^{0}=[0.04, 0.25, 0.25, 0.5]$ with $N=128$ and $N=32$ respectively.

To compare algorithm performance, we plot relative errors against the !cputime! value of Matlab on an AMD Ryzen 5 5600 H CPU. We call this value “computational resources”, as it takes into account the use of several CPU cores by Matlab’s internal linear algebra, making it fairer than the actual running time. For each algorithm and problem, we indicate in Table 1 the step length parameters, the number of outer steps to reach the computational resources value of 6000 for denoising 15,000 for deconvolution, and an average multiplier to convert computational resources into running times.

For performance comparison, we need estimates $\tilde{\alpha }$ and $\tilde{u}$ of optimal $\hat{\alpha }$ and $\hat{u}=S_u(\hat{\alpha })$. For denoising we find them by searching for the one-dimensional variable $\alpha $ on a regular grid and recursively subdividing until node spacing goes below $10^{-5}$. As $\tilde{u}$, we take an estimate of $S_u(\tilde{\alpha })$ obtained with 25,000 steps of the implicit base line method. We visualise so obtained $\tilde{\alpha }$ and $J \circ S_u$ in Fig. 3. For the higher-dimensional deconvolution problem, such a scan is not feasible. Instead, we obtain the comparison estimates by running the implicit method from a very good initial iterate until computational resources (CPU time) value of 6000 for $N=32$ and 10,000 for $N=128$. Specifically, we initialise the kernel parameters $(\alpha _2,\alpha _3,\alpha _4)$ as those used for generating the data z, and the regularization parameter $\alpha _1 = 0.045$ for $N=32$ and $\alpha _1=0.02$ for $N=128$, the latter found by trial and error. This initialisation is different from that used for the actual numerical experiments; see above. Our experiments indicate that the other methods approach $\tilde{\alpha }$ so obtained faster than the implicit method itself, providing some justification for the choice.

With these solution estimates we define the inner and outer relative errors

$$\begin{aligned} e_{\alpha ,\text {rel}} :=\tfrac{\Vert \tilde{\alpha }- \alpha ^k\Vert }{\Vert \tilde{\alpha }\Vert } \quad \text {and}\quad e_{u,\text {rel}} :=\tfrac{\Vert \tilde{u} - u^k\Vert }{\Vert \tilde{u}\Vert }. \end{aligned}$$

4.5 Results

We report performance in Figs. 4 and 5 and the image data and reconstructions in Figs. 1 and 2. Figure 5 indicates that for deconvolution the FIFB significantly outperforms the other methods. The outer variable converges much faster than for other evaluated methods despite the fact that the inner variable especially with $N=32$ stays some distance away from $\tilde{u}$. However, as the dashed line indicates, the exact solution $S_u(\alpha ^k)$ of the inner problem for the corresponding outer iterate, shows clear signs of convergence. (The few “spikes” in the graph for $N=128$ temporarily have the regularisation parameter $\alpha ^k_0$ much closer to zero than $\tilde{\alpha }_0$.) This observation justifies the intuition that the inner problem does not need to be solved to high accuracy to obtain convergence for the outer problem; that such accurate solutions can even be detrimental to convergence. The exact solution of the adjoint equation in both the implicit method and the FEFB causes them to be too slow to make any meaningful progress. The denoising experiments of Fig. 4 likewise suggest that the FIFB is initially the best performing algorithm, although the implicit method and the AID catch up later on the denoising problem. On the small denoising problem ($N=128$), the implicit method is significantly faster than any oher method. Overall, and for practical purposes, nevertheless, the FIFB appears to perform the best.

Data availability

Our Matlab source codes are available on Zenodo [54]. The test images are from the publicly available free Kodak image set [49].

Notes

An error in [43, Lemma 10] requires some conditions therein to be taken “in the limit” as $t \searrow 0$.

References

Bard, J.F., Falk, J.E.: An explicit solution to the multi-level programming problem. Comput. Oper. Res. 9(1), 77–100 (1982). https://doi.org/10.1016/0305-0548(82)90007-7
Article MathSciNet Google Scholar
Allende, G.B., Still, G.: Solving bilevel programs with the KKT-approach. Math. Program. 138(1–2), 309–332 (2012). https://doi.org/10.1007/s10107-012-0535-x
Article MathSciNet Google Scholar
Fliege, J., Tin, A., Zemkoho, A.: Gauss-newton-type methods for bilevel optimization. Comput. Optim. Appl. 78, 793–824 (2021). https://doi.org/10.1007/s10589-020-00254-3
Article MathSciNet Google Scholar
Jiang, Y., Li, X., Huang, C., Wu, X.: Application of particle swarm optimization based on CHKS smoothing function for solving nonlinear bilevel programming problem. Appl. Math. Comput. 219(9), 4332–4339 (2013). https://doi.org/10.1016/j.amc.2012.10.010
Article MathSciNet Google Scholar
De Los Reyes, J.C., Villacís, D.: Bilevel Imaging Learning Problems as Mathematical Programs with Complementarity Constraints (2021)
Falk, J.E., Liu, J.: On bilevel programming, Part I: general nonlinear cases. Math. Program. 70(1–3), 47–72 (1995). https://doi.org/10.1007/bf01585928
Article Google Scholar
De Los Reyes, J.C., Schönlieb, C.-B.: Image denoising: learning noise distribution via PDE-constrained optimization. Inverse Prob. Imag. 7, 1183–1214 (2013). arXiv:1207.3425
Google Scholar
Kunisch, K., Pock, T.: A bilevel optimization approach for parameter learning in variational models. SIAM J. Imag. Sci. 6(2), 938–983 (2013). https://doi.org/10.1137/120882706
Article MathSciNet Google Scholar
Holler, G., Kunisch, K., Barnard, R.C.: A bilevel approach for parameter learning in inverse problems. Inverse Prob. 34(11), 115012 (2018). https://doi.org/10.1088/1361-6420/aade77
Article MathSciNet Google Scholar
Hintermüller, M., Rautenberg, C.N., Wu, T., Langer, A.: Optimal selection of the regularization function in a weighted total variation model. Part II: algorithm, its analysis and numerical tests. J. Math. Imag. Vis. 59, 515 (2017). https://doi.org/10.1007/s10851-017-0736-2
Article MathSciNet Google Scholar
Sherry, F., Benning, M., De los Reyes, J.C., Graves, M.J., Maierhofer, G., Williams, G., Schönlieb, C.-B., Ehrhardt, M.J.: Learning the sampling pattern for MRI. IEEE Trans. Med. Imaging 39(12), 4310–4321 (2020). https://doi.org/10.1109/TMI.2020.3017353
Article PubMed Google Scholar
Calatroni, L., De Los Reyes, J.C., Schönlieb, C.-B.: Dynamic sampling schemes for optimal noise learning under multiple nonsmooth constraints. In: Proceedings of the 26th IFIP TC 7 Conference on System Modeling and Optimization, Klagenfurt, Austria (2014)
Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 6305–6315. PMLR, online (2020)
De los Reyes, J.C., Villacís, D.: Optimality conditions for bilevel imaging learning problems with total variation regularization. SIAM Journal on Imaging Sciences 15(4), 1646–1689 (2022). https://doi.org/10.1137/21M143412X. arXiv:2107.08100
Ehrhardt, M.J., Roberts, L.: Inexact derivative-free optimization for bilevel learning. J. Math. Imag. Vis. (2021). https://doi.org/10.1007/s10851-021-01020-8
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40, 120–145 (2011). https://doi.org/10.1007/s10851-010-0251-1
Article MathSciNet Google Scholar
He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imag. Sci. 5(1), 119–149 (2012). https://doi.org/10.1137/100814494
Article MathSciNet Google Scholar
Valkonen, T.: Testing and non-linear preconditioning of the proximal point method. Appl. Math. Optim. 82(2), 1 (2020). https://doi.org/10.1007/s00245-018-9541-6. arXiv:1703.05705
Article MathSciNet Google Scholar
Clason, C., Valkonen, T.: Introduction to Nonsmooth Analysis and Optimization. Work in progress (2020). https://arxiv.org/abs/2001.00216
Valkonen, T.: First-order primal-dual methods for nonsmooth nonconvex optimisation. In: Chen, K., Schönlieb, C.-B., Tai, X.-C., Younes, L. (eds.) Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-03009-4_93-1
Chen, T., Sun, Y., Yin, W.: A Single-Timescale Stochastic Bilevel Optimization Method (2021)
Hong, M., Wai, H.-T., Wang, Z., Yang, Z.: A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic (2020)
Li, J., Gu, B., Huang, H.: A fully single loop algorithm for bilevel optimization without hessian inverse. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7426–7434 (2022)
Yang, J., Ji, K., Liang, Y.: Provably faster algorithms for bilevel optimization. Adv. Neural. Inf. Process. Syst. 34, 13670–13682 (2021)
Google Scholar
Dagréou, M., Ablin, P., Vaiter, S., Moreau, T.: A Framework for Bilevel Optimization that Enables Stochastic and Global Variance Reduction Algorithms (2022)
Ji, K., Yang, J., Liang, Y.: Bilevel optimization: convergence analysis and enhanced design. In: International Conference on Machine Learning, pp. 4882–4892. PMLR (2021)
Ji, K., Liang, Y.: Lower bounds and accelerated algorithms for bilevel optimization. J. Mach. Learn. Res. 23, 1–56 (2022)
CAS Google Scholar
Ghadimi, S., Wang, M.: Approximation methods for bilevel programming (2018)
Luo, Z.Q., Pang, J.S., Ralph, D.: Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge (1996)
Book Google Scholar
Aussel, D., Lalitha, C.S. (eds.): Generalized Nash Equilibrium Problems, Bilevel Programming and MPEC. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4774-9
Dempe, S.: Foundations of Bilevel Programming. Nonconvex Optimization and Its Applications. Springer, US (2006)
Google Scholar
Dempe, S., Kalashnikov, V., Pérez-Valdés, G.A., Kalashnykova, N.: Bilevel Programming Problems. Energy Systems. Springer, Berlin (2015). https://doi.org/10.1007/978-3-662-45827-3
Dempe, S.: Bilevel optimization: theory, algorithms and applications. Preprint 2018-11, TU Bergakademie Freiberg, Fakultät für Mathematik und Informatik (2018). http://www.optimization-online.org/DB_FILE/2018/08/6773.pdf
Stephan Dempe, A.Z. (ed.): Bilevel Optimization: Advances and Next Challenges. Springer, Berlin (2020)
Ye, J.J., Zhu, D.L.: Optimality conditions for bilevel programming problems. Optimization 33(1), 9–27 (1995). https://doi.org/10.1080/02331939508844060
Article MathSciNet Google Scholar
Zemkoho, A.B.: Solving ill-posed bilevel programs. Set-valued Var. Anal. 24(3), 423–448 (2016). https://doi.org/10.1007/s11228-016-0371-x
Article MathSciNet Google Scholar
Dempe, S., Zemkoho, A.B.: The bilevel programming problem: reformulations, constraint qualifications and optimality conditions. Math. Program. 138(1–2), 447–473 (2012). https://doi.org/10.1007/s10107-011-0508-5
Article MathSciNet Google Scholar
Mehlitz, P., Zemkoho, A.B.: Sufficient optimality conditions in bilevel programming. Math. Oper. Res. (2021). https://doi.org/10.1287/moor.2021.1122
Article MathSciNet Google Scholar
Bai, K., Ye, J.J.: Directional necessary optimality conditions for bilevel programs. Math. Oper. Res. (2021). https://doi.org/10.1287/moor.2021.1164. To appear (published online)
Sabach, S., Shtern, S.: A first order method for solving convex bilevel optimization problems. SIAM J. Optim. 27(2), 640–660 (2017). https://doi.org/10.1137/16m105592x
Article MathSciNet Google Scholar
Shehu, Y., Vuong, P.T., Zemkoho, A.: An inertial extrapolation method for convex simple bilevel optimization. Optim. Methods Softw. 36(1), 1–19 (2019). https://doi.org/10.1080/10556788.2019.1619729
Article MathSciNet Google Scholar
De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: Bilevel parameter learning for higher-order total variation regularisation models. J. Math. Imag. Vis. 57, 1–25 (2017). https://doi.org/10.1007/s10851-016-0662-8. arXiv:1508.07243
Article MathSciNet Google Scholar
De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: The structure of optimal parameters for image restoration problems. J. Math. Anal. Appl. 434, 464–500 (2016). https://doi.org/10.1016/j.jmaa.2015.09.023. arXiv:1505.01953
Article MathSciNet Google Scholar
Hintermüller, M., Wu, T.: Bilevel optimization for calibrating point spread functions in blind deconvolution. Inverse Probl. Imag. 9(4), 1139–1169 (2015). https://doi.org/10.3934/ipi.2015.9.1139
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: Learning consistent discretizations of the total variation. SIAM J. Imag. Sci. 14(2), 778–813 (2021). https://doi.org/10.1137/20m1377199
Article MathSciNet Google Scholar
Ochs, P., Ranftl, R., Brox, T., Pock, T.: Techniques for gradient-based bilevel optimization with non-smooth lower level problems. J. Math. Imag. Vis. 56(2), 175–194 (2016). https://doi.org/10.1007/s10851-016-0663-7
Article MathSciNet Google Scholar
Clarke, F.: Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, US (1990). https://doi.org/10.1137/1.9781611971309
Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2), 251–266 (2004)
MathSciNet Google Scholar
Franzen, R.: Kodak lossless true color image suite. PhotoCD PCD0992. Lossless, true color images released by the Eastman Kodak Company (1999). http://r0k.us/graphics/kodak/
Calatroni, L., Cao, C., De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: Bilevel approaches for learning of variational imaging models. In: Variational Methods in Imaging and Geometric Control. Radon Series on Computational and Applied Mathematics, vol. 18, pp. 252–290 (2016). https://doi.org/10.1515/9783110430394-008
Wächter, A.: An Interior Point Algorithm for Large-Scale Nonlinear Optimization with Applications in Process Engineering. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA (2002)
Clason, C., Mazurenko, S., Valkonen, T.: Acceleration and global convergence of a first-order primal-dual method for nonconvex problems. SIAM J. Optim. 29, 933–963 (2019). https://doi.org/10.1137/18M1170194. arXiv:1802.03347
Article MathSciNet Google Scholar
Valkonen, T.: A primal-dual hybrid gradient method for non-linear operators with applications to MRI. Inverse Prob. 30(5), 055012 (2014). https://doi.org/10.1088/0266-5611/30/5/055012. arXiv:1309.5032
Article ADS Google Scholar
Suonperä, E.: Codes for “Linearly convergent bilevel optimization with single-step inner methods” (2023). https://doi.org/10.5281/zenodo.7974062
Van der Vorst, H.A.: Bi-cgstab: a fast and smoothly converging variant of bi-cg for the solution of nonsymmetric linear systems. SIAM J. Sci. Comput. 13(2), 631–644 (1992)
Article MathSciNet Google Scholar

Download references

Funding

Open Access funding provided by University of Helsinki including Helsinki University Central Hospital. This research has been supported by the Academy of Finland Grants 314701, 320022, and 345486.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
Ensio Suonperä & Tuomo Valkonen
ModeMat, Escuela Politécnica Nacional, Quito, Ecuador
Tuomo Valkonen

Authors

Ensio Suonperä
View author publications
You can also search for this author in PubMed Google Scholar
Tuomo Valkonen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ensio Suonperä.

Ethics declarations

Conflicts of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Prox-$\sigma $-contractivity

In the next three theorems we verify prox-$\sigma $-contractivity (Assumption 2.1) for some common cases.

The next theorem readily extends to $R=\beta \Vert \,\varvec{\cdot }\,\Vert _1 + \delta _{[0, \infty )^n}$ on $\mathbb {R}^n$ and products sets A since ${{\,\textrm{prox}\,}}_{\sigma R}$ is independent for each coordinate.

Theorem 5.1

(prox-$\sigma $-contractivity of positivity-constrained soft-thresholding) Let $\sigma ,\beta >0$ and $R=\beta |\,\varvec{\cdot }\,|_1 + \delta _{[0, \infty )}$ on $\mathbb {R}$. Then R is prox-$\sigma $-contractive at any $\widehat{\alpha }\ge \max \{0, \sigma (q+\beta )\}$ for any $q \in \mathbb {R}^n$ with any factor $0< C_R < \sigma ^{-1}$ within

$$\begin{aligned} A = \left[ \max \left\{ 0, \widehat{\alpha }- \frac{\widehat{\alpha }- \sigma (q + \beta )}{1-\sigma C_R} \right\} , \infty \right) . \end{aligned}$$

In particular, if $\widehat{\alpha }\in {{\,\textrm{dom}\,}}R=[0, \infty )$ and $-q = \beta \in \partial R(\widehat{\alpha })$, then R is locally prox-$\sigma $-contractive at $\widehat{\alpha }$ with any factor $0< C_R < \sigma ^{-1}$ within

$$\begin{aligned} A = \left[ 0, \infty \right) . \end{aligned}$$

Proof

We have (see, e.g., [19])

$$\begin{aligned} D_{\sigma R}(\alpha ) = {{\,\textrm{prox}\,}}_{\sigma R}(\alpha - \sigma q) - \alpha = {\left\{ \begin{array}{ll} -\sigma (q + \beta ), &{} \alpha \ge \sigma (q + \beta ), \\ -\alpha , &{} \alpha < \sigma (q + \beta ). \end{array}\right. } \end{aligned}$$

We have by assumption $\widehat{\alpha }\ge \sigma (q+\beta )$. If also $\alpha \ge \sigma (q + \beta )$, we have $D_{\sigma R}(\alpha ) - D_{\sigma R}(\widehat{\alpha })=0$, which satisfies the required inequality.

Suppose then that $\alpha < \sigma (q + \beta )$.

Since $D_{\sigma R}(\widehat{\alpha })=-\sigma (q+\beta )$, we need to show

$$\begin{aligned} |D_{\sigma R}(\alpha )+\sigma (q+\beta )| \le \sigma C_R (\widehat{\alpha }-\alpha ). \end{aligned}$$

(49)

We have $D_{\sigma R}(\alpha )=-\alpha $, so (49) rearranges as

$$\begin{aligned} -\alpha +\sigma (q+\beta ) \le \sigma C_R (\widehat{\alpha }-\alpha ). \end{aligned}$$

Since $1 > \sigma C_R$, this inequality can be rearranged as the condition $\alpha \ge \widehat{\alpha }- \frac{\widehat{\alpha }- \sigma (q + b)}{1-\sigma C_R}$. Any $\alpha \in A$ satisfies this bound.

Let then $-q = \beta \in \partial R(\widehat{\alpha })$. Since $\widehat{\alpha }\ge 0$, we have $\widehat{\alpha }\ge \sigma (q+\beta ) = 0$. Since $\frac{\widehat{\alpha }- \sigma (q + \beta )}{1-\sigma C_R} = \frac{\widehat{\alpha }}{1-\sigma C_R} \ge \widehat{\alpha }$, the claimed simpler expression for A follows from the general. $\square $

Similarly to the previous result, the restriction $q=0$ in the next theorem on projections to a convex set C forbids stricly complementary cases of $\widehat{\alpha }\in {{\,\textrm{bd}\,}}C$, i.e., we cannot have $0 \ne -q \in N_C(\widehat{\alpha }) :=\partial \delta _C(\widehat{\alpha })$.

Theorem 5.2

(prox-$\sigma $-contractivity of projections) Let $\sigma >0$ and $R=\delta _C$ for a convex and closed $C \subset \mathbb {R}^n$. Then R is prox-$\sigma $-contractive at any $\widehat{\alpha }\in C$ for $q=0$ within $A=C$ with any factor $C_R > 0$.

Proof

We have ${{\,\textrm{prox}\,}}_{\sigma R}={{\,\textrm{proj}\,}}_C$ for the Euclidean projection onto C. Since $\alpha , \widehat{\alpha }\in C = {{\,\textrm{dom}\,}}R$ and $q=0$, we have $\alpha = {{\,\textrm{proj}\,}}_C(\alpha ) = {{\,\textrm{proj}\,}}_C(\alpha - \sigma q)$, and likewise for $\widehat{\alpha }$. The claim is now immediate. $\square $

Example 5.3

(ReLu) The proximal mapping of $\delta _{[0, \infty )}$ is known as the rectifier linear unit activation function (ReLu). By the above theorem, it is prox-$\sigma $-contractive at any $\widehat{\alpha }\ge 0$ for $q=0$.

Theorem 5.4

(prox-$\sigma $-contractivity of smooth functions) Let $\sigma ,\beta >0$ and $R: \mathbb {R}^n \rightarrow \mathbb {R}$ be convex with Lipschitz gradient. Then R is prox-$\sigma $-contractive at any $\widehat{\alpha }\in \mathbb {R}^n$ for any $q \in \mathbb {R}^n$ within $A=\mathbb {R}^n$ with the factor $C_R=L_{\nabla R}$ the Lipschitz factor of $\nabla R$.

Proof

Write $p(\alpha ) :={{\,\textrm{prox}\,}}_{\sigma R}(t(\alpha ))$ and $t(\alpha ) :=\alpha - \sigma q$. According to the definition of the proximal operator, $ 0 = \sigma \nabla R(p(\alpha )) + p(\alpha ) - t(\alpha ). $ Hence $ p(\alpha )-\alpha = -\sigma [q + \nabla R(p(\alpha ))] $ which yields for any $\alpha \in \mathbb {R}^n$, as required,

$$\begin{aligned} \Vert D_{\sigma R}(\alpha )-D_{\sigma R}(\widehat{\alpha })\Vert&= \Vert [p(\alpha )-\alpha ]-[p(\widehat{\alpha })-\widehat{\alpha }]\Vert \\&=\sigma \Vert \nabla R(p(\alpha ))-\nabla R(p(\widehat{\alpha }))\Vert \\&\le \sigma L_{\nabla R}\Vert \hspace{-1.0pt}|p(\alpha )-p(\widehat{\alpha }) \Vert \hspace{-1.0pt}| \\&\le \sigma L_{\nabla R}\Vert t(\alpha )-t(\widehat{\alpha })\Vert \\&= \sigma L_{\nabla R}\Vert \alpha -\widehat{\alpha }\Vert . \end{aligned}$$

$\square $

Appendix B: A norm on separable spaces

We show basic properties of the inner product and norm defined by (6b).

Theorem 6.1

Let U be a Hilbert space and $\mathscr {A}$ a separable Hilbert space. On $\mathbb {L}(U; \mathscr {A})$ define $\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle $ and $\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|$ according to (6b). Then

(i)
$\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle $ is an inner product and $\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|$ a norm on $\mathbb {L}(U; \mathscr {A})$.
(ii)
For $M \in \mathbb {L}(U; U)$ and $p \in \mathbb {L}(U; \mathscr {A})$, we have $\Vert \hspace{-1.0pt}|pM \Vert \hspace{-1.0pt}| \le \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|\Vert M\Vert _{\mathbb {L}(U; U)}$.
(iii)
The operator norm on $\mathbb {L}(U; \mathscr {A})$ satisfies $\Vert \,\varvec{\cdot }\,\Vert _{\mathbb {L}(U; \mathscr {A})} \le \Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|$.

Proof

(i) $\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle $ is bilinear and symmetric. Also $\Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}| = \langle \hspace{-2.0pt}\langle p, p \rangle \hspace{-2.0pt}\rangle \ge 0$ for all $p \in P$.

To prove that $\Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}| > 0$ for $p \ne 0$, we observe that the contrary implies $\Vert p^*\varphi _i\Vert =0$ for all $i \in I$. Since $\{\varphi _i\}_{i \in I}$ is a basis for $\mathscr {A}$, this implies $p^*=0$, hence $p=0$.

(ii) We have $ \Vert \hspace{-1.0pt}|pM \Vert \hspace{-1.0pt}|^2 =\sum _{i \in I} \Vert M^*p^*\varphi _i\Vert ^2 \le \sum _{i \in I} \Vert M^*\Vert _{\mathbb {L}(U; U)}^2 \Vert p^*\varphi _i\Vert ^2 = \Vert M\Vert _{\mathbb {L}(U; U)}^2 \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|^2. $

(iii) Let $p \in \mathbb {L}(U; \mathscr {A})$. Then $\Vert p\Vert _{\mathbb {L}(U; \mathscr {A})} =\Vert p^*\Vert _{\mathbb {L}(\mathscr {A}; U)} =\sup _{\alpha \in \mathscr {A}, \Vert \alpha \Vert =1} \Vert p^*\alpha \Vert _U$. Since $\{\varphi _i\}_{i \in I}$ is an orthonormal basis for $\mathscr {A}$, we can write $\alpha =\sum _{i \in I} a_i \varphi _i$ for some $a_i \in \mathbb {R}$ with $\sum _{i \in I} a_i^2=1$. Thus $ \Vert p^*\alpha \Vert _U^2 =\sum _{i \in I}\sum _{\alpha \in I} a_i a_j \langle p^*\varphi _i,p^*\varphi _j\rangle _U \le \sum _{i \in I} \left( \sum _{\alpha \in I} a_j^2\right) \Vert p^*\varphi _i\Vert _U^2 = \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|^2, $ where the last inequality uses Young’s inequality. The claim follows. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Suonperä, E., Valkonen, T. Linearly convergent bilevel optimization with single-step inner methods. Comput Optim Appl 87, 571–610 (2024). https://doi.org/10.1007/s10589-023-00527-7

Download citation

Received: 07 November 2022
Accepted: 04 September 2023
Published: 28 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10589-023-00527-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Linearly convergent bilevel optimization with single-step inner methods

Abstract

Similar content being viewed by others

Methodology and first-order algorithms for solving nonsmooth and non-strongly convex bilevel optimization problems

On the solution of convex bilevel optimization problems

Bilevel Optimization with Nonsmooth Lower Level Problems

1 Introduction

1.1 Fundamentals and applications

1.2 Notation and basic concepts

2 Proposed methods

2.1 Optimality conditions

2.2 Algorithm: forward-exact-forward-backward

Assumption 2.1

Assumption 2.2

Remark 2.3

Remark 2.4

2.3 Algorithm: forward-inexact-forward-backward

Assumption 2.5

Remark 2.6

3 Convergence analysis

3.1 General results

Lemma 3.1

Proof

Lemma 3.2

Proof

Lemma 3.3

Proof

Assumption 3.4

Lemma 3.5

Proof

Lemma 3.6

Proof

Theorem 3.7

Proof

Lemma 3.8

Proof

Lemma 3.9

Proof

Remark 3.10

Lemma 3.11

Proof

Lemma 3.12

Proof

3.2 Convergence: forward-exact-forward-backward

Lemma 3.13

Proof

Lemma 3.14

Proof

Lemma 3.15

Proof

Theorem 3.16

Proof

3.3 Convergence: forward-inexact-forward-backward

Lemma 3.17

Proof

Lemma 3.18

Proof

Lemma 3.19

Proof

Lemma 3.20

Proof

Theorem 3.21

Proof

4 Numerical experiments

4.1 Denoising

4.2 Deconvolution

4.3 An implicit baseline method

4.4 Numerical setup

4.5 Results

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note