Abstract
We propose a new approach to solving bilevel optimization problems, intermediate between solving full-system optimality conditions with a Newton-type approach, and treating the inner problem as an implicit function. The overall idea is to solve the full-system optimality conditions, but to precondition them to alternate between taking steps of simple conventional methods for the inner problem, the adjoint equation, and the outer problem. While the inner objective has to be smooth, the outer objective may be nonsmooth subject to a prox-contractivity condition. We prove linear convergence of the approach for combinations of gradient descent and forward-backward splitting with exact and inexact solution of the adjoint equation. We demonstrate good performance on learning the regularization parameter for anisotropic total variation image denoising, and the convolution kernel for image deconvolution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Two general approaches are typical for the solution of the bilevel optimization problem
in Hilbert spaces \(\mathscr {A}\) and U. The first, familiar from the treatment of general mathematical programming with equilibrium constraints (MPECs), is to write out the Karush–Kuhn–Tucker conditions for the whole problem in a suitable form, and to apply a Newton-type method or other nonlinear equation solver to them [1,2,3,4,5].
The second approach, common in the application of (1) to inverse problems and imaging [6,7,8,9,10,11,12], treats the solution mapping \(S_u\) as an implicit function. Thus it is necessary to (i) on each outer iteration k solve the inner problem \(\min _u F(u; \alpha ^k)\) near-exactly using an optimization method of choice; (ii) solve an adjoint equation to calculate the gradient of the solution mapping; and (iii) use another optimization method of choice on the outer problem \(\min _\alpha J(S_u(\alpha ))\) with the knowledge of \(S_u(\alpha ^k)\) and \(\nabla S_u(\alpha ^k)\). The inner problem is therefore generally assumed to have a unique solution, and the solution map to be differentiable. An algorithm for nonsmooth inner problems has been developed in [13], while [14] rely on proving directional Bouligand differentiability for otherwise nonsmooth problems.
The challenge of the first “whole-problem” approach is to scale it to large problems, typically involving the inversion of large matrices. The difficulty with the second “implicit function” approach is that the inner problem needs to be solved several times, which can be expensive. Solving the adjoint equation also requires matrix inversion. The variant in [15] avoids this through derivative-free methods for the outer problem. It also solves the inner problem to a low but controlled accuracy.
In this paper, by preconditioning the implicit-form first-order optimality conditions, we develop an intermediate approach more efficient than the aforementioned, as we demonstrate in the numerical experiments of Sect. 4. It can be summarized as (i) take only one step of an optimization method on the inner problem, (ii) perform a cheap operation to advance towards the solution of the adjoint equation, and, finally, (iii) using this approximate information, take one step of an optimization method for the outer problem. Repeat.
The preconditioning, which we introduce in detail in Sect. 2, is based on insight from the derivation of the primal-dual proximal splitting of [16] as a preconditioned proximal point method [17,18,19]. We write the optimality conditions for (1) as the inclusion \(0 \in H(x)\) for a set-valued H, where \(x=(u,p,\alpha )\) for an adjoint variable p. The basic proximal point method then iteratively solves \(x^{k+1}\) from
This can be as expensive as solving the original optimality condition. The idea then is to introduce a preconditioning operator M that decouples the components of x—in our case u, p and \(\alpha \)—such that each component can be solved in succession from
Gradient steps can be handled through nonlinear preconditioning [18, 19], as we will see in Sect. 2 when we develop the approach in detail along with two more specific algorithms, the FEFB (Forward-Exact-Forward-Backward) and the FIFB (Forward-Inexact-Forward-Backward). In Sect. 3 we prove their local linear convergence under a second-order growth condition on the composed objective \(J \circ S_u\), and other more technical conditions. The proof is based on the “testing” approach developed in [18] and also employed extensively in [19, 20]. Finally, we evaluate the numerical performance of the proposed schemes on imaging applications in Sect. 4, specifically the learning of a regularization parameter for total variation denoising, and the convolution kernel for deblurring. Since the purpose of these experiments is a simple performance comparison between different algorithms, instead of real applications, we only use a single training sample of various dimensions, as explained in Sect. 4.
Intermediate approaches, some reminiscent of ours, have recently also been developed in the machine learning community. Our approach, however, allows a non-smooth function R in the outer problem (1). Moreover, to our knowledge, our work is the first to show linear convergence for a fully “single-loop” algorithm. To be more precise, the STABLE [21], TTSA [22], FLSA [23], MRBO, VRBO [24], and SABA [25] are “single-loop” algorithms such as ours, taking only a single step towards the solution of the inner problem on each outer iteration. The STABLE requires solving the adjoint equation exactly, as does our first approach, the FEFB. The others use a Neumann series approximation for the adjoint equation. Our second approach, the FIFB, takes a simple step reminiscent of gradient descent for the adjoint equation. The TSSA and STABLE obtain sublinear convergence of the outer iterates \(\{\alpha ^k\}_{k \in \mathbb {N}}\) assuming strong convexity (second-order growth) of both the inner and outer objective. For the SABA similar linear convergence is claimed with the outer strong convexity replaced by a Polyak-Łojasiewicz inequality. Without either of those assumptions, the theoretical results on the aforementioned methods from the literature are much weaker, and generally only show various forms of “stall” of the iterates at a sublinear rate, or the ergodic convergence of the gradient \(\nabla _{\alpha }[J\circ S_u](\alpha ^k)\) of the composed objective to zero. Such modes of convergence say very little about the convergence of function values to optimum or the iterates to a solution.
In the context of not fully single-loop algorithms, the AID, ITD [26], AccBio [27], and ABA [28] take a fixed (small) number of inner iterations for each outer iteration. The AID and ITD only sublinear convergence of the composed gradient is claimed. For the ABA and AccBio linear convergence of outer function values is claimed under strong convexity of both the inner and outer objectives.
1.1 Fundamentals and applications
Fundamentals of MPECs and bilevel optimization are treated in the books [29,30,31,32]. An extensive literature review up to 2018 can be found in [33], and recent developments in [34]. Optimality conditions for bilevel problems, both necessary and sufficient, are developed in, e.g., [35,36,37,38,39]. A more limited type of “bilevel” problems only constrains \(\alpha \) to lie in the set of minimisers of another problem. Algorithms for such problems are treated in [40, 41].
Bilevel optimization has been used for learning regularization parameters and forward operators for inverse imaging problems. With total variation regularization in the inner problem, the parameter learning problem in its most basic form reads [7]
This problem finds the best possible \(\alpha \) for reconstructing the “ground truth” image b from the measurement data z, which may be noisy and possibly transformed and only partially known through the forward operator \(A_\alpha \), mapping images to measurements. To generalize to multiple images, the outer problem would sum over them and corresponding inner problems [12]. Multi-parameter regularization is discussed in [42], and natural conditions for \(\alpha >0\) in [43].Footnote 1 In other works, the forward operator \(A_\alpha \) is learned for blind image deblurring [44] or undersampling in magnetic resonance imaging [11]. In [8] regularization kernels are learned, while [14, 45] study the learning of optimal discretisation schemes. To circumvent the non-differentiability of \(S_u\), [46] replace the inner problem with a fixed number of iterations of an algorithm. Their approach has connections to the learning of deep neural networks.
Bilevel problems can also be seen as leader–follower or Stackelberg games: the outer problem or agent leads by choosing \(\alpha \), and the inner agent reacts with the best possible u for that \(\alpha \). Multiple-agent Nash equilibria may also be modeled as bilevel problems. Both types of games can be applied to financial markets and resource use planning; we refer to the the aforementioned books [29,30,31,32] for specific examples.
1.2 Notation and basic concepts
We write \(\mathbb {L}(X; Y)\) for the space of bounded linear operators between the normed spaces X and Y and \({{\,\textrm{Id}\,}}\) for the identity operator. Generally X will be Hilbert, so we can identify it with the dual \(X^*\).
For \(G \in C^1(X)\), we write \(G'(x) \in X^*\) for the Fréchet derivative at x, and \(\nabla G(x) \in X\) for its Riesz presentation, i.e., the gradient. For \(E \in C^1(X; Y)\), since \(E'(x) \in \mathbb {L}(X; Y)\), we use the Hilbert adjoint to define \(\nabla E(x) :=G'(x)^* \in \mathbb {L}(Y; X)\). Then the Hessian \(\nabla ^2 G(x) :=\nabla [\nabla G](x) \in \mathbb {L}(X; X)\). When necessary we indicate the differentiation variable with a subscript, e.g., \(\nabla _u F(u, \alpha )\). For convex \(R: X \rightarrow \overline{\mathbb {R}}\), we write \({{\,\textrm{dom}\,}}R\) for the effective domain and \(\partial R(x)\) for the subdifferential at x. With slight abuse of notation, we identify \(\partial R(x)\) with the set of Riesz presentations of its elements. We define the proximal operator as \({{\,\textrm{prox}\,}}_R(x) :=\mathop {\mathrm {arg\,min}}\limits _z \frac{1}{2}\Vert z-x\Vert ^2 + R(z)=({{\,\textrm{Id}\,}}+\partial R)^{-1}(x)\).
We write \(\langle x,y\rangle \) for an inner product, and B(x, r) for a closed ball in a relevant norm \(\Vert \,\varvec{\cdot }\,\Vert \). For self-adjoint positive semi-definite \(M\in \mathbb {L}(X; X)\) we write \(\Vert x\Vert _{M} :=\sqrt{\langle x,x\rangle _{M}} :=\sqrt{\langle Mx,x\rangle }.\) Pythagoras’ or three-point identity then states
for all \(x,y,z\in X\). We extensively use Young’s inequality
We sometimes apply operations on \(x \in X\) to all elements of a set \(A \subset X\), writing \(\langle x+A,z\rangle :=\{\langle x+a,z\rangle \mid a \in A \}\), and for \(B \subset \mathbb {R}\), writing \(B \ge c\) if \(b \ge c\) for all \(b \in B.\)
2 Proposed methods
We now present our proposed methods for (1). They are based on taking a single gradient descent step for the inner problem, and using forward-backward splitting for the outer problem. The two methods differ on how an “adjoint equation” is handled. We present the algorithms and assumptions required to prove their convergence in Sects. 2.2 and 2.3 after deriving optimality conditions and the adjoint equation in Sect. 2.1. We prove convergence in Sect. 3.
2.1 Optimality conditions
Suppose \(u\mapsto F(u;\alpha )\in C^2(U)\) is proper, coercive, and weakly lower semicontinuous for each outer variable \(\alpha \in {{\,\textrm{dom}\,}}R \subset \mathscr {A}\). Then the direct method of the calculus of variations guarantees the inner problem \(\min _u F(u; \alpha )\) to have a solution. If, further, \(u\mapsto F(u;\alpha )\) is strictly convex, the solution is unique so that the solution mapping \(S_u\) from (1) is uniquely determined.
Suppose further that \(F, \nabla F\) and \(S_u\) are Fréchet differentiable. Writing \(T(\alpha ) :=(S_u(\alpha ), \alpha )\), Fermat’s principle and \(S_u(\tilde{\alpha }) \in \mathop {\mathrm {arg\,min}}\limits _u F(u; \tilde{\alpha })\) then show that
for \(\alpha \) near \(\tilde{\alpha }\). Therefore, the chain rule for Fréchet differentiable functions yields
That is, \(p=\nabla _{\alpha }S_u(\alpha )\) solves for \(u=S_u(\alpha )\) the adjoint equation
We introduce the corresponding solution mapping for the adjoint variable p,
We will later make assumptions that ensure that \(S_p\) is well-defined.
Since \(S_u: \mathscr {A}\rightarrow U\), the Fréchet derivative \(S_u'(\alpha ) \in \mathbb {L}(\mathscr {A}; U)\) and the Hilbert adjoint \(\nabla _\alpha S_u(\alpha ) \in \mathbb {L}(U; \mathscr {A})\) for all \(\alpha \). Consequently \(p \in \mathbb {L}(U; \mathscr {A})\), but we will need p to lie in an inner product space. Assuming \(\mathscr {A}\) to be a separable Hilbert space, we introduce such structure
by using a countable orthonormal basis \(\{\varphi _i\}_{i\in I}\) of \(\mathscr {A}\) to define the inner product
We briefly study this inner product and the induced norm \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) in Sect. 2.
By the sum rule for Clarke subdifferentials (denoted \(\partial _C\)) and their compatibility with convex subdifferentials and Fréchet differentiable functions [47], we obtain
The Fermat principle for Clarke subdifferentials then furnishes the necessary optimality condition
We combine (3), (4) and (7) as the inclusion
with
This is the optimality condition that our proposed methods, presented in Sects. 2.2 and 2.3, attempt to satisfy. We generally abbreviate
2.2 Algorithm: forward-exact-forward-backward
Our first strategy for solving (8) takes just a single gradient descent step for the inner problem, solves the adjoint equation exactly, and then takes a forward-backward step for the outer problem. We call this Algorithm 2.1 the FEFB (forward-exact-forward-backward).
Using H defined in (9), Algorithm 2.1 can be written implicitly as solving
for \(x^{k+1} = (u^{k+1}, p^{k+1}, \alpha ^{k+1})\), where, with \(x=(u, p, \alpha )\),
and the preconditioning operator \(M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})\) is
The “nonlinear preconditioning” applied to H to construct \(H_{k+1}\) shifts iterate indices such that a forward step is performed instead of a proximal step; compare [18, 19].
We next state essential structural, initialisations, and step length assumptions. We start with a contractivity condition needed for the proximal step with respect to R.
Assumption 2.1
Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) be convex, proper, and lower semicontinuous. We say that R is locally prox-\(\sigma \)- contractive at \(\widehat{\alpha }\in \mathscr {A}\) for \(q \in \mathscr {A}\) (within \(A \subset {{\,\textrm{dom}\,}}R\)) if there exist \( C_R > 0\) and a neighborhood \(A \subset {{\,\textrm{dom}\,}}R\) of \(\widehat{\alpha }\) such that, for all \(\alpha \in A\),
If \(\rho >0\) can be arbitrary with the same factor \(C_R\), we drop the word “locally”.
We verify Assumption 2.1 for some common cases in Sect. 1. When applying the assumption to to \(\widehat{\alpha }\) satisfying (8), we will take \(q = -\widehat{p}\nabla _u J(\widehat{u}) \in \partial R(\widehat{\alpha })\). Then \(D_{\sigma R}(\widehat{\alpha })=0\) by standard properties of proximal mappings. The results for nonsmooth functions in Sect. 1 in that case forbid strict complementarity. In particular, for \(R=\beta \Vert \,\varvec{\cdot }\,\Vert _1 + \delta _{[0, \infty )^n}\) we need to have \(q \in (\beta , \ldots , \beta )\), and for \(R=\delta _C\) for a convex set C, we need to have \(q=0\). Intuitively, this restriction serves to forbid the finite identification property [48] of proximal-type methods, as \(\{\alpha ^n\}\) cannot converge too fast in our techniques for the stability of the inner problem and adjoint with respect to perturbations of \(\alpha \).
We now come to our main assumption for the FEFB. It collects conditions related to step lengths, initialization, and the problem functions F, J, and R. For a constant \(c>0\) to be determined by the assumption, we introduce the testing operator
The idea, introduced in [18] and further explained in [19], is to test the algorithm-defining inclusion (10) with the linear functional \(\langle Z\,\varvec{\cdot }\,,x^{k+1}-\widehat{x}\rangle \) to obtain a descent estimate with respect to the ZM-norm. The operator Z encodes component-specific scalings and convergence rates, although we do not exploit the latter in this manuscript.
Assumption 2.2
We assume that U is a Hilbert space, \(\mathscr {A}\) a separable Hilbert space, and treat the adjoint variable \(p\in \mathbb {L}(U; \mathscr {A})\) as an element of the inner product space P defined in (6a). Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) and \(J: U \rightarrow \mathbb {R}\) be convex, proper, and lower semicontinuous, and assume the same from \(F(\,\varvec{\cdot }\,, \alpha )\in C^2(U)\) for all \(\alpha \in {{\,\textrm{dom}\,}}R.\) Pick \((\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)\) and let \(\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}\) be generated by Algorithm 2.1 for a given initial iterate \((u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R\). For a given \(r, r_u>0\) we suppose that
-
(i)
The relative initialization bound \(\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) holds for some \(C_u>0\).
-
(ii)
There exists in \(B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) a continuously Fréchet-differentiable and \(L_{S_u}\)-Lipschitz inner problem solution mapping \(S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )\).
-
(iii)
\(F(\widehat{u};\,\varvec{\cdot }\,)\) is Lipschitz continuously differentiable with factor \(L_{\nabla F,\widehat{u}} > 0\), and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) for all \((u, \alpha ) \in B(\widehat{u}, r_u) \times ( B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) for some \(\gamma _F, L_F > 0.\) Moreover, \((u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz in \(B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) with factors \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\), where we equip \(U \times \mathscr {A}\) with the norm \((u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}\).
-
(iv)
The inner step length \(\tau \in (0, 2\kappa /L_F]\) for some \(\kappa \in (0, 1)\).
-
(v)
The outer fitness function J is Lipschitz continuously differentiable with factor \(L_{\nabla J}\), and \(\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _\alpha ,L_\alpha >0\). Moreover, R is locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(C_R\ge 0\).
-
(vi)
The constants \(\varphi _u, C_u > 0\) satisfy
$$\begin{aligned} \gamma _F (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})C_u + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$where
$$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$ -
(vii)
The outer step length \(\sigma \) fulfills
$$\begin{aligned} 0 < \sigma \le \frac{(C_F-1)C_u}{(L_{S_u} +C_FC_u)C_{\alpha }} \end{aligned}$$where
$$\begin{aligned} {\left\{ \begin{array}{ll} C_F :=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \quad \text {and} \\ C_{\alpha } :=(N_pL_{\nabla J} + N_{\nabla J} L_{S_p}) C_u + L_\alpha + C_R. \end{array}\right. } \end{aligned}$$ -
(viii)
The initial iterates \(u^0\) and \(\alpha ^0\) are such that the distance-to-solution
$$\begin{aligned} r_0 :=\sqrt{\sigma \varphi _u\tau ^{-1}\Vert u^{0}-\widehat{u}\Vert ^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$satisfies
$$\begin{aligned} r_0 \le r \quad \text {and}\quad r_0 \max \{2L_{S_u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$
Remark 2.3
(Interpretation) Part (i) of Assumption 2.2 ensures that the initial inner problem iterate is good relative to the outer problem iterate. If \(u^1\) solves the inner problem for \(\alpha ^0\), (i) holds for any \(C_u>0\). Therefore, (i) can always be satisfied by solving the inner problem for \(\alpha ^0\) to high accuracy. This condition does not require \(\alpha ^0\) to be close to a solution \(\widehat{\alpha }\) of the entire problem.
Part (ii) ensures that the inner problem solution map exists and is well-behaved; we discuss it more in the next Remark 2.4.
Parts (iii) and (v) are second order growth and boundedness conditions, standard in smooth optimization. The nonsmooth R is handled through the prox-\(\sigma \)-contractivity assumption. If \(S_u\) is twice Fréchet differentiable, the product and the chain rules establish
If \(R=0\), first-order optimality conditions establish \(\nabla _u J(S_u(\widehat{\alpha }))=0\). Therefore, if, further, J is strongly convex and \(S_u'(\widehat{\alpha })\) is invertible, \(\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)(\widehat{\alpha })\) for some \(\gamma >0\). Then additional continuity assumptions establish the positivity required in (v) in a neighbourhood of \(\widehat{\alpha }\). It is also possible to further develop the condition to not depend on the solution mapping at all.
Dependent on R, (v) may restrict the outer step length parameter \(\sigma \). Part (iii) ensures that \(u\mapsto \nabla _u^2 F(u; \alpha )\) is invertible and \(S_p\) is well-defined. We will see in Lemma 3.3 that the radius \(r_u\) is sufficiently large that \(\alpha \in B(\widehat{\alpha }, r)\) implies \(S_u(\alpha ) \in B(\widehat{u}, r_u)\). Part (v) implies that \(\alpha \mapsto \nabla _{\alpha }(J\circ S_u)(\alpha )\) is Lipschitz in \(B(\widehat{\alpha }, r)\).
Part (iv) is a standard step length condition for the inner problem while (vii) is a step length condition for the outer problem. It depends on several constants defined in the more technical part (vi). We can always satisfy the inequality in (vi) by good relative initialisation (small \(C_u>0\)), as discussed above, and taking the testing parameter \(\varphi _u\) small. According to the local initialization condition (viii), the latter can be done if the initialial iterates are close to a solution \((\widehat{u}, \widehat{\alpha })\) of the entire problem, or if \(r_u>0\) can be be taken arbitrarily large. If we can take both \(r>0\) and \(r_u>0\) arbitrarily large, we obtain global convergence.
Remark 2.4
(Existence and differentiability of the solution map) Suppose F is twice continuously differentiable in both variables, and that \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) for all \(u\in B(\widehat{u}, r_u)\) and \(\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _F>0\). Then the implicit function theorem shows the existence of a unique continuously differentiable \(S_u\) in a neighborhood of any \(\alpha \in B(\widehat{\alpha },r) \cap {{\,\textrm{dom}\,}}R\). Such an \(S_u\) is also Lipschitz in a neighborhood of \(\alpha \); see, e.g., [19, Lemma 2.11]. If \(\mathscr {A}\) is finite-dimensional, a compactness argument gluing together the neighborhoods then proves Assumption 2.2 (ii).
2.3 Algorithm: forward-inexact-forward-backward
Our second strategy for solving (8) modifies the first approach to solve the adjoint variable inexactly, so that no costly matrix inversions are required. Instead we perform an update reminiscent of a gradient step. This approach, which we call the FIFB (forward-inexact-forward-backward) reads as Algorithm 2.2 and has the implicit form
The implicit form can also be written as (10) with
and the preconditioning operator \(M\in \mathbb {L}(U\times P \times \mathscr {A}; U\times P \times \mathscr {A})\),
For the testing operator Z we use the structure
with the constants \(\varphi _u, \varphi _p>0\) determined in the following assumption. It is the FIFB counterpart of Assumption 2.2 for the FEFB, collecting essential structural, step length, and initialization assumptions.
Assumption 2.5
We assume that U is a Hilbert space, \(\mathscr {A}\) a separable Hilbert space, and treat the adjoint variable \(p\in \mathbb {L}(U; \mathscr {A})\) as an element of the inner product space P defined in (6a). Let \(R: \mathscr {A}\rightarrow \overline{\mathbb {R}}\) and \(J: U \rightarrow \mathbb {R}\) be convex, proper, and lower semicontinuous, and assume the same from \(F(\,\varvec{\cdot }\,, \alpha )\) for all \(\alpha \in {{\,\textrm{dom}\,}}R\). Pick \((\widehat{u},\widehat{p},\widehat{\alpha }) \in H^{-1}(0)\) and let \(\{(u^m, p^m, \alpha ^m)\}_{m\in \mathbb {N}}\) be generated by Algorithm 2.2 for a given initial iterate \((u^{0}, p^{0}, \alpha ^{0}) \in U \times P \times {{\,\textrm{dom}\,}}R\). For given \(r, r_u>0\) we suppose that
-
(i)
The relative initialization bounds \(\Vert u^{1}-S_u(\alpha ^{0})\Vert \le C_u \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) and \(\Vert \hspace{-1.0pt}|p^{1}-\nabla _{\alpha }S_u(\alpha ^{0}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{0} - \widehat{\alpha }\Vert \) hold with some constants \(C_u>0\) and \(C_p>0.\)
-
(ii)
There exists in \(B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R\) a continuously Fréchet-differentiable and \(L_{S_u}\)-Lipschitz inner problem solution mapping \(S_u: \alpha \mapsto S_u(\alpha ) \in \mathop {\mathrm {arg\,min}}\limits F(\,\varvec{\cdot }\,; \alpha )\).
-
(iii)
\(F(\widehat{u};\,\varvec{\cdot }\,)\) is Lipschitz continuously differentiable with factor \(L_{\nabla F,\widehat{u}} > 0\), and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) for \(u\in B(\widehat{u}, r_u)\) and \(\alpha \in B(\widehat{\alpha }, 2r) \cap {{\,\textrm{dom}\,}}R.\) Moreover, \((u,\alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u,\alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz in \(B(\widehat{u}, r_u)\times (B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R)\) with factors \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\), where we equip \(U \times \mathscr {A}\) with the norm \((u, \alpha ) \mapsto \Vert u\Vert _U + \Vert \alpha \Vert _{\mathscr {A}}\).
-
(iv)
The inner step length \(\tau \in (0, 2\kappa /L_F]\) for some \(\kappa \in (0, 1)\) whereas the adjoint step length \(\theta \in (0, 1/L_F)\).
-
(v)
The outer fitness function J is Lipschitz continuously differentiable with factor \(L_{\nabla J},\) and \(\gamma _\alpha \cdot {{\,\textrm{Id}\,}}\le \nabla _{\alpha }^2(J\circ S_u)\le L_\alpha \cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(\gamma _\alpha , L_\alpha > 0\). Moreover, R is locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) for some \(C_R\ge 0\).
-
(vi)
The constants \(\varphi _u, \varphi _p, C_u > 0\) satisfy
$$\begin{aligned} \varphi _p \le \varphi _u \frac{\gamma _F^2(1-\kappa )}{2 L_F L_{S_p}} \end{aligned}$$and
$$\begin{aligned}{} & {} L_F L_{S_p} \varphi _p + \sqrt{(L_F L_{S_p} \varphi _p)^2 + \gamma _F^2 (L_{\nabla J}N_p + L_{S_p} N_{\nabla J})^2C_u^2} \\{} & {} \quad + \frac{L_{\nabla F,\widehat{u}}^2}{(1-\kappa )} \varphi _u < \gamma _F \gamma _\alpha , \end{aligned}$$where
$$\begin{aligned} \begin{aligned} N_{\nabla _{\alpha u} F}&:=\max _{\begin{array}{c} u \in B(\widehat{u}, r_u),\\ \alpha \in B(\widehat{\alpha }, 2r)\cap {{\,\textrm{dom}\,}}R \end{array}} \Vert \hspace{-1.0pt}|\nabla _{\alpha u} F(u,\alpha ) \Vert \hspace{-1.0pt}|, \\ L_{S_p}&:=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}, \\ N_{\nabla J}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \nabla _u J (S_u(\alpha ))\Vert , \\ N_{\nabla S_u}&:=\max _{\alpha \in B(\widehat{\alpha }, r)\cap {{\,\textrm{dom}\,}}R} \Vert \hspace{-1.0pt}|\nabla _{\alpha }S_u(\alpha ) \Vert \hspace{-1.0pt}|, \text { and} \\ N_p&:=N_{\nabla S_u} + C r \text { with } C=L_{S_p} C_u. \end{aligned} \end{aligned}$$ -
(vii)
The outer step length \(\sigma \) satisfies
$$\begin{aligned} 0 < \sigma \le \textstyle \frac{1}{C_\alpha }\min \left\{ \frac{(C_F-1)C_u}{L_{S_u} +C_FC_u}, \frac{(C_{F,S}-1)C_p- (1+C_{F,S})L_{S_p} C_u}{(1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u} \right\} \end{aligned}$$with
$$\begin{aligned} \begin{aligned} C_F&:=\sqrt{1+2\tau \gamma _F(1 - \kappa )}, \qquad C_{F,S} :=\sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)} \quad \text {and} \\ C_{\alpha }&:=N_pL_{\nabla J}C_u + N_{\nabla J}\max \{C_p, L_{S_p} C_u\} + L_\alpha + C_R. \end{aligned} \end{aligned}$$ -
(viii)
The initial iterate \((u^0, p^0, \alpha ^0)\) is such that the distance-to-solution
$$\begin{aligned} r_0 :=\sqrt{\frac{\sigma \varphi _u}{\tau }\Vert u^{0}-\widehat{u}\Vert ^2 + \frac{\sigma \varphi _p}{\theta }\Vert \hspace{-1.0pt}|p^{0}-\widehat{p} \Vert \hspace{-1.0pt}|^2 + \Vert \alpha ^{0} - \widehat{\alpha }\Vert ^2} = \sqrt{\sigma }\Vert x^{0}-\widehat{x}\Vert _{ZM} \end{aligned}$$satisfies
$$\begin{aligned} r_0 \le r \ \text {and}\ r_0 \max \{2(C_u + L_{S_u}), \sqrt{\tfrac{\tau }{\sigma \varphi _u}}(1+\tau L_{F}) + \tau L_{\nabla F,\widehat{u}}\} \le r_u. \end{aligned}$$
Remark 2.6
(Interpretation) The interpretation of Assumption 2.2 in Remark 2.4 also applies to Assumption 2.5. We stress that to satisfy the inequality in (vi), it suffices to ensure small \(C_u>0\) by good relative initialization of u and p with respect to \(\alpha \), and choosing the testing parameters \(\varphi _u, \varphi _p>0\) small enough. According to (viii), the latter can be done by initializing close to a solution, or if the radii \(r_u>0\) is large.
3 Convergence analysis
We now prove the convergence of the FEFB (Algorithm 2.1) and the FIFB (Algorithm 2.2) in the respective Sects. 3.2 and 3.3. Before this we start with common results. Our proofs are self-contained, but follow on the “testing” approach of [18] (see also [19]). The main idea is to prove a monotonicity-type estimate for the operator \(H_{k+1}\) occurring in the implicit forms (10) and (14) of the algorithms, and then use the three-point identity (2) with respect to ZM-norms and inner products. This yields an inequality that readily yields an estimate from which convergence rates can be observed. The main results for the FEFB and the FIFB and in the respective Theorems 3.16 and 3.21.
Throughout, we assume that either Assumption 2.2 (FEFB) or 2.5 (FIFB) holds, and tacitly use the constants from the relevant one. We also tacitly take it that \(\alpha ^k \in {{\,\textrm{dom}\,}}R\) for all \(k \in \mathbb {N}\), as this is guaranteed by the assumptions for \(k=0\), and by the proximal step in the algorithms for \(k \ge 1\).
3.1 General results
Our main goal here is to bound the error in the inner and adjoint iterates \(u^k\) and \(p^k\) in terms of the outer iterates \(\alpha ^k\). We also derive bounds on the outer steps, and local monotonicity estimates. We first show that the solution mapping for the adjoint equation (4) is Lipschitz.
Lemma 3.1
Suppose \((u, \alpha ) \mapsto \nabla _{u}^2 F (u; \alpha )\) and \((u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha ) \in P\) are Lipschitz continuous with the respective constants \(L_{\nabla ^2 F}\) and \(L_{\nabla _{\alpha u} F}\) in some bounded closed set \(V_u \times V_{\alpha }.\) Also assume that \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) and \(\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}\) in \(V_u \times V_{\alpha }\) for some \(\gamma _F, N_{\nabla _{\alpha u}}>0\). Then \(S_p\) is Lipschitz continuous in \(V_u \times V_{\alpha },\) i.e.
for \(u_1,u_2\in V_u\) and \(\alpha _1,\alpha _2 \in V_{\alpha }\) with factor \(L_{S_p} :=\gamma _F^{-2} L_{\nabla ^2 F}N_{\nabla _{\alpha u} F} + \gamma _F^{-1} L_{\nabla _{\alpha u} F}.\)
Proof
Using the definition of \(S_p\) in (5), we rearrange
Thus the triangle inequality and the operator norm inequality Theorem 6.1 (ii) give
The assumption \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\) implies \(\Vert (\nabla _u^2 F(u; \alpha ))^{-1}\Vert \le \gamma _F^{-1}.\) Therefore, also using the Lipschitz continuity of \((u, \alpha ) \mapsto \nabla _{\alpha u} F (u; \alpha )\) in \(V_u\times V_{\alpha },\) we get
Towards estimating the second term on the right hand side of (16), we observe that
for any invertible linear operators A, B. Then we use \(\Vert \hspace{-1.0pt}|\nabla _{\alpha u} F \Vert \hspace{-1.0pt}| \le N_{\nabla _{\alpha u} F}\) and the Lipschitz continuity of \(\nabla _{u}^2 F (u; \alpha )\) to obtain
Inserting this inequality and (17) into (16) establishes the claim. \(\square \)
We now prove two simple step length bounds.
Lemma 3.2
Let Assumption 2.2 or 2.5 hold. Then \(\sigma < 1/L_\alpha \) and \(1< C_F < \sqrt{1+\gamma _F/L_F}\).
Proof
We have \(C_F>1\) since \(\kappa <1\) forces \(2\tau \gamma _F(1 - \kappa )>0.\) Assumption 2.2 (iv) or 2.5 (iv) implies \(2\tau \gamma _F(1 - \kappa )<4\gamma _F(\kappa - \kappa ^2)/L_F \le \gamma _F/L_F.\) Therefore \( C_F < \sqrt{1+ \gamma _F/L_F}.\) For \(C_F,C_u, L_{S_u}>0\) it holds \(C_FC_u-C_u< L_{S_u} +C_FC_u.\)
Hence Assumption 2.2 (vii) or 2.5 (vii) gives
\(\square \)
The next lemma explains the latter inequality for \(r_0\) in Assumption 2.2 (viii) and 2.5 (viii). For \(u^n\) and \(\alpha ^n\) close enough to the respective solutions, it bounds the next iterate \(u^{n+1}\) and the true inner problem solution \(S_u(\alpha ^n)\) for \(\alpha ^n\) to the \(r_u\)-neighborhood of \(\widehat{u}\).
Lemma 3.3
Suppose Assumption 2.2 or 2.5 hold and \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\), as well as \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). Then \(u^{n+1}\in B(\widehat{u}, r_u)\) and \(S_u(\alpha ^n)\in B(\widehat{u}, r_u).\)
Proof
The inner gradient step of Algorithm 2.1 or with \(u^{n}\in B(\widehat{u}, \sqrt{\frac{\tau }{\sigma \varphi _u}}r_0)\) give
Using \(\nabla _u F(\widehat{u}; \widehat{\alpha })=0\), \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\), and the Lipschitz continuity of \(F(\widehat{u};\,\varvec{\cdot }\,)\) and \(F(\,\varvec{\cdot }\,;\alpha ^n)\) from Assumption 2.2 (iii) or 2.5 (iii) we continue to estimate, as required
Next, the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha }, 2r)\) from Assumption 2.2 (ii) or 2.5 (ii) with \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(r_0\le r\) from Assumption 2.2 (viii) or 2.5 (viii) imply
We now introduce a working condition that we later prove. It guarantees that the Lipschitz and Hessian properties of Assumption 2.2 (ii), (iii) and (v) or Assumption 2.5 (ii), (iii) and (v) hold at iterates.
Assumption 3.4
(Iterate locality) Let \(r_0\le r\) and \(N_p\) be defined in either Assumption 2.2 or 2.5. Then this assumption holds for a given \(n \in \mathbb {N}\) if
Indeed, the next two lemmas show that if Assumption 3.4 holds for \(n=k\) along with some further conditions, then it holds for \(n=k+1\).
Lemma 3.5
Suppose either Assumption 2.2 or 2.5 holds. Let \(n \in \mathbb {N}\) and suppose
with \(\alpha ^n \in B(\widehat{\alpha }, r)\). Then \(\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}| \le N_p\).
Proof
We estimate using (18) and the definitions of the relevant constants in Assumption 2.2 or 2.5 that
Lemma 3.6
Let \(k \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5 holds; Assumption 3.4 holds for \(n=k\); and that (18) holds for \(n=k+1\). If also \(\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}\) for \(n\in \{0,\ldots ,k\}\), then Assumption 3.4 holds for \(n=k+1\).
Proof
Summing \(\Vert x^{n+1} - \widehat{x}\Vert _{ZM} \le \Vert x^n - \widehat{x}\Vert _{ZM}\) over \(n=0,\ldots ,k\) gives \(\Vert x^{k+1} - \widehat{x}\Vert _{ZM} \le \Vert x^{0} - \widehat{x}\Vert _{ZM} = \sigma ^{-1/2}r_0.\) By the definitions of Z and M in (12) or (15), and (11b) or (14b) respectively, it follows \(\alpha ^{k+1} \in B(\widehat{\alpha }, r_0)\) and \(u^{k+1}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\) as required. We finish by using Lemma 3.5 with \(n=k+1\) to establish \(\Vert \hspace{-1.0pt}|p^{k+2} \Vert \hspace{-1.0pt}| \le N_p\). \(\square \)
We next prove a monotonicity-type estimate for the inner objective. For this we need need the following three-point monotonicity inequality.
Theorem 3.7
Let \(z,\widehat{x}\in X.\) Suppose \(F\in C^2(X),\) and for some \(L>0\) and \(\gamma \ge 0\) that \(\gamma \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(\zeta ) \le L \cdot {{\,\textrm{Id}\,}}\) for all \(\zeta \in [\widehat{x}, z] :=\{\widehat{x} + s(z - \widehat{x}) \mid s\in [0,1] \}\). Then, for all \(\beta \in (0, 1]\) and \(x \in X\),
Proof
The proof follows that of [19, Lemma 15.1] whose statement unnecessarily takes \(\zeta \) in neighborhood of \(\widehat{x}\) instead of just the interval \([\widehat{x}, z]\). \(\square \)
Lemma 3.8
Let \(n \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5, and 3.4 hold. Then for any \(\kappa \in (0,1)\), we have
Proof
Assumption 2.2 (iii) or 2.5 (iii) with \(\alpha ^n\in B(\widehat{\alpha }, r)\) and \(u^{n}\in B(\widehat{u},r_u)\) from Assumption 3.4 give \(\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha ^n) \le L_F\cdot {{\,\textrm{Id}\,}}\) for all \(u \in [\widehat{u}, u^n]\). We have \(\nabla _{u}F(\widehat{u}; \widehat{\alpha })=0\) since \(0\in H(\widehat{u}, \widehat{p}, \widehat{\alpha })\). Therefore Theorem 3.7 yields
Young’s inequality and the definition of \(L_{\nabla F, \widehat{u}}\) in Assumption 2.2 (iii) or 2.5 (iii) now readily establishes the claim. \(\square \)
The next lemma bounds the steps taken for the outer problem variable.
Lemma 3.9
Let \(n \in \mathbb {N}\). Suppose either Assumption 2.2 or 2.5 hold, as do Assumption 3.4 (18), and
Then
and
Proof
Using the \(\alpha \)-update of Algorithm 2.1 or , we estimate
Since proximal maps are 1-Lipschitz, and R is by Assumption 2.2 (v) or 2.5 (v) locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) for \(\widehat{p}\nabla _u J(\widehat{u})\) within \(B(\widehat{\alpha }, r) \cap {{\,\textrm{dom}\,}}R\) with factor \(C_R\), it follows
We have \(\widehat{p}\nabla _u J(\widehat{u})=\nabla _\alpha S_u(\widehat{\alpha }) \nabla _u J(S_u(\widehat{\alpha }))=\nabla _\alpha (J \circ S_u)(\widehat{\alpha })\), where \(\nabla _{\alpha }(J\circ S_u)\) is \(L_\alpha \)-Lipschitz in \(B(\widehat{\alpha }, r)\ni \alpha ^n\) by Assumption 2.2 (v) or 2.5 (v). Hence
Using the Lipschitz continuity of \(\nabla _u J\) from Assumption 2.2 (v) or 2.5 (v), we continue
We have \(\Vert \hspace{-1.0pt}|p^{n+1} \Vert \hspace{-1.0pt}|\le N_{p}\) and \(\alpha ^n\in B(\widehat{\alpha }, r)\) by Assumption 3.4. Hence \(\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}\) by the definition in Assumption 2.2 (vi) or 2.5 (vi). Using (18) and (21) therefore give
Inserting this into (24), we obtain (22).
Assumption 2.2 (vii) or Assumption 2.5 (vii) and (22) then yield
Rearranging terms and finishing with the triangle inequality we get (23). \(\square \)
Remark 3.10
(Gradient steps with respect to R) We could (in both FEFB and FIFB) also take a gradient step instead of a proximal step with respect to R with \(L_{\nabla R}\)-Lipschitz gradient. That is, we would perform for the outer problem the update
This can be shown to be convergent by changing (24) to
We next prove that if an inner problem iterate has small error, and we take a short step in the outer problem, then also the next inner problem iterate has small error.
Lemma 3.11
Let \(k \in \mathbb {N}\). Suppose Assumption 2.2 or 2.5 hold. If Assumption 3.4, (18), and (21) hold for \(n=k,\) then (21) holds for \(n=k+1\) and we have \(\alpha ^{k+1}\in B(\widehat{\alpha }, 2r_0)\).
Proof
We plan to use Theorem 3.7 on \(F(\,\varvec{\cdot }\,; \alpha ^{k+1})\) followed by the three-point identity and simple manipulations. We begin by proving the conditions of the theorem.
First, we show that both \(u^{k+1}\in B(\widehat{u},r_u)\) and \(S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)\). The former is immediate from Assumption 3.4 and Lemma 3.3. For the latter we use (23) of Lemma 3.9. Its first inequality readily implies either \(\Vert \alpha ^{k} - \widehat{\alpha }\Vert > \Vert \alpha ^{k+1} - \alpha ^{k}\Vert \) or \(\alpha ^{k+1} = \widehat{\alpha }\). In the latter case \(S_u(\alpha ^{k+1})=\widehat{u} \in B(\widehat{u},r_u)\). In the former, using \(\alpha ^k\in B(\widehat{\alpha },r_0)\), we get
Therefore we can use the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha },2r)\) from Assumption 2.2 (ii) or 2.5 (ii) to estimate
This implies \(S_u(\alpha ^{k+1})\in B(\widehat{u},r_u)\) by Assumption 2.2 (viii) or 2.5 (viii).
Since both \(u^{k+1}, S_u(\alpha ^{k+1}) \in B(\widehat{u},r_u)\), Assumption 2.2 (iii) or 2.5 (iii)
shows that \(\gamma _F \cdot {{\,\textrm{Id}\,}}\le \nabla ^2 F(u) \le L_F\cdot {{\,\textrm{Id}\,}}\) for \(u \in [S_u(\alpha ^{k+1}), u^{k+1}]\). Consequently Theorem 3.7 and \(\nabla _{u}F(S_u(\alpha ^{k+1}); \alpha ^{k+1}) = 0\) give
Inserting the u update of Algorithm 2.1 or , i.e., \( -\tau ^{-1}(u^{k+2}-u^{k+1}) = \nabla _{u}F(u^{k+1}; \alpha ^{k+1}) \) and using the three-point identity (2) we get
Equivalently
Because Assumption 2.2 (iv) or 2.5 (iv) guarantees \(1-\tau L_F/(2\kappa ) > 0,\) this implies
Therefore the triangle inequality, (21) for \(n=k\) and the Lipschitz continuity of \(S_u\) in \(B(\widehat{\alpha },2r)\ni \alpha ^{k}, \alpha ^{k+1}\) yield
Inserting (23) here, we establish the claim. \(\square \)
The next lemma is a crucial monotonicity-type estimate for the outer problem. It depends on an \(\alpha \)-relative exactness condition on the inner and adjoint variables.
Lemma 3.12
Let \(n \in \mathbb {N}\). Suppose Assumption 2.2(v) and (vi), or 2.5 (v) and (vi) hold with Assumption 3.4 and
Then, for any \(d > 0\),
Proof
The \(\alpha \)-update of both Algorithms 2.1 and 2.2 in implicit form reads
Similarly, \(0 \in H(\widehat{u}, \widehat{p}, \widehat{\alpha })\) implies \( \widehat{p}\,\nabla _{u}J(\widehat{u}) + \widehat{q} =0 \) for some \( \widehat{q} \in \partial R(\widehat{\alpha }). \) Writing \(E_0\) for the left hand side of (26), these expressions and the monotonicity of \(\partial R\) yield
We estimate \(E_1\) and \(E_2\) separately.
The one-dimensional mean value theorem gives
for some \(\zeta \in [\widehat{\alpha }, \alpha ^n]\) and \(Q :=\nabla ^2_\alpha (J\circ S_u)(\zeta )\).
Since \(\Vert \alpha ^n - \widehat{\alpha }\Vert \le r\) by Assumption 3.4, also \(\Vert \zeta - \widehat{\alpha }\Vert \le r\).
Therefore, the 3-point identity (2) and Assumption 2.2 (v) or 2.5 (v) yield
To estimate \(E_1\) we rearrange
We have \(\Vert \nabla _u J(S_u(\alpha ^n))\Vert \le N_{\nabla J}\) by the definition of the latter in Assumption 2.2 (vi) or 2.5 (vi) with \(\alpha ^n\in B(\widehat{\alpha }, r)\) from Assumption 3.4. The same assumptions establish that \(\nabla _u J\) is Lipschitz. Hence, using the operator norm inequality Theorem 6.1 (iii),
Applying (25) and Young’s inequality now yields for any \(d>0\) the estimate
By inserting (28) and (29) into (27) we obtain the claim (26). \(\square \)
3.2 Convergence: forward-exact-forward-backward
We now prove the convergence of Algorithm 2.1. We start with a lemma that shows an \(\alpha \)-relative exactness estimate on the adjoint iterate when one holds for the inner iterate. This is needed to use Lemma 3.12. The main result of this subsection is in the final Theorem 3.16. It proves under Assumption 2.2 the linear convergence of \(\{(u^n, \alpha ^n)\}_{n \in \mathbb {N}}\) generated by Algorithm 2.1 to \((\widehat{u}, \widehat{\alpha })\) solving the first-order optimality condition (8) for some \(\widehat{p}\).
Lemma 3.13
Let \(n \in \mathbb {N}\). Suppose Assumption 2.2 and the inner exactness estimate (21) hold as well as \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\) and \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). Then (18) and (25) hold for \(C=L_{S_p}C_u.\)
Proof
Since (18) with (21) equals (25), it suffices to prove (18). We assumed \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(u^{n+1},S_u(\alpha ^n)\in B(\widehat{u}, r_u)\) by Lemma 3.3. Therefore the Lipschitz continuity of \(S_p\) in \(B(\widehat{u}, r_u)\times B(\widehat{\alpha }, r)\) from Lemma 3.1 with Assumption 2.2 (ii) and (iii) and (21) give
We are able to collect the previous lemmas into a descent estimate from which we immediately observe local linear convergence. We recall the definitions of the preconditioning and testing operators M and Z in (11b) and (12).
Lemma 3.14
Let \(n \in \mathbb {N}\) and suppose Assumption 2.2 and 3.4, and the inner exactness estimate (21) hold. Then
for \(\varphi _u>0\) as in Assumption 2.2 (vi),
Proof
We start by proving the monotonicity estimate
for \(\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) :=\varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\). We observe that \(\varepsilon _u,\varepsilon _{\alpha }>0\) by Assumption 2.2. The monotonicity estimate (31) expands as
for (all elements of the set)
We estimate each of the three lines of \(h_{n+1}\) separately. For the first line, we use (20) from Lemma 3.8. For the middle line we observe that \(p^{n+1}\nabla _{u}^2 F(u^{n+1};\alpha ^{n}) + \nabla _{\alpha u}F( u^{n+1};\alpha ^{n})=0\) by the p-update of Algorithm 2.1.
For the last line, we use (26) from Lemma 3.12 with \(d=2\). We can do this because (25) holds by (21) and 3.13. This gives
Summing with (20) we thus obtain
The factor of the first term is \(\varepsilon _u\) and the factor of last term is zero. Since \(\sigma <1/L_\alpha \) by Lemma 3.2 and \(L_F/(2\kappa ) \le 1/\tau \) by Assumption 2.2 (iv), we obtain (32), i.e., (31).
We now come to the fundamental argument of the testing approach of [18], combining operator-relative monotonicity estimates with the three-point identity. Indeed, (31) combined with the implicit algorithm (10) gives
Inserting the three-point identity (2) and expanding \(\mathscr {V}_{n+1}\) yields (30). \(\square \)
Before stating our main convergence result for the FEFB, we simplify the assumptions of the previous lemma to just Assumption 2.2.
Lemma 3.15
Suppose Assumption 2.2 holds. Then (30) holds for any \(n\in \mathbb {N}\).
Proof
Then claim readily follows if we prove by induction for all \(n \in \mathbb {N}\) that
We first prove (*) for \(n=0\). Assumption 2.2 (i) directly establishes (21). The definition of \(r_0\) in Assumption 2.2 also establishes that \(\alpha ^{n}\in B(\widehat{\alpha }, r_0)\) and \(u^{n}\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0)\). We have just proved the conditions of Lemma 3.13, which establishes (18) for \(n=0\).
Now Lemma 3.5 establishes \(\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}| \le N_p\). Therefore Assumption 3.4 holds for \(n=0\). Finally (3.14) proves (30) for \(n=0\). This concludes the proof of the induction base.
We then make the induction assumption that (*) holds for \(n\in \{0,\ldots ,k\}\) and prove it for \(n=k+1\). Indeed, the induction assumption and Lemma 3.11 give (21) for \(n=k+1\). Next (30) for \(n=k\) implies \(\alpha ^{k+1}\in B( \widehat{\alpha },r_0)\) and \(u^{k+1} \in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau } r_0)\), where \(r_0\) and \(r_u\) are as in Assumption 2.2. Therefore Lemma 3.3 gives \(u^{k+2}\in B(\widehat{u}, r_u)\) while Lemma 3.13 establishes (18) for \(n=k+1\). For all \(n\in \{0,\ldots ,k\}\), the inequality (30) implies \(\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^n-\widehat{x}\Vert _{ZM}\). Therefore Lemma 3.6 proves Assumption 3.4 and finally (3.14) proves (30) and consequently (*) for \(n=k+1\). \(\square \)
Theorem 3.16
Suppose Assumption 2.2 holds. Then \(\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0\) linearly.
Proof
Lemma 3.15, expansion of (30), and basic manipulation show that
for \(\mu := \min \{1+2\varepsilon _u\varphi ^{-1}_u \tau , 1+2\varepsilon _{\alpha }\sigma \}\). Since \(\mu >1\), linear convergence follows. \(\square \)
3.3 Convergence: forward-inexact-forward-backward
We now prove the convergence of Algorithm 2.2. The overall structure and idea of the proofs follows Sect. 3.2 and uses several lemmas from Sect. 3.1. We first prove monotonicity estimate lemma for the adjoint step and then that a small enough step length in the outer problem gurantees that the inner and adjoint iterates stay in a small local neighbourhood if they are already in one. The main result of this subsection is in the final Theorem 3.21. It proves under Assumption 2.5 the linear convergence of \(\{(u^n, p^n, \alpha ^n)\}_{n \in \mathbb {N}}\) generated by Algorithm 2.2 to \((\widehat{u}, \widehat{p}, \widehat{\alpha })\) solving the first-order optimality condition (8).
Lemma 3.17
Let \(u\in U, \alpha \in \mathscr {A}\) and \(p_1, p_2, \tilde{p} \in P.\) Moreover, \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) and
holds. Then
Proof
[4] Using (33), the three-point identity (2) and \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F(u; \alpha )\le L_F\cdot {{\,\textrm{Id}\,}}\) gives for \(A :=\nabla _{u}^2 F(u; \alpha )\) the lower bound
\(\square \)
Lemma 3.18
Let \(k \in \mathbb {N}\). Suppose Assumption 2.5 holds, and Assumption 3.4 and
hold for \(n=k\). Then (34) holds for \(n = k+1.\)
Proof
Observe that (34) for \(n=k\) implies (21) as well as (18) for \(n=k\) and \(C=C_p\). Lemma s3.11 therefore proves the first part of (34) for \(n=k+1\), i.e.,
We still need to show the second part \(\Vert \hspace{-1.0pt}|p^{k+2}-\nabla _{\alpha }S_u(\alpha ^{k+1}) \Vert \hspace{-1.0pt}| \le C_p \Vert \alpha ^{k+1} - \widehat{\alpha }\Vert \). We follow the fundamental argument of the testing approach (see the end of the proof of (3.14)) and use Assumption 2.5 (ii) and (iii). For the latter we need \(\alpha ^k, \alpha ^{k+1}\in B(\widehat{\alpha }, 2r)\) and \(u^{k+2}, S_u(\alpha ^{k})\in B(\widehat{u}, r_u).\) We have \(\alpha ^{k}\in B(\widehat{\alpha },r_0)\) by Assumption 3.4 and \(\alpha ^{k+1}\in B(\widehat{\alpha },2r_0)\) by Lemma 3.11. Thus we may use the Lipschitz continuity of \(S_u\) with the triangle inequality and (35) to get \(S_u(\alpha ^k)\in B(S_u(\widehat{\alpha }), L_{S_u}r_0)\subset B(\widehat{u}, r_u)\) and
which yields \(u^{k+2}\in B(\widehat{u}, r_u).\) The definition of \(S_p\) in (5) implies
Since also \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)\) from Assumption 2.5 (iii), we get
from Lemma 3.17. By the p update of the FIFB in the implicit form (13), we have
Combining with (36) gives
An application of the three-point identity (2) with \(\theta L_F \le 1\) from Assumption 2.5 (iv) now yields for \(C_{F,S} = \sqrt{(1+\theta \gamma _F)/(1-\theta \gamma _F)}\) the estimate
This estimate and the triangle inequality give
The solution map \(S_u\) is Lipschitz in \(B(\widehat{\alpha }, 2r)\) and \(S_p\) is Lipschitz in \(B(\widehat{u}, r_u)\times B(\widehat{\alpha }, 2r)\) due to Assumption 2.5 (ii) and (iii), and Lemma 3.1. Combined with the triangle inequality, (34) for \(n=k\) and (35), we obtain
for
Using again the Lipschitz continuity of \(S_p\) and (35), we get
Inserting (38) and (39) into (37) yields
Therefore the claim follows if we show that
Lemma 3.9 proves (22) with \(C=C_p\). Together with Assumption 2.5 (vii) it yields
Multiplying by \((1+L_{S_u})L_{S_p}+C_{F,S}C_p- (1+C_{F,S})L_{S_p}C_u,\) rearranging terms, and continuing with the triangle inequality gives (40). Indeed,
\(\square \)
We now show that the adjoint iterates stay local if the outer iterates do.
Again, by combining the previous lemmas, we prove an estimate from which local convergence is immediate. For this, we recall the definitions of the preconditioning and testing operators M and Z in (14b) and (12).
Lemma 3.19
Suppose Assumption 2.5 and 3.4, and the inner and adjoint exactness estimate (34) hold for \(n\in \mathbb {N}.\) Then
for
where \(\varphi _u, \varphi _p>0\) are as in Assumption 2.5, \(C_{\alpha ,1} :=\varphi _p\tfrac{L_FL_{S_p}}{\gamma _F}\), and \(C_{\alpha ,2} :=\bigl (L_{\nabla J}N_p + L_{S_p} N_{\nabla J}\bigr ) \tfrac{C_u}{2}\).
Proof
We start by proving the monotonicity estimate
for \(\mathscr {V}_{n+1}(\widehat{u}, \widehat{p}, \widehat{\alpha }) = \varepsilon _u\Vert u^{n+1}-\widehat{u}\Vert ^2 + \varepsilon _p\Vert p^{n}-\widehat{p}\Vert ^2 + \varepsilon _{\alpha }\Vert \alpha ^{n+1} - \widehat{\alpha }\Vert ^2\). We observe that \(\varepsilon _u,\varepsilon _p,\varepsilon _{\alpha }>0\) by Assumption 2.5. The monotonicity estimate (42) expands as
for (all elements of the set)
We estimate the three lines of \(h_{n+1}\) separately. We immediately take care of the first line by using (20) from Lemma 3.8.
For the second line, using the optimality condition (4) we have
We have \(u^{n+1}\), \(S_u(\alpha ^n)\in B(\widehat{u}, r_u)\), and \(\alpha ^n\in B(\widehat{\alpha }, r)\) by Lemma 3.3 and 3.4. Thus \(\gamma _F\cdot {{\,\textrm{Id}\,}}\le \nabla _u^2 F\le L_F\cdot {{\,\textrm{Id}\,}}\) in \(B(\widehat{u}, r_u) \times B(\widehat{\alpha }, 2r)\) and \(\Vert \nabla _{u}^2 F(u^{n+1};\alpha ^{n})\Vert \le L_F\) by Assumption 2.5 (iii). We get
from Lemma 3.17. By Theorem 6.1 (i) \(\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle \) is an inner product and \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) a norm on \(\mathbb {L}(U; \mathscr {A}),\) and we can use thus Cauchy-Schwarz inequality for them. Therefore, using also Theorem 6.1 (ii), Lemma 3.1 and Young’s inequality, we can estimate
Inserting (45) and (46) into (44), we obtain
The factor of the last term equals \(C_{\alpha ,1}\) from Assumption 2.5 (v).
Since Assumption 3.4 and (34) hold, Lemma 3.12 gives (26) with \(C = C_p\) and any \(d>0\) for the third line of \(h_{n+1}\). Summing (20), (47) and (26) we finally deduce
We have
Then also \(\frac{\gamma _\alpha }{2}- \frac{C_{\alpha ,2}}{d}=\varepsilon _{\alpha }\). It follows
Since \(\sigma <1/L_\alpha \) by Lemma 3.2, \(L_F/(2\kappa ) \le 1/\tau \) and \(\theta < 1/L_F\) by Assumption 2.5 (iv), we obtain (43), i.e., (42). We finish by applying the fundamental arguments of the testing approach to (42) and the general implicit update (10) as in (3.14). \(\square \)
We simplify the assumptions of the previous lemma to just Assumption 2.5.
Lemma 3.20
Suppose Assumption 2.5 holds. Then (41) holds for any \(n\in \mathbb {N}\).
Proof
The claim readily follows if we prove by induction for all \(n\in \mathbb {N}\) that
We first prove () for \(n=0.\) Assumption 2.5 (i) directly establishes (34). The definition of \(r_0\) in Assumption 2.5 also establishes that \(\alpha ^n\in B(\widehat{\alpha }, r_0)\) and \(u^n\in B(\widehat{u}, \sqrt{\sigma ^{-1}\varphi ^{-1}_u\tau }r_0).\) We have just proved the conditions of Lemma 3.5, which gives \(\Vert \hspace{-1.0pt}|p^1 \Vert \hspace{-1.0pt}|\le N_p\). Thus we we proved Assumption 3.4 for \(n=0.\) Now (3.19) proves (41) for \(n=0.\) This concludes the proof of induction base.
We then make the induction assumption that () holds for \(n\in \{0,\ldots ,k\}\) and prove it for \(n=k+1.\) The induction assumption and Lemma 3.18 give (34) for \(n=k+1.\) The inequality (41) for \(n\in \{0,\ldots ,k\}\) also ensures \(\Vert x^{n+1}-\widehat{x}\Vert _{ZM} \le \Vert x^{n}-\widehat{x}\Vert _{ZM}\) for \(n\in \{0,\ldots ,k\}.\) Therefore Lemma 3.6 proves Assumption 3.4 for \(n=k+1\). Now (3.19) shows (41) and concludes the proof of () for \(n=k+1\). \(\square \)
We finally come to the main convergence result for the FIFB.
Theorem 3.21
Suppose Assumption 2.5 holds. Then \(\varphi _u\tau ^{-1}\Vert u^{n}-\widehat{u}\Vert ^2 + \varphi _p\theta ^{-1}\Vert p^{n}-\widehat{p}\Vert ^2 + \sigma ^{-1}\Vert \alpha ^{n} - \widehat{\alpha }\Vert ^2 \rightarrow 0\) linearly.
Proof
We define \(\mu _1:= \min \{(1+2\varepsilon _u\varphi ^{-1}_u \tau ), (1+2\varepsilon _{\alpha }\sigma )\}\) and \(\mu _2:= 1 - 2\varepsilon _p\varphi ^{-1}_p\theta .\) Lemma 3.20, expansion of (41), and basic manipulation show that
There are two separate cases (i) \(\mu _1 \mu _2 \le 1\) and (ii) \(\mu _1 \mu _2 > 1.\) In case (i), we have
which with (48) implies linear convergence since \(\mu ^{-1}_1\in (0,1).\) In case (ii), we obtain
which with (48) implies linear convergence since \(\mu _2\in (0,1).\) \(\square \)
4 Numerical experiments
We evaluate the performance of our proposed algorithms on parameter learning for (anisotropic, smoothed) total variation image denoising and deconvolution. For a “ground truth” image \(b \in \mathbb {R}^{N^2}\) of dimensions \(N \times N\), we take
as the outer fitness function. For b we use a cropped portion of image 02 or 08 from the free Kodak dataset [49] converted to gray values in [0, 1]. The purpose of these numerical experiments is a simple performance comparison between our proposed methods and a few representative approaches from the literature. We therefore only consider a single ground-truth image b and a corresponding corrupted data z in the various inner problems, which we next describe. For proper generalizable parameter learning, multiple such training pairs \((b_i, z_i)\) should be used. This can in principle be done by summing over all the data in both the inner and outer problem; resulting in a higher-dimensional bilevel problem; see, e.g., [50]. In practise, a large sample count would require stochastic techniques.
4.1 Denoising
For denoising we take in (1) as the inner objective
and as the outer regulariser \(R \equiv 0\). The simulated measurement z is obtained from b by adding Gaussian noise of standard deviation 0.1. The matrix D is a backward difference operator with Dirichlet boundary conditions. Instead of the one-norm, \(\Vert \cdot \Vert _1\), to ensure the twice differentiability of the objective and hence a simple adjoint equation, we use a \(C^2\) Huber- or Moreau–Yosida -type approximation with
We used \(\gamma =10^{-4}\) in our experiments (Fig. 1).
4.2 Deconvolution
For deconvolution, we take as the inner objective
and as the outer regulariser \(R(\alpha )=\beta (\sum _{i=2}^4\alpha _i-1)^2+\delta _{[0, \infty )}(\alpha _1)\) for a regularisation parameter \(\beta =10^4\). We introduce the constant \(C=\tfrac{1}{10}\) to help convergence by ensuring the same order of magnitude for all components of \(\alpha \). The first element of \(\alpha \) is the total variation regularization parameter while the rest parametrizes the convolution kernel \(K(\alpha )\) as illustrated in Fig. 2a. The sum of the elements of the kernel equals \(\alpha _2 + \alpha _3 + \alpha _4.\) Operator \(r_{\theta }\) rotates image \(\theta \) degrees, clockwise for \(\theta >0\) and counterclocwise for \(\theta <0.\) We form z by computing \(r_{-1}(K(\alpha )*r_1(b))\) for \(\alpha = [0.15, 0.1, 0.75] \) and adding Gaussian noise of standard deviation \(1\cdot 10^{-2}\).
For denoising, and deconvolution assuming \(\ker D \cap \ker K(\alpha ) = \{0\}\), it is not difficult to verify the structural parts of Assumption 2.2 and 2.5, required for the convergence results of Sect. 3. We do not attempt to verify the conditions on the step lengths, choosing them by trial and error.
4.3 An implicit baseline method
We evaluate Algorithms 2.1 and 2.2 against a conventional method that solves both the inner problem and the adjoint equation (near-)exactly, as well as the AID [26]. We also experimented with solving the equivalent constrained optimisation problem \(\min _{\alpha , u} J(u)\) subject to \(\nabla _u F(u; \alpha )=0\) with IPOPT [51] and the NL-PDPS [52, 53]. However, we did not observe convergence without the inclusion of additional \(H^1\) regularisation in the inner problem, as in, e.g., [7]. Since that changes the problem, we have not included “whole problem” approaches in our comparison.
To solve the inner problem in the implicit baseline method, we use gradient descent, starting at \(v^0=0\) and updating \(v^{m+1} :=v^m - \tau _m \nabla F(v^m; \alpha ^k)\) We then set \(u^{k+1}=v^{m+1}\). The adjoint and outer iterate updates are as in Algorithm 2.1, however, we discover \(\sigma =\sigma _k\) by the line search rule [19, (12.41)] for nonsmooth problems, starting at \(\sigma _k=5\cdot 10^{-5}\) and multiplying by 0.1 on each line search step. For deconvolution we use a fixed step length parameter, as it performed better. The specific parameter choices (step lengths, number of inner and adjoint iterations) for all algorithms and experiments are listed in Table 1.
4.4 Numerical setup
Our algorithm implementations are available on Zenodo [54]. To solve the adjoint equation in the FEFB and implicit methods, we use Matlab’s bicgstab implementation of the stabilized biconjugate gradient method [55] with tolerance \(10^{-5}\), and maximum iteration count \(10^{3}\). With the AID we use 50 conjugate gradient iterations. These choices, as well as the choice of the number of inner iterations for the implicit method and the AID, have been made by trial and error to be as small as possible while obtaining an apparently stable algorithm.
To evaluate scalability, we consider for denoising both \(N=128\) and \(N=256\). For deconvolution we consider \(N=128\) and \(N=32.\) We take initial \(u^0 = S_u(\alpha ^0)\) and \(p^0 = S_p(u^0, \alpha ^0)\) where for denoising \(\alpha ^0=0\) and for deconvolution \(\alpha ^{0}=[0.4, 0.25, 0.25, 0.5]\) and \(\alpha ^{0}=[0.04, 0.25, 0.25, 0.5]\) with \(N=128\) and \(N=32\) respectively.
To compare algorithm performance, we plot relative errors against the !cputime! value of Matlab on an AMD Ryzen 5 5600 H CPU. We call this value “computational resources”, as it takes into account the use of several CPU cores by Matlab’s internal linear algebra, making it fairer than the actual running time. For each algorithm and problem, we indicate in Table 1 the step length parameters, the number of outer steps to reach the computational resources value of 6000 for denoising 15,000 for deconvolution, and an average multiplier to convert computational resources into running times.
For performance comparison, we need estimates \(\tilde{\alpha }\) and \(\tilde{u}\) of optimal \(\hat{\alpha }\) and \(\hat{u}=S_u(\hat{\alpha })\). For denoising we find them by searching for the one-dimensional variable \(\alpha \) on a regular grid and recursively subdividing until node spacing goes below \(10^{-5}\). As \(\tilde{u}\), we take an estimate of \(S_u(\tilde{\alpha })\) obtained with 25,000 steps of the implicit base line method. We visualise so obtained \(\tilde{\alpha }\) and \(J \circ S_u\) in Fig. 3. For the higher-dimensional deconvolution problem, such a scan is not feasible. Instead, we obtain the comparison estimates by running the implicit method from a very good initial iterate until computational resources (CPU time) value of 6000 for \(N=32\) and 10,000 for \(N=128\). Specifically, we initialise the kernel parameters \((\alpha _2,\alpha _3,\alpha _4)\) as those used for generating the data z, and the regularization parameter \(\alpha _1 = 0.045\) for \(N=32\) and \(\alpha _1=0.02\) for \(N=128\), the latter found by trial and error. This initialisation is different from that used for the actual numerical experiments; see above. Our experiments indicate that the other methods approach \(\tilde{\alpha }\) so obtained faster than the implicit method itself, providing some justification for the choice.
With these solution estimates we define the inner and outer relative errors
4.5 Results
We report performance in Figs. 4 and 5 and the image data and reconstructions in Figs. 1 and 2. Figure 5 indicates that for deconvolution the FIFB significantly outperforms the other methods. The outer variable converges much faster than for other evaluated methods despite the fact that the inner variable especially with \(N=32\) stays some distance away from \(\tilde{u}\). However, as the dashed line indicates, the exact solution \(S_u(\alpha ^k)\) of the inner problem for the corresponding outer iterate, shows clear signs of convergence. (The few “spikes” in the graph for \(N=128\) temporarily have the regularisation parameter \(\alpha ^k_0\) much closer to zero than \(\tilde{\alpha }_0\).) This observation justifies the intuition that the inner problem does not need to be solved to high accuracy to obtain convergence for the outer problem; that such accurate solutions can even be detrimental to convergence. The exact solution of the adjoint equation in both the implicit method and the FEFB causes them to be too slow to make any meaningful progress. The denoising experiments of Fig. 4 likewise suggest that the FIFB is initially the best performing algorithm, although the implicit method and the AID catch up later on the denoising problem. On the small denoising problem (\(N=128\)), the implicit method is significantly faster than any oher method. Overall, and for practical purposes, nevertheless, the FIFB appears to perform the best.
Notes
An error in [43, Lemma 10] requires some conditions therein to be taken “in the limit” as \(t \searrow 0\).
References
Bard, J.F., Falk, J.E.: An explicit solution to the multi-level programming problem. Comput. Oper. Res. 9(1), 77–100 (1982). https://doi.org/10.1016/0305-0548(82)90007-7
Allende, G.B., Still, G.: Solving bilevel programs with the KKT-approach. Math. Program. 138(1–2), 309–332 (2012). https://doi.org/10.1007/s10107-012-0535-x
Fliege, J., Tin, A., Zemkoho, A.: Gauss-newton-type methods for bilevel optimization. Comput. Optim. Appl. 78, 793–824 (2021). https://doi.org/10.1007/s10589-020-00254-3
Jiang, Y., Li, X., Huang, C., Wu, X.: Application of particle swarm optimization based on CHKS smoothing function for solving nonlinear bilevel programming problem. Appl. Math. Comput. 219(9), 4332–4339 (2013). https://doi.org/10.1016/j.amc.2012.10.010
De Los Reyes, J.C., Villacís, D.: Bilevel Imaging Learning Problems as Mathematical Programs with Complementarity Constraints (2021)
Falk, J.E., Liu, J.: On bilevel programming, Part I: general nonlinear cases. Math. Program. 70(1–3), 47–72 (1995). https://doi.org/10.1007/bf01585928
De Los Reyes, J.C., Schönlieb, C.-B.: Image denoising: learning noise distribution via PDE-constrained optimization. Inverse Prob. Imag. 7, 1183–1214 (2013). arXiv:1207.3425
Kunisch, K., Pock, T.: A bilevel optimization approach for parameter learning in variational models. SIAM J. Imag. Sci. 6(2), 938–983 (2013). https://doi.org/10.1137/120882706
Holler, G., Kunisch, K., Barnard, R.C.: A bilevel approach for parameter learning in inverse problems. Inverse Prob. 34(11), 115012 (2018). https://doi.org/10.1088/1361-6420/aade77
Hintermüller, M., Rautenberg, C.N., Wu, T., Langer, A.: Optimal selection of the regularization function in a weighted total variation model. Part II: algorithm, its analysis and numerical tests. J. Math. Imag. Vis. 59, 515 (2017). https://doi.org/10.1007/s10851-017-0736-2
Sherry, F., Benning, M., De los Reyes, J.C., Graves, M.J., Maierhofer, G., Williams, G., Schönlieb, C.-B., Ehrhardt, M.J.: Learning the sampling pattern for MRI. IEEE Trans. Med. Imaging 39(12), 4310–4321 (2020). https://doi.org/10.1109/TMI.2020.3017353
Calatroni, L., De Los Reyes, J.C., Schönlieb, C.-B.: Dynamic sampling schemes for optimal noise learning under multiple nonsmooth constraints. In: Proceedings of the 26th IFIP TC 7 Conference on System Modeling and Optimization, Klagenfurt, Austria (2014)
Liu, R., Mu, P., Yuan, X., Zeng, S., Zhang, J.: A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 6305–6315. PMLR, online (2020)
De los Reyes, J.C., Villacís, D.: Optimality conditions for bilevel imaging learning problems with total variation regularization. SIAM Journal on Imaging Sciences 15(4), 1646–1689 (2022). https://doi.org/10.1137/21M143412X. arXiv:2107.08100
Ehrhardt, M.J., Roberts, L.: Inexact derivative-free optimization for bilevel learning. J. Math. Imag. Vis. (2021). https://doi.org/10.1007/s10851-021-01020-8
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40, 120–145 (2011). https://doi.org/10.1007/s10851-010-0251-1
He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imag. Sci. 5(1), 119–149 (2012). https://doi.org/10.1137/100814494
Valkonen, T.: Testing and non-linear preconditioning of the proximal point method. Appl. Math. Optim. 82(2), 1 (2020). https://doi.org/10.1007/s00245-018-9541-6. arXiv:1703.05705
Clason, C., Valkonen, T.: Introduction to Nonsmooth Analysis and Optimization. Work in progress (2020). https://arxiv.org/abs/2001.00216
Valkonen, T.: First-order primal-dual methods for nonsmooth nonconvex optimisation. In: Chen, K., Schönlieb, C.-B., Tai, X.-C., Younes, L. (eds.) Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-03009-4_93-1
Chen, T., Sun, Y., Yin, W.: A Single-Timescale Stochastic Bilevel Optimization Method (2021)
Hong, M., Wai, H.-T., Wang, Z., Yang, Z.: A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic (2020)
Li, J., Gu, B., Huang, H.: A fully single loop algorithm for bilevel optimization without hessian inverse. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7426–7434 (2022)
Yang, J., Ji, K., Liang, Y.: Provably faster algorithms for bilevel optimization. Adv. Neural. Inf. Process. Syst. 34, 13670–13682 (2021)
Dagréou, M., Ablin, P., Vaiter, S., Moreau, T.: A Framework for Bilevel Optimization that Enables Stochastic and Global Variance Reduction Algorithms (2022)
Ji, K., Yang, J., Liang, Y.: Bilevel optimization: convergence analysis and enhanced design. In: International Conference on Machine Learning, pp. 4882–4892. PMLR (2021)
Ji, K., Liang, Y.: Lower bounds and accelerated algorithms for bilevel optimization. J. Mach. Learn. Res. 23, 1–56 (2022)
Ghadimi, S., Wang, M.: Approximation methods for bilevel programming (2018)
Luo, Z.Q., Pang, J.S., Ralph, D.: Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge (1996)
Aussel, D., Lalitha, C.S. (eds.): Generalized Nash Equilibrium Problems, Bilevel Programming and MPEC. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4774-9
Dempe, S.: Foundations of Bilevel Programming. Nonconvex Optimization and Its Applications. Springer, US (2006)
Dempe, S., Kalashnikov, V., Pérez-Valdés, G.A., Kalashnykova, N.: Bilevel Programming Problems. Energy Systems. Springer, Berlin (2015). https://doi.org/10.1007/978-3-662-45827-3
Dempe, S.: Bilevel optimization: theory, algorithms and applications. Preprint 2018-11, TU Bergakademie Freiberg, Fakultät für Mathematik und Informatik (2018). http://www.optimization-online.org/DB_FILE/2018/08/6773.pdf
Stephan Dempe, A.Z. (ed.): Bilevel Optimization: Advances and Next Challenges. Springer, Berlin (2020)
Ye, J.J., Zhu, D.L.: Optimality conditions for bilevel programming problems. Optimization 33(1), 9–27 (1995). https://doi.org/10.1080/02331939508844060
Zemkoho, A.B.: Solving ill-posed bilevel programs. Set-valued Var. Anal. 24(3), 423–448 (2016). https://doi.org/10.1007/s11228-016-0371-x
Dempe, S., Zemkoho, A.B.: The bilevel programming problem: reformulations, constraint qualifications and optimality conditions. Math. Program. 138(1–2), 447–473 (2012). https://doi.org/10.1007/s10107-011-0508-5
Mehlitz, P., Zemkoho, A.B.: Sufficient optimality conditions in bilevel programming. Math. Oper. Res. (2021). https://doi.org/10.1287/moor.2021.1122
Bai, K., Ye, J.J.: Directional necessary optimality conditions for bilevel programs. Math. Oper. Res. (2021). https://doi.org/10.1287/moor.2021.1164. To appear (published online)
Sabach, S., Shtern, S.: A first order method for solving convex bilevel optimization problems. SIAM J. Optim. 27(2), 640–660 (2017). https://doi.org/10.1137/16m105592x
Shehu, Y., Vuong, P.T., Zemkoho, A.: An inertial extrapolation method for convex simple bilevel optimization. Optim. Methods Softw. 36(1), 1–19 (2019). https://doi.org/10.1080/10556788.2019.1619729
De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: Bilevel parameter learning for higher-order total variation regularisation models. J. Math. Imag. Vis. 57, 1–25 (2017). https://doi.org/10.1007/s10851-016-0662-8. arXiv:1508.07243
De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: The structure of optimal parameters for image restoration problems. J. Math. Anal. Appl. 434, 464–500 (2016). https://doi.org/10.1016/j.jmaa.2015.09.023. arXiv:1505.01953
Hintermüller, M., Wu, T.: Bilevel optimization for calibrating point spread functions in blind deconvolution. Inverse Probl. Imag. 9(4), 1139–1169 (2015). https://doi.org/10.3934/ipi.2015.9.1139
Chambolle, A., Pock, T.: Learning consistent discretizations of the total variation. SIAM J. Imag. Sci. 14(2), 778–813 (2021). https://doi.org/10.1137/20m1377199
Ochs, P., Ranftl, R., Brox, T., Pock, T.: Techniques for gradient-based bilevel optimization with non-smooth lower level problems. J. Math. Imag. Vis. 56(2), 175–194 (2016). https://doi.org/10.1007/s10851-016-0663-7
Clarke, F.: Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, US (1990). https://doi.org/10.1137/1.9781611971309
Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2), 251–266 (2004)
Franzen, R.: Kodak lossless true color image suite. PhotoCD PCD0992. Lossless, true color images released by the Eastman Kodak Company (1999). http://r0k.us/graphics/kodak/
Calatroni, L., Cao, C., De Los Reyes, J.C., Schönlieb, C.-B., Valkonen, T.: Bilevel approaches for learning of variational imaging models. In: Variational Methods in Imaging and Geometric Control. Radon Series on Computational and Applied Mathematics, vol. 18, pp. 252–290 (2016). https://doi.org/10.1515/9783110430394-008
Wächter, A.: An Interior Point Algorithm for Large-Scale Nonlinear Optimization with Applications in Process Engineering. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA (2002)
Clason, C., Mazurenko, S., Valkonen, T.: Acceleration and global convergence of a first-order primal-dual method for nonconvex problems. SIAM J. Optim. 29, 933–963 (2019). https://doi.org/10.1137/18M1170194. arXiv:1802.03347
Valkonen, T.: A primal-dual hybrid gradient method for non-linear operators with applications to MRI. Inverse Prob. 30(5), 055012 (2014). https://doi.org/10.1088/0266-5611/30/5/055012. arXiv:1309.5032
Suonperä, E.: Codes for “Linearly convergent bilevel optimization with single-step inner methods” (2023). https://doi.org/10.5281/zenodo.7974062
Van der Vorst, H.A.: Bi-cgstab: a fast and smoothly converging variant of bi-cg for the solution of nonsymmetric linear systems. SIAM J. Sci. Comput. 13(2), 631–644 (1992)
Funding
Open Access funding provided by University of Helsinki including Helsinki University Central Hospital. This research has been supported by the Academy of Finland Grants 314701, 320022, and 345486.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Prox-\(\sigma \)-contractivity
In the next three theorems we verify prox-\(\sigma \)-contractivity (Assumption 2.1) for some common cases.
The next theorem readily extends to \(R=\beta \Vert \,\varvec{\cdot }\,\Vert _1 + \delta _{[0, \infty )^n}\) on \(\mathbb {R}^n\) and products sets A since \({{\,\textrm{prox}\,}}_{\sigma R}\) is independent for each coordinate.
Theorem 5.1
(prox-\(\sigma \)-contractivity of positivity-constrained soft-thresholding) Let \(\sigma ,\beta >0\) and \(R=\beta |\,\varvec{\cdot }\,|_1 + \delta _{[0, \infty )}\) on \(\mathbb {R}\). Then R is prox-\(\sigma \)-contractive at any \(\widehat{\alpha }\ge \max \{0, \sigma (q+\beta )\}\) for any \(q \in \mathbb {R}^n\) with any factor \(0< C_R < \sigma ^{-1}\) within
In particular, if \(\widehat{\alpha }\in {{\,\textrm{dom}\,}}R=[0, \infty )\) and \(-q = \beta \in \partial R(\widehat{\alpha })\), then R is locally prox-\(\sigma \)-contractive at \(\widehat{\alpha }\) with any factor \(0< C_R < \sigma ^{-1}\) within
Proof
We have (see, e.g., [19])
We have by assumption \(\widehat{\alpha }\ge \sigma (q+\beta )\). If also \(\alpha \ge \sigma (q + \beta )\), we have \(D_{\sigma R}(\alpha ) - D_{\sigma R}(\widehat{\alpha })=0\), which satisfies the required inequality.
Suppose then that \(\alpha < \sigma (q + \beta )\).
Since \(D_{\sigma R}(\widehat{\alpha })=-\sigma (q+\beta )\), we need to show
We have \(D_{\sigma R}(\alpha )=-\alpha \), so (49) rearranges as
Since \(1 > \sigma C_R\), this inequality can be rearranged as the condition \(\alpha \ge \widehat{\alpha }- \frac{\widehat{\alpha }- \sigma (q + b)}{1-\sigma C_R}\). Any \(\alpha \in A\) satisfies this bound.
Let then \(-q = \beta \in \partial R(\widehat{\alpha })\). Since \(\widehat{\alpha }\ge 0\), we have \(\widehat{\alpha }\ge \sigma (q+\beta ) = 0\). Since \(\frac{\widehat{\alpha }- \sigma (q + \beta )}{1-\sigma C_R} = \frac{\widehat{\alpha }}{1-\sigma C_R} \ge \widehat{\alpha }\), the claimed simpler expression for A follows from the general. \(\square \)
Similarly to the previous result, the restriction \(q=0\) in the next theorem on projections to a convex set C forbids stricly complementary cases of \(\widehat{\alpha }\in {{\,\textrm{bd}\,}}C\), i.e., we cannot have \(0 \ne -q \in N_C(\widehat{\alpha }) :=\partial \delta _C(\widehat{\alpha })\).
Theorem 5.2
(prox-\(\sigma \)-contractivity of projections) Let \(\sigma >0\) and \(R=\delta _C\) for a convex and closed \(C \subset \mathbb {R}^n\). Then R is prox-\(\sigma \)-contractive at any \(\widehat{\alpha }\in C\) for \(q=0\) within \(A=C\) with any factor \(C_R > 0\).
Proof
We have \({{\,\textrm{prox}\,}}_{\sigma R}={{\,\textrm{proj}\,}}_C\) for the Euclidean projection onto C. Since \(\alpha , \widehat{\alpha }\in C = {{\,\textrm{dom}\,}}R\) and \(q=0\), we have \(\alpha = {{\,\textrm{proj}\,}}_C(\alpha ) = {{\,\textrm{proj}\,}}_C(\alpha - \sigma q)\), and likewise for \(\widehat{\alpha }\). The claim is now immediate. \(\square \)
Example 5.3
(ReLu) The proximal mapping of \(\delta _{[0, \infty )}\) is known as the rectifier linear unit activation function (ReLu). By the above theorem, it is prox-\(\sigma \)-contractive at any \(\widehat{\alpha }\ge 0\) for \(q=0\).
Theorem 5.4
(prox-\(\sigma \)-contractivity of smooth functions) Let \(\sigma ,\beta >0\) and \(R: \mathbb {R}^n \rightarrow \mathbb {R}\) be convex with Lipschitz gradient. Then R is prox-\(\sigma \)-contractive at any \(\widehat{\alpha }\in \mathbb {R}^n\) for any \(q \in \mathbb {R}^n\) within \(A=\mathbb {R}^n\) with the factor \(C_R=L_{\nabla R}\) the Lipschitz factor of \(\nabla R\).
Proof
Write \(p(\alpha ) :={{\,\textrm{prox}\,}}_{\sigma R}(t(\alpha ))\) and \(t(\alpha ) :=\alpha - \sigma q\). According to the definition of the proximal operator, \( 0 = \sigma \nabla R(p(\alpha )) + p(\alpha ) - t(\alpha ). \) Hence \( p(\alpha )-\alpha = -\sigma [q + \nabla R(p(\alpha ))] \) which yields for any \(\alpha \in \mathbb {R}^n\), as required,
\(\square \)
Appendix B: A norm on separable spaces
We show basic properties of the inner product and norm defined by (6b).
Theorem 6.1
Let U be a Hilbert space and \(\mathscr {A}\) a separable Hilbert space. On \(\mathbb {L}(U; \mathscr {A})\) define \(\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle \) and \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) according to (6b). Then
-
(i)
\(\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle \) is an inner product and \(\Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\) a norm on \(\mathbb {L}(U; \mathscr {A})\).
-
(ii)
For \(M \in \mathbb {L}(U; U)\) and \(p \in \mathbb {L}(U; \mathscr {A})\), we have \(\Vert \hspace{-1.0pt}|pM \Vert \hspace{-1.0pt}| \le \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|\Vert M\Vert _{\mathbb {L}(U; U)}\).
-
(iii)
The operator norm on \(\mathbb {L}(U; \mathscr {A})\) satisfies \(\Vert \,\varvec{\cdot }\,\Vert _{\mathbb {L}(U; \mathscr {A})} \le \Vert \hspace{-1.0pt}|\,\varvec{\cdot }\, \Vert \hspace{-1.0pt}|\).
Proof
(i) \(\langle \hspace{-2.0pt}\langle \,\varvec{\cdot }\,, \,\varvec{\cdot }\, \rangle \hspace{-2.0pt}\rangle \) is bilinear and symmetric. Also \(\Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}| = \langle \hspace{-2.0pt}\langle p, p \rangle \hspace{-2.0pt}\rangle \ge 0\) for all \(p \in P\).
To prove that \(\Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}| > 0\) for \(p \ne 0\), we observe that the contrary implies \(\Vert p^*\varphi _i\Vert =0\) for all \(i \in I\). Since \(\{\varphi _i\}_{i \in I}\) is a basis for \(\mathscr {A}\), this implies \(p^*=0\), hence \(p=0\).
(ii) We have \( \Vert \hspace{-1.0pt}|pM \Vert \hspace{-1.0pt}|^2 =\sum _{i \in I} \Vert M^*p^*\varphi _i\Vert ^2 \le \sum _{i \in I} \Vert M^*\Vert _{\mathbb {L}(U; U)}^2 \Vert p^*\varphi _i\Vert ^2 = \Vert M\Vert _{\mathbb {L}(U; U)}^2 \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|^2. \)
(iii) Let \(p \in \mathbb {L}(U; \mathscr {A})\). Then \(\Vert p\Vert _{\mathbb {L}(U; \mathscr {A})} =\Vert p^*\Vert _{\mathbb {L}(\mathscr {A}; U)} =\sup _{\alpha \in \mathscr {A}, \Vert \alpha \Vert =1} \Vert p^*\alpha \Vert _U\). Since \(\{\varphi _i\}_{i \in I}\) is an orthonormal basis for \(\mathscr {A}\), we can write \(\alpha =\sum _{i \in I} a_i \varphi _i\) for some \(a_i \in \mathbb {R}\) with \(\sum _{i \in I} a_i^2=1\). Thus \( \Vert p^*\alpha \Vert _U^2 =\sum _{i \in I}\sum _{\alpha \in I} a_i a_j \langle p^*\varphi _i,p^*\varphi _j\rangle _U \le \sum _{i \in I} \left( \sum _{\alpha \in I} a_j^2\right) \Vert p^*\varphi _i\Vert _U^2 = \Vert \hspace{-1.0pt}|p \Vert \hspace{-1.0pt}|^2, \) where the last inequality uses Young’s inequality. The claim follows. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Suonperä, E., Valkonen, T. Linearly convergent bilevel optimization with single-step inner methods. Comput Optim Appl 87, 571–610 (2024). https://doi.org/10.1007/s10589-023-00527-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-023-00527-7