linearly convergent bilevel optimization with single-step inner methods

We propose a new approach to solving bilevel optimization problems, intermediate between solving full-system optimality conditions with a Newton-type approach, and treating the inner problem as an implicit function. The overall idea is to solve the full-system optimality conditions, but to precondition them to alternate between taking steps of simple conventional methods for the inner problem, the adjoint equation, and the outer problem. While the inner objective has to be smooth, the outer objective may be nonsmooth subject to a prox-contractivity condition. We prove linear convergence of the approach for combinations of gradient descent and forward-backward splitting with exact and inexact solution of the adjoint equation. We demonstrate good performance on learning the regularization parameter for anisotropic total variation image denoising, and the convolution kernel for image deconvolution


introduction
Two general approaches are typical for the solution of the bilevel optimization problem in Hilbert spaces  and  .The first, familiar from the treatment of general mathematical programming with equilibrium constraints (MPECs), is to write out the Karush-Kuhn-Tucker conditions for the whole problem in a suitable form, and to apply a Newton-type method or other nonlinear equation solver to them [4,1,25,36,18].The second approach, common in the application of (1.1) to inverse problems and imaging [24,14,37,32,30,45,6], treats the solution mapping   as an implicit function.Thus it is necessary to (i) on each outer iteration  solve the inner problem min   (;   ) near-exactly using an optimization method of choice; (ii) solve an adjoint equation to calculate the gradient of the solution mapping; and (iii) use another optimization method of choice on the outer problem min   (  ()) with the knowledge of   (  ) and ∇  (  ).The inner problem is therefore generally assumed to have a unique solution, and the solution map to be differentiable.An algorithm for nonsmooth inner problems has been developed in [39], while [17] rely on proving directional Bouligand differentiability for otherwise nonsmooth problems.
The challenge of the first "whole-problem" approach is to scale it to large problems, typically involving the inversion of large matrices.The difficulty with the second "implicit function" approach is that the inner problem needs to be solved several times, which can be expensive.Solving the adjoint equation also requires matrix inversion.The variant in [23] avoids this through derivative-free methods for the outer problem.It also solves the inner problem to a low but controlled accuracy.
In this paper, by preconditioning the implicit-form first-order optimality conditions, we develop an intermediate approach more efficient than the aforementioned, as we demonstrate in the numerical experiments of Section 4. It can be summarized as (i) take only one step of an optimization method on the inner problem, (ii) perform a cheap operation to advance towards the solution of the adjoint equation, and, finally, (iii) using this approximate information, take one step of an optimization method for the outer problem.Repeat.
This can be as expensive as solving the original optimality condition.The idea then is to introduce a preconditioning operator  that decouples the components of -in our case ,  and -such that each component can be solved in succession from 0 ∈  ( +1 ) +  ( +1 −   ).
Gradient steps can be handled through nonlinear preconditioning [49,12], as we will see in Section 2 when we develop the approach in detail along with two more specific algorithms, the FEFB (Forward-Exact-Forward-Backward) and the FIFB (Forward-Inexact-Forward-Backward).In Section 3 we prove their local linear convergence under a second-order growth condition on the composed objective  •   , and other more technical conditions.The proof is based on the "testing" approach developed in [49] and also employed extensively in [12,50].Finally, we evaluate the numerical performance of the proposed schemes on imaging applications in Section 4, specifically the learning of a regularization parameter for total variation denoising, and the convolution kernel for deblurring.Since the purpose of these experiments is a simple performance comparison between different algorithms, instead of real applications, we only use a single training sample of various dimensions, as explained in Section 4.
Intermediate approaches, some reminiscent of ours, have recently also been developed in the machine learning community.Our approach, however, allows a non-smooth function  in the outer problem (1.1).Moreover, to our knowledge, our work is the first to show linear convergence for a fully "single-loop" algorithm.To be more precise, the STABLE [9], TTSA [33], FLSA [38], MRBO, VRBO [53], and SABA [13] are "single-loop" algorithms such as ours, taking only a single step towards the solution of the inner problem on each outer iteration.The STABLE requires solving the adjoint equation exactly, as does our first approach, the FEFB.The others use a Neumann series approximation for the adjoint equation.Our second approach, the FIFB, takes a simple step reminiscent of gradient descent for the adjoint equation.The TSSA and STABLE obtain sublinear convergence of the outer iterates {  }  ∈ℕ assuming strong convexity (second-order growth) of both the inner and outer objective.For the SABA similar linear convergence is claimed with the outer strong convexity replaced by a Polyak-Łojasiewicz inequality.Without either of those assumptions, the theoretical results on the aforementioned methods from the literature are much weaker, and generally only show various forms of "stall" of the iterates at a sublinear rate, or the ergodic convergence of the gradient ∇  [ •  ] (  ) of the composed objective to zero.Such modes of convergence say very little about the convergence of function values to optimum or the iterates to a solution.
In the context of not fully single-loop algorithms, the AID, ITD [35], AccBio [34], and ABA [27] take a fixed (small) number of inner iterations for each outer iteration.The AID and ITD only sublinear convergence of the composed gradient is claimed.For the ABA and AccBio linear convergence of outer function values is claimed under strong convexity of both the inner and outer objectives.

fundamentals and applications
Fundamentals of MPECs and bilevel optimization are treated in the books [40,2,19,21].An extensive literature review up to 2018 can be found in [20], and recent developments in [46].Optimality conditions for bilevel problems, both necessary and sufficient, are developed in, e.g., [54,55,22,41,3].A more limited type of "bilevel" problems only constrains  to lie in the set of minimisers of another problem.Algorithms for such problems are treated in [43,44].
Bilevel optimization has been used for learning regularization parameters and forward operators for inverse imaging problems.With total variation regularization in the inner problem, the parameter learning problem in its most basic form reads [14] min This problem finds the best possible  for reconstructing the "ground truth" image  from the measurement data , which may be noisy and possibly transformed and only partially known through the forward operator   , mapping images to measurements.To generalize to multiple images, the outer problem would sum over them and corresponding inner problems [6].Multi-parameter regularization is discussed in [16], and natural conditions for  > 0 in [15]. 1 In other works, the forward operator   is learned for blind image deblurring [31] or undersampling in magnetic resonance imaging [45].In [37] regularization kernels are learned, while [17,8] study the learning of optimal discretisation schemes.To circumvent the non-differentiability of   , [42] replace the inner problem with a fixed number of iterations of an algorithm.Their approach has connections to the learning of deep neural networks.
Bilevel problems can also be seen as leader-follower or Stackelberg games: the outer problem or agent leads by choosing , and the inner agent reacts with the best possible  for that .Multiple-agent Nash equilibria may also be modeled as bilevel problems.Both types of games can be applied to financial markets and resource use planning; we refer to the the aforementioned books [40,2,19,21] for specific examples.

notation and basic concepts
We write ( ;  ) for the space of bounded linear operators between the normed spaces  and  and Id for the identity operator.Generally  will be Hilbert, so we can identify it with the dual  * .
1 An error in [15,Lemma 10] requires some conditions therein to be taken "in the limit" as  ↘ 0.
Bilevel optimization with single-step inner methods

proposed methods
We now present our proposed methods for (1.1).They are based on taking a single gradient descent step for the inner problem, and using forward-backward splitting for the outer problem.The two methods differ on how an "adjoint equation" is handled.We present the algorithms and assumptions required to prove their convergence in Sections 2.2 and 2.3 after deriving optimality conditions and the adjoint equation in Section 2.1.We prove convergence in Section 3.

optimality conditions
for  near α.Therefore, the chain rule for Fréchet differentiable functions yields That is,  = ∇    () solves for  =   () the adjoint equation We introduce the corresponding solution mapping for the adjoint variable , We will later make assumptions that ensure that   is well-defined.Since   :  →  , the Fréchet derivative  ′  () ∈ (;  ) and the Hilbert adjoint ∇    () ∈ ( ; ) for all .Consequently  ∈ ( ; ), but we will need  to lie in an inner product space.Assuming  to be a separable Hilbert space, we introduce such structure by using a countable orthonormal basis {  }  ∈ of  to define the inner product We briefly study this inner product and the induced norm ∥| • ∥| in Appendix b.
⊲ adjoint solution 5: ⊲ outer forward-backward step 6: end for By the sum rule for Clarke subdifferentials (denoted   ) and their compatibility with convex subdifferentials and Fréchet differentiable functions [10], we obtain The Fermat principle for Clarke subdifferentials then furnishes the necessary optimality condition We

algorithm: forward-exact-forward-backward
Our first strategy for solving (2.6) takes just a single gradient descent step for the inner problem, solves the adjoint equation exactly, and then takes a forward-backward step for the outer problem.We call this Algorithm 2.1 the FEFB (forward-exact-forward-backward).
The "nonlinear preconditioning" applied to  to construct  +1 shifts iterate indices such that a forward step is performed instead of a proximal step; compare [49,12].We next state essential structural, initialisations, and step length assumptions.We start with a contractivity condition needed for the proximal step with respect to .
We verify Assumption 2.1 for some common cases in Appendix a.When applying the assumption to to  satisfying (2.6), we will take  = − ∇   ( ) ∈ ( ).Then   ( ) = 0 by standard properties of proximal mappings.The results for nonsmooth functions in Appendix a in that case forbid strict complementarity.In particular, for  =  ∥ • ∥ 1 +  [0,∞)  we need to have  ∈ (, . . ., ), and for  =   for a convex set , we need to have  = 0. Intuitively, this restriction serves to forbid the finite identification property [28] of proximal-type methods, as {  } cannot converge too fast in our techniques for the stability of the inner problem and adjoint with respect to perturbations of .
We now come to our main assumption for the FEFB.It collects conditions related to step lengths, initialization, and the problem functions  ,  , and .For a constant  > 0 to be determined by the assumption, we introduce the testing operator (2.10)  := diag(  Id, Id, Id).
The idea, introduced in [49] and further explained in [12] (viii) The initial iterates  0 and  0 are such that the distance-to-solution 2 ensures that the initial inner problem iterate is good relative to the outer problem iterate.If  1 solves the inner problem for  0 , (i) holds for any   > 0. Therefore, (i) can always be satisfied by solving the inner problem for  0 to high accuracy.This condition does not require  0 to be close to a solution  of the entire problem.
Part (ii) ensures that the inner problem solution map exists and is well-behaved; we discuss it more in the next Remark 2.4.
Parts (iii) and (v) are second order growth and boundedness conditions, standard in smooth optimization.The nonsmooth  is handled through the prox--contractivity assumption.If   is twice Fréchet differentiable, the product and the chain rules establish for some  > 0. Then additional continuity assumptions establish the positivity required in (v) in a neighbourhood of .It is also possible to further develop the condition to not depend on the solution mapping at all.
Dependent on , (v) may restrict the outer step length parameter .Part (iii) ensures that  ↦ → ∇ 2   (; ) is invertible and   is well-defined.We will see in Lemma 3.3 that the radius   is sufficiently large that  ∈ ( , Part (iv) is a standard step length condition for the inner problem while (vii) is a step length condition for the outer problem.It depends on several constants defined in the more technical part (vi).We can always satisfy the inequality in (vi) by good relative initialisation (small   > 0), as discussed above, and taking the testing parameter   small.According to the local initialization condition (viii), the latter can be done if the initialial iterates are close to a solution ( , ) of the entire problem, or if   > 0 can be be taken arbitrarily large.If we can take both  > 0 and   > 0 arbitrarily large, we obtain global convergence.Remark 2.4 (Existence and differentiability of the solution map).Suppose  is twice continuously differentiable in both variables, and that   •Id ≤ ∇ 2   (; ) for all  ∈ ( ,   ) and  ∈ ( , 2 ) ∩dom  for some   > 0. Then the implicit function theorem shows the existence of a unique continuously differentiable   in a neighborhood of any  ∈ ( ,  ) ∩ dom .Such an   is also Lipschitz in a neighborhood of ; see, e.g., [12,Lemma 2.11].If  is finite-dimensional, a compactness argument gluing together the neighborhoods then proves Assumption 2.2 (ii).

algorithm: forward-inexact-forward-backward
Our second strategy for solving (2.6) modifies the first approach to solve the adjoint variable inexactly, so that no costly matrix inversions are required.Instead we perform an update reminiscent of a gradient step.This approach, which we call the FIFB (forward-inexact-forward-backward) reads as Algorithm 2.2 and has the implicit form The implicit form can also be written as (2.8) with For the testing operator  we use the structure (2.13)  := diag(  Id,   Id, Id).
(v) The outer fitness function  is Lipschitz continuously differentiable with factor  ∇ , and Moreover,  is locally prox-contractive at  for ∇   ( ) within ( ,  ) ∩ dom  for some   ≥ 0.
(vi) The constants   ,   ,   > 0 satisfy where ∥|∇    () ∥|, and (vii) The outer step length  satisfies and (viii) The initial iterate ( 0 ,  0 ,  0 ) is such that the distance-to-solution Remark 2.6 (Interpretation).The interpretation of Assumption 2.2 in Remark 2.4 also applies to Assumption 2.5.We stress that to satisfy the inequality in (vi), it suffices to ensure small   > 0 by good relative initialization of  and  with respect to , and choosing the testing parameters   ,   > 0 small enough.According to (viii), the latter can be done by initializing close to a solution, or if the radii   > 0 is large.

convergence analysis
We now prove the convergence of the FEFB (Algorithm 2.1) and the FIFB (Algorithm 2.2) in the respective Sections 3.2 and 3.3.Before this we start with common results.Our proofs are self-contained, but follow on the "testing" approach of [49] (see also [12]).The main idea is to prove a monotonicitytype estimate for the operator  +1 occurring in the implicit forms (2.8) and (2.12) of the algorithms, Bilevel optimization with single-step inner methods and then use the three-point identity (1.2) with respect to  -norms and inner products.This yields an inequality that readily yields an estimate from which convergence rates can be observed.The main results for the FEFB and the FIFB and in the respective Theorems 3.16 and 3.21.Throughout, we assume that either Assumption 2.2 (FEFB) or 2.5 (FIFB) holds, and tacitly use the constants from the relevant one.We also tacitly take it that   ∈ dom  for all  ∈ ℕ, as this is guaranteed by the assumptions for  = 0, and by the proximal step in the algorithms for  ≥ 1.

general results
Our main goal here is to bound the error in the inner and adjoint iterates   and   in terms of the outer iterates   .We also derive bounds on the outer steps, and local monotonicity estimates.We first show that the solution mapping for the adjoint equation ( 2 Thus the triangle inequality and the operator norm inequality Theorem b.1 (ii) give The assumption Therefore, also using the Lipschitz continuity of (, ) ↦ → ∇   (; ) in   ×   , we get Towards estimating the second term on the right hand side of (3.1), we observe that for any invertible linear operators , .Then we use ∥|∇   ∥| ≤  ∇   and the Lipschitz continuity of ∇ 2   (; ) to obtain Inserting this inequality and (3.2) into (3.1)establishes the claim.□ We now prove two simple step length bounds.Lemma 3.2.Let Assumption 2.2 or 2.5 hold.Then  < 1/  and 1 <   < √︁ 1 +   /  .
Proof.The inner gradient step of Algorithm 2.1 or 2.2 with Using ∇   ( ; ) = 0,   ∈ ( ,  0 ), and the Lipschitz continuity of  ( ; • ) and  ( • ;   ) from Assumption 2.2 (iii) or 2.5 (iii) we continue to estimate, as required Next, the Lipschitz continuity of   in ( , 2 ) from Assumption 2.2 (ii) or 2.5 (ii) with   ∈ ( ,  0 ) and  0 ≤  from Assumption 2.2 (viii) or 2.5 (viii) imply □ We now introduce a working condition that we later prove.It guarantees that the Lipschitz and Hessian properties of Assumption 2.2(ii), (iii) and (v), or Assumption 2.5(ii), (iii) and (v) hold at iterates.Assumption 3.4 (Iterate locality).Let  0 ≤  and   be defined in either Assumption 2.2 or 2.5.Then this assumption holds for a given  ∈ ℕ if Indeed, the next two lemmas show that if Assumption 3.4 holds for  =  along with some further conditions, then it holds for  =  + 1. Lemma 3.5.Suppose either Assumption 2.2 or 2.5 holds.Let  ∈ ℕ and suppose Proof.We estimate using (3.3) and the definitions of the relevant constants in Assumption 2.2 or 2.5 that ) and  +1 ∈ ( , √︁  −1  −1   0 ) as required.We finish by using Lemma 3.5 with  =  + 1 to establish ∥| +2 ∥| ≤   .□ We next prove a monotonicity-type estimate for the inner objective.For this we need need the following three-point monotonicity inequality.Theorem 3.7.Let ,  ∈  .Suppose  ∈  2 ( ), and for some  > 0 and Proof.The proof follows that of [12,Lemma 15.1] whose statement unnecessarily takes  in neighborhood of  instead of just the interval [ , ].□ Lemma 3.8.Let  ∈ ℕ. Suppose either Assumption 2.2 or 2.5, and 3.4 hold.Then for any  ∈ (0, 1), we have Proof.Assumption 2.2 (iii) or 2.5 (iii) with   ∈ ( ,  ) and   ∈ ( ,   ) from Assumption 3.4 give We have ∇   ( ; ) = 0 since 0 ∈  ( , , ).Therefore Theorem 3.7 yields Young's inequality and the definition of  ∇,  in Assumption 2.2 (iii) or 2.5 (iii) now readily establishes the claim.□ The next lemma bounds the steps taken for the outer problem variable.Lemma 3.9.Let  ∈ ℕ. Suppose either Assumption 2.2 or 2.5 hold, as do Assumption 3.4, (3.3), and Then Proof.Using the -update of Algorithm 2.1 or 2.2, we estimate Since proximal maps are 1-Lipschitz, and  is by Assumption 2.2 (v) or 2.5 (v) locally prox--contractive at  for ∇   ( ) within ( ,  ) ∩ dom  with factor   , it follows (3.9) We have Using the Lipschitz continuity of ∇   from Assumption 2.2 (v) or 2.5 (v), we continue We Inserting this into (3.9),we obtain (3.7).Assumption 2.2 (vii) or Assumption 2.5 (vii) and (3.7) then yield Rearranging terms and finishing with the triangle inequality we get (3.8).□ Remark 3.10 (Gradient steps with respect to ).We could (in both FEFB and FIFB) also take a gradient step instead of a proximal step with respect to  with  ∇ -Lipschitz gradient.That is, we would perform for the outer problem the update This can be shown to be convergent by changing (3.9) to We next prove that if an inner problem iterate has small error, and we take a short step in the outer problem, then also the next inner problem iterate has small error.Lemma 3.11.Let  ∈ ℕ. Suppose Assumption 2.2 or 2.5 hold.If Assumption 3.4, (3.3), and (3.6) hold for  = , then (3.6) holds for  =  + 1 and we have  +1 ∈ ( , 2 0 ).

Bilevel optimization with single-step inner methods
Proof.We plan to use Theorem 3.7 on  ( • ;  +1 ) followed by the three-point identity and simple manipulations.We begin by proving the conditions of the theorem.

convergence: forward-exact-forward-backward
We now prove the convergence of Algorithm 2.1.We start with a lemma that shows an -relative exactness estimate on the adjoint iterate when one holds for the inner iterate.This is needed to use Lemma 3.12.The main result of this subsection is in the final Theorem 3.
We are able to collect the previous lemmas into a descent estimate from which we immediately observe local linear convergence.We recall the definitions of the preconditioning and testing operators  and  in (2.9b) and (2.10).Lemma 3.14.Let  ∈ ℕ and suppose Assumptions 2.2 and 3.4, and the inner exactness estimate (3.6) hold.Then for   > 0 as in Assumption 2.2 (vi), > 0, and   := Proof.We start by proving the monotonicity estimate We observe that   ,   > 0 by Assumption 2.2.The monotonicity estimate (3.16) expands as for (all elements of the set) We estimate each of the three lines of ℎ +1 separately.For the first line, we use (3.5) from Lemma 3.8.For the middle line we observe that  +1 ∇ 2   ( +1 ;   ) + ∇   ( +1 ;   ) = 0 by the -update of Algorithm 2.1.
For the last line, we use (3.11) from Lemma 3.12 with  = 2.We can do this because (3.10) holds by (3.6) and Lemma 3.13.This gives Summing with (3.5) we thus obtain The factor of the first term is   and the factor of last term is zero.Since  < 1/  by Lemma 3.2 and   /(2) ≤ 1/ by Assumption 2.2 (iv), we obtain (3.17), i.e., (3.16).
Bilevel optimization with single-step inner methods page 17 of 32 We now come to the fundamental argument of the testing approach of [49], combining operatorrelative monotonicity estimates with the three-point identity.Indeed, (3.16) combined with the implicit algorithm (2.8) gives Inserting the three-point identity (1.2) and expanding  +1 yields (3.15).□ Before stating our main convergence result for the FEFB, we simplify the assumptions of the previous lemma to just Assumption 2. We first prove (*) for  = 0. Assumption 2.2 (i) directly establishes (3.6).The definition of  0 in Assumption 2.2 also establishes that   ∈ ( ,  0 ) and   ∈ ( , √︁  −1  −1   0 ).We have just proved the conditions of Lemma 3.13, which establishes (3.3) for  = 0. Now Lemma 3.5 establishes ∥| 1 ∥| ≤   .Therefore Assumption 3.4 holds for  = 0. Finally Lemma 3.14 proves (3.15) for  = 0.This concludes the proof of the induction base.

convergence: forward-inexact-forward-backward
We now prove the convergence of Algorithm 2.2.The overall structure and idea of the proofs follows Section 3.2 and uses several lemmas from Section 3.1.We first prove monotonicity estimate lemma for the adjoint step and then that a small enough step length in the outer problem gurantees that the inner and adjoint iterates stay in a small local neighbourhood if they are already in one.The main result of this subsection is in the final Theorem 3.21.It proves under Assumption 2.5 the linear convergence of {(  ,   ,   )} ∈ℕ generated by Algorithm 2.2 to ( , , ) solving the first-order optimality condition (2.6).Lemma 3.17.
Bilevel optimization with single-step inner methods Proof.Using (3.18), the three-point identity (1.2) and We still need to show the second part ∥| +2 −∇    ( +1 ) ∥| ≤   ∥ +1 −  ∥.We follow the fundamental argument of the testing approach (see the end of the proof of Lemma 3.14) and use Assumption 2.5 (ii) and (iii).For the latter we need   ,  +1 ∈ ( , 2 ) and  +2 ,   (  ) ∈ ( ,   ).We have   ∈ ( ,  0 ) by Assumption 3.4 and  +1 ∈ ( , 2 0 ) by Lemma 3.11.Thus we may use the Lipschitz continuity of   with the triangle inequality and (3.20) to get   (  ) ∈ (  ( ),     0 ) ⊂ ( ,   ) and from Lemma 3.17.By the  update of the FIFB in the implicit form (2.11), we have Combining with (3.21) gives Bilevel optimization with single-step inner methods An application of the three-point identity (1.2) with   ≤ 1 from Assumption 2.5 (iv) now yields for This estimate and the triangle inequality give The solution map   is Lipschitz in ( , 2 ) and   is Lipschitz in ( ,   ) × ( , 2 ) due to Assumption 2.5 (ii) and (iii), and Lemma 3.1.Combined with the triangle inequality, (3.19) for  =  and (3.20), we obtain Using for (all elements of the set) We estimate the three lines of ℎ +1 separately.We immediately take care of the first line by using (3.5) from Lemma 3.8.
We then make the induction assumption that (**) holds for  ∈ {0, . . .,  } and prove it for  = +1.The induction assumption and Lemma 3. We finally come to the main convergence result for the FIFB.Theorem 3.21.Suppose Assumption 2.5 holds.Then

numerical experiments
We evaluate the performance of our proposed algorithms on parameter learning for (anisotropic, smoothed) total variation image denoising and deconvolution.For a "ground truth" image  ∈ ℝ  2 of dimensions  ×  , we take as the outer fitness function.For  we use a cropped portion of image 02 or 08 from the free Kodak dataset [26] converted to gray values in [0, 1].The purpose of these numerical experiments is a simple performance comparison between our proposed methods and a few representative approaches from the literature.We therefore only consider a single ground-truth image  and a corresponding corrupted data  in the various inner problems, which we next describe.For proper generalizable parameter learning, multiple such training pairs (  ,   ) should be used.This can in principle be done by summing over all the data in both the inner and outer problem; resulting in a higher-dimensional bilevel problem; see, e.g., [5].In practise, a large sample count would require stochastic techniques.

denoising
For denoising we take in (1.1) as the inner objective and as the outer regulariser  ≡ 0. The simulated measurement  is obtained from  by adding Gaussian noise of standard deviation 0.1.The matrix  is a backward difference operator with Dirichlet boundary conditions.Instead of the one-norm, ∥ • ∥ 1 , to ensure the twice differentiability of the objective and hence a simple adjoint equation, we use a  2 Huber-or Moreau-Yosida -type approximation with We used  = 10 −4 in our experiments.
Bilevel optimization with single-step inner methods

deconvolution
For deconvolution, we take as the inner objective and as the outer regulariser () =  ( 4 =2   − 1) 2 +  [0,∞) ( 1 ) for a regularisation parameter  = 10 4 .We introduce the constant  = 1 10 to help convergence by ensuring the same order of magnitude for all components of .The first element of  is the total variation regularization parameter while the rest parametrizes the convolution kernel  () as illustrated in Figure 2a.The sum of the elements of the kernel equals  2 + 3 + 4 .Operator   rotates image  degrees, clockwise for  > 0 and counterclocwise for  < 0. We form  by computing  −1 ( () *  1 ()) for  = [0.15,0.1, 0.75] and adding Gaussian noise of standard deviation 1 • 10 −2 .
For denoising, and deconvolution assuming ker  ∩ ker  () = {0}, it is not difficult to verify the structural parts of Assumptions 2.2 and 2.5, required for the convergence results of Section 3. We do not attempt to verify the conditions on the step lengths, choosing them by trial and error.

an implicit baseline method
We evaluate Algorithms 2.1 and 2.2 against a conventional method that solves both the inner problem and the adjoint equation (near-)exactly, as well as the AID [35].We also experimented with solving the equivalent constrained optimisation problem min ,  () subject to ∇   (; ) = 0 with IPOPT [52] and the NL-PDPS [11,48].However, we did not observe convergence without the inclusion of additional  1 regularisation in the inner problem, as in, e.g., [14].Since that changes the problem, we have not included "whole problem" approaches in our comparison.

Bilevel optimization with single-step inner methods
Table 1: Algorithm parametrisation (step length parameters, inner and adjoint iteration counts), time multiplier, and outer steps taken to reach threshold computational resources (CPU time) value.The threshold is 6000 for denoising, and 15000 for deconvolution.The time multipliers allows conversion from computational resources to real time.It differs between algorithms and problem dimensions due to different levels of parallelisability.The 'inner steps' and 'adjoint steps' columns indicate the (maximum) number of iterations taken towards a solution of the inner problem or the adjoint equation on every outer iteration.The bicgstab method for solving the adjoint for the implicit method and the FEFB may use a smaller number of iterations as determined by the threshold 10 −5 ."LS" for the step length parameter  means that line search was used.To solve the inner problem in the implicit baseline method, we use gradient descent, starting at  0 = 0 and updating  +1 :=   −   ∇ (  ;   ) We then set  +1 =  +1 .The adjoint and outer iterate updates are as in Algorithm 2.1, however, we discover  =   by the line search rule [12, (12.41)] for nonsmooth problems, starting at   = 5 • 10 −5 and multiplying by 0.1 on each line search step.For deconvolution we use a fixed step length parameter, as it performed better.The specific parameter choices (step lengths, number of inner and adjoint iterations) for all algorithms and experiments are listed in Table 1.

numerical setup
Our algorithm implementations are available on Zenodo [47].To solve the adjoint equation in the FEFB and implicit methods, we use Matlab's bicgstab implementation of the stabilized biconjugate gradient method [51] with tolerance 10 −5 , and maximum iteration count 10 3 .With the AID we use 50 conjugate gradient iterations.These choices, as well as the choice of the number of inner iterations for the implicit method and the AID, have been made by trial and error to be as small as possible while obtaining an apparently stable algorithm.
To evaluate scalability, we consider for denoising both  = 128 and  = 256.For deconvolution we consider  = 128 and  = 32.We take initial  0 =   ( 0 ) and  0 =   ( 0 ,  0 ) where for denoising  To compare algorithm performance, we plot relative errors against the cputime value of Matlab on an AMD Ryzen 5 5600H CPU.We call this value "computational resources", as it takes into account the use of several CPU cores by Matlab's internal linear algebra, making it fairer than the actual running time.For each algorithm and problem, we indicate in Table 1 the step length parameters, the number of outer steps to reach the computational resources value of 6000 for denoising 15000 for deconvolution , and an average multiplier to convert computational resources into running times.
For performance comparison, we need estimates α and ũ of optimal α and û =   ( α).For denoising we find them by searching for the one-dimensional variable  on a regular grid and recursively subdividing until node spacing goes below 10 −5 .As ũ, we take an estimate of   ( α) obtained with 25000 steps of the implicit base line method.We visualise so obtained α and  •   in Figure 3.For the higher-dimensional deconvolution problem, such a scan is not feasible.Instead, we obtain the comparison estimates by running the implicit method from a very good initial iterate until computational resources (CPU time) value of 6000 for  = 32 and 10000 for  = 128.Specifically, we initialise the kernel parameters ( 2 ,  3 ,  4 ) as those used for generating the data , and the regularization parameter  1 = 0.045 for  = 32 and  1 = 0.02 for  = 128, the latter found by trial and error.This initialisation is different from that used for the actual numerical experiments; see above.Our experiments indicate that the other methods approach α so obtained faster than the implicit method itself, providing some justification for the choice.
With these solution estimates we define the inner and outer relative errors and  ,rel := ∥ ũ −  ∥ ∥ ũ ∥ .

results
We report performance in Figures 4 and 5 and the image data and reconstructions in Figures 1 and 2. Figure 5 indicates that for deconvolution the FIFB significantly outperforms the other methods.The outer variable converges much faster than for other evaluated methods despite the fact that the inner variable especially with  = 32 stays some distance away from ũ.However, as the dashed line indicates, the exact solution   (  ) of the inner problem for the corresponding outer iterate, shows clear signs of convergence.(The few "spikes" in the graph for  = 128 temporarily have the regularisation parameter   0 much closer to zero than α0 .)This observation justifies the intuition that the inner problem does Bilevel optimization with single-step inner methods ()) + () with   () ∈ arg min  ∈  (; )

Figure 3 :
Figure 3: Graph of composed objective the denoising problem (both problem sizes), along with nearoptimal α found by recursive subdivision.